## IEEE COMPSAC19 Workshop on Data Flow Models and Extreme-Scale Computing Panel Automatic Macro Data Flow Execution on Global Address Space Multicore Systems by OSCAR Compiler with Locality and Power Optimization



### Hironori Kasahara, Ph.D., IEEE Fellow, IPSJ Fellow Senior Executive VP, Waseda University IEEE Computer Society President 2018

URL: http://www.kasahara.cs.waseda.ac.jp/

1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley, 1986 Assistant Prof., 1988 Associate Prof., 1989-90 Research Scholar: U. of Illinois, Urbana-Champaign, Center for Supercomputing R&D, 1997 Prof., 2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan 2018 Nov. Senior Vice President, Waseda Univ.

1987 IFAC World Congress Young Author Prize
1997 IPSJ Sakai Special Research Award
2005 STARC Academia-Industry Research Award
2008 LSI of the Year Second Prize
2008 Intel AsiaAcademic Forum Best Research Award
2010 IEEE CS Golden Core Member Award
2014 Minister of Edu., Sci. & Tech. Research Prize
2015 IPSJ Fellow, 2017 IEEE Fellow, Eta Kappa Nu

Reviewed Papers: 216, Invited Talks: 176, Granted Patents: 48 (Japan, US, GB, China), Articles in News Papers, Web News, Medias incl. TV etc.: 605

Committees in Societies and Government 258
IEEE Computer Society: President 2018, Executive
Committee(2017-19), BoG(2009-14), Strategic
Planning Committee Chair 2017, Multicore STC Chair
(2012-), Japan Chair(2005-07), IEEE TAB (2018)
IPSJ Chair: HG for Magazine. & J. Edit, Sig. on ARC.
[METI/NEDO] Project Leaders: Multicore for
Consumer Electronics, Advanced Parallelizing
Compiler, Chair: Computer Strategy Committee
[Cabinet Office] CSTP Supercomputer Strategic ICT
PT, Japan Prize Selection Committees, etc.
[MEXT] Info. Sci. & Tech. Committee,
Supercomputers (Earth Simulator, HPCI Promo., Next
Gen. Supercomputer K) Committees, etc.

**Question 1: Program Execution Model (PXM) vs. Programming Model (PM)** 

Question 3: Of the programmability of dataflow models

<u>A Macro Task Graph</u>, or Macro Data Flow Graph, generated from <u>a sequential</u> <u>program written in C or Fortran (PM)</u> by <u>Earliest Executable Condition (EEC)</u> <u>Analysis is executed by using static or dynamic task scheduling (PXM)</u>.

#### **Generation of Coarse Grain Tasks**

- Macro-tasks (MTs)
  - ➤ Block of Pseudo Assignments (BPA): Basic Block (BB)
  - > Repetition Block (RB): natural loop
  - > Subroutine Block (SB): subroutine



# Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)



#### **Earliest Executable Conditions**

| Macrotask No. | Earliest Executable Condition                   |
|---------------|-------------------------------------------------|
| 1             |                                                 |
| 2             | 1 2                                             |
| 3             | (1) 3                                           |
| 4             | 2 4 OR (1) 3                                    |
| 5             | (4) 5 AND [ 2 4 OR (1) 3 ]                      |
| 6             | 3 OR (2) 4                                      |
| 7             | 5 OR (4) 6                                      |
| 8             | (2) 4 OR (1) 3                                  |
| 9             | (8) 9                                           |
| 10            | (8) 10                                          |
| 11            | 8 9 OR 8 10                                     |
| 12            | 11 <sub>12</sub> AND [ 9 OR (8) <sub>10</sub> ] |
| 13            | 11 <sub>13</sub> OR 11 <sub>12</sub>            |
| 14            | (8) 9 OR (8) 10                                 |
| 15            | 2 15                                            |

## **OSCAR Parallelizing Compiler**

To improve effective performance, cost-performance and software productivity and reduce power

Multigrain Parallelization(LCPC1991,2001,04)
coarse-grain parallelism among loops and
subroutines (2000 on SMP), near fine grain
parallelism among statements (1992) in
addition to loop parallelism

#### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03)
Software Coherent Control (2017)

#### Data Transfer Overlapping(2016 partially)

Data transfer overlapping using Data Transfer Controllers (DMAs)

#### **Power Reduction**

(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.



Multicore Program Development Using OSCAR API V2.0

#### **Sequential Application Program in Fortran or C**

(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Hetero

Homogeneous

Manual parallelization / power reduction

#### **Accelerator Compiler/ User**

Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks

#### Waseda OSCAR **Parallelizing Compiler**

- Coarse grain task parallelization
- **Data Localization**
- **DMAC** data transfer
- Power reduction using **DVFS, Clock/ Power gating**

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

**OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores** 

Directives for thread generation, memory, data transfer using DMA, power managements

**Parallelized APIF or C** program

Proc0

**Code** with directives Thread 0

Proc1

**Code** with directives Thread 1

Accelerator 1 Code

**Accelerator 2** Code

**Low Power** Homogeneous **Multicore Code** Generation

API Analyzer |

Existing sequential compiler

Low Power Heterogeneous **Multicore Code** Generation

API Analyzer (Available from Waseda)

Existing sequential compiler

Server Code Generation

> **OpenMP** Compiler

**OSCAR: Optimally Scheduled Advanced Multiprocessor API:** Application Program Interface

**Generation of** parallel machine codes using sequential compilers

Homegeneous Multicore s from Vendor A (SMP servers Intel, ARM, IBM, . AMD, Infineon, Renesas, RISC V



Heterogeneous **Multicores** from Vendor B **FPGA**, Accelerators

various



Shred memory servers

#### **Question 2: System-level API and Fine-Grain Parallelism**

When a task grain is too fine like a basic block, OSCAR compiler applies static scheduling with task fusion. Usually static scheduling is more efficient than dynamic for fine grain. OSCAR API is used for execution on various multicores and global address space multiprocessor systems.

Macro Task Fusion for Static Task Scheduling used in car engine control



## OSCAR API Ver. 2.0 for Homogeneous (LCPC2009) /Heterogeneous (LCPC2010) Multicores and Manycores

API for Parallel Processing on Various Multicores, Power Management, Hardware and Software Cache Control, and Local Memory Management

Manual Download: http://www.kasahara.cs.waseda.ac.jp/api/regist.php?lang=en&ver=2.1

### List of Directives (22 directives)

- Parallel Execution API
  - parallel sections (\*)
  - flush (\*)
  - critical (\*)
  - execution
- Memoay Mapping API
  - threadprivate (\*)
  - distributedshared
  - onchipshared
- Synchronization API
  - groupbarrier
- Data Transfer API
  - dma\_transfer
  - dma\_contiguous\_parameter
  - dma\_stride\_parameter
  - dma flag check
  - dma flag send

- Power Control API
  - fvcontrol
  - get fvstatus
- Timer API
  - get current time
- Accelerator
  - accelerator\_task\_entry
- Cache Control
  - cache writeback
  - cache selfinvalidate
  - complete memop
  - noncacheable
  - aligncache
    - 2 hint directives for OSCAR compiler
    - accelerator task
    - oscar comment

from V2.0

(\* from OpenMP)