how is sppyecifically multicore programming …...october 11, 2007, 13:00 to 14:00, 20th lcpc, uiuc...

How is specifically multicore p yprogramming different from

traditional parallel computing?Hironori Kasahara

Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor

Research InstituteResearch Institute Waseda University

Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp

1October 11, 2007, 13:00 to 14:00, 20th LCPC, UIUC

Multi-core EverywhereMulti-core from embedded to supercomputersMulti core from embedded to supercomputers

Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navi, DVD, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, y , j ,NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1)

PCs, ServersC C 2 iIntel Dual-Core Xeon, Core 2 Duo, Montecito

AMD Quad and Dual-Core OpteronWSs, Deskside & Highend Servers

IBM Power4 5 5+ pSeries690(32way) p5 550Q(8 way)IBM Power4,5,5+, pSeries690(32way), p5 550Q(8 way) , Sun N iagara(SparcT1,T2), SGI ALTIX350,

SupercomputersIBM Blue Gene/L: 360TFLOPS, 2005, (BG/P, 2007?)IBM BlueGene/L　

Lawrence Livermore National Laboratory2005/04稼動予定

OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

Low power CMP based 128K processor chips

Huge number of parallel applications must be developed in a few years

Lawrence Livermore National Laboratory2005/04稼動予定

低消費電力チップマルチプロセッサをベースとしたスパコン

must be developed in a few yearsBest programming method?

Sequential programming which can65,536-プロセッサチップ（128Kプロセッサ

360TFLOPS

2

Sequential programming which can be easily parallelized by compiler

Ex.) Restricted C without pointers and recursive calls

　＝毎秒360兆回の　　　　浮動小数点計算）

キャビネットあたり1024プロセッサチップ

１プロセッサチップ上に２プロセッサ集積

Market of Consumer Electronics1 T illi D ll i 20101 Trillion Dollars in 2010 (World Wide)

1 6 兆円

W /W 市場規模

1 6 兆円

W /W 市場規模

市場規

1 2 兆円

1 4 兆円

兆円

携帯電話

市場規

1 2 兆円

1 4 兆円

兆円

携帯電話

規模（出荷額）

1 . 5 兆円

2 兆円ﾃﾞｼﾞﾀﾙｶﾒﾗ

ﾞｼﾞﾀﾞﾞｵ

薄型ﾃﾚﾋﾞ

規模（出荷額）

1 . 5 兆円

2 兆円ﾃﾞｼﾞﾀﾙｶﾒﾗ

ﾞｼﾞﾀﾞﾞｵ

薄型ﾃﾚﾋﾞ

0

5 千億円

1 兆円ﾃﾞｼﾞﾀﾙﾋﾞﾃﾞｵ

ｶｰﾅﾋﾞ

D V D ﾚｺｰﾀﾞ

0

5 千億円

1 兆円ﾃﾞｼﾞﾀﾙﾋﾞﾃﾞｵ

ｶｰﾅﾋﾞ

D V D ﾚｺｰﾀﾞ

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

１２１２７６７６４９４９デジタルスチルカメラ（M 台）

年平均成長率％‘０７‘０３


年平均成長率％‘０７‘０３

Dig Camera

Annual Growth Rates

2005.5.11NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

７４７４３３３３３．６３．６D V Dレコーダ（M 台）

４５４５２７２７６６デジタル TV（M 台）


４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

７４７４３３３３３．６３．６D V Dレコーダ（M 台）

４５４５２７２７６６デジタル TV（M 台）

１２１２７６７６４９４９デジタルスチルカメラ（M 台）Dig. CameraDig. TVDVD RecorderDVD for PC

3１１１１２０．９２０．９１４．０１４．０自動車用半導体需要（B$）

８８６７０６７０４９０４９０携帯電話（M 台）

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

１１１１２０．９２０．９１４．０１４．０自動車用半導体需要（B$）

８８６７０６７０４９０４９０携帯電話（M 台）

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）Mobile PhoneLSI for Cars

T i ff ti f t fIf you want to make parallel programming

• To improve effective performance, cost-performance and productivity and reduce consumed power

M ll li id i h t it– More parallelism considering core heterogeneity• Multigrain Parallelization

– coarse-grain parallelism in addition to loop parallelismcoarse grain parallelism in addition to loop parallelism– Limited size of memory management considering real-time

• Data localization for local memories and distributed shoared memories in addition to cache

– Data Transfer Overlapping using DMA controllers• Data transfer overhead hiding by overlapping task execution and• Data transfer overhead hiding by overlapping task execution and

data transfer using DMA or data pre-fetching– Power reduction controlling FV and power shut-down

• Reduction of consumed power by compiler control of frequency,voltage and power shut down with hardware supports.

4

OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore

• OSCAR Type Memory• OSCAR Type Memory Architecture

• LPM– Local Program Memoryg y

• LDM– Local Data Memory

• DSMDi ib d Sh d M– Distributed Shared Memory

• CSM– Centralized Shared Memory

• On Chip and/or Off Chipp p• DTU

– Data Transfer Unit• Interconnection Network

M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …

5

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices0 PE1 PE n

LDM/D-cacheLPM/

CSM j I/OCMP k

CPUDTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc) FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

6DSM: distributed shared mem.DTC: Data Transfer Controller

LPM : local program mem.FVR: frequency / voltage control register

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

7

Current approach in METI/NEDO Multicore ProjectOperMP(Section, Flush, Critical) + Directives for data assignment for LDM, DSM, On-

chip CSM Off chip CSM DMA transfers and FV & Power Shut down controlchip CSM, Off-chip CSM,DMA transfers and FV & Power Shut-down control

Executable

Translate into

parallel

API to specify data assignment,

data transfer,

Sequential Application

P codes for each vendor

chip

parallel codes for

each vender

,power reduction

control

Program（Subset of C Language)

RealtimA

Image

Was

Backend compilerMach. Codes

Proc0ScheduledTasksT1 St SH l i

APIdecoder

Sequential Compilerm

e Cons

Applicati

e, Secure

seda OSC

Backend Compiler

T1 Stop

Proc1ScheduledTasks

SH multicore

Mach.CodesAPI

decoderSequential Compilersum

er Eon Prog

e Audio S

etc.

CA

R C

o

B k d C il

FR-V

TasksT2 T4

Proc2S h d l d

decoder Compiler

lectronicgram

sStream

in

mpiler

Backend Compiler

（）

ScheduledTasks

T3 T6 Slow

Mach.CodesSequential

CompilerAPIdecoder

csng （CELL）

Data Transfer by DTC(DMAC)

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing

Centralized scheduling codeHierarchical Multigrain Parallel Processing code

SECTIONSSECTION SECTION1st layer

Distributed scheduling

d

T0 T1 T2 T3

MT1_1

T4 T5 T6 T7MT1_1

1st layer

code SYNC SENDMT1_2

SYNC RECVMT1_3SB

MT1_2DOALL

MT1 4MT1-3

1_4_11_4_11_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1MT1_4RB MT1-4

1_4_2

1_4_4

1_4_3

1_4_2

1_4_4

1_4_31_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_1

1_3_2 1_3_3 1_3_41_4_11_3_6 1_3_6 1_3_6

END SECTIONS2nd layer

1_3_5 1_3_6

1_4_21_4_3 1_4_4

9Thread group0

Thread group1

2nd layer

2nd layer

Processor Block DiagramProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t

Core #3CPU FPUCore #2

600MHzIL, DL: Instruction/Data

local memory

I$32K

D$32K

IL DL

CPU FPU

CCNI$32K

D$32K

CPU FPUCore #1

$ $CPU FPUCore #0

CPU FPU rolle

r

URAM: User RAMGCPG: Global clock pulse generator

IL8K

DL16K

CCN

URAM 128K

32K 32KIL8K

DL16K

CCNI$32K

D$32K

IL8K

DL16K

CCNI$32K

D$32K

CPU FPU

CCN

p co

ntr

LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller

URAM 128K8K 16KURAM 128K

IL8K

DL16K

URAM 128K Snoo

p(S

NC

)LCPG0-3LCPG0-3LCPG

0-3LCPG0-3

On-chip system bus (SHwy)

LBSC: SRAM controllerDBSC: DDR2 controller

300MHzS (0 3

DBSC PCIExp

HWGCPGIPLBSC

104 laneDDR2

32bitSRAM32bit

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g

TIM

EE

11

Chip OverviewProcess Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)

Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)

Power Consumption

0.6 mW/MHz/CPU @ 600MHz (90nm G)

#0 PeripheralsCore

#1

Consumption 600MHz (90nm G)

Clock Frequency 600MHz

CPU Performance 4320 MIPS (Dhrystone 2.1)

SNC

Core Core

#3FPU Performance 16.8 GFLOPS

I/D Cache 32KB 4way set-associative (each)

#2#3

( )

ILRAM/OLRAM 8KB/16KB (each CPU)

URAM 128KB (each CPU)

P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm

ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with

12

pIndividually Managed Clock Frequency for Low Power Consumption”

Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)

Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing

up

5.0

4.02 95

3.825.0

4.03.43

d u

p

Speed

3.01.91

2.95

3.01.86

2.70

Speed

2.0

1.0

1.002.0

1.0

1.00

Number of Processors1 2 3 4

01 2 3 4

0Number of Processors

Page. 13

Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding：

Number of Processors

Data Localization11

2

1

32

PE0 PE1

12 1

3 4 56

23 45

6 7 8910 1112

3

4

2

6 7

8

14

18

7

6 7 8910 1112

1314 15 16

5

8

9

11

15

18

19

25

8 9 101718 19 2021 22

10

11

13 16

25

29

112324 25 26

dlg0dlg3dlg1 dlg2

17 20

21

22 26

30

12 1314

15

2728 29 3031 32

33Data Localization Group

dlg023 24

27 28

32

14MTG MTG after Division A schedule for two processors

15 33Data Localization Group31

how is sppyecifically multicore programming …...october 11, 2007, 13:00 to 14:00, 20th lcpc, uiuc...

Documents