how is sppyecifically multicore programming …...october 11, 2007, 13:00 to 14:00, 20th lcpc, uiuc...
TRANSCRIPT
How is specifically multicore p yprogramming different from
traditional parallel computing?Hironori Kasahara
Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor
Research InstituteResearch Institute Waseda University
Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp
1October 11, 2007, 13:00 to 14:00, 20th LCPC, UIUC
Multi-core EverywhereMulti-core from embedded to supercomputersMulti core from embedded to supercomputers
Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navi, DVD, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, y , j ,NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1)
PCs, ServersC C 2 iIntel Dual-Core Xeon, Core 2 Duo, Montecito
AMD Quad and Dual-Core OpteronWSs, Deskside & Highend Servers
IBM Power4 5 5+ pSeries690(32way) p5 550Q(8 way)IBM Power4,5,5+, pSeries690(32way), p5 550Q(8 way) , Sun N iagara(SparcT1,T2), SGI ALTIX350,
SupercomputersIBM Blue Gene/L: 360TFLOPS, 2005, (BG/P, 2007?)IBM BlueGene/L
Lawrence Livermore National Laboratory2005/04稼動予定
OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)
Low power CMP based 128K processor chips
Huge number of parallel applications must be developed in a few years
Lawrence Livermore National Laboratory2005/04稼動予定
低消費電力チップマルチプロセッサをベースとしたスパコン
must be developed in a few yearsBest programming method?
Sequential programming which can65,536-プロセッサチップ(128Kプロセッサ
360TFLOPS
2
Sequential programming which can be easily parallelized by compiler
Ex.) Restricted C without pointers and recursive calls
=毎秒360兆回の 浮動小数点計算)
キャビネットあたり1024プロセッサチップ
1プロセッサチップ上に2プロセッサ集積
Market of Consumer Electronics1 T illi D ll i 20101 Trillion Dollars in 2010 (World Wide)
1 6 兆 円
W /W 市 場 規 模
1 6 兆 円
W /W 市 場 規 模
市場規
1 2 兆 円
1 4 兆 円
兆 円
携 帯 電 話
市場規
1 2 兆 円
1 4 兆 円
兆 円
携 帯 電 話
規模(出荷額)
1 . 5 兆 円
2 兆 円テ ゙ シ ゙ タ ル カ メ ラ
゙ シ ゙ タ ゙ ゙ オ
薄 型 テ レ ヒ ゙
規模(出荷額)
1 . 5 兆 円
2 兆 円テ ゙ シ ゙ タ ル カ メ ラ
゙ シ ゙ タ ゙ ゙ オ
薄 型 テ レ ヒ ゙
0
5 千 億 円
1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ
カ ー ナ ヒ ゙
D V D レ コ ー タ ゙
0
5 千 億 円
1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ
カ ー ナ ヒ ゙
D V D レ コ ー タ ゙
0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2
0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2
121276764949デ ジ タル ス チ ル カメラ(M 台 )
年 平 均 成長 率 %‘07‘03
121276764949デ ジ タル ス チ ル カメラ(M 台 )
年 平 均 成長 率 %‘07‘03
Dig Camera
Annual Growth Rates
2005.5.11NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より
43431141142727PC用 DV D(記 録 型 )(M 台 )
747433333.63.6D V Dレコー ダ (M 台 )
4545272766デ ジ タル TV(M 台 )
121276764949デ ジ タル ス チ ル カメラ(M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )
747433333.63.6D V Dレコー ダ (M 台 )
4545272766デ ジ タル TV(M 台 )
121276764949デ ジ タル ス チ ル カメラ(M 台 )Dig. CameraDig. TVDVD RecorderDVD for PC
3111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)
88670670490490携 帯 電 話 (M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )
111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)
88670670490490携 帯 電 話 (M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )Mobile PhoneLSI for Cars
T i ff ti f t fIf you want to make parallel programming
• To improve effective performance, cost-performance and productivity and reduce consumed power
M ll li id i h t it– More parallelism considering core heterogeneity• Multigrain Parallelization
– coarse-grain parallelism in addition to loop parallelismcoarse grain parallelism in addition to loop parallelism– Limited size of memory management considering real-time
• Data localization for local memories and distributed shoared memories in addition to cache
– Data Transfer Overlapping using DMA controllers• Data transfer overhead hiding by overlapping task execution and• Data transfer overhead hiding by overlapping task execution and
data transfer using DMA or data pre-fetching– Power reduction controlling FV and power shut-down
• Reduction of consumed power by compiler control of frequency,voltage and power shut down with hardware supports.
4
OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore
• OSCAR Type Memory• OSCAR Type Memory Architecture
• LPM– Local Program Memoryg y
• LDM– Local Data Memory
• DSMDi ib d Sh d M– Distributed Shared Memory
• CSM– Centralized Shared Memory
• On Chip and/or Off Chipp p• DTU
– Data Transfer Unit• Interconnection Network
M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …
5
OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)
CMP m
PE0 PE
CMP (chip multiprocessor 0)0
CPU
I/ODevicesI/O
Devices0 PE1 PE n
LDM/D-cacheLPM/
CSM j I/OCMP k
CPUDTC
DSMI-Cache
CSN t k I t fFVR
Intra-chip connection network
CSMNetwork InterfaceFVR
FVR
CSM / L2 Cache
(Multiple Buses, Crossbar, etc) FVR
FVRFVR FVR FVR FVR
Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.
FVR
6DSM: distributed shared mem.DTC: Data Transfer Controller
LPM : local program mem.FVR: frequency / voltage control register
Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler
• Shortest execution time mode
7
Current approach in METI/NEDO Multicore ProjectOperMP(Section, Flush, Critical) + Directives for data assignment for LDM, DSM, On-
chip CSM Off chip CSM DMA transfers and FV & Power Shut down controlchip CSM, Off-chip CSM,DMA transfers and FV & Power Shut-down control
Executable
Translate into
parallel
API to specify data assignment,
data transfer,
Sequential Application
P codes for each vendor
chip
parallel codes for
each vender
,power reduction
control
Program(Subset of C Language)
RealtimA
Image
Was
Backend compilerMach. Codes
Proc0ScheduledTasksT1 St SH l i
APIdecoder
Sequential Compilerm
e Cons
Applicati
e, Secure
seda OSC
Backend Compiler
T1 Stop
Proc1ScheduledTasks
SH multicore
Mach.CodesAPI
decoderSequential Compilersum
er Eon Prog
e Audio S
etc.
CA
R C
o
B k d C il
FR-V
TasksT2 T4
Proc2S h d l d
decoder Compiler
lectronicgram
sStream
in
mpiler
Backend Compiler
( )
ScheduledTasks
T3 T6 Slow
Mach.CodesSequential
CompilerAPIdecoder
csng (CELL)
Data Transfer by DTC(DMAC)
Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing
Centralized scheduling codeHierarchical Multigrain Parallel Processing code
SECTIONSSECTION SECTION1st layer
Distributed scheduling
d
T0 T1 T2 T3
MT1_1
T4 T5 T6 T7MT1_1
1st layer
code SYNC SENDMT1_2
SYNC RECVMT1_3SB
MT1_2DOALL
MT1 4MT1-3
1_4_11_4_11_3_2
1_3_3
1_3_1
1_3_2
1_3_3
1_3_1
1_3_2
1_3_3
1_3_1MT1_4RB MT1-4
1_4_2
1_4_4
1_4_3
1_4_2
1_4_4
1_4_31_3_4
1 3 6
1_3_5
1_3_4
1 3 6
1_3_5
1_3_4
1 3 6
1_3_5
1_3_1
1_3_2 1_3_3 1_3_41_4_11_3_6 1_3_6 1_3_6
END SECTIONS2nd layer
1_3_5 1_3_6
1_4_21_4_3 1_4_4
9Thread group0
Thread group1
2nd layer
2nd layer
Processor Block DiagramProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t
Core #3CPU FPUCore #2
600MHzIL, DL: Instruction/Data
local memory
I$32K
D$32K
IL DL
CPU FPU
CCNI$32K
D$32K
CPU FPUCore #1
$ $CPU FPUCore #0
CPU FPU rolle
r
URAM: User RAMGCPG: Global clock pulse generator
IL8K
DL16K
CCN
URAM 128K
32K 32KIL8K
DL16K
CCNI$32K
D$32K
IL8K
DL16K
CCNI$32K
D$32K
CPU FPU
CCN
p co
ntr
LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller
URAM 128K8K 16KURAM 128K
IL8K
DL16K
URAM 128K Snoo
p(S
NC
)LCPG0-3LCPG0-3LCPG
0-3LCPG0-3
On-chip system bus (SHwy)
LBSC: SRAM controllerDBSC: DDR2 controller
300MHzS (0 3
DBSC PCIExp
HWGCPGIPLBSC
104 laneDDR2
32bitSRAM32bit
An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g
TIM
EE
11
Chip OverviewProcess Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)
Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)
Power Consumption
0.6 mW/MHz/CPU @ 600MHz (90nm G)
#0 PeripheralsCore
#1
Consumption 600MHz (90nm G)
Clock Frequency 600MHz
CPU Performance 4320 MIPS (Dhrystone 2.1)
SNC
Core Core
#3FPU Performance 16.8 GFLOPS
I/D Cache 32KB 4way set-associative (each)
#2#3
( )
ILRAM/OLRAM 8KB/16KB (each CPU)
URAM 128KB (each CPU)
P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm
ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with
12
pIndividually Managed Clock Frequency for Low Power Consumption”
Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)
Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing
up
5.0
4.02 95
3.825.0
4.03.43
d u
p
Speed
3.01.91
2.95
3.01.86
2.70
Speed
2.0
1.0
1.002.0
1.0
1.00
Number of Processors1 2 3 4
01 2 3 4
0Number of Processors
Page. 13
Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding:
Number of Processors
Data Localization11
2
1
32
PE0 PE1
12 1
3 4 56
23 45
6 7 8910 1112
3
4
2
6 7
8
14
18
7
6 7 8910 1112
1314 15 16
5
8
9
11
15
18
19
25
8 9 101718 19 2021 22
10
11
13 16
25
29
112324 25 26
dlg0dlg3dlg1 dlg2
17 20
21
22 26
30
12 1314
15
2728 29 3031 32
33Data Localization Group
dlg023 24
27 28
32
14MTG MTG after Division A schedule for two processors
15 33Data Localization Group31