throughput-effective on-chip networks for manycore accelerators
DESCRIPTION
Throughput-Effective On-Chip Networks for Manycore Accelerators. Ali Bakhoda , John Kim ¹ and Tor M. Aamodt ¹ KAIST, Korea . Manycore Accelerators and NoC. Manycore accelerators P revalent example: high-end GPUs 10s of thousands of threads running at the same time - PowerPoint PPT PresentationTRANSCRIPT
Throughput-Effective On-Chip Networks for Manycore Accelerators
Ali Bakhoda, John Kim¹ and Tor M. Aamodt¹KAIST, Korea
2
Manycore Accelerators and NoC
Manycore accelerators Prevalent example: high-end GPUs 10s of thousands of threads running at the same time Bulk Synchronous Parallel programming style 3 / 5 top supercomputers
Based on the Nov. 2010 Top500 list
Primary goal: Higher application level throughput
NoC in accelerators Needs a different perspective from CPUs Not very well studied in this context
3
The Need for Throughput-Effective NoCs
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoC
LESS AREA
HIGHER THROUGHPUT
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Throughput-Effective design: Improves application level performance per unit chip area
4
Contributions
Study impact of NoC on application level performance
Traditional improvements (router latency reduction): minimal impact on application level performance
Increasing channel width: High performance gain + high area cost Consider application level throughput per unit area of NoC
Throughput correlated with injection rate of few nodes Many-to-few-to-many traffic pattern
Propose Throughput-Effective NoC design Checkerboard network Multi-port router structure
5
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
6
Accelerator Overview
Compute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute ComputeCompute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute Compute
DispatchQueue
MemMiss
WaitingQueue
7
Baseline Network Mesh with MCs at periphery of the chip
Similar to Tilera’s TILE64 or Intel’s 80-core Teraflops chip Simple and Scalable
Dimension Order Routing Virtual Channel Flow Control 4-cycle routers
Compute
Network-On-Chip
MC+L2
GDDR
MC+L2
GDDR
MC+L2
GDDR
Compute Compute Compute Compute
8
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.50
0.75
1.00
Application Level ThroughputApplication Level Throughput/Area
Bandwidth Limit of Ideal Interconnect[fraction of off-chip DRAM bandwidth]
Finding a Balanced Design
Bisection bandwidth of baseline mesh
9
Gap between Balanced Mesh and Ideal NoC
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoCLESS AREA
HIGHER THROUGHPUT
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
10
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
11
NoC properties in ManyCore Accelerators
Router latency has minimal impact on application level throughput
Aggressive 1-cycle routers instead of 4-cycle router Only 2.3% application level speedup
Channel Bandwidth is very important 27% speedup by doubling BW But quadratic area increase 1-Cy-
cle Router
s
2x BW0%
20%
HM Speedup
12
2x Channel Bandwidth
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Ideal NoCLESS AREA
HIGHER THROUGHPUT
2x BW
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
13
Many-to-Few-to-Many Traffic Pattern
C0
requ
est n
etw
ork
C1
Cn
C0
C1
Cnre
ply
netw
ork
MC0
MC1
MCm
C2
MC Injectionbandwidth
C2
14
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
15
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
16
Checkerboard Routing: Half-Routers
Half-Router Connectivity
Half-Routers No turns allowed at half-routers Limited connectivity Saves ~50% of router crossbar area
Full-Routers: Normal routers w/ complete connectivity
Use Half-Routers every other node
Ejection
Injection
North
South
EastWest
Half Router
Full Router
17
Solution: Routing Restriction (1)
• Routing from a full-router to a half-router that is:– An odd number of columns
away– Not in the same row
• Solution: Use YX routing instead of XY routing in this case
Half Router
Full Router
18
Solution: Routing Restriction (2)
Routing from a half-router to a half-router that is: An even number of columns
away Not in the same row
Solution: needs two turns(1) To intermediate full-router using YX(2) To the destination using XY
Requires an extra VC to avoid deadlock
Half Router
Full Router
19
Routing Restriction (3) Full-routers that are odd number of columns
away We avoid this case by using a different MC
placement (next 2 slides)
Half Router
Full Router
20
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
21
Placement of MCs
Exploit Many-to-Few Place the MCs at Half-Router nodes
Half-Routers can communicate will all nodes with no penalty Common case for BSP: compute cores communicate with MCs
not each other
[CMP-MSI’08] “Extending the Scalability of Single Chip Stream Processors with On-chip Caches”, Bakhoda et al. [ISCA’09] “Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs" Abts et al.
Half Router
Compute Core Router
Memory Controller Router
22
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
23
Multi-port routers at MCs
• Reduce the bottleneck at the few nodes• Increase terminal BW of the few nodes
– Increase the injection ports of MC routers– Minimal area overhead (~1% in total NoC area)– Speedups of up to 25%
24
Throughput-Effective Network design
Throughput-Effective
Reduce Area
Checkerboard Routing
Channel Slicing
Increase Performance
Checkerboard Placement
Multi-Port routers at
MCs
25
Outline
Introduction Baseline architecture NoC properties in accelerators Throughput-Effective NoC design Experimental results Conclusion
26
Methodology
Compute simulation: GPGPU-Sim (2.2.1b) NoC simulation: Booksim-2
Integrated into GPGPU-Sim as network simulator
Area estimations: Orion 2.0 Benchmarks: 24 CUDA applications including
the Rodinia benchmarks
27
Results Combination of
Checkerboard routing and placement Channel Slicing Multi-port routers at MCs
Overall HM speedup 17% across 24 benchmarks over balanced baseline
Total NoC area reduction of 43%
AES BIN HSP NE NDL
HW LE HIS LU SLA BP CON
NNC
BLK
MM LPS RAY
DG SS TRA
SR WP MUM
LIB FWT
SCP STC KM CFD
BFS RD HM-20%
0%20%40%60%80%
Spee
dup
Low SpeedupLow Traffic
Low SpeedupHigh Traffic
High SpeedupHigh Traffic
28
Throughput-Effective NoC
190 210 230 250 270 290 3100.0012
0.0014
0.0016
0.0018
0.0020
Average Throughput [IPC]
(Chi
p Ar
ea)-1
[1/
mm
2]
Thr. Eff.
Ideal NoCLESS AREA
HIGHER THROUGHPUT
2x BW
0.35 IPC/mm 2
0.40 IPC/mm 2
0.45 IPC/mm 2
0.50 IPC/mm 2
0.55 IPC/mm 2
0.30 IPC/mm 2
Balanced Mesh
29
Summary
Throughput-Effective design: Consider system level performance impact + area impact of NoC
Observations NoC BW is more important than latency in accelerators Many-to-Few-to-Many traffic pattern
Throughput-Effective NoC for accelerators Checkerboard Multi-port MC routers Channel-slicing
Thank you
31
Backups…
32
Channel Slicing – Double networks
Divide the single network into two physical networks Each new network: half the bisection BW of the original network Overall bisection BW: constant
Saves area Quadratic dependency of crossbar area on channel BW
Increases serialization latency But compute accelerators are not sensitive to latency
33
Results
Memory Controller placement HM of speedup 13% over balanced baseline design
Compute Core Router
Memory Controller Router
-20%0%
20%40%60%80%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Spee
dup
34
Results• Checkerboard routing
– Less than 1% performance loss compared to DOR with same resources
– Reduces total router area by 14.2%
Half Router
Compute Core Router
Memory Controller Router
70%80%90%
100%110%120%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Rel
ativ
e Pe
rfor
man
ce
35
Results Channel slicing
Average change in performance < 1% NoC area reduction of 37%
Half Router
Compute Core Router
Memory Controller Router
-7%0%7%
14%
AES BIN HSP NE NDL HW LE HIS LU SLA BP CONNNCBLK MM LPS RAY DG SS TRA SR WP MUM LIB FWT SCP STC KM CFD BFS RD HM
Spee
dup
36
Top 5 systems
TOP 5 Systems - 11/2010 1 Tianhe-1A - NUDT TH MPP, X5670 2.93Ghz 6C, Nvidia GPU
, FT-1000 8C 2 Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz 3 Nebulae - Dawning TC3600 Blade, Intel X5650, Nvidia Tesla
C2050 GPU 4TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670,
Nvidia GPU, Linux/Windows 5 Hopper - Cray XE6 12-core 2.1 GHz
37
Alternative MC placement example
38
Many-to-Few-to-Many Traffic Pattern
C0
requ
est n
etw
ork
C1
Core outputbandwidth
Cn
C0
C1
Cnre
ply
netw
ork
MC0
MC1
MCm
C2
MC inputbandwidth
MC outputbandwidth
Core inputbandwidth
C2