from hypercubes to dragonflies · r log n) / l = aspect ratio where k = radix b = total bandwidth n...
TRANSCRIPT
![Page 1: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/1.jpg)
Jul 21, 2008 IAA: ‹#›
From Hypercubes to Dragonflies a short history of interconnect
William J. Dally Computer Science Department
Stanford University
IAA Workshop July 21, 2008
![Page 2: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/2.jpg)
Jul 21, 2008
Outline
• The low-radix era • High-radix routers and networks • To ExaScale and beyond • NoCs the final frontier
IAA: ‹#›
![Page 3: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/3.jpg)
Jul 21, 2008
Partial Timeline
Date Event Features
1983 Caltech Cosmic Cube Hypercube, programmed transfer
1985 Torus Routing Chip torus, routed, wormhole, virtual channels
1987 iPSC/2 routed hypercube
1990 Virtual-channel flow control
1991 J-Machine
1992 Paragon, T3D, CM5
1994 Vulcan
1995 T3E
1996 Reliable Router Link level retry
2000 NoCs
2001 SP2, Quadrics
2002 X1
2004 Global adaptive routing
2005 High-Radix Routers
2006 YARC/BW IAA: ‹#›
Low-Radix Era
![Page 4: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/4.jpg)
Jul 21, 2008
The Low-Radix Era
IAA: ‹#›
![Page 5: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/5.jpg)
Jul 21, 2008
The Cosmic Cube
IAA: ‹#›
• Caltech 1983 • Hypercube topology • No routers – programmed transfers for every hop • Store and forward: T = L/B x H
![Page 6: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/6.jpg)
Jul 21, 2008
Torus Routing Chip
• Caltech 1985, 3µm CMOS • Torus topology
– Topology driven by technology constraints (pins, bisection)
• Wormhole routing: T = L/B + H • Virtual channels to break deadlock IAA: ‹#›
![Page 7: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/7.jpg)
Jul 21, 2008
1985-2004 The Low-Radix Era
• Low-radix (4 ≤ k ≤ 8) routers • Torus, mesh, or Clos (fat-tree) topologies • Wormhole or virtual-cut-through flow control • Virtual channels for deadlock avoidance and
performance • Almost exclusively minimal routing • Delta, Paragon, T3D, SP-1, CM-5, T3E, SP-2, … • … but router bandwidth was increasing exponentially
IAA: ‹#›
![Page 8: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/8.jpg)
Some Routers MARS Router
1984 Torus Routing Chip
1985 Network Design Frame
1988
MDP 1991
Reliable Router 1994 MAP 1998 Imagine
2002
Robert Mullins
![Page 9: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/9.jpg)
Jul 21, 2008
High Radix Routers and Networks
IAA: ‹#›
![Page 10: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/10.jpg)
Jul 21, 2008
Bandwidth
IAA: ‹#›
Velio 3003 1296 ball BGA 280 3.2G pairs
440Gb/s in +440Gb/s out
![Page 11: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/11.jpg)
Jul 21, 2008 IAA: ‹#›
0.1
1
10
100
1000
10000
1985 1990 1995 2000 2005 2010
year
band
wid
th p
er ro
uter
nod
e (G
b/s)
Torus Routing ChipIntel iPSC/2J-MachineCM-5Intel Paragon XPCray T3DMIT AlewifeIBM VulcanCray T3ESGI Origin 2000AlphaServer GS320IBM SP Switch2Quadrics QsNetCray X1Velio 3003IBM HPSSGI Altix 3000Cray XT3YARC
BlackWidow
Router Bandwidth Scaling ~100x in 10 years
![Page 12: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/12.jpg)
Jul 21, 2008 Google: 12
High-Radix Router
Router
Router
![Page 13: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/13.jpg)
Jul 21, 2008 Google: 13
High-Radix Router
Router
Router
Low-radix (small number of fat ports) High-radix (large number of skinny ports)
Router Router
![Page 14: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/14.jpg)
Jul 21, 2008 Google: 14
Latency
Latency = H tr + L / b = 2trlogkN + 2kL / B
where k = radix B = total Bandwidth N = # of nodes L = message size
![Page 15: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/15.jpg)
Jul 21, 2008 Google: 15
Latency vs. Radix
0
50
100
150
200
250
300
0 50 100 150 200 250radix
late
ncy
(nse
c)
2003 technology 2010 technology
Optimal radix ~ 40
Optimal radix ~ 128
Serialization latency increases Header latency decreases
![Page 16: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/16.jpg)
Jul 21, 2008 Google: 16
Determining Optimal Radix
Latency = Header Latency + Serialization Latency
= H tr + L / b = 2trlogkN + 2kL / B
Optimal radix k log2 k = (B tr log N) / L = Aspect Ratio
where k = radix B = total Bandwidth N = # of nodes L = message size
![Page 17: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/17.jpg)
Jul 21, 2008
High-Radix Router
• Many router structures scale as P2 (or P2V2) – Allocators particularly difficult – Not feasible with P=64 and V~8
• Decompose router – Each sub-allocation is feasible
• Put the buffers where they do the most good
• YARC – Cray 2006 – 64 ports, each 18.75Gb/s (3x6.25) – Tiled hierarchical design
• 8x8 array of 8x8 subswitches • Buffering at subswitch inputs
IAA: ‹#›
![Page 18: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/18.jpg)
Jul 21, 2008 Google: 18
High-Radix Switch Architectures (II)
output 1 output 2
output k
input 1
input k
· · ·
input 2
· · ·
(a) Baseline design (b) Fully buffered crossbar
output 1
output 2
output k
input 1
input k
· · ·
input 2
· · ·
![Page 19: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/19.jpg)
Jul 21, 2008 Google: 19
(b) Fully buffered crossbar
output 1
output 2
output k
input 1
input k
· · ·
input 2
· · ·
High-Radix Switch Architectures (III)
output 1 output 2
output k
input 1
input k
· · ·
input 2
· · ·
(a) Baseline design
subswitch
(c) Hierarchical crossbar
![Page 20: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/20.jpg)
Jul 21, 2008
Global Adaptive Routing Enables new Topologies
• VAL gives optimal worst-case throughput • MIN gives optimal benign traffic performance • UGAL (Universal Globally Adaptive Load-balance)
– [Singh ’05] – Routes benign traffic minimally – Starts routing like VAL if load imbalance in channel queues – In the worst-case, degenerates into VAL, thus giving optimal
worst-case throughput
20
![Page 21: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/21.jpg)
Jul 21, 2008 21
Hnm 4. Hnm= non-min path (sid) length
UGAL
s
d Hm
1. Hm= shortest path (SP) length
qm
2. qm= congestion of the outgoing channel for SP
i
3. Pick i, a random intermediate node
qnm
5. qnm= congestion of the outgoing channel for sid
6. Choose SP if Hmqm≤ Hnmqnm; else route via i, minimally in each phase
![Page 22: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/22.jpg)
Jul 21, 2008 22
CQR: TOR throughput
Switches to non-minimal at 0.12
![Page 23: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/23.jpg)
Jul 21, 2008 23
CQR: TOR latency
![Page 24: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/24.jpg)
Jul 21, 2008 24
UGAL report card
Algo Θ benign Θ adv Θ avg
Throughput (frac of capacity)
VAL 0.5 0.5 0.5 MIN 1.0 0.02 0.02
UGAL 1.0 0.5 0.5
VAL 0.5 0.5 0.5 MIN 1.0 0.33 0.63
UGAL 1.0 0.5 0.7
VAL 0.5 0.5 0.5 MIN 1.0 0.2 0.52
UGAL 1.0 0.5 0.63
64 node topology
K64
8 x 8 torus
64 node CCC
![Page 25: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/25.jpg)
Jul 21, 2008 Google: 25
Transient Imbalance
0
2
4
6
8
10
12
14
16
Routers in the middle stage of the network
Maxim
um
buffer
siz
e
![Page 26: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/26.jpg)
Jul 21, 2008 Google: 26
With Adaptive Routing
0
2
4
6
8
10
12
14
16
Routers in the middle stage of the network
Maxim
um
buffer
siz
e
![Page 27: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/27.jpg)
Jul 21, 2008 Google: 27
High-Radix Topology
• Use high radix, k, to get low hop count – H = logk(N) – Hop count ~ cost
• Provide good performance on both benign and adversarial traffic patterns – Rules out butterfly networks - no path diversity
• H = logk(N) - optimal • Dismal throughput on worst-case traffic
– Clos networks work OK • H = 2logk(N) - with short circuit • But twice the hop count needed on benign traffic
– Cayley graphs have nice properties but are hard to route
![Page 28: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/28.jpg)
Jul 21, 2008 High-Radix Interconnection Networks
Clos Networks Delivering Predictable Performance Since 1953
![Page 29: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/29.jpg)
IAA: ‹#›
IEEE
Spe
ctru
m O
nlin
e, Ju
ly 1
6, 2
008
![Page 30: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/30.jpg)
Jul 21, 2008
Flattened Butterfly (ISCA’07)
![Page 31: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/31.jpg)
Jul 21, 2008
Dragonfly Topology
R2 Rn-1
R1 R0 Rn-2
Interconnection Network
intra-group interconnection network
R0 R1 Ra-1
G0 G1 Gg-1 Gg-1
Inter-group Interconnection Network
![Page 32: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/32.jpg)
Jul 21, 2008
Dragonfly Topology Example
P0 P1
R0
P2 P3
R1
P4 P5
R2
P6 P7
R3
P24 P25 P26 P27 P28 P29 P30 P31
P8 P9
R4
P10 P11
R5
P12 P13
R6
P14 P15
R7
R12
R13
R14
R15
G2
G4 G5 G6 G7 G8
G3
G0 G1
![Page 33: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/33.jpg)
Jul 21, 2008
Topology Cost Comparison
![Page 34: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/34.jpg)
Jul 21, 2008
A Summary of Where We Are
• High radix routers and networks – Best able to convert high pin bandwidth into value
• Router organization – Hierarchical switch and allocator, subswitch buffers
• Routing – Global adaptive routing enables aggressive topologies
• Topology – Flattened butterfly gives minimal diameter (hence cost) – Dragonfly minimizes number of long (expensive) links
IAA: ‹#›
![Page 35: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/35.jpg)
Jul 21, 2008
To ExaScale and Beyond
IAA: ‹#›
![Page 36: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/36.jpg)
1 EFlop/s Strawman
• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) • 1 Chip = 742 Cores (=4.5TF/s)
• 213MB of L1I&D; 93MB of L2 • 1 Node = 1 Proc Chip + 16 DRAMs (16GB) • 1 Group = 12 Nodes + 12 Routers (=54TF/s) • 1 Rack = 32 Groups (=1.7 PF/s)
• 384 nodes / rack • 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)
• 68 MW w’aggressive assumptions • 166 MILLION cores • 680 MILLION FPUs • 3.6PB = 0.0036 bytes/flops
Sizing done by “balancing” power budgets with achievable capabilities
![Page 37: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/37.jpg)
Jul 21, 2008
Some Exa Observations
• Its all about energy (J/bit) – more on this later • Two levels of network – on-chip (Noc) and off-chip • System network is a dragonfly
– 12 CMPs, 31 local, 21 global links per router – 12 Parallel networks – enables tuning and tapering
IAA: ‹#›
![Page 38: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/38.jpg)
Jul 21, 2008
Exa Energy Budget - Fixed
• Allocating 60% power to FPUs doesn’t leave much for communication
• Leads to a steep BW taper
IAA: ‹#›
![Page 39: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/39.jpg)
Jul 21, 2008
Exa Energy Budget Adaptive
• 1060 cores per chip (vs 742) 4 FPUs/core • Can sustain 1 access per cycle at L1 or L2
– But nothing else
• Can use all power at L3, DRAM, or globally • Adapt actual power budget to demand of application
– Throttle to stay in bounds IAA: ‹#›
![Page 40: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/40.jpg)
Jul 21, 2008 IAA: ‹#›
Its all about Joules/bit
10pJ/FLOP in 32nm 128FLOPs of energy to read a word from DRAM 32FLOPs of energy to send word within cabinet 256FLOPs to cross machine
![Page 41: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/41.jpg)
Jul 21, 2008
NoCs
IAA: ‹#›
![Page 42: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/42.jpg)
Jul 21, 2008
Exa NoC
• Much of the communication is on chip
• First 4-levels of storage hierarchy
• The bulk of all bandwidth • What does this network look
like?
IAA: ‹#›
![Page 43: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/43.jpg)
Jul 21, 2008 IAA: ‹#›
On Chip Interconnect
• Enabling circuit technology – Links, switches, buffers – 10x - 100x improvements in b/J – Drives topologies, routing, flow
control
• Network design – Efficient topology/routing – Flow control (trade BW for
buffers) – Low latency uArch
R4 R7R6R5
R8 R11R10R9
R12 R15R14R13
R0 R3R2R1
![Page 44: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/44.jpg)
Jul 21, 2008 High-Radix Interconnection Networks
Evaluation Comparison
0
0.2
0.4
0.6
0.8
1
1.2
mesh flatbfly
Latency Energy-Delay Product
0
0.2
0.4
0.6
0.8
1
1.2
mesh flatbfly
~35% ~65%
Flattened butterfly can be extended to on-chip networks to achieve lower latency and lower energy.
![Page 45: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/45.jpg)
Jul 21, 2008
Summary
• Low-radix era – Developed key network technologies
• Wormhole FC, virtual channels, etc… • Matched network design to packaging and signaling constraints
– 4 < k < 8 torus, mesh, and Clos networks
• High-radix era – Router bandwidth increasing 100x per decade – Increasing network radix better able to exploit increased router bandwidth – Requires partitioned router organization with subswitch buffers. – Global adaptive routing enables efficient topologies – Flattened butterflies and dragonflies – Data centers now driving these technologies
• NoCs – Bulk of bandwidth is on-chip – Flattened butterfly is ideal topology here too! – fewer hops – Many open questions in NoC design
IAA: ‹#›
![Page 46: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/46.jpg)
Jul 21, 2008
Challenges
• Power – at all levels – Circuits/devices – Topology/routing/flow control – On and off chip
• Network architecture – Indirect adaptive routing – Dealing with cable aggregation
(cables and WDM)
• On-chip networks – all levels – Circuits – Topology/routing/flow control – On/off chip interfaces
• Network interfaces – Can have zero overhead send
and receive (J-Machine, M-Machine)
• Programming abstractions – Abstract the communication
hierarchy (Sequoia) – Vertical, not horizontal is what
matters – Focus on essential issues, not
incidentals (like MPI occupancy)
• Device/Circuit Technology – Focus on pJ/bit
IAA: ‹#›
![Page 47: From Hypercubes to Dragonflies · r log N) / L = Aspect Ratio where k = radix B = total Bandwidth N = # of nodes L = message size . ... 1 P4 P5 R 2 P6 P7 R 3 P24 P25 P26 P27 P28 P29](https://reader031.vdocuments.mx/reader031/viewer/2022022803/5c87dcc209d3f2df188cc8cf/html5/thumbnails/47.jpg)
Jul 21, 2008 Google: 47
Some very good books