3. interconnection networks
DESCRIPTION
3. Interconnection Networks . Historical Perspective. Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues between nearest neighbors. Messages were forwarded by processors on path. “Store and forward” networking - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/1.jpg)
3. Interconnection Networks
![Page 2: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/2.jpg)
Historical Perspective• Early machines were:
• Collection of microprocessors.• Communication was performed using bi-directional queues
between nearest neighbors.• Messages were forwarded by processors on path.
• “Store and forward” networking• There was a strong emphasis on topology in algorithms,
in order to minimize the number of hops = minimize time
![Page 3: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/3.jpg)
Network Analogy
• To have a large number of transfers occurring at once, you need a large number of distinct wires.
• Networks are like streets:• Link = street.• Switch = intersection.• Distances (hops) = number of blocks traveled.• Routing algorithm = travel plan.
• Properties:• Latency: how long to get between nodes in the
network.• Bandwidth: how much data can be moved per unit
time.• Bandwidth is limited by the number of wires and the rate at
which each wire can accept data.
![Page 4: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/4.jpg)
Design Characteristics of a Network• Topology (how things are connected):
• Crossbar, ring, 2-D and 3-D meshs or torus, hypercube, tree, butterfly, perfect shuffle ....
• Routing algorithm (path used):• Example in 2D torus: all east-west then all
north-south (avoids deadlock).• Switching strategy:
• Circuit switching: full path reserved for entire message, like the telephone.
• Packet switching: message broken into separately-routed packets, like the post office.
• Flow control (what if there is congestion):• Stall, store data temporarily in buffers, re-route data
to other nodes, tell source node to temporarily halt, discard, etc.
![Page 5: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/5.jpg)
Performance Properties of a Network: Latency• Diameter: the maximum (over all pairs of nodes) of the
shortest path between a given pair of nodes.• Latency: delay between send and receive times
• Latency tends to vary widely across architectures• Vendors often report hardware latencies (wire time)• Application programmers care about software
latencies (user program to user program)• Observations:
• Hardware/software latencies often differ by 1-2 orders of magnitude
• Maximum hardware latency varies with diameter, but the variation in software latency is usually negligible
• Latency is important for programs with many small messages
![Page 6: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/6.jpg)
Performance Properties of a Network: Bandwidth• The bandwidth of a link = w * 1/t
• w is the number of wires• t is the time per bit
• Bandwidth typically in Gigabytes (GB), i.e., 8* 220 bits• Effective bandwidth is usually lower than physical link
bandwidth due to packet overhead.Routing and control header
Data payload
Error code
Trailer
• Bandwidth is important for applications with mostly large messages
![Page 7: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/7.jpg)
Performance Properties of a Network: Bisection Bandwidth
• Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves
• Bandwidth across “narrowest” part of the network
bisection cut
not a bisectioncut
bisection bw= link bw bisection bw = sqrt(n) * link bw
• Bisection bandwidth is important for algorithms in which all processors need to communicate with all others
![Page 8: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/8.jpg)
Network Topology• In the past, there was considerable research in network
topology and in mapping algorithms to topology.• Key cost to be minimized: number of “hops” between
nodes (e.g. “store and forward”)• Modern networks hide hop cost (i.e., “wormhole
routing”), so topology is no longer a major factor in algorithm performance.
• Example: On IBM SP system, hardware latency varies from 0.5 usec to 1.5 usec, but user-level message passing latency is roughly 36 usec.
• Need some background in network topology• Algorithms may have a communication topology• Topology affects bisection bandwidth.
![Page 9: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/9.jpg)
![Page 10: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/10.jpg)
![Page 11: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/11.jpg)
Linear and Ring Topologies
• Linear array
• Diameter = n-1; average distance ~n/3.• Bisection bandwidth = 1 (in units of link bandwidth).
• Torus or Ring
• Diameter = n/2; average distance ~ n/4.• Bisection bandwidth = 2.• Natural for algorithms that work with 1D arrays.
![Page 12: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/12.jpg)
Meshes and Tori
Two dimensional mesh • Diameter = 2 * (sqrt(n ) – 1)• Bisection bandwidth = sqrt(n)
• Generalizes to higher dimensions (Cray T3D used 3D Torus).• Natural for algorithms that work with 2D and/or 3D arrays.
Two dimensional torus• Diameter = sqrt(n )• Bisection bandwidth = 2* sqrt(n)
![Page 13: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/13.jpg)
Hypercubes• Number of nodes n = 2d for dimension d.
• Diameter = d. • Bisection bandwidth = n/2.
• 0d 1d 2d 3d 4d
• Popular in early machines (Intel iPSC, NCUBE).• Lots of clever algorithms.
• Greycode addressing:• Each node connected to
d others with 1 bit different. 001000
100
010 011
111
101
110
![Page 14: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/14.jpg)
Trees
• Diameter = log n.• Bisection bandwidth = 1.• Easy layout as planar graph.• Many tree algorithms (e.g., summation).• Fat trees avoid bisection bandwidth problem:
• More (or wider) links near top.• Example: Thinking Machines CM-5.
![Page 15: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/15.jpg)
Butterflies with n = (k+1)2^k nodes• Diameter = 2k.• Bisection bandwidth = 2^k.• Cost: lots of wires.• Used in BBN Butterfly.• Natural for FFT.
O 1O 1
O 1 O 1
butterfly switchmultistage butterfly network
![Page 16: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/16.jpg)
![Page 17: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/17.jpg)
Topologies in Real MachinesRed Storm (Opteron + Cray network, future)
3D Mesh
Blue Gene/L 3D Torus
SGI Altix Fat tree
Cray X1 4D Hypercube*
Myricom (Millennium) Arbitrary
Quadrics (in HP Alpha server clusters)
Fat tree
IBM SP Fat tree (approx)
SGI Origin Hypercube
Intel Paragon (old) 2D Mesh
BBN Butterfly (really old) Butterfly
olde
r n
ewer
Many of these are approximations:E.g., the X1 is really a “quad bristled hypercube” and some of the fat trees are not as fat as they should be at the top
![Page 18: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/18.jpg)
Performance Models
![Page 19: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/19.jpg)
![Page 20: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/20.jpg)
![Page 21: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/21.jpg)
![Page 22: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/22.jpg)
![Page 23: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/23.jpg)
Latency and Bandwidth Model
• Time to send message of length n is roughly
• Topology is assumed irrelevant.• Often called “ model” and written
• Usually >> >> time per flop.• One long message is cheaper than many short ones.
• Can do hundreds or thousands of flops for cost of one message.• Lesson: Need large computation-to-communication ratio to
be efficient.
Time = latency + n*cost_per_word = latency + n/bandwidth
Time = + n*
nn
![Page 24: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/24.jpg)
Alpha-Beta Parameters on Current Machines• These numbers were obtained empirically
machine
T3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872
is latency in usecsis BW in usecs per Byte
How well does the model Time = + n*predict actual performance?
![Page 25: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/25.jpg)
End to End Latency Over Time
nCube/2
nCube/2 CM5
CM5 CS2
CS2
SP1SP2ParagonT3D
T3DSPP
KSRSPPCenju3
T3E
T3E
SP-Power3
QuadricsMyrinet
Quadrics1
10
100
1000
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
• Latency has not improved significantly, unlike Moore’s Law• T3E (shmem) was lowest point – in 1997
Data from Kathy Yelick, UCB and NERSC
![Page 26: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/26.jpg)
Send Overhead Over Time
• Overhead has not improved significantly; T3D was best• Lack of integration; lack of attention in software
Myrinet2K
Dolphin
T3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
Data from Kathy Yelick, UCB and NERSC
![Page 27: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/27.jpg)
Bandwidth Chart
0
50
100
150
200
250
300
350
400
2048 4096 8192 16384 32768 65536 131072
Message Size (Bytes)
Ban
dwid
th (M
B/s
ec)
T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect
Data from Mike Welcome, NERSC
![Page 28: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/28.jpg)
1
10
100
1000
10000
8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
GigE/VIPL
GigE/MPI
Drop Page Fields Here
Sum of model
size
machine
Model Time Varying Message Size & Machines
![Page 29: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/29.jpg)
1
10
100
1000
10000
8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
T3E/Shm
T3E/MPI
IBM/LAPI
IBM/MPI
Quadrics/Shm
Quadrics/MPI
Myrinet/GM
Myrinet/MPI
GigE/VIPL
GigE/MPI
Drop Page Fields Here
Sum of gap
size
machine
Measured Message Time
![Page 30: 3. Interconnection Networks](https://reader033.vdocuments.mx/reader033/viewer/2022061606/56816014550346895dcf151b/html5/thumbnails/30.jpg)
Results: EEL and Overhead
0
5
10
15
20
25
T3E/M
PI
T3E/Shm
em
T3E/E-R
eg
IBM/M
PI
IBM/LA
PI
Quadri
cs/M
PI
Quadri
cs/P
ut
Quadri
cs/G
et
M2K/M
PI
M2K/G
M
Dolphin
/MPI
Gigane
t/VIP
L
usec
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
Data from Mike Welcome, NERSC