design of scalable network considering diameter and … · simulator & emulator ... requiring...
TRANSCRIPT
Design of Scalable Network
Considering Diameter and
Cable Delay
Kentaro Sano
Tohoku University, JAPAN
Tohoku
Design of Scalable Network, K Sano
Agenda
Introduction
Assumption
Preliminary evaluation & candidate networks
Cable length and delay
Simulator & emulator
Summary
2
Design of Scalable Network, K Sano
Introduction
Feasibility study : 2012-2014
3 teams working for
next-gen supercomputers
Tohoku-NEC-JAMSTEC team
Working group for
interconnection
network
subsystem
3
Osaka
University
Tohoku
University
W.G. for interconnection
subsystem
Design of Scalable Network, K Sano
Background and Objective
More nodes with higher performance
Requiring high performance and scalable network
Application demands
Global/collective communication
Local communication (ex: p2p w/ 3D decomposition)
Usability, performance robustness
Scalability
4
Goal: find NW for next-gen supercomputers
Exploring design space with
application demands and technology constraints
Small-diameter NW using high-radix SWs,
which is also good at local p2p communication
Performance, cost, power, usability, reliability
Design of Scalable Network, K Sano
Assumption for Design Space Exploration
System scale
~65536 SMP nodes
Technology
~64x64 full crossbar switch
10~ GB/s per link
5
Full
cross
bar
input q 1 out b
input q 2 out b
input q 64 out b
64 x 64 switch
(virtual cut-through with virtual chs)
Fat tree又は
Hybrid NW
Fat tree又は
Hybrid NW
Fat tree又は
Hybrid NW
Network
n planes (for SMP)
node
1
node
65536
System overview
IB technology roadmap
Design of Scalable Network, K Sano
Preliminary Evaluation
Typical topologies
Full fat tree
3D / 5D torus
Dragonfly
6
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
N N N N
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
SW SW SW SW SW SW SW SW SW
Full fat tree
n-D torus Dragonfly
Design of Scalable Network, K Sano
Comparison of Topologies
Too large diameter for low-D torus
Too many links for high-D torus / dragonfly
Fat tree looks good, but long cables?
7
Topology Full fat-tree 3D Torus 5D Torus Dragonfly
Nodes 65,536 65,536 65,536 65,536
Organization3 stages
64 x 32 x 3264x32x32 16x8x8x8x8
all-to-all(1D 16, 2D 16x16)
Node injection BW [GB/s]
Bisection BW [TB/s] 320 20 80 160
min to Max hops 2 ~ 6 1~63 1~23 2 ~ 5
min to Max delay [ns] 100 ~ 500 100 ~ 6300 100 ~ 2300 100 ~ 400
Links 196,608 196,608 1,310,720 468,736
Switches 5120 within nodes within nodes 4096
10
no cable delay
considered
Design of Scalable Network, K Sano
Full Fat Tree
Small diameter, but big latency via spine SWs
Max # of hops is limited especially with high-radix SWs.
Cable length grows with # of nodes.
8
32 nodes
SW
N N
SW
N N
SW
N N
SW SW SW
32 SWs
32 SWs
SW
N N
SW
N N
SW
N N
SW SW SW
SW
N N
SW
N N
SW
N N
SW SW SW
1024 nodes / islands
65536 nodes / 64 islands
SW SW SW
64 links, 10GB/s/link
SW SW SW SW SW SW
Max 6 hops
Spine SW
Design of Scalable Network, K Sano
Another Candidate: FTT Hybrid
Hierarchical network
Local fat tree (group)
256 nodes
2-stage fat tree
Only short cables
in a small fat tree
Global 2D torus
16x16 of 256-node groups
Short cables to connect
adjacent groups
512 links between groups
Expected advantages
Shorter cables
Expandable & flexible
9
Global NW: 2D Torus of 16x16 groups
G
x 16
x 16
128128
128256
Nodes
(FTT : Fat Tree & Torus)
Local fat tree
Global 2D torus
Design of Scalable Network, K Sano
Comparison Summary
Detailed & quantitative evaluation
Full fat tree and FTT hybrid
Consider more details about implementation & apps
10
Features Diameter # of Links Note
Fat tree
General-puropse,
High usability◎ ○ High cable delay?
Low-D torus × ○ -
High-D torus ○ × -
Dragonfly
Pseudo
high-radix NW◎ × -
FTT-hybrid
Combination of
Fat tree and torus○ ○ Low cable delay?
Good cost
performance,
Extendability
Design of Scalable Network, K Sano
Cable Length and Delay
Preliminary estimation
based on expected implementation
Boards (node, switch)
Cabinets (node, switch)
Floor layout
Cabling
11
C0 C1
C2 C3
FTT-hybrid layout example
cabinet
Design of Scalable Network, K Sano
Preliminary Result
12
node A node B node D node E
0.05 % 1.5 % 98.4 %
SW SW SW
SW SW
SW
2 m 10 ns
2 m 10 ns
2 m 10 ns
2 m 10 ns
20 m 100 ns
20 m 100 ns
20 m 100 ns
80 m, 400 ns
Stage 1
Stage 2
Stage 3 spine switch
Fat tree (Max 6 hops)
node A node B node D node E
0.05 % 0.33 % 99.6 %
SW SW SW
SW SW
2 m 10 ns
2 m 10 ns
2 m 10 ns
2 m 10 ns
10 m 50 ns
10 m 50 ns
10 m 50 ns
15 m 75 ns
1~16 hops in 2D torus
FTT-hybrid (Max 20 hops)
No big difference in Max cable delay
Fat tree = 1020ns + (5 SW-delay)
Hybrid = 1395ns + (19 SW-delay)
Hybrid can have shorter delay for local p2p communication.
80 m, 400 ns
Design of Scalable Network, K Sano
Example of 3D Mesh Communication
3D decomposition and
adjacent communication
13
Data exchange among 3D subgrids
x
y
z
Global NW: 2D Torus of 16x16 groups
G
x 16
x 16
128128
128256
Nodes
z (x & y can be assigned)
x
y
Latency (4 hops)
= 195ns + (4 SW-delay) : x, y
= 120ns + (3 SW-delay) : z
Much shorter than Fat tree
= 1020ns + (5 SW-delay) : x, y, z
Design of Scalable Network, K Sano
Quantitative Evaluation (On-going)
Software simulator (OPNET-based)
Purpose
Get rough results quickly
Validate collective comm.
Rough SW model
Simple arbitration
No back pressure
Limited NW size
~8129 nodes
Hardware emulator
FPGA-based emulator
Obtain detailed results
Cycle accurate model
Real arbitration, flit-level transmission, back pressure
Large NW : ~65536 nodes
14
routing
switching
Tx & Rx delay
Rx delay given by send SW
switch delay
routing delay
switching delay
transferring delay
buffering delay
inp
ut
po
rt 0
inp
ut
po
rt 1
inp
ut
po
rt 2
inp
ut
po
rt 6
3
ou
tpu
t p
ort
0
ou
tpu
t p
ort
1
ou
tpu
t p
ort
2
ou
tpu
t p
ort
6
3
Switch structure and delay model
Design of Scalable Network, K Sano
Hardware Emulator Overview
FPGA cluster
4 x host PCs
4 x FPGAs / PC
4 x 10G SFP+ ports / FPGA
Implementation
SW for nodes (on Linux)
HW for switches (on FPGA)
15
Node of
FPGA cluster
FPGA board (Stratix V)
x 4
QDR II+
SRAM A
QDR II+
SRAM B
QDR II+
SRAM C
QDR II+
SRAM D
DD
R3 D
RA
M A
PC
3-1
28
00
(D
DR
3-1
60
0)
DD
R3 D
RA
M B
PC
3-1
28
00
(D
DR
3-1
60
0)
10G SFP+ A(Tx, Rx)
10G SFP+ B(Tx, Rx)
10G SFP+ C(Tx, Rx)
10G SFP+ D(Tx, Rx)
ALTERA
Stratix V FPGA
5SGXEA7
N2F45C2
12.8GB/s
12.8GB/s
x18@500MHz
1GB/s forread/write
10Gbps+ each (Tx, Rx)
18 Mbits each(20-bit addressing for 18-bit data)
2GB as default(up to 8GB)
x64@800MHz(DDR)up to
1066MHz
PCIe 3.0 x 8 : 8GB/s (Tx, Rx)
DE5-NET
PCI-Express
DDR3
memory
QDRII SRAM
SFP+
10G Ether
FPGA
Other nodes
not installed yet
Design of Scalable Network, K Sano
Hardware Emulator Overview
16
64 port
10GbE switch
4 x FPGA boards
SFP+
10GbE ports
Node of
FPGA cluster
Other nodes
not installed yet
Design of Scalable Network, K Sano
Summary
Design space exploration for
small diameter NWs with high-radix switches
Technology constraint
Application demands
global and local-p2p communication
Two candidates after topology comparison
Full fat tree & FTT-hybrid
Preliminary evaluation for cable length & delay
Future (on-going) work
Quantitative evaluation with simulation & emulation
Application performance estimation
17