aka outing - hpc-ai advisory council

28
F AST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS AKA. DFSSSP ROUTING Torsten Hoefler and Jens Domke All used images belong to the producers.

Upload: others

Post on 12-Nov-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS AKA. DFSSSP ROUTING

Torsten Hoefler and Jens Domke

All used images belong to the producers.

New challenging problems need more bandwidth Scientific computing

Big data analytics

Distributed graph problems

InfiniBand offers advanced routing (kind of) IB spec mandates deadlock-free deterministic

distributed routing

OpenSM is the standard routing software Optimal algorithms for fat-tree and tori networks

MinHop for others (can produce credit loops)

Commercial alternatives exist

MOTIVATION FOR BETTER ROUTING

T. Hoefler and J. Domke Slide 2 of 28

FORGET FULL BISECTION BANDWIDTH Expensive topologies do not guarantee high bandwidth

Deterministic oblivious routing cannot reach full bandwidth! See Valiant’s lower bound

Random routing is asymptotically optimal but looses locality

InfiniBand routing: Deterministic oblivious, destination-based, simple

Linear forwarding table (LFT) at each switch

LID mask control (LMC) enables multiple addresses per port

Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks

T. Hoefler and J. Domke Slide 3 of 28

BUT MY VENDOR SAID “NON-BLOCKING”

Two communications 16, 414

Full bisection bandwidth network

No full bandwidth observed!

Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks

T. Hoefler and J. Domke Slide 4 of 28

SO HOW BAD IS CONGESTION?

Lower Bound! (~ GigE Speed)

Reality?

CHiC Supercomputer: • slightly aged but reflects routing • 566 nodes, full bisection IB fat-tree • no endpoint congestion! • effective Bisection Bandwidth: 0.699

Microbenchmarks (yet to be seen in

practice)

Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks

T. Hoefler and J. Domke Slide 5 of 28

BUT I HAVE A CLEVER SUBNET MANAGER! OpenSM (IB) routing algorithms: minhop (finds minimal paths, balances number of routes local

at each switch)

updn/dnup (uses Up*/Down* turn-control, limits choice but routes contain no credit loops)

ftree (fat-tree optimized routing, no credit loops)

dor/torus-2QoS (dimension order routing for k-ary n-cubes, DOR might generate credit loops)

lash (uses DOR and breaks credit loops with virtual lanes)

It’s clever if you have a fat-tree or a torus But beware if you add or remove one link!

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 6 of 28

WHY ARBITRARY TOPOLOGIES? Many networks grow over time or fulfill more

than one purpose

Fat-trees and butterflies are hard to grow

Tori networks may have undesirable properties

IB supports arbitrary topologies!

Hybrid networks exist:

T. Hoefler and J. Domke Slide 7 of 28

EFFECTIVE BISECTION BANDWIDTH A measure for global network performance Considers routing

Can be measured with a benchmark

More realistic then bisection bandwidth

Effective Bisection Bandwidth (eBB) benchmark Divide network into equal partitions A and B

combinations

Find one peer in B for each node in A pairings

Huge number of patterns Statistics converge fast (~1000 measurements)

Implemented in Netgauge/eBB (download and try!)

Hoefler et al.: Multistage Switches are not Crossbars: Effects of Static Routing in High-Performance Networks

T. Hoefler and J. Domke Slide 8 of 28

ACHIEVING HIGH BANDWIDTH Model network as G=(VP[VC, E)

Path r(u,v) is a path between u,v 2 VP

Routing R consists of P(P-1) paths

Edge load l(e) = number of paths on e 2 E

Edge forwarding index ¼(G,R)=maxe2E l(e)

¼(G,R) is an upper bound to congestion!

Goal is to find R that minimizes ¼(G,R)

Shown to be NP-hard in the general case

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 9 of 28

A SIMPLE HEURISTIC

Keep it simple, greedily minimize ¼(G,R)

SSSP routing starts an SSSP run at each node

Finds paths with minimal edge-load l(e)

Updates routing tables in reverse

Essentially SDSP

Updates l(e) between runs

Strives for global balancing

An example …

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 10 of 28

Source: penny-arcade.com

STEP 1/3 Step 1:

Source-node 0:

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 11 of 28

STEP 2/3 Step 2:

Source-node 1:

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 12 of 28

STEP 3/3 Step 3:

Source-node 2:

¼(G,R)=2

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 13 of 28

EVALUATION - ODIN

Simulation Benchmark

(Netgauge; pattern: eBB)

Simulation predicts 5% improvement

Benchmark shows 18% improvement!

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 14 of 28

EVALUATION - DEIMOS

Simulation Benchmark

(Netgauge; pattern: eBB)

Simulation predicts 23% improvement

Benchmark shows 40% improvement!

T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-Scale InfiniBand Networks

T. Hoefler and J. Domke Slide 15 of 28

IT WORKS, IS THAT ALL? JUST SSSP?

Shown to run well on real systems in practice Odin (128 nodes, 18% eBB speedup)

Deimos (726 nodes, 40% eBB speedup)

Lomonosov1 (~4.5k nodes, ~10-20% Graph500 speedup)

Unfortunately not!

SSSP Routing may create credit loops On certain topologies

To be proven if some topologies are loop-free

Problematic in production environments (and interesting in theory )

1Lomonosov experiments were executed by Anton Korzh and Alexander Naumov

T. Hoefler and J. Domke Slide 16 of 28

WHAT ARE CREDITS AND WHY DO THEY LOOP?

IB uses credit-based flow-control egress messages sent only if receive-buffer available

Very similar to deadlocks in wormhole-routed systems

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 17 of 28

DEAL WITH CREDIT LOOPS Prevent (UP*/Down*, turn-based routing) Limits routing options

Resolve (LASH, use VLs to break cycles) Consumes additional buffers

Ignore (MinHop, DOR) Potential resolution: packet timeouts

Discouraged by IB specification

Others: Bubble Routing etc. Not supported by current specification

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 18 of 28

USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example:

Deadlock!

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 19 of 28

USING VLS TO AVOID DEADLOCKS Pioneered with LASH, example:

2 VLs resolve deadlock

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 20 of 28

DEADLOCK-FREE SSSP ROUTING Perform normal SSSP

Detect cycles

“Break” cycle by adding new VL, rinse, repeat

VLs are expensive, how many do we need?

NP-hard problem

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 21 of 28

IS IT PRACTICAL? DO WE HAVE ENOUGH VLS?

Usually yes!

But it depends on the topology and network size

IB supports up to 15 VLs

Current hardware has 8 VLs

0

2

4

6

8

10

12

14

16

ChiC

Deimos

Odin

JUROPA

Ranger

Tsubame

Needed #(virtual lanes)

LASH

DFSSSP

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 22 of 28

MPI BENCHMARKS

47% faster execution (compared to MinHop)

Same performance for SSSP and DFSSSP

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 512 1024 1536 2048 2560 3072 3584 4096

Runtime in s

Elements in buffer (#floats)

MPI All-to-All on 128 nodes (1 core per node)

MinHop

LASHDFSSSP

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 23 of 28

EBB AND NAS PARALLEL BENCHMARKS Measured on Deimos at TU Dresden

1 MPI process per node (except for 1024)

0

50

100

150

200

250

300

350

400

128 256 512 1024

eBB in MiByte/s

Number of MPI-processes

MinHop

LASHSSSPDFSSSP

0

50

100

150

200

250

121 256 484 1024

Gflop/s (total)

Number of MPI-processes

MinHop

LASHSSSPDFSSSP

BT, class C – solver

Netgauge, eBB

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 24 of 28

IS IT PRACTICAL? WHAT ABOUT LARGE SCALE? Up to 64% higher eBB (depending on #nodes and topology)

Runtime can be an issue! (for systems w/ #nodes > 5.000)

0

0.2

0.4

0.6

0.8

1

CHiC

Deimos

JUROPA

Odin

Ranger

Tsubame

Eff. bisection bandwidth

MinHopUp*/Down*FatTree

LASHDORSSSP

DFSSSP

10-4

10-2

100

102

104

CHiC

Deimos

JUROPA

Odin

Ranger

Tsubame

Runtime in s

MinHopUp*/Down*FatTree

LASHDORSSSP

DFSSSP

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 25 of 28

HOW DO I GET/USE DFSSSP? OpenFabrics Alliance integrated DFSSSP into the

official OFED stack in May, 2012 http://www.openfabrics.org/downloads/

management/opensm-3.3.16.tar.gz

Start OpenSM with option “-R dfsssp” or set “routing_engine dfsssp” in configuration file

Configure QoS for switches/HCAs to equally use VLs

(Please don’t use DFSSSP in opensm-3.3.14/15)

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 26 of 28

HOW DO I GET/USE DFSSSP? QoS config example for OpenSM:

qos TRUE qos_max_vls 8 qos_high_limit 4 qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64 qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4 qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

Enable SL queries for path setup in Open MPI: Configure Open MPI with

“--with-openib --enable-openib-dynamic-sl”

Run your application with “--mca btl_openib_ib_path_record_service_level 1”

Domke, Hoefler, Nagel: Deadlock-Free Oblivious Routing for Arbitrary Topologies

T. Hoefler and J. Domke Slide 27 of 28

DFSSSP can offer: Same or better eff. bisection bandwidth

(compared to other routing algorithms)

Huge increase in application performance and speed-up of MPI-collectives (especially for irregular topologies)

Usability on arbitrary topologies and efficient usage of VLs

Free and easy to use (integration in OFED)

Enhancement for other technologies (similar to IB)

CONCLUSION

T. Hoefler and J. Domke Slide 28 of 28