application-specific topology-aware mapping for three dimensional topologies
Post on 05-Jan-2016
70 Views
Preview:
DESCRIPTION
TRANSCRIPT
Application-specific Topology-aware Mapping for Three Dimensional Topologies
Abhinav BhateléLaxmikant V. Kalé
2
Outline
• Motivation• The Mapping Problem• Static Mapping: 3D Stencil• Load Balancing: NAMD• Future Work
3
The network latency for wormhole routing is
(Lf/B)*D + L/B
Lf = Length of each flit, B = bandwidth D = number of hops, L = length of message
Lionel M. Ni and Philip K. McKinley, “A Survey of Wormhole Routing Techniques in Direct Networks”, Computer, Volume 26, Issue 2, pages 62-76, 1993
4
0.001
0.01
0.1
1
512 1024 2048 4096
No. of processors
Tim
e (
ms) 100000 NN
10000 NN
1000 NN
100 NN
10 NN
Message Latencies
0.001
0.01
0.1
1
10
512 1024 2048 4096
No. of processors
Tim
e (
ms)
100000 RND
100000 NN
10000 RND
10000 NN
1000 RND
1000 NN
100 RND
100 NN
10 RND
10 NN
NN = Near Neighbor, RND = Random
5
Hardware Latencies
• Blue Gene/L– Near neighbor: < 1 µs– Worst case: 7 µs
• Blue Gene/P– Near neighbor: < 1 µs– Worst case: 5 µs
• Corresponding differences for MPI messages
6
Topology-aware mapping
• Problem: Given a object communication graph and a processor graph, find an optimal mapping– Minimizes communication– Ensure load balance
• Metric for communication traffic– Hop-bytes = number of links (hops) traversed
X message size
7
Machine Topology
• Information required at runtime– No. of processors in the allocated partition– No. of processors along each dimension– Physical coordinates of each processor
8
9
Communication Graph
• Static– 3D Stencil: regular communication graph
• Dynamic– Molecular dynamics application– Changes as atoms migrate from one processor
to another
10
Static Graph - 3D Stencil
11
Performance
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
512 1024 2048 4096 8192
No. of Processors
Tim
e p
er
itera
tion (
secs
)
Random
Round-Robin
Topology
12
Hop counts
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
512 1024 2048 4096 8192
Hop count (in millions)
Num
ber
of
pro
cess
ors
Random
Round-robin
Topology
13
Dynamic Graph - NAMD
• Molecular Dynamics (MD) application
• Simulation box is a 3D cell full of atoms
Patches
Computes
14
15
Load Balancing in NAMD
• Measurement-based (Charm++)– Principle of persistence
• Patches are statically mapped– Orthogonal recursive bisection
• Computes can be migrated• Load balancing framework gathers the
communication information• Goal
– Minimize communication– Maximize load balance
16
Y
Non- bondedComputes
Patches
X
BondedComputes
Z
17
Old strategy
• Greedy approach• Pick the heaviest compute• Place it on a processor with one of the
patches OR• On a processor which already has a
compute for this patch
18
Patch 1
Patch 2
Outer Brick
Inner Brick
3D Torus
19
Hop-bytes
0
500
1000
1500
2000
2500
3000
Hop-byt
es
(MB)
512 1024 2048 4096 8192
Number of processors
NAMD on Blue Gene/ L (VN mode)
Old
Topology
0
500
1000
1500
2000
2500
3000
Hop-byt
es
(MB)
512 1024 2048 4096 8192
Number of processors
NAMD on Blue Gene/ L (CO mode)
Old
Topology
~17 %
20
Future Work
• Reason for contention– Heavy communication exceeding bandwidth– Link contention (such as in deterministic
routing)
• Use UPC/PAPI on Blue Gene/L and P
21
Future Work
• Automatic Mapping– Initial Static Mapping– Use case – meshing applications
• Extend work on the Charm++ load balancers– Section-multicast aware load balancers– Useful in matrix multiplication
22
Future Work
• Optimization on other topologies– SiCortex (Kautz Graph)– Infiniband clusters (Fat-tree)
23
Summary
• Topology mapping helps!– Especially heavily communication bound
applications
• Static mapping• Dynamic mapping during load balancing• Automatic mapping to relieve the user
top related