hierknem: an adaptive framework for kernel- assisted and topology-aware collective communications on...
TRANSCRIPT
HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware
Collective Communications on Many-core Clusters
Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra
Dec 2. 2011
@ICL Lunch Talk
Agenda
• Introduction
• Related work
• Kernel-assisted approach
• HierKNEM
• Experiments
• Conclusion
Introduction
• Hierarchies brought by multi-core cluster
• Message Passing is still dominative Programming.
• Programming libraries want to handle hierarchies internally.
• Collective communication is critical to application’s performance
Problem: Tuned Collective
• It cannot see the edges brought by the hierarchies of multi-core clusters
• Build a logical topology without runtime hardware topology information.
Topology-Unaware: Mismatch problem*
1
4
3
2
2
1
4
3
3
2
1
4
4
3
2
1
Core0 Core1 Core2 Core3
Node 0 Node 1
1
4
3
2
2
1
4
3
3
2
1
4
4
3
2
1
Core0 Core2 Core1 Core3
Node 0 Node 1
P0 P1 P2 P3 P0 P1 P2 P3
Open MPI Tuned Allgather Ring algorithm under different process-core binding cases.
--bycore --bynode
* T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011
# of nodes
# of cores
Agenda
• Introduction
• Related work
• Kernel-assisted approach
• HierKNEM
• Experiments
• Conclusion
Related work
• Cheetah R. Graham and etc., Cheetah: A
Framework for Scalable Hierarchical Collective Operations CCGRID 2011
• Distance-aware framework
T. Ma, and etc., Process Distance-
Aware Adaptive MPI Collective Communications. CLUSTER 2011
SBGP
SBGP
SBGP
BCOL
BCOL
BCOL
IB links
NUMA links
Intra-socket links
Agenda
• Introduction
• Related work
• Kernel-assisted Approach
• HierKNEM
• Experiments
• Conclusion
Status of Kernel-assisted One-sided Single-copy Inter-Process
communication
• KNEM(0.9.7) and LIMIC(0.5.5)
• XPMEM(Cross-Process Memory Mapping)
• CMA(Cross Memory Attach).
Development of kernel-assisted approach in MPI stacks
• Intra-node p2p comm.MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC)
• Intra-node collective comm.KNEM Coll
T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011
• Inter- and intra-node collective comm.HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012
Agenda
• Introduction
• Related work
• Kernel-assisted approach
• HierKNEM
• Experiments
• Conclusion
Agenda
• Introduction
• Related work
• Kernel-assisted approach
• HierKNEM
• Experiments
• Conclusion
Hardware Environment• Stremi Cluster
• 32 nodes
• Node: AMD’s 24-core
• Gigabit Ethernet
• Parapluie Cluster
• 32 nodes
• Node: AMD’s 24-core
• 20 G Infiniband
Software Environment
• Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7
• KNEM version 0.9.6, LIMIC 0.5.5
• IMB-3.2(cache on)
• Always use the same mapping between cores and processes if without special mention. (--bycore way)
Broadcast Performance
Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node , 32nodes).
More than 30 times!! More than twice
Reduce Performance
Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).
Allgather Performance
Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).
Topology-aware Operations
Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).
Core per Node Scalability
Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).