hierknem: an adaptive framework for kernel- assisted and topology-aware collective communications on...

25
HierKNEM: An Adaptive Framework for Ke rnel-Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma , George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. 2011 @ICL Lunch Talk

Upload: osborn-greene

Post on 30-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware

Collective Communications on Many-core Clusters

Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra

Dec 2. 2011

@ICL Lunch Talk

Agenda

• Introduction

• Related work

• Kernel-assisted approach

• HierKNEM

• Experiments

• Conclusion

Introduction

• Hierarchies brought by multi-core cluster

• Message Passing is still dominative Programming.

• Programming libraries want to handle hierarchies internally.

• Collective communication is critical to application’s performance

Problem: Tuned Collective

• It cannot see the edges brought by the hierarchies of multi-core clusters

• Build a logical topology without runtime hardware topology information.

Topology-Unaware: Mismatch problem*

1

4

3

2

2

1

4

3

3

2

1

4

4

3

2

1

Core0 Core1 Core2 Core3

Node 0 Node 1

1

4

3

2

2

1

4

3

3

2

1

4

4

3

2

1

Core0 Core2 Core1 Core3

Node 0 Node 1

P0 P1 P2 P3 P0 P1 P2 P3

Open MPI Tuned Allgather Ring algorithm under different process-core binding cases.

--bycore --bynode

* T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011

# of nodes

# of cores

Agenda

• Introduction

• Related work

• Kernel-assisted approach

• HierKNEM

• Experiments

• Conclusion

Related work

• Cheetah R. Graham and etc., Cheetah: A

Framework for Scalable Hierarchical Collective Operations CCGRID 2011

• Distance-aware framework

T. Ma, and etc., Process Distance-

Aware Adaptive MPI Collective Communications. CLUSTER 2011

SBGP

SBGP

SBGP

BCOL

BCOL

BCOL

IB links

NUMA links

Intra-socket links

Agenda

• Introduction

• Related work

• Kernel-assisted Approach

• HierKNEM

• Experiments

• Conclusion

Status of Kernel-assisted One-sided Single-copy Inter-Process

communication

• KNEM(0.9.7) and LIMIC(0.5.5)

• XPMEM(Cross-Process Memory Mapping)

• CMA(Cross Memory Attach).

Development of kernel-assisted approach in MPI stacks

• Intra-node p2p comm.MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC)

• Intra-node collective comm.KNEM Coll

T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011

• Inter- and intra-node collective comm.HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012

Agenda

• Introduction

• Related work

• Kernel-assisted approach

• HierKNEM

• Experiments

• Conclusion

Framework of HierKNEM

Subgroup: Intra-node Comm.

Inter-node Comm.

BroadcastInter-node forward

KNEM read

Leader processes

Non-Leader processes

Send Recv

KNEM Copy

Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size.

Reduce

Intra-node Comm.

Inter-node Comm.

New_Comm.

Inter-node forward

KNEM read/write

Allgather: Topology-aware Ring

Agenda

• Introduction

• Related work

• Kernel-assisted approach

• HierKNEM

• Experiments

• Conclusion

Hardware Environment• Stremi Cluster

• 32 nodes

• Node: AMD’s 24-core

• Gigabit Ethernet

• Parapluie Cluster

• 32 nodes

• Node: AMD’s 24-core

• 20 G Infiniband

Software Environment

• Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7

• KNEM version 0.9.6, LIMIC 0.5.5

• IMB-3.2(cache on)

• Always use the same mapping between cores and processes if without special mention. (--bycore way)

Broadcast Performance

Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node , 32nodes).

More than 30 times!! More than twice

Reduce Performance

Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).

Allgather Performance

Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).

Topology-aware Operations

Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).

Core per Node Scalability

Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).

Conclusion

• HierKNEM achieved huge speedup from overlap between inter- and intra-node communication.

• HierKNEM is immune to modifications of the underlying process-core binding.(topology-aware).

• HierKNEM provides a linear speedup with the increase of the number of cores per node