collective communication...

Collective Communication

Optimizations

Efficient Shared Memory and RDMA based design for

MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and

Dhabaleswar K. Panda




Scaling Alltoall Collective on Multi-core Systems

Rahul Kumar, Amith Mamidala and Dhabaleswar K. Panda

Designing Multi-Leader-Based Allgather Algorithms for Multi-

Core ClustersKrishna Kandalla, Hari Subramoni, Gopal

Santhnaraman, Mathew Koop and Dhabaleswar K. Panda

Presented By:Md. Wasi-ur-Rahman




Efficient Shared Memory andRDMA based design for

MPI_Allgather over InfiniBand

Introduction• Motivated for using in Multi-Core clusters • Recent Advancement in multi-core architecture enabled

higher process density/node• MPI is the most popular programming model for parallel

applications• MPI_Allgather is one of the collective operations in MPI,

which is used extensively• InfiniBand is deployed widely for supporting

communication in large clusters• RDMA has most efficient and scalable performance

features• An efficient MPI_Allgather is highly desirable for all MPI

applications

MPI-Allgather: Recursive Doubling Algorithm

Recursive Algorithm (Contd.)

Recursive Doubling with multiple process/node

Problems with this approach

1.No buffer sharing2.No control over

scheduling3.No overlapping

possible between network communication and data copying

Problem Statement

• In what way extra copy cost can be avoided?

• Is there any overlapping possible between data copying while network operation is taking place?

Algorithm

Performance Evaluation

Two Comparisons1. Comparison between the original

algorithm and the new design2. Comparison between non-overlapping

version of this design and the overlapping version of this design

Experimental Results

Experimental Results (Contd.)

Overlap Benefits

Conclusion & Future Work

• Implemented common buffer for inter and intra node communication

• Incorporated into MVAPICH• Apply to MPI_Allgather algorithms for odd

number of processes• Application level study• Running on higher cores

Scaling Alltoall CollectiveOn Multi-core Systems

Offload Architecture

•The network processing is offloaded to network interface•The NIC is able to send messages relieving the CPU

Onload Architecture

•In onload architecture, CPU is involved in communication in addition to performing the computation•Overlapping between communication and computation is not possible

Bi-directional Bandwidth : InfiniPath (Onload)

Bi-directional Bandwidth : ConnectX

Bi-directional Bandwidth : InfiniHost III (offload)

May be due to the congestion factor at the network interface on using many network architectures

Motivation

Receive side distribution more costly than send side aggregation

Problem Statement

• Can shared memory help to avoid network transactions?

• Can the performance for AlltoAll be improved for multi-core clusters?

Leader Based Algorithm for AlltoAll

AlltoAll Leader based Algorithm

•With two cores per node the number of inter-node communication by each core doubles•Latency is almost doubles with increase in cores/node

Proposed Design

Step 1: Intra-node CommunicationStep 2: AlltoAll Inter-node Communication in each group

Performance Results: AlltoAll (InfiniPath)

InfiniPath with 512 byte message

AlltoAll: InfiniHost III

AlltoAll: ConnectX

CPMD Application

•CPMD uses extensive use of AlltoAll•Proposed Algorithm shows better performance on 128 core system

Proves the scalability as with increased system, new algorithm performs better

Conclusion & Future Work

• Proposed design achieves a reduction in MPI_Alltoall time by 55%

• Speeds up CPMD by 33%• Evaluate in 10GigE system in future• Extend this work to other collectives

Designing Multi-Leader-BasedAllgather Algorithms for

Multi-Core Clusters

Allgather

• Each process broadcasts a vector to each of the other process

• Algorithm used Recursive Doubling (small messages)

tcomm = ts * log(p) + tw * (p -1) * m Ring (large messages)

tcomm = (ts + tw * m) * (p -1)

Scaling on Multi-cores

Problem Statement

• Can there be an algorithm which is multi-core and NUMA aware to achieve better performance and scalability as core counts and system size both increases?

Single Leader – Performance

Proposed Multi-Leader Scheme

Multi-Leader Scheme – Step 1

Multi-Leader Scheme – Step 2

Performance Results

Multi-Leader pt2pt vs shmem

Performance in large scale multi-cores

Proposed Unified Scheme

Conclusion & Future Work• Proposed multi-leader shows improved

scalability and memory contention• Future work should be to devise an algorithm

that can choose the number of leaders in an optimal way in every scenario

• Real world applications• Examine benefits of using kernel based zero-

copy intra-node exchanges for large messages

Thank You

collective communication...

Documents