collective communication...

45
Collective Communication Optimizations Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand Amith R. Mamidala, Abhinav Vishnu and Dhabaleswar K. Panda Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand Amith R. Mamidala, Abhinav Vishnu and Dhabaleswar K. Panda Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith Mamidala and Dhabaleswar K. Panda Designing Multi-Leader-Based Allgather Algorithms for Multi- Core Clusters Krishna Kandalla, Hari Subramoni, Gopal Santhnaraman, Mathew Koop and Dhabaleswar K. Panda Presented By: Md. Wasi-ur-Rahman Efficient Shared Memory and RDMA based design for MPI_Allgather over InfiniBand Amith R. Mamidala, Abhinav Vishnu and Dhabaleswar K. Panda

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Collective Communication

Optimizations

Efficient Shared Memory and RDMA based design for

MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and

Dhabaleswar K. Panda

Efficient Shared Memory and RDMA based design for

MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and

Dhabaleswar K. Panda

Scaling Alltoall Collective on Multi-core Systems

Rahul Kumar, Amith Mamidala and Dhabaleswar K. Panda

Designing Multi-Leader-Based Allgather Algorithms for Multi-

Core ClustersKrishna Kandalla, Hari Subramoni, Gopal

Santhnaraman, Mathew Koop and Dhabaleswar K. Panda

Presented By:Md. Wasi-ur-Rahman

Efficient Shared Memory and RDMA based design for

MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and

Dhabaleswar K. Panda

Page 2: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Efficient Shared Memory andRDMA based design for

MPI_Allgather over InfiniBand

Page 3: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Introduction• Motivated for using in Multi-Core clusters • Recent Advancement in multi-core architecture enabled

higher process density/node• MPI is the most popular programming model for parallel

applications• MPI_Allgather is one of the collective operations in MPI,

which is used extensively• InfiniBand is deployed widely for supporting

communication in large clusters• RDMA has most efficient and scalable performance

features• An efficient MPI_Allgather is highly desirable for all MPI

applications

Page 4: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

MPI-Allgather: Recursive Doubling Algorithm

Page 5: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Recursive Algorithm (Contd.)

Page 6: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Recursive Doubling with multiple process/node

Page 7: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Problems with this approach

1.No buffer sharing2.No control over

scheduling3.No overlapping

possible between network communication and data copying

Page 8: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Problem Statement

• In what way extra copy cost can be avoided?

• Is there any overlapping possible between data copying while network operation is taking place?

Page 9: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Algorithm

Page 10: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Performance Evaluation

Two Comparisons1. Comparison between the original

algorithm and the new design2. Comparison between non-overlapping

version of this design and the overlapping version of this design

Page 11: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Experimental Results

Page 12: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Experimental Results (Contd.)

Page 13: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Overlap Benefits

Page 14: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Conclusion & Future Work

• Implemented common buffer for inter and intra node communication

• Incorporated into MVAPICH• Apply to MPI_Allgather algorithms for odd

number of processes• Application level study• Running on higher cores

Page 15: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Scaling Alltoall CollectiveOn Multi-core Systems

Page 16: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Offload Architecture

•The network processing is offloaded to network interface•The NIC is able to send messages relieving the CPU

Page 17: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Onload Architecture

•In onload architecture, CPU is involved in communication in addition to performing the computation•Overlapping between communication and computation is not possible

Page 18: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Bi-directional Bandwidth : InfiniPath (Onload)

Page 19: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Bi-directional Bandwidth : ConnectX

Page 20: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Bi-directional Bandwidth : InfiniHost III (offload)

May be due to the congestion factor at the network interface on using many network architectures

Page 21: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Motivation

Receive side distribution more costly than send side aggregation

Page 22: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Problem Statement

• Can shared memory help to avoid network transactions?

• Can the performance for AlltoAll be improved for multi-core clusters?

Page 23: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Leader Based Algorithm for AlltoAll

Page 24: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

AlltoAll Leader based Algorithm

•With two cores per node the number of inter-node communication by each core doubles•Latency is almost doubles with increase in cores/node

Page 25: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Proposed Design

Step 1: Intra-node CommunicationStep 2: AlltoAll Inter-node Communication in each group

Page 26: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Performance Results: AlltoAll (InfiniPath)

Page 27: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

InfiniPath with 512 byte message

Page 28: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

AlltoAll: InfiniHost III

Page 29: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

AlltoAll: ConnectX

Page 30: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

CPMD Application

•CPMD uses extensive use of AlltoAll•Proposed Algorithm shows better performance on 128 core system

Proves the scalability as with increased system, new algorithm performs better

Page 31: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Conclusion & Future Work

• Proposed design achieves a reduction in MPI_Alltoall time by 55%

• Speeds up CPMD by 33%• Evaluate in 10GigE system in future• Extend this work to other collectives

Page 32: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Designing Multi-Leader-BasedAllgather Algorithms for

Multi-Core Clusters

Page 33: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Allgather

• Each process broadcasts a vector to each of the other process

• Algorithm used Recursive Doubling (small messages)

tcomm = ts * log(p) + tw * (p -1) * m Ring (large messages)

tcomm = (ts + tw * m) * (p -1)

Page 34: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Scaling on Multi-cores

Page 35: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Problem Statement

• Can there be an algorithm which is multi-core and NUMA aware to achieve better performance and scalability as core counts and system size both increases?

Page 36: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Single Leader – Performance

Page 37: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Proposed Multi-Leader Scheme

Page 38: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Multi-Leader Scheme – Step 1

Page 39: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Multi-Leader Scheme – Step 2

Page 40: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Performance Results

Page 41: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Multi-Leader pt2pt vs shmem

Page 42: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Performance in large scale multi-cores

Page 43: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Proposed Unified Scheme

Page 44: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Conclusion & Future Work• Proposed multi-leader shows improved

scalability and memory contention• Future work should be to devise an algorithm

that can choose the number of leaders in an optimal way in every scenario

• Real world applications• Examine benefits of using kernel based zero-

copy intra-node exchanges for large messages

Page 45: Collective Communication Optimizationsweb.cse.ohio-state.edu/~panda.2/788/slides/5a_5c_collective.pdf2.No control over scheduling 3.No overlapping possible between network communication

Thank You