collective communication...
TRANSCRIPT
Collective Communication
Optimizations
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
Scaling Alltoall Collective on Multi-core Systems
Rahul Kumar, Amith Mamidala and Dhabaleswar K. Panda
Designing Multi-Leader-Based Allgather Algorithms for Multi-
Core ClustersKrishna Kandalla, Hari Subramoni, Gopal
Santhnaraman, Mathew Koop and Dhabaleswar K. Panda
Presented By:Md. Wasi-ur-Rahman
Efficient Shared Memory and RDMA based design for
MPI_Allgather over InfiniBandAmith R. Mamidala, Abhinav Vishnu and
Dhabaleswar K. Panda
Efficient Shared Memory andRDMA based design for
MPI_Allgather over InfiniBand
Introduction• Motivated for using in Multi-Core clusters • Recent Advancement in multi-core architecture enabled
higher process density/node• MPI is the most popular programming model for parallel
applications• MPI_Allgather is one of the collective operations in MPI,
which is used extensively• InfiniBand is deployed widely for supporting
communication in large clusters• RDMA has most efficient and scalable performance
features• An efficient MPI_Allgather is highly desirable for all MPI
applications
MPI-Allgather: Recursive Doubling Algorithm
Recursive Algorithm (Contd.)
Recursive Doubling with multiple process/node
Problems with this approach
1.No buffer sharing2.No control over
scheduling3.No overlapping
possible between network communication and data copying
Problem Statement
• In what way extra copy cost can be avoided?
• Is there any overlapping possible between data copying while network operation is taking place?
Algorithm
Performance Evaluation
Two Comparisons1. Comparison between the original
algorithm and the new design2. Comparison between non-overlapping
version of this design and the overlapping version of this design
Experimental Results
Experimental Results (Contd.)
Overlap Benefits
Conclusion & Future Work
• Implemented common buffer for inter and intra node communication
• Incorporated into MVAPICH• Apply to MPI_Allgather algorithms for odd
number of processes• Application level study• Running on higher cores
Scaling Alltoall CollectiveOn Multi-core Systems
Offload Architecture
•The network processing is offloaded to network interface•The NIC is able to send messages relieving the CPU
Onload Architecture
•In onload architecture, CPU is involved in communication in addition to performing the computation•Overlapping between communication and computation is not possible
Bi-directional Bandwidth : InfiniPath (Onload)
Bi-directional Bandwidth : ConnectX
Bi-directional Bandwidth : InfiniHost III (offload)
May be due to the congestion factor at the network interface on using many network architectures
Motivation
Receive side distribution more costly than send side aggregation
Problem Statement
• Can shared memory help to avoid network transactions?
• Can the performance for AlltoAll be improved for multi-core clusters?
Leader Based Algorithm for AlltoAll
AlltoAll Leader based Algorithm
•With two cores per node the number of inter-node communication by each core doubles•Latency is almost doubles with increase in cores/node
Proposed Design
Step 1: Intra-node CommunicationStep 2: AlltoAll Inter-node Communication in each group
Performance Results: AlltoAll (InfiniPath)
InfiniPath with 512 byte message
AlltoAll: InfiniHost III
AlltoAll: ConnectX
CPMD Application
•CPMD uses extensive use of AlltoAll•Proposed Algorithm shows better performance on 128 core system
Proves the scalability as with increased system, new algorithm performs better
Conclusion & Future Work
• Proposed design achieves a reduction in MPI_Alltoall time by 55%
• Speeds up CPMD by 33%• Evaluate in 10GigE system in future• Extend this work to other collectives
Designing Multi-Leader-BasedAllgather Algorithms for
Multi-Core Clusters
Allgather
• Each process broadcasts a vector to each of the other process
• Algorithm used Recursive Doubling (small messages)
tcomm = ts * log(p) + tw * (p -1) * m Ring (large messages)
tcomm = (ts + tw * m) * (p -1)
Scaling on Multi-cores
Problem Statement
• Can there be an algorithm which is multi-core and NUMA aware to achieve better performance and scalability as core counts and system size both increases?
Single Leader – Performance
Proposed Multi-Leader Scheme
Multi-Leader Scheme – Step 1
Multi-Leader Scheme – Step 2
Performance Results
Multi-Leader pt2pt vs shmem
Performance in large scale multi-cores
Proposed Unified Scheme
Conclusion & Future Work• Proposed multi-leader shows improved
scalability and memory contention• Future work should be to devise an algorithm
that can choose the number of leaders in an optimal way in every scenario
• Real world applications• Examine benefits of using kernel based zero-
copy intra-node exchanges for large messages
Thank You