![Page 1: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/1.jpg)
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters
Rinku GuptaDell Computers
Dhabaleswar PandaThe Ohio State [email protected]
Pavan BalajiThe Ohio State [email protected]
Jarek NieplochaPacific Northwest National Lab
![Page 2: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/2.jpg)
Contents
Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work
![Page 3: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/3.jpg)
Motivation
• Communication Characteristics of Parallel Applications• Point-to-Point Communication
o Send and Receive primitives
• Collective Communicationo Barrier, Broadcast, Reduce, All Reduceo Built over Send-Receive Communication primitives
• Communication Methods for Modern Protocols• Send and Receive Model• Remote Direct Memory Access (RDMA) Model
![Page 4: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/4.jpg)
Remote Direct Memory Access
• Remote Direct Memory Access (RDMA) Modelo RDMA Writeo RDMA Read (Optional)
• Widely supported by modern protocols and architectureso Virtual Interface Architecture (VIA)o InfiniBand Architecture (IBA)
• Open Questionso Can RDMA be used to optimize Collective Communication? [rin02]o Do we need to rethink algorithms optimized for Send-Receive?
[rin02]: “Efficient Barrier using Remote Memory Operations on VIA-based Clusters”, Rinku Gupta, V. Tipparaju, J. Nieplocha, D. K. Panda. Presented at Cluster 2002, Chicago, USA
![Page 5: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/5.jpg)
Send-Receive and RDMA Communication Models
User buffer
Registered
S R
Registered
NIC
User buffer
NIC
descriptor descriptor
User buffer
Registered
S R
NIC
Registered User buffer
NIC
descriptor
Send/Recv RDMA Write
![Page 6: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/6.jpg)
Benefits of RDMA
• RDMA gives a shared memory illusion • Receive operations are typically expensive
• RDMA is Receiver transparent
• Supported by VIA and InfiniBand architecture
• A novel unexplored method
![Page 7: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/7.jpg)
Contents
Motivation Design Issues
Buffer Registration Data Validity at Receiver End Buffer Reuse
RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work
![Page 8: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/8.jpg)
Buffer Registration
• Static Buffer Registration Contiguous region in memory for every communicator Address exchange is done during initialization time
• Dynamic Buffer Registration - Rendezvous
User buffers, registered during the operation, when needed Address exchange is done during the operation
![Page 9: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/9.jpg)
Data Validity at Receiver End
• Interrupts• Too expensive; might not be supported
• Use Immediate field of VIA descriptor• Consumes a receive descriptor
• RDMA write a Special byte to a pre-defined location
![Page 10: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/10.jpg)
Buffer Reuse
• Static Buffer Registration Buffers need to be reused Explicit notification has to be sent to sender
• Dynamic Buffer Registration No buffer Reuse
![Page 11: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/11.jpg)
Contents
Motivation Design Issues RDMA-based Broadcast
Design Issues Experimental Results Analytical Models
RDMA-based All Reduce Conclusions and Future Work
![Page 12: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/12.jpg)
Buffer Registration and Initialization
• Static Registration Scheme (for size <= 5K bytes)
P0 P1 P2 P3
ConstantBlock size
Notify Buffer
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Dynamic Registration Scheme (for size > 5K) -- Rendezvous scheme
![Page 13: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/13.jpg)
-11-1 -11 1
Data Validity at Receiver End
P0 P1 P2 P3
-1
-1
-1
-1
-1
-1
-1
-1
ConstantBlock size
• Broadcast counter = 1 (First Broadcast with Root P0)
Data size
Broadcastcounter
Notify Buffer
1
![Page 14: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/14.jpg)
2
2
1
2
2
1
2
2
1
2
2
1
Buffer Reuse
P0 P1 P2 P3
1 1 Notify Buffer 1
Broadcast Buffer
P0 P1 P2 P3
![Page 15: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/15.jpg)
Performance Test Bed
16 1GHz PIII nodes, 33MHz PCI bus, 512MB RAM.
Machines connected using GigaNet cLAN 5300 switch.
MVICH Version : mvich-1.0• Integration with MVICH-1.0• MPI_Send modified to support RDMA Write
Timings were taken for varying block sizes• Tradeoff between number of blocks and size of blocks
![Page 16: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/16.jpg)
RDMA Vs Send-Receive Broadcast (16 nodes)
0
50
100
150
200
250
300
3504 8 16 32 64 128
256
512
1024
1536
2048
2560
3072
3584
4096
4608
Message Size (bytes)
Late
ncy
(us)
RDMA 4K bytes/block RDMA 3K bytes/block RDMA 2K bytes/blockRDMA 1K bytes/block Send-Receive
• Improvement ranging from 14.4% (large messages) to 19.7% (small messages)• Block size of 3K is performing the best
19.7%
14.4%
![Page 17: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/17.jpg)
0
50
100
150
200
250
3004 8 16 32 64 128
256
512
1024
1536
2048
2560
3072
3584
4096
4608
Message Size (bytes)
Late
ncy
(us)
AnalyticalExperimental
Anal. and Exp. Comparison (16 nodes)Broadcast
• Error difference of lesser than 7%
![Page 18: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/18.jpg)
RDMA Vs Send-Receive for Large Clusters (Analytical Model Estimates: Broadcast)
512 Nodes Broadcast
0100200300400500600700
4 8
16 32 64 128
256
512
1024
2048
4096
Message Size (bytes)
Late
ncy
(us)
Send-Receive RDMA
1024 Node Broadcast
0100200300400500600700
4 8 16 32 64
128
256
512
1024
2048
4096
Message Size (bytes)La
tenc
y (u
s)
Send-Receive RDMA
16%
21%
16%
21%
• Estimated Improvement ranging from 16% (small messages) to 21% (large messages) for large clusters of sizes 512 nodes and 1024 nodes
![Page 19: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/19.jpg)
Contents
Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce
Degree-K tree Experimental Results (Binomial & Degree-K) Analytical Models (Binomial & Degree-K)
Conclusions and Future Work
![Page 20: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/20.jpg)
Degree-K tree-based Reduce
P1 P2 P3 P4 P5 P6 P7P0
[ 1 ] [ 1 ] [ 1 ] [ 1 ]
[ 3 ]
[ 2 ] [ 2 ]
P1 P2 P3 P4 P5 P6 P7P0
[ 1 ] [ 1 ]
[ 2 ]
P1 P2 P3 P4 P5 P6 P7P0
[ 1 ]K = 1K = 3K = 7
![Page 21: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/21.jpg)
Experimental Evaluation
• Integrated into MVICH-1.0• Reduction Operation = MPI_SUM
• Data type = 1 INT (data size = 4 bytes)
• Count = 1 (4 bytes) to 1024 (4096) bytes
• Finding the optimal Degree-K
• Experimental Vs Analytical (best case & worst case)
• Exp. and Anal. comparison of Send-Receive with RDMA
![Page 22: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/22.jpg)
Optimal Degree-K (16 nodes)
0200400600800
10001200
4 8
16 32 64
128
256
512
1024
2048
4096
Message Size (bytes)
Latenc
y (u
s)
Degree 1Degree 3Degree 7Degree 15
4 nodes
8 nodes
16 nodes Degree-3
Degree-7
Degree-3 Degree-3 Degree-1
Degree-3 Degree-1
Degree-3 Degree-1
4-256B 256-1KB Beyond 1KB
Choosing the Optimal Degree-K forAll Reduce
• For lower message sizes, higher degrees perform better than degree-1 (binomial)
![Page 23: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/23.jpg)
Degree-K RDMA-based All Reduce Analytical Model
• Experimental timings fall between the best case and the worst case analytical estimates• For lower message sizes, higher degrees perform better than degree-1 (binomial)
4 nodes
8 nodes
16 nodes Degree-3
Degree-7
Degree-3 Degree-3 Degree-1
Degree-3 Degree-1
Degree-3 Degree-1
4-256B 256-1KB Beyond 1KB
Degree-3 Degree-3 Degree-1
Degree-3 Degree-3 Degree-1 1024 nodes
512 nodes
Experimental Vs Analytical (Degree 3: 16 nodes)
0100200300400500600700
4 8 16 32 64 128 256 512 1024 2048 4096
Message Size (bytes)
Latenc
y (u
s)
Analytical (Best)Analytical (Worst)Experimental
![Page 24: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/24.jpg)
Binomial Send-Receive Vs Optimal & Binomial Degree-K RDMA (16 nodes) All Reduce
0
100
200
300
400
500
600
700
4 8 16 32 64 128
256
512
1024
2048
4096
Message Size (bytes)
Late
ncy
(us)
Binomial Send ReceiveOptimal Degree-K RDMABinomial RDMA
38.13%
9%
• Improvement ranging from 9% (large messages) to 38.13% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive
![Page 25: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/25.jpg)
Binomial Send-Receive Vs Binomial & Optimal Degree-K All Reduce for large clusters
512 Node All Reduce
0
200
400
600
800
1000
1200
1400
4 8 16 32 64 128 256 512 1024 2048 4096
Message Size (bytes)
Latenc
y (us)
Binomial Send Receive
Optimal Degree K (best case)
Optimal Degree-K (worst case)
Binomial RDMA
1024 Node All Reduce
0
200
400
600
800
1000
1200
1400
1600
4 8 16 32 64 128 256 512 1024 2048 4096
Message Size (bytes)
Latenc
y (u
s)
Binomial Send Receive
Optimal Degree K (best case)
Optimal Degree-K (worst case)
Binomial RDMA
• Improvement ranging from 14% (large messages) to 35-40% (small messages) for the optimal degree-K RDMA-based All Reduce compared to Binomial Send-Receive
35-40%
14%
35-41%
14%
![Page 26: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/26.jpg)
Contents
Motivation Design Issues RDMA-based Broadcast RDMA-based All Reduce Conclusions and Future Work
![Page 27: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/27.jpg)
Conclusions
• Novel method to implement the collective communication library
• Degree-K algorithm to exploit the benefits of RDMA• Implemented the RDMA-based Broadcast and All Reduce• Broadcast: 19.7% improvement for small and 14.4% for large messages
(16nodes)• All Reduce: 38.13% for small messages, 9.32% for large messages
(16nodes)
• Analytical models for Broadcast and All Reduce• Estimate Performance benefits of large clusters• Broadcast: 16-21% for 512 and 1024 node clusters• All Reduce: 14-40% for 512 and 1024 node clusters
![Page 28: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/28.jpg)
Future Work
• Exploit the RDMA Read feature if available• Round-trip cost design issues
• Extend to MPI-2.0• One sided Communication
• Extend framework to emerging InfiniBand architecture
![Page 29: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/29.jpg)
For more information, please visit the
http://nowlab.cis.ohio-state.eduNetwork Based Computing Group,
The Ohio State University
Thank You!
NBC Home Page
![Page 30: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/30.jpg)
Backup Slides
![Page 31: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/31.jpg)
Receiver Side Best for Large messages(Analytical Model)
P3
P2
P1 Tt ToTn Ts
= ( Tt * k ) + Tn + Ts + To + Tc k - No of Sending nodes
Tt ToTn Ts
Tt ToTn Ts
![Page 32: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/32.jpg)
P3
P2
P1
To
Tt Tn Ts To
To
Receiver Side Worst for Large messages (Analytical Model)
= ( Tt * k ) + Tn + Ts + ( To * k ) + Tc k - No of Sending nodes
Tt Tn Ts
Tt Tn Ts
![Page 33: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/33.jpg)
Buffer Registration and Initialization
• Static Registration Scheme (for size <= 5K)
P0 P1 P2 P3
ConstantBlock size(5K+1)
P1
P2
P3
Each block is of size 5K+1. Every process has N blocks, whereN is the number of processes in the communicator
![Page 34: Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062410/56816262550346895dd2c3d2/html5/thumbnails/34.jpg)
Data Validity at Receiver End
P0 P1 P2 P3
2 543
1
51
3
P0 P1 P2 P3
1
91
5
2 543
Computed Data
P0 P1 P2 P3
1
91
5
2 543
4
Data 1
Data 2 91
P0 P1 P2 P3
1
91
5
2 543
4141
Computed Data