kernel-level support for scalable intra-node collective...
TRANSCRIPT
![Page 1: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/1.jpg)
Kernel-Level Support for Scalable Intra-Node Collective Communications
Hyun-Wook Jin and Joong-Yeon Cho System Software Laboratory
Dept. of Computer Science and Engineering Konkuk University [email protected]
1
![Page 2: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/2.jpg)
Contents
• MPI intra-node communication
• Intra-node collective communication – MPI_Bast()
– MPI_Gather()
• Conclusions and future work
2
![Page 3: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/3.jpg)
Multi/Many-Core Processors
Xeon 5100 Series
(Woodcrest)
Xeon 5500 Series
(Gainestown)
Xeon 5600 Series
(Westmere-EP)
Xeon E5-2600 Series
(Sandy Bridge-EP)
2 4 6 8
Xeon Phi X100 Series
(Knights Corner)
Xeon Phi 7200 Series
(Knights Landing)
61 72
Xeon E5-2600 v2
Series (Ivy Bridge-EP)
Xeon E5-2600 v3
Series (Haswell-EP)
Xeon E7 v4 Family
(Broadwell)
Xeon Platinum
Series (Skylake)
12 18 24 28
3
![Page 4: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/4.jpg)
MPI Intra-Node Communication
• Loopback – NIC provides a
loopback path
– Two DMAs
• Shared memory – Communicate through a
memory area shared between MPI processes
– Two data copies
Processor i
Processor j
Memory
Send Buf
Recv Buf
Process A
Process B DMA
DMA
NIC
Memory
Send Buf
Recv Buf
Process A
Process B
Shared Memory
Copy
Copy
Processor i
Processor j
4
![Page 5: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/5.jpg)
MPI Intra-Node Communication
• Memory mapping – Directly move a message from source to destination
buffer by means of kernel-level support
– Single data copy • Beneficial for large messages
Memory
Send Buf
Recv Buf
Process A
Process B
Direct Copy
Processor i
Processor j
5
![Page 6: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/6.jpg)
Kernel-Level Support for MMapping
• LiMIC2 – Opened the era of one-copy intra-node communication
• H-W. Jin, S. Sur, L. Chai, and D. K. Panda, “LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster,” In Proc. of CPP-05, Jun. 2005.
• H.-W. Jin, S. Sur, Lei Chai, and D. K. Panda, "Lightweight Kernel-Level Primitives for High-Performance MPI Intra-Node Communication over Multi-Core Systems," In Proc. of IEEE Cluster 2007, Sep. 2007.
– LiMIC2-0.5 was publicly released with MVAPICH2-1.4RC1 (Jun. 2009)
– LiMIC2-0.5.6 is being released with the latest MVAPICH2 • mvapich2-src]$ ./configure --with-limic2 [omit other configure
options]
• mvapich2-src]$ mpirun_rsh -np 4 -hostfile ~/hosts MV2_SMP_USE_LIMIC2=1 [path to application]
6
![Page 7: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/7.jpg)
Kernel-Level Support for MMapping
• CMA – In-kernel implementation + New system calls
• J. Vienne, “Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters,“ In Proc. of XSEDE 14, Jul. 2014.
– Default intra-node communication channel for large messages in MVAPICH2
• XPMEM – Supports memory mapping to user-level address space
• B. Kocoloski and J. Lange, “XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems,” In Proc. of HPDC 2015, 2015.
7
![Page 8: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/8.jpg)
Intra-Node Collective Communication
• MPI_Bcast() – Broadcasts a message from the root to all other
processes of the communicator • One-to-Many: Root -> Other processes
– MVAPICH2 (version 2.3) uses the collective-aware shared memory
• MPI_Gather() – Gathers together values from a group of processes
• Many-to-One: All processes -> Root
– MVAPICH2 (version 2.3) uses the kernel-level support (either CMA or LiMIC2) for large messages
8
![Page 9: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/9.jpg)
MPI_BCAST
9
![Page 10: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/10.jpg)
MPI_Bcast() in MVAPICH2 (v.2.3)
프로세스 B, C, …, N 프로세스 B, C, …, N
Collective-aware Shared Memory
Root Process Source Buffer
1. Copies 8KB data blocks to the shared memory (by the root process)
2. Copies 8KB data blocks to the destination buffer (by the other processes)
Other Processes Destination Buffer
10
![Page 11: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/11.jpg)
How bad is LiMIC2 for MPI_Bcast()?
• Experimentally applied LiMIC2 instead of shared memory – Shows higher latency up to 548%
11
![Page 12: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/12.jpg)
Why not to use LiMIC2 in MPI_Bcast()?
• What we expected…
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping
Data Copy
Memory Unmapping
return
12
![Page 13: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/13.jpg)
Why not to use LiMIC2 in MPI_Bcast()?
• What actually happened…
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping (get_user_pages())
Data Copy
Memory Unmapping
return
13
![Page 14: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/14.jpg)
MPI_Bcast() with LiMIC2-overlap
• The root performs memory mapping and the others reuse (share) the mapped area
P0 (Root) P1 P2 P3
MPI_Bcast()
Send Descriptor
Memory Mapping
Data Copy
Memory Unmapping return
14
![Page 15: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/15.jpg)
Preliminary Measurement Results
• 20-core system – Intel Xeon Haswell
Deca-Core x 2
– LiMIC2-overlap reduces the latency up to 68%
• 120-core system – Intel Xeon IvyBridge
15-Core x 8
– LiMIC2-overlap reduces the latency up to 84%
15
![Page 16: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/16.jpg)
What’s going on in MVAPICH2 Bcast?
프로세스 B, C, …, N 프로세스 B, C, …, N
Collective-aware Shared Memory
Root Process Source Buffer
1. Copies 8KB data blocks to the shared memory (by the root process)
2. Copies 8KB data blocks to the destination buffer (by the other processes)
Other Processes Destination Buffer
16
![Page 17: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/17.jpg)
What’s going on in MVAPICH2 Bcast?
• Collective-aware shared memory
• LiMIC2-overlap
* Message size: 256KB * Some profiling overheads are included 17
![Page 18: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/18.jpg)
What’s going on in MVAPICH2 Bcast?
• Data copy operations are not overlapped as much as expected
18
2
1
2
1
2
1
Block ID
Tim
e
![Page 19: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/19.jpg)
MPI_GATHER
19
![Page 20: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/20.jpg)
MPI_Gather() in MVAPICH2 (v.2.3)
Root Process (P0) Destination Buffer
Process P1
Source Buffer ∙∙∙
Process P2
Source Buffer
Process P(N-1)
Source Buffer
Intermediate Buffer
1. Allocates an intermediate buffer
2. Moves messages to the intermediate buffer via point-to-point communication
3. Copies the gathered messages to the destination buffer
20
![Page 21: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/21.jpg)
Is it OK to use LiMIC2 in MPI_Gather()?
* Message size: 256KB * Some profiling overheads are included 21
![Page 22: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/22.jpg)
Is it OK to use LiMIC2 in MPI_Gather()?
22
![Page 23: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/23.jpg)
Is it OK to use LiMIC2 in MPI_Gather()?
23
![Page 24: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/24.jpg)
Why not to use LiMIC2 in MPI_Gather()?
P0 (Root) P1 P2 P3
MPI_Gather()
Send Descriptor
return
Memory Mapping Data Copy
Memory Unmapping
Memory Mapping Data Copy
Memory Unmapping
Memory Mapping Data Copy
Memory Unmapping
Gather for P1
Gather for P2
Gather for P3
24
![Page 25: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/25.jpg)
MPI_Gather() with LiMIC2-overlap
P0 (Root) P1 P2 P3
MPI_Gather()
return
Data Copy Memory Unmapping
Memory Mapping
Send Descriptor
25
![Page 26: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/26.jpg)
Preliminary Measurement Results
• 20-core system – LiMIC2-overlap reduces
the latency up to 88%
• 120-core system – LiMIC2-overlap reduces
the latency up to 50%
– Different algorithms matter (e.g., binomial tree algorithm)
26
![Page 27: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/27.jpg)
CONCLUSIONS
27
![Page 28: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/28.jpg)
Concluding Remark
• Intra-node collective communication – MPI_Bcast()
• One-to-Many communication
• Implemented using collective-aware shared memory
– MPI_Gather() • Many-to-One communication
• Implemented using point-to-point
• LiMIC2-overlap – New interfaces
• Memory mapping reuse
• Flexibility of who can perform data copy
– 84% improvement for MPI_Bcast()
– 88% improvement for MPI_Gather()
28
![Page 29: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/29.jpg)
Ongoing Work
• Other collectives – MPI_Scatter()
• LiMIC2-overlap reduces the latency up to 78% on the 20-core system
– MPI_Allgather()
• LiMIC2-overlap reduces the latency up to 38% on the 20-core system
29
![Page 30: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/30.jpg)
Ongoing Work
• Overlapping between collective communication and computation
30
P0 (Root) P1 P2 P3
MPI_Bcast()
return
Com
puta
tion
P0 (Root) P1 P2 P3
MPI_Bcast()
sync
return
Com
puta
tion
![Page 31: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/31.jpg)
Future Work
LiMIC3
31
![Page 32: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/32.jpg)
ParaMo 2019
• The 1st International Workshop on Parallel Programming Models in High-Performance Cloud – Co-located with Euro-Par 2019
– Date: August 26, 2019
– Venue: Göttingen, Germany
32
![Page 33: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)](https://reader034.vdocuments.mx/reader034/viewer/2022050210/5f5ce46864c14d31c944903e/html5/thumbnails/33.jpg)
Thank You!
33