mvapich2-gdr: pushing the frontier of hpc and deep...
TRANSCRIPT
![Page 1: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/1.jpg)
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at Mellanox booth (SC ‘19)
by
![Page 2: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/2.jpg)
Mellanox Booth (SC ’19) 2Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions
Outline
![Page 3: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/3.jpg)
Mellanox Booth (SC ’19) 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing 2002)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decadePartner in the 5th ranked TACC Frontera System
![Page 4: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/4.jpg)
Mellanox Booth (SC ’19) 4Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000Se
p-04
Feb-
05Ju
l-05
Dec-
05M
ay-0
6O
ct-0
6M
ar-0
7Au
g-07
Jan-
08Ju
n-08
Nov
-08
Apr-
09Se
p-09
Feb-
10Ju
l-10
Dec-
10M
ay-1
1O
ct-1
1M
ar-1
2Au
g-12
Jan-
13Ju
n-13
Nov
-13
Apr-
14Se
p-14
Feb-
15Ju
l-15
Dec-
15M
ay-1
6O
ct-1
6M
ar-1
7Au
g-17
Jan-
18Ju
n-18
Nov
-18
Apr-
19
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
![Page 5: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/5.jpg)
Mellanox Booth (SC ’19) 5Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
![Page 6: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/6.jpg)
Mellanox Booth (SC ’19) 6Network Based Computing Laboratory
MVAPICH2 Software Family Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
![Page 7: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/7.jpg)
Mellanox Booth (SC ’19) 7Network Based Computing Laboratory
CPU CPUQPI
GPU
PCIe
GPU
GPU
CPU
GPU
IB
Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host6. Inter-Socket GPU-Host
Memory buffers
8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .
• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters
![Page 8: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/8.jpg)
Mellanox Booth (SC ’19) 8Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
![Page 9: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/9.jpg)
Mellanox Booth (SC ’19) 9Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from
GPU device buffers• Unified memory
![Page 10: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/10.jpg)
Mellanox Booth (SC ’19) 10Network Based Computing Laboratory
• Released on 08/08/2019
• Major Features and Enhancements– Based on MVAPICH2 2.3.1
– Support for CUDA 10.1
– Support for PGI 19.x
– Enhanced intra-node and inter-node point-to-point performance
– Enhanced MPI_Allreduce performance for DGX-2 system
– Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode
– Enhanced performance of datatype support for GPU-resident data
• Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe
– Enhanced GPU-based point-to-point and collective tuning
• OpenPOWER systems such as ORNL Summit and LLNL Sierra ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center
– Scaled Allreduce to 24,576 Volta GPUs on Summit
– Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems
– Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
– Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
– Flexible support for running TensorFlow (Horovod) jobs
MVAPICH2-GDR 2.3.2
![Page 11: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/11.jpg)
Mellanox Booth (SC ’19) 11Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us11X
![Page 12: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/12.jpg)
Mellanox Booth (SC ’19) 12Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
![Page 13: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/13.jpg)
Mellanox Booth (SC ’19) 13Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
![Page 14: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/14.jpg)
Mellanox Booth (SC ’19) 14Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Multi-stream Communication for IPC• CMA-based Intra-node Communication Support• Support for OpenPower and NVLink with GDRCOPY2• Maximal overlap in MPI Datatype Processing
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions
Outline
![Page 15: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/15.jpg)
Mellanox Booth (SC ’19) 15Network Based Computing Laboratory
Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1
• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
• Up to 30% higher D2D bandwidth on DGX-1 with NVLink
•
05000
10000150002000025000300003500040000
128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
16% better
02000400060008000
100001200014000160001800020000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
30% better
Available since MVAPICH2-GDR-2.3a
![Page 16: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/16.jpg)
Mellanox Booth (SC ’19) 16Network Based Computing Laboratory
0
5000
10000
15000
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
CMA-based Intra-node Communication Support
0100200300400500600
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) LATENCY
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
MVAPICH2-GDR-2.3.2Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores
NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA
• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth
30% better30% better
![Page 17: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/17.jpg)
Mellanox Booth (SC ’19) 17Network Based Computing Laboratory
0
10
20
30
40
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)(N
odes
=1, P
PN=2
0)
0
50
100
150
200
4 8 16 32 64 128256512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively
3.6X
Alltoall
Reduce
5.2X
3.2X3.3X
0
400
800
1200
1600
2000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
0
2500
5000
7500
10000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
3.2X
1.4X
1.3X1.2X
![Page 18: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/18.jpg)
Mellanox Booth (SC ’19) 18Network Based Computing Laboratory
D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
0
1
2
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Small Messages)
0
50
100
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Large Messages)
020000400006000080000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Intra-Node Bandwidth
0
5
10
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Small Messages)
0100200300
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Large Messages)
05000
10000150002000025000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Inter-Node Bandwidth
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)Intra-node Latency: 0.76 us (with GDRCopy)
Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)Inter-node Latency: 2.18 us (with GDRCopy 2.0)
![Page 19: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/19.jpg)
Mellanox Booth (SC ’19) 19Network Based Computing Laboratory
D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)
Intra-node D-H Latency: 0.49 us (with GDRCopy)
Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLa
tenc
y (u
s)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0
100
200
300
400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
05
101520
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
D-H INTRA-NODE BW
Spectrum MPI MV2-GDR
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0100200300400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
0
10
20
30
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
H-D INTRA-NODE BW
Spectrum MPI MV2-GDR
![Page 20: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/20.jpg)
Mellanox Booth (SC ’19) 20Network Based Computing Laboratory
Managed Memory Performance (OpenPOWER Intra-node)
Latency MD MD Bandwidth MD MD
Bi-Bandwidth MD MD
![Page 21: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/21.jpg)
Mellanox Booth (SC ’19) 21Network Based Computing Laboratory
0
2
4
6
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
0
5
10
15
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-Next
MVAPICH2-GDR-Next w/ SHARP
MVAPICH2 with SHARP Support (Preliminary Results)GPU AllReduce w/ SHARP Support – 2 nodes
0
5
10
15
20
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-NextMVAPICH2-GDR-Next w/ SHARP
3.41 us
GPU AllReduce w/ SHARP Support – 4 nodes
2.34 us
HOST AllReduce w/ SHARP Support – 2 nodes
3.14 us
0
2
4
6
8
10
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
HOST AllReduce w/ SHARP Support – 8 nodes
3.97 us
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
![Page 22: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/22.jpg)
Mellanox Booth (SC ’19) 22Network Based Computing Laboratory
• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions
• Halo data exchange• Duplicate the boundary• Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
![Page 23: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/23.jpg)
Mellanox Booth (SC ’19) 23Network Based Computing Laboratory
MPI Datatype support in MVAPICH2• Datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
![Page 24: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/24.jpg)
Mellanox Booth (SC ’19) 24Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM
![Page 25: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/25.jpg)
Mellanox Booth (SC ’19) 25Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguousMPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
![Page 26: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/26.jpg)
Mellanox Booth (SC ’19) 26Network Based Computing Laboratory
16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)
pre-comm post-recv post-
send wait-recv wait-send
post-comm start-up test-
commbench-comm
Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229
MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396
MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602
Application: COMB
Run Scripts pushed to COMB Github repo: https://github.com/LLNL/Comb/pull/2
18x
27x
- Improvements due to enhanced support for GPU-kernel based packing/unpacking routines
![Page 27: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/27.jpg)
Mellanox Booth (SC ’19) 27Network Based Computing Laboratory
Application: HYPRE - BoomerAMG
0.971.196
1.4361.756
0.94 1.01 1.082
1.628
0
0.5
1
1.5
2
16 GPUs 32 GPUs 64 GPUs 128 GPUs
Seco
nds
(Sum
of S
etup
Pha
se &
Sol
ve
Phas
e Ti
mes
)
# of GPUs
HYPRE - BoomerAMGSpectrum-MPI 10.3.0.1 MVAPICH2-GDR 2.3.2
RUN MVAPICH2-GDR 2.3.2: export MV2_USE_CUDA=1 MV2_USE_GDRCOPY=0 MV2_USE_RDMA_CM=0export MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_HYBRID_BINDING_POLICY=spread MV2_IBA_HCA=mlx5_0:mlx5_3OMP_NUM_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
RUN Spectrum-MPI 10.3.0.1: OMP_NUM_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
25%16%
![Page 28: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/28.jpg)
Mellanox Booth (SC ’19) 28Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
• Benefits of CUDA-Aware MPI with TensorFlow• Optimized Collectives for Deep Learning• Out-of-core DNN Training
• Conclusions
Outline
![Page 29: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/29.jpg)
Mellanox Booth (SC ’19) 29Network Based Computing Laboratory
• Need to understand several options currently available
• gRPC (official support)– Open-source – can be enhanced by others
– Accelerated gRPC (add RDMA to gRPC)
• gRPC+X– Use gRPC for bootstrap and rendezvous
– Actual communication is in “X”
– XMPI, Verbs, GPUDirect RDMA (GDR), etc.
• No-gRPC– Baidu – the first one to use MPI Collectives for TF
– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
Data Parallel Training with TensorFlow (TF)
A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112
![Page 30: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/30.jpg)
Mellanox Booth (SC ’19) 30Network Based Computing Laboratory
• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.
• Up to 11% better performance on the RI2 cluster (16 GPUs)
• Near-ideal – 98% scaling efficiency
Exploiting CUDA-Aware MPI for TensorFlow (Horovod)
0100200300400500600700800900
1000
1 2 4 8 16
Imag
es/s
econ
d (H
ighe
r is b
ette
r)No. of GPUs
Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal
MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2
2.3 (Basic CUDA support)
A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19, https://arxiv.org/abs/1810.11112
![Page 31: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/31.jpg)
Mellanox Booth (SC ’19) 31Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2: Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
![Page 32: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/32.jpg)
Mellanox Booth (SC ’19) 32Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine
020406080
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth
MVAPICH2-GDR 2.3.2 NCCL 2.4
2.4X better
110
1001000
10000
4 16 64 256 1K 4K 16K
64K
256K 1M 4M 16M
64M
Late
ncy
(us)
Message Size (Bytes)
Latency
MVAPICH2-GDR 2.3.2 NCCL 2.4
5.8X better
Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
![Page 33: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/33.jpg)
Mellanox Booth (SC ’19) 33Network Based Computing Laboratory
MVAPICH2-GDR: MPI_Allreduce (Device Buffers) on Summit• Optimized designs in MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.7X better
050
100150200250300350400450
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
1.7X better
![Page 34: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/34.jpg)
Mellanox Booth (SC ’19) 34Network Based Computing Laboratory
Distributed Training with TensorFlow and MVAPICH2-GDR on Summit• ResNet-50 Training using
TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
(Tho
usan
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
![Page 35: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/35.jpg)
Mellanox Booth (SC ’19) 35Network Based Computing Laboratory
• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images
• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default
New Benchmark for Image Segmentation on Summit
*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)
![Page 36: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/36.jpg)
Mellanox Booth (SC ’19) 36Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses– Multi-GPU Training within a single node
– Performance degradation for GPUs across different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
ning
Tim
e (s
econ
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use caseOSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
![Page 37: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/37.jpg)
Mellanox Booth (SC ’19) 37Network Based Computing Laboratory
Scalability and Large (Out-of-core) Models?• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45
– Next generation models: Neural Machine Translation (NMT) • Ridiculously large (billions of parameters),
• Will require even more memory!
– Can we exploit new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe)shows the potential of managed memory designs that can provide performance with negligible/no overhead.
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18
![Page 38: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/38.jpg)
Mellanox Booth (SC ’19) 38Network Based Computing Laboratory
• CPU based results– AMD EPYC
– Intel Xeon
• Excellent speedups for – VGG-19
– ResNet-110
– ResNet-1000 (1k layers)
• Able to train “future” models– E.g. ResNet-5000 (a synthetic
5000-layer model we benchmarked)
HyPar-Flow (HF): Hybrid Parallelism for TensorFlow
*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf
110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)
![Page 39: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/39.jpg)
Mellanox Booth (SC ’19) 39Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR• Conclusions
Outline
![Page 40: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/40.jpg)
Mellanox Booth (SC ’19) 40Network Based Computing Laboratory
• MVAPICH2-GDR Library provides optimized MPI communication on InfiniBand and RoCE clusters with GPUs
• Supports both X86 and OpenPower with NVLink
• Takes advantage of CUDA features like IPC and GPUDirect RDMA families
• Allows flexible solutions for streaming applications with GPUs
• Provides optimized solutions (scale-up and scale-out) for High-Performance Deep Learning
Conclusions
![Page 41: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/41.jpg)
Mellanox Booth (SC ’19) 41Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
![Page 42: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/42.jpg)
Mellanox Booth (SC ’19) 42Network Based Computing Laboratory
• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members
– External speakers
• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)
• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events
• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/
Multiple Events at SC ‘19
![Page 43: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/43.jpg)
Mellanox Booth (SC ’19) 43Network Based Computing Laboratory
Funding AcknowledgmentsFunding Support by
Equipment Support by
![Page 44: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/44.jpg)
Mellanox Booth (SC ’19) 44Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)
– K. S. Kandadi (M.S.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)
– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– M. S. Ghazimeersaeed
– A. Ruhela
– K. ManianCurrent Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist– M. Arnold
Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)
![Page 45: MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learninghidl.cse.ohio-state.edu/static/media/talks/slide/dk_day2... · 2019-12-02 · CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases](https://reader031.vdocuments.mx/reader031/viewer/2022011822/5ec5b8bd2e300d2c1872270a/html5/thumbnails/45.jpg)
Mellanox Booth (SC ’19) 45Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich