1
Davide Rossetti, Elena Agostini
S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT
2
GPUDIRECT ELSEWHERE AT GTC2017
H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox )
S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA)
S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA)
S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom)
S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU)
3
AGENDA
GPUDirect technologies
NVLINK-enabled multi-GPU systems
GPUDirect P2P
GPUDirect RDMA
GPUDirect Async
Async Benchmarks & applications results
4
INTRODUCTION TO GPUDIRECT TECHNOLOGIES
5
GPUDIRECT FAMILY1
Technologies, enabling products !!!
GPU pinned memory shared with other RDMA capable devicesAvoids intermediate copies
GPUDIRECT SHARED GPU-SYSMEM
Accelerated GPU-GPU memory copiesInter-GPU direct load/store access
GPUDIRECT P2P
Direct GPU to 3rd party device transfersE.g. direct I/O, optimized inter-node communication
GPUDIRECT RDMA2
Direct GPU to 3rd party device synchronizationsE.g. optimized inter-node communication
GPUDIRECT ASYNC
[1] https://developer.nvidia.com/gpudirect [2] http://docs.nvidia.com/cuda/gpudirect-rdma
6
GPUDIRECTscopes
• GPUDirect P2P → data
• Intra-node
• GPUs both master and slave
• Over PCIe or NVLink
• GPUDirect RDMA → data
• Inter-node
• GPU slave, 3rd party device master
• Over PCIe
• GPUDirect Async → control
• GPU & 3rd party device, master & slave
• Over PCIe
Data
pla
ne
Control plane
GPUDirect AsyncHOST GPU
GPU
GPUDirect RDMA/P2P
7
NVLINK-enabled Multi-GPU servers
8
NVIDIA DGX-1AI Supercomputer-in-a-Box
170 TFLOPS | 8x Tesla P100 16GB | NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad IB 100Gbps, Dual 10GbE | 3U — 3200W
9
DGX-1 SYSTEM TOPOLOGY
GPU – CPU link:
PCIe
12.5+12.5 GB/s eff BW
GPUDirect P2P:
GPU – GPU link is NVLink
Cube mesh topology
not all-to-all
GPUDirect RDMA:
GPU – NIC link is PCIe
10
IBM MINSKY
2 POWER8 with NVLink
4 NVIDIA Tesla P100 GPUs
256 GB System Memory
2 SSD storage devices
High-speed interconnect: IB or Ethernet
Optional:Up to 1 TB System MemoryPCIe attached NVMe storage
11
P100GPU
POWER8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
NVLink
115 GB/s
P100GPU
POWER8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
115 GB/s
NVLink
IBM MINSKY SYSTEM TOPOLOGY
GPU – CPU link:
2x NVLINK
40+40 GB/s raw BW
GPUDirect P2P:
GPU – GPU link is 2x NVLink
Two cliques topology
GPUDirect RDMA:
Not supported
12
GPUDIRECT AND MULTI-GPU SYSTEMSTHE CASE OF DGX-1
13
HOW TO’S
Device topology, link type and capabilities
GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc
Same for CPU – GPU link: NVLINK or PCIe
Same for NIC – GPU link (HWLOC)
Select an optimized GPU/CPU/NIC combination in MPI runs
Enable GPUDirect RDMA
14
CUDA LINK CAPABILITIES
// CUDA driver APItypedef enum CUdevice_P2PAttribute_enum {
CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01,
CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02,
CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03
} CUdevice_P2PAttribute;
cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute
attrib, CUdevice srcDevice, CUdevice dstDevice)
// CUDA runtime APIcudaDeviceGetP2PAttribute(int *value, enum
cudaDeviceP2PAttr attr, int srcDevice, int dstDevice)
basic info, GPU-GPU links only
A relative value indicating
the performance of the
link between two GPUs
(NVLINK ranks higher than
PCIe).
Can do remote native
atomics in GPU kernels
15
GPUDIRECT P2P: NVLINK VS PCIE
Note: some GPUs are not connected e.g. GPU0-GPU7Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!!
NVLINK transparently picked if available
cudaSetDevice(0);
cudaMalloc(&buf0, size);
cudaCanAccessPeer (&access, 0, 1);
assert(access == 1);
cudaEnablePeerAccess (1, 0);
cudaSetDevice(1);
cudaMalloc(&buf1, size);
…
cudaSetDevice (0);
cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault);
16
MULTI GPU RUNS ON DGX-1
Create wrapper script
Use local MPI rank (MPI impldependent)
Don’t use CUDA_VISIBLE_DEVICES, hurts P2P!!!
Environment variables to pass selection down to MPI and app
In application cudaSetDevice(“USE_GPU”)
Run wrapper script
Select best GPU/CPU/NIC for each MPI rank
$ cat wrapper.sh
if [ ! –z $OMPI_COMM_WORLD_LOCAL_RANK ]; then
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
elif [ ! –z $MV2_COMM_WORLD_LOCAL_RANK ]; then
lrank=$MV2_COMM_WORLD_LOCAL_RANK
fi
if (( $lrank > 7 )); then echo "too many ranks"; exit; fi
case ${HOSTNAME} in
*dgx*)
USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7
export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1
HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3
export OMPI_MCA_btl_openib_if_include=${HCA}
export MV2_IBA_HCA=${HCA}
export USE_GPU;;
…
esac
numactl --cpunodebind=${USE_SOCKET} –l $@
$ mpirun –np N wrapper.sh myapp param1 param2 …
17
NVML1 NVLINK
nvmlDeviceGetNvLinkVersion(nvmlDevice_t device,
unsigned int link, unsigned int *version)
nvmlDeviceGetNvLinkState(nvmlDevice_t device,
unsigned int link, nvmlEnableState_t *isActive)
nvmlDeviceGetNvLinkCapability(nvmlDevice_t
device, unsigned int link,
nvmlNvLinkCapability_t capability, unsigned int
*capResult)
nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t
device, unsigned int link, nvmlPciInfo_t *pci)
Link discovery and info APIs
1 http://docs.nvidia.com/deploy/nvml-api/
nvmlDevice separate from CUDA gpu id’s (all devices vs CUDA_VISIBLE_DEVICES)
NVML_NVLINK_MAX_LINKS=6
See later for capabilities
domain:bus:device.function PCI identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8)
18
NVLINK CAPABILITIES
nvidia-smi nvlink –I <GPU id> –c
typedef enum nvmlNvLinkCapability_enum {
NVML_NVLINK_CAP_P2P_SUPPORTED = 0,
NVML_NVLINK_CAP_SYSMEM_ACCESS = 1,
NVML_NVLINK_CAP_P2P_ATOMICS = 2,
NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3,
NVML_NVLINK_CAP_SLI_BRIDGE = 4,
NVML_NVLINK_CAP_VALID = 5,
} nvmlNvLinkCapability_t;
On DGX-1
19
NVLINK COUNTERS
Per GPU (-i 0), per link (-l <0..3>)
Two sets of counters (-g <0|1>)
Per set counter types: cycles,packets,bytes (-sc xyz)
Reset individually (-r <0|1>)
On DGX-1
20
NVML TOPOLOGY
nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count,nvmlDevice_t* deviceArray )
nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo )
nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned int* count, nvmlDevice_t* deviceArray )
nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned intcpuSetSize, unsigned long *cpuSet);
GPU-GPU & GPU-CPU topology query1 APIs
1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html
NVML_TOPOLOGY_INTERNAL,
NVML_TOPOLOGY_SINGLE,
NVML_TOPOLOGY_MULTIPLE,
NVML_TOPOLOGY_HOSTBRIDGE,
NVML_TOPOLOGY_CPU,
NVML_TOPOLOGY_SYSTEM,
21
SYSTEM TOPOLOGY
$ nvidia-smi topo -m
On DGX-1
22
SYSTEM TOPOLOGY
$ nvidia-smi topo -mp
On DGX-1, PCIe only
25
GPUDIRECT P2P
26
DGX-1 P2P PERFORMANCE
in CUDA toolkit samples
Sources: samples/1_Utilities/p2pBandwidthLatencyTest
Binary: samples/bin/x86_64/linux/release/p2pBandwidthLatencyTest
p2pBandwidthLatencyTest
27
DGX-1 P2P PERFORMANCE
In CUDA toolkit demo suite:
/usr/local/cuda-8.0/extras/demo_suite/busGrind –h
Usage: -h: print usage
-p [0,1] enable or disable pinned memory tests (default on)
-u [0,1] enable or disable unpinned memory tests (default off)
-e [0,1] enable or disable p2p enabled memory tests (default on)
-d [0,1] enable or disable p2p disabled memory tests (default off)
-a enable all tests
-n disable all tests
busGrind
28
Intra-node MPI BW
8/3/16
GPU-aware MPI running over GPUDirect P2PDual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink)
0
5
10
15
20
25
30
35
40
1K 4K 16K 64K 256K 1M 4M 16M
Ban
dw
idth
GB
/se
c
Message Size (Bytes)
k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir
~35 GB/sec Bi-dir
~17 GB/sec Uni-dir
29
GPUDIRECT RDMA
30
GPUDirect RDMA over RDMA networks
For Linux rdma subsystem
open-source nvidia_peer_memory kernel module1
important bug fix in ver 1.0-3 !!!
enables NVIDIA GPUDirect RDMA on OpenFabrics stack
Multiple vendors
Mellanox2: ConnectX3 to ConnectX-5, IB/RoCE
Chelsio3: T5, iWARP
Others to come
for better network communication latency
1 https://github.com/Mellanox/nv_peer_memory2 http://www.mellanox.com/page/products_dyn?product_family=1163 http://www.chelsio.com/gpudirect-rdma
31
GPUDirect RDMA over Infiniband
For bandwidth:
$ git clone
git://git.openfabrics.org/~grockah/perf
test.git
$ cd perftest
$ ./autogen.sh
$ export CUDA_H_PATH=/usr/local/cuda-
8.0/include/cuda.h
$ ./configure –prefix=$HOME/test
$ make all install
E.g. host to GPU memory (H-G) BW test:
server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
client $ ~/test/bin/ib_write_bw -n 1000 -O -a
server.name.org
GPU to GPU memory (G-G) BW test:
server$ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
client $ ~/test/bin/ib_write_bw -n 1000 -O -a --use_cuda
server.name.org
Benchmarking bandwidth
32
DGX-1 GPUDirect RDMA uni-dir BW
IB message size, 5000 iterations, RC protocol
0
2000
4000
6000
8000
10000
12000
14000
2 4 8 16
32
64
128
256
512
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
2MB
4MB
8MB
BW(M
B/s)
RDMABWDGX-1,CX-4,P100SXM,socket0tosocket1
HH
HG
GH
GG
33
GPUDIRECT ASYNC
34
GPUDIRECT ASYNC
Communications prepared by CPU
• hardly parallelizable, branch intensive
• GPU orchestrates flow
Run by GPU front-end unit
• Same one scheduling GPU work
• Now also scheduling network communications
leverage GPU front-end unit
Front-end unit
Compute Engines
35
GPUDIRECT ASYNC OVER OFA VERBS
• Prerequisites: nvidia_peer_memory driver, GDRcopy1 library
• CUDA 8.0+ Stream Memory Operations (MemOps) APIs
• MLNX OFED 4.0+ Peer-Direct Async Verbs APIs
• libgdsync2: bridging CUDA & IB Verbs
• MPI: experimental support in MVAPICH2-GDS3
• libmp: lightweight, MPI-like stream-sync primitives, internal benchmarking
1 http://github.com/NVIDIA/gdrcopy2 http://github.com/gpudirect/libgdsync, devel branch3 see DK Panda’s talk
SW stack bits
36
SW STACK
CUDA
driverIB verbs
IB core
mlx5
libgdsync
NV display
driver
kernel-mode
user-mode
HCA GPU
HW
mixed
open-
source
proprietary
nv_peer_mem
extensions
for Async
MVAPICH2
RDMA
ext. for Async
CUDA RT
libmpOpen MPI
applications benchmarks
IB Verbs
extensions
for Async
cxgb4
Async
extensions for
RDMA/Async
37
GPUDIRECT ASYNC OVER INFINIBAND
May need special HCA configuration on Kepler/Maxwell GPUs, e.g. on Mellanox:
$ mlxconfig -d /dev/mst/mtxxx_pciconf0 set NON_PREFETCHABLE_PF_BAR=1
$ reboot
Enable GPU peer mappings:
$ cat /etc/modprobe.d/nvidia.conf options nvidia
NVreg_RegistryDwords="PeerMappingOverride=1"
Requirements
38
CUDA STREAM MemOpsCU_STREAM_WAIT_VALUE_GEQ = 0x0,CU_STREAM_WAIT_VALUE_EQ = 0x1,CU_STREAM_WAIT_VALUE_AND = 0x2,CU_STREAM_WAIT_VALUE_NOR = 0x3,CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWaitValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);
CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, cuuint32_t value, unsigned int flags);CUresult cuStreamWriteValue64(CUstream stream, CUdeviceptr addr, cuuint64_t value, unsigned int flags);
CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1,CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2,CU_STREAM_MEM_OP_WAIT_VALUE_64 = 4,CU_STREAM_MEM_OP_WRITE_VALUE_64 = 5,CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, CUstreamBatchMemOpParams *paramArray, unsigned int flags);
polling on 32/64bit
word
32/64bit word write
lower-overhead batched
work submission
guarantees memoryconsistency for RDMA
39
CUDA STREAM MemOpsAPIs features
• batching multiple consecutive MemOps save ~1.5us each op
• use cuStreamBatchMemOp()
• APIs accept device pointers
• memory need registration (cuMemHostRegister)
• device pointer retrieval (cuMemHostGetDevicePointer)
• 3rd party device PCIe resources (aka BARs)
• assumed physically contiguous & uncached
• special flag needed in cuMemHostRegister
40
ASYNC: LIBMP
41
LIBMP
Lightweight message passing library
thin layer on top of IB Verbs
in-order receive buffer matching
no tags, no wildcards, no data types
zero copy transfers, uses flow control of IB RC transport
Eases benchmarking of multi-GPU applications
Not released (yet)
42
LIBMPPrototype message passing library
PREPARED BY TRIGGERED BY
CPU synchronous CPU CPU
Stream synchronous CPU GPU front-end unit
Kernel initiated CPU GPU SMs
43NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP CPU-SYNC COMMUNICATION// Send/Recv
int mp_irecv (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);
int mp_isend (void *buf, int size, int peer, mp_reg_t *mp_reg, mp_request_t *req);
// Put/Get
int mp_window_create(void *addr, size_t size, mp_window_t *window_t);
int mp_iput (void *src, int size, mp_reg_t *src_reg, int peer, size_t displ,
mp_window_t *dst_window_t, mp_request_t *req);
int mp_iget (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t displ,
mp_window_t *src_window_t, mp_request_t *req);
// Wait
int mp_wait (mp_request_t *req);
int mp_wait_all (uint32_t count, mp_request_t *req);
44NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP STREAM-SYNC COMMUNICATION// Send
int mp_isend_on_stream (void *buf, int size, int peer, mp_reg_t *mp_reg,
mp_request_t *req, cudaStream_t stream);
// Put/Get
int mp_iput_on_stream (void *src, int size, mp_reg_t *src_reg, int peer, size_t
displ, mp_window_t *dst_window_t, mp_request_t *req, cudaStream_t stream);
int mp_iget_on_stream (void *dst, int size, mp_reg_t *dst_reg, int peer, size_t
displ, mp_window_t *src_window_t, mp_request_t *req, cudaStream_t stream);
// Wait
int mp_wait_on_stream (mp_request_t *req, cudaStream_t stream);
int mp_wait_all_on_stream (uint32_t count, mp_request_t *req, cudaStream_t
stream);
45
CUDA stream-sync communications
Loop { mp_irecv(…)
compute <<<…,stream>>> (buf)
mp_isend_on_stream(…)
mp_wait_all_on_stream (…)}
CPU HCAstream
46NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP STREAM-SYNC batch submission
// Prepare single requests, IB Verbs side only
int mp_send_prepare (void *buf, int size, int peer, mp_reg_t *mp_reg,
mp_request_t *req);
// Post set of prepared requests
int mp_isend_post_on_stream (mp_request_t *req, cudaStream_t stream);
int mp_isend_post_all_on_stream (uint32_t count, mp_request_t *req,
cudaStream_t stream);
47NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
LIBMP KERNEL-SYNC COMMUNICATION
// Host side code// extract descriptors from prepared requests
int mp::mlx5::get_descriptors(send_desc_t *send_info, mp_request_t *req);
int mp::mlx5::get_descriptors(wait_desc_t *wait_info, mp_request_t *req);
// Device side code
__device__ int mp::device::mlx5::send(send_desc_t * send_info);
__device__ int mp::device::mlx5::wait (wait_desc_t * wait_info);
48
Kernel-sync communication
Loop { mp_irecv(…, rreq);mp_get_descriptors(wi, rreq);mp_send_prepare(…, sreq);mp_get_descriptors(si, sreq);
compute_and_communicate <<<…,stream>>> (buf,wi,si) {do_something(buf);mlx5::send(si);do_something_else();mlx5::wait(wi);keep_working();
}}
CPU HCAstream
49
ASYNC: experimental MPI BINDINGS
50
CODE SAMPLE
MPI_Comm comm = MPI_COMM_WORLD;
cudaStreamCreate (&stream);
Loop {
kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);
cudaStreamSynchronize (stream);
MPI_Irecv(buf2, count, MPI_INT, peer, 0, comm, &req[0]);
MPI_Isend(buf1, count, MPI_INT, peer, 0, comm, &req[1]);
MPI_Waitall (2, req, statuses);
}
Standard GPU-aware MPI
51
STREAM-SYNCHRONOUS VERSION
MPI_Comm stream_comm;
cudaStreamCreate (&stream);
MPIX_Comm_dup_with_stream( MPI_COMM_WORLD, &stream_comm, &stream);
Loop {
kernel <<<nblocks, nthreads, stream>>> (buf1, buf2);
MPI_Recv(buf2, count, MPI_BYTE, peer, 0, stream_comm);
MPI_Send(buf1, count, MPI_BYTE, peer, 0, stream_comm);
}
MPIX_Wait_stream_completion(stream_comm);
Experimental MPI extensions
52
ASYNCBENCHMARKS
LibMP models
HPGMG-FV
CoMD
Lulesh2
53
LIBMP MODELS
Three different execution models:
• CPU-synchronous (default)
• CUDA Stream-synchronous, CPU asynchronous (SA)
• communications triggered by a CUDA Stream
• CUDA Kernel-synchronous (or Kernel-Initiated, KI)
• Communications triggered by CUDA Kernel threads
Need to evaluate performance: 3 multi-process applications MPI to LibMP
Summary
54
HPGMG-CUDA
55
HPGMG-FVOverview
High Performance Geometric Multigrid (Lawrence Berkeley National Laboratory) :
• Geometric multi-grid solver for variable-coefficient elliptic problems on isotropic Cartesian grids
using the Finite Volume method
• F-cycle: multiple V-cycles a finer grid is smoothed and restricted into a coarser grid. A direct
solver is applied to the coarsest grid and the solution is then iteratively interpolated to finer
grids.
Multi-process execution:
• Each grid is divided into several same-size boxes
• Workload is fairly distributed among processes
56
HPGMG-FV CUDAOverview
GPU accelleration
Multi-process and multi-GPU, MPI
Level size threshold:
• Lower levels computed on CPU
• Higher levels computed on GPU
57
HPGMG-FV CUDAMulti-process execution: MPI communications
CPU Level
(coarser)
Smoother
Residual
Exchange
boundary
GPU Level
(finer)
Smoother
Residual
Exchange
boundary
CPU Level
(coarser)
Smoother
Residual
Exchange
boundary
Interpolation Restriction
Inter-level communication: restriction (moving from a finer level to a coarser level) and
interpolation (moving from a coarser level to a finer level)
Intra-level communication: exchange_boundary function (boundary region exchange between
processes)
59
EXCHANGE BOUNDARY IMPLEMENTATION
cudaDeviceSynchronize();for(p=0; p < num_peers; p++) {
MPI_Irecv(rBuf, …, rReqs[p]);}
pack_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();
for(p=0; p < num_peers; p++) {MPI_Isend(sBuf, …, sReqs[p]);
}
local_kernel<<<…, stream>>>();
MPI_Waitall(rReqs, …);unpack_kernel<<<…, stream>>>(rBuf, …);MPI_Waitall(sReqs, …);
for(p=0; p < num_peers; p++) {mp_irecv(rBuf, …, rReqs[p]);
}
pack_kernel<<<…, stream>>>(sBuf, …);
for(p=0; p < num_peers; p++) {mp_isend_on_stream(sBuf, …, sReqs[p], stream);
}
local_kernel<<<…, stream>>>();
mp_wait_all_on_stream(…, rReqs, stream);unpack_kernel<<<…, stream>>>(rBuf, …);mp_wait_all_on_stream(…, sReqs, stream);
Code comparison: MPI vs SA-Model
MPI Version LibMP SA-Model Version
60
MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
Synchronization
CUDA kernels
Communications (Send or Wait)
Host
GPU Stream
NetworkIsend WaitAll
61
MPI IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
Host
GPU Stream
Network
GPU Idle time
63
SA MODEL IMPLEMENTATIONConsecutive exchange_boundary() calls, CUDA Visual Profiler
CUDA kernels
Communications (Send, Put or Wait)
Isend WaitRecv
WaitIsend
Host
GPU Stream
Network
CUDA 8.0 driver functions
64
SA MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Up to 24% time improvement wrt MPI
• 4 processes, single GPU level:
• 64% CPU work time reduction
• 28% GPU activity time reduction
• The bigger is the box size, the more
message size grows: communication
overhead becomes less important
65
KI MODEL IMPLEMENTATIONHow to
For each exchange_boundary() execution:
• Kernel fusion, single vs three CUDA kernels
• Move communications inside fused kernel
• Respect mutual dependencies (hard!)
launch
fused
kernel
exchange_boundary
fused kernel
CPU
GPU
67
KI MODEL IMPLEMENTATIONexchange_boundary() fused kernel – CUDA blocks organization
Fused kernel
Set 1 Set 2 Set 3
CUB library and GPU global memory variable
68
KI MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Up to 26% time improvement wrt MPI
• 4 processes, single GPU level:
• 77% CPU work time reduction
• 32% GPU activity time reduction
69
DGX-1 CLUSTER BENCHMARKSPerformance gain
• DGX-1 cluster (NVIDIA):
• Tesla P100 GPUs
• CUDA 8.0
• Mellanox ConnectX-4 OFED 3.4
• LibMP v2.0
• OpenMPI 1.10.5
• 2 processes only: about 8% less than
Wilkes
• Gain wrt MPI:
• CPU sync default
• SA model
• KI model
70
COMD-CUDA
71
COMD-CUDAImplementation
• CoMD (www.exmatex.org project) is a classical molecular dynamics proxy
application
• O(N2) within cutoff
• It considers materials where the interatomic potentials are short range and the
simulation requires the evaluation of all forces between atom pairs within the cutoff
distance
• Dependency between CUDA kernels and communications
• Explored SA-model only
72
SA MODEL IMPLEMENTATION
Repeat for 3 times {
load_kernel <<<…, stream>>>(sBufA, …);load_kernel <<<…, stream>>>(sBufB, …);cudaDeviceSynchronize();
MPI_Sendrecv(sBufA, rBufA, …);MPI_Sendrecv(sBufB, rBufB, …);
cuda_tasks(stream);
unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …);
}
for(p=0; p < (3 x 2); p++) {mp_irecv(…, rReqs[p]);
}
for(p=0; p < (3 x 2); p += 2) {
load_kernel <<<…, stream>>>(sBufA, …);mp_isend_on_stream(…, sReqs[p], stream);load_kernel <<<…, stream>>>(sBufB, …);mp_isend_on_stream(…, sReqs[p+1], stream);
cuda_tasks(stream);
mp_wait_all_on_stream(…, rReqs+p, stream);unload_kernel <<<…, stream>>>(rBufA, …);unload_kernel <<<…, stream>>>(rBufB, …); mp_wait_all_on_stream(…, sReqs+p, stream);}
Communication periods - Code comparison
MPI Version LibMP SA-Model Version
73
SA MODEL PERFORMANCE GAINWilkes cluster – communications gain
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• 25% ~ 35% time gain
• Communication periods only
• Gain SA model wrt MPI
74
SA MODEL PERFORMANCE GAINDGX-1 cluster – communications
• DGX-1 cluster (NVIDIA):
• Tesla P100 GPUs
• CUDA 8.0
• Mellanox ConnectX-4 OFED 3.4
• LibMP v2.0
• OpenMPI 1.10.5
• 4 processes only: 13% less than
Wilkes
• Communication periods only
• Gain SA model wrt MPI
75
LULESH2-CUDA
76
LULESH2-CUDAImplementation
• Proxy application developed at Lawrence Livermore National Laboratory
• “… It approximates the hydrodynamics equations discretely by partitioning the
spatial problem domain into a collection of volumetric elements defined by a
mesh.” ( https://codesign.llnl.gov/lulesh.php )
• Dependency between CUDA kernels and communications
• Explored SA-model only
77
Repeat up to 26 times {mp_irecv(rBuf, …, rReqs);
}……Repeat up to 26 times {
cuda_kernel<<<…, stream>>>(…);cudaMemcpyAsync(…, stream);mp_isend_on_stream(…, sReqs, stream);
}mp_wait_all_on_stream(…, sReqs, stream);……
Repeat up to 26 times {mp_wait_on_stream(…, rReqs, stream);cudaMemcpyAsync(…, stream);cuda_kernel<<<…, stream>>>(…);
}
SA MODEL IMPLEMENTATION
Repeat up to 26 times {MPI_Irecv(rBuf, …);
}……Repeat up to 26 times {
cuda_kernel<<<…, stream>>>(sBuf, …);cudaDeviceSynchronize();cudaMemcpyAsync(…, stream);MPI_Isend(sBuf, …);
}MPI_Waitall(…);……
Repeat up to 26 times {MPI_Wait(…);cuda_kernel<<<…, stream>>>(rBuf, …);cudaMemcpyAsync(…, stream);cudaDeviceSynchronize();
}
Communication periods - Code comparison
MPI Version LibMP SA-Model Version
78
SA MODEL IMPLEMENTATIONWilkes cluster - performance gain
• Progressively increasing the number
of iterations to intensify the GPU
workload
• Wilkes cluster (Cambridge, UK):
• Telsa K20 GPUs
• CUDA 8.0
• Mellanox Connect-IB OFED 3.2
• OpenMPI
• Average gain of 13%
• 27 and 64 processes
79
GAME OVER
80