multi-gpu programming | gtc 2013
TRANSCRIPT
![Page 1: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/1.jpg)
Levi Barnes Developer Technology, NVIDIA
Multi-GPU Programming
©
2013,
NVIDIA
1
![Page 2: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/2.jpg)
Hands-on Multi-GPU training
S3529
Hands-on Lab: Multi-GPU Acceleration Example
Justin Luitjens
Wednesday 17:00-17:50
Rm 230A1
©
2013,
NVIDIA
2
![Page 3: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/3.jpg)
Where can I find multi-GPU nodes?
GPU Test Drive
http://www.nvidia.com/GPUTestDrive
©
2013,
NVIDIA
3
![Page 4: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/4.jpg)
Where can I find multi-GPU nodes?
GPU Test Drive
http://www.nvidia.com/GPUTestDrive
Commercial/Academic Clusters
— Emerald (8 x M2090)
— Keeneland (3/2 x M2070)
— Tsubame (3 x M2050)
— JSCC RAS (8x M2090)
— Amazon Cloud (2 x C2050)
©
2013,
NVIDIA
4
![Page 5: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/5.jpg)
Outline
Why More GPUs?
Executing on multiple GPUs
Hiding inter-GPU communication
Tuning for node topology
CUDA C Case study
— One Host, Many GPUs
— Many Host, Many GPUs
©
2012,
NVIDIA
5
![Page 6: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/6.jpg)
Move to multiple GPUs
GPU
CPU
![Page 7: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/7.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
7
CPU
MemCpy
GPU
time
![Page 8: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/8.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
8
Serial
computation
CPU
MemCpy
GPU
time
![Page 9: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/9.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
9
Serial
computation
communication
CPU
MemCpy
GPU
time
![Page 10: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/10.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
10
Serial
computation
communication
idle
CPU
MemCpy
GPU
time
![Page 11: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/11.jpg)
Move to multiple GPUs
GPU
CPU
GPU
CPU
$$
![Page 12: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/12.jpg)
Move to multiple GPUs
GPU
CPU
GPU
CPU
GPU
CPU
$$ $$
![Page 13: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/13.jpg)
Move to multiple GPUs
GPU
CPU
GPU
CPU
GPU
CPU
$$ $$
GPU
CPU
$$
![Page 14: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/14.jpg)
Move to multiple GPUs
GPU GPU
CPU
GPU GPU
![Page 15: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/15.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
15
CPU
MemCpy
GPU 1
time
![Page 16: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/16.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
16
CPU
MemCpy
GPU 1
time
![Page 17: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/17.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
17
CPU
MemCpy
GPU 1
time
GPU 2
![Page 18: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/18.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
18
CPU
MemCpy
GPU 1
time
GPU 2
GPU 3
![Page 19: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/19.jpg)
Move to multiple GPUs
©
2012,
NVIDIA
19
CPU
MemCpy
GPU 1
time
GPU 2
GPU 3
GPU 4
![Page 20: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/20.jpg)
Nvidia K10 – multiple GPUs on a board
GPU GPU
![Page 21: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/21.jpg)
Nvidia K10 – multiple GPUs on a board
GPU GPU
CPU
GPU GPU
![Page 22: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/22.jpg)
Executing on multiple GPUs
![Page 23: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/23.jpg)
Managing multiple GPUs from a single CPU thread
cudaSetDevice() - sets the current GPU
©
2012,
NVIDIA
24
![Page 24: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/24.jpg)
Managing multiple GPUs from a single CPU thread
cudaSetDevice() - sets the current GPU
©
2012,
NVIDIA
25
cudaSetDevice( 0 );
kernel<<<...>>>(...);
cudaMemcpyAsync(...);
cudaSetDevice( 1 );
kernel<<<...>>>(...);
![Page 25: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/25.jpg)
Peer-to-peer memcopy
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
©
2012,
NVIDIA
26
![Page 26: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/26.jpg)
Peer-to-peer memcopy
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
©
2012,
NVIDIA
27
GPU 0 GPU 1
![Page 27: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/27.jpg)
Peer-to-peer memcopy
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
©
2012,
NVIDIA
28
GPU 0 GPU 1
CPU
5 GB/s
![Page 28: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/28.jpg)
Enabling GPUDirect
cudaDeviceEnablePeerAccess( peer_device, 0 )
Enables current GPU to access addresses on peer_device GPU
cudaDeviceCanAccessPeer( &accessible, dev_X, dev_Y )
Checks whether dev_X can access memory of dev_Y
©
2012,
NVIDIA
29
![Page 29: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/29.jpg)
Peer-to-peer memcopy
©
2012,
NVIDIA
30
GPU 0 GPU 1
6.6
GB/s*
* PCIe gen. 2 (12 GB/s for gen. 3)
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
![Page 30: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/30.jpg)
Peer-to-peer memcopy
©
2012,
NVIDIA
31
GPU 0 GPU 1
6 + 6 = 12 GB/s*
* PCIe gen. 2 (22 GB/s for gen. 3)
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
![Page 31: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/31.jpg)
Unified Virtual Addressing (UVA)
Driver/GPU can determine from an address where data
resides
©
2012,
NVIDIA
32
![Page 32: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/32.jpg)
Unified Virtual Addressing (UVA)
Driver/GPU can determine from an address where data
resides
Requires:
64-bit Linux or 64-bit Windows with TCC driver
Fermi or later architecture GPUs (compute capability 2.0 or higher)
CUDA 4.0 or later
©
2012,
NVIDIA
33
![Page 33: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/33.jpg)
Peer-to-peer memcopy
cudaMemcpyPeerAsync( void* dst_addr, int dst_dev,
void* src_addr, int src_dev,
size_t num_bytes, cudaStream_t stream )
cudaMemcpyAsync( void* dst_addr, void* src_addr,
size_t num_bytes, cudaStream_t stream,
cudaMemcpyDefault )
©
2012,
NVIDIA
34
![Page 34: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/34.jpg)
Hiding inter-GPU communication
35 ©
2012,
NVIDIA
![Page 35: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/35.jpg)
Overlap kernel and memory copy
©
2012,
NVIDIA
36
CPU
MemCpy
GPU
time
![Page 36: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/36.jpg)
Overlap kernel and memory copy
©
2012,
NVIDIA
37
CPU
MemCpy
GPU
time
![Page 37: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/37.jpg)
Overlap kernel and memory copy
Requirements:
— D2H or H2D memcopy from pinned memory
— Device with compute capability ≥ 1.1 (G84 and later)
— Kernel and memcopy in different, non-0 streams
©
2012,
NVIDIA
38
CPU
MemCpy
GPU
time
![Page 38: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/38.jpg)
Pinned memory
cudaHostAlloc(void** pHost, size_t size, int flags)
cudaFree(void * pHost)
allocates pinned (page-locked) CPU memory
cudaHostRegister(void* pHost, size_t size, int flags)
cudaHostUnregister(void* pHost)
pins/unpins previously allocated memory
![Page 39: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/39.jpg)
Pinned memory
cudaHostAlloc(void** pHost, size_t size, int flags)
cudaFree(void * pHost)
allocates pinned (page-locked) CPU memory
cudaHostRegister(void* pHost, size_t size, int flags)
cudaHostUnregister(void* pHost)
pins/unpins previously allocated memory
![Page 40: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/40.jpg)
Overlap kernel and memory copy
©
2012,
NVIDIA
41
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaHostAlloc(&src, size, 0);
…
cudaMemcpyAsync( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream1>>>(…);
cudaMemcpyAsync( dst, src, size, dir, stream2 );
kernel<<<grid, block, 0, stream2>>>(…);
![Page 41: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/41.jpg)
Overlap kernel and memory copy
©
2012,
NVIDIA
42
potentially
overlapped
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaHostAlloc(&src, size, 0);
…
cudaMemcpyAsync( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream1>>>(…);
cudaMemcpyAsync( dst, src, size, dir, stream2 );
kernel<<<grid, block, 0, stream2>>>(…);
stream1
stream2
CPU
![Page 42: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/42.jpg)
Creating CUDA Events
cudaEventCreate(cudaEvent_t *event)
cudaEventDestroy(cudaEvent_t event)
cudaEventRecord(cudaEvent_t event,
cudaStream_t stream=0)
![Page 43: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/43.jpg)
Using CUDA Events
cudaEventSynchronize(cudaEvent_t event)
cudaStreamWaitEvent(cudaStream_t stream,
cudaEvent_t event)
![Page 44: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/44.jpg)
Expressing dependency through events
©
2012,
NVIDIA
45
cudaEvent_t ev; cudaEventCreate(&ev); … cudaMemcpyAsync( dst, src, size, dir, stream1 ); cudaMemcpyAsync( dst2, src2, size, dir, stream2 ); cudaEventRecord( ev, stream2 ); cudaStreamWaitEvent(stream1, ev); kernel<<<grid, block, 0, stream1>>>(…); Kernel<<<grid, block, 0, stream2>>>(…);
stream1
stream2
CPU
![Page 45: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/45.jpg)
Multi-GPU Streams and Events
•Multi-GPU, Streams, and Events
46 ©
2012,
NVIDIA
![Page 46: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/46.jpg)
The Rules
CUDA streams and events are per device (GPU)
— Each device has its own default stream (aka 0- or NULL-stream)
Streams and:
— Kernels: can be launched to a stream only if the stream’s GPU is
current
— Memcopies: can be issued to any stream
— Events: can be recorded only to a stream if the stream’s GPU is
current
Synchronization/query:
— It is OK to query or synchronize with any event/stream
©
2012,
NVIDIA
47
![Page 47: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/47.jpg)
The Rules
CUDA streams and events are per device (GPU)
— Each device has its own default stream (aka 0- or NULL-stream)
Streams and:
— Kernels: can be launched to a stream only if the stream’s GPU is
current
— Memcopies: can be issued to any stream
— Events: can be recorded only to a stream if the stream’s GPU is
current
Synchronization/query:
— It is OK to query or synchronize with any event/stream
©
2012,
NVIDIA
48
![Page 48: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/48.jpg)
The Rules
CUDA streams and events are per device (GPU)
— Each device has its own default stream (aka 0- or NULL-stream)
Streams and:
— Kernels: can be launched to a stream only if the stream’s GPU is
current
— Memcopies: can be issued to any stream
— Events: can be recorded only to a stream if the stream’s GPU is
current
Synchronization/query:
— It is OK to query or synchronize with any event/stream
©
2012,
NVIDIA
49
![Page 49: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/49.jpg)
The Rules
CUDA streams and events are per device (GPU)
— Each device has its own default stream (aka 0- or NULL-stream)
Streams and:
— Kernels: can be launched to a stream only if the stream’s GPU is
current
— Memcopies: can be issued to any stream
— Events: can be recorded only to a stream if the stream’s GPU is
current
Synchronization/query:
— It is OK to query or synchronize with any event/stream
©
2012,
NVIDIA
50
![Page 50: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/50.jpg)
The Rules
CUDA streams and events are per device (GPU)
— Each device has its own default stream (aka 0- or NULL-stream)
Streams and:
— Kernels: can be launched to a stream only if the stream’s GPU is
current
— Memcopies: can be issued to any stream
— Events: can be recorded only to a stream if the stream’s GPU is
current
Synchronization/query:
— It is OK to query or synchronize with any event/stream
©
2012,
NVIDIA
51
![Page 51: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/51.jpg)
Example 1
©
2012,
NVIDIA
52
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1
cudaEventCreate( &eventB );
kernel<<<..., streamB>>>(...);
cudaEventRecord( eventB, streamB );
cudaEventSynchronize( eventB );
OK:
• device 1 is current
• eventB and streamB belong to device 1
![Page 52: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/52.jpg)
Example 2
©
2012,
NVIDIA
53
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1
cudaEventCreate( &eventB );
kernel<<<..., streamA>>>(...);
cudaEventRecord( eventB, streamB );
cudaEventSynchronize( eventB );
ERROR:
• device 1 is current
• streamA belongs to device 0
![Page 53: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/53.jpg)
Example 3
©
2012,
NVIDIA
54
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1
cudaEventCreate( &eventB );
kernel<<<..., streamB>>>(...);
cudaEventRecord( eventA, streamB );
cudaEventSynchronize( eventB );
ERROR:
• eventA belongs to device 0
• streamB belongs to device 1
![Page 54: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/54.jpg)
Example 4
©
2012,
NVIDIA
55
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1
cudaEventCreate( &eventB );
kernel<<<..., streamB>>>(...);
cudaEventRecord( eventB, streamB );
cudaSetDevice( 0 );
cudaEventSynchronize( eventB );
kernel<<<..., streamA>>>(...);
device-1 is current
device-0 is current
![Page 55: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/55.jpg)
Example 4
©
2012,
NVIDIA
56
cudaStream_t streamA, streamB;
cudaEvent_t eventA, eventB;
cudaSetDevice( 0 );
cudaStreamCreate( &streamA ); // streamA and eventA belong to device-0
cudaEventCreate( &eventA );
cudaSetDevice( 1 );
cudaStreamCreate( &streamB ); // streamB and eventB belong to device-1
cudaEventCreate( &eventB );
kernel<<<..., streamB>>>(...);
cudaEventRecord( eventB, streamB );
cudaSetDevice( 0 );
cudaEventSynchronize( eventB );
kernel<<<..., streamA>>>(...);
OK:
• device-0 is current
• synchronizing/querying events/streams of
other devices is allowed
• here, device-0 won’t start executing the
kernel until device-1 finishes its kernel
![Page 56: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/56.jpg)
Example 5
©
2012,
NVIDIA
57
int gpu_A = 0;
int gpu_B = 1;
cudaSetDevice( gpu_A );
cudaMalloc( &d_A, num_bytes );
int accessible = 0;
cudaDeviceCanAccessPeer( &accessible, gpu_B, gpu_A );
if( accessible )
{
cudaSetDevice(gpu_B );
cudaDeviceEnablePeerAccess( gpu_A, 0 );
kernel<<<...>>>( d_A);
}
Even though kernel executes on
gpu2, it will access (via PCIe)
memory allocated on gpu1
![Page 57: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/57.jpg)
Subdividing the problem
![Page 58: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/58.jpg)
Stage 1
Compute halo regions
![Page 59: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/59.jpg)
Stage 2
Compute internal regions
![Page 60: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/60.jpg)
Stage 2
Compute internal regions
Exchange halo regions
![Page 61: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/61.jpg)
Code Pattern
©
2012,
NVIDIA
63
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
}
Compute halos
![Page 62: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/62.jpg)
Code Pattern
©
2012,
NVIDIA
64
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
}
Compute internal
Compute halos
![Page 63: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/63.jpg)
Code Pattern
©
2012,
NVIDIA
65
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]
cudaMemcpyPeerAsync( ..., stream_compute[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_compute[i] );
}
Compute halos
Exchange halos
Compute internal
![Page 64: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/64.jpg)
Code Pattern
©
2012,
NVIDIA
66
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]
cudaMemcpyPeerAsync( ..., stream_compute[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_compute[i] );
}
Compute halos
Exchange halos
Compute internal
halo stream_compute int halo int
![Page 65: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/65.jpg)
Code Pattern
©
2012,
NVIDIA
67
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]);
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
Compute halos
Exchange halos
Compute internal
![Page 66: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/66.jpg)
Code Pattern
©
2012,
NVIDIA
68
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]);
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
Compute halos
Exchange halos
Compute internal
halo stream_compute int
stream_copy
halo int halo int
![Page 67: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/67.jpg)
Code Pattern
©
2012,
NVIDIA
69
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]);
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
Compute halos
Exchange halos
Compute internal
![Page 68: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/68.jpg)
Code Pattern
©
2012,
NVIDIA
70
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]);
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
Compute halos
Exchange halos
Compute internal
halo stream_compute int
stream_copy
halo int halo int
![Page 69: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/69.jpg)
Code Pattern
©
2012,
NVIDIA
71
for( int istep=0; istep<nsteps; istep++)
{
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
kernel_halo<<<..., stream_compute[i]>>>( ... );
cudaEventRecord(event_i[i], stream_compute[i] );
kernel_int<<<..., stream_compute[i]>>>( ... );
}
for( int i=0; i<num_gpus-1; i++ ) {
cudaStreamWaitEvent(stream_copy[i], event_i[i]);
cudaMemcpyPeerAsync( ..., stream_copy[i] );
}
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_copy[i] );
for( int i=0; i<num_gpus; i++ )
{
cudaSetDevice( gpu[i] );
cudaDeviceSynchronize();
// swap input/output pointers
}
}
Synchronize
Compute halos
Exchange halos
Compute internal
![Page 70: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/70.jpg)
OpenACC in 1 slide
©
2013,
NVIDIA
73
#pragma acc data copyin(in[0:SIZE+2])
#pragma acc kernels loop present(out, in)
for (int i=1; i< SIZE+1; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1];
}
#pragma end kernels
#pragma acc update host(out)
![Page 71: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/71.jpg)
OpenACC in 1 slide
©
2013,
NVIDIA
74
#pragma acc data copyin(in[0:SIZE+2])
#pragma acc kernels loop present(out, in)
for (int i=1; i< SIZE+1; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1];
}
#pragma end kernels
#pragma acc update host(out)
![Page 72: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/72.jpg)
Multiple GPUs with OpenACC
©
2013,
NVIDIA
76
#pragma omp parallel num_threads(4)
{
int t = omp_get_thread_num();
acc_set_device_num(t, acc_device_nvidia);
#pragma acc data copyin(in)
#pragma acc kernels loop present(in)
for (int i=1; i< 1+SIZE; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1]; }
#pragma end kernels
#pragma acc update host(out)
}
//compile with –mp -acc
![Page 73: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/73.jpg)
Multiple GPUs with OpenACC
©
2013,
NVIDIA
77
#pragma omp parallel num_threads(4)
{
int t = omp_get_thread_num();
acc_set_device_num(t, acc_device_nvidia);
#pragma acc data copyin(in)
#pragma acc kernels loop present(in)
for (int i=1; i< 1+SIZE; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1]; }
#pragma end kernels
#pragma acc update host(out)
}
//compile with –mp -acc
![Page 74: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/74.jpg)
Multiple GPUs with OpenACC
©
2013,
NVIDIA
78
#pragma omp parallel num_threads(4)
{
int t = omp_get_thread_num();
acc_set_device_num(t, acc_device_nvidia);
#pragma acc data copyin(in)
#pragma acc kernels loop present(in)
for (int i=1+t*SIZE/4; i< 1+(t+1)*SIZE/4; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1]; }
#pragma end kernels
#pragma acc update host(out)
}
//compile with –mp -acc
![Page 75: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/75.jpg)
Multiple GPUs with OpenACC
©
2013,
NVIDIA
79
#pragma omp parallel num_threads(4)
{
int t = omp_get_thread_num();
acc_set_device_num(t, acc_device_nvidia);
#pragma acc data copyin(in[t*SIZE/4:SIZE/4+2])
#pragma acc kernels loop present(in[t*SIZE/4:SIZE/4+2])
for (int i=1+t*SIZE/4; i< 1+(t+1)*SIZE/4; i++) {
out[i] = 0.5*in[i] +
0.25*in[i-1] +
0.25*in[i+1]; }
#pragma end kernels
#pragma acc update host(out[t*SIZE/4+1:SIZE/4])
}
//compile with –mp -acc
![Page 76: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/76.jpg)
Tuning for your topology
![Page 77: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/77.jpg)
CPU/GPU interface
GPU-0 GPU-1 GPU-2 GPU-3
Host PCIe
![Page 78: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/78.jpg)
Example: 4-GPU Topology
©
2012,
NVIDIA
83
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
to host system
![Page 79: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/79.jpg)
Example: 4-GPU Topology
• Two ways to exchange
– Left-right approach
©
2012,
NVIDIA
84
![Page 80: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/80.jpg)
Example: 4-GPU Topology
• Two ways to exchange
– Left-right approach
©
2012,
NVIDIA
85
![Page 81: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/81.jpg)
Example: 4-GPU Topology
• Two ways to exchange
– Left-right approach
– Pairwise approach
©
2012,
NVIDIA
86
![Page 82: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/82.jpg)
Example: 4-GPU Topology
• Two ways to exchange
– Left-right approach
– Pairwise approach
©
2012,
NVIDIA
87
![Page 83: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/83.jpg)
Example: Left-Right Approach for 4 GPUs
©
2012,
NVIDIA
88
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
Stage 1: send “right” / receive from “left”
![Page 84: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/84.jpg)
Example: Left-Right Approach for 4 GPUs
©
2012,
NVIDIA
89
GPU-3 GPU-2
PCIe switch
GPU-1 GPU-0
PCIe switch
PCIe switch
Stage 2: send “left” / receive from “right”
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
Stage 1: send “right” / receive from “left”
![Page 85: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/85.jpg)
Example: Left-Right Approach for 4 GPUs
Achieved throughput: ~15 GB/s
©
2012,
NVIDIA
90
GPU-3 GPU-2
PCIe switch
GPU-1 GPU-0
PCIe switch
PCIe switch
Stage 2: send “left” / receive from “right”
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
Stage 1: send “right” / receive from “left”
![Page 86: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/86.jpg)
Example: Left-Right Approach for 8 GPUs
Achieved aggregate throughput: ~34 GB/s
©
2012,
NVIDIA
93
IOH
CPU CPU
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
GPU-4 GPU-5
PCIe switch
GPU-6 GPU-7
PCIe switch
PCIe switch PCIe switch
![Page 87: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/87.jpg)
Example: Pairwise Approach for 4 GPUs
No contention for PCIe links
©
2012,
NVIDIA
94
GPU-3 GPU-2
PCIe switch
GPU-1 GPU-0
PCIe switch
PCIe switch
Stage 2: odd-even pairs
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
Stage 1: even-odd pairs
![Page 88: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/88.jpg)
IOH
CPU CPU
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
GPU-4 GPU-5
PCIe switch
GPU-6 GPU-7
PCIe switch
PCIe switch PCIe switch
Example: Even-Odd Stage of Pairwise Approach for 8 GPUs
©
2012,
NVIDIA
95
![Page 89: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/89.jpg)
IOH
CPU CPU
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
GPU-4 GPU-5
PCIe switch
GPU-6 GPU-7
PCIe switch
PCIe switch PCIe switch
Example: Even-Odd Stage of Pairwise Approach for 8 GPUs
©
2012,
NVIDIA
96
Contention !
![Page 90: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/90.jpg)
IOH
CPU CPU
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
GPU-4 GPU-5
PCIe switch
GPU-6 GPU-7
PCIe switch
PCIe switch PCIe switch
Example: Even-Odd Stage of Pairwise Approach for 8 GPUs
• Odd-even stage:
– Will always have contention for 8 or more GPUs
• Even-odd stage:
– Will not have contention
©
2012,
NVIDIA
97
Contention !
![Page 91: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/91.jpg)
1D Communication
For 2 GPUs, the approaches are equivalent
Left-Right approach better for the other cases
In all cases, duplex improves bandwidth
©
2012,
NVIDIA
98
![Page 92: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/92.jpg)
Code for the Left-Right Approach
©
2012,
NVIDIA
99
for( int i=0; i<num_gpus-1; i++ ) // “right” stage
cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );
gpu[] = device ids
d_out[], d_in[] = gpu memory
![Page 93: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/93.jpg)
Code for the Left-Right Approach
©
2012,
NVIDIA
10
0
for( int i=0; i<num_gpus-1; i++ ) // “right” stage
cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream[i] );
for( int i=1; i<num_gpus; i++ ) // “left” stage
cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );
![Page 94: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/94.jpg)
PCIe contention
©
2012,
NVIDIA
10
1
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
for( int i=0; i<num_gpus-1; i++ ) // “right” stage
cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream[i] );
for( int i=1; i<num_gpus; i++ ) // “left” stage
cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );
![Page 95: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/95.jpg)
PCIe contention
©
2012,
NVIDIA
10
2
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
for( int i=0; i<num_gpus-1; i++ ) // “right” stage
cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );
//for( int i=0; i<num_gpus; i++ )
// cudaStreamSynchronize( stream[i] );
for( int i=1; i<num_gpus; i++ ) // “left” stage
cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );
![Page 96: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/96.jpg)
PCIe contention
©
2012,
NVIDIA
10
3
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
GPU-3 GPU-2
PCIe switch
GPU-1 GPU-0
PCIe switch
PCIe switch
![Page 97: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/97.jpg)
Fix
©
2012,
NVIDIA
10
4
for( int i=0; i<num_gpus-1; i++ ) // “right” stage
cudaMemcpyPeerAsync( d_out[i+1], gpu[i+1], d_in[i], gpu[i], num_bytes, stream[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream[i] );
for( int i=1; i<num_gpus; i++ ) // “left” stage
cudaMemcpyPeerAsync( d_out[i-1], gpu[i-1], d_in[i], gpu[i], num_bytes, stream[i] );
![Page 98: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/98.jpg)
Better performance with Synchronize
©
2012,
NVIDIA
10
5
GPU-3 GPU-2
PCIe switch
GPU-1 GPU-0
PCIe switch
PCIe switch
GPU-0 GPU-1
PCIe switch
GPU-2 GPU-3
PCIe switch
PCIe switch
Synchronize
![Page 99: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/99.jpg)
Determining Topology/Locality of a System
Hardware Locality tool:
— http://www.open-mpi.org/projects/hwloc/
— Cross-OS, cross-platform
lspci -tvv
©
2012,
NVIDIA
10
6
![Page 100: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/100.jpg)
Hiding inter-node communication
![Page 101: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/101.jpg)
Multiple GPUs, multiple nodes
©
2012,
NVIDIA
109
cudaMemcpyAsync( ..., stream_halo[i] );
cudaStreamSynchronize( stream_halo[i] );
MPI_Sendrecv( ... );
cudaMemcpyAsync( ..., stream_halo[i] );
GPU
CPU CPU
GPU GPU GPU GPU GPU GPU GPU
![Page 102: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/102.jpg)
GPU-aware MPI
MVAPICH
OpenMPI
Cray
©
2013,
NVIDIA
11
0
![Page 103: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/103.jpg)
GPU-aware MPI
S3047
Introduction to CUDA-aware MPI and NVIDIA GPUDirect™
Jiri Kraus
Wednesday 16:00-16:50
Rm 230C
©
2013,
NVIDIA
11
1
![Page 104: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/104.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
3
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
![Page 105: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/105.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
4
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
![Page 106: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/106.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
5
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
![Page 107: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/107.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
6
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
MPI_Sendrecv( ... );
![Page 108: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/108.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
7
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
MPI_Sendrecv( ... );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
MPI_Sendrecv( ... );
![Page 109: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/109.jpg)
Overlapping MPI and PCIe Transfers
©
2012,
NVIDIA
11
8
for( int i=0; i<num_gpus-1; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
for( int i=1; i<num_gpus; i++ )
cudaMemcpyPeerAsync( ..., stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
MPI_Sendrecv( ... );
for( int i=0; i<num_gpus; i++ )
cudaStreamSynchronize( stream_halo[i] );
cudaSetDevice( gpu[0] );
cudaMemcpyAsync( ..., stream_halo[0] );
MPI_Sendrecv( ... );
cudaSetDevice( gpu[num_gpus-1] );
cudaMemcpyAsync( ..., stream_halo[num_gpus-1] );
![Page 110: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/110.jpg)
NVIDIA® GPUDirect™ Support for RDMA Direct Communication Between GPUs and PCIe devices
GPU1 GPU2
PCI-e
System
Memory GDDR5
Memory GDDR5
Memory
CPU
Network
Card
Node 1
PCI-e
GPU1 GPU2
GDDR5
Memory GDDR5
Memory
System
Memory
CPU
Network
Card
Node 2
Network
![Page 111: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/111.jpg)
Case Study
120 ©
2012,
NVIDIA
![Page 112: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/112.jpg)
Case Study: TTI FWM
• TTI Forward Wave Modeling
– Fundamental part of TTI RTM
– 3DFD, 8th order in space, 2nd order in time
– 1D domain decomposition
• Experiments:
– Throughput increase over 1 GPU
– Single node, 4-GPU “tree”
©
2012,
NVIDIA
12
1
![Page 113: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/113.jpg)
Case Study: TTI FWM
• Experiments:
– 512x512x512 cube
– Requires ~7 GB working set
– Single node, 4-GPU “tree”
– 95% scaling to 8 GPUs/node
©
2012,
NVIDIA
12
2
![Page 114: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/114.jpg)
Case Study: TTI FWM
©
2012,
NVIDIA
12
3
• Experiments:
– 512x512x512 cube
– Requires ~7 GB working set
– Single node, 4-GPU “tree”
– 95% scaling to 8 GPUs/Node
– 95% scaling to 8 GPUs/node
![Page 115: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/115.jpg)
Case Study: Time Breakdown
Single step (single 8-GPU node):
Halo computation: 1.1 ms
Internal computation: 12.7 ms
Halo-exchange: 5.9 ms
Total: 13.8 ms
Communication is completely hidden
— ~95% scaling: halo+internal: 13.8 ms (13.0 ms if done without
splitting computation into halo and internal)
— Time enough for slower communication (network)
©
2012,
NVIDIA
124
![Page 116: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/116.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
12
5
![Page 117: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/117.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
12
6
![Page 118: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/118.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
12
7
![Page 119: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/119.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
12
8
Communication takes longer than
internal computation
![Page 120: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/120.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
12
9
Communication takes longer than
internal computation
![Page 121: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/121.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Test system:
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
13
0
Communication takes longer than
internal computation
![Page 122: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/122.jpg)
Case Study: Multiple Nodes
Test system:
— 3 servers, each with 2 GPUs, Infiniband DDR interconnect
Performance:
— 512x512x512 domain:
1 node x 2 GPUs: 1.98x
2 nodes x 1 GPU: 1.97x
2 nodes x 2 GPUs: 3.98x
3 nodes x 2 GPUs: 4.50x
— 768x768x768 domain:
3 nodes x 2 GPUs: 5.60x
Conclusions
— Communication (PCIe and IB DDR2) is hidden when each GPU gets ~100 slices
Network is ~68% of all communication time
— IB QDR hides communication when each GPU gets ~70 slices
©
2012,
NVIDIA
13
1
Communication takes longer than
internal computation
![Page 123: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/123.jpg)
Summary
Multiple GPUs can stretch your compute dollar
PeerToPeer and can move data directly between GPUs
Streams and asynchronous kernel/copies facilitate
concurrent execution
Many apps can scale to 8 GPUs and beyond
reat scaling to many GPUs
©
2012,
NVIDIA
14
1
![Page 124: Multi-GPU Programming | GTC 2013](https://reader033.vdocuments.mx/reader033/viewer/2022052309/589708ab1a28ab3a038be93d/html5/thumbnails/124.jpg)
Get started
Hands on Training
S3529 Wednesday 5pm
GPU Test Drive
http://www.nvidia.com/GPUTestDrive
Visit us at the CUDA expert table
©
2012,
NVIDIA
143