unified memory on p100 - oak ridge leadership computing ... · 1/11/2017 read-only copy will be...
TRANSCRIPT
![Page 1: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/1.jpg)
UNIFIED MEMORY ON P100 Peng Wang
HPC Developer Technology, NVIDIA
![Page 2: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/2.jpg)
2
OVERVIEW
How UM works in P100
UM optimization
UM and directives
![Page 3: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/3.jpg)
3
HOW UNIFIED MEMORY WORKS IN P100
![Page 4: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/4.jpg)
4
HETEROGENEOUS ARCHITECTURES Memory hierarchy
1/11/2017
CPU
System Memory
GPU Memory
GPU 0 GPU 1 GPU N
![Page 5: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/5.jpg)
5
UNIFIED MEMORY Starting with Kepler and CUDA 6
1/11/2017
Custom Data Management
System Memory
GPU Memory
Developer View With Unified Memory
Unified Memory
![Page 6: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/6.jpg)
6
UNIFIED MEMORY IMPROVES DEV CYCLE
1/11/2017
Identify Parallel Regions
Allocate and Move Data
Implement GPU Kernels
Optimize GPU Kernels
Often time-consuming and
bug-prone
Identify Parallel Regions
Implement GPU Kernels
Optimize Data
Locality
Optimize GPU Kernels
![Page 7: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/7.jpg)
7
UNIFIED MEMORY Single pointer for CPU and GPU
1/11/2017
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPU code GPU code with Unified Memory
![Page 8: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/8.jpg)
8
UNIFIED MEMORY ON PRE-PASCAL Code example explained
Pages allocated before they are used – cannot oversubscribe GPU
Pages migrate to GPU only on kernel launch – cannot migrate on-demand
1/11/2017
cudaMallocManaged(&ptr, ...); *ptr = 1; qsort<<<...>>>(ptr);
CPU page fault: data migrates to CPU
Pages are populated in GPU memory
Kernel launch: data migrates to GPU
![Page 9: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/9.jpg)
9
UNIFIED MEMORY ON PRE-PASCAL Kernel launch triggers bulk page migrations
1/11/2017
GPU memory ~0.3 TB/s
System memory ~0.1 TB/s
PCI-E
kernel launch page
fault
page fault
cudaMallocManaged
![Page 10: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/10.jpg)
10
PAGE MIGRATION ENGINE Support Virtual Memory Demand Paging
49-bit Virtual Addresses
Sufficient to cover 48-bit CPU address + all GPU memory
GPU page faulting capability
Can handle thousands of simultaneous page faults
Up to 2 MB page size
Better TLB coverage of GPU memory
11.1.2017 Г.
![Page 11: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/11.jpg)
11
UNIFIED MEMORY ON PASCAL Now supports GPU page faults
If GPU does not have a VA translation, it issues an interrupt to CPU
Unified Memory driver could decide to map or migrate depending on heuristics
Pages populated and data migrated on first touch
1/11/2017
cudaMallocManaged(&ptr, ...); *ptr = 1; qsort<<<...>>>(ptr);
CPU page fault: data allocates on CPU
Empty, no pages anywhere (similar to malloc)
GPU page fault: data migrates to GPU
![Page 12: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/12.jpg)
12
UNIFIED MEMORY ON PASCAL True on-demand page migrations
1/11/2017
GPU memory ~0.7 TB/s
System memory ~0.1 TB/s
interconnect page fault
page fault
page fault
page fault
page fault
map VA to system memory
cudaMallocManaged
![Page 13: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/13.jpg)
13
UNIFIED MEMORY ON PASCAL Improvements over previous GPU generations
On-demand page migration
GPU memory oversubscription is now practical (*)
Concurrent access to memory from CPU and GPU (page-level coherency)
Can access OS-controlled memory on supporting systems
(*) on pre-Pascal you can use zero-copy but the data will always stay in system memory
1/11/2017
![Page 14: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/14.jpg)
14
UM USE CASE: HPGMG
Fine grids are offloaded to GPU (TOC), coarse grids are processed on CPU (LOC)
GPU
CPU
THRESHOLD
V-CYCLE F-CYCLE
![Page 15: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/15.jpg)
15
HPGMG AMR PROXY Data locality and reuse of AMR levels
![Page 16: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/16.jpg)
16
OVERSUBSCRIPTION RESULTS
0
20
40
60
80
100
120
140
160
180
200
1.4 4.7 8.6 28.9 58.6
x86 K40 P100 (x86 PCI-e) P100 (P8 NVLINK)
Applicati
on t
hro
ughput
(MD
OF/s
)
P100 memory size (16GB)
x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)
All 5 levels fit in GPU memory
Only 2 levels fit
Only 1 level fits
Application working set (GB)
![Page 17: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/17.jpg)
17
UNIFIED MEMORY: ATOMICS
Pre-Pascal: atomics from the GPU are atomic only for that GPU
GPU atomics to peer memory are not atomic for remote GPU
GPU atomics to CPU memory are not atomic for CPU operations
Pascal: Unified Memory enables wider scope for atomic operations
NVLINK supports native atomics in hardware (both CPU-GPU and GPU-GPU)
PCI-E will have software-assisted atomics (only CPU-GPU)
1/11/2017
![Page 18: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/18.jpg)
18
UNIFIED MEMORY ATOMICS System-Wide Atomics
__global__ void mykernel(int *addr) { atomicAdd_system(addr, 10); } void foo() { int *addr; cudaMallocManaged(addr, 4); *addr = 0; mykernel<<<...>>>(addr); __sync_fetch_and_add(addr, 10); }
System-wide atomics not available on Kepler / Maxwell
Pascal enables system-wide atomics • Direct support of atomics over NVLink • Software-assisted over PCIe
![Page 19: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/19.jpg)
19
UNIFIED MEMORY: MULTI-GPU
Pre-Pascal: direct access requires P2P support, otherwise falls back to sysmem
Use CUDA_MANAGED_FORCE_DEVICE_ALLOC to mitigate this
Pascal: Unified Memory works very similar to CPU-GPU scenario
GPU A accesses GPU B memory: GPU A takes a page fault
Can decide to migrate from GPU B to GPU A, or map GPU A
GPUs can map each other’s memory, but CPU cannot access GPU memory directly
1/11/2017
![Page 20: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/20.jpg)
20
UNIFIED MEMORY OPTIMIZATION
![Page 21: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/21.jpg)
21
GENERAL GUIDELINES
UM overhead Migration: move the data, limited by CPU-GPU interconnect bandwidth Page fault: update page table, ~10s of μs per page, while execution stalls. Solution: prefetch
Redundant transfer for read-only data Solution: duplication
Thrashing: infrequent access, migration overhead can exceed locality benefits Solution: mapping
1/11/2017
![Page 22: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/22.jpg)
22
NEW HINTS API IN CUDA 8
cudaMemPrefetchAsync(ptr, length, destDevice, stream)
Migrate data to destDevice: overlap with compute Update page table: much lower overhead than page fault in kernel Async operation that follows CUDA stream semantics
cudaMemAdvise(ptr, length, advice, device)
Specifies allocation and usage policy for memory region User can set and unset at any time
1/11/2017
![Page 23: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/23.jpg)
23
PREFETCH Simple code example
1/11/2017
void foo(cudaStream_t s) { char *data; cudaMallocManaged(&data, N); init_data(data, N); cudaMemPrefetchAsync(data, N, myGpuId, s); // potentially other compute ... mykernel<<<..., s>>>(data, N, 1, compare); cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s); // potentially other compute ... cudaStreamSynchronize(s); use_data(data, N); cudaFree(data); }
CPU faults are less expensive may still be worth avoiding
GPU faults are expensive prefetch to avoid excess faults
![Page 24: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/24.jpg)
24
PREFETCH EXPERIMENT
__global__ void inc(float *a, int n)
{
int gid = blockIdx.x*blockDim.x + threadIdx.x;
a[gid] += 1;
}
cudaMallocManaged(&a, n*sizeof(float));
for (int i = 0; i < n; i++)
a[i] = 1;
timer.Start();
cudaMemPrefetchAsync(a, n*sizeof(float), 0, NULL);
inc<<<grid, block>>>(a, n);
cudaDeviceSynchronize();
timer.Stop();
ms (GB/s) Total
time
Kernel
time
Prefetch
time
No prefetch 476 (2.3) 476 (2.3) N/A
Prefetch 49 (22) 2 (536) 47 (11)
n=128M
Page fault within kernel are expensive.
![Page 25: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/25.jpg)
25
Prefetch next level while performing computations on current level
Use cudaMemPrefetchAsync with non-blocking stream to overlap with default stream
HPGMG: DATA PREFETCHING
4 4 3 4 4 3
0 1 2 3 4
compute
prefetch
0 1 2 3
2
2 3
eviction 0 1 2 4
4
2
4
![Page 26: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/26.jpg)
26
RESULTS WITH USER HINTS
0
20
40
60
80
100
120
140
160
180
200
1.4 4.7 8.6 28.9 58.6
x86 K40 P100 (x86 PCI-e) P100 + hints (x86 PCI-e) P100 (P8 NVLINK) P100 + hints (P8 NVLINK)
Applicati
on t
hro
ughput
(MD
OF/s
)
x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)
All 5 levels fit in GPU memory
Only 2 levels fit
Only 1 level fits
Application working set (GB)
P100 memory size (16GB)
![Page 27: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/27.jpg)
27
init_data(data, N); cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId); mykernel<<<...>>>(data, N); use_data(data, N);
READ DUPLICATION
cudaMemAdviseSetReadMostly
Use when data is mostly read and occasionally written to
1/11/2017
Read-only copy will be created on GPU page fault
CPU reads will not page fault
![Page 28: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/28.jpg)
28
READ DUPLICATION
Prefetching creates read-duplicated copy of data and avoids page faults
Note: writes are allowed but will generate page fault and remapping
1/11/2017
init_data(data, N); cudaMemAdvise(data, N, cudaMemAdviseSetReadMostly, myGpuId); cudaMemPrefetchAsync(data, N, myGpuId, cudaStreamLegacy); mykernel<<<...>>>(data, N); use_data(data, N);
Read-only copy will be created during prefetch
CPU and GPU reads will not fault
![Page 29: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/29.jpg)
29
DIRECT MAPPING Preferred location and direct access
cudaMemAdviseSetPreferredLocation
Set preferred location to avoid migrations
First access will page fault and establish mapping
cudaMemAdviseSetAccessedBy
Pre-map data to avoid page faults
First access will not page fault
Actual data location can be anywhere
1/11/2017
![Page 30: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/30.jpg)
30
DIRECT MAPPING
CUDA 8 performance hints (works with cudaMallocManaged)
1/11/2017
// set preferred location to CPU to avoid migrations cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId); // keep this region mapped to my GPU to avoid page faults cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, myGpuId); // prefetch data to CPU and establish GPU mapping cudaMemPrefetchAsync(ptr, size, cudaCpuDeviceId, cudaStreamLegacy);
![Page 31: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/31.jpg)
31
HYBRID IMPLEMENTATION Data sharing between CPU and GPU
Level N (large) is shared between CPU and GPU Level N+1 (small) is shared between CPU and GPU
Level N Level N+1
Smoother Residual Restriction
Memory
GPU kernels
Smoother
CPU functions
Redisual
To avoid frequent migrations allocate N+1: pin page in CPU memory
![Page 32: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/32.jpg)
32
DIRECT MAPPING Use case: HPGMG
Problem: excessive faults and migrations at CPU-GPU crossover point
Solution: pin coarse levels to CPU and map them to GPU page tables
Pre-Pascal: allocate data with cudaMallocHost or malloc + cudaHostRegister
1/11/2017
no page faults
~5% performance improvement on Tesla P100
![Page 33: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/33.jpg)
33
MPI AND UNIFIED MEMORY
Using Unified Memory with CUDA-aware MPI needs explicit support from the MPI implementation:
Check with your MPI implementation of choice for their support
Unified Memory is supported in OpenMPI since 1.8.5 and MVAPICH2-GDR since 2.2b
Set preferred location may help improve performance of CUDA-aware MPI using managed pointers
1/11/2017
![Page 34: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/34.jpg)
34
MULTI-GPU PERFORMANCE: HPGMG
MPI buffers can be allocated with cudaMalloc, cudaMallocHost, cudaMallocManaged
CUDA-aware MPI can stage managed buffers through system or device memory
0
10
20
30
40
50
60
70
80
90
managed managed +cuda-aware
zero-copy device +cuda-aware
MD
OF/s
2xK40 with P2P support
![Page 35: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/35.jpg)
35
UNIFIED MEMORY AND DIRECTIVES
![Page 36: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/36.jpg)
36
module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron !$acc declare create(zelectron) setup.F90: allocate(zelectron(6,me)) call init(zelectron) !$acc update device(zelectron) pushe.F90: real mpgc(4,me) !$acc declare create(mpgc) !$acc parallel loop do i=1,me zelectron(1,m)=mpgc(1,m) enddo !$acc end parallel
!$acc update host(zelectron) call foo(zelectron)
W/o UM W/ UM
module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron setup.F90: allocate(zelectron(6,me)) call init(zelectron) pushe.F90: real mpgc(4,me) !$acc parallel loop do i=1,me zelectron(1,me)=... enddo !$acc end parallel
call foo(zelectron)
Remove all data directives. Add “-ta=managed” to compile
NVIDIA and LLNL Confidential 36
![Page 37: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/37.jpg)
37
UM OPENACC USE CASE: GTC
Plasma code for fusion science
10X slowdown in key compute routine when turning on Unified Memory
module.F90: module particle_array real,dimension(:,:),allocatable :: zelectron setup.F90: allocate(zelectron(6,me)) call init(zelectron) pushe.F90: real mpgc(4,me) !$acc parallel loop do i=m,me zelectron(1,m)=mpgc(1,m) enddo !$acc end parallel
call foo(zelectron)
Problem: automatic array.
Allocated/deallocated EACH time entering the routine
Paying page fault cost EACH time entering the routine.
Even though no migration.
![Page 38: UNIFIED MEMORY ON P100 - Oak Ridge Leadership Computing ... · 1/11/2017 Read-only copy will be created on GPU page fault CPU reads will not page fault . 28 READ DUPLICATION ... Unified](https://reader034.vdocuments.mx/reader034/viewer/2022042319/5f08a69f7e708231d4230eac/html5/thumbnails/38.jpg)
38
UNIFIED MEMORY SUMMARY
On-demand page migration
Performance overhead: migration, page fault
Optimization: prefetch, duplicate, mapping