cs179: gpu programming lecture 5: memory. today gpu memory overview cuda memory syntax tips and...
TRANSCRIPT
![Page 1: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/1.jpg)
CS179: GPU ProgrammingLecture 5: Memory
![Page 2: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/2.jpg)
Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling
![Page 3: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/3.jpg)
Memory Overview
• Very slow access:• Between host and device
• Slow access:• Global Memory
• Fast access:• Shared memory, constant
memory, texture memory, local memory
• Very fast access:• Register memory
![Page 4: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/4.jpg)
Global Memory Read/write Shared between blocks and grids Same across multiple kernel executions Very slow to access
No caching!
![Page 5: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/5.jpg)
Constant Memory Read-only in device Cached in multiprocessor Fairly quick
Cache can broadcast to all active threads
![Page 6: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/6.jpg)
Texture Memory Read-only in device 2D cached -- quick access Filtering methods available
![Page 7: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/7.jpg)
Shared Memory Read/write per block Memory is shared within block Generally quick
Has bad worst-cases
![Page 8: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/8.jpg)
Local Memory Read/write per thread Not too fast (stored independent of chip) Each thread can only see its own local
memory Indexable (can do arrays)
![Page 9: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/9.jpg)
Register Memory Read/write per thread function Extremely fast Each thread can only see its own
register memory Not indexable (can’t do arrays)
![Page 10: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/10.jpg)
Syntax:Register Memory Default memory type Declare as normal -- no special syntax
int var = 1; Only accessible by current thread
![Page 11: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/11.jpg)
Syntax:Local Memory “Global” variables for threads
Can modify across local functions for a thread Declare with __device__ __local__ keyword
__device__ __local__ int var = 1; Can also just use __local__
![Page 12: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/12.jpg)
Syntax: Shared Memory Shared across threads in block, not across blocks Cannot use pointers, but can use array syntax for arrays Declare with __device__ __shared__ keyword
__device__ __shared__ int var[]; Can also just use __shared__ Don’t need to declare size for arrays
![Page 13: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/13.jpg)
Syntax: Global Memory Created with cudaMalloc Can pass pointers between host and kernel
Transfer is slow! Declare with __device__keyword
__device__ int var = 1;
![Page 14: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/14.jpg)
Syntax: Constant Memory Declare with __device__ __constant__ keyword
__device__ __constant__ int var = 1; Can also just use __constant__
Set using cudaMemcpyToSymbol (or cudaMemcpy) cudaMemcpyToSymbol(var, src, count);
![Page 15: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/15.jpg)
Syntax: Texture Memory To be discussed later…
![Page 16: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/16.jpg)
Memory Issues Each multiprocessor has set amount of memory
Limits amount of blocks we can have (# of blocks) x (memory used per block) <= total memory Either get lots of blocks using little memory, or fewer blocks using
lots of memory
![Page 17: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/17.jpg)
Memory Issues Register memory is limited!
Similar to shared memory in blocks Can have many threads using fewer registers, or few threads
using many registers Former is better, more parallelism
![Page 18: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/18.jpg)
Memory Issues Global accesses: slow!
Can be sped up when memory is contiguous Memory coalescing: making memory contiguous
Coalesced accesses are: Contiguous accesses In-order accesses Aligned accesses
![Page 19: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/19.jpg)
Memory Coalescing:Aligned Accesses Threads read 4, 8, or 16 bytes at a time from global memory
Accesses must be aligned in memory! Good:
Bad:
Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE?
0x00 0x04 0x14
0x00 0x07 0x14
![Page 20: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/20.jpg)
Memory CoalescingAligned Accesses
Also bad: beginning unaligned
![Page 21: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/21.jpg)
Memory Coalescing:Aligned Accesses Built-in types force alignment
float3 (12B) takes up the same space as float4 (16B) float3 arrays are not aligned!
To align a struct, use __align__(x) // x = 4, 8, 16 cudaMalloc aligns the start of each block automatically
cudaMalloc2D aligns the start of each row for 2D arrays
![Page 22: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/22.jpg)
Memory Coalescing:Contiguous Accesses Contiguous = memory is together
Example: non-contiguous memory Thread 3 and 4 swapped accesses!
![Page 23: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/23.jpg)
Memory Coalescing:Contiguous Accesses Which is better?
index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y);
index = threadIdx.x + blockDim.y * (blockIdx.y + gridDim.y * blockIdx.x);
![Page 24: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/24.jpg)
bank[0]
bank[1]
bank[2]
bank[3]
Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses
thread[0][0]
thread[0][1]
thread[1][0]
thread[1][1]
![Page 25: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/25.jpg)
bank[0]
bank[1]
bank[2]
bank[3]
Memory Coalescing:Contiguous Accesses Case 1: Contiguous accesses
thread[0][0]
thread[0][1]
thread[1][0]
thread[1][1]
![Page 26: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/26.jpg)
Memory Coalescing:In-order Accesses In-order accesses
Do not skip addresses Access addresses in order in memory
Bad example: Left: address 140 skipped Right: lots of skipped addresses
![Page 27: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/27.jpg)
Memory Coalescing Good example:
![Page 28: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/28.jpg)
Memory Coalescing Not as much of an issue in new hardware
Many restrictions relaxed -- e.g., do not need to have sequential access
However, memory coalescing and alignment still good practice!
![Page 29: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/29.jpg)
Memory Issues Shared memory:
Also can be limiting Broken up into banks
Optimal when entire warp is reading shared memory together
Banks: Each bank services only one thread at a time Bank conflict: when two threads try to access same block
Causes slowdowns in program!
![Page 30: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/30.jpg)
Bank Conflicts Bad:
Many threads trying to access the same bank
![Page 31: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/31.jpg)
Bank Conflicts Good:
Few to no bank conflicts
![Page 32: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/32.jpg)
Bank Conflicts Banks service 32-bit words at a time at addresses mod 64
Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc.
Want to avoid multiple thread access to same bank Keep data spread out Split data that is larger than 4 bytes into multiple accesses Be careful of data elements with even stride
![Page 33: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/33.jpg)
Broadcasting Fast distribution of data to threads Happens when entire warp tries to access same address
Memory will get broadcasted to all threads in one read
![Page 34: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/34.jpg)
Summary Best memory management:
Balances memory optimization with parallelism
Break problem up into coalesced chunks Process data in shared memory, then copy back to global
Remember to avoid bank conflicts!
![Page 35: CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling](https://reader035.vdocuments.mx/reader035/viewer/2022062511/5517f736550346a2228b4892/html5/thumbnails/35.jpg)
Next Time Texture memory CUDA Applications in graphics