cs101c gpu programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 cs179 gpu...
TRANSCRIPT
1 CS179 GPU Programming
CS179 GPU ProgrammingIntro to CUDA: Part II
Lecture originally by Luke Durant and Tamas Szalay
2 CS179 GPU Programming
Today – More CUDAMore overviewHow to use in programsMatrix multiplication, with codeCompiling CUDA
3 CS179 GPU Programming
CUDA Summary What is CUDA?
Different interface to underlying hardware Functions to interface host and device (memory copy, etc) Library to simplify hardware interaction
Kernels are small programs/functions A thread executes a kernel
A block executes a group of threads (of same kernel) All on one multiprocessor, can share some data
A grid executes multiple blocks (also of same kernel) Blocks are scheduled arbitrarily, no thread-safety
4 CS179 GPU Programming
By Analogy Global memory, shared memory, constant
memory… Gets confusing – think analogy with graphics
Kernels –> shaders Global memory –> buffer objects
CUDA can access global memory as textures too Grid –> single render call
Things shaders do not have Shared memory Arbitrary read-write (scattering)
In shaders, in/out arrays indexed automatically Thread block division, threadIdx, blockIdx
5 CS179 GPU Programming
CUDA Layers
6 CS179 GPU Programming
CUDA LayersRarely need to use driverFirst labs will concentrate on using
runtime Sufficient for most things
Later ones will use libraries briefly CUBLAS, CUFFT Can even handle CPU-GPU memory transfer
Fast!
7 CS179 GPU Programming
CUFFT Benchmark
8 CS179 GPU Programming
Using CUDA Notice that CUFFT much slower with
memory transfer PCIe 2.0 is 0.5 GB/s per x
e.g. 16x is 8 GB/s But still have scheduling overhead
Need to transfer some data to start a grid, for example
Want to copy data back and forth little, if possible
We will only be using CUDA synchronously Though async interfaces exist
9 CS179 GPU Programming
Common Program FlowProgramming GPU is all about memory
Minimize global memory access, host/device transfer
Consider matrix example from last lecture Copy input matrices to graphics card Start kernel grid Each block copies sub-matrices into shared
memory and multiplies Result is copied back onto host machine
Let’s do this in detail
10 CS179 GPU Programming
Matrix MultiplicationComputing AxB = C, of inner dimension
wACalculating each sub-matrix Csub as
product of two long rectangular matrices Each multiplied as Csub-sized
blocks and accumulated
11 CS179 GPU Programming
Matrix Multiplication Want sub-matrices as large as possible
Each thread block is a sub-matrix Each thread computes a single element of
Csub
Maximum threads/block is 512, so choose Csub to be 16x16 Grid size is then determined by C/16
But how do we step through the A, B sub-matrices? Simple – with a big for loop in the kernel Loading pair Asub, Bsub into shared memory
12 CS179 GPU Programming
Memory Benefits So why are we doing it this way again? Pretend each thread just computes an element
of C by stepping along entire length of A, B We have ~ 1 global memory access per
arithmetic instruction Global memory access is around 400 clock cycles Multiplication is around 10 This is very bad!
We have a fixed number of arithmetic instructions
Want to reduce memory accesses instead
13 CS179 GPU Programming
Memory Benefits If we instead load Asub, Bsub into shared
memory and multiply them into Csub there… Takes 256 global accesses
But we get 16x16x16 arithmetic operations Which effectively corresponds to a 16x
speedup!
14 CS179 GPU Programming
Matrix Code: Setup
15 CS179 GPU Programming
Matrix Code: Launch
Note <<<dimGrid, dimBlock>>> syntax used to launch
Can pass values, pointers to global memory, etc. Will talk more about syntax in recitation
16 CS179 GPU Programming
Matrix Code: Loading
Note that each thread only loads one piece of submatrices, indexes shared memory via threadIdx
And thus they need to be synchronized
threadIdx.x, threadIdx.y
17 CS179 GPU Programming
Matrix Code: Multiplication
Note we need to synchronize again
18 CS179 GPU Programming
Matrix MultiplicationThis is all confusing, sub-parallelizationRead over the CUDA Programmer’s Guide
Covers matrix multiplication in greater detail Plus, explains how to do everything
19 CS179 GPU Programming
CUDA CodeA couple CUDA language features to keep
an eye out for Some special identifiers, __device__,
__shared__ function<<< … >>>() kernel launch syntax Special data types: dim3, but also have
float2…float4, etc. (like vec2…vec4 in GLSL) CUDA runtime functions start with “cuda” Driver functions start with just “cu”
Again, more on coding specifics in recitation
20 CS179 GPU Programming
Compiling CUDAWhat happens when you compile?
Some of the code has to be built for the GPU…nvcc is NVIDIA’s CUDA compiler
Extracts and compiles C code intended for device
Then calls main compiler (gcc, or cl if using Windows) on remainder (host code)
Typically operates on files ending in .cuSetting up Makefiles can be a pain
So we do it for you
21 CS179 GPU Programming
Emulation Mode CUDA programs can also be compiled/linked
to emulation library Some of you may need to do this, if your
personal computers don’t support CUDA Two problems:
Very very slow (obviously) Synchronous and deterministic – if you have
multithreading bugs, you might not see them Still, rather useful for testing/debugging Download the SDK, try it out
Compile it with ‘make emu=1’ to use emulation
22 CS179 GPU Programming
HomeworkGrab the CUDA programming manualCheck out table of contents so you know
what’s in itRead matrix multiplication codeUnderstand it conceptuallyCoding details will be explained in
recitation