cs101c gpu programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 cs179 gpu...

1 CS179 GPU Programming

CS179 GPU ProgrammingIntro to CUDA: Part II

Lecture originally by Luke Durant and Tamas Szalay


Today – More CUDAMore overviewHow to use in programsMatrix multiplication, with codeCompiling CUDA


CUDA Summary What is CUDA?

Different interface to underlying hardware Functions to interface host and device (memory copy, etc) Library to simplify hardware interaction

Kernels are small programs/functions A thread executes a kernel

A block executes a group of threads (of same kernel) All on one multiprocessor, can share some data

A grid executes multiple blocks (also of same kernel) Blocks are scheduled arbitrarily, no thread-safety


By Analogy Global memory, shared memory, constant

memory… Gets confusing – think analogy with graphics

Kernels –> shaders Global memory –> buffer objects

CUDA can access global memory as textures too Grid –> single render call

Things shaders do not have Shared memory Arbitrary read-write (scattering)

In shaders, in/out arrays indexed automatically Thread block division, threadIdx, blockIdx


CUDA Layers


CUDA LayersRarely need to use driverFirst labs will concentrate on using

runtime Sufficient for most things

Later ones will use libraries briefly CUBLAS, CUFFT Can even handle CPU-GPU memory transfer

Fast!


CUFFT Benchmark


Using CUDA Notice that CUFFT much slower with

memory transfer PCIe 2.0 is 0.5 GB/s per x

e.g. 16x is 8 GB/s But still have scheduling overhead

Need to transfer some data to start a grid, for example

Want to copy data back and forth little, if possible

We will only be using CUDA synchronously Though async interfaces exist


Common Program FlowProgramming GPU is all about memory

Minimize global memory access, host/device transfer

Consider matrix example from last lecture Copy input matrices to graphics card Start kernel grid Each block copies sub-matrices into shared

memory and multiplies Result is copied back onto host machine

Let’s do this in detail


Matrix MultiplicationComputing AxB = C, of inner dimension

wACalculating each sub-matrix Csub as

product of two long rectangular matrices Each multiplied as Csub-sized

blocks and accumulated


Matrix Multiplication Want sub-matrices as large as possible

Each thread block is a sub-matrix Each thread computes a single element of

Csub

Maximum threads/block is 512, so choose Csub to be 16x16 Grid size is then determined by C/16

But how do we step through the A, B sub-matrices? Simple – with a big for loop in the kernel Loading pair Asub, Bsub into shared memory


Memory Benefits So why are we doing it this way again? Pretend each thread just computes an element

of C by stepping along entire length of A, B We have ~ 1 global memory access per

arithmetic instruction Global memory access is around 400 clock cycles Multiplication is around 10 This is very bad!

We have a fixed number of arithmetic instructions

Want to reduce memory accesses instead


Memory Benefits If we instead load Asub, Bsub into shared

memory and multiply them into Csub there… Takes 256 global accesses

But we get 16x16x16 arithmetic operations Which effectively corresponds to a 16x

speedup!


Matrix Code: Setup


Matrix Code: Launch

Note <<<dimGrid, dimBlock>>> syntax used to launch

Can pass values, pointers to global memory, etc. Will talk more about syntax in recitation


Matrix Code: Loading

Note that each thread only loads one piece of submatrices, indexes shared memory via threadIdx

And thus they need to be synchronized

threadIdx.x, threadIdx.y


Matrix Code: Multiplication

Note we need to synchronize again


Matrix MultiplicationThis is all confusing, sub-parallelizationRead over the CUDA Programmer’s Guide

Covers matrix multiplication in greater detail Plus, explains how to do everything


CUDA CodeA couple CUDA language features to keep

an eye out for Some special identifiers, __device__,

__shared__ function<<< … >>>() kernel launch syntax Special data types: dim3, but also have

float2…float4, etc. (like vec2…vec4 in GLSL) CUDA runtime functions start with “cuda” Driver functions start with just “cu”

Again, more on coding specifics in recitation


Compiling CUDAWhat happens when you compile?

Some of the code has to be built for the GPU…nvcc is NVIDIA’s CUDA compiler

Extracts and compiles C code intended for device

Then calls main compiler (gcc, or cl if using Windows) on remainder (host code)

Typically operates on files ending in .cuSetting up Makefiles can be a pain

So we do it for you


Emulation Mode CUDA programs can also be compiled/linked

to emulation library Some of you may need to do this, if your

personal computers don’t support CUDA Two problems:

Very very slow (obviously) Synchronous and deterministic – if you have

multithreading bugs, you might not see them Still, rather useful for testing/debugging Download the SDK, try it out

Compile it with ‘make emu=1’ to use emulation


HomeworkGrab the CUDA programming manualCheck out table of contents so you know

what’s in itRead matrix multiplication codeUnderstand it conceptuallyCoding details will be explained in

recitation

cs101c gpu programmingcourses.cms.caltech.edu/cs101gpu/2013/lec10_cuda_intro_2.pdf · 1 cs179 gpu...

Documents