cs179: gpu programming

CS179: GPU ProgrammingLecture 8: More CUDA Runtime

Today CUDA arrays for textures CUDA runtime Helpful CUDA functions

CUDA Arrays Recall texture memory

Used to store large data Stored on GPU Accessible to all blocks,

threads

CUDA Arrays Used Texture memory for buffers (lab 3)

Allows vertex data to remain on GPU How else can we access texture memory?

CUDA arrays

CUDA Arrays Why CUDA arrays over normal arrays?

Better caching, 2D caching Spatial locality Supports wrapping/clamping Supports filtering

CUDA Linear Textures “Textures” but in global memory Usage:

Step 1: Create texture reference texture<TYPE> tex

TYPE = float, float3, int, etc. Step 2: Bind memory to texture reference

cudaBindTexture(offset, tex, devPtr, size); Step 3: Get data on device via tex1Dfetch

tex1DFetch(tex, x); x is the byte where we want to read!

Step 4: Clean up after finished cudaUnbindTexture(&tex)

CUDA Linear Textures Texture reference properties:

texRef<type, dim, mode> type = float, int, float3, etc. dim = # of dimensions (1, 2, or 3) mode =

cudaReadModeElementType: standard read cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats

CUDA Linear Textures Important warning:

Textures are in a global space of memory Threads can read and write to texture at same time

This can cause synchronization problems! Do not rely on thread running order, ever

CUDA Linear Textures Other limitations:

Only 1D, can make indexing and caching a bit less convenient Pitch may be not ideal for 2D array Not read-write

Solution: CUDA arrays

CUDA Arrays Live in texture memory space Access via texture fetches

CUDA Arrays Step 1: Create channel description

Tells us texture attributes cudaCreateChannelDesc(int x, int y, int z, int w, enum mode) x, y, z, w are number of bytes per component mode is cudaChannelFormatKindFloat, etc.

CUDA Arrays Step 2: Allocate memory

Must be done dynamically Use cudaMallocArray(cudaArray **array, struct desc, int size)

Most global memory functions work with CUDA arrays too cudaMemcpyToArray, etc.

CUDA Arrays Step 3: Create texture reference

texture<TYPE, dim, mode> texRef -- just as before Parameters must match channel description where applicable

Step 4: Edit texture settings Settings are encoded as texRef struct members

CUDA Arrays Step 5: Bind the texture reference to array

cudaBindTextureToArray(texRef, array) Step 6: Access texture

Similar to before, now we have more options: tex1DFetch(texRef, x) tex2DFetch(texRef, x, y)

CUDA Arrays Final Notes:

Coordinates can be normalized to [0, 1] if in float mode Filter modes: nearest point or linear

Tells CUDA how to blend texture Wrap vs. clamp:

Wrap: out of bounds accesses wrap around to other side Ex.: (1.5, 0.5) -> (0.5, 0.5)

Clamp: out of bounds accesses set to border value Ex.: (1.5, 0.5) -> (1.0, 0.5)

CUDA Arrays

point sampling linear sampling

CUDA Arrays

wrap clamp

CUDA Runtime Nothing new, every function cuda____ is part of the runtime Lots of other helpful functions Many runtime functions based on making your program

robust Check properties of card, set up multiple GPUs, etc.

Necessary for multi-platform development!

CUDA Runtime Starting the runtime:

Simply call a cuda_____ function! CUDA can waste a lot of resources Stop CUDA with cudaThreadExit()

Called automatically on CPU exit, but you may want to call earlier

CUDA Runtime Getting devices and properties:

cudaGetDeviceCount(int * n); Returns # of CUDA-capable devices Can use to check if machine is CUDA-capable!

cudaSetDevice(int n) Sets device n to the currently used device

cudaGetDeviceProperties(struct *devProp prop, int n); Loads data from device n into prop

Device Properties char name[256]: ASCII identifier of GPU size_t totalGlobalMem: Total global memory available size_t sharedMemPerBlock: Shared memory available per

multiprocessor int regsPerBlock: How many registers we have per block int warpSize: size of our warps size_t memPitch: maximum pitch allowed for array allocation int maxThreadsPerBlock: maximum number of threads/block int maxThreadsDim[3]: maximum sizes of a block

Device Properties int maxGridSize[3]: maximum grid sizes size_t totalConstantMemory: maximum available constant

memory int major, int minor: major and minor versions of CUDA support int clockRate: clock rate of device in kHz size_t textureAlignment: memory alignment required for

textures int deviceOverlap: Does this device allow for memory copying

while kernel is running? (0 = no, 1 = yes) int multiprocessorCount: # of multiprocessors on device

Device Properties Uses?

Actually get values for memory, instead of guessing Program to be accessible for multiple systems Can get the best device

Device Properties Getting the best device:

Pick a metric (Ex.: most multiprocessors could be good)int num_devices, device;cudaGetDeviceCount(&num_devices);if (num_devices > 1) {int max_mp = 0, best_device = 0;for (device = 0; device < num_devices; device++) {cudaDeviceProp prop;cudaGetDeviceProperties(&prop, device);int mp_count = prop.multiProcessorCount;if (mp_count > max_mp) {max_mp = mp_count;best_device = device;}}cudaSetDevice(best_device);}

Device Properties We can also use this to launch multiple GPUs Each GPU must have its own host thread Multithread on CPU, each thread calls different device

Set device on thread using cudaSetDevice(n);

CUDA Runtime Synchronization Note:

Most calls to GPU/CUDA are asynchronous Some are synchonous (usually things dealing with memory) Can force synchronization:

cudaThreadSynchronize() Blocks until all devices are done Good for error checking, timing, etc.

CUDA Events Great for timing! Can place event markers in CUDA to measure time Example code:cudaEvent_t start, stop;cudaCreateEvent(&start); cudaCreateEvent(&stop);cudaEventRecord(start, 0);// DO SOME GPU CODE HEREcudaEventRecord(stop, 0);cudaEventSynchronize(stop);float elapsed_time;cudaEventElapsedTime(&elapsed_time, start, stop);

CUDA Streams Streams manage concurrency and ordering

Ex.: call malloc, then kernel 1, then kernel 2, etc. Calls in different streams are asynchronous!

Don’t know when each stream is where in code

Using Streams Create stream

cudaStreamCreate(cudaStream_t *stream) Copy memory using async calls:

cudaMemcpyAsync(…, cudaStream_t stream) Call in kernel as another parameter:

kernel<<<gridDim, blockDim, sMem, stream>>> Query if stream is done:

cudaStreamQuery(cudaStream_t stream) returns cudaSuccess if stream is done, cudaErrorNotReady otherwise

Block process until a stream is done: cudaStreamSynchronize(cudaStream_t stream)

Destroy stream & cleanup: cudaStreamDestroy(cudaStream_t stream)

Using Streams Example:cudaStream_t stream[2];

for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]);

for (int i = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i) myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);

for (int i = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]);

cudaThreadSynchronize();

Next Time Lab 4 Recitation:

3D Textures Pixel Buffer Objects (PBOs) Fractals!

cs179: gpu programming

Documents