cs179: gpu programming

31
CS179: GPU Programming Lecture 8: More CUDA Runtime

Upload: hammer

Post on 23-Feb-2016

71 views

Category:

Documents


0 download

DESCRIPTION

CS179: GPU Programming. Lecture 8: More CUDA Runtime. Today. CUDA arrays for textures CUDA runtime Helpful CUDA functions. CUDA Arrays. Recall texture memory Used to store large data Stored on GPU Accessible to all blocks, threads. CUDA Arrays. Used Texture memory for buffers (lab 3) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS179: GPU Programming

CS179: GPU ProgrammingLecture 8: More CUDA Runtime

Page 2: CS179: GPU Programming

Today CUDA arrays for textures CUDA runtime Helpful CUDA functions

Page 3: CS179: GPU Programming

CUDA Arrays Recall texture memory

Used to store large data Stored on GPU Accessible to all blocks,

threads

Page 4: CS179: GPU Programming

CUDA Arrays Used Texture memory for buffers (lab 3)

Allows vertex data to remain on GPU How else can we access texture memory?

CUDA arrays

Page 5: CS179: GPU Programming

CUDA Arrays Why CUDA arrays over normal arrays?

Better caching, 2D caching Spatial locality Supports wrapping/clamping Supports filtering

Page 6: CS179: GPU Programming

CUDA Linear Textures “Textures” but in global memory Usage:

Step 1: Create texture reference texture<TYPE> tex

TYPE = float, float3, int, etc. Step 2: Bind memory to texture reference

cudaBindTexture(offset, tex, devPtr, size); Step 3: Get data on device via tex1Dfetch

tex1DFetch(tex, x); x is the byte where we want to read!

Step 4: Clean up after finished cudaUnbindTexture(&tex)

Page 7: CS179: GPU Programming

CUDA Linear Textures Texture reference properties:

texRef<type, dim, mode> type = float, int, float3, etc. dim = # of dimensions (1, 2, or 3) mode =

cudaReadModeElementType: standard read cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats

Page 8: CS179: GPU Programming

CUDA Linear Textures Important warning:

Textures are in a global space of memory Threads can read and write to texture at same time

This can cause synchronization problems! Do not rely on thread running order, ever

Page 9: CS179: GPU Programming

CUDA Linear Textures Other limitations:

Only 1D, can make indexing and caching a bit less convenient Pitch may be not ideal for 2D array Not read-write

Solution: CUDA arrays

Page 10: CS179: GPU Programming

CUDA Arrays Live in texture memory space Access via texture fetches

Page 11: CS179: GPU Programming

CUDA Arrays Step 1: Create channel description

Tells us texture attributes cudaCreateChannelDesc(int x, int y, int z, int w, enum mode) x, y, z, w are number of bytes per component mode is cudaChannelFormatKindFloat, etc.

Page 12: CS179: GPU Programming

CUDA Arrays Step 2: Allocate memory

Must be done dynamically Use cudaMallocArray(cudaArray **array, struct desc, int size)

Most global memory functions work with CUDA arrays too cudaMemcpyToArray, etc.

Page 13: CS179: GPU Programming

CUDA Arrays Step 3: Create texture reference

texture<TYPE, dim, mode> texRef -- just as before Parameters must match channel description where applicable

Step 4: Edit texture settings Settings are encoded as texRef struct members

Page 14: CS179: GPU Programming

CUDA Arrays Step 5: Bind the texture reference to array

cudaBindTextureToArray(texRef, array) Step 6: Access texture

Similar to before, now we have more options: tex1DFetch(texRef, x) tex2DFetch(texRef, x, y)

Page 15: CS179: GPU Programming

CUDA Arrays Final Notes:

Coordinates can be normalized to [0, 1] if in float mode Filter modes: nearest point or linear

Tells CUDA how to blend texture Wrap vs. clamp:

Wrap: out of bounds accesses wrap around to other side Ex.: (1.5, 0.5) -> (0.5, 0.5)

Clamp: out of bounds accesses set to border value Ex.: (1.5, 0.5) -> (1.0, 0.5)

Page 16: CS179: GPU Programming

CUDA Arrays

point sampling linear sampling

Page 17: CS179: GPU Programming

CUDA Arrays

wrap clamp

Page 18: CS179: GPU Programming

CUDA Runtime Nothing new, every function cuda____ is part of the runtime Lots of other helpful functions Many runtime functions based on making your program

robust Check properties of card, set up multiple GPUs, etc.

Necessary for multi-platform development!

Page 19: CS179: GPU Programming

CUDA Runtime Starting the runtime:

Simply call a cuda_____ function! CUDA can waste a lot of resources Stop CUDA with cudaThreadExit()

Called automatically on CPU exit, but you may want to call earlier

Page 20: CS179: GPU Programming

CUDA Runtime Getting devices and properties:

cudaGetDeviceCount(int * n); Returns # of CUDA-capable devices Can use to check if machine is CUDA-capable!

cudaSetDevice(int n) Sets device n to the currently used device

cudaGetDeviceProperties(struct *devProp prop, int n); Loads data from device n into prop

Page 21: CS179: GPU Programming

Device Properties char name[256]: ASCII identifier of GPU size_t totalGlobalMem: Total global memory available size_t sharedMemPerBlock: Shared memory available per

multiprocessor int regsPerBlock: How many registers we have per block int warpSize: size of our warps size_t memPitch: maximum pitch allowed for array allocation int maxThreadsPerBlock: maximum number of threads/block int maxThreadsDim[3]: maximum sizes of a block

Page 22: CS179: GPU Programming

Device Properties int maxGridSize[3]: maximum grid sizes size_t totalConstantMemory: maximum available constant

memory int major, int minor: major and minor versions of CUDA support int clockRate: clock rate of device in kHz size_t textureAlignment: memory alignment required for

textures int deviceOverlap: Does this device allow for memory copying

while kernel is running? (0 = no, 1 = yes) int multiprocessorCount: # of multiprocessors on device

Page 23: CS179: GPU Programming

Device Properties Uses?

Actually get values for memory, instead of guessing Program to be accessible for multiple systems Can get the best device

Page 24: CS179: GPU Programming

Device Properties Getting the best device:

Pick a metric (Ex.: most multiprocessors could be good)int num_devices, device;cudaGetDeviceCount(&num_devices);if (num_devices > 1) {int max_mp = 0, best_device = 0;for (device = 0; device < num_devices; device++) {cudaDeviceProp prop;cudaGetDeviceProperties(&prop, device);int mp_count = prop.multiProcessorCount;if (mp_count > max_mp) {max_mp = mp_count;best_device = device;}}cudaSetDevice(best_device);}

Page 25: CS179: GPU Programming

Device Properties We can also use this to launch multiple GPUs Each GPU must have its own host thread Multithread on CPU, each thread calls different device

Set device on thread using cudaSetDevice(n);

Page 26: CS179: GPU Programming

CUDA Runtime Synchronization Note:

Most calls to GPU/CUDA are asynchronous Some are synchonous (usually things dealing with memory) Can force synchronization:

cudaThreadSynchronize() Blocks until all devices are done Good for error checking, timing, etc.

Page 27: CS179: GPU Programming

CUDA Events Great for timing! Can place event markers in CUDA to measure time Example code:cudaEvent_t start, stop;cudaCreateEvent(&start); cudaCreateEvent(&stop);cudaEventRecord(start, 0);// DO SOME GPU CODE HEREcudaEventRecord(stop, 0);cudaEventSynchronize(stop);float elapsed_time;cudaEventElapsedTime(&elapsed_time, start, stop);

Page 28: CS179: GPU Programming

CUDA Streams Streams manage concurrency and ordering

Ex.: call malloc, then kernel 1, then kernel 2, etc. Calls in different streams are asynchronous!

Don’t know when each stream is where in code

Page 29: CS179: GPU Programming

Using Streams Create stream

cudaStreamCreate(cudaStream_t *stream) Copy memory using async calls:

cudaMemcpyAsync(…, cudaStream_t stream) Call in kernel as another parameter:

kernel<<<gridDim, blockDim, sMem, stream>>> Query if stream is done:

cudaStreamQuery(cudaStream_t stream) returns cudaSuccess if stream is done, cudaErrorNotReady otherwise

Block process until a stream is done: cudaStreamSynchronize(cudaStream_t stream)

Destroy stream & cleanup: cudaStreamDestroy(cudaStream_t stream)

Page 30: CS179: GPU Programming

Using Streams Example:cudaStream_t stream[2];

for (int i = 0; i < 2; ++i) cudaStreamCreate(&stream[i]);

for (int i = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]);

for (int i = 0; i < 2; ++i) myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);

for (int i = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]);

cudaThreadSynchronize();

Page 31: CS179: GPU Programming

Next Time Lab 4 Recitation:

3D Textures Pixel Buffer Objects (PBOs) Fractals!