cudadma: emulating dma engines on gpus for...

CudaDMA: Emulating DMA engines on GPUs for Performance and Programmability

Brucek Khailany (NVIDIA Research) Michael Bauer (Stanford) Henry Cook (UC Berkeley)

CudaDMA overview

A library for efficient bulk transfers between global and

shared memory in CUDA kernels (not host<->device copies)

Motivation: Ease programmer burden for high performance

http://code.google.com/p/cudadma/

CPU SM

Shared

Shared …

global<->shared

transfers

Motivation: data shape != thread shape

Thread block size/shape mismatch shared data size/shape

— Complex kernel code (lots of ‘if’ statements, thread index math)

Goal: Decouple data shape from thread block dimensions

6x7 input data

4x5 thread block

Example: 3D finite difference stencil

8th order in space, 1st order in time computation

— Thread per (x,y) location

— Step through Z-dimension

— Load 2D halos into shared for

each step in Z-dimension

Programmer challenges

— How to split halo xfers across threads?

— Memory B/W optimizations

/////////////////////////////////////////

// update the data slice in smem

s_data[ty][tx] = local_input1[radius];

s_data[ty][tx+BDIMX] = local_input2[radius];

if( threadIdx.y<radius ) // halo above/below

s_data[threadIdx.y][tx] = g_curr[c_offset - radius*dimx];

s_data[threadIdx.y][tx+BDIMX] = g_curr[c_offset - radius*dimx + BDIMX];

if( threadIdx.y >= radius && threadIdx.y < 2*radius )

s_data[threadIdx.y+BDIMY][tx] = g_curr[c_offset + (BDIMY-radius)*dimx];

s_data[threadIdx.y+BDIMY][tx+BDIMX] = g_curr[c_offset + (BDIMY-radius)*dimx + BDIMX];

if(threadIdx.x<radius) // halo left/right

s_data[ty][threadIdx.x] = g_curr[c_offset - radius];

s_data[ty][threadIdx.x+2*BDIMX+radius] = g_curr[c_offset + 2*BDIMX];

__syncthreads();

Example copy code for 3D stencil

CudaDMA approach

CudaDMA library

— Block transfers explicitly declared in CUDA kernels

— Primarily used for “streaming” data through shared memory

— Common access patterns supported

— Implemented as C++ objects instantiated in CUDA kernels

Object member functions used to initiate “DMA transfers”

Advantages

— Simple, maintainable user code

— Access patterns independent of thread block dimensions

— Optimized library implementations for global memory bandwidth

Kernel pseudocode with CudaDMA

CudaDMA objects declared

at top of the kernel

— Fixed access pattern

Kernel loops over large

dataset

— Copy data to shared

— Barrier

— Process data in shared

— Barrier

— …

__global__

void cuda_dma_kernel(float *data)

__shared__ float buffer[NUM_ELMTS];

cudaDMAStrided<false,ALIGNMENT>

dma_ld(EL_SZ,EL_CNT,EL_STRIDE);

for (int i=0; i<NUM_ITERS; i++) {

dma_ld.execute_dma

(data[A*i],buffer);

__syncthreads();

process_buffer(buffer);

__syncthreads();

Supported access patterns

CudaDMASequential

CudaDMAStrided

CudaDMAIndirect

— Scatter/Gather

CudaDMAHalo

— 2D halo regions

Specifying access patterns

Access pattern described with parameters

Up to 5 parameters for strided patterns

— BYTES_PER_ELMT - the size of each element in bytes

— NUM_ELMTS - the number of elements to be transfered

— ALIGNMENT – whether elements are 4-,8-, or 16-byte aligned

— src_stride - the stride between the source elements in bytes

— dst_stride - the stride between elements after they have been

transferred in bytes

Similar parameters used for other patterns

# of threads independent of access pattern

Optimizations and tuning

Optimizations performed by CudaDMA implementations

— Pointer casting enables vector loads and stores

— Hoisting of pointer arithmetic into the constructor

— Memory coalescing and shared memory bank conflict avoidance

Considerations for memory bandwidth performance

— Use compile-time constant template parameters if possible

— Load at maximum alignment (highest performance at 16 bytes)

— Size, #threads: Highest performance with <=64 bytes per thread

64 threads: 4 KB transfers; 128 threads: 8KB transfers, …

Predictable performance

Strided access pattern

— 2 KB transfers

— Tesla C2070 (ECC Off)

— 128 threads per thread block

participating in the DMA

— 2 thread blocks per SM

cudadma: emulating dma engines on gpus for...

Documents

dma 2006 dma 2006

01-br8.0 dma&dma admission control

content-preserving warps for 3d video...

emulating the nintendo entertainment system

cuda warps and occupancy

dma and dma controller 8237

dma groupe - parc industriel des lavours - 01100 … › doc...

a framework for geometric warps and...

p2 bahrain-saudi ties ‘worth emulating’

serious about warps the 3 rd annual warps forum 13 march...

dma+300 / dma+1000 / dma+2000 dma and fatigue testing...

emulating a cooperative behavior in a generic emulating...

why warps ?… accretion, recycling , stimulation ?

dma north: the dma legal update

depth synthesis and local warps for plausible image … ·...

winged warps series blue morpho - · pdf...

emulating a combo organ using faust

cuda – optimización de aplicaciones...–modelo de...

biomimicry emulating nature’s genius

emulating the u.s