Download - Tech Talk NVIDIA CUDA
CUDA > J. Rühmkorf > July 22nd 2009
Slide 1
NVIDIA CUDA The Compute Unified Device Architecture
Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009
CUDA > J. Rühmkorf > July 22nd 2009
Slide 2
References
University of Illinois at Urbana-Champaign, Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009: http://courses.ece.illinois.edu/ece498/al/
Website about General-Purpose Computation on Graphics Hardware: http://gpgpu.org/developer/cudaACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU): http://mags.acm.org/queue/20080304/Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13: http://www.ddj.com/architect/207200659NVIDIA CUDA Best Practices Guide http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI DIA_CUDA_BestPracticesGuide_2.3.pdfHubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online: http://developer.nvidia.com/object/gpu-gems-3.html
CUDA > J. Rühmkorf > July 22nd 2009
Slide 3
Multi- and Manycore Architectures A Difficult Road Lies Ahead
Don Knuth on Multicore Architectures“[…] my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop”In: InformIT, April 25th 2008 http://www.informit.com/articles/article.aspx?p=1193856
CUDA > J. Rühmkorf > July 22nd 2009
Slide 4
Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
CUDA > J. Rühmkorf > July 22nd 2009
Slide 5
Multicore: yoke of oxenEach core optimized for executing a single thread
Manycore: flock of chickensCores optimized for aggregate throughput, deemphasizing individual performance
Multicore Manycore
Multicore and Manycore (1) Structural Differences
CUDA > J. Rühmkorf > July 22nd 2009
Slide 6
Multicore and Manycore (2) Technical Characteristics
Specifica- tions Core i7 960 GTX285
Processing Elements
4 cores, 4 way SIMD
@3.2 GHz
30 cores, 8 way SIMD
@1.5 GHz
Resident Threads (max)
4 cores, 2 threads, 4 width SIMD:
32 strands
30 cores, 32 SIMD vectors, 32 width
SIMD: 30720 strands
SP GFLOP/s 102 1080
Memory Bandwidth 25.6 GB/s 159 GB/s
Register File - 1.875 MB
Local Store - 480 kB
Core i7
GTX285
CUDA > J. Rühmkorf > July 22nd 2009
Slide 7
Multicore and Manycore (3) Performance Comparison: CPU vs. GPU
CPU vs. GPUy-axis: floating point operations per sec., single precision
CUDA > J. Rühmkorf > July 22nd 2009
Slide 8
CPU(host)GPU w/
local DRAM(device)
An Example of the Physical Reality Behind CUDA
CUDA > J. Rühmkorf > July 22nd 2009
Slide 9
CUDA Processing Flow
CUDA > J. Rühmkorf > July 22nd 2009
Slide 10
CUDA in a Nutshell Key Characteristics
CUDA is designed for wide SIMD/SPMD parallelism & scalabilityCUDA provides 3 key abstractions, i.e. a hierarchy:
of thread groups,of shared memories, andof barrier synchronization
CUDA programs are written in C + extensionsOpenCL is inspired by CUDA, but HW & SW vendor neutral
Programming model essentially identical
CUDA > J. Rühmkorf > July 22nd 2009
Slide 11
Hello World
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i]
}
int main() {
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
hello-world.cu
CUDA > J. Rühmkorf > July 22nd 2009
Slide 12
Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
CUDA > J. Rühmkorf > July 22nd 2009
Slide 13
. . .Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Serial Code (host)
. . .Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
Integrated host + device application C programSerial or modestly parallel parts in host C codeHighly parallel parts in device SPMD kernel C code
CUDA Programming Model Structure of a CUDA application
CUDA > J. Rühmkorf > July 22nd 2009
Slide 14
CUDA Programming Model CUDA Devices and Threads
A compute deviceIs a coprocessor to the CPU or hostHas its own DRAM (device memory) Runs many threads in parallelIs typically a GPU but can also be another type of parallel processing device
Express data-parallel portions as device kernels (which run on many threads)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 15
CUDA Programming Model Arrays of Parallel Threads
Execute a Kernel by specifying arrays of threadsAll threads run the same code (SPMD) Use thread-ID to compute memory addresses & make control decisions
76543210
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
CUDA > J. Rühmkorf > July 22nd 2009
Slide 16
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Thread Block 0
……float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block 1
…float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block N - 1
CUDA Programming Model Use Thread Blocks for (Scalable) Cooperation
76543210 76543210 76543210
Divide monolithic thread array into multiple blocksThreads within a block can cooperate via
shared memory, atomic operations, and barrier synchronization
Threads in different blocks cannot cooperate
CUDA > J. Rühmkorf > July 22nd 2009
Slide 17
CUDA Programming Model Organisation of Thread Blocks
Thread Blocks can be one-, two- or three-dimensional arraysThe host issues a sequence of kernel invocations (kernel 1, kernel 2) to the deviceEach kernel is executed as a batch of threadsThis batch is organized as a grid of thread blocks
2-dimensional thread blocks
CUDA > J. Rühmkorf > July 22nd 2009
Slide 18
CUDA Programming Model Block IDs and Thread IDs
Each thread uses IDs to decide what data to work on
Block ID: 1D, 2D, or 3DThread ID: 1D, 2D, or 3D
Simplifies memory addressing when processing multidimensional data
CUDA > J. Rühmkorf > July 22nd 2009
Slide 19
Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
CUDA > J. Rühmkorf > July 22nd 2009
Slide 20
CUDA Memory Model Overview
Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threadsLong latency access
We will focus on global memory for nowConstant and texture memory will not be covered here
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
Texture Memory
CUDA > J. Rühmkorf > July 22nd 2009
Slide 21
CUDA Memory Model CUDA Device Memory Allocation
Grid
GlobalMemory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
cudaMalloc(): Allocates object in the device global memory
Requires two parametersAddress of a pointer to the allocated objectSize of allocated object
cudaFree()
Frees object from the device global memoryPointer to freed object
CUDA > J. Rühmkorf > July 22nd 2009
Slide 22
CUDA Memory Model CUDA Host-Device Data Transfer
cudaMemcpy()
memory data transferRequires four parameters
Pointer to destination Pointer to sourceNumber of bytes to copyType of transfer
Type of transfer is one of:Host to HostHost to DeviceDevice to HostDevice to Device
Asynchronous transfer
Grid
GlobalMemory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
CUDA > J. Rühmkorf > July 22nd 2009
Slide 23
Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
CUDA > J. Rühmkorf > July 22nd 2009
Slide 24
gcc / cl
G80 SASSfoo.sass
OCG
CUDA API Extended C
cudaccEDG C/C++ frontend
Open64 Global Optimizer
GPU Assemblyfoo.s
CPU Host Code foo.cpp
Integrated source(foo.cu)
Mark Murphy, “NVIDIA’s Experience with with Open64,”www.capsl.udel.edu/conferences/open64/2008 008/Papers/101.doc
CUDA > J. Rühmkorf > July 22nd 2009
Slide 25
CUDA API C for CUDA
Function type specifiers__global__, __device__, __host__
Variable type specifiers__device__, __shared__, __constant__
KeywordsthreadIdx, blockIdx
Intrinsics / builtin functions:__syncthreads()
Runtime APIMemory, symbol, execution management
Function launch
__device__ float filter[N];
__global__ void convolve(float*image) {
__shared__ float region[M];...region[threadIdx] = image[i]; __syncthreads()...
image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>>(myimage);
image-convolution.cu
CUDA > J. Rühmkorf > July 22nd 2009
Slide 26
__global__:defines a kernel functionmust return void
__device__ and __host__ can be used together
executed on: only callable from:
__device__ float deviceFunc() device device
__global__ void kernelFunc() device host
__host__ float hostFunc() host host
CUDA API CUDA Function Type Qualifiers (1)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 27
CUDA API CUDA Function Type Qualifiers (2)
__device__ functions cannot have their address takenFor functions executed on the device:
no recursionno static variable declarations inside the functionno variable number of arguments
CUDA > J. Rühmkorf > July 22nd 2009
Slide 28
__device__Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.
__shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.
Not covered here:__constant__ (optionally used together with __device__)
Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.
CUDA API CUDA Variable Type Qualifiers
CUDA > J. Rühmkorf > July 22nd 2009
Slide 29
A kernel function ( == __global__ function) must be called with an execution configuration:
__global__ void kernelFunc(...) {...};
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
size_t sharedMemBytes = 64; // 64 bytes of shared memory
kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...);
Any call to a kernel function is asynchronous from CUDA 1.0 onExplicit synchronization needed for blocking
CUDA API Calling a Kernel Function – Execution Configuration
CUDA > J. Rühmkorf > July 22nd 2009
Slide 30
Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
CUDA > J. Rühmkorf > July 22nd 2009
Slide 31
A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
Leave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and device
Assume square matrix for simplicity
A Simple CUDA Example Matrix Multiplication
CUDA > J. Rühmkorf > July 22nd 2009
Slide 32
Simple CUDA Example Square Matrix Multiplication
P = M * N of size WIDTH x WIDTH
Here: without tiling!One thread calculates one element of PM and N are loaded WIDTH times from global memory
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
CUDA > J. Rühmkorf > July 22nd 2009
Slide 33
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
Memory Layout of a Matrix in C
CUDA > J. Rühmkorf > July 22nd 2009
Slide 34
M
N
P
WID
THW
IDTH
WIDTH WIDTH
i
k
k
j
Step 1: Matrix Multiplication A Simple Host Version in C
// Matrix multiplication on the (CPU) // host in double precisionvoid matrixMulOnHost(float* M, float* N,
float* P, int width) { for (int i = 0; i < width; ++i)
for (int j = 0; j < width; ++j) {double sum = 0;for (int k = 0; k < width; ++k) {
double a = M[i * width + k];double b = N[k * width + j];sum += a * b;
}P[i * width + j] = sum;
}}
CUDA > J. Rühmkorf > July 22nd 2009
Slide 35
void matrixMulOnDevice(float* M, float* N, float* P, int width) {
int size = width * width * sizeof(float);
float* Md, *Nd, *Pd;
…
1. // Allocate and load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the device
cudaMalloc(&Pd, size);
Step 2: Input Matrix Data Transfer (Host-sided Code)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 36
Step 3: Output Matrix Data Transfer (Host-sided Code)
2. // Kernel invocation code – to be shown later
…
3. // Read P from the device
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matrices
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
}
CUDA > J. Rühmkorf > July 22nd 2009
Slide 37
// Matrix multiplication kernel – per thread code
__global__ void matrixMulKernel(float* Md, float* Nd, float* Pd, int width) {
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
// see next page…
Step 4: Kernel Function (1)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 38
Nd
Md Pd
WID
THW
IDTH
WIDTH WIDTH
ty
tx
ty
tx
k
kfor (int k = 0; k < width; ++k) {
float Melement = Md[k + threadIdx.y*width];
float Nelement = Nd[threadIdx.x + k*width];
Pvalue += Melement * Nelement;
}
{
int i = threadIdx.x + threadIdx.y*width;
Pd[i] = Pvalue;
}
}
Step 4: Kernel Function (2)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 39
Step 5: Kernel Invocation (Host-sided Code)
// Insert into step 2. from before
// Setup the execution configuration
dim3 dimGrid(1, 1);
dim3 dimBlock(width, width);
// Launch the device computation threads!
matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width);
CUDA > J. Rühmkorf > July 22nd 2009
Slide 40
Grid 1Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
Example Far from Ideal Only One Thread Block Used
One Block of threads compute matrix Pd
Each thread computes one element of Pd
Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)
Size of matrix limited by the number of threads allowed in a thread block
CUDA > J. Rühmkorf > July 22nd 2009
Slide 41
CUDA: A Bright Future?
Eure Rede aber sei: Ja, ja; nein, nein. Was darüber ist, das ist vom Übel.
Matthäus 5:37
CUDA > J. Rühmkorf > July 22nd 2009
Slide 42
NVIDIA CUDA: Appendix Best Practices & Things to Watch Out For
CUDA > J. Rühmkorf > July 22nd 2009
Slide 43
Appendix Best Practices & Things to Watch Out For
Obtain relevant hardware dataCompiling a CUDA programLinkingDebuggingC for CUDA vs. CUDA Driver APIWatch out: floating point computationsUnsupported C language elementsBranching of codeCoalesced access to device global memoryAccess patterns to avoid bank conflicts
CUDA > J. Rühmkorf > July 22nd 2009
Slide 44
Obtain Relevant Hardware Data
Make sure to obtain relevant additional hardware dataCall cudaGetDeviceProperties()
CUDA > J. Rühmkorf > July 22nd 2009
Slide 45
PTX Code
C/C++ CUDAApplication
G80 … GPU
Target code
Virtual
Physical
CPU Code
float4 me = gx[gtid];me.x += me.y * me.z;
ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;
Compiling a CUDA Program (1)Parallel Thread eXecution (PTX)
Virtual Machine and ISA (Instruction Set Architecture)Programming modelExecution resources and state
PTX to TargetCompiler
NVCC
CUDA > J. Rühmkorf > July 22nd 2009
Slide 46
Compiling a CUDA Program (2) NVCC as a Compiler Driver
Any source file containing CUDA language extensions must be compiled with NVCCNVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...
NVCC outputs:C code (host CPU Code)
Must then be compiled with the rest of the application using another tool
PTXObject code directlyOr, PTX source, interpreted at runtime
CUDA > J. Rühmkorf > July 22nd 2009
Slide 47
Any executable with CUDA code requires two dynamic libraries:The CUDA runtime library (cudart) The CUDA core library (cuda)
Linking
CUDA > J. Rühmkorf > July 22nd 2009
Slide 48
Debugging Using the Device Emulation Mode
An executable compiled in device emulation mode (enabled via nvcc -deviceemu) runs completely on the host using the CUDA runtime
No need of any device and CUDA driverEach device thread is emulated with a host thread
Running in device emulation mode, one can:Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versaCall any host function from device code (e.g. printf) and vice- versaDetect deadlock situations caused by improper usage of __syncthreads()
CUDA > J. Rühmkorf > July 22nd 2009
Slide 49
Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
Device Emulation Mode Pitfalls
CUDA > J. Rühmkorf > July 22nd 2009
Slide 50
CUDA Driver API vs. C for CUDA (1) Extended C
gcc / cl
G80 SASSfoo.sass
OCG
cudaccEDG C/C++ frontend
Open64 Global Optimizer
GPU Assemblyfoo.s
CPU Host Code foo.cpp
Integrated source(foo.cu)
Mark Murphy, “NVIDIA’s Ex perience Experience with Open64,”www.capsl.udel.edu/conferences/open6 n64/2008/Papers/101.doc
CUDA > J. Rühmkorf > July 22nd 2009
Slide 51
CUDA Driver API vs. C for CUDA (2) Mutually Exclusive: Choose One or the Other
The C runtime for CUDA handles kernel loading and setting kernels before they are launched. The implicit code initialization, CUDA context management, CUDA module management (cubin and function mapping), kernel configuration, and parameter passing are all performed by the C runtime for CUDA.It comprises two principal parts:
The low-level functions (cuda_runtime_api.h) have a C-style interface that does not require compilation with nvcc.The high-level functions (cuda_runtime.h) have a C++-style interface built on top of the low-level functions.
Of these, the high-level functions are the most commonly used. They wrap some of the low-level functions, using overloading, references, and default arguments. These wrappers can be used from C++ code and can be compiled with any C++ compiler.
The driver API is a lower-level API than the runtime API. When compared with the runtime API, the driver API has these advantages:
No dependency on the runtime libraryMore control over devices (for example, only the driver API enables one CPU thread to control multiple GPUs) No C extensions in the host code, so compilers other than the default CPU compiler can be used
Its primary disadvantagesVerbose code Greater difficulty in debugging No device emulation
A key point is that for every runtime API function, there is an equivalent driver API function. The driver API, however, includes other functions missing in the runtime API, such as those for migrating a context from one host thread to another.
CUDA > J. Rühmkorf > July 22nd 2009
Slide 52
CUDA Driver API vs. C for CUDA (3) Example: Vector Addition Using C for CUDAconst unsigned int cnBlockSize = 512;const unsigned int cnBlocks = 3;const unsigned int cnDimension =
cnBlocks * cnBlockSize;// create CUDA device & contextcudaSetDevice( 0 ); // pick first device// allocate host vectorsfloat * pA = new float[cnDimension];float * pB = new float[cnDimension];float * pC = new float[cnDimension];// initialize host memoryrandomInit(pA, cnDimension);randomInit(pB, cnDimension);
// allocate device memoryfloat *pDeviceMemA, *pDeviceMemB,
*pDeviceMemC;cudaMalloc((void **)&pDeviceMemA,
cnDimension * sizeof(float));cudaMalloc((void **)&pDeviceMemB,
cnDimension * sizeof(float));cudaMalloc((void **)&pDeviceMemC,
cnDimension * sizeof(float));
// copy host vectors to devicecudaMemcpy(pDeviceMemA, pA, cnDimension *
sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(pDeviceMemB, pB, cnDimension *
sizeof(float), cudaMemcpyHostToDevice);
vectorAdd<<<cnBlocks, cnBlockSize>>> (pDeviceMemA, pDeviceMemB,pDeviceMemC);
// copy result from device to hostcudaMemcpy ((void *) pC, pDeviceMemC,
cnDimension * sizeof(float),cudaMemcpyDeviceToHost);
delete[] pA;delete[] pB;delete[] pC;cudaFree(pDeviceMemA);cudaFree(pDeviceMemB);cudaFree(pDeviceMemC);
CUDA > J. Rühmkorf > July 22nd 2009
Slide 53
CUDA Driver API vs. C for CUDA (4) Example: Vector Addition Using CUDA Driver APIconst unsigned int cnBlocks = 3;const unsigned int cnDimension = cnBlocks * cnBlockSize;CUdevice hDevice;CUcontext hContext;CUmodule hModule;CUfunction hFunction;// create CUDA device & contextcuInit(0);cuDeviceGet(&hContext, 0); // pick first devicecuCtxCreate(&hContext, 0, hDevice));cuModuleLoad(&hModule, “vectorAdd.cubin”);cuModuleGetFunction(&hFunction, hModule, "vectorAdd");// allocate host vectorsfloat * pA = new float[cnDimension];float * pB = new float[cnDimension];float * pC = new float[cnDimension];// initialize host memoryrandomInit(pA, cnDimension);randomInit(pB, cnDimension);// allocate memory on the deviceCUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));// copy host vectors to devicecuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));// set up parameter valuescuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);
#define ALIGN_UP(offset, alignment) \(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
int offset = 0;void* ptr;ptr = (void*)(size_t)pDeviceMemA;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);ptr = (void*)(size_t)pDeviceMemB;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);ptr = (void*)(size_t)pDeviceMemC;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);cuParamSetSize(cuFunction, offset);// execute kernelcuLaunchGrid(cuFunction, cnBlocks, 1);// copy the result from device back to hostcuMemcpyDtoH((void *) pC, pDeviceMemC,cnDimension * sizeof(float));delete[] pA;delete[] pB;delete[] pC;cuMemFree(pDeviceMemA);cuMemFree(pDeviceMemB);cuMemFree(pDeviceMemC);
CUDA > J. Rühmkorf > July 22nd 2009
Slide 54
Watch Out: Floating Point Computations Differing Results of FP Computations
Results of floating-point computations will slightly differ because of:Different compiler outputs, instruction setsUse of extended precision for intermediate results
There are various options to force strict single precision on the host
CUDA > J. Rühmkorf > July 22nd 2009
Slide 55
Watch Out: Floating Point Computations Single and Double Precision Operations
Double precisionNo deviations from the IEEE 754 standard
Single precision Denormals and signalling NaNs are not supported; Only two IEEE rounding modes are supported (chop and round-to- nearest even); and The precision of division/square root is slightly lower than single precision.
CUDA > J. Rühmkorf > July 22nd 2009
Slide 56
Limitations (1) Only a Subset of C Available
C for CUDA offers only a subset of the C language:Recursion-freeFunction-pointer-free
Functions reside in the global device memory, therefore we cannot obtain their addresses
CUDA > J. Rühmkorf > July 22nd 2009
Slide 57
Limitations (2) Branching in Programm Code
For best performanceThreads should be running in groups 32 threads 32 threads = 1 warp
All threads of a warp should take the same execution pathOtherwise, branching will probably hurt
CUDA > J. Rühmkorf > July 22nd 2009
Slide 58
Coalesced Access to Device Global Memory
High Priority: Ensure global memory accesses are coalesced whenever possibleGlobal memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) But: certain access requirements have to be met
CUDA > J. Rühmkorf > July 22nd 2009
Slide 59
Coalesced Access (Reading Floats)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 60
Uncoalesced Access (Reading Floats)
CUDA > J. Rühmkorf > July 22nd 2009
Slide 61
Shared Memory16 KB Organized in 16 Banks, 1 KB each
Shared Memory As fast as a register …… if no bank conflicts occur!
Bank conflict: More than one thread in the same half-warp access the same bank
Access needs to be serialized Cost = max (# of concurrent access)
Shared Memory – Bank Conflicts
CUDA > J. Rühmkorf > July 22nd 2009
Slide 62
Linear addressingStep size = 1 Word
Random Permutation
Linear addressingStep size = 3 Words
Broadcast
Shared Memory – No Bank Conflicts
CUDA > J. Rühmkorf > July 22nd 2009
Slide 63
Linear adressingStep size = 2 Words
Linear addressingStep size = 8 words
No conflict or 5-way conflict
Shared Memory – Bank Conflicts