introduction to cuda heterogeneous programming

45
Introduction to CUDA heterogeneous programming Katia Oleinik [email protected] Scientific Computing and Visualization Boston University

Upload: jonah

Post on 24-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Introduction to CUDA heterogeneous programming. Katia Oleinik [email protected] Scientific Computing and Visualization Boston University. Architecture. NVIDIA Tesla M2070: Core clock: 1.15GHz Single instruction 448 CUDA cores 1.15 x 1 x 448 = 515 Gigaflops double precision (peak) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to CUDA heterogeneous programming

Introduction to CUDAheterogeneous programming

Katia [email protected]

Scientific Computing and Visualization

Boston University

Page 2: Introduction to CUDA heterogeneous programming

CUDA• Architecture• C Language extensions• Terminology

CUDA Basics• Hello, World!• CUDA kernels• Blocks and

threads overview

GPU memory• Memory management• Parallel kernels• Threads synchronization• Race conditions and atomic

operations

Page 3: Introduction to CUDA heterogeneous programming

Architecture

NVIDIA Tesla M2070:

Core clock: 1.15GHz Single instruction 448 CUDA cores 1.15 x 1 x 448 = 515 Gigaflops double precision (peak)

1.03 Tflops single precision (peak)

3GB total dedicated memory

Delivers performance at about 10% of the cost and 5% the power of CPU

Page 4: Introduction to CUDA heterogeneous programming

Architecture

CUDA:

• Compute Unified Device Architecture

• General Purpose Parallel Computing Architecture by NVIDIA

• Supports traditional OpenGL graphics

Page 5: Introduction to CUDA heterogeneous programming

Architecture

Memory Bandwidth: the rate at which data can be read from or stored into memory, expressed in bytes per second

Intel Xeon X5650: 32 GB/s Tesla M2070: 148 GB/s

Page 6: Introduction to CUDA heterogeneous programming

Architecture

Tesla M2070 Processor: • Streaming Multiprocessors (SM): 14• Streaming Processors on each SM: 32

Total: 14 x 32 = 448 Cores

Each Streaming Multiprocessor supports 1024 threads.

Page 7: Introduction to CUDA heterogeneous programming

Architecture

CUDA:

SIMT philosophy: Single Instruction Multiple Thread

Computationally intensive—The time spent on computation significantly exceeds the time spent on transferring data to and from GPU memory. Massively parallel—The computations can be broken down into hundreds or thousands of independent units of work.

Page 8: Introduction to CUDA heterogeneous programming

Architecture

# Copy tutorial filesscc1 % cp –r /scratch/katia/cuda .

# Request interactive session on the node with GPUscc1 % qrsh –l gpus=1

# Change directoryscc1-ha1 % cd deviceQuery

# Set Environment variables to link to CUDA 5/0 scc1-ha1 % module load cuda/5.0

# Execute deviceQuery programscc1-ha1 % ./deviceQuery

Page 9: Introduction to CUDA heterogeneous programming

Architecture

CUDA Driver Version / Runtime Version 5.0 / 5.0

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 5375 MBytes

(14) Multiprocessors x ( 32) CUDA Cores/MP: 448 CUDA Cores

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Information that we will need later in this tutorial:

Page 10: Introduction to CUDA heterogeneous programming

CUDA Architecture

Warp size: 32

Maximum number of threads per multiprocessor: 1536

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Information that we will need later in this tutorial:

Page 11: Introduction to CUDA heterogeneous programming

CUDA Architecture

# Change directoryscc1-ha1 % cd bandwidthTest

# Execute bandwidthTest programscc1-ha1 % ./bandwidthTest

Query device capabilities and measure GPU/CPU bandwidth.This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e

Page 12: Introduction to CUDA heterogeneous programming

CUDA Terminology

CUDA:

Device The GPU and its memory (device memory)

Host The CPU and its memory (host memory)

Page 13: Introduction to CUDA heterogeneous programming

CUDA: C Language Extensions

CUDA:

• Based on industry-standard C

• Language extensions allow heterogeneous programming

• APIs for memory and device managing

Page 14: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCuda1.cu

#include <stdio.h>int main(void){

printf("Hello, Cuda! \n");

return(0);}

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda1 helloCuda1.cu

Page 15: Introduction to CUDA heterogeneous programming

Hello, Cuda!

Function to be executed on the device (GPU) and called from host code

__device__ void foo(){ . . . }

CUDA Language closely follows C/C++ syntax with minimum set of extensions:

NVCC compiler will compile the function that run on the device and host compiler (gcc) will take care about all other functions that run on the host (e.g. main() )

Page 16: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

#include <stdio.h>

__global__ void cudakernel(void){ printf("Hello, I am CUDA kernel ! Nice to meet you!\n");}

Page 17: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

int main(void){

printf("Hello, Cuda! \n");

cudakernel<<<1,1>>>(); cudaDeviceSynchronize();

printf("Nice to meet you too! Bye, CUDA\n");

return(0);}

Page 18: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

cudakernel<<<N,M>>>();

cudaDeviceSynchronize();

Triple angle brackets indicate that the function will be executed on the device (GPU).This function is called kernel.

Kernel is always of type void.Program returns immediately after launching the kernel. To prevent program to finish before kernel is completed, we have call cudaDeviceSynchronize().

Page 19: Introduction to CUDA heterogeneous programming

CUDA: C Language Extensions

There is a number of cuda functions:

Device management:cudaGetDeviceCount(), cudaGetDeviceProperties()

Error management:cudaGetLastError(), cudaSafeCall(), cudaCheckError()

Device memory management:cudaMalloc(), cudaFree(), cudaMemcpy()

Page 20: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCuda2.cu

To build the program, use nvcc compiler:

scc-he1: % nvcc -o helloCuda2 helloCuda2.cu –arch sm_20

The ability to print from within the kernel was added in a later generation of architectural evolution. To request the support of Compute Capability 2.0, we need to add this option into compilation command line.

Page 21: Introduction to CUDA heterogeneous programming

Hello, Cuda!

CUDA: Basic example HelloCudaBlock.cu

#include <stdio.h>

__global__ void cudakernel(void){ printf("Hello, I am CUDA block %d !\n", blockIdx.x);}

int main(void){

. . . cudakernel<<<16,1>>>(); . . .}

To simplify compilation process we will use Makefile:

% make HelloCudaBlock

Page 22: Introduction to CUDA heterogeneous programming

CUDA: C Language Extensions

CUDA provides special variable for thread identification in the kernal:

dim3 threadIdx; // thread ID within the block

dim3 blockIdx; // block ID within the grid

dim3 blockDim; // number of threads per block

dim3 gridDim; // number of blocks in the grid

In the simple 1-dimentional case, we use only the first component of each variable, e.g. threadIdx.x

Page 23: Introduction to CUDA heterogeneous programming

CUDA: Blocks and Threads

Serial Code

Serial Code

Kernel A

Kernel B

Host

Host

Device

Device

Page 24: Introduction to CUDA heterogeneous programming

CUDA: C Language Extensions

CUDA: Basic example HelloCudaThread.cu

#include <stdio.h>

__global__ void cudakernel(void){ printf("Hello, I am CUDA thread %d !\n", threadIdx.x);}

int main(void){

. . . cudakernel<<<1,16>>>(); . . .}

Page 25: Introduction to CUDA heterogeneous programming

CUDA: Blocks and Threads

• One kernel is executed on the device at a time

• Many threads execute each kernel

• Each thread execute the same code (SPMD)

• Threads are grouped into thread blocks

• Kernel is a grid of thread blocks

• Threads are scheduled as sets of warps

• Warp is a group of 32 threads

• SM executes same instruction on all threads in the warp

• Blocks cannot synchronize and can run in any order

Page 26: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements){

int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < numElements) { C[i] = A[i] + B[i]; }}

Page 27: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu

1 2 3 4 5 6 70 1 2 3 4 5 6 70 1 2 3 4 5 6 70 1 2 3 4 5 6 70

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

int i = blockDim.x * blockIdx.x + threadIdx.x;

Unlike blocks, threads have mechanisms to communicate and synchronize

Page 28: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu device memory allocation

int main(void) { . . . float *d_A = NULL; err = cudaMalloc((void **)&d_A, size);

float *d_B = NULL; err = cudaMalloc((void **)&d_B, size);

float *d_C = NULL; err = cudaMalloc((void **)&d_C, size); . . .}

Page 29: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu

int main(void) {

. . . // Copy input values to the device cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_A, &A, size, cudaMemcpyHostToDevice);

. . .}

Page 30: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu

int main(void) { . . . // Launch the Vector Add CUDA Kernel int threadsPerBlock = 256; int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); err = cudaGetLastError(); . . .}

Page 31: Introduction to CUDA heterogeneous programming

Vector Addition Example

CUDA: vectorAdd.cu

int main(void) {

. . . // Copy result back to host cudaMemcpy(&C, d_C, size, cudaMemcpyDeviceToHost); // Clean-up cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); . . .}

Page 32: Introduction to CUDA heterogeneous programming

Timing CUDA kernel

CUDA: vectorAddTime.cu

float memsettime; cudaEvent_t start, stop;

// initialize CUDA timer cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0);

// CUDA Kernel . . .

// stop CUDA timer cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&memsettime,start,stop); printf(" *** CUDA execution time: %f *** \n", memsettime); cudaEventDestroy(start); cudaEventDestroy(stop);

Page 33: Introduction to CUDA heterogeneous programming

Timing CUDA kernel

CUDA: vectorAddTime.cu

scc-ha1 % make

// specify the number of threads per block scc-ha1 % vectorAddTime 128

Explore the CUDA kernel execution time based on the block size:

Remember: • CUDA Streaming Multiprocessor executes threads in warps (32 threads)• There is a maximum of 1024 threads per block (for our GPU)• There is a maximum of 1536 threads per multiprocessor (for our GPU)

Page 34: Introduction to CUDA heterogeneous programming

Dot Product

CUDA: dotProd1.cu

a0

a1

a2

a3

b0

b1

b2

b3

*

*

*

*

+ C

C = A * B = ( a0, a1 , a2 , a3 ) * ( b0, b1 , b2 , b3 ) = a0 * b0 + a1 * b1 + a2 * b2 + a3 * b3

Page 35: Introduction to CUDA heterogeneous programming

Dot Product

CUDA: dotProd1.cu

A block of threads shares common memory, called shared memory

Shared Memory is extremely fast on-chip memory

To declare shared memory use __shared__ keyword

Shared Memory is not visible to the threads in other blocks

Page 36: Introduction to CUDA heterogeneous programming

Dot Product

CUDA: dotProd1.cu

#define N 512__global__ voiddot( int*a, int*b, int*c ) {

// Shared memory for results of multiplication __shared__ inttemp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

// Thread 0 sums the pairwise products if( threadIdx.x == 0 ) { int sum = 0; for( int i= 0; i< N; i++ ) sum += temp[i]; *c = sum; }}

What if thread 0 starts to calculate sum before other threads completed their calculations?

Page 37: Introduction to CUDA heterogeneous programming

Thread Synchronization

CUDA: dotProd1.cu

#define N 512__global__ voiddot( int*a, int*b, int*c ) {

// Shared memory for results of multiplication __shared__ inttemp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads();

// Thread 0 sums the pairwise products if( threadIdx.x == 0 ) { int sum = 0; for( int i= 0; i< N; i++ ) sum += temp[i]; *c = sum; }}

Page 38: Introduction to CUDA heterogeneous programming

Thread Synchronization

CUDA: dotProd1.cu

int main(void) { . . . // copy input vectors to the device . . .

// Launch CUDA kernel dotProductKernel <<<1, N >>> (dev_A, dev_B, dev_C);

. . . // copy input vectors from the device . . .}

But our vector is limited to the maximum block size. Can we use blocks?

Page 39: Introduction to CUDA heterogeneous programming

Race Condition

CUDA: dotProd2.cu

a0

a1

a2

a3

b0

b1

b2

b3

*

*

*

*

+ sum

a4

a5

a6

a7

b4

b5

b6

b7

*

*

*

*

+ sum

Block 0

Block 1

C

Page 40: Introduction to CUDA heterogeneous programming

Race Condition

CUDA: dotProd2.cu#define N (2048*2048)#define THREADS_PER_BLOCK 512__global__ void dotProductKernel( int*a, int*b, int*c ) { __shared__ int temp[THREADS_PER_BLOCK];

int index = threadIdx.x + blockIdx.x * blockDim.x;

temp[threadIdx.x] = a[index] * b[index]; __syncthreads();

if( threadIdx.x == 0) { intsum = 0; for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];

*c += sum; }}Blocks interfere with each other – Race condition

Page 41: Introduction to CUDA heterogeneous programming

Race Condition

CUDA: dotProd2.cu#define N (2048*2048)#define THREADS_PER_BLOCK 512__global__ void dotProductKernel( int*a, int*b, int*c ) { __shared__ int temp[THREADS_PER_BLOCK];

int index = threadIdx.x + blockIdx.x * blockDim.x;

temp[threadIdx.x] = a[index] * b[index]; __syncthreads();

if( threadIdx.x == 0) { intsum = 0; for( int i= 0; i< THREADS_PER_BLOCK; i++ )sum += temp[i];

atomicAdd(c,sum); }}

Page 42: Introduction to CUDA heterogeneous programming

Atomic Operations

Race conditions - behavior depends upon relative timing of multiple event sequences.Can occur when an implied read-modify-write is interruptible

Read-Modify-Write uninterruptible – atomic

atomicAdd() atomicInc()atomicSub() atomicDec()atomicMin() atomicExch()atomicMax() atomicCAS()

Page 43: Introduction to CUDA heterogeneous programming

CUDA Best Practices

NVIDIA’s link: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

1. Assess

2. Parallelize

3. Optimize

4. Deploy

Locate part of the slowest part of the codegcc -O2 -g -pg myprog.cgprof ./a.out > profile.txt

Use CUDA to parallelize code;Use optimize cu* libraries if possible;

Overlapping data transfers, fine-tuning operation sequences

Compare the outcome with the original expectations.

Page 44: Introduction to CUDA heterogeneous programming

CUDA Debugging

CUDA-GDB - GNU Debugger that runs on Linux and Mac: http://developer.nvidia.com/cuda-gdb

The NVIDIA Parallel Nsight debugging and profiling tool for Microsoft Windows Vista and Windows 7 is available as a free plugin for Microsoft Visual Studio: http://developer.nvidia.com/nvidia-parallel-nsight

Page 45: Introduction to CUDA heterogeneous programming

This tutorial has been made possible by Scientific Computing and Visualization

groupat Boston University.

Katia [email protected]

http://www.bu.edu/tech/research/training/tutorials/list/