basic cuda programming

51
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang ([email protected])

Upload: tobias

Post on 23-Feb-2016

76 views

Category:

Documents


0 download

DESCRIPTION

Basic CUDA Programming. Computer Architecture 2014 (Prof. Chih -Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang ([email protected]). From Graphics to General Purpose Processing – CPU vs GPU. CPU : general purpose computation (SISD) GPU : data-parallel computation (SIMD). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Basic CUDA  Programming

Basic CUDA Programming

Computer Architecture 2014 (Prof. Chih-Wei Liu)

Final Project – CUDA Tutorial

TA Cheng-Yen Yang ([email protected])

Page 2: Basic CUDA  Programming

From Graphics to General Purpose Processing – CPU vs GPU

CPU: general purpose computation (SISD)GPU: data-parallel computation (SIMD)

2

Page 3: Basic CUDA  Programming

3

What is CUDA?

Compute Unified Device Architecture

Hardware or software? A programming modelA parallel computing platform

. . .. . .

. . .. . .

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Threa d(0,1,0 )

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Threa d(0,0,0 )

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/storeLoad/store

Global Memory

Thread Execution Manager

Input Assembler

Host

TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTextureTextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Thread Id #:0 1 2 3 … m

Thread program

Application

CUDA

Platform

Page 4: Basic CUDA  Programming

Heterogeneous computing: CPU+GPU Cooperation

4

CPU(host)

GPU w/ local DRAM

(device)

Host-Device Architecture:

Page 5: Basic CUDA  Programming

5

Heterogeneous computing:NVIDIA CUDA Compiler (NVCC)

NVCC separates CPU and GPU source code into two parts. For host codes, NVCC invokes typical C compiler like GCC, Intel

C compiler, or MS C compiler.

All the device codes are compiled by NVCC. The extension of device source files should be “.cu”.

All executable with CUDA code requires: CUDA core library (cuda)

CUDA runtime library (cudart)

Page 6: Basic CUDA  Programming

Heterogeneous computing:CUDA Code Execution (1/2)

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Memory Transfer

Memory Transfer

Memory Transfer

Memory Transfer

Device

Lunch Kernel

Lunch Kernel

6

Page 7: Basic CUDA  Programming

Heterogeneous computing:CUDA Code Execution (2/2)

7

Page 8: Basic CUDA  Programming

Heterogeneous computing:NIVIDA G80 seriesTexture Processor Cluster (TPC)

Streaming Multiprocessor (SM)

Streaming Processor (SP) / CUDA core Special Function Unit (SFU)

For NV G8800(G80), the number of SPs is 128.

8

Page 9: Basic CUDA  Programming

Heterogeneous computing:NIVIDA G80 series – CUDA mode

9

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 10: Basic CUDA  Programming

10

CUDA Programming Model (1/7) Define

Programming model Memory model

Help developers map the current applications or algorithms onto CUDA devices more easily and clearly.

NVIDIA GPUs have different architecture compared with common CPUs. It is important to follow CUDA’s programming model to obtain higher performance of program execution.

Page 11: Basic CUDA  Programming

11

CUDA Programming Model (2/7)

C/C++ for CUDASubset of C with extensionsC++ templates for GPU codeCUDA goals:

Scale code to 100s of cores and 1000s of parallel threads.

Facilitate heterogeneous computing: CPU + GPUCUDA defines:

Programming modelMemory model

Page 12: Basic CUDA  Programming

12

CUDA Programming Model (3/7) CUDA Kernels and Threads:

Parallel portions of an application are executed on the device as kernels. And only one kernel is executed at a time.

All the threads execute the same kernel at a time.

Differences between CUDA and GPU threads CUDA threads are extremely lightweight

Very little creation overhead

Fast switching

CUDA uses 1000s of threads to achieve efficiency Multi-core CPUs can only use a few

Page 13: Basic CUDA  Programming

CUDA Programming Model (4/7)

Arrays of Parallel Threads:A CUDA kernel is executed by an array of threads

All threads run the same code Each thread has an ID that it uses to compute memory addresses

and make control decisions

13

Page 14: Basic CUDA  Programming

CUDA Programming Model (5/7)

Thread Batching:Kernel launches a grid of thread blocks

Threads within a block cooperate via shared memory

Threads in different blocks cannot cooperate

Allows programs to transparently scale to different GPUs14

Page 15: Basic CUDA  Programming

CUDA Programming Model (6/7)

CUDA Programming Model:A kernel is executed by a grid of

thread blocks Block can be 1D or 2D.

A thread block is a batch of threads

Thread can be 1D or 2D or 3D.

Data can be shared through shared memory

Execution synchronization

But threads from different blocks can’t cooperate.

15

Page 16: Basic CUDA  Programming

CUDA Programming Model (7/7)Memory Model:Registers

Per threadData lifetime = thread lifetime

Local memoryPer thread off-chip memory (physically in device DRAM)Data lifetime = thread lifetime

Shared memoryPer thread block on-chip memoryData lifetime = block lifetime

Global (device) memoryAccessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation

Host (CPU) memoryNot directly accessible by CUDA threads

16

Page 17: Basic CUDA  Programming

CUDA C Basic

17

Page 18: Basic CUDA  Programming

GPU Memory Allocation/ReleaseMemory allocation on GPU

cudaMalloc(void **pointer, size_t nbytes)Preset value for specific memory area

cudaMemset(void *pointer, int value, size_t count)Release memory allocation

cudaFree(void *pointer)

18

int n = 1024;int nbytes = 1024*sizeof(int);int *d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);

Page 19: Basic CUDA  Programming

Data Copies

cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

direction specifies locations (host or device) of src and dstBlocks CPU thread: returns after the copy is completeDoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

19

Page 20: Basic CUDA  Programming

Function Qualifiers

__global__ : invoked from within host (CPU) code, cannot be called from device (GPU) code must return void

__device__ : called from other GPU functions, cannot be called from host (CPU)

code

__host__ : can only be executed by CPU, called from host

20

Page 21: Basic CUDA  Programming

Variable Qualifiers (GPU code)

__device__ Stored in device memory (large capacity, high latency, uncached)

Allocated with cudaMalloc (__device__ qualifier implied)

Accessible by all threads

Lifetime: application

__shared__ Stored in on-chip shared memory (SRAM, low latency)

Allocated by execution configuration or at compile time

Accessible by all threads in the same thread block

Lifetime: duration of thread block

Unqualified variables: Scalars and built-in vector types are stored in registers

Arrays may be in registers or local memory (registers are not addressable)21

Page 22: Basic CUDA  Programming

CUDA Built-in Device Variables

All __global__ and __device__ functions have access to these automatically defined variables

dim3 gridDim; Dimensions of the grid in blocks (at most 2D)

dim3 blockDim; Dimensions of the block in threads

dim3 blockIdx; Block index within the grid

dim3 threadIdx; Thread index within the block

22

Page 23: Basic CUDA  Programming

Executing Code on the GPU

Kernels are C functions with some restrictionsCan only access GPU memoryMust have void return typeNo variable number of arguments (“varargs”)Not recursiveNo static variables

Function arguments automatically copied from CPU to GPU memory

23

Page 24: Basic CUDA  Programming

Launching Kernels

Modified C function call syntax:kernel<<<dim3 grid, dim3 block>>>(…);

Execution configuration (“<<< >>>”):Grid dimensions: x and yThread-block dimensions: x, y, and z

24

dim3 grid(16,16);dim3 block(16,16);kernel1<<<grid, block>>>(…);

kernel2<<<32, 512>>>(…);

Page 25: Basic CUDA  Programming

Data Decomposition

Often want each thread in kernel to access a different element of an array.

25

blockIdx.x

blockDim.x = 5

threadIdx.x

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14idx =

blockIdx.x*blockDim.x + threadIdx.x

Page 26: Basic CUDA  Programming

Data Decomposition Example:Increment Array Elements

Increment N-element vector a by scalar b

26

CPU program

void increment_cpu(float *a, float *b, int N){ for(int idx=0;idx<N;idx++) a[idx]=a[idx]+b;}

void main(){ … increment_cpu(a,b,N);}

CUDA program

__global__ void increment_gpu(float *a, float *b, int N){ int idx = blockIdx.x*blockDim.x+threadIdx.x; if(idx<N) a[idx]=a[idx]+b;}

void main(){ … dim3 dimBlock(blocksize); dim3 dimGrid(ceil(N/(float)blocksize)); increment_gpu<<<dimGrid,dimBlock>>>(a,b,N);}

Page 27: Basic CUDA  Programming

LAB0: Setup CUDA Environment & Device Query

27

Page 28: Basic CUDA  Programming

CUDA Environment Setup

Check your NVIDIA GPU Compute capability

GPU’s generation

https://developer.nvidia.com/cuda-gpus

Download CUDA Development fileshttps://developer.nvidia.com/cuda-toolkit-archiveCUDA driverCUDA toolkit

PC Room ED417B used version 2.3.

CUDA SDK (optional)

Install CUDATest CUDA

Device Query (check in the sample codes)28

Page 29: Basic CUDA  Programming

Setup CUDA for MS Visual Studio(WindowsXP/VS2005~2010)

In PC Room ED417B : CUDA device: NV 8400GS

CUDA toolkit version: 2.0 or 2.3

VS20005 or VS2008

From scratch http://forums.nvidia.com/index.php?showtopic=30273

CUDA VS Wizard http://sourceforge.net/projects/cudavswizard/

Modified from existing project

29

Page 30: Basic CUDA  Programming

30

Setup CUDA for MS Visual Studio 2012 (Win7/VS2012) MS Visual Studio 2012 installed Download and install CUDA toolkit 5.0

https://developer.nvidia.com/cuda-toolkit-50-archive

Follow the instructions on this blog http://blog.norture.com/2012/10/gpu-parallel-programming-in-vs2

012-with-nvidia-cuda/

You can start to write a CUDA program by creating new project from VS2012’s template.

If you have previous version of VS projects, it is recommended to re-create the project by using VS2012’s new template.

Page 31: Basic CUDA  Programming

CUDA Device Query Example:(CUDA toolkit 6.0)

31

Page 32: Basic CUDA  Programming

Lab1: First CUDA ProgramYet Another Random Number Sequence Generator

32

Page 33: Basic CUDA  Programming

Yet Another Random Number Sequence Generator

Implemented by CPU and GPUFunctionality:

Given an random integer array A holding 8192 elements

Generated by rand()Re-generate random number by multiplying itself

256 times without regard to overflow occurred.B[i]new = B[i]old*A[i] (CPU)C[i]new = C[i]old*A[i] (GPU)

Check the consistency between the results of CPU and GPU. 33

Page 34: Basic CUDA  Programming

Data Manipulation between Host and Device

cudaError_t cudaMalloc( void** devPtr, size_t count )

Allocates count bytes of linear memory on the device and return in *devPtr as a pointer to the allocated memory

cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)

Copies count bytes from memory area pointed to by src to the memory area pointed to by dst

kind indicates the type of memory transfer cudaMemcpyHostToHost cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice

cudaError_t cudaFree( void* devPtr ) Frees the memory space pointed to by devPtr

34

Float GPU_kernel(int *B, int *A) {

// Create Pointers for Memory Space on Device

int *dA, *dB;

// Allocate Memory Space on Device

cudaMalloc( (void**) &dA, sizeof(int)*SIZE );

cudaMalloc( (void**) &dB, sizeof(int)*SIZE );

// Copy Data to be Calculated

cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

// Lunch Kernel

cuda_kernel<<<1,1>>>(dB, dA);

// Copy Output Back

cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );

// Free Memory Spaces on Device

cudaFree( dA );

cudaFree( dB );

}

cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

Page 35: Basic CUDA  Programming

Now, go and finish your first CUDA program !!!

35

Page 36: Basic CUDA  Programming

Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zipOpen project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )

main.cu Random input generation, output validation, result reporting

device.cu Lunch GPU kernel, GPU kernel code

parameter.hFill in appropriate APIs

GPU_kernel() in device.cu

36

Page 37: Basic CUDA  Programming

Lab2: Make the Parallel Code Faster Yet Another Random Number Sequence Generator

37

Page 38: Basic CUDA  Programming

Parallel Processing in CUDA

Parallel code can be partitioned into blocks and threads

cuda_kernel<<<nBlk, nTid>>>(…)Multiple tasks will be initialized, each with

different block id and thread idThe tasks are dynamically scheduled

Tasks within the same block will be scheduled on the same stream multiprocessor

Each task take care of single data partition according to its block id and thread id

38

Page 39: Basic CUDA  Programming

Locate Data Partition by Built-in Variables

Built-in VariablesgridDim

x, y

blockIdx x, y

blockDim x, y, z

threadIdx x, y, z

39

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 40: Basic CUDA  Programming

Data Partition for Previous Example

TASK 0blockIdx.x = 0threadIdx.x = 0

TASK 1blockIdx.x = 0threadIdx.x = 1

TASK 2blockIdx.x = 1threadIdx.x = 0

TASK 3blockIdx.x = 1threadIdx.x = 1

… ……

head

length

40

When processing 64 integer data:

cuda_kernel<<<2, 2>>>(…)

int total_task = gridDim.x * blockDim.x ;int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;

int length = SIZE / total_task ;int head = task_sn * length ;

Page 41: Basic CUDA  Programming

Processing Single Data Partition

__global__ void cuda_kernel ( int *B, int *A ) {

int length = SIZE / total_task;

for ( int i = head ; i < head + length ; i++ ) {

B[i] = A[i]256;

}

}

return;

int total_task = gridDim.x * blockDim.x;

int task_sn = blockDim.x * blockIdx.x + threadIdx.x;

int head = task_sn * length;

41

Page 42: Basic CUDA  Programming

Parallelize Your Program !!!

42

Page 43: Basic CUDA  Programming

Partition kernel into threadsIncrease nTid from 1 to 512Keep nBlk = 1

Group threads into blocksAdjust nBlk and see if it helpsMaintain total number of threads below 512,

and make sure that 8192 can be divisible by that number.

e.g. nBlk * nTid < 51243

Page 44: Basic CUDA  Programming

Lab3: Resolve Memory Contention Yet Another Random Number Sequence Generator

44

Page 45: Basic CUDA  Programming

Parallel Memory Architecture

Memory is divided into banks to achieve high bandwidth

Each bank can service one address per cycle

Successive 32-bit words are assigned to successive banks

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

45

Page 46: Basic CUDA  Programming

Lab2 Review

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

CONFILICT!!!!

A[48]

A[32]

A[16]

A[ 0]

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

CONFILICT!!!!

A[49]

A[33]

A[17]

A[ 1]

46

When processing 64 integer data:

cuda_kernel<<<1, 4>>>(…)

Page 47: Basic CUDA  Programming

How about Interleave Accessing?

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

NO CONFLICT

A[ 3]

A[ 2]

A[ 1]

A[ 0]

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

NO CONFLICT

A[ 7]

A[ 6]

A[ 5]

A[ 4]

47

When processing 64 integer data:

cuda_kernel<<<1, 4>>>(…)

Page 48: Basic CUDA  Programming

Implementation of Interleave Accessing

head

stripe

head = task_snstripe = total_task

48

cuda_kernel<<<1, 4>>>(…)

Page 49: Basic CUDA  Programming

Improve Your Program !!!

49

Page 50: Basic CUDA  Programming

Modify original kernel code in interleaving manner

cuda_kernel() in device.cu

Adjusting nBlk and nTid as in Lab2 and examine the effect

Maintain total number of threads below 512, and make sure that 8192 can be divisible by that number.

e.g. nBlk * nTid < 512

50

Page 51: Basic CUDA  Programming

Thank You

http://twins.ee.nctu.edu.tw/~skchen/lab3.zipFinal project issue

Subject: Porting & optimizing any algorithm on any multi-core

Demo: 1 week after final exam @ ED412

Group:1 ~ 2 person per group

* Group member & demo time should be registered after final exam @ ED412

51