cuda parallel programming tutorial

26
CUDA Parallel Programming Tutorial Richard Membarth [email protected] Hardware-Software-Co-Design University of Erlangen-Nuremberg 19.03.2009 Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth 1

Upload: bqayudya-kurniawan

Post on 08-Apr-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 1/26

CUDA Parallel Programming Tutorial

Richard [email protected]

Hardware-Software-Co-Design

University of Erlangen-Nuremberg

19.03.2009

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  1

Page 2: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 2/26

Outline

◮ Tasks for CUDA

CUDA programming model◮ Getting started

◮ Example codes

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  2

Page 3: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 3/26

Tasks for CUDA

◮ Provide ability to run code on GPU

◮ Manage resources

◮ Partition data to fit on cores

◮ Schedule blocks to cores

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  3

Page 4: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 4/26

Data Partitioning

◮ Partition data in smaller

blocks that can be processed

by one core

◮ Up to 512 threads in one

block

◮ All blocks define the grid

◮ All blocks execute same

program (kernel)

◮ Independent blocks

◮ Only ONE kernel at a time

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  4

Page 5: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 5/26

Page 6: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 6/26

Tesla Architecture

◮ 30 cores, 240 ALUs (1 mul-add)

◮ (1 mul-add + 1 mul): 240 * (2+1) * 1.3 GHz = 936 GFLOPS

◮ 4.0 GB GDDR3, 102 GB/s Mem BW, 4GB/s PCIe BW to CPU

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  6

Page 7: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 7/26

CUDA: Extended C

◮ Function qualifiers

◮ Variable qualifiers

◮ Built-in keywords

◮ Intrinsics

◮ Function calls

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  7

Page 8: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 8/26

Function Qualifiers

◮ Functions: device , global , host

  __global__ void f i l t e r (int *in, int * out ) {

...

}

◮ Default: host◮ No function pointers◮ No recursion◮ No static variables◮ No variable number of arguments◮ No return value

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  8

Page 9: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 9/26

Variable Qualifiers

◮ Variables: device , constant , shared

  _ _ c o n s t a n t _ _ float m at ri x [ 10 ] = {1 . 0 f, . .. } ;

 __shared__ int [32][2];

◮ Default: Variables reside in registers

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  9

Page 10: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 10/26

Built-In Variables

◮ Available inside of kernel code

◮ Thread index within current block:

threadIdx.x , threadIdx.y ,

threadIdx.z

◮ Block index within grid: blockIdx.x , blockIdx.y

◮ Dimension of grid, block:

gridDim.x , gridDim.y

 blockDim.x , blockDim.y ,

 blockDim.z

◮ Warp size: warpSize

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  10

Page 11: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 11/26

Intrinsics

◮ void __syncthreads();

◮ Synchronizes in all thread of current block

◮ Use in conditional code may lead to deadlocks◮ Intrinsics for most mathematical functions exists, e.g.

  __sinf(x), __cosf(x), __expf(x), ...

◮ Texture functions

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  11

Page 12: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 12/26

Function Calls

◮ Launch parameters:

◮ Grid dimension (up to 2D)◮ Block dimension (up to 3D)◮ Optional: stream ID◮ Optional: shared memory size◮ kernel<<<grid, block, stream, shared_mem>>>();

  __global__ void f i l t e r (int * i n , int * o u t ) ;

...

dim3 g ri d ( 16 , 1 6) ;

dim3 b lo c k (1 6 , 1 6) ;

filter <<< g ri d, b lo ck , 0 , 0 >>> ( i n, o ut ) ;filter <<< g r id , b l o ck >>> ( i n, o ut ) ;

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  12

Page 13: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 13/26

Getting Started

◮ Compiler path

◮ Sample Makefile

◮ Debugging

◮ Memory management

◮ Time measurement

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  13

Page 14: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 14/26

Compiler Path

◮ gcc/g++ compiler for host

code◮ nvcc compiler for device code

◮ gcc/g++ for linking

◮ icc/icpc works as well

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  14

Page 15: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 15/26

Page 16: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 16/26

Building the Program

Makefile offers different options:

◮ Production mode: make

◮ Debug mode: make dbg=1◮ Emulation mode: make emu=1

◮ Debug+Emulation mode: make dbg=1 emu=1

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  16

Page 17: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 17/26

Debugging

SDK offers wrappers for function calls:◮ For CUDA function calls: cutilSafeCall(function);

◮ For kernel launches (calls internally cudaThreadSynchronize()):

cutilCheckMsg(function);

◮ For SDK functions: cutilCheckError(function);

Additional tools (recommended):

◮ CudaVisualProfiler

◮ valgrind – in emulation mode only, there is no MMU on the GPU!

◮ gdb – in emulation mode: #ifdef __DEVICE_EMULATION__

◮ real (!) gdb support, for GNU Linux – unfortunately 32bit only :(

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  17

Page 18: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 18/26

Memory Management

◮ Host manages GPU memory

◮ cudaMalloc(void **pointer, size_t size);

◮ cudaMemset(void *pointer, int value, size_t count);

◮ cudaFree(void *pointer);

Memcopy for GPU:◮ cudaMemcpy(void *dst, void *src, size_t size, cudaMemcpyKind

direction

◮ cudaMemcpyKind:

◮ cudaMemcpyHostToDevice

◮ cudaMemcpyDeviceToHost◮ cudaMemcpyDeviceToDevice

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  18

Page 19: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 19/26

Time Measurement

◮ Initialization biases execution time

◮ Don’t measure first kernel launch!◮ SDK provides timer:

int timer=0;

cutCreateTimer (&timer );

cutStartTimer (timer) ;

.. .

cutStopTimer (timer) ;

cutGetTimerValue (timer) ;cutDeleteTimer (timer) ;

◮ Use events for asynchronous functions:

cudaEvent_t start_event, stop_event ;

cutilSafeCall (cudaEventCreate (&start_event ));

cutilSafeCall (cudaEventCreate (&stop_event ));

cudaEventRecord (start_event, 0); // record in stream-0, to ensure that all

  previous CUDA calls have completed .. .

cudaEventRecord (stop_event, 0);

cudaEventSynchronize (stop_event ); // block until the event is actually 

recorded 

cudaEventElapsedTime (&time_memcpy, start_event, stop_event );

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  19

Page 20: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 20/26

Example

Vector addition:

◮ CPU Implementation

◮ Host code

◮ Device code

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  20

Page 21: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 21/26

Vector Addition - CPU Implementation

void v e c t o r _ a d d (float *iA, float * i B , float* oC, int w id th ) {

int i ;

for ( i =0 ; i <w id th ; i ++) {oC [ i] = iA [i ] + iB [i ];

}

}

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  21

V Addi i GPU I i i li i

Page 22: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 22/26

Vector Addition - GPU Initialization

// include CUDA and SDK headers - CUDA 2.1

#include < c u t i l _ i n l i n e . h >// include CUDA and SDK headers - CUDA 2.0

#include <cuda.h>

#include <cutil.h>

// include kernels

#include " v e c t o r _ a d d _ k e r n e l . c u "

int m a i n ( int argc, char** argv ) {

int d e v ;

// CUDA 2.1

d ev = c u t G e t M a x G f l o p s D ev i c e I d( ) ;

c u d a S e t D e v i c e ( d e v ) ;

// CUDA 2.0

C U T _ D E V I C E _ I NI T( a r g c , a r g v ) ;

}

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  22

Page 23: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 23/26

V t Additi L h K l

Page 24: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 24/26

Vector Addition - Launch Kernel

// setup execution parameters

dim3 g ri d ( 1 , 1 ) ;

dim3 t h r ea d s ( n u m _ e l e m e n t s , 1 ) ;

// execute the kernel

v e c_ a d d< < < g r id , t h r ea d s >>>( d e v i c e _ i d a t a _ A , d e v i c e _ i d a t a _B ,

d e v i c e _ o d a t a _ C ) ;

c u d a T h r e a d S y n c h r o n i z e ( ) ;

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  24

V t Additi K l F ti

Page 25: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 25/26

Vector Addition - Kernel Function

  __global__ void v e c t o r _ a d d (float * i A , float *iB, float* oC ) {

int idx = threadIdx.x + blockDim.x * b l oc kI d. x ;

o C[ i dx ] = iA [ id x ] + i B[ i dx ] ;

}

Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth  25

Page 26: CUDA Parallel Programming Tutorial

8/6/2019 CUDA Parallel Programming Tutorial

http://slidepdf.com/reader/full/cuda-parallel-programming-tutorial 26/26