cuda - 101 basics. overview what is cuda? data parallelism host-device model thread execution...

27
CUDA - 101 Basics

Upload: margery-chambers

Post on 18-Jan-2016

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA - 101

Basics

Page 2: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Overview

• What is CUDA?• Data Parallelism• Host-Device model• Thread execution• Matrix-multiplication

Page 3: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

GPU revised!

Page 4: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

What is CUDA?

• Compute Device Unified Architecture• Programming interface to GPU• Supports C/C++ and Fortran natively– Third party wrappers for Python, Java, MATLAB etc

• Various libraries available– cuBLAS, cuFFT and many more…– https://developer.nvidia.com/gpu-accelerated-libr

aries

Page 5: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA computing stack

Page 6: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA computing stack

Page 7: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA computing stack

Page 8: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA computing stack

Page 9: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Data Parallel programming

i1

Kernel

i2 i3 … iN

o1 o2 o3 … oN

Page 10: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Data parallel algorithm

• Dot product : C = A . BA1 B1 …

C1 C2 C3 … CN

A2 B2 A3 B3 AN BN

+ + + + +Kernel

Page 11: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Host-Device model

CPU (Host) GPU (Device)

Page 12: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Threads

• A thread is an instance of the kernel program– Independent in a data

parallel model– Can be executed on a

different core• Host tells the device to

run a kernel program– And how many threads

to launch

Page 13: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Matrix-Multiplication

Page 14: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CPU-only MatrixMultiplication

Execute this code

For all elements of P

Page 15: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Memory Indexing in C (and CUDA)

M(i, j) = M[i + j * width]

Page 16: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA version - I

Page 17: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA program flow

• Allocate input and output memory on host– Do the same for device

• Transfer input data from host -> device• Launch kernel on device• Transfer output data from device -> host

Page 18: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Allocating Device memory

• Host tells the device when to allocate and free memory in device

• Functions for host-program– cudaMalloc(memory reference, size)– cudaFree(memory reference)

Page 19: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Transfer Data to/from device

• Again, host tells device when to transfer data• cudaMemcpy(target, source, size, flag)

Page 20: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

CUDA version - 2Host Memory

Device Memory

Allocate matrix M on deviceTransfer M from host -> Device

Allocate matrix N on deviceTransfer N from host -> Device

Allocate matrix P on device

Execute Kernel on Device

Transfer P from Device-> Host

Free Device memories for M, N and P

Page 21: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Matrix Multiplication Kernel

• Kernel specifies the function to be executed on Device

Parameters = Device memories, width

Thread = Each element of output matrix P

Dot product of M’s row and N’s column

Write dot product at current location

Page 22: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Extensions : Function qualifiers

Page 23: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Extensions : Thread indexing

• All threads execute the same code– But they need work on separate memory data

• threadId.x & threadId.y– These variables automatically receive

corresponding values for their threads

Page 24: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Thread Grid

• Represents group of all threads to be executed for a particular kernel

• Two level hierarchy– Grid is composed of Blocks– Each Block is composed of threads

Page 25: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Thread Grid

0, 0 1, 0 2, 0 width-1, 0

0, 1 width–1, 1

0, 2

0, width-1 width – 1, width - 1

Page 26: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

Conclusion

• Sample code and tutorials• CUDA nodes?• Programming guide – http://docs.nvidia.com/cuda/cuda-c-programming

-guide/

• SDK– https://developer.nvidia.com/cuda-downloads– Available for windows, Mac and Linux– Lot of sample programs

Page 27: CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication

QUESTIONS?