- data parallelism and threads - nc state computer …...objective • to learn about data...

12
Introduction to CUDA - Data Parallelism and Threads Lesson 1.4

Upload: others

Post on 03-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

Introduction to CUDA

- Data Parallelism and Threads

Lesson 1.4

Page 2: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

Objective

• To learn about data parallelism and

the basic features of CUDA C, a

heterogeneous parallel programming

interface that enables exploitation

of data parallelism

• Hierarchical thread organization

• Main interfaces for launching

parallel execution

• Thread index to data index mapping

2

Page 3: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

A[0]vector A

vector B

vector C

A[1] A[2] A[N-1]

B[0] B[1] B[2]

… B[N-1]

C[0] C[1] C[2] C[N-1]…

+ + + +

Data Parallelism - Vector Addition Example

3

Page 4: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

CUDA /OpenCL – Execution Model

• Heterogeneous host+device application C program

• Serial parts in host C code

• Parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

4

Page 5: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

From Natural Language to Electrons

Natural Language (e.g, English)

Algorithm

High-Level Language (C/C++…)

Instruction Set Architecture

Microarchitecture

Circuits

Electrons

©Yale Patt and Sanjay Patel, From bits and bytes to gates

and beyond

5

Compiler

Page 6: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

6

The ISA

• An Instruction Set Architecture (ISA)

is a contract between the hardware

and the software.

• As the name suggests, it is a set of

instructions that the architecture

(hardware) can execute.

Page 7: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

7

A program at the ISA level

• A program is a set of instructions

stored in memory that can be read,

interpreted, and executed by the

hardware.

• Program instructions operate on data

stored in memory or provided by

Input/Output (I/O) device.

Page 8: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

8

A Von-Neumann Processor

Memory

Control Unit

I/O

ALUReg

File

PC IR

Processing Unit

A thread is a “virtualized” or

“abstracted”

Von-Neumann Processor

Page 9: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

Arrays of Parallel Threads

• A CUDA kernel is executed by a grid (array) ofthreads

– All threads in a grid run the same kernel code (SPMD)

– Each thread has indexes that it uses to compute memory addresses and make control decisions

9

i = blockIdx.x * blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

0 1 2 254 255

Page 10: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

Thread Blocks: Scalable Cooperation

• Divide thread array into multiple blocks

• Threads within a block cooperate via

shared memory, atomic operations and

barrier synchronization

• Threads in different blocks do not

interact

i = blockIdx.x *

blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

0 1 2 254 255

Thread Block 0

1 2 254 255

Thread Block 1

0

i = blockIdx.x *

blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

1 2 254 255

Thread Block N-1

0

i = blockIdx.x *

blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

…… …

10

Page 11: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

blockIdx and threadIdx

• Each thread uses indices to decide what data to work on– blockIdx: 1D, 2D, or 3D (CUDA

4.0)

– threadIdx: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing

– Solving PDEs on volumes

– …

11

device

GridBlock (0,

0)

Block (1,

1)

Block (1,

0)

Block (0,

1)

Block (1,1)

Thread

(0,0,0

)

Thread

(0,1,3

)

Threa

d

(0,1,

0)

Threa

d

(0,1,

1)

Threa

d

(0,1,

2)

Threa

d

(0,0,

0)

Threa

d

(0,0,

1)

Threa

d

(0,0,

2)

Threa

d

(0,0,

3)

(1,0,0)(1,0,1) (1,0,

2)

(1,0,3)

Page 12: - Data Parallelism and Threads - NC State Computer …...Objective • To learn about data parallelism and the basic features of CUDA C, a heterogeneous parallel programming interface

To learn more, read

Chapter 3.