parallel programming many-core computing: cuda ...bal/college11/class3-cuda-introduction.pdf ·...

58
PARALLEL PROGRAMMING MANY-CORE COMPUTING: CUDA INTRODUCTION (3/5) Rob van Nieuwpoort [email protected]

Upload: others

Post on 07-Sep-2020

39 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:

CUDA INTRODUCTION (3/5)

Rob van Nieuwpoort

[email protected]

Page 2: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Schedule 2

1. Introduction, performance metrics & analysis

2. Many-core hardware, low-level optimizations

3. GPU hardware and Cuda class 1: basics

4. Cuda class 2: advanced

5. Case study: LOFAR telescope with many-cores

Page 3: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

GPU hardware introduction 3

Page 4: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

It's all about the memory 4

Page 5: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Integration into host system

Typically PCI Express 2.0 x16

Theoretical speed 8 GB/s

protocol overhead → 6 GB/s

In reality: 4 – 6 GB/s

V3.0 is coming soon

Double bandwidth

Less protocol overhead

5

Page 6: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Lessons from Graphics Pipeline

Throughput is paramount

must paint every pixel within frame time

scalability

Create, run, & retire lots of threads very rapidly

measured 14.8 billion thread/s on increment() kernel

Use multithreading to hide latency

1 stalled thread is OK if 100 are ready to run

6

Page 7: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CPU vs GPU 7

Movie

The Mythbusters

Jamie Hyneman & Adam Savage

Discovery Channel

Appearance at NVIDIA’s NVISION 2008

Page 8: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Why is this different from a CPU?

Different goals produce different designs

GPU assumes work load is highly parallel

CPU must be good at everything, parallel or not

CPU: minimize latency experienced by 1 thread

big on-chip caches

sophisticated control logic

GPU: maximize throughput of all threads

# threads in flight limited by resources => lots of

resources (registers, etc.)

multithreading can hide latency => skip the big caches

share control logic across many threads

8

Page 9: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Flynn’s taxonomy revisited 9

Single Data Multiple Data

Single instruction SISD SIMD

Multiple Instruction MISD MIMD

GPUs don’t fit!

Page 10: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Key architectural Ideas

SIMT (Single Instruction Multiple Thread) execution

HW automatically handles divergence

Hardware multithreading

HW resource allocation & thread scheduling

HW relies on threads to hide latency

Context switching is (basically) free

10

Page 11: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

GPU hardware: ATI 11

Page 12: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CPU vs GPU Chip 12

AMD Magny-Cours (6 cores) ATI 4870 (800 cores)

Page 13: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Latest generation ATI

Northern Islands

1 chip: HD 6970

1536 cores

176 GB/sec memory bandwidth

2.7 tflops single, 675 gflops double precision

Maximum power: 250 Watts

299 euros!

2 chips: HD 6990

3072 cores, 5.1 tflops, 575 euro!

Comparison: entire 72-node DAS-4 VU cluster has

4.4 tflops

13

Page 14: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ATI 5870 architecture overview 14

Page 15: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ATI 5870 SIMD engine

Each of the 20 SIMD engines has:

16 thread processors x 5 stream cores = 80 scalar stream processing units

20 * 16 * 5 = 1600 cores total

32KB Local Data Share

its own control logic and runs from a shared set of threads

a dedicated fetch unit with 8KB L1 cache

a 64KB global data share to communicate with other SIMD engines

15

Page 16: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ATI 5870 thread processor

Each thread processor includes:

4 stream cores + 1 special function

stream core

general purpose registers

FMA in a single clock

16

Page 17: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ATI 5870 Memory Hierarchy

EDC (Error Detection Code)

CRC Checks on Data Transfers for Improved Reliability

at High Clock Speeds

Bandwidths

Up to 1 TB/sec L1 texture fetch bandwidth

Up to 435 GB/sec between L1 & L2

153.6 GB/s to device memory

PCI-e 2.0, 16x: 8GB/s to main memory

17

Page 18: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ATI programming models

Low-level: CAL (assembly)

High-level: Brook+

Originally developed at Stanford University

Streaming language

Performance is not great

Now: OpenCL

18

Page 19: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

GPU Hardware: NVIDIA 19

Page 20: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Reading material 20

Reader:

NVIDIA’s Next Generation CUDA Compute Architecture: Fermi

Recommended further reading:

CUDA: Compute Unified

Device Architecture

Page 21: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

Fermi

Consumer: GTX 480, 580

GPGPU: Tesla C2050

More memory, ECC

1.0 teraflop single

515 megaflop double

16 streaming

multiprocessors (SM)

GTX 580: 16

GTX 480: 15

C2050: 14

SMs are independent

21

Page 22: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory C

ontrolle

r

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory C

ontrolle

rM

em

ory C

ontrolle

r

Mem

ory C

ontroller

Mem

ory C

ontroller

Mem

ory C

ontroller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

Fermi Streaming Multiprocessor (SM)

32 cores per SM (512 cores total)

64KB configurable L1 cache / shared memory

32,768 32-bit registers

22

Page 23: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA Core Architecture

Decoupled floating point

and integer data paths

Double precision throughput

is 50% of single precision

Integer operations

optimized for extended

precision

64 bit and wider data

element size

Predication field for all

instructions

Fused-multiply-add

23

Page 24: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Memory Hierarchy

Configurable L1 cache per SM

16KB L1 cache / 48KB Shared

48KB L1 cache / 16KB Shared

Shared 768KB L2 cache

registers

Device memory

L1 cache / shared memory

L2 cache

Host memory

24

PCI-e

bus

Page 25: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Thread

Per-thread Local Memory

SM

Per-SM

Shared

Memory

Kernel 0

. .

. Per-device Global Memory

. . .

Kernel 1

Multiple Memory Scopes 25

Per-thread private memory

Each thread has its own local memory

Stacks, other private data

Per-SM shared memory

Small memory close to the processor, low latency

Device memory

GPU frame buffer

Can be accessed by any thread in any SM

Page 26: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Local

Shared

Device

Device Local Shared

Non - unified Address Space

Unified Address Space

0 32 - bit

0 40 - bit

* p _ local

* p _ shared

* p _ device

* p

Unified Load/Store Addressing 26

Page 27: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Atomic Operations

Device memory is not coherent!

Share data between streaming multiprocessors

Read / Modify / Write

Fermi increases atomic performance by 5x to 20x

27

Page 28: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

ECC (Error-Correcting Code)

All major internal memories are ECC protected

Register file, L1 cache, L2 cache

DRAM protected by ECC (on Tesla only)

ECC is a must have for many computing applications

28

Page 29: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

NVIDIA GPUs become more generic

Expand performance sweet spot of the GPU

Caching

Concurrent kernels

Double precision floating point

C++

Full integration in modern software development environment

Debugging

Profiling

Bring more users, more applications to the GPU

29

Page 30: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Programming NVIDIA GPUs 30

Page 31: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA

CUDA: Scalable parallel programming

C/C++ extensions

Provide straightforward mapping onto hardware

Good fit to GPU architecture

Maps well to multi-core CPUs too

Scale to 1000s of cores & 100,000s of threads

GPU threads are lightweight — create / switch is free

GPU needs 1000s of threads for full utilization

31

Page 32: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Parallel Abstractions in CUDA

Hierarchy of concurrent threads

Lightweight synchronization primitives

Shared memory model for cooperating threads

32

Page 33: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Hierarchy of concurrent threads

Parallel kernels composed of many threads

All threads execute the same sequential program

Called the Kernel

Threads are grouped into thread blocks

Threads in the same block can cooperate

Threads in different blocks cannot!

All thread blocks are organized in a Grid

Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

33

Page 34: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Grid Thread Block 0, 0

Grids, Thread Blocks and Threads

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 0, 1

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 0, 2

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 0

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 1

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Thread Block 1, 2

0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3

Page 35: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA Model of Parallelism

CUDA virtualizes the physical hardware

Devices have

Different numbers of SMs

Different compute capabilities (Fermi = 2.0)

block is a virtualized streaming multiprocessor (threads, shared memory)

thread is a virtualized scalar processor (registers, PC, state)

Scheduled onto physical hardware without pre-emption

threads/blocks launch & run to completion

blocks should be independent

• • • Block Shared

Memory Block Shared

Memory

Device Memory

35

Page 36: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Hardware Memory Spaces in CUDA

Grid

Device Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

36

Page 37: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Device Memory

CPU and GPU have separate memory spaces

Data is moved across PCI-e bus

Use functions to allocate/set/copy memory on GPU

Very similar to corresponding C functions

Pointers are just addresses

Can’t tell from the pointer value whether the address is

on CPU or GPU

Must exercise care when dereferencing:

Dereferencing CPU pointer on GPU will likely crash

Same for vice versa

37

Page 38: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Additional memories

Textures

Read-only

Data resides in device memory

Different read path, includes specialized caches

Constant memory

Data resides in device memory

Manually managed

Small (e.g., 64KB)

Use when all threads in a block read the same address

Serializes otherwise

Page 39: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:

cudaMalloc(void **pointer, size_t nbytes)

cudaMemset(void *pointer, int val, size_t count)

cudaFree(void* pointer)

int n = 1024;

int nbytes = 1024 * sizeof(int);

int* data = 0;

cudaMalloc(&data, nbytes);

cudaMemset(data, 0, nbytes);

cudaFree(data);

39

Page 40: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Data Copies

cudaMemcpy(void *dst, void *src,

size_t nbytes,

enum cudaMemcpyKind direction);

returns after the copy is complete

blocks CPU thread until all bytes have been copied

doesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKind

cudaMemcpyHostToDevice

cudaMemcpyDeviceToHost

cudaMemcpyDeviceToDevice

Non-blocking copies are also available

DMA transfers, overlap computation and communication

40

Page 41: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA Variable Type Qualifiers 41

Variable declaration Memory Scope Lifetime

int var; register thread thread

int array_var[10]; local thread thread

__shared__ int shared_var; shared block block

__device__ int global_var; device grid application

__constant__ int constant_var; constant grid application

Page 42: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

C for CUDA 42

Philosophy: provide minimal set of extensions necessary

Function qualifiers: __global__ void my_kernel() { }

__device__ float my_device_func() { }

Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks

dim3 block_dim(4, 8, 8); // 256 threads per block (1.3M total)

my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel

Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension

dim3 blockDim; // Block dimension

dim3 blockIdx; // Block index

dim3 threadIdx; // Thread index

void syncthreads(); // Thread synchronization

Page 43: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Calculating the global thread index 43

Grid Thread Block 0

0 1 2 3

Thread Block 1

0 1 2 3

Thread Block 2

0 1 2 3

blockDim.X

―global‖ thread index:

blockDim.x * blockIdx.x + threadIdx.x;

Page 44: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Calculating the global thread index 44

―global‖ thread index:

blockDim.x * blockIdx.x + threadIdx.x;

4 * 2 + 1 = 9

Grid Thread Block 0

0 1 2 3

Thread Block 1

0 1 2 3

Thread Block 2

0 1 2 3

blockDim.X

Page 45: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Vector add

void vector_add(int size, float* a, float* b, float* c) {

for(int i=0; i<size; i++) {

c[i] = a[i] + b[i];

}

}

45

Page 46: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Vector addition GPU code

// compute vector sum c = a + b

// each thread performs one pair-wise addition

__global__ void vector_add(float* A, float* B, float* C) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i];

}

int main() {

// initialization code here ...

// launch N/256 blocks of 256 threads each

vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);

// cleanup code here ...

}

GPU code

Host code

(can be in the same file)

46

Page 47: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Vector addition host code

int main(int argc, char** argv) {

float *hostA, *deviceA, *hostB, *deviceB, *hostC, *deviceC;

int size = N * sizeof(float);

// allocate host memory

hostA = malloc(size);

hostB = malloc(size);

hostC = malloc(size);

// initialize A, B arrays here...

// allocate device memory

cudaMalloc(&deviceA, size);

cudaMalloc(&deviceB, size);

cudaMalloc(&deviceC, size); 47

Page 48: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Vector addition host code

// transfer the data from the host to the device

cudaMemcpy(devA, hostA, size, cudaMemcpyHostToDevice);

cudaMemcpy(devB, hostB, size, cudaMemcpyHostToDevice);

// launch N/256 blocks of 256 threads each

vector_add<<<N/256, 256>>>(deviceA, deviceB, deviceC);

// transfer the result back from the GPU to the host

cudaMemcpy(hostC, deviceC, size, cudaMemcpyDeviceToHost);

}

48

Page 49: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

CUDA shared memory 49

Page 50: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Using shared memory

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

__global__ void adj_diff_naive(int *result, int *input) {

// compute this thread’s global index

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {

// each thread loads two elements from device memory

int x_i = input[i];

int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}

}

50

Page 51: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Using shared memory

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

__global__ void adj_diff_naive(int *result, int *input) {

// compute this thread’s global index

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {

// each thread loads two elements from device memory

int x_i = input[i];

int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}

}

51

Page 52: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Using shared memory

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

__global__ void adj_diff_naive(int *result, int *input) {

// compute this thread’s global index

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {

// each thread loads two elements from device memory

int x_i = input[i];

int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}

} The next thread also reads input[i]

52

Page 53: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

Using shared memory

__global__ void adj_diff(int *result, int *input) {

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

__shared__ int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

// each thread reads 1 device memory elt, stores it in s_data

s_data[threadIdx.x] = input[i];

// avoid race condition: ensure all loads are complete

__syncthreads();

if(threadIdx.x > 0) {

result[i] = s_data[threadIdx.x] – s_data[threadIdx.x–1];

} else if(i > 0) {

// handle thread block boundary

result[i] = s_data[threadIdx.x] – input[i-1];

}

}

53

Page 54: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

A Common Programming Strategy

Partition data into subsets that fit into shared

memory

54

Page 55: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

A Common Programming Strategy

Handle each data subset with one thread block

55

Page 56: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

A Common Programming Strategy

Load the subset from device memory to shared

memory, using multiple threads to exploit memory-

level parallelism

56

Page 57: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

A Common Programming Strategy

Perform the computation on the subset from shared

memory

57

Page 58: Parallel programming many-core computing: CUDA ...bal/college11/class3-cuda-introduction.pdf · CUDA CUDA: Scalable parallel programming C/C++ extensions Provide straightforward mapping

A Common Programming Strategy

Copy the result from shared memory back to device

memory

58