cuda - parallel computing on...

40
CUDA - Parallel Computing on GPUs Richard Membarth [email protected] Hardware-Software-Co-Design University of Erlangen-Nuremberg 19.03.2009 Friedrich-Alexander University of Erlangen-Nuremberg Richard Membarth 1

Upload: others

Post on 22-Jul-2020

31 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CUDA - Parallel Computing on GPUs

Richard [email protected]

Hardware-Software-Co-DesignUniversity of Erlangen-Nuremberg

19.03.2009

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 1

Page 2: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Outline

◮ Key concepts behind modern graphics architectures◮ Understanding these allows to

◮ Understand difference between GPU and CPU code◮ Optimize compute kernels◮ Get feeling for what kind of tasks might benefit from GPUs

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 2

Page 3: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

CPU

◮ Little space spent on execution units

◮ Optimized for latency

ExecutionContext

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 4: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

GPU

◮ Remove components that help a single instruction stream runfaster

ExecutionContext

Fetch/Decode

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU

Out-of-OrderControl Logic

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 5: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

GPU

◮ Remove components that help a single instruction stream runfaster

ExecutionContext

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 6: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

GPU

◮ Remove components that help a single instruction stream runfaster

ExecutionContext

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 7: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

GPU

◮ Remove components that help a single instruction stream runfaster

ExecutionContext

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 8: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

CPU vs. GPU Hardware

GPU

◮ Remove components that help a single instruction stream runfaster

ExecutionContext

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Data Cache

ALU⇒

ExecutionContext

Fetch/Decode

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3

Page 9: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Parallelism

◮ Shader program on one execution unit

ExecutionContext

Fetch/Decode

ALU

add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4

Page 10: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Parallelism

◮ Use same program for two units

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4

Page 11: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Parallelism

◮ Use same program for more units

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4

Page 12: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Parallelism

◮ Use same program for even more units

◮ Moores Law: GPU parallelism is doubling every year

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

ExecutionContext

Fetch/Decode

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4

Page 13: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Hardware - SIMD

◮ Use same instruction unit for more than one stream

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5

Page 14: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Hardware - SIMD

◮ Use same instruction unit for more than one stream

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

vec add r0.w, -r0.z, c1.wvec mul r4.xy, r4, r0.wvec cmp r0.w, r0.z, c1.z, c1.wvec mov r4.zw, c1vec cmp r1.w, -c0.x, r4.z, r4.wvec mul r0.w, r0.w, r1.wvec cmp r0.xy, -r0.w, r0, r4vec mul r4.xy, r0, r0vec mul r4.z, r0.y, r0.xvec mad r1.xyz, r1, c2.w, c2vec mov r0.z, c1.w

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5

Page 15: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Hardware - SIMD

◮ Use same instruction unit for more than one stream

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5

Page 16: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMD Processing

◮ Explicit vector instructions◮ Intel/AMD/PPC SSE◮ Cell B.E.◮ Intel Larrabee

◮ Scalar instructions, implicit vectorization◮ NVIDIA GPUs◮ AMD GPUs◮ SPMD alike concept: Single Instruction, Multiple Thread (SIMT)

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 6

Page 17: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 18: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 19: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 20: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 21: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 22: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 23: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 24: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

SIMTALU ALU ALU ALU ALU ALU ALU ALU

if (diff > 0) {

diff = a[x] - b[x];

s = exp(-c * diff);

d+=s;} else {

r = b[x];

s = exp(-d *a[x]);

d-=s;}

<unconditional code>

Time

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7

Page 25: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Latency

◮ No caches on chip

◮ Access to memory takes a few hundred clock cycles

◮ Threads perform often only few instructions

◮ No other instruction can be scheduled due to data dependency

⇒ Stalls!!!

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 8

Page 26: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Stalls

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Thread 1-8

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9

Page 27: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Stalls

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9

Page 28: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Stalls

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32

Stall

Runnable

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9

Page 29: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Stalls

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32

Stall

Runnable

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9

Page 30: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Memory Stalls

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9

Page 31: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Throughput

Time

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32

Done!

Done!

Done!

Done!

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

◮ Increase run-time of one group, maximize throughput of manygroups

◮ GPU: high latency, high throughputFriedrich-Alexander University of Erlangen-NurembergRichard Membarth 10

Page 32: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Context storage

◮ Large context storage

◮ Maximal 24 contexts

⇒ Scalability

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Shared Context

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 11

Page 33: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Blocking

◮ One core can executes many contexts in parallel

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12

Page 34: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Blocking

◮ Data has to be partitioned into independent parts

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12

Page 35: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Blocking

◮ Data has to be partitioned into independent parts

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12

Page 36: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Blocking

◮ One core works on such independent blocks

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12

Page 37: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Blocking

◮ User defines mapping ofblocks to data

◮ Same kernel is executed onall blocks

◮ Blocks must be processedindependently

◮ Scheduling of blocks is doneby hardware

⇒ Thousands of threads in parallel

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 13

Page 38: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

GPU Performance

◮ 16 cores

◮ 8 ALUs percore

◮ 128 ALUs

◮ 1 Mul-Add percycle⇒ 256GFLOPs(@1GHz)

Work DistributorTex

Tex

Tex

TexInput Assemply

Rasterizer

Output Blend

Video Decode

DRAM

(Tesla C1060: 30 cores, 1.3GHz ⇒ 624GFLOPS)

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 14

Page 39: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Summary

◮ Remove components that help a single instruction stream◮ Use multiple units in parallel◮ Use SIMT◮ Schedule multiple blocks (hide latency)

Work DistributorTex

Tex

Tex

TexInput Assemply

Rasterizer

Output Blend

Video Decode

DRAM DRAM

Fetch/DecodeOut-of-OrderControl Logic

Branch Predictor

Memory Pre-Fetcher

Cache

ExecutionContext

ALU

ALU ALU

ALU

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 15

Page 40: CUDA - Parallel Computing on GPUspdsgroup.hpclab.ceid.upatras.gr/files/CUDA-Parallel-Computing-on-G… · CUDA - Parallel Computing on GPUs Richard Membarth richard.membarth@cs.fau.de

Questions?

Krakow, Pontifical ResidencyCourtesy of Robert Grimm

Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 16