cuda - parallel computing on...
TRANSCRIPT
CUDA - Parallel Computing on GPUs
Richard [email protected]
Hardware-Software-Co-DesignUniversity of Erlangen-Nuremberg
19.03.2009
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 1
Outline
◮ Key concepts behind modern graphics architectures◮ Understanding these allows to
◮ Understand difference between GPU and CPU code◮ Optimize compute kernels◮ Get feeling for what kind of tasks might benefit from GPUs
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 2
CPU vs. GPU Hardware
CPU
◮ Little space spent on execution units
◮ Optimized for latency
ExecutionContext
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
CPU vs. GPU Hardware
GPU
◮ Remove components that help a single instruction stream runfaster
ExecutionContext
Fetch/Decode
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU
Out-of-OrderControl Logic
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
CPU vs. GPU Hardware
GPU
◮ Remove components that help a single instruction stream runfaster
ExecutionContext
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
CPU vs. GPU Hardware
GPU
◮ Remove components that help a single instruction stream runfaster
ExecutionContext
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
CPU vs. GPU Hardware
GPU
◮ Remove components that help a single instruction stream runfaster
ExecutionContext
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
CPU vs. GPU Hardware
GPU
◮ Remove components that help a single instruction stream runfaster
ExecutionContext
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Data Cache
ALU⇒
ExecutionContext
Fetch/Decode
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 3
GPU Parallelism
◮ Shader program on one execution unit
ExecutionContext
Fetch/Decode
ALU
add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4
GPU Parallelism
◮ Use same program for two units
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4
GPU Parallelism
◮ Use same program for more units
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
add r0.w, -r0.z, c1.wmul r4.xy, r4, r0.wcmp r0.w, r0.z, c1.z, c1.wmov r4.zw, c1cmp r1.w, -c0.x, r4.z, r4.wmul r0.w, r0.w, r1.wcmp r0.xy, -r0.w, r0, r4mul r4.xy, r0, r0mul r4.z, r0.y, r0.xmad r1.xyz, r1, c2.w, c2mov r0.z, c1.w
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4
GPU Parallelism
◮ Use same program for even more units
◮ Moores Law: GPU parallelism is doubling every year
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
ExecutionContext
Fetch/Decode
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 4
GPU Hardware - SIMD
◮ Use same instruction unit for more than one stream
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5
GPU Hardware - SIMD
◮ Use same instruction unit for more than one stream
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
vec add r0.w, -r0.z, c1.wvec mul r4.xy, r4, r0.wvec cmp r0.w, r0.z, c1.z, c1.wvec mov r4.zw, c1vec cmp r1.w, -c0.x, r4.z, r4.wvec mul r0.w, r0.w, r1.wvec cmp r0.xy, -r0.w, r0, r4vec mul r4.xy, r0, r0vec mul r4.z, r0.y, r0.xvec mad r1.xyz, r1, c2.w, c2vec mov r0.z, c1.w
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5
GPU Hardware - SIMD
◮ Use same instruction unit for more than one stream
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 5
SIMD Processing
◮ Explicit vector instructions◮ Intel/AMD/PPC SSE◮ Cell B.E.◮ Intel Larrabee
◮ Scalar instructions, implicit vectorization◮ NVIDIA GPUs◮ AMD GPUs◮ SPMD alike concept: Single Instruction, Multiple Thread (SIMT)
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 6
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
SIMTALU ALU ALU ALU ALU ALU ALU ALU
if (diff > 0) {
diff = a[x] - b[x];
s = exp(-c * diff);
d+=s;} else {
r = b[x];
s = exp(-d *a[x]);
d-=s;}
<unconditional code>
Time
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 7
Memory Latency
◮ No caches on chip
◮ Access to memory takes a few hundred clock cycles
◮ Threads perform often only few instructions
◮ No other instruction can be scheduled due to data dependency
⇒ Stalls!!!
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 8
Memory Stalls
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Thread 1-8
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9
Memory Stalls
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9
Memory Stalls
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32
Stall
Runnable
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9
Memory Stalls
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32
Stall
Runnable
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9
Memory Stalls
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32
Stall
Runnable
Stall
Runnable
Stall
Runnable
Stall
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 9
GPU Throughput
Time
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Thread 1-8 Thread 9-16 Thread 17-24 Thread 25-32
Done!
Done!
Done!
Done!
Stall
Runnable
Stall
Runnable
Stall
Runnable
Stall
Runnable
◮ Increase run-time of one group, maximize throughput of manygroups
◮ GPU: high latency, high throughputFriedrich-Alexander University of Erlangen-NurembergRichard Membarth 10
Context storage
◮ Large context storage
◮ Maximal 24 contexts
⇒ Scalability
Fetch/Decode
ALU ALU ALU ALU
ALU ALU ALU ALU
Shared Context
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 11
Blocking
◮ One core can executes many contexts in parallel
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12
Blocking
◮ Data has to be partitioned into independent parts
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12
Blocking
◮ Data has to be partitioned into independent parts
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12
Blocking
◮ One core works on such independent blocks
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 12
Blocking
◮ User defines mapping ofblocks to data
◮ Same kernel is executed onall blocks
◮ Blocks must be processedindependently
◮ Scheduling of blocks is doneby hardware
⇒ Thousands of threads in parallel
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 13
GPU Performance
◮ 16 cores
◮ 8 ALUs percore
◮ 128 ALUs
◮ 1 Mul-Add percycle⇒ 256GFLOPs(@1GHz)
Work DistributorTex
Tex
Tex
TexInput Assemply
Rasterizer
Output Blend
Video Decode
DRAM
(Tesla C1060: 30 cores, 1.3GHz ⇒ 624GFLOPS)
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 14
Summary
◮ Remove components that help a single instruction stream◮ Use multiple units in parallel◮ Use SIMT◮ Schedule multiple blocks (hide latency)
Work DistributorTex
Tex
Tex
TexInput Assemply
Rasterizer
Output Blend
Video Decode
DRAM DRAM
Fetch/DecodeOut-of-OrderControl Logic
Branch Predictor
Memory Pre-Fetcher
Cache
ExecutionContext
ALU
ALU ALU
ALU
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 15
Questions?
Krakow, Pontifical ResidencyCourtesy of Robert Grimm
Friedrich-Alexander University of Erlangen-NurembergRichard Membarth 16