gpu and the brick wall

94
Graphics Processing Unit (GPU) Architecture and Programming TU/e 5kk73 Zhenyu Ye Bart Mesman Henk Corporaal 2010-11-08

Upload: ugur-candan

Post on 11-Nov-2014

1.591 views

Category:

Technology


8 download

DESCRIPTION

GPU programingThe Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW

TRANSCRIPT

Page 1: Gpu and The Brick Wall

Graphics Processing Unit (GPU)Architecture and Programming

TU/e 5kk73Zhenyu Ye

Bart MesmanHenk Corporaal

2010-11-08

Page 2: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 3: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 4: Gpu and The Brick Wall

System Architecture

Page 5: Gpu and The Brick Wall

GPU ArchitectureNVIDIA Fermi, 512 Processing Elements (PEs)

Page 6: Gpu and The Brick Wall

What Can It Do?Render triangles.

NVIDIA GTX480 can render 1.6 billion triangles per second!

Page 7: Gpu and The Brick Wall

General Purposed Computing

ref: http://www.nvidia.com/object/tesla_computing_solutions.html

Page 8: Gpu and The Brick Wall

The Vision of NVIDIA"Within the next few years, there will be single-chip graphics

devices more powerful and versatile than any graphics system that has ever been built, at any price." 

-- David Kirk, NVIDIA, 1998

Page 9: Gpu and The Brick Wall

ref: http://www.llnl.gov/str/JanFeb05/Seager.html

Single-Chip GPU v.s. Fastest Super Computers

Page 10: Gpu and The Brick Wall

Top500 Super Computer in June 2010

Page 11: Gpu and The Brick Wall

GPU Will Top the List in Nov 2010

Page 12: Gpu and The Brick Wall

The Gap Between CPU and GPU

ref: Tesla GPU Computing Brochure

Page 13: Gpu and The Brick Wall

GPU Has 10x Comp Density

Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.

Page 14: Gpu and The Brick Wall

Evolution of Intel PentiumPentium I Pentium II

Pentium III Pentium IV

Chip areabreakdown

Q: What can you observe? Why?

Page 15: Gpu and The Brick Wall

Extrapolation of Single Core CPUIf we extrapolate the trend, in a few generations, Pentium will look like:

Of course, we know it did not happen. 

Q: What happened instead? Why?

Page 16: Gpu and The Brick Wall

Evolution of Multi-core CPUsPenryn Bloomfield

Gulftown Beckton

Chip areabreakdown

Q: What can you observe? Why?

Page 17: Gpu and The Brick Wall

Let's Take a Closer Look

Less than 10% of total chip area is used for the real execution.

Q: Why?

Page 18: Gpu and The Brick Wall

The Memory Hierarchy

Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ.64-bit FP FMA takes about 200 pJ.

It seems we can not further increase the computational density.

Page 19: Gpu and The Brick Wall

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

Page 20: Gpu and The Brick Wall

The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW

Power Wall + Memory Wall + ILP Wall = Brick Wall

David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link

Page 21: Gpu and The Brick Wall

How to Break the Brick Wall?

Hint: how to exploit the parallelism inside the application?

Page 22: Gpu and The Brick Wall

Step 1: Trade Latency with Throughput

Hind the memory latency through fine-grained interleaved threading.

Page 23: Gpu and The Brick Wall

Interleaved Multi-threading

Page 24: Gpu and The Brick Wall

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Page 25: Gpu and The Brick Wall

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:Pros: ?Cons: ?

Page 26: Gpu and The Brick Wall

Interleaved Multi-threading

The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency

Fine-grained interleaved multi-threading:Pros: remove branch predictor, OOO scheduler, large cacheCons: register pressure, etc.

Page 27: Gpu and The Brick Wall

Fine-Grained Interleaved Threading

Pros: reduce cache size,no branch predictor, no OOO scheduler

Cons: register pressure,thread scheduler,require huge parallelism

Without and with fine-grained interleaved threading

Page 28: Gpu and The Brick Wall

HW SupportRegister file supports zero overhead context switch between interleaved threads.

Page 29: Gpu and The Brick Wall

Can We Make Further Improvement?

Reducing large cache gives 2x computational density.

Q: Can we make further improvements?

Hint:We have only utilized thread level parallelism (TLP) so far.

Page 30: Gpu and The Brick Wall

Step 2: Single Instruction Multiple Data

SSE has 4 data lanes GPU has 8/16/24/... data lanes

GPU uses wide SIMD: 8/16/24/... processing elements (PEs)CPU uses short SIMD: usually has vector width of 4.

Page 31: Gpu and The Brick Wall

Hardware SupportSupporting interleaved threading + SIMD execution

Page 32: Gpu and The Brick Wall

Single Instruction Multiple Thread (SIMT)

Hide vector width using scalar threads.

Page 33: Gpu and The Brick Wall

Example of SIMT ExecutionAssume 32 threads are grouped into one warp.

Page 34: Gpu and The Brick Wall

Step 3: Simple Core

The Stream Multiprocessor (SM) is a light weight core compared to IA core.

Light weight PE:Fused Multiply Add (FMA)

SFU:Special Function Unit

Page 35: Gpu and The Brick Wall

NVIDIA's Motivation of Simple Core

"This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train."

--Bill Dally, NVIDIA

Page 36: Gpu and The Brick Wall

Review: How Do We Reach Here?NVIDIA Fermi, 512 Processing Elements (PEs)

Page 37: Gpu and The Brick Wall

Throughput Oriented Architectures

1. Fine-grained interleaved threading (~2x comp density)2. SIMD/SIMT (>10x comp density)3. Simple core (~2x comp density)

Key architectural features of throughput oriented processor.

ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)

Page 38: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 39: Gpu and The Brick Wall

CUDA ProgrammingMassive number (>10000) of light-weight threads.

Page 40: Gpu and The Brick Wall

Express Data Parallelism in Threads 

Compare thread program with vector program.

Page 41: Gpu and The Brick Wall

Vector Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

Vector program (vector width of 8)

float A[4][8];

do-all(i=0;i<4;i++){    movups xmm0, [ &A[i][0] ]    incps xmm0    movups [ &A[i][0] ], xmm0} 

Vector width is exposed to programmers.

Page 42: Gpu and The Brick Wall

CUDA Program

Scalar program float A[4][8];do-all(i=0;i<4;i++){    do-all(j=0;j<8;j++){        A[i][j]++;     }}

CUDA program

float A[4][8];  kernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;} 

• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).

• Hardware converts TLP into DLP at run time.

Page 43: Gpu and The Brick Wall

Two Levels of Thread HierarchykernelF<<<(4,1),(8,1)>>>(A); __device__    kernelF(A){    i = blockIdx.x;    j = threadIdx.x;    A[i][j]++;}

 

Page 44: Gpu and The Brick Wall

Multi-dimension Thread and Block ID

kernelF<<<(2,2),(4,2)>>>(A); __device__    kernelF(A){    i = blockDim.x * blockIdx.y        + blockIdx.x;    j = threadDim.x * threadIdx.y        + threadIdx.x;    A[i][j]++;}

 

Both grid and thread block can have two dimensional index.

Page 45: Gpu and The Brick Wall

Scheduling Thread Blocks on SMExample:Scheduling 4 thread blocks on 3 SMs.

Page 46: Gpu and The Brick Wall

Executing Thread Block on SM

Executed on machine with width of 4:

Executed on machine with width of 8:

Notes: the number of Processing Elements (PEs) is transparent to programmer.

kernelF<<<(2,2),(4,2)>>>(A); __device__    kernelF(A){    i = blockDim.x * blockIdx.y        + blockIdx.x;    j = threadDim.x * threadIdx.y        + threadIdx.x;    A[i][j]++;}

 

Page 47: Gpu and The Brick Wall

Multiple Levels of Memory HierarchyName Cache? cycle read-only?

Global L1/L2 200~400 (cache miss) R/W

Shared No 1~3 R/W

Constant Yes 1~3 Read-only

Texture Yes ~100 Read-only

Local L1/L2 200~400 (cache miss) R/W

Page 48: Gpu and The Brick Wall

Explicit Management of Shared MemShared memory is frequently used to exploit locality.

Page 49: Gpu and The Brick Wall

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16]; //allocate smem    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync();    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter with 3x3 window

3x3 window on image

Image data in DRAM

Page 50: Gpu and The Brick Wall

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j]; // load to smem    __sync(); // thread wait at barrier    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

Stage data in shared mem

Page 51: Gpu and The Brick Wall

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync(); // every thread is ready    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

all threads finish the load

Page 52: Gpu and The Brick Wall

Shared Memory and Synchronization

kernelF<<<(1,1),(16,16)>>>(A); __device__    kernelF(A){    __shared__ smem[16][16];    i = threadIdx.y;    j = threadIdx.x;    smem[i][j] = A[i][j];    __sync();    A[i][j] = ( smem[i-1][j-1]                   + smem[i-1][j]                   ...                   + smem[i+1][i+1] ) / 9;}

 

Example: average filter over 3x3 window

3x3 window on image

Start computation

Page 53: Gpu and The Brick Wall

Programmers Think in Threads

Q: Why make this hassle?

Page 54: Gpu and The Brick Wall

Why Use Thread instead of Vector?

Thread Pros:• Portability. Machine width is transparent in ISA.• Productivity. Programmers do not need to take care the

vector width of the machine.

Thread Cons:• Manual sync. Give up lock-step within vector.• Scheduling of thread could be inefficient.• Debug. "Threads considered harmful". Thread program

is notoriously hard to debug. 

Page 55: Gpu and The Brick Wall

Features of CUDA

• Programmers explicitly express DLP in terms of TLP.• Programmers explicitly manage memory hierarchy.• etc.

Page 56: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 57: Gpu and The Brick Wall

Micro-architectureGF100 micro-architecture

Page 58: Gpu and The Brick Wall

HW Groups Threads Into WarpsExample: 32 threads per warp

Page 59: Gpu and The Brick Wall

Example of ImplementationNote: NVIDIA may use a more complicated implementation.

Page 60: Gpu and The Brick Wall

ExampleProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Assume warp 0 and warp 1 are scheduled for execution.

Page 61: Gpu and The Brick Wall

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r1 for warp 0r4 for warp 1

Page 62: Gpu and The Brick Wall

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r1 for warp 0r4 for warp 1

Page 63: Gpu and The Brick Wall

Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Read source operands:r2 for warp 0r5 for warp 1

Page 64: Gpu and The Brick Wall

Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Push ops to op collector:r2 for warp 0r5 for warp 1

Page 65: Gpu and The Brick Wall

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the first 16 threads in the warp.

Page 66: Gpu and The Brick Wall

ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Compute the last 16 threads in the warp.

Page 67: Gpu and The Brick Wall

Write backProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5

Write back:r0 for warp 0r3 for warp 1

Page 68: Gpu and The Brick Wall

Other High Performance GPU

• ATI Radeon 5000 series.

Page 69: Gpu and The Brick Wall

ATI Radeon 5000 Series Architecture

Page 70: Gpu and The Brick Wall

Radeon SIMD Engine

• 16 Stream Cores (SC)• Local Data Share

Page 71: Gpu and The Brick Wall

VLIW Stream Core (SC)

Page 72: Gpu and The Brick Wall

Local Data Share (LDS)

Page 73: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 74: Gpu and The Brick Wall

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 75: Gpu and The Brick Wall

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 76: Gpu and The Brick Wall

Shared Mem Contains Multiple Banks

Page 77: Gpu and The Brick Wall

Compute CapabilityNeed arch info to perform optimization.

ref: NVIDIA, "CUDA C Programming Guide", (link)

Page 78: Gpu and The Brick Wall

Shared Memory (compute capability 2.x)

withoutbankconflict:

withbankconflict:

Page 79: Gpu and The Brick Wall

Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure

  Optimizations on memory bandwidth • Global memory alignment and coalescing• Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping

 Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion 

 Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity

Page 80: Gpu and The Brick Wall

Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.

Page 81: Gpu and The Brick Wall

Global Memory

Page 82: Gpu and The Brick Wall

Global Memory

Page 83: Gpu and The Brick Wall

Global Memory

Page 84: Gpu and The Brick Wall

Roofline ModelIdentify performance bottleneck: computation bound v.s. bandwidth bound

Page 85: Gpu and The Brick Wall

Optimization Is Key for Attainable Gflops/s

Page 86: Gpu and The Brick Wall

Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.

Page 87: Gpu and The Brick Wall

Today's Topics

• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends

Page 88: Gpu and The Brick Wall

Trends

Coming architectures:• Intel's Larabee successor: Many Integrated Core (MIC)• CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.

Page 89: Gpu and The Brick Wall

Intel Many Integrated Core (MIC)32 core version of MIC:

Page 90: Gpu and The Brick Wall

Intel Sandy Bridge

Highlight:• Reconfigurable shared L3

for CPU and GPU• Ring bus

Page 91: Gpu and The Brick Wall

Sandy Bridge's New CPU-GPU interface 

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Page 92: Gpu and The Brick Wall

Sandy Bridge's New CPU-GPU interface 

ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)

Page 93: Gpu and The Brick Wall

AMD Llano Fusion APU (expt. Q3 2011)

Notes:• CPU and GPU are not

sharing cache?• Unknown interface

between CPU/GPU

Page 94: Gpu and The Brick Wall

GPU Research in ES Group

GPU research in the Electronic Systems group.http://www.es.ele.tue.nl/~gpuattue/