pervasive massively multithreaded gpu processors michael c. shebanow sr. arch mgr, gpus

Pervasive Massively Multithreaded GPU

Processors

Michael C. Shebanow

Sr. Arch Mgr, GPUs

ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors

The “Real” Title

This talk is about SIMT Processors

The Past, Present, and a glimpse of the Future

The Past

Brief Chronology of GPUs at NVIDIA

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

NVIDIA

Gigapixel

NV1 (NV2) NV3 NV3.5

NV20NV25

NV2A NV17

NV30NV35

NV40 G70G71

Voodoo 1 Voodoo 2Banshee

Voodoo 3

Merlot Pinot

1998 1999 2000 2001 2002 2003 2004

DirectX 6Multitexturing

Riva TNT

DirectX 8SM 1.x

GeForce 3 Cg

DirectX 9SM 2.0

GeForceFX

DirectX 9.0cSM 3.0

GeForce 6DirectX 5Riva 128

DirectX 7T&L TextureStageState

GeForce 256

Quake 3 Giants Halo Far Cry UE3Half-Life

Early NVIDIA GPUs(Precambrian Eon)

NV1 (1995)Forward texturing

Traverse in texel space, generate pixels(vs. conventional “reverse” texturing where pixel locations are sampled in texture space)

Quadratic patchesDifferent than DirectX polygon rendering approach

Integrated audio

Precambrian (cont’d)

NV3 - Riva 128 (Aug 1997)

1st 128-bit memory bus“Wider is better”

DirectX 3 support

1 pix/clk 100 MHzUnified memory for frame buffer and texture

16b Z / 16b color

Integrated VGA from Weitek

Shades of Programmability (Phanerozoic Eon)

NV4 - Riva TNT (Summer 1998)

2 pix/clk @ 90 MHz

DirectX 5

Dual texturing @ 1 pix/clk

Register combiners

Rudimentary Shader Processors

Early Programmable ShadingGoogle “register combiners”, http://developer.nvidia.com/object/registercombiners.html

“Fixed function but programmable”

General Combiner Flow

The Birth of Modern GPUs(Cenozoic Eon)

NV20 - GeForce3 (Feb 2001)

4 pix/clk @ 240 MHz, 2 bilinear tex/pix

DirectX 8

Shaders!Programmable vertex shaders

“Configurable” pixel shaders

Input 2Input 1Input 0

Temp 2Temp 1Temp 0

ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R0.xyzx, R0.xyzx;RSQR R0.w, R0.w;MULR R0.xyz, R0.w, R0.xyzx;ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R1.xyzx, R1.xyzx;RSQR R0.w, R0.w;MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;MULR R1.xyz, R0.w, R1.xyzx;DP3R R0.w, R1.xyzx, f[TEX1].xyzx;MAXR R0.w, R0.w, {0}.x;

Shaders: Before and AfterShaders: Before and After

Halo, © Bungie, Elder Scrolls 3: Morrowind, © Bethesda

Fully Programmable Shader Engines(Cretaceous Period)

NV30, NV31 - GeForce FX (Jan 2003)

4 pix/clk 500MHz (Ultra)8 pix/clk for Z-only

128 pin DDR DRAM interface

Superset of DirectX 9FP32 programmable pixel shader

Mainstream derivative: NV31

Not a stellar market success

GeForce FX Shader Program Examples

Programmability Improved(Paleocene Epoch)

NV40 - GeForce 6800 (April 2004)

16 pix/clk @ 500 MHz ; DX9 Shader Model 3.0

256 pin DRAM interface

Transition from AGP to PCI-E

Evolved NV3x shader; focused perf/area effort

SLI Re-born

Texture

pixeltexture

FB memory

Host Interface / Front End

Geometry

Rasterize

Texture

Raster Op

Shader

Display

L2 Cache

vertex texture

Shader

Quad Distribute

Rasterize

Quad Collect

The Present

Modern Shader Processors – The “SM”(Pliocene Epoch)

G80 - GeForce 8800 (Nov 2006)24 pix/clk @ 575 MHz

384-bit local memory interface

Virtual memory remapping for system and frame buffer

DirectX 10

Unified shader for vertex, geometry, and pixel programs

Compute!

T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP

Data Assembler

Front End

DRAM DRAM DRAM DRAM DRAM DRAM

FB FB FB FB FB FBFB

Out Out Out Out Out Out

In In In In In In

Setup, Raster, Zcull

Primitive Control

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction L1 Cache

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Texture

PreRop

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Shared Memory

Data L1 Cache

Instruction Fetch

Instruction Decode

Geometry

Raster

Host Unit

SM Streaming Multiprocessors

TPC Texture-Processor Clusters

NVIDIA TeslaScalable High Density ComputingMassively Multi-threaded Parallel Computing

Unified Design

Shader D

Shader A

Shader B

Shader C

Shader Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design

Streaming Multiprocessor (SM)

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

Streaming Multiprocessor (SM)8 Streaming Processors (SP)

8 SP FMA, 1 shared DP FMA

2 Super Function Units (SFU)

Multi-threaded instruction dispatch1 to 768 threads active

SIMD instruction per 16/32 threads

Hot clock 1.5 GHz, tepid 750 MHz, 24 GFLOPS

32 KB local register file (RFn)

16 KB global register file (GRF), aka Shared Memory

SM Conceptual Block Diagram

Warp 0

Warp 1

Warp K

FetchUnit

RegisterFiles

InstructionCache

SchedUnit

SingleInstruction

Multi- Threaded

The Future

The CMOS “Canvas”

The Ideal Processor?

MathUnit

The Processor We Live With?

MathUnit

“Glue”Unit

Performance = Total Area X Computational Area Efficiency X Achieved Dynamic Efficiency

What is SIMT?

SIMD MIMDSIMT

SIMD versus MIMD versus SIMT?

SIMD: “Synchronous Internally Parallel”

MIMD: “Asynchronous Externally Parallel”

SIMT: “Quasi-Synchronous Externally Parallel”

SIMT = “Near” MIMD Programming Model w/ SIMD Implementation Efficiencies

Rs #imm

VLD R1,R0,#imm

VMA R3,R1,R2,R3

ADD R0,R0,#imm +

Rs #imm

SIMD Vector Instruction

Scalar Instruction

MIMD/SIMT Thread MIMD/SIMT Thread MIMD/SIMT Thread

SIMT Multithreaded Execution

SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads

Warp: a set of 32 parallel threadsthat execute a SIMT instructionSIMT provides easy single-thread scalar programming with SIMD efficiency

Hardware implements zero-overhead warp and thread scheduling

SIMT threads can execute independentlySIMT warp diverges and converges when threads branch independentlyBest efficiency and performance when threads of a warp execute together

warp 8 instruction 11

Single-Instruction Multi-Threadinstruction scheduler

A Few Open SIMT Problems

Control Divergence

Data Divergence

Data Representation

Coherence

Diversity

Control Divergence

A_Code;If (cond) { B_Code; While (cond) { C_Code; If (cond) { D_Code; } Else { E_Code; } F_Code; } G_Code;} Else { H_Code;}I_Code;

ImmediateDominator

ImmediatePost-Dominator

Control Flow Operation

Control Flow Divergence can Happen at control flow operations

Why is Control Divergence Bad?

Loss of efficiency in SIMD execution

If different execution path threads are executed together

Unequal path execution delays implies the “wait or stay diverged” dilemma

Data Access via Pointers in Parallel Programs

Pointers represent a major problem in parallel programs

Location that a pointer references cannot be resolved until runtime

struct {int x;int y;

} *p;int z = p->y;

LD R1,R0[4] // R0 = p

DECODE

ADDRESS

Resolved

Data Divergence

SIMT magnifies the pointer problem

Non-converged memory accesses

= data divergence

Classic scatter/gather problem

DECODE

ADDRESS

Memory

Data Representation:The AOS versus SOA Dilemma

AOS (array of structure)

#define NNN nnnstruct {type1 field1;type2 field2;...

} data[NNN];

SOA (structure of array)

#define NNN nnnstruct {type1 field1[NNN];type2 field2[NNN];...

} data;

AOS versus SOA in Memory

Field1 [0]Field2 [0]...FieldN [0]Field1 [1]Field2 [1]...FieldN [1]

... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

011xxx

100xxx

Field1 [0]Field1 [2]Field1 [4]Field1 [6] Field1 [1]Field1 [3]Field1 [5]Field1 [7]

Field2 [0]Field2 [2]Field2 [4]Field2 [6] Field2 [1]Field2 [3]Field2 [5]Field2 [7]

... ... ... ... ... ... ... ...

000001010011100101110111

000xxx

001xxx

010xxx

Vector Access

Scalar Array Access

AOS versus SOA: How to Choose?

Programmer: pick AOSNatural way to think about data: group related fields

In some cases, better memory access efficiencySparse access to records

SIMT: pick SOAThreads executing same code want to access same data element at the same time

Very convenient for HW

How to reconcile?

Descriptors

AKA “capabilities”For example, Plessey 250, Cambridge CAP, Intel 432

D3D employs a form of descriptor

“Resources descriptors” are capabilities

Major language issue for parallel programming?

Diversity: CPU-GPU Détente?

Really SISD vs. SIMT

Sequential applications on SIMT hardware?

Conversely, thread parallel applications on multi-core scalar machines?

Room for both?

Coherent Caches?

Some small planes have built in parachutes

Really good idea?

Fact: existing GPUs don’t support cache coherency

Should coherent caches be added?

The Future Revisited

So what is the future in high performance computing?

1. SIMT

2. Lots of cores

3. Clouds

The Demise of ILP

Uniprocessor performance improvements are crawling to a halt

Very hard to architecturally extract more ILP from single threads

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%

Parallel Processing

Conjecture: most problems worth solving can be solved via a parallel program

SIMT fundamentally a better model than either SIMD or MIMD

Medical Imaging U of Utah

Molecular DynamicsU of Illinois, Urbana

Video TranscodingElemental Tech

Matlab ComputingAccelerEyes

AstrophysicsRIKEN

Financial simulationOxford

Linear AlgebraUniversidad Jaime

3D UltrasoundTechniscan

Quantum ChemistryU of Illinois, Urbana

Gene SequencingU of Maryland

Scaling

Can a single GPU do it all?

Systems have to scale to multiple boxes

Programming systems have to scale with them

Final Thoughts

The future is bright for parallel programming

Future supercomputers = networked SIMT-based processing systems

Thanks!mshebanow@nvidia.com

pervasive massively multithreaded gpu processors michael c. shebanow sr. arch mgr, gpus

acm international conference

computing frontiers

xyzx dp3r r0

xyzx rsqr r0

xyzx maxr r0

programmable slide

gpus slide

xyzx addr

Documents

a powerful ide for gpu computing on windows codenamed...

24 multithreaded algorithms

aes cryptosystem acceleration using graphics processing...

c 2010 by tom as zegard latrach. all rights...

nvidia dli hands-on training · image classification with...

14 multithreaded programming

intel® guide for developing multithreaded applications ·...

multithreaded programming quickstart

massively parallel read mapping on gpus with the q-group...

gpu programming using cuda - physics &...

*t: a multithreaded massively parallel architecturecentral...

multithreaded chapter 4: multithreaded programming

ch04 - multithreaded programming

efficient utilization of computational resources in hybrid...

visualizing massively multithreaded applications with

inter-warp instruction temporal locality in...

large graph algorithms for massively multithreaded - cvit

multithreaded processors ppt

massively parallel earthquake simulations on gpus · 2014....

multithreaded algorithms