pervasive massively multithreaded gpu processors michael c. shebanow sr. arch mgr, gpus
Post on 14-Dec-2015
233 Views
Preview:
TRANSCRIPT
Pervasive Massively Multithreaded GPU
Processors
Michael C. Shebanow
Sr. Arch Mgr, GPUs
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The “Real” Title
This talk is about SIMT Processors
The Past, Present, and a glimpse of the Future
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Past
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Brief Chronology of GPUs at NVIDIA
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
3dfx
NVIDIA
Gigapixel
NV1 (NV2) NV3 NV3.5
NV4
NV5
NV10
NV15
NV11
NV20NV25
NV2A NV17
NV30NV35
NV40 G70G71
G80
NV41
NV44
G72
G73
Voodoo 1 Voodoo 2Banshee
Voodoo 3
Monet
R300
2006
Merlot Pinot
NV31
NV36
NV34
NV43
1998 1999 2000 2001 2002 2003 2004
DirectX 6Multitexturing
Riva TNT
DirectX 8SM 1.x
GeForce 3 Cg
DirectX 9SM 2.0
GeForceFX
DirectX 9.0cSM 3.0
GeForce 6DirectX 5Riva 128
DirectX 7T&L TextureStageState
GeForce 256
Quake 3 Giants Halo Far Cry UE3Half-Life
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Early NVIDIA GPUs(Precambrian Eon)
NV1 (1995)Forward texturing
Traverse in texel space, generate pixels(vs. conventional “reverse” texturing where pixel locations are sampled in texture space)
Quadratic patchesDifferent than DirectX polygon rendering approach
Integrated audio
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Precambrian (cont’d)
NV3 - Riva 128 (Aug 1997)
1st 128-bit memory bus“Wider is better”
DirectX 3 support
1 pix/clk 100 MHzUnified memory for frame buffer and texture
16b Z / 16b color
Integrated VGA from Weitek
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Shades of Programmability (Phanerozoic Eon)
NV4 - Riva TNT (Summer 1998)
2 pix/clk @ 90 MHz
DirectX 5
Dual texturing @ 1 pix/clk
Register combiners
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Rudimentary Shader Processors
Early Programmable ShadingGoogle “register combiners”, http://developer.nvidia.com/object/registercombiners.html
“Fixed function but programmable”
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
General Combiner Flow
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Birth of Modern GPUs(Cenozoic Eon)
NV20 - GeForce3 (Feb 2001)
4 pix/clk @ 240 MHz, 2 bilinear tex/pix
DirectX 8
Shaders!Programmable vertex shaders
“Configurable” pixel shaders
Input 2Input 1Input 0
OP
Temp 2Temp 1Temp 0
ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R0.xyzx, R0.xyzx;RSQR R0.w, R0.w;MULR R0.xyz, R0.w, R0.xyzx;ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;DP3R R0.w, R1.xyzx, R1.xyzx;RSQR R0.w, R0.w;MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;MULR R1.xyz, R0.w, R1.xyzx;DP3R R0.w, R1.xyzx, f[TEX1].xyzx;MAXR R0.w, R0.w, {0}.x;
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
11
Shaders: Before and AfterShaders: Before and After
Halo, © Bungie, Elder Scrolls 3: Morrowind, © Bethesda
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Fully Programmable Shader Engines(Cretaceous Period)
NV30, NV31 - GeForce FX (Jan 2003)
4 pix/clk 500MHz (Ultra)8 pix/clk for Z-only
128 pin DDR DRAM interface
Superset of DirectX 9FP32 programmable pixel shader
Mainstream derivative: NV31
Not a stellar market success
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
GeForce FX Shader Program Examples
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Programmability Improved(Paleocene Epoch)
NV40 - GeForce 6800 (April 2004)
16 pix/clk @ 500 MHz ; DX9 Shader Model 3.0
256 pin DRAM interface
Transition from AGP to PCI-E
Evolved NV3x shader; focused perf/area effort
SLI Re-born
Texture
pixeltexture
FB memory
Host Interface / Front End
Geometry
Rasterize
Texture
Raster Op
Shader
Fra
me
Bu
ffe
r In
terf
ace
(F
BI)
GPU
Display
video
L2 Cache
vertex texture
Shader
Sh
ader
Pip
elin
e 0
Quad Distribute
FIF
O
Sh
ader
Pip
elin
e 1
Sh
ader
Pip
elin
e 2
Sh
ader
Pip
elin
e 3
Rasterize
Quad Collect
ROP
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Present
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Modern Shader Processors – The “SM”(Pliocene Epoch)
G80 - GeForce 8800 (Nov 2006)24 pix/clk @ 575 MHz
384-bit local memory interface
Virtual memory remapping for system and frame buffer
DirectX 10
Unified shader for vertex, geometry, and pixel programs
Compute!
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
G8x
GPU
T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP T L2ROP
Data Assembler
Front End
DRAM DRAM DRAM DRAM DRAM DRAM
FB FB FB FB FB FBFB
Hub
Out Out Out Out Out Out
In In In In In In
C
m
d
F
I
F
O
Setup, Raster, Zcull
Primitive Control
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
TPC
Texture
L1
Texture
Unit
PreRop
SM
SMC3
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
SM
SMC2
SP0 R
SP1 R
SP2 R
SP3 R
SP4R
SP5R
SP6R
SP7R
Shared Memory
Data L1 Cache
SFU
SFU
Instruction Fetch
Instruction L1 Cache
Instruction Decode
Geometry
Raster
I&D
L2
Cache
SMC
Host Unit
SM Streaming Multiprocessors
TPC Texture-Processor Clusters
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
NVIDIA TeslaScalable High Density ComputingMassively Multi-threaded Parallel Computing
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Unified Design
Shader D
Shader A
Shader B
Shader C
Shader Core
ibuffer ibuffer ibuffer ibuffer
obuffer obuffer obufferobuffer
Discrete Design Unified Design
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Streaming Multiprocessor (SM)
SM
TPC
SP
DP
SP
SP SP
SP SP
SP SP
I-Cache
MT Issue
C-Cache
SFU SFU
SharedMemory
Streaming Multiprocessor (SM)8 Streaming Processors (SP)
8 SP FMA, 1 shared DP FMA
2 Super Function Units (SFU)
Multi-threaded instruction dispatch1 to 768 threads active
SIMD instruction per 16/32 threads
Hot clock 1.5 GHz, tepid 750 MHz, 24 GFLOPS
32 KB local register file (RFn)
16 KB global register file (GRF), aka Shared Memory
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
SM Conceptual Block Diagram
Warp 0
Warp 1
Warp K
FetchUnit
RegisterFiles
PC
PC
PC
InstructionCache
SchedUnit
ALUs
LSU
SingleInstruction
(SI)
Multi- Threaded
(MT)
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Future
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The CMOS “Canvas”
20 mm
20 mm
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Ideal Processor?
M
M M
M
M
M M
M
M
M M
M
M
M M
M
M
M M
M
M
M M
M
M
M M
M
M
M M
M
M
M M
M
MathUnit
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Processor We Live With?
G
M G
M G
M G
M G
M G
M
MathUnit
G
M G
M G
M G
M G
M G
M
G
M G
M G
M G
M G
M G
M
“Glue”Unit
Performance = Total Area X Computational Area Efficiency X Achieved Dynamic Efficiency
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
What is SIMT?
SIMD MIMDSIMT
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
SIMD versus MIMD versus SIMT?
SIMD: “Synchronous Internally Parallel”
MIMD: “Asynchronous Externally Parallel”
SIMT: “Quasi-Synchronous Externally Parallel”
SIMT = “Near” MIMD Programming Model w/ SIMD Implementation Efficiencies
X
+
Rd
Rs Rs
Rs X
+
Rd
Rs Rs
Rs X
+
Rd
Rs Rs
Rs
LD
Rd
Rs #imm
LD
Rd
Rs #imm
LD
Rd
Rs #imm
VLD R1,R0,#imm
VMA R3,R1,R2,R3
ADD R0,R0,#imm +
Rd
Rs #imm
+
Rd
Rs #imm
+
Rd
Rs #imm
SIMD Vector Instruction
SIMD Vector Instruction
Scalar Instruction
MIMD/SIMT Thread MIMD/SIMT Thread MIMD/SIMT Thread
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
SIMT Multithreaded Execution
SIMT: Single-Instruction Multi-Threadexecutes one instruction across many independent threads
Warp: a set of 32 parallel threadsthat execute a SIMT instructionSIMT provides easy single-thread scalar programming with SIMD efficiency
Hardware implements zero-overhead warp and thread scheduling
SIMT threads can execute independentlySIMT warp diverges and converges when threads branch independentlyBest efficiency and performance when threads of a warp execute together
warp 8 instruction 11
Single-Instruction Multi-Threadinstruction scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
A Few Open SIMT Problems
Control Divergence
Data Divergence
Data Representation
Coherence
Diversity
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Control Divergence
A_Code;If (cond) { B_Code; While (cond) { C_Code; If (cond) { D_Code; } Else { E_Code; } F_Code; } G_Code;} Else { H_Code;}I_Code;
B
T NT
E
A
B
H
I
GT NT
C
E
NT T
D
F
ImmediateDominator
ImmediatePost-Dominator
Control Flow Operation
Control Flow Divergence can Happen at control flow operations
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Why is Control Divergence Bad?
Loss of efficiency in SIMD execution
If different execution path threads are executed together
Unequal path execution delays implies the “wait or stay diverged” dilemma
T NT
41
25 16
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Data Access via Pointers in Parallel Programs
Pointers represent a major problem in parallel programs
Location that a pointer references cannot be resolved until runtime
struct {int x;int y;
} *p;int z = p->y;
LD R1,R0[4] // R0 = p
FETCH
DECODE
ISSUE
ADDRESS
CACHE
WB
Resolved
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Data Divergence
SIMT magnifies the pointer problem
Non-converged memory accesses
= data divergence
Classic scatter/gather problem
FETCH
DECODE
ISSUE
ADDRESS
CACHE
WB
ADDRESS
CACHE
WB
ADDRESS
CACHE
WB
ADDRESS
CACHE
WB
Memory
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Data Representation:The AOS versus SOA Dilemma
AOS (array of structure)
#define NNN nnnstruct {type1 field1;type2 field2;...
} data[NNN];
SOA (structure of array)
#define NNN nnnstruct {type1 field1[NNN];type2 field2[NNN];...
} data;
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
AOS versus SOA in Memory
AOS:
SOA:
Field1 [0]Field2 [0]...FieldN [0]Field1 [1]Field2 [1]...FieldN [1]
Field1 [2]Field2 [2]...FieldN [2]Field1 [3]Field2 [3]...FieldN [3]
Field1 [4]Field2 [4]...FieldN [4]Field1 [5]Field2 [5]...FieldN [5]
Field1 [6]Field2 [6]...FieldN [6]Field1 [7]Field2 [7]...FieldN [7]
... ... ... ... ... ... ... ...
000001010011100101110111
000xxx
001xxx
010xxx
011xxx
100xxx
Field1 [0]Field1 [2]Field1 [4]Field1 [6] Field1 [1]Field1 [3]Field1 [5]Field1 [7]
Field2 [0]Field2 [2]Field2 [4]Field2 [6] Field2 [1]Field2 [3]Field2 [5]Field2 [7]
... ... ... ... ... ... ... ...
000001010011100101110111
000xxx
001xxx
010xxx
Vector Access
Scalar Array Access
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
AOS versus SOA: How to Choose?
Programmer: pick AOSNatural way to think about data: group related fields
In some cases, better memory access efficiencySparse access to records
SIMT: pick SOAThreads executing same code want to access same data element at the same time
Very convenient for HW
How to reconcile?
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Descriptors
AKA “capabilities”For example, Plessey 250, Cambridge CAP, Intel 432
D3D employs a form of descriptor
“Resources descriptors” are capabilities
Major language issue for parallel programming?
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Diversity: CPU-GPU Détente?
Really SISD vs. SIMT
Sequential applications on SIMT hardware?
Conversely, thread parallel applications on multi-core scalar machines?
Room for both?
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Coherent Caches?
Some small planes have built in parachutes
Really good idea?
Fact: existing GPUs don’t support cache coherency
Bad?
Should coherent caches be added?
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Future Revisited
So what is the future in high performance computing?
1. SIMT
2. Lots of cores
3. Clouds
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
The Demise of ILP
Uniprocessor performance improvements are crawling to a halt
Very hard to architecturally extract more ILP from single threads
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)52%/year
19%/year
ps/gate 19%Gates/clock 9%
Clocks/inst 18%
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Parallel Processing
Conjecture: most problems worth solving can be solved via a parallel program
SIMT fundamentally a better model than either SIMD or MIMD
146X
Medical Imaging U of Utah
36X
Molecular DynamicsU of Illinois, Urbana
18X
Video TranscodingElemental Tech
50X
Matlab ComputingAccelerEyes
100X
AstrophysicsRIKEN
149X
Financial simulationOxford
47X
Linear AlgebraUniversidad Jaime
20X
3D UltrasoundTechniscan
130X
Quantum ChemistryU of Illinois, Urbana
30X
Gene SequencingU of Maryland
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Scaling
Can a single GPU do it all?
Systems have to scale to multiple boxes
Programming systems have to scale with them
ACM International Conference on Computing Frontiers 2009: Pervasive Massively Multithreaded GPU Processors
Final Thoughts
The future is bright for parallel programming
Future supercomputers = networked SIMT-based processing systems
Thanks!mshebanow@nvidia.com
top related