the future of gpu computing - gtc on demand...the future of computing bill dally chief scientist...
TRANSCRIPT
![Page 1: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/1.jpg)
The Future of GPU Computing
Bill Dally
Chief Scientist & Sr. VP of Research, NVIDIA
Bell Professor of Engineering, Stanford University
November 18, 2009
![Page 2: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/2.jpg)
The Future of Computing
Bill Dally
Chief Scientist & Sr. VP of Research, NVIDIA
Bell Professor of Engineering, Stanford University
November 18, 2009
![Page 3: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/3.jpg)
Outline
Single-thread performance is no longer scaling
Performance = Parallelism
Efficiency = Locality
Applications have lots of both
Machines need lots of cores (parallelism) and an
exposed storage hierarchy (locality)
A programming system must abstract this
The future is even more parallel
![Page 4: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/4.jpg)
Single-threaded processor
performance is no longer scaling
![Page 5: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/5.jpg)
Moore’s Law
In 1965 Gordon Moore predicted
the number of transistors on an
integrated circuit would double
every year.
Later revised to 18 months
Also predicted L3 power scaling
for constant function
No prediction of processor
performance
Moore, Electronics 38(8) April 19, 1965
![Page 6: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/6.jpg)
MoreTransistors
MoreValue
MorePerformance
Architecture Applications
![Page 7: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/7.jpg)
The End of ILP Scaling
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)
Dally et al., The Last Classical Computer, ISAT Study, 2001
![Page 8: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/8.jpg)
Explicit Parallelism is Now Attractive
1e-4
1e-3
1e-2
1e-1
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)
Linear (ps/Inst)
30:1
1,000:1
30,000:1
Dally et al., The Last Classical Computer, ISAT Study, 2001
![Page 9: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/9.jpg)
Single-Thread Processor
Performance vs Calendar Year
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs.
VA
X-
25%/year
52%/year
20%/year
Source: Hennessy & Patterson, CAAQA, 4th Edition
![Page 10: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/10.jpg)
Single-threaded processor
performance is no longer scaling
Performance = Parallelism
![Page 11: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/11.jpg)
Chips are power limited
and most power is spent moving data
![Page 12: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/12.jpg)
CMOS Chip is our Canvas
20mm
![Page 13: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/13.jpg)
4,000 64b FPUs fit on a chip
20mm64b FPU
0.1mm2
50pJ/op
1.5GHz
![Page 14: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/14.jpg)
Moving a word across die = 10FMAs
Moving a word off chip = 20FMAs
20mm64b FPU
0.1mm2
50pJ/op
1.5GHz
64b 1mm
Channel
25pJ/word10
mm
250
pJ,
4cyc
les
64b Off-Chip
Channel
1nJ/word
64b Floating Point
![Page 15: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/15.jpg)
Chips are power limited
Most power is spent moving data
Efficiency = Locality
![Page 16: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/16.jpg)
Performance = Parallelism
Efficiency = Locality
![Page 17: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/17.jpg)
Scientific Applications
Large data sets
Lots of parallelism
Increasingly irregular (AMR)
Irregular and dynamic data structures
Requires efficient gather/scatter
Increasingly complex models
Lots of locality
Global solution sometimes bandwidth
limited
Less locality in these phases
![Page 18: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/18.jpg)
Performance = Parallelism
Efficiency = Locality
Fortunately, most applications have lots of both.
Amdahl’s law doesn’t apply to most future applications.
![Page 19: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/19.jpg)
Exploiting parallelism and locality requires:
Many efficient processors
(To exploit parallelism)
An exposed storage hierarchy
(To exploit locality)
A programming system that abstracts this
![Page 20: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/20.jpg)
Tree-structured machines
P P P P
L1 L1 L1 L1
Net
L2
Net
L3
![Page 21: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/21.jpg)
Optimize use of scarce bandwidth
Provide rich, exposed storage hierarchy
Explicitly manage data movement on this hierarchy
Reduces demand, increases utilization
Compute
Flux
States
Compute
Numerical
Flux
Element
Faces
Gathered
Elements
Numerical
Flux
Gather
Cell
Compute
Cell
Interior
Advance
Cell
Elements
(Current)
Elements
(New)
Read-Only Table Lookup Data
(Master Element)
Face
Geometry
Cell
Orientations
Cell
Geometry
![Page 22: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/22.jpg)
Fermi is a throughput computer
512 efficient cores
Rich storage
hierarchyShared memory
L1
L2
GDDR5 DRAM
DR
AM
I/F
HO
ST
I/F
Gig
a T
hre
ad
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
![Page 23: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/23.jpg)
Fermi
![Page 24: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/24.jpg)
Avoid Denial Architecture
Single thread processors are in denial about parallelism
and locality
They provide two illusions:
Serial execution - Denies parallelism
Tries to exploit parallelism with ILP - inefficient & limited
scalability
Flat memory - Denies locality
Tries to provide illusion with caches – very inefficient
when working set doesn’t fit in the cache
These illusions inhibit performance and efficiency
![Page 25: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/25.jpg)
CUDA Abstracts the GPU Architecture
Programmer sees many cores and
exposed storage hierarchy, but is isolated
from details.
![Page 26: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/26.jpg)
CUDA as a Stream Language
Launch a cooperative thread array
foo<<<nblocks, nthreads>>>(x, y, z) ;
Explicit control of the memory hierarchy
__shared__ float a[SIZE] ;
Also enables communication between threads of a
CTA
Allows access to arbitrary data within a kernel
![Page 27: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/27.jpg)
Examples
146X
Interactive
visualization of
volumetric white
matter connectivity
36X
Ionic placement for
molecular dynamics
simulation on GPU
19X
Transcoding HD video
stream to H.264
17X
Fluid mechanics in
Matlab using .mex file
CUDA function
100X
Astrophysics N-body
simulation
149X
Financial simulation
of LIBOR model with
swaptions
47X
GLAME@lab: an M-
script API for GPU
linear algebra
20X
Ultrasound medical
imaging for cancer
diagnostics
24X
Highly optimized
object oriented
molecular dynamics
30X
Cmatch exact string
matching to find
similar proteins and
gene sequences
![Page 28: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/28.jpg)
Current CUDA Ecosystem
Applications Libraries
FFTBLAS
LAPACKImage
processingVideo processing
Signal processing
Vision
Consultants OEMs
Languages
C, C++DirectXFortranOpenCLPython
Compilers
PGI FortranCAPs HMPP
MCUDAMPI
NOAA Fortran2COpenMP
UIUCMIT
HarvardBerkeley
CambridgeOxford
…
IIT DelhiTsinghua
DortmundtETH Zurich
MoscowNTU…
Over 200 Universities Teaching CUDA
ANEO
GPU Tech
Oil &
GasFinance
Medical Biophysics
Numerics
Imaging
CFD
DSP EDA
![Page 29: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/29.jpg)
Ease of Programming
Source: Nicolas Pinto, MIT
![Page 30: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/30.jpg)
![Page 31: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/31.jpg)
The future is even more parallel
![Page 32: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/32.jpg)
CPU scaling ends, GPU continues
Source: Hennessy & Patterson, CAAQA, 4th Edition2017
![Page 33: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/33.jpg)
DARPA Study Indentifies four
challenges for ExaScale Computing
Report published September 28, 2008:
Four Major Challenges
Energy and Power challenge
Memory and Storage challenge
Concurrency and Locality challenge
Resiliency challenge
Number one issue is power
Extrapolations of current architectures and
technology indicate over 100MW for an Exaflop!
Power also constrains what we can put on a chip
Available at
www.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf
![Page 34: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/34.jpg)
Energy and Power Challenge
Heterogeneous architecture
A few latency-optimized processors
Many (100s-1,000s) throughput-optimized processors
Which are optimized for ops/J
Efficient processor architecture
Simple control – in-order multi-threaded
SIMT execution to amortize overhead
Agile memory system to capture locality
Keeps data and instruction access local
Optimized circuit design
Minimize energy/op
Minimize cost of data movement
* This section is a projection based on Moore’s law and does not represent a committed roadmap
![Page 35: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/35.jpg)
An NVIDIA ExaScale Machine in 2017
2017 GPU Node – 300W (GPU + memory + supply)
2,400 throughput cores (7,200 FPUs), 16 CPUs – single chip
40TFLOPS (SP) 13TFLOPS (DP)
Deep, explicit on-chip storage hierarchy
Fast communication and synchronization
Node Memory
128GB DRAM, 2TB/s bandwidth
512GB Phase-change/Flash for checkpoint and scratch
Cabinet – ~100kW
384 Nodes – 15.7PFLOPS (SP), 50TB DRAM
Dragonfly network – 1TB/s per node bandwidth
System – ~10MW
128 Cabinets – 2 EFLOPS (SP), 6.4 PB DRAM
Distributed EB disk array for file system
Dragonfly network with active optical links
RASECC on all memory and links
Option to pair cores for self-checking (or use application-level checking)
Fast local checkpoint* This section is a projection based on Moore’s law and does not represent a committed roadmap
![Page 36: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/36.jpg)
Conclusion
![Page 37: The Future of GPU Computing - GTC On Demand...The Future of Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November](https://reader033.vdocuments.mx/reader033/viewer/2022060902/609eb475a8ce855f475d63ee/html5/thumbnails/37.jpg)
Performance = Parallelism
Efficiency = Locality
Applications have lots of both.
GPUs have lots of cores (parallelism) and an exposed storage
hierarchy (locality)