w hat’s new with gpgpu?jowens/talks/berkeley-glunch-060831.pdftoday’s microprocessors •scalar...
TRANSCRIPT
![Page 1: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/1.jpg)
What’s New with GPGPU?
John OwensJohn OwensAssistant Professor, Electrical and ComputerAssistant Professor, Electrical and Computer
EngineeringEngineeringInstitute for Data Analysis and VisualizationInstitute for Data Analysis and Visualization
University of California, DavisUniversity of California, Davis
![Page 2: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/2.jpg)
Microprocessor Scaling is Slowing
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1985 1990 1995 2000 2005 2010 2015 2020
Perf (ps/Inst)52%/year
19%/year
ps/gate 19%Gates/clock 9%
Clocks/inst 18%
[courtesy of Bill Dally][courtesy of Bill Dally]
![Page 3: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/3.jpg)
Today’s Microprocessors
•• Scalar programming modelScalar programming modelwith no native datawith no native dataparallelismparallelism• SSE is the exception
•• Few arithmetic units Few arithmetic units –– little littleareaarea•• Optimized for complexOptimized for complex
controlcontrol•• Optimized for Optimized for low latencylow latency
not not high bandwidthhigh bandwidth•• Result: poor match forResult: poor match for
many appsmany apps
Pentium III – 28.1M T
![Page 4: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/4.jpg)
Future Potential is Large
•• 2001: 30:12001: 30:1•• 2011: 1000:12011: 1000:1
1e-4
1e-3
1e-2
1e-1
1e+0
1e+1
1e+2
1e+3
1e+4
1e+5
1e+6
1e+7
1980 1985 1990 1995 2000 2005 2010 2015 2020
Perf (ps/Inst)
Linear (ps/Inst)
52%/year
74%/year
19%/year30:1
1,000:1
30,000:1
[courtesy of Bill Dally]
![Page 5: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/5.jpg)
Parallel Processing is the Future
Major vendors supportingMajor vendors supportingmulticoremulticore• Intel, AMD
Excitement about IBM CellExcitement about IBM CellHardware support for threadsHardware support for threadsInterest inInterest in general-purposegeneral-purpose
programmability on programmability on GPUsGPUs
Universities must teachUniversities must teachthinking in parallelthinking in parallel
![Page 6: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/6.jpg)
GPU
Long-Term Trend: CPU vs. GPU
![Page 7: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/7.jpg)
Recent GPU Performance Trends
Programmable 32-bit FP multiplies per second
Data courtesy Ian Buck; from Owens et al. 2005 [EG STAR]
$3087800GTX
$278X1900XTX
$334P4X3.4
54.4GB/s
6 GB/s
49.6GB/s
![Page 8: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/8.jpg)
Functionality Improves Too!10 years ago:10 years ago:•• Graphics done in softwareGraphics done in software5 years ago:5 years ago:•• FFull graphics pipelineull graphics pipelineToday:Today:•• 40x geometry, 13x fill 40x geometry, 13x fill vsvs. 5 yrs ago. 5 yrs ago•• Programmable!Programmable!
Programmable, data parallelProgrammable, data parallelprocessing on every desktopprocessing on every desktop
The GPU is the first commercial data-The GPU is the first commercial data-parallel processorparallel processor
![Page 9: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/9.jpg)
GPU
The Rendering Pipeline
Application
Rasterization
Geometry
Composite
Compute 3D geometryMake calls to graphics API
Transform geometry from 3D to2D (in parallel)
Generate fragments from 2Dgeometry (in parallel)
Combine fragments into image
![Page 10: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/10.jpg)
GPU
The Programmable Rendering Pipeline
Application
Rasterization
(Fragment)
Geometry
(Vertex)
Composite
Compute 3D geometryMake calls to graphics API
Transform geometry from 3D to2D; vertex programs
Generate fragments from 2Dgeometry; fragment programs
Combine fragments into image
![Page 11: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/11.jpg)
Triangle Setup
L2 Tex
Shader Instruction Dispatch
Fragment Crossbar
MemoryPartition
MemoryPartition
MemoryPartition
MemoryPartition
Z-Cull
NVIDIA GeForce 6800 3D Pipeline
Courtesy Nick Triantos, NVIDIA
Vertex
Fragment
Composite
![Page 12: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/12.jpg)
Programming a GPU for Graphics
•• Each fragment isEach fragment isshaded shaded w/w/ SIMD programSIMD program
•• Shading can use valuesShading can use valuesfrom texture memoryfrom texture memory
•• Image can be used asImage can be used astexture on futuretexture on futurepassespasses
•• Application specifiesApplication specifiesgeometry geometry rasterizedrasterized
![Page 13: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/13.jpg)
Programming a GPU for GP Programs
•• Run a SIMD programRun a SIMD programover each fragmentover each fragment
•• ““GatherGather”” is permitted is permittedfrom texture memoryfrom texture memory
•• Resulting buffer can beResulting buffer can betreated as texture ontreated as texture onnext passnext pass
•• Draw a screen-sizedDraw a screen-sizedquadquad
![Page 14: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/14.jpg)
GPUs are fast (why?) …Characteristics of computation permit efficientCharacteristics of computation permit efficient
hardware implementationshardware implementations• High amount of parallelism …
• … exploited by graphics hardware
• High latency tolerance and feed-forward dataflow …
• … allow very deep pipelines
• … allow optimization for bandwidth not latency
Simple controlSimple control• Restrictive programming model
Competition between vendorsCompetition between vendors
![Page 15: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/15.jpg)
… but GPU programming is hard
Must think in graphics metaphorsMust think in graphics metaphorsRequires parallel programmingRequires parallel programming (CPU-GPU,(CPU-GPU,
task, data, instruction)task, data, instruction)Restrictive programming models andRestrictive programming models and
instruction setsinstruction setsPrimitive toolsPrimitive toolsRapidly changing interfacesRapidly changing interfaces
Big picture: Every time I say Big picture: Every time I say ““GPUGPU””,,substitute substitute ““parallel processorparallel processor””
![Page 16: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/16.jpg)
GPGPU in Scientific Visualization
Flowfield Vis:“a million particles with ease” @ 60 fps
Tensorfield Vis:CPU – GPU Speedup = 1 : 150 (!!!)
GPGPU and VISa winning team
(winner of IEEE 2005 VIS Competition)Focus and Context Raycasting:High Quality interactive visualizationof large datasets on $400 hardware
Online decompression:Interactively Visualize 20 GB of dataon a 256 MB GPU
Courtesy Jens Krüger (TU München)
![Page 17: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/17.jpg)
GPGPU Effects
Fire
Water
Let your Virtual Reality come to livewith interactive GPGPU effects!
(all off these effects are simulated and rendered on theGPU in realtime)
Smoke
Courtesy Jens Krüger (TU München)
![Page 18: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/18.jpg)
Challenge: Programming Systems
CPUCPUScalarScalar
STL, GNU SL, MPI, STL, GNU SL, MPI, ……C, Fortran, C, Fortran, ……
gccgcc, vendor-specific, , vendor-specific, ……gdbgdb, , vtunevtune, Purify, , Purify, ……
LotsLots applicationsapplications
ProgrammingModel
High-LevelAbstractions/
Libraries
Low-LevelLanguages Compilers
Performance Analysis Tools
GPUGPUStream? Data-Parallel?Stream? Data-Parallel?Brook, Scout, Brook, Scout, shsh, , GliftGlift
GLSL, Cg, HLSL, GLSL, Cg, HLSL, ……Vendor-specificVendor-specific
ShadesmithShadesmith, , NVPerfHUDNVPerfHUDNoneNone
kernelskernels
Docs
![Page 19: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/19.jpg)
Glift: Data Structures for GPUsGoalGoal
• Simplify creation and use of random-access GPU datastructures for graphics and GPGPU programming
ContributionsContributions• Abstraction for GPU data structures
• Glift template library
• Iterator computation model for GPUs
Aaron E. Lefohn, Joe Kniss,Robert Strzodka, ShubhabrataSengupta, and John D. Owens.
“Glift: An Abstraction forGeneric, Efficient GPU DataStructures”'. ACM TOG Jan
2006.
![Page 20: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/20.jpg)
Today’s Vendor Support
High-Level Graphics Language
OpenGL ∂ D3D ∂
Low-Level Device Driver
![Page 21: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/21.jpg)
Possible Future Vendor Support
High-Level Graphics Language
OpenGL ∂ D3D ∂ Compute ∂
Low-Level Device Driver
High-Level Compute Lang.
Low-Level∂ API
![Page 22: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/22.jpg)
ATI CTM
Low-level interface to GPULow-level interface to GPU
![Page 23: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/23.jpg)
Big Picture Research Targets•• Data structuresData structures
• Top-down approach ratherthan bottom-up … what SHOULD we support?
• Interaction of algorithms and data structures
• Export to multiple architectures
•• Self-tuning codeSelf-tuning code•• Communication between Communication between GPUsGPUs•• Programming systems for Programming systems for multiplemultiple parallel architectures parallel architectures
• Major obstacle: difficulty of programming. Danger of fragmentation!Opportunity for education as well.
• Learn from past
• Explore portable primitives
What weWhat we’’re good at: using re good at: using GPUs GPUs as first class computingas first class computingresourcesresources
We donWe don’’t know whatt know whatarchitecture will win. Butarchitecture will win. But
we know it will be parallel.we know it will be parallel.
![Page 24: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/24.jpg)
Rob Pike on Languages
Conclusion A highly parallel language used by non-experts.
Power of notation Good: make it easier to express yourself Better: hide stuff you don't care about Best: hide stuff you do care about
Give the language a purpose.
Exposing Parallelism
Control Flow
Data Locality
Synchronization
![Page 25: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/25.jpg)
Moving Forward …
What will DX10 give us?What will DX10 give us?What works well now?What works well now?What doesnWhat doesn’’t work well now?t work well now?What will improve in the future?What will improve in the future?What will continue to be difficult?What will continue to be difficult?
![Page 26: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/26.jpg)
The New DX10 Pipeline
VertexVertexBufferBuffer
InputInputAssemblerAssembler
VertexVertexShaderShader
SetupSetupRasterizerRasterizer
OutputOutputMergerMerger
PixelPixelShaderShader
GeometryGeometryShaderShader
IndexIndexBufferBuffer TextureTexture TextureTexture RenderRender
TargetTargetDepthDepthStencilStencilTextureTexture StreamStream
BufferBuffer
Stream outStream out
MemoryMemory
memorymemory
programmableprogrammable
fixedfixed
SamplerSampler SamplerSampler SamplerSampler
ConstantConstant ConstantConstant ConstantConstant
[courtesy David Blythe]
![Page 27: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/27.jpg)
What Runs Well on GPUs?
GPUsGPUs win when win when ……• Limited data reuse
• High arithmetic intensity: Defined as mathoperations per memory op•Attacks the memory wall - are all mem ops necessary?
• Common error: Not comparing against optimizedCPU implementation
----36 GB/s36 GB/sNV GF 6800NV GF 6800
44 GB/s44 GB/s6 GB/s6 GB/sP4 3GHzP4 3GHz
Cache BWCache BWMemory BWMemory BW
![Page 28: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/28.jpg)
R300R300 R360R360 R420R420
GFLOPSGFLOPS
GFloats/secGFloats/sec
7x Gap
Arithmetic IntensityHistorical growth rates (per year):Historical growth rates (per year):
• Compute: 71%
• DRAM bandwidth: 25%
• DRAM latency: 5%
[courtesy Ian Buck]
![Page 29: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/29.jpg)
Arithmetic Intensity
GPU wins whenGPU wins when……Arithmetic intensityArithmetic intensity
Segment
3.7 ops per word11 GFLOPS
GeForce 7800 GTX
Pentium 4 3.0 GHz
[courtesy Ian Buck]
![Page 30: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/30.jpg)
Memory Bandwidth
GPU wins whenGPU wins when……Streaming memoryStreaming memory
bandwidthbandwidth SAXPY
FFT
GeForce 7800 GTX
Pentium 4 3.0 GHz
[courtesy Ian Buck]
![Page 31: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/31.jpg)
Memory BandwidthStreaming MemoryStreaming Memory
SystemSystem• Optimized for
sequential performance
GPU cache is limitedGPU cache is limited• Optimized for texture
filtering
• Read-only
• Small
Local storageLocal storage• CPU >> GPU
GeForce GeForce 7800 GTX Pentium 47800 GTX Pentium 4
[courtesy Ian Buck]
![Page 32: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/32.jpg)
What Will (Hopefully) Improve?
OrthogonalityOrthogonality• Instruction sets
• Features
• Tools
StabilityStabilityInterfaces, APIs, libraries, abstractionsInterfaces, APIs, libraries, abstractions
• Necessary as graphics and GPGPU converge!
![Page 33: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/33.jpg)
What Won’t Change?
Rate of progressRate of progressPrecision (64b floating point?)Precision (64b floating point?)ParallelismParallelism
• Won’t sacrifice performance
Difficulty of programming parallel hardwareDifficulty of programming parallel hardware• … but APIs and libraries may help
Concentration on entertainment appsConcentration on entertainment apps
![Page 34: W hat’s New with GPGPU?jowens/talks/berkeley-glunch-060831.pdfToday’s Microprocessors •Scalar programming model with no native data parallelism •SSE is the exception •Few](https://reader033.vdocuments.mx/reader033/viewer/2022051921/600e687b0e5b0701d41b6ae3/html5/thumbnails/34.jpg)
GPGPU Top Ten
The Killer AppThe Killer AppProgramming models andProgramming models andtoolstoolsGPU in tomorrowGPU in tomorrow’’sscomputer?computer?Data conditionalsData conditionalsRelationship to otherRelationship to otherparallel parallel hw/swhw/sw
Managing rapid change inManaging rapid change inhw/hw/swsw (roadmaps) (roadmaps)Performance evaluationPerformance evaluationand cliffsand cliffsPhilosophy of faults andPhilosophy of faults andlack of precisionlack of precisionBroader toolbox forBroader toolbox forcomputation / datacomputation / datastructuresstructuresWedding graphics andWedding graphics andGPGPU techniquesGPGPU techniques