piko: a framework for authoring programmable graphics pipelines anjul patney and stanley tzeng uc...
TRANSCRIPT
Piko: A Framework for Authoring Programmable Graphics Pipelines
Anjul Patney and Stanley TzengUC Davis and NVIDIA
Kerry A. Seitz, Jr. and John D. OwensUC Davis
What does an efficient graphics pipeline look like?
What does an efficient graphics pipeline look like?
Renderer
Unreal Engine 4
Unity 5
Disney Hyperion
Pixar RenderMan
Solid Angle Arnold
Media Molecule Dreams
What does an efficient graphics pipeline look like?
Renderer Platform
Unreal Engine 4 GPU
Unity 5 GPU
Disney Hyperion Multicore CPU
Pixar RenderMan Multicore CPU
Solid Angle Arnold Multicore CPU
Media Molecule Dreams GPU
What does an efficient graphics pipeline look like?
Renderer Platform Algorithm
Unreal Engine 4 GPU Rasterization with deferred shading
Unity 5 GPU Rasterization with forward / deferred shading
Disney Hyperion Multicore CPU Path tracing with deferred shading
Pixar RenderMan Multicore CPU Reyes with Path tracing
Solid Angle Arnold Multicore CPU Path tracing
Media Molecule Dreams GPU Point-based rendering with deferred shading
Problem
Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Vision
Stage A
Stage B
Stage C
Stage E
Stage D
Stage F
?
CPU
GPU
High-level programmability
High-performance
Flexibility
Existing Work
Software Pipelines on GPUs
CudaRaster RenderAnts VoxelPipeFreePipe
OptiX and Embree
Programmable engines for accelerating ray tracing on specific platforms.
GRAMPS
• Introduces flexible graphics pipelines• Abstracts stages in classes• Abstracts communication by queues
[Sugerman et al. 2009]
Halide
• Introduces programmable image pipelines
• Applies well to shorter and more regular image-processing pipeline
[Ragan-Kelley et al. 2012]
What are the fundamentals of high-performance?
• Parallelism• Execution Locality• Data Locality• Producer-consumer locality
Spatial tiling
Efficient graphics pipelines utilize spatial tiling
Efficient graphics pipelines utilize spatial tiling
Efficient graphics pipelines utilize spatial tiling
• Packet ray tracing• SIMD fragment shading on GPUs• Tiled rendering on mobile GPUs
Vision
Stage A
Stage B
Stage C
Stage E
Stage D
Stage F
?
CPU
GPU
High-level programmability
High-performance
Flexibility
Vision
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
High-level programmability
High-performance
Flexibility
System Walkthrough
pikoc
Pipe Description
(Piko)
Pipe Implementation
(C++ / PTX)
Host Code(C++)
Executable
CPU Compiler
Host Interface (C++)
Device Compiler
pikoc
Pipe Description
(Piko)
Pipe Implementation
(C++ / PTX)
Executable
CPU Compiler
Host Interface (C++)
Device Compiler
Host Code(C++)
Device-independent(C++)
pikoc
Pipe Implementation
(C++ / PTX)
Host Code(C++)
Executable
CPU Compiler
Host Interface (C++)
Device Compiler
Pipe Description
(Piko)
Pipeline description (graph of stages)
Pipe Description
(Piko)
Host Code(C++)
Executable
CPU Compiler
Device Compiler
pikoc
Pipe Implementation
(C++ / PTX)
Host Interface (C++)
Clang- and LLVM- based infrastructure
pikoc
Pipe Description
(Piko)
Pipe Implementation
(C++ / PTX)
Host Code(C++)
Host Interface (C++)
Executable
CPU Compiler
Device Compiler
Problem
Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Problem
Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Approach
Use spatial tiling to help author efficient and flexible graphics pipelines.
Problem
Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Approach
Use programmable spatial tiling to help author efficient and flexible graphics pipelines.
Programmable Spatial Tiling
We need three answers from the pipeline author
How does data map to spatial tile?
How do we schedule tiles at runtime?
What to compute for each tile?
AssignTile( )
Schedule( )
Process( )
Each stage consists of these three “phases”
Each stage in a pipeline has three phases
Stage A
Stage C
Stage B
AssignTile
Schedule
Process
AssignTile
Schedule
Process
AssignTile
Schedule
Process
S
A
A
A
S
S
S
P
P
InputPrimitives
Populated Bins
Execution Cores
Final Output
Input Scene P ProcessS ScheduleA AssignBin
S
A
A
A
S
S
S
P
P
InputPrimitives
Populated Bins
Execution Cores
Final Output
Input Scene P ProcessS ScheduleA AssignBinAssignTile
S
A
A
A
S
S
S
P
P
InputPrimitives
Populated Bins
Execution Cores
Final Output
Input Scene P ProcessS ScheduleA AssignBinAssignTile
S
A
A
A
S
S
S
P
P
InputPrimitives
Populated Bins
Execution Cores
Final Output
Input Scene P ProcessS ScheduleA AssignBinAssignTile
S
A
A
A
S
S
S
P
P
InputPrimitives
Populated Bins
Execution Cores
Final Output
Input Scene P ProcessS ScheduleA AssignBinAssignTile
Phases help identify optimization opportunities.
Identical tile size
Identical data-to-tile mapping
Identical tile-to-core mapping
Stage A
Stage B
Stage C
Stage D
Phases help identify optimization opportunities.
Identical tile size
Identical AssignTile Result
Identical Schedule Result
Stages can be fused to one
Stage A
Stage D
Stage B
Stage C
Stage BStage C
Phases help explore pipeline implementations
Vertex Shade
Raster
Fragment Shade
Composite
Geometry Shade
VS VS VS VS
GS GS GS GS
Rst Rst Rst Rst
FS FS FS FS
Cmp Cmp Cmp Cmp
Phases help explore pipeline implementations
Vertex Shade
Raster
Fragment Shade
Composite
Geometry Shade
Rst Rst Rst Rst
FS FS FS FS
Cmp Cmp Cmp Cmp
VS VS VS VS
GS GS GS GS
Phases help explore pipeline implementations
Vertex Shade
Raster
Fragment Shade
Composite
Geometry Shade
VS VS VS VS
GS GS GS GS
Rst Rst Rst Rst
FS FS FS FS
Cmp Cmp Cmp Cmp
Phases help explore pipeline implementations
Vertex Shade
Raster
Fragment Shade
Composite
Geometry Shade
VS VS VS VS
GS GS GS GS
Rst Rst Rst Rst
FS FS FS FS
Cmp Cmp Cmp Cmp
Evaluation
Piko pipelines are easy to express and customize
VS
Rast
FS
Setup
Comp
VS
Rast
FS
Setup
Comp
FS
Comp
Split
Dice
Sample
Shade
Comp
VS
Rast
Trace
Setup
FS
Comp
Triangle Raster Stereo Raster Reyes Raster-Raytrace
Piko pipelines are easy to express and customize
VS
Rast
FS
Setup
Comp
VS
Rast
FS
Setup
Comp
FS
Comp
Split
Dice
Sample
Shade
Comp
VS
Rast
Trace
Setup
FS
Comp
Triangle Raster Stereo Raster Reyes Raster-Raytrace
Piko pipelines are easy to express and customize
VS
Rast
FS
Setup
Comp
VS
Rast
FS
Setup
Comp
FS
Comp
Split
Dice
Sample
Shade
Comp
VS
Rast
Trace
Setup
FS
Comp
Triangle Raster Stereo Raster Reyes Raster-Raytrace
Piko pipelines are easy to express and customize
VS
Rast
FS
Setup
Comp
VS
Rast
FS
Setup
Comp
FS
Comp
Split
Dice
Sample
Shade
Comp
VS
Rast
Trace
Setup
FS
Comp
Triangle Raster Stereo Raster Reyes Raster-Raytrace
Piko lets us explore implementation alternatives
No tiling, complete stage fusion
1 10 100 10000
1
2
3
4
5
6
7
Shader complexity (# lights)
Rel
ativ
e fr
ame
tim
e
NVIDIA GPU Multicore CPU
Fairy ForestVS
Rast
FS
Setup
Comp
VS
Setup
Rast
FS
Comp
Baseline
Piko lets us explore implementation alternatives
Tiling with fusion
1 10 100 10000
1
2
3
4
5
6
7
Shader complexity (# lights)
Rel
ativ
e fr
ame
tim
e Fairy Forest
NVIDIA GPU Multicore CPU
VS
Rast
FS
Setup
Comp
Baseline
VS
Setup
Rast
FS
Comp
Piko lets us explore implementation alternatives
Tiling with no fusion
1 10 100 10000
1
2
3
4
5
6
7
Shader complexity (# lights)
Rel
ativ
e fr
ame
tim
e Fairy Forest
NVIDIA GPU Multicore CPU
VS
Rast
FS
Setup
Comp
Baseline
VS
Setup
FS
Comp
Rast
Piko enables high-performance code generation
Fairy Forest
Buddha Mecha Dragon0
2
4
6
8
10
12
cudaraster Piko Raster
Ren
derin
g tim
e (m
s) Performance is within 3.3-5.5x of hand-optimized code.
[Laine and Karras 2011]
Piko enables high-performance code generation
[Weber et al. 2015]
Micropolis Piko Reyes0
2
4
6
8
10
12
14
Spl
it P
erfo
rman
ce
(Mpa
tche
s /
seco
nd)
Split performance is within 30% of hand-optimized GPU Reyes.
Summary
Piko enables programmability and performance
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
Piko
CPU
GPU
Piko enables programmability and performance
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
High-level programmability
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko enables programmability and performance
Piko
CPU
GPU
High-performance
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
Piko enables programmability and performance
CPU
GPU
Flexibility
Our work is not done
Piko can be improved
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
Utilization of shared local memory
Piko can be improved
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
Support for dynamic scheduling of pipeline work
The search for a graphics abstraction is not over
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
The search for a graphics abstraction is not over
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
Do tiles have to be 2d, uniform, one-config-per stage?
The search for a graphics abstraction is not over
Stage B
Stage C
Stage A
Stage E
Stage D
Stage F
Piko
CPU
GPU
Are there other abstractions that enable high-level programmability and achieve high-performance?
Acknowledgments
Discussions and adviceTim Foley, Jonathan Ragan-Kelley, Aaron Lefohn, Matt Pharr, Mark Lacey, Kayvon Fatahalian, Bill Mark, Marco Salvi, Chuck Lingle, Jason Mak, Edmund Yan, Calina Copos, Mike Steffen, Alex Elkman
NVVM HelpVinod Grover, Sean Lee
Financial SupportIntel Science and Technology Center (VC), NVIDIA Research Fellowship, Intel Ph.D. Fellowship, National Science Foundation Fellowship, NVIDIA, AMD, NSF, UC Lab Fees
AssetsAMD, Intel (Project Offset), Ingo Wald, Bay Raitt, Stanford
Thank you!github.com/piko-dev/piko-public
Extra Slides
RasterPipe pipe;pipe.allocate(...);pipe.prepare();pipe.run_single();
unsigned* pixels = pipe.pikoScreen.getData();
glDrawPixels(screenW, screenH, GL_RGBA, GL_UNSIGNED_BYTE, data);
Host Code is device independent.
Unmodified C++
A pipeline is a C++ class declaration.
class RasterPipe : public PikoPipe {
VertexShaderStage vertexShader_; RasterStage raster_; PikoScreen pikoScreen_; ...
RasterPipe() { pikoConnect (vertexShader_, raster_, 0, 0); } ...};
Connections indicate pipeline structure.
Stages are instantiated as objects.
Each phase is a member function.
class RasterStage : public Stage<8, 8, 32, raster_stri, Pixel> { inline void AssignTile(raster_stri p) { ... this->assignToBin (p, binID); ... } inline void schedule(int binID) { this->specifySchedule (LOAD_BALANCE); } inline void process(raster_stri p) { ... this->emit (Pixel(pos, color), 0); ... } };
A stage is a C++ class definition.
Built-in routines identify common scenarios.
Templates specify tiling configuration.
pikoc implements the pipeline description.
Pipeline
Stagesclang
pikocfrontend
Kernelplan
pikoc backend
Host Interface
Pipe Implementation
clang libNVVM
Frontend walks the AST and performs high-level optimizations.
Backend uses LLVM to generate optimized device code.
WIP Slides