mantle for developers
DESCRIPTION
Keynote presentation about Mantle by Johan Andersson at AMD Developer Summit 2013 (APU13)TRANSCRIPT
MANTLE FOR DEVELOPERS
JOHAN ANDERSSON – TECHNICAL DIRECTORFROSTBITE
ELECTRONIC ARTS
Simplify advanced development
Improve performance
Enable developers to innovate
Challenge the status quo
Mantle?
Control GPU performanceCPU performance
Programmability Platforms
Developer impact areas
Explicit Model: Mantle
Traditional Model:Black Box
Middle-ground abstraction – compromise between performance & “usability”
Hidden resource memory & state Resource CPU access tied to device context Driver analyzes & synchronizes implicitly
Thin low-level abstraction to expose how hardware works
App explicit memory management Resources are globally accessible App explicit resource state transitions
Control
New model
Tell when render target will be used as a texture‒ And many more resource state transitions
Don’t destroy resources that GPU is using‒ Keep track with fences or frames
Manual dynamic resource renaming‒ No DISCARD for driver resource renaming
Resource memory tiling
Powerful validation layer will help!
App responsibilityControl
App high-level decisions & optimizations‒ Has full scene information‒ Easier to optimize performance & memory
Flexible & efficient memory management‒ Linear frame allocators‒ Memory pools‒ Pinned memory
Reduced development time‒ For advanced game engines & apps‒ Easier to get to target performance & robustness
Explicit control enablesControl
Light-weight driver‒ Easier to develop & maintain‒ Reduced CPU draw call overhead
Transient resources‒ Alias render targets within frame ‒ Major memory savings‒ No need to pre-allocate everything
Explicit control enablesControl
CPU performance
CPU perf
Descriptor sets Monolithic pipelines Command buffers
Core concepts
Table with resource references to bind to graphics or compute pipeline
Replaces traditional resource stage binding‒ Major performance & flexibility advantage ‒ Closer to how the hardware works
App managed - lots of strategies possible!‒ Tiny vs huge sets‒ Single vs multiple‒ Static vs semi-static vs dynamic
Example 1: Single simple dynamic descriptor set‒ Bind everything you need for a single draw call‒ Close to DX/GL model but share between stages
Descriptor setsCPU perf
LinkSampler
Image Memory
VertexBuffer (VS)
Texture0 (VS+PS)
Constants (VS)
Texture1 (PS)
Texture2 (PS)
Sampler0 (VS+PS)
Dynamic descriptor set
Table with resource references to bind to graphics or compute pipeline
Replaces traditional resource stage binding‒ Major performance & flexibility advantage‒ Closer to how the hardware works
App managed - lots of strategies possible!‒ Tiny vs huge sets‒ Single vs multiple‒ Static vs semi-static vs dynamic
Example 2: Reuse static set with nesting‒ Reduce update time & memory usage
Descriptor setsCPU perf
LinkSampler
Image Memory
Constants (VS)
Link
Dynamic descriptor set
Texture3 (PS)
Texture4 (PS)
Sampler0 (VS+PS)
Texture2 (PS)
Texture1 (PS)
Sampler1 (PS)
Static descriptor set
VertexBuffer (VS)
Texture0 (VS+PS)
CPU perf
Shader stages & select graphics state combined into single object‒ No runtime compilation or patching needed!‒ Significantly less runtime overhead to use
Supports parallel building & caching‒ Fast loading times
Usage & management up to the app‒ Static vs dynamic creation‒ Amount of pipelines‒ State usage
Monolithic pipelines
IA VS HS DSTessellator
GS RS PSDB
CB
Pipeline state
Issue pipelined graphics & compute commands into a command buffer‒ Bind graphics state, descriptor sets, pipeline‒ Draw calls‒ Render targets‒ Clears‒ Memory transfers‒ NOT: resource mapping
Fully independent objects‒ Create multiple every frame‒ Or pre-build up front and reuse
Command buffersCPU perf
RenderDriver Render
GameRender
Game GameRender
Automatically extracts parallelism out of most apps Doesn’t scale beyond 2-3 cores Additional latency Driver thread often bottleneck – can collide app threads
CPU 0
CPU 1
CPU 2
CPU perf
DX/GL parallelism
Render
Game
Render
Game Game
Render
App can go fully wide with its rendering – minimal latency Close to linear scaling with CPU cores No driver threads – no overhead – no contention Frostbite’s approach on all consoles – and on PC with Mantle!
Render
Render
Render
Render
Render
Render
Render
Render
Render
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU perf
Parallel dispatch with Mantle
GPU performance
GPU perf
Thanks to improved CPU performance – CPU will rarely be a bottleneck for the GPU‒ CPU could help GPU more:
‒ Less brute force rendering‒ Improve culling
Shader pipeline object – driver optimizations‒ Can optimize with pipeline state knowledge‒ Can optimize across all shader stages
Resource states‒ Gives driver a lot more knowledge & flexibility‒ Apps can avoid expensive/redundant transitions,
such as surface decompression
Expose existing GPU functionality‒ Quad & Rect-lists‒ HW-specific MSAA & depth data access‒ Programmable sample patterns‒ And more..
GPU optimizations
Modern GPUs are heterogeneous machines with multiple engines‒ Graphics pipeline‒ Compute pipeline(s)‒ DMA transfer‒ Video encode/decode‒ More…
Mantle exposes queues for the engines + synchronization primitives
QueuesGPU perf
Graphics
Compute
DMA
GPU
. . .
Queues
QueuesGPU perf
Graphics
Compute
DMA
GPU
. . .
Queues
Async DMA transfers‒ Copy resources in parallel with graphics or
compute
Queue use casesGPU perf
Render Other render Use copy
CopyGraphics
DMA
Async DMA transfers‒ Copy resources in parallel with graphics or
compute
Async compute together with graphics‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Queue use casesGPU perf
GBuffer Shadowmap 0 Shadowmap 1 Final lightingNon-shadowed lightingCompute
Graphics
Async DMA transfers‒ Copy resources in parallel with graphics or
compute
Async compute together with graphics‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Multiple compute kernels collaborating‒ Can be faster than über-kernel‒ Example: Compute geometry backend & compute
rasterizer
Queue use casesGPU perf
Compute GeometryCompute 0
Compute 1
Graphics Ordinary RenderingCompute Rasterizer
Draw0 Draw1 Draw2Process0Compute
Graphics
Process1 Process0
Async DMA transfers‒ Copy resources in parallel with graphics or
compute
Async compute together with graphics‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Multiple compute kernels collaborating‒ Can be faster than über-kernel‒ Example: Compute geometry backend & compute
rasterizer
Compute as frontend for graphics pipeline‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline
Queue use casesGPU perf
Async DMA transfers‒ Copy resources in parallel with graphics or
compute
Async compute together with graphics‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units
Multiple compute kernels collaborating‒ Can be faster than über-kernel‒ Example: Compute geometry backend & compute
rasterizer
Compute as frontend for graphics pipeline‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline
Queue use casesGPU perf
Game engines will build large GPU job graphs‒ Move away from single sequential submission‒ Just as we already have done on CPU
Programmability
Programmability
Explicit control of GPU queues and synchronization, finally!‒ Implement your own Alternate-Frame-Rendering‒ Or something more exotic..
Use case: Workstation rendering with 4-8 GPUs‒ Super high-quality rendering & simulation‒ Load balance graphics & compute job graphs across GPUs‒ 20-40 TFlops in a single machine!
Use case: Low-latency rendering‒ Important for VR and competitive games‒ Latency optimized GPU job graph scheduling‒ VR: Simultaneously drive 2 GPUs (1 per eye)
Explicit Multi-GPU
Programmability
Command buffer predication & flow control‒ GPU affecting/skipping submitted commands‒ Go beyond DrawIndirect / DispatchIndirect‒ Advanced variable workloads ‒ Advanced culling optimizations
Write occlusion query results into GPU buffer‒ No CPU roundtrip needed‒ Can drive predicated rendering‒ Or use results directly in shaders (lens flares)
New mechanisms
Programmability
Mantle supports bindless resources‒ Shaders can select resources to use instead of
static binding from CPU‒ Extension of the descriptor set support
Key component that will open up a lot of opportunities!
Examples‒ Performance optimizations – less data to update‒ Logic & data structures that live fully on the GPU
‒ Scene culling & rendering‒ Material representations
‒ Deferred shading‒ Raytracing
Bindless resources
Platforms
Mantle gives us strong benefits on Windows today‒ Console-like performance & programmability on both Windows 7 and Windows 8‒ For us, well worth the dev time!
DX & GL are the industry standards‒ Needed for platforms that do not support Mantle‒ Needed by devs who do not want/need more control‒ Have to have fallback paths for GL/DX, but not limit oneself to it
Mantle and PlayStation 4 will drive our future Frostbite designs & optimizations‒ PS4 graphics API has great programmability & performance as well‒ Share concepts, methods & optimization strategies
TodayPlatforms
Want to see Mantle on Linux and Mac!‒ Would enable support for our full engine & rendering‒ Significantly easier to do efficient renderer with Mantle than with OpenGL
Use cases: ‒ Workstations‒ R&D
‒ Not limited by WDDM‒ Games
‒ Mantle + SteamOS = powerful combination!
Linux & MacPlatforms
Mobile architectures are getting closer in capabilities to desktop GPUs
Want graphics API that allows apps to fully utilize the hardware‒ Power efficient‒ High performance‒ Programmable
Major opportunity with Mantle – leap frog GL4, DX11‒ For mobile SoC vendors‒ For Google and Apple
MobilePlatforms
Mantle is designed to be a thin hardware abstraction‒ Not tied to AMD’s GCN architecture‒ Forward compatible‒ Extensions for architecture- and platform-specific functionality
Mantle would be a much more efficient graphics API for other vendors as well‒ Most Mantle functionality can be supported on today’s modern GPUs
Want to see future version of Mantle supported on all platforms and on all modern GPUs!‒ Become an active industry standard with IHVs and ISVs collaborating‒ Enable us developers to innovate with great performance & programmability everywhere
Multi-vendor?Platforms
Mantle support is in development‒ Core renderer (closer to PS4 than DX11)‒ Implement all rendering techniques used in BF4 (many!)‒ CPU optimizations (parallel dispatch, descriptor sets)‒ GPU optimizations (minimize transitions, MSAA)‒ R&D for advanced GPU optimizations‒ Memory management‒ Multi-GPU support‒ ~2 months of work
Update targeting late December
Battlefield 4Frostbite
Very different rendering compared to BF4
Frostbite Mantle renderer will work out of the box
Focus on APU performance
Plants vs Zombies: Garden WarfareFrostbite
All Frostbite games designed with Mantle‒ 15 games in development across all of EA
Advanced Mantle rendering & use cases‒ Lots of exciting R&D opportunities!
Want multi-vendor & multi-platform support!
FutureFrostbite
THE END
Email: [email protected]: http://frostbite.comTwitter: @repi