DIRECT3D AND THE FUTURE OF GRAPHICS APIS
Dave Oldcorn, AMDDan Baker, Oxide GamesJohan Andersson, EA / DICE
2| AMD Direct3D Futures | March 20th, 2014
NITROUS AND DX12
Dan BakerPartner, Oxide Games
3| AMD Direct3D Futures | March 20th, 2014
HAVEN’T WE BEEN HERE BEFORE?
Goal of DX9–Remember State blocks?
Goal of DX10–Large state groups
Goal of DX11–Deferred contexts
Are we actually getting faster, or are CPUs just faster? –Quite possible no perf improvements due to API features in 10 years
Maybe adding features isn’t the answer…
4| AMD Direct3D Futures | March 20th, 2014
DEEPLY ROOTED PROBLEM
Coding design philosophies clash with real world OOP, data hiding, polymorphic design clashes with task-driven, data parallel Evident in language trends, striking disconnect between what is considered good code, and what is fast Gap has always been there, but has grown in recent years
– 15 years ago, processors often bound by computation
– Now, usually bound by cache misses, serialization, pipeline stalls, etc.
– Multi-Core CPUs are ineffectively utilized ‘Heavy Iron’ , e.g. Big Object, Opaque memory is a dead end for performance The revolt is beginning in high performance graphics APIS, but will spread
5| AMD Direct3D Futures | March 20th, 2014
BUT… HOW MUCH FASTER?
Biggest problem with industry today: AcceptanceOnly 1 secret in API design: That it can be done.
–And isn’t that hard–And our code isn’t that ugly
Star Swarm already demonstrating what is possible on a PC
6| AMD Direct3D Futures | March 20th, 2014
D3D12 FEATURES THAT NITROUS USES
True de-coupled multi-core rendering– Expecting near linear thread scheduling
Manual Hazard tracking– Hazards have been resolved already
Memory Heaps– Bigger chunks of memory pool grouping make management simpler
Descriptor Tables– Table exposure allows a cheaper way of binding textures– Allows texture bindings to be shared between non-adjacent batches
7| AMD Direct3D Futures | March 20th, 2014
WHAT’S DIFFERENT NOW?
Spec Written
Spec Reviewed
API implemented
Released to public
First Engine use
Analysis done
Thenn
8| AMD Direct3D Futures | March 20th, 2014
WHAT’S DIFFERENT NOW?
Nown
Create Spec
Implement Spec
Prototype on Actual Engines
Analyze
Discuss with IHVs,
ISVsStart Here
If Ready, exit here to prep for release
9| AMD Direct3D Futures | March 20th, 2014
IN THE SPIRIT OF CONTRIBUTING
Oxide proud to announce that we have a proto-type of Nitrous running on D3D12
*PR DISCLAIMER* This is not an official announcement regarding D3D12 support
Porting from other modern APIs is much simpler than porting from D3D11 to D3D12
10| AMD Direct3D Futures | March 20th, 2014
EXPECTED RESULTS
CPU Driver overhead largely put to restHuge increases in driver reliabilityHuge decreases in frame latency, expecting median frame latency to be 1.5 frames–Increased perceptual responsiveness
Never a dropped frame or stall due to driver API issues–*Other OS events could cause stalls
Driver should be far smaller, simpler to implement, IHVs can spend more time on optimizations
DIRECT3D12 AND THE FUTURE OF GRAPHICS APIS
Dave Oldcorn, Direct3D12 Driver Architect, AMD
12| AMD Direct3D Futures | March 20th, 2014
THE PROBLEM
13| AMD Direct3D Futures | March 20th, 2014
THE PROBLEM
Mismatch between existing Direct3D and hardware capabilities
– Lots of CPU cores, but only one stream of data
– State communication in small chunks
– “Hidden” work Hard to predict from any one given call what the overhead might be Implicit memory management
– Hardware evolving away from classical register programming
14| AMD Direct3D Futures | March 20th, 2014
Metal(register level access)
API LANDSCAPE
Gap between PC ‘raw’ 3D APIs and the hardware has opened up
Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice
Where the PC APIs are is a middle ground
Capa
bilit
y, ea
se o
f use
, dist
ance
from
3D
engi
ne
Game EnginesFrostbite
Unity
Unreal
CryEngine
BlitzTech
Flash / Silverlight
Console APIsOpportunity
D3D9
OpenGLD3D11
D3D7/8
Application
15| AMD Direct3D Futures | March 20th, 2014
WHAT ARE THE CONSEQUENCES?WHAT ARE THE SOLUTIONS?
16| AMD Direct3D Futures | March 20th, 2014
SEQUENTIAL API
Sequential API: state for given draw comes from arbitrary previous time
Some states must be reconciled on the CPU (“delayed validation”)
– All contributing state needs to be visible
GPU isn’t like this, uses command buffers
– Must save and restore state at start and end
...
Draw
Set PS CB
Draw x 5
Set VS CB
Draw x 3
Set Blend
Set PS
Set RT state
Draw
Set VS VB
Draw
...
(more, earlier)
PS CB
VS CB
Blend state
PS
RT state
Draw
State contributing to draw
API input
17| AMD Direct3D Futures | March 20th, 2014
THREADING A SEQUENTIAL API
Sequential API threading
– Simple producer / consumer model Extra latency Buffering has a cost More threading would mean dividing tasks on finer grain
– Bottlenecked on application or driver thread Difficult to extract parallelism (Amdahl’s Law)
Application simulation
PrebuildThread 0
PrebuildThread 1
Application Render Thread
GPU Execution Queue
Queued Buffer 0
QueuedBuffer 1
...
Runtime / Driver
Application
Driver Thread
QueuedBuffer 2
18| AMD Direct3D Futures | March 20th, 2014
COMMAND BUFFER API
GPUs only listen to command buffers
Let the app build them
– Command Lists, at the API level
Solves sequential API CPU issues
Application simulation
Thread 0 Thread 1
Build Cmd Buffer
BuildCmd
Buffer
GPU Execution Queue
Queued Buffer 0
QueuedBuffer 1
...
Runtime / Driver
Application
19| AMD Direct3D Futures | March 20th, 2014
BETTER SCHEDULING
App has much more control over scheduling work
– Both CPU side and GPU
Threads don’t really share much resource
Many more options for streaming assets
Driver thread
Create thread
D3D11: CB building threads tend to interfere
GPU load still added but only after queuing
Render work
Create work
GPU executes
D3D12: CB building threads more independent
Create thread
Build threads
20| AMD Direct3D Futures | March 20th, 2014
PIPELINE OBJECTS
Pipeline objects get rid of JIT and enable LTCG for GPUs
Decouple interface and implementation
We’re aware that this is a hairpin bend for many graphics engines to negotiate.
– Many engines don’t think in terms of predicting state up front
– The benefits are worth it Simplified dataflow
through pipeline
VS
PS
IndexProcess
Primitive Generation
Rasteriser
RendertargetOutput
?
?
?
21| AMD Direct3D Futures | March 20th, 2014
RENDER OBJECT BINDING MISMATCH
Hardware uses tables in video memory
BUT still programmed like a register solution
– So one bind becomes: Allocate a new chunk of video memory Create a new copy of the entire table Update the one entry
Write the register with the new table base address
SR
CB
On-chiproot table
(1 per stage) Pointer to table(here, textures)
GPU MemorySRD table
GPU Memoryresource
Pointer to table(constant buffers)
Pointer to (+ params of) resource
22| AMD Direct3D Futures | March 20th, 2014
DESCRIPTOR TABLES
Several tables of each type of resource
– Easy to divide up by frequency
Tables can be of arbitrary size; dynamically indexed to provide bindless textures
Changing a table pointer is cheap
Updating a descriptor in a table is not
SR.T[0]
SR.T[3]
SR.T[2]
SR.T[1]
UAV
CB.T[1]
CB.T[0]
Samp
SR.T[0][0]
SR.T[0][2]
SR.T[0][1]
CB.T[1][0]
CB.T[1][1]
On-chiptable Pointer to table
(textures table 0)
GPU MemorySRD table
Pointer to table(constbuf table 1)
23| AMD Direct3D Futures | March 20th, 2014
KEY INNOVATIONS
Innovation CPU-side win GPU-side win
Command buffersBuild on many threadsControl of scheduling
Lower latencySimplified state
tracking
Pipeline state objects
Link at create timeNo JIT shader compiles
Efficient batched updatesCheaper state updates
Enables LTCG
Bind objects in groups Cheap to change group Cheap to change group
Fits hardware paradigmMove work to
Create Predictability Enables optimisations
24| AMD Direct3D Futures | March 20th, 2014
KEY INNOVATIONS
Innovation CPU-side win GPU-side win
Explicit Synchronisation
EfficiencyRequired for bindless
texturesLess overhead
Explicit Memory Management
EfficiencyPredictability
Application flexibilityZero copy
Control over placement
Do lessPredictability, Efficiency
Enables aggressive scheduleFEWER BUGS
25| AMD Direct3D Futures | March 20th, 2014
NEW PROBLEMS(AND TIPS TO SOLVE THEM)
26| AMD Direct3D Futures | March 20th, 2014
NEW VISIBLE LIMITS
More draws in does not automatically mean more triangles out
– You will not see full rendering rates with triangles averaging 1 pixel each.
– Wireframe mode should look different to filled rendering
27| AMD Direct3D Futures | March 20th, 2014
NEW VISIBLE LIMITS
Feeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before
10k/frame of anything is ~1µs per thing.
GPU pipeline depth is likely to be 1-10µs (1k-10k cycles).
Specific limit: context registers
– Shader tables are NOT in the context
– Compute doesn’t bottleneck on context
28| AMD Direct3D Futures | March 20th, 2014
APPLICATION IN CHARGE
Application is arbiter of correct rendering
– This is a serious responsibility
– The benefits of D3D12 aren’t readily available without this condition
Applications must be warning-free on the debug layer
Different opportunities for driver intervention
29| AMD Direct3D Futures | March 20th, 2014
APPLICATION IN CHARGE
No driver thread in play
– App can target much lower latency
– BUT implies app has to be ready with new GPU work
Driver F1
App Render Frame 1
GPU F1
Frame 2
F2
F2
Frame 3
F3
F3
D3D11: No dead GPU time after 1st frame (but extra latency)
DeadTime
First work sent to driver Driver buffers Present; no future dead time
No buffered present reveals dead time on GPU
30| AMD Direct3D Futures | March 20th, 2014
USE COMMAND BUFFERS SPARINGLY
Each API command list maps to a single hardware command buffer
Starting / ending a command list has an overhead
– Writes full 3D state, may flush caches or idle GPU
We think a good rule of thumb will be to target around 100 command buffers/frame
– Use the multiple submission API where possibleCB0 CB1 CB2CB0
Multiple applications running on system
Application 0 queue
CB0 CB1 CB2
CB0
Application 1 queue
GPU executes
31| AMD Direct3D Futures | March 20th, 2014
ROUND-UP
32| AMD Direct3D Futures | March 20th, 2014
ALL-NEW
There’s a learning curve here for all of us
In the main it’s a shallow one
– Compared at least to the general problem of multithreaded rendering Multithread is always hard.
– Simpler design means fewer bugs and more predictable performance
33| AMD Direct3D Futures | March 20th, 2014
WHAT AMD PLAN TO DELIVER
An early preview driver “soon”
Release driver for Direct3D12 launch
Continuous engagement
– With Microsoft
– With ISVs Bring your opinions to us and to Microsoft.
34| AMD Direct3D Futures | March 20th, 2014
DX12 AND FROSTBITE
Johan AnderssonTechnical Director
35| AMD Direct3D Futures | March 20th, 2014
DX12 AND FROSTBITE
PC is very important for EA and we’ve been pushing hard to improve graphics capabilities on Windows
Excited to be working with Microsoft and the IHVs on Direct3D again!
Good & very healthy collaboration between Microsoft, the IHVs and us game/engine developers
DX12 is a really big step forward from DX11 or GL4
36| AMD Direct3D Futures | March 20th, 2014
DX12 FEATURES AND FROSTBITE
Key DX12 features that are a great fit for Frostbite:
– Efficient parallel command buffers
– Descriptor tables
– Pipeline objects
– Explicit resource synchronization
– Explicit memory management
DX12 is still in development so actively working with Microsoft & the IHVs to help make sure all of it fits together and is efficient
37| AMD Direct3D Futures | March 20th, 2014
DX12 PLATFORMS
DX12 support on Windows 7 & most existing PC hardware is critical for us
– Huge user base still on Windows 7
– Gamers would see major benefits without upgrading
DX12 support on Xbox One is critical for us
– Will lead to improved performance & quality for future Xbox One titles
– Almost all of our games are cross platform Gen4/PC
– Easier development – renderer is shared between Windows & Xbox One
Looking forward to DX12 on mobile/tablets
– Power efficiency & low overhead is really key
– Need larger user base to target on Windows for mobile
38| AMD Direct3D Futures | March 20th, 2014
DX12 AND FROSTBITE
We are building a DX12 renderer for Frostbite!
– Will work on GPUs from all vendors – benefits a wide set of gamers
Expected benefits over DX11:
– More stable and consistent performance
– Higher overall performance
– Move our design target – more richer & more detailed game worlds
– Thinner drivers – easier to work with / less of a black box
– More control for us developers – new techniques & optimizations
Really happy that the full Windows & Xbox eco systems are moving to low-level graphics API!
39| AMD Direct3D Futures | March 20th, 2014
QUESTIONS