Direct3D12 and the future of graphics APIs
Direct3D12 and the future of graphics APIs
Dave Oldcorn, Direct3D12 Technical Lead, AMD
#| AMD Direct3D Futures | March 20th, 2014
1
The Problem
#| AMD Direct3D Futures | March 20th, 2014
2
The problemMismatch between existing Direct3D and hardware capabilitiesLots of CPU cores, but only one stream of dataState communication in small chunksHidden workHard to predict from any one given call what the overhead might beImplicit memory managementHardware evolving away from classical register programming
#| AMD Direct3D Futures | March 20th, 2014Metal(register level access)API landscapeGap between PC raw 3D APIs and the hardware has opened up
Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice
Where the PC APIs are is a middle ground
Capability, ease of use, distance from 3D engine
Game Engines
Frostbite
Unity
Unreal
CryEngine
BlitzTech
Flash / SilverlightConsole APIsOpportunityD3D9
OpenGL
D3D11
D3D7/8Application
#| AMD Direct3D Futures | March 20th, 2014
4
What are the Consequences?What Are the solutions?
#| AMD Direct3D Futures | March 20th, 2014Sequential APISequential API: state for given draw comes from arbitrary previous time
Some states must be reconciled on the CPU (delayed validation)All contributing state needs to be visible
GPU isnt like this, uses command buffersMust save and restore state at start and end
...DrawSet PS CBDraw x 5Set VS CBDraw x 3Set BlendSet PSSet RT stateDrawSet VS VBDraw...
(more, earlier)PS CBVS CBBlend statePSRT stateDraw
State contributing to draw API input
#| AMD Direct3D Futures | March 20th, 2014
6
Threading a sequential APISequential API threadingSimple producer / consumer modelExtra latencyBuffering has a costMore threading would mean dividing tasks on finer grainBottlenecked on application or driver threadDifficult to extract parallelism (Amdahls Law)
Application simulationPrebuildThread 0PrebuildThread 1Application Render ThreadGPU Execution QueueQueued Buffer 0 QueuedBuffer 1 ...Runtime / DriverApplicationDriver ThreadQueuedBuffer 2
#| AMD Direct3D Futures | March 20th, 2014
7
Command buffer APIGPUs only listen to command buffers
Let the app build themCommand Lists, at the API level
Solves sequential API CPU issues
Application simulationThread 0Thread 1Build Cmd BufferBuildCmdBufferGPU Execution QueueQueued Buffer 0 QueuedBuffer 1 ...Runtime / DriverApplication
#| AMD Direct3D Futures | March 20th, 2014
8
Better schedulingApp has much more control over scheduling workBoth CPU side and GPU
Threads dont really share much resource
Many more options for streaming assets
Driver threadCreate thread
D3D11: CB building threads tend to interfereGPU load still added but only after queuingRender workCreate work
GPU executes
D3D12: CB building threads more independentCreate threadBuild threads
#| AMD Direct3D Futures | March 20th, 2014
9
Pipeline objects Pipeline objects get rid of JIT and enable LTCG for GPUs
Decouple interface and implementation
Were aware that this is a hairpin bend for many graphics engines to negotiate.Many engines dont think in terms of predicting state up frontThe benefits are worth it
Simplified dataflow through pipelineVSPSIndexProcessPrimitive GenerationRasteriserRendertargetOutput???
#| AMD Direct3D Futures | March 20th, 2014
10
render object binding mismatchHardware uses tables in video memory
BUT still programmed like a register solutionSo one bind becomes:Allocate a new chunk of video memoryCreate a new copy of the entire tableUpdate the one entryWrite the register with the new table base address
SR
CB
On-chiproot table(1 per stage)Pointer to table(here, textures)GPU MemorySRD tableGPU MemoryresourcePointer to table(constant buffers)Pointer to (+ params of) resource
#| AMD Direct3D Futures | March 20th, 2014
11
Descriptor TablesSeveral tables of each type of resourceEasy to divide up by frequency
Tables can be of arbitrary size; dynamically indexed to provide bindless textures
Changing a pointer in the root table is cheap
Updating a descriptor in a table is not so cheapSome dynamic descriptors are a requirement but avoid in general.
SR.T[0]SR.T[3]SR.T[2]SR.T[1]UAVCB.T[1]CB.T[0]SampSR.T[0][0]SR.T[0][2]SR.T[0][1]
CB.T[1][0]CB.T[1][1]
On-chiproot tablePointer to table(textures table 0)GPU MemorySRD tablePointer to table(constbuf table 1)
#| AMD Direct3D Futures | March 20th, 2014
12
KEY innovations
InnovationCPU-side winGPU-side winCommand buffersBuild on many threadsControl of schedulingLower latencySimplified state trackingPipeline state objectsLink at create timeNo JIT shader compilesEfficient batched updatesCheaper state updatesEnables LTCGBind objects in groupsCheap to change groupCheap to change groupFits hardware paradigmMove work to CreatePredictabilityEnables optimisations
#| AMD Direct3D Futures | March 20th, 2014
13
KEY innovations
InnovationCPU-side winGPU-side winExplicit SynchronisationEfficiencyRequired for bindless texturesLess overheadExplicit Memory ManagementEfficiencyPredictabilityApplication flexibilityZero copyControl over placementDo lessPredictability, EfficiencyEnables aggressive scheduleFEWER BUGS
#| AMD Direct3D Futures | March 20th, 2014
14
NEW PROBLEMS(And tips to solve them)
#| AMD Direct3D Futures | March 20th, 2014
15
New visible limitsMore draws in does not automatically mean more triangles outYou will not see full rendering rates with triangles averaging 1 pixel each.Wireframe mode should look different to filled rendering
#| AMD Direct3D Futures | March 20th, 2014
16
New visible limitsFeeding the GPU much more efficiently means exploring interesting new limits that werent visible before
10k/frame of anything is ~1s per thing.
GPU pipeline depth is likely to be 1-10s (1k-10k cycles).
Specific limit: context registersRoot shader table is NOT in the contextCompute doesnt bottleneck on context
#| AMD Direct3D Futures | March 20th, 2014
17
Application in chargeApplication is arbiter of correct renderingThis is a serious responsibilityThe benefits of D3D12 arent readily available without this condition
Applications must be warning-free on the debug layer
Different opportunities for driver intervention
Consider controlling risk by avoiding riskier techniques
#| AMD Direct3D Futures | March 20th, 2014Application in chargeNo driver thread in playApp can target much lower latencyBUT implies app has to be ready with new GPU work
DriverF1App RenderFrame 1GPUF1Frame 2F2F2Frame 3F3F3D3D11: No dead GPU time after 1st frame (but extra latency)DeadTimeFirst work sent to driverDriver buffers Present; no future dead timeNo buffered present reveals dead time on GPU
#| AMD Direct3D Futures | March 20th, 2014Use command buffers sparinglyEach API command list maps to a single hardware command buffer
Starting / ending a command list has an overheadWrites full 3D state, may flush caches or idle GPU
We think a good rule of thumb will be to target around 100 command buffers/frameUse the multiple submission API where possible
CB0CB1CB2CB0Multiple applications running on systemApplication 0 queueCB0CB1CB2CB0Application 1 queueGPU executes
#| AMD Direct3D Futures | March 20th, 2014Round-up
#| AMD Direct3D Futures | March 20th, 2014All-newTheres a learning curve here for all of us
In the main its a shallow oneCompared at least to the general problem of multithreaded renderingMultithread is always hard.Simpler design means fewer bugs and more predictable performance
#| AMD Direct3D Futures | March 20th, 2014What AMD plan to deliverRelease driver for Direct3D12 launch
Continuous engagementWith MicrosoftWith ISVsBring your opinions to us and to Microsoft.
#| AMD Direct3D Futures | March 20th, 2014QUESTIONS
#| AMD Direct3D Futures | March 20th, 2014
24