direct3d12 and the future of graphics apis

DIRECT3D12 AND THE FUTURE OF GRAPHICS APIS

Dave Oldcorn, Direct3D12 Technical Lead, AMD

2| AMD Direct3D Futures | March 20th, 2014

THE PROBLEM


THE PROBLEM

Mismatch between existing Direct3D and hardware capabilities

– Lots of CPU cores, but only one stream of data

– State communication in small chunks

– “Hidden” work

Hard to predict from any one given call what the overhead might be

Implicit memory management

– Hardware evolving away from classical register programming


Metal(register level access)

API LANDSCAPE

Gap between PC ‘raw’ 3D APIs and the hardware has opened up

Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice

Where the PC APIs are is a middle ground

Capa

bilit

y, e

ase

of u

se, d

istan

ce fr

om 3

D e

ngin

e

Game EnginesFrostbite

Unity

Unreal

CryEngine

BlitzTech

Flash / Silverlight

Console APIsOpportunity

D3D9

OpenGLD3D11

D3D7/8

Application


WHAT ARE THE CONSEQUENCES?WHAT ARE THE SOLUTIONS?


SEQUENTIAL API

Sequential API: state for given draw comes from arbitrary previous time

Some states must be reconciled on the CPU (“delayed validation”)

– All contributing state needs to be visible

GPU isn’t like this, uses command buffers

– Must save and restore state at start and end

...

Draw

Set PS CB

Draw x 5

Set VS CB

Draw x 3

Set Blend

Set PS

Set RT state

Draw

Set VS VB

Draw

...

(more, earlier)

PS CB

VS CB

Blend state

PS

RT state

Draw

State contributing to draw

API input


THREADING A SEQUENTIAL API

Sequential API threading

– Simple producer / consumer model

Extra latency

Buffering has a cost

More threading would mean dividing tasks on finer grain

– Bottlenecked on application or driver thread

Difficult to extract parallelism (Amdahl’s Law)

Application simulation

PrebuildThread 0

PrebuildThread 1

Application Render Thread

GPU Execution Queue

Queued Buffer 0

QueuedBuffer 1

...

Runtime / Driver

Application

Driver Thread

QueuedBuffer 2


COMMAND BUFFER API

GPUs only listen to command buffers

Let the app build them

– Command Lists, at the API level

Solves sequential API CPU issues

Application simulation

Thread 0 Thread 1

Build Cmd Buffer

BuildCmd

Buffer

GPU Execution Queue

Queued Buffer 0

QueuedBuffer 1

...

Runtime / Driver

Application


BETTER SCHEDULING

App has much more control over scheduling work

– Both CPU side and GPU

Threads don’t really share much resource

Many more options for streaming assets

Driver thread

Create thread

D3D11: CB building threads tend to interfere

GPU load still added but only after queuing

Render work

Create work

GPU executes

D3D12: CB building threads more independent

Create thread

Build threads


PIPELINE OBJECTS

Pipeline objects get rid of JIT and enable LTCG for GPUs

Decouple interface and implementation

We’re aware that this is a hairpin bend for many graphics engines to negotiate.

– Many engines don’t think in terms of predicting state up front

– The benefits are worth it

Simplified dataflow through pipeline

VS

PS

IndexProcess

Primitive Generation

Rasteriser

RendertargetOutput

?

?

?


RENDER OBJECT BINDING MISMATCH

Hardware uses tables in video memory

BUT still programmed like a register solution

– So one bind becomes:

Allocate a new chunk of video memory

Create a new copy of the entire table

Update the one entry

Write the register with the new table base address

SR

CB

On-chiproot table

(1 per stage) Pointer to table(here, textures)

GPU MemorySRD table

GPU Memoryresource

Pointer to table(constant buffers)

Pointer to (+ params of) resource


DESCRIPTOR TABLES

Several tables of each type of resource

– Easy to divide up by frequency

Tables can be of arbitrary size; dynamically indexed to provide bindless textures

Changing a pointer in the root table is cheap

Updating a descriptor in a table is not so cheap

– Some dynamic descriptors are a requirement but avoid in general.

SR.T[0]

SR.T[3]

SR.T[2]

SR.T[1]

UAV

CB.T[1]

CB.T[0]

Samp

SR.T[0][0]

SR.T[0][2]

SR.T[0][1]

CB.T[1][0]

CB.T[1][1]

On-chiproot table Pointer to table

(textures table 0)

GPU MemorySRD table

Pointer to table(constbuf table 1)


KEY INNOVATIONS

Innovation CPU-side win GPU-side win

Command buffersBuild on many threadsControl of scheduling

Lower latency

Simplified state tracking

Pipeline state objects

Link at create timeNo JIT shader compiles

Efficient batched updates

Cheaper state updatesEnables LTCG

Bind objects in groups Cheap to change group Cheap to change group

Fits hardware paradigm

Move work to Create Predictability Enables optimisations


KEY INNOVATIONS

Innovation CPU-side win GPU-side win

Explicit Synchronisation

EfficiencyRequired for bindless

texturesLess overhead

Explicit Memory Management

EfficiencyPredictability

Application flexibility

Zero copyControl over placement

Do lessPredictability, Efficiency

Enables aggressive scheduleFEWER BUGS


NEW PROBLEMS(AND TIPS TO SOLVE THEM)


NEW VISIBLE LIMITS

More draws in does not automatically mean more triangles out

– You will not see full rendering rates with triangles averaging 1 pixel each.

– Wireframe mode should look different to filled rendering


NEW VISIBLE LIMITS

Feeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before

10k/frame of anything is ~1µs per thing.

GPU pipeline depth is likely to be 1-10µs (1k-10k cycles).

Specific limit: context registers

– Root shader table is NOT in the context

– Compute doesn’t bottleneck on context


APPLICATION IN CHARGE

Application is arbiter of correct rendering

– This is a serious responsibility

– The benefits of D3D12 aren’t readily available without this condition

Applications must be warning-free on the debug layer

Different opportunities for driver intervention

Consider controlling risk by avoiding riskier techniques


APPLICATION IN CHARGE

No driver thread in play

– App can target much lower latency

– BUT implies app has to be ready with new GPU work

Driver F1

App Render Frame 1

GPU F1

Frame 2

F2

F2

Frame 3

F3

F3

D3D11: No dead GPU time after 1st frame (but extra latency)

DeadTime

First work sent to driver Driver buffers Present; no future dead time

No buffered present reveals dead time on GPU


USE COMMAND BUFFERS SPARINGLY

Each API command list maps to a single hardware command buffer

Starting / ending a command list has an overhead

– Writes full 3D state, may flush caches or idle GPU

We think a good rule of thumb will be to target around 100 command buffers/frame

– Use the multiple submission API where possibleCB0 CB1 CB2CB0

Multiple applications running on system

Application 0 queue

CB0 CB1 CB2

CB0

Application 1 queue

GPU executes


ROUND-UP


ALL-NEW

There’s a learning curve here for all of us

In the main it’s a shallow one

– Compared at least to the general problem of multithreaded rendering

Multithread is always hard.

– Simpler design means fewer bugs and more predictable performance


WHAT AMD PLAN TO DELIVER

Release driver for Direct3D12 launch

Continuous engagement

– With Microsoft

– With ISVs

Bring your opinions to us and to Microsoft.


QUESTIONS

direct3d12 and the future of graphics apis

Documents

amd direct3d futures

existing direct3d

api level

api input

graphics engines

d apis

pc apis

cb building threads