Transcript
Page 1: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

RENDERING BATTLEFIELD 4 WITH MANTLE

Johan Andersson Yuriy O’DonnellElectronic Arts

Page 2: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

2

Page 3: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

3

DX11 Mantle

Avg: 78 fpsMin: 42 fps

Core i7-3970x, AMD Radeon R9 290x, 1080p ULTRA

Avg: 120 fpsMin: 94 fps+58%!

Page 4: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

4

BF4 MANTLE GOALS

Goals:

– Significantly improve CPU performance

– More consistent & stable performance

– Improve GPU performance where possible

– Add support for a new Mantle rendering backend in a live game Minimize changes to engine interfaces

Compatible with built PC content

– Work on wide set of hardware APU to quad-GPU But x64 only (32-bit Windows needs to die)

Non-goals:

– Design new renderer from scratch for Mantle

– Take advantage of asymmetric MGPU (APU+discrete)

– Optimize video memory consumption

Page 5: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

5

BF4 MANTLE STRATEGIC GOALS

Prove that low-level graphics APIs work outside of consoles

Push the industry towards low-level graphics APIs everywhere

Build a foundation for the future that we can build great games on

Page 6: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

6

AGENDA

Shaders Pipelines Memory Resources Command buffers Queues Multiple GPUs

Page 7: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

7

SHADERS

Page 8: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

8

SHADER CONVERSION

DX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC tool

– Done at load time

– Don’t have to change our shaders!

Have full source & control over the process

Could write AMDIL directly or use other frontends if wanted

Page 9: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

9

SHADER RESOURCES

Shader resource bind points replaced with a resource table object - descriptor set– This is how the hardware accesses the shader resources

– Flat list of images, buffers and samplers used by any of the shader stages

– Vertex shader streams converted to vertex shader buffer loads

Engine assign each shader resource to specific slot in the descriptor set(s)

– Can share slots between shader stages = smaller descriptor sets

– The mapping takes a while to wrap one’s head around

Page 10: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

10

DESCRIPTOR SETS

Very simple usage in BF4: for each draw call write flat list of resources

–Essentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream

Single dynamic descriptor set object per frame Sub-allocate for each draw call and write list of resources

~15000 resource slots written per frame in BF4, still very fast

Page 11: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

11

DESCRIPTOR SETS

Page 12: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

12

DESCRIPTOR SETS – FUTURE OPTIMIZATIONS

Use static descriptor sets when possible

Reduce resource duplication by reusing & sharing more across shader stages

Nested descriptor sets

Page 13: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

13

PIPELINES

Page 14: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

14

COMPUTE PIPELINES

1:1 mapping between pipeline & shader

No state built into pipeline

Can execute in parallel with rendering

~100 compute pipelines in BF4

Page 15: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

15

GRAPHICS PIPELINES

All graphics shader stages combined to a single pipeline object together with important graphics state

~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory

Could use smaller working pool of active state objects to keep reasonable amount in memory

– Have not been required for us

Page 16: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

16

PRE-BUILDING PIPELINES

Graphics pipeline creation is expensive operation, do at load time instead of runtime!

– Creating one of our graphics pipelines take ~10-60 ms each

– Pre-build using N parallel low-priority jobs

– Avoid 99.9% of runtime stalls caused by pipeline creation!

Requires knowing the graphics pipeline state that will be used with the shaders

– Primitive type

– Render target formats

– Render target write masks

– Blend modes

Not fully trivial to know all state, may require engine changes / pre-defining use cases

– Important to design for!

Page 17: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

17

PIPELINE CACHE

Cache built pipelines both in memory cache and disk cache

– Improved loading times

– Max 300 MB

– Simple LRU policy

– LZ4 compressed

Database signature:

– Driver version

– Vendor ID

– Device ID

Page 18: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

18

DYNAMIC STATE OBJECTS

Graphics state is only set with the pipeline object and 5 dynamic state objects

– State objects: color blend, raster, viewport, depth-stencil, MSAA

– No other parameters such as in DX11 with stencil ref or SetViewport functions

Frostbite use case:

– Pre-create when possible

– Otherwise on-demand creation (hash map)

– Only ~100 state objects!

Still possible to end up with lots of state objects

– Esp. with state object float & integer values (depth bounds, depth bias, viewport)

– But no need to store all permutations in memory, objects are fast to create & app manages lifetimes

Page 19: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

19

MEMORY

Page 20: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

20

MEMORY MANAGEMENT

Mantle devices exposes multiple memory heaps with characteristics

– Can be different between devices, drivers and OS:es

User explicitly places resources in wanted heaps

– Driver suggests preferred heaps when creating objects, not a requirement

Type Size Page CPU access GPU Read

GPU Write

CPU Read

CPU Write

Local 256 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined 130 170 0.0058 2.8

Local 4096 MB 65535 130 180 0 0

Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined 2.6 2.6 0.1 3.3

Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent 2.6 2.6 3.2 2.9

Page 21: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

21

FROSTBITE MEMORY HEAPS

System Shared Mapped

– CPU memory that is GPU visible.

– Write combined & persistently mapped = easy & fast to write to in parallel at any time

System Shared Pinned

– CPU cached for readback.

– Not used much

Video Shared

– GPU memory accessible by CPU. Used for descriptor sets and dynamic buffers

– Max 256 MB (legacy constraint)

– Avoid keeping persistently mapped as VidMM doesn’t like this and can decide to move it back to CPU memory

Video Private

– GPU private memory.

– Used for render targets, textures and other resources CPU does not need to access

Page 22: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

22

MEMORY REFERENCES

WDDM needs to know which memory allocations are referenced for each command buffer

– In order to make sure they are resident and not paged out

– Max ~1700 memory references are supported

– Overhead with having lots of references

Engine needs to keep track of what memory is referenced while building the command buffers

– Easy & fast to do

– Each reference is either read-only or read/write

– We use a simple global list of references shared for all command buffers.

Page 23: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

23

MEMORY POOLING

Pooling memory allocations was required for us

– Sub allocate within larger 1 – 32 MB chunks

– All resources stored memory handle + offset

– Not as elegant as just void* on consoles

– Fragmentation can be a concern, not too much issues for us in practice

GPU virtual memory mapping is fully supported, can simplify & optimize management

Page 24: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

24

OVERCOMMITTING VIDEO MEMORY

Avoid overcommitting video memory!

– Will lead to severe stalls as VidMM moves blocks and moves memory back and forth

– VidMM is a black box – One of the biggest issues we ran into during development

Recommendations

– Balance memory pools

– Make sure to use read-only memory references

– Use memory priorities

Page 25: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

25

MEMORY PRIORITIES

Setting priorities on the memory allocations helps VidMM choose what to page out when it has to

5 priority levels

– Very high = Render targets with MSAA

– High = Render targets and UAVs

– Normal = Textures

– Low = Shader & constant buffers

– Very low = vertex & index buffers

Page 26: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

26

MEMORY RESIDENCY FUTURE

For best results manage which resources are in video memory yourself & keep only ~80% used

– Avoid all stalls

– Can async DMA in and out

We are thinking of redesigning to fully avoid possibility of overcommitting

Hoping WDDM’s memory residency management can be simplified & improved in the future

Page 27: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

27

RESOURCE MANAGEMENT

Page 28: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

28

RESOURCE LIFETIMES

App manages lifetime of all resources

– Have to make sure GPU is not using an object or memory while we are freeing it on the CPU

– How we’ve always worked with GPUs on the consoles

– Multi-GPU adds some additional complexity that consoles do not have

We keep track of lifetimes on a per frame granularity

– Queues for object destruction & free memory operations

– Add to queue at any time on the CPU

– Process queues when GPU command buffers for the frame are done executing

– Tracked with command buffer fences

Page 29: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

29

LINEAR FRAME ALLOCATOR

We use multiple linear allocators with Mantle for both transient buffers & images

– Used for huge amount of small constant data and other GPU frame data that CPU writes

– Easy to use and very low overhead

– Don’t have to care about lifetimes or state

Fixed memory buffers for each frame

– Super cheap sub-allocation from from any thread

– If full, use heap allocation (also fast due to pooling)

Alternative: ring buffers

– Requires being able to stall & drain pipeline at any allocation if full, additional complexity for us

Page 30: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

30

TILING

Textures should be tiled for performance

– Explicitly handled in Mantle, user selects linear or tiled

– Some formats (BC) can’t be accessed as linear by the GPU

On consoles we handle tiling offline as part of our data processing pipeline

– We know the exact tiling formats and have separate resources per platform

For Mantle

– Tiling formats are opaque, can be different between GPU architectures and image types

– Tile textures with DMA image upload from SystemShared to VideoPrivate Linear source, tiled destination

Free

Page 31: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

31

COMMAND BUFFERS

Page 32: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

32

COMMAND BUFFERS

Command buffers are the atomic unit of work dispatched to the GPU

– Separate creation from execution

– No “immediate context” a la DX11 that can execute work at any call

– Makes resource synchronization and setup significantly easier & faster

Typical BF4 scenes have around ~50 command buffers per frame

– Reasonable tradeoff for us with submission overhead vs CPU load-balancing

Page 33: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

33

COMMAND BUFFER SOURCES

Frostbite has 2 separate sources of command buffers

– World rendering Rendering the world with tons of objects, lots of draw calls. Have all frame data up front All resources except for render targets are read-only No resource state transitions Generated in parallel up front each frame

– Immediate rendering (“the rest”) Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc Managing resource state, memory and running on different queues (graphics, compute, DMA)

Sequentially generated in a single job, simulate an immediate context by splitting the command buffer

Both are very important and have different requirements

Page 34: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

34

RESOURCE TRANSITIONS

Key design in Mantle to significantly lower driver overhead & complexity

– Explicit hazard tracking by the app/engine

– Drives architecture-specific caches & compression

– AMD: FMASK, CMASK, HTILE

– Enables explicit memory management

Examples:

– Optimal render target writes → Graphics shader read-only

– Compute shader write-only → DrawIndirect arguments

Mantle has a strong validation layer that tracks transitions which is a major help

Page 35: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

35

MANAGING RESOURCE TRANSITIONS

Engines need a clear design on how to handle state transitions Multiple approaches possible:

– Sequential in-order command buffers Generate one command buffer at the time in order

Transition resources on-demand when doing operation on them, very simple Recommendation: start with this

– Out-of-order multiple command buffers Track state per command buffer, fix up transitions when order of command buffers is known

– Hybrid approaches & more

Page 36: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

36

MANAGING RESOURCE TRANSITIONS IN FROSTBITE

Current approach in Frostbite is quite basic:

– We keep track of a single state for each resource (not subresource)

– The “immediate rendering” transition resources as needed depending on operation

– The out of order “world rendering” command buffers don’t need to transition states Already have write access to RTs and read-access to all resources setup outside them Avoids the problem of them not knowing the state during generation

Works now but as we do more general parallel rendering it will have to change

– Track resource state for each command buffer & fixup between command buffers

Page 37: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

37

QUEUES

Page 38: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

38

QUEUES

Universal queue can do both graphics, compute and presents

We use also use additional queues to parallelize GPU operations:

– DMA queue – Improve perf with faster transfers & avoiding idling graphics while transfering

– Compute queue - Improve perf by utilizing idle ALU and update resources simultaneously with gfx

More GPUs = more queues!

Page 39: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

39

Order of execution within a queue is sequential

Synchronize multiple queues with GPU semaphores (signal & wait)

Also works across multiple GPUs

Compute

Graphics

QUEUES SYNCHRONIZATION

S

Wait

W

S

Page 40: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

40

QUEUES SYNCHRONIZATION CONT

Started out with explicit semaphores

– Error prone to handle when having lots of different semaphores & queues

– Difficult to visualize & debug

Switched to more representation more similar to a job graph Just a model on top of the semaphores

Page 41: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

41

GPU JOB GRAPH

Each GPU job has list of dependencies (other command buffers) Dependencies has to finish first before job can run on its queue The dependencies can be from any queue

Was easier to work with, debug and visualize Really extendable going forward

Graphics 1 Graphics 2

DMA

Compute

Graphics 2

Page 42: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

42

ASYNC DMA

AMD GPUs have dedicated hardware DMA engines, let’s use them!

– Uploading through DMA is faster than on universal queue, even if blocking

– DMA have alignment restrictions, have to support falling back to copies on universal queue

Use case: Frame buffer & texture uploads

– Used by resource initial data uploads and our UpdateSubresource

– Guaranteed to be finished before the GPU universal queue starts rendering the frame

Use case: Multi-GPU frame buffer copy

– Peer-to-peer copy of the frame buffer to the GPU that will present it

Page 43: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

43

ASYNC COMPUTE

Frostbite has lots of compute shader passes that could run in parallel with graphics work

– HBAO, blurring, classification, tile-based lighting, etc

Running as async compute can improve GPU performance by utilizing ”free” ALU

– For example while doing shadowmap rendering (ROP bound)

Page 44: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

44

ASYNC COMPUTE – TILE-BASED LIGHTING

3 sequential compute shaders

– Input: zbuffer & gbuffer

– Output: HDR texture/UAV

Runs in parallel with graphics pipeline that renders to other targets

Compute

Graphics

TileZ

Gbuffer Shadowmaps Reflection Distort Transp

Cull lights Lighting

S

SWait

W

Page 45: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

45

ASYNC COMPUTE – TILE-BASED LIGHTING

We manually prepare the resources for the async compute

– Important to not access the resources on other queues at the same time (unless read-only state)

– Have to transition resources on the queue that last used it

Up to 80% faster in our initial tests, but not fully reliable

– But is a pretty small part of the frame time

– Not in BF4 yet

Compute

Graphics

TileZ

Gbuffer Shadowmaps Reflection Distort Transp

Cull lights Lighting

S

SWait

W

Page 46: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

46

MULTI-GPU

Page 47: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

47

MULTI-GPU

Multi-GPU alternatives:

– AFR – Alternate Frame Rendering (1-4 GPUs of the same power)

– Heterogeneous AFR – 1 small + 1 big GPU (APU + Discrete)

– SFR – Split Frame Rendering

– Multi-GPU Job Graph – Primary strong GPU + slave GPUs helping

Frostbite supports AFR natively

– No synchronization points within the frame

– For resources that are not rendered every frame: re-render resources for each GPU Example: sky envmap update on weather change

With Mantle multi-GPU is explicit and we have to build support for it ourselves

Page 48: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

48

MULTI-GPU AFR WITH MANTLE

All resources explicitly duplicated on each GPU with async DMA

– Hidden internally in our rendering abstraction

Every frame alternate which GPU we build command buffers for and are using resources from

Our UpdateSubresource has to make sure it updates resources on all GPU

Presenting the screen has to in some modes copy the frame buffer to the GPU that owns the display

Bonus:

– Can simulate multi-GPU mode even with single GPU to debug AFR issues!

– Multi-GPU works in windowed mode!

Page 49: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

49

GPUs are independently rendering & presenting to the screen – can cause micro-stuttering– Frames are not presented in a regular intervals

– Frame rate can be high but presentation & gameplay is not smooth

– FCAT is a good tool to analyse this

MULTI-GPU ISSUES

GPU0

GPU1

Frame 0 P

Frame 1 P

Frame 2 P

Frame 3 P

GPU0

GPU1

Irregular presentation

interval

Page 50: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

50

GPUs are independently rendering & presenting to the screen – can cause micro-stuttering– Frames are not presented in a regular intervals

– Frame rate can be high but presentation & gameplay is not smooth

– FCAT is a good tool to analyse this

We need to introduce dependency & dampening between the GPUs to alleviate this – frame pacing

MULTI-GPU ISSUES

GPU0

GPU1

Frame 0 P

Frame 1 P

Frame 2 P

Frame 3 P

Ideal presentation

interval

Page 51: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

51

FRAME PACING

Measure average frame rate on each GPU

– Short history (10-30 frames)

– Filter out spikes

Insert delay on the GPU before each present

– Force the frame times to become more regular and GPUs to align

– Delay value is based on the calculate avg frame rate

GPU0

GPU1

Frame 0 P

Frame 1 P

Frame 2 P

Frame 3 P

GPU0

GPU1

Delay

D

Page 52: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

52

CONCLUSION

Page 53: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

53

MANTLE DEV RECOMMENDATIONS

The validation layer is a critical friend!

You’ll end up with a lot of object & memory management code, try share with console code

Make sure you have control over memory usage and can avoid overcommitting video memory

Build a robust solution for resource state management early

Figure out how to pre-create your graphics pipelines, can require engine design changes

Build for multi-GPU support from the start, easier than to retrofit

Page 54: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

54

FUTURE

Second wave of Frostbite Mantle titles

Adapt Frostbite core rendering layer based on learnings from Mantle

– Refine binding & buffer updates to further reduce overhead

– Virtual memory management

– More async compute & async DMAs

– Multi-GPU job graph R&D

Linux

– Would like to see how our Mantle renderer behaves with different memory management & driver model

Page 55: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

55

QUESTIONS?

Email: [email protected]: http://frostbite.comTwitter: @repi

Page 56: Rendering Battlefield 4 with Mantle by Yuriy ODonnell

56

MANTLE SQUAD

Frostbite

– Johan Andersson

– Jasper Bekkers

– Yuriy O’Donnell

– Arne Schober

– Graham Wihlidal

AMD

– Brian Bennett

– Michael Grossfeld

– Guennadi Riguer


Top Related