rendering battlefield 4 with mantle by yuriy odonnell

Download Rendering Battlefield 4 with Mantle by Yuriy ODonnell

If you can't read please download the document

Post on 16-Apr-2017

2.305 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

Rendering Battlefield 4 with Mantle

Rendering battlefield 4 with mantle

Johan Andersson Yuriy ODonnellElectronic Arts

#

1

#

DX11MantleAvg: 78 fpsMin: 42 fpsCore i7-3970x, AMD Radeon R9 290x, 1080p ULTRAAvg: 120 fpsMin: 94 fps+58%!

#Bf4 mantle goalsGoals:

Significantly improve CPU performanceMore consistent & stable performanceImprove GPU performance where possible

Add support for a new Mantle rendering backend in a live gameMinimize changes to engine interfacesCompatible with built PC content

Work on wide set of hardwareAPU to quad-GPUBut x64 only (32-bit Windows needs to die)Non-goals:

Design new renderer from scratch for Mantle

Take advantage of asymmetric MGPU (APU+discrete)

Optimize video memory consumption

#

4

Bf4 mantle strategic goals

Prove that low-level graphics APIs work outside of consoles

Push the industry towards low-level graphics APIs everywhere

Build a foundation for the future that we can build great games on

#Agenda

ShadersPipelinesMemoryResourcesCommand buffersQueuesMultiple GPUs

#shaders

#Shader conversionDX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC toolDone at load timeDont have to change our shaders!

Have full source & control over the process

Could write AMDIL directly or use other frontends if wanted

#Shader ResourcesShader resource bind points replaced with a resource table object - descriptor setThis is how the hardware accesses the shader resourcesFlat list of images, buffers and samplers used by any of the shader stagesVertex shader streams converted to vertex shader buffer loads

Engine assign each shader resource to specific slot in the descriptor set(s)Can share slots between shader stages = smaller descriptor setsThe mapping takes a while to wrap ones head around

#Descriptor setsVery simple usage in BF4: for each draw call write flat list of resourcesEssentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream

Single dynamic descriptor set object per frameSub-allocate for each draw call and write list of resources

~15000 resource slots written per frame in BF4, still very fast

#

10

Descriptor sets

#

11

Descriptor sets future OPTIMIZATIONS

Use static descriptor sets when possible

Reduce resource duplication by reusing & sharing more across shader stages

Nested descriptor sets

#

12

Pipelines

#Compute pipelines1:1 mapping between pipeline & shader

No state built into pipeline

Can execute in parallel with rendering

~100 compute pipelines in BF4

#Graphics pipelinesAll graphics shader stages combined to a single pipeline object together with important graphics state

~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory

Could use smaller working pool of active state objects to keep reasonable amount in memoryHave not been required for us

#Pre-building pipelinesGraphics pipeline creation is expensive operation, do at load time instead of runtime!Creating one of our graphics pipelines take ~10-60 ms eachPre-build using N parallel low-priority jobsAvoid 99.9% of runtime stalls caused by pipeline creation!

Requires knowing the graphics pipeline state that will be used with the shadersPrimitive typeRender target formatsRender target write masksBlend modes

Not fully trivial to know all state, may require engine changes / pre-defining use casesImportant to design for!

#Pipeline cacheCache built pipelines both in memory cache and disk cacheImproved loading timesMax 300 MBSimple LRU policyLZ4 compressed

Database signature:Driver versionVendor IDDevice ID

#Dynamic State objectsGraphics state is only set with the pipeline object and 5 dynamic state objectsState objects: color blend, raster, viewport, depth-stencil, MSAANo other parameters such as in DX11 with stencil ref or SetViewport functions

Frostbite use case:Pre-create when possibleOtherwise on-demand creation (hash map)Only ~100 state objects!

Still possible to end up with lots of state objectsEsp. with state object float & integer values (depth bounds, depth bias, viewport)But no need to store all permutations in memory, objects are fast to create & app manages lifetimes

#memory

#Memory managementMantle devices exposes multiple memory heaps with characteristicsCan be different between devices, drivers and OS:es

User explicitly places resources in wanted heapsDriver suggests preferred heaps when creating objects, not a requirementTypeSizePageCPU accessGPU ReadGPU WriteCPU ReadCPU WriteLocal256 MB65535CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined1301700.00582.8Local4096 MB6553513018000Remote16106 MB65535CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined2.62.60.13.3Remote16106 MB65535CpuVisible|CpuGpuCoherent2.62.63.22.9

#Frostbite memory heapsSystem Shared MappedCPU memory that is GPU visible. Write combined & persistently mapped = easy & fast to write to in parallel at any time

System Shared PinnedCPU cached for readback. Not used much

Video SharedGPU memory accessible by CPU. Used for descriptor sets and dynamic buffersMax 256 MB (legacy constraint)Avoid keeping persistently mapped as VidMM doesnt like this and can decide to move it back to CPU memory

Video PrivateGPU private memory. Used for render targets, textures and other resources CPU does not need to access

#

21

Memory REFERENCESWDDM needs to know which memory allocations are referenced for each command bufferIn order to make sure they are resident and not paged outMax ~1700 memory references are supportedOverhead with having lots of references

Engine needs to keep track of what memory is referenced while building the command buffersEasy & fast to doEach reference is either read-only or read/writeWe use a simple global list of references shared for all command buffers.

#Memory poolingPooling memory allocations was required for usSub allocate within larger 1 32 MB chunksAll resources stored memory handle + offsetNot as elegant as just void* on consolesFragmentation can be a concern, not too much issues for us in practice

GPU virtual memory mapping is fully supported, can simplify & optimize management

#Overcommitting video memoryAvoid overcommitting video memory!Will lead to severe stalls as VidMM moves blocks and moves memory back and forthVidMM is a black box One of the biggest issues we ran into during development

RecommendationsBalance memory poolsMake sure to use read-only memory referencesUse memory priorities

#Memory prioritiesSetting priorities on the memory allocations helps VidMM choose what to page out when it has to

5 priority levelsVery high = Render targets with MSAAHigh = Render targets and UAVsNormal = TexturesLow = Shader & constant buffersVery low = vertex & index buffers

#Memory residency futureFor best results manage which resources are in video memory yourself & keep only ~80% usedAvoid all stallsCan async DMA in and out

We are thinking of redesigning to fully avoid possibility of overcommitting

Hoping WDDMs memory residency management can be simplified & improved in the future

#Resource management

#Resource lifetimesApp manages lifetime of all resourcesHave to make sure GPU is not using an object or memory while we are freeing it on the CPUHow weve always worked with GPUs on the consolesMulti-GPU adds some additional complexity that consoles do not have

We keep track of lifetimes on a per frame granularityQueues for object destruction & free memory operationsAdd to queue at any time on the CPUProcess queues when GPU command buffers for the frame are done executingTracked with command buffer fences

#Linear frame allocatorWe use multiple linear allocators with Mantle for both transient buffers & imagesUsed for huge amount of small constant data and other GPU frame data that CPU writesEasy to use and very low overheadDont have to care about lifetimes or state

Fixed memory buffers for each frame Super cheap sub-allocation from from any threadIf full, use heap allocation (also fast due to pooling)

Alternative: ring buffersRequires being able to stall & drain pipeline at any allocation if full, additional complexity for us

#tilingTextures should be tiled for performanceExplicitly handled in Mantle, user selects linear or tiledSome formats (BC) cant be accessed as linear by the GPU

On consoles we handle tiling offline as part of our data processing pipelineWe know the exact tiling formats and have separate resources per platform

For MantleTiling formats are opaque, can be different between GPU architectures and image typesTile textures with DMA image upload from SystemShared to VideoPrivateLinear source, tiled destinationFree

#command buffers

#Command buffersCommand buffers are the atomic unit of work dispatched to the GPUSeparate creation from executionNo immediate context a la DX11 that can execute work at any callMakes resource synchronization and setup significantly easier & faster

Typical BF4 scenes have around ~50 command buffers per frameReasonable tradeoff for us with submission overhead vs CPU load-balancing

#Command buffer sourcesFrostbite has 2 separate sources of command buffers

World renderingRendering the world with tons of objects, lots of draw calls. Have all frame data up frontAll resources except for render targets are read-onlyNo resource state transitionsGenerated in paral