agenda cpu threads flip queue cpu queues gpu hardware queue

45
Performance tips for Windows Store apps using DirectX and C++ Max McMullen Principal Development Lead – Direct3D Microsoft Corporation 4-102

Upload: kory-banks

Post on 17-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Performance tips for Windows Store apps using DirectX and C++Max McMullenPrincipal Development Lead – Direct3DMicrosoft Corporation4-102

Overview

Measuring rendering performance

Power efficient GPU characteristics

Optimizing for power efficient GPUs

Agenda

Overview

Optimizing for the Windows 8/RT OSNew form factors and platforms require new optimizations

Windows uses DirectX to get every pixel on screen

Direct3D 11.1 provides new APIs to optimize rendering

Use optimized Windows 8/RT platformsAll Windows Store apps use DirectX for rendering

WWA & XAML optimized use of Direct2D and Direct3D 11.1

Direct2D and Direct2D Effects fully leverage Direct3D 11.1

But sometimes you really need to use Direct3D itself…

What you should know

Basics of building a C++ Windows Store app

Direct3D fundamentals

Measuring rendering performance

Many useful tools for Windows performance optimization:Visual Studio Performance Profiler, Visual Studio Graphics Diagnostics, hardware partner tools…

Two primary tools used to optimize Direct3D usage in the Windows 8/RT OS:Basic: FPS/time measurement in app/microbenchmarksAdvanced: GPUView

How do you measure rendering performance?

Frames per second (FPS)Quick but sometimes misleading

C++/DirectX Windows Store apps sync to the display refresh

Measure render time, not presentCall ID3D11DeviceContext::Flush instead of IDXGISwapchain::Present

Infrequent output: file output

Frequent output: look at FPSCounter.cpp in the GeometryRealization sample

Demo: FPS measurement

GPUView

Part of the Windows Performance Toolkit

ETW Logging of CPU and GPU work

Measures graphics performanceFPS, startup time, glitching, render time, latency

Enables detailed analysis of CPU and GPU workloads and interdependencies

GPUView – Record and AnalyzeInstallx86: Windows Performance ToolkitARM: Windows Kits\8.0\Windows Performance Toolkit\Redistributables\WPTarm-arm_en-us.msi

RecordRun log.cmd to startPerform actionRun log.cmd to stop

AnalyzeData captured in merged.etl, load in GPUView

GPUView - Interface

CPU Threads

Flip Queue

CPU Queues

GPU Hardware Queue

GPUView Interface: GPU Hardware Queue

The GPU Hardware Queue shows command buffers rendering on the GPU.CPU Queue command buffers moved to the GPU Hardware Queue when the hardware is ready to receive more commands.

Demo: GPUView

Power efficient GPU characteristics

What to expect with power efficient GPUsFeature level 9_1 or 9_3

Limited available bandwidth

Both immediate render and tiled render GPUs

Limited shader instruction throughput

Feature Level 9.x (FL9.1, FL9.3)

Real-time render limitations generally occur before reaching these maximums

Feature Level 9.1 9.3

Texture size 2048x2048 4096x4096

Pixel shader instructions

64 arithmetic, 32 sample

512 total

GPU Memory BandwidthBaseline requirement: 1.9 GB/sec benchmarked

7.5 I/O operation per screen pixel, 1366x768x32bpp@60hz

I/O Cost Operation

1 Screen Fill w/Solid Color

2 Screen Fill w/Texture

3 Screen Fill w/Texture & Alpha Blend

Immediate render

GPUshader cores

Memory bus

Graphics memory

Tiled render

GPUshader cores

Memory bus

Graphics memory

Tiled render

GPUshader cores

Memory bus

Graphics memory

Tiled render

GPUshader cores

Memory bus

Graphics memory

Shader instruction throughputFill rates on GPUs depend on a number of factorsMemory bandwidthBlend modeShader coresShader complexityEtc

Power efficient GPUs become shader throughput bound at approximately ~4 pixel shader instructions

Optimizing for low power GPUs

Bandwidth optimization: basicsRender opaque objects front-to-back with z-buffering

Disable alpha blending for opaque objects

Use geometry to trim large transparent areas

Bandwidth optimization: compress resourcesDirect3D supports texture compression at all feature levelsBC1 4-bits/pixel for RGB formats - 6x compression ratioBC2,3 8-bits/pixel for RGBA formats - 4x compression ratio

Smaller resources also means faster downloads of your app

Bandwidth optimization: quantize resourcesUse the 16 bit formats added to Direct3D 11.1:

DXGI_FORMAT_B5G6R5_UNORMDXGI_FORMAT_B5G5R5A1_UNORMDXGI_FORMAT_B4G4R4A4_UNORM

Bandwidth optimization: flip presentMust use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL

OS automatically uses “fullscreen” flips when:Swapchain buffer dimensions match the desktop resolutionSwapchain format is DXGIFMT_B8G8R8A8_UNORM*App is the only content onscreen

Buffer dimensions need to be converted correctly from device independent pixels (dips)

Just create the swapchain with zero width and height to get the right size

using namespace Windows::Graphics::Display;

float ConvertDipsToPixels(float dips){ static const float dipsPerInch = 96.0f; return floor(dips*DisplayProperties::LogicalDpi/dipsPerInch+0.5f);}

Platform::Agile<Windows::UI::Core::CoreWindow> m_window;

float swapchainWidth = ConvertDipsToPixels(m_window->Bounds.Width);float swapchainHeight = ConvertDipsToPixels(m_window->Bounds.Height);

Demo: Optimized flip presents

Bandwidth optimization: tiled render GPUsMinimize command buffer flushesDon’t map resources in use by the GPU, use DISCARD and NO_OVERWRITE

Minimize scene flushesVisit RenderTargets only once per frameDon’t update resources in use by the GPU from the CPU, use DISCARD and NO_OVERWRITE with ID3D11DeviceContext::CopySubresourceRegion1

Use scissors when updating small portions of a RenderTarget

Bandwidth optimization: tiled render GPUsNew Direct3D APIs provide hints to avoid unnecessary copies

Rendering artifacts if used incorrectly

Bandwidth optimization: Discard* APIs

m_swapChain->Present(1, 0); // present the image on the display

ComPtr<ID3D11View> view; m_renderTargetView.As(&view); // get the view on the RT

m_d3dContext->DiscardView(view.Get()); // discard the view

Use ID3D11DeviceContext1::DiscardView and ID3D11DeviceContext1::DiscardResource1 to prevent unnecessary tile copies

Artifacts if used incorrectly

Tiled render

GPUshader cores

Memory bus

Graphics memory

Tiled render

GPUshader cores

Memory bus

Graphics memory

Shader instruction throughputPower efficient GPUs have limited throughput for full precision

Minimum precision hints increase throughput when precision doesn’t matter

Specifies minimum rather than actual precisionmin16float, min16int, min10int

Don’t change precision often

20-25% improvement in practice with min16float

Minimum precisionstatic const float brightThreshold = 0.5f;

Texture2D sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= 4.0f;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Minimum precisionstatic const min16float brightThreshold = (min16float)0.5;

Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min16float)4.0;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Minimum precision – bad usagestatic const min16float brightThreshold = (min16float)0.5;

Texture2D<min16float4> sourceTexture : register(t0);float4 DownScale3x3BrightPass(QuadVertexShaderOutput input) : SV_TARGET{ min16float3 brightColor = 0; // Gather 16 adjacent pixels (each bilinear sample reads a 2x2 region) brightColor = sourceTexture.Sample(linearSampler, input.tex, int2(-1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1,-1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2(-1, 1)).rgb; brightColor += sourceTexture.Sample(linearSampler, input.tex, int2( 1, 1)).rgb; brightColor /= (min10int)4.0;

// Brightness thresholding brightColor = max(0, brightColor - brightThreshold);

return float4(brightColor, 1.0f);}

Wrap-upOptimize!

Use the right tools and techniques to measure performance

Tune for power efficient GPUs’ unique performance characteristics

Direct3D 11.1 and Windows 8 provide the APIs to fully leverage power efficient GPUs

Resources

Build 2012 Talk: 3-113 Graphics with the Direct3D11.1 API made easyBuild 2012 Talk: 3-109 Developing a Windows Store app using C++ and DirectX

Visual Studio 2012 Remote Debugging: http://blogs.msdn.com/b/dsvc/archive/2012/10/26/windows-rt-windows-store-app-debugging.aspx

FPS Counter in GeometryRealization sample: http://code.msdn.microsoft.com/windowsapps/Geometry-Realization-963be8b7#content

GPUView: http://msdn.microsoft.com/en-us/library/windows/desktop/jj585574(v=vs.85).aspx

Direct3D11.1: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404562(v=vs.85).aspx

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.