threading successes 06 allegorithmic
TRANSCRIPT
![Page 1: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/1.jpg)
AllegorithmicSubstance
Threaded Middleware
![Page 2: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/2.jpg)
2
Procedural textures on multi-core
•Other than framerate and features, what else can you do with extra CPU power ?
•We’ll look at Allegorithmic’s middleware, Substance
![Page 3: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/3.jpg)
3
Procedural textures are valuable for modern games
•Have a LOT of textures.•Want shorter loading times ( , faster starts
)teleportations or zooms .•Need to reduce texture memory on a disc, for
download, and/or in RAM.•Can benefit from more flexible and reusable
assets.
![Page 4: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/4.jpg)
Introducing Substance
In Q2 2007 Allegorithmic started a complete reengineering of ProFX2, authoring tool and engine, named Substance.
Unit tests were done very early to ensure that Substance could target streaming.
Cross-platform : PC, PS3, XBOX, etc. Expected linear multi-thread scalability.
![Page 5: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/5.jpg)
What is Substance ?
Substance is a middleware product composed of two elements.
Substance Authoring Tool lets you create procedural textures create texture packages of a few kilobytes ! A cooker compiles generic data into binaries
optimized for a specific platform or user. Substance Engine
generates bitmap textures on the fly.
![Page 6: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/6.jpg)
Less FPS ?
More textures, not less FPS Substance consumes idle cycles, not frames
Graphics bitrates follow Moore's law Higher poly count → bigger worlds Higher filter rate → larger textures Desired texture volume grows faster than RAM
Streaming is a necessity But HDD net bitrate does not follow. Bottleneck !
Modern gameplay entails sudden bitrate bursts This is worsened by HDD seeks and entails stalls.
![Page 7: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/7.jpg)
No, a stable and high FPS.
Even masked, a stall is actually a FPS drop Substance works in Random Access Memory The gamer zooms or teleports:
Give 4 cores and a GPU to Substance Sacrifice 1 or 2 frames Substance gen. & cache 1-2M new texels. The stall does not hinder game play.
Substance diminishes stalls Substance helps to maintain a high FPS.
![Page 8: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/8.jpg)
Performance issue:streaming in games
DVD or HDD net bitrate is 2 or 6 MB/s Our aim: add a stable 4MB/s without the GPU Requires billions of intermediate pixels/s. Can CPUs compete with GPUs ? Opportunity: cores are still under-exploited in
most game engines. Texture processing is privileged in the new
multi-core architectures.
![Page 9: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/9.jpg)
The architecture was designed with these issues in mind:
Homogeneous CPU and GPU versions Streaming (~1-10 CPU cycles per pixel) SIMD & MT for the multi-core generations No cache nor threading pollution Fine grained jobs and lockless sync. Low memory footprint
![Page 10: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/10.jpg)
The theoretical benefit was calculated
New architectures come with enhanced SIMD. Expected x10 compared to std C++
Tricks and algorithmic changes could give another x10 on some filters, like DXT
We were confident that our image processes could be well threaded. Partly because we generate textures asynchronously
Hence the CPU version of ProFX2 could be accelerated by a factor x25-x100
![Page 11: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/11.jpg)
This is the approach taken to address the issue:
Simple innerloop tests actually showed that optimized SSE2-4 code could give a boost of x10
Find a data layout coherent with micro parallelism (SIMD and pipeline), low level threading, cache and memory handling.
OpenMP is then used to test strategies before designing a specific MT HAL
![Page 12: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/12.jpg)
Here’s the code that was developed to make this possible:
A SIMD HAL is ready for PC, Xbox, PS3. OpenMP easily gives a 85% MT linearity. Our MT HAL is converging towards a model of
lockless synchronization, 95% expected. The cooker precomputes data that will help
synchronization and MT efficiency. Our API exposes asynchronous commands.
Perfect to share cores with a game loop !
![Page 13: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/13.jpg)
The compositing graph,node based image processing
Authoring Tool: non linear editing Engine: efficient high level structure Graph (DAG) contains 3 types of nodes:
Sources: procedural noise, bitmaps, SVGs Filters: blend, HSL, TRS, warp, blur, etc. Outputs: coherent diffuse & normal maps, etc.
Main advantages: Libraries, capsules: instanciation of subgraphs Complex variants: fast to create and compute Dynamic custom branches (ex: aging textures)
![Page 14: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/14.jpg)
The compositing graph,node based image processing
![Page 15: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/15.jpg)
Threading strategies
High level threading: Task decomposition : 1 node (filter) per thread Graph splitting ensures task independency
Low level threading: Data decomposition : 1 strip of blocks per thread Dispatcher ensures non conflicting areas Pixel to pixel filters are concatenated. Streamed R/W, no L2 cache pollution Temporary blocks in private L1 double buffers Intermediate images never allocated Lockless reactive sync and cache friendly
![Page 16: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/16.jpg)
Threading sub graphs (1/11)by nodes (high level)
![Page 17: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/17.jpg)
Threading sub graphs (2/11)by nodes, caching
![Page 18: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/18.jpg)
Threading sub graphs (3/11)by nodes
![Page 19: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/19.jpg)
Threading sub graphs (4/11)by strips (low level)
![Page 20: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/20.jpg)
Threading sub graphs (5/11)remove from cache
![Page 21: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/21.jpg)
Threading sub graphs (6/11)by strips
![Page 22: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/22.jpg)
Threading sub graphs (7/11)remove from cache
![Page 23: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/23.jpg)
Threading sub graphs (8/11)by strips
![Page 24: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/24.jpg)
Threading sub graphs (9/11)remove from cache
![Page 25: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/25.jpg)
Threading sub graphs (10/11)by strips
![Page 26: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/26.jpg)
Threading sub graphs (11/11)update cache, and finished
![Page 27: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/27.jpg)
Expect more streaming bandwidth
Substance generates 4MB/s of compressed textures per second
Cumulate this with classical streaming 50+ MB/s loading with 4 cores and 1 GPU
streaming1/2 core
1 core2 cores
2 cores4 cores
4 coresadd GPU
0
10
20
30
40
50 GPUPenrynYorkfieldCD, DVD
![Page 28: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/28.jpg)
Here’s how close we got to the theoretical best performance:
DXT compression at 2G pixels/s (same as what hi-end GPUs can do in 2007).
8 bits SVG (cooked) rendering at 20G/s. 8G/s anti-aliasing with 4 sub-samples.
In most cases 4 cores give a x3.8 boost Some filters are more problematic, but solutions
have been imagined in details, and will be implemented between Q2 and Q4 2008.
![Page 29: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/29.jpg)
Here’s the new performance profile:
DXT TRS Blend SVG0
200
400
600
800
1000
1200
1400
1600
1800
2000
ProFX2Substance
Substance and ProFX2 figures are for one core.
4 cores: 3.8 times more fillrate.
ProFX2: SVG GPU Substance: SVG CPU SVG AA: 2G pixels/s
per core
![Page 30: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/30.jpg)
This is future-proofed
The cooker precomputes whatever helps to linearise computations.
Scalable code: SSE4 added in one day thanks to the SIMD HAL
Scalable threading: our two strategies scale A few functions dispatch virtual CPU "shaders" 64-cores ready ↔ code a new dispatcher ? Multiplatform design.
![Page 31: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/31.jpg)
What’s next?
![Page 32: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/32.jpg)
Procedural diffuse map
![Page 33: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/33.jpg)
Coherent procedural normal map
![Page 34: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/34.jpg)
Complex procedural environment map
![Page 35: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/35.jpg)
This scene is made entirely of
proceduraltextures
![Page 36: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/36.jpg)
Future sources of bandwidth
SIMD code can be better pipelined in ASM. Our cooker can optimize a lot of things. Authoring tool will have a RT profiler Artists gaining experience with Substance will
also optimize their packages better. Artist feedback will also help us to improve the
expressiveness of each filter ~30-50 filters per texture, main perf. divisor.
![Page 37: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/37.jpg)
Here’s how you can best take advantage of procedural textures
Anticipate texture generation requests. Predict visibility (HOM, PVS). Create mipmaps. Access levels JIT. Cache the useful texels. Adapt texture resolution to workload. Use texture variants, less tiling textures or
details. Show a higher texel/pixel ratio.
![Page 38: Threading Successes 06 Allegorithmic](https://reader033.vdocuments.mx/reader033/viewer/2022060121/55935d5a1a28ab63648b467c/html5/thumbnails/38.jpg)
What do you think?
Have you tried something like this? Have you rejected trying something like this?