gdc2014: boosting your arm mobile 3d rendering performance with umbra

Antwan HätäläUmbra 3 Lead programmer

Boosting your ARMmobile 3D rendering

performance with Umbra 3

INDEX• Who are we?• Games• What is Umbra 3 and occlusion culling• bringing our system to the PlayStation 4• experiences and benefits• lessons learned

UMBRASOFTWAREOcclusion culling middlewarefor 3D games

Founded in 2007

14 employees

Based in Helsinki, Finland

Support office in Seattle, WA

Same problem – Different solutions

Mo Money – Mo Problems

“Level artists are there to fill theworld with content. Integrating Umbra

saved us not only artist time but the time to create and maintain an efficient

visibility culling solution. Umbra’s support provides us with the solutions and

features that we need.”

“Umbra’s technology is playing an important rolein the creation of our next universe, by freeing our

artists from the burden of manual markups typically associated

with polygon soup.”

Occlusionculling basics

Occlusion Culling: Why bother?

• Process and render only whats visible• improved frame rate and rendering performance• allows you to put more detail into levels and create larger

levels

6

What is Umbra ?

7

Determines visible objects fast to save further work both on CPU and GPU

Rasterizes automatically generated proprietary occluder models on CPU

Operates in low resolution, generates conservative (dilated) results Rasterization is embarassingly parallel in nature

Parallellize across CPU cores Process multiple pixels/elements in SIMD

Optimized for SSE, Altivec, Cell and ARM NEON

Umbra 3 occluder rasterizer

8

Processing of multiple data elements (2 to 16) in single instruction Separate execution pipeline: can execute in parallel with ARM Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64

bit integers Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9

For mobile 3D title purposes, it will be there Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue,

latencies For multi-platform, target A9 and enjoy free benefits on more advanced

platforms Used in one of three ways

Inline assembly Compiler intrinsics Compiler auto-vectorization

Similar to SSE, Altivec but for best performance you need to know your platform

NEON overview

9

Collaborate with the compiler, but keep an eye on the output Align your data when possible Inline functions that operate on SIMD values Use __restrict to let compiler reorder Watch for register spilling

Schedule enough NEON work, even when it might be redundant Loading data from ARM registers is relatively cheap, storing back is expensive Hide load/store latencies by interleaving with computation (unroll your loops)

Never interleave VFP instructions with NEON Means pipeline flush, tens of cycles of penalty Watch for ”s” register use is compiler output

NEON common best practices

10

No penalty from interleaving 2-wide ops with 4-wide ops Cortex-A8/A9 does 64-bit float operations per cycle vget_high_xxx, vget_low_xxx to address quadword halves

Narrow to 64 bits early 16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc. Use VMOVN or coupled operation and narrow

Careful with your constants VMOV and VMVN can encode lots of useful constants Compilers do a good job of constant encoding, but can’t choose the constants for you

Killer instructions Shift-and-insert: VSRI, VSLI Byte permute by table lookup: VTBL, VTBX Gather load and scatter store: VLD2-4, VST2-4

NEON optimization tricks

11

Example routine: gather sign bits of large array of float values

NEON optimization example

function gather_signbits(flt_array):let output_bitmap = bitmap of size len(flt_array)foreach elem in flt_array at index idx:if (elem < 0)set_bit(output_bitmap, idx)elseclear_bit(output_bitmap,idx)

12

Sufficient unrolling: handle 16 elements in one iteration

compare 4 values per instruction bitwise and for correct bit offsets collapse with vertical or (pairwise

add)

Neon optimization example: first attempt20: add.w r2, r0, #3224: vld1.64 {d28-d29}, [r0 :128]28: vld1.64 {d24-d25}, [r2 :128]2c: add.w r2, r0, #1630: vclt.f32 q14, q14, #034: vld1.64 {d26-d27}, [r2 :128]38: add.w r2, r0, #48

; 0x303c: vclt.f32 q12, q12, #040: vand q14, q8, q1444: vld1.64 {d30-d31}, [r2 :128]48: vclt.f32 q13, q13, #04c: vand q13, q11, q1350: vclt.f32 q15, q15, #054: vand q12, q10, q1258: vand q15, q9, q155c: vorr q13, q14, q1360: vorr q12, q12, q1564: vorr q12, q13, q1268: vpadd.i32 d24, d24, d256c: vpadd.i32 d24, d24, d2470: vst1.32 {d24[0]}, [r0 :32], r1

13

Compare with zero = shift sign bit Can shift and combine

simultaneously with VSRI instruction

Narrow to 16 bits (VMOVN) before proceeding further

half the amount of constants

Neon optimization example: shift-and-insert, narrow early

18: vld1.64 {d18-d19}, [r0 :128]1c: add.w r3, r0, #1620: adds r1, #422: vshr.u32 q9, q9, #1926: vld1.64 {d20-d21}, [r3 :128]2a: add.w r3, r0, #322e: vsri.32 q9, q10, #2332: vld1.64 {d20-d21}, [r3 :128]36: add.w r3, r0, #48

; 0x303a: vsri.32 q9, q10, #273e: vld1.64 {d20-d21}, [r3 :128]42: vsri.32 q9, q10, #3146: vmovn.i32 d18, q94a: vand d18, d18, d164e: vshl.u16 d18, d18, d1752: vpaddl.u16 d18, d1856: vpadd.i32 d18, d18, d185a: vst1.32 {d18[0]}, [r0 :32], r2

Thank you.For more on Umbra 3, go to:

umbra3.com [email protected]

Follow us on Twitter @umbrasoftware

gdc2014: boosting your arm mobile 3d rendering performance with umbra

Technology