ppu optimisation lesson

Engineer Learning Session #1Optimisation Tips for PowerPC

Ben HankeApril 16th 2009

Playstation 3 PPU

Optimized Playstation 3 PPU

Consoles with PowerPC

Established Common Sense

“90% of execution time is spent in 10% of the code”

“Programmers are really bad at knowing what really needs optimizing so should be guided by profiling”

“Why should I bother optimizing X? It only runs N times per frame”

“Processors are really fast these days so it doesn’t matter”

“The compiler can optimize that better than I can”

Alternative View

The compiler isn’t always as smart as you think it is.

Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture.

A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate.

It’s easier to write more efficient code up front than to go and find it all later with a profiler.

PPU Hardware Threads

We have two of them running on alternate cycles

If one stalls, other thread runs

Not multi-core:

Shared exection units

Shared access to memory and cache

Most registers duplicated

Ideal usage:

Threads filling in stalls for each other without thrashing cache

PS3 Cache

Level 1 Instruction Cache

32 kb 2-way set associative, 128 byte cache line

Level 1 Data Cache

32 kb 4-way set associative, 128 byte cache line

Write-through to L2

Level 2 Data and Instruction Cache

512 kb 8-way set associative

Write-back

128 byte cache line

Cache Miss Penalties

L1 cache miss = 40 cycles

L2 cache miss = ~1000 cycles!

In other words, random reads from memory are excruciatingly expensive!

Reading data with large strides – very bad

Consider smaller data structures, or group data that will be read together

Virtual Functions

What happens when you call a virtual function?

What does this code do?

virtual void Update() {}

May touch cache line at vtable address unnecessarily

Consider batching by type for better iCache pattern

If you know the type, maybe you don’t need the virtual – save touching memory to read the function address

Even better – maybe the data you actually want to manipulate can be kept close together in memory?

Data Hazards Ahead

Spot the Stall

int SlowFunction(int & a, int & b)

{

a = 1;

b = 2;

return a + b;

}

Method 1

Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)

{

a = 1;

b = 2;

return a + b;

}

Method 2

int FastFunction(int * __restrict a,int * __restrict b)

{

*a = 1;

*b = 2;

return *a + *b; // we promise that a != b

}

__restrict Keyword

__restrict only works with pointers, not references (which sucks). Aliasing only applies to identical types. Can be applied to implicit this pointer in member functions.

Put it after the closing brace. Stops compiler worrying that you passed a class data member to the function.

Load-Hit-Store

What is it?

Write followed by read

PPU store queue

Average latency 40 cycles

Snoop bits 52 through 61 (implications?)

True LHS

Type casting between register files:

float floatValue = (float)intValue;

float posY = vPosition.X();

Data member as loop counter

while( m_Counter-- ) {}

Aliasing:

void swap( int & a, int & b ) { int t = a; a = b; b = t; }

Workaround: Data member as loop counter

int counter = m_Counter; // load into register

while( counter-- ) // register will decrement

{

doSomething();

}

m_Counter = 0; // Store to memory just once

Workarounds

Keep data in the same domain as much as possible

Reorder your writes and reads to allow space for latency

Consider using word flags instead of many packed bools

Load flags word into register

Perform logical bitwise operations on flags in register

Store new flags value

e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |

newcanSeeFlag | kAngry;

False LHS Case

Store queue snooping only compares bits 52 through 61

So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC.

Writing to and reading from memory 4KB apart

Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool

Write a bool then read a nearby one -> ~40 cycle stall

Example

struct BaddyState

{

bool m_bAlive;

bool m_bCanSeePlayer;

bool m_bAngry;

bool m_bWearingHat;

}; // 4 bytes

Where might we stall?

if ( m_bAlive )

{

m_bCanSeePlayer = LineOfSightCheck( this, player );

if ( m_bCanSeePlayer && !m_bAngry )

{

m_bAngry = true;

StartShootingPlayer();

}

}

Workaround

if ( m_bAlive ) // load, compare

{

const bool bAngry = m_bAngry; // load

const bool bCanSeePlayer = LineOfSightCheck( this, player );

m_bCanSeePlayer = bCanSeePlayer; // store

if (bCanSeePlayer && !bAngry ) // compare registers

{

m_bAngry = true; // store

StartShootingPlayer();

}

}

Loop + Singleton Gotcha

What happens here?

for( int i = 0; i < enemyCount; ++i )

{

EnemyManager::Get().DispatchEnemy();

}

Workaround

EnemyManager & enemyManager = EnemyManager::Get();

for( int i = 0; i < enemyCount; ++i )

{

enemyManager.DispatchEnemy();

}

Branch Hints

Branch mis-prediction can hurt your performance

24 cycles penalty to flush the instruction queue

If you know a certain branch is rarely taken, you can use a static branch hint, e.g.

if ( __builtin_expect( bResult, 0 ) )

Far better to eliminate the branch!

Use logical bitwise operations to mask results

This is far easier and more applicable in SIMD code

Floating Point Branch Elimination

__fsel, __fsels – Floating point select

float min( float a, float b )

{

return ( a < b ) ? a : b;

}

float min( float a, float b )

{

return __fsels( a – b, b, a );

}

Microcoded Instructions

Single instruction -> several, fetched from ROM, pipeline bubble

Common example of one to avoid: shift immediate.

int a = b << c;

Minimum 11 cycle latency.

If you know range of values, can be better to switch to a fixed shift!

switch( c )

{

case 1: a = b << 1; break;



default: break;

}

Loop Unrolling

Why unroll loops?

Less branches

Better concurrency, instruction pipelining, hide latency

More opportunities for compiler to optimise

Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies

How many times is enough?

On average about 4 – 6 times works well

Best for loops where num iterations is known up front

Need to think about spare - iterationCount % unrollCount

Picking up the Spare

If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case.

for ( int i = 0; i < numElements; i += 4 )

{

InlineTransformation( pElements[ i+0 ] );




}


If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop.

for ( ; i < numElements; ++i )

{

InlineTransformation( pElements[ i ] );

}


Alternative method (pros and cons - one branch but longer code generated).

switch( numElements & 3 )

{

case 3: InlineTransformation( pElements[ i+2 ] );



case 0: break;

}

Loop unrolled… now what?

If you unrolled your loop 4 times, you might be able to use SIMD

Use AltiVec intrinsics – align your data to 16 bytes

128 bit registers - operate on 4 32-bit values in parallel

Most SIMD instructions have 1 cycle throughput

Consider using SOA data instead of AOS

AOS: Arrays of interleaved posX, posY, posZ structures

SOA: A structure of arrays for each field dimensioned for all elements

Example: FloatToHalf

Example: FloatToHalf4

Questions?

ppu optimisation lesson

Technology