ppu optimisation lesson

36
Engineer Learning Session #1 Optimisation Tips for PowerPC Ben Hanke April 16 th 2009

Upload: slantsixgames

Post on 28-Nov-2014

1.175 views

Category:

Technology


3 download

DESCRIPTION

PowerPC Optimization Tips

TRANSCRIPT

Page 1: PPU Optimisation Lesson

Engineer Learning Session #1Optimisation Tips for PowerPC

Ben HankeApril 16th 2009

Page 2: PPU Optimisation Lesson

Playstation 3 PPU

Page 3: PPU Optimisation Lesson

Optimized Playstation 3 PPU

Page 4: PPU Optimisation Lesson

Consoles with PowerPC

Page 5: PPU Optimisation Lesson

Established Common Sense

“90% of execution time is spent in 10% of the code”

“Programmers are really bad at knowing what really needs optimizing so should be guided by profiling”

“Why should I bother optimizing X? It only runs N times per frame”

“Processors are really fast these days so it doesn’t matter”

“The compiler can optimize that better than I can”

Page 6: PPU Optimisation Lesson

Alternative View

The compiler isn’t always as smart as you think it is.

Really bad things can happen in the most innocent looking code because of huge penalties inherent in the architecture.

A generally sub-optimal code base will nickle and dime you for a big chunk of your frame rate.

It’s easier to write more efficient code up front than to go and find it all later with a profiler.

Page 7: PPU Optimisation Lesson

PPU Hardware Threads

We have two of them running on alternate cycles

If one stalls, other thread runs

Not multi-core:

Shared exection units

Shared access to memory and cache

Most registers duplicated

Ideal usage:

Threads filling in stalls for each other without thrashing cache

Page 8: PPU Optimisation Lesson

PS3 Cache

Level 1 Instruction Cache

32 kb 2-way set associative, 128 byte cache line

Level 1 Data Cache

32 kb 4-way set associative, 128 byte cache line

Write-through to L2

Level 2 Data and Instruction Cache

512 kb 8-way set associative

Write-back

128 byte cache line

Page 9: PPU Optimisation Lesson

Cache Miss Penalties

L1 cache miss = 40 cycles

L2 cache miss = ~1000 cycles!

In other words, random reads from memory are excruciatingly expensive!

Reading data with large strides – very bad

Consider smaller data structures, or group data that will be read together

Page 10: PPU Optimisation Lesson

Virtual Functions

What happens when you call a virtual function?

What does this code do?

virtual void Update() {}

May touch cache line at vtable address unnecessarily

Consider batching by type for better iCache pattern

If you know the type, maybe you don’t need the virtual – save touching memory to read the function address

Even better – maybe the data you actually want to manipulate can be kept close together in memory?

Page 11: PPU Optimisation Lesson

Data Hazards Ahead

Page 12: PPU Optimisation Lesson

Spot the Stall

int SlowFunction(int & a, int & b)

{

a = 1;

b = 2;

return a + b;

}

Page 13: PPU Optimisation Lesson

Method 1

Q: When will this work and when won’t it?

inline int FastFunction(int & a, int & b)

{

a = 1;

b = 2;

return a + b;

}

Page 14: PPU Optimisation Lesson

Method 2

int FastFunction(int * __restrict a,int * __restrict b)

{

*a = 1;

*b = 2;

return *a + *b; // we promise that a != b

}

Page 15: PPU Optimisation Lesson

__restrict Keyword

__restrict only works with pointers, not references (which sucks). Aliasing only applies to identical types. Can be applied to implicit this pointer in member functions.

Put it after the closing brace. Stops compiler worrying that you passed a class data member to the function.

Page 16: PPU Optimisation Lesson

Load-Hit-Store

What is it?

Write followed by read

PPU store queue

Average latency 40 cycles

Snoop bits 52 through 61 (implications?)

Page 17: PPU Optimisation Lesson

True LHS

Type casting between register files:

float floatValue = (float)intValue;

float posY = vPosition.X();

Data member as loop counter

while( m_Counter-- ) {}

Aliasing:

void swap( int & a, int & b ) { int t = a; a = b; b = t; }

Page 18: PPU Optimisation Lesson

Workaround: Data member as loop counter

int counter = m_Counter; // load into register

while( counter-- ) // register will decrement

{

doSomething();

}

m_Counter = 0; // Store to memory just once

Page 19: PPU Optimisation Lesson

Workarounds

Keep data in the same domain as much as possible

Reorder your writes and reads to allow space for latency

Consider using word flags instead of many packed bools

Load flags word into register

Perform logical bitwise operations on flags in register

Store new flags value

e.g. m_Flags = ( initialFlags & ~kCanSeePlayer ) |

newcanSeeFlag | kAngry;

Page 20: PPU Optimisation Lesson

False LHS Case

Store queue snooping only compares bits 52 through 61

So false LHS occurs if you read from address while different item in queue matches addr & 0xFFC.

Writing to and reading from memory 4KB apart

Writing to and reading from memory on sub-word boundary, e.g. packed short, byte, bool

Write a bool then read a nearby one -> ~40 cycle stall

Page 21: PPU Optimisation Lesson

Example

struct BaddyState

{

bool m_bAlive;

bool m_bCanSeePlayer;

bool m_bAngry;

bool m_bWearingHat;

}; // 4 bytes

Page 22: PPU Optimisation Lesson

Where might we stall?

if ( m_bAlive )

{

m_bCanSeePlayer = LineOfSightCheck( this, player );

if ( m_bCanSeePlayer && !m_bAngry )

{

m_bAngry = true;

StartShootingPlayer();

}

}

Page 23: PPU Optimisation Lesson

Workaround

if ( m_bAlive ) // load, compare

{

const bool bAngry = m_bAngry; // load

const bool bCanSeePlayer = LineOfSightCheck( this, player );

m_bCanSeePlayer = bCanSeePlayer; // store

if (bCanSeePlayer && !bAngry ) // compare registers

{

m_bAngry = true; // store

StartShootingPlayer();

}

}

Page 24: PPU Optimisation Lesson

Loop + Singleton Gotcha

What happens here?

for( int i = 0; i < enemyCount; ++i )

{

EnemyManager::Get().DispatchEnemy();

}

Page 25: PPU Optimisation Lesson

Workaround

EnemyManager & enemyManager = EnemyManager::Get();

for( int i = 0; i < enemyCount; ++i )

{

enemyManager.DispatchEnemy();

}

Page 26: PPU Optimisation Lesson

Branch Hints

Branch mis-prediction can hurt your performance

24 cycles penalty to flush the instruction queue

If you know a certain branch is rarely taken, you can use a static branch hint, e.g.

if ( __builtin_expect( bResult, 0 ) )

Far better to eliminate the branch!

Use logical bitwise operations to mask results

This is far easier and more applicable in SIMD code

Page 27: PPU Optimisation Lesson

Floating Point Branch Elimination

__fsel, __fsels – Floating point select

float min( float a, float b )

{

return ( a < b ) ? a : b;

}

float min( float a, float b )

{

return __fsels( a – b, b, a );

}

Page 28: PPU Optimisation Lesson

Microcoded Instructions

Single instruction -> several, fetched from ROM, pipeline bubble

Common example of one to avoid: shift immediate.

int a = b << c;

Minimum 11 cycle latency.

If you know range of values, can be better to switch to a fixed shift!

switch( c )

{

case 1: a = b << 1; break;

case 2: a = b << 2; break;

case 3: a = b << 3; break;

default: break;

}

Page 29: PPU Optimisation Lesson

Loop Unrolling

Why unroll loops?

Less branches

Better concurrency, instruction pipelining, hide latency

More opportunities for compiler to optimise

Only works if code can actually be interleaved, e.g. inline functions, no inter-iteration dependencies

How many times is enough?

On average about 4 – 6 times works well

Best for loops where num iterations is known up front

Need to think about spare - iterationCount % unrollCount

Page 30: PPU Optimisation Lesson

Picking up the Spare

If you can, artificially pad your data with safe values to keep as multiple of unroll count. In this example, you might process up to 3 dummy items in the worst case.

for ( int i = 0; i < numElements; i += 4 )

{

InlineTransformation( pElements[ i+0 ] );

InlineTransformation( pElements[ i+1 ] );

InlineTransformation( pElements[ i+2 ] );

InlineTransformation( pElements[ i+3 ] );

}

Page 31: PPU Optimisation Lesson

Picking up the Spare

If you can’t pad, run for numElements & ~3 instead and run to completion in a second loop.

for ( ; i < numElements; ++i )

{

InlineTransformation( pElements[ i ] );

}

Page 32: PPU Optimisation Lesson

Picking up the Spare

Alternative method (pros and cons - one branch but longer code generated).

switch( numElements & 3 )

{

case 3: InlineTransformation( pElements[ i+2 ] );

case 2: InlineTransformation( pElements[ i+1 ] );

case 1: InlineTransformation( pElements[ i+0 ] );

case 0: break;

}

Page 33: PPU Optimisation Lesson

Loop unrolled… now what?

If you unrolled your loop 4 times, you might be able to use SIMD

Use AltiVec intrinsics – align your data to 16 bytes

128 bit registers - operate on 4 32-bit values in parallel

Most SIMD instructions have 1 cycle throughput

Consider using SOA data instead of AOS

AOS: Arrays of interleaved posX, posY, posZ structures

SOA: A structure of arrays for each field dimensioned for all elements

Page 34: PPU Optimisation Lesson

Example: FloatToHalf

Page 35: PPU Optimisation Lesson

Example: FloatToHalf4

Page 36: PPU Optimisation Lesson

Questions?