c++ on next-gen consoles: effective code for new architectures
DESCRIPTION
C++ on Next-Gen Consoles: Effective Code for New Architectures. Pete Isensee Development Manager Microsoft Game Technology Group. Last Year at GDC. Chris Hecker ranted What did he say? Programmers: danger ahead Out-of-order execution: good In-order execution: bad - PowerPoint PPT PresentationTRANSCRIPT
C++ on Next-Gen Consoles:Effective Code for New ArchitecturesPete IsenseeDevelopment ManagerMicrosoft Game Technology Group
Last Year at GDC Chris Hecker ranted What did he say?
Programmers: danger ahead Out-of-order execution: good In-order execution: bad Microsoft and Sony are going to screw you You are so hosed. Game over, man.
“There’s absolutely nothing you can do about this”
Console Hardware Architectures Optimized to do floating-point math Optimized for multithreaded tasks Optimized to run games Not optimized to run general purpose
code Not optimized to do branch prediction,
code reordering, instruction pipelining or other out-of-order magic
Large L2 caches Large latencies
We’re Game Programmers.We Love Challenges. We will make games on these consoles The solution is not assembly language The solution is to tailor our C/C++
engines, inner loops and bottleneck functions to the realities of the hardware
Remember: C++ code can make or break your game’s performance
Not Covering
Profiling (do it) Multithreading (do it) Memory allocation (avoid in game loop) Compiler settings (experiment) Exception handling (avoid it)
Topics for Today
Thinking about L2 Optimize memory access Use CPU caches effectively
Thinking about in-order processing Avoid function call overhead Tips for efficient math Avoid hidden C++ inefficiencies
Optimize Memory Access
Proverb: thou shalt treat memory as if it were thy hard drive
You will be memory-bound on new consoles
Recommendations Never read from the same place twice in a
frame Read data sequentially Write data sequentially Use everything you read
Minimize Data Passes
Game frame loops often access data twice Or three times Or more
Optimize for a single pass Consider less frequent operations
AI Physics, collision Networking Particle systems
Multiple PassArchitecture
Pointer Aliasing Explained
void init( float *a, const float *b ) {
a[0] = 1.0f - *b;
a[1] = 1.0f - *b;
}
Nominal case
Worst case float a[2]={0.0f};
init( a, &a[0] );
0.0 0.00.0
0.0 0.0
b a
1.0 1.0
1.0 0.0
ab
A Solution: Restrict Restrict keyword tells the compiler there’s no
aliasing Restrict permits the compiler to generate
much more efficient code
void init( float* __restrict a,
const float* __restrict b ) {
a[0] = 1.0f - *b; // compiler can do
a[1] = 1.0f - *b; // the right thing
}
What to Restrict
Use restrict widely Function pointer parameters Local pointers Pointers in structs/classes But not:
Function return types Casts Global pointers (maybe) References (maybe)
Use the CPU Caches Effectively The L2 cache is your best friend Using the cache well is an art Ensure you have a good profiler by
your side
Keep the Working Set Small
Pack commonly used data together Frequently used data might deserve its
own struct/class Keep rarely used data separate
Example: texture file names Consider bitfields
Bitfields are extremely efficient on PowerPC
Consider other forms of lossless compression
Inefficient Structs Are Bad Mojostruct InefficientCar {
bool manual; // padding here
wheel wheels[8]; // 8 wheels?
bool convertible; // more pad
char engine; // 4 bits used
char file[32]; // rarely used
double maxAccel; // double?
};
sizeof(InefficientCar) = 80
Carefully Design Structures
struct EfficientCar { wheel wheels[4]; // 4 wheels wheel *moreWheels; char *file; // stored elsewhere float maxAccel; // float unsigned engine:4; // bitfields unsigned manual:1; unsigned convertible:1;};sizeof(EfficientCar) = 32
Choose the Right Container
Prefer contiguous containers Or at least mostly contiguous Examples: array, vector, deque
Avoid node-based containers List, set/map, binary trees, hash tables
If you must use a tree, consider a custom allocator for memory locality
Vector + std::sort is often faster (and smaller) than set or map or hash tables, by an order of magnitude
Avoid Function Call Overhead
Function call overhead was a surprising cause of performance issues on Xbox
The same is true on Xbox 360 and PS3 Fortunately, there are lots of solutions Research compiler settings. On Xbox
360: Inline “any suitable” Enable link-time code generation
Spend time ensuring the compiler is inlining the right things
Avoid Virtual Functions
Weigh the limitations of virtual functions Adds a branch instruction Branch is always mispredicted Compiler is limited in how it can optimize
Consider replacing virtual void Draw() = 0;
With Xbox360.cpp: void Draw() { ... } Windows.cpp: void Draw() { ... } PS3.cpp: void Draw() { ... }
Maximize Leaf Functions
Leaf functions don’t call other functions, ever
If a potential leaf function calls another function, the high-level function: Is much less likely to be inlined Must set up a stack frame Must set up registers
Potential solutions Remove the inner function completely Inline the inner function Provide two versions of the outer function
Unroll Inner Loops
Compiler can’t unroll loops where n is variable
Even unrolling from ++i to i+=4 can be a significant gain Eliminates three branch instructions Increases opportunity for code scheduling
Don’t forget to hoist invariants out, too
Example Unrolling
// originalfor( i=a.beg(); i!=a.end(); ++i ) process(i);
// unrollede = a.end();for( i=a.beg(); i!=e; i+=4 ) { process(i); process(i+1); process(i+2); process(i+3);}
Pass Native Types by Value
Tradition says that “large” types are passed by pointer or reference, but be careful New consoles have really large registers
Native types include 64-bit int (__int64) VMX vector (__vector4) – 128 bits!
Pass structs by pointer or reference One exception: pass structs consisting of
bitfields <= 64 bits by value
Know Data Type Performance
int32 and int64 have equivalent perf float and double have equivalent perf int8 and int16 are slower than int
They generate extra instructions High bits cleared or sign-extended
Example: int32 adds 2X faster than int16 adds
Recommendations Store as smallest type required Load into int32, int64 or double for
calculations
Use Native Vector Types
In CS 101, you learned to create abstract data types, such as matrices
typedef std::vector<float,4> vec;
typedef std::vector<vec,4> matrix;
This code is an abomination At least on Xbox 360 and PS3 Xbox 360 and PS3 have dedicated
vector math units called VMX units Use them!
Your Math Buddies
__vector4 (4 32-bit floats; 128-bit register)
XMVECTOR (typedef for vector4) XMMATRIX (array of 4 vector4s) XMVECTOR operators (+,-,*,/) Hundreds of XMVECTOR and
XMMATRIX functions Xbox 360-specific, but similar
constructs in PS3 compilers
Avoid Floating-Point Branches
FP branches are slow Cache has to be flushed ~10X slower than int branches
Avoid loops with float test expressions
Eliminate altogether if possible Can be faster to calculate values
you won’t use! Compare integers instead Replace with fsel when
possible 10-20X performance gain
The fsel Option in Detail
Definition of hardware implementation:
float fsel(float a, float b, float c)
{
return ( a < 0.0f ) ? b : c;
}
You can replace expressions like v = ( w < x ) ? y : z; // slow
With faster expressions like v = fsel( w - x, y, z ); // turbo
Prefer Platform-Specific Funcs
The C runtime (CRT) is not usually the best option when performance matters
Xbox 360 examples Prefer CreateFile to fopen or C++ streams
Options for asynchronous reads and other goodness
Prefer XMemCpy to memcpy 2-6X faster
Prefer XMemSet to memset 8-14X faster
Avoid Hidden C++ Inefficiencies C++ rocks the house! C++ can bring your game to its knees! Consider these innocuous snippets
Quaternion q; s.push_back( k ); if( (float)i > f ) obj->Draw(); GameObject arr[1000]; a = b + c; i++;
C++ is Dangerous With power comes responsibility Beware constructors
Is initialization the right thing to do? Beware hidden allocations Conversion casts may have significant cost Use virtual functions with care Beware overloaded operators Stick to known idioms
Operator++ should be a constant-time operation.
Really.
Summary
There absolutely are many things you can do to efficiently program next-gen consoles
Two key issues: L2/memory and in-order processing Treat memory as you would a hard disk Watch out for those branches; use tricks
like fsel Prefer a light C++ touch
What’s Next
Our games are only as good as the weakest member of the team
Share what you’ve learned “The sharing of ideas allows us to
stand on one another’s shoulders instead of on one another’s feet” – Jim Warren
Questions
[email protected] Fill out your feedback forms