Download - Coding for multiple cores
![Page 1: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/1.jpg)
Coding for Multiple Cores
Bruce Dawson & Chuck WalbournProgrammersGame Technology Group
![Page 2: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/2.jpg)
Why multi-threading/multi-core?
Clock rates are stagnantFuture CPUs will be predominantly multi-thread/multi-core
Xbox 360 has 3 coresPS3 will be multi-core>70% of PC sales will be multi-core by end of 2006
Most Windows Vista systems will be multi-core
Two performance possibilities:Single-threaded? Minimal performance growthMulti-threaded? Exponential performance growth
![Page 3: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/3.jpg)
Design for MultithreadingGood design is critical
Bad multithreading can be worse than no multithreading
Deadlocks, synchronization bugs, poor performance, etc.
![Page 4: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/4.jpg)
Bad Multithreading
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
![Page 5: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/5.jpg)
Rendering ThreadRendering ThreadRendering Thread
Game Thread
Good Multithreading
Main Thread
Physics
Rendering Thread
Animation/Skinning
Particle Systems
Networking
File I/O
Game Thread
![Page 6: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/6.jpg)
Another Paradigm: CascadesThread 1
Thread 2
Thread 3
Thread 4
Thread 5
Input
Physics
AI
Rendering
Present
Frame 1Frame 2Frame 3Frame 4
Advantages:Synchronization points are few and well-defined
Disadvantages:Increases latency (for constant frame rate)
Needs simple (one-way) data flow
![Page 7: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/7.jpg)
Typical Threaded Tasks
File Decompression
Rendering
Graphics Fluff
Physics
![Page 8: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/8.jpg)
File Decompression
Most common CPU heavy thread on the Xbox 360
Easy to multithread
Allows use of aggressive compression to improve load times
Don’t throw a thread at a problem better solved by offline processing
Texture compression, file packing, etc.
![Page 9: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/9.jpg)
Rendering
Separate update and render threads
Rendering on multiple threads (D3DCREATE_MULTITHREADED) works poorly
Exception: Xbox 360 command buffers
Special case of cascades paradigmPass render state from update to render
With constant workload gives same latency, better frame rate
With increased workload gives same frame rate, worse latency
![Page 10: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/10.jpg)
Graphics Fluff
Extra graphics that doesn't affect playProcedurally generated animating cloud textures
Cloth simulations
Dynamic ambient occlusion
Procedurally generated vegetation, etc.
Extra particles, better particle physics, etc.
Easy to synchronize
Potentially expensive, but if the core is otherwise idle...?
![Page 11: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/11.jpg)
Physics?
Could cascade from update to physics to rendering
Makes use of three threads
May be too much latency
Could run physics on many threadsUses many threads while doing physics
May leave threads mostly idle elsewhere
![Page 12: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/12.jpg)
Rendering ThreadRendering Thread
Overcommitted Multithreading?Physics
Rendering Thread
Animation/Skinning
Particle Systems
Game Thread
![Page 13: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/13.jpg)
How Many Threads?No more than one CPU intensive software thread per core
3-6 on Xbox 3601-? on PC (1-4 for now, need to query)
Too many busy threads adds complexity, and lowers performance
Context switches are not free
Can have many non-CPU intensive threads
I/O threads that block, or intermittent tasks
![Page 14: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/14.jpg)
Simultaneous Multi-Threading
Be careful with Simultaneous Multi-Threading (SMT) threads
Not the same as double the number of cores
Can give a small perf boost
Can cause a perf drop
Can avoid scheduler latency
Ideally one heavy thread per core plus some additional intermittent threads
![Page 15: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/15.jpg)
Case Study: Kameo (Xbox 360)
Started single threaded
Rendering was taking half of time—put on separate thread
Two render-description buffers created to communicate from update to render
Linear read/write access for best cache usage
Doesn't copy const data
File I/O and decompress on other threads
![Page 16: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/16.jpg)
Separate Rendering Thread
Update Thread
Buffer 1
Render Thread
Buffer 0
![Page 17: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/17.jpg)
Case Study: Kameo (Xbox 360)
Core Thread Software threads
00 Game update
1 File I/O
10 Rendering
1
20 XAudio
1 File decompression
Total usage was ~2.2-2.5 cores
![Page 18: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/18.jpg)
Case Study: Project Gotham Racing
Core Thread Software threads
00 Update, physics, rendering, UI
1 Audio update, networking
10 Crowd update, texture decompression
1 Texture decompression
20 XAudio
1
Total usage was ~2.0-3.0 cores
![Page 19: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/19.jpg)
Managing Your Threads
Creating threads
Synchronizing
TerminatingDon't use TerminateThread()
Bad idea on Windows: leaves the process in an indeterminate state, doesn't allow clean-up, etc.
Unavailable on Xbox 360
Instead return from your thread function, or call ExitThread
![Page 20: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/20.jpg)
Creating Threads Poorlyconst int stackSize = 0;HANDLE hThread = CreateThread(0, stackSize, ThreadFunctionBad, 0, 0, 0);// Do work on main thread here.for (;;) { // Wait for child thread to complete DWORD exitCode; GetExitCodeThread(hThread, &exitCode); if (exitCode != STILL_ACTIVE) break;}
...
DWORD __stdcall ThreadFunctionBad(void* data){#ifdef WIN32 SetThreadAffinityMask(GetCurrentThread(), 8);#endif // Do child thread work here. return 0;}
CreateThread doesn't initialize C runtime
Stack size of zero means inherit parent's
stack size
Busy waiting is bad!
Don't forget to close this when done with it
Be careful with thread affinities on Windows
![Page 21: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/21.jpg)
Creating Threads Wellconst int stackSize = 65536;HANDLE hThread = (HANDLE)_beginthreadex(0, stackSize, ThreadFunction, 0, 0, 0);// Do work on main thread here.// Wait for child thread to completeWaitForSingleObject(hThread, INFINITE);CloseHandle(hThread);
...
unsigned __stdcall ThreadFunction(void* data){#ifdef XBOX // On Xbox 360 you must explicitly assign // software threads to hardware threads. XSetThreadProcessor(GetCurrentThread(), 2);#endif // Do child thread work here. return 0;}
_beginthreadex initializes CRT
Specify stack size on Xbox 360
The correct way to wait for a thread to exit
Don't forget to close this when done with it
Thread affinities must be specified on Xbox
360
![Page 22: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/22.jpg)
Alternative: OpenMP
Available in VC++ 2005
Simple way to parallelize loops and some other constructs
Works best on long symmetric tasks—particles?
Game tasks are short—16.6 ms
Many game tasks are not symmetric
OpenMP is nice, but not ideal
![Page 23: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/23.jpg)
Available Synchronization Objects
Events
Semaphores
Mutexes
Critical Sections
Don't use SuspendThread()Some title have used this for synchronization
Can easily lead to deadlocks
Interacts badly with Visual Studio debugger
![Page 24: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/24.jpg)
Exclusive Access: Mutex// InitializeHANDLE mutex = CreateMutex(0, FALSE, 0);
// Usevoid ManipulateSharedData() { WaitForSingleObject(mutex, INFINITE); // Manipulate stuff... ReleaseMutex(mutex);}
// DestroyCloseHandle(mutex);
![Page 25: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/25.jpg)
Exclusive Access: CRITICAL_SECTION// InitializeCRITICAL_SECTION cs;InitializeCriticalSection(&cs);
// Usevoid ManipulateSharedData() { EnterCriticalSection(&cs); // Manipulate stuff... LeaveCriticalSection(&cs);}
// DestroyDeleteCriticalSection(&cs);
![Page 26: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/26.jpg)
Lockless programming
Trendy technique to use clever programming to share resources without locking
Includes InterlockedXXX(), lockless message passing, Double Checked Locking, etc.
Very hard to get right:Compiler can reorder instructions
CPU can reorder instructions
CPU can reorder reads and writes
Not as fast as avoiding synchronization entirely
![Page 27: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/27.jpg)
Lockless Messages: Buggy
void SendMessage(void* input) { // Wait for the message to be 'empty'. while (g_msg.filled) ; memcpy(g_msg.data, input, MESSAGESIZE); g_msg.filled = true;}
void GetMessage() { // Wait for the message to be 'filled'. while (!g_msg.filled) ; memcpy(localMsg.data, g_msg.data, MESSAGESIZE); g_msg.filled = false;}
![Page 28: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/28.jpg)
Synchronization tips/costs:
Synchronization is moderately expensive when there is no contention
Hundreds to thousands of cycles
Synchronization can be arbitrarily expensive when there is contention!
Goals:Synchronize rarely
Hold locks briefly
Minimize shared data
![Page 29: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/29.jpg)
Beware hidden synchronization:
Allocations are (generally) a synch pointConsider per-thread heaps with no lockingHEAP_NO_SERIALIZE flag avoids lock on Win32 heapsConsider custom single-purpose allocatorsConsider avoiding memory allocations!
Avoid synch in in-house profilersD3DCREATE_MULTITHREADED causes synchronization on almost every Direct3D call
![Page 30: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/30.jpg)
Threading File I/O & Decompression
First: use large reads and asynchronous I/O
Then: consider compression to accelerate loading
Don't do format conversions etc. that are better done at build time!
Have resource proxies to allow rendering to continue
![Page 31: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/31.jpg)
File I/O Implementation Details
vector<Resource*> g_resources;
Worst design: decompressor locks g_resources while decompressing
Better design: decompressor adds resources to vector after decompressing
Still requires renderer to synch on every resource access
Best design: two Resource* vectorsRenderer has private vector, no locking required
Decompressor use shared vector, syncs when adding new Resource*
Renderer moves Resource* from shared to private vector once per frame
![Page 32: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/32.jpg)
Profiling multi-threaded apps
Need thread-aware profilers
Profiling may hide many synchronization stalls
Home-grown spin locks make profiling harder
Consider instrumenting calls to synchronization functions
Don't use locks in instrumentation—use TLS variables to store results
Windows: Intel VTune, AMD CodeAnalyst, and the Visual Studio Team System Profiler
Xbox 360: PIX, XbPerfView, etc.
![Page 33: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/33.jpg)
PIX timing capture
![Page 34: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/34.jpg)
Naming Threadstypedef struct tagTHREADNAME_INFO { DWORD dwType; // must be 0x1000 LPCSTR szName; // pointer to name (in user addr space) DWORD dwThreadID; // thread ID (-1=caller thread) DWORD dwFlags; // reserved for future use, must be zero} THREADNAME_INFO;
void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) { THREADNAME_INFO info; info.dwType = 0x1000; info.szName = szThreadName; info.dwThreadID = dwThreadID; info.dwFlags = 0;
__try { RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD),
(DWORD*)&info ); } __except(EXCEPTION_CONTINUE_EXECUTION) { }}
SetThreadName(-1, "Main thread");
![Page 35: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/35.jpg)
Other Ideas
Debugging tips for MTVisual Studio does support multi-threaded debugging
Use threads window
Use @hwthread in watch window on Xbox 360
KD and WinDBG support multi-threaded debugging
Thread Local Storage (TLS)__declspec(thread) declares per-thread variables
But doesn't work in dynamically loaded DLLs
TLSAlloc is less efficient, less convenient, but works in dynamically loaded DLLs
![Page 36: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/36.jpg)
Windows tips
Avoid using D3DCREATE_MULTITHREADEDIt’s easy, it works, it’s really really slowBest to do all calls to Direct3D from a single threadCould pass off locked resource pointers to a queue for a loading threads to work with
Test on multiple machines and configurations
Single-core, SMT (i.e. Hyper-Threading), Dual-core, Intel and AMD chips, Multi-socket multicore (4+ cores)
![Page 37: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/37.jpg)
Windows API features
WaitForMultipleObjectObviously better than a series of WaitForSingleObject calls
The OS is highly optimized around multithreading and event-based blocking
I/O Completion PortsVery efficient way to have the OS assign a pool of worker threads to incoming I/O requests
Useful construct for implementing a game server
![Page 38: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/38.jpg)
SMT versus Multicore
OS returns number of logical processors in GetSystemInfo(), so a 2 could mean a SMT machine with only 1 actual core –or- 2 coresDetailed Win32 APIs exposing this distinction not available until Windows XP x64, Windows Server 2003 SP1, Windows Vista, etc.GetLogicalProcessorInformation()
For now you have to use CPUID detailed by Intel and AMD to parse this out…
![Page 39: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/39.jpg)
Timing with Multiple Cores
RDTSC is not always synced between cores!As your thread moves from core to core, results of RDTSC counter deltas may be nonsense
CPU frequency itself can change at run-time through speed step technologies
See Power Management APIs for more information
Best thing to do is use Win32 API QueryPerformanceCounter / QueryPerformanceFrequencySee DirectX SDK article Game Timing and Multiple Cores
![Page 40: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/40.jpg)
Thread Micromanagement
Use SetThreadAffinityMask with caution!
May be useful for assigning ‘heavy’ work threadsThis mask is technically a hint, not a commitmentRDTSC-based instrumenting will require locking the game threads to a single coreOtherwise let the Windows scheduler do the right thingCreateDevice/Reset might have a side-effect on the calling thread’s affinity with software vertex processing enabled
![Page 41: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/41.jpg)
Thread Micromanagement (cont)
Be careful about boosting thread priorityIf the priority is too high, you could cause the system to hang and become unresponsive
If the priority is too low, the thread may starve
![Page 42: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/42.jpg)
DLLs and Multithreading
DllMain for every DLL is informed of thread creation/destruction
For some DLLs this is required to initialize TLS
For many this is a waste of time, so call DisableThreadLibraryCalls() from your DllMain during process creation (DLL_PROCESS_ATTACH)
The OS serializes access to the entry pointThis means threads created during DllMain won’t start for a while, so don’t wait on them in the DLL startup
![Page 43: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/43.jpg)
Resources
Multithreading Applications in Win32, Jim Beveridge & Robert Weiner, Addison-Wesley, 1997Multiprocessor Considerations for Kernel-Mode Drivers
http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/MP_issues.doc
Determining Logical Processors per Physical Processorhttp://www.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/knowledgebase/43842.htm
GetLogicalProcessorInformationhttp://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/getlogicalprocessorinformation.asp
Double checked lockinghttp://en.wikipedia.org/wiki/Double-checked_locking
![Page 44: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/44.jpg)
ResourcesGDC 2006 Presentations
http://msdn.com/directx/presentations
DirectX Developer Centerhttp://msdn.com/directx
XNA Developer Centerhttp://msdn.com/xna
Xbox Developer Center (Registered Devs Only)https://xds.xbox.com
XNA, DirectX, XACT Forumshttp://msdn.com/directx/forums
Email [email protected] (DirectX Feedback)
[email protected] (Xbox Developers Only)
[email protected] (XNA Feedback)
![Page 45: Coding for multiple cores](https://reader036.vdocuments.mx/reader036/viewer/2022062320/55a502051a28abc1248b458f/html5/thumbnails/45.jpg)
© 2006 Microsoft Corporation. All rights reserved.Microsoft, DirectX, Xbox 360, the Xbox logo, and XNA are either registered trademarks or trademarks of Microsoft Corporation in the United Sates and / or other countries.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.