bad multithreading...• “false sharing” - artefact of cache structure – performance issue,...
TRANSCRIPT
11/6/11
1
Multicore Strategies for Games
Prof. Aaron Lanterman School of Electrical and Computer Engineering
Georgia Institute of Technology
2
Bad multithreading Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
3
Rendering Thread Rendering Thread
Game Thread
Good multithreading
Main Thread
Physics
Animation/ Skinning
Particle Systems
Networking
File I/O
Game Thread
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 4
Another paradigm: cascades
Thread 2: Physics
Thread 4: Rendering
Thread 5: Present
Thread 1: Input
Thread 3: AI
• Advantages: – Synchronization points are few and well-defined
• Disadvantages: – Increases latency (for constant frame rate) – Needs simple (one-way) data flow – For balance, each chunk needs to take a similar amount of time
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
11/6/11
2
5
Typical task: file decompression • Most common CPU heavy thread on the
Xbox 360 • Easy to multithread • Allows use of aggressive compression to
improve load times • Don’t throw a thread at a problem better
solved by offline processing – Texture compression, file packing, etc.
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 6
Typical task: rendering • Separate update and render threads • Rendering on multiple threads usually
works poorly – GPU can have trouble if multiple threads
try to talk to it at once (Xbox 360 command buffers are supposed to be OK)
• Special case of cascades paradigm – Pass render state from update to render
Slideadapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
7
Separate rendering thread
Update Thread
Buffer 1
Render Thread
Buffer 0
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 8
Typical task: graphics fluff • Extra graphics that doesn’t affect play
– Procedurally generated animating cloud textures – Cloth simulations – Procedurally generated vegetation, etc. – Extra particles, better particle physics, etc.
• Can run at lower frame rate • Easy to synchronize • One game had one thread manipulating cloth,
another thread handling cloth shadows • On single-core machines, can drop or simplify the
fluff without effecting gameplay
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
11/6/11
3
9
Typical tasks: physics? • Could cascade from update to physics
to rendering – Makes use of three threads – May be too much latency
• Could run physics on many threads – Uses many threads while doing physics – May leave threads mostly idle elsewhere
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 10
Careful with simultaneous multi-threading
• Not the same as double the number of cores • Can give a small performance boost…
– …if first thread is underutilizing execution resources because of dependency stalls
• Can cause a performance drop – Two threads may fight over L1 cache
• Can avoid scheduler latency – Have a thread that is ready to run but OS waits for current
“scheduling quantum” to expire before running the thread – Hardware threads can wake up faster; works well if you
have a thread that mostly sleeps but needs to wake quickly on demand
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
11
How many threads? • No more than one CPU intensive
software thread per core – 3-6 on Xbox 360 – 1-? On PC (1-4 for now, need to query)
• Too many busy threads adds complexity and lowers performance – Context switches are not free
• Can have many non-CPU intensive threads – I/O threads that block, or intermittent tasks
Slide from from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 12
Rare’s Kameo
Screenshots from www.rareware.com
11/6/11
4
13
Case study: Kameo (1) • Started out as single threaded
– Was going to be an original Xbox game, but decided to and make it a 360 launch title
• CPU usage split was 51/49 for update/render, so rendering was put on separate thread – Two render-description buffers created to
communicate from update to render – Linear read/write access for best cache usage – Doesn’t copy const data
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 14
Case study: Kameo (2) • Decompression thread:
• Saved space on DVD and improved load times • Cost was some spare CPU cycles
• Actually two threads for file I/O • One for reading and one for decompressing,
because some calls can block for ~0.5s doing directory lookups
• Multithreading added about six months before launch - but it worked!
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
15
Case study: Kameo (3) Core Thread Software threads
0 0 Game update 1 File I/O
1 0 Rendering 1
2 0 XAudio 1 File decompression
• Total usage was ~2.2-2.5 cores
80-99%
80-99%
50%
Screenshot from www.rareware.com
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 16
Bizarre Creations’ Project Gotham Racing 3
See http://media.xbox360.gamespy.com/media/741/741362/vids_1.html for movie clips
Screenshot from projectgothamracing3.com/screenshots
11/6/11
5
17
Case Study: Project Gotham Racing 3
Core Thread Software threads
0 0 Update, physics, rendering, UI 1 Audio update, networking
1 0 Crowd update, texture decompression 1 Texture decompression
2 0 XAudio 1
• Total usage was ~2.0-3.0 cores
Screenshot from projectgothamracing3.com/screenshots
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 18
Available synchronization objects
• Critical sections (locks) • Semaphores (alas not in XNA) • Mutexes • Don’t suspend threads
– Some games have used this for synchronization
– Can easily lead to deadlocks – Interacts badly with Visual Studio debugger
Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
19
Synchronization tips/costs: • Synchronization is moderately expensive
when there is no contention – Hundreds to thousands of cycles
• Synchronization can be arbitrarily expensive when there is contention!
• Goals: – Synchronize rarely – Hold locks briefly – Minimize shared data
Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 20
Avoid “effective single-threading”
• Requiring exclusive access to a popular resource can make multi-threading a complex way of doing single-threading on multiple threads
• Want to use synchronization primitives to guarantee multiple threads won’t modify resources simultaneously, while designing so that they generally won't anyway.
Notes from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
11/6/11
6
21
Beware hidden synchronization • Memory allocation (i.e., malloc in C)
– All sorts of ways to alleviate the problem • File access • Using D3DCREATE_MULTITHREADED if
developing with unmanaged code • “False sharing” - artefact of cache structure
– Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence:
The Hidden Perils of Sharing Data,” PowerPoint presentation
!Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 22
Things to avoid • Threads terminating other threads
– Can’t do it on Xbox 360, discouraged on Windows
• Mutexes – Aren’t as fast as critical section locks
Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation
23
Lockless programming • Spin locks
– Write-release/read-acquire semantics • Interlocked instructions • Difficult to get right:
– Very hard for native C++ Xbox 360 coding – .NET makes some of this easier
• Bruce Dawson, “Lockless Programming Considerations for Xbox 360 and Microsoft Windows” – msdn2.microsoft.com/en-us/library/bb310595.aspx
Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 24
What about OpenMP?
• Industry tends to shy away from OpenMP and similar solutions
• Prefers more direct control
#pragma omp parallel default(none) shared(n,x,y) private(i)!
{!
#pragma omp for!
for (i=0; i < n; i++)!
!x[i] += y[i];!
}
(Example from somewhere on web; can’t remember where)
11/6/11
7
XNA specific notes (1)
25
• GraphicsDevice is somewhat thread-safe – Cannot render from more than one thread at a time – Can create resources and SetData while another thread renders
• ContentManager is not thread-safe – OK to have multiple instances, but only one per thread
• Input is not threadable – Windows games must read input on the main game thread
• Audio and networking are thread-safe
Slide from Shawn Hargreaves, “Understanding XNA Framework Performance”
26
XNA specific notes (2) • Catalin’s suggestion: Keep
rendering on main thread (Thread 1 on Xbox 360) – Game class does some behind-the-
scenes graphics stuff
Great article: Catalin Zima, “Multi-threading for your XNA Game,” http://www.ziggyware.com/readarticle.php?article_id=221
27
Common mistake • Creating a new thread on every iteration of
the game loop – Creating and releasing threads has a lot of
overhead… – …especially if you are running in Visual Studio
(i.e. “in the debugger”)… – …and especially if you are running on the
Xbox 360 from Visual Studio
• Better to create the threads you need at the beginning
28
Take a step back… • Always ask: “should I be doing this on the
CPU at all?”
• GPU has ridiculous amounts of computing power – Look for tasks with high compute per CPU-
GPU communication ratio
• HLSL is HLSL whether you’re using managed or unmanaged code on the CPU