bad multithreading...• “false sharing” - artefact of cache structure – performance issue,...

7
11/6/11 1 Multicore Strategies for Games Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology 2 Bad multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 3 Rendering Thread Rendering Thread Game Thread Good multithreading Main Thread Physics Animation/ Skinning Particle Systems Networking File I/O Game Thread Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 4 Another paradigm: cascades Thread 2: Physics Thread 4: Rendering Thread 5: Present Thread 1: Input Thread 3: AI Advantages: Synchronization points are few and well-defined Disadvantages: Increases latency (for constant frame rate) Needs simple (one-way) data flow For balance, each chunk needs to take a similar amount of time Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

Upload: others

Post on 02-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

1

Multicore Strategies for Games

Prof. Aaron Lanterman School of Electrical and Computer Engineering

Georgia Institute of Technology

2

Bad multithreading Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

3

Rendering Thread Rendering Thread

Game Thread

Good multithreading

Main Thread

Physics

Animation/ Skinning

Particle Systems

Networking

File I/O

Game Thread

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 4

Another paradigm: cascades

Thread 2: Physics

Thread 4: Rendering

Thread 5: Present

Thread 1: Input

Thread 3: AI

•  Advantages: –  Synchronization points are few and well-defined

•  Disadvantages: –  Increases latency (for constant frame rate) –  Needs simple (one-way) data flow –  For balance, each chunk needs to take a similar amount of time

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

Page 2: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

2

5

Typical task: file decompression •  Most common CPU heavy thread on the

Xbox 360 •  Easy to multithread •  Allows use of aggressive compression to

improve load times •  Don’t throw a thread at a problem better

solved by offline processing – Texture compression, file packing, etc.

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 6

Typical task: rendering •  Separate update and render threads •  Rendering on multiple threads usually

works poorly – GPU can have trouble if multiple threads

try to talk to it at once (Xbox 360 command buffers are supposed to be OK)

•  Special case of cascades paradigm – Pass render state from update to render

Slideadapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

7

Separate rendering thread

Update Thread

Buffer 1

Render Thread

Buffer 0

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 8

Typical task: graphics fluff •  Extra graphics that doesn’t affect play

–  Procedurally generated animating cloud textures –  Cloth simulations –  Procedurally generated vegetation, etc. –  Extra particles, better particle physics, etc.

•  Can run at lower frame rate •  Easy to synchronize •  One game had one thread manipulating cloth,

another thread handling cloth shadows •  On single-core machines, can drop or simplify the

fluff without effecting gameplay

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

Page 3: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

3

9

Typical tasks: physics? •  Could cascade from update to physics

to rendering – Makes use of three threads – May be too much latency

•  Could run physics on many threads – Uses many threads while doing physics – May leave threads mostly idle elsewhere

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 10

Careful with simultaneous multi-threading

•  Not the same as double the number of cores •  Can give a small performance boost…

–  …if first thread is underutilizing execution resources because of dependency stalls

•  Can cause a performance drop –  Two threads may fight over L1 cache

•  Can avoid scheduler latency –  Have a thread that is ready to run but OS waits for current

“scheduling quantum” to expire before running the thread –  Hardware threads can wake up faster; works well if you

have a thread that mostly sleeps but needs to wake quickly on demand

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

11

How many threads? •  No more than one CPU intensive

software thread per core – 3-6 on Xbox 360 – 1-? On PC (1-4 for now, need to query)

•  Too many busy threads adds complexity and lowers performance – Context switches are not free

•  Can have many non-CPU intensive threads –  I/O threads that block, or intermittent tasks

Slide from from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 12

Rare’s Kameo

Screenshots from www.rareware.com

Page 4: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

4

13

Case study: Kameo (1) •  Started out as single threaded

–  Was going to be an original Xbox game, but decided to and make it a 360 launch title

•  CPU usage split was 51/49 for update/render, so rendering was put on separate thread –  Two render-description buffers created to

communicate from update to render –  Linear read/write access for best cache usage –  Doesn’t copy const data

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 14

Case study: Kameo (2) •  Decompression thread:

•  Saved space on DVD and improved load times •  Cost was some spare CPU cycles

•  Actually two threads for file I/O •  One for reading and one for decompressing,

because some calls can block for ~0.5s doing directory lookups

•  Multithreading added about six months before launch - but it worked!

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

15

Case study: Kameo (3) Core Thread Software threads

0 0 Game update 1 File I/O

1 0 Rendering 1

2 0 XAudio 1 File decompression

•  Total usage was ~2.2-2.5 cores

80-99%

80-99%

50%

Screenshot from www.rareware.com

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 16

Bizarre Creations’ Project Gotham Racing 3

See http://media.xbox360.gamespy.com/media/741/741362/vids_1.html for movie clips

Screenshot from projectgothamracing3.com/screenshots

Page 5: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

5

17

Case Study: Project Gotham Racing 3

Core Thread Software threads

0 0 Update, physics, rendering, UI 1 Audio update, networking

1 0 Crowd update, texture decompression 1 Texture decompression

2 0 XAudio 1

•  Total usage was ~2.0-3.0 cores

Screenshot from projectgothamracing3.com/screenshots

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 18

Available synchronization objects

•  Critical sections (locks) •  Semaphores (alas not in XNA) •  Mutexes •  Don’t suspend threads

– Some games have used this for synchronization

– Can easily lead to deadlocks – Interacts badly with Visual Studio debugger

Slide adapted from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

19

Synchronization tips/costs: •  Synchronization is moderately expensive

when there is no contention – Hundreds to thousands of cycles

•  Synchronization can be arbitrarily expensive when there is contention!

•  Goals: – Synchronize rarely – Hold locks briefly – Minimize shared data

Slide from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 20

Avoid “effective single-threading”

•  Requiring exclusive access to a popular resource can make multi-threading a complex way of doing single-threading on multiple threads

•  Want to use synchronization primitives to guarantee multiple threads won’t modify resources simultaneously, while designing so that they generally won't anyway.

Notes from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

Page 6: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

6

21

Beware hidden synchronization •  Memory allocation (i.e., malloc in C)

– All sorts of ways to alleviate the problem •  File access •  Using D3DCREATE_MULTITHREADED if

developing with unmanaged code •  “False sharing” - artefact of cache structure

– Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence:

The Hidden Perils of Sharing Data,” PowerPoint presentation

!Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 22

Things to avoid •  Threads terminating other threads

– Can’t do it on Xbox 360, discouraged on Windows

• Mutexes – Aren’t as fast as critical section locks

Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation

23

Lockless programming •  Spin locks

– Write-release/read-acquire semantics •  Interlocked instructions •  Difficult to get right:

– Very hard for native C++ Xbox 360 coding –  .NET makes some of this easier

•  Bruce Dawson, “Lockless Programming Considerations for Xbox 360 and Microsoft Windows” –  msdn2.microsoft.com/en-us/library/bb310595.aspx

Information from Bruce Dawson & Chuck Walbourn, Microsoft Game Technology Group, “Coding for Multiple Cores,” PowerPoint presentation 24

What about OpenMP?

•  Industry tends to shy away from OpenMP and similar solutions

•  Prefers more direct control

#pragma omp parallel default(none) shared(n,x,y) private(i)!

{!

#pragma omp for!

for (i=0; i < n; i++)!

!x[i] += y[i];!

}

(Example from somewhere on web; can’t remember where)

Page 7: Bad multithreading...• “False sharing” - artefact of cache structure – Performance issue, not a correctness issue – Bruce Dawson, “Multicore Memory Coherence: The Hidden

11/6/11

7

XNA specific notes (1)

25

•  GraphicsDevice is somewhat thread-safe –  Cannot render from more than one thread at a time –  Can create resources and SetData while another thread renders

•  ContentManager is not thread-safe –  OK to have multiple instances, but only one per thread

•  Input is not threadable –  Windows games must read input on the main game thread

•  Audio and networking are thread-safe

Slide from Shawn Hargreaves, “Understanding XNA Framework Performance”

26

XNA specific notes (2) • Catalin’s suggestion: Keep

rendering on main thread (Thread 1 on Xbox 360) – Game class does some behind-the-

scenes graphics stuff

Great article: Catalin Zima, “Multi-threading for your XNA Game,” http://www.ziggyware.com/readarticle.php?article_id=221

27

Common mistake •  Creating a new thread on every iteration of

the game loop – Creating and releasing threads has a lot of

overhead… – …especially if you are running in Visual Studio

(i.e. “in the debugger”)… – …and especially if you are running on the

Xbox 360 from Visual Studio

•  Better to create the threads you need at the beginning

28

Take a step back… •  Always ask: “should I be doing this on the

CPU at all?”

•  GPU has ridiculous amounts of computing power – Look for tasks with high compute per CPU-

GPU communication ratio

•  HLSL is HLSL whether you’re using managed or unmanaged code on the CPU