task management for irregular- parallel workloads … management for irregular-parallel workloads on...

38
Task Management for Irregular- Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California, Davis

Upload: phungtu

Post on 01-Apr-2018

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Task Management for Irregular-Parallel Workloads on the GPU

Stanley Tzeng, Anjul Patney, and John D. Owens University of California, Davis

Page 2: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Introduction – Parallelism in Graphics Hardware

1

Input Queue

Output Queue

Process

Input Queue

Output Queue

Process

Regular Workloads

Irregular Workloads

Good matching of input size to output size

Vertex, Fragment Processor, etc.

Geometry Processor, Recursion

Unprocessed Work Unit Processed Work Unit

Hard to estimate

output given input

Page 3: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Increased programmability on GPUs allows different programmable pipelines on the GPU.

•  We want to explore how pipelines can be efficiently mapped onto the GPU.

  What if your pipeline has irregular stages ?

  How should data between pipeline stages be stored ?

  What about load balancing across all parallel units ?

  What if your pipeline is more geared towards task parallelism rather than data parallelism?

Motivation – Programmable Pipelines

2

Our paper addresses these Issues!

Page 4: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Imagine that these pipeline stages were actually bricks.

•  Then we are providing the mortar between the bricks.

In Other Words…

3

Pipeline Stages

Us

Page 5: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Alternative pipelines on the GPU: •  Renderants [Zhou et al. 2009]

•  Freepipe [Liu et al. 2009]

•  Optix [NVIDIA 2010]

•  Distributed Queuing on the GPU:

•  GPU Dynamic Load Balancing [Cederman et al. 2008]

•  Multi-CPU work

•  Reyes on the GPU:

•  Subdivision [Patney et al. 2008]

•  Diagsplit [Fisher et al. 2009]

•  Micropolygon Rasterization [Fatahalian et al. 2009]

Related Work

4

Page 6: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Ingredients for Mortar

5

What is the proper granularity for tasks?

How many threads to launch?

How to avoid global synchronizations? How to distribute

tasks evenly?

Warp Size Work Granularity

Uberkernels

Persistent Threads

Task Donation

Questions that we need to address:

Page 7: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.

•  Solution: we choose block sizes of 32 threads / block. •  Removes messy synchronization barriers.

•  Can view each block as a MIMD thread. We call these blocks processors

Warp Size Work Granularity

6

Physical Processor

P P P

Physical Processor

Physical Processor

P P P P P P

Page 8: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Problem: Want to eliminate global kernel barriers for better processor utilization

•  Uberkernels pack multiple execution routes into one kernel.

Uberkernel Processor Utilization

7

Pipeline Stage 1

Pipeline Stage 2

Data Flow

Uberkernel

Stage 1

Stage 2

Data Flow

Page 9: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Life of a Thread:

Persistent Thread Scheduler Emulation

8

Spawn Fetch Data

Process Data

Write Output Death

Life of a Persistent Thread:

How do we know when to stop? When there is no more work left

•  Problem: If input is irregular? How many threads do we launch?

•  Launch enough to fill the GPU, and keep them alive so they keep fetching work.

Page 10: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Problem: We need to ensure that our processors are constantly working and not idle.

•  Solution: Design a software memory management system.

•  How each processor fetches work is based on our queuing strategy.

•  We look at 4 strategies: •  Block Queues

•  Distributed Queues

•  Task Stealing

•  Task Donation

Memory Management System

9

Page 11: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  To obtain exclusive access to a queue each queue has a lock.

•  Current implementation uses spin locks and are very slow on GPUs.

•  We want to use as few locks as possible.

A Word About Locks

10

while (atomicCAS(lock, 0,1) ==1);

Page 12: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  1 dequeue for all processors. Read from one end write back to the other.

Block Queuing

11

P1 P2 P3 P4

Until the last element, fetching work

from the queue requires no

lock

Writing back to the queue is serial. Each

processor obtains lock before

writing.

Read Write

Excellent Load Balancing

Horrible Lock Contention

Page 13: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Each processor has its own dequeue (called a bin) and it reads and writes to it.

Distributed Queuing

12

P1 P2 P3 P4

This processor finished all its work and can

only idle

Eliminates Locks

Idle Processors

Page 14: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

This processor finished all its

work and steals from neighboring

processor

•  Using the distributed queuing scheme, but now processors can steal work from another bin.

Task Stealing

13

P1 P2 P3 P4

Very Low Idle

Big Bin Sizes

Page 15: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  When a bin is full, processor can give work to someone else.

Task Donation

14

This processor ‘s bin is full and donates

its work to someone else’s bin

P1 P2 P3 P4

Smaller memory usage

More complicated

Page 16: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Main measure to compare: •  How many iterations the processor is idle due to lock

contention or waiting for other processors to finish.

•  We use a synthetic work generator to precisely control the conditions.

Evaluating the Queues

15

Page 17: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Average Idle Iterations Per Processor

16

Idle Iterations

25000

200

Block Queue

Dist. Queue

Task Stealing

Task Donation

Lock Contention

Idle Waiting

About the same performance

About the same performance

Page 18: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

How it All Fits Together

17

Spawn

Fetch Data

Process Data

Write Output

Death

Global Memory

Page 19: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

18

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

Page 20: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

19

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

We spawn 32 threads per thread group / block in our

grid. These are our processors.

Page 21: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

20

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

Each processor grabs work to

process.

Page 22: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

21

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

Uberkernel decies how to process the current work unit

Page 23: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

22

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

Once work is processed, thread execution returns to fetching more

work

Page 24: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Our Version

23

Spawn

Fetch Data

Write Output

Death

Task Donation Memory System

Uberkernels

Pers

iste

nt T

hrea

ds

When there is no work left in the

queue, the threads retire

Page 25: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

APPLICATION: REYES

24

Page 26: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Pipeline Overview

Start with smooth surfaces Obtain micropolygons

Subdivision / Tessellation

Shading

Rasterization / Sampling

Composition and Filtering

Scene

Image

Page 27: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Pipeline Overview

Shade micropolygons Subdivision / Tessellation

Shading

Rasterization / Sampling

Composition and Filtering

Scene

Image

Page 28: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Pipeline Overview

Map micropolygons to screen space

Subdivision / Tessellation

Shading

Rasterization / Sampling

Composition and Filtering

Scene

Image

Page 29: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Pipeline Overview

Reconstruct pixels from obtained samples

Subdivision / Tessellation

Shading

Rasterization / Sampling

Composition and Filtering

Scene

Image

Page 30: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Pipeline Overview

Subdivision / Tessellation

Shading

Rasterization / Sampling

Scene

Image

Composition and Filtering

Irregular Input and Output

Regular Input and Output

Regular Input Irregular Output

Irregular Input Regular Output

Page 31: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  We combine the patch split and dice stage into one kernel.

•  Bins are loaded with initial patches. •  1 processor works on 1 patch at a time. Processor can

write back split patches into bins. •  Output is a buffer of micropolygons

Split and Dice

30

Page 32: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  32 Threads on 16 CPs – 16 threads each work in u and v •  Calculate u and v thresholds, and then go to uberkernel

branch decision: •  Branch 1 splits the patch again

•  Branch 2 dices the patch into micropolygons

Split and Dice

31

Bin 0

Bin 0

µpoly buffer

Page 33: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Stamp out samples for each micropolygon. 1 processor per micropolygon patch.

•  Since only output is irregular, use a block queue. •  Write out to a sample buffer.

Sampling

32

P1 P2 P3 P4

0 15000

Read &

Stamp

Atomic Add

Writeout

µpolygons queue

processor

counter

Global Memory

Page 34: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Stamp out samples for each micropolygon. 1 processor per micropolygon patch.

•  Since only output is irregular, use a block queue. •  Write out to a sample buffer.

Sampling

33

P1 P2 P3 P4

0 15000

Read &

Stamp

Atomic Add

Writeout

µpolygons queue

processor

counter

Global Memory

Q: Why are you using a block queue? Didn’t you just show block queues have high contention when processors write back into it?

A: True. However, we’re not writing into this queue! We’re just reading from it, so there is no writeback idle time.

Page 35: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Smooth Surfaces, High Detail

16 samples per pixel >15 frames per second on GeForce GTX280

Page 36: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  What other (and better) abstractions are there for programmable pipelines?

•  How is future GPU design going to affect software schedulers?

•  For Reyes: What is the right model to do GPU real time micropolygon shading?

What’s Next

35

Page 37: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

•  Matt Pharr, Aaron Lefohn, and Mike Houston •  Jonathan Ragan-Kelley •  Shubho Sengupta •  Bay Raitt and Headus Inc. for models •  National Science Foundation •  SciDAC •  NVIDIA Graduate Fellowship

Acknowledgments

36

Page 38: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,

Thank You