study: persistent threads style programming model for gpu...

36
A Study of Persistent Threads Style Programming Model for GPU Computing Kshitij Gupta [/shi/ /tij/] UC Davis GTC 2012 | San Jose

Upload: others

Post on 20-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

A Study of Persistent Threads Style Programming Model for GPU Computing

Kshitij Gupta [/shi/ /tij/]

UC Davis GTC 2012 | San Jose

Page 2: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

A Study of Persistent Threads Style GPU Programming for GPGPU Workloads

Kshitij Gupta, Jeff A. Stuart, John D. Owens

UC Davis InPar 2012 | San Jose

Page 3: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Outline

GPGPU Programming (“nonPT”)

Limitations

Introduction to “PT”

Use Cases

Observations/Discussion

Page 4: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

* We will use CUDA terminology, but the same† discussion can be extended to OpenCL

NOTE: (CUDA | OpenCL) Terminology

CUDA OpenCL

Thread Work item

Warp --

Thread block Work group

Grid Index space

Local memory Private memory

Shared memory Local memory

Global memory Global memory

Scalar core Processing element

Multi-processor (SM) Compute unit

Page 5: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: GPGPU Programming Hierarchy

DR

AM

GPU

SM_3 SM_2 SM_1 SM_0

Page 6: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: GPGPU Programming Hierarchy

DR

AM

Warp

P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT

SM0

P0 P1 P7

Virtualize

SIMD

Page 7: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: GPGPU Programming Hierarchy

DR

AM

Warp

P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT

SM0

P0 P1 P7

Block

W0 W1 WN

Virtualize VirtualizE

Page 8: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: GPGPU Programming Hierarchy

DR

AM

Warp

P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT

SM0

P0 P1 P7

Block

W0 W1 WN

B0 B1 BM (a) SPMD

Virtualize VirtualizE ViRtUaLiZe

Page 9: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: GPGPU Programming Hierarchy

DR

AM

Warp

P0_0 P1_0 P7_0 P0_X P1_X P7_X SIMT

SM0

P0 P1 P7

Block

W0 W1 WN

(a) SPMD

(b)

B0 B1 BM

Virtualize VirtualizE ViRtUaLiZe VIRTUALIZE!!

Page 10: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: Workload Evolution

Unified Shader Core

ibuff

obuff

ibuff

obuff

ibuff

obuff

Shader A

Shader B

Shader C

(a) Pre-2006: Discrete cores

Core C

Core B

Core A

ibuff

obuff

ibuff

obuff

ibuff

obuff

(b) 2006: Stream programming – CUDA architecture with unified cores; along

with ‘C for CUDA’

Page 11: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: Workload Evolution

Unified Shader Core

ibuff

obuff

ibuff

obuff

ibuff

obuff

Kernel A

Kernel B

Kernel C

(i)

(ii)

(iii)

(a) Pre-2006: Discrete cores

Core C

Core B

Core A

ibuff

obuff

ibuff

obuff

ibuff

obuff

(b) 2006: Stream programming – CUDA architecture with unified cores; along

with ‘C for CUDA’

(c) Today: A sample of irregular workload patterns

ibuff

obuff obuff

ibuff

obuff

ibuff

ibuff

obuff

sync

sync

Page 12: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Core GPGPU Programming Characteristics (& Limitations)

1. Host-Device Interface Master-slave processing

Kernel size

2. Device-side Properties Lifetime of a Block

Hardware Scheduler

Block State

3. Memory Consistency Intra-block

Inter-block

4. Kernel Invocations Producer-consumer

Spawning kernels

Irregular workloads

Page 13: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Preliminaries: Example – Image Processing (outside Pixar HQ)

Grid

Kernel

Block

Thread

Page 14: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

A sample image of 128x128 pixels divided into 16 blocks

vertical motion blur; processed on a 4-SM GPU

Inp

ut

Imag

e

Ou

tpu

t Im

age

G

PU

0 1 2 3

Page 15: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

2 3 1 0

1 0 3 2

3 2 0 1

A sample image of 128x128 pixels divided into 16 blocks nonPT illustration

vertical motion blur; processed on a 4-SM GPU

Softw

are vie

w

Hard

ware

view

In

pu

t Im

age

O

utp

ut

Imag

e

16 blocks mapped to the 4 SMs in random order

GP

U

0 1 2 3

Page 16: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Persistent Threads: Properties

1. Maximal Launch – A kernel uses only as many threads as can be concurrently scheduled on the SMs

2. Software, not hardware, schedules work

Page 17: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

0 1 2 3

2 3 1 0

1 0 3 2

3 2 0 1

A sample image of 128x128 pixels divided into 16 blocks nonPT illustration

vertical motion blur; processed on a 4-SM GPU

Softw

are vie

w

Hard

ware

view

In

pu

t Im

age

O

utp

ut

Imag

e

16 blocks mapped to the 4 SMs in random order

GP

U

0 1 2 3

0 1 2 3

PT illustration

4 SMs 4 thread groups

0 1 2 3

Page 18: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Persistent Threads: Properties

1. Maximal Launch – A kernel uses only as many threads as can be concurrently scheduled on the SMs Thread-group (v/s thread-block) Upper-bound: maximal launch

Lower-bound: 1

2. Software, not hardware, schedules work Work-queues Several optimizations possible

In his paper – single global FIFO

Page 19: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Common Communication Patterns

Linear

Diagonal

Zig-Zag

Scanline

Wavefront

Pinwheel

Checker

.

.

.

Page 20: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

• μ-kernel benchmarks

• Workload comprises of FMAs

• Nvidia GeForce GTX295

Use Cases

Page 21: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

PT Use Cases

# Use Case Scenario Advantage of Persistent Threads

1 CPU-GPU

Synchronization

Kernel A produces a variable amount of data that must be consumed by Kernel B

nonPT implementations require a round-trip communication to the host to launch Kernel B with the exact number of blocks corresponding to work items produced by Kernel A.

2 Load Balancing Traversing an irregularly-structured, hierarchical data structure

PT implementations build an efficient queue to allow a single kernel to produce a variable amount of output per thread and load balance those outputs onto threads for further processing.

3 Maintaining Active

State

A kernel accumulates a single value across a large number of threads, or Kernel A wants to pass data to Kernel B through shared memory or registers

Because a PT kernel processes many more items per block than a nonPT kernel, it can effectively leverage shared memory across a larger block size for an application like a global reduction.

4 Global

Synchronization

Global synchronization within a kernel across workgroups

In a nonPT kernel, synchronizing across blocks within a kernel is not possible because blocks run to completion and cannot wait for blocks that have not yet been scheduled. The PT model ensures that all blocks are resident and thus allows global synchronization.

Page 22: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #1: CPU-GPU Synchronization

Scenario Advantage of Persistent Threads

Kernel A produces a variable amount of data that must be

consumed by Kernel B

nonPT implementations require a round-trip communication to the host to launch Kernel B with the exact number of blocks corresponding to work items produced by Kernel A.

CPU

GPU’-kC

GPU-kP

data

param

CPU

CPU

GPU-kP

data

param

GPU-kC

param

data

dat

a

read-back barrier trb

tcpu

tlaunch

(a) nonPT (b) PT

Page 23: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #1: CPU-GPU Synchronization

Page 24: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #2: Load Balancing/Irregular Parallelism

Scenario Advantage of Persistent Threads

Traversing an irregularly-structured, hierarchical data

structure

PT implementations build an efficient queue to allow a single kernel to produce a variable amount of output per thread and load balance those outputs onto threads for further processing.

Page 25: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #2: Workload Illustration – Tree(s)

Initial Inputs

# o

f le

vels

Initial Inputs

# o

f le

vels

(a) Full Tree

(b) Tilted Tree

Page 26: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #2: Workload – Complete Tree

Page 27: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #2: Workload – Tilted Tree

Page 28: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #3: Maintaining Active State

Scenario Advantage of Persistent Threads

A kernel accumulates a single value across a large number of

threads, or Kernel A wants to pass data to Kernel B through shared

memory or registers

Because a PT kernel processes many more items per block than a nonPT kernel, it can effectively leverage shared memory across a larger block size for an application like a global reduction.

Kernel-X2

Kernel-X1

Kernel-X3

GPU Kernel-X’

(a) nonPT (b) PT

Page 29: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #3: Workload Illustration – Reduction

Page 30: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #3: Workload – Reduction

Page 31: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #4: Global Synchronization

Scenario Advantage of Persistent Threads

Global synchronization within a kernel across workgroups

In a nonPT kernel, synchronizing across blocks within a kernel is not possible because blocks run to completion and cannot wait for blocks that have not yet been scheduled. The PT model ensures that all blocks are resident and thus allows global synchronization.

Kernel-X2

Kernel-X1

cross-block barrier #1

cross-block barrier #2

Kernel-X3

GPU Kernel-X’

PT barrier #1

PT barrier #2

(a) nonPT (b) PT

Page 32: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Use Case #4: Global Synchronization

Page 33: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Portability & Usability

# Use Case Occupancy Scheduling Comments

1 CPU-GPU

Synchronization ----- ----- indirect; CPU-GPU workload partitioning

2

Load Balancing/Irregular

Parallelism -----

non-trivial when sophisticated queuing structures (local + global) and work stealing/donation optimizations are used

3 Maintaining Active

State

different kernel organization and partitioning strategies

4 Global

Synchronization ----- hard to debug as occupancy changes

Page 34: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Looking Ahead…

# Use Case Disucssion

1 CPU-GPU

Synchronization • Less of an issue on future consumer systems; but HPC still a problem • We expect this to be addressed in the future

2

Load Balancing/Irregular

Parallelism • Provide support for queues

3 Maintaining Active

State • Very hard to solve!

4 Global

Synchronization • Kernel launch might be cheaper than synchronizing across an entire

chip in future-generation hardware

Page 35: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Looking Ahead…

Return-on-Investment

Power?

Modifications?

Native support would not require a complete re-making of the underlying hardware

Small changes could lead to reasonable gains

Augment existing APIs?

Page 36: Study: Persistent Threads Style Programming Model for GPU ...on-demand.gputechconf.com/...Persistent-Threads-Style-Programmin… · A Study of Persistent Threads Style Programming

Kshitij Gupta www.kshitijgupta.com

Thank You!