divergence-aware warp scheduling - microarch.orgtim rogers divergence-aware warp scheduling 7...

19
Timothy G. Rogers 1 , Mike O’Connor 2 , Tor M. Aamodt 1 1 The University of British Columbia 2 NVIDIA Research Divergence-Aware Warp Scheduling MICRO 2013 Davis, CA

Upload: others

Post on 12-Mar-2020

28 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1

1The University of British Columbia 2NVIDIA Research

Divergence-Aware Warp Scheduling

MICRO 2013 Davis, CA

Page 2: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 2

Streaming Multiprocessor Streaming Multiprocessor

Warp Scheduler

Memory Unit

L1D

GPU

W1 …W2

•  10000’s concurrent threads •  Grouped into warps •  Scheduler picks warp to issue each cycle

Main Memory L2 cache

Threads

Warp …

Page 3: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 3

Warp …

Main Memory

2 Types of Divergence

… Can waste memory bandwidth

Memory Divergence

Branch Divergence

Effects functional unit

utilization Aware of branch

divergence

if(…) { …

}

Threads

Warp … 1 0 10

Aware of memory

divergence

AND Focus on improving

performance

Page 4: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 4

Motivation

•  Transfer locality management from SW to HW

•  Software solutions: •  Complicate programming •  Not always performance portable •  Not guaranteed to improve performance •  Sometimes impossible

•  Improve performance of programs with memory divergence •  Parallel irregular applications •  Economically important (server computing, big data)

Page 5: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 5

Programmability Case Study Sparse Vector-Matrix Multiply

Divergence

Added Complication

Dependent on Warp Size

Parallel Reduction

Explicit Scratchpad Use

2 versions from SHOC

Divergent Version

GPU-Optimized Version

Each thread has locality

Page 6: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 6

Previous Work •  Scheduling used to capture intra-thread locality (MICRO 2012)

•  Proactive

•  Branch divergence aware

Reactive •  Detects interference then throttles

Previous Work Divergence-Aware Warp Scheduling

Predict and be Proactive

Adapt to branch divergence Unaware of branch divergence •  All warps treated equally

Outperformed by profiled

static throttling

Outperform static

solution

Lost Locality Detected

W2

Warp Scheduler

Memory Unit

L1D

W1 …W3 WN Go Go Stop Stop

1 0 1 1 0 0 Active Mask

1 1 1 1 1 1 Active Mask

Go Go

Case Study: Divergent code 50% slowdown

Case Study: Divergent code <4% slowdown

Page 7: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 7

Divergence-Aware Warp Scheduling

How to be proactive

Adapt to branch divergence

•  Identify where locality exists •  Limit the number of warps executing in high locality regions

•  Create cache footprint prediction in high locality regions •  Account for number of active lanes to create per-warp

footprint prediction. •  Change the prediction as branch divergence occurs.

Page 8: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 8

Where is the locality? •  Examine every load instruction in program

0

10

20

30

40

50

60

Static Load Instructions in GC workload

Hits

/Mis

ses

PKI

Load

1

Load

2

Load

3

Load

4

Load

5

Load

6

Load

7

Load

8

Load

9

Load

10

Load

11

Load

12

Load

13

Locality Concentrated in Loops

Loop

Page 9: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 9

Locality In Loops Limit Study

0

0.2

0.4

0.6

0.8

1

Average

Line accessed this iteration

Line accessed last iteration

Other

Hits on data accessed in immediately previous trip How much data should we keep around?

Frac

tion

cach

e hi

ts in

loop

s

Page 10: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 10

DAWS Objectives

1.  Predict the amount of data accessed by each warp in a loop iteration.

2.  Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.

Page 11: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 11

Observations that enable prediction •  Memory divergence in static instructions is predictable

•  Data touched by divergent loads dependent on active mask

Warp 0 Warp 1 … load …

Divergence

Divergence Warp

Main Memory

Main Memory Main Memory

Divergence

Both Used To Create Cache

Footprint Prediction

4 accesses 2 accesses

1 0 10Warp

1 1 11

Page 12: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 12

Online characterization to create cache footprint prediction

1.   Detect loops with locality

2.   Classify loads in the loop

3.   Compute footprint from active mask

Some loops have locality Some don’t Limit

multithreading here

while(…) { load 1 … load 2

}

Diverged

Not Diverged

while(…) { load 1 … load 2

}

Warp 0 1 1 1 1 1 1

Loop with locality

Loop with locality

Diverged

Not Diverged

4 accesses

1 access +

Warp 0’s Footprint = 5 cache

lines

Page 13: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

int C[]={0,64,96,128,160,160,192,224,256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; while(i < C[tid+1]) { sum += A[ i ]; ++i; } …

Example Compressed Sparse Row Kernel

Time1 Time0 Time2

Cache A[0]

A[64] A[96]

A[128]

Cache A[0]

A[64] A[96]

A[128]

Cache A[32]

A[160] A[192] A[224]

Warp0 1 1 1 1

2nd Iter.

Warp0 1 0 0 0

33rd Iter.

Warp1 0 1 1 1

1st Iter.

Warp1 0 1 1 1

1st Iter.

Memory Divergence

Divergent Branch

Go Go

Warp1

Warp0 Warp1

Warp0

No Footprint

Warp0 1 1 1 1

1st Iter.

DAWS Operation Example

Cache Footprint

4 4 4 Want to capture

spatial locality

Hit Hit Hit

Hit

Go

Hit x30 Hit x30 Hit x30 Hit x30

Loop Stop Go

No locality detected = no

footprint

Locality Detected 1 Diverged Load

Detected

Footprint = 4X1

Footprint = 3X1 Early warps

profile loop for later warps

Warp 0 has branch divergence Both warps capture

spatial locality together 4 Active threads

Stop

Footprint decreased

Page 14: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 14

Methodology GPGPU-Sim (version 3.1.0)

•  30 Streaming Multiprocessors •  32 warp contexts (1024 threads total)

•  32k L1D per streaming multiprocessor •  1M L2 unified cache

Compared Schedulers •  Cache-Conscious Wavefront Scheduling (CCWS) •  Profile based Best-SWL •  Divergence-Aware Warp Scheduling (DAWS)

More schedulers in paper

Page 15: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 15

Sparse MM Case Study Results

Within 4% of optimized with no programmer

input

0

0.5

1

1.5

2

Div

erge

nt C

ode

Exec

utio

n tim

e

•  Performance (normalized to optimized version)

Page 16: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 16

Sparse MM Case Study Results •  Properties (normalized to optimized version)

0

0.5

1

1.5

2

2.5

3

13.3

<20% increase in off-chip accesses

Divergent version now has potential energy advantages

Divergent code issues 2.8x less instructions

Div

erge

nt c

ode

off-c

hip

acce

sses

Page 17: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 17

Cache-Sensitive Applications •  Breadth First Search (BFS) •  Memcached-GPU (MEMC) •  Sparse Matrix-Vector Multiply (SPMV-Scalar) •  Garbage Collector (GC) •  K-Means Clustering (KMN) Cache-Insensitive

Applications in paper

Page 18: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 18

00.20.40.60.81

1.21.41.61.8

Results Outperform Best-SWL in highly

branch divergent

Overall 26% improvement over

CCWS

Nor

mal

ized

Spe

edup

CCWS

BFS MEMC SPMV-Scalar

GC KMN HMean

Best-­‐SWL DAWS

Page 19: Divergence-Aware Warp Scheduling - Microarch.orgTim Rogers Divergence-Aware Warp Scheduling 7 Divergence-Aware Warp Scheduling How to be proactive Adapt to branch divergence • Identify

Tim Rogers Divergence-Aware Warp Scheduling 19

Summary

Divergent loads in GPU programs. •  Software solutions complicate programming

DAWS •  Captures opportunities by accounting for divergence

Overall 26% performance improvement over CCWS Case Study: Divergent code performs within 4% code optimized to

minimize divergence

Questions?