timothy g. rogers 1 , mike o’connor 2 , tor m. aamodt 1

Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1

1The University of British Columbia2NVIDIA Research

Divergence-Aware Warp Scheduling

MICRO 2013 Davis, CA

Tim Rogers Divergence-Aware Warp Scheduling 2

Streaming MultiprocessorStreaming Multiprocessor

Warp Scheduler

Memory Unit

L1D

GPU

…

W1 …W2

• 10000’s concurrent threads• Grouped into warps• Scheduler picks warp to issue each cycle

Main MemoryL2 cache

Threads

Warp …


Warp …

Main Memory

2 Types of Divergence

… Can waste memory bandwidth

Memory Divergence

Branch Divergence

Effects functional unit

utilizationAware of branch

divergence

if(…) {…

}

Threads

Warp …1 0 10

Aware of memory

divergence

ANDFocus on improving

performance


Motivation

• Transfer locality management from SW to HW

• Software solutions:• Complicate programming• Not always performance portable• Not guaranteed to improve performance• Sometimes impossible

• Improve performance of programs with memory divergence• Parallel irregular applications• Economically important (server computing, big data)


Programmability Case StudySparse Vector-Matrix Multiply

Divergence

Added Complication

Dependent on Warp Size

Parallel Reduction

Explicit Scratchpad Use

2 versions from SHOC

Divergent Version

GPU-Optimized Version

Each thread has locality


Previous Work• Scheduling used to capture intra-thread locality (MICRO 2012)

• Proactive

• Branch divergence aware

Reactive• Detects interference then throttles

Previous Work Divergence-AwareWarp Scheduling

Predict and be Proactive

Adapt to branch divergenceUnaware of branch divergence• All warps treated equally

Outperformed by profiled

static throttling

Outperform static

solution

Lost Locality Detected

W2

Warp Scheduler

Memory Unit

L1D

W1 …W3 WNGo Go StopStop

101100Active Mask

111111Active Mask

GoGo

Case Study:Divergent code 50% slowdown

Case Study:Divergent code <4% slowdown


Divergence-AwareWarp Scheduling

How to be proactive

Adapt to branch divergence

• Identify where locality exists• Limit the number of warps executing in high locality regions

• Create cache footprint prediction in high locality regions• Account for number of active lanes to create per-warp

footprint prediction.• Change the prediction as branch divergence occurs.


Where is the locality?• Examine every load instruction in program

0102030405060

Static Load Instructions in GC workload

Hits

/Mis

ses

PKI

Load

1

Load

2

Load

3

Load

4

Load

5

Load

6

Load

7

Load

8

Load

9

Load

10

Load

11

Load

12

Load

13

Locality Concentrated in Loops

Loop


Locality In Loops Limit Study

0

0.2

0.4

0.6

0.8

1

Average

Line accessed this iteration

Line accessed last iteration

Other

Hits on data accessed in immediately previous tripHow much data should we keep around?

Frac

tion

cach

e hi

ts in

loop

s


DAWS Objectives

1.Predict the amount of data accessed by each warp in a loop iteration.

2.Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.


Observations that enable prediction• Memory divergence in static instructions is predictable

• Data touched by divergent loads dependent on active mask

Warp 0Warp 1…load…

Divergence

DivergenceWarp

Main Memory

Main MemoryMain Memory

Divergence

Both Used To Create Cache

Footprint Prediction

4 accesses2 accesses

1 0 10Warp

1 1 11


Online characterization to create cache footprint prediction1. Detect loops with locality

2. Classify loads in the loop

3. Compute footprint from active mask

Some loops have locality Some don’tLimit

multithreading here

while(…) {load 1…load 2

}

Diverged

Not Diverged

while(…) {load 1…load 2

}

Warp 0 111111

Loop with locality

Loop with locality

Diverged

Not Diverged

4 accesses

1 access+

Warp 0’s Footprint= 5 cache

lines

int C[]={0,64,96,128,160,160,192,224,256};void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid];

while(i < C[tid+1]) {

sum += A[ i ];

++i;

}…

Example Compressed Sparse Row Kernel

Time1Time0 Time2

Cache A[0]

A[64]A[96]A[128]

Cache A[0]A[64]A[96]

A[128]

Cache A[32]

A[160]A[192]A[224]

Warp0 1 1 1 12nd Iter.

Warp0 1 0 0 033rd Iter.

Warp1 0 1 1 1

1st Iter.

Memory Divergence

Divergent Branch

Go Go

Warp1

Warp0Warp1

Warp0

No Footprint

DAWS Operation Example

Cache Footprint

4 4 4 Want to capture spatial locality

HitHitHit

Hit

Go

Hit x30Hit x30Hit x30Hit x30

LoopStopGo

No locality detected = no

footprint

Locality Detected1 Diverged Load

Detected

Footprint = 4X1

Footprint = 3X1Early warps

profile loop for later warps

Warp 0 has branch divergenceBoth warps capture

spatial locality together4 Active threads

Stop

Footprint decreased


MethodologyGPGPU-Sim (version 3.1.0)• 30 Streaming Multiprocessors

• 32 warp contexts (1024 threads total)• 32k L1D per streaming multiprocessor• 1M L2 unified cache

Compared Schedulers• Cache-Conscious Wavefront Scheduling (CCWS)• Profile based Best-SWL• Divergence-Aware Warp Scheduling (DAWS)

More schedulers in paper


Sparse MM Case Study Results

Within 4% of optimized with no programmer

input

0

0.5

1

1.5

2

Div

erge

nt C

ode

Exec

utio

n tim

e

• Performance (normalized to optimized version)


Sparse MM Case Study Results• Properties (normalized to optimized version)

0

0.5

1

1.5

2

2.5

3<20% increase in off-

chip accesses

Divergent version now has potential energy advantages

Divergent code issues 2.8x less instructions

Div

erge

nt c

ode

off-c

hip

acce

sses


Cache-Sensitive Applications• Breadth First Search (BFS)• Memcached-GPU (MEMC)• Sparse Matrix-Vector Multiply (SPMV-Scalar)• Garbage Collector (GC)• K-Means Clustering (KMN) Cache-Insensitive

Applications in paper


00.20.40.60.81

1.21.41.61.8

ResultsOutperform Best-SWL in highly

branch divergent

Overall 26% improvement over

CCWS

Nor

mal

ized

Spe

edup

CCWS

BFS MEMC SPMV-Scalar

GC KMN HMean

Best-SWL DAWS


Summary

Divergent loads in GPU programs.• Software solutions complicate programming

Problem

SolutionDAWS

• Captures opportunities by accounting for divergence

Result Overall 26% performance improvement over CCWSCase Study: Divergent code performs within 4% code optimized to minimize divergence

Questions?

timothy g. rogers 1 , mike o’connor 2 , tor m. aamodt 1

Documents