divergence-aware warp scheduling - microarch.orgtim rogers divergence-aware warp scheduling 7...

Timothy G. Rogers1, Mike O’Connor2, Tor M. Aamodt1

1The University of British Columbia 2NVIDIA Research

Divergence-Aware Warp Scheduling

MICRO 2013 Davis, CA

Tim Rogers Divergence-Aware Warp Scheduling 2

Streaming Multiprocessor Streaming Multiprocessor

Warp Scheduler

Memory Unit

L1D

GPU

W1 …W2

•  10000’s concurrent threads •  Grouped into warps •  Scheduler picks warp to issue each cycle

Main Memory L2 cache

Threads

Warp …


Warp …

Main Memory

2 Types of Divergence

… Can waste memory bandwidth

Memory Divergence

Branch Divergence

Effects functional unit

utilization Aware of branch

divergence

if(…) { …

}

Threads

Warp … 1 0 10

Aware of memory

divergence

AND Focus on improving

performance


Motivation

•  Transfer locality management from SW to HW

•  Software solutions: •  Complicate programming •  Not always performance portable •  Not guaranteed to improve performance •  Sometimes impossible

•  Improve performance of programs with memory divergence •  Parallel irregular applications •  Economically important (server computing, big data)


Programmability Case Study Sparse Vector-Matrix Multiply

Divergence

Added Complication

Dependent on Warp Size

Parallel Reduction

Explicit Scratchpad Use

2 versions from SHOC

Divergent Version

GPU-Optimized Version

Each thread has locality


Previous Work •  Scheduling used to capture intra-thread locality (MICRO 2012)

•  Proactive

•  Branch divergence aware

Reactive •  Detects interference then throttles

Previous Work Divergence-Aware Warp Scheduling

Predict and be Proactive

Adapt to branch divergence Unaware of branch divergence •  All warps treated equally

Outperformed by profiled

static throttling

Outperform static

solution

Lost Locality Detected

W2

Warp Scheduler

Memory Unit

L1D

W1 …W3 WN Go Go Stop Stop

1 0 1 1 0 0 Active Mask

1 1 1 1 1 1 Active Mask

Go Go

Case Study: Divergent code 50% slowdown

Case Study: Divergent code <4% slowdown


Divergence-Aware Warp Scheduling

How to be proactive

Adapt to branch divergence

•  Identify where locality exists •  Limit the number of warps executing in high locality regions

•  Create cache footprint prediction in high locality regions •  Account for number of active lanes to create per-warp

footprint prediction. •  Change the prediction as branch divergence occurs.


Where is the locality? •  Examine every load instruction in program

0

10

20

30

40

50

60

Static Load Instructions in GC workload

Hits

/Mis

ses

PKI

Load

1

Load

2

Load

3

Load

4

Load

5

Load

6

Load

7

Load

8

Load

9

Load

10

Load

11

Load

12

Load

13

Locality Concentrated in Loops

Loop


Locality In Loops Limit Study

0

0.2

0.4

0.6

0.8

1

Average

Line accessed this iteration

Line accessed last iteration

Other

Hits on data accessed in immediately previous trip How much data should we keep around?

Frac

tion

cach

e hi

ts in

loop

s


DAWS Objectives

1.  Predict the amount of data accessed by each warp in a loop iteration.

2.  Schedule warps in loops so that aggregate predicted footprint does not exceed L1D.


Observations that enable prediction •  Memory divergence in static instructions is predictable

•  Data touched by divergent loads dependent on active mask

Warp 0 Warp 1 … load …

Divergence

Divergence Warp

Main Memory

Main Memory Main Memory

Divergence

Both Used To Create Cache

Footprint Prediction

4 accesses 2 accesses

1 0 10Warp

1 1 11


Online characterization to create cache footprint prediction

1.   Detect loops with locality

2.   Classify loads in the loop

3.   Compute footprint from active mask

Some loops have locality Some don’t Limit

multithreading here

while(…) { load 1 … load 2

}

Diverged

Not Diverged

while(…) { load 1 … load 2

}

Warp 0 1 1 1 1 1 1

Loop with locality

Loop with locality

Diverged

Not Diverged

4 accesses

1 access +

Warp 0’s Footprint = 5 cache

lines

int C[]={0,64,96,128,160,160,192,224,256}; void sum_row_csr(float* A, …) { float sum = 0; int i =C[tid]; while(i < C[tid+1]) { sum += A[ i ]; ++i; } …

Example Compressed Sparse Row Kernel

Time1 Time0 Time2

Cache A[0]

A[64] A[96]

A[128]

Cache A[0]

A[64] A[96]

A[128]

Cache A[32]

A[160] A[192] A[224]

Warp0 1 1 1 1

2nd Iter.

Warp0 1 0 0 0

33rd Iter.

Warp1 0 1 1 1

1st Iter.

Warp1 0 1 1 1

1st Iter.

Memory Divergence

Divergent Branch

Go Go

Warp1

Warp0 Warp1

Warp0

No Footprint

Warp0 1 1 1 1

1st Iter.

DAWS Operation Example

Cache Footprint

4 4 4 Want to capture

spatial locality

Hit Hit Hit

Hit

Go

Hit x30 Hit x30 Hit x30 Hit x30

Loop Stop Go

No locality detected = no

footprint

Locality Detected 1 Diverged Load

Detected

Footprint = 4X1

Footprint = 3X1 Early warps

profile loop for later warps

Warp 0 has branch divergence Both warps capture

spatial locality together 4 Active threads

Stop

Footprint decreased


Methodology GPGPU-Sim (version 3.1.0)

•  30 Streaming Multiprocessors •  32 warp contexts (1024 threads total)

•  32k L1D per streaming multiprocessor •  1M L2 unified cache

Compared Schedulers •  Cache-Conscious Wavefront Scheduling (CCWS) •  Profile based Best-SWL •  Divergence-Aware Warp Scheduling (DAWS)

More schedulers in paper


Sparse MM Case Study Results

Within 4% of optimized with no programmer

input

0

0.5

1

1.5

2

Div

erge

nt C

ode

Exec

utio

n tim

e

•  Performance (normalized to optimized version)


Sparse MM Case Study Results •  Properties (normalized to optimized version)

0

0.5

1

1.5

2

2.5

3

13.3

<20% increase in off-chip accesses

Divergent version now has potential energy advantages

Divergent code issues 2.8x less instructions

Div

erge

nt c

ode

off-c

hip

acce

sses


Cache-Sensitive Applications •  Breadth First Search (BFS) •  Memcached-GPU (MEMC) •  Sparse Matrix-Vector Multiply (SPMV-Scalar) •  Garbage Collector (GC) •  K-Means Clustering (KMN) Cache-Insensitive

Applications in paper


00.20.40.60.81

1.21.41.61.8

Results Outperform Best-SWL in highly

branch divergent

Overall 26% improvement over

CCWS

Nor

mal

ized

Spe

edup

CCWS

BFS MEMC SPMV-Scalar

GC KMN HMean

Best-‐SWL DAWS


Summary

Divergent loads in GPU programs. •  Software solutions complicate programming

DAWS •  Captures opportunities by accounting for divergence

Overall 26% performance improvement over CCWS Case Study: Divergent code performs within 4% code optimized to

minimize divergence

Questions?

divergence-aware warp scheduling - microarch.orgtim rogers divergence-aware warp scheduling 7...

Documents