pl-4049, cache coherence for gpu architectures, by arvindh shriraman and tor aamodt

101
Cache coherence for GPU Architectures 1 Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19)

Upload: amd-developer-central

Post on 23-Dec-2014

398 views

Category:

Technology


4 download

DESCRIPTION

PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt at the AMD Developer Summit (APU13) November 11-13, 2013

TRANSCRIPT

Page 1: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Cache coherence for GPU Architectures

���1

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19)

Page 2: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Agenda

���2

Page 3: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Agenda

Challenges with CPU coherence on GPUs.

���2

Page 4: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Agenda

Challenges with CPU coherence on GPUs.

Temporal Coherence: Rethinking coherence for GPUs

���2

Page 5: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Agenda

Challenges with CPU coherence on GPUs.

Temporal Coherence: Rethinking coherence for GPUs

What is the cost of providing coherence?

���2

Page 6: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Why provide coherence?1. Inter-workgroup communication

2. Atomic operations

3. Task queues

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems, ISPASS 2012

���3

Page 7: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Cache Coherence

Shared Memory

P P P PProgrammer

Appearance: One global copy of every location

���4

Page 8: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Cache Coherence

MulticoresP P P PL1 L1 L1 L1

L2 L2

GPUs

Memory

L1 L1 L1 L1

... L2 Memory

���5

Page 9: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Cache Coherence

P P P PL1 L1 L1 L1

L2 L2

Memory

L1 L1 L1 L1

... L2 ...

Heterogeneous Systems

How to provide coherence?

���6

Page 10: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenges

���7

Page 11: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenges with coherence

L1

Shared L2

L1

���8

Page 12: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenges with coherence

L1

Shared L2

L1

���8

Page 13: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenges with coherence

L1

Shared L2

L11 2

���8

Page 14: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenges with coherence

L1

Shared L2

L11 23

���8

Page 15: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 1: Traffic

L1

Shared L2

L1

���9

Page 16: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 1: Traffic

L1

Shared L2

L1 L1

���9

Page 17: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 1: Traffic

L1

Shared L2

L1 L1

���9

Page 18: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 1: Traffic

L1

Shared L2

L1 L1

30% more traffic than current GPUs

���9

Page 19: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1

���10

Page 20: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Page 21: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Page 22: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Page 23: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1

Coherence protocol buffers require 28% of total L2

Protocol Buffer

���10

Page 24: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 3: Complexity

L1

Shared L2

L11

4 states

Incoherent protocol

���11

Page 25: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Challenge 3: Complexity

L1

Shared L2

L11

4 states 16 states

Incoherent protocol

Coherent protocol

23

���11

Page 26: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Coherence Overhead.

L1Coherence messages 1. Traffic transferring 2. Area overhead 3. Protocol complexity

How to achieve coherence without messages?

���12

Page 27: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

TEMPORAL COHERENCE���13

Page 28: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

Page 29: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

Page 30: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

Page 31: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

Page 32: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

���15

Shared L2

Page 33: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

1 Clock

���15

Shared L2

Page 34: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

Load

1 Clock

���15

Shared L2

Page 35: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

���15

Shared L2

Page 36: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

���15

Shared L2

Page 37: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

!

GT Shared if TIME < GT

���15

Shared L2

Page 38: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1L1

���16

Page 39: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1L1

0 TIME

���16

Page 40: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1L1

Load

0 TIME

���16

Page 41: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1L1

Load

!

20

0 TIME

���16

Page 42: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

L1L1

Load

!

20

0 TIME

!

20 Line shared till 20

���16

Page 43: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20 L1

5 TIME

���17

Page 44: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20 L1

Load

!

25

5 TIME

���17

Page 45: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20!

25Line shared

till 25

L1

Load

!

25

5 TIME

���17

Page 46: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20!

25

L1!

25

15 TIME

���18

Page 47: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20!

25

L1!

25

15 TIME

Write

���18

Page 48: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20!

25

L1!

25

20 TIME

Write

���19

Page 49: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

20!

25

L1!

25

20 TIME

Write

���19

Page 50: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

25

L1

Write

!

25

25 TIME

���20

Page 51: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

25

L1

Write

!

25

25 TIME

���20

Page 52: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

25

L1

Write

!

25

25 TIME

���20

Page 53: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

L1L1!

25

L1

Write

!

25

25 TIME

���20

Page 54: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence

No coherence messages

All transactions are 2-hop

Protocol complexity minimal

Supports strong and weak memory models

Enables optimized communication (ask me later...)

���21

Page 55: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

How to set the block lifetime?

•Longer => writes may stall

•Shorter => may not exploit temporal locality

!

• Lifetime predictor at L2. -Load to expired block (for temporal locality) -Store to unexpired block (reduce write stalls) -Eviction of unexpired block (reduce L2 eviction stalls)

���22

Page 56: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

���23

Page 57: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

���23

Page 58: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

���23

Page 59: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

Resource stalls

���23

Page 60: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

Resource stalls

Hurts GPU applications

���23

Page 61: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Goal : eliminate Write Stalls!

Sensitive to misprediction

Resource stalls

Hurts GPU applications

���23

Page 62: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

OLD OLD

���24

Page 63: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25

���24

Page 64: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25 Fence

���24

Page 65: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25 Fence......

���24

Page 66: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

���25

Page 67: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

���25

Page 68: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

......

���25

Page 69: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Page 70: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Page 71: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Page 72: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

!

25!

25

���26

Page 73: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Temporal Coherence (Weak)

No Access Stalls

Efficient GPU applications

Aggressive lifetime predictors

Supports weak memory models

���27

Page 74: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

���28

Page 75: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

���29

Page 76: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect TrafficGPU Applications (do not need coherence)

���30

Page 77: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2

GPU Applications (do not need coherence)

���30

Page 78: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2N

O.C

CGPU Applications (do not need coherence)

���30

Page 79: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

NO

.CC

2.3

GPU Applications (do not need coherence)

���30

Page 80: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

NO

.CC

2.3

GPU Applications (do not need coherence)

���30

Page 81: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

NO

.CC

2.3

GPU Applications (do not need coherence)

.8xWr-Through

���30

Page 82: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

TC

NO

.CC

2.3

GPU Applications (do not need coherence)

.8x

.3x

Wr-Through

No msgs

���30

Page 83: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

���31

Page 84: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

���32

Page 85: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

���32

Page 86: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75N

O L

1

���32

Page 87: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

NO

L1

���32

Page 88: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

NO

L1

���32

Page 89: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

GPU

-VI

TC

NO

L1

���32

Page 90: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

GPU

-VI

TC

NO

L1

Need a 32KB directory

���32

Page 91: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Protocol Complexity

���33

Page 92: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

���33

Page 93: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

���33

Page 94: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

GPU-VI

21510

���33

Page 95: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

GPU-VI

21510

Temporal Coherence

2153

���33

Page 96: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

What did we learn!

•Throughput and heterogeneous architectures require a more streamlined caching framework. !

•Single-chip integration enables mechanisms that we can exploit to simplify communication protocols. !

•Efficient coherence protocols enable programmers to deploy accelerators for wider purposes..

Page 97: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

•Obtain GPGPU-Sim with coherence support

���35

http://www.ece.ubc.ca/~isingh/gpgpusim-ruby.tar.gz

Contact: [email protected]

or [email protected]

Page 98: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

Interconnect Energy

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

NO

-L1

MES

I G

PU-V

I G

PU-V

ini

TCW

NO

-CO

H

MES

I G

PU-V

I G

PU-V

ini

TCW

Inter- workgroup

Intra- workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

NO

-L1

MES

I G

PU-V

I G

PU-V

ini

TCW

NO

-CO

H

MES

I G

PU-V

I G

PU-V

ini

TCW

Inter- workgroup

Intra- workgroup

Nor

mal

ized

Pow

er

���36

Page 99: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

���37

Page 100: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applicationsSp

eedu

p

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

���38

Page 101: PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt

TC-Strong vs TC-Weak

Fixed lifetime for all applications

0.6 0.8 1.0

1.2

1.4

All applications

Spe

edup

0.6

0.8

1.0

1.2

All applications S

peed

up

Best lifetime for each application

TCSUO TCS TCSOO TCW TCW w/ predictor

���39