cooperative cache scrubbing

40
Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Upload: fleur

Post on 07-Jan-2016

76 views

Category:

Documents


2 download

DESCRIPTION

Cooperative Cache Scrubbing. Jennifer B. Sartor, Wim Heirman , Steve Blackburn*, Lieven Eeckhout , Kathryn S. McKinley^ PACT 2014. * ^. Multicore Challenge. Application. Objects rapidly allocated and short-lived. Managed language runtime environment. Operating System. P. P. P. P. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cooperative Cache Scrubbing

Cooperative Cache Scrubbing

Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^

PACT 2014

* ^

Page 2: Cooperative Cache Scrubbing

Multicore Challenge

Chip

memory (DRAM)p. 2

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

Objects rapidly allocated and

short-lived

LLC

Page 3: Cooperative Cache Scrubbing

Problem: Allocation Wall

Chip

memory (DRAM)p. 3

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEAD

DEADDEAD

DEAD

DEAD

Objects rapidly allocated and

short-lived

LLC

Page 4: Cooperative Cache Scrubbing

Problem: Bandwidth & Power Wall

Chip

memory (DRAM)p. 4

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEAD

DEADDEAD

DEAD

DEAD 00000000000000

Objects rapidly allocated and

short-lived

Zero initialization

LLC

Page 5: Cooperative Cache Scrubbing

Cooperative Cache Scrubbing

Chip

LLC

memory (DRAM)p. 5

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

00000000000000

Objects rapidly allocated and

short-lived

Zero initialization

DEADDEAD

DEADDEAD DEAD

DEADwrite read

LLC

Page 6: Cooperative Cache Scrubbing

Generational Garbage Collection

Young objects die quickly Nursery

Traced for live objects Copy to mature space Reclaimed ‘en masse’

NurseryMature

LLC

8MBp. 6

DEADDEADDEADDEAD DEAD

DEAD

Page 7: Cooperative Cache Scrubbing

Dead Lines in LLC (8MB)

p. 7

Page 8: Cooperative Cache Scrubbing

Dead Data Written Back?

Chip

LLC

memory (DRAM)p. 8

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEADDEAD

DEAD

DEAD

DEAD

Page 9: Cooperative Cache Scrubbing

Useless Write Backs (8MB LLC)

p. 9

Page 10: Cooperative Cache Scrubbing

Cooperative Cache Scrubbing

Communicate managed language’s semantic information to hardware

Caches ‘Scrub’ dead lines

Invalidate Unset dirty bit

Zero lines without fetch Result

Better cache management Avoid traffic to DRAM Save DRAM energy

p. 10

writes

reads

Page 11: Cooperative Cache Scrubbing

Dead Data Written in Cache?

Young objects die quickly Nursery

Traced for live objects Copy to mature space Reclaimed ‘en masse’

NurseryMature

LLCDEADDEAD

DEAD DEAD

DEADDEAD

DEAD

DEAD

p. 11

0000000

Page 12: Cooperative Cache Scrubbing

Dead Lines Written in LLC (8MB)

p. 12

Page 13: Cooperative Cache Scrubbing

SW-HW Cooperative Scrubbing

Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world)

After nursery collection, call scrub instruction on each line in entire range

Call zero instructions to zero region (32KB)

Hardware

p. 13

Page 14: Cooperative Cache Scrubbing

SW-HW Cooperative Scrubbing

Software Hardware

Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU

Zeroing (L2) clzero: zero cache line without fetch

Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip)

p. 14

PowerPC’s dcbi, ARM

PowerPC’s dcbz

Page 15: Cooperative Cache Scrubbing

MESI Coherence Transitions

p. 15

M E

I S

clclean/-

clinvalidate/- clin

valid

ate/

-

clclean/-

clclean/-

clinv

alida

te/-

clinvalidate/-clclean/-

Page 16: Cooperative Cache Scrubbing

MESI Coherence Transitions

p. 16

M E

I S

clzero/-clzero/-

clze

ro/B

usIn

valid

ate

clzero/BusInvalidateB

usIn

valid

ate

BusIn

valid

ate

BusInvalidate

external: from another LLC

Page 17: Cooperative Cache Scrubbing

Methodology

Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM

Works with JIT compiler Emulate system calls (futex & nanosleep)

JVM-simulator communication with new instruction

Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2nd invocation

p. 17

Page 18: Cooperative Cache Scrubbing

DRAM Writes (8MB nursery)

p. 18

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

Page 19: Cooperative Cache Scrubbing

DRAM Writes (8MB nursery)

p. 19

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

Page 20: Cooperative Cache Scrubbing

DRAM Writes (8MB nursery)

p. 20

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

Page 21: Cooperative Cache Scrubbing

DRAM Reads (8MB nursery)

p. 21

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Page 22: Cooperative Cache Scrubbing

DRAM Reads (8MB nursery)

p. 22

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Page 23: Cooperative Cache Scrubbing

DRAM Reads (8MB nursery)

p. 23

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Page 24: Cooperative Cache Scrubbing

DRAM Reads (8MB nursery)

p. 24

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Page 25: Cooperative Cache Scrubbing

DRAM Reads (8MB nursery)

p. 25

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Page 26: Cooperative Cache Scrubbing

Dynamic DRAM Energy (8MB nursery)

p. 26

Mean0

10

20

30

40

50

60

70

80

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Page 27: Cooperative Cache Scrubbing

Dynamic DRAM Energy (8MB nursery)

p. 27

Mean0

10

20

30

40

50

60

70

80

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Page 28: Cooperative Cache Scrubbing

Total DRAM Energy

p. 28

4M 8M 16M

-5

0

5

10

15

20

25

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

-22%

Page 29: Cooperative Cache Scrubbing

Total DRAM Energy

p. 29

4M 8M 16M

-5

0

5

10

15

20

25

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

-22%

Page 30: Cooperative Cache Scrubbing

Total DRAM Traffic

p. 30

4M 8M 16M

-50

-25

0

25

50

75

100

clinvalidateclundirtyclcleanclzeroclclean+clzero

Tra

ffic

Re

du

cti

on

(%

)

-14x

Page 31: Cooperative Cache Scrubbing

clclean+clzero Improvements

p. 31

DRAM R

eads

DRAM W

rites

Total

DRAM

Tra

ffic

LLC m

isses

Execu

tion

time

Dynam

ic DRAM

Ene

rgy

Total

DRAM

Ene

rgy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

4MB 8MB 16MB

Page 32: Cooperative Cache Scrubbing

Related Work

Cooperative cache management ESKIMO by Isen & John, Micro 09

Useless reads and writes to DRAM by sequential C programs

Reduce energy Require large map in hardware, extra cache bits

Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 C & Fortran static analysis to give cache hints to evict or

keep data

Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth

p. 32

Page 33: Cooperative Cache Scrubbing

Conclusions

Software-hardware cooperative cache scrubbing

Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation

methodology Reductions 59% traffic 14% DRAM energy 4.6% execution time

p. 33

http://users.elis.ugent.be/~jsartor/

0000000DEAD

Page 34: Cooperative Cache Scrubbing

p. 34

Page 35: Cooperative Cache Scrubbing

Execution Time (8MB nursery)

p. 35

Mean0

1

2

3

4

5

6

7

clinvalidateclundirtyclcleanclzeroclclean+clzero

Ex

ec

uti

on

Tim

e R

ed

uc

tio

n (

%)

Page 36: Cooperative Cache Scrubbing

Changes to MESI coherence protocol

State clinvalidate clundirty/clclean

clzero BusInvalidate

M invalidate L1/L2 (no WB) I

invalidate L1/L2 (no WB) E(clclean LRU)

⁄ invalidate L1/L2 (no WB) I

E invalidate L1/L2 I

invalidate L1/L2 (clclean LRU)

M invalidate L1/L2 I

S invalidate L1/L2 I

invalidate L1/L2 (clclean LRU)

BusInvalidate M

invalidate L1/L2 I

I ⁄ ⁄ BusInvalidate M

p. 36

Page 37: Cooperative Cache Scrubbing

Total DRAM Energy (8MB nursery)

p. 37

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

-10

0

10

20

30

40

50

60

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Page 38: Cooperative Cache Scrubbing

Execution Time Across Nurseries

p. 38

Page 39: Cooperative Cache Scrubbing

Execution Time

p. 39

Page 40: Cooperative Cache Scrubbing

Dynamic DRAM Energy 8MB Nursery

p. 40