cooperative cache scrubbing

Post on 07-Jan-2016

76 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Cooperative Cache Scrubbing. Jennifer B. Sartor, Wim Heirman , Steve Blackburn*, Lieven Eeckhout , Kathryn S. McKinley^ PACT 2014. * ^. Multicore Challenge. Application. Objects rapidly allocated and short-lived. Managed language runtime environment. Operating System. P. P. P. P. - PowerPoint PPT Presentation

TRANSCRIPT

Cooperative Cache Scrubbing

Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^

PACT 2014

* ^

Multicore Challenge

Chip

memory (DRAM)p. 2

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

Objects rapidly allocated and

short-lived

LLC

Problem: Allocation Wall

Chip

memory (DRAM)p. 3

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEAD

DEADDEAD

DEAD

DEAD

Objects rapidly allocated and

short-lived

LLC

Problem: Bandwidth & Power Wall

Chip

memory (DRAM)p. 4

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEAD

DEADDEAD

DEAD

DEAD 00000000000000

Objects rapidly allocated and

short-lived

Zero initialization

LLC

Cooperative Cache Scrubbing

Chip

LLC

memory (DRAM)p. 5

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

00000000000000

Objects rapidly allocated and

short-lived

Zero initialization

DEADDEAD

DEADDEAD DEAD

DEADwrite read

LLC

Generational Garbage Collection

Young objects die quickly Nursery

Traced for live objects Copy to mature space Reclaimed ‘en masse’

NurseryMature

LLC

8MBp. 6

DEADDEADDEADDEAD DEAD

DEAD

Dead Lines in LLC (8MB)

p. 7

Dead Data Written Back?

Chip

LLC

memory (DRAM)p. 8

P

$

P

$

P

$

P

$

Managed language runtime environment

Application

Operating System

DEADDEADDEAD

DEAD

DEAD

DEAD

Useless Write Backs (8MB LLC)

p. 9

Cooperative Cache Scrubbing

Communicate managed language’s semantic information to hardware

Caches ‘Scrub’ dead lines

Invalidate Unset dirty bit

Zero lines without fetch Result

Better cache management Avoid traffic to DRAM Save DRAM energy

p. 10

writes

reads

Dead Data Written in Cache?

Young objects die quickly Nursery

Traced for live objects Copy to mature space Reclaimed ‘en masse’

NurseryMature

LLCDEADDEAD

DEAD DEAD

DEADDEAD

DEAD

DEAD

p. 11

0000000

Dead Lines Written in LLC (8MB)

p. 12

SW-HW Cooperative Scrubbing

Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world)

After nursery collection, call scrub instruction on each line in entire range

Call zero instructions to zero region (32KB)

Hardware

p. 13

SW-HW Cooperative Scrubbing

Software Hardware

Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU

Zeroing (L2) clzero: zero cache line without fetch

Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip)

p. 14

PowerPC’s dcbi, ARM

PowerPC’s dcbz

MESI Coherence Transitions

p. 15

M E

I S

clclean/-

clinvalidate/- clin

valid

ate/

-

clclean/-

clclean/-

clinv

alida

te/-

clinvalidate/-clclean/-

MESI Coherence Transitions

p. 16

M E

I S

clzero/-clzero/-

clze

ro/B

usIn

valid

ate

clzero/BusInvalidateB

usIn

valid

ate

BusIn

valid

ate

BusInvalidate

external: from another LLC

Methodology

Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM

Works with JIT compiler Emulate system calls (futex & nanosleep)

JVM-simulator communication with new instruction

Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2nd invocation

p. 17

DRAM Writes (8MB nursery)

p. 18

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

DRAM Writes (8MB nursery)

p. 19

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

DRAM Writes (8MB nursery)

p. 20

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

20

40

60

80

100

120

clinvalidateclundirtyclcleanclzeroclclean+clzero

Wri

tes

/Ba

se

lin

e (

%)

DRAM Reads (8MB nursery)

p. 21

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

DRAM Reads (8MB nursery)

p. 22

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

DRAM Reads (8MB nursery)

p. 23

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

DRAM Reads (8MB nursery)

p. 24

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

DRAM Reads (8MB nursery)

p. 25

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

0

25

50

75

100

125

150

175

200

225

clinvalidateclundirtyclcleanclzeroclclean+clzero

Re

ad

s/B

as

eli

ne

(%

)

Dynamic DRAM Energy (8MB nursery)

p. 26

Mean0

10

20

30

40

50

60

70

80

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Dynamic DRAM Energy (8MB nursery)

p. 27

Mean0

10

20

30

40

50

60

70

80

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Total DRAM Energy

p. 28

4M 8M 16M

-5

0

5

10

15

20

25

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

-22%

Total DRAM Energy

p. 29

4M 8M 16M

-5

0

5

10

15

20

25

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

-22%

Total DRAM Traffic

p. 30

4M 8M 16M

-50

-25

0

25

50

75

100

clinvalidateclundirtyclcleanclzeroclclean+clzero

Tra

ffic

Re

du

cti

on

(%

)

-14x

clclean+clzero Improvements

p. 31

DRAM R

eads

DRAM W

rites

Total

DRAM

Tra

ffic

LLC m

isses

Execu

tion

time

Dynam

ic DRAM

Ene

rgy

Total

DRAM

Ene

rgy

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

4MB 8MB 16MB

Related Work

Cooperative cache management ESKIMO by Isen & John, Micro 09

Useless reads and writes to DRAM by sequential C programs

Reduce energy Require large map in hardware, extra cache bits

Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 C & Fortran static analysis to give cache hints to evict or

keep data

Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth

p. 32

Conclusions

Software-hardware cooperative cache scrubbing

Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation

methodology Reductions 59% traffic 14% DRAM energy 4.6% execution time

p. 33

http://users.elis.ugent.be/~jsartor/

0000000DEAD

p. 34

Execution Time (8MB nursery)

p. 35

Mean0

1

2

3

4

5

6

7

clinvalidateclundirtyclcleanclzeroclclean+clzero

Ex

ec

uti

on

Tim

e R

ed

uc

tio

n (

%)

Changes to MESI coherence protocol

State clinvalidate clundirty/clclean

clzero BusInvalidate

M invalidate L1/L2 (no WB) I

invalidate L1/L2 (no WB) E(clclean LRU)

⁄ invalidate L1/L2 (no WB) I

E invalidate L1/L2 I

invalidate L1/L2 (clclean LRU)

M invalidate L1/L2 I

S invalidate L1/L2 I

invalidate L1/L2 (clclean LRU)

BusInvalidate M

invalidate L1/L2 I

I ⁄ ⁄ BusInvalidate M

p. 36

Total DRAM Energy (8MB nursery)

p. 37

antlr

avro

rabl

oat

fop

jytho

n

luin

dex

luse

arch

luse

arch

.fix

pmd

sunf

low

xala

nM

ean

-10

0

10

20

30

40

50

60

clinvalidateclundirtyclcleanclzeroclclean+clzero

En

erg

y R

ed

uc

tio

n (

%)

Execution Time Across Nurseries

p. 38

Execution Time

p. 39

Dynamic DRAM Energy 8MB Nursery

p. 40

top related