jayesh gaur*, alaa alameldeen**, sreenivas...

28
Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney* * Microarchitecture Research Lab (MRL), Intel India **Memory Architecture Lab (MAL), Intel Oregon ISCA 2016 Seoul, Korea

Upload: others

Post on 03-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

* Microarchitecture Research Lab (MRL), Intel India

**Memory Architecture Lab (MAL), Intel OregonISCA 2016

Seoul, Korea

Page 2: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Motivation

• Memory continues to be the bottleneck for modern CPU’s

• Larger Last Level Cache (LLC) capacity improves performance and power

• Higher hit rates Better Performance

• Fewer off-chip DRAM accesses Lower power, energy

• However this comes at a cost : Area and Leakage

• Cache compression is an attractive option

• Increased Capacity at lower area

2

But Compression can interfere with Replacement PoliciesWe present a new compression architecture to address this

Page 3: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

3

How Cache Compression Works ?

Data ArrayTag Array

CompressionLogic

De-CompressionLogic

~8% increase in Area

Prior works have tried to change SRAM layout to allow compressionChanging a dense, timing sensitive SRAM layout is difficult

Way

Set

Data is fragmented across the setHow to associate tags to data ?

Tag Array(2X Tags)

Page 4: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Agenda

• Practical Architecture for Compressed Cache

• Interaction between Compression and Replacement Policies

• Base-Victim Proposal

• Results

• Performance

• Power

• Conclusions

4

Page 5: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Creating a Compressed LLC

5

• Tags per Set are doubled

• Exactly two tags are associated with each way

• Tag hit to data fetch is optimized

• Only 64B of data for every two tags

• Data corresponding to the tags is compressed

Tag0

Tag1

Way 0

Tag0

Tag1

Way 1

Tag0

Tag1

Way 2

Tag0

Tag1

Way 3

Data 0 Data 1

Data-0-size + Data-1-size <= 64B

Page 6: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

6

0

0.5

1

1.5

2

2.5

3

IPC

an

d D

RA

M R

ea

d R

ati

o

ov

er

Ba

seli

ne

Traces --->

IPC Ratio DRAM Read Ratio

Performance gain from compression

Average 12% Loss ! Hit Rates lower in generalWhy did larger LLC capacity lower performance ?

Compression FriendlyCompression Unfriendly

Page 7: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Issues with Compressed Cache

Partner line victimization

• Replacement policy is broken because of size limitations

• Performs poorer than baseline with many negative outliers

7

48 16 24 40 64 0 32 24

Way 0 Way 1 Way 2 Way 3

1 0 1 1 1 X 1 1NRU Age

Data Size

Incoming request Size

LRU candidate way does not have space

Allocating into LRU way will victimize the partner MRU line !We need to increase capacity but also preserve the replacement policy !

24

Page 8: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

• Extra capacity (Tag-1 in each way) logically belongs to a Victim Cache

• Tag 0 victims are cached in Tag 1 space

• Base replacement policy strictly maintained in Tag 0

• Guarantees baseline cache hit behavior, performance

• Victim cache is always clean

• Partner line victimization is easy

01

Way 0 Way 1 Way 2 Way 3

01

01

01

B

V

8

Opportunistic Victim Cache

0 1

Way 0 Way 1 Way 2 Way 3

0 10 10 1

Page 9: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

9

Compressed LLC : Miss

Page 10: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

10

Compressed LLC : Miss

Page 11: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

11

Compressed LLC : Miss

Page 12: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

B,24

F,32 E,8 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

12

Compressed LLC : Miss

Page 13: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

B,24

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

13

Compressed LLC : Miss

Page 14: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

B,24

14

Compressed LLC : Miss

Page 15: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

Read miss allocates in Tag0 LRU

If size exceeds what is available in Tag0, victimize partner line in Tag1 victim cache

In baseline it was never there

Cannot be poorer than baseline

If victim created insert into Tag1 victim cache

D,24 C,8 A,48

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

Z,48

Incoming Request

B,24

15

Compressed LLC : Miss

Victim cache is always clean Dirty lines written to memoryReturn data to core after decompression

Page 16: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48 B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

16

Compressed LLC : Hit in Victim $

Page 17: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48

B,24

F,32 E,8 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

17

Compressed LLC : Hit in Victim $

Page 18: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

18

Compressed LLC : Hit in Victim $

Page 19: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16 Y,32

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

19

Compressed LLC : Hit in Victim $

Page 20: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48

B,24

F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

20

Compressed LLC : Hit in Victim $

Page 21: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

If hit from Tag1 victim cache then move data to Tag0

Behave as if read miss served from memory

But latency is just lookup of LLC

Gives performance

Management of Victim Cache is critical

Needs more analysis

D,24 C,56 A,48

B,24F,32 X,16

Baseline Cache

Ways 10 2 3

Victim Cache

<Tag, Size>

+

LRU

E,8

Incoming Request

21

Compressed LLC : Hit in Victim $

Page 22: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

22

Configuration

• x86 Core running at 4GHz

• 2MB, 16 way Inclusive LLC per core

• Not Recently Used (NRU) replacement

• DDR3-1600 15-15-15-34

• Base Delta Immediate (BDI) Compression

• Decompression Latency of 2 cycles

Category Traces

SPECFP 06 30

ISPEC 06 29

Productivity 14

Client 27

Overall 100

On an average each cache-line gets compressed to 55% of its sizeDoubling Tags should get most of the gains

Page 23: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

23

Results : IPC Gain

1.03

1.13

1.07

1.10

1.09

1.04

1.12

1.07

1.101.08

1.06

1.11

1.05

1.111.09

1.05

1.08

1.05

1.09

1.07

0.90

0.95

1.00

1.05

1.10

1.15

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

IPC

Ra

tio

ov

er

ba

seli

ne

3M Uncompressed LLC Opportunistic Compression

8% area addition gives performance equal to 50% increased areaGood gains across various categories of workloads

Page 24: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

24

Results : Correlation with hit rate improvement

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

IPC

an

d D

RA

M R

ea

d R

ati

o

ov

er

Ba

seli

ne

Traces ----------->

IPC Ratio DRAM Read Ratio

Hit Rate >= Baseline hit rate. No negative outliersMemory traffic reduces by average 16%

Compression Friendly Compression Unfriendly

Page 25: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

25

Effect of Baseline Replacement Policy

0.90

0.95

1.00

1.05

1.10

1.15

1.20

SPECFP SPECINT Productivity Client Average SPECFP SPECINT Productivity Client Average

Compression Friendly Overall

IPC

Ra

tio

ov

er

NR

U B

ase

lin

e

SRRIP SRRIP + Compression CHAR CHAR + Compression

Good gains with various state of the art replacement policiesIncreases capacity while retaining benefits of good replacement !

Page 26: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

26

Energy Savings

Power saved in DRAM compensates for increased power in LLCOverall 6.5% energy savings

0.2

0.4

0.6

0.8

1

1.2

DR

AM

Re

ad

Ra

tio

an

d E

ne

rgy

Ra

tio

ov

er

Ba

seli

ne

Traces ------------>

DRAM Read Ratio

Energy Ratio with Word Enables

Page 27: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*

27

Conclusions

• Cache Compression increases capacity with low area impact

• But compression interferes with replacement policies

• We propose Base-Victim compression

• Opportunistic Victim cache created by compression

• Preserves gains from replacement policies

• No costly SRAM layout changes

• All changes in the cache controller

• ~50% increase in capacity with 8% area addition

Page 28: Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*isca2016.eecs.umich.edu/wp-content/uploads/2016/07/5A-1.pdf · 2016. 7. 5. · Jayesh Gaur*, Alaa Alameldeen**, Sreenivas Subramoney*