the v-way cache: demand-based associativity via global

Post on 01-Nov-2021

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The V-Way Cache:

Demand-Based Associativity via Global Replacement

Moinuddin K. Qureshi David Thompson Yale N. Patt

Department of Electrical and Computer Engineering,

The University of Texas at Austin.

{moin, dave, patt}@hps.utexas.edu

The International Symposium on Computer Architecture - 2005 1/23

Introduction

Need for efficient management of secondary caches.

Ideal cache: fully associative with OPT replacement.

The International Symposium on Computer Architecture - 2005 2/23

Introduction

Need for efficient management of secondary caches.

Ideal cache: fully associative with OPT replacement.

0

10

20

30

40

50

Per

cent

age

redu

ctio

n in

mis

s ra

te 256kB 16-way + LRU256kB FA + LRU256kB FA + OPT512kB 8-way + LRU

The International Symposium on Computer Architecture - 2005 2/23

Fully Associative Caches: Cost v/s Benefit

Benefits

Conflict miss eliminationGlobal Replacement (finds the best victim)

Cost

Significant increase in the number of tag comparisonsIncreased access latencyIncreased power consumption

The International Symposium on Computer Architecture - 2005 3/23

Fully Associative Caches: Cost v/s Benefit

Benefits

Conflict miss eliminationGlobal Replacement (finds the best victim)

Cost

Significant increase in the number of tag comparisonsIncreased access latencyIncreased power consumption

Can we get the benefits of a fully associative cache without payingthe cost?

The International Symposium on Computer Architecture - 2005 3/23

Outline

Introduction

Example of Local and Global Replacement

The V-Way Cache

Evaluation

Related Work and Conclusion

The International Symposium on Computer Architecture - 2005 4/23

Example of Local Replacement

X = {x0, x1, x2, x3}

Y = {y0, y1, y2, y3}

SET A

SET B

x0 x1 x2 x3

y0 y1 y2 y3

WORKING SETdx0

dx1

dx2

dx3

dy0

dy1

dy2

dy3

DATA−STORETAG−STOREADDRESS

The International Symposium on Computer Architecture - 2005 5/23

Example of Local Replacement

SET A

SET B

x0 x1 x2 x3

y0 y1 y2 y3

WORKING SETdx0

dx1

dx2

dx3

dy0

dy1

dy2

dy3

DATA−STORETAG−STOREADDRESS

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

The International Symposium on Computer Architecture - 2005 5/23

Example of Local Replacement

SET A

SET B

x1 x2 x3

y0 y1 y2 y3

WORKING SET

dx1

dx2

dx3

dy0

dy1

dy2

dy3

DATA−STORETAG−STOREADDRESS

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

x4

dx4

The International Symposium on Computer Architecture - 2005 5/23

Example of Local Replacement

SET A

SET B

x1 x2 x3

y0 y1 y2 y3

WORKING SET

dx1

dx2

dx3

dy0

dy1

dy2

dy3

DATA−STORETAG−STOREADDRESS

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

x4

dx4

DORMANT WAY

THRASH

The International Symposium on Computer Architecture - 2005 5/23

Example of Local Replacement

SET A

SET B

x1 x2 x3

y0 y1 y2 y3

WORKING SET

dx1

dx2

dx3

dy0

dy1

dy2

dy3

DATA−STORETAG−STOREADDRESS

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

x4

dx4

DORMANT WAY

THRASH

Static partitioning of resources.

The International Symposium on Computer Architecture - 2005 5/23

Example of Global Replacement

Y = {y0, y1, y2, y3}

X = {x0, x1, x2, x3}

SET B1

SET A1

SET B0

SET A0 x0 x2

x1 x3

y0 y2

y1 y3

dy0

dx3

dx2

dy3

dy1

dx1

dy2

ADDRESS

dx0

DATA−STORETAG−STORE

WORKING SET

REDISTRIBUTED

The International Symposium on Computer Architecture - 2005 6/23

Example of Global Replacement

SET B1

SET A1

SET B0

SET A0 x0 x2

x1 x3

y0 y2

y1 y3

dy0

dx3

dx2

dy3

dy1

dx1

dy2

ADDRESS

dx0

DATA−STORETAG−STORE

WORKING SET

REDISTRIBUTED

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

The International Symposium on Computer Architecture - 2005 6/23

Example of Global Replacement

SET B1

SET A1

SET B0

SET A0 x0 x2

x1 x3

y0 y2

y1 y3

dy0

dx3

dx2

dy3

dy1

dx1

dy2

ADDRESS

dx0

DATA−STORETAG−STORE

WORKING SET

REDISTRIBUTED

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

The International Symposium on Computer Architecture - 2005 6/23

Example of Global Replacement

SET B1

SET A1

SET B0

SET A0 x0 x2

x1 x3

y0 y2

y1

dy0

dx3

dx2

dy1

dx1

dy2

ADDRESS

dx0

DATA−STORETAG−STORE

WORKING SET

REDISTRIBUTED

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

x4

dx4

The International Symposium on Computer Architecture - 2005 6/23

Example of Global Replacement

SET B1

SET A1

SET B0

SET A0 x0 x2

x1 x3

y0 y2

y1

dy0

dx3

dx2

dy1

dx1

dy2

ADDRESS

dx0

DATA−STORETAG−STORE

WORKING SET

REDISTRIBUTED

X = {x0, x1, x2, x3, x4}

Y = {y0, y1, y2}

x4

dx4

Dynamic sharing of resources!!

The International Symposium on Computer Architecture - 2005 6/23

Outline

Introduction

Example of Local and Global Replacement

The V-Way Cache

Evaluation

Related Work and Conclusion

The International Symposium on Computer Architecture - 2005 7/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

HIT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

HIT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBAL

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

DATA

HIT

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

MISS

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

MISS

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

MISS

Configuration Tag Access Data Replacement

Set-Associative Fast Local

Fully-Associative

V-Way

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

MISS

Configuration Tag Access Data Replacement

Set-Associative Fast Local

Fully-Associative Slow Global

V-Way

The International Symposium on Computer Architecture - 2005 8/23

The V-Way Cache

TAGCOMPARE

FPTRSELECT

RPTRV

ARRAY

SCHEME

GLOBALDATA

DATA

FPTRTAGSTATUS

REPLACEMENT

INDEX OFFTAG

TAG−STORE DATA−STORE

MISS

Configuration Tag Access Data Replacement

Set-Associative Fast Local

Fully-Associative Slow Global

V-Way Fast Global

The International Symposium on Computer Architecture - 2005 8/23

A Practical Global Replacement Algorithm

LRU is impractical because there are thousands of lines

Second level cache access stream is a filtered version of theprogram access stream

Reuse frequency is skewed towards the low end0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

15+

Reuse count

0

10

20

30

40

Per

cent

age

of L

2 ev

icti

ons

The International Symposium on Computer Architecture - 2005 9/23

Reuse Replacement

00 01 10 11

VICTIMIZE

INITIALIZE

TESTTESTTESTTEST

TABLE

PTR

REUSE COUNTER

ACCESS ACCESS ACCESS

ACCESS

The International Symposium on Computer Architecture - 2005 10/23

Reuse Replacement

00 01 10 11

VICTIMIZE

INITIALIZE

TESTTESTTESTTEST

ACCESS ACCESS ACCESS

ACCESS

TABLE

PTR

REUSE COUNTER

11

01

00

The International Symposium on Computer Architecture - 2005 10/23

Reuse Replacement

00 01 10 11

VICTIMIZE

INITIALIZE

TESTTESTTESTTEST

TABLEREUSE COUNTER

PTR

10

01

00ACCESS ACCESS ACCESS

ACCESS

The International Symposium on Computer Architecture - 2005 10/23

Reuse Replacement

00 01 10 11

VICTIMIZE

INITIALIZE

TESTTESTTESTTEST

ACCESS ACCESS

ACCESS

TABLEREUSE COUNTER

PTR

10

00

00ACCESS

The International Symposium on Computer Architecture - 2005 10/23

Reuse Replacement

00 01 10 11

VICTIMIZE

INITIALIZE

TESTTESTTESTTEST

ACCESS

TABLEREUSE COUNTER

PTR

10

00

00ACCESS ACCESS

ACCESS

The International Symposium on Computer Architecture - 2005 10/23

Victim Distance for Reuse Replacement

Problem of variable replacement latencyAverage victim distance: 3.9Worst case victim distance: 1888

The International Symposium on Computer Architecture - 2005 11/23

Victim Distance for Reuse Replacement

Problem of variable replacement latencyAverage victim distance: 3.9Worst case victim distance: 1888

SolutionTest eight counters each cycleLimit search to five cycles

1 2 3 4 5

98.9%Probability (victim)

Latency (in cycles)

91.3% 96.9% 99.2%98.3%

The International Symposium on Computer Architecture - 2005 11/23

Outline

Introduction

Example of Local and Global Replacement

The V-Way Cache

Evaluation

Related Work and Conclusion

The International Symposium on Computer Architecture - 2005 12/23

Evaluation Outline

Experimental Methodology

Reduction in Misses with the V-Way Cache

Comparing Reuse Replacement and LRU

Storage, Latency, and Energy Cost

Impact on IPC

The International Symposium on Computer Architecture - 2005 13/23

Experimental Methodology

First level I-cache, D-cache: 16kB, 2-way, 64B linesize, LRU

Baseline L2: Unified, 256kB, 8-way, 128B linesize, LRU

Benchmarks: SPEC CPU2000

The International Symposium on Computer Architecture - 2005 14/23

Reduction in Misses with the V-Way Cache

Primary upper bound: Fully associative cache

Secondary upper bound: Double sized cache

The International Symposium on Computer Architecture - 2005 15/23

Reduction in Misses with the V-Way Cache

Primary upper bound: Fully associative cache

Secondary upper bound: Double sized cache

0

10

20

30

40

50

60

70

80

90

100

Per

cent

age

redu

ctio

n in

mis

s ra

te 256kB V-way + Reuse

256kB FA + LRU512kB 8-way + LRU

perlb

mkcra

ftymes

afac

erec

vorte

xap

simcf vp

r

gcc

ammp

twolf

bzip2

parse

rgz

ipsw

imga

lgel

amea

n

The International Symposium on Computer Architecture - 2005 15/23

Comparing Reuse Replacement and LRU

Comparison of miss-rate for LRU and Reuse replacement

Bmk bzip2 crafty gcc gzip mcf parser perl. twolf vortex

LRU 34.6 1.1 3.8 2.4 29.5 32.7 0.1 36.5 8.5

Reuse 35.0 1.0 3.8 2.4 29.9 32.9 0.1 35.4 7.1

Bmk vpr ammp apsi facerec galgel mesa swim amean

LRU 11.0 50.0 34.8 50.7 8.3 3.4 65.3 23.3

Reuse 10.5 50.0 34.8 50.6 8.5 3.5 65.3 23.2

The International Symposium on Computer Architecture - 2005 16/23

Storage, Latency, and Energy Cost

Storage needed for extra tags, FPTR, RPTR, and Reuse bits

Line-size Miss-rate reduction Increase in area

128 B 13.2% 5.8%

256 B 14.9% 2.9%

Delay due to more tags and FPTR selection: 0.13 ns

Energy in accessing bigger tag-store

Parallel lookup Baseline V-Way

1.02nJ 0.35nJ 0.40nJ

The International Symposium on Computer Architecture - 2005 17/23

Storage, Latency, and Energy Cost

Storage needed for extra tags, FPTR, RPTR, and Reuse bits

Line-size Miss-rate reduction Increase in area

128 B 13.2% 5.8%

256 B 14.9% 2.9%

Delay due to more tags and FPTR selection: 0.13 ns

Energy in accessing bigger tag-store

Parallel lookup Baseline V-Way

1.02nJ 0.35nJ 0.40nJ

The International Symposium on Computer Architecture - 2005 17/23

Storage, Latency, and Energy Cost

Storage needed for extra tags, FPTR, RPTR, and Reuse bits

Line-size Miss-rate reduction Increase in area

128 B 13.2% 5.8%

256 B 14.9% 2.9%

Delay due to more tags and FPTR selection: 0.13 ns

Energy in accessing bigger tag-store

Parallel lookup Baseline V-Way

1.02nJ 0.35nJ 0.40nJ

The International Symposium on Computer Architecture - 2005 17/23

Impact on IPC

Pipeline: 12 stage, 8 wide with 128 entry reservation station

L1 hit latency of 2 cycles and L2 hit latency of 10 cycles

L3/Main memory: access-latency of 80 cycles

The International Symposium on Computer Architecture - 2005 18/23

Impact on IPC

Pipeline: 12 stage, 8 wide with 128 entry reservation station

L1 hit latency of 2 cycles and L2 hit latency of 10 cycles

L3/Main memory: access-latency of 80 cycles

0

5

10

15

20

25

30

35

40

45

Per

cent

age

IPC

impr

ovem

ent

over

bas

e

V-Way (same hit latency as base)V-Way (additional one cycle hit latency)

bzip2

crafty gcc

gzip

mcfpa

rser

perlb

mktw

olfvo

rtex

vpr

ammp

apsi

facere

cga

lgel

mesa

swim

gmea

n

The International Symposium on Computer Architecture - 2005 18/23

Outline

Introduction

Example of Local and Global Replacement

The V-Way Cache

Evaluation

Related Work and Conclusion

The International Symposium on Computer Architecture - 2005 19/23

Related Work

Extra storage for conflict misses: Victim cache [Jouppi ISCA’90]

Multi-probe techniquesPredictive sequential associative cache [Calder+ HPCA’96]

Adaptive group associative cache [Peir+ ASPLOS’98]

Cache indexing functionSkewed associativity [Seznec ISCA’93]

Prime-modulo indexing [Kharbutli+ HPCA’04]

Software managed fully associative cache: IIC [Hallnor+ ISCA’00]

The International Symposium on Computer Architecture - 2005 20/23

Other Possible Applications of the V-Way Cache

Platform for global replacement with inbuilt shadow directory

Tag inclusion data exclusion [Piranha ISCA’00]

Cache compression [Hallnor+ HPCA’05]

Interaction with NuRAPID [Chishti+ MICRO’03]

The International Symposium on Computer Architecture - 2005 21/23

Conclusion

Traditional cache assumes uniform accesses across sets

Global replacement allows the V-Way cache to vary thenumber of valid ways depending on the set demand

Reuse replacement is fast and performs comparable to LRU

V-Way cache can lower miss-rate and improve performance

V-Way cache can serve as an infrastructure for otheroptimizations

The International Symposium on Computer Architecture - 2005 22/23

Questions

The International Symposium on Computer Architecture - 2005 23/23

top related