highly-associative caches for low-power processors

18
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu

Upload: fern

Post on 06-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Highly-Associative Caches for Low-Power Processors. Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu. Motivation. Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1 Many academic studies on cache - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Highly-Associative Caches  for Low-Power Processors

Highly-Associative Caches for Low-Power Processors

Michael Zhang

Krste Asanovic

{rzhang|krste}@lcs.mit.edu

Page 2: Highly-Associative Caches  for Low-Power Processors

Motivation

Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1

Many academic studies on cache [Albera, Bahar, ’98] – Power and performance trade-offs [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache [Kin,Gupta, Mangione-Smith, ’97] – Filter cache [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC [Wilton, Jouppi, ’94] – CACTI cache model

Many Industrial Low-Power Processors use CAM (content-addressable-memory) ARM3 – 64-way set-associative – [Furber et. al. ’89] StrongArm – 32-way set-associative – [Santhanam et. al. ’98] Intel XScale – 32-way set-associative – ’01

CAM: Fast and Energy-Efficient

Page 3: Highly-Associative Caches  for Low-Power Processors

Talk Outline

Structural Comparison

Area and Delay Comparison

Energy Comparison

Related work

Conclusion

Page 4: Highly-Associative Caches  for Low-Power Processors

Set-Associative RAM-tag Cache

Not energy-efficient All ways are read out

Two-phase approach More energy-efficient 2X latency

=? =?

Tag Status Data Tag Status Data

Tag Index Offset

Page 5: Highly-Associative Caches  for Low-Power Processors

Set-Associative RAM-tag Sub-bank

Not energy-efficient All ways are read out

Two-phase approach More energy-efficient 2X latency

Sub-banking

1 sub-bank = 1 way

Low-swing Bitlines Only for reads, writes

performed full-swing

Wordline Gating

I/O

BUS

addr

Ad

dre

ss D

ecod

er

gwl

lwl

Offset Dec.

offset

DataSRAM Cells

SenseAmps

lwl

Offset Dec.

offset

DataSRAM Cells

SenseAmps

32128

Tag SRAMCells

Tag Comp

BUS Cache

Page 6: Highly-Associative Caches  for Low-Power Processors

CAM-tag Cache

Only one sub-bank activated

Associativity within sub-bank

Easy to implement high associativity

Tag Status Data

HIT?

Word

Tag OffsetBank

Tag Status Data

HIT?HIT?

Page 7: Highly-Associative Caches  for Low-Power Processors

CAM-tag Cache Sub-bank

Only one sub-bank activated

Associativity within sub-bank

Easy to implement high associativity

I/O

BUS

tag

CA

M-t

ag

Arr

ay

gwl

lwl

Offset Dec.

offset

SRAM Cells

SenseAmps

lwl

Offset Dec.

offset

SRAM Cells

SenseAmps

32128

Page 8: Highly-Associative Caches  for Low-Power Processors

CAM Functionality and Energy Usage

WLBit Bit_b SBitSBit_b

match

10-T CAM CellWith Separate

Write/Search LinesAnd Low-Swing

Match Line

WL

Bit Bit_b SBitSBit_b

match

Match

1

01

1

10

WL

Bit Bit_b SBitSBit_b

match

Mismatch

1 0

011 0

CAM Energy Dissipation Search Lines Match Lines Drivers

SR

AM

XOR

Page 9: Highly-Associative Caches  for Low-Power Processors

CAM-tag Cache Sub-bank Layout

10% area overhead over RAM-tag cache

2x12x32 CAM Array

1-KB Cache Sub-bank implemented in 0.25 m CMOS technology

32x64 RAM Array

Page 10: Highly-Associative Caches  for Low-Power Processors

Delay Comparison

gwl

Global Wordline Decoding

lwl

Decoded offset

Local Wordline Decoding

Tag readout

Data readout

Index Bits

Tag bits

Data out

Tag bitsTag Comp.

Data out

RAM tag Cache Critical Path:

CAM tag Cache Critical Path:

Within 3% of each other

Tag bits broadcasting Tag bits

Tag Comp.

gwl

Data readout

Local Wordline Decoding

lwl

Decoded offset

Page 11: Highly-Associative Caches  for Low-Power Processors

Hit Energy ComparisonH

it E

nerg

y p

er

Access f

or

8K

B C

ach

e in

pJ

0

50

100

150

200

250

300

350

400

450

1-wayRAM

2-wayRAM

4-wayRAM

8-wayRAM

8-wayCAM

16-wayCAM

32-wayCAM

LZWijpegpegwitperlm88ksimgccAvg

Associativity and Implementation

Page 12: Highly-Associative Caches  for Low-Power Processors

Miss Rate Results

0

5

10

15

20

25

1-way 2-way 4-way 8-way 16-way 32-way 64-way

LZW

0

0.5

1

1.5

2

2.5

3

3.5

1-way 2-way 4-way 8-way 16-way 32-way 64-way

ijpeg

0

0.5

1

1.5

2

2.5

3

3.5

1-way 2-way 4-way 8-way 16-way 32-way 64-way

m88ksim

0

2

4

6

8

10

12

14

16

1-way 2-way 4-way 8-way 16-way 32-way 64-way

8KB

16KB

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1-way 2-way 4-way 8-way 16-way 32-way 64-way

0

1

2

3

4

5

6

1-way 2-way 4-way 8-way 16-way 32-way 64-way

perl

pegwit

gcc

Page 13: Highly-Associative Caches  for Low-Power Processors

Total Access Energy (pegwit)

Pegwit – High miss rate for high associativity

Tota

l En

erg

y p

er

Access f

or

8K

B C

ach

e in

pJ

Miss Energy Expressed in Multiples of 32-bit Read Access Energy

0

500

1000

1500

2000

2500

32X 64X 128X 256X 512X 1024X

1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM

Page 14: Highly-Associative Caches  for Low-Power Processors

Total Access Energy (perl)

Perl – Very low miss rate for high associativity

Tota

l En

erg

y p

er

Access f

or

8K

B C

ach

e in

pJ

Miss Energy Expressed in Multiples of 32-bit Read Access Energy

0

50

100

150

200

250

300

350

400

450

500

32X 64X 128X 256X 512X 1024X

1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM

Page 15: Highly-Associative Caches  for Low-Power Processors

Other Advantages of CAM-tag

Hit signal generated earlier Simplifies pipelines

Simplified store operation Wordline only enabled during a hit Stores can happen in a single cycle No write buffer necessary

Page 16: Highly-Associative Caches  for Low-Power Processors

Related Work

CACTI and CACTI2 [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99]Accurate delay and energy estimate

Results within 10%Energy estimate not suited for low-power designsTypical Low-power features not included in CACTI

Sub-banking Low-swing bitlines Wordline gating Separate CAM search line Low-swing match lines

Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank Our results closely agree with [Amruthur and Horowitz, 98]

Page 17: Highly-Associative Caches  for Low-Power Processors

Conclusion

CAM tags – high performance and low-power Energy consumption of 32-way CAM < 2-way RAM Easy to implement highly-associative tags Low area overhead (10%) Comparable access delay Better CPI by reducing miss rate

Page 18: Highly-Associative Caches  for Low-Power Processors

Thank You!

http://www.cag.lcs.mit.edu/scale/