cs152 – computer architecture and engineering lecture 13 – …cs152/fa04/lecnotes/lec7-1.pdf ·...

CS 152 L13 Cache I () UC Regents Fall 2004 © UCB

2004-10-14 Dave Patterson

(www.cs.berkeley.edu/~patterson)

John Lazzaro (www.cs.berkeley.edu/~lazzaro)

www-inst.eecs.berkeley.edu/~cs152/

CS152 – Computer Architecture andEngineering

Lecture 13 – Cache I

1


The Big Picture: Where are we now?

Datapath

Memory

Processor

Input

Output

Control Next: Focus on the memory system

Memory

Processor

Input

Output

Control So far: Focus on processor datapath

and control Datapath

2


Today’s Lecture - Caches

Memory hierarchy

Static memory design

Locality

Cache design

3


1977: DRAM faster than microprocessors

Apple ][ (1977)

Steve WozniakSteve

Jobs

CPU: 1000 ns DRAM: 400 ns

4


Since 1980, CPU has outpaced DRAM ...

CPU60% per yr2X in 1.5 yrs

DRAM9% per yr2X in 10 yrs

10

DRAM

CPU

Performance(1/latency)

100

1000

1980

2000

1990 Year

Gap grew 50% per year

Q. How do architects address this gap? A. Put smaller, faster “cache” memories

between CPU and DRAM. Create a “memory hierarchy”.

5


Basic Idea: Variable-latency memory port Lower Level

MemoryUpper Level

MemoryTo Processor

From Processor

Blk X

Blk Y

Small, fast Large, slow

FromCPU

To CPU

Data in upper memory returned with lower latency.

Data in lower level returned with higher latency.

6


Administrivia - Lab 3, HW 3 ...

Lab 3 “no forwarding” Xilinx demo on 10/15 (tomorrow)

Homework 3 due 10/20 (Wednesday),283 Soda, in CS 152 box at 5 PM

Lab 3 final demo on 10/22 (Friday)Report due 10/25 (Monday,11:59 PM)

7


2004 Memory Hierarchy: Apple iMac G5

iMac G51.6 GHz$1299.00

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 88 1e7

Let programs address a memory space that scales to the disk size, at a speed that is

usually as fast as register access

Managed by compiler

Managed by hardware

Managed by OS,hardware,application

Goal: Illusion of large, fast, cheap memory

8


iMac’s PowerPC 970: All caches on-chip

(1K)

Registers

512KL2

L1 (64K Instruction)

L1 (32K Data)9


Latency: A closer look

Reg L1 Inst

L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 88 1e7Latency

(sec) 0.6n 1.9n 1.9n 6.9n 55n 12.5m

Hz 1.6G 533M 533M 145M 18M 80Architect’s latency toolkit:

Read latency: Time to return first byte of a random access

(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.

10


Programs with locality cache well ...

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Addr

ess

(one

dot

per

acc

ess)

Q. Point out bad locality behavior ...

SpatialLocality

Temporal Locality

Bad

11


The caching algorithm in one slide

Temporal locality: Keep most recently accessed data closer to processor.

Spatial locality: Move contiguous blocks in the address space to upper levels.

Lower LevelMemory

Upper LevelMemory

To Processor

From Processor

Blk X

Blk Y

12


Caching terminology

Lower LevelMemory

Upper LevelMemory

To Processor

From Processor

Blk X

Blk Y

Hit: Data appearsin upper

level block(ex: Blk X)

Miss: Data retrieval from lower level

needed(Ex: Blk Y)

Hit Rate: The fraction of memory accesses found in

upper level.

Miss Rate: 1 - Hit Rate

Hit Time: Time to access upper level. Includes hit/miss check.

Miss penalty: Time to replace

block in upper level + deliver to CPU

Hit Time << Miss Penalty

13

UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()

Static Memory Design

14


Review: Two inverters store a bit

The other elements in a memory circuit control reading and writing

16-transistor circuit. Most transistors implement read/write semantics

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'(&)#'*+,#-*.

/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&

164-276&!"#$% #$1869

/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&

5"#$%&164-276&$&'()* #$1869

!

8#;

<

."12*&1'3" 8#-8;&1-&<&5"#$%

8#;

8#;=

8#;

8#;

8#;=

8#;=

8#;

8#;=

Holds

value

0 1 0

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'(&)#'*+,#-*.

/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&

164-276&!"#$% #$1869

/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&

5"#$%&164-276&$&'()* #$1869

!

8#;

<

."12*&1'3" 8#-8;&1-&<&5"#$%

8#;

8#;=

8#;

8#;

8#;=

8#;=

8#;

8#;=

Holds

value

1 0 1

Example: Flip-Flop

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'(&)#'*+,#-*.

/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&

164-276&!"#$% #$1869

/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&

5"#$%&164-276&$&'()* #$1869

!

8#;

<

."12*&1'3" 8#-8;&1-&<&5"#$%

8#;

8#;=

8#;

8#;

8#;=

8#;=

8#;

8#;=

!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8

!"#$%&'(&)#'*+,#-*.

/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&

164-276&!"#$% #$1869

/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&

5"#$%&164-276&$&'()* #$1869

!

8#;

<

."12*&1'3" 8#-8;&1-&<&5"#$%

8#;

8#;=

8#;

8#;

8#;=

8#;=

8#;

8#;=

D Q

CLK

15


For use in arrays: Static RAM (SRAM) cell

Writing a bit

Drive bit lineswith new dataand activate

word line

1

01

bit

!"#$%&'())* ++,!-.)'/ 012-3/414- 56&1'--

!"#$%&#'()'*"+,(-"*.$+&/"0(

1 234)(-'##

1 5$+6'+(7'##(! #"8'+(9'0/&%,:(;&6;'+(7"/%<=&%((((((((((((((((((

1 >"(+'?+'/;(+'@A&+'9(

1 2&*.#'(+'$9(! ?$/%'+($77'//(

1 2%$09$+9(B-(.+"7'//(! 0$%A+$#(?"+(&0%'6+$%&"0(8&%;(#"6&7

1 C34)(-'##

1 2*$##'+(7'##(! ;&6;'+(9'0/&%,:(#"8'+(7"/%<=&%(

1 >''9/(.'+&"9&7(+'?+'/;:($09(+'?+'/;($?%'+(+'$9(

1 -"*.#'D(+'$9(! #"06'+($77'//(%&*'(

1 2.'7&$#(B-(.+"7'//(! 9&??&7A#%(%"(&0%'6+$%'(8&%;(#"6&7(7&+7A&%/

8"+9(#&0'

=&%(#&0' =&%(#&0'

8"+9(#&0'

=&%(#&0'

!"#$%&'()&*$+',,#&#-.#$/#01##-$+',,#&#-0$(#(2&*$0*%#3$'3$0"#$/'0 .#445

bit

word

Reading a bit

Activateword line

let cell drivebit lines.

1 0

1

01

bit

!"#$%&'())* ++,!-.)'/ 012-3/414- 56&1'--

!"#$%&#'()'*"+,(-"*.$+&/"0(

1 234)(-'##

1 5$+6'+(7'##(! #"8'+(9'0/&%,:(;&6;'+(7"/%<=&%((((((((((((((((((

1 >"(+'?+'/;(+'@A&+'9(

1 2&*.#'(+'$9(! ?$/%'+($77'//(

1 2%$09$+9(B-(.+"7'//(! 0$%A+$#(?"+(&0%'6+$%&"0(8&%;(#"6&7

1 C34)(-'##

1 2*$##'+(7'##(! ;&6;'+(9'0/&%,:(#"8'+(7"/%<=&%(

1 >''9/(.'+&"9&7(+'?+'/;:($09(+'?+'/;($?%'+(+'$9(

1 -"*.#'D(+'$9(! #"06'+($77'//(%&*'(

1 2.'7&$#(B-(.+"7'//(! 9&??&7A#%(%"(&0%'6+$%'(8&%;(#"6&7(7&+7A&%/

8"+9(#&0'

=&%(#&0' =&%(#&0'

8"+9(#&0'

=&%(#&0'

!"#$%&'()&*$+',,#&#-.#$/#01##-$+',,#&#-0$(#(2&*$0*%#3$'3$0"#$/'0 .#445

bit

word

1 0

16


Putting it all together: an SRAM array

4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz

Lec19.13

° Why do computer designers need to know about RAM technology?

• Processor performance is usually limited by memory bandwidth

• As IC densities increase, lots of memory will fit on processor chip

- Tailor on-chip memory to specific needs

- Instruction cache

- Data cache

- Write buffer

° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser

Random Access Memory (RAM) Technology


Lec19.14

Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

° Write:1. Drive bit lines (bit=1, bit=0)

2.. Select row

° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1


Lec19.15

Typical SRAM Organization: 16-word x 4-bit

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

SRAM

Cell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &

Precharger - +Wr Driver &



Precharger

Ad

dress D

ecod

er

WrEn

Precharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3

Q: Which is longer:

word line or

bit line?


Lec19.16

° Write Enable is usually active low (WE_L)

° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed

• WE_L is asserted (Low), OE_L is disasserted (High)

- D serves as the data input pin

• WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin

• Both WE_L and OE_L are asserted:

- Result is unknown. Don’t do that!!!

° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)

A

DOE_L

2 Nwordsx M bit

SRAM

N

M

WE_L

Logic Diagram of a Typical SRAM

WriteDriver

WriteDriver

WriteDriver

WriteDriver

Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.

ParallelDataI/OLines

Add muxesto selectsubset of bits

17


Cache Design Example

18


CPU address space: An array of “blocks” Block #

7

123456

0

227

- 1

...

32-byte blocks

27 bits 5 bits

The job of a cache is to hold

a “popular” subset of blocks.

32-bit Memory Address

Which block? Byte #031

19


One approach: Fully Associative Cache

Cache Tag (27 bits) Byte Select

531 04

Ex: 0x01

ValidBit

Byte 31 ... Byte

1Byte

0

Byte 31 ... Byte

1Byte

0

Cache DataHolds 4 blocks

=

=

=

=

HitReturn bytes of “hit” cache line

Block # (”Tags”)026

20


Conclusions

Program locality is why building a memory hierarchy makes sense

Latency toolkit: hierarchy design,bit-wise parallelism, pipelining.

Cache operation: compare tags, detect hits, select bytes.

In practice: how many rows, how many columns, how many arrays.

21

cs152 – computer architecture and engineering lecture 13 – …cs152/fa04/lecnotes/lec7-1.pdf ·...

Documents