cs152 – computer architecture and engineering lecture 13 – …cs152/fa04/lecnotes/lec7-1.pdf ·...
TRANSCRIPT
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
2004-10-14 Dave Patterson
(www.cs.berkeley.edu/~patterson)
John Lazzaro (www.cs.berkeley.edu/~lazzaro)
www-inst.eecs.berkeley.edu/~cs152/
CS152 – Computer Architecture andEngineering
Lecture 13 – Cache I
1
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
The Big Picture: Where are we now?
Datapath
Memory
Processor
Input
Output
Control Next: Focus on the memory system
Memory
Processor
Input
Output
Control So far: Focus on processor datapath
and control Datapath
2
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Today’s Lecture - Caches
Memory hierarchy
Static memory design
Locality
Cache design
3
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
1977: DRAM faster than microprocessors
Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
4
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
10
DRAM
CPU
Performance(1/latency)
100
1000
1980
2000
1990 Year
Gap grew 50% per year
Q. How do architects address this gap? A. Put smaller, faster “cache” memories
between CPU and DRAM. Create a “memory hierarchy”.
5
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Basic Idea: Variable-latency memory port Lower Level
MemoryUpper Level
MemoryTo Processor
From Processor
Blk X
Blk Y
Small, fast Large, slow
FromCPU
To CPU
Data in upper memory returned with lower latency.
Data in lower level returned with higher latency.
6
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Administrivia - Lab 3, HW 3 ...
Lab 3 “no forwarding” Xilinx demo on 10/15 (tomorrow)
Homework 3 due 10/20 (Wednesday),283 Soda, in CS 152 box at 5 PM
Lab 3 final demo on 10/22 (Friday)Report due 10/25 (Monday,11:59 PM)
7
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
2004 Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz$1299.00
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 88 1e7
Let programs address a memory space that scales to the disk size, at a speed that is
usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,application
Goal: Illusion of large, fast, cheap memory
8
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
iMac’s PowerPC 970: All caches on-chip
(1K)
Registers
512KL2
L1 (64K Instruction)
L1 (32K Data)9
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Latency: A closer look
Reg L1 Inst
L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80GLatency(cycles) 1 3 3 11 88 1e7Latency
(sec) 0.6n 1.9n 1.9n 6.9n 55n 12.5m
Hz 1.6G 533M 533M 145M 18M 80Architect’s latency toolkit:
Read latency: Time to return first byte of a random access
(1) Parallelism. Request data from N 1-bit-wide memories at the same time. Overlaps latency cost for all N bits. Provides N times the bandwidth. (2) Pipeline memory. If memory has N cycles of latency, issue a request each cycle, receive it N cycles later.
10
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Addr
ess
(one
dot
per
acc
ess)
Q. Point out bad locality behavior ...
SpatialLocality
Temporal Locality
Bad
11
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
The caching algorithm in one slide
Temporal locality: Keep most recently accessed data closer to processor.
Spatial locality: Move contiguous blocks in the address space to upper levels.
Lower LevelMemory
Upper LevelMemory
To Processor
From Processor
Blk X
Blk Y
12
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Caching terminology
Lower LevelMemory
Upper LevelMemory
To Processor
From Processor
Blk X
Blk Y
Hit: Data appearsin upper
level block(ex: Blk X)
Miss: Data retrieval from lower level
needed(Ex: Blk Y)
Hit Rate: The fraction of memory accesses found in
upper level.
Miss Rate: 1 - Hit Rate
Hit Time: Time to access upper level. Includes hit/miss check.
Miss penalty: Time to replace
block in upper level + deliver to CPU
Hit Time << Miss Penalty
13
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
Static Memory Design
14
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
Review: Two inverters store a bit
The other elements in a memory circuit control reading and writing
16-transistor circuit. Most transistors implement read/write semantics
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Holds
value
0 1 0
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
Holds
value
1 0 1
Example: Flip-Flop
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-8
!"#$%&'(&)#'*+,#-*.
/ 0"12*&1'3" 4".2#1.&,4-3&5"#$%&
164-276&!"#$% #$1869
/ :#-8;&1-&<&5"#$% 4".2#1.&,4-3&
5"#$%&164-276&$&'()* #$1869
!
8#;
<
."12*&1'3" 8#-8;&1-&<&5"#$%
8#;
8#;=
8#;
8#;
8#;=
8#;=
8#;
8#;=
D Q
CLK
15
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
For use in arrays: Static RAM (SRAM) cell
Writing a bit
Drive bit lineswith new dataand activate
word line
1
01
bit
!"#$%&'())* ++,!-.)'/ 012-3/414- 56&1'--
!"#$%&#'()'*"+,(-"*.$+&/"0(
1 234)(-'##
1 5$+6'+(7'##(! #"8'+(9'0/&%,:(;&6;'+(7"/%<=&%((((((((((((((((((
1 >"(+'?+'/;(+'@A&+'9(
1 2&*.#'(+'$9(! ?$/%'+($77'//(
1 2%$09$+9(B-(.+"7'//(! 0$%A+$#(?"+(&0%'6+$%&"0(8&%;(#"6&7
1 C34)(-'##
1 2*$##'+(7'##(! ;&6;'+(9'0/&%,:(#"8'+(7"/%<=&%(
1 >''9/(.'+&"9&7(+'?+'/;:($09(+'?+'/;($?%'+(+'$9(
1 -"*.#'D(+'$9(! #"06'+($77'//(%&*'(
1 2.'7&$#(B-(.+"7'//(! 9&??&7A#%(%"(&0%'6+$%'(8&%;(#"6&7(7&+7A&%/
8"+9(#&0'
=&%(#&0' =&%(#&0'
8"+9(#&0'
=&%(#&0'
!"#$%&'()&*$+',,#&#-.#$/#01##-$+',,#&#-0$(#(2&*$0*%#3$'3$0"#$/'0 .#445
bit
word
Reading a bit
Activateword line
let cell drivebit lines.
1 0
1
01
bit
!"#$%&'())* ++,!-.)'/ 012-3/414- 56&1'--
!"#$%&#'()'*"+,(-"*.$+&/"0(
1 234)(-'##
1 5$+6'+(7'##(! #"8'+(9'0/&%,:(;&6;'+(7"/%<=&%((((((((((((((((((
1 >"(+'?+'/;(+'@A&+'9(
1 2&*.#'(+'$9(! ?$/%'+($77'//(
1 2%$09$+9(B-(.+"7'//(! 0$%A+$#(?"+(&0%'6+$%&"0(8&%;(#"6&7
1 C34)(-'##
1 2*$##'+(7'##(! ;&6;'+(9'0/&%,:(#"8'+(7"/%<=&%(
1 >''9/(.'+&"9&7(+'?+'/;:($09(+'?+'/;($?%'+(+'$9(
1 -"*.#'D(+'$9(! #"06'+($77'//(%&*'(
1 2.'7&$#(B-(.+"7'//(! 9&??&7A#%(%"(&0%'6+$%'(8&%;(#"6&7(7&+7A&%/
8"+9(#&0'
=&%(#&0' =&%(#&0'
8"+9(#&0'
=&%(#&0'
!"#$%&'()&*$+',,#&#-.#$/#01##-$+',,#&#-0$(#(2&*$0*%#3$'3$0"#$/'0 .#445
bit
word
1 0
16
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
Putting it all together: an SRAM array
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.13
° Why do computer designers need to know about RAM technology?
• Processor performance is usually limited by memory bandwidth
• As IC densities increase, lots of memory will fit on processor chip
- Tailor on-chip memory to specific needs
- Instruction cache
- Data cache
- Write buffer
° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser
Random Access Memory (RAM) Technology
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.14
Static RAM Cell
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
° Write:1. Drive bit lines (bit=1, bit=0)
2.. Select row
° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
replaced with pullupto save area
10
0 1
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.15
Typical SRAM Organization: 16-word x 4-bit
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
SRAM
Cell
- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp
: : : :
Word 0
Word 1
Word 15
Dout 0Dout 1Dout 2Dout 3
- +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger - +Wr Driver &
Precharger
Ad
dress D
ecod
er
WrEn
Precharge
Din 0Din 1Din 2Din 3
A0
A1
A2
A3
Q: Which is longer:
word line or
bit line?
4/12/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec19.16
° Write Enable is usually active low (WE_L)
° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed
• WE_L is asserted (Low), OE_L is disasserted (High)
- D serves as the data input pin
• WE_L is disasserted (High), OE_L is asserted (Low)
- D is the data output pin
• Both WE_L and OE_L are asserted:
- Result is unknown. Don’t do that!!!
° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)
A
DOE_L
2 Nwordsx M bit
SRAM
N
M
WE_L
Logic Diagram of a Typical SRAM
WriteDriver
WriteDriver
WriteDriver
WriteDriver
Word and bit lines slow down as array grows larger! Architects specify number of rows and columns.
ParallelDataI/OLines
Add muxesto selectsubset of bits
17
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
Cache Design Example
18
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
CPU address space: An array of “blocks” Block #
7
123456
0
227
- 1
...
32-byte blocks
27 bits 5 bits
The job of a cache is to hold
a “popular” subset of blocks.
32-bit Memory Address
Which block? Byte #031
19
UC Regents Fall 2004 © UCBCS 152 L13 Cache I ()
One approach: Fully Associative Cache
Cache Tag (27 bits) Byte Select
531 04
Ex: 0x01
ValidBit
Byte 31 ... Byte
1Byte
0
Byte 31 ... Byte
1Byte
0
Cache DataHolds 4 blocks
=
=
=
=
HitReturn bytes of “hit” cache line
Block # (”Tags”)026
20
CS 152 L13 Cache I () UC Regents Fall 2004 © UCB
Conclusions
Program locality is why building a memory hierarchy makes sense
Latency toolkit: hierarchy design,bit-wise parallelism, pipelining.
Cache operation: compare tags, detect hits, select bytes.
In practice: how many rows, how many columns, how many arrays.
21