caches and virtual memory

ENGS 116 Lecture 13 1

Caches and Virtual Memory

Vincent H. Berk

October 31st, 2008

Reading for Today: Sections C.1 – C.3 (Jouppi article)

Reading for Monday: Sections C.4 – C.7

Reading for Wednesday: Sections 5.1 – 5.3


Improving Cache Performance

• Average memory-access time (AMAT) = Hit time + Miss rate Miss penalty (ns or clocks)

• Improve performance by:

1. Reducing the miss rate

2. Reducing the miss penalty

3. Reducing the time to hit in the cache


Reducing Miss Rate

• Larger Blocks• Larger Cache• Higher Associativity


Classifying Misses: 3 Cs

• Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses even in an infinite cache)

• Capacity: If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative, size X cache)

• Conflict: If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way set associative, size X cache)


3Cs Absolute Miss Rate (SPEC92)

Conflict

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16 32 64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory Compulsory vanishingly small


2:1 Cache Rule

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8 16 32 64 128

1-way

2-way

4-way

8-wayCapacity

Compulsory

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2


3Cs Relative Miss Rate

Cache Size (KB)

Mis

s R

ate

per

Typ

e

0%

20%

40%

60%

80%

100%1 2 4 8

16 32 64

128

1-way

2-way4-way

8-way

Capacity

Compulsory

Conflict

Flaws: for fixed block sizeGood: insight


How Can We Reduce Misses?

• 3 Cs: Compulsory, Capacity, Conflict

• In all cases, assume total cache size not changed

• What happens if we:

1) Change Block Size: Which of 3Cs is obviously affected?

2) Change Associativity: Which of 3Cs is obviously affected?

3) Change Compiler: Which of 3Cs is obviously affected?


Block Size (bytes)

Miss Rate

0%

5%

10%

15%

20%

25%

16

32

64

12

8

25

6

1K

4K

16K

64K

256K

1. Reduce Misses via Larger Block Size


2. Reduce Misses: Larger Cache Size

• Obvious improvement

but:

• Longer hit time

• Higher cost

• Each cache size favors a block-size, based on memory bandwidth

AMAT = Hit time + Miss rate Miss penalty (ns or clocks)


3. Reduce Misses via Higher Associativity

• 2:1 Cache Rule:

– Miss Rate DM cache size N ≈ Miss Rate 2-way SA cache size N/2

• Beware: Execution time is final measure!

– Will clock cycle time increase?

• 8-Way is almost fully associative


Example: Avg. Memory Access Time vs. Miss Rate• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way,

1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity

(KB) 1-way 2-way 4-way 8-way

1 2.33 2.15 2.07 2.01

2 1.98 1.86 1.76 1.68

4 1.72 1.67 1.61 1.53

8 1.46 1.48 1.47 1.43

16 1.29 1.32 1.32 1.32

32 1.20 1.24 1.25 1.27

64 1.14 1.20 1.21 1.23

128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)


Reducing Miss Penalty

• Multilevel caches• Read priority over write



1. Reduce Miss Penalty: L2 Caches

• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 Miss PenaltyL2

AMAT = Hit TimeL1 + Miss RateL1 (Hit TimeL2 + Miss RateL2 Miss PenaltyL2)

• Definitions:– Local miss rate — misses in this cache divided by the total

number of memory accesses to this cache (Miss rateL2)– Global miss rate — misses in the cache divided by the total

number of memory accesses generated by the CPU (Miss RateL1 Miss RateL2)

– Global miss rate is what matters —indicates what fraction of memory accesses from CPU go all the way to main memory


Comparing Local and Global Miss Rates

• 32 KByte 1st level cache;Increasing 2nd level cache

• Global miss rate close to single level cache rate provided L2 >> L1

• Don’t use local miss rate

• L2 not tied to CPU clock cycle!

• Cost & A.M.A.T.

• Generally fast hit times and fewer misses

• Since hits are few, target miss reduction


Relative CPU Time

Block Size

11.11.21.31.41.51.61.71.81.9

2

16 32 64 128 256 512

1.361.28 1.27

1.34

1.54

1.95

L2 cache block size & A.M.A.T.

• 32KB L1, 8-byte path to memory


2. Reduce Miss Penalty: Read Priority over Write on Miss

• Write through with write buffers offer RAW conflicts with main memory reads on cache misses

• If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50%)

• Check write buffer contents before read; if no conflicts, let the memory access continue

• Write Back?

– Read miss replacing dirty block

– Normal: Write dirty block to memory, and then do the read

– Instead copy the dirty block to a write buffer, then do the read, and then do the write

– CPU stalls less frequently since restarts as soon as read finished


Reducing Hit Time

• Avoiding Address Translation in index



1. Fast Hits by Avoiding Address Translation

• Send virtual address to cache? Called Virtually Addressed Cache or Virtual Cache vs. Physical Cache

– Every time process is switched logically must flush the cache; otherwise get false hits

>> Cost is time to flush + “compulsory” misses from empty cache

– Must handle aliases (sometimes called synonyms): Two different virtual addresses map to same physical address

• Solution to aliases

– HW guarantees each block a unique physical address OR page coloring used to ensure virtual and physical addresses match in last x bits

• Solution to cache flush

– Add process identifier tag that identifies process as well as address within process: cannot get a hit if wrong process


Virtually Addressed Caches

CPU

TB

$

MEM

VA

PA

PA

ConventionalOrganization

CPU

$

TB

MEM

VA

VA

PA

Virtually Addressed CacheTranslate only on miss

Synonym Problem

CPU

$ TB

MEM

VA

PATags

PA

Overlap $ accesswith VA translation:requires $ index toremain invariantacross translation

VATags

L2 $


• If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag

• Limits cache to page size: what if want bigger caches and uses same trick?

– Higher associativity moves barrier to right

– Page coloring (software OS requires that all Aliases share lower address bits, leads to set-associative pages!)

2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address

0Page Address Page Offset

Address Tag Index Block Offset

31 1112


Virtual Memory

• Virtual Address (232, 264) to Physical Address mapping (228)

• Virtual memory in terms of cache:

– Cache block?

– Cache miss?

• How is virtual memory different from caches?

– What controls replacement

– Size (transfer unit, mapping mechanisms)

– Lower-level use


Figure 5.36 The logical program in its contiguous virtual address space is shown on the left; it consists of four pages A, B, C, and D.

C

A

B

0

4K

8K

12K

16K

20K

24K

28K

Physical address:

A

C

B

D

0

4K

8K

12K

Virtual address:

Physical main memory

Virtual memory

D Disk


Figure 5.37 Typical ranges of parameters for caches and virtual memory.

Parameter First-level cache Virtual memoryBlock (page) size 16 – 128 bytes 4096 – 65,536 bytesHit time 1 – 2 clock cycles 40 – 100 clock cyclesMiss penalty (Access time) (Transfer time)

8 – 100 clock cycles (6 – 60 clock cycles) (2 – 40 clock cycles)

700,000 – 6,000,000 clock cycles (500,000 – 4,000,000 clock cycles) (200,000 – 2,000,000 clock cycles)

Miss rate 0.5 – 10% 0.00001 – 0.001%Data memory size 0.016 – 1 MB 16 – 8192 MB


Virtual Memory

• 4 Questions for Virtual Memory (VM)?

– Q1: Where can a block be placed in the upper level?

fully associative, set associative, or direct mapped?

– Q2: How is a block found if it is in the upper level?

– Q3: Which block should be replaced on a miss?

random or LRU?

– Q4: What happens on a write?

write back or write through?

• Other issues: size; pages or segments or hybrid


Figure 5.40 The mapping of a virtual address to a physical address via a page table.

Page offsetVirtual page number

Virtual address

Page table

Physical address

Main memory


Fast Translation: Translation Buffer (TLB)• Cache of translated addresses

• Data portion usually includes physical page frame number, protection field, valid bit, use bit, and dirty bit

• Alpha 21064 data TLB: 32-entry fully associative

43

21

Page-frame address <30>

Page offset <13>

<30> Tag

<21> Physical page #

(low-order 13 bits of address)

34-bit physical address

(high-order 21 bits of address)

32:1 MUX

V R W<1> <2><2>

<13>

<21>


Selecting a Page Size• Reasons for larger page size

– Page table size is inversely proportional to the page size;

therefore memory saved

– Fast cache hit time easy when cache ≤ page size (VA caches);

bigger page makes it feasible as cache grows in size

– Transferring larger pages to or from secondary storage,

possibly over a network, is more efficient

– Number of TLB entries is restricted by clock cycle time, so a larger

page size maps more memory, thereby reducing TLB misses

• Reasons for a smaller page size– Fragmentation: don’t waste storage; data must be contiguous within page

– Quicker process start for small processes

• Hybrid solution: multiple page sizes– Alpha: 8 KB, 16 KB, 32 KB, 64 KB pages (43, 47, 51, 55 virtual addr bits)


Alpha VM Mapping

• “64-bit” address divided into 3 segments

– seg0 (bit 63 = 0) user code/heap

– seg1 (bit 63 = 1, 62 = 1) user stack

– kseg (bit 63 = 1, 62 = 0)

kernel segment for OS

• Three level page table, each one page

– Alpha only 43 bits of VA

– (future min page size up to 64 KB 55 bits of VA)

• PTE bits; valid, kernel & user, read & write enable (no reference, use, or dirty bit)

– What do you do?

Page table entry

Page table entry

Page table entry

Page Table Base Register

+

+

+

Physical address

page offsetphysical page-frame number

Main memory

Virtual address

page offsetlevel3seg0/seg1 selector

level1 level2

21

10 10 10 13

000 … 0 or 111 … 1

8 bytes32 bit address32 bit fields

L2 page table

L3 page table

L1 page table

. . .

. . .

. . .

. . .. . . . . .

Memory Hierarchy

<28>


Protection

• Avoid separate processes to access each others memory– Causes Segmentation Fault: sigSEGV– Useful for Multitasking systems– Operating system issue

• Each Process has its own state– Page tables– Heap, Text, Stack pages– Registers, PC

• To prevent processes from modifying their own page tables:– Rings of protection, Kernel vs. User

• To prevent processes from modifying other process memory:– Page tables point to distinct physical pages


Protection 2

• Each page needs:

– PID bit

– Read/Write/Execute bit

• Each process needs

– Stack frame page(s)

– Text or code pages

– Data or heap pages

– State table keeping:» PC and other CPU status registers

» State of all registers


Alpha 21064• Separate Instruction & Data

TLB & Caches

• TLBs fully associative

• TLB updates in SW(“Private Arch Lib”)

• Caches 8KB direct mapped, write through

• Critical 8 bytes first

• Prefetch instr. stream buffer

• 2 MB L2 cache, direct mapped, WB (off-chip)

• 256 bit path to main memory, 4 x 64-bit modules

• Victim buffer: to give read priority over write

• 4-entry write buffer between D$ & L2$

StreamBuffer

WriteBuffer

Victim Buffer

Instr Data

caches and virtual memory

Documents

cache engs

associative engs

cache size n miss rate

cache rule miss rate

miss rateexample

miss rate2

miss penalty3

total cache size