the memory hierarchy jehan-françois pâris [email protected]

THE MEMORY HIERARCHY

Jehan-François Pâ[email protected]

Chapter Organization• Technology overview• Caches

– Cache associativity, write through andwrite back, …

• Virtual memory– Page table organization, the translation lookaside

buffer (TLB), page fault handling, memory protection

• Virtual machines• Cache consistency

TECHNOLOGY OVERVIEW

Dynamic RAM

• Standard solution for main memory since 70's– Replaced magnetic core memory

• Bits represented stored on capacitors– Charged state represents a one

• Capacitors discharge – Must be dynamically refreshed– Achieved by accessing each cell several

thousand times each second

Dynamic RAM

Row select

ColumnSelect

nMOS transistor

Capacitor

Ground

The role of the nMOS transistorRow select (gate)

ColumnSelect(source)

drain

• When the gate is positive with respect to the ground, electrons are attracted to the gate (the "field effect")and current can go through

• Normally, no current can go from the source to the drain

Not on the exam

Magnetic disks

Platter

R/W headArm

Servo

Magnetic disk (I)

• Data are stored into circular tracks• Tracks are partitioned into a variable number of

fixed-size sectors• If disk drive has more than one platter, all tracks

corresponding to the same position of the R/ W head form a cylinder

Magnetic disk (II)

• Disk spins at a speed varying between– 5,400 rpm (laptops) and– 15,000 rpm (Seagate Cheetah X15, …)– Accessing data requires

• Positioning the head on the right track: –Seek time

• Waiting for the data to reach the R/W head–On the average half a rotation

Disk access times

• Dominated by seek time and rotational delay

• We try to reduce seek times by placing all data that are likely to be accessed together on nearby tracks or same cylinder

• Cannot do as much for rotational delay– On the average half a rotation

Average rotational delay

RPM Delay

(ms)

5400 5.67200 4.2

10,000 3.015,000 2.0

Overall performance

• Disk access times are still dominated by rotational latency– Were 8-10 ms in the late 70's when rotational

speeds were 3,000 to 3,600 RPM• Disk capacities and maximum transfer rates have

done much better – Pack many more tracks per platter– Pack many more bits per track

The internal disk controller

• Printed circuit board attached to disk drive– As powerful as the CPU of a personal

computer of the early 80's• Functions include

– Speed buffering– Disk scheduling– …

Reliability issues

• Disk drives have more reliability issues than most other computer components– Moving parts eventually wear– Infant mortality– Would be too costly to produce

perfect magnetic surfaces• Disks have bad blocks

Disk failure rates

• Failure rates follow a bathtub curve– High infantile mortality – Low failure rate during useful life– Higher failure rates as disks wear out

Disk failure rates (II)

Failurerate

Time

Infantilemortality

Useful life

Wearout

Disk failure rates (III)

• Infant mortality effect can last for months for disk drives

• Cheap ATA disk drives seem to age less gracefully than SCSI drives

MTTF

• Disk manufacturers advertise very highMean Times To Fail (MTTF) for their products– 500,000 to 1,000,000 hours, that is,

57 to 114 years• Does not mean that disk will last that long!• Means that disks will fail at an average rate of one

failure per 500,000 to 100,000 hours duringtheir useful life

More MTTF Issues (I)

• Manufacturers' claims are not supported by solid experimental evidence

• Obtained by submitting disks to a stress test at high temperature and extrapolating results to ideal conditions– Procedure raises many issues

More MTTF Issues (II)

• Failure rates observed in the field aremuch higher– Can go up to 8 to 9 percent per year

• Corresponding MTTFs are 11 to 12.5 years

• If we have 100 disks and a MTTF of 12.5 years, we can expect an average of 8 disk failures per year

Bad blocks (I)

• Also known as– Irrecoverable read errors– Latent sector errors

• Can be caused by– Defects in magnetic substrate– Problems during last write

Bad blocks (II)

• Disk controller uses redundant encoding that can detect and correct many errors

• When internal disk controller detects a bad block– Marks it as unusable– Remaps logical block address of bad block to

spare sectors• Each disk is extensively tested during

burn in period before being released

The memory hierarchy (I)

Level Device Access Time

1 Fastest registers(2 GHz CPU)

0.5 ns

2 Main memory 10-60 ns

3 Secondary storage (disk) 7 ms

4 Mass storage(CD-ROM library)

a few s

The memory hierarchy (II)

• To make sense of these numbers, let us consider an analogy

Writing a paper (I)

Level Resource Access Time

1 Open book on desk 1 s

2 Book on desk3 Book in library

4 Book far away

Writing a paper (II)



2 Book on desk 20-120 s

3 Book in library

4 Book far away

Writing a paper (III)




3 Book in library 162 days

4 Book far away

Writing a paper (IV)




3 Book in library 162 days

4 Book far away 63 years

Major issues

• Huge gaps between– CPU speeds and SDRAM access times– SDRAM access times and disk access times

• Both problems have very different solutions– Gap between CPU speeds and SDRAM

access times handled by hardware– Gap between SDRAM access times and disk

access times handled by combination of software and hardware

Why?

• Having hardware handle an issue– Complicates hardware design– Offers a very fast solution– Standard approach for very frequent actions

• Letting software handle an issue– Cheaper– Has a much higher overhead– Standard approach for less frequent actions

Will the problem go away?

• It will become worse– RAM access times are not improving as fast

as CPU power– Disk access times are limited by rotational

speed of disk drive

What are the solutions?

• To bridge the CPU/DRAM gap:– Interposing between the CPU and the DRAM

smaller, faster memories that cache the data that the CPU currently needs• Cache memories• Managed by the hardware and invisible to

the software (OS included)

What are the solutions?

• To bridge the DRAM/disk drive gap:• Storing in main memory the data blocks that are

currently accessed (I/O buffer)• Managing memory space and disk space as a

single resource (Virtual memory)• I/O buffer and virtual memory are managed by

the OS and invisible to the user processes

Why do these solutions work?

• Locality principle:– Spatial locality:

at any time a process only accesses asmall portion of its address space

– Temporal locality:this subset does not change too frequently

Can we think of examples?

• The way we write programs• The way we act in everyday life

– …

CACHING

The technology

• Caches use faster static RAM (SRAM)– Similar organization as that of D flipflops

• Can have– Separate caches for instructions and data

• Great for pipelining– A unified cache

A little story (I)

• Consider a closed-stack library– Customers bring book requests to circulation desk– Librarians go to stack to fetch requested book

• Solution is used in national libraries– Costlier than open-stack approach– Much better control of assets

A little story (II)

• Librarians have noted that some books get asked again and again– Want to put them closer to the circulation desk

• Would result in much faster service• The problem is how to locate these books

– They will not be at the right location!

A little story (III)

• Librarians come with a great solution– They put behind the circulation desk shelves with

100 book slots numbered from 00 to 99– Each slot is a home for the most recently

requested book that has a call number whose last two digits match the slot number • 3141593 can only go in slot 93• 1234567 can only go in slot 67

A little story (IV)

The call number of the book I need is 3141593

Let me see if it's in bin 93

A little story (V)

• To let the librarian do her job each slot much contain either– Nothing or– A book and its reference number

• There are many books whose reference number ends in 93 or any two given digits

A little story (VI)

Could I get this time the book whose call number 4444493?

Sure

A little story (VII)

• This time the librarian will– Go bin 93– Find it contains a book with a different call

number• She will

– Bring back that book to the stacks– Fetch the new book

Basic principles

• Assume we want to store in a faster memory 2n words that are currently accessed by the CPU– Can be instructions or data or even both

• When the CPU will need to fetch an instruction or load a word into a register– It will look first into the cache– Can have a hit or a miss

Cache hits

• Occur when the requested word is found in the cache– Cache avoided a memory access– CPU can proceed

Cache misses

• Occur when the requested word is not found in the cache– Will need to access the main memory– Will bring the new word into the cache

• Must make space for it by expelling one of the cache entries

–Need to decide which one

Handling writes (I)

• When CPU has to store the contents of a register into main memory– Write will update the cache

• If the modified word is already in the cache– Everything is fine

• Otherwise– Must make space for it by expelling one of the

cache entries

Handling writes (II)

• Two ways to handle writes– Write through:

• Each write updates both the cache and the main memory

– Write back:• Writes are not propagated to the main

memory until the updated word is expelled from the cache

Handling writes (II)

• Write through • Write back

CPU

Cache

RAM

CPU

Cache

RAM

later

Pros and cons

• Write through:– Ensures that memory is always up to date

• Expelled cache entries can be overwritten

• Write back:– Faster writes– Complicates cache expulsion procedure

• Must write back cache entries that have been modified in the cache

Picking the right solution

• Caches use write through:– Provides simpler cache expulsions– Can minimize write-through overhead with

additional circuitry• I/O Buffers and virtual memory use

write back:– Write-through overhead would be too high

A better write through (I)

• Add a small buffer to speed up write performance of write-through caches– At least four words

• Holds modified data until they are written into main memory – Cache can proceed as soon as data are

written into the write buffer

Write buffer

A better write through (II)

• Write through • Better write through

CPU

Cache

RAM

CPU

Cache

RAM

A very basic cache

• Has 2n entries• Each entry contains

– A word (4 bytes)– Its RAM address

• Sole way to identify the word– A bit indicating whether the cache entry

contains something useful

A very basic cache (I)

RAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address WordRAM Address Word

Actualcachesaremuchbigger

RAM Address WordRAM Address WordRAM Address WordRAM Address Word


Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N

A very basic cache (II)

000001010011100101110111




Comments (I)

• The cache organization we have presented is nothing but the hardware implementation of a hash table

• Each entry has– a key: the word address– a value: word contents plus valid bit

Comments (II)

• The hash function is h(k) = (k/4) mod N

where k is the key and N is the cache size– Can be computed very fast

• Unlike conventional hash tables, this organization has no provision for handling collisions– Use expulsion to resolve collisions

Managing the cache

• Each word fetched into cache can occupy a single cache location– Specified by n+1 to 2 bits of its address

• Two words with the same n+1 to 2 bitscannot be at the same time in the cache– Happens whenever the addresses of the two

words differ by K 2n+2

Example

• Assume cache can contain 8 words• If word 48 is in the cache it will be stored at

cache index (48/4) mod 8 = 12 mod 8 = 4• In our case 2n+2 = 23+2 = 32• The only possible cache index for word 80 would

be (80/4) mod 8 = 20 mod 8 = 4• Same for words 112, 144, 176, …

Managing the cache

• Each word fetched into cache can occupy a single cache location– Specified by n+1 to 2 bits of its address

• Two words with the same n+1 to 2 bitscannot be at the same time in the cache– Happens whenever the addresses of the two

words differ by K 2n+2

Saving cache space

• We do not need to store whole address of each word in cache– Bits 1 and 0 will always be zero– Bits n + 1 to 2 can be inferred from the

cache index• If cache has 8 entries, bits 4 to 2

– Will only store in tag the remaining bits of address

A very basic cache (III)

Cache uses bits 4 to 2ofwordaddress

000001010011100101110111

Bits 31:5 WordBits 31:5 WordBits 31:5 WordBits 31:5 Word

Bits 31:5 WordBits 31:5 WordBits 31:5 WordBits 31:5 Word


Storing a new word in the cache

• Location of new word entry will be obtained from LSB of word address – Discard 2 LSB

• Always zero for a well-aligned word– Remove n next LSB for a cache of size 2n

• Given by cache index

MSB of word address 00n next LSB

Accessing a word in the cache (I)

• Start with word address

• Remove two least significant bit• Always zero

Word address

Word address minus two LSB

Accessing a word in the cache (II)

• Split remainder of address into

– n least significant bits• Word address in the cache

– Cache tag

n LSB

Word address minus two LSB

Cache Tag

Towards a better cache

• Our cache takes into account temporal locality of accesses– Repeated accesses to the same location

• But not their spatial locality– Accesses to neighboring locations

• Cache space is poorly used– Need 26 + 1 bits of overhead to store

32 bits of data

Multiword cache (I)

• Each cache entry will contain a block of 2, 4 , 8, … words with consecutive addresses– Will require words to be well aligned

• Pair of words should start at an address that is multiple of 2×4 = 8

• Group of four words should start at an address that is multiple of 4×4 = 16

Multiword cache (II)

WordWord

WordWord

WordWord

WordWord

000001010011

000001

100101

010011

000001

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

TagValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

110111

100101

010011

000001

Contents

100101

010011

WordWord

WordWord

Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6

WordWord

WordWord

WordWord

WordWord

Multiword cache (III)

• Has 2n entries each containing 2m words• Each entry contains

– 2m words – A tag– A bit indicating whether the cache entry

contains useful data

Storing a new word in the cache

• Location of new word entry will be obtained from LSB of word address – Discard 2 + m LSB

• Always zero for a well-aligned group of words

– Take n next LSB for a cache of size 2n

MSB of address 2 +m LSBn next LSB

Example

• Assume– Cache can contain 8 entries– Each block contains 2 words

• Words 48 and 52 belong to the same block– If word 48 is in the cache it will be stored at cache

index (48 /8) mod 8 = 6 mod 8 = 6– If word 48 is in the cache it will be stored at cache

index (49 /8) mod 8 = 6 mod 8 = 6

Selecting the right block size

• Larger block sizes improve the performance of the cache– Allows us to exploit spatial locality

• Three limitations– Spatial locality effect less pronounced if block

size exceeds 128 bytes– Too many collisions in very small caches– Large blocks take more time to be fetched

into the cache

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)

Collision effect in small cache

• Consider a 4KB cache– If block size is 16 B, that is, 4 words,

cache will have 256 blocks– …– If block size is 128 B, that is 32 words,

cache will have 32 blocks• Too many collisions

Problem

• Consider a very small cache with 8 entries and a block size of 8 bytes (2 words)– Which words will be fetched in the cache

when the CPU accesses words at address 32, 48, 60 and 80?

– How will these words will be stored in the cache?

Solution (I)

• Since block size is 8 bytes– 3 LSB of address used to address one of the

8 bytes in a block• Since cache holds 8 blocks,

– Next 3 LSB of address used by the cache index

• As a result, tag has 32 – 3 – 3 =26 bits

Solution (I)

• Consider words at address 32

• Cache index is (32/23) mod 23 = (32/8) mod 8 = 4• Block tag is 32/26 = 32/64 = 0

Row 4 Tag=0 32 33 34 35 36 37 38 39

Solution (II)

• Consider words at address 48• Cache index is (48/8) mod 8 =6• Block tag is 48/64 = 0

Row 6 Tag=0 48 49 50 51 52 53 54 55

Solution (III)

• Consider words at address 60• Cache index is (60/8) mod 8 =7• Block tag is 60/64 = 0

Row 6 Tag=0 56 57 58 59 60 61 62 63

Solution (IV)

• Consider words at address 80• Cache index is (80/8) mod 8 = 10 mod 8 = 2• Block tag is 80/64 = 1

Row 2 Tag=1 80 81 82 83 84 85 86 67

Set-associative caches (I)

• Can be seen as 2, 4, 8 caches attached together• Reduces collisions

Set-associative caches (II)

000001010011100101110111

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block



000001010011100101110111




Set-associative caches (III)

• Advantage:– We take care of more collisions

• Like a hash table with a fixed bucket size– Results in lower miss rates than direct-

mapped caches• Disadvantage:

– Slower access– Best solution if miss penalty is very big

Fully associative caches

• The dream!• A block can occupy any index position in the

cache• Requires an associative memory

– Content-addressable– Like our brain!

• Remain a dream

Designing RAM to support caches

• RAM connected to CPU through a "bus"– Clock rate much slower than CPU clock rate

• Assume that a RAM access takes– 1 bus clock cycle to send the address– 15 bus clock cycle to initiate a read– 1 bus clock cycle to send a word of data


• Assume– Cache block size is 4 words– One-word bank of DRAM

• Fetching a cache block would take1 + 4×15 + 4×1 = 65 bus clock cycles

– Transfer rate is 0.25 byte/bus cycle• Awful!


• Could – Double bus width (from 32 to 64 bits)– Have a two-word bank of DRAM

• Fetching a cache block would take1 + 2×15 + 2×1 = 33 bus clock cycles

– Transfer rate is 0.48 byte/bus cycle• Much better

• Costly solution


• Could – Have an interleaved memory organization– Four one-word banks of DRAM– A 32-bit bus

32 bits

RAMbank 1

RAMbank 0

RAMbank 2

RAMbank 3


• Can do the 4 accesses in parallel• Must still transmit the block 32 bits by 32 bits• Fetching a cache block would take

1 + 15 + 4×1 = 20 bus clock cycles– Transfer rate is 0.80 word/bus cycle

• Even better• Much cheaper than having a 64-bit bus

ANALYZING CACHE PERFORMANCE

Memory stalls

• Can divide CPU time into– NEXEC clock cycles spent executing

instructions – NMEM_STALLS cycles spent waiting for memory

accesses• We have

CPU time = (NEXEC + NMEM_STALLS)×TCYCLE

Memory stalls

• We assume that– cache access times can be neglected– most CPU cycles spent waiting for memory

accesses are caused by cache misses• Distinguishing between read stalls and write

stalls

NMEM_STALLS = NRD_STALLS + NWR_STALLS

Read stalls

• Fairly simple

NRD_STALLS = NMEM_RD×Read miss rate×

Read miss penalty

Write stalls (I)

• Two causes of delays– Must fetch missing blocks before updating them

• We update at most 8 bytes of the block!– Must take into account cost of write through

• Buffering delay depends of proximity of writes not number of cache misses

–Writes too close to each other

Write stalls (II)

• We have

NWR_STALLS =NWRITES×Write miss rate×

Write miss penalty + NWR_BUFFER_STALLS

• In practice, very few buffer stalls if the buffer contains at least four words

Global impact

• We have

NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×

Cache miss penalty • and also

NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×

Cache miss penalty

Example

• Miss rate of instruction cache is 2 percentMiss rate of data cache is 4 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles36 percent of instructions access the main memory

• How many cycles are lost due to cache misses?

Solution (I)

• Impact of instruction cache misses0.02×100 =2 cycles/instruction

• Impact of data cache misses0.36×0.04×100 =1.44 cycles/instruction

• Total impact of cache misses2 + 1.44 = 3.44 cycles/instruction

Solution (II)

• Average number of cycles per instruction2 + 3.44 = 5.44 cycles/instruction

• Fraction of time wasted 3.44 /5.44 = 63 percent

Problem

• Redo the example with the following data– Miss rate of instruction cache is 3 percent

Miss rate of data cache is 5 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles40 percent of instructions access the main memory

Solution

• The fraction of time wasted to memory stalls is 71 percent

Average memory access time

• Some authors call it AMATTAVERAGE = TCACHE + f×TMISS

where f is the cache miss rate• Times can be expressed

– In nanoseconds– In number of cycles

Example

• A cache has a hit rate of 96 percent• Accessing data

– In the cache requires one cycle– In the memory requires 100 cycles

• What is the average memory access time?

Solution

• Miss rate = 1 – Hit rate = 0.04• Applying the formula

TAVERAGE = 1 + 0.04×100 = 5 cycles

Corrected

Impact of a better hit rate

• What would be the impact of improving the hit rate of the cache from 96 to 98 percent?

Solution

• New miss rate = 1 – New hit rate = 0.02• Applying the formula

TAVERAGE = 1 + 0.02×100 = 3 cycles

When the hit rate is above 80 percent small improvements in the hit rate willresult in much better miss rate

Examples

• Old hit rate: 80 percentNew hit rate: 90 percent– Miss rates goes from 20 to 10 percent!

• Old hit rate: 94 percentNew hit rate: 98 percent– Miss rates goes from 6 to 2 percent!

In other words

It's the miss rate, stupid!

Improving cache hit rate

• Two complementary techniques– Using set-associative caches

• Must check tags of all blocks with the same index values

–Slower • Have fewer collisions

–Fewer misses– Use a cache hierarchy

A cache hierarchy (I)

CPU

L1

L2

L3

RAM

L1 misses

L2 misses

L3 misses

A cache hierarchy

• Topmost cache– Optimized for speed, not miss rate– Rather small– Uses a small block size

• As we go down the hierarchy– Cache sizes increase– Block sizes increase– Cache associativity level increases

Example

• Cache miss rate per instruction is 2 percentIn the absence of memory stalls, each instruction would take one cycleCache miss penalty is 100 nsClock rate is 4GHz

• How many cycles are lost due to cache misses?

Solution (I)

• Duration of clock cycle1/(4 Ghz) = 0.25×10-9 s = 0.25 ns

• Cache miss penalty100ns = 400 cycles

• Total impact of cache misses0.02×400 = 8 cycles/instruction

Solution (II)

• Average number of cycles per instruction1 + 8 = 9 cycles/instruction

• Fraction of time wasted 8/9 = 89 percent

Example (cont'd)

• How much faster would the processor if we added a L2 cache that – Has a 5 ns access time– Would reduce miss rate to main memory to

0.5 percent?• Will see later how to get that

Solution (I)

• L2 cache access time5ns = 20 cycles

• Impact of cache misses per instructionL1 cache misses + L2 cache misses =

0.02×20+0.005×400 = 0.4 + 2.0 =2.4 cycles/instruction

• Average number of cycles per instruction1 + 2.4 = 3.4 cycles/instruction

Solution (II)

• Fraction of time wasted 2.4/3.4 = 63 percent

• CPU speedup 9/3.4 = 2.6

How to get the 0.005 miss rate

• Wanted miss rate corresponds to a combined cache hit rate of 99.5 percent

• Let H1 be hit rate of L1 cache and H2 be the hit rate of the second cache

• The combined hit rate of the cache hierarchy isH = H1 +(1-H1)H2

How to get the 0.005 miss rate

• We have0.995 = 0.98 +0.02H2

• H2 = (0.995 – 0.98)/0.02 = 0.75– Quite feasible!

Can we do better? (I)

• Keep 98 percent hit rate for L1 cache• Raise hit rate of L2 cache to 85 percent

– L2 cache is now slower: 6ns• Impact of cache misses per instruction

L1 cache misses + L2 cache misses =0.02×24+0.02×0.15×400 = 0.48 + 1.2 =1.68 cycles/instruction

The verdict

• Fraction of time wasted per cycle1.68/2.68 = 63 percent

• CPU speedup 9/2.68 = 3.36

Would a faster L2 cache help?

• Redo the example assuming – Miss rate of L1 cache is till 98 percent– New faster L2 cache

• Access time reduced to 3 ns• Hit rate only 50 percent

The verdict

• Fraction of time wasted 87 percent

• CPU speedup 1.72

New L2 cache with a lower access timebut a higher miss rate performs much worsethan original L2 cache

Cache replacement policy

• Not an issue in direct mapped caches– We have no choice!

• An issue in set-associative caches – Best policy is least recently used (LRU)

• Expels from the cache a block in thesame set as the incoming block

• Pick block that has not been accessed for the longest period of time

Implementing LRU policy

• Easy when each set contains two blocks– We attach to each block a use bit that is

• Set to 1 when the block is accessed• Reset to 0 when the other block is accessed

– We expel block whose use bit is 0• Much more complicated for higher associativity

levels

REALIZATIONS

Caching in a multicore organization

• Multicore organizations often involve multiple chips– Say four chips with four cores per chip

• Have a cache hierarchy on each chip– L1, L2, L3 – Some caches are private, other are shared

• Accessing a cache on a chip is much faster than accessing a cache on another chip

AMD 16-core system (I)

• AMD 16-core system– Sixteen cores on four chips

• Each core has a 64-KB L1 and a 512-KB L2 cache

• Each chip has a 2-MB shared L3 cache

X/Y where X is latency in cyclesY is bandwidth in bytes/cycle

AMD 16-core system (II)

• Observe that access times are non-uniform– Takes more time to access L1 or L2 cache of

another core than accessing shared L3 cache– Takes more time to access caches in another

chip than local caches– Access times and bandwidths depend on

chip interconnect topology

VIRTUAL MEMORY

Main objective (I)

• To allow programmers to write programs that reside– partially in main memory– partially on disk

Main objective (II)

Main memory Address space (I)

Address space (II)

Motivation

• Most programs do not access their whole address space at the same time

• Compilers go through several phases– Lexical analysis– Preprocessing (C, C++)– Syntactic analysis– Semantic analysis– …

Advantages (I)

– VM allows programmers to write programs that would not otherwise fit in main memory• They will run although much more slowly• Very important in 70's and 80's

– VM allows OS to allocate the main memory much more efficiently• Do not waste precious memory space• Still important today

Advantages

• VM let programmers use– Sparsely populated– Very large address spaces

VMDC S L

Sparsely populated address spaces

• Let programmers put different items apart from each other– Code segment– Data segment– Stack– Shared library– Mapped files

Wait untilyou take 4330 to

study this

Big difference with caching

• Miss penalty is much bigger– Around 5 ms– Assuming a memory access time of 50 ns,

5 ms equals 100,000 memory accesses– For caches, miss penalty was around

100 cycles

Consequences

• Will use much larger block sizes– Blocks, here called pages, measure 4 KB, 8KB,

… with 4 KB an unofficial standard• Will use fully associative mapping to reduce

misses, here called page faults• Will use write back to reduce disk accesses

– Must keep track of modified (dirty) pages in memory

Virtual memory

• Combines two big ideas– Non-contiguous memory allocation:

processes are allocated page frames scattered all over the main memory

– On-demand fetch:Process pages are brought in main memory when they are accessed for the first time

• MMU takes care of almost everything

Main memory

• Divided into fixed-size page frames– Allocation units – Sizes are powers of 2 (512B, …, 4KB, … )– Properly aligned– Numbered 0 , 1, 2, . . .

0 1 2 3 4 5 6 7 8

Program address space

• Divided into fixed-size pages– Same sizes as page frames– Properly aligned– Also numbered 0 , 1, 2, . . .

0 1 2 3 4 5 6 7

The mapping

• Will allocate non contiguous page frames to the pages of a process

0 1 2

3 4 5 6 70 1 2

The mapping

Page Number Frame number

0 01 42 2

The mapping

• Assuming 1KB pages and page frames

Virtual Addresses Physical Addresses

0 to 1,023 0 to 1,023

1,024 to 2,047 4,096 to 5,119

2,048 to 3,071 2,048 to 3,071

The mapping

• Observing that 210 = 1000000000 in binary• We will write 0-0 for ten zeroes and 1-1 for ten

ones


0000-0 to 0001-1 0000-0 to 0001-1

0010-0 to 0011-1 1000-0 to 1001-1

0100-0 to 0101-1 0100-0 to 0101-1

The mapping

• The ten least significant bits of the address do not change


000 0-0 to 000 1-1 000 0-0 to 000 1-1

001 0-0 to 001 1-1 100 0-0 to 100 1-1

010 0-0 to 010 1-1 010 0-0 to 010 1-1

The mapping

• Must only map page numbers into page frame numbers

Page number Page frame number

000 000

001 100

010 010

The mapping

• Same in decimal


0 0

1 4

2 2

The mapping

• Since page numbers are always in sequence, they are redundant


0 0

1 4

2 2 X

The algorithm

• Assume page size = 2p

• Remove p least significant bits from virtual address to obtain the page number

• Use page number to find corresponding page frame number in page table

• Append p least significant bits from virtual address to page frame number to get physical address

Realization

2 897

897

1

5

7

3

5

Virtual Address

Physical Address

PAGE TABLE

Page No Offset

PageFrameNo

(10 bits)

The offset

• Offset contains all bits that remain unchanged through the address translation process

• Function of page size

Page size Offset

1 KB 10 bits2 KB 11 bits 4KB 12 bits

The page number

• Contains other bits of virtual address• Assuming 32-bit addresses

Page size Offset Page number

1 KB 10 bits 22 bits2 KB 11 bits 21 bits

4KB 12 bits 20 bits

Internal fragmentation

• Each process now occupies an integer number of pages

• Actual process space is not a round number– Last page of a process is rarely full

• On the average, half a page is wasted– Not a big issue– Internal fragmentation

On-demand fetch (I)

• Most processes terminate without having accessed their whole address space– Code handling rare error conditions, . . .

• Other processes go to multiple phases during which they access different parts of their address space– Compilers

On-demand fetch (II)

• VM systems do not fetch whole address space of a process when it is brought into memory

• They fetch individual pages on demand when they get accessed the first time– Page miss or page fault

• When memory is full, they expel from memory pages that are not currently in use

On-demand fetch (III)

• The pages of a process that are not in main memory reside on disk– In the executable file for the program being

run for the pages in the code segment– In a special swap area for the data pages that

were expelled from main memory

On-demand fetch (IV)

Main memory Code Data

Disk Executable

Swap area

On-demand fetch (V)

• When a process tries to access data that are nor present in main memory– MMU hardware detects that the page is

missing and causes an interrupt– Interrupt wakes up page fault handler– Page fault handler puts process in waiting state

and brings missing page in main memory

Advantages

• VM systems use main memory more efficiently than other memory management schemes– Give to each process more or less what it

needs • Process sizes are not limited by the size of main

memory– Greatly simplifies program organization

Sole disadvantage

• Bringing pages from disk is a relatively slow operation– Takes milliseconds while memory access take

nanoseconds• Ten thousand times to hundred thousand

times slower

The cost of a page fault

• Let– Tm be the main memory access time

– Td the disk access time– f the page fault rate– Ta the average access time of the VM

Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td

Example

• Assume Tm = 50 ns and Td = 5 ms

f Mean memory access time

10-3 = 50 ns + 5 ms/103 = 5,050 ns

10-4 = 50 ns + 5 ms/104 = 550 ns

10-5 = 50 ns + 5 ms/105 = 100 ns

10-6 = 50 ns + 5 ms/ 106 = 55 ns

Conclusion

• Virtual memory works best when page fault rate is less than a page fault per 100,000 instructions

Locality principle (I)

• A process that would access its pages in a totally unpredictable fashion would perform very poorly in a VM system unless all its pages are in main memory

Locality principle (II)

• Process P accesses randomly a very large array consisting of n pages

• If m of these n pages are in main memory, the page fault frequency of the process will be( n – m )/ n

• Must switch to another algorithm

Tuning considerations

• In order to achieve an acceptable performance,a VM system must ensure that each process has in main memory all the pages it is currently referencing

• When this is not the case, the system performance will quickly collapse

First problem

• A virtual memory system has– 32 bit addresses– 8 KB pages

• What are the sizes of the – Page number field?– Offset field?

Solution (I)

• Step 1:Convert page size to power of 2

8 KB = 2----- B

• Step 2:Exponent is length of offset field

Solution (II)

• Step 3:Size of page number field =Address size – Offset size

Here 32 – ____ = _____ bits

• Highlight the text in the box to see the answers

13 bits for the offset and 19 bits for the page number

PAGE TABLE REPRESENTATION

Page table entries

• A page table entry (PTE) contains– A page frame number– Several special bits

• Assuming 32-bit addresses, all fit into four bytes

Page frame number Bits

The special bits (I)

• Valid bit:1 if page is in main memory, 0 otherwise

• Missing bit:1 if page is in not main memory, 0 otherwise

• Serve the same functionUse different conventions

The special bits (II)

• Dirty bit:1 if page has been modified since it was brought into main memory,0 otherwise– A dirty page must be saved in the process

swap area on disk before being expelled from main memory

– A clean page can be immediately expelled

The special bits (III)

• Page-referenced bit:1 if page has been recently modified,0 otherwise– Often simulated in software

Where to store page tables

• Use a three-level approach• Store parts of page table

– In high speed registers located in the MMU:the translation lookaside buffer (TLB)(good solution)

– In main memory (bad solution)– On disk (ugly solution)

The translation look aside buffer

• Small high-speed memory– Contains fixed number of PTEs– Content-addressable memory

• Entries include page frame number and page number

Page frame number BitsPage number

Realizations (I)

• TLB of Intrisity FastMATH– 32-bit addresses– 4 KB pages– Fully associative TLB with 16 entries– Each entry occupies 64 bits

• 20 bits for page number• 20 bits for page frame number• Valid bit, dirty bit, …

Realizations (II)

• TLB of ULTRA SPARC III– 64-bit addresses

• Maximum program size is 244 bytes, that is,16 TB

– Supported page sizes are 4 KB, 16KB, 64 KB, 4MB ("superpages")

Realizations (III)

• TLB of ULTRA SPARC III– Dual direct-mapping (?) TLB

• 64 entries for code pages• 64 entries for data pages

– Each entry occupies 64 bits• Page number and page frame number• Context• Valid bit, dirty bit, …

The context (I)

• Conventional TLBs contain the PTE's for a specific address space – Must be flushed each time the OS switches

from the current process to a new process• Frequent action in any modern OS

– Introduces a significant time penalty

The context (II)

• UltraSPARC III architecture adds to TLB entries a context identifying a specific address space– Page mappings from different address

spaces can coexist in the TLB– A TLB hit now requires a match for both

page number and context– Eliminates the need to flush the TLB

TLB misses

• When a PTE cannot be found in the TLB, a TLB miss is said to occur

• TLB misses can be handled– By the computer firmware:

• Cost of miss is one extra memory access– By the OS kernel:

• Cost of miss is two context switches

Letting SW handle TLB misses

• As in other exceptions, must save current value of PC in EPC register

• Must also assert the exception by the end of the clock cycle during which the memory access occurs– In MIPS, must prevent WB cycle to occur after

MEM cycle that generated the exception

Example

• Consider the instructionlw $1, 0($2)

– If word at address $2 is not in the TLB,we must prevent any update of $1

Performance implications

• When TLB misses are handled by the firmware, they are very cheap– A TLB hit rate of 99% is very good:

Average access cost will be

Ta = 0.99×Tm + 0.01×2Tm= 1.01Tm

• Less true if TLB misses are handled by the kernel

Storing the rest of the page table

• PTs are too large to be stored in main memory– Will store active part of the PT in main memory– Other entries on disk

• Three solutions– Linear page tables– Multilevel page tables– Hashed page tables

Storing the rest of the page table

• We will review these solutions even though page table organizations are an operating system topic

Linear page tables (I)

• Store PT in virtual memory (VMS solution)• Very large page tables need more than 2 levels

(3 levels on MIPS R3000)

Linear page tables (II)

PhysicalMemory

Virtual MemoryVirtual Memory

PTOtherOther PTs PTs

Linear page tables (III)

• Assuming a page size of 4KB,– Each page of virtual memory requires 4 bytes

of physical memory– Each PT maps 4GB of virtual addresses– A PT will occupy 4MB– Storing these 4MB in virtual memory will

require 4KB of physical memory

Multi-level page tables (I)

• PT is divided into – A master index that always remains in main

memory– Sub indexes that can be expelled

Multi-level page tables (II)

VIRTUAL ADDRESS

PHYSICAL ADDRESS

MASTER INDEX

Offset

Offset

1ary 2ary

SUBINDEX

< Page Number >

Frame No

FrameAddr

(unchanged)

Multi-level page tables (III)

• Especially suited for a page size of 4 KB and 32 bits virtual addresses

• Will allocate– 10 bits of the address for the first level,– 10 bits for the second level, and– 12 bits for the offset.

• Master index and sub indexes will all have 210

entries and occupy 4KB

Hashed page tables (I)

• Only contain paged that are in main memory– PTs are much smaller

• Also known as inverted page tables

Hashed page table (II)

PN hashPN

PFN

PN = page numberPFN = page frame number

Selecting the right page size

• Increasing the page size– Increases the length of the offset– Decreases the length of the page number– Reduces the size of page tables

• Less entries– Increases internal fragmentation

• 4KB seems to be a good choice

MEMORY PROTECTION

Objective

• Unless we have an isolated single-user system, we must prevent users from– Accessing– Deleting– Modifying

the address spaces of other processes, including the kernel

Historical considerations

• Earlier operating systems for personal computers did not have any protection– They were single-user machines– They typically ran one program at a time

• Windows 2000, Windows XP, Vista and MacOS X are protected

Memory protection (I)

• VM ensures that processes cannot access page frames that are not referenced in their page table.

• Can refine control by distinguishing among– Read access– Write access– Execute access

• Must also prevent processes from modifying their own page tables

Dual-mode CPU

• Require a dual-mode CPU• Two CPU modes

– Privileged mode or executive mode that allows CPU to execute all instructions

– User mode that allows CPU to execute only safe unprivileged instructions

• State of CPU is determined by a special bits

Switching between states

• User mode will be the default mode for all programs– Only the kernel can run in supervisor mode

• Switching from user mode to supervisor mode is done through an interrupt – Safe because the jump address is at a well-

defined location in main memory

Memory protection (II)

• Has additional advantages:– Prevents programs from corrupting address

spaces of other programs– Prevents programs from crashing the kernel

• Not true for device drivers which are inside the kernel

• Required part of any multiprogramming system

INTEGRATING CACHES AND VM

The problem

• In a VM system, each byte of memory has two addresses– A virtual address– A physical address

• Should cache tags contain virtual addresses or physical addresses?

Discussion• Using virtual addresses

– Directly available– Bypass TLB– Cache entries specific

to a given address space

– Must flush caches when the OS selects another process

• Using physical addresses– Must access first TLB– Cache entries not

specific to a given address space

– Do not have to flush caches when the OS selects another process

The best solution

• Let the cache use physical addresses– No need to flush the cache at each context

switch– TLB access delay is tolerable

Processing a memory access (I)

• if virtual address in TLB :get physical address

else :

create TLB miss exceptionbreak

…

I use Python because it is very compact:hetland.org/writing/instant-python.html

Processing a memory access (II)

if read_access :while data not in cache :

stalldeliver data to CPU

else : # write_access

… Continues on next page

Processing a memory access (III)if write_access_OK :

while data not in cache :stall

write data into cacheupdate dirty bitput data and address in write buffer

else :

# illegal accesscreate TLB miss exception

More Problems (I)

• A virtual memory system has a virtual address space of 4 Gigabytes and a page size of 4 Kilobytes. Each page table entry occupies 4 bytes.

More Problems (II)

• How many bits are used for the byte offset?

• Since 4K =2___, the byte offset will

use __ bits.

• Highlight text in box to see the answer

Since 4KB= 212bytes, the byte offset uses 12 bits

More Problems (III)

• How many bits are used for the page number?

• Since 4G = 2__ we will have __-bit virtual addresses. Since the byte offset occupies ___ of these __ bits, __ bits are left for the page number.

The page number uses 20 bits of the address

More Problems (IV)

• What is the maximum number of page table entries in a page table?

• Address space/ Page size =

2__ / 2__ =

2 ___ PTE’s.

220 page table entries

More problems (VI)

• A computer has 32 bit addresses and a page size of one kilobyte.

• How many bits are used to represent the page number?

___ bits• What is the maximum number of entries in a

process page table?2___ entries

Answer

• As 1KB = 210 bytes, the byte offset occupies10 bits

• The page number uses the remaining 22 bits ofthe address

Some review questions

• Why are TLB entries 64-bit wide while page table entries only require 32 bits?

• What would be the main disadvantage of a virtual memory system lacking a dirty bit?

• What is the big limitation of VM systems that cannot prevent processes from executing the contents of any arbitrary page in their address space?

Answers

• We need extra space for storing the page number

• It would have to write back to disk all pages thatit expels even when they were not modified

• It would make the system less secure

VIRTUAL MACHINES

Key idea

• Let different operating systems run at the same time on a single computer– Windows, Linux and Mac OS– A real-time OS and a conventional OS– A production OS and a new OS being tested

How it is done

• A hypervisor /VM monitor defines two or more virtual machines

• Each virtual machine has– Its own virtual CPU– Its own virtual physical memory– Its own virtual disk(s)

The virtualization process

Actual hardware

CPU

Memory

Disk

Virtual hardware # 1

CPU

Memory


CPU

Memory

Disk


CPU

Memory

Disk

Hypervisor

Reminder

• In a conventional OS,– Kernel executes in privileged/supervisor

mode• Can do virtually everything

– User processes execute in user mode• Cannot modify their page tables• Cannot execute privileged instructions

KernelPrivileged

mode

Usermode

User processUser process

System call

Two virtual machines

HypervisorPrivilegedmode

Usermode

Userprocess

VM Kernel

Userprocess

Userprocess

VM Kernel

Userprocess

Explanations (II)

• Whenever the kernel of a VM issues a privileged instruction, an interrupt occurs– The hypervisor takes control and do the physical

equivalent of what the VM attempted to do:• Must convert virtual RAM addresses into

physical RAM addresses• Must convert virtual disk block addresses into

physical block addresses

Translating a block address

VM kernel

Virtual disk

Access block x, yof my virtual disk

That's block v, w of the actual disk

Actual disk

Hypervisor

Access block v, w

of actual disk

Handling I/Os

• Difficult task because– Wide variety of devices– Some devices may be shared among several

VMs• Printers• Shared disk partition

–Want to let Linux and Windowsaccess the same files

Virtual Memory Issues

• Each VM kernel manages its own memory– Its page tables map program virtual

addresses into what it believes to be physical addresses

The dilemma

User processA

VM kernel

Page 735 of process A is stored in page frame 435

That's page frame 993 of the actual RAM

Hypervisor

The solution (I)

• Address translation must remain fast!– Hypervisor lets each VM kernel manage their

own page tables but do not use them• They contain bogus mappings!

– It maintains instead its own shadow page tables with the correct mappings• Used to handle TLB misses

Why it works

• Most memory accesses go through the TLB

• The system can tolerate slower page table updates

The solution (II)

• To keep its shadow page tables up to date, hypervisor must track any changes made by the VM kernels

• Mark page tables read-only–Each attempt to update then by a VM

kernel results in an interrupt

Nastiest Issue

• The whole VM approach assumes that a kernel executing in user mode will behave exactly like a kernel executing in privileged mode except that privileged instruction will be trapped

• Not true for all architectures!– Intel x86 Pop flags (POPF) instruction– …

POPF instruction

• Pop top of stack into lower 16 bits of EFLAGS – Designed for a 16-bit architecture

• FFLAGS contains interrupt enable flag (IE)• When executed in privileged mode, POPF

updates all flags• When executed in user mode, POPF updates all

flags but the IE flag

Solutions

1. Modify the instruction set and eliminate instructions like POPF• IBM redesigned the instruction set of their

360 series for the 370 series2. Mask it through clever software

• Dynamic "binary translation" when direct execution of code could not work(VMWare)

Other Approaches (I)

• Can use the VM approach to let binaries written in a specific machine language run on a machine with a different instruction set

• Called emulators• Have a huge performance penalty

– Still work fairly well when the target machine is much faster than the original architecture

– Lets us run very old binaries

Other Approaches (II)

• Can use the VM approach to let programs written in any arbitrary low-level language run on many different architectures

• Java virtual machine (JVM)– Ported to may architectures– Allow execution of programs written in

"bytecode"– Professes to be inherently safe

CP/CMS (I)

• IBM was the dominant computer manufacturer during the 60's and the 70's– Machines were designed for batch processing– Lacked any decent time-sharing OS

• Wanted by universities– TSS/360 was not a great success

CP/CMS (II)

• IBM Cambridge Scientific Center– In Cambridge, MA– Developed a combination of

• A Control Program (CP) supporting virtual machines

• A time-sharing OS (CMS) for a single user• Was a great success!

CP/CMS (III)

• How it worked

CMS on a VM

CMS on a VM

CMS on a VM

CMS on a VM

CP (hypervisor)

CACHE CONSISTENCY

The problem• Specific to architectures with

– Several processors sharing the same main memory

– Multicore architectures• Each core/processor has its own private cache

– Needed for performance • Problem occur when same data are present in

two or more private caches

An example (I)

RAM

CPU

Cachex=0

CPU

Cachex=0

An example (II)

RAM

CPU

Cachex=1

CPU

Cachex=0

Increments x

Still assumes x =0

Our Objective

• Single copy serializability– All operations on all the variables should have

the same effect as if they were executed• in sequence with• a single copy of each variable

One-copy serializability rules

1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last

2. A processor accessing a variable sees all updates applied to that variable in thesame order

– The exact order does not matter as long as everybody agrees on it

An example

RAM

CPU

Cachex=1

Sets x to 1

CPU

Cachex=0

Resets x to 0

CPU

Cachex=?

CPU

Cachex=?

Both CPUs must applythe two updates

in the same order

Big problem

• When a processor updates a cached variable, the new value of the variable is not immediately written into the main memory– Perfect one-copy serializability is not feasible

New rules

1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last if the updates are sufficiently separated in times

2. A processor accessing a variable sees all updates applied to that variable in thesame order

– No compromise is possible here

A remark

• Data consistency issues appear in may disguises– Cache consistency– Distributed shared memory

• work done in early to mid 90's– Distributed file systems– Distributed databases

An example (I)

• UNIX workstations use a distributed file system called NFS (Network File System)

• An NFS comprises– client workstations– a centralized server

• NFS allows client workstations to cache contents of the file they access

• What happens when two workstations access the same file?

An example (II)

Server

x’ xx’’

A B

Inconsistent updates

Possible Approaches (I)

• Always keep a single copy:– Guarantees one–copy serializability– Would make the system too slow

• No caching!

• Prevent shared access:– Guarantees one–copy serializability– Would be very slow and complicated

Possible Approaches (II)

• Replicate and update:– Allows multiple processors to cache variables

already cached by other processors– Whenever a processor updates a cached variable,

it propagates the update to all other caches holding a copy of the variable

– Costly because processors tend to repeatedly update the same variable• Temporal locality of accesses

Possible Approaches (III)

• Replicate and invalidate:– Allows multiple processors to cache variables

already cached by other processors– Whenever a processor updates a cached

variable, we invalidate all other cached copies of the variable• Works well with write-through caches

–Will get the correct value later from RAM

A realization: Snoopy caches

• All caches are linked to the main memory through a shared bus

– All caches observe the writes performed by other caches

• When a cache notices that another cache performs a write on a memory location that it has in its cache, it invalidates the corresponding cache block

An example (I)

RAM

CPU

Cachex=2

CPU

CacheFetches x = 2

x = 2

An example (II)

RAM

CPU

Cachex = 2

CPU

Cachex = 2

Also fetches x

x = 2

An example (III)

RAM

CPU

Cachex = 0

CPU

Cachex = 2

Resets x to 0

x = 2

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = ?x = ?

Performs write-through

Detects write-through

and invalidates its copy of x

x = 0

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = 0

when CPU wants to access x. cache gets correct value

from RAM

x = 0

A last correctness condition

• Cache cannot reorder their memory updates– Cache RAM buffer must be FIFO

• First in first out

Example

• A CPU performs– x = 0;– x++; // sets x to 1;

• Final value for x in CPU cache is 1• If write buffer reorders write-through requests,

final value for x in RAM—and other caches will be 0– Ouch!

Miscellaneous fallacies (I)

• Segmented address spaces

– Address is segment number + offset in segment

– Supposed to let programmers organize their address space into meaningful segments

– Programmers—and compilers—hate them

Miscellaneous fallacies (II)

• Ignoring virtual memory behavior when accessing large two-dimensional arrays

– Must access array in a way that minimizes number of page faults

– Done by all good mathematical software libraries

Miscellaneous fallacies (III)

• Believing that you can virtualize any CPU architecture – Some are much more difficult than others

Concluding remarks

• As before, we have seen how human ingenuity has worked around hardware limitations– Cannot increase CPU speed above 3 to 4 GHZ

Pipelining, multicore architectures

– RAM is slower than CPUCaches

– Hard disks much slower than RAMVirtual memory, I/O buffering

the memory hierarchy jehan-françois pâris [email protected]

Documents

disk driveas powerful

size sectorsif disk

rotational delayon

rotational speeds

rotational delaywe

rotational latencywere

high mean times

average half