third review session jehan-françois pâris may 5, 2010

190
THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Upload: rosamund-thomas

Post on 18-Jan-2018

221 views

Category:

Documents


0 download

DESCRIPTION

MATERIALS (II) I/O Operations –More about disks –I/O operation implementation: Busses Memory-mapped I/O Specific I/O instructions –RAID organizations

TRANSCRIPT

Page 1: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

THIRD REVIEW SESSION

Jehan-François PârisMay 5, 2010

Page 2: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MATERIALS (I)

• Memory hierarchies:– Caches– Virtual memory– Protection– Virtual machines– Cache consistency

Page 3: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MATERIALS (II)

• I/O Operations– More about disks– I/O operation implementation:

• Busses• Memory-mapped I/O• Specific I/O instructions

– RAID organizations

Page 4: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MATERIALS (III)

• Parallel Architectures– Shared memory multiprocessors– Computer clusters– Hardware multithreading– SISD, SIMD, MIMD, …– Roofline performance model

Page 5: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

CACHING AND VIRTUAL MEMORY

Page 6: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Common objective

• Make a combination of– Small, fast and expensive memory– Large, slow and cheap memory

look like– A single large and fast memory

• Fetch policy is fetch on demand

Page 7: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Questions to ask

• What are the transfer units?• How are they placed in the faster memory?• How are they accessed?• How do we handle misses?• How do we implement writes?

and more generally• Are these tasks performed by the hardware or the OS?

Page 8: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Transfer units

• Block or pages containing 2n bytes– Always properly aligned

• If a block or a page contains 2n bytes,the n LSBs of its start address will be all zeroes

Page 9: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Examples

• If block size is 4 words,– Corresponds to 16 = 2 4 bytes– 4 LSB of block address will be all zeroes

• If page size is 4KB– Corresponds to 22×210 = 212 bytes– 12 LSBs of page address will be all zeroes– Remaining bits of address form page number

Page 10: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Examples

XXXXXXXXXXXXXXXXX<12 zeroes>

Page size = 4KB

32-bit address of first byte in page

In page address:20-bit page number + 12 bit offset

XXXXXXXXXXXXXXXXX YYYYYYYYY

Page 11: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Consequence

• In a 32-bit architecture, – We identify a block or a page of size 2n bytes

by the 32 – n MSBs of its address – Will be called

• Tag• Page number

Page 12: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Placement policy

• Two extremes– Each block can only occupy a fixed address

in the faster memory• Direct mapping (many caches)

– Each page can occupy any address in the faster memory• Full association (virtual memory)

Page 13: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Direct mapping

• Assume – Cache has 2m entries – Block size is 2n bytes– a is the block address

(with its n LSBs removed)• The block will be placed at cache position

a % 2m

Page 14: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Consequence

• The tag identifying the cache block will be the start address of the block with its n + m LSBs removed– the original n LSBs because they are known

to be all zeroes– the next m LSBs because they are equal to

a % 2m

Page 15: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Consequence

Block start address

Remove n LSBs because they are all-zeroes

Block address

Remove m additional LSBs given by a%2m

Tag

Page 16: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A cache whose block size is 8 bytes

WordWord

WordWord

WordWord

WordWord

000001010011

000001

100101

010011

000001

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

TagValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

110111

100101

010011

000001

Contents

100101

010011

WordWord

WordWord

Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6

WordWord

WordWord

WordWord

WordWord

Page 17: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Fully associative solution

• Used in virtual memory systems

• Each page can occupy any free page frame in main memory

• Use a page table– Without redundant first

column

Page Page ##

Frame #

00 411 722 2733 4444 5…… …

Page 18: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solutions with limited associativity

• A cache of size 2m with associativity level k lets a given block occupy any of k possible locations in the cache

• Implementation looks very much like k caches of size 2m/k put together

• All possible cache locations for a block have the same position a % 2m/k in each of the smaller caches

Page 19: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A set-associative cache with k=2

000001010011100101110111

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block

Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N

000001010011100101110111

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block

Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N

Page 20: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Accessing an entry

• In a cache, use hardware to compute the possible cache position for the block containing the data– a % 2m for a cache using direct mapping– a % 2m/k for a cache of associativity level k

• Check then if the cache entry is valid using its valid bit

Page 21: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Accessing an entry

• In a VM system, hardware checks the TLB to find the frame containing a given page number

• TLB entries contain– A page number (tag)– A frame number– A valid bit– A dirty bit

Page 22: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Accessing an entry

Page frame number BitsPage number

• The valid bit indicates if the mapping is valid• The dirty bit indicates whether we need to savethe page contents when we expel it

Page 23: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Accessing an entry

• If page mapping is not in the TLB, must consult the page table and update the TLB– Can be done by hardware or software

Page 24: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Realization

2 897

897

1

57

3

5

Virtual Address

Physical Address

PAGE TABLE

Page No Offset

PageFrameNo

(10 bits)

Page 25: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling cache misses

• Cache hardware fetches missing block• Often overwriting an existing entry• Which one?

– The one that occupies the same location if cache use direct mapping

– One of those that occupy the same location if cache use direct mapping

Page 26: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling cache misses

• Before expelling a cache entry, we must– Check its dirty bit– Save its contents if dirty bit is on.

Page 27: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling page faults

• OS fetches missing page• Often overwriting an existing page• Which one?

– One that was not recently used• Selected by page replacement policy

Page 28: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling page faults

• Before expelling a page, we must– Check its dirty bit– Save its contents if dirty bit is on.

Page 29: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling writes (I)

• Two ways to handle writes– Write through:

• Each write updates both the cache and the main memory

– Write back:• Writes are not propagated to the main

memory until the updated word is expelled from the cache

Page 30: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling writes (II)

• Write through • Write back

CPU

Cache

RAM

CPU

Cache

RAM

later

Page 31: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Pros and cons• Write through:

– Ensures that memory is always up to date• Expelled cache entries can be overwritten

• Write back:– Faster writes– Complicates cache expulsion procedure

• Must write back cache entries that have been modified in the cache

Page 32: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A better write through (I)

• Add a small buffer to speed up write performance of write-through caches– At least four words

• Holds modified data until they are written into main memory – Cache can proceed as soon as data are

written into the write buffer

Page 33: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Write buffer

A better write through (II)

• Write through • Better write through

CPU

Cache

RAM

CPU

Cache

RAM

Page 34: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Designing RAM to support caches

• RAM connected to CPU through a "bus"– Clock rate much slower than CPU clock rate

• Assume that a RAM access takes– 1 bus clock cycle to send the address– 15 bus clock cycle to initiate a read– 1 bus clock cycle to send a word of data

Page 35: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Designing RAM to support caches

• Assume– Cache block size is 4 words– One-word bank of DRAM

• Fetching a cache block would take1 + 4×15 + 4×1 = 65 bus clock cycles

– Transfer rate is 0.25 byte/bus cycle• Awful!

Page 36: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Designing RAM to support caches

• Could – Have an interleaved memory organization– Four one-word banks of DRAM– A 32-bit bus

32 bits

RAMbank 1

RAMbank 0

RAMbank 2

RAMbank 3

Page 37: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Designing RAM to support caches

• Can do the 4 accesses in parallel• Must still transmit the block 32 bits by 32 bits• Fetching a cache block would take

1 + 15 + 4×1 = 20 bus clock cycles– Transfer rate is 0.80 word/bus cycle

• Even better• Much cheaper than having a 64-bit bus

Page 38: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

PERFORMANCE ISSUES

Page 39: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory stalls

• Can divide CPU time into– NEXEC clock cycles spent executing instructions

– NMEM_STALLS cycles spent waiting for memory accesses

• We haveCPU time = (NEXEC + NMEM_STALLS)×TCYCLE

Page 40: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory stalls

• We assume that– cache access times can be neglected– most CPU cycles spent waiting for memory

accesses are caused by cache misses

Page 41: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Global impact

• We have

NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×

Cache miss penalty • and also

NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×

Cache miss penalty

Page 42: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• Miss rate of instruction cache is 2 percentMiss rate of data cache is 5 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles40 percent of instructions access the main memory

• How many cycles are lost due to cache misses?

Page 43: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (I)

• Impact of instruction cache misses0.02×100 =2 cycles/instruction

• Impact of data cache misses0.40×0.05×100 =2 cycles/instruction

• Total impact of cache misses2 + 2 = 4 cycles/instruction

Page 44: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (II)

• Average number of cycles per instruction2 + 4 = 6 cycles/instruction

• Fraction of time wasted 4 /6 = 67 percent

Page 45: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Average memory access time

• Some authors call it AMATTAVERAGE = TCACHE + f×TMISS

where f is the cache miss rate• Times can be expressed

– In nanoseconds– In number of cycles

Page 46: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• A cache has a hit rate of 96 percent• Accessing data

– In the cache requires one cycle– In the memory requires 100 cycles

• What is the average memory access time?

Page 47: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution

• Miss rate = 1 – Hit rate = 0.04• Applying the formula

TAVERAGE = 1 + 0.04×100 = 401 cycles

Page 48: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

In other words

It's the miss rate, stupid!

Page 49: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Improving cache hit rate

• Two complementary techniques– Using set-associative caches

• Must check tags of all blocks with the same index values

–Slower • Have fewer collisions

–Fewer misses– Use a cache hierarchy

Page 50: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A cache hierarchy

• Topmost cache– Optimized for speed, not miss rate– Rather small– Uses a small block size

• As we go down the hierarchy– Cache sizes increase– Block sizes increase– Cache associativity level increases

Page 51: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• Cache miss rate per instruction is 3 percentIn the absence of memory stalls, each instruction would take one cycleCache miss penalty is 100 nsClock rate is 4GHz

• How many cycles are lost due to cache misses?

Page 52: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (I)

• Duration of clock cycle1/(4 Ghz) = 0.25×10-9 s = 0.25 ns

• Cache miss penalty100ns = 400 cycles

• Total impact of cache misses0.03×400 = 12 cycles/instruction

Page 53: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (II)• Average number of cycles per instruction

1 + 12 = 13 cycles/instruction• Fraction of time wasted

12/13 = 92 percent

A very good case for hardware multithreading

Page 54: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example (cont'd)

• How much faster would the processor if we added a L2 cache that – Has a 5 ns access time– Would reduce miss rate to main memory to

one percent?

Page 55: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (I)

• L2 cache access time5ns = 20 cycles

• Impact of cache misses per instructionL1 cache misses + L2 cache misses =

0.03×20+0.01×400 = 0.6 + 4.0 =4.6 cycles/instruction

• Average number of cycles per instruction1 + 4.6 = 5.6 cycles/instruction

Page 56: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (II)

• Fraction of time wasted 4.6/5.6 = 82 percent

• CPU speedup 13/4.6 = 2.83

Page 57: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Problem

• Redo the second part of the example assuming that the secondary cache– Has a 3 ns access time– Can reduce miss rate to main memory to one

percent?

Page 58: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution

• Fraction of time wasted 86 percent

• CPU speedup 1.22

New L2 cache with a lower access timebut a higher miss rate performs much worsethan first L2 cache

Page 59: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• A virtual memory has a page fault rate of 10-4

faults per memory access• Accessing data

– In the memory requires 100 ns– On disk requires 5 ms

• What is the average memory access time?Tavg = 100 ns + 10-4 ×5 ms = 600ns

Page 60: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The cost of a page fault

• Let– Tm be the main memory access time

– Td the disk access time– f the page fault rate– Ta the average access time of the VM

Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td

Page 61: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• Assume Tm = 50 ns and Td = 5 ms

f Mean memory access time

10-3 = 50 ns + 5 ms/103 = 5,050 ns

10-4 = 50 ns + 5 ms/104 = 550 ns

10-5 = 50 ns + 5 ms/105 = 100 ns

10-6 = 50 ns + 5 ms/ 106 = 55 ns

Page 62: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

In other words

It's the page fault rate, stupid!

Page 63: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Locality principle (I)

• A process that would access its pages in a totally unpredictable fashion would perform very poorly in a VM system unless all its pages are in main memory

Page 64: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Locality principle (II)

• Process P accesses randomly a very large array consisting of n pages

• If m of these n pages are in main memory, the page fault frequency of the process will be( n – m )/ n

• Must switch to another algorithm

Page 65: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

First problem

• A virtual memory system has– 32 bit addresses– 4 KB pages

• What are the sizes of the – Page number field?– Offset field?

Page 66: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (I)

• Step 1:Convert page size to power of 2

4 KB = 212 B

• Step 2:Exponent is length of offset field

Page 67: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution (II)

• Step 3:Size of page number field =Address size – Offset size

Here 32 – 12 = 20 bits

12 bits for the offset and 20 bits for the page number

Page 68: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MEMORY PROTECTION

Page 69: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Objective

• Unless we have an isolated single-user system, we must prevent users from– Accessing– Deleting– Modifying

the address spaces of other processes, including the kernel

Page 70: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory protection (I)• VM ensures that processes cannot access page

frames that are not referenced in their page table.

• Can refine control by distinguishing among– Read access– Write access– Execute access

• Must also prevent processes from modifying their own page tables

Page 71: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Dual-mode CPU

• Require a dual-mode CPU• Two CPU modes

– Privileged mode or executive mode that allows CPU to execute all instructions

– User mode that allows CPU to execute only safe unprivileged instructions

• State of CPU is determined by a special bits

Page 72: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Switching between states

• User mode will be the default mode for all programs– Only the kernel can run in supervisor mode

• Switching from user mode to supervisor mode is done through an interrupt – Safe because the jump address is at a well-

defined location in main memory

Page 73: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory protection (II)

• Has additional advantages:– Prevents programs from corrupting address

spaces of other programs– Prevents programs from crashing the kernel

• Not true for device drivers which are inside the kernel

• Required part of any multiprogramming system

Page 74: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

INTEGRATING CACHES AND VM

Page 75: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The problem

• In a VM system, each byte of memory has teo addresses– A virtual address– A physical address

• Should cache tags contain virtual addresses or physical addresses?

Page 76: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Discussion• Using virtual addresses

– Directly available– Bypass TLB– Cache entries specific

to a given address space

– Must flush caches when the OS selects another process

• Using physical addresses– Must access first TLB– Cache entries not

specific to a given address space

– Do not have to flush caches when the OS selects another process

Page 77: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The best solution

• Let the cache use physical addresses– No need to flush the cache at each context

switch– TLB access delay is tolerable

Page 78: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

VIRTUAL MACHINES

Page 79: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Key idea

• Let different operating systems run at the same time on a single computer– Windows, Linux and Mac OS– A Real-time Os and a conventional OS– A production OS and a new OS being tested

Page 80: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

How it is done

• A hypervisor /VM monitor defines two or more virtual machines

• Each virtual machine has– Its own virtual CPU– Its own virtual physical memory– Its own virtual disk(s)

Page 81: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Two virtual machines

HypervisorPrivilegedmode

Usermode

Userprocess

VM Kernel

Userprocess

Userprocess

VM Kernel

Userprocess

Page 82: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Translating a block address

VM kernel

Virtual disk

Access block x, yof my virtual disk

That's block v, w of the actual disk

Actual disk

Hypervisor

Access block v, w

of actual disk

Page 83: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Handling I/Os

• Difficult task because– Wide variety of devices– Some devices may be shared among several VMs

• Printers• Shared disk partition

–Want to let Linux and Windows access the same files

Page 84: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Virtual Memory Issues

• Each VM kernel manages its own memory– Its page tables map program virtual

addresses into pseudo-physical addresses• It treats these addresses as physical

addresses

Page 85: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The dilemma

User processA

VM kernel

Page 735 of process A is stored in page frame 435

That's page frame 993 of the actual RAM

Hypervisor

Page 86: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The solution (I)

• Address translation must remain fast!– Hypervisor lets each VM kernel manage their

own page tables but do not use them• They contain bogus mappings!

– It maintains instead its own shadow page tables with the correct mappings• Used to handle TLB misses

Page 87: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The solution (II)

• To keep its shadow page tables up to date, hypervisor must track any changes made by the VM kernels

• Mark page tables read-only

Page 88: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Nastiest Issue

• The whole VM approach assumes that a kernel executing in user mode will behave exactly like a kernel executing in privileged mode

• Not true for all architectures!– Intel x86 Pop flags (POPF) instruction– …

Page 89: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solutions

1. Modify the instruction set and eliminate instructions like POPF• IBM redesigned the instruction set of their

360 series for the 370 series2. Mask it through clever software

• Dynamic "binary translation" when direct execution of code could not work(VMWare)

Page 90: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

CACHE CONSISTENCY

Page 91: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The problem• Specific to architectures with

– Several processors sharing the same main memory

– Multicore architectures• Each core/processor has its own private cache

– A must for performance • Happens when same data are present in two or

more private caches

Page 92: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (I)

RAM

CPU

Cachex=0

CPU

Cachex=0

Page 93: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (II)

RAM

CPU

Cachex=1

CPU

Cachex=0

Increments x

Still assumes x =0

Page 94: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example

RAM

CPU

Cachex=1

Sets x to 1

CPU

Cachex=1

Resets x to 1

CPU

Cachex=?

CPU

Cachex=?

Both CPUs must applythe two updates

in the same order

Page 95: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Rules

1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last if the updates are sufficiently separated in times

2. A processor accessing a variable sees all updates applied to that variable in thesame order

– No compromise is possible here

Page 96: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A realization: Snoopy caches

• All caches are linked to the main memory through a shared bus– All caches observe the writes performed by

other caches

• When a cache notices that another cache performs a write on a memory location that it has in its cache, it invalidates the corresponding cache block

Page 97: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (I)

RAM

CPU

Cachex=2

CPU

CacheFetches x = 2

Page 98: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (II)

RAM

CPU

Cachex = 2

CPU

Cachex = 2

Also fetches x

Page 99: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (III)

RAM

CPU

Cachex = 0

CPU

Cachex = 2

Resets x to 0

Page 100: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = ?x = ?

Performs write-through

Detects write-through

and invalidates its copy of x

Page 101: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = 0

when CPU wants to access x. cache gets correct value

from RAM

Page 102: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A last correctness condition

• Cache cannot reorder their memory updates– Cache RAM buffer must be FIFO

• First in first out

Page 103: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Miscellaneous fallacies

• Segmented address spaces– Address is segment number + offset in

segment– Programmers hate them

• Ignoring virtual memory behavior when accessing large two-dimensional arrays

Page 104: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Miscellaneous fallacies

• Segmented address spaces– Address is segment number + offset in segment– Programmers hate them

• Ignoring virtual memory behavior when accessing large two-dimensional arrays

• Believing that you can virtualize any CPU architecture

Page 105: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

DEPENDABILITY

Page 106: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Reliability and Availability

• Reliability– Probability R(t) that system will be up at time

t if it was up at time t = 0• Availability

– Fraction of time the system is up• Reliability and availability do not measure the

same thing!

Page 107: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MTTF, MMTR and MTBF

• MTTF is mean time to failure

• MTTR is mean time to repair

• 1/MTTF is failure rate

• MTTBF, the mean time between failures, is

MTBF = MTTF + MTTR

Page 108: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Reliability

• As a first approximation

R(t) = exp(–t/MTTF)

– Not true if failure rate varies over time

Page 109: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Availability

• Measured by

(MTTF)/(MTTF + MTTR) = MTTF/MTBF

– MTTR is very important

Page 110: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• A server crashes on the average once a month• When this happens, it takes six hours to reboot it• What is the server availability ?

Page 111: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution

• MTBF = 30 days• MTTR = ½ day

• MTTF = 29 ½ days• Availability is 29.5/30 =98.3 %

Page 112: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• A disk drive has a MTTF of 20 years.• What is the probability that the data it contains

will not be lost over a period of five years?

Page 113: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Example

• A disk farm contains 100 disks whose MTTF is 20 years.

• What is the probability that no data will be lost over a period of five years?

Page 114: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solution

• The aggregate failure rate of the disk farm is100x1/20 =5

• The mean time to failure of the farm is 1/5 year

• We apply the formulaR(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11

Page 115: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID Arrays

Page 116: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Today’s Motivation

• We use RAID today for – Increasing disk throughput by allowing parallel

access– Eliminating the need to make disk backups

• Disks are too big to be backed up in an efficient fashion

Page 117: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 0

• No replication• Advantages:

– Simple to implement– No overhead

• Disadvantage:– If array has n disks failure rate is n times the

failure rate of a single disk

Page 118: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID levels 0 and 1RAID level 0

RAID level 1 Mirrors

Page 119: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 1• Mirroring:

– Two copies of each disk block• Advantages:

– Simple to implement– Fault-tolerant

• Disadvantage:– Requires twice the disk capacity of normal file

systems

Page 120: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 2• Instead of duplicating the data blocks we use an error

correction code• Very bad idea because disk drives either work correctly

or do not work at all– Only possible errors are omission errors– We need an omission correction code

• A parity bit is enough to correct a single omission

Page 121: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID levels 2 and 3RAID level 2

RAID level 3

Check disks

Parity disk

Page 122: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 3• Requires N+1 disk drives

– N drives contain data (1/N of each data block)• Block b[k] now partitioned into N fragments b[k,1],

b[k,2], ... b[k,N]– Parity drive contains exclusive or of these N fragments

p[k] = b[k,1] b[k,2] ... b[k,N]

Page 123: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

How parity works?• Truth table for XOR (same as parity)

A B AB0 0 00 1 11 0 11 1 0

Page 124: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Recovering from a disk failure• Small RAID level 3 array with data disks D0 and D1 and

parity disk P can tolerate failure of either D0 or D1

D0 D1 P0 0 00 1 11 0 11 1 0

D1P=D0 D0P=D10 00 11 01 1

Page 125: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

How RAID level 3 works (I)

• Assume we have N + 1 disks• Each block is partitioned into N equal chunks

Block

Chunk Chunk Chunk Chunk

N = 4 inexample

Page 126: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

How RAID level 3 works (II)

• XOR data chunks to compute the parity chunk

Parity

• Each chunk is written into a separate disk

Parity

Page 127: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

How RAID level 3 works (III)

• Each read/write involves all disks in RAID array– Cannot do two or more reads/writes in parallel– Performance of array not netter than that of a

single disk

Page 128: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 4 (I)

• Requires N+1 disk drives– N drives contain data

• Individual blocks, not chunks– Blocks with same disk address form a stripe

x x xx ?

Page 129: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 4 (II)

• Parity drive contains exclusive or of the N blocks in stripe

p[k] = b[k] b[k+1] ... b[k+N-1]

• Parity block now reflects contents of several blocks!

• Can now do parallel reads/writes

Page 130: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID levels 4 and 5

RAID level 4

RAID level 5

Bottleneck

Page 131: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID LEVEL 5

• Single parity drive of RAID level 4 is involved in every write – Will limit parallelism

• RAID-5 distribute the parity blocks among the N+1 drives– Much better

Page 132: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

The small write problem

• Specific to RAID 5• Happens when we want to update a single block

– Block belongs to a stripe– How can we compute the new value of the

parity block

...b[k+1] p[k]b[k+2]b[k]

Page 133: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

First solution• Read values of N-1 other blocks in stripe• Recompute

p[k] = b[k] b[k+1] ... b[k+N-1]

• Solution requires– N-1 reads– 2 writes (new block and new parity block)

Page 134: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Second solution• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute

p[k] = new b[m] old b[m] old p[k]

• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and new parity block)

Page 135: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID level 6 (I)• Not part of the original proposal

– Two check disks– Tolerates two disk failures– More complex updates

Page 136: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

RAID level 6 (II)

• Has become more popular as disks are becoming– Bigger– More vulnerable to irrecoverable read errors

• Most frequent cause for RAID level 5 array failures is– Irrecoverable read error occurring while

contents of a failed disk are reconstituted

Page 137: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

CONNECTING I/O DEVICES

Page 138: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Busses

• Connecting computer subsystems with each other was traditionally done through busses

• A bus is a shared communication link connecting multiple devices

• Transmit several bits at a time– Parallel buses

Page 139: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Busses

Page 140: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Examples

• Processor-memory busses– Connect CPU with memory modules– Short and high-speed

• I/O busses– Longer– Wide range of data bandwidths– Connect to memory through processor-memory bus

of backplane bus

Page 141: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Synchronous busses

• Include a clock in the control lines• Bus protocols expressed in actions to be taken at each

clock pulse• Have very simple protocols• Disadvantages

– All bus devices must run at same clock rate– Due to clock skew issues, cannot be both fast and

long

Page 142: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Asynchronous busses

• Have no clock• Can accommodate a wide variety of devices• Have no clock skew issues• Require a handshaking protocol before any

transmission– Implemented with extra control lines

Page 143: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Advantages of busses

• Cheap– One bus can link many devices

• Flexible– Can add devices

Page 144: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Disadvantages of busses

• Shared devices– can become bottlenecks

• Hard to run many parallel lines at high clock speeds

Page 145: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

New trend

• Away from parallel shared buses• Towards serial point-to-point switched

interconnections– Serial

• One bit at a time– Point-to-point

• Each line links a specific device to another specific device

Page 146: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

x86 bus organization

• Processor connects to peripherals through two chips (bridges)– North Bridge– South Bridge

Page 147: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

x86 bus organizationNorth

Bridge

South

Bridge

Page 148: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

North bridge

• Essentially a DMA controller– Lets disk controller access main memory w/o

any intervention of the CPU• Connects CPU to

– Main memory– Optional graphics card– South Bridge

Page 149: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

South Bridge

• Connects North bridge to a wide variety of I/O busses

Page 150: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Communicating with I/O devices

• Two solutions– Memory-mapped I/O– Special I/O instructions

Page 151: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory mapped I/O

• A portion of the address space reserved for I/O operations– Writes to any to these addresses are

interpreted as I/O commands– Reading from these addresses gives access to

• Error bit • I/O completion bit• Data being read

Page 152: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Memory mapped I/O

• User processes cannot access these addresses– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Page 153: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Dedicated I/O instructions

• Privileged instructions that cannot be executed by User processes cannot access these addresses– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Page 154: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Polling

• Simplest way for an I/O device to communicate with the CPU

• CPU periodically checks the status of pending I/O operations– High CPU overhead

Page 155: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

I/O completion interrupts

• Notify the CPU that an I/O operation has completed• Allows th CPU to do something else while waiting

for the completion of an I/O operation– Multiprogramming

• I/O completion interrupts are processed by CPU between instructions– No internal instruction state to save

Page 156: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Interrupts levels

• See previous chapter

Page 157: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Direct memory access

• DMA• Lets disk controller access main memory w/o

any intervention of the CPU

Page 158: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

DMA and virtual memory

• A single DMA transfer may cross page boundaries with– One page being in main memory– One missing page

Page 159: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solutions

• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory

subsystem• Break DMA transfers crossing page boundaries

into chains of transfers that do not corss page boundaries

Page 160: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An Example

Page Page Page Page

Break DMA transfer

into DMA DMA

Page 161: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

DMA and cache hierarchy

• Three approaches for handling temporary inconsistencies between caches and main memory

Page 162: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Solutions

1. Running all DMA accesses to the cache– Bad solution

2. Have OS selectively– Invalidate affected cache entries when

performing a read– Forcing immediate flush of dirty cache entries

when performing a write3. Have specific hardware to do same

Page 163: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Benchmarking I/O

Page 164: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Benchmarks

• Specific benchmarks for– Transaction processing

• Emphasis on speed and graceful recovery from failures

–Atomic transactions:• All or nothing behavior

Page 165: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

An important observation

• Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput– Unless we access sequentially very large

ranges of data• 512 KB and more

Page 166: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Major fallacies

• Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years– MTTF expresses failure rate during the disk

actual lifetime• Disk failure rates in the field match the MMTTFS

mentioned in the manufacturers’ literature– They are up to ten times higher

Page 167: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Major fallacies

• Neglecting to do end-to-end checks– …

• Using magnetic tapes to back up disks– Tape formats can become quickly

obsolescent– Disk bit densities have grown much faster

than tape data densities.

Page 168: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

WRITING PARALLEL PROGRAMS

Page 169: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Ahmdahl's law– Balancing the load

Page 170: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Page 171: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

Page 172: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

Page 173: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

MULTI PROCESSOR ORGANIZATIONS

Page 174: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Shared memory multiprocessorsPU

Cache

PU

Cache

PU

Cache

Interconnection network

RAM I/O

Page 175: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Page 176: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Computer clustersPU

Cache

RAM

PU

Cache

RAM

PU

Cache

RAM

Interconnection network

Page 177: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

Page 178: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

HARDWARE MULTITHREADING

Page 179: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Page 180: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Page 181: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Page 182: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Page 183: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

Page 184: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

ALPHABET SOUP

Page 185: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Page 186: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

Page 187: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

PERFORMANCE ISSUES

Page 188: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Roofline model• Takes into account

– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Page 189: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Page 190: THIRD REVIEW SESSION Jehan-François Pâris May 5, 2010

Roofline model

02468

101214161820

0 1 2 3 4 5

Arithmetic Intensity

Atta

inab

le G

FLO

PS/s

Peak floating-point performance

Floating-point performance islimited by memory bandwidth