third review session jehan-françois pâris may 5, 2010

THIRD REVIEW SESSION

Jehan-François PârisMay 5, 2010

MATERIALS (I)

• Memory hierarchies:– Caches– Virtual memory– Protection– Virtual machines– Cache consistency

MATERIALS (II)

• I/O Operations– More about disks– I/O operation implementation:

• Busses• Memory-mapped I/O• Specific I/O instructions

– RAID organizations

MATERIALS (III)

• Parallel Architectures– Shared memory multiprocessors– Computer clusters– Hardware multithreading– SISD, SIMD, MIMD, …– Roofline performance model

CACHING AND VIRTUAL MEMORY

Common objective

• Make a combination of– Small, fast and expensive memory– Large, slow and cheap memory

look like– A single large and fast memory

• Fetch policy is fetch on demand

Questions to ask

• What are the transfer units?• How are they placed in the faster memory?• How are they accessed?• How do we handle misses?• How do we implement writes?

and more generally• Are these tasks performed by the hardware or the OS?

Transfer units

• Block or pages containing 2n bytes– Always properly aligned

• If a block or a page contains 2n bytes,the n LSBs of its start address will be all zeroes

Examples

• If block size is 4 words,– Corresponds to 16 = 2 4 bytes– 4 LSB of block address will be all zeroes

• If page size is 4KB– Corresponds to 22×210 = 212 bytes– 12 LSBs of page address will be all zeroes– Remaining bits of address form page number

Examples

XXXXXXXXXXXXXXXXX<12 zeroes>

Page size = 4KB

32-bit address of first byte in page

In page address:20-bit page number + 12 bit offset

XXXXXXXXXXXXXXXXX YYYYYYYYY

Consequence

• In a 32-bit architecture, – We identify a block or a page of size 2n bytes

by the 32 – n MSBs of its address – Will be called

• Tag• Page number

Placement policy

• Two extremes– Each block can only occupy a fixed address

in the faster memory• Direct mapping (many caches)

– Each page can occupy any address in the faster memory• Full association (virtual memory)

Direct mapping

• Assume – Cache has 2m entries – Block size is 2n bytes– a is the block address

(with its n LSBs removed)• The block will be placed at cache position

a % 2m

Consequence

• The tag identifying the cache block will be the start address of the block with its n + m LSBs removed– the original n LSBs because they are known

to be all zeroes– the next m LSBs because they are equal to

a % 2m

Consequence

Block start address

Remove n LSBs because they are all-zeroes

Block address

Remove m additional LSBs given by a%2m

Tag

A cache whose block size is 8 bytes

WordWord

WordWord

WordWord

WordWord

000001010011

000001

100101

010011

000001

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

ValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

TagValidY/NY/NY/NY/NY/NY/NY/NY/N

110111

100101

010011

000001

110111

100101

010011

000001

Contents

100101

010011

WordWord

WordWord

Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6

WordWord

WordWord

WordWord

WordWord

Fully associative solution

• Used in virtual memory systems

• Each page can occupy any free page frame in main memory

• Use a page table– Without redundant first

column

Page Page ##

Frame #

00 411 722 2733 4444 5…… …

Solutions with limited associativity

• A cache of size 2m with associativity level k lets a given block occupy any of k possible locations in the cache

• Implementation looks very much like k caches of size 2m/k put together

• All possible cache locations for a block have the same position a % 2m/k in each of the smaller caches

A set-associative cache with k=2

000001010011100101110111

Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block


Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N

000001010011100101110111



Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N

Accessing an entry

• In a cache, use hardware to compute the possible cache position for the block containing the data– a % 2m for a cache using direct mapping– a % 2m/k for a cache of associativity level k

• Check then if the cache entry is valid using its valid bit

Accessing an entry

• In a VM system, hardware checks the TLB to find the frame containing a given page number

• TLB entries contain– A page number (tag)– A frame number– A valid bit– A dirty bit

Accessing an entry

Page frame number BitsPage number

• The valid bit indicates if the mapping is valid• The dirty bit indicates whether we need to savethe page contents when we expel it

Accessing an entry

• If page mapping is not in the TLB, must consult the page table and update the TLB– Can be done by hardware or software

Realization

2 897

897

1

57

3

5

Virtual Address

Physical Address

PAGE TABLE

Page No Offset

PageFrameNo

(10 bits)

Handling cache misses

• Cache hardware fetches missing block• Often overwriting an existing entry• Which one?

– The one that occupies the same location if cache use direct mapping

– One of those that occupy the same location if cache use direct mapping

Handling cache misses

• Before expelling a cache entry, we must– Check its dirty bit– Save its contents if dirty bit is on.

Handling page faults

• OS fetches missing page• Often overwriting an existing page• Which one?

– One that was not recently used• Selected by page replacement policy

Handling page faults

• Before expelling a page, we must– Check its dirty bit– Save its contents if dirty bit is on.

Handling writes (I)

• Two ways to handle writes– Write through:

• Each write updates both the cache and the main memory

– Write back:• Writes are not propagated to the main

memory until the updated word is expelled from the cache

Handling writes (II)

• Write through • Write back

CPU

Cache

RAM

CPU

Cache

RAM

later

Pros and cons• Write through:

– Ensures that memory is always up to date• Expelled cache entries can be overwritten

• Write back:– Faster writes– Complicates cache expulsion procedure

• Must write back cache entries that have been modified in the cache

A better write through (I)

• Add a small buffer to speed up write performance of write-through caches– At least four words

• Holds modified data until they are written into main memory – Cache can proceed as soon as data are

written into the write buffer

Write buffer

A better write through (II)

• Write through • Better write through

CPU

Cache

RAM

CPU

Cache

RAM

Designing RAM to support caches

• RAM connected to CPU through a "bus"– Clock rate much slower than CPU clock rate

• Assume that a RAM access takes– 1 bus clock cycle to send the address– 15 bus clock cycle to initiate a read– 1 bus clock cycle to send a word of data


• Assume– Cache block size is 4 words– One-word bank of DRAM

• Fetching a cache block would take1 + 4×15 + 4×1 = 65 bus clock cycles

– Transfer rate is 0.25 byte/bus cycle• Awful!


• Could – Have an interleaved memory organization– Four one-word banks of DRAM– A 32-bit bus

32 bits

RAMbank 1

RAMbank 0

RAMbank 2

RAMbank 3


• Can do the 4 accesses in parallel• Must still transmit the block 32 bits by 32 bits• Fetching a cache block would take

1 + 15 + 4×1 = 20 bus clock cycles– Transfer rate is 0.80 word/bus cycle

• Even better• Much cheaper than having a 64-bit bus

PERFORMANCE ISSUES

Memory stalls

• Can divide CPU time into– NEXEC clock cycles spent executing instructions

– NMEM_STALLS cycles spent waiting for memory accesses

• We haveCPU time = (NEXEC + NMEM_STALLS)×TCYCLE

Memory stalls

• We assume that– cache access times can be neglected– most CPU cycles spent waiting for memory

accesses are caused by cache misses

Global impact

• We have

NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×

Cache miss penalty • and also

NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×

Cache miss penalty

Example

• Miss rate of instruction cache is 2 percentMiss rate of data cache is 5 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles40 percent of instructions access the main memory

• How many cycles are lost due to cache misses?

Solution (I)

• Impact of instruction cache misses0.02×100 =2 cycles/instruction

• Impact of data cache misses0.40×0.05×100 =2 cycles/instruction

• Total impact of cache misses2 + 2 = 4 cycles/instruction

Solution (II)

• Average number of cycles per instruction2 + 4 = 6 cycles/instruction

• Fraction of time wasted 4 /6 = 67 percent

Average memory access time

• Some authors call it AMATTAVERAGE = TCACHE + f×TMISS

where f is the cache miss rate• Times can be expressed

– In nanoseconds– In number of cycles

Example

• A cache has a hit rate of 96 percent• Accessing data

– In the cache requires one cycle– In the memory requires 100 cycles

• What is the average memory access time?

Solution

• Miss rate = 1 – Hit rate = 0.04• Applying the formula

TAVERAGE = 1 + 0.04×100 = 401 cycles

In other words

It's the miss rate, stupid!

Improving cache hit rate

• Two complementary techniques– Using set-associative caches

• Must check tags of all blocks with the same index values

–Slower • Have fewer collisions

–Fewer misses– Use a cache hierarchy

A cache hierarchy

• Topmost cache– Optimized for speed, not miss rate– Rather small– Uses a small block size

• As we go down the hierarchy– Cache sizes increase– Block sizes increase– Cache associativity level increases

Example

• Cache miss rate per instruction is 3 percentIn the absence of memory stalls, each instruction would take one cycleCache miss penalty is 100 nsClock rate is 4GHz

• How many cycles are lost due to cache misses?

Solution (I)

• Duration of clock cycle1/(4 Ghz) = 0.25×10-9 s = 0.25 ns

• Cache miss penalty100ns = 400 cycles

• Total impact of cache misses0.03×400 = 12 cycles/instruction

Solution (II)• Average number of cycles per instruction

1 + 12 = 13 cycles/instruction• Fraction of time wasted

12/13 = 92 percent

A very good case for hardware multithreading

Example (cont'd)

• How much faster would the processor if we added a L2 cache that – Has a 5 ns access time– Would reduce miss rate to main memory to

one percent?

Solution (I)

• L2 cache access time5ns = 20 cycles

• Impact of cache misses per instructionL1 cache misses + L2 cache misses =

0.03×20+0.01×400 = 0.6 + 4.0 =4.6 cycles/instruction

• Average number of cycles per instruction1 + 4.6 = 5.6 cycles/instruction

Solution (II)

• Fraction of time wasted 4.6/5.6 = 82 percent

• CPU speedup 13/4.6 = 2.83

Problem

• Redo the second part of the example assuming that the secondary cache– Has a 3 ns access time– Can reduce miss rate to main memory to one

percent?

Solution

• Fraction of time wasted 86 percent

• CPU speedup 1.22

New L2 cache with a lower access timebut a higher miss rate performs much worsethan first L2 cache

Example

• A virtual memory has a page fault rate of 10-4

faults per memory access• Accessing data

– In the memory requires 100 ns– On disk requires 5 ms

• What is the average memory access time?Tavg = 100 ns + 10-4 ×5 ms = 600ns

The cost of a page fault

• Let– Tm be the main memory access time

– Td the disk access time– f the page fault rate– Ta the average access time of the VM

Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td

Example

• Assume Tm = 50 ns and Td = 5 ms

f Mean memory access time

10-3 = 50 ns + 5 ms/103 = 5,050 ns

10-4 = 50 ns + 5 ms/104 = 550 ns

10-5 = 50 ns + 5 ms/105 = 100 ns

10-6 = 50 ns + 5 ms/ 106 = 55 ns

In other words

It's the page fault rate, stupid!

Locality principle (I)

• A process that would access its pages in a totally unpredictable fashion would perform very poorly in a VM system unless all its pages are in main memory

Locality principle (II)

• Process P accesses randomly a very large array consisting of n pages

• If m of these n pages are in main memory, the page fault frequency of the process will be( n – m )/ n

• Must switch to another algorithm

First problem

• A virtual memory system has– 32 bit addresses– 4 KB pages

• What are the sizes of the – Page number field?– Offset field?

Solution (I)

• Step 1:Convert page size to power of 2

4 KB = 212 B

• Step 2:Exponent is length of offset field

Solution (II)

• Step 3:Size of page number field =Address size – Offset size

Here 32 – 12 = 20 bits

12 bits for the offset and 20 bits for the page number

MEMORY PROTECTION

Objective

• Unless we have an isolated single-user system, we must prevent users from– Accessing– Deleting– Modifying

the address spaces of other processes, including the kernel

Memory protection (I)• VM ensures that processes cannot access page

frames that are not referenced in their page table.

• Can refine control by distinguishing among– Read access– Write access– Execute access

• Must also prevent processes from modifying their own page tables

Dual-mode CPU

• Require a dual-mode CPU• Two CPU modes

– Privileged mode or executive mode that allows CPU to execute all instructions

– User mode that allows CPU to execute only safe unprivileged instructions

• State of CPU is determined by a special bits

Switching between states

• User mode will be the default mode for all programs– Only the kernel can run in supervisor mode

• Switching from user mode to supervisor mode is done through an interrupt – Safe because the jump address is at a well-

defined location in main memory

Memory protection (II)

• Has additional advantages:– Prevents programs from corrupting address

spaces of other programs– Prevents programs from crashing the kernel

• Not true for device drivers which are inside the kernel

• Required part of any multiprogramming system

INTEGRATING CACHES AND VM

The problem

• In a VM system, each byte of memory has teo addresses– A virtual address– A physical address

• Should cache tags contain virtual addresses or physical addresses?

Discussion• Using virtual addresses

– Directly available– Bypass TLB– Cache entries specific

to a given address space

– Must flush caches when the OS selects another process

• Using physical addresses– Must access first TLB– Cache entries not

specific to a given address space

– Do not have to flush caches when the OS selects another process

The best solution

• Let the cache use physical addresses– No need to flush the cache at each context

switch– TLB access delay is tolerable

VIRTUAL MACHINES

Key idea

• Let different operating systems run at the same time on a single computer– Windows, Linux and Mac OS– A Real-time Os and a conventional OS– A production OS and a new OS being tested

How it is done

• A hypervisor /VM monitor defines two or more virtual machines

• Each virtual machine has– Its own virtual CPU– Its own virtual physical memory– Its own virtual disk(s)

Two virtual machines

HypervisorPrivilegedmode

Usermode

Userprocess

VM Kernel

Userprocess

Userprocess

VM Kernel

Userprocess

Translating a block address

VM kernel

Virtual disk

Access block x, yof my virtual disk

That's block v, w of the actual disk

Actual disk

Hypervisor

Access block v, w

of actual disk

Handling I/Os

• Difficult task because– Wide variety of devices– Some devices may be shared among several VMs

• Printers• Shared disk partition

–Want to let Linux and Windows access the same files

Virtual Memory Issues

• Each VM kernel manages its own memory– Its page tables map program virtual

addresses into pseudo-physical addresses• It treats these addresses as physical

addresses

The dilemma

User processA

VM kernel

Page 735 of process A is stored in page frame 435

That's page frame 993 of the actual RAM

Hypervisor

The solution (I)

• Address translation must remain fast!– Hypervisor lets each VM kernel manage their

own page tables but do not use them• They contain bogus mappings!

– It maintains instead its own shadow page tables with the correct mappings• Used to handle TLB misses

The solution (II)

• To keep its shadow page tables up to date, hypervisor must track any changes made by the VM kernels

• Mark page tables read-only

Nastiest Issue

• The whole VM approach assumes that a kernel executing in user mode will behave exactly like a kernel executing in privileged mode

• Not true for all architectures!– Intel x86 Pop flags (POPF) instruction– …

Solutions

1. Modify the instruction set and eliminate instructions like POPF• IBM redesigned the instruction set of their

360 series for the 370 series2. Mask it through clever software

• Dynamic "binary translation" when direct execution of code could not work(VMWare)

CACHE CONSISTENCY

The problem• Specific to architectures with

– Several processors sharing the same main memory

– Multicore architectures• Each core/processor has its own private cache

– A must for performance • Happens when same data are present in two or

more private caches

An example (I)

RAM

CPU

Cachex=0

CPU

Cachex=0

An example (II)

RAM

CPU

Cachex=1

CPU

Cachex=0

Increments x

Still assumes x =0

An example

RAM

CPU

Cachex=1

Sets x to 1

CPU

Cachex=1

Resets x to 1

CPU

Cachex=?

CPU

Cachex=?

Both CPUs must applythe two updates

in the same order

Rules

1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last if the updates are sufficiently separated in times

2. A processor accessing a variable sees all updates applied to that variable in thesame order

– No compromise is possible here

A realization: Snoopy caches

• All caches are linked to the main memory through a shared bus– All caches observe the writes performed by

other caches

• When a cache notices that another cache performs a write on a memory location that it has in its cache, it invalidates the corresponding cache block

An example (I)

RAM

CPU

Cachex=2

CPU

CacheFetches x = 2

An example (II)

RAM

CPU

Cachex = 2

CPU

Cachex = 2

Also fetches x

An example (III)

RAM

CPU

Cachex = 0

CPU

Cachex = 2

Resets x to 0

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = ?x = ?

Performs write-through

Detects write-through

and invalidates its copy of x

An example (IV)

RAM

CPU

Cachex = 0

CPU

Cachex = 0

when CPU wants to access x. cache gets correct value

from RAM

A last correctness condition

• Cache cannot reorder their memory updates– Cache RAM buffer must be FIFO

• First in first out

Miscellaneous fallacies

• Segmented address spaces– Address is segment number + offset in

segment– Programmers hate them

• Ignoring virtual memory behavior when accessing large two-dimensional arrays

Miscellaneous fallacies

• Segmented address spaces– Address is segment number + offset in segment– Programmers hate them

• Ignoring virtual memory behavior when accessing large two-dimensional arrays

• Believing that you can virtualize any CPU architecture

DEPENDABILITY

Reliability and Availability

• Reliability– Probability R(t) that system will be up at time

t if it was up at time t = 0• Availability

– Fraction of time the system is up• Reliability and availability do not measure the

same thing!

MTTF, MMTR and MTBF

• MTTF is mean time to failure

• MTTR is mean time to repair

• 1/MTTF is failure rate

• MTTBF, the mean time between failures, is

MTBF = MTTF + MTTR

Reliability

• As a first approximation

R(t) = exp(–t/MTTF)

– Not true if failure rate varies over time

Availability

• Measured by

(MTTF)/(MTTF + MTTR) = MTTF/MTBF

– MTTR is very important

Example

• A server crashes on the average once a month• When this happens, it takes six hours to reboot it• What is the server availability ?

Solution

• MTBF = 30 days• MTTR = ½ day

• MTTF = 29 ½ days• Availability is 29.5/30 =98.3 %

Example

• A disk drive has a MTTF of 20 years.• What is the probability that the data it contains

will not be lost over a period of five years?

Example

• A disk farm contains 100 disks whose MTTF is 20 years.

• What is the probability that no data will be lost over a period of five years?

Solution

• The aggregate failure rate of the disk farm is100x1/20 =5

• The mean time to failure of the farm is 1/5 year

• We apply the formulaR(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11

RAID Arrays

Today’s Motivation

• We use RAID today for – Increasing disk throughput by allowing parallel

access– Eliminating the need to make disk backups

• Disks are too big to be backed up in an efficient fashion

RAID LEVEL 0

• No replication• Advantages:

– Simple to implement– No overhead

• Disadvantage:– If array has n disks failure rate is n times the

failure rate of a single disk

RAID levels 0 and 1RAID level 0

RAID level 1 Mirrors

RAID LEVEL 1• Mirroring:

– Two copies of each disk block• Advantages:

– Simple to implement– Fault-tolerant

• Disadvantage:– Requires twice the disk capacity of normal file

systems

RAID LEVEL 2• Instead of duplicating the data blocks we use an error

correction code• Very bad idea because disk drives either work correctly

or do not work at all– Only possible errors are omission errors– We need an omission correction code

• A parity bit is enough to correct a single omission

RAID levels 2 and 3RAID level 2

RAID level 3

Check disks

Parity disk

RAID LEVEL 3• Requires N+1 disk drives

– N drives contain data (1/N of each data block)• Block b[k] now partitioned into N fragments b[k,1],

b[k,2], ... b[k,N]– Parity drive contains exclusive or of these N fragments

p[k] = b[k,1] b[k,2] ... b[k,N]

How parity works?• Truth table for XOR (same as parity)

A B AB0 0 00 1 11 0 11 1 0

Recovering from a disk failure• Small RAID level 3 array with data disks D0 and D1 and

parity disk P can tolerate failure of either D0 or D1

D0 D1 P0 0 00 1 11 0 11 1 0

D1P=D0 D0P=D10 00 11 01 1

How RAID level 3 works (I)

• Assume we have N + 1 disks• Each block is partitioned into N equal chunks

Block

Chunk Chunk Chunk Chunk

N = 4 inexample

How RAID level 3 works (II)

• XOR data chunks to compute the parity chunk

Parity

• Each chunk is written into a separate disk

Parity

How RAID level 3 works (III)

• Each read/write involves all disks in RAID array– Cannot do two or more reads/writes in parallel– Performance of array not netter than that of a

single disk

RAID LEVEL 4 (I)

• Requires N+1 disk drives– N drives contain data

• Individual blocks, not chunks– Blocks with same disk address form a stripe

x x xx ?

RAID LEVEL 4 (II)

• Parity drive contains exclusive or of the N blocks in stripe

p[k] = b[k] b[k+1] ... b[k+N-1]

• Parity block now reflects contents of several blocks!

• Can now do parallel reads/writes

RAID levels 4 and 5

RAID level 4

RAID level 5

Bottleneck

RAID LEVEL 5

• Single parity drive of RAID level 4 is involved in every write – Will limit parallelism

• RAID-5 distribute the parity blocks among the N+1 drives– Much better

The small write problem

• Specific to RAID 5• Happens when we want to update a single block

– Block belongs to a stripe– How can we compute the new value of the

parity block

...b[k+1] p[k]b[k+2]b[k]

First solution• Read values of N-1 other blocks in stripe• Recompute

p[k] = b[k] b[k+1] ... b[k+N-1]

• Solution requires– N-1 reads– 2 writes (new block and new parity block)

Second solution• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute

p[k] = new b[m] old b[m] old p[k]

• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and new parity block)

RAID level 6 (I)• Not part of the original proposal

– Two check disks– Tolerates two disk failures– More complex updates

RAID level 6 (II)

• Has become more popular as disks are becoming– Bigger– More vulnerable to irrecoverable read errors

• Most frequent cause for RAID level 5 array failures is– Irrecoverable read error occurring while

contents of a failed disk are reconstituted

CONNECTING I/O DEVICES

Busses

• Connecting computer subsystems with each other was traditionally done through busses

• A bus is a shared communication link connecting multiple devices

• Transmit several bits at a time– Parallel buses

Busses

Examples

• Processor-memory busses– Connect CPU with memory modules– Short and high-speed

• I/O busses– Longer– Wide range of data bandwidths– Connect to memory through processor-memory bus

of backplane bus

Synchronous busses

• Include a clock in the control lines• Bus protocols expressed in actions to be taken at each

clock pulse• Have very simple protocols• Disadvantages

– All bus devices must run at same clock rate– Due to clock skew issues, cannot be both fast and

long

Asynchronous busses

• Have no clock• Can accommodate a wide variety of devices• Have no clock skew issues• Require a handshaking protocol before any

transmission– Implemented with extra control lines

Advantages of busses

• Cheap– One bus can link many devices

• Flexible– Can add devices

Disadvantages of busses

• Shared devices– can become bottlenecks

• Hard to run many parallel lines at high clock speeds

New trend

• Away from parallel shared buses• Towards serial point-to-point switched

interconnections– Serial

• One bit at a time– Point-to-point

• Each line links a specific device to another specific device

x86 bus organization

• Processor connects to peripherals through two chips (bridges)– North Bridge– South Bridge

x86 bus organizationNorth

Bridge

South

Bridge

North bridge

• Essentially a DMA controller– Lets disk controller access main memory w/o

any intervention of the CPU• Connects CPU to

– Main memory– Optional graphics card– South Bridge

South Bridge

• Connects North bridge to a wide variety of I/O busses

Communicating with I/O devices

• Two solutions– Memory-mapped I/O– Special I/O instructions

Memory mapped I/O

• A portion of the address space reserved for I/O operations– Writes to any to these addresses are

interpreted as I/O commands– Reading from these addresses gives access to

• Error bit • I/O completion bit• Data being read

Memory mapped I/O

• User processes cannot access these addresses– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Dedicated I/O instructions

• Privileged instructions that cannot be executed by User processes cannot access these addresses– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Polling

• Simplest way for an I/O device to communicate with the CPU

• CPU periodically checks the status of pending I/O operations– High CPU overhead

I/O completion interrupts

• Notify the CPU that an I/O operation has completed• Allows th CPU to do something else while waiting

for the completion of an I/O operation– Multiprogramming

• I/O completion interrupts are processed by CPU between instructions– No internal instruction state to save

Interrupts levels

• See previous chapter

Direct memory access

• DMA• Lets disk controller access main memory w/o

any intervention of the CPU

DMA and virtual memory

• A single DMA transfer may cross page boundaries with– One page being in main memory– One missing page

Solutions

• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory

subsystem• Break DMA transfers crossing page boundaries

into chains of transfers that do not corss page boundaries

An Example

Page Page Page Page

Break DMA transfer

into DMA DMA

DMA and cache hierarchy

• Three approaches for handling temporary inconsistencies between caches and main memory

Solutions

1. Running all DMA accesses to the cache– Bad solution

2. Have OS selectively– Invalidate affected cache entries when

performing a read– Forcing immediate flush of dirty cache entries

when performing a write3. Have specific hardware to do same

Benchmarking I/O

Benchmarks

• Specific benchmarks for– Transaction processing

• Emphasis on speed and graceful recovery from failures

–Atomic transactions:• All or nothing behavior

An important observation

• Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput– Unless we access sequentially very large

ranges of data• 512 KB and more

Major fallacies

• Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years– MTTF expresses failure rate during the disk

actual lifetime• Disk failure rates in the field match the MMTTFS

mentioned in the manufacturers’ literature– They are up to ten times higher

Major fallacies

• Neglecting to do end-to-end checks– …

• Using magnetic tapes to back up disks– Tape formats can become quickly

obsolescent– Disk bit densities have grown much faster

than tape data densities.

WRITING PARALLEL PROGRAMS

Overview

• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or

password guessing• Much more difficult for other applications

– Communication overhead among sub-tasks– Ahmdahl's law– Balancing the load

Amdahl's Law

• Assume a sequential process takes

– tp seconds to perform operations that could be performed in parallel

– ts seconds to perform purely sequential operations

• The maximum speedup will be

(tp + ts )/ts

Balancing the load

• Must ensure that workload is equally divided among all the processors

• Worst case is when one of the processors does much more work than all others

A last issue

• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs

MULTI PROCESSOR ORGANIZATIONS

Shared memory multiprocessorsPU

Cache

PU

Cache

PU

Cache

Interconnection network

RAM I/O

…

Shared memory multiprocessor

• Can offer– Uniform memory access to all processors

(UMA)• Easiest to program

– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory

Computer clustersPU

Cache

RAM

PU

Cache

RAM

PU

Cache

RAM

Interconnection network

…

Computer clusters

• Very easy to assemble• Can take advantage of high-speed LANs

– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through

message passing

HARDWARE MULTITHREADING

General idea

• Let the processor switch to another thread of computation while them current one is stalled

• Motivation:– Increased cost of cache misses

Implementation

• Entirely controlled by the hardware– Unlike multiprogramming

• Requires a processor capable of– Keeping track of the state of each thread

• One set of registers—including PC– for each concurrent thread

– Quickly switching among concurrent threads

Approaches

• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads

Approaches

• Coarse-grained multithreading– Switches between threads whenever a long

stall is detected– Easier to implement – Cannot eliminate all stalls

Approaches

• Simultaneous multi-threading:– Takes advantage of the possibility of modern

hardware to perform different tasks in parallel for instructions of different threads

– Best solution

ALPHABET SOUP

Classification

• SISD:– Single instruction, single data– Conventional uniprocessor architecture

• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture

Classification

• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data

• Think of adding two vectors

for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];

PERFORMANCE ISSUES

Roofline model• Takes into account

– Memory bandwidth– Floating-point performance

• Introduces arithmetic intensity– Total number of floating point operations in a

program divided by total number of bytes transferred to main memory

– Measured in FLOPS/byte

Roofline model

• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic

Intensity, Peak Floating-Point Performance

Roofline model

02468

101214161820

0 1 2 3 4 5

Arithmetic Intensity

Atta

inab

le G

FLO

PS/s

Peak floating-point performance

Floating-point performance islimited by memory bandwidth

third review session jehan-françois pâris may 5, 2010

Documents