third review session jehan-françois pâris may 5, 2010
DESCRIPTION
MATERIALS (II) I/O Operations –More about disks –I/O operation implementation: Busses Memory-mapped I/O Specific I/O instructions –RAID organizationsTRANSCRIPT
THIRD REVIEW SESSION
Jehan-François PârisMay 5, 2010
MATERIALS (I)
• Memory hierarchies:– Caches– Virtual memory– Protection– Virtual machines– Cache consistency
MATERIALS (II)
• I/O Operations– More about disks– I/O operation implementation:
• Busses• Memory-mapped I/O• Specific I/O instructions
– RAID organizations
MATERIALS (III)
• Parallel Architectures– Shared memory multiprocessors– Computer clusters– Hardware multithreading– SISD, SIMD, MIMD, …– Roofline performance model
CACHING AND VIRTUAL MEMORY
Common objective
• Make a combination of– Small, fast and expensive memory– Large, slow and cheap memory
look like– A single large and fast memory
• Fetch policy is fetch on demand
Questions to ask
• What are the transfer units?• How are they placed in the faster memory?• How are they accessed?• How do we handle misses?• How do we implement writes?
and more generally• Are these tasks performed by the hardware or the OS?
Transfer units
• Block or pages containing 2n bytes– Always properly aligned
• If a block or a page contains 2n bytes,the n LSBs of its start address will be all zeroes
Examples
• If block size is 4 words,– Corresponds to 16 = 2 4 bytes– 4 LSB of block address will be all zeroes
• If page size is 4KB– Corresponds to 22×210 = 212 bytes– 12 LSBs of page address will be all zeroes– Remaining bits of address form page number
Examples
XXXXXXXXXXXXXXXXX<12 zeroes>
Page size = 4KB
32-bit address of first byte in page
In page address:20-bit page number + 12 bit offset
XXXXXXXXXXXXXXXXX YYYYYYYYY
Consequence
• In a 32-bit architecture, – We identify a block or a page of size 2n bytes
by the 32 – n MSBs of its address – Will be called
• Tag• Page number
Placement policy
• Two extremes– Each block can only occupy a fixed address
in the faster memory• Direct mapping (many caches)
– Each page can occupy any address in the faster memory• Full association (virtual memory)
Direct mapping
• Assume – Cache has 2m entries – Block size is 2n bytes– a is the block address
(with its n LSBs removed)• The block will be placed at cache position
a % 2m
Consequence
• The tag identifying the cache block will be the start address of the block with its n + m LSBs removed– the original n LSBs because they are known
to be all zeroes– the next m LSBs because they are equal to
a % 2m
Consequence
Block start address
Remove n LSBs because they are all-zeroes
Block address
Remove m additional LSBs given by a%2m
Tag
A cache whose block size is 8 bytes
WordWord
WordWord
WordWord
WordWord
000001010011
000001
100101
010011
000001
110111
100101
010011
000001
ValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
ValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
TagValidY/NY/NY/NY/NY/NY/NY/NY/N
110111
100101
010011
000001
110111
100101
010011
000001
Contents
100101
010011
WordWord
WordWord
Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6Bits 31:6
WordWord
WordWord
WordWord
WordWord
Fully associative solution
• Used in virtual memory systems
• Each page can occupy any free page frame in main memory
• Use a page table– Without redundant first
column
Page Page ##
Frame #
00 411 722 2733 4444 5…… …
Solutions with limited associativity
• A cache of size 2m with associativity level k lets a given block occupy any of k possible locations in the cache
• Implementation looks very much like k caches of size 2m/k put together
• All possible cache locations for a block have the same position a % 2m/k in each of the smaller caches
A set-associative cache with k=2
000001010011100101110111
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
000001010011100101110111
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Bits 31:5 BlockBits 31:5 BlockBits 31:5 BlockBits 31:5 Block
Tag ContentsValidY/NY/NY/NY/NY/NY/NY/NY/N
Accessing an entry
• In a cache, use hardware to compute the possible cache position for the block containing the data– a % 2m for a cache using direct mapping– a % 2m/k for a cache of associativity level k
• Check then if the cache entry is valid using its valid bit
Accessing an entry
• In a VM system, hardware checks the TLB to find the frame containing a given page number
• TLB entries contain– A page number (tag)– A frame number– A valid bit– A dirty bit
Accessing an entry
Page frame number BitsPage number
• The valid bit indicates if the mapping is valid• The dirty bit indicates whether we need to savethe page contents when we expel it
Accessing an entry
• If page mapping is not in the TLB, must consult the page table and update the TLB– Can be done by hardware or software
Realization
2 897
897
1
57
3
5
Virtual Address
Physical Address
PAGE TABLE
Page No Offset
PageFrameNo
(10 bits)
Handling cache misses
• Cache hardware fetches missing block• Often overwriting an existing entry• Which one?
– The one that occupies the same location if cache use direct mapping
– One of those that occupy the same location if cache use direct mapping
Handling cache misses
• Before expelling a cache entry, we must– Check its dirty bit– Save its contents if dirty bit is on.
Handling page faults
• OS fetches missing page• Often overwriting an existing page• Which one?
– One that was not recently used• Selected by page replacement policy
Handling page faults
• Before expelling a page, we must– Check its dirty bit– Save its contents if dirty bit is on.
Handling writes (I)
• Two ways to handle writes– Write through:
• Each write updates both the cache and the main memory
– Write back:• Writes are not propagated to the main
memory until the updated word is expelled from the cache
Handling writes (II)
• Write through • Write back
CPU
Cache
RAM
CPU
Cache
RAM
later
Pros and cons• Write through:
– Ensures that memory is always up to date• Expelled cache entries can be overwritten
• Write back:– Faster writes– Complicates cache expulsion procedure
• Must write back cache entries that have been modified in the cache
A better write through (I)
• Add a small buffer to speed up write performance of write-through caches– At least four words
• Holds modified data until they are written into main memory – Cache can proceed as soon as data are
written into the write buffer
Write buffer
A better write through (II)
• Write through • Better write through
CPU
Cache
RAM
CPU
Cache
RAM
Designing RAM to support caches
• RAM connected to CPU through a "bus"– Clock rate much slower than CPU clock rate
• Assume that a RAM access takes– 1 bus clock cycle to send the address– 15 bus clock cycle to initiate a read– 1 bus clock cycle to send a word of data
Designing RAM to support caches
• Assume– Cache block size is 4 words– One-word bank of DRAM
• Fetching a cache block would take1 + 4×15 + 4×1 = 65 bus clock cycles
– Transfer rate is 0.25 byte/bus cycle• Awful!
Designing RAM to support caches
• Could – Have an interleaved memory organization– Four one-word banks of DRAM– A 32-bit bus
32 bits
RAMbank 1
RAMbank 0
RAMbank 2
RAMbank 3
Designing RAM to support caches
• Can do the 4 accesses in parallel• Must still transmit the block 32 bits by 32 bits• Fetching a cache block would take
1 + 15 + 4×1 = 20 bus clock cycles– Transfer rate is 0.80 word/bus cycle
• Even better• Much cheaper than having a 64-bit bus
PERFORMANCE ISSUES
Memory stalls
• Can divide CPU time into– NEXEC clock cycles spent executing instructions
– NMEM_STALLS cycles spent waiting for memory accesses
• We haveCPU time = (NEXEC + NMEM_STALLS)×TCYCLE
Memory stalls
• We assume that– cache access times can be neglected– most CPU cycles spent waiting for memory
accesses are caused by cache misses
Global impact
• We have
NMEM_STALLS = NMEM_ACCESSES×Cache miss rate×
Cache miss penalty • and also
NMEM_STALLS = NINSTRUCTIONS×(NMISSES/Instruction)×
Cache miss penalty
Example
• Miss rate of instruction cache is 2 percentMiss rate of data cache is 5 percentIn the absence of memory stalls, each instruction would take 2 cyclesMiss penalty is 100 cycles40 percent of instructions access the main memory
• How many cycles are lost due to cache misses?
Solution (I)
• Impact of instruction cache misses0.02×100 =2 cycles/instruction
• Impact of data cache misses0.40×0.05×100 =2 cycles/instruction
• Total impact of cache misses2 + 2 = 4 cycles/instruction
Solution (II)
• Average number of cycles per instruction2 + 4 = 6 cycles/instruction
• Fraction of time wasted 4 /6 = 67 percent
Average memory access time
• Some authors call it AMATTAVERAGE = TCACHE + f×TMISS
where f is the cache miss rate• Times can be expressed
– In nanoseconds– In number of cycles
Example
• A cache has a hit rate of 96 percent• Accessing data
– In the cache requires one cycle– In the memory requires 100 cycles
• What is the average memory access time?
Solution
• Miss rate = 1 – Hit rate = 0.04• Applying the formula
TAVERAGE = 1 + 0.04×100 = 401 cycles
In other words
It's the miss rate, stupid!
Improving cache hit rate
• Two complementary techniques– Using set-associative caches
• Must check tags of all blocks with the same index values
–Slower • Have fewer collisions
–Fewer misses– Use a cache hierarchy
A cache hierarchy
• Topmost cache– Optimized for speed, not miss rate– Rather small– Uses a small block size
• As we go down the hierarchy– Cache sizes increase– Block sizes increase– Cache associativity level increases
Example
• Cache miss rate per instruction is 3 percentIn the absence of memory stalls, each instruction would take one cycleCache miss penalty is 100 nsClock rate is 4GHz
• How many cycles are lost due to cache misses?
Solution (I)
• Duration of clock cycle1/(4 Ghz) = 0.25×10-9 s = 0.25 ns
• Cache miss penalty100ns = 400 cycles
• Total impact of cache misses0.03×400 = 12 cycles/instruction
Solution (II)• Average number of cycles per instruction
1 + 12 = 13 cycles/instruction• Fraction of time wasted
12/13 = 92 percent
A very good case for hardware multithreading
Example (cont'd)
• How much faster would the processor if we added a L2 cache that – Has a 5 ns access time– Would reduce miss rate to main memory to
one percent?
Solution (I)
• L2 cache access time5ns = 20 cycles
• Impact of cache misses per instructionL1 cache misses + L2 cache misses =
0.03×20+0.01×400 = 0.6 + 4.0 =4.6 cycles/instruction
• Average number of cycles per instruction1 + 4.6 = 5.6 cycles/instruction
Solution (II)
• Fraction of time wasted 4.6/5.6 = 82 percent
• CPU speedup 13/4.6 = 2.83
Problem
• Redo the second part of the example assuming that the secondary cache– Has a 3 ns access time– Can reduce miss rate to main memory to one
percent?
Solution
• Fraction of time wasted 86 percent
• CPU speedup 1.22
New L2 cache with a lower access timebut a higher miss rate performs much worsethan first L2 cache
Example
• A virtual memory has a page fault rate of 10-4
faults per memory access• Accessing data
– In the memory requires 100 ns– On disk requires 5 ms
• What is the average memory access time?Tavg = 100 ns + 10-4 ×5 ms = 600ns
The cost of a page fault
• Let– Tm be the main memory access time
– Td the disk access time– f the page fault rate– Ta the average access time of the VM
Ta = (1 – f ) Tm + f (Tm + Td ) = Tm + f Td
Example
• Assume Tm = 50 ns and Td = 5 ms
f Mean memory access time
10-3 = 50 ns + 5 ms/103 = 5,050 ns
10-4 = 50 ns + 5 ms/104 = 550 ns
10-5 = 50 ns + 5 ms/105 = 100 ns
10-6 = 50 ns + 5 ms/ 106 = 55 ns
In other words
It's the page fault rate, stupid!
Locality principle (I)
• A process that would access its pages in a totally unpredictable fashion would perform very poorly in a VM system unless all its pages are in main memory
Locality principle (II)
• Process P accesses randomly a very large array consisting of n pages
• If m of these n pages are in main memory, the page fault frequency of the process will be( n – m )/ n
• Must switch to another algorithm
First problem
• A virtual memory system has– 32 bit addresses– 4 KB pages
• What are the sizes of the – Page number field?– Offset field?
Solution (I)
• Step 1:Convert page size to power of 2
4 KB = 212 B
• Step 2:Exponent is length of offset field
Solution (II)
• Step 3:Size of page number field =Address size – Offset size
Here 32 – 12 = 20 bits
12 bits for the offset and 20 bits for the page number
MEMORY PROTECTION
Objective
• Unless we have an isolated single-user system, we must prevent users from– Accessing– Deleting– Modifying
the address spaces of other processes, including the kernel
Memory protection (I)• VM ensures that processes cannot access page
frames that are not referenced in their page table.
• Can refine control by distinguishing among– Read access– Write access– Execute access
• Must also prevent processes from modifying their own page tables
Dual-mode CPU
• Require a dual-mode CPU• Two CPU modes
– Privileged mode or executive mode that allows CPU to execute all instructions
– User mode that allows CPU to execute only safe unprivileged instructions
• State of CPU is determined by a special bits
Switching between states
• User mode will be the default mode for all programs– Only the kernel can run in supervisor mode
• Switching from user mode to supervisor mode is done through an interrupt – Safe because the jump address is at a well-
defined location in main memory
Memory protection (II)
• Has additional advantages:– Prevents programs from corrupting address
spaces of other programs– Prevents programs from crashing the kernel
• Not true for device drivers which are inside the kernel
• Required part of any multiprogramming system
INTEGRATING CACHES AND VM
The problem
• In a VM system, each byte of memory has teo addresses– A virtual address– A physical address
• Should cache tags contain virtual addresses or physical addresses?
Discussion• Using virtual addresses
– Directly available– Bypass TLB– Cache entries specific
to a given address space
– Must flush caches when the OS selects another process
• Using physical addresses– Must access first TLB– Cache entries not
specific to a given address space
– Do not have to flush caches when the OS selects another process
The best solution
• Let the cache use physical addresses– No need to flush the cache at each context
switch– TLB access delay is tolerable
VIRTUAL MACHINES
Key idea
• Let different operating systems run at the same time on a single computer– Windows, Linux and Mac OS– A Real-time Os and a conventional OS– A production OS and a new OS being tested
How it is done
• A hypervisor /VM monitor defines two or more virtual machines
• Each virtual machine has– Its own virtual CPU– Its own virtual physical memory– Its own virtual disk(s)
Two virtual machines
HypervisorPrivilegedmode
Usermode
Userprocess
VM Kernel
Userprocess
Userprocess
VM Kernel
Userprocess
Translating a block address
VM kernel
Virtual disk
Access block x, yof my virtual disk
That's block v, w of the actual disk
Actual disk
Hypervisor
Access block v, w
of actual disk
Handling I/Os
• Difficult task because– Wide variety of devices– Some devices may be shared among several VMs
• Printers• Shared disk partition
–Want to let Linux and Windows access the same files
Virtual Memory Issues
• Each VM kernel manages its own memory– Its page tables map program virtual
addresses into pseudo-physical addresses• It treats these addresses as physical
addresses
The dilemma
User processA
VM kernel
Page 735 of process A is stored in page frame 435
That's page frame 993 of the actual RAM
Hypervisor
The solution (I)
• Address translation must remain fast!– Hypervisor lets each VM kernel manage their
own page tables but do not use them• They contain bogus mappings!
– It maintains instead its own shadow page tables with the correct mappings• Used to handle TLB misses
The solution (II)
• To keep its shadow page tables up to date, hypervisor must track any changes made by the VM kernels
• Mark page tables read-only
Nastiest Issue
• The whole VM approach assumes that a kernel executing in user mode will behave exactly like a kernel executing in privileged mode
• Not true for all architectures!– Intel x86 Pop flags (POPF) instruction– …
Solutions
1. Modify the instruction set and eliminate instructions like POPF• IBM redesigned the instruction set of their
360 series for the 370 series2. Mask it through clever software
• Dynamic "binary translation" when direct execution of code could not work(VMWare)
CACHE CONSISTENCY
The problem• Specific to architectures with
– Several processors sharing the same main memory
– Multicore architectures• Each core/processor has its own private cache
– A must for performance • Happens when same data are present in two or
more private caches
An example (I)
RAM
CPU
Cachex=0
CPU
Cachex=0
An example (II)
RAM
CPU
Cachex=1
CPU
Cachex=0
Increments x
Still assumes x =0
An example
RAM
CPU
Cachex=1
Sets x to 1
CPU
Cachex=1
Resets x to 1
CPU
Cachex=?
CPU
Cachex=?
Both CPUs must applythe two updates
in the same order
Rules
1. Whenever a processes accesses a variable it always gets the value stored by the processor that updated that variable last if the updates are sufficiently separated in times
2. A processor accessing a variable sees all updates applied to that variable in thesame order
– No compromise is possible here
A realization: Snoopy caches
• All caches are linked to the main memory through a shared bus– All caches observe the writes performed by
other caches
• When a cache notices that another cache performs a write on a memory location that it has in its cache, it invalidates the corresponding cache block
An example (I)
RAM
CPU
Cachex=2
CPU
CacheFetches x = 2
An example (II)
RAM
CPU
Cachex = 2
CPU
Cachex = 2
Also fetches x
An example (III)
RAM
CPU
Cachex = 0
CPU
Cachex = 2
Resets x to 0
An example (IV)
RAM
CPU
Cachex = 0
CPU
Cachex = ?x = ?
Performs write-through
Detects write-through
and invalidates its copy of x
An example (IV)
RAM
CPU
Cachex = 0
CPU
Cachex = 0
when CPU wants to access x. cache gets correct value
from RAM
A last correctness condition
• Cache cannot reorder their memory updates– Cache RAM buffer must be FIFO
• First in first out
Miscellaneous fallacies
• Segmented address spaces– Address is segment number + offset in
segment– Programmers hate them
• Ignoring virtual memory behavior when accessing large two-dimensional arrays
Miscellaneous fallacies
• Segmented address spaces– Address is segment number + offset in segment– Programmers hate them
• Ignoring virtual memory behavior when accessing large two-dimensional arrays
• Believing that you can virtualize any CPU architecture
DEPENDABILITY
Reliability and Availability
• Reliability– Probability R(t) that system will be up at time
t if it was up at time t = 0• Availability
– Fraction of time the system is up• Reliability and availability do not measure the
same thing!
MTTF, MMTR and MTBF
• MTTF is mean time to failure
• MTTR is mean time to repair
• 1/MTTF is failure rate
• MTTBF, the mean time between failures, is
MTBF = MTTF + MTTR
Reliability
• As a first approximation
R(t) = exp(–t/MTTF)
– Not true if failure rate varies over time
Availability
• Measured by
(MTTF)/(MTTF + MTTR) = MTTF/MTBF
– MTTR is very important
Example
• A server crashes on the average once a month• When this happens, it takes six hours to reboot it• What is the server availability ?
Solution
• MTBF = 30 days• MTTR = ½ day
• MTTF = 29 ½ days• Availability is 29.5/30 =98.3 %
Example
• A disk drive has a MTTF of 20 years.• What is the probability that the data it contains
will not be lost over a period of five years?
Example
• A disk farm contains 100 disks whose MTTF is 20 years.
• What is the probability that no data will be lost over a period of five years?
Solution
• The aggregate failure rate of the disk farm is100x1/20 =5
• The mean time to failure of the farm is 1/5 year
• We apply the formulaR(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11
RAID Arrays
Today’s Motivation
• We use RAID today for – Increasing disk throughput by allowing parallel
access– Eliminating the need to make disk backups
• Disks are too big to be backed up in an efficient fashion
RAID LEVEL 0
• No replication• Advantages:
– Simple to implement– No overhead
• Disadvantage:– If array has n disks failure rate is n times the
failure rate of a single disk
RAID levels 0 and 1RAID level 0
RAID level 1 Mirrors
RAID LEVEL 1• Mirroring:
– Two copies of each disk block• Advantages:
– Simple to implement– Fault-tolerant
• Disadvantage:– Requires twice the disk capacity of normal file
systems
RAID LEVEL 2• Instead of duplicating the data blocks we use an error
correction code• Very bad idea because disk drives either work correctly
or do not work at all– Only possible errors are omission errors– We need an omission correction code
• A parity bit is enough to correct a single omission
RAID levels 2 and 3RAID level 2
RAID level 3
Check disks
Parity disk
RAID LEVEL 3• Requires N+1 disk drives
– N drives contain data (1/N of each data block)• Block b[k] now partitioned into N fragments b[k,1],
b[k,2], ... b[k,N]– Parity drive contains exclusive or of these N fragments
p[k] = b[k,1] b[k,2] ... b[k,N]
How parity works?• Truth table for XOR (same as parity)
A B AB0 0 00 1 11 0 11 1 0
Recovering from a disk failure• Small RAID level 3 array with data disks D0 and D1 and
parity disk P can tolerate failure of either D0 or D1
D0 D1 P0 0 00 1 11 0 11 1 0
D1P=D0 D0P=D10 00 11 01 1
How RAID level 3 works (I)
• Assume we have N + 1 disks• Each block is partitioned into N equal chunks
Block
Chunk Chunk Chunk Chunk
N = 4 inexample
How RAID level 3 works (II)
• XOR data chunks to compute the parity chunk
Parity
• Each chunk is written into a separate disk
Parity
How RAID level 3 works (III)
• Each read/write involves all disks in RAID array– Cannot do two or more reads/writes in parallel– Performance of array not netter than that of a
single disk
RAID LEVEL 4 (I)
• Requires N+1 disk drives– N drives contain data
• Individual blocks, not chunks– Blocks with same disk address form a stripe
x x xx ?
RAID LEVEL 4 (II)
• Parity drive contains exclusive or of the N blocks in stripe
p[k] = b[k] b[k+1] ... b[k+N-1]
• Parity block now reflects contents of several blocks!
• Can now do parallel reads/writes
RAID levels 4 and 5
RAID level 4
RAID level 5
Bottleneck
RAID LEVEL 5
• Single parity drive of RAID level 4 is involved in every write – Will limit parallelism
• RAID-5 distribute the parity blocks among the N+1 drives– Much better
The small write problem
• Specific to RAID 5• Happens when we want to update a single block
– Block belongs to a stripe– How can we compute the new value of the
parity block
...b[k+1] p[k]b[k+2]b[k]
First solution• Read values of N-1 other blocks in stripe• Recompute
p[k] = b[k] b[k+1] ... b[k+N-1]
• Solution requires– N-1 reads– 2 writes (new block and new parity block)
Second solution• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute
p[k] = new b[m] old b[m] old p[k]
• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and new parity block)
RAID level 6 (I)• Not part of the original proposal
– Two check disks– Tolerates two disk failures– More complex updates
RAID level 6 (II)
• Has become more popular as disks are becoming– Bigger– More vulnerable to irrecoverable read errors
• Most frequent cause for RAID level 5 array failures is– Irrecoverable read error occurring while
contents of a failed disk are reconstituted
CONNECTING I/O DEVICES
Busses
• Connecting computer subsystems with each other was traditionally done through busses
• A bus is a shared communication link connecting multiple devices
• Transmit several bits at a time– Parallel buses
Busses
Examples
• Processor-memory busses– Connect CPU with memory modules– Short and high-speed
• I/O busses– Longer– Wide range of data bandwidths– Connect to memory through processor-memory bus
of backplane bus
Synchronous busses
• Include a clock in the control lines• Bus protocols expressed in actions to be taken at each
clock pulse• Have very simple protocols• Disadvantages
– All bus devices must run at same clock rate– Due to clock skew issues, cannot be both fast and
long
Asynchronous busses
• Have no clock• Can accommodate a wide variety of devices• Have no clock skew issues• Require a handshaking protocol before any
transmission– Implemented with extra control lines
Advantages of busses
• Cheap– One bus can link many devices
• Flexible– Can add devices
Disadvantages of busses
• Shared devices– can become bottlenecks
• Hard to run many parallel lines at high clock speeds
New trend
• Away from parallel shared buses• Towards serial point-to-point switched
interconnections– Serial
• One bit at a time– Point-to-point
• Each line links a specific device to another specific device
x86 bus organization
• Processor connects to peripherals through two chips (bridges)– North Bridge– South Bridge
x86 bus organizationNorth
Bridge
South
Bridge
North bridge
• Essentially a DMA controller– Lets disk controller access main memory w/o
any intervention of the CPU• Connects CPU to
– Main memory– Optional graphics card– South Bridge
South Bridge
• Connects North bridge to a wide variety of I/O busses
Communicating with I/O devices
• Two solutions– Memory-mapped I/O– Special I/O instructions
Memory mapped I/O
• A portion of the address space reserved for I/O operations– Writes to any to these addresses are
interpreted as I/O commands– Reading from these addresses gives access to
• Error bit • I/O completion bit• Data being read
Memory mapped I/O
• User processes cannot access these addresses– Only the kernel
• Prevents user processes from accessing the disk in an uncontrolled fashion
Dedicated I/O instructions
• Privileged instructions that cannot be executed by User processes cannot access these addresses– Only the kernel
• Prevents user processes from accessing the disk in an uncontrolled fashion
Polling
• Simplest way for an I/O device to communicate with the CPU
• CPU periodically checks the status of pending I/O operations– High CPU overhead
I/O completion interrupts
• Notify the CPU that an I/O operation has completed• Allows th CPU to do something else while waiting
for the completion of an I/O operation– Multiprogramming
• I/O completion interrupts are processed by CPU between instructions– No internal instruction state to save
Interrupts levels
• See previous chapter
Direct memory access
• DMA• Lets disk controller access main memory w/o
any intervention of the CPU
DMA and virtual memory
• A single DMA transfer may cross page boundaries with– One page being in main memory– One missing page
Solutions
• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory
subsystem• Break DMA transfers crossing page boundaries
into chains of transfers that do not corss page boundaries
An Example
Page Page Page Page
Break DMA transfer
into DMA DMA
DMA and cache hierarchy
• Three approaches for handling temporary inconsistencies between caches and main memory
Solutions
1. Running all DMA accesses to the cache– Bad solution
2. Have OS selectively– Invalidate affected cache entries when
performing a read– Forcing immediate flush of dirty cache entries
when performing a write3. Have specific hardware to do same
Benchmarking I/O
Benchmarks
• Specific benchmarks for– Transaction processing
• Emphasis on speed and graceful recovery from failures
–Atomic transactions:• All or nothing behavior
An important observation
• Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput– Unless we access sequentially very large
ranges of data• 512 KB and more
Major fallacies
• Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years– MTTF expresses failure rate during the disk
actual lifetime• Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers’ literature– They are up to ten times higher
Major fallacies
• Neglecting to do end-to-end checks– …
• Using magnetic tapes to back up disks– Tape formats can become quickly
obsolescent– Disk bit densities have grown much faster
than tape data densities.
WRITING PARALLEL PROGRAMS
Overview
• Some problems are embarrassingly parallel– Many computer graphics tasks– Brute force searches in cryptography or
password guessing• Much more difficult for other applications
– Communication overhead among sub-tasks– Ahmdahl's law– Balancing the load
Amdahl's Law
• Assume a sequential process takes
– tp seconds to perform operations that could be performed in parallel
– ts seconds to perform purely sequential operations
• The maximum speedup will be
(tp + ts )/ts
Balancing the load
• Must ensure that workload is equally divided among all the processors
• Worst case is when one of the processors does much more work than all others
A last issue
• Humans likes to address issues one after the order– We have meeting agendas– We do not like to be interrupted– We write sequential programs
MULTI PROCESSOR ORGANIZATIONS
Shared memory multiprocessorsPU
Cache
PU
Cache
PU
Cache
Interconnection network
RAM I/O
…
Shared memory multiprocessor
• Can offer– Uniform memory access to all processors
(UMA)• Easiest to program
– Non-uniform memory access to all processors(NUMA)• Can scale up to larger sizes• Offer faster access to nearby memory
Computer clustersPU
Cache
RAM
PU
Cache
RAM
PU
Cache
RAM
Interconnection network
…
Computer clusters
• Very easy to assemble• Can take advantage of high-speed LANs
– Gigabit Ethernet, Myrinet, …• Data exchanges must be done through
message passing
HARDWARE MULTITHREADING
General idea
• Let the processor switch to another thread of computation while them current one is stalled
• Motivation:– Increased cost of cache misses
Implementation
• Entirely controlled by the hardware– Unlike multiprogramming
• Requires a processor capable of– Keeping track of the state of each thread
• One set of registers—including PC– for each concurrent thread
– Quickly switching among concurrent threads
Approaches
• Fine-grained multithreading:– Switches between threads for each instruction– Provides highest throughputs– Slows down execution of individual threads
Approaches
• Coarse-grained multithreading– Switches between threads whenever a long
stall is detected– Easier to implement – Cannot eliminate all stalls
Approaches
• Simultaneous multi-threading:– Takes advantage of the possibility of modern
hardware to perform different tasks in parallel for instructions of different threads
– Best solution
ALPHABET SOUP
Classification
• SISD:– Single instruction, single data– Conventional uniprocessor architecture
• MIMD:– Multiple instructions, multiple data– Conventional multiprocessor architecture
Classification
• SIMD:– Single instruction, multiple data– Perform same operations on a set of similar data
• Think of adding two vectors
for (i = 0; i++; i < VECSIZE)sum[i] = a[i] + b[i];
PERFORMANCE ISSUES
Roofline model• Takes into account
– Memory bandwidth– Floating-point performance
• Introduces arithmetic intensity– Total number of floating point operations in a
program divided by total number of bytes transferred to main memory
– Measured in FLOPS/byte
Roofline model
• Attainable GFLOPS/s =Min(Peak Memory BWArithmetic
Intensity, Peak Floating-Point Performance
Roofline model
02468
101214161820
0 1 2 3 4 5
Arithmetic Intensity
Atta
inab
le G
FLO
PS/s
Peak floating-point performance
Floating-point performance islimited by memory bandwidth