storage and i/o jehan-françois pâris [email protected]
TRANSCRIPT
STORAGE AND I/O
Jehan-François Pâ[email protected]
Chapter Organization
• Availability and Reliability• Technology review
– Solid-state storage devices– I/O Operations– Reliable Arrays of Inexpensive Disks
DEPENDABILITY
Reliability and Availability
• Reliability– Probability R(t) that system will be up at time
t if it was up at time t = 0• Availability
– Fraction of time the system is up• Reliability and availability do not measure the
same thing!
Which matters?
• It depends:– Reliability for real-time systems
• Flight control• Process control, …
– Availability for many other applications• DSL service• File server, web server, …
MTTF, MMTR and MTBF
• MTTF is mean time to failure
• MTTR is mean time to repair
• 1/MTTF is failure rate
• MTTBF, the mean time between failures, is
MTBF = MTTF + MTTR
Reliability
• As a first approximation
R(t) = exp(–t/MTTF)
– Not true if failure rate varies over time
Availability
• Measured by
(MTTF)/(MTTF + MTTR) = MTTF/MTBF
– MTTR is very important• A good MTTR requires that we detect
quickly the failure
The nine notation
• Availability is often expressed in "nines"– 99 percent is two nines– 99.9 percent is three nines– …
• Formula is –log10 (1 – A)
• Example:–log10 (1 – 0.999) = –log10 (10-3) = 3
Example
• A server crashes on the average once a month• When this happens, it takes 12 hours to reboot it• What is the server availability ?
Solution
• MTBF = 30 days• MTTR = 12 hours = ½ day
• MTTF = 29 ½ days• Availability is 29.5/30 =98.3 %
Keep in mind
• A 99 percent availability is not as great as we might think– One hour down every 100 hours
• Fifteen minutes down every 24 hours
Example
• A disk drive has a MTTF of 20 years.• What is the probability that the data it contains
will not be lost over a period of five years?
Example
• A disk farm contains 100 disks whose MTTF is 20 years.
• What is the probability that no data will be lost over a period of five years?
Solution
• The aggregate failure rate of the disk farm is100x1/20 =5 failures/year
• The mean time to failure of the farm is 1/5 year
• We apply the formulaR(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11
– Almost zero chance!
TECHNOLOGY OVERVIEW
Disk drives
• See previous chapter
• Recall that the disk access time is the sum of– The disk seek time (to get to the right track)– The disk rotational latency– The actual transfer time
Flash drives
• Widely used in flash drives, most MP3 players and some small portable computers
• Similar technology as EEPROM• Two technologies
What about flash?• Widely used in flash drives, most MP3 players
and some small portable computers
• Several important limitations– Limited write bandwidth
• Must erase a whole block of data before overwriting them
– Limited endurance• 10,000 to 100,000 write cycles
Storage Class Memories
• Solid-state storage – Non-volatile– Much faster than conventional disks
• Numerous proposals:– Ferro-electric RAM (FRAM)– Magneto-resistive RAM (MRAM)– Phase-Change Memories (PCM)
Phase-Change MemoriesNo moving
parts
Crossbarorganization
A data cell
Phase-Change Memories
• Cells contain a chalcogenide material that has two states– Amorphous with high electrical resistivity– Crystalline with low electrical resistivity
• Quickly cooling material from above fusion point leaves it in amorphous state
• Slowly cooling material from above crystallization point leaves it in crystalline state
Projections• Target date 2012• Access time 100 ns• Data Rate 200–1000 MB/s• Write Endurance 109 write cycles• Read Endurance no upper limit• Capacity 16 GB• Capacity growth > 40% per year• MTTF 10–50 million hours• Cost < $2/GB
Interesting Issues (I)
• Disks will remain much cheaper than SCM for some time
• Could use SCMs as intermediary level between main memory and disks
Main memory
SCM
Disk
A last comment
• The technology is still experimental
• Not sure when it will come to the market
• Might even never come to the market
Interesting Issues (II)
• Rather narrow gap between SCM access times and main memory access times
• Main memory and SCM will interact– As the L3 cache interact with the main
memory– Not as the main memory now interacts with
the disk
RAID Arrays
Today’s Motivation
• We use RAID today for – Increasing disk throughput by allowing parallel
access– Eliminating the need to make disk backups
• Disks are too big to be backed up in an efficient fashion
RAID LEVEL 0
• No replication• Advantages:
– Simple to implement– No overhead
• Disadvantage:– If array has n disks failure rate is n times the
failure rate of a single disk
RAID levels 0 and 1RAID level 0
RAID level 1 Mirrors
RAID LEVEL 1• Mirroring:
– Two copies of each disk block• Advantages:
– Simple to implement– Fault-tolerant
• Disadvantage:– Requires twice the disk capacity of normal file
systems
RAID LEVEL 2
• Instead of duplicating the data blocks we use an error correction code
• Very bad idea because disk drives either work correctly or do not work at all– Only possible errors are omission errors– We need an omission correction code
• A parity bit is enough to correct a single omission
RAID levels 2 and 3
RAID level 2
RAID level 3
Check disks
Parity disk
RAID LEVEL 3
• Requires N+1 disk drives– N drives contain data (1/N of each data block)
• Block b[k] now partitioned into N fragments b[k,1], b[k,2], ... b[k,N]
– Parity drive contains exclusive or of these N fragmentsp[k] = b[k,1] b[k,2] ... b[k,N]
How parity works?
• Truth table for XOR (same as parity)
A B AB0 0 00 1 11 0 11 1 0
Recovering from a disk failure
• Small RAID level 3 array with data disks D0 and D1 and parity disk P can tolerate failure of either D0 or D1
D0 D1 P0 0 00 1 11 0 11 1 0
D1P=D0 D0P=D10 00 11 01 1
How RAID level 3 works (I)
• Assume we have N + 1 disks• Each block is partitioned into N equal chunks
Block
Chunk Chunk Chunk Chunk
N = 4 inexample
How RAID level 3 works (II)
• XOR data chunks to compute the parity chunk
Parity
• Each chunk is written into a separate disk
Parity
How RAID level 3 works (III)
• Each read/write involves all disks in RAID array– Cannot do two or more reads/writes in parallel– Performance of array not better than that of a
single disk
RAID LEVEL 4 (I)
• Requires N+1 disk drives– N drives contain data
• Individual blocks, not chunks– Blocks with same disk address form a stripe
x x xx ?
RAID LEVEL 4 (II)
• Parity drive contains exclusive or of the N blocks in stripe
p[k] = b[k] b[k+1] ... b[k+N-1]
• Parity block now reflects contents of several blocks!
• Can now do parallel reads/writes
RAID levels 4 and 5
RAID level 4
RAID level 5
Bottleneck
RAID LEVEL 5
• Single parity drive of RAID level 4 is involved in every write – Will limit parallelism
• RAID-5 distribute the parity blocks among the N+1 drives– Much better
The small write problem
• Specific to RAID 5• Happens when we want to update a single block
– Block belongs to a stripe– How can we compute the new value of the
parity block
...b[k+1] p[k]b[k+2]b[k]
First solution
• Read values of N-1 other blocks in stripe• Recompute
p[k] = b[k] b[k+1] ... b[k+N-1]
• Solution requires– N-1 reads– 2 writes (new block and new parity block)
Second solution
• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute
p[k] = new b[m] old b[m] old p[k]
• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and new parity block)
RAID level 6 (I)• Not part of the original proposal
– Two check disks– Tolerates two disk failures– More complex updates
RAID level 6 (II)
• Has become more popular as disks become– Bigger– More vulnerable to irrecoverable read errors
• Most frequent cause for RAID level 5 array failures is– Irrecoverable read error occurring while
contents of a failed disk are reconstituted
RAID level 6 (III)
• Typical array size is 12 disks• Space overhead is 2/12 = 16.7 %• Sole real issue is cost of small writes
– Three reads and three writes:• Read old value of block being updated,
old parity block P, old party block Q• Write new value of block being updated,
new parity block P, new party block Q
CONCLUSION (II)
• Low cost of disk drives made RAID level 1 attractive for small installations
• Otherwise pick– RAID level 5 for higher parallelism– RAID level 6 for higher protection
• Can tolerate one disk failure and irrecoverable read errors
A review question
• Consider an array consisting of four 750 GB disks• What is the storage capacity of the array if we organize
it
– As a RAID level 0 array?
– As a RAID level 1 array?
– As a RAID level 5 array?
The answers
• Consider an array consisting of four 750 GB disks• What is the storage capacity of the array if we organize
it
– As a RAID level 0 array? 3 TB
– As a RAID level 1 array? 1.5 TB
– As a RAID level 5 array? 2.25 TB
CONNECTING I/O DEVICES
Busses
• Connecting computer subsystems with each other was traditionally done through busses
• A bus is a shared communication link connecting multiple devices
• Transmit several bits at a time– Parallel buses
Busses
Examples
• Processor-memory busses– Connect CPU with memory modules– Short and high-speed
• I/O busses– Longer– Wide range of data bandwidths– Connect to memory through processor-memory
bus of backplane bus
Standards
• Firewire– For external use– 63 devices per channel– 4 signal lines– 400 Mb/s or 800 Mb/s– Up to 4.5 m
Standards
• USB 2.0– For external use– 127 devices per channels– 2 signal lines– 1.5 Mb/s (Low Speed), 12 Mb/s (Full Speed)
and 480 Mb/s (Hi Speed)– Up to 5 m
Standards
• USB 3.0– For external use– Adds a 5 Gb/s transfer rate (Super Speed)– Maximum distance is still 5m
Standards
• PCI Express– For internal use– 1 device per channel– 2 signal lines per "lane"– Multiples of 250 MB/s:
• 1x, 2x, 4x, 8x, 16x and 32x– Up to 0.5 m
Standards
• Serial ATA– For internal use– Connects cheap disks to computer– 1 device per channel– 4 data lines– 300 MB/s– Up to 1 m
Standards
• Serial Attached SCSI (SAS)– For external use– 4 devices per channel– 4 data lines– 300 MB/s– Up to 8 m
Synchronous busses
• Include a clock in the control lines• Bus protocols expressed in actions to be taken at
each clock pulse• Have very simple protocols• Disadvantages
– All bus devices must run at same clock rate– Due to clock skew issues, cannot be both fast
and long
Asynchronous busses
• Have no clock• Can accommodate a wide variety of devices• Have no clock skew issues• Require a handshaking protocol before any
transmission– Implemented with extra control lines
Advantages of busses
• Cheap– One bus can link many devices
• Flexible– Can add devices
Disadvantages of busses
• Shared devices– can become bottlenecks
• Hard to run many parallel lines at high clock speeds
New trend
• Away from parallel shared buses• Towards serial point-to-point switched
interconnections– Serial
• One bit at a time– Point-to-point
• Each line links a specific device to another specific device
x86 bus organization
• Processor connects to peripherals through two chips (bridges)– North Bridge– South Bridge
x86 bus organizationNorth
Bridge
South
Bridge
North bridge
• Essentially a DMA controller– Lets disk controller access main memory w/o
any intervention of the CPU• Connects CPU to
– Main memory– Optional graphics card– South Bridge
South Bridge
• Connects North bridge to a wide variety of I/O busses
Communicating with I/O devices
• Two solutions– Memory-mapped I/O– Special I/O instructions
Memory mapped I/O
• A portion of the address space reserved for I/O operations– Writes to any to these addresses are
interpreted as I/O commands– Reading from these addresses gives access to
• Error bit • I/O completion bit• Data being read
Memory mapped I/O
• User processes cannot access these addresses– Only the kernel
• Prevents user processes from accessing the disk in an uncontrolled fashion
Dedicated I/O instructions
• Privileged instructions that cannot be executed by user processes– Only the kernel
• Prevents user processes from accessing the disk in an uncontrolled fashion
Polling
• Simplest way for an I/O device to communicate with the CPU
• CPU periodically checks the status of pending I/O operations– High CPU overhead
I/O completion interrupts
• Notify the CPU that an I/O operation has completed
• Allows the CPU to do something else while waiting for the completion of an I/O operation– Multiprogramming
• I/O completion interrupts are processed by CPU between instructions– No internal instruction state to save
Interrupts levels
• See previous chapter
Direct memory access
• DMA• Lets disk controller access main memory w/o
any intervention of the CPU
DMA and virtual memory
• A single DMA transfer may cross page boundaries with– One page being in main memory– One missing page
Solutions
• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory
subsystem• Break DMA transfers crossing page boundaries
into chains of transfers that do not cross page boundaries
Solutions
• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory
subsystem• Break DMA transfers crossing page boundaries
into chains of transfers that do not cores page boundaries
An Example
Page Page Page Page
Break DMA transfer
into DMA DMA
DMA and cache hierarchy
• Three approaches for handling temporary inconsistencies between caches and main memory
Solutions
1. Running all DMA accesses to the cache– Bad solution
2. Have OS selectively– Invalidate affected cache entries when
performing a read– Forcing immediate flush of dirty cache entries
when performing a write3. Have specific hardware to do same
Benchmarking I/O
Benchmarks
• Specific benchmarks for– Transaction processing
• Emphasis on speed and graceful recovery from failures
–Atomic transactions:• All or nothing behavior
An important observation
• Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput– Unless we access sequentially very large
ranges of data• 512 KB and more
Major fallacies
• Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years– MTTF expresses failure rate during the disk
actual lifetime• Disk failure rates in the field match the MMTTFS
mentioned in the manufacturers’ literature– They are up to ten times higher
Major fallacies
• Neglecting to do end-to-end checks– …
• Using magnetic tapes to back up disks– Tape formats can become quickly
obsolescent– Disk bit densities have grown much faster
than tape data densities.
Can you read these?
On an old PCNoNo
But you can still read this