storage and i/o jehan-françois pâris [email protected]

STORAGE AND I/O

Jehan-François Pâ[email protected]

Chapter Organization

• Availability and Reliability• Technology review

– Solid-state storage devices– I/O Operations– Reliable Arrays of Inexpensive Disks

DEPENDABILITY

Reliability and Availability

• Reliability– Probability R(t) that system will be up at time

t if it was up at time t = 0• Availability

– Fraction of time the system is up• Reliability and availability do not measure the

same thing!

Which matters?

• It depends:– Reliability for real-time systems

• Flight control• Process control, …

– Availability for many other applications• DSL service• File server, web server, …

MTTF, MMTR and MTBF

• MTTF is mean time to failure

• MTTR is mean time to repair

• 1/MTTF is failure rate

• MTTBF, the mean time between failures, is

MTBF = MTTF + MTTR

Reliability

• As a first approximation

R(t) = exp(–t/MTTF)

– Not true if failure rate varies over time

Availability

• Measured by

(MTTF)/(MTTF + MTTR) = MTTF/MTBF

– MTTR is very important• A good MTTR requires that we detect

quickly the failure

The nine notation

• Availability is often expressed in "nines"– 99 percent is two nines– 99.9 percent is three nines– …

• Formula is –log10 (1 – A)

• Example:–log10 (1 – 0.999) = –log10 (10-3) = 3

Example

• A server crashes on the average once a month• When this happens, it takes 12 hours to reboot it• What is the server availability ?

Solution

• MTBF = 30 days• MTTR = 12 hours = ½ day

• MTTF = 29 ½ days• Availability is 29.5/30 =98.3 %

Keep in mind

• A 99 percent availability is not as great as we might think– One hour down every 100 hours

• Fifteen minutes down every 24 hours

Example

• A disk drive has a MTTF of 20 years.• What is the probability that the data it contains

will not be lost over a period of five years?

Example

• A disk farm contains 100 disks whose MTTF is 20 years.

• What is the probability that no data will be lost over a period of five years?

Solution

• The aggregate failure rate of the disk farm is100x1/20 =5 failures/year

• The mean time to failure of the farm is 1/5 year

• We apply the formulaR(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11

– Almost zero chance!

TECHNOLOGY OVERVIEW

Disk drives

• See previous chapter

• Recall that the disk access time is the sum of– The disk seek time (to get to the right track)– The disk rotational latency– The actual transfer time

Flash drives

• Widely used in flash drives, most MP3 players and some small portable computers

• Similar technology as EEPROM• Two technologies

What about flash?• Widely used in flash drives, most MP3 players

and some small portable computers

• Several important limitations– Limited write bandwidth

• Must erase a whole block of data before overwriting them

– Limited endurance• 10,000 to 100,000 write cycles

Storage Class Memories

• Solid-state storage – Non-volatile– Much faster than conventional disks

• Numerous proposals:– Ferro-electric RAM (FRAM)– Magneto-resistive RAM (MRAM)– Phase-Change Memories (PCM)

Phase-Change MemoriesNo moving

parts

Crossbarorganization

A data cell

Phase-Change Memories

• Cells contain a chalcogenide material that has two states– Amorphous with high electrical resistivity– Crystalline with low electrical resistivity

• Quickly cooling material from above fusion point leaves it in amorphous state

• Slowly cooling material from above crystallization point leaves it in crystalline state

Projections• Target date 2012• Access time 100 ns• Data Rate 200–1000 MB/s• Write Endurance 109 write cycles• Read Endurance no upper limit• Capacity 16 GB• Capacity growth > 40% per year• MTTF 10–50 million hours• Cost < $2/GB

Interesting Issues (I)

• Disks will remain much cheaper than SCM for some time

• Could use SCMs as intermediary level between main memory and disks

Main memory

SCM

Disk

A last comment

• The technology is still experimental

• Not sure when it will come to the market

• Might even never come to the market

Interesting Issues (II)

• Rather narrow gap between SCM access times and main memory access times

• Main memory and SCM will interact– As the L3 cache interact with the main

memory– Not as the main memory now interacts with

the disk

RAID Arrays

Today’s Motivation

• We use RAID today for – Increasing disk throughput by allowing parallel

access– Eliminating the need to make disk backups

• Disks are too big to be backed up in an efficient fashion

RAID LEVEL 0

• No replication• Advantages:

– Simple to implement– No overhead

• Disadvantage:– If array has n disks failure rate is n times the

failure rate of a single disk

RAID levels 0 and 1RAID level 0

RAID level 1 Mirrors

RAID LEVEL 1• Mirroring:

– Two copies of each disk block• Advantages:

– Simple to implement– Fault-tolerant

• Disadvantage:– Requires twice the disk capacity of normal file

systems

RAID LEVEL 2

• Instead of duplicating the data blocks we use an error correction code

• Very bad idea because disk drives either work correctly or do not work at all– Only possible errors are omission errors– We need an omission correction code

• A parity bit is enough to correct a single omission

RAID levels 2 and 3

RAID level 2

RAID level 3

Check disks

Parity disk

RAID LEVEL 3

• Requires N+1 disk drives– N drives contain data (1/N of each data block)

• Block b[k] now partitioned into N fragments b[k,1], b[k,2], ... b[k,N]

– Parity drive contains exclusive or of these N fragmentsp[k] = b[k,1] b[k,2] ... b[k,N]

How parity works?

• Truth table for XOR (same as parity)

A B AB0 0 00 1 11 0 11 1 0

Recovering from a disk failure

• Small RAID level 3 array with data disks D0 and D1 and parity disk P can tolerate failure of either D0 or D1

D0 D1 P0 0 00 1 11 0 11 1 0

D1P=D0 D0P=D10 00 11 01 1

How RAID level 3 works (I)

• Assume we have N + 1 disks• Each block is partitioned into N equal chunks

Block

Chunk Chunk Chunk Chunk

N = 4 inexample

How RAID level 3 works (II)

• XOR data chunks to compute the parity chunk

Parity

• Each chunk is written into a separate disk

Parity

How RAID level 3 works (III)

• Each read/write involves all disks in RAID array– Cannot do two or more reads/writes in parallel– Performance of array not better than that of a

single disk

RAID LEVEL 4 (I)

• Requires N+1 disk drives– N drives contain data

• Individual blocks, not chunks– Blocks with same disk address form a stripe

x x xx ?

RAID LEVEL 4 (II)

• Parity drive contains exclusive or of the N blocks in stripe

p[k] = b[k] b[k+1] ... b[k+N-1]

• Parity block now reflects contents of several blocks!

• Can now do parallel reads/writes

RAID levels 4 and 5

RAID level 4

RAID level 5

Bottleneck

RAID LEVEL 5

• Single parity drive of RAID level 4 is involved in every write – Will limit parallelism

• RAID-5 distribute the parity blocks among the N+1 drives– Much better

The small write problem

• Specific to RAID 5• Happens when we want to update a single block

– Block belongs to a stripe– How can we compute the new value of the

parity block

...b[k+1] p[k]b[k+2]b[k]

First solution

• Read values of N-1 other blocks in stripe• Recompute

p[k] = b[k] b[k+1] ... b[k+N-1]

• Solution requires– N-1 reads– 2 writes (new block and new parity block)

Second solution

• Assume we want to update block b[m]• Read old values of b[m] and parity block p[k]• Compute

p[k] = new b[m] old b[m] old p[k]

• Solution requires– 2 reads (old values of block and parity block)– 2 writes (new block and new parity block)

RAID level 6 (I)• Not part of the original proposal

– Two check disks– Tolerates two disk failures– More complex updates

RAID level 6 (II)

• Has become more popular as disks become– Bigger– More vulnerable to irrecoverable read errors

• Most frequent cause for RAID level 5 array failures is– Irrecoverable read error occurring while

contents of a failed disk are reconstituted

RAID level 6 (III)

• Typical array size is 12 disks• Space overhead is 2/12 = 16.7 %• Sole real issue is cost of small writes

– Three reads and three writes:• Read old value of block being updated,

old parity block P, old party block Q• Write new value of block being updated,

new parity block P, new party block Q

CONCLUSION (II)

• Low cost of disk drives made RAID level 1 attractive for small installations

• Otherwise pick– RAID level 5 for higher parallelism– RAID level 6 for higher protection

• Can tolerate one disk failure and irrecoverable read errors

A review question

• Consider an array consisting of four 750 GB disks• What is the storage capacity of the array if we organize

it

– As a RAID level 0 array?



The answers

• Consider an array consisting of four 750 GB disks• What is the storage capacity of the array if we organize

it

– As a RAID level 0 array? 3 TB

– As a RAID level 1 array? 1.5 TB

– As a RAID level 5 array? 2.25 TB

CONNECTING I/O DEVICES

Busses

• Connecting computer subsystems with each other was traditionally done through busses

• A bus is a shared communication link connecting multiple devices

• Transmit several bits at a time– Parallel buses

Busses

Examples

• Processor-memory busses– Connect CPU with memory modules– Short and high-speed

• I/O busses– Longer– Wide range of data bandwidths– Connect to memory through processor-memory

bus of backplane bus

Standards

• Firewire– For external use– 63 devices per channel– 4 signal lines– 400 Mb/s or 800 Mb/s– Up to 4.5 m

Standards

• USB 2.0– For external use– 127 devices per channels– 2 signal lines– 1.5 Mb/s (Low Speed), 12 Mb/s (Full Speed)

and 480 Mb/s (Hi Speed)– Up to 5 m

Standards

• USB 3.0– For external use– Adds a 5 Gb/s transfer rate (Super Speed)– Maximum distance is still 5m

Standards

• PCI Express– For internal use– 1 device per channel– 2 signal lines per "lane"– Multiples of 250 MB/s:

• 1x, 2x, 4x, 8x, 16x and 32x– Up to 0.5 m

Standards

• Serial ATA– For internal use– Connects cheap disks to computer– 1 device per channel– 4 data lines– 300 MB/s– Up to 1 m

Standards

• Serial Attached SCSI (SAS)– For external use– 4 devices per channel– 4 data lines– 300 MB/s– Up to 8 m

Synchronous busses

• Include a clock in the control lines• Bus protocols expressed in actions to be taken at

each clock pulse• Have very simple protocols• Disadvantages

– All bus devices must run at same clock rate– Due to clock skew issues, cannot be both fast

and long

Asynchronous busses

• Have no clock• Can accommodate a wide variety of devices• Have no clock skew issues• Require a handshaking protocol before any

transmission– Implemented with extra control lines

Advantages of busses

• Cheap– One bus can link many devices

• Flexible– Can add devices

Disadvantages of busses

• Shared devices– can become bottlenecks

• Hard to run many parallel lines at high clock speeds

New trend

• Away from parallel shared buses• Towards serial point-to-point switched

interconnections– Serial

• One bit at a time– Point-to-point

• Each line links a specific device to another specific device

x86 bus organization

• Processor connects to peripherals through two chips (bridges)– North Bridge– South Bridge

x86 bus organizationNorth

Bridge

South

Bridge

North bridge

• Essentially a DMA controller– Lets disk controller access main memory w/o

any intervention of the CPU• Connects CPU to

– Main memory– Optional graphics card– South Bridge

South Bridge

• Connects North bridge to a wide variety of I/O busses

Communicating with I/O devices

• Two solutions– Memory-mapped I/O– Special I/O instructions

Memory mapped I/O

• A portion of the address space reserved for I/O operations– Writes to any to these addresses are

interpreted as I/O commands– Reading from these addresses gives access to

• Error bit • I/O completion bit• Data being read

Memory mapped I/O

• User processes cannot access these addresses– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Dedicated I/O instructions

• Privileged instructions that cannot be executed by user processes– Only the kernel

• Prevents user processes from accessing the disk in an uncontrolled fashion

Polling

• Simplest way for an I/O device to communicate with the CPU

• CPU periodically checks the status of pending I/O operations– High CPU overhead

I/O completion interrupts

• Notify the CPU that an I/O operation has completed

• Allows the CPU to do something else while waiting for the completion of an I/O operation– Multiprogramming

• I/O completion interrupts are processed by CPU between instructions– No internal instruction state to save

Interrupts levels

• See previous chapter

Direct memory access

• DMA• Lets disk controller access main memory w/o

any intervention of the CPU

DMA and virtual memory

• A single DMA transfer may cross page boundaries with– One page being in main memory– One missing page

Solutions

• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory

subsystem• Break DMA transfers crossing page boundaries

into chains of transfers that do not cross page boundaries

Solutions

• Make DMA work with virtual addresses– Issue is then dealt by the virtual memory

subsystem• Break DMA transfers crossing page boundaries

into chains of transfers that do not cores page boundaries

An Example

Page Page Page Page

Break DMA transfer

into DMA DMA

DMA and cache hierarchy

• Three approaches for handling temporary inconsistencies between caches and main memory

Solutions

1. Running all DMA accesses to the cache– Bad solution

2. Have OS selectively– Invalidate affected cache entries when

performing a read– Forcing immediate flush of dirty cache entries

when performing a write3. Have specific hardware to do same

Benchmarking I/O

Benchmarks

• Specific benchmarks for– Transaction processing

• Emphasis on speed and graceful recovery from failures

–Atomic transactions:• All or nothing behavior

An important observation

• Very difficult to operate a disk subsystem at a reasonable fraction of its maximum throughput– Unless we access sequentially very large

ranges of data• 512 KB and more

Major fallacies

• Since rated MTTFs of disk drives exceed one million hours, disk can last more than 100 years– MTTF expresses failure rate during the disk

actual lifetime• Disk failure rates in the field match the MMTTFS

mentioned in the manufacturers’ literature– They are up to ten times higher

Major fallacies

• Neglecting to do end-to-end checks– …

• Using magnetic tapes to back up disks– Tape formats can become quickly

obsolescent– Disk bit densities have grown much faster

than tape data densities.

Can you read these?

On an old PCNoNo

But you can still read this

storage and i/o jehan-françois pâris [email protected]

Documents

failure slide

mttf mttr slide

dependability slide

technologies slide

time t

data cell slide

actual transfer time

technology overview