ffs, lfs, and raid

FFS, LFS, and RAID

Andy WangCOP 5611

Advanced Operating Systems

UNIX Fast File System

Designed to improve performance of UNIX file I/O

Two major areas of performance improvementBigger block sizesBetter on-disk layout for files

Block Size Improvement

4x block size quadrupled amount of data gotten per disk fetch

But could lead to fragmentation problemsSo fragments introduced

Small files stored in fragmentsFragments addressable

But not independently fetchable

Disk Layout Improvements

Aimed toward avoiding disk seeksBad if finding related files takes many

seeksVery bad if find all the blocks of a single

file requires seeksSpatial locality: keep related things close

together on disk

Cylinder Groups

A cylinder group: a set of consecutive disk cylinders in the FFS

Files in the same directory stored in the same cylinder group

Within a cylinder group, tries to keep things contiguous

But must not let a cylinder group fill up

Locations for New Directories

Put new directory in relatively empty cylinder group

What is “empty”?Many free i_nodesFew directories already there

The Importance of Free Space

FFS must not run too close to capacityNo room for new filesLayout policies ineffective when too few

free blocksTypically, FFS needs 10% of the total

blocks free to perform well

Performance of FFS

4x to 15x the bandwidth of old UNIX file system

Depending on size of disk blocksPerformance on original file system

Limited by CPU speedDue to memory-to-memory buffer copies

FFS Not the Ultimate Solution

Based on technology of the early 80sAnd file usage patterns of those timesIn modern systems, FFS achieves only

~5% of raw disk bandwidth

The Log-Structured File System

Large caches can catch almost all readsBut most writes have to go to diskSo FS performance can be limited by

writesSo, produce a FS that writes quicklyLike an append-only log

Basic LFS Architecture

Buffer writes, send them sequentially to diskData blocksAttributesDirectoriesAnd almost everything else

Converts small sync writes to large async writes

A Simple Log Disk Structure

File A

Block7

File Z

Block1

File M

Block202

File A

Block3

File F

Block1

File A

Block7

File L

Block26

File L

Block25

Head of Log

Key Issues in Log-Based Architecture

1. Retrieving information from the log

No matter how well you cache, sooner or later you have to read

2. Managing free space on the disk

You need contiguous space to write - in the long run, how do you get more?

Finding Data in the Log

Give me block 25 of file LOr,Give me block 1 of file F

File A

Block7

File Z

Block1

File M

Block202

File A

Block3

File F

Block1

File A

Block7

File L

Block26

File L

Block25

Retrieving Information From the Log

Must avoid sequential scans of disk to read files

Solution: store index structures in logIndex is essentially the most recent

version of the i_node

Finding Data in the Log

How do you find all blocks of file Foo?

FooBlock 1

FooBlock2

FooBlock3

FooBlock1(old)

Finding Data in the Log with an I_node

FooBlock 1

FooBlock2

FooBlock3

FooBlock1(old)

How Do You Find a File’s I_node?

You could search sequentiallyLFS optimizes by writing i_node maps to

the logThe i_node map points to the most recent

version of each i_nodeA file system’s i_nodes cover multiple

blocks of i_node map

How Do You Find the Inode?

The Inode Map

How Do You Find Inode Maps?

Use a fixed region on disk that always points to the most recent i_node map blocks

But cache i_node maps in main memorySmall enough that few disk accesses

required to find i_node maps

Finding I_node Maps

New i_node mapsAn old i_node map

Reclaiming Space in the Log

Eventually, the log reaches the end of the disk partition

So LFS must reuse disk space like superseded data blocks

Space can be reclaimed in background or when needed

Goal is to maintain large free extents on disk

Example of Need for Reuse

Head of log

New data to be logged

Major Alternatives for Reusing Log

Threading+ Fast

- Fragmentation

- Slower reads

Head of log


Major Alternatives for Reusing Log

Copying+Simple

+Avoids fragmentation

-Expensive


LFS Space Reclamation Strategy

Combination of copying and threadingCopy to free large fixed-size segmentsThread free segments togetherTry to collect long-lived data permanently

into segments

A Threaded, Segmented Log

Head of log

Cleaning a Segment

1. Read several segments into memory

2. Identify the live blocks

3. Write live data back (hopefully) into a smaller number of segments

Identifying Live Blocks

Hard to track down live blocks of all filesInstead, each segment maintains a

segment summary blockIdentifying what is in each block

Crosscheck blocks with owning i_node’s block pointers

Written at end of log write, for low overhead

Segment Cleaning Policies

What are some important questions?When do you clean segments?How many segments to clean?Which segments to clean?How to group blocks in their new segments?

When to Clean

PeriodicallyContinuouslyDuring off-hoursWhen disk is nearly fullOn-demandLFS uses a threshold system

How Many Segments to Clean

The more cleaned at once, the better the reorganization of the diskBut the higher the cost of cleaning

LFS cleans a few tens at a timeTill disk drops below threshold value

Empirically, LFS not very sensitive to this factor

Which Segments to Clean?

Cleaning segments with lots of dead data gives great benefit

Some segments are hot, some segments are cold

But “cold” free space is more valuable than “hot” free space

Since cold blocks tend to stay cold

Cost-Benefit Analysis

u = utilizationA = ageBenefit to cost = u*A/(u + 1)Clean cold segments with some space,

hot segments with a lot of space

What to Put Where?

Given a set of live blocks and some cleaned segments, which goes where?Order blocks by ageWrite them to segments oldest first

Goal is very cold, highly utilized segments

Goal of LFS Cleaning

100% fullempty

number of segments

100% fullempty

number of segments

Performance of LFS

On modified Andrew benchmark, 20% faster than FFS

LFS can create and delete 8 times as many files per second as FFS

LFS can read 1 ½ times as many small files

LFS slower than FFS at sequential reads of randomly written files

Logical Locality vs. Temporal Locality

Logical locality (spatial locality): Normal file systems keep a file’s data blocks close together

Temporal locality: LFS keeps data written at the same time close together

When temporal locality = logical localitySystems perform the same

Major Innovations of LFS

Abstraction: everything is a logTemporal locality Use of caching to shape disk access

patterns Cache most readsOptimized writes

Separating full and empty segments

Where Did LFS Look For Performance Improvements?Minimized disk access

Only write when segments filled up

Increased size of data transfersWrite whole segments at a time

Improving localityAssuming temporal locality, a file’s blocks are

all adjacent on diskAnd temporally related files are nearby

Parallel Disk Access and RAID

One disk can only deliver data at its maximum rate

So to get more data faster, get it from multiple disks simultaneously

Saving on rotational latency and seek time

Utilizing Disk Access Parallelism

Some parallelism available just from having several disks

But not muchInstead of satisfying each access from

one disk, use multiple disks for each access

Store part of each data block on several disks

Disk Parallelism Example

open(foo) read(bar) write(zoo)

FileSystem

Data Striping

Transparently distributing data over disksBenefits –

Increases disk parallelismFaster response for big requests

Major parameters Number of disks Size of data interleaf

Fine vs. Coarse grained Data Interleaving Fine grained data interleaving

+ High data rate for all requestsBut only one request per disk arrayLots of time spent positioning

Coarse grained data interleaving+ Large requests access many disks

+ Many small requests handled at onceSmall I/O requests access few disks

Reliability of Disk Arrays

Without disk arrays, failure of one disk among N loses 1/Nth of the data

With disk arrays (fine grained across all N disks), failure of one disk loses all data

N disks 1/Nth as reliable as one disk

Adding Reliability to Disk Arrays

Buy more reliable disksBuild redundancy into the disk array

Multiple levels of disk array redundancy possible

Most organizations can prevent any data loss from single disk failure

Basic Reliability Mechanisms

Duplicate dataParity for error detectionError Correcting Code for detection and

correction

Parity Methods

Can use parity to detect multiple errorsBut typically used to detect single error

If hardware errors are self-identifying, parity can also correct errors

When data is written, parity must be written, too

Error-Correcting Code

Based on Hamming codes, mostlyNot only detect error, but identify which bit

is wrong

RAID Architectures

Redundant Arrays of Independent DisksBasic architectures for organizing disks

into arraysAssuming independent control of each

diskStandard classification scheme divides

architectures into levels

Non-Redundant Disk Arrays (RAID Level 0)

No redundancy at allSo, what we just talked aboutAny failure causes data loss

Non-Redundant Disk Array Diagram (RAID Level 0)


FileSystem

Mirrored Disks (RAID Level 1)

Each disk has second disk that mirrors its contentsWrites go to both disksNo data striping

+ Reliability is doubled

+ Read access faster

- Write access slower

- Expensive and inefficient

Mirrored Disk Diagram (RAID Level 1)


FileSystem

Memory-Style ECC (RAID Level 2)

Some disks in array are used to hold ECCE.g., 4 data disks require 3 ECC disks

+ More efficient than mirroring

+ Can correct, not just detect, errors

- Still fairly inefficient

Memory-Style ECC Diagram (RAID Level 2)


FileSystem

Bit-Interleaved Parity (RAID Level 3)

Each disk stores one bit of each data block

One disk in array stores parity for other disks

+ More efficient that Levels 1 and 2

- Parity disk doesn’t add bandwidth

Bit-Interleaved RAID Diagram (Level 3)


FileSystem

Block-Interleaved Parity (RAID Level 4)

Like bit-interleaved, but data is interleaved in blocks of arbitrary sizeSize is called striping unitSmall read requests use 1 disk

+ More efficient data access than level 3

+ Satisfies many small requests at once

- Parity disk can be a bottleneck

- Small writes require 4 I/Os

Block-Interleaved Parity Diagram (RAID Level 4)


FileSystem

Block-Interleaved Distributed-Parity (RAID Level 5)

Spread the parity out over all disks

+ No parity disk bottleneck

+ All disks contribute read bandwidth

– Requires 4 I/Os for small writes

Block-Interleaved Distributed-Parity Diagram (RAID Level 5)


FileSystem

Other RAID Configurations

RAID 6Can survive two disk failures

RAID 10 (RAID 1+0)Data striped across mirrored pairs

RAID 01 (RAID 0+1)Mirroring two RAID 0 arrays

RAID 15, RAID 51

Where Did RAID Look For Performance Improvements?Parallel use of disks

Improve overall delivered bandwidth by getting data from multiple disks

Biggest problem is small write performance

But we know how to deal with small writes . . .

Bonus

Given N disks in RAID 1/10/01/15/51, what is the expected number of disk failures before data loss? (1/2 critique)

Given 1-TB disks and probability p for a bit to fail silently, what is the probability of irrecoverable data loss for RAID 1/5/6/10/01/15/51 after a single disk failure? (1/2 critique)

ffs, lfs, and raid

Documents

free i

single file

file lor

recent i

nodea file systems i

blocks of file foo

logthe i

disk seeksbad