file system consistency -...

File system consistency

File system consistency and recovery

File systems can becomes inconsistency due to power loss, systemcrashes, etc.

• Data loss: user data in the files are lost (can still use the filesystem though).

• File system metadata error: metadata not in a consistentstate, e.g., free block bitmap does not agree with actual blockusage. This is more severe problem (because the file systemmay not be usable).

How can we recover from a potential inconsistent state after acrash or power loss?

• Most file systems only recover file system metadata error, e.g.,bring the file system metadata to a consistency state.

Why file systems can become inconsistency?

The major reason for crash-consistency problem is that file systemupdates takes multiple steps. For example, in the Unix file system,creating a file at least need the following steps:

• Create a new inode and initialize it.• Mark the inode bitmap to reflect the inode allocation.• Add an entry to the directory block for the new file (points to

the new inode).If the system crashes before all steps are finished, then the filesystem is only partially updated on disk, causing problems.

Also because of cache, the operations can be carried out in anyorder.

An update consistency example

Assuming we have a simple Unix file system as in the followingfigure, and we want to expand the only file inside:

0 1 0 0 0 0 0 0 0 0 01

inode bitmap data bitmap

inodes data blocks

size=1

pointer=4

pointer=null

pointer=null

pointer=null

a mini file system

0 1 0 0 0 0 0 0 0 0 11

inode bitmap data bitmap

inodes data blocks

size=2

pointer=4

pointer=5

pointer=null

pointer=null

expand a file

As can be seen, we at least need to write three blocks to disk:• the block bitmap (block B)• the inode itself (block I)• the new data block (block D)


Let’s imagine that only a single write operation succeeds, we havethree cases:

•

Just the block D is written to disk. The data is on disk,but there is no inode points to it and no bitmap indicationthat D is allocated. This is not a problem at all from theperspective of file system consistency.

•

Just the updated inode I is written to disk. In this case,the inode points to block 5. But block 5 is not yet written andthus the inode points to garbage. Also the bitmap tells us thatblock 5 has not been allocated, but the inode says it has. Thisis an inconsistency problem.

•

Just the updated bitmap B is written back. In this case,the bitmap indicates block 5 is allocated, but there is no inodethat points to it. This is an inconsistent problem. It is calledspace leak as block 5 would never be used by the file system.


Now assuming that two steps succeeded, we have three cases:•

Blocks I and B are written to disk. In this case, the filesystem metadata is completely consistent: inode points toblock 5, and the bitmap indicates block 5 is in use. Theproblem is block 5 has garbage in it again.

•

Blocks I and D are written to disk. In this case, the inodepoints to the correct data, but the inode record disagrees withthe bitmap record. This is an inconsistency again.

•

Blocks B and D are written to disk. In this case, the inodeand bitmap are not consistent. However, even though theblock was written and the bitmap indicates its usage, we haveno idea which file it belongs to. This is also a space leak.

How to solve the crash consistency problem?

As indicated by the example, the root cause of the crash consistencyproblem is that the file system update operations are not atomic.

It is however hard to make all operations atomic because the diskonly commits one write at a time, and crashes or power loss mayoccur between any of the steps.

Solution 1: the file system checker

Here is one solution to the file system consistency problem:• Ignore the problem completely while updating the file system.• Fix it later (e.g., during next boot time) if there is a problem.

Many tools available to perform checks and fixes on file systems:• The fsck tool for Unix/Linux.• The chkdsk tool on Windows.• Each of these tools must be designed specifically for a

particular file system.• Cannot fix every problem, e.g., how do you fix garbage data

blocks? The real goal is to make sure the file system metadatais internally consistent.

fsck details

Here is an outline of what fsck does:•

Superblock: perform sanity checks on superblock, e.g., detectproblems like the file system size is smaller than the number ofblocks allocated.

•

Free blocks: fsck scans the inodes, indirect pointer blocks,to check which blocks are currently allocated. Then itcompares this result with the free block bitmap. It there is anyinconsistency, it is resolved by trusting the information fromthe inodes (i.e., resetting the bitmap).

•

Inode state: Each inode is checked for corruption. Forexample, each inode should have valid fields. If an inode isconsidered suspect (e.g., have a corrupted type field), it iscleared.

fsck details

•

Inode links: the link count of each inode is also verified. fsckscans the entire directory and records the number of links to aparticular inode. If the result is different from the inode’srecord, then usually the inode is fixed. For example, if anallocated inode is discovered but no directory refers to it, it ismoved to the lost+found directory.

•

Duplicates: fsck also checks for duplicate pointers, i.e.,cases where two different inodes refer to the same block. Ifone inode is obviously bad, it may be cleared. Alternately, thepointed-to block could be copied, giving each inode its owncopy as desired.

fsck details

•

Bad blocks: if a pointer points to something outside its validrange, e.g., it has an address that refers to block greater thanthe partition size, then it may be cleared. fsck cannot doanything too intelligent since it does not know what’ssupposed to be there.

•

Directory checks: making sure “.” and “..” are the firstentries, each inode referred to in a directory entry is allocated,no directory is linked to more than once in the entire hierarchy,etc.

Problem with fsck

The biggest problem of fsck: it is too slow.• File system unusable before fsck finishes.• Takes many minutes to several hours depending on the disk

size and I/O bandwidth.• Most work is redundant: scan the entire disk just to see if one

of the three writes didn’t manage to complete?!

Solution 2: ordered updates

If we are careful about the order of updates, we are in a muchbetter situation. Consider out simple example again ( see it again ), weneed to write back the blocks B, I, and D. If we impose an order ofupdate:

1 Update block B, if crash after this step, then we have a spaceleak since D is marked as allocated but no one uses it.

2 Update block I, if crash after this step, we are completely finewith file system consistency, just garbage in block D.

3 update block D

In general, if we impose certain order to the updates, then we willat most get the space leak problem (because there will never be anallocated inode or data block that is not marked in the on-diskbitmap).

• File system can be brought online immediately after crash, abackground fsck runs to garbage collect leaked blocks.

Synchronous writes

Older system often perform synchronous writes, that is they forcean update order, e.g.:

• Write new inode to disk before directory entry.• Remove directory name before deallocating inode.• Write cleared inode to disk before updating cylinder group free

bitmap.The major drawbacks:

• This implies that the cache is write-back: when changes madein the cache, it is immediately written to the disk.

• CPU has to wait for a disk operation to finish before issuingthe next. The net effect: file operations proceed at the diskspeed, performance significantly dropped.

Delayed write back

In general, we want to maintain an order of updates:• Never write pointer before initializing the structure it points to.• Never reuse a resource before nullifying all pointers to it.• Never clear last pointer to live resource before setting new one.

But we also want to remain the blocks in buffer as long as possible:• If we create a file A, use it for a short time, then deletes it, no

disk traffic is needed (completely in buffer, compare this to thesynchronous writes approach).

• Multiple updates to the same block can be merged. E.g.,several file added to the same directory, only one write backneeded if buffered, compared to several disk back in thesynchronous writes approach.

Delayed write back with order constraints

How to use delayed writes while still preserving orders?• Establish dependence orders between blocks.• Write back any block as long as there is no dependence.

Example: say you create file A:

• Block X contains the inode.• Block Y contains a directory block.• Create file A in inode block X, directory block Y.• Y!X, means Y depends on X (X must be updated before Y).

There are some problems with this approach:• Block aging: some blocks always have dependency, will never

get written back.• Cyclic dependency may occur (next slide).

Cyclic dependency

Suppose we want to create file A, unlink file B:• Assume both files in the same directory block and inode block.• To create file A: create inode on disk, then update directory

block.• To unlink file B: update directory block first, then update

inode on disk.

inode 3

inode 5

B 5

C 3

inode block directory block

original state

inode 3

inode 5

B 5

C 3


create file A

A 6

inode 6

d

e

p

e

n

d

s

inode 3

C 3


remove file B

A 6

inode 6

d

e

p

e

n

d

s

d

e

p

e

n

d

s

inode 5

Such cyclic dependency prevents both operations to proceed.

Soft updates

How to solve these problems? Use soft updates!• Keep track of dependency in a much finer level. E.g., a block

containing 64 inodes, the system can maintain up to 64dependency structures with one for each inode in the buffer.

• Each dependency track maintains an old value and a newvalue.

• Write blocks in any order.• When writing a block, temporarily roll back any changes you

cannot yet commit to disk (as if those changes had notoccurred yet).

• Lock rolled-back version so applications don’t see it.

Soft updates illustration

inode 3

inode 5 C 3


after metadata updates

inode 3

inode 5

B 5

C 3


memory copy

disk copy

inode 6

A 6

inode 3

inode 5 C 3


write directory block first

inode 3

inode 5 C 3


memory copy

disk copy

inode 6

undo file A’s change

inode 3

inode 5 C 3


write inode block (no dependency now)

inode 3

inode 5 C 3


memory copy

disk copy

inode 6

inode 3

inode 5 C 3


redo file A’s change

inode 3

C 3


memory copy

disk copy

inode 6

A 6

inode 6

w

r

i

t

e

-

b

a

c

k

write-back

write the directory block second time

A 6

w

r

i

t

e

-

b

a

c

k

inode 5

inode 6

1

2

3

4

lock buffer

unlock buffer

Soft updates

Here are the operations in the BSD FFS that require soft updates:• file creation• file removal• directory creation• directory removal• file/directory rename• block allocation• indirect block manipulation• free map management

However, soft updates requires intricate knowledge of each filesystem data structure and thus adds a fair amount of complexity tothe system. (Introduced in 4.4BSD, now available across the BSDlines.)

Solution 3: journaling

The most popular solution to the file system consistent problem isto use journaling:

• Tools like fsck are slow because they perform unnecessarywork: examining all data for a few possible errors.

• If we had known where is the error, then we would just focuson those parts, saving most of the time.

• Learn the tricks from database systems: keeping notes (thejournal) on what we are going to do and use the notes toreplay anything that is not done yet.

• Also known as write-ahead logging.• First appeared in the Cedar file system, used in many modern

file systems, e.g., ext3/ext4, reiserfs, IBM JFS, WindowsNTFS, Apple HFS+, etc.

Journal and transaction

•

Journal is an allocated space in the file system for notestaking, i.e., a log. The journal can also be placed on aseparate device.

•

Transaction is an individual “note” in the journal.

superblock journal bitmap . . . (rest of file system)

transaction transactionheader transaction

begin

end

contents

transaction ID

contents block 0

disk address

How does journaling work?

The idea: when updating the disk, before overwriting the structuresin place (into the file system), first write down a transactiondescribing what you are about to do (hence, write-ahead logging).

• Keeping the journal guarantees that if a crash takes placesduring the update of file system, you can go back and look atthe note and try again. You know exactly what to fix after acrash.

• But what if the system crashes when the journal is beingupdated?

•

No problem! Since a transaction is updated before the real file

system modification, the file system is still consistent. We just

discard the transaction!

• Journaling file system does introduce some performanceimplications as it increases disk writes.

•

Usually okay, transactions are written sequentially. Can also

merge transactions in memory.

•

An important research topic in journaling file systems.

Data journaling

Using our simple example ( see it again ), we wish to update the inode(I) block, bitmap (B) block, and the data block (D). Before writingthem to their final disk locations, we first write them to the journal.A transaction would look like: begin

endI D B

• Five blocks are written to the transaction.• The begin block marks the starting point of a transaction. It

also includes information about the pending update to the filesystem (e.g., the final disk address of I, D, B).

• The end block marks the end of a transaction.• This is known as physical logging as it puts the exact

physical contents of the update in the journal.• An alternative, logical logging, puts logical representation of

the update in the journal (e.g., “this update wishes to appenddata block D to file X, flip the last bit in block B”). Muchdifficult to implement, but saves space in the journal.

Checkpointing

To checkpoint the file system is to bring it up to date with thepending updates in the journal. Successfully checkpointing the filesystem means it is updated without error. Thus in a journaling filesystem, the basic sequence of operations are:

1 Journal write: write the transaction (e.g., begin

endI D B )to the log and wait for these writes to complete.

2 Checkpoint: write the update (e.g., blocks I, D, B) to thefile system.

Note: the transactions do not have to be committed to the journalafter every change in the file system.

• Frequent journal writes ensures data safety at the cost ofperformance drop.

• Can also keep the transactions in memory for longer time,better performance (e.g., transaction merging), increase therisk of losing more data in case of crash.

Journal commit

What happens when a crash occur during the journal update?• As said, we simply discard the transaction in that case.• But that means we need to be able to tell if a transaction is

valid or not!Writing the transaction in a single write call is unsafe! For example,if we issue (write begin

endI D B ) in a single call, the systemmay schedule the writes in any order, e.g.:

1 Write the begin , I, B, end blocks first.2 Then write the block D.

If the system crashes between steps 1 and 2 , then we’ll have acorrupted transaction: begin

endI B

??

Journal commit

Why a transaction like begin

endI B

?? in the journal is aproblem?

• It looks valid since it has valid begin and end blocks.• We cannot tell if the block D in the transaction is wrong. It is

just arbitrary user data.• If the system reboots and runs recovery, it will copy the block

?? to the file system.Then how to correctly update the journal? Two steps:

1 Journal write: write the transaction (the begin block andthe contents blocks) to the journal, wait for writes tocomplete.

2 Journal commit: write the transaction commit block commit tothe journal, wait for write to complete. Transaction is nowcommitted.

So it looks like: begin

I D

commit

B

write barrier

Journal commit

Why a transaction like begin

endI B

?? in the journal is aproblem?

• It looks valid since it has valid begin and end blocks.• We cannot tell if the block D in the transaction is wrong. It is

just arbitrary user data.• If the system reboots and runs recovery, it will copy the block

?? to the file system.Another way to make journal commit safe is to include a checksumin every transaction that is computed over the contents of thetransaction: begin

I D

checksum

B end

• This enables the transaction to be written to the journal in asingle I/O call without incurring a wait, improvingperformance.

• During transaction read, if the checksum mismatches thecomputed checksum of the transaction, the transaction iscorrupted.

Recovery in journaling file system

Recovery in a journaling file system is easy:• If a crash happens before the transaction is committed to the

journal, then just discard the transaction and skip the pendingupdate. The file system is still consistent.

• If a crash happens after the transaction is committed to thejournal but before the checkpoint step, then the file systemcan recover by replay the transaction. That is, execute thetransaction according to the information recorded in its beginblock. This also called redo logging, or roll-forward

• It is okay for a crash to happen at any point. We can alwaysrecover the system to a consistent state.

• The recover time is only proportional to the size of the journal,and not related to the size of the disk (as in the case of fsck).

Some more details

• File systems like ext3 typically do not commit a transaction tothe journal each time an update to the file system is made.Instead for performance reasons, it waits for several file systemupdates and merges all updates to a single transaction. Thisavoids excessive write traffic to disk.

• Journal needs to be cleaned! The journal area on disk is notinfinite, thus after each successful checkpoint step, we need tomark those transactions that are checkpointed. This gives usthe basic steps in the system:

1 Journal write: write the contents of the transaction (the

begin blocks and contents blocks) to the journal, wait for the

writes to complete.

2 Journal commit: write the

commit

block to the journal, wait

for write to complete.

3 Checkpoint: write the contents of the transaction to the file

system.

4 Clean journal: mark the transaction free in the journal.

Metadata journaling

Including every data block (e.g., D) in the transaction is expensive(have to write them twice). Metadata journaling only keeps thefile system metadata in the journal. For example, a transactionwould be: begin

endI B

But when do we write out the data block D? We have two choices:

1 Commit transaction to journal.2 Checkpoint.3 Write D to disk.

There’s potential problem here: if the system crashes before step3 , then the inode for the file will point to garbage data since theblock D is not in the journal and will not be recovered.

• This is called non-ordered mode.

Metadata journaling

Including every data block (e.g., D) in the transaction is expensive(have to write them twice). Metadata journaling only keeps thefile system metadata in the journal. For example, a transactionwould be: begin

endI B

But when do we write out the data block D? We have two choices:

1 Write D to disk.2 Commit transaction to journal.3 Checkpoint.

Now this guarantees that an inode pointer will never point togarbage. This is the same principle used in the ordered updateapproach: “write the pointed to object before the object with thepointer to it.”

• This is called ordered mode.

Different journaling modes

So far, we’ve seen several different modes in a journaling filesystem. They can usually be configured by a user (e.g., in ext3):

• Data journaling• Non-ordered metadata journaling• Ordered metadata journaling

All are able to keep the file system metadata consistent.

Log-structured file system

Log-structured file system (LFS)

All the file systems so far we’ve seen use somewhat similarstrategies to organize and update data on the disk: data blocks arekept indexed by some structure, modifying a file updates those datablocks in-place (meaning find the block we want to change andchange it).

The Log-structured file system (LFS) uses an interesting disklayout and update strategy:

• Never overwrites files or directories in-place on the disk,instead, create a new copy of data and put it somewhere elseon the disk when you make changes (also called shadow

paging or copy-on-write).• The entire file system is a big log (or journal)! Compare this

to the journaling file system, which only utilizes the journal asan aid for recovery. The file data and metadata are stillorganized in the usual way.

The motivation behind LFS

The designers of LFS based their rationale on the followingobservations:

• Memory sizes were growing: more data could be cached inmemory. File system read performance will improve. Disktraffic would increasingly consist of writes. File systemperformance would largely be determined by writes.

• The gap between random and sequential I/O performance islarge and growing: disk bandwidth increases much faster thanseek and rotational delay. One has to use the disk in asequential manner in order to get huge performance advantage.

• Existing file systems (at that time) perform poorly on manycommon workloads: for example, FFS creates multiple smallblocks when creating a new file. Even though FFS would tryto place all of these blocks within the same cylinder group, itwould still incur many short seeks and subsequent rotationaldelays, unable to reach the peak sequential bandwidth.

Organization of LFS

• Entire file system is a big journal composed of segments plusa header area called checkpoint region that contains globalfile system parameters.

CR

segment segment segment segment segment

checkpoint region

segment

summary

data data inodes

checksum

time

block 1

file number

block 1

global block

number

• Segment size usually 512 KB – 1 MB to take advantages ofdisk sequential I/O.

How to store file data?

• Everything is written to a segment, including the file data andmetadata such as inodes, indirect pointer blocks.

segment segment

file1 file2

inodes data block

segment segment

file1 file2

append a block to file1, change the middle block of file2, create file3

file1

file3

file2

imap

• When a file is updated, old contents are not overwritten,instead, new contents are written to a new location.

Where to find the inodes then?

• Since all the inodes are now scattered in segments, LFS usesinode map (imap) to index the newest (remember new inodesare constantly written to new locations) inodes locations(compare this with page table).

CR

segment

imap

root imap

• Each imap is a fixed-size block of inodes location mappinginfo. The checkpoint region contains the current location of allthe imap blocks.

• Most of these imap blocks will be cached in memory.(Compare this with two-level page table!)

Garbage collection?

• LFS keeps writing newer version of a file. One option is tokeep the older versions around (versioning file system).

• LFS instead only keeps the latest live version of a file.

CR

segment segment segmentsegment

dead data

segment

CR

segment segment segmentsegment segment

after cleaning

free segment

relocate live blocks

• LFS runs garbage collection to reclaim the old data.• Read in a number of segments, write out a new set of

segments with just the live blocks within them.

How to determine if a block is alive?

Use the segment summary block:• Segment summary block includes metadata for all the blocks

within a segment ( see diagram ).• For each block in a segment, the inode number of the file it

belongs to is recorded in the segment summary block, the fileoffset is also recorded.

• For a block D, use the segment summary block info, locate itsinode I through imap, locate D’s offset T in I and compare ifthe address recorded is equal to D’s address. (Yes: D is live,No: D is garbage.)

• LFS optimizes this search by recording a version number in theimap. Version number is increased when a file is deleted ortruncated to 0 length. The version number is also kept in thesegment summary block for each block. Compare these twonumbers to determine if a block is alive or not.

Cleaning policy

When to perform a file system cleaning in LFS?• A low-priority background process that runs continuously to

perform cleaning.• During idle time.• Or when you have to (because the disk if full).

Which segments should be cleaned?• More challenging question, the subject of many research

papers.• Obvious choice is the most fragmented segments. But not the

best choice.• The original LFS used a policy to clean “cold” segments more

often than “hot” segments. A cold segment is one whose datablocks rarely change. A hot segment is one whose contents arefrequently being updated.

Characteristics of LFS

• Very good write performance.• But depends on large memory cache for read performance.• Cleaning became the focus of much controversy in LFS, and

concerns over cleaning costs perhaps limited LFS’s initialimpact.

• Nevertheless, the intellectual legacy of LFS lives on in modernfile systems such as ZFS (also uses copy-on-write approach).

file system consistency -...

Documents