fault isolation and quick recovery in isolation file systems · fault isolation and quick recovery...

Post on 05-Apr-2018

224 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fault Isolation and Quick Recovery in Isolation File Systems

Lanyue Lu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau

University of Wisconsin - Madison

1

File-System Availability Is Critical

2

File-System Availability Is Critical

Main data access interface➡ desktop, laptop, mobile devices, file servers

2

File-System Availability Is Critical

Main data access interface➡ desktop, laptop, mobile devices, file servers

A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states

2

File-System Availability Is Critical

Main data access interface➡ desktop, laptop, mobile devices, file servers

A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states

A small fault can cause global failures➡ e.g., a single bit can impact the whole file system

2

File-System Availability Is Critical

Main data access interface➡ desktop, laptop, mobile devices, file servers

A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states

A small fault can cause global failures➡ e.g., a single bit can impact the whole file system

Global failures considered harmful➡ read-only, crash

2

Server Virtualization

Hypervisor

Shared file system

Guest virtual machines

VM2VM1 VM3

VMDK1 VMDK2 VMDK3

3

VM2VM1 VM3

VMDK1 VMDK2 VMDK3

4

VM2VM1 VM3

VMDK1 VMDK2 VMDK3

e.g., metadata corruption

4

VM2VM1 VM3

VMDK1 VMDK2 VMDK3

5

VM2VM1 VM3

VMDK1 VMDK2 VMDK3

e.g., metadata corruption

5

VM2VM1 VM3

VMDK1 VMDK2 VMDK3ReadOnly

orCrash

All VMsare affected

6

VM2VM1 VM3

VMDK1 VMDK2 VMDK3ReadOnly

orCrash

All VMsare affected

e.g., metadata corruption

6

Our Solution

7

Our Solution

A new abstraction for fault isolation➡ support multiple independent fault domains➡ protect a group of files for a domain

7

Our Solution

A new abstraction for fault isolation➡ support multiple independent fault domains➡ protect a group of files for a domain

Isolation file systems➡ fine-grained fault isolation➡ quick recovery

7

Introduction

Study of Failure Policies

Isolation File Systems

Challenges

8

Questions to Answer

9

Questions to AnswerWhat global failure policies are used ?

➡ failure types➡ number of each type

9

Questions to AnswerWhat global failure policies are used ?

➡ failure types➡ number of each type

What are the root causes of global failures ?➡ related data structures➡ number of each cause

9

Methodology

10

Methodology

Three major file systems➡ Ext3 (Linux 2.6.32), Ext4 (Linux 2.6.32)➡ Btrfs (Linux 3.8)

10

Methodology

Three major file systems➡ Ext3 (Linux 2.6.32), Ext4 (Linux 2.6.32)➡ Btrfs (Linux 3.8)

Analyze source code➡ identify types of global failures➡ count related error handling functions➡ correlate global failures to data structures

10

Q1: What global failure policies

are used ?

11

Global Failure Policies

12

Global Failure Policies

Definition➡ a failure which impacts all users of the file system or even the operating system

12

Global Failure Policies

Definition➡ a failure which impacts all users of the file system or even the operating system

Read-Only➡ e.g., ext3_error(): ➡ mark file system as read-only➡ abort the journal

12

ext3/balloc.c, 2.6.32

read_block_bitmap(...){

  1   bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”);     return NULL;

}}

Read-Only Example

13

ext3/balloc.c, 2.6.32

read_block_bitmap(...){

  1   bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”);     return NULL;

}}

Read-Only Example

13

ext3/balloc.c, 2.6.32

read_block_bitmap(...){

  1   bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”);     return NULL;

}}

Read-Only Example

13

ext3/balloc.c, 2.6.32

read_block_bitmap(...){

  1   bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”);     return NULL;

}}

Read-Only Example

13

ext3/balloc.c, 2.6.32

read_block_bitmap(...){

  1   bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”);     return NULL;

}}

Read-Only Example

13

Global Failure Policies

Definition➡ a failure which impacts users of the file system or even the operating system

Read-Only➡ e.g., ext3_error(): ➡ mark file system as read-only➡ abort the journal

Crash➡ e.g., BUG(), ASSERT(), panic()➡ crash the file system or operating system

14

btrfs/disk-io.c, 3.8

open_ctree(...) {

  1   root->node = read_tree_block(...); 2 BUG_ON(!root->node);

Crash Example

15

btrfs/disk-io.c, 3.8

open_ctree(...) {

  1   root->node = read_tree_block(...); 2 BUG_ON(!root->node);

Crash Example

15

btrfs/disk-io.c, 3.8

open_ctree(...) {

  1   root->node = read_tree_block(...); 2 BUG_ON(!root->node);

Crash Example

15

btrfs/disk-io.c, 3.8

open_ctree(...) {

  1   root->node = read_tree_block(...); 2 BUG_ON(!root->node);

Crash Example

15

0

200

400

600

800

1000Nu

mbe

r of I

nsta

nces

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Crash

16

Read-only and crash are common in modern file systems

Over 67% of global failures will crash the system

17

Q2: What are the root causes

of global failures ?

18

Global Failure Causes

19

Global Failure Causes

Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption

19

ext3/dir.c, 2.6.32

ext3_check_dir_entry(...){

1 rlen = ext3_rec_len_from_disk();  2   if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3     ext3_error(sb, error);

}

Metadata Corruption Example

20

ext3/dir.c, 2.6.32

ext3_check_dir_entry(...){

1 rlen = ext3_rec_len_from_disk();  2   if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3     ext3_error(sb, error);

}

Metadata Corruption Example

20

ext3/dir.c, 2.6.32

ext3_check_dir_entry(...){

1 rlen = ext3_rec_len_from_disk();  2   if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3     ext3_error(sb, error);

}

Metadata Corruption Example

20

ext3/dir.c, 2.6.32

ext3_check_dir_entry(...){

1 rlen = ext3_rec_len_from_disk();  2   if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3     ext3_error(sb, error);

}

Metadata Corruption Example

20

Global Failure Causes

Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption

I/O failure➡ metadata I/O failure and journaling failure➡ e.g., fail to read an inode block

21

ext4/namei.c, 2.6.32

empty_dir(...){

  1   bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,

“fail to read directory block”);

I/O Failure Example

22

ext4/namei.c, 2.6.32

empty_dir(...){

  1   bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,

“fail to read directory block”);

I/O Failure Example

22

ext4/namei.c, 2.6.32

empty_dir(...){

  1   bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,

“fail to read directory block”);

I/O Failure Example

22

ext4/namei.c, 2.6.32

empty_dir(...){

  1   bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,

“fail to read directory block”);

I/O Failure Example

22

Global Failure Causes

Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption

I/O failure➡ metadata I/O failure and journaling failure➡ e.g., fail to read an inode block

Software bugs➡ unexpected states detected➡ e.g., allocated block is not in a valid range

23

ext3/balloc.c, 2.6.32

ext3_rsv_window_add(...){

  1   if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {

rsv_window_dump(root, 1); 4 BUG();

}

Software Bug Example

24

ext3/balloc.c, 2.6.32

ext3_rsv_window_add(...){

  1   if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {

rsv_window_dump(root, 1); 4 BUG();

}

Software Bug Example

24

ext3/balloc.c, 2.6.32

ext3_rsv_window_add(...){

  1   if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {

rsv_window_dump(root, 1); 4 BUG();

}

Software Bug Example

24

ext3/balloc.c, 2.6.32

ext3_rsv_window_add(...){

  1   if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {

rsv_window_dump(root, 1); 4 BUG();

}

Software Bug Example

24

ext3/balloc.c, 2.6.32

ext3_rsv_window_add(...){

  1   if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {

rsv_window_dump(root, 1); 4 BUG();

}

Software Bug Example

24

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

0

200

400

600

800

1000

Num

ber o

f Ins

tanc

es

Ext3 Ext4 Btrfs

193

409

829

ReadOnly Pure Crash

Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.

Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is

shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.

2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-

tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault

Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes

dir-entry 4 4 3 Yesgdt 3 2 Yes

indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes

journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes

transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193

Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.

isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained

fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.

3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-

tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy

of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization

2

Ext3

25

26

All global failures are caused by

metadata and system states

26

All global failures are caused by

metadata and system statesBoth local and shared metadata can cause global failures

26

All global failures are caused by

metadata and system statesBoth local and shared metadata can cause global failures

26

Not Only Local File Systems

27

Not Only Local File Systems

Shared-disk file systems OCFS2➡ inspired by Ext3 design➡ used in virtualization environment➡ host virtual machine images➡ allow multiple Linux guests to share a file system

27

Not Only Local File Systems

Shared-disk file systems OCFS2➡ inspired by Ext3 design➡ used in virtualization environment➡ host virtual machine images➡ allow multiple Linux guests to share a file system

Global failures are also prevalent➡ a single piece of corrupted metadata can fail the whole file system on multiple nodes !

27

Current Abstractions

28

Current Abstractions

File and directory➡ metadata is shared for different files or directories

28

Current Abstractions

File and directory➡ metadata is shared for different files or directories

Namespace➡ virtual machines, Chroot, BSD jail, Solaris Zones➡ multiple namespaces still share a file system

28

Current Abstractions

File and directory➡ metadata is shared for different files or directories

Namespace➡ virtual machines, Chroot, BSD jail, Solaris Zones➡ multiple namespaces still share a file system

Partitions➡ multiple file systems on separated partitions➡ a single panic on a partition can crash the whole operating system➡ static partitions, dynamic partitions➡ management of many partitions

28

29

All files on a file system implicitly share

a single fault domain

29

All files on a file system implicitly share

a single fault domain

29

All files on a file system implicitly share

a single fault domain

Current file-system abstractions do not

provide fine-grained fault isolation

29

Introduction

Study of Failure Policies

Isolation File SystemsNew Abstraction

Fault Isolation

Quick Recovery

Preliminary Implementation on Ext3

Challenges

30

Isolation File Systems

31

Isolation File Systems

Fine-grained partitioned➡ files are isolated into separated domains

31

Isolation File Systems

Fine-grained partitioned➡ files are isolated into separated domains

Independent➡ faulty units will not affect healthy units

31

Isolation File Systems

Fine-grained partitioned➡ files are isolated into separated domains

Independent➡ faulty units will not affect healthy units

Fine-grained recovery➡ repair a faulty unit quickly➡ instead of checking the whole file system

31

Isolation File Systems

Fine-grained partitioned➡ files are isolated into separated domains

Independent➡ faulty units will not affect healthy units

Fine-grained recovery➡ repair a faulty unit quickly➡ instead of checking the whole file system

Elastic➡ dynamically grow and shrink its size

31

New Abstraction

32

New Abstraction

File Pod➡ an abstract partition➡ contains a group of files and related metadata ➡ an independent fault domain

32

New Abstraction

File Pod➡ an abstract partition➡ contains a group of files and related metadata ➡ an independent fault domain

Operations➡ create a file pod➡ set / get file pod’s attributes➡ failure policy➡ recovery policy

➡ bind / unbind a file to pod➡ share a file between pods

32

d1 d2

d4

d3

/

33

d1 d2

d4

d3

/Pod1 Pod2

34

Introduction

Study of Failure Policies

Isolation File SystemsNew Abstraction

Fault Isolation

Quick Recovery

Preliminary Implementation on Ext3

Challenges

35

Metadata Isolation

36

Metadata Isolation

Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata

36

Metadata Isolation

Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata

For example➡ multiple inodes are stored in a single inode block

i i i i i i i i i i i i

an inode block36

Metadata Isolation

Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata

For example➡ multiple inodes are stored in a single inode block ➡ an I/O failure can affect multiple files

i i i i i i i i i i i i

an inode block

a block read failure

36

37

Key Idea 1:

37

Key Idea 1:

Isolate metadata for file pods

37

Localize Failures

38

Localize Failures

Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod

38

Localize Failures

Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod

Read-Only➡ mark a file pod as Read-Only

38

Localize Failures

Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod

Read-Only➡ mark a file pod as Read-Only

Crash➡ crash a file pod instead of the whole system➡ provide the same initial states after crash

38

d1 d2

d4

d3

/Pod1 Pod2

39

d1 d2

d4

d3

/Pod1 Pod2

e.g., corruption

40

d1 d2

d4

d3

/Pod1 Pod2

e.g., corruption

40

Introduction

Study of Failure Policies

Isolation File SystemsNew Abstraction

Fault Isolation

Quick Recovery

Preliminary Implementation on Ext3

Challenges

41

Quick Recovery

42

Quick Recovery

File system recovery is slow➡ a small error requires a full check➡ many random read requests➡ 7 hours to sequentially read a 2 TB disk

42

43

a small fault

requires a full check(slow!)

43

a small fault

requires a full check(slow!)

43

44

Key Idea 2:

44

Key Idea 2:

Minimize the file system checking range during recovery

44

Quick Recovery

45

Quick Recovery

Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline

45

Quick Recovery

Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline

When recover ?➡ leverage internal detection mechanism

45

Quick Recovery

Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline

When recover ?➡ leverage internal detection mechanism

How to recover more efficiently ?➡ only check the faulty pod➡ narrow down to certain data structures

45

Introduction

Study of Failure Policies

Isolation File SystemsNew Abstraction

Fault Isolation

Quick Recovery

Preliminary Implementation on Ext3

Challenges

46

Ext3 Layout

47

Ext3 Layout

A disk is divided into block groups➡ physical partition for disk locality

47

Ext3 Layout

A disk is divided into block groups➡ physical partition for disk locality

disk layout

47

Ext3 Layout

A disk is divided into block groups➡ physical partition for disk locality

SB GDTs BM InodesIM Blocks Blocks

disk layout

one block group

47

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

multiple files can share a single block group

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

multiple files can share a single block group

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

multiple files can share a single block group

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

multiple files can share a single block group

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

multiple files can share a single block group

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

f5

multiple files can share a single block group

one file can span multiple block groups

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

f5

multiple files can share a single block group

one file can span multiple block groups

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

f5

multiple files can share a single block group

one file can span multiple block groups

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

f5

multiple files can share a single block group

one file can span multiple block groups

48

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

SB GDTs BM InodesIM Blocks Blocks

f1

f2

f3

f4

f5

multiple files can share a single block group

one file can span multiple block groups

48

Layout

49

Layout

A file pod contains multiple block groups➡ one block group only maps to one file pod➡ performance locality and fault isolation

49

Layout

A file pod contains multiple block groups➡ one block group only maps to one file pod➡ performance locality and fault isolation

disk layout

POD1 POD2 POD3

49

Data Structures

50

Data Structures

Pod related structure➡ no extra mapping structures

50

Data Structures

Pod related structure➡ no extra mapping structures➡ embeds in group descriptors➡ group descriptors are loaded into memory

SB GDTs BM InodesIM Blocks Blocks

a block grouppod

50

Algorithms

51

Algorithms

Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary

51

POD1 POD2 POD3

52

POD1 POD2 POD3

1. within the same pod

2. an empty block group

52

Algorithms

53

Algorithms

Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary

De-fragmentation➡ potential internal fragmentation

53

Algorithms

Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary

De-fragmentation➡ potential internal fragmentation➡ de-fragmentation for file pods ➡ similar solution in Ext4

53

Journaling

54

Journaling

Virtual transaction➡ contains updates only from one pod

T1 T2 T3

Pod 1

On-disk journal

Pod 2 Pod 3

independenttransactions

54

Journaling

Virtual transaction➡ contains updates only from one pod➡ better performance isolation

T1 T2 T3

Pod 1

On-disk journal

Pod 2 Pod 3

independenttransactions

54

Journaling

Virtual transaction➡ contains updates only from one pod➡ better performance isolation➡ commit multiple virtual transactions in parallel

T1 T2 T3

Pod 1

On-disk journal

Pod 2 Pod 3

journal reservation

independenttransactions

shared journal

54

Introduction

Study of Failure Policies

Isolation File SystemsNew Abstraction

Fault Isolation

Quick Recovery

Preliminary Implementation on Ext3

Challenges

55

Status

56

Status

What we did➡ a simple prototype for Ext3➡ provide readonly isolation

56

Status

What we did➡ a simple prototype for Ext3➡ provide readonly isolation

What we plan to do ➡ crash isolation

56

Status

What we did➡ a simple prototype for Ext3➡ provide readonly isolation

What we plan to do ➡ crash isolation➡ quick recovery after failure

56

Status

What we did➡ a simple prototype for Ext3➡ provide readonly isolation

What we plan to do ➡ crash isolation➡ quick recovery after failure➡ other file systems: Ext4 and Btrfs

56

Challenges

57

Challenges

Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree

57

Challenges

Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree

Local failure ➡ is it correct to continue to run ?➡ light-weight, stateless crash for a pod

57

Challenges

Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree

Local failure ➡ is it correct to continue to run ?➡ light-weight, stateless crash for a pod

Performance➡ potential overhead of managing pods➡ better performance isolation ➡ better scalability

57

58

Failure is not an option.

58

Failure is not an option. -- NASA

58

59

Global failure is not an option;

59

Global failure is not an option;

local failure with quick recovery

59

Global failure is not an option;

local failure with quick recovery

is an option.

59

Global failure is not an option;

local failure with quick recovery

is an option.

-- Isolation File Systems

59

60

Questions ?

60

top related