Fault Isolation and Quick Recovery in Isolation File Systems
Lanyue Lu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau
University of Wisconsin - Madison
1
File-System Availability Is Critical
2
File-System Availability Is Critical
Main data access interface➡ desktop, laptop, mobile devices, file servers
2
File-System Availability Is Critical
Main data access interface➡ desktop, laptop, mobile devices, file servers
A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states
2
File-System Availability Is Critical
Main data access interface➡ desktop, laptop, mobile devices, file servers
A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states
A small fault can cause global failures➡ e.g., a single bit can impact the whole file system
2
File-System Availability Is Critical
Main data access interface➡ desktop, laptop, mobile devices, file servers
A wide range of failures➡ resource allocation, metadata corruption➡ failed I/O operations, incorrect system states
A small fault can cause global failures➡ e.g., a single bit can impact the whole file system
Global failures considered harmful➡ read-only, crash
2
Server Virtualization
Hypervisor
Shared file system
Guest virtual machines
VM2VM1 VM3
VMDK1 VMDK2 VMDK3
3
VM2VM1 VM3
VMDK1 VMDK2 VMDK3
4
VM2VM1 VM3
VMDK1 VMDK2 VMDK3
e.g., metadata corruption
4
VM2VM1 VM3
VMDK1 VMDK2 VMDK3
5
VM2VM1 VM3
VMDK1 VMDK2 VMDK3
e.g., metadata corruption
5
VM2VM1 VM3
VMDK1 VMDK2 VMDK3ReadOnly
orCrash
All VMsare affected
6
VM2VM1 VM3
VMDK1 VMDK2 VMDK3ReadOnly
orCrash
All VMsare affected
e.g., metadata corruption
6
Our Solution
7
Our Solution
A new abstraction for fault isolation➡ support multiple independent fault domains➡ protect a group of files for a domain
7
Our Solution
A new abstraction for fault isolation➡ support multiple independent fault domains➡ protect a group of files for a domain
Isolation file systems➡ fine-grained fault isolation➡ quick recovery
7
Introduction
Study of Failure Policies
Isolation File Systems
Challenges
8
Questions to Answer
9
Questions to AnswerWhat global failure policies are used ?
➡ failure types➡ number of each type
9
Questions to AnswerWhat global failure policies are used ?
➡ failure types➡ number of each type
What are the root causes of global failures ?➡ related data structures➡ number of each cause
9
Methodology
10
Methodology
Three major file systems➡ Ext3 (Linux 2.6.32), Ext4 (Linux 2.6.32)➡ Btrfs (Linux 3.8)
10
Methodology
Three major file systems➡ Ext3 (Linux 2.6.32), Ext4 (Linux 2.6.32)➡ Btrfs (Linux 3.8)
Analyze source code➡ identify types of global failures➡ count related error handling functions➡ correlate global failures to data structures
10
Q1: What global failure policies
are used ?
11
Global Failure Policies
12
Global Failure Policies
Definition➡ a failure which impacts all users of the file system or even the operating system
12
Global Failure Policies
Definition➡ a failure which impacts all users of the file system or even the operating system
Read-Only➡ e.g., ext3_error(): ➡ mark file system as read-only➡ abort the journal
12
ext3/balloc.c, 2.6.32
read_block_bitmap(...){
1 bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”); return NULL;
}}
Read-Only Example
13
ext3/balloc.c, 2.6.32
read_block_bitmap(...){
1 bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”); return NULL;
}}
Read-Only Example
13
ext3/balloc.c, 2.6.32
read_block_bitmap(...){
1 bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”); return NULL;
}}
Read-Only Example
13
ext3/balloc.c, 2.6.32
read_block_bitmap(...){
1 bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”); return NULL;
}}
Read-Only Example
13
ext3/balloc.c, 2.6.32
read_block_bitmap(...){
1 bitmap_blk = desc->bg_block_bitmap; 2 bh = sb_getblk(sb, bitmap_blk); 3 if (!bh){ 4 ext3_error(sb, “Cannot read block bitmap”); return NULL;
}}
Read-Only Example
13
Global Failure Policies
Definition➡ a failure which impacts users of the file system or even the operating system
Read-Only➡ e.g., ext3_error(): ➡ mark file system as read-only➡ abort the journal
Crash➡ e.g., BUG(), ASSERT(), panic()➡ crash the file system or operating system
14
btrfs/disk-io.c, 3.8
open_ctree(...) {
1 root->node = read_tree_block(...); 2 BUG_ON(!root->node);
Crash Example
15
btrfs/disk-io.c, 3.8
open_ctree(...) {
1 root->node = read_tree_block(...); 2 BUG_ON(!root->node);
Crash Example
15
btrfs/disk-io.c, 3.8
open_ctree(...) {
1 root->node = read_tree_block(...); 2 BUG_ON(!root->node);
Crash Example
15
btrfs/disk-io.c, 3.8
open_ctree(...) {
1 root->node = read_tree_block(...); 2 BUG_ON(!root->node);
Crash Example
15
0
200
400
600
800
1000Nu
mbe
r of I
nsta
nces
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Crash
16
Read-only and crash are common in modern file systems
Over 67% of global failures will crash the system
17
Q2: What are the root causes
of global failures ?
18
Global Failure Causes
19
Global Failure Causes
Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption
19
ext3/dir.c, 2.6.32
ext3_check_dir_entry(...){
1 rlen = ext3_rec_len_from_disk(); 2 if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3 ext3_error(sb, error);
}
Metadata Corruption Example
20
ext3/dir.c, 2.6.32
ext3_check_dir_entry(...){
1 rlen = ext3_rec_len_from_disk(); 2 if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3 ext3_error(sb, error);
}
Metadata Corruption Example
20
ext3/dir.c, 2.6.32
ext3_check_dir_entry(...){
1 rlen = ext3_rec_len_from_disk(); 2 if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3 ext3_error(sb, error);
}
Metadata Corruption Example
20
ext3/dir.c, 2.6.32
ext3_check_dir_entry(...){
1 rlen = ext3_rec_len_from_disk(); 2 if (rlen < EXT3_DIR_REC_LEN(1)){ error = “rec_len is too small”; 3 ext3_error(sb, error);
}
Metadata Corruption Example
20
Global Failure Causes
Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption
I/O failure➡ metadata I/O failure and journaling failure➡ e.g., fail to read an inode block
21
ext4/namei.c, 2.6.32
empty_dir(...){
1 bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,
“fail to read directory block”);
I/O Failure Example
22
ext4/namei.c, 2.6.32
empty_dir(...){
1 bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,
“fail to read directory block”);
I/O Failure Example
22
ext4/namei.c, 2.6.32
empty_dir(...){
1 bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,
“fail to read directory block”);
I/O Failure Example
22
ext4/namei.c, 2.6.32
empty_dir(...){
1 bh = ext4_bread(NULL, inode, &err); if (bh && err) 2 EXT4_ERROR_INODE(inode,
“fail to read directory block”);
I/O Failure Example
22
Global Failure Causes
Metadata corruption➡ metadata inconsistency is detected➡ e.g., a block/inode bitmap corruption
I/O failure➡ metadata I/O failure and journaling failure➡ e.g., fail to read an inode block
Software bugs➡ unexpected states detected➡ e.g., allocated block is not in a valid range
23
ext3/balloc.c, 2.6.32
ext3_rsv_window_add(...){
1 if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {
rsv_window_dump(root, 1); 4 BUG();
}
Software Bug Example
24
ext3/balloc.c, 2.6.32
ext3_rsv_window_add(...){
1 if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {
rsv_window_dump(root, 1); 4 BUG();
}
Software Bug Example
24
ext3/balloc.c, 2.6.32
ext3_rsv_window_add(...){
1 if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {
rsv_window_dump(root, 1); 4 BUG();
}
Software Bug Example
24
ext3/balloc.c, 2.6.32
ext3_rsv_window_add(...){
1 if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {
rsv_window_dump(root, 1); 4 BUG();
}
Software Bug Example
24
ext3/balloc.c, 2.6.32
ext3_rsv_window_add(...){
1 if (start < this->rsv_start) p = &(*p)->rb->left; 2 else if (start > this->rsv_end) p = &(*p)->rb->right; 3 else {
rsv_window_dump(root, 1); 4 BUG();
}
Software Bug Example
24
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
0
200
400
600
800
1000
Num
ber o
f Ins
tanc
es
Ext3 Ext4 Btrfs
193
409
829
ReadOnly Pure Crash
Figure 1: Failure Types. This figure shows the failure typesfor each file system. The total number of global failure instancesis on top of each bar.
Ext3 explicitly validates the integrity of metadata inmany places, especially at the I/O boundary when read-ing from disk. For example, Ext3 validates a directory en-try before traversing that directory and Ext3 checks thatthe inode bitmap is in a correct state before allocating anew inode. Unfortunately, as indicated by the MetadataCorruption column, if Ext3 detects a corruption in any ofthese structures, it causes a global failure. The I/O Failurecolumn similarly shows that Ext3 causes global failureswhen an individual I/O request fails. Finally, the SoftwareBugs column shows that there are a significant number ofinternal assertions (such as BUG ON), which are utilized tovalidate file system state at runtime, and these also cause aglobal failure when invoked. We observe that nearly all ofglobal failures in Ext3 are due to problems with metadataand other file system internal state, and not user data.For each data structure, we also check whether it is
shared across different files. As shown in the last col-umn of Table 1, most metadata structures are organized ina shared manner and thus can cause global failures. How-ever, even for local structures, such as indirect blocks, aglobal failure can still occur.
2.3 DiscussionA file is the basic file system abstraction used to store theuser’s data in a logically isolated unit. Users can readfrom and write to a file. Another basic abstraction is adirectory, which maps a file name to the file itself. Filesand directories are usually organized as a directory tree.A namespace holds a logical group of files or direc-
tories. To protect files in a shared environment, differ-ent applications are isolated within separated namespaces.Typical examples include chroot, BSD jail, SolarisZones, and virtual machines.However, these abstractions do not provide any fault
Data Structure MC IOF SB Sharedb-bitmap 2 2 Yesi-bitmap 1 1 Yesinode 1 2 2 Yessuper 1 Yes
dir-entry 4 4 3 Yesgdt 3 2 Yes
indir-blk 1 1 Noxattr 5 2 1 Noblock 5 Yes/Nojournal 3 27 Yes
journal head 31 Yesbuf head 16 Yeshandle 22 9 Yes
transaction 28 Yesrevoke 2 Yesother 1 11 Yes/NoTotal 19 37 137 = 193
Table 1: Global Failure Causes of Ext3. This table showsthe failure causes for Ext3, in terms of data structures, failurecauses and their related numbers. MC: Metadata Corruption;IOF: I/O Failures; SB: Software Bugs; Share: whether thisstructure is shared by multiple files or directories.
isolation within a file system. Files and directoriesonly represent and isolate data logically for applications.Within a file system, different files and directories sharemetadata and system state; when faults are related to theseshared metadata, global failure policies are triggered.Therefore, file system abstractions lack a fine-grained
fault isolation mechanism. Current file systems implicitlyuse a single fault domain; a fault in one file may cause aglobal reaction, thus affecting all clients of the file system.
3 New Abstraction: File PodTo address the problem of inadequate fault isolation in filesystems, we propose a new abstraction, called a file pod,for fine-grained fault isolation in file systems.A file pod is an abstract file system partition that con-
tains a group of files and their related metadata. Each filepod is isolated as an independent fault domain, with itsown failure policy. Any failure related to a file pod onlyaffects itself, not the whole file system. For example, ifmetadata corruption is detected within a file pod and thefailure policy is to remount as read-only, then a Swarmfile system marks only that pod as read-only, without af-fecting other consistent file pods.File pods allow applications to control the failure policy
of their own files and related metadata, instead of lettingthe file system manage the failures globally. Furthermore,this new abstraction supports flexible bindings betweennamespaces and fault domains; thus it can be used in awide range of environments, such as server virtualization
2
Ext3
25
26
All global failures are caused by
metadata and system states
26
All global failures are caused by
metadata and system statesBoth local and shared metadata can cause global failures
26
All global failures are caused by
metadata and system statesBoth local and shared metadata can cause global failures
26
Not Only Local File Systems
27
Not Only Local File Systems
Shared-disk file systems OCFS2➡ inspired by Ext3 design➡ used in virtualization environment➡ host virtual machine images➡ allow multiple Linux guests to share a file system
27
Not Only Local File Systems
Shared-disk file systems OCFS2➡ inspired by Ext3 design➡ used in virtualization environment➡ host virtual machine images➡ allow multiple Linux guests to share a file system
Global failures are also prevalent➡ a single piece of corrupted metadata can fail the whole file system on multiple nodes !
27
Current Abstractions
28
Current Abstractions
File and directory➡ metadata is shared for different files or directories
28
Current Abstractions
File and directory➡ metadata is shared for different files or directories
Namespace➡ virtual machines, Chroot, BSD jail, Solaris Zones➡ multiple namespaces still share a file system
28
Current Abstractions
File and directory➡ metadata is shared for different files or directories
Namespace➡ virtual machines, Chroot, BSD jail, Solaris Zones➡ multiple namespaces still share a file system
Partitions➡ multiple file systems on separated partitions➡ a single panic on a partition can crash the whole operating system➡ static partitions, dynamic partitions➡ management of many partitions
28
29
All files on a file system implicitly share
a single fault domain
29
All files on a file system implicitly share
a single fault domain
29
All files on a file system implicitly share
a single fault domain
Current file-system abstractions do not
provide fine-grained fault isolation
29
Introduction
Study of Failure Policies
Isolation File SystemsNew Abstraction
Fault Isolation
Quick Recovery
Preliminary Implementation on Ext3
Challenges
30
Isolation File Systems
31
Isolation File Systems
Fine-grained partitioned➡ files are isolated into separated domains
31
Isolation File Systems
Fine-grained partitioned➡ files are isolated into separated domains
Independent➡ faulty units will not affect healthy units
31
Isolation File Systems
Fine-grained partitioned➡ files are isolated into separated domains
Independent➡ faulty units will not affect healthy units
Fine-grained recovery➡ repair a faulty unit quickly➡ instead of checking the whole file system
31
Isolation File Systems
Fine-grained partitioned➡ files are isolated into separated domains
Independent➡ faulty units will not affect healthy units
Fine-grained recovery➡ repair a faulty unit quickly➡ instead of checking the whole file system
Elastic➡ dynamically grow and shrink its size
31
New Abstraction
32
New Abstraction
File Pod➡ an abstract partition➡ contains a group of files and related metadata ➡ an independent fault domain
32
New Abstraction
File Pod➡ an abstract partition➡ contains a group of files and related metadata ➡ an independent fault domain
Operations➡ create a file pod➡ set / get file pod’s attributes➡ failure policy➡ recovery policy
➡ bind / unbind a file to pod➡ share a file between pods
32
d1 d2
d4
d3
/
33
d1 d2
d4
d3
/Pod1 Pod2
34
Introduction
Study of Failure Policies
Isolation File SystemsNew Abstraction
Fault Isolation
Quick Recovery
Preliminary Implementation on Ext3
Challenges
35
Metadata Isolation
36
Metadata Isolation
Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata
36
Metadata Isolation
Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata
For example➡ multiple inodes are stored in a single inode block
i i i i i i i i i i i i
an inode block36
Metadata Isolation
Observation➡ metadata is organized in a shared manner ➡ hard to isolate a failure for metadata
For example➡ multiple inodes are stored in a single inode block ➡ an I/O failure can affect multiple files
i i i i i i i i i i i i
an inode block
a block read failure
36
37
Key Idea 1:
37
Key Idea 1:
Isolate metadata for file pods
37
Localize Failures
38
Localize Failures
Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod
38
Localize Failures
Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod
Read-Only➡ mark a file pod as Read-Only
38
Localize Failures
Local Failures ➡ convert global failures to local failures ➡ same failure semantics➡ only fail the faulty pod
Read-Only➡ mark a file pod as Read-Only
Crash➡ crash a file pod instead of the whole system➡ provide the same initial states after crash
38
d1 d2
d4
d3
/Pod1 Pod2
39
d1 d2
d4
d3
/Pod1 Pod2
e.g., corruption
40
d1 d2
d4
d3
/Pod1 Pod2
e.g., corruption
40
Introduction
Study of Failure Policies
Isolation File SystemsNew Abstraction
Fault Isolation
Quick Recovery
Preliminary Implementation on Ext3
Challenges
41
Quick Recovery
42
Quick Recovery
File system recovery is slow➡ a small error requires a full check➡ many random read requests➡ 7 hours to sequentially read a 2 TB disk
42
43
a small fault
requires a full check(slow!)
43
a small fault
requires a full check(slow!)
43
44
Key Idea 2:
44
Key Idea 2:
Minimize the file system checking range during recovery
44
Quick Recovery
45
Quick Recovery
Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline
45
Quick Recovery
Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline
When recover ?➡ leverage internal detection mechanism
45
Quick Recovery
Metadata Isolation➡ file pod as the unit of recovery➡ check and recover independently ➡ both online and offline
When recover ?➡ leverage internal detection mechanism
How to recover more efficiently ?➡ only check the faulty pod➡ narrow down to certain data structures
45
Introduction
Study of Failure Policies
Isolation File SystemsNew Abstraction
Fault Isolation
Quick Recovery
Preliminary Implementation on Ext3
Challenges
46
Ext3 Layout
47
Ext3 Layout
A disk is divided into block groups➡ physical partition for disk locality
47
Ext3 Layout
A disk is divided into block groups➡ physical partition for disk locality
disk layout
47
Ext3 Layout
A disk is divided into block groups➡ physical partition for disk locality
SB GDTs BM InodesIM Blocks Blocks
disk layout
one block group
47
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
multiple files can share a single block group
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
multiple files can share a single block group
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
multiple files can share a single block group
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
multiple files can share a single block group
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
multiple files can share a single block group
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
f5
multiple files can share a single block group
one file can span multiple block groups
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
f5
multiple files can share a single block group
one file can span multiple block groups
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
f5
multiple files can share a single block group
one file can span multiple block groups
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
f5
multiple files can share a single block group
one file can span multiple block groups
48
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
SB GDTs BM InodesIM Blocks Blocks
f1
f2
f3
f4
f5
multiple files can share a single block group
one file can span multiple block groups
48
Layout
49
Layout
A file pod contains multiple block groups➡ one block group only maps to one file pod➡ performance locality and fault isolation
49
Layout
A file pod contains multiple block groups➡ one block group only maps to one file pod➡ performance locality and fault isolation
disk layout
POD1 POD2 POD3
49
Data Structures
50
Data Structures
Pod related structure➡ no extra mapping structures
50
Data Structures
Pod related structure➡ no extra mapping structures➡ embeds in group descriptors➡ group descriptors are loaded into memory
SB GDTs BM InodesIM Blocks Blocks
a block grouppod
50
Algorithms
51
Algorithms
Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary
51
POD1 POD2 POD3
52
POD1 POD2 POD3
1. within the same pod
2. an empty block group
52
Algorithms
53
Algorithms
Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary
De-fragmentation➡ potential internal fragmentation
53
Algorithms
Pod based inode and block allocation➡ preserve original allocation’s locality ➡ allocation will not cross pod boundary
De-fragmentation➡ potential internal fragmentation➡ de-fragmentation for file pods ➡ similar solution in Ext4
53
Journaling
54
Journaling
Virtual transaction➡ contains updates only from one pod
T1 T2 T3
Pod 1
On-disk journal
Pod 2 Pod 3
independenttransactions
54
Journaling
Virtual transaction➡ contains updates only from one pod➡ better performance isolation
T1 T2 T3
Pod 1
On-disk journal
Pod 2 Pod 3
independenttransactions
54
Journaling
Virtual transaction➡ contains updates only from one pod➡ better performance isolation➡ commit multiple virtual transactions in parallel
T1 T2 T3
Pod 1
On-disk journal
Pod 2 Pod 3
journal reservation
independenttransactions
shared journal
54
Introduction
Study of Failure Policies
Isolation File SystemsNew Abstraction
Fault Isolation
Quick Recovery
Preliminary Implementation on Ext3
Challenges
55
Status
56
Status
What we did➡ a simple prototype for Ext3➡ provide readonly isolation
56
Status
What we did➡ a simple prototype for Ext3➡ provide readonly isolation
What we plan to do ➡ crash isolation
56
Status
What we did➡ a simple prototype for Ext3➡ provide readonly isolation
What we plan to do ➡ crash isolation➡ quick recovery after failure
56
Status
What we did➡ a simple prototype for Ext3➡ provide readonly isolation
What we plan to do ➡ crash isolation➡ quick recovery after failure➡ other file systems: Ext4 and Btrfs
56
Challenges
57
Challenges
Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree
57
Challenges
Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree
Local failure ➡ is it correct to continue to run ?➡ light-weight, stateless crash for a pod
57
Challenges
Metadata isolation➡ tree-based directory structure➡ globally shared metadata: super block, journal➡ shared system states: block allocation tree
Local failure ➡ is it correct to continue to run ?➡ light-weight, stateless crash for a pod
Performance➡ potential overhead of managing pods➡ better performance isolation ➡ better scalability
57
58
Failure is not an option.
58
Failure is not an option. -- NASA
58
59
Global failure is not an option;
59
Global failure is not an option;
local failure with quick recovery
59
Global failure is not an option;
local failure with quick recovery
is an option.
59
Global failure is not an option;
local failure with quick recovery
is an option.
-- Isolation File Systems
59
60
Questions ?
60