17 1 embedded software lab. embedded software lab daejun park, eunsoo park lecture 12 ext4

57
17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

Upload: jennifer-watkins

Post on 28-Dec-2015

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

1

Embedded Software Lab.

Embedded Software Lab

Daejun Park, Eunsoo Park

Lecture 12 EXT4

Page 2: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

2

Embedded Software Lab.

So we will cover specific FS connected to VFS• EXT2, EXT3 • EXT4

According to the chapter 12. VFS gives an abstraction view of FS to users

•••

EXT2,3

EXT4

NTFS, F2FS

•••<specific implementation of FS>

Overview FS

Page 3: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

3

Embedded Software Lab.

Ext2 Disk Data Structure

These parts are duplicated in each block group

block and inode bitmap must be stored in a single block

We will cover each components in block group.• Super block, group desc, bitmap, inode table

Page 4: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

4

Embedded Software Lab.

Super Block

We are here!

2 sectors (1024 bytes) that describe the file system• Volume label• Block size• # blocks per group• #reserved blocks before the 1st block group• The superblock block group number• Count of free inodes & blocks ( total all groups)

1st superblock is 1024bytes past the beginning of the file system• The first two sectors are used to store boot code

Page 5: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

5

Embedded Software Lab.

Super Block(2)

Type Field Description

__le32 s_inodes_count # of inodes in filesystem

__le32 s_blocks_count # of blocks in filesystem

__le32 s_free_blocks_count Free blocks counter

__le32 s_free_inodes_count Free inodes counter

__le32 s_log_block_size Block size (0:1024 bytes, 1: 2048 bytes, …)

__le32 s_blocks_per_group # of blocks per group

__le32 s_inodes_per_group # of inodes per group

__le16 s_state Status flag (mounted, unmounted, er-ror)

__le16 s_block_group_nr Block group number of this superblock

char [64] s_last_mounted Pathname of last mount point

….. …… …….<ext2_super_block>

Additional fields are for ext3 compatibility ( journaling ) and (e2fsck)

Page 6: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

6

Embedded Software Lab.

Group Descriptor, Bitmap

We are here!

Type Field Description

__le32 bg_block_bitmap Block number of block bitmap

__le32 bg_inode_bitmap Block number of inode bitmap

__le32 bg_inode_tableBlock number of first inode table block

__le16bg_free_blocks_count

Number of free blocks in the group

__le16bg_free_inodes_count

Number of free inodes in the group

__le16 bg_used_dirs_count Number of directories in the group

__le16 bg_pad Alignment to word

__le32 [3] bg_reserved Nulls to pad out 24 bytes

…….. ……… ……<ext2_group_desc>

we can decide a number of blocks in partition by size of one block?Ex) ext2: one block= 4KB A bitmap can store as much as 32k 32k * 4KB = 128MB one group maximum capacity Conclusive we can determine how many blocks can be allo-cated in partition.

Page 7: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

7

Embedded Software Lab.

Inode Table

We are here!

Inode Table• Multiple consecutive blocks, each of which contains a predefined number of in-

odes.

Inode All inodes have the same size : 128bytes• Each inode corresponds to one file, and it stores file’s primary metadata, such

as file’s size, ownership, and temporal information.• Inode is typically 128 bytes in size and is allocated to each file and directory • Directory has file/directory name and pointer to inode in the table• Inode points to the file content blocks

Page 8: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

8

Embedded Software Lab.

Inode

Type Field Description

__le16 i_mode File type and access rights

__le16 i_uid Owner identifier

__le32 i_size File length in bytes

__le16 i_links_count Hard links counter

__le32 i_blocks Number of data blocks of the file

__le32 [EXT2_N_BLOCKS] i_block Pointers to data blocks

__le32 i_file_acl File access control list

__le32 i_dir_acl Directory access control list

union osd1 osd2 OS info

<ext2_inode>

Page 9: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

9

Embedded Software Lab.

Inode(2)

• Access Control Lists(ACL) - file protection mechanism in Unix filesystem - ACL can be associated with each file - A user may specify for each of his files the names of spe-cific users and the privileges to be given to these users - Linux 2.6 fully supports ACLs by making use of inode ex-tended attributes(extended attributes have been introduced mainly to support ACLs

Page 10: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

10

Embedded Software Lab.

Inode(3)

File_type Description Explanation

0 Unknown

1 Regular file • Needs data blocks only when it starts to have data(first created, empty data blocks)

2 Directory • Data block store filenames together with the corresponding inode numbers

• Such data block contain structures of type ext2_dir_entry_2

• EXT2_NAME_LEN : 255

3 Character de-vice

No data block Just inode

4 Block device No data block Just inode

5 Named pipe No data block Just inode

6 Socket No data block Just inode

7 Symbolic link If the pathname less than equal 60 inodeIf the pathname more than 60 one data block

Page 11: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

11

Embedded Software Lab.

Inode(4)Type Field Description

__le32 inode Inode number

__le16 rec_lenDirectory entry length(pointer to next item off-set)

__u8name_len

Filename length (real)

__u8 file_type File type

Char [EXT2_NAME_LEN]

nameFilename (A multiple of 4 )<ext2_dir_entry_2>

Deleted

12+16

*4 for efficiency

v Offset

Page 12: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

12

Embedded Software Lab.

Inode(5)

Inode & Directory• Map a file name with the related inode• Directory is itself a file (supporting file hierarchy)

0 1 2 3 4 5 6 … 1 2 3 4 5 6 7 8 9 10 11 …

status : dirsize : **…data blocks: 1 _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 ..2 .3 usr4 home6 dev7 etc…

status : dirsize : **…data blocks: 7 _ _ _ _ _ _ _ _ _ _ _ _ _ _

status : filesize : 26…data blocks: 10 _ _ _ _ _ _ _ _ _ _ _ _ _ _

2 ..4 .8

reports.doc

9 hello.c10 sudbir5

alphabet.txt

abcdefghi…/* comment

for hello.c */

int main(){…}

/home/alphabet.txt

inode table disk blocks

Page 13: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

13

Embedded Software Lab.

Inode(6)

inode table

Boot B

lock

Su

per B

lock

Root D

ir

2 . 2 .. 3 File1.c 4 mydir 5 myfile 7 mydir2

myd

ir

File1.c myfile

Status : dirSize : **Data blocks: 20 _ _ _ _ _ __ _ _ _ _ _ _ _

0 1 2 3 4 5 6 …

Status : fileSize : ***Data blocks : 21 22 23 _ _ _ __ _ _ _ _ _ _

Status : dirSize : **Data blocks : 24 _ _ _ _ _ __ _ _ _ _ _ _ _

20 21 22 23 24 25 26 27 28 29 30

Status : fileSize : ****25 26 27 29 30 31 3233 34 35 36 37 28 _ _

Root dir File1.c mydir myfile

383940414243

4 . 2 .. 10 a.hwp 11 b.c 24 Test.c19 Note.doc

myfile

indirectdirectory entry

Page 14: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

14

Embedded Software Lab.

• For performance, most information stored in the disk data structure of an Ext2 partition are copied into RAM when the file system is mounted

• Kernel uses the page cache to keep disk data structures up-to-date

Memory Data Structures

In dynamic mode, the data is kept in a cache as long as the associated object is in use; when the file is closed or the data block is deleted, may be removed from the cache.

Page 15: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

15

Embedded Software Lab.

Memory Data Structures(2)

VFS: s_fs_info

Memory

Disk

Superblock Object

Buffer head

Page 16: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

16

Embedded Software Lab.

After Completion

ext2_fill_super()• Allocate all buffer for Objects and read or point to them

s_debts fields for maintaining balance btw regular file and Directory s_debts increase because of increasing the number of directory.Otherwise it will decrease

Memory Data Structures(3)

Page 17: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

17

Embedded Software Lab.

Creating EXT FilesystemMke2fs(Making EXT2 FS utility)

1. Initializes the superblock and the group descriptors.

2. For each block group, reserves all the disk blocks needed to store the superblock, the group descriptors, the inode table, and the two bitmaps.

3. Initializes the inode bitmap and the data map bitmap of each block group to 0.

4. Initializes the inode table of each block group.

5. Creates the /root directory.

6. Creates the lost+found directory, which is used by e2fsck to link the lost and found defective blocks.

7. Updates the inode bitmap and the data block bitmap of the block group in which the two previous directories have been cre-ated.

8. Groups the defective blocks (if any) in the lost+found directory.

Page 18: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

18

Embedded Software Lab.

• Inode Object For each component of the pathname that is not already in the den-try cache, a new dentry object and a new inode object are created.

• When the VFS accesses an Ext2 disk inode, it creates a cor-responding inode descriptor of type ext2_inode_info

• Inode object include these - The whole VFS inode object - Most of the fields found in the disk’s inode structure that are not kept in the VFS inode - The i_next_alloc_block and i_next_alloc_goal fields, which store the logical block number and physical block number of the disk block - The i_acl and i_default_acl fields, which point to the ACLs of the file

Memory Data Structures(4)

Page 19: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

19

Embedded Software Lab.

Methods

ext2_sops

Ext2 Super Block Opera-tions

•••

alloc_inode

read_inode

write_inode

•••

• Point to the EXT2 specific operations

Ext2 inode Operations

• includes directory operations in terms of EXT2• includes regular file operations in terms of EXT2• if some methods are NULL, call VFS generic methods or

nothing.

<fs/ext2/super.c>

Page 20: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

20

Embedded Software Lab.

Methods(2)

<EXT2 Inode Operations><EXT2 file Operations>

Operations Table

Page 21: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

21

Embedded Software Lab.

Managing Disk Space

We will cover the operations of inode and data block in terms of • Avoid File Fragmentation• A volume management must work ASAP.

A FS tries to keep the block in contiguous order.

However blocks can be scattered and file holes makes Volumes bigger.

Page 22: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

22

Embedded Software Lab.

Managing Disk Space – Creating Inode

• Creating inodes find_group_orlov()

find_group_other()

Page 23: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

23

Embedded Software Lab.

Managing Disk Space – Deleting Inode

• Deleting inodes Clear_inode() :

Page 24: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

24

Embedded Software Lab.

Managing Disk Space – Data Blocks Ad-dressing

• Data Blocks Addressing

Blocks may be referred to either by their relative position inside the file (their file block number) or by their position inside the disk partition(LBN-logical block number)

An offset f • Derive the file block number from the f• Translate the file block number to LBN

EXT2 provides a method to store the connection between each file block num-ber and the LBN on disk

We will look up i_block field thoroughly

It is hard to translate file block number into LBN

Page 25: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

25

Embedded Software Lab.

• The i_block field in the disk inode is an array of EXT2_N_BLOCKS components that contain logical block numbers.

0 1 2 3 4 •••

4KB contains the points to1024 LBNs

Managing Disk Space – Data Blocks Ad-dressing(2)

0 4096 4096 * 2 4096 * 3 4096 * 4 4096 * 5

4KB

Indirect Double indi-rect

Triple indi-rect

We can calculate upper size of data in terms of n-indirected

Ex) 2-directed = direct + 1-directed direct 12*4KB = 48KB 1-directed (4KB/4B)*4KB = 1024*4KB = 4MB+48KB2-directed (4KB/4B)*(4KB/4B)*4KB = 4GB + 4MB +48KB

Page 26: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

26

Embedded Software Lab.

File Hole

• A file hole is a portion of regular file that contains “\0” and is not sorted in any data block on disk

• File holes were introduced to avoid wasting disk space.• A block is assigned to a file only when the process needs to write

data into it

File Hole

Condition : i_size > 512 * i_blocks That’s HOLE!

Page 27: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

27

Embedded Software Lab.

• Try to keep the meta-data and data blocks closely• Try to keep the files under the same directory

• ext2_get_block() searches for a free block• file fragmentation should be reduced

Allocating a data block

Page 28: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

28

Embedded Software Lab.

Releasing a Data block

Page 29: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

29

Embedded Software Lab.

ext2

ext3

Journaling

EXT3 Overview

Inter-Compatible

• Inter-compatible– Ext2 converts to Ext3– Ext3 can be read by Ext2

• Ext3 adds journaling for consistency– Journal is a small, circular area written before writing to the disk

– After crash, read the journal to ensure all write operations were completed

• Redo any that were not completed

Page 30: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

30

Embedded Software Lab.

EXT3 Filesystem

• Designed with two simple concepts in mind:– To be a journaling filesystem

– To be compatible with the old Ext2 filesystem

• Journaling Filesystems– Updates to filesystem blocks might be kept in dynamic memory

for long period of time before being flushed to disk

– A dramatic event such as a power-down failure or a system crash might thus leave the filesystem in an inconsistent state

– To overcome this problem, each traditional Unix filesystem is checked before being mounted too long time

– avoid running time-consuming consistency checks on the whole filesystem

– Instead, look in a special disk area that contains the most recent disk write oper-

ations named journal

Page 31: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

31

Embedded Software Lab.

EXT3 Journaling

• The idea behind Ext3 journaling – First, a copy of the blocks to be written is stored in the journal

– When the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem

• When system failure occurred before a commit to the jour-nal– Either the copies of the blocks relative to the high-level change are

missing from the journal or they are incomplete; – e2fsck ignores journals.

• When system failure occurred after a commit to the journal– The copies of the blocks are valid, and e2fsck writes journals into

the filesystem.

Page 32: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

32

Embedded Software Lab.

EXT3 Journaling(3)

• The first block in the journal is journal superblock, and it contains the first logging data address and its sequence number.

• Updates are done in transactions, and each transaction has a se-quence number.

• Each transaction starts with a descriptor block that contains the transaction sequence number and a list of what blocks are being updated.

• Following the descriptor block are the updated blocks.

• When the updates have been written to disk, a commit block is written with the same sequence number.

Transaction

Checkpoint=write to the Disk

Page 33: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

33

Embedded Software Lab.

Ext3 Journaling(2)

hyemin

Before committing, they gathered file manipulation which called “transaction”

is

My name

eslab Best

X Y Z W

the

Manipulate A

Manipulate B

Manipulate C

Transaction

JournalSection

Descrip-tor Block

eslab is the Best Commit Block

X Y Z W

Page 34: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

34

Embedded Software Lab.

EXT3 Journaling modes

There are three journaling modes

Mode Journal Ordered Writeback

Role • All Filesystem data and meta-data

• Only changes to filesystem meta-data are logged into the journal

• Only changes to filesystem metadata are logged.

Pros & Cons

Safest and slowest Default Ext3 jour-naling mode

Fastest mode but not safe

This is the method found on the other journaling filesys-tems

Page 35: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

35

Embedded Software Lab.

Ext3 – Journal Structure

s_start Transac-tion

••• Transaction

Journal Section locates in Filesystem or Other partition.• In filesystem, inode num 8 points to journal section • It has no dir entry so Users cannot see it. .journal

Journal Super Block

Desc Block Block Block ••• Commit Block

Circular buffer

Header

Page 36: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

36

Embedded Software Lab.

EXT3 JBD(Journaling Block Device) Layer

JBD must also protect itself from system failures that could corrupt the journal via three fundamental units:• Log Record - Describes a single update of a disk block of the journaling filesystem• Atomic Operation Handle - Includes log records relative to a single high-level change of the filesystem - typically, each system call modifying the filesystem gives rise to a single atomic operation handle - To start an atomic operation the Ext3 filesystem invokes the journal_start() JDB Function, which allocates, if necessary, a new atomic operation handle and inserts it into the current transaction• Transaction - Includes several atomic operation handles whose log records are marked valid for e2fsck at the same time.

Transaction

Log Record

Block

Block

Block

Block

Block

Block

Block

Block

File Operation

Page 37: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

37

Embedded Software Lab.

EXT3 JBD(Journaling Block Device) Layer(2)

How a transaction works Complete : All log records in-cluded in the transaction are written in Journal(e2fsck works well) t_state = T_FINISHED

Incomplete : (e2fsck ignores in-complete transaction) t_state could be set these flagsT_RUNNINGT_LOCKEDT_FLUSHT_COMMIT

Page 38: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

38

Embedded Software Lab.

How Journaling Works(2)

<Ordered Mode>Preparation JBD Write Operation Done!

Start

Commit com-plete

CheckPoint

Journal_get_write_access() Register target buffer head at JBD kjournald2

Page 39: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

39

Embedded Software Lab.

EXT4-Overview

• EXT4: October 2008, stable code in the Linux 2.6.28– preliminary development version in Linux 2.6.19– easily upgrade ext3– Utilize the previous work, focus on adding advanced fea-

tures– a new scalable enterprise-ready file system in a short

time

• Maintainers– Theodore Ts'o [email protected]– Andreas Dilger [email protected]  

Page 40: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

40

Embedded Software Lab.

EXT4-Usage

2010/1/15 Google announced that it would upgrade its storage infrastructure from ext2 to ext4.

2010/12/14 Google announced they would use ext4, instead of YAFFS on Android 2.3

Page 41: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

41

Embedded Software Lab.

EXT4-features

• Bigger file/filesystem size support.– Compared to ext3, ext4 is 8 times larger in file size, – 65536 times larger in filesystem size.

• I/O performance improvement– delayed allocation, multi block allocator extent map and

persistent preallocation– Fast fsck: flex_bg and uninit_bg– Reliability: journal checksumming– Maintenance: online defragmentation– Misc: backward compatibility with ext2/ext3, nanosec

timestamps, subdir scalability, etc.

Page 42: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

42

Embedded Software Lab.

EXT3 vs. EXT4

Page 43: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

43

Embedded Software Lab.

Scalability Enhancements

• ext3: 16TB file system size limit – caused by the 32-bit block number– 4KB(1 block size) X 2^32 (blocks_count: unsigned int) = 16TB

• ext4: 1EB– 48-bit block numbers– 4KB X 2^48 = 1EB (2^(12+48)B) = 1000^6

BYTE(TB*1000^2)– Metadata in the superblock, the group descriptors, and the

journal: • New fields added for most significant 32 bits for block-counter

variables, s_free_blocks_count, s_blocks_count, and s_r_blocks_count

– JBD -> JBD2 (support 48-bit block addresses)Why not 64-bit support ?• 1EB is enough in current situation• 1EB file system 119 years to finish one full e2fcsk, so reliability issue

Page 44: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

44

Embedded Software Lab.

Scalability Enhancements(2)

• Extent: represent a range of contiguous physical blocks• Efficient to represent large files• Better CPU utilization, fewer metadata IOs• One extent: 215 contiguous blocks (128MB, 1 block=4KB)• 4 extents in ext4 inode structure or extent_header

header extent0 extent1 extent2 extent3

< Ext4_inode i_block[EXT4_N_BLOCKS] >

12bytes 12bytes 12bytes 12bytes 12bytes

60bytes

< Ext4_inode i_block[EXT4_N_BLOCKS] >

4bytes

15*4bytes array

Page 45: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

45

Embedded Software Lab.

Scalability Enhancements(3)

/* This is the extent on-disk structure. It's used at the bottom of the tree. */struct ext4_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start_lo; /* low 32 bits of physical block */};

/* This is index on-disk structure. It's used at all the levels except the bottom. */struct ext4_extent_idx { __le32 ei_block; /* index covers logical blocks from 'block' */ __le32 ei_leaf_lo; /* pointer to the physical block of the next * level. leaf or next index could be there */ __le16 ei_leaf_hi; /* high 16 bits of physical block */ __u16 ei_unused;};

struct ext4_extent_header { __le16 eh_magic; /* probably will support different formats */ __le16 eh_entries; /* number of valid entries */ __le16 eh_max; /* capacity of store in entries */ __le16 eh_depth; /* has tree real underlying blocks? */ __le32 eh_generation; /* generation of the tree */};

That’s why # of con-tiguous block is 2^15

eh_magic : block mapped extent or ex-tent for robustness

Page 46: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

46

Embedded Software Lab.

Scalability Enhancements(4)

Page 47: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

47

Embedded Software Lab.

Scalability Enhancements(5)• Large files

– Ext3 file size: i_blocks counter value in Linux. • Block size: 4KB, Max file size: 4TB =((4KB/4B)^3 X 4KB) -> file system level

• Unit in sector(512B): 2^32 X 512B = 2TB -> Linux limitation

– ext4: feature HUGE_FILE added• 32 bit logical block numbers with extent, 2^32 X 4KB = 16TB

• Large number of files– Ext3 allocates inode statically so fixed number inode It limits # of files

– dynamic inode tables, a cluster of contiguous inode table blocks (ITBC) can be allocated

on demand.

– 15-bit relative block number: 2^15 = 4K X 8 bit (block bitmap)

– 4 bit offset: 4KB(1 block)/256B (default ext4 inode structure) = 2^4 (16)

64-bit inode layout

Page 48: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

48

Embedded Software Lab.

Scalability Enhancements(6)

• Directory scalability – ext3: 32,000 maximum number of subdirec-

tories, linked list -> very inefficient with large numbers of entries

– ext4: storing directory entries in a constant depth Htree data structure

• (specialized BTree-like structure using 32-bit hashes)

• Large inode and fast extended at-tributes– The default inode structure size 128 bytes.

(already crowded)

– In ext4, default inode structure size 256

bytes

– fixed-field section: nanosecond timestamps, fast ex-

tended attributes (EAs)

Page 49: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

49

Embedded Software Lab.

Reliability Enhancements

• Reliability is very important to ext3 and is one of the reasons for its vast popularity.– robust metadata design, internal redundancy at various

levels, and built-in integrity checking using checksums.– Important is the speed at which a file system is recov-

ered after corruption.

• Unused inode count and fast e2fsck– (next slide)

• Checksumming– Adding metadata checksumming – easily detect corruption, avoid blindly trusting the data– group descriptors, journal have a checksum added

Page 50: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

50

Embedded Software Lab.

Reliability Enhancements(2)

• Unused inode count and fast e2fsck – The uninitialized groups and inode table high watermark feature

allows much of the lengthy e2fsck pass 1 scanning to be safely skipped.

– reduce the total time taken by e2fsck by 2 to 20 times– enabled at mke2fs time or using tune2fs via “-O uninit_groups”

option.– the kernel stores the number of unused inodes at the end of

each block group’s inode table.• EXT3– e2fsck time grows linearly with the total

number of inodes, regardless of how many are used.

– e2fsck takes the same amount of time with zero used files as with 2.1M used files.

• EXT4 with the unused inode high wa-termark feature– e2fsck time is only dependent on the

number of used inodes.

ext3: 0 filesext3: 100k filesext3: 2.1M files

ext4: 100k files

ext4: 2.1M files

Page 51: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

51

Embedded Software Lab.

Block Allocation Enhancements

• Persistent preallocation – Preallocate blocks for a file up-front– DB, Streaming Media Server– ensure contiguous allocation as far as possible for a file– allocated but uninitialized– The MSB of the extent length field indicates whether a

given extent contains uninitialized data.

• Delayed block allocation– block allocations are postponed to page flush time rather

than during the write()– Combine many block allocation requests into a single re-

quest• Reduce fragmentation and save CPU cycles.• avoids unnecessary block allocation for shortlived files

– There is a trade-off between performance and reliability – 30% improved throughput, 50% reduction in CPU

Page 52: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

52

Embedded Software Lab.

Block Allocation Enhance-ments(2)

• Online defragmentation– with age, the filesystem still become quite fragmented– e4defrag

• Creates a temporary inode and allocates contiguous ex-tents using multiple block allocation

• Copies the original file data to the page cache and flushes the dirty pages to the temporary inode’s blocks

• Migrates the block pointers from the temporary inode to the original inode

Page 53: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

53

Embedded Software Lab.

Problems with Ext3 block allocator• Lack of free extent information across the file sys-

tem - Use only the bitmap to search for the free blocks to re-serve - Search for free blocks only inside the reservation window

• Doesn’t differentiate allocation for small / large files

Ext3 Vs Ext4 in terms of Scala-bility

Page 54: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

54

Embedded Software Lab.

Multiple Blocks Allocator

• EXT3 block reservation – subsequent request for blocks for a file get served before

interleaved

– per-file reservation window

• EXT4 Multiple Blocks Allocator– Different strategy for different allocation requests

– Per-block-group buddy cache • Contiguous multiple blocks are allocated at once to prevent file

fragmentation.

• builds per-block group free extents information based on the on-disk block bitmap to guide the search for free extents

• generated at filesystem mount time and stored in memory us-ing a buddy structure.

Page 55: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

55

Embedded Software Lab.

Multiple Blocks Allocator(2)

• Different strategy for different allocation requests– Better allocation for small and large files

• Ext4 multiple block allocator maintains two preallocated spaces – Small allocation request,

• per-CPU locality group preallocation• used for small files are places closer on disk

– Large allocation request, • per-file (per-inode) preallocation • used for larger files are less interleaved

• Which preallocation space to use – depends on the total size derived out of current file size and al-

location request size.– If the total size < stream_req blocks, per-CPU locality group

preallocation space.– Default is 16 (/prof/fs/ext4/<partition>/stream_req)

Page 56: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

56

Embedded Software Lab.

Multiple Block Allocator(3)

• Per-block-group buddy cache– When it can’t allocate blocks from the preallocation– Contiguous free blocks of block group are managed by

the buddy system in memory (20-213).

Page 57: 17 1 Embedded Software Lab. Embedded Software Lab Daejun Park, Eunsoo Park Lecture 12 EXT4

17

57

Embedded Software Lab.

Multiple Blocks Allocator(4)

• Per-block-group buddy cache– Blocks unused by the current allocation are added to inode

preallocation– Inode preallocation enables blocks will be assigned preferen-

tially when the next block allocation comes. Consequently contiguous multiple blocks are used.

– For a file smaller than 16 blocks is added to the per-CPU local-ity group to pack small files together