file system concepts - csl.skku.educsl.skku.edu/uploads/swe3015s14/swe3015s14fs.pdf · unix file...
TRANSCRIPT
File system concepts
• Ease of searching a specific data
– File to group data: variable size, naming
– Directory to group files
File data
Directory File name, file offset File name, file offset
File data
Unix file systems history
Unix file system (System V, 1974)
Berkeley fast file system (BSD 4.2, 1984)
Extended file system (Linux, 1992)
Log-structured file system (1991)
Minix file system (Minix, 1987)
Ext4 file system (2008)
XFS (IRIX, 1994) Journaling file system
(OS/2, 1999)
BTRFS (2009)
Ext2 file system (1993)
Ext3 file system (2001)
1970
1980
1990
2000
2010
Journaling file system (AIX, 1990)
Journaling file system (Linux, 2001)
XFS (Linux, 2002)
F2FS (2012)
HFS (1985)
HFS+ (1998)
DOS/Windows file systems history
• File Allocation Table
– FAT (8bit, 1977) / FAT12 (1980) / FAT16 (1984)
Target for floppy disk
– HPFS (OS/2, 1989)
– FAT32/VFAT (1996)
– exFAT (2006)
• NTFS
– Since Windows NT 3.1 (1993)
Network/distributed file systems
• Network file systems
– Mount remote file system to local directory
– Network File System
– Server Message Block/CiFS (samba)
– AppleTalk Filing Protocol
• Distributed file system
– Share storage device to build a large file system
– Andrew File System
– Google file system
– Hadoop file system (HDFS)
File system interfaces
• R. C. Daley, P. G. Neumann, A General-Purpose File System For Secondary Storage, 1965 – Defined what a file system is and how it works
– Concepts of user, file, directory, directory hierarchy
– Backup storage and their usage • Incremental backup / weekly full backup recovery
• POSIX [IEEE 1003 / Richard Stallman / 1988]
– Standardized file system interfaces
– Standard I/O API
– Direct I/O API
– Memory mapped I/O API
File system interface : stream I/O
• Buffered and line-by-line I/O interface
• Header: <stdio.h>
• Handler: FILE *f;
• Functions
– fopen, fclose
– fprintf, fscanf
– fgets, fputs
– fread, fwrite
– fseek, ftell
#include <stdio.h>
int main(void)
{
FILE *fp;
char *str;
if ( fp = fopen("main.c", "r") )
{
str = malloc(4096);
while( fgets(str, 4095, fp) )
printf("%s", str);
fclose(fp);
free(str);
}
return 0;
}
File system interface : direct I/O
• Header: <fcntl.h>, <unistd.h>, …
• Handler: int fd;
• Functions
– open, creat, close
– read, write
– lseek, lseek64
– posix_fallocate, posix_fadvise
#include <fcntl.h>
#include <unistd.h>
int main(void)
{
int fd;
void *buf;
if ( (fd = open("main.c", "r")) > 0)
{
buf = malloc(4096);
while( read(fd, buf, 4096) > 0)
write(1, buf, 4096);
close(fd);
free(buf);
}
return 0;
}
File system interface : mmap I/O
• Memory access to read/write a file
• Header: <sys/mman.h>
• Handler: void *ptr;
• Functions
– void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset)
– int munmap(void *addr, size_t length)
File system interface : mmap I/O
• Example
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
int main(void)
{
int fd, length;
void *buf;
if ( (fd = open("main.c", "r")) > 0)
{
length = lseek(fd, 0, SEEK_END);
buf = mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0);
write(1, buf, length);
munmap(buf, length);
close(fd);
}
return 0;
}
Stream I/O illustrated
Application
VFS
Page cache
libc
fopen
open
sys_open
Hello, Guys
fgets
read
sys_read
fgets
Hello, Guys
Hello,
fclose
close
fprintf
Hello, World
write
World
fflush
sys_write sys_close
Memory mapped I/O illustrated
Application
VFS
Page cache
libc
mmap
sys_mmap
동해물과 백두산이 마르고 닳도록 하느님이 보우하사 우리나라 만세
무궁화 삼천리 화려강산 …
c=buf[0]
aops->readpage()
buf[1]=‘\n’ munmap
동해물과 백두산이 마르고 닳도록 하느님이 보우하사 우리나라 만세
무궁화 삼천리 화려강산 pagefault
aops->writepage()
replacement
sys_munmap
File system design elements
• Space allocation
– Contiguous allocation vs. fragmented allocation
– File to block mapping management
– Managing free space
• Name space management
– File naming: name length, case sensitivity, … • ex. early UNIX file system / FAT uses 8.3 naming system
– Directory hierarchy • Single level array
• Tree-structured multi-level directory
• graph-structured directory
Disk layout and file abstraction
• Abstractions in file system
– File data
– Inode: per file metadata • name, size, data location, modified time, owner, …
– Directory hierarchy
– Superblock
– Meta data for free space management
File a, 0 File a, 1 Inode a Dir b Superblock
?
Allocated/free space management
• Bitmap approach (ext*fs)
– Low storage capacity usage
– High free space search cost
• Linked List approach (FAT)
– Low free space search cost
11011000
Allocated/free space management
• Tree-based approach
– Inode and indirect blocks
– Extents: (start block number, contiguous blocks)
inode filename attributes
direct blocks
single indirect double indirect triple indirect
Indirect block
Indirect block Indirect block
data
data data data data
data
…
Indirect block
data
data
…
data
data
…
…
Allocated/free space management
• Tree-based approach
– B-Tree (XFS, btrfs, …) • Useful for extent-based allocation
(1, 3) (7, 1) (10, 4)
3
4
8
1 2 3 4 5 6 7 8
File allocation
(14, 5) (4, 3) (8, 2)
5
3
2
(0, 1)
Free space
Directory implementation
• Array
– Easy to manage
– File name length limit
• Linear list
– Variable length file name
– Hard to manage
• Hash table
– Indexed by file name: fast search
– Hash collision
RUN.EXE
README.TXT
DATA.DB
…
…
…
RUN.EXE
README.TXT
DATA.DB
…
…
…
Long named file.docx …
Characteristics
• Background: 1970s
– Personal computer
– Floppy disks (~ 1MB)
• 8.3 name space
– Case insensitive
– Long name format extension
• No protection mechanism
• No consistency guarantee
– chkdsk, diskscan
• File data location management
– Linked list approach
• FAT entry (1 entry / 1 cluster)
– Next cluster number (cluster: 512 bytes ~ 32 KB)
– 0: free, -1: end of file
Boot block
File allocation table
0 0 0 0
A.EXE
FAT Root dir. Data
00003 00005 00006 -1
Backup
Directory
• A special file with 32 bytes directory entries
• Entries
– File name: 11 bytes (name 8, extension 3)
– Attributes • Read-only, hidden, system, sub-directory, archive, long file name
– ctime, atime, mtime • Year (7), month (4), day (5), hour (5), min (6), second/2 (5)
– First data cluster
– File size (max. 4 GB)
Long name extension
• Combining consecutive directory entries
– First entry: normal directory entry (first 11 character)
– LFN entries • File name segment: 26 bytes
• Reserved critical entries
– First data cluster
– File type, sequence number, etc.
Introductio 0 ctime atime mtime FDC length n to File L F System.pptx 0
Sequence File type First cluster, for compatibility
Boot sector
• Boot strap
• File system summary
– File system size (sectors)
– Logical sector size
– Cluster size
– # of FATs
– Root directory entries • Root directory first cluster
– Volume label
– Drive number
Free space management
• Next free cluster pointer
– FAT32 maintains last allocated cluster number Possible to undelete recently delete files
– Produces fragmentation
0 0 0 0 00003 00005 00006 -1
Last allocated cluster
Characteristics
• Background
– Linux operating system: multi-user
– Evolving for from desktop to server and real-time system
• Based on block groups
– Each block group works as an independent file system
– Inode, directory, file data
• Inodes for allocation and attribute management
• Journaling support from ext3
Block group
• Ext file system = an array of block groups
• Block group size: determined by block size
– 4K block 128MB
– Why? Data block bitmap must fit in a block
bg_block_bitmap, bg_inode_bitmap, bg_inode_table bg_free_blocks_count, bg_free_inodes_count, …
Directory
• ext3~ supports HTree: hashing for entry lookup [Daniel Phillips, A Directory Index for Ext2, Linux Symposium’02]
Free space management
• Data block bitmap / inode bitmap in each block group
• Block allocation rule
– Top-level directory’s inode • In the empty block group, if possible
• Block group with maximum free inodes
– Other inodes and data blocks • In the block group where its inode or parent resides, if possible
• Nearest-backside block group with free blocks more than average
/usr /home /var /etc
Storage implementation layers
Virtual File System
Ext4 FAT NFS FUSE
Page Cache
Block device
Device mapper
Network stack
I/O scheduler
Device driver
MTD
YAFFS
FTL
CFQ noop antic
Introduction to VFS
• Hordes different file system implementations
– IPC mechanisms (PIPE, FIFO, socket, …) too
• Abstracts generic file system implementations
– Directory traversal
– Page cache
• Interfaces with POSIX system call APIs
– File descriptor management
/
usr (ext4)
home (btrfs)
boot (squashfs)
local (xfs)
vmware.socket (socket)
VFS implementation
• System call to file system’s specific methods
• Generic objects
– Superblock: specific file system
– Inode: specific file
– Dentry: a directory entry
– File: an open file
VFS operations
• File system specific operations – Pseudo object oriented programming model
• File system specific object: ex. sb->s_fs_info
• File system specific operations: ex. sb->s_op.sync_fs()
– Generic object + FS specific object + operations = VFS
• VFS internal objects
– To handle file system status • struct file_system_type file system mounting
• struct vfsmount file system mount point
• struct file_struct file descriptor management – struct file *fd_array[NR_OPEN_DEFAULT]
• struct fs_struct process status (working dir, …)
– Dentry cache
VFS operations example
• sys_open (fs/open.c)
– Main routine: do_sys_open()
– Allocate fd: get_unused_fd_flags()
– Walk path and open a file: do_filp_open() • lookup_fast(), __d_lookup() : dentry cache lookup
• i_op->lookup(dir, dentry) : repeat to target inode
• d_op->d_hash(dentry, name), d_op->d_compare(dentry, name1, name2)
• f_op->open(inode, file)
• i_op->create(dir, dentry, mode)
• s_op->alloc_inode(sb)
Superblock API
• Superblock object
– Per mounted file system instance
• Superblock operations
– Superblock management • write_super(), put_super()
– Inode management • alloc_inode(), write_inode()
– File system management • sync_fs(), free_fs()
• Initialization: get_sb() function
Type Name
list_head s_list
dev_t s_dev
list_head s_inodes
list_head s_files
super_operations s_op
void * s_fs_info
… …
Inode API
• Inode object
• Inode operations
– Inode management • create, truncate, setattr, fallocate,
– Directory management • lookup, link, unlink, symlink, mkdir, …
Type Name
super_block i_sb
list_head i_dentry
unsigned long i_ino
atomic_t i_count
uid_t i_uid
struct timespec i_atime
loff_t i_size
address_space i_mapping
inode_operations i_op
file_operations i_fop
void * i_private
File API
• File object
• File operations
– llseek(), read(), write(), mmap()
– open(), release(), flush()
Type Name
struct path f_path
int f_flags
loff_t f_pos
address_space f_mapping
file_operations f_op
void * private_data
Directory API
• Dentry object
• Operations
– d_revalidate
– d_hash
– d_compare
– d_delete, d_release
– d_iput, d_dname
• Almost no need to implement
– Exception: case insensitive file name
Type Name
struct inode d_inode
hlist_node d_hash
struct dentry d_parent
struct qstr d_name
list_head d_subdirs
list_head d_alias
dentry_operations d_op
void* d_fsdata
char d_iname[]
VFS objects relationships
superblock inode inode
inode file file
dentry dentry
s_inode_list
i_sb f_mapping->host
f_sb
i_dentry
d_inode
d_sb
Page cache implementation
• Integrated with process virtual memory module
• Page associated with inode
– page->mapping, index
• address_space object
– Target file
– Page cache tree (radix)
• address_space operations
– I/O: writepage, readpage, writepages, readpages
– Block allocation and mapping: bmap()
Type Name
struct inode host
radix_tree_root page_tree
long nrpages
address_space_ operations
a_ops
… …
Delayed allocation
Page cache implementation
• Useful API
– page = find_get_page(mapping, index)
– SetPageDirty(page), ClearPageDirty(page)
• Flush thread
– Write-back dirty pages • On-demand, dirty threshold, free threshold
– Per allocation-group • Per disk if device mapper is not used
• Per top-level virtual device if device mapper is applied