vfs
DESCRIPTION
Virtual file system by WaqasTRANSCRIPT
P.J.Braam/CMU -- 1
Linux Virtual File System
Peter J. Braam
P.J.Braam/CMU -- 2
Aims
• Present the data structures in Linux VFS
• Provide information about flow of control
• Describe methods and invariants needed to implement a new file system
• Illustrate with some examples
P.J.Braam/CMU -- 3
File access
History
• BSD implemented VFS for NFS: aim dispatch to different filesystems
• VMS had elaborate filesystem
• NT/Win95 have VFS type interfaces
• Newer systems integrate VM with buffer cache.
P.J.Braam/CMU -- 4
Linux Filesystems
• Media based– ext2 - Linux native– ufs - BSD– fat - DOS FS– vfat - win 95– hpfs - OS/2– minix - well….– Isofs - CDROM– sysv - Sysv Unix– hfs - Macintosh– affs - Amiga Fast FS– NTFS - NT’s FS– adfs - Acorn-strongarm
• Network– nfs– Coda – AFS - Andrew FS– smbfs - LanManager– ncpfs - Novell
• Special ones– procfs -/proc – umsdos - Unix in DOS– userfs - redirector to user
P.J.Braam/CMU -- 5
Linux Filesystems (ctd)
• Forthcoming:– devfs - device file system– DFS - DCE distributed
FS• Varia:
– cfs - crypt filesystem– cfs - cache filesystem– ftpfs - ftp filesystem– mailfs - mail filesystem– pgfs - Postgres versioning
file system
• Linux serves (unrelated to the VFS!)– NFS - user & kernel– Coda– AppleShare -
netatalk/CAP– SMB - samba– NCP - Novell
P.J.Braam/CMU -- 6
Linux is Obsolete
Andrew Tanenbaum
Usefulness
P.J.Braam/CMU -- 7
File access
Linux VFS
• Multiple interfaces build up VFS:– files– dentries – inodes– superblock – quota
• VFS can do all caching & provides utility fctns to FS
• FS provides methods to VFS; many are optional
P.J.Braam/CMU -- 8
User level file access
• Typical user level types and code:– pathnames: “/myfile”
– file descriptors: fd = open(“/myfile”…)
– attributes in struct stat: stat(“/myfile”, &mybuf), chmod, chown...
– offsets: write, read, lseek
– directory handles: DIR *dh = opendir(“/mydir”)
– directory entries: struct dirent *ent = readdir(dh)
P.J.Braam/CMU -- 9
VFS
• Manages kernel level file abstractions in one format for all file systems
• Receives system call requests from user level (e.g. write, open, stat, link)
• Interacts with a specific file system based on mount point traversal
• Receives requests from other parts of the kernel, mostly from memory management
P.J.Braam/CMU -- 10
File system level
• Individual File Systems– responsible for managing file & directory data
– responsible for managing meta-data: timestamps, owners, protection etc
– translates data between
• particular FS data: e.g. disk data, NFS data, Coda/AFS data
• VFS data: attributes etc in standard format
– e.g. nfs_getattr(….) returns attributes in VFS format, acquires attributes in NFS format to do so.
P.J.Braam/CMU -- 11
Anatomy of stat system callsys_stat(path, buf) { dentry = namei(path); if ( dentry == NULL ) return -ENOENT;
inode = dentry->d_inode; rc =inode->i_op->i_permission(inode); if ( rc ) return -EPERM; rc = inode->i_op->i_getattr(inode, buf); dput(dentry); return rc;}
Establish VFS data
Call into inode layer of filesystem
Call into inode layer of filesystem
P.J.Braam/CMU -- 12
sys_fstatfs(fd, buf) { /* for things like “df” */ file = fget(fd); if ( file == NULL ) return -EBADF; superb = file->f_dentry->d_inode->i_super; rc = superb->sb_op->sb_statfs(sb, buf); return rc;}
Call into superblock layer of filesystem
Translate fd to VFS data structure
Anatomy of fstatfs system call
P.J.Braam/CMU -- 13
Data structures
• VFS data structures for:
– VFS handle to the file: inode (BSD: vnode)
– User instantiated file handle: file (BSD: file)
– The whole filesystem: superblock (BSD: vfs)
– A name to inode translation: dentry
P.J.Braam/CMU -- 14
Shorthand method notation
• super block methods: sss_methodname
• inode methods: iii_methodname
• dentry methods: ddd_methodname
• file methods: fff_methodname
• instead of :
inode i_op lookup we write iii_lookup
P.J.Braam/CMU -- 15
namei
struct dentry *namei(parent, name) {
if (dentry = d_lookup(parent,name))
else
ddd_hash(parent, name)
ddd_revalidate(dentry)
iii_lookup(parent, name)
sss_read_inode(…)
struct inode *iget(ino, dev) {
/* try cache else .. */
}
VFS FS
P.J.Braam/CMU -- 16
Superblocks
• Handle metadata only (attributes etc)• Responsible for retrieving and storing
metadata from the FS media or peers• Struct superblocks hold things like:
– device, blocksize, dirty flags, list of dirty inodes– super operations– wait queue– pointer to the root inode of this FS
P.J.Braam/CMU -- 17
Super Operations (sss_)
• Ops on Inodes:– read_inode– put_inode– write_inode– delete_inode– clear_inode– notify_change
• Superblock manips:– read_super (mount)– put_super (unmount) – write_super (unmount)– statfs (attributes)
P.J.Braam/CMU -- 18
Inodes
• Inodes are VFS abstraction for the file• Inode has operations (iii_methods)• VFS maintains an inode cache, NOT the
individual FS’s (compare NT, BSD etc)• Inodes contain an FS specific area where:
– ext2 stores disk block numbers etc– AFS would store the FID
• Extraordinary inode ops are good for dealing with stale NFS file handles etc.
P.J.Braam/CMU -- 19
What’s inside an inode - 1
list_head i_hashlist_head i_listlist_head i_dentryint i_count
long i_inoint i_dev
{m,a,c}time{u,g}idmodesizen_link
caching
Identifies file
Usual stuff
P.J.Braam/CMU -- 20
What’s inside an inode -2
superblock i_sbinode_ops i_op
wait objects, semaphorelockvm_area_structpipe/socket info
page information
union { ext2fs_inode_info i_ext2 nfs_inode_info i_nfs coda_inode_info i_coda..} u
Which FS
For mmap,networking
waiting
FS Specificinfo:
blockno’sfids etc
P.J.Braam/CMU -- 21
Inode state• Inode can be on one or two lists:
– (hash & in_use) or (hash & dirty ) or unused– inode has a use count i_count
• Transitions – unused hash: iget calls sss_read_inode
– dirty in_use: sss_write_inode
– hash unused: call on sss_clear_inode, but if
i_nlink = 0: iput calls sss_delete_inode when i_count falls to 0
P.J.Braam/CMU -- 22
Dirty inodes
Inode_hashtable
1. iget: if i_count>0 ++2. iput: if i_count>1 - -
sss_write_inode(sync one)
Fs storage
Used inodes
Unused inodes
Fs storage
sss_read_inode(iget)
sss_clear_inode(freeing inos)orsss_delete_inode(iput)
media fs only
(mark_inode_dirty)
3. free_inodes4. syncing inodes
Players:
Fs storage
Inode Cache
P.J.Braam/CMU -- 23
Red Hat Software sold 240,000 copies of Red Hat Linux in 1997 and expects to reach 400,000 in 1998.
Estimates of installed servers (InfoWorld):- Linux: 7 million- OS/2: 5 million- Macintosh: 1 million
Sales
P.J.Braam/CMU -- 24
Inode operations (iii_)• lookup: return inode
– calls iget• creation/removal
– create– link– unlink– symlink– mkdir– rmdir– mknod– rename
• symbolic links– readlink– follow link
• pages– readpage, writepage,
updatepage - read or write page. Generic for mediafs.
– bmap - return disk block number of logical block
• special operations– revalidate - see dentry sect– truncate– permission
P.J.Braam/CMU -- 25
Dentry world
• Dentry is a name to inode translation structure
• Cached agressively by VFS
• Eliminates lookups by FS & private caches– timing on Coda FS: ls -lR 1000 files after priming cache
• linux 2.0.32: 7.2secs
• linux 2.1.92: 0.6secs
– disk fs: less benefit, NFS even more
• Negative entries!
• Namei is dramatically simplified
P.J.Braam/CMU -- 26
Inside dentry’s
• name
• pointer to inode
• pointer to parent dentry
• list head of children
• chains for lots of lists
• use count
P.J.Braam/CMU -- 27
Dentry associated lists
d_alias chainsplace: d_instantiateremove: dentry_iput
inode I_dentry list head
d_child chainsplace: d_allocremove: d_prune, d_invalidate, d_put
inode i_dentry list head
= d_inode pointer = d_parent pointer
dentry inode relationship dentry tree relationship
Legend: inode dentry
P.J.Braam/CMU -- 28
Dcachedentry_hashtable (d_hash chains)
unused dentries (d_lru chains)
namei iii_lookup d_add
pruned_invalidate d_drop
• namei tries cache: d_lookup– ddd_compare
• Success: ddd_revalidate– d_invalidate if fails– proceed if success
• Failure: iii_lookup– find inode– iget
• sss_read_inode– finish:
• d_add– can give negative entry
in dcache
dhash(parent, name) list head
P.J.Braam/CMU -- 29
Dentry methods
• ddd_revalidate: can force new lookup
• ddd_hash: compute hash value of name
• ddd_compare: are names equal?
• ddd_delete, ddd_put, ddd_iput: FS cleanup opportunity
P.J.Braam/CMU -- 30
Dentry particulars:
• ddd_hash and ddd_compare have to deal with extraordinary cases for msdos/vfat:– case insensitive– long and short filename pleasantries
• ddd_revalidate -- can force new lookup if inode not in use:– used for NFS/SMBfs aging– used for Coda/AFS callbacks
P.J.Braam/CMU -- 31
Dijkstra probably hates me
Linus Torvalds
Style
P.J.Braam/CMU -- 32
Memory mapping
• vm_area structure has – vm_operations– inode, addresses etc.
• vm_operations– map, unmap– swapin, swapout– nopage -- read when page isn’t in VM
• mmap– calls on iii_readpage– keeps a use count on the inode until unmap