file systems: design and implementation
DESCRIPTION
File Systems: Design and Implementation. Operating Systems Fall 2002. What is it all about?. File system is a service which supports an abstract representation of the secondary storage Supported by OS Why is a file system needed? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/1.jpg)
OS Fall’02
File Systems:Design and Implementation
Operating SystemsFall 2002
![Page 2: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/2.jpg)
OS Fall’02
What is it all about? File system is a service which
supports an abstract representation of the secondary storage
Supported by OS
Why is a file system needed?What is so special about the secondary storage (as opposed to the main memory)?
![Page 3: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/3.jpg)
OS Fall’02
Memory Hierarchy
Typical capacity
Main memory
SecondaryStorage: Disks
Off-line Storage:Tapes, CDs, etc
![Page 4: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/4.jpg)
OS Fall’02
Main memory vs. Secondary storage
Small (MB/GB) ExpensiveFast (10-6/10-7 sec) VolatileDirectly accessible
by CPU Interface: (virtual)
memory address
Large (GB/TB)Cheap Slow (10-2/10-3 sec)Persistent Cannot be directly
accessed by CPUData should be first brought into the main memory
![Page 5: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/5.jpg)
OS Fall’02
Some numbers… 1GB=230 ~109 Bytes 1TB=240 ~1012 (terabyte) 1PB=250 ~1015 (petabyte) 1EB=260 ~1018 (exabyte)
232 ~ 4 x 109: Genome base pairs 264 ~ 16 x 1018: Brain electrons 2256 ~ 65,536 x 1072: Particles in
Universe
![Page 6: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/6.jpg)
OS Fall’02
Secondary storage structure A number of disks directly attached
to the computer Network attached disks accessible
through a fast networkStorage Area Network (SAN)
Simple disks Smart disks
![Page 7: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/7.jpg)
OS Fall’02
Internal disk structure
![Page 8: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/8.jpg)
OS Fall’02
Data Access Sector size is the minimum
read/write unit of data (usually 1KB)Access: (#surface, #track, #sector)
Smart disk drives hide out the internal disk layout
Access: (#sector)
Moving arm assembly (Seek) is expensive
Sequential access is x100 times faster than the random access
![Page 9: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/9.jpg)
OS Fall’02
Overview File system services
File system interface
File system implementationFinding files and their dataReading and writingOther issues
Performance is the paramount issue for the file system implementation
![Page 10: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/10.jpg)
OS Fall’02
File System services File system is a layer between the
secondary storage and the application
Presents the secondary storage as a collection of persistent objects with unique names, called files
Provides mechanisms for mapping the data between the secondary storage and the main memory
![Page 11: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/11.jpg)
OS Fall’02
What is a file (קובץ) File is a named persistent collection of
data Unstructured, sequential (UNIX)
Data is accessed by specifying the offset Collection of records (database
systems)Supports associative access give me all records with “Name=Yossi”
Attributes: owner, permissions, modification time, size, etc…
![Page 12: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/12.jpg)
OS Fall’02
File system interface File data access
READ: Bring a specified chunk of data from file into the process virtual address spaceWRITE: Write a specified chunk of data from the process virtual address space to the file
CREATE, DELETE, SEEK, TRUNCATE open, close, set_attributes
![Page 13: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/13.jpg)
OS Fall’02
Accessing File Data: File Control Block
A control structure, File Control Block (FCB), is associated with each file in the file system
Each FCB has a unique identifier (FCB ID)UNIX: i-node, identified by i-node number
FCB structure: File attributesA data structure for accessing the file’s data
![Page 14: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/14.jpg)
OS Fall’02
Accessing File Data Given the file name Get to the file’s FCB using the file
system catalog Use the FCB to get to the desired
offset within the file data
![Page 15: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/15.jpg)
OS Fall’02
Accessing File Data: Catalog The catalog maps a file name to the FCB
Checks permissions This can be done for each file data access
Inefficient: Do this once when the file is first referenced
file_handle=open(file_name): search the catalog and bring FCB into the memoryUNIX: in-memory FCB: in-core i-node
close(file_handle): release FCB from memory
![Page 16: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/16.jpg)
OS Fall’02
The Catalog Organization FCBs are stored in predefined
locations on the diskUNIX: i-node list
Hierarchical structure:Some FCBs are just a list of pointers to other FCBs Directories UNIX: directory is a file whose data is an
array of (file_name, i-node#) pairs
Recursive mapping
![Page 17: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/17.jpg)
OS Fall’02
Searching the UNIX catalog /a/b/c => i-node of /a/b/c Get the root i-node:
The i-node number of ‘/’ is pre-defined (2) Use the root i-node to get to the ‘/’ data Search (a, i-node#) in the root’s data Get the a’s i-node Get to the a’s data and search for (b, i-
node#) Get the b’s i-node Etc… Permissions are checked all along the way
Each dir in the path must be (at least) executable
![Page 18: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/18.jpg)
OS Fall’02
Allocating disk blocks to file data
Assume unstructured filesArray of bytes
Efficient offset -> disk block mapping Efficient disk access for both
sequential and random patternsMinimizing number of seeks
Efficient space utilizationMinimizing external/internal fragmentation
![Page 19: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/19.jpg)
OS Fall’02
Static and Contiguous Allocation
Allocate each file a fixed number of blocks at the creation time
Efficient offset lookupOnly the block # of the offset 0 is needed
Efficient disk access Inefficient space utilization
Internal, external fragmentation
No support for dynamic extension
![Page 20: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/20.jpg)
OS Fall’02
Static and Contiguous Allocation
Catalog
![Page 21: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/21.jpg)
OS Fall’02
Extent-based allocation File get blocks in contiguous chunks
called extentsMultiple contiguous allocations
For large files, B-tree is used for efficient offset lookup
![Page 22: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/22.jpg)
OS Fall’02
Extent-based allocation
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
foo.c bar.c
core.666
foo.c (0,3) (7,2) (16,2)bar.c (3,1) (12,4)
core.666 (8,3) (18,1)
Catalog
![Page 23: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/23.jpg)
OS Fall’02
Extent-based allocation Efficient offset lookup and disk
access Support for dynamic growth/shrink Dynamic memory allocation
techniques are used (e.g., first-fit) Suffers from external fragmentation
Use compaction
![Page 24: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/24.jpg)
OS Fall’02
Single-block allocation Extent-based allocation with a
fixed extent size of one disk block
File blocks are scattered anywhere on the diskInefficient sequential access
UNIX block allocation Linked allocation
MS-DOS File Allocation Table (FAT)
![Page 25: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/25.jpg)
OS Fall’02
Block Allocation in UNIX 10 direct pointers 1 single indirect pointer: points to a
block of N pointers to blocks 1 double indirect pointer: points to a
block of N pointers each of which points to a block of N pointers to blocks
1 triple indirect pointer… Overall addresses 10+N+N2+N3 disk
blocks
![Page 26: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/26.jpg)
OS Fall’02
Block Allocation in UNIX
Direct 1Direct 2
...
Direct 10Indirect
Double indirectTriple indirect
1
2
...
10
11
...
N
N+1
2N
...
...
Ind 1
Dbl 1
Ind 1
Ind N
...
Trpl
Dbl 2
Dbl N
Ind N+1
...
Ind N+1
![Page 27: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/27.jpg)
OS Fall’02
Block Allocation in UNIX Optimized for small files
Outdated empirical studies indicate that 98% of all files are under 80 KB
Poor performance for random access of large files
No external fragmentation Wasted space in pointer blocks for large
sparse files Modern UNIX implementations use the
extent-based allocation
![Page 28: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/28.jpg)
OS Fall’02
Linked Allocation Each file is a linked list of disk blocks Offset lookup:
Efficient for sequential accessInefficient for random access
Access to large files may be inefficient as the blocks are scattered
Solution: block clustering
No fragmentation, wasted space for pointers in each block
![Page 29: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/29.jpg)
OS Fall’02
Linked AllocationCatalog
![Page 30: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/30.jpg)
OS Fall’02
File Allocation Table (FAT) A section at the beginning of the
disk is set aside to contain the tableIndexed by the block numbers on diskAn entry for each disk block (or for a cluster thereof)
Blocks belonging to the same file are chained
The last file block, unused blocks and bad blocks have special markings
![Page 31: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/31.jpg)
OS Fall’02
FATCatalog entry
![Page 32: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/32.jpg)
OS Fall’02
FAT Pros and Cons Improved random access
just search a small table instead of the whole disk
Inefficient sequential accessSeek back to the table and forth to the block for each file block!
Block allocation is easyjust find the first 0 marked block
![Page 33: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/33.jpg)
OS Fall’02
Free space management Disk bitmap: represent the disk
block allocation as an array of bitsBit for each disk block: 1 - non-allocated block, 0 - allocated block Simple and efficient in finding free blocksWastes space on disk
Linked list of free blocks (UNIX)Efficient for finding a single free block
![Page 34: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/34.jpg)
OS Fall’02
Next: File System continued File I/O
Organization, performance
Atomicity and consistency Etc...
![Page 35: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/35.jpg)
OS Fall’02
File I/O CPU cannot access the file data
directly Must be first brought to the main
memoryHow to do this efficiently?
Read/Write mapping using buffer cache
Memory mapped files
![Page 36: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/36.jpg)
OS Fall’02
Read/Write Mapping File data is made available to
applications via a pre-allocated main memory region
Buffer cache The file systems transfers data
between the buffer cache and disk in granularity of disk blocks
The data is explicitly copied from/to buffer cache to/from the application address space
![Page 37: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/37.jpg)
OS Fall’02
Read/Write Mapping
Buffer Cache
Main Memory
File A
File B
File C
Kernel
![Page 38: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/38.jpg)
OS Fall’02
Reading data (Disk block=1K)
User
Buffer Cache
File C
Kernel
Buf
ptr
UNSIGNED CHAR BUF[8192];
UNSIGNED CHAR *PTR=BUF+126;
FD = OPEN(“C”,…);
SEEK(FD,1324); // 1324=1024+300
READ(FD,PTR,1848); // 724+1024+100=1848
1324
3172
![Page 39: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/39.jpg)
OS Fall’02
Writing data (Disk block=1K)
User
Buffer Cache
File C
Kernel
Buf
ptr
UNSIGNED CHAR BUF[8192];
UNSIGNED CHAR *PTR=BUF+126;
FD = OPEN(“C”,…);
SEEK(FD,1324); // 1324=1024+300
WRITE(FD,PTR,1848); // 724+1024+100=1848
1324
3172 Unallocated
region
![Page 40: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/40.jpg)
OS Fall’02
Buffer Cache management All disk I/O goes through the buffer
cacheBoth user data and control data (e.g., i-node) are cached
LRU replacement Dirty (modified) marker to indicate
whether write-back is needed
![Page 41: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/41.jpg)
OS Fall’02
Advantages Strict separation of concerns
Hiding disk access peculiarities from the user Block size, memory alignment, memory
allocation in multiples of the block size, etc…
Disk blocks are cachedAggregation for small transfers (locality)Block re-use across processesTransient data might be never written to disk
![Page 42: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/42.jpg)
OS Fall’02
Disadvantages Extra copying
Disk->buffer cache->user space Vulnerability to failures
Does not care about the user data blocksThe control data blocks (metadata) is the real problem E.g., i-nodes, pointer blocks can be in cache
when a failure occurs As a result the file system internal state
might be corrupted
![Page 43: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/43.jpg)
OS Fall’02
A complete UNIX example
![Page 44: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/44.jpg)
OS Fall’02
Memory mapped files A file (or a portion thereof) is
mapped into a contiguous region of the process virtual memory
UNIX: mmap system call
Mapping operation is very efficient:just marking
The access to file is governed by the virtual memory subsystem
![Page 45: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/45.jpg)
OS Fall’02
Mmapped files: Pros and Cons Advantages:
reduce copyingno need for a pre-allocated buffer cache in the main memory
Disadvantages: less or no control over the actual disk writing: the file data becomes volatileA mapped area must fit the virtual address space
![Page 46: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/46.jpg)
OS Fall’02
Reliability and Recovery File system data consists of
Control data (metadata), user data
Failures can cause data loss and corruption
Cached dataPower failure during the sector write may corrupt physically the data stored in the sector
![Page 47: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/47.jpg)
OS Fall’02
Metadata vs. User data Lost or corruption of the metadata
might lead to a massive user data loss
File systems must care about the metadataFile systems usually do not care much about the user data Operation semantics? Users must care about their data themselves
(e.g., backups)
![Page 48: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/48.jpg)
OS Fall’02
Reliability and caching Caching affects the WRITE semantics
The write operation returnsIs it guaranteed that the requested data is indeed written on disk?What if some data blocks in cache are the metadata blocks?
Solutionswrite-through: writes bypass cachewrite-back: dirty blocks are written asynchronously
![Page 49: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/49.jpg)
OS Fall’02
User data reliability in UNIX Based on write-back policy
User data is written back to disk periodicallyPOSIX compatible semanticsCommands like sync and fsync are used for forced write of the dirty blocks
![Page 50: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/50.jpg)
OS Fall’02
Metadata reliability Based on write-through policy
updates are written to disk immediately
Some data is not written in-placeCan go back to the last consistent version
Some data is replicated UNIX superblock
File system goes through consistency check/repair cycle at the boot time
fsck, ScanDisk
![Page 51: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/51.jpg)
OS Fall’02
Metadata reliability using logging
Write-through negatively affects performance
Think about random access
Solution: maintain a sequential log of metadata updates: Journal
IBM’s Journal File System (JFS)
![Page 52: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/52.jpg)
OS Fall’02
Journal File System (JFS) Operations logged (journaled):
create,link,mkdir,truncate,allocating write, …Each operation may involve several metadata updates (transaction)
Once operation is logged it returnswrite ahead logging
The disk writes are performed asynchronously
aggregation possible
![Page 53: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/53.jpg)
OS Fall’02
JFS: Journal maintenance A cursor (pointer) is maintained The cursor is advanced once the
updated blocks associated with the transaction are written to disk (hardened)
hardened transaction records can be deleted from the journal
Upon recovery: Re-do all the operations starting from the last cursor position
![Page 54: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/54.jpg)
OS Fall’02
JFS: Pros and Cons Advantages:
Asynchronous metadata writeFast recovery: depends on the Journal size and not on the file-system size
Disadvantagesextra writespace wasted by journal (insignificant)
![Page 55: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/55.jpg)
OS Fall’02
Log Structured File System Ousterhout & Douglis (1992) Caching is enough for good read
performance Writes is the real performance
bottleneckwriting-back cached user blocks may require many random disk accesseswrite-through for reliability denies optimizations logging solves the problem for metadata
![Page 56: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/56.jpg)
OS Fall’02
Log Structured File System The idea: everything is log Each write - both data and control -
is appended to the sequential log The problem: how to locate files and
data efficiently for random access by Reads
The solution: use a floating file map
![Page 57: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/57.jpg)
OS Fall’02
Log structured file systemsupermap
supermap
supermap
Before
After block change
After block addition
![Page 58: File Systems: Design and Implementation](https://reader036.vdocuments.mx/reader036/viewer/2022081513/568152e5550346895dc10477/html5/thumbnails/58.jpg)
OS Fall’02
Next: Networking and distributed systems Last: New storage architectures
Storage Area Networks, Network Attached Storage, Object Disks, file systems, etc...