transactions and reliability
DESCRIPTION
Transactions and Reliability. Sarah Diesburg Operating Systems COP 4610. Motivation. File systems have lots of metadata : Free blocks, directories, file headers, indirect blocks Metadata is heavily cached for performance. Problem. System crashes - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/1.jpg)
Transactions and Reliability
Sarah Diesburg
Operating Systems
COP 4610
![Page 2: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/2.jpg)
Motivation
File systems have lots of metadata:Free blocks, directories, file headers, indirect
blocks
Metadata is heavily cached for performance
![Page 3: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/3.jpg)
Problem
System crashesOS needs to ensure that the file system
does not reach an inconsistent state
Example: move a file between directoriesRemove a file from the old directoryAdd a file to the new directory
What happens when a crash occurs in the middle?
![Page 4: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/4.jpg)
UNIX File System (Ad Hoc Failure-Recovery)
Metadata handling:Uses a synchronous write-through caching
policyA call to update metadata does not return until the
changes are propagated to disk
Updates are orderedWhen crashes occur, run fsck to repair in-
progress operations
![Page 5: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/5.jpg)
Some Examples of Metadata Handling
Undo effects not yet visible to usersIf a new file is created, but not yet added to the
directoryDelete the file
Continue effects that are visible to usersIf file blocks are already allocated, but not
recorded in the bitmapUpdate the bitmap
![Page 6: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/6.jpg)
UFS User Data Handling
Uses a write-back policyModified blocks are written to disk at 30-second
intervalsUnless a user issues the sync system call
Data updates are not orderedIn many cases, consistent metadata is good
enough
![Page 7: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/7.jpg)
Example: Vi
Vi saves changes by doing the following1. Writes the new version in a temp file
Now we have old_file and new_temp file
2. Moves the old version to a different temp fileNow we have new_temp and old_temp
3. Moves the new version into the real fileNow we have new_file and old_temp
4. Removes the old versionNow we have new_file
![Page 8: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/8.jpg)
Example: Vi
When crashes occurLooks for the leftover filesMoves forward or backward depending on the
integrity of files
![Page 9: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/9.jpg)
Transaction Approach
A transaction groups operations as a unit, with the following characteristics:Atomic: all operations either happen or they
do not (no partial operations)Serializable: transactions appear to happen
one after the otherDurable: once a transaction happens, it is
recoverable and can survive crashes
![Page 10: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/10.jpg)
More on Transactions
A transaction is not done until it is committed
Once committed, a transaction is durableIf a transaction fails to complete, it must
rollback as if it did not happen at allCritical sections are atomic and
serializable, but not durable
![Page 11: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/11.jpg)
Transaction Implementation (One Thread)
Example: money transfer
Begin transaction
x = x – 1;
y = y + 1;
Commit
![Page 12: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/12.jpg)
Transaction Implementation (One Thread)
Common implementations involve the use of a log, a journal that is never erased
A file system uses a write-ahead log to track all transactions
![Page 13: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/13.jpg)
Transaction Implementation (One Thread)
Once accounts of x and y are on a log, the log is committed to disk in a single write
Actual changes to those accounts are done later
![Page 14: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/14.jpg)
Transaction Illustrated
x = 1;y = 1;
x = 1;y = 1;
![Page 15: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/15.jpg)
Transaction Illustrated
x = 1;y = 1;
x = 0;y = 2;
![Page 16: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/16.jpg)
Transaction Illustrated
x = 1;y = 1;
x = 0;y = 2;
begin transaction
old x: 1
old y: 1
new x: 0
new y: 2
commit
Commit the log to disk before updating the actual values on disk
![Page 17: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/17.jpg)
Transaction Steps
Mark the beginning of the transactionLog the changes in account xLog the changes in account yCommitModify account x on diskModify account y on disk
![Page 18: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/18.jpg)
Scenarios of Crashes
If a crash occurs after the commitReplays the log to update accounts
If a crash occurs before or during the commitRolls back and discard the transaction
![Page 19: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/19.jpg)
Two-Phase Locking (Multiple Threads)
Logging alone not enough to prevent multiple transactions from trashing one another (not serializable)
Solution: two-phase locking1. Acquire all locks
2. Perform updates and release all locks
Thread A cannot see thread B’s changes until thread A commits and releases locks
![Page 20: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/20.jpg)
Transactions in File Systems
Almost all file systems built since 1985 use write-ahead loggingNTFS, HFS+, ext3, ext4, …
+ Eliminates running fsck after a crash
+ Write-ahead logging provides reliability
- All modifications need to be written twice
![Page 21: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/21.jpg)
Log-Structured File System (LFS)
If logging is so great, why don’t we treat everything as log entries?
Log-structured file systemEverything is a log entry (file headers,
directories, data blocks)Write the log only once
Use version stamps to distinguish between old and new entries
![Page 22: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/22.jpg)
More on LFS
New log entries are always appended to the end of the existing logAll writes are sequentialSeeks only occurs during reads
Not so bad due to temporal locality and caching
Problem:Need to create more contiguous space all the
time
![Page 23: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/23.jpg)
RAID and Reliability
So far, we assume that we have a single diskWhat if we have multiple disks?
The chance of a single-disk failure increases
RAID: redundant array of independent disksStandard way of organizing disks and classifying the
reliability of multi-disk systemsGeneral methods: data duplication, parity, and error-
correcting codes (ECC)
![Page 24: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/24.jpg)
RAID 0
No redundancyUses block-level striping across disks
i.e., 1st block stored on disk 1, 2nd block stored on disk 2
Failure causes data loss
![Page 25: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/25.jpg)
Non-Redundant Disk Array Diagram (RAID Level 0)
open(foo) read(bar) write(zoo)
FileSystem
![Page 26: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/26.jpg)
Mirrored Disks (RAID Level 1)
Each disk has a second disk that mirrors its contentsWrites go to both disks
+ Reliability is doubled
+ Read access faster
- Write access slower
- Expensive and inefficient
![Page 27: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/27.jpg)
Mirrored Disk Diagram (RAID Level 1)
open(foo) read(bar) write(zoo)
FileSystem
![Page 28: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/28.jpg)
Memory-Style ECC (RAID Level 2)
Some disks in array are used to hold ECCByte to detect error, extra bits for error
correcting
+ More efficient than mirroring
+ Can correct, not just detect, errors
- Still fairly inefficiente.g., 4 data disks require 3 ECC disks
![Page 29: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/29.jpg)
Memory-Style ECC Diagram (RAID Level 2)
open(foo) read(bar) write(zoo)
FileSystem
![Page 30: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/30.jpg)
Bit-Interleaved Parity (RAID Level 3)
Uses bit-level striping across disksi.e., 1st byte stored on disk 1, 2nd byte stored on
disk 2One disk in the array stores parity for the
other disksNo detection bits needed, relies on disk
controller to detect errors
+ More efficient than Levels 1 and 2- Parity disk doesn’t add bandwidth
![Page 31: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/31.jpg)
Parity Method
Disk 1: 1001Disk 2: 0101Disk 3: 1000Parity: 0100 = 1001 xor 0101 xor 1000
To recover disk 2Disk 2: 0101 = 1001 xor 1000 xor 0100
![Page 32: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/32.jpg)
Bit-Interleaved RAID Diagram (Level 3)
open(foo) read(bar) write(zoo)
FileSystem
![Page 33: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/33.jpg)
Block-Interleaved Parity (RAID Level 4)
Like bit-interleaved, but data is interleaved in blocks
+ More efficient data access than level 3- Parity disk can be a bottleneck- Small writes require 4 I/Os
Read the old blockRead the old parityWrite the new blockWrite the new parity
![Page 34: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/34.jpg)
Block-Interleaved Parity Diagram (RAID Level 4)
open(foo) read(bar) write(zoo)
FileSystem
![Page 35: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/35.jpg)
Block-Interleaved Distributed-Parity (RAID Level 5)
Sort of the most general level of RAIDSpreads the parity out over all disks
+ No parity disk bottleneck
+ All disks contribute read bandwidth
– Requires 4 I/Os for small writes
![Page 36: Transactions and Reliability](https://reader031.vdocuments.mx/reader031/viewer/2022013012/56814d60550346895dbaa5b8/html5/thumbnails/36.jpg)
Block-Interleaved Distributed-Parity Diagram (RAID Level 5)
open(foo) read(bar) write(zoo)
FileSystem