cerfying a crash-safe file system · cerfying a crash-safe file system ... + * --- perhaps by...
TRANSCRIPT
![Page 1: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/1.jpg)
Cer$fying a Crash-safe File System
Haogang Chen
Thesis AdvisorsFrans Kaashoek and Nickolai Zeldovich
![Page 2: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/2.jpg)
File systems should not lose data
• People use file systems to store permanent data
![Page 3: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/3.jpg)
File systems should not lose data
• People use file systems to store permanent data
• Computers can crash any$me • power failures
• hardware failures (unplug USB drive)
• soIware bugs
![Page 4: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/4.jpg)
File systems should not lose data
• People use file systems to store permanent data
• Computers can crash any$me • power failures
• hardware failures (unplug USB drive)
• soIware bugs
• File systems should not lose or corrupt data in case of crashes
![Page 5: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/5.jpg)
File systems are complex and have bugs
• Linux ext4: ~60,000 lines of code
![Page 6: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/6.jpg)
File systems are complex and have bugs
• Linux ext4: ~60,000 lines of code
Cumula&ve number of bug patches in Linux file systems [Lu et al., FAST’13]
# of
pat
ches
for b
ugs
0
150
300
450
600
Dec-03 Apr-04 Dec-04 Jan-06 Feb-07 Apr-08 Jun-09 Aug-10 May-11
ext3xfsjfsreiserfsext4btrfs
![Page 7: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/7.jpg)
File systems are complex and have bugs
• Linux ext4: ~60,000 lines of code
• Some bugs are serious: data loss, security exploits, etc.
Cumula&ve number of bug patches in Linux file systems [Lu et al., FAST’13]
# of
pat
ches
for b
ugs
0
150
300
450
600
Dec-03 Apr-04 Dec-04 Jan-06 Feb-07 Apr-08 Jun-09 Aug-10 May-11
ext3xfsjfsreiserfsext4btrfs
![Page 8: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/8.jpg)
Researches in avoiding bugs in file systems
• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])
• Symbolic execu$on (e.g., EXE [Oakland’06])
• Design modeling (e.g., in Alloy [ABZ’08])
• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]
• BilbyFS [Keller 2014]
• UBIFS [Ernst et al. 2013]
![Page 9: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/9.jpg)
reduce# of bugs
Researches in avoiding bugs in file systems
• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])
• Symbolic execu$on (e.g., EXE [Oakland’06])
• Design modeling (e.g., in Alloy [ABZ’08])
• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]
• BilbyFS [Keller 2014]
• UBIFS [Ernst et al. 2013]
![Page 10: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/10.jpg)
incomplete + no crashes
reduce# of bugs
Researches in avoiding bugs in file systems
• Most research is on finding bugs • Crash injec$on (e.g., EXPLODE [OSDI’06])
• Symbolic execu$on (e.g., EXE [Oakland’06])
• Design modeling (e.g., in Alloy [ABZ’08])
• Some elimina$on of bugs by proving: • FS without directories [Arkoudas et al. 2004]
• BilbyFS [Keller 2014]
• UBIFS [Ernst et al. 2013]
![Page 11: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/11.jpg)
Dealing with crashes is hard
• Crashes expose many par$ally-updated states • Reasoning about all failure cases is hard
![Page 12: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/12.jpg)
Dealing with crashes is hard
• Crashes expose many par$ally-updated states • Reasoning about all failure cases is hard
• Performance op$miza$ons lead to more tricky par$al states • Disk I/O is expensive
• Buffer updates in memory
![Page 13: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/13.jpg)
Dealing with crashes is hard
• Crashes expose many par$ally-updated states • Reasoning about all
failure cases is hard
• Performance op$miza$ons lead to more tricky par$al states • Disk I/O is expensive
• Buffer updates in memory
commit353b67d8ced4dc53281c88150ad295e24bc4b4c5Author:JanKara<[email protected]>Date:SatNov2600:35:392011+0100Title:jbd:Issuecacheflushaftercheckpointing
---a/fs/jbd/checkpoint.c+++b/fs/jbd/checkpoint.c@@-504,7+503,25@@intcleanup_journal_tail(journal_t*journal)spin_unlock(&journal->j_state_lock);return1;}+spin_unlock(&journal->j_state_lock);++/*+*Weneedtomakesurethatanyblocksthatwererecentlywrittenout+*---perhapsbylog_do_checkpoint()---areflushedoutbeforewe+*dropthetransactionsfromthejournal.It'sunlikelythiswillbe+*necessary,especiallywithanappropriatelysizedjournal,butwe+*needthistoguaranteecorrectness.Fortunately+*cleanup_journal_tail()doesn'tgetcalledallthatoften.+*/+if(journal->j_flags&JFS_BARRIER)+blkdev_issue_flush(journal->j_fs_dev,GFP_KERNEL,NULL);+spin_lock(&journal->j_state_lock);+if(!tid_gt(first_tid,journal->j_tail_sequence)){+spin_unlock(&journal->j_state_lock);+/*Someoneelsecleanedupjournalsoreturn0*/+return0;+}
A patch for Linux’s write-ahead logging (jbd) in 2012: “Is it safe to omit a disk write barrier here?”
It's unlikely this will be necessary, … but we need this to guarantee correctness. Fortunately this func;on doesn't get called all that o<en.
![Page 14: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/14.jpg)
Goal: cer$fy a file system under crashes
![Page 15: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/15.jpg)
Goal: cer$fy a file system under crashes
• FSCQ: first cer$fied crash-safe file system
A complete file system with a machine-checkable proof that its implementa$on meets its specifica$on, both under normal execuDon and under any sequence of crashes, including crashes during recovery.
![Page 16: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/16.jpg)
Contribu$ons
• CHL: Crash Hoare Logic • Specifica$on framework for crash-safety of storage
• Crash condi$on and recovery seman$cs
• Automa$on to reduce proof effort
• FSCQ: the first cer$fied crash-safe file system • Basic Unix-like file system (no hard-links, no concurrency)
• Precise specifica$on for the core subset of POSIX
• I/O performance on par with Linux ext4
• CPU overhead is high
![Page 17: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/17.jpg)
FSCQ runs standard Unix programs
Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
Program
![Page 18: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/18.jpg)
FSCQ runs standard Unix programs
Coq proof checker
Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
OK
Program
![Page 19: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/19.jpg)
FSCQ runs standard Unix programs
Coq proof checker
Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
FSCQ’s Haskell code
FSCQ’s FUSE serverOK
Code extrac$on
Haskell compiler
![Page 20: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/20.jpg)
FSCQ runs standard Unix programs
Coq proof checker
Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
FSCQ’s Haskell code
FSCQ’s FUSE server
Haskell libraries & FUSE driver
OK
Linux kernel /dev/sda
Code extrac$on
Haskell compiler
![Page 21: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/21.jpg)
FSCQ runs standard Unix programs
Coq proof checker
Crash Hoare Logic (CHL) Top-level specifica;on Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
FSCQ’s Haskell code
FSCQ’s FUSE server
Haskell libraries & FUSE driver
OK
Linux kernel /dev/sda
Code extrac$on
Haskell compiler
syscalls FUSE upcalls disk read(), write(), sync()
$mvsrcdest$gitclonerepo…$make
![Page 22: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/22.jpg)
FSCQ’s Trusted CompuDng Base
Coq proof checker
Crash Hoare Logic (CHL) Top-level specificaDon Internal specifica;ons
Program Proof
FSCQ (wriKen in Coq)
FSCQ’s Haskell code
FSCQ’s FUSE server
Haskell libraries & FUSE driver
OK
Linux kernel /dev/sda
Code extracDon
Haskell compiler
syscalls FUSE upcalls disk read(), write(), sync()
$mvsrcdest$gitclonerepo…$make
![Page 23: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/23.jpg)
Outline
• Crash safety • What is the correct behavior aIer a crash?
• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)
• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences
• Building a complete file system
• Evalua$on
![Page 24: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/24.jpg)
What is crash safety?
• What guarantee should file system provide when it crashes and reboot?
• Look it up in the POSIX standard?
![Page 25: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/25.jpg)
POSIX is vague about crash behavior
[...] a power failure [...] can cause data to be lost. The data may be associated with a file that is s:ll open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspec:on of file contents could determine that an update did not occur.
IEEE Std 1003.1, 2013 Edi$on
![Page 26: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/26.jpg)
POSIX is vague about crash behavior
• POSIX’s goal was to specify “common-denominator” behavior
• Gives freedom to file systems to implement their own op$miza$ons
[...] a power failure [...] can cause data to be lost. The data may be associated with a file that is s:ll open, with one that has been closed, with a directory, or with any other internal system data structures associated with permanent storage. This data can be lost, in whole or part, so that only careful inspec:on of file contents could determine that an update did not occur.
IEEE Std 1003.1, 2013 Edi$on
![Page 27: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/27.jpg)
What is crash safety?
• What guarantee should file system provide when it crashes and reboot?
• Look it up in the POSIX standard? (Too Vague)
![Page 28: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/28.jpg)
What is crash safety?
• What guarantee should file system provide when it crashes and reboot?
• Look it up in the POSIX standard? (Too Vague)
• A simple and useful defini$on is transacDonal • Atomicity: every file-system call is all-or-nothing
• Durability: every call persists on disk when it returns
![Page 29: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/29.jpg)
What is crash safety?
• What guarantee should file system provide when it crashes and reboot?
• Look it up in the POSIX standard? (Too Vague)
• A simple and useful defini$on is transacDonal • Atomicity: every file-system call is all-or-nothing
• Durability: every call persists on disk when it returns
• Run every file-system call inside a transac$on, using write-ahead logging.
![Page 30: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/30.jpg)
Write-ahead logging
Disk
![Page 31: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/31.jpg)
Write-ahead logging
Log0
➡ log_begin()
Disk
![Page 32: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/32.jpg)
Write-ahead logging
Log0 2 a
8 b
5 c
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)
Disk
1. Append writes to the log
![Page 33: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/33.jpg)
Write-ahead logging
Log0 2 a
8 b
5 c3
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record
![Page 34: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/34.jpg)
Write-ahead logging
a c b Log0 2 a
8 b
5 c3
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons
![Page 35: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/35.jpg)
Write-ahead logging
a c b Log0
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log
![Page 36: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/36.jpg)
Write-ahead logging
• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log
a c b Log0
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log
![Page 37: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/37.jpg)
Write-ahead logging
• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log
• Atomicity: either all writes appear on disk or none do
a c b Log0
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log
![Page 38: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/38.jpg)
Write-ahead logging
• Recovery: aIer crash, replay (apply) any commiRed transac$on in the log
• Atomicity: either all writes appear on disk or none do
• Durability: all changes are persisted on disk when log_commit() returns
a c b Log0
➡ log_begin()➡ log_write(2,‘a’)➡ log_write(8,‘b’)➡ log_write(5,‘c’)➡ log_commit()
Disk
1. Append writes to the log2. Set commit record3. Apply the log to disk loca$ons4. Truncate the log
![Page 39: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/39.jpg)
Example: transac$onal crash safety
defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()
![Page 40: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/40.jpg)
Example: transac$onal crash safety
defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()
deflog_recover():ifcommitted:
log_apply()log_truncate()
… aTer crash …
![Page 41: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/41.jpg)
Example: transac$onal crash safety
• Q: How to formally define what happens when the computer crashes?
defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()
deflog_recover():ifcommitted:
log_apply()log_truncate()
… aTer crash …
![Page 42: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/42.jpg)
Example: transac$onal crash safety
• Q: How to formally define what happens when the computer crashes?
• Q: How to formally specify the behavior of “create” in presence of crash and recovery?
defcreate(dir,name):log_begin()newfile=allocate_inode()newfile.init()dir.add(name,newfile)log_commit()
deflog_recover():ifcommitted:
log_apply()log_truncate()
… aTer crash …
![Page 43: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/43.jpg)
Approach: Crash Hoare Logic{pre} code {post}
SPEC disk write (a, v)PRE a 7! v0POST a 7! v
![Page 44: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/44.jpg)
Approach: Crash Hoare Logic
• Crash condiDon: all intermediate disk states (plus two end-states)
{pre} code {post}
SPEC disk write (a, v)PRE a 7! v0POST a 7! vCRASH a 7! v0 _ a 7! v
{crash}
![Page 45: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/45.jpg)
Approach: Crash Hoare Logic
• Crash condiDon: all intermediate disk states (plus two end-states)
• CHL’s disk model matches what most other file systems assume:
• Wri$ng a single block is an atomic opera$on, no data corrup$on
{pre} code {post}
SPEC disk write (a, v)PRE a 7! v0POST a 7! vCRASH a 7! v0 _ a 7! v
{crash}
![Page 46: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/46.jpg)
Asynchronous disk I/O
![Page 47: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/47.jpg)
Asynchronous disk I/O • For performance, hard drive caches
writes in its internal vola$le buffer • Writes do not persist immediately
![Page 48: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/48.jpg)
Asynchronous disk I/O • For performance, hard drive caches
writes in its internal vola$le buffer • Writes do not persist immediately
• Disk flushes the buffer to media in background • Writes might be reordered
![Page 49: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/49.jpg)
Asynchronous disk I/O • For performance, hard drive caches
writes in its internal vola$le buffer • Writes do not persist immediately
• Disk flushes the buffer to media in background • Writes might be reordered
• Use write barrier (disk_sync) to force flushing the buffer • Make data persistent & enforce ordering
• Disk syncs are expensive!
![Page 50: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/50.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
![Page 51: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/51.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
![Page 52: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/52.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 53: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/53.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
• Idea: use value-sets: • Read returns the latest value:
• Write adds a value to the set:
• Sync discards previous values:
• Reboot chooses a random value:
a 7! hv0, vsi
a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 54: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/54.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
• Idea: use value-sets: • Read returns the latest value:
• Write adds a value to the set:
• Sync discards previous values:
• Reboot chooses a random value:
a 7! hv0, vsiv0
a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 55: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/55.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
• Idea: use value-sets: • Read returns the latest value:
• Write adds a value to the set:
• Sync discards previous values:
• Reboot chooses a random value:
a 7! hv0, vsi
a 7! hv, {v0} [ vsiv0
a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 56: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/56.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
• Idea: use value-sets: • Read returns the latest value:
• Write adds a value to the set:
• Sync discards previous values:
• Reboot chooses a random value:
a 7! hv0, vsi
a 7! hv, {v0} [ vsia 7! hv0, ?i
v0
a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 57: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/57.jpg)
Formalizing asynchronous disk I/O • Challenge: when crashes, the disk might lose some of
the recent writes
• Idea: use value-sets: • Read returns the latest value:
• Write adds a value to the set:
• Sync discards previous values:
• Reboot chooses a random value:
a 7! hv0, vsi
a 7! hv, {v0} [ vsia 7! hv0, ?ia 7! hv0, ?i, v0 2 {v0} [ vs
v0
a⟼0,b⟼0disk_write(a,1)disk_write(b,2)disk_write(a,3)
Q: What are the possible disk states ifcrashing aIer the 3 writes?
A: 6 cases: a ⟼ 0 or 1 or 3, b ⟼ 0 or 2
![Page 58: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/58.jpg)
CHL asynchronous disk model
SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _
a 7! hv, {v0} [ vsi
![Page 59: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/59.jpg)
CHL asynchronous disk model
• Specifica$ons for disk_write, disk_read, and disk_sync are axioms
SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _
a 7! hv, {v0} [ vsi
![Page 60: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/60.jpg)
CHL asynchronous disk model
• Specifica$ons for disk_write, disk_read, and disk_sync are axioms
• “disk |= …” means the disk address space entails the predicate
SPEC disk write (a, v)PRE disk |= a 7! hv0, vsiPOST disk |= a 7! hv, {v0} [ vsiCRASH disk |= a 7! hv0, vsi _
a 7! hv, {v0} [ vsi
![Page 61: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/61.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
![Page 62: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/62.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
Physical disk log a 7! hv0, vsi
![Page 63: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/63.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
Physical disk log a 7! hv0, vsi
Logical disk a 7! v
![Page 64: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/64.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
Physical disk log a 7! hv0, vsi
Logical disk a 7! v
Files inum 7! filefile0 file1 file2 filen
![Page 65: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/65.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
Physical disk log a 7! hv0, vsi
Logical disk a 7! v
Files inum 7! filefile0 file1 file2 filen
Directory tree
![Page 66: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/66.jpg)
Abstrac$on layers• Each abstrac$on layer forms an address space
• RepresentaDon invariants connect logical states between layers
Physical disk log a 7! hv0, vsi
Logical disk a 7! v
Files inum 7! filefile0 file1 file2 filen
Directory treedir_rep
files_rep
log_rep
![Page 67: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/67.jpg)
Example: representa$on invariant SPEC log write (a, v)PRE
old state |= a 7! v0POST
new state |= a 7! v
• old_state and new_state are “logical disks” exposed by the logging system
![Page 68: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/68.jpg)
Example: representa$on invariant
• old_state and new_state are “logical disks” exposed by the logging system
• log_rep connects transac$on state to an on-disk representa$on
• Describes the log’s on-disk layout using many ⟼ primi$ves
SPEC log write (a, v)PRE disk |= log rep (ActiveTxn, start state, old state)
old state |= a 7! v0POST disk |= log rep (ActiveTxn, start state, new state)
new state |= a 7! vCRASH disk |= log rep (ActiveTxn, start state, any state)
![Page 69: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/69.jpg)
Cer$fying procedures
• bmap: return the block address at a given offset for an inode
defbmap(inode,bnum):ifbnum>=NDIRECT:indirect=log_read(inode.blocks[NDIRECT])returnindirect[bnum-NDIRECT]else:returninode.blocks[bnum]
![Page 70: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/70.jpg)
Cer$fying procedures
• bmap: return the block address at a given offset for an inode
PRE POST
CRASH
defbmap(inode,bnum):ifbnum>=NDIRECT:indirect=log_read(inode.blocks[NDIRECT])returnindirect[bnum-NDIRECT]else:returninode.blocks[bnum]
![Page 71: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/71.jpg)
Cer$fying procedures• Follow the control flow graph
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 72: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/72.jpg)
Cer$fying procedures• Follow the control flow graph
• Need pre/post/crash condi$ons for each called procedure
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 73: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/73.jpg)
Cer$fying procedures• Follow the control flow graph
• Need pre/post/crash condi$ons for each called procedure
• Chain pre- and postcondi$ons, forming proof obligaIons
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 74: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/74.jpg)
Cer$fying procedures• Follow the control flow graph
• Need pre/post/crash condi$ons for each called procedure
• Chain pre- and postcondi$ons, forming proof obligaIons
• CHL: combines crash condi$ons, get more proof obligaDons
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 75: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/75.jpg)
Proof automa$on• CHL follows the CFG, and generates proof obliga$ons
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 76: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/76.jpg)
Proof automa$on• CHL follows the CFG, and generates proof obliga$ons
• CHL solves trivial obliga$ons automa$cally (common case)
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
![Page 77: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/77.jpg)
Proof automa$on• CHL follows the CFG, and generates proof obliga$ons
• CHL solves trivial obliga$ons automa$cally (common case)
• Remaining proof effort: changing representaIon invariants
• Show that rep invariant holds at entry and exit
PRE POST
CRASH
if
return
log_read returnprocedure bmap()
inodes_rep
inodes_rep
inodes_rep
![Page 78: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/78.jpg)
Specifying an en$re system call (simplified)
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
![Page 79: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/79.jpg)
Specifying an en$re system call (simplified)
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
![Page 80: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/80.jpg)
Specifying an en$re system call (simplified)
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)
CRASH disk |= log rep(NoTxn, start state) _log rep(NoTxn, new state) _log rep(ActiveTxn, start state, any state) _log rep(CommittingTxn, start state, new state)
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
![Page 81: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/81.jpg)
Specifying an en$re system call (simplified)
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)
CRASH disk |= log rep(NoTxn, start state) _log rep(NoTxn, new state) _log rep(ActiveTxn, start state, any state) _log rep(CommittingTxn, start state, new state)
would_recover_either (start_state, new_state)
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
![Page 82: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/82.jpg)
CRASH disk |= would recover either (start state, new state)
Specifying an en$re system call (simplified)
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
![Page 83: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/83.jpg)
Specifying log recovery
SPEC log recover ()PRE disk |= would recover either (last state, committed state)POST disk |= log rep(NoTxn, last state) _
log rep(NoTxn, committed state)CRASH disk |= would recover either (last state, committed state)
![Page 84: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/84.jpg)
Specifying log recovery
• log_recover() is idempotent:
• Crash condi$on implies its own precondi$on
• OK to run log_recover() again aIer a crash in itself
SPEC log recover ()PRE disk |= would recover either (last state, committed state)POST disk |= log rep(NoTxn, last state) _
log rep(NoTxn, committed state)CRASH disk |= would recover either (last state, committed state)
![Page 85: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/85.jpg)
procedure bmap()
Recovery execu$on seman$cs
PRE POST
CRASH
if
return
log_read return
log_recover
![Page 86: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/86.jpg)
procedure bmap()
Recovery execu$on seman$cs
PRE POST
CRASH
if
return
log_read return
log_recover
![Page 87: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/87.jpg)
procedure bmap()
Recovery execu$on seman$cs
PRE POST
CRASH
if
return
log_read return
log_recover RECOVER
![Page 88: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/88.jpg)
Joint execu$on of two procedures bmap ⨝ log_recover
Recovery execu$on seman$cs
• Whenever bmap (or log_recover) crashes, run log_recover aIer reboot
PRE POST
CRASH
if
return
log_read return
log_recover RECOVER
![Page 89: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/89.jpg)
End-to-end specifica$on
• create() is atomic, if log_recover() runs aIer every crash
• POST is stronger than RECOVER
SPEC create (drum, fn) on log recover ()PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = drum^fn /2 tree[path]
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)RECOVER disk |= log rep(NoTxn, start state) _
log rep(NoTxn, new state)
![Page 90: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/90.jpg)
CHL summary
• Key ideas: crash condiDons and recovery semanDcs
• CHL benefit: enables precise failure specifica$ons • Allows for automa$c chaining of pre/post/crash condi$ons
• Reduces proof burden
• CHL cost: must write crash condi$on for every func$on, loop, etc. • Crash condi$ons are oIen simple (above logging layer)
![Page 91: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/91.jpg)
Outline
• Crash safety • What is the correct behavior aIer a crash?
• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)
• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences
• Building a complete file system
• Evalua$on
✔
![Page 92: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/92.jpg)
FSCQ implements many op$miza$ons
• Group commit • Buffer transac$ons in memory, and flush them in a single batch
• Relax durability guarantee
• Log-bypass writes • File data writes go to the disk (buffer cache) directly
• Log checksums • Checksum log entries to reduce write barriers
• Deferred apply • Apply the log only when the log is full
![Page 93: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/93.jpg)
disk
Example: group commit
logdata
![Page 94: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/94.jpg)
disk
Example: group commit
transac;on cache
logdata
1. Each file-system call forms a transac$on, and are buffered in the transacIon cache
![Page 95: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/95.jpg)
disk
Example: group commit
➡ mkdir(‘d’)➡ create(‘d/a’)➡ rename(‘d/a’,‘d/b’)
, ,transac;on cache
logdata
1. Each file-system call forms a transac$on, and are buffered in the transacIon cache
![Page 96: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/96.jpg)
disk
Example: group commit
➡ mkdir(‘d’)➡ create(‘d/a’)➡ rename(‘d/a’,‘d/b’)➡ fsync(‘d’)
transac;on cache
logdata
1. Each file-system call forms a transac$on, and are buffered in the transacIon cache
2. fsync() flushes cached transac$ons to the on-disk log in a batch • Preserve order
![Page 97: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/97.jpg)
➡ mkdir(‘d’)➡ create(‘d/a’)
Challenge: formalizing group commit• Many more crash states (e.g., before or aIer mkdir() )
• On-disk state can be irrelevant to create() itself, but to some previous opera$ons
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, start state)
start state |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
POST disk |= log rep(NoTxn, new state)new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)CRASH disk |= would recover either (start state, new state)
![Page 98: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/98.jpg)
Specifica$on idea: disk sequences
![Page 99: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/99.jpg)
Specifica$on idea: disk sequences
flushed state
disk0
in-memory transac;ons write-ahead log
txn1 txn2 txnn
![Page 100: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/100.jpg)
disk sequence
Specifica$on idea: disk sequences
disk0
flushed state
disk0
in-memory transac;ons write-ahead log
txn1 txn2 txnn
![Page 101: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/101.jpg)
disk sequence
Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence
disk0
flushed state
disk1 diskn
disk0
in-memory transac;ons
latest
write-ahead log
txn1 txn2 txnn
![Page 102: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/102.jpg)
disk sequence
Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence
• Each logical disk has a corresponding tree
disk0
flushed state
disk1 diskn
disk0
in-memory transac;ons
latest
write-ahead log
tree_rep tree_reptree_rep
txn1 txn2 txnn
![Page 103: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/103.jpg)
disk sequence
Specifica$on idea: disk sequences• Each (cached) system call adds a new logical disk to the sequence
• Each logical disk has a corresponding tree
• Capture the idea that metadata updates must be ordered
disk0
flushed state
disk1 diskn
disk0
in-memory transac;ons
latest
write-ahead log
tree_rep tree_reptree_rep
txn1 txn2 txnn
![Page 104: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/104.jpg)
New specifica$on with disk sequence
• Specifica$on isn’t more complicated
SPEC create (dnum, fn)PRE disk |= log rep(NoTxn, disk seq)
disk seq.latest |= dir rep(tree) ^9 path, tree[path].node = dnum^fn /2 tree[path]
POST disk |= log rep(NoTxn, disk seq ++ {new state})new state |= dir rep(new tree) ^
new tree = tree.update(path, fn, EmptyFile)CRASH disk |= would recover any (disk seq ++ {new state})
![Page 105: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/105.jpg)
Specifica$on for fsync on directories
• AIer fsync(), there is only one possible on-disk state (the latest one)
SPEC fsync (dir inum)PRE disk |= log rep(NoTxn, disk seq)
disk seq.latest |= tree rep(tree) ^IsDir(find inum(tree, dir inum))
POST disk |= log rep(NoTxn, {disk seq.latest})CRASH disk |= would recover any(disk seq)
![Page 106: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/106.jpg)
Formaliza$on techniques for op$miza$ons
• Group commit • Disk sequences: captures ordered metadata updates
• Log-bypass writes • Disk relaDons: enforces safety w.r.t. metadata updates
• Log checksums • Checksum model: soundly reasons about hash collision
✔
![Page 107: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/107.jpg)
Outline
• Crash safety • What is the correct behavior aIer a crash?
• Challenge 1: formalizing crashes • Crash Hoare Logic (CHL)
• Challenge 2: incorpora$ng performance op$miza$ons • Disk sequences
• Building a complete file system
• Evalua$on
![Page 108: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/108.jpg)
FSCQ: building a complete file system • File system design is close
to v6 Unix (+ logging)FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
![Page 109: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/109.jpg)
FSCQ: building a complete file system • File system design is close
to v6 Unix (+ logging)
• Implementa$on aims to reduce proof effort
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
![Page 110: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/110.jpg)
FSCQ: building a complete file system • File system design is close
to v6 Unix (+ logging)
• Implementa$on aims to reduce proof effort• Many precise internal
abstrac$on layers
• e.g., split File and Inode
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
Block-level file
Inode
![Page 111: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/111.jpg)
FSCQ: building a complete file system • File system design is close
to v6 Unix (+ logging)
• Implementa$on aims to reduce proof effort• Many precise internal
abstrac$on layers
• e.g., split File and Inode
• Reuse proven components
• e.g., general bitmap allocator
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
Bitmap allocator
Directory tree
Inode
![Page 112: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/112.jpg)
FSCQ: building a complete file system • File system design is close
to v6 Unix (+ logging)
• Implementa$on aims to reduce proof effort• Many precise internal
abstrac$on layers
• e.g., split File and Inode
• Reuse proven components
• e.g., general bitmap allocator
• Simpler specifica$ons
• e.g., no hard link ⇒ tree spec
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
Directory tree
![Page 113: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/113.jpg)
Evalua$on
• What bugs do FSCQ’s theorems eliminate?
• How much development effort is required for FSCQ?
• How well does FSCQ perform?
![Page 114: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/114.jpg)
Does FSCQ eliminate bugs?
• One data point: once theorems proven, no implementa$on bugs in proven code • Did find some mistakes in spec, as a result of end-to-end checks
• E.g., forgot to specify that extending a file should zero-fill
![Page 115: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/115.jpg)
Does FSCQ eliminate bugs?
• One data point: once theorems proven, no implementa$on bugs in proven code • Did find some mistakes in spec, as a result of end-to-end checks
• E.g., forgot to specify that extending a file should zero-fill
• Systema$c study • Categorize bugs from Linux kernel’s patch history
• Manually examine if FSCQ can eliminate bugs in each category
![Page 116: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/116.jpg)
FSCQ’s theorems eliminate many bugs Bug category Prevented?
Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔
Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔
Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔
Improper corner-case handling e.g., running out of blocks during rename ✔
![Page 117: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/117.jpg)
FSCQ’s theorems eliminate many bugs Bug category Prevented?
Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔
Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔
Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔
Improper corner-case handling e.g., running out of blocks during rename ✔
Low-level bugs e.g., double free, integer overflow Some (memory safe)
Returning incorrect error code Some
![Page 118: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/118.jpg)
FSCQ’s theorems eliminate many bugs Bug category Prevented?
Mistakes in logging logice.g., combining incompa:ble op:miza:ons ✔
Misuse of logging APIe.g., releasing indirect block in two transac:ons ✔
Mistakes in recovery protocol e.g., issuing write barrier in the wrong order ✔
Improper corner-case handling e.g., running out of blocks during rename ✔
Low-level bugs e.g., double free, integer overflow Some (memory safe)
Returning incorrect error code Some
Concurrency Not supported
Security Not supported
![Page 119: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/119.jpg)
Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq
• > 50% reusable infrastructure
4%12%
7%5%
21%8%
44%
CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API
![Page 120: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/120.jpg)
Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq
• > 50% reusable infrastructure
• Comparison: ext4 has ~60,000 lines of C code (many more features)
4%12%
7%5%
21%8%
44%
CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API
![Page 121: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/121.jpg)
Development effort• Total of ~50,000 lines of verified code, specs, and proofs in Coq
• > 50% reusable infrastructure
• Comparison: ext4 has ~60,000 lines of C code (many more features)
• What’s the cost of adding new features to FSCQ?
4%12%
7%5%
21%8%
44%
CHL infrastructureGeneral data structuresWrite-ahead logBuffer cacheInodes and filesDirectoriesTop-level API
![Page 122: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/122.jpg)
Change effort propor$onal to scope of change
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
![Page 123: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/123.jpg)
Change effort propor$onal to scope of change
• Indirect blocks: • + 1,500 lines in Inode
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
Inode
![Page 124: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/124.jpg)
Change effort propor$onal to scope of change
• Indirect blocks: • + 1,500 lines in Inode
• Write-back buffer cache: • + 2300 lines beneath log
~ 600 lines in rest of FSCQ
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
Buffer cache
![Page 125: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/125.jpg)
Change effort propor$onal to scope of change
• Indirect blocks: • + 1,500 lines in Inode
• Write-back buffer cache: • + 2300 lines beneath log
~ 600 lines in rest of FSCQ
• Group commit: • + 1800 lines in Log
~ 100 lines in rest of FSCQ
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead logWrite-ahead log
![Page 126: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/126.jpg)
Change effort propor$onal to scope of change
• Indirect blocks: • + 1,500 lines in Inode
• Write-back buffer cache: • + 2300 lines beneath log
~ 600 lines in rest of FSCQ
• Group commit: • + 1800 lines in Log
~ 100 lines in rest of FSCQ
• Changed lines include code, specs and proofs
FSCQ system calls
Directory
Directory tree
Block-level file
Inode
Bitmap allocator
Buffer cache
Write-ahead log
![Page 127: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/127.jpg)
Performance comparison
• File-system-intensive workload • LFS “largefile” benchmark
• mailbench, a qmail-like mail server
• Compare with ext4 (non-cer$fied) in default mode • Mount op$on: async,data=ordered
• Use FUSE to forward and serialize requests (disable concurrency)
• Running on an hard disk on a desktop • Quad-core Intel i7-980X 3.33 GHz / 24 GB / Hitachi HDS721010CLA332
• Linux 3.11 / GHC 8.0.1 / all file systems run on a separate par$$on
![Page 128: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/128.jpg)
FSCQ Performance
• FSCQ’s CPU overhead is high
• FSCQ’s I/O performance is on par with ext4
Runn
ing
Tim
e (s
econ
ds)
0
5
10
15
20
25
30
largefile mailbench
FSCQ ext4
![Page 129: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/129.jpg)
FSCQ Performance
• FSCQ’s CPU overhead is high
• FSCQ’s I/O performance is on par with ext4
Runn
ing
Tim
e (s
econ
ds)
0
5
10
15
20
25
30
largefile mailbench
FSCQ ext4 Number of disk I/Os per opera;on
largefile mailbench
write sync write sync
FSCQ 1.0 1.0 50.0 9.8
ext4 1.0 1.0 38.0 12.3
![Page 130: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/130.jpg)
Future direc$ons
• Extrac$ng to na$ve code • Reduce both CPU overhead and TCB
• Cer$fying crash-safe applica$ons • Use FSCQ’s top-level spec to cer$fy a mail server or a KV store
• Suppor$ng concurrency • Run FSCQ in a mul$-user environment
• Exploit both I/O concurrency and parallelism
![Page 131: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/131.jpg)
Conclusion
• CHL helps specify and prove crash safety • Crash condi$ons
• Recovery execu$on seman$cs
• FSCQ: first cer$fied crash-safe file system • Precise specifica$on in presence of crashes
• I/O performance on par with Linux ext4
• Moderate development effort
![Page 132: Cerfying a Crash-safe File System · Cerfying a Crash-safe File System ... + * --- perhaps by log_do_checkpoint() --- are flushed out before we + * drop the transactions from the](https://reader033.vdocuments.mx/reader033/viewer/2022051800/5ac2df9e7f8b9a433f8e913d/html5/thumbnails/132.jpg)
Conclusion
• CHL helps specify and prove crash safety • Crash condi$ons
• Recovery execu$on seman$cs
• FSCQ: first cer$fied crash-safe file system • Precise specifica$on in presence of crashes
• I/O performance on par with Linux ext4
• Moderate development effort
h|ps://github.com/mit-pdos/fscq-impl