fine-grained fault tolerance using device...
TRANSCRIPT
![Page 1: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/1.jpg)
Fine-Grained Fault Tolerance using Device Checkpoints
Asim Kadavwith Matthew Renzelmann and Michael M. Swift
University of Wisconsin-Madison
1
![Page 2: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/2.jpg)
The (old) elephant in the room
2
device drivers
(majority of kernel code)
3rd party developers
+
OSkernel
2
![Page 3: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/3.jpg)
The (old) elephant in the room
2
device drivers
(majority of kernel code)
3rd party developers
+
OSkernel
2
![Page 4: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/4.jpg)
The (old) elephant in the room
2
device drivers
(majority of kernel code)
3rd party developers
+
OSkernel
Recipe for
disaster
2
![Page 5: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/5.jpg)
Improvement System Validation Validation Validation Improvement SystemDrivers Bus Classes
Isolation Nooks [SOSP 03] 6 1 2
XFI [OSDI 06] 2 1 1
CuriOS [OSDI 08] 2 1 2
Type Safety SafeDrive [OSDI 06] 6 2 3
Singularity [Eurosys 06] 1 1 1
Specification Nexus [OSDI 08] 2 1 2
Termite [SOSP 09] 2 1 2
Recovery Shadow Drivers [OSDI 04] 13 1 3
Static analysis tools Windows SDV [Eurosys 06] All All All
Coverity [CACM 10] All All All
Cocinelle [Eurosys 08] All All All
3
Extensive past work on reliability research
3
![Page 6: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/6.jpg)
Improvement System Validation Validation Validation Improvement SystemDrivers Bus Classes
Isolation Nooks [SOSP 03] 6 1 2
XFI [OSDI 06] 2 1 1
CuriOS [OSDI 08] 2 1 2
Type Safety SafeDrive [OSDI 06] 6 2 3
Singularity [Eurosys 06] 1 1 1
Specification Nexus [OSDI 08] 2 1 2
Termite [SOSP 09] 2 1 2
Recovery Shadow Drivers [OSDI 04] 13 1 3
Static analysis tools Windows SDV [Eurosys 06] All All All
Coverity [CACM 10] All All All
Cocinelle [Eurosys 08] All All All
3
Extensive past work on reliability research
3
![Page 7: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/7.jpg)
Improvement System Validation Validation Validation Improvement SystemDrivers Bus Classes
Isolation Nooks [SOSP 03] 6 1 2
XFI [OSDI 06] 2 1 1
CuriOS [OSDI 08] 2 1 2
Type Safety SafeDrive [OSDI 06] 6 2 3
Singularity [Eurosys 06] 1 1 1
Specification Nexus [OSDI 08] 2 1 2
Termite [SOSP 09] 2 1 2
Recovery Shadow Drivers [OSDI 04] 13 1 3
Static analysis tools Windows SDV [Eurosys 06] All All All
Coverity [CACM 10] All All All
Cocinelle [Eurosys 08] All All All
3
Extensive past work on reliability research
3
![Page 8: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/8.jpg)
Improvement System Validation Validation Validation Improvement SystemDrivers Bus Classes
Isolation Nooks [SOSP 03] 6 1 2
XFI [OSDI 06] 2 1 1
CuriOS [OSDI 08] 2 1 2
Type Safety SafeDrive [OSDI 06] 6 2 3
Singularity [Eurosys 06] 1 1 1
Specification Nexus [OSDI 08] 2 1 2
Termite [SOSP 09] 2 1 2
Recovery Shadow Drivers [OSDI 04] 13 1 3
Static analysis tools Windows SDV [Eurosys 06] All All All
Coverity [CACM 10] All All All
Cocinelle [Eurosys 08] All All All
3
Observation 1: Solutions that limit changes to kernel and apply to lots of drivers have real impact
Extensive past work on reliability research
3
![Page 9: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/9.jpg)
Improvement System Validation Validation Validation Improvement SystemDrivers Bus Classes
Isolation Nooks [SOSP 03] 6 1 2
XFI [OSDI 06] 2 1 1
CuriOS [OSDI 08] 2 1 2
Type Safety SafeDrive [OSDI 06] 6 2 3
Singularity [Eurosys 06] 1 1 1
Specification Nexus [OSDI 08] 2 1 2
Termite [SOSP 09] 2 1 2
Recovery Shadow Drivers [OSDI 04] 13 1 3
Static analysis tools Windows SDV [Eurosys 06] All All All
Coverity [CACM 10] All All All
Cocinelle [Eurosys 08] All All All
3
Extensive past work on reliability research
Observation 2: Most systems focus on improving isolation and detection and not on recovery
3
![Page 10: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/10.jpg)
Driver failure recovery limited to driver restart
★ Restart driver upon failure★ Safedrive and MINIX approach★ Can break applications
Device Driver
Device
Driver-Kernel Interface
4
Applications
Kernel
Shadow drivers
4
![Page 11: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/11.jpg)
Driver failure recovery limited to driver restart
★ Restart driver upon failure★ Safedrive and MINIX approach★ Can break applications
Device Driver
Device
Driver-Kernel Interface
4
Applications
Kernel
Shadow drivers
4
![Page 12: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/12.jpg)
Driver failure recovery limited to driver restart
★ Restart driver upon failure★ Safedrive and MINIX approach★ Can break applications
Device Driver
Device
Shadow Driver
Driver-Kernel Interface
4
Applications
Kernel
★ Restart and replay upon failure★ Shadow driver approach ★ Always record state of driver★ Perform restart and log replay
upon failure★ Transparent to applications
Shadow drivers
4
![Page 13: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/13.jpg)
Problem 1: Restart based driver recovery is slow
5
0ms
500ms
1,000ms
1,500ms
2,000ms
8139too e1000 ens1371 psmouse
Restart times
net net sound input
5
![Page 14: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/14.jpg)
Problem 1: Restart based driver recovery is slow
5
Shadow drivers restart the driver upon failure which can be slow
0ms
500ms
1,000ms
1,500ms
2,000ms
8139too e1000 ens1371 psmouse
Restart times
net net sound input
5
![Page 15: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/15.jpg)
Driver re-initialization probes hardware again
6
Allocate device structures
Set chipset specific ops
Map BAR and I/O ports
Register device operations
Detect chipset capabilities
Cold boot device Verify EEPROM checksum
Device self test
Configure device
Device ready
6
![Page 16: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/16.jpg)
Driver re-initialization probes hardware again
6
Allocate device structures
Set chipset specific ops
Map BAR and I/O ports
Register device operations
Detect chipset capabilities
Cold boot device Verify EEPROM checksum
Device self test
Configure device
Device ready
6
![Page 17: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/17.jpg)
Driver re-initialization probes hardware again
6
★ What does slow device re-initialization hurt?★ Fault tolerance: Driver recovery★ Virtualization: Live migration ★ OS functions: Fast reboot
Allocate device structures
Set chipset specific ops
Map BAR and I/O ports
Register device operations
Detect chipset capabilities
Cold boot device Verify EEPROM checksum
Device self test
Configure device
Device ready
6
![Page 18: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/18.jpg)
Problem 2: Shadow drivers assume drivers follow class behavior
7
★ Class definition includes:★ Callbacks registered with the bus,
device and kernel subsystem
networkdriver
bus
net devicesubsystem
kernel
probe
xmit
confignetwork
card
shadow drivers
7
![Page 19: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/19.jpg)
Problem 2: Shadow drivers assume drivers follow class behavior
7
How many drivers follow class behavior and how much code does this add and
★ Class definition includes:★ Callbacks registered with the bus,
device and kernel subsystem
networkdriver
bus
net devicesubsystem
kernel
probe
xmit
confignetwork
card
shadow drivers
7
![Page 20: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/20.jpg)
Problem 2(a): Drivers do behave outside class definitions
★ Non-class behavior that affects recovery:- procfs/sysfs interactions and unique ioctls
8
$ echo 1 > /sys/class/sound/mixer/device/enable
Windows WLAN card config via private ioctls
Linux sound card config via sysfs
8
![Page 21: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/21.jpg)
Problem 2(a): Drivers do behave outside class definitions
★ Non-class behavior that affects recovery:- procfs/sysfs interactions and unique ioctls
8
At least 16% of drivers have non-class behavior and may not recover correctly using shadow drivers
$ echo 1 > /sys/class/sound/mixer/device/enable
Windows WLAN card config via private ioctls
Linux sound card config via sysfs
8
![Page 22: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/22.jpg)
Problem 2(b): Too many classes
9★ “Understanding Modern Device Drivers” ASPLOS 2012
ata (1%)
cdrom
ide
md (RAID)
mmc
network RAID
mtd (1.5%)scsi (9.6%)floppy
tape
acpiblue tooth
crypto
fire wire
gpu (3.9%)
inputjoy stick
key board
mouse
touch screentablet game port
serio
leds
media (10.5%)
isdn (3.4%)
sound (10%)
pcm
midi
mixer
thermal
tty
char (52%)
block (16%)net (27%)
other (5%)
atm
ethernet
infiniband
wireless
wimax
token ring
Linux
Device Drivers
gpio
tpmserial
display
lcd
back light
video (5.2%)
pata
disk
sata
disk
fiber channel
iscsi
usb-storageosd
raid
drm
vga
bus drivers
xen/lguest
dma/pci libs
video
radio
digital video broadcasting
wan
uwb
driver libraries
9
![Page 23: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/23.jpg)
Problem 2(b): Too many classes
9★ “Understanding Modern Device Drivers” ASPLOS 2012
ata (1%)
cdrom
ide
md (RAID)
mmc
network RAID
mtd (1.5%)scsi (9.6%)floppy
tape
acpiblue tooth
crypto
fire wire
gpu (3.9%)
inputjoy stick
key board
mouse
touch screentablet game port
serio
leds
media (10.5%)
isdn (3.4%)
sound (10%)
pcm
midi
mixer
thermal
tty
char (52%)
block (16%)net (27%)
other (5%)
atm
ethernet
infiniband
wireless
wimax
token ring
Linux
Device Drivers
gpio
tpmserial
display
lcd
back light
video (5.2%)
pata
disk
sata
disk
fiber channel
iscsi
usb-storageosd
raid
drm
vga
bus drivers
xen/lguest
dma/pci libs
video
radio
digital video broadcasting
wan
uwb
driver libraries
Class-specific driver recovery leads to a large kernel recovery subsystem
9
![Page 24: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/24.jpg)
Fine-Grained Fault Tolerance (FGFT)
10
10
![Page 25: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/25.jpg)
Fine-Grained Fault Tolerance (FGFT)
10
Fine-grained Isolation
★ Runs driver entry points like transactions
★ Relies on code generation to limit new code in kernel
10
![Page 26: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/26.jpg)
Fine-Grained Fault Tolerance (FGFT)
10
Fine-grained Isolation
★ Runs driver entry points like transactions
★ Relies on code generation to limit new code in kernel
Checkpoint-based recovery
★ Provides fast and correct recovery semantics
10
![Page 27: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/27.jpg)
Fine-Grained Fault Tolerance (FGFT)
10
Fine-grained Isolation
★ Runs driver entry points like transactions
★ Relies on code generation to limit new code in kernel
★ Requires incremental overhead/changes to drivers
★ Shifts burden of fault tolerance to faulty code
Checkpoint-based recovery
★ Provides fast and correct recovery semantics
10
![Page 28: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/28.jpg)
Outline
11
Introduction
Evaluation and Conclusions
Fine-grained isolation
Checkpoint-based recovery
11
![Page 29: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/29.jpg)
Unit of fault tolerance: Driver entry point
12
networkdriver
network card
probe
xmit
config
12
![Page 30: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/30.jpg)
Unit of fault tolerance: Driver entry point
12
networkdriver
network card
probe
xmit
config
whole driver isolation
12
![Page 31: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/31.jpg)
Unit of fault tolerance: Driver entry point
12
networkdriver
network card
probe
xmit
config
12
![Page 32: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/32.jpg)
Unit of fault tolerance: Driver entry point
12
networkdriver
network card
probe
xmit
config
FGFT isolation
12
![Page 33: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/33.jpg)
Unit of fault tolerance: Driver entry point
12
★ Provide fault tolerance to specific driver entry points
networkdriver
network card
probe
xmit
config
FGFT isolation
12
![Page 34: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/34.jpg)
Unit of fault tolerance: Driver entry point
12
★ Provide fault tolerance to specific driver entry points
networkdriver
network card
probe
xmit
config
★ Can be applied to untested code or code marked suspicious by static or runtime tools
FGFT isolation
12
![Page 35: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/35.jpg)
netdev
Transactional support through code generation
13
networkdriver
get ringparam
netdev
13
![Page 36: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/36.jpg)
netdev
Transactional support through code generation
13
networkdriver
get ringparam
netdev
SFInetwork
driver
stubs
stubs
13
![Page 37: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/37.jpg)
netdev
Transactional support through code generation
13
networkdriver
get ringparam SFInetwork
driver
stubs
stubs
netdev
13
![Page 38: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/38.jpg)
netdev
Transactional support through code generation
13
Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
get ringparam SFInetwork
driver
stubs
stubs
netdev
13
![Page 39: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/39.jpg)
netdev
Transactional support through code generation
13
Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Read
★ Detects and recovers from: ★ Memory errors like invalid pointer accesses★ Structural errors like malformed structures★ Processor exceptions like divide by zero, stack corruption
networkdriver
get ringparam SFInetwork
driver
stubs
stubs
netdev
13
![Page 40: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/40.jpg)
result
netdev
Transactional support through code generation
13
Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Read
★ Detects and recovers from: ★ Memory errors like invalid pointer accesses★ Structural errors like malformed structures★ Processor exceptions like divide by zero, stack corruption
networkdriver
get ringparam SFInetwork
driver
stubs
stubs
netdev
netdev
13
![Page 41: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/41.jpg)
Outline
14
Introduction
Conclusion
Fine-grained isolation
Checkpoint-based recovery
14
![Page 42: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/42.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
15
![Page 43: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/43.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
15
![Page 44: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/44.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
★ Device state is not captured★ Device configuration space
15
![Page 45: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/45.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
★ Device state is not captured★ Device configuration space★ Internal device registers and counters
15
![Page 46: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/46.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
★ Device state is not captured★ Device configuration space★ Internal device registers and counters★ Memory buffer addresses used for DMA
15
![Page 47: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/47.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
★ Device state is not captured★ Device configuration space★ Internal device registers and counters★ Memory buffer addresses used for DMA
★ Unique for every device
15
![Page 48: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/48.jpg)
Checkpointing drivers is hard★Easy to capture memory state
15
networkdriver
network card
checkpoint
★ Device state is not captured★ Device configuration space★ Internal device registers and counters★ Memory buffer addresses used for DMA
★ Unique for every device
Intuition: Operating systems already capture device state during power management
15
![Page 49: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/49.jpg)
Intuition with power management
16
★ Refactor power management code for device checkpoints★ Correct: Developer captures unique device semantics ★ Fast: Avoids probe and latency critical for applications
★ Ask developers to export checkpoint/restore in their drivers
16
![Page 50: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/50.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Disable device
Save DMA state
Suspend device
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Device Ready
Suspend Resume
17
![Page 51: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/51.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Suspend device
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Device Ready
Suspend Resume
17
![Page 52: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/52.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Device Ready
Suspend Resume
17
![Page 53: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/53.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Device Ready
Suspend Resume
17
![Page 54: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/54.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Device Ready
Resume Checkpoint
17
![Page 55: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/55.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Re-attach/Enable device
Resume Checkpoint
17
![Page 56: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/56.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Resume Checkpoint
17
![Page 57: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/57.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
RestoreCheckpoint
17
![Page 58: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/58.jpg)
Device checkpoint/restore from PM code
17
Save config state
Save register state
Save DMA state
Restore config state
Restore register state
Restore or reset DMA state
Suspend/resume code provides device checkpoint functionality
RestoreCheckpoint
17
![Page 59: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/59.jpg)
Synergy of isolation and fast checkpoints
18
netdev
networkdriver
netdev
18
![Page 60: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/60.jpg)
Synergy of isolation and fast checkpoints
18
xmit
netdev
networkdriver
netdev
18
![Page 61: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/61.jpg)
Synergy of isolation and fast checkpoints
18
netdev
networkdriver
netdev
get ringparam
18
![Page 62: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/62.jpg)
Synergy of isolation and fast checkpoints
18
netdev
networkdriver
netdev
C
get ringparam
18
![Page 63: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/63.jpg)
Synergy of isolation and fast checkpoints
18
netdev
networkdriver
netdev
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 64: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/64.jpg)
Synergy of isolation and fast checkpoints
18
netdev
networkdriver
netdev
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 65: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/65.jpg)
Synergy of isolation and fast checkpoints
18
netdev netdevnetdev
networkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 66: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/66.jpg)
Synergy of isolation and fast checkpoints
18
netdev netdevnetdev
networkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 67: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/67.jpg)
Synergy of isolation and fast checkpoints
18
netdev netdevnetdev Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 68: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/68.jpg)
Synergy of isolation and fast checkpoints
18
netdev netdevnetdev Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 69: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/69.jpg)
Synergy of isolation and fast checkpoints
18
netdev netdevnetdev Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 70: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/70.jpg)
Synergy of isolation and fast checkpoints
18
err R
netdev netdevnetdev Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 71: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/71.jpg)
Synergy of isolation and fast checkpoints
18
err R
FGFT provides transactional execution of driver entry points
netdev netdevnetdev Range Table
Address Access rights
0xffffa000 Read
0xffffa008 Write
0xffffa00a Readnetworkdriver
SFInetwork
driver
stubs
stubs
C
get ringparam
18
![Page 72: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/72.jpg)
How does this give us transactional execution?
19
19
![Page 73: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/73.jpg)
How does this give us transactional execution?
19
★ Atomicity: All or nothing execution★ Driver state: Run code in SFI module★ Device state: Explicitly checkpoint/restore state
19
![Page 74: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/74.jpg)
How does this give us transactional execution?
19
★ Atomicity: All or nothing execution★ Driver state: Run code in SFI module★ Device state: Explicitly checkpoint/restore state
★ Isolation: Serialization to hide incomplete transactions★ Re-use existing device locks to lock driver★ Two phase locking
19
![Page 75: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/75.jpg)
How does this give us transactional execution?
19
★ Atomicity: All or nothing execution★ Driver state: Run code in SFI module★ Device state: Explicitly checkpoint/restore state
★ Isolation: Serialization to hide incomplete transactions★ Re-use existing device locks to lock driver★ Two phase locking
★ Consistency: Only valid (kernel, driver and device) states★ Higher level mechanisms to rollback external actions★ At most once device action guarantee to applications
19
![Page 76: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/76.jpg)
Outline
20
Introduction
Evaluation & Conclusions
Fine-grained isolation
Checkpoint-based recovery
20
![Page 77: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/77.jpg)
Evaluation platform
21
★ Criterion :★ Latency of recovery: How fast is it?★ Correctness of recovery: How well does it work?★ Incremental effort: How much work is it?★ Performance: How much does it cost?
21
![Page 78: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/78.jpg)
Evaluation platform
21
★ Platform : ★ Implemented in Linux 2.6.29★ 2.5 GHz Intel Core 2 Quad
core w/ 4 GB DDR2 DRAM ★ Six drivers across three classes
★ Criterion :★ Latency of recovery: How fast is it?★ Correctness of recovery: How well does it work?★ Incremental effort: How much work is it?★ Performance: How much does it cost?
Driver Class Bus
8139too net PCIe1000 net PCI
r8169 net PCI
pegasus net USB
psmouse sound PCI
ens1371 input serio
21
![Page 79: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/79.jpg)
Recovery speedup
22
8139too e1000 pegasus r8169 ens1371 psmouse0ms
500ms
1,000ms
1,500ms
2,000msRestart recoveryFGFT recovery
Recovery times
22
![Page 80: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/80.jpg)
Recovery speedup
22
8139too e1000 pegasus r8169 ens1371 psmouse0ms
500ms
1,000ms
1,500ms
2,000ms
680.00
1030.00
120.00150.00
1800.00
310.00
Restart recoveryFGFT recovery
Recovery times
22
![Page 81: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/81.jpg)
Recovery speedup
22
8139too e1000 pegasus r8169 ens1371 psmouse0ms
500ms
1,000ms
1,500ms
2,000ms
680.00
1030.00
120.00150.00
1800.00
310.00410.00
115.000.045.00
295.00
0.07
Restart recoveryFGFT recovery
Recovery times
22
![Page 82: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/82.jpg)
Recovery speedup
22
FGFT provides significant speedup in driver recovery and improves system availability
8139too e1000 pegasus r8169 ens1371 psmouse0ms
500ms
1,000ms
1,500ms
2,000ms
680.00
1030.00
120.00150.00
1800.00
310.00410.00
115.000.045.00
295.00
0.07
Restart recoveryFGFT recovery
Recovery times
22
![Page 83: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/83.jpg)
Static and dynamic fault injection
Driver Injected Faults
Native Crashes
8139too 43 43e1000 47 47
r8169 36 36pegasus 34 33ens1371 22 21
psmouse 46 46TOTAL 258 256
23
23
![Page 84: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/84.jpg)
Static and dynamic fault injection
Driver Injected Faults
Native Crashes
FGFT Crashes
8139too 43 43 NONEe1000 47 47 NONE
r8169 36 36 NONEpegasus 34 33 NONEens1371 22 21 NONE
psmouse 46 46 NONETOTAL 258 256 NONE
23
23
![Page 85: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/85.jpg)
Static and dynamic fault injection
Driver Injected Faults
Native Crashes
FGFT Crashes
8139too 43 43 NONEe1000 47 47 NONE
r8169 36 36 NONEpegasus 34 33 NONEens1371 22 21 NONE
psmouse 46 46 NONETOTAL 258 256 NONE
23
FGFT recovers from multiple failures : 1) restores non-class state and 2) does not affect other threads
23
![Page 86: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/86.jpg)
Programming effort
Driver LOC Isolation annotationsIsolation annotations Recovery additionsRecovery additions
Driverannotations
Kernelannotations
LOC Moved LOC Added
8139too 1, 904 15 20 26 4
e1000 13, 973 32 32 10r8169 2, 993 10 17 5pegasus 1, 541 26 12 22 5ens1371 2, 110 23 66 16 6psmouse 2, 448 11 19 19 6
24
24
![Page 87: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/87.jpg)
Programming effort
Driver LOC Isolation annotationsIsolation annotations Recovery additionsRecovery additions
Driverannotations
Kernelannotations
LOC Moved LOC Added
8139too 1, 904 15 20 26 4
e1000 13, 973 32 32 10r8169 2, 993 10 17 5pegasus 1, 541 26 12 22 5ens1371 2, 110 23 66 16 6psmouse 2, 448 11 19 19 6
24
FGFT requires a loadable kernel module (1200 LOC) and 38 lines of kernel changes to trap processor exceptions
24
![Page 88: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/88.jpg)
Throughput with isolation and recovery
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
25
![Page 89: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/89.jpg)
Throughput with isolation and recovery
0
25
50
75
100
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
25
![Page 90: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/90.jpg)
Throughput with isolation and recovery
0
25
50
75
100100
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
CPU: 2.4%
25
![Page 91: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/91.jpg)
Throughput with isolation and recovery
0
25
50
75
100100
93
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
CPU: 2.4% 2.4%
25
![Page 92: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/92.jpg)
Throughput with isolation and recovery
0
25
50
75
100100
93100
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
CPU: 2.4% 2.4% 3.4%
25
![Page 93: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/93.jpg)
Throughput with isolation and recovery
0
25
50
75
100100
93100
96
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
CPU: 2.4% 2.4% 2.9%3.4%
25
![Page 94: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/94.jpg)
Throughput with isolation and recovery
0
25
50
75
100100
93100
96
Thr
ough
put
%ag
e (B
asel
ine
844
Mbp
s)
e1000 Network Card
NativeFGFT-‐I/O-‐allFGFT-‐off-‐I/OFGFT-‐I/O-‐1/2
netperf on Intel quad-core machines25
CPU: 2.4% 2.4% 2.9%3.4%
FGFT can isolate and recover high bandwidth devices at low overhead without adding kernel subsystems
25
![Page 95: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/95.jpg)
Summary
26
26
![Page 96: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/96.jpg)
Summary
26
★ FGFT runs driver code as transactions★ Provides fault tolerance at incremental
performance and programmer efforts
★ Introduced device checkpoints★ Provides fast and complete recovery semantics
★ Fast device checkpoints should be explored in other domains like fast reboot, upgrade etc.
26
![Page 98: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/98.jpg)
Extra slides
★ Unlike suspend, devices continue to be accessed after a checkpoint★ Rely on drivers following ACPI specifications for
correctness
28
![Page 99: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/99.jpg)
Latency for device checkpoint/restore
Driver Class Bus Checkpoint Times
Restore Times
8139too net PCI 33μs 62μse1000 net PCI 32μs 280msr8169 net PCI 26μs 30μspegasus net USB 0μs 4msens1371 sound PCI 33μs 111mspsmouse input serio 0μs 390ms
29
Fast checkpoint/restore using suspend/resume
29
![Page 100: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/100.jpg)
Transforming drivers to run as FGFT
If (c==0) {.print (“Driver init”);}..
Driver with annotations
Static modifications30
30
![Page 101: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/101.jpg)
Transforming drivers to run as FGFT
If (c==0) {.print (“Driver init”);}..
Driver with annotations
Static modifications30
User supplied annotations
Source transformation (adds driver transactions)
30
![Page 102: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/102.jpg)
Transforming drivers to run as FGFT
If (c==0) {.print (“Driver init”);}..
Driver with annotations
Static modifications30
If (c==0) {.print (“Driver init”);}..
If (c==0) {.print (“Driver init”);}..
User supplied annotations
Source transformation (adds driver transactions)
Main driver module
SFI driver module
SFI = software fault isolated
30
![Page 103: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/103.jpg)
Transforming drivers to run as FGFT
If (c==0) {.print (“Driver init”);}..
Driver with annotations
Static modifications Run-time support30
If (c==0) {.print (“Driver init”);}..
If (c==0) {.print (“Driver init”);}..
User supplied annotations
Source transformation (adds driver transactions)
Main driver module
SFI driver module
SFI = software fault isolated
30
![Page 104: Fine-Grained Fault Tolerance using Device Checkpointspages.cs.wisc.edu/~kadav/app/fgft-slides.pdf · Fine-Grained Fault Tolerance using Device Checkpoints ... Cold boot device Verify](https://reader035.vdocuments.mx/reader035/viewer/2022062908/5aecb41a7f8b9a66258eea4e/html5/thumbnails/104.jpg)
Transforming drivers to run as FGFT
If (c==0) {.print (“Driver init”);}..
Driver with annotations
Communication and recovery
support
Static modifications Run-time support30
If (c==0) {.print (“Driver init”);}..
If (c==0) {.print (“Driver init”);}..
1200 LOC
User supplied annotations
Source transformation (adds driver transactions)
Object tracking
Marshaling/Demarshaling
Kernel undo log
Main driver module
SFI driver module
SFI = software fault isolated
30