operating systems meet fault tolerance - microkernel-based...
TRANSCRIPT
![Page 1: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/1.jpg)
OPERATING SYSTEMS MEET FAULTTOLERANCE
Microkernel-Based Operating Systems
Bjorn Dobel
Dresden, 29.01.2013
![Page 2: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/2.jpg)
“If there’s more than one possibleoutcome of a job or task, and one of thoseoutcome will result in disaster or anundesirable consequence, then somebodywill do it that way.” (Edward Murphy jr.)
Dresden, 29.01.2013 OS Resilience slide 2 of 44
![Page 3: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/3.jpg)
Outline
• Murphy and the OS: Is it really that bad?• Fault-Tolerant Operating Systems
– Minix3– CuriOS– L4ReAnimator
• Creative OS Debugging– Detecting race conditions with DataCollider
Dresden, 29.01.2013 OS Resilience slide 3 of 44
![Page 4: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/4.jpg)
Why Things go Wrong• Programming in C:
This pointer is certainly never going to be NULL!
• Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
• Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
• Hardware interaction:But the device spec said, this was not allowed to happen!
• Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to testmy code!
Dresden, 29.01.2013 OS Resilience slide 4 of 44
![Page 5: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/5.jpg)
Why Things go Wrong• Programming in C:
This pointer is certainly never going to be NULL!
• Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
• Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
• Hardware interaction:But the device spec said, this was not allowed to happen!
• Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to testmy code!
Dresden, 29.01.2013 OS Resilience slide 4 of 44
![Page 6: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/6.jpg)
Why Things go Wrong• Programming in C:
This pointer is certainly never going to be NULL!
• Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
• Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
• Hardware interaction:But the device spec said, this was not allowed to happen!
• Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to testmy code!
Dresden, 29.01.2013 OS Resilience slide 4 of 44
![Page 7: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/7.jpg)
Why Things go Wrong• Programming in C:
This pointer is certainly never going to be NULL!
• Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
• Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
• Hardware interaction:But the device spec said, this was not allowed to happen!
• Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to testmy code!
Dresden, 29.01.2013 OS Resilience slide 4 of 44
![Page 8: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/8.jpg)
Why Things go Wrong• Programming in C:
This pointer is certainly never going to be NULL!
• Layering vs. responsibility:
Of course, someone in the higher layers will already have checkedthis return value.
• Concurrency:
This struct is shared between an IRQ handler and a kernel thread.But they will never execute in parallel.
• Hardware interaction:But the device spec said, this was not allowed to happen!
• Hypocrisy:
I’m a cool OS hacker. I won’t make mistakes, so I don’t need to testmy code!
Dresden, 29.01.2013 OS Resilience slide 4 of 44
![Page 9: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/9.jpg)
A Classic Study
• A. Chou et al.: An empirical study of operating systemerrors, SOSP 2001
• Automated software error detection (today:http://www.coverity.com)
• Target: Linux (1.0 - 2.4)– Where are the errors?– How are they distributed?– How long do they survive?– Du bugs cluster in certain locations?
Dresden, 29.01.2013 OS Resilience slide 5 of 44
![Page 10: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/10.jpg)
Revalidation of Chou’s Results
• N. Palix et al.: Faults in Linux: Ten years later, ASPLOS2011
• 10 years of work on tools to decrease error counts - hasit worked?
• Repeated Chou’s analysis until Linux 2.6.34
Dresden, 29.01.2013 OS Resilience slide 6 of 44
![Page 11: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/11.jpg)
Linux: Lines of Code
Dresden, 29.01.2013 OS Resilience slide 7 of 44
![Page 12: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/12.jpg)
Fault Rate per Subdirectory (2001)
Dresden, 29.01.2013 OS Resilience slide 8 of 44
![Page 13: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/13.jpg)
Fault Rate per Subdirectory (2011)
Dresden, 29.01.2013 OS Resilience slide 9 of 44
![Page 14: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/14.jpg)
Bug Lifetimes (2011)
Dresden, 29.01.2013 OS Resilience slide 10 of 44
![Page 15: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/15.jpg)
Bug Distribution
Dresden, 29.01.2013 OS Resilience slide 11 of 44
![Page 16: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/16.jpg)
Break
• Faults are an issue.
• Hardware-related stuff is worst.
• Now what can the OS do about it?
Dresden, 29.01.2013 OS Resilience slide 12 of 44
![Page 17: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/17.jpg)
Minix3 – A Fault-tolerant OS
Use
rpr
oces
ses User Processes
Server Processes
Device Processes Disk TTY Net Printer Other
File PM Reinc ... Other
Shell Make User ... Other
Ker
nel
Kernel Clock Task System Task
Dresden, 29.01.2013 OS Resilience slide 13 of 44
![Page 18: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/18.jpg)
Minix3: Fault Tolerance
• Address Space Isolation– Applications only access private memory– Faults do not spread to other components
• User-level OS services– Principle of Least Privilege– Fine-grain control over resource access
• e.g., DMA only for specific drivers
• Small components– Easy to replace (micro-reboot)
Dresden, 29.01.2013 OS Resilience slide 14 of 44
![Page 19: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/19.jpg)
Minix3: Fault Detection
• Fault model: transient errors caused by software bugs
• Fix: Component restart
• Reincarnation server monitors components– Program termination (crash)– CPU exception (div by 0)– Heartbeat messages
• Users may also indicate that something is wrong
Dresden, 29.01.2013 OS Resilience slide 15 of 44
![Page 20: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/20.jpg)
Repair
• Restarting a component is insufficient:– Applications may depend on restarted component– After restart, component state is lost
• Minix3: explicit mechanisms– Reincarnation server signals applications about restart– Applications store state at data store server– In any case: program interaction needed
• Restarted app: store/recover state• User apps: recover server connection
Dresden, 29.01.2013 OS Resilience slide 16 of 44
![Page 21: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/21.jpg)
Break
• Minix3 fault tolerance:– Architectural Isolation– Explicit monitoring and notifications
• Other approaches:– CuriOS: smart session state handling– L4ReAnimator: semi-transparent restart in a
capability-based system
Dresden, 29.01.2013 OS Resilience slide 17 of 44
![Page 22: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/22.jpg)
CuriOS: Servers and Sessions• State recovery is tricky
– Minix3: Data Store for application data– But: applications interact
• Servers store session-specific state• Server restart requires potential rollback for every
participant
ServerState
ClientAState
ClientBStateServer
Client A
Client B
Dresden, 29.01.2013 OS Resilience slide 18 of 44
![Page 23: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/23.jpg)
CuriOS: Server State Regions
• CuriOS kernel (CuiK) manages dedicated sessionmemory: Server State Regions
• SSRs are managed by the kernel and attached to aclient-server connection
ServerState
Server
Client AClient State A
Client B
Client State B
Dresden, 29.01.2013 OS Resilience slide 19 of 44
![Page 24: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/24.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State B
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 25: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/25.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State B
call()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 26: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/26.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State B
ClientAState
call()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 27: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/27.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State B
reply()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 28: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/28.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State Bcall()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 29: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/29.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State B
ClientBState
call()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 30: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/30.jpg)
CuriOS: Protecting Sessions
• SSR gets mapped only when a client actually invokesthe server
• Solves another problem: failure while handling A’srequest will never corrupt B’s session state
ServerState
Server
Client AClient State A
Client B
Client State Breply()
Dresden, 29.01.2013 OS Resilience slide 20 of 44
![Page 31: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/31.jpg)
CuriOS: Transparent Restart• CuriOS is a Single-Address-Space OS:
– Every application runs on the same page table (withmodified access rights)
OS A A B BShared Mem
OS A A B BB Running
OS A A B BA Running
OS A A B BVirt. Memory
Dresden, 29.01.2013 OS Resilience slide 21 of 44
![Page 32: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/32.jpg)
Transparent Restart
• Single Address Space– Each object has unique address– Identical in all programs– Server := C++ object
• Restart– Replace old C++ object with new one– Reuse previous memory location– References in other applications remain valid– OS blocks access during restart
Dresden, 29.01.2013 OS Resilience slide 22 of 44
![Page 33: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/33.jpg)
L4ReAnimator: Restart on L4Re
• L4Re Applications– Loader component: ned– Detects application termination: parent signal– Restart: re-execute Lua init script (or parts of it)
– Problem after restart: capabilities• No single component knows everyone owning a
capability to an object• Minix3 signals won’t work
Dresden, 29.01.2013 OS Resilience slide 23 of 44
![Page 34: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/34.jpg)
L4Re: Session Creation
Client Server
Loader
SessionCreationCapability
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 35: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/35.jpg)
L4Re: Session Creation
Client Server
Loader
(1)c
reat
e
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 36: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/36.jpg)
L4Re: Session Creation
Client Server
Loader
(2) Mappedduringstartup
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 37: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/37.jpg)
L4Re: Session Creation
Client Server
Loader
(3) factory.create()
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 38: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/38.jpg)
L4Re: Session Creation
Client Server
Loader
Session Capability
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 39: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/39.jpg)
L4Re: Session Creation
Client Server
Loader
(4) create
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 40: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/40.jpg)
L4Re: Session Creation
Client Server
Loader
(5) use
Dresden, 29.01.2013 OS Resilience slide 24 of 44
![Page 41: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/41.jpg)
L4Re: Server Crash
Client Server
Loader
Session Capability
(6) e
xit
Kernel destroys mem-ory, server objects (chan-nels...)
Dresden, 29.01.2013 OS Resilience slide 25 of 44
![Page 42: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/42.jpg)
L4Re: Server Crash
Client Server
Loader
Session Capability
(7) r
esta
rt
Dresden, 29.01.2013 OS Resilience slide 25 of 44
![Page 43: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/43.jpg)
L4Re: Restarted Server
Client Server
Loader
Session Capability
Dresden, 29.01.2013 OS Resilience slide 26 of 44
![Page 44: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/44.jpg)
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Dresden, 29.01.2013 OS Resilience slide 26 of 44
![Page 45: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/45.jpg)
L4Re: Restarted Server
Client Server
Loader
Session Capability
use
Error!
Dresden, 29.01.2013 OS Resilience slide 26 of 44
![Page 46: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/46.jpg)
L4ReAnimator
• Only the application itself can detect that a capabilityvanished
• Kernel raises Capability fault• Application needs to re-obtain the capability: execute
capability fault handler• Capfault handler: application-specific
– Create new communication channel– Restore session state
• Programming model:– Capfault handler provided by server implementor– Handling transparent for application developer– Semi-transparency
Dresden, 29.01.2013 OS Resilience slide 27 of 44
![Page 47: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/47.jpg)
L4ReAnimator: Cleanup
• Some channels have resources attached (e.g., framebuffer for graphical console)
• Resource may come from a different resource (e.g.,frame buffer from memory manager)
• Resources remain intact (stale) upon crash• Client ends up using old version of the resource• Requires additional app-specific knowledge• Unmap handler
Dresden, 29.01.2013 OS Resilience slide 28 of 44
![Page 48: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/48.jpg)
Summary
• L4ReAnimator– Capfault: Clients detect server restarts lazily– Capfault Handler: application-specific knowledge on how
to regain access to the server– Unmap handler: clean up old resources after restart
• All these frameworks only deal with software errors.• What about hardware faults?
Dresden, 29.01.2013 OS Resilience slide 29 of 44
![Page 49: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/49.jpg)
Hardware Errors
• Hardware fails all the time• Permanent faults
– Studies: 2% chance of a DRAM DIMM to fail within a year– Hardware ECC cannot catch them all!– 5% chance of a disk failure within a year
• Decreasing transistor sizes → increase in rate oftransient (soft) errors
– No longer only a space/aviation problem
Dresden, 29.01.2013 OS Resilience slide 30 of 44
![Page 50: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/50.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 51: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/51.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 52: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/52.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 53: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/53.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 54: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/54.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 55: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/55.jpg)
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Romain
Dresden, 29.01.2013 OS Resilience slide 31 of 44
![Page 56: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/56.jpg)
Process-Level Redundancy [Shye 2007]
Binary recompilation• Complex, unprotected compiler• Architecture-dependent
Reuse OS mechanisms
System calls for replica synchronization
Additional synchronization events
Virtual memory fault isolation• Restricted to Linux user-level programs
Microkernel-based
Dresden, 29.01.2013 OS Resilience slide 32 of 44
![Page 57: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/57.jpg)
Process-Level Redundancy [Shye 2007]
Binary recompilation• Complex, unprotected compiler• Architecture-dependent
Reuse OS mechanisms
System calls for replica synchronizationAdditional synchronization events
Virtual memory fault isolation• Restricted to Linux user-level programs
Microkernel-based
Dresden, 29.01.2013 OS Resilience slide 32 of 44
![Page 58: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/58.jpg)
Transparent Replication as OS Service
Application
L4 RuntimeEnvironment
L4/Fiasco.OC microkernel
Dresden, 29.01.2013 OS Resilience slide 33 of 44
![Page 59: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/59.jpg)
Transparent Replication as OS Service
ReplicatedApplication
L4 RuntimeEnvironment Romain
L4/Fiasco.OC microkernel
Dresden, 29.01.2013 OS Resilience slide 33 of 44
![Page 60: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/60.jpg)
Transparent Replication as OS Service
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment Romain
L4/Fiasco.OC microkernel
Dresden, 29.01.2013 OS Resilience slide 33 of 44
![Page 61: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/61.jpg)
Transparent Replication as OS Service
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment Romain
L4/Fiasco.OC microkernel
Dresden, 29.01.2013 OS Resilience slide 33 of 44
![Page 62: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/62.jpg)
Transparent Replication as OS Service
Reliable Computing Base
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment Romain
L4/Fiasco.OC microkernel
Dresden, 29.01.2013 OS Resilience slide 33 of 44
![Page 63: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/63.jpg)
Romain: Structure
Master
Dresden, 29.01.2013 OS Resilience slide 34 of 44
![Page 64: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/64.jpg)
Romain: Structure
Replica Replica Replica
Master
Dresden, 29.01.2013 OS Resilience slide 34 of 44
![Page 65: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/65.jpg)
Romain: Structure
Replica Replica Replica
Master
=
Dresden, 29.01.2013 OS Resilience slide 34 of 44
![Page 66: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/66.jpg)
Romain: Structure
Replica Replica Replica
Master
SystemCall Proxy
ResourceManager =
Dresden, 29.01.2013 OS Resilience slide 34 of 44
![Page 67: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/67.jpg)
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
Dresden, 29.01.2013 OS Resilience slide 35 of 44
![Page 68: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/68.jpg)
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2
Dresden, 29.01.2013 OS Resilience slide 35 of 44
![Page 69: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/69.jpg)
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2
1 2 3 4 5 6 Master
Dresden, 29.01.2013 OS Resilience slide 35 of 44
![Page 70: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/70.jpg)
Partitioned Capability Tables
1 2 3 4 5 6
Replica 1
1 2 3 4 5 6
Replica 2
1 2 3 4 5 6 Master
Marked used
Master private
Dresden, 29.01.2013 OS Resilience slide 36 of 44
![Page 71: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/71.jpg)
Replica Memory ManagementReplica 1
rw ro ro
Replica 2
rw ro ro
Master
Dresden, 29.01.2013 OS Resilience slide 37 of 44
![Page 72: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/72.jpg)
Replica Memory ManagementReplica 1
rw ro ro
Replica 2
rw ro ro
Master
Dresden, 29.01.2013 OS Resilience slide 37 of 44
![Page 73: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/73.jpg)
Replica Memory ManagementReplica 1
rw ro ro
Replica 2
rw ro ro
Master
Dresden, 29.01.2013 OS Resilience slide 37 of 44
![Page 74: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/74.jpg)
Shared Memory
• Not in complete control of master
• Standard technique: trap&emulate– Execution overhead (x100 - x1000)– Adds complexity to RCB
Disassembler 6,000 LoCTiny emulator 500 LoC
• Our implementation: copy & execute
Dresden, 29.01.2013 OS Resilience slide 38 of 44
![Page 75: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/75.jpg)
Copy&ExecuteMaster Replica
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 76: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/76.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]
X
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 77: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/77.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 78: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/78.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;NOP
restore masterstate
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 79: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/79.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]mov eax, [ebx]load repl. state
NOP; NOP; ...;NOP
restore masterstate
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 80: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/80.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;NOP
restore masterstate
mov eax, [ebx]
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 81: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/81.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;NOP
restore masterstate
mov eax, [ebx]
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 82: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/82.jpg)
Copy&ExecuteMaster Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;NOP
restore masterstate
mov eax, [ebx]
Dresden, 29.01.2013 OS Resilience slide 39 of 44
![Page 83: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/83.jpg)
Overhead vs. Unreplicated Execution
Dresden, 29.01.2013 OS Resilience slide 40 of 44
![Page 84: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/84.jpg)
Romain Lines of Code
Base code (main, logging, locking) 325Application loader 375Replica manager 628Redundancy 153Memory manager 445System call proxy 311Shared memory 281Total 2,518Fault injector 668GDB server stub 1,304
Dresden, 29.01.2013 OS Resilience slide 41 of 44
![Page 85: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/85.jpg)
Hardening the RCB• We need: Dedicated
mechanisms to protectthe RCB (HW or SW)
• We have: Full controlover software
• Use FT-encodingcompiler?
– Has not been done forkernel code yet
• RAD-hardenedhardware?
– Too expensive
Why not split coresinto resilient and non-resilient ones?
ResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
Dresden, 29.01.2013 OS Resilience slide 42 of 44
![Page 86: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/86.jpg)
Summary
• OS-level techniques to tolerate SW and HW faults• Address-space isolation• Microreboots• Various ways of handling session state• Replication against hardware errors
Dresden, 29.01.2013 OS Resilience slide 43 of 44
![Page 87: Operating Systems Meet Fault Tolerance - Microkernel-Based ...os.inf.tu-dresden.de/Studium/KMB/WS2012/15-Resilience.pdf · OPERATING SYSTEMS MEET FAULT TOLERANCE Microkernel-Based](https://reader036.vdocuments.mx/reader036/viewer/2022081406/5f11e46667d5915796592a26/html5/thumbnails/87.jpg)
Further Reading
• Minix3: Jorrit Herder, Ben Gras,, Philip Homburg, Andrew S. Tanenbaum: FaultIsolation for Device Drivers, DSN 2009
• CuriOS: Francis M. David, Ellick M. Chan, Jeffrey C. Carlyle and Roy H. CampbellCuriOS: Improving Reliability through Operating System Structure, OSDI 2008
• L4ReAnimator: Dirk Vogt, Bjorn Dobel, Adam Lackorzynski: Stay strong, staysafe: Enhancing Reliability of a Secure Operating System, IIDS 2010
• PLR: Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseh Blomsted, RameshPeri: Using Process-Level Redundancy to Exploit Multiple Cores for TransientFault Tolerance, DSN 2007
• Romain:– Bjorn Dobel, Hermann Hartig, Michael Engel: Operating System Support
for Redundant Multithreading, EMSOFT 2012– Bjorn Dobel, Hermann Hartig: Who watches the watchmen? – Protecting
Operating System Reliability Mechanisms, HotDep 2012
Dresden, 29.01.2013 OS Resilience slide 44 of 44