microreboot
TRANSCRIPT
Microreboot: A Cheap Technique for Recovery
George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox
Presented By Riyad
Motivation● Production software has many transient
bugs● Rebooting can “cure” failures caused by
transient bugs● Rebooting is expensive, causes nontrivial
service disruption and downtime● Microreboot (µRB)!!!
Microreboot (µRB)● Reboot individual fine-grained component● Similar as application reboot
○ Magnitudes faster recovery○ Few failed requests during recovery○ Less lost works due to recovery
● Rejuvenate the system without shutting it down● System needs to be designed microrebootable
from ground up.
µRB Goals● Reduce system recovery time
● Minimize failure’s disruption to system and users
● Preserve in-memory data
Crash Only System Design● Don’t try to take complex recovery process● Upon detecting failures crash gracefully● Keep state in stable storage● Ensure consistency of state and data before
crashing● Recover from failure by rebooting
application
µRB System Design● Fine-grain components
○ Component-level µRB and fast initialization○ Huge components lower benefit of µRB
● State segregation○ Prevent reading inconsistent state during recovery○ Separates data recovery and application recovery
● Decoupling○ Lower disruption across system during recovery
µRB System Design● Retryable requests
○ Minimize number of failures during recovery● Leases
○ Improve the reliability of cleaning up after μRBs, otherwise may leak resources
Research Questions● Are μRBs effective in recovering from
failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a
performance overhead?
Experiment● J2EE Application● JBoss Server (modified to support µRB)● eBid, a crash only application based on
RUBiS● MySQL for persistent state● FastS/SSM for session state
Injected Faults● Deadlocks● Infinite loops ● Memory leaks ● Transient Java exceptions ● Corrupted data structures● Out of Memory error ● Low-level faults underneath the JVM layer
Failure Detection● Network-level error or an HTTP 4xx or 5xx
error or keywords indicative of failure (e.g., “exception,” “failed,” “error”).
● Submits in parallel each request to fault injected application, good application. Discrepancy between two results is “failure”.
Recovery Group● EJBs might maintain references to other
EJBs ○ Cannot be microrebooted individually
● Whenever an EJB is to be microrebooted, microreboot the transitive closure of its inter-EJB dependents as a group.
Recovery Manager● Micorereboots -
○ EJBs (Recovery Group)○ the WAR○ All of eBid○ The JVM that runs JBoss○ The operating system.
● Reboots component related to failed URL.● Tries the cheapest recovery first
Research Questions● Are μRBs effective in recovering from
failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a
performance overhead?
μRB Failure Recovery
Failed Requests
Research Questions● Are μRBs effective in recovering from
failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a
performance overhead?
Failure + Recovery
Client-perceived Availability
Research Questions● Are μRBs effective in recovering from
failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a
performance overhead?
μRB in Cluster
Client-perceived Availability● Response latency more than 8 seconds, user
get distracted
Research Questions● Are μRBs effective in recovering from
failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a
performance overhead?
Performance Impact
Limitations● µRB can leave system inconsistent if updates
aren’t atomic ● µRB can leak resources if resources aren’t
allocated through application server (Java Native Interface)
● Can delay full reboot when it’s the only way
Limitations● Recovers from only transient bugs● Considerable design effort needed● Not suitable for
○ Existing monolithic applications○ (C/C++) don’t have such JavaEE like framework
● Experiment on only one recovery group closure with 5 EJBs.
Microreboot● Cheap alternative of full system recovery
● Restart components “with a clean state”
● Reduces recovery time, failed requests, functional disruptions
● Only suitable for application with fine-grained components
Discussion● Do μRBs lead to overengineering -
AbstractAbstractObjectFactory● Is modifications needed for μRB worth for
existing monolithic applications?● Possible to have a recovery technique for
CPU-bound application?