microreboot

Microreboot: A Cheap Technique for Recovery

George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox

Presented By Riyad

Motivation● Production software has many transient

bugs● Rebooting can “cure” failures caused by

transient bugs● Rebooting is expensive, causes nontrivial

service disruption and downtime● Microreboot (µRB)!!!

Microreboot (µRB)● Reboot individual fine-grained component● Similar as application reboot

○ Magnitudes faster recovery○ Few failed requests during recovery○ Less lost works due to recovery

● Rejuvenate the system without shutting it down● System needs to be designed microrebootable

from ground up.

µRB Goals● Reduce system recovery time

● Minimize failure’s disruption to system and users

● Preserve in-memory data

Crash Only System Design● Don’t try to take complex recovery process● Upon detecting failures crash gracefully● Keep state in stable storage● Ensure consistency of state and data before

crashing● Recover from failure by rebooting

application

µRB System Design● Fine-grain components

○ Component-level µRB and fast initialization○ Huge components lower benefit of µRB

● State segregation○ Prevent reading inconsistent state during recovery○ Separates data recovery and application recovery

● Decoupling○ Lower disruption across system during recovery

µRB System Design● Retryable requests

○ Minimize number of failures during recovery● Leases

○ Improve the reliability of cleaning up after μRBs, otherwise may leak resources

Research Questions● Are μRBs effective in recovering from

failures?● Are μRBs any better than JVM restarts?● Are μRBs useful in clusters?● Do μRB-friendly architectures incur a

performance overhead?

Experiment● J2EE Application● JBoss Server (modified to support µRB)● eBid, a crash only application based on

RUBiS● MySQL for persistent state● FastS/SSM for session state

Injected Faults● Deadlocks● Infinite loops ● Memory leaks ● Transient Java exceptions ● Corrupted data structures● Out of Memory error ● Low-level faults underneath the JVM layer

Failure Detection● Network-level error or an HTTP 4xx or 5xx

error or keywords indicative of failure (e.g., “exception,” “failed,” “error”).

● Submits in parallel each request to fault injected application, good application. Discrepancy between two results is “failure”.

Recovery Group● EJBs might maintain references to other

EJBs ○ Cannot be microrebooted individually

● Whenever an EJB is to be microrebooted, microreboot the transitive closure of its inter-EJB dependents as a group.

Recovery Manager● Micorereboots -

○ EJBs (Recovery Group)○ the WAR○ All of eBid○ The JVM that runs JBoss○ The operating system.

● Reboots component related to failed URL.● Tries the cheapest recovery first

μRB Failure Recovery

Failed Requests

Failure + Recovery

Client-perceived Availability

μRB in Cluster

Client-perceived Availability● Response latency more than 8 seconds, user

get distracted

Performance Impact

Limitations● µRB can leave system inconsistent if updates

aren’t atomic ● µRB can leak resources if resources aren’t

allocated through application server (Java Native Interface)

● Can delay full reboot when it’s the only way

Limitations● Recovers from only transient bugs● Considerable design effort needed● Not suitable for

○ Existing monolithic applications○ (C/C++) don’t have such JavaEE like framework

● Experiment on only one recovery group closure with 5 EJBs.

Microreboot● Cheap alternative of full system recovery

● Restart components “with a clean state”

● Reduces recovery time, failed requests, functional disruptions

● Only suitable for application with fine-grained components

Discussion● Do μRBs lead to overengineering -

AbstractAbstractObjectFactory● Is modifications needed for μRB worth for

existing monolithic applications?● Possible to have a recovery technique for

CPU-bound application?

microreboot

Technology

rb failure recovery

recovery technique

recovery leases

system recovery time

cheapest recovery

rbs effective

rbs useful

rb ebid