checkpointing and recovery. purpose consider a long running application –regularly checkpoint the...
TRANSCRIPT
Checkpointing and Recovery
Purpose
• Consider a long running application– Regularly checkpoint the application
• Expensive task
– In case of failure, restore to the previous checkpoint
• What happens in case of a distributed application– One (or more) processes fail
– Restoration to previous checkpoint should be done consistently
Examples
What to Save?
• Depends on application– Could be as simple as just program counter
information– Could be the state of the entire process,
including messages received, etc
Stable Storage
• Checkpoints must survive failure of processes (including failure during a disk write)– A simple approach for stable storage
Approaches
• Asynchronous– The local checkpoints at different processes are
taken independently
• Synchronous– The local checkpoints at different processes are
coordinated– They may not be at the same time
Asynchronous Checkpointing
• Problem– Domino effect
Failed process
Other Issues with Asynchronous Checkpointing
• Useless checkpoints
• Need for garbage collection
• Recovery requires significant coordination
Asynchronous Checkpointing (Continued)
• Identify dependency between different checkpoint intervals
• This information is stored along with checkpoints in a stable storage
• When a process repairs, it requests this information from others to determine the need for rollback
Two Examples of Asynchronous Checkpointing
• Bhargava and Lian
• Wang et al
Algorithm by Bhargava et al
• Draw an edge from ci, x to cj,y if either
– i = j and y = x+1
– i j and a message m is sent from Ii, x and received in Ij, y
• Where Ii, x is the interval between ci, x-1 and ci, x
• Rollback recovery line used for recovery as well as garbage collection
Algorithm by Wang et al
• Difference– If a message sent from Ii, x is received in Ij, y then draw
an edge between cj, x-1 to cj, y
• Recovery line obtained is similar to that by by Bhargava and Lian
• Advantage– Number of useful checkpoints is at most N(N+1)/2
• This can be shown that the number of checkpoints that are ahead of recovery line
Coordinated Checkpointing
• Using diffusing computation– How can we use diffusing computation to
obtain a consistent snapshot?
Algorithm by Tamir and Sequin
• Blocking checkpoint– A coordinator decides when a checkpoint is taken
– Coordinator sends a request message to all
– Each process• Stops executing
• Flushes the channels
• Takes a tentative checkpoint
• Replies to coordinator
– When all processes send replies, the coordinator asks them to change it to a permanent checkpoint
Algorithm by Tamir and Sequin
• How many checkpoints need to be stored per process?
Checkpointing in Timed Systems
• If perfectly synchronized clocks?
Checkpointing in Timed Systems
• What if clocks are loosely synchronized?– Max clock drift, , is known?
• All processes take a checkpoint at a fixed (local) time – After the checkpoint, a process does not send any
messages for 2– The set of local checkpoints is guaranteed to be
consistent
Minimal Checkpoint Coordination
• Approach by Koo and Toueg– Require processes to take a checkpoint only if
they have to
Logging Protocols
• Pessimistic
• Optimistic
• Causal