checkpointing and recovery. purpose consider a long running application –regularly checkpoint the...

Checkpointing and Recovery

Purpose

• Consider a long running application– Regularly checkpoint the application

• Expensive task

– In case of failure, restore to the previous checkpoint

• What happens in case of a distributed application– One (or more) processes fail

– Restoration to previous checkpoint should be done consistently

Examples

What to Save?

• Depends on application– Could be as simple as just program counter

information– Could be the state of the entire process,

including messages received, etc

Stable Storage

• Checkpoints must survive failure of processes (including failure during a disk write)– A simple approach for stable storage

Approaches

• Asynchronous– The local checkpoints at different processes are

taken independently

• Synchronous– The local checkpoints at different processes are

coordinated– They may not be at the same time

Asynchronous Checkpointing

• Problem– Domino effect

Failed process

Other Issues with Asynchronous Checkpointing

• Useless checkpoints

• Need for garbage collection

• Recovery requires significant coordination

Asynchronous Checkpointing (Continued)

• Identify dependency between different checkpoint intervals

• This information is stored along with checkpoints in a stable storage

• When a process repairs, it requests this information from others to determine the need for rollback

Two Examples of Asynchronous Checkpointing

• Bhargava and Lian

• Wang et al

Algorithm by Bhargava et al

• Draw an edge from ci, x to cj,y if either

– i = j and y = x+1

– i j and a message m is sent from Ii, x and received in Ij, y

• Where Ii, x is the interval between ci, x-1 and ci, x

• Rollback recovery line used for recovery as well as garbage collection

Algorithm by Wang et al

• Difference– If a message sent from Ii, x is received in Ij, y then draw

an edge between cj, x-1 to cj, y

• Recovery line obtained is similar to that by by Bhargava and Lian

• Advantage– Number of useful checkpoints is at most N(N+1)/2

• This can be shown that the number of checkpoints that are ahead of recovery line

Coordinated Checkpointing

• Using diffusing computation– How can we use diffusing computation to

obtain a consistent snapshot?

Algorithm by Tamir and Sequin

• Blocking checkpoint– A coordinator decides when a checkpoint is taken

– Coordinator sends a request message to all

– Each process• Stops executing

• Flushes the channels

• Takes a tentative checkpoint

• Replies to coordinator

– When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

Algorithm by Tamir and Sequin

• How many checkpoints need to be stored per process?

Checkpointing in Timed Systems

• If perfectly synchronized clocks?

Checkpointing in Timed Systems

• What if clocks are loosely synchronized?– Max clock drift, , is known?

• All processes take a checkpoint at a fixed (local) time – After the checkpoint, a process does not send any

messages for 2– The set of local checkpoints is guaranteed to be

consistent

Minimal Checkpoint Coordination

• Approach by Koo and Toueg– Require processes to take a checkpoint only if

they have to

Logging Protocols

• Pessimistic

• Optimistic

• Causal

checkpointing and recovery. purpose consider a long running application –regularly checkpoint the...

Documents