checkpointing and recovery. purpose consider a long running application –regularly checkpoint the...

19
Checkpointing and Recovery

Upload: meagan-moody

Post on 28-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Checkpointing and Recovery

Page 2: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Purpose

• Consider a long running application– Regularly checkpoint the application

• Expensive task

– In case of failure, restore to the previous checkpoint

• What happens in case of a distributed application– One (or more) processes fail

– Restoration to previous checkpoint should be done consistently

Page 3: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Examples

Page 4: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

What to Save?

• Depends on application– Could be as simple as just program counter

information– Could be the state of the entire process,

including messages received, etc

Page 5: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Stable Storage

• Checkpoints must survive failure of processes (including failure during a disk write)– A simple approach for stable storage

Page 6: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Approaches

• Asynchronous– The local checkpoints at different processes are

taken independently

• Synchronous– The local checkpoints at different processes are

coordinated– They may not be at the same time

Page 7: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Asynchronous Checkpointing

• Problem– Domino effect

Failed process

Page 8: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Other Issues with Asynchronous Checkpointing

• Useless checkpoints

• Need for garbage collection

• Recovery requires significant coordination

Page 9: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Asynchronous Checkpointing (Continued)

• Identify dependency between different checkpoint intervals

• This information is stored along with checkpoints in a stable storage

• When a process repairs, it requests this information from others to determine the need for rollback

Page 10: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Two Examples of Asynchronous Checkpointing

• Bhargava and Lian

• Wang et al

Page 11: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Algorithm by Bhargava et al

• Draw an edge from ci, x to cj,y if either

– i = j and y = x+1

– i j and a message m is sent from Ii, x and received in Ij, y

• Where Ii, x is the interval between ci, x-1 and ci, x

• Rollback recovery line used for recovery as well as garbage collection

Page 12: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Algorithm by Wang et al

• Difference– If a message sent from Ii, x is received in Ij, y then draw

an edge between cj, x-1 to cj, y

• Recovery line obtained is similar to that by by Bhargava and Lian

• Advantage– Number of useful checkpoints is at most N(N+1)/2

• This can be shown that the number of checkpoints that are ahead of recovery line

Page 13: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Coordinated Checkpointing

• Using diffusing computation– How can we use diffusing computation to

obtain a consistent snapshot?

Page 14: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Algorithm by Tamir and Sequin

• Blocking checkpoint– A coordinator decides when a checkpoint is taken

– Coordinator sends a request message to all

– Each process• Stops executing

• Flushes the channels

• Takes a tentative checkpoint

• Replies to coordinator

– When all processes send replies, the coordinator asks them to change it to a permanent checkpoint

Page 15: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Algorithm by Tamir and Sequin

• How many checkpoints need to be stored per process?

Page 16: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Checkpointing in Timed Systems

• If perfectly synchronized clocks?

Page 17: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Checkpointing in Timed Systems

• What if clocks are loosely synchronized?– Max clock drift, , is known?

• All processes take a checkpoint at a fixed (local) time – After the checkpoint, a process does not send any

messages for 2– The set of local checkpoints is guaranteed to be

consistent

Page 18: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Minimal Checkpoint Coordination

• Approach by Koo and Toueg– Require processes to take a checkpoint only if

they have to

Page 19: Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore

Logging Protocols

• Pessimistic

• Optimistic

• Causal