rollback-recovery protocols in message-passing systems based on a survey of rollback-recovery...

Rollback-Recovery Protocols in Message-Passing Systems

Based on

A Survey of Rollback-Recovery Protocols

in Message-Passing Systems

by

Mootaz Elnozahy Lorenzo Alvisi

Yi-Min Wang David B. Johnson

Motivation• Large distributed systems have vast computing potential.

• In these systems a machine can stop participating in execution of a distributed application as a result of:

– disconnection from the network

– shut down or reboot by the user

– power break

If any of these events occur we say that the node has failed.

• The computing potential is hampered by the nodes’ susceptibility to failures.– There is a need to preserve the correctness of a distributed execution

despite failures.

Rollback Recovery

• Periodically use stable storage (e.g. disk) to save the processes’ state and maybe some additional useful data during failure-free execution.– A saved state of a process is called a checkpoint

• Upon a failure, restart a failed process from one of the saved checkpoints – reduces the amount of lost computation

• Of course, when recovering, consistency between processes must be maintained.

Flavors of Rollback Recovery

• There are techniques that – rely on the application to decide when and what to

save, or

– provide the programmer with linguistic constructs to be added to the application.

• There are also techniques, called transparent techniques, that do not require any intervention on the part of the application or the programmer.

• We focus on transparent rollback recovery.

System Model

• A constant number of processes (N)– Communicate only through messages

– Interact with outside world through messages

– Cooperate to execute a distributed program

System Model: Communication

• Most protocols assume that the communication network is immune to partitioning.

• Some protocols assume reliable FIFO delivery of messages.

• Other protocols assume unreliable communication, which mean that the messages can be– lost– duplicated– reordered

System Model: Failures• A process that fails

– loses its volatile state– stops execution – does not send any more messages

Such behavior is called fail-stop• Processes have a stable storage device that

survives failures.• Number of tolerated failures in different protocols

varies from 1 to N.– Some protocols do not tolerate failures during recovery.

Consistent System States

• A global state of a message-passing system consists of:– individual states of all processes– the states of communication channels

• A consistent global state is a global state in which if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending that message

Consistent System States (2)• Intuitively, a consistent global state is one that

may occur during a failure-free, correct execution of a distributed computation.

• The goal of a rollback-recovery protocol is to bring the system into a consistent state

Consistent Global Checkpointsand Recovery Line

• A consistent global checkpoint is a set of N checkpoints, one from each processes, forming a consistent system state.– Any consistent global checkpoint can be used

to restart process execution upon failure

• It is desirable to minimize the amount of lost work by restoring the system to the most recent consistent global checkpoint, which is called the recovery line.

Orphan Messages and Orphan Processes

• A message m sent by a process Pi that has failed is an orphan message, if the system cannot guarantee regeneration of the same m during the recovery of Pi.

• A process Pk whose state depends on a non-deterministic event (e.g. receipt of a message) that cannot be reproduced is called an orphan process.– Existence of orphan processes violates integrity of the

execution and therefore must be prevented

In Transit Messages

• A message that has been sent but not yet received is called an in-transit message.

• Do rollback recovery protocols have to guarantee the delivery of in-transit messages?– Depends on whether reliable communication is

assumed.

In-Transit Messages: Reliable Communication

• Reliable communication protocols cannot ensure reliability of message delivery if processes fail.

• For example, if an in-transit message is lost because the intended receiver has failed, then– conventional communication protocols will generate a timeout

and inform the sender that the message cannot be delivered.– In a rollback-recovery system, however, the receiver will

eventually recover, and therefore the system must:• mask the timeout from the application program at the sender

process, and• make in-transit messages available to the intended receiver

process after it recovers.

In Transit Messages: Unreliable Communication

• If unreliable communication is assumed, then:– In-transit messages lost due to failure of the receiver

cannot be distinguished from those lost due to communication failures.

– Loss of an in-transit message is a legal event.

• Therefore, the recovery protocol need not handle in-transit messages in any special way.

Interactions with the Outside World

• A message-passing system often interacts with the outside world to receive input data or show the outcome of a computation.

• If a failure occurs, the outside world cannot be relied on to roll back:– a printer cannot roll back the effects of printing

a character– an automatic teller machine cannot recover the

money that it dispensed to a customer.

Interactions with the Outside World: Output Messages

• It is therefore necessary that the outside world perceive a consistent behavior of the system despite failures.

• Before sending output to the outside world, the system must ensure that the state from which the output is sent can be recovered. – This is commonly called the output commit

problem

Interactions with the Outside World: Input Messages

• Input messages that a system receives from the outside world may not be reproducible during recovery– It may not be possible for the outside world to

regenerate them. – Recovery protocols must arrange to save these input

messages so that they can be retrieved when needed for execution replay after a failure.

• A common approach is to save each input message on stable storage before allowing the application program to process it.

Stable Storage

• Rollback recovery uses stable storage to save checkpoints, event logs, and other recovery-related information despite failures.

• Stable storage in rollback recovery is only an abstraction.– Often confused with the disk storage used to

implement it.

Stable Storage (2)

• There are different implementation styles of stable storage:– In a system that tolerates only a single failure, stable

storage may consist of the volatile memory of another process.

– In a system that wishes to tolerate an arbitrary number of transient failures, stable storage may consist of a local disk in each host.

– In a system that tolerates non-transient failures, stable storage must consist of a persistent medium outside the host on which a process is running. A replicated file system is a possible implementation in such systems.

Garbage Collection

• As the application progresses and more recovery information is collected, a subset of the stored recovery information may become useless.– Deletion of such useless recovery information is called

garbage collection.

• A common approach to garbage collection is to identify the recovery line and discard all data relating to events that occurred before that line. – For example, processes that coordinate their checkpoints to

form consistent states will always restart from the most recent checkpoint of each process, and so all previous checkpoints can be discarded.

Z-Cycles and Z-Paths

• A Z-path (zigzag path) is a special sequence of messages that connects two checkpoints.

• Let denote Lamport’s happen-before relation. • Let ci,x denote the xth checkpoint of process Pi. • Define the execution portion between two

consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint).

• Let sendi and deliveri be the communication events by process Pi.

Definition of Z-Path

Given two checkpoints ci,x and cj,y, a Z-path exists between ci,x and cj,y if and only if one of the following two conditions holds:1. x < y and i = j; or

2. There exists a sequence of messages [m0, m1,…, mn], n 0, such that:

ci,x sendi(m0); l < n, either deliverk(ml) and sendk(ml+1) are in

the same checkpoint interval, or deliverk(ml) sendk(ml+1); and

deliverj(mn) cj,y

Z-Cycles and Z-Paths (2)

• Z-cycle is a Z-path that begins and ends with the same checkpoint.– Above, [m5, m4, m3] is a Z-cycle that start and

ends at checkpoint c2,2.

[m1, m2] and [m3, m4] are Z-paths between c0,1 and c2,2

The Z-Cycles Theory

• The Z-cycle theory was first introduced as a framework for reasoning about consistent system states.

• The theory has proved a powerful tool for reasoning about a class of protocols known as communication-induced checkpointing. – In particular, it has been proven that a checkpoint

involved in a Z-cycle cannot become part of a consistent state in a system that uses only checkpoints.

Types of Rollback Recovery Protocols

C h eckp o in t-b ased L og -b ased

R o llb ack R ecovery

Checkpoint-based and Log-based Recovery Protocols

• Checkpoint-based rollback recovery protocols, a.k.a. checkpointing protocols, rely only on checkpointing to achieve fault-tolerance.

• Log-based rollback recovery protocols, a.k.a. logging protocols, combine checkpointing with logging of non-deterministic events.

Checkpoint-Based Protocols

• Rely only on checkpointing to achieve fault-tolerance– Upon a failure, strive to restore the system to

the most recent consistent set of checkpoints (a.k.a. recovery line)

• The checkpointing protocols differ in the amount of cooperation between processes.

Classification of Checkpoint-based Protocols

U n coord in a ted C om m u n ica tion -in d u ced C oord in a ted

C h eckp o in t-b ased


1. Uncoordinated checkpointing – each process takes its checkpoints independently

2. Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state

3. Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes.

Uncoordinated Checkpointing• A.k.a. independent checkpointing• A process decides when to make a checkpoint

independently of other processes– chooses the most convenient time

• for example, when the amount of state information is small

• The processes record dependencies among the checkpoints during the failure-free execution, in order to determine a consistent global checkpoint during recovery.

• Uncoordinated checkpointing protocols inherently suffer from the domino effect

Rollback Propagation and The Domino Effect

• Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back.– This is commonly called rollback propagation.– If the processes have to roll back to the beginning of the

computation, this is called the domino effect.

Failure of P2 causes rollback to the beginning of the computation

Monitoring the Dependencies

• Let ci,x be the xth checkpoint of process Pi. We call x the checkpoint index.

• Let Ii,x denote the interval between checkpoints ci,x-1 and ci,x. We call it the checkpoint interval.

• If process Pi at interval Ii,x sends a message m to Pj, it piggybacks the pair (i,x) on m.

• When Pj receives m during interval Ij,y, it records the dependency from Ii,x to Ij,y

– the dependency is later saved onto stable storage when Pj takes checkpoint cj,y.

Monitoring the Dependencies (2)

• The recorded dependencies are used at recovery time for calculation of the recovery line. There are two methods to do it:– Rollback-dependency graphs– Checkpoint-graphs

Rollback-Dependency Graphs

• Consider the system at the time of a failure. Let C be the set of all the checkpoints, F the set of failure points of the failed processes, and L the set of current states of the living processes.

• Denote the current state of a process Pi (failed or living) that follows a checkpoint ci,x by ci,x+1

• A rollback-dependency graph is a graph G(V,E) so that:– V = C F L– E contains an edge from ci,x to cj,y only if either

(1) i j, and a message m is sent from Ii,x and received in Ij,y , or(2) i = j and y = x + 1

• If there is an edge from ci,x to cj,y and a failure forces Ii,x to be rolled back, then Ij,y must also be rolled back.– This is why it is called “rollback-dependency graph”.

Rollback-Dependency Graphs (2)

– Mark the failure points.– Mark all the nodes reachable from the failure points.– In each process, the latest unmarked checkpoint belongs to

the recovery line.

Rollback-dependency graph

Algorithm to discover the recovery line

Checkpoint Graphs

• Checkpoint graphs are similar to rollback-dependency graphs, except:– when a message is sent from Ii,x and received in Ij,y, a directed edge

is drawn from ci,x-1 to cj,y (instead of from ci,x to cj,y).

– failure points are not included in V ( = C L )

Rollback-dependency graph:

ci,x cj,y

ci,x-1 cj,y

Checkpoint graph:

Checkpoint Graphs (2)

Checkpointing graph

• Checkpoint graph represents the happened-before relationship between the checkpoints.

• The recovery line is calculated by the rollback propagation algorithm, which at each step rolls back the processes according to the recorded dependencies.

Rollback Propagation Algorithm

include last checkpoint of each failed process as an element in set RootSet;

include current state of each surviving process as an element in RootSet;

mark all checkpoints reachable by following at least one edge from any member of RootSet;

while (at least one member of RootSet is marked)

replace each marked element in RootSet by the last unmarked checkpoint of the same process;

mark all checkpoints reachable by following at least one edge from any member of RootSet

end

RootSet is the recovery line.

Rollback-Dependency Graphs vs. Checkpoint Graphs

• Both the rollback-dependency graph and the checkpoint graph approaches are equivalent.– they always produce the same recovery line (as indeed

they do in the example).

Checkpointing graphRollback-Dependency graph

Recovery

• In order to be able to calculate the recovery line some process needs to collect all the dependency data recorded by all the processes.

• A process recovering from a failure broadcasts a dependency request message

• Each process that receives a dependency request– stops execution– replies with the local dependency information

• Then, the initiator – calculates the recovery line based on the received data– broadcasts a rollback request message containing the recovery line

Recovery (2)

• A process whose current state belongs to the recovery line resumes execution.

• Otherwise, it rolls back to a checkpoint indicated by the recovery line.

P1

m1

P0

P2

A

B

C

m0m2

m3A

B

C

m3

Recovery line

Garbage Collection

• In order to prevent memory overflow and reduce storage overhead only useful checkpoints should be kept.

• Any checkpoint that precedes the recovery lines for all possible combinations of process failures can be discarded.

Garbage Collection Algorithm

• Build a rollback-dependency graph as if all the processes have failed.

• Run the algorithm for discovery of the recovery line.– The resulting recovery line is called global

recovery line.

• All the checkpoints taken before the recovery line are obsolete.

Garbage Collection Example

• As can be seen from the example when the global recovery line is unable to advance because of rollback propagation, a large number of non-obsolete checkpoints may need to be retained.

Disadvantages of Uncoordinated Checkpointing

• Susceptible to the domino effect• Checkpoints that will never be part of a global

consistent state can be taken – Storage overhead– do not advance the recovery line

• A process needs to maintain multiple checkpoints and to use garbage collector to reclaim checkpoints that are no longer needed

• Not suitable for output commit, because output commit requires global coordination to compute the recovery line

Coordinated Checkpointing

• The processes cooperate in order to form a consistent global checkpoint.

• Only one checkpoint needs to be maintained on the stable storage at all times.– No need for garbage collection– Reduced storage overhead.

• Recovery is less complicated than in uncoordinated checkpointing.

• Expensive output commit– A global checkpoint is needed before output can be

committed to the outside world.

Preventing Dependencies

• The main purpose of coordination is to avoid dependencies between the local checkpoints belonging to the same logical global checkpoint.

• The coordinated checkpointing protocols differ in the way they prevent dependencies.

Classification of Coordinated Checkpointing Protocols

B lock in g S yn ch ron ized N on -b lock in g

C oord in a ted

Blocking Checkpoint Coordination

• Blocking Checkpoint Coordination is the most straightforward approach to implement coordinated checkpointing.

• A coordinator process orchestrates the checkpointing by sending a request to checkpoint to each process.– not very scalable

Blocking Checkpoint Coordination (2)

• Two checkpoints become dependent in the following case:

B

A

• Blocking Checkpoint Coordination prevents dependencies by blocking communication while checkpointing protocol executes. – Can result in large overhead due to blocking

• If P0 rolls back to A, than P1

has to roll back a checkpoint that precedes B.

Algorithm for Blocking Checkpoint Coordination

• A coordinator takes a checkpoint and broadcasts a checkpoint request message to all processes.

• When a process receives this message, it – stops its execution– flushes all the communication channels – takes a tentative checkpoint – sends an acknowledgment message back to the

coordinator.

• After the coordinator receives acknowledgments from all processes, it broadcasts a commit message

Algorithm for Blocking Checkpoint Coordination (2)

• After receiving the commit message, each process– removes the old permanent checkpoint

– atomically makes the tentative checkpoint permanent.

– resumes execution

P1

m1

P0

P2

B

C

m0m2

Am3

A

B B

m4

m5

Non-Blocking Checkpoint Coordination

• An initiator sends all the other processes a request to take checkpoint.• If the channels are FIFO and reliable the following algorithm can be

used.– The initiator takes a checkpoint and sends a checkpoint request to all processes. – Each process takes a checkpoint upon receiving the first checkpoint request and

rebroadcasts it to all processes.

B

A

Non-Blocking Checkpoint Coordination (2)

• If the channels are not FIFO the checkpointing request can be piggybacked on every post-checkpoint message.

• Note that in both cases (FIFO & non-FIFO), each checkpoint request contains the identity of its initiator along with the sequence number of the request.

B

A

Synchronized Checkpoint Clocks

• Loosely synchronized clocks can be used for triggering the local checkpointing actions at approximately the same time instead of waiting for request from an initiator.

• All processes take checkpoints at predefined times, according to their local clock.

• A process takes a checkpoint and blocks for a period that equals the sum of

- the maximum deviation between clocks, and - the maximum time to detect a failure in another process in the system.

Synchronized Checkpoint Clocks (2)

• If during checkpointing or waiting time a failure occurs in another processes, the taken checkpoint is discarded and the protocol is aborted.

P1

P2

P0

Synchronized Checkpoint Clocks: Optimization

• Instead of blocking during the waiting time, a process can continue execution, but include a checkpointing request in the messages it sends

• A process that receives a checkpointing request with id i starts the ith checkpointing period if it has not started it yet.– The attached message is delivered only after the

process has performed the ith checkpoint

Minimal Checkpointing Coordination

• If all processes participate in every checkpoint, the system becomes not scalable– need to reduce the number of processes involved in the

checkpoint

• Observation: only those processes that have communicated with the initiator of the current checkpoint, either directly or indirectly since the last checkpoint, need to take new checkpoints

Algorithm for Minimal Checkpointing Coordination

• The checkpoint initiator identifies all processes with which it has communicated since the last checkpoint and sends them a request.

• Upon receiving the request, each process in turn identifies all processes it has communicated with since the last checkpoints and sends them a request, and so on, until no more processes can be identified.– Hierarchical distribution of a checkpoint request, instead of one

initiator.

– The rest of the protocol is done according to either blocking or non-blocking approach

Communication-induced Checkpointing

• Balances between uncoordinated and coordinated checkpointing– Allows processes to take some checkpoints

independently. These checkpoints are called local checkpoints

– Guarantees the eventual progress of the recovery line by forcing processes to take additional checkpoints, called forced checkpoints.

Communication-induced Checkpointing (2)

• Communication-induced checkpointing piggybacks protocol-related information on each application message. – In contrast with coordinated checkpointing, no special

coordination messages are exchanged.

• The receiver of each application message uses the piggybacked information to determine if it has to take a forced checkpoint to advance the global recovery line.

• The forced checkpoint must be taken before the application may process the contents of the message.– high latency and overhead– need to reduce the number of forced checkpoints

Classification of Communication-induced Protocols

In d ex-b ased M od e l-B ased

C om m u n ica tion -in d u ced

Model-based Checkpointing

• Model-based checkpointing relies on preventing patterns of communications and checkpoints that could result in inconsistent states among the existing checkpoints. – A model is set up to detect the possibility that such

patterns could be forming within the system.– A checkpoint is usually forced to prevent the

undesirable patterns from occurring. – The decision to force a checkpoint is done locally using

the available information.

Model-based Checkpointing Algorithms

• The MSR model:– In every checkpoint interval all message-receiving

events precede all message-sending events. – Can be maintained by taking an additional checkpoint

before every message-receiving event that is not separated from its previous message-sending event by a checkpoint.

• Another way to prevent the domino effect is to avoid rollback propagation completely by taking a checkpoint immediately after every message-sending event.

Unnecessary Checkpoints

• Model-based checkpointing usually takes more forced checkpoints than it is necessary.– The model used to detect possible inconsistencies is not

precise and therefore forces local checkpoints to prevent the formation of undesirable patterns that may never actually materialize.

– It is possible that two processes detect the potential for inconsistent checkpoints and independently force local checkpoints to prevent the formation of undesirable patterns that could be prevented by a single forced checkpoint.

Index-based checkpointing

• Index-based communication-induced checkpointing works by assigning monotonically increasing indexes to global checkpoints, such that the checkpoints having the same index at different processes form a consistent state.

• The indexes are piggybacked on application messages to help receivers decide when they should force a checkpoint.

• For instance, the protocol by Briatico et al. forces a process to take a checkpoint upon receiving a message with a piggybacked index greater than the local index.

Detailed Classification of Checkpoint-based Protocols

U n coord in a ted

In d ex-b ased M od e l-b ased


B lock in g S yn ch ron ized N on -b lock in g

C oord in a ted


Types of Rollback Recovery Protocols

C h eckp o in t-b ased L og -b ased


Log-based Rollback Recovery

• Log-based recovery views the execution of a process as a sequence of state intervals.– An interval starts with a non-deterministic event, such as:

• Receipt of a message (from a process or the outside world)• Reading the contents of the local clock• Interrupt

– Execution during an interval is deterministic

• A process that is started from the same state and is subjected to the same non-deterministic event yields the same output.

Logging Protocols and the PWD assumption

• Log-based recovery assumes that– all non-deterministic events that a process

executes can be identified, and that– the information necessary to replay each event

during recovery can be logged• Such information is called a determinant of the

event

• Together, these conditions constitute the piecewise deterministic (PWD) assumption.

Logging Protocols and the PWD assumption (2)

• If the PWD assumption holds, log-based rollback-recovery protocols can recover a failed process and replay its execution exactly as it occurred before the failure. Therefore, they are:– useful when interactions with the outside world are

frequent, because it eliminates the need to take expensive checkpoints before sending such output.

– generally not susceptible to the domino effect, thereby allowing processes to use uncoordinated checkpointing if desired.

Log-based Rollback Recovery (2)

• During failure-free operation, each process – logs the determinants of all the non-deterministic

events that it observes to the stable storage– periodically takes checkpoints to limit the amount of

work during the recovery.• The pre-failure execution of a failed process can be

reconstructed during recovery up to the first non-deterministic event whose determinant is not logged.

• The system must guarantee that upon recovery of all failed processes, there is no orphan processes.

Logging Protocols: Recoverable and Stable States

• A state interval is recoverable if there is sufficient information to replay the execution up to that state interval despite any future failures in the system.

• Also, a state interval is stable if the determinant of the non-deterministic event that started it is logged on stable storage.

• A recoverable state interval is always stable, but the opposite is not always true.

Message Logging Example

• States X, Y and Z form the maximum recoverable state i.e., the most recent recoverable consistent system state.

m7 is an orphan message

p0 is an orphan process

• Assume that the processes P1 and P2 fail before logging the determinants corresponding to the deliveries of m6 and m5, respectively,

Classification of Log-based Protocols

P ess im is tic C au sa l O p tim is tic

L og -b ased


There are three main types of logging protocols, depending on when the determinants are logged to stable storage.1. pessimistic logging – the application blocks waiting for the

determinant of each non-deterministic event to be stored on stable storage before the effects become visible.

2. optimistic logging – the application does not block, and determinants are spooled to stable storage asynchronously.

3. causal logging - a balance between optimistic and pessimistic logging.

Types of Logging Protocols and Orphan Processes

• Pessimistic logging guarantees that orphan processes are never created due to a failure.– Simplify recovery, garbage collection and output commit, at the

expense of higher failure-free performance overhead.

• Optimistic logging reduces the failure-free performance overhead, but allow orphan processes to be created due to failures. – The possibility of having orphans complicates recovery, garbage

collection and output commit.

• Causal logging attempts to combine the advantages of low performance overhead and fast output commit– May require complex recovery and garbage collection.

The No-Orphans Consistency Condition

• Let e be a non-deterministic event that occurs at process p, we define:• Depend(e) – the set of processes that are affected by a

non-deterministic event e. This set consists of p, and any process whose state depends on the event e according to the Lamport’s happened before relation.

• Log(e) – the set of processes that have logged a copy of e’s determinant in their volatile memory.

• Stable(e) – a predicate that is true if e’s determinant is logged on stable storage.

The No-Orphans Consistency Condition (2)

• When a subset of processes fail, a surviving process depending on an event e is not an orphan, if:

e: Stable(e) Depend(e) Log(e)• This property is called the always-no-orphans condition. It

stipulates that if any surviving process depends on an event e that either– the event is logged on stable storage, or – some process has a copy of the determinant of event e.

• If neither condition is true, then the process is an orphan because it depends on an event e that cannot be generated during recovery since its determinant has been lost.



L og -b ased

Pessimistic Logging

• The determinant of each non-deterministic event is logged before it can affect the computation.

• In their most straightforward form the pessimistic protocol log to the stable storage.– This approach is called synchronous logging

• Significant performance overhead during the failure-free execution

Pessimistic Logging Example

• During failure-free operation the logs of processes P0, P1 and P2 contain the determinants needed to replay messages {m0, m4, m7}, {m1, m3, m6} and {m2, m5}, respectively.

Advantages of Pessimistic Logging

1. Processes can commit output to the outside world without running a special protocol.

2. The frequency of checkpoints can be determined by trading off the desired runtime performance with the desired protection of the on-going execution.

3. Functioning processes that are not affected by failures, continue to operate and never become orphans. – This is highly desirable in practical systems.

4. Older checkpoints and determinants of non-deterministic events that occurred before the most recent checkpoint can be discarded

Hardware Techniques for Reducing Performance Overhead

• The performance overhead of synchronous logging can be lowered by using special hardware.

• Examples: – a fast non-volatile semiconductor memory to implement

stable storage• improves performance by orders of magnitude.

– a special bus to guarantee atomic logging of all messages exchanged in the system.

• ensures that the log of one machine is automatically stored on a designated backup without blocking the execution of the application program.

• requires that all non-deterministic events be converted into external messages.

Sender-Based Message Logging

• Some pessimistic logging systems reduce the overhead of synchronous logging without relying on hardware.

• For example, the Sender-Based Message Logging (SBML) – Keeps the determinants corresponding to the delivery of each

message m in the volatile memory of its sender. – The determinant of m, which consists of its content and the order

in which it was delivered, is logged in several steps:• Before sending m, the sender logs its content in volatile memory.• The receiver of m responds with an acknowledgment that includes the

order in which the message was delivered, • the sender adds to the determinant the ordering information.

• SBML tolerates only one failure and cannot handle non-deterministic events internal to a process.

Relaxing Logging Atomicity

• m2 and m4 are allowed to affect P2 before logged, but must be logged before m6 is sent.

• The performance overhead of pessimistic logging can be reduced by delivering a message or an event while deferring its logging until the host communicates with another host or with the outside world.

Relaxing Logging Atomicity (2)

• Systems that separate logging of an event from its delivery may lose the last messages delivered before a failure. – This may be a problem for applications that

assume that processes communicate through reliable channels.

– This problem does not arise in protocols that log messages at the sender or do not assume reliable communication channels



L og -b ased

Optimistic Logging

• Processes log determinants asynchronously to stable storage.• Determinants are kept in a volatile log, which is periodically

flushed to stable storage.– No need to block waiting for the determinants to be written to stable

storage– Temporary creation of orphan processes is permitted

• Needs garbage collection – multiple checkpoints may be kept

• Slower output commit– requires coordination to insure no failure revokes output

• More complicated recovery – has two flavors: synchronous recovery and asynchronous recovery

Optimistic Logging and Orphan Processes

• If a process fails, the determinants in its volatile log are lost and some state intervals cannot be recovered.

• If the failed process sent a message during any of the lost intervals the receiver of the message becomes an orphan process.

• When recovery is complete there is no orphan processes.– The orphan processes are rolled back until their states do

not depend on any message whose determinant has been lost.

Optimistic Logging Example

• Note that the processes keep multiple checkpoints– non-trivial garbage collection is needed.

• If P0 wants to commit output in state X, it must:– log m4 and m5 to stable storage– ask P2 to log m2 and m5 to stable storage

• Suppose P2 in fails before the determinant for m5 is logged to stable storage.

P1 becomes orphan and needs to roll back to B , which forces P0 to rollback to A

Synchronous Recovery and Dependency Tracking

• During synchronous recovery all processes run a recovery protocol to compute the maximum recoverable state based on:– Logged determinants and checkpoints– Dependency information gathered during the failure-free

execution.

• There are two approaches to dependency tracking: direct and transitive– In both, during failure-free execution, each– process increments a state interval index at the beginning of

each state interval.

Direct Dependency Tracking

• The state interval index is piggybacked on each outgoing message.

• The receiver records records the dependency directly caused by the message.

• These direct dependencies are assembled at recovery time to obtain complete dependency information.

Transitive Dependency Tracking

• Each process Pi maintains a size-N vector TDi, where:– TDi[i] is Pi’s current state interval index– TDi[j], j i, is the highest index of any state interval of Pj on which Pi

depends.

• TDi is sent in each outgoing message and is updated on each receipt of a message

• Each interval of Pi is associated with a vector timestamp.– Two intervals are dependant if their vectors are comparable, i.e. all entries of

one vector are not bigger than the corresponding entries of the other

• Disadvantage: generally incurs a higher failure-free overhead for piggybacking and maintaining the dependency vectors

• Advantage: allows faster output commit and recovery.

Asynchronous Recovery

• In asynchronous recovery, a failed process restarts by broadcasting a rollback announcement

• If upon receiving a rollback announcement a process detects that it has become an orphan with respect to that announcement it – rolls back– broadcasts its own rollback announcement.

Incarnation Tracking

• When a process restarts execution from a checkpoint, we will say that it starts a new incarnation.

• Multiple incarnations of a process may coexist in the system with asynchronous recovery– each process needs to track the dependency of its state on every

incarnation of all processes to correctly detect orphaned states.

• Dependency tracking can be limited to a single incarnation of each process by forcing a process Pi to delay delivery of messages carrying a dependency on an unknown incarnation of a process Pi, until Pi receives all the preceding rollback announcements from Pi.

Exponential Rollbacks

• In asynchronous recovery protocols a single failure can cause another process to roll back an exponential number of times.– This is known as the exponential rollbacks

phenomenon.

Pi rolls back 2i-1 times

Dealing with Exponential Rollbacks

• Several ways have been proposed:– Distinguish failure announcements from

rollback announcements and broadcast only the former

– Piggyback the original rollback announcement from the failed process on every subsequent rollback announcement that it triggers.

– Piggyback all rollback announcements on every application message



L og -b ased

Causal Logging

• Ensure the always-no-orphans property by ensuring that the determinant of each non-deterministic event that precedes the state of a process, according to Lamport’s happened-before, is either stable or it is available locally to that process.

• The determinant of each of these events contains the order in which its original receiver delivered the corresponding message.

• The message sender, as in sender-based message logging, logs the message content.

Causal Logging Example

• Process P0 “guides” the recovery of P1 and P2 since it knows the order in which P1 should replay receipt of m1 and m3.

• The contents of m1 are obtained from the sender log of P0. The contents of m3 are deterministically regenerated during the recovery of P1 and P2.

In state X the determinants of m0, m1, m2 , m3 and m4 are either on stable storage or in volatile memory in P0.

Messages m5 and m6 may be lost upon the failure,

Advantages of Causal Logging• Causal Logging has the failure-free performance advantages

of optimistic logging while retaining most of the advantages of pessimistic logging– avoids synchronous access to stable storage except during output

commit.– allows each process to commit output independently

• the sender processor simply needs to save its log to stable storage

– never creates orphans – limits the rollback of any failed process to the most recent

checkpoint on stable storage.– reduces the storage overhead and the amount of work at risk.

• The above advantages come at the expense of a more complex recovery protocol.

Tracking Causality

• Processes piggyback the non-stable determinants in their volatile log on the messages they send to other processes.

• On receiving a message, a process first adds any piggybacked determinant to its volatile determinant log and then delivers the message to the application.

• The determinants are stored and sent in the form of antecedence graph.

Antecedence Graph

• Antecedence graph of a process P is a directed graph G(V,E) so that:– V is a set of non-deterministic events that

precede P’s current state (according to happened-before)

– E contains an edge v u if and only if v precedes u (according to happened-before)

Antecedence Graph Example

Efficient Transmission of Antecedence Graphs

• Carrying the entire graph on each application message is unacceptable.

• Solution: any message between processes p and q carries only the difference between the graphs piggybacked on the previous message exchanged.

• Furthermore, if p has recently received a message from q, it can exclude the graph portions that have been piggybacked on that message.

• This technique has low overhead in practice

Family Based Logging

• Further reduction of the overhead is possible if the system is willing to tolerate a number of failures that is less than N.

• Family Based Logging protocols (FBL) are parameterized by the number of tolerated failures. – Log each non-deterministic event in the volatile store of f

+ 1 different hosts. • propagation of information about an event stops when it has been

recorded in f + 1 processes. For f < N,

– Sender-based logging is used to support message replay during recovery and determinants are piggybacked on application messages.

Family Based Logging (2)

• FBL protocols do not access stable storage except for checkpointing. – Reducing access to stable storage in turn reduces

performance overhead and implementation complexity.

• An implementation for the protocol with f = 1 confirms that the performance overhead is very small.

• The described causal logging protocol is an FBL protocol corresponding to the case of f = N.

Detailed Classification of Rollback Recovery Protocols

U n coord in a ted

In d ex-b ased M od e l-b ased


B lock in g S yn ch ron ized

N on -b lock in g

C oord in a ted


P ess im is tic C au sa l

O p tim is tic

L og -b ased


Comparison

rollback-recovery protocols in message-passing systems based on a survey of rollback-recovery...

Documents