distributed systems cs 15-440 fault tolerance- part iii lecture 19, nov 25, 2013 mohammad hammoud 1
TRANSCRIPT
Distributed SystemsCS 15-440
Fault Tolerance- Part III
Lecture 19, Nov 25, 2013
Mohammad Hammoud
1
Today… Last Two Sessions:
Fault Tolerance – Part II Reliable request-reply communication
Quiz 2
Today’s Session: Fault Tolerance – Part III
Reliable group communication Atomicity Recovery
Announcements: Quiz 2 grades are out PS4 (the last assignment) is due on Dec 2, 2013 by 11:59PM P4 (the last project) is due on Dec 5, 2013 by 11:59PM Final Exam is on Sunday Dec 8, 2013 at 9:00AM, Room 2051 (all topics are included-
Open book, open notes)
2
Objectives
Discussion on Fault Tolerance
General background on fault tolerance
Process resilience, failure detection and reliable communication
Atomicity and distributed commit protocols
Recovery from failures
Objectives
Discussion on Fault Tolerance
General background on fault tolerance
Process resilience, failure detection and reliable communication
Atomicity and distributed commit protocols
Recovery from failures
Reliable Communication
5
Reliable Communication
Reliable Request-Reply Communication
Reliable Group Communication
Reliable Group Communication
As we considered reliable request-reply communication, we need also to consider reliable multicasting services
E.g., Election algorithms use multicasting schemes
6
1 2
3
4
5
6
7
Reliable Group Communication
A Basic Reliable-Multicasting Scheme Atomic Multicasting
7
Reliable Group Communication
A Basic Reliable-Multicasting Scheme Atomic Multicasting
8
Reliable Multicasting
Reliable multicasting indicates that a message that is sent to a group of processes should be delivered to each member of that group
A distinction should be made between:
Reliable communication in the presence of faulty processesReliable communication when processes are assumed to operate correctly
In the presence of faulty processes, multicasting is considered to be reliable when it can be guaranteed that all non-faulty group members receive the message
9
Basic Reliable Multicasting Questions
What happens if during multicasting a process P joins or leaves a group? Should the sent message be delivered? Should P (if joining) also receive the message?
What happens if the (sending) process crashes during multicasting?
What about message ordering?
10
A Simple Case: Reliable Multicasting with Feedback Messages
Consider the case when a single sender S wants to multicast a message to multiple receivers
An S’s multi-casted message may be lost part way and delivered to some, but not to all, of the intended receivers
Assume that messages are received in the same order as they are sent
11
Reliable Multicasting with Feedback Messages
M25
Last = 24
Receiver
Last = 24
Receiver
Last = 23
Receiver
Last = 24
Receiver
Network
Sender
HistoryBuffer
M25M25M25M25M25M25M25M25M25M25M25M25M25M25
Last = 24
Receiver
Last = 24
Receiver
Last = 23
Receiver
Last = 24
ReceiverSender
M25 M25 M25 M25
ACK25Missed 24ACK25ACK25
12An extensive and detailed survey of total-order broadcasts can be found in Defago et al. (2004)
Reliable Group Communication
A Basic Reliable-Multicasting Scheme Atomic Multicasting
13
Atomic Multicast
C1: What is often needed in a distributed system is the guarantee that a message is delivered to either all processes or none at all
C2: It is also generally required that all messages are delivered in the same order to all processes
Satisfying C1 and C2 results in what we call atomic multicast
Atomic multicast:
Ensures that non-faulty processes maintain a consistent view
Forces reconciliation when a process recovers and rejoins the group
14
Virtual Synchrony
A multicast message m is uniquely associated with a list of processes to which it should be delivered
This delivery list corresponds to a group view (G)
In principle, the delivery of m is allowed to fail: When a group-membership-change is the result of the sender
of m crashing Accordingly, m may either be delivered to all remaining processes, or
ignored by each of them
Or when a group-membership-change is the result of a receiver of m crashing
Accordingly, m may be ignored by every other receiver-- which corresponds to the situation that the sender of m crashed before m was sent
15A reliable multicast with this property is said to be “virtually synchronous”
The Principle of Virtual Synchrony
P1
P2
P3
P4
Reliable multicast by multiple point-to-point messages
P3 crashes
G = {P1, P2, P3, P4} G = {P1, P2, P4}
P3 rejoins
G = {P1, P2, P3, P4}Time
Partial multicast from P3 is discarded
16
Message Ordering
Four different virtually synchronous multicast orderings are distinguished:
1.Unordered multicasts
2.FIFO-ordered multicasts
3.Causally-ordered multicasts
4.Totally-ordered multicasts
17
1. Unordered multicasts
A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered by different processes
Process P1 Process P2 Process P3
Sends m1 Receives m1 Receives m2
Sends m2 Receives m2 Receives m1
Three communicating processes in the same group
18
2. FIFO-Ordered Multicasts
With FIFO-Ordered multicasts, the communication layer is forced to deliver incoming messages from the same process in the same order as they have been sent
Process P1 Process P2 Process P3 Process P4
Sends m1 Receives m1 Receives m3 Sends m3
Sends m2 Receives m3 Receives m1 Sends m4
Receives m2 Receives m2
Receives m4 Receives m4
Four processes in the same group with two different senders.
19
3-4. Causally-Ordered and Total-Ordered Multicasts
Causally-ordered multicasts preserve potential causality between different messages
If message m1 causally precedes another message m2, regardless of whether they were multicast by the same sender or not, the communication layer at each receiver will always deliver m1 before m2
Total-ordered multicasts require that when messages are delivered, they are delivered in the same order to all group members (regardless of whether message delivery is unordered, FIFO-ordered, or causally-ordered)
20
Virtually Synchronous Reliable Multicasting
A virtually synchronous reliable multicasting that offers total-ordered delivery of messages is what we refer to as atomic multicasting
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Six different versions of virtually synchronous reliable multicasting
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
Multicast Basic Message Ordering Total-Ordered Delivery?
Reliable multicast None No
FIFO multicast FIFO-ordered delivery No
Causal multicast Causal-ordered delivery No
Atomic multicast None Yes
FIFO atomic multicast FIFO-ordered delivery Yes
Causal atomic multicast Causal-ordered delivery Yes
21
Distributed Commit
Atomic multicasting problem is an example of a more general problem, known as distributed commit
The distributed commit problem involves having an operation being performed by each member of a process group, or none at all
With reliable multicasting, the operation is the delivery of a message
With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction
Distributed commit is often established by means of a coordinator and participants
22
One-Phase Commit Protocol
In a simple scheme, a coordinator can tell all participants whether or not to (locally) perform the operation in question
This scheme is referred to as a one-phase commit protocol
The one-phase commit protocol has a main drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator
In practice, more sophisticated schemes are needed The most common utilized one is the two-phase commit protocol
23
Two-Phase Commit Protocol
Assuming that no failures occur, the two-phase commit protocol (2PC) consists of the following two phases, each consisting of two steps:
Phase I: Voting Phase
Step 1• The coordinator sends a VOTE_REQUEST message to all
participants.
Step 2
• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message
Phase I: Voting Phase
Step 1• The coordinator sends a VOTE_REQUEST message to all
participants.
Step 2
• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message
Phase I: Voting Phase
Step 1• The coordinator sends a VOTE_REQUEST message to all
participants.
Step 2
• When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message.
24
Phase II: Decision Phase
Step 1
Step 2
Two-Phase Commit ProtocolPhase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
Step 2
Phase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.
Step 2
Phase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.
• However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message.
Step 2
Phase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.
• However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message.
Step 2
• Each participant that voted for a commit waits for the final reaction by the coordinator.
Phase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.
• However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message.
Step 2
• Each participant that voted for a commit waits for the final reaction by the coordinator.
• If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction.
Phase II: Decision Phase
Step 1
• The coordinator collects all votes from the participants.
• If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants.
• However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message.
Step 2
• Each participant that voted for a commit waits for the final reaction by the coordinator.
• If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction.
• Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well.
25
2PC Finite State Machines
INIT
WAIT
COMMITABORT
CommitVote-request
Vote-abortGlobal-abort
Vote-commitGlobal-commit
INIT
WAIT
COMMITABORT
Vote-requestVote-commit
Global-abortACK
Global-commitACK
Vote-requestVote-abort
The finite state machine for the coordinator in 2PC
The finite state machine for aparticipant in 2PC
26
2PC Algorithm
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected{
wait for any incoming vote;
if timeout{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
If all participants sent VOTE_COMMIT and coordinator votes COMMIT{
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
}else{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
}
Actions by coordinator:
27
Two-Phase Commit Protocol
write INIT to local log;
Wait for VOTE_REQUEST from coordinator;
If timeout{
write VOTE_ABORT to local log;
exit;
}
If participant votes COMMIT{
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout{
multicast DECISION_RQUEST to other participants;
wait until DECISION is received; /*remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;}
else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log};
}else{
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
}
Actions by participants:
28
Two-Phase Commit Protocol
/*executed by separate thread*/
while true{
wait until any incoming DECISION_REQUEST is received; /*remain blocked*/
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT
send GLOBAL_ABORT to requesting participant;
else
skip; /*participant remains blocked*/
}
Actions for handling decision requests:
29
Objectives
Discussion on Fault Tolerance
General background on fault tolerance
Process resilience, failure detection and reliable communication
Atomicity and distributed commit protocols
Recovery from failures
Recovery
So far, we have mainly concentrated on algorithms that allow us to tolerate faults
However, once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state
In what follows we focus on:
What it actually means to recover to a correct state
When and how the state of a distributed system can be recorded and recovered, by means of checkpointing and message logging
31
Recovery
Error Recovery Checkpointing Message Logging
32
Recovery
Error Recovery Checkpointing Message Logging
33
Error Recovery
Once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state
Fundamental to fault tolerance is the recovery from an error
The idea of error recovery is to replace an erroneous state with an error-free state
There are essentially two forms of error recovery:
1. Backward recovery
2. Forward recovery
34
Backward Recovery
In backward recovery, the main issue is to bring the system from its present erroneous state “back” to a previously correct state
It is necessary to record the system’s state from time to time onto a stable storage, and to restore such a recorded state when things go wrong
Each time (part of) the system’s present state is recorded, a checkpoint is said to be made
Some problems with backward recovery: Restoring a system or a process to a previous state is generally expensive
(in terms of performance) Some states can never be rolled back (e.g., typing in UNIX rm –fr *)
Forward Recovery
When the system detects that it has made an error, forward recovery reverts the system state to error time and corrects it, to be able to move forward
Forward recovery is typically faster than backward recovery but requires that it has to be known in advance which errors may occur
Some systems make use of both forward and backward recovery for different errors or different parts of one error
36
Recovery
Error Recovery Checkpointing Message Logging
37
Why Checkpointing?
In fault-tolerant distributed systems, backward recovery requires that systems “regularly” save their states onto stable storages
This process is referred to as checkpointing
Checkpointing consists of storing a “distributed snapshot” of the current application state, and later on, use it for restarting the execution in case of a failure
38
Recovery Line
In capturing a distributed snapshot, if a process P has recorded the receipt of a message, m, then there should be also a process Q that has recorded the sending of m
Initial state A snapshot
Message sent from Q to P
A recovery line Not a recovery line
A failure
They jointly form a distributedsnapshot
We are able to identify both, senders and receivers.
P
Q
39
m
Checkpointing
Checkpointing can be of two types:
1. Independent Checkpointing: each process simply records its local state from time to time in an uncoordinated fashion
2. Coordinated Checkpointing: all processes synchronize to jointly write their states to local stable storages
Which algorithm among the ones we’ve studied can be used to implement coordinated checkpointing? A simple solution is to use 2PC
40
Domino Effect Independent checkpointing may make it difficult to find a recovery line,
leading potentially to a domino effect resulting from cascaded rollbacks
With coordinated checkpointing, the saved state is automatically globally consistent, hence, domino effect is inherently avoided
A failure
P
Q
Not a Recovery LineRollback
Not a Recovery LineNot a Recovery Line
41
Recovery
Error Recovery Checkpointing Message Logging
42
Why Message Logging?
Considering that checkpointing is an expensive operation, techniques have been sought to reduce the number of checkpoints, but still enable recovery
An important technique in distributed systems is message logging
The basic idea is that if transmission of messages can be replayed, we can still reach a globally consistent state, yet without having to restore that state from stable storage
In practice, the combination of having fewer checkpoints and message logging is more efficient than having to take many checkpoints
43
Message Logging
Message logging can be of two types:
1. Sender-based logging: A process can log its messages before sending them off
2. Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application
When a sending or a receiving process crashes, it can restore the most recently checkpointed state, and from there on “replay” the logged messages (Is it fine for non-deterministic behaviors?)
44
Replay of Messages and Orphan Processes
Caveat: Incorrect replay of messages after recovery can lead to orphan processes
P
Q
R
M1
Logged Message
Unlogged Message
M2M3
Q crashes Q recovers
M1
M1 is replayed
M2
M2 can never be replayed
M3
M3 becomes an orphan
45
Objectives
Discussion on Fault Tolerance
General background on fault tolerance
Process resilience, failure detection and reliable communication
Atomicity and distributed commit protocols
Recovery from failures
All Covered!
Next Class
Distributed File Systems-Part I
Thank You!
47