chapter 8, fault tolerance newpourhaji.ir/upload/2015/04/5535fe983e725.pdf · 2015-04-21 · fault...

Ali Asghar Pourhaji Kazem, Spring 2015

DISTRIBUTED SYSTEMSPrinciples and Paradigms

Second EditionANDREW S. TANENBAUM

MAARTEN VAN STEEN

Chapter 8Fault Tolerance

1


Fault Tolerance Basic Concepts

• Being fault tolerant is strongly related to what are called dependable systems

• Dependability implies the following:1. Availability: Readiness for usage2. Reliability: Continuity of service delivery3. Safety: Very low probability of catastrophes4. Maintainability: How easy can a failed

system be repaired2


Type of Errors

• Transient: Transient faults occur once and thendisappear

• If the operation is repeated, the fault goes away. A bird flyingthrough the beam of a microwave transmitter may cause lostbits on some network

• Intermittent: An intermittent fault occurs, thenvanishes of its own accord, then reappears, andso on

• A loose contact on a connector will often cause an intermittentfault

• Permanent: A permanent fault is one thatcontinues to exist until the faulty component isreplaced 3


Terminology

• Failure: When a component is not living up to itsspecifications, a failure occurs

• Error: That part of a component’s state that can lead to afailure

• Fault: The cause of an error

What to do about faults• Fault prevention: prevent the occurrence of a fault• Fault tolerance: build a component such that it can mask the

presence of faults• Fault removal: reduce presence, number, seriousness of

faults• Fault forecasting: estimate present number, future

incidence, and consequences of faults4


Failure Models

Figure 8-1. Different types of failures.5


Failure Masking by Redundancy

• If a system is to be fault tolerant, the best it cando is to try to hide the occurrence of failuresfrom other processes

• The key technique for masking faults is to useredundancy

• Three kinds are possible:• information redundancy• Time redundancy• physical redundancy

6



• Information redundancy: With information redundancy,extra bits are added to allow recovery from garbled bits

• Time redundancy: With time redundancy, an action isperformed, and then if need be, it is performed again

• Physical redundancy: With physical redundancy, extraequipment or processes are added to make it possiblefor the system as a whole to tolerate the loss ormalfunctioning of some components

7



Figure 8-2. Triple modular redundancy.8


PROCESS RESILIENCE (1)

• The first topic we discuss is protection against processfailures, which is achieved by replicating processes intogroups

• The key approach to tolerating a faulty process is toorganize several identical processes into a group

• The key property that all groups have is that when amessage is sent to the group itself, all members of thegroup receive it

• In this way, if one process in a group fails, hopefullysome other process can take over for it

9


PROCESS RESILIENCE (2)

• Basic issue• Protect yourself against faulty processes by replicating and

distributing computations in a group

• Flat groups• Good for fault tolerance as information exchange immediately

occurs with all group members; however, may impose moreoverhead as control is completely distributed (hard toimplement).

• Hierarchical groups• All communication through a single coordinator ⇒ not really fault

tolerant and scalable, but relatively easy to implement.

10


Flat Groups versus Hierarchical Groups

Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group.11


Groups and failure masking (1)

• K-fault tolerant group• When a group can mask any k concurrent member failures (k is

called degree of fault tolerance).

• How large does a k-fault tolerant group need to be?• Assume crash/performance failure semantics ⇒ a total of k +1

members are needed to survive k member failures.• Assume arbitrary failure semantics, and group output defined by

voting ⇒ a total of 2k +1 members are needed to survive kmember failures.

• Assumption• All members are identical, and process all input in the same

order ⇒ only then we sure that they do exactly the same thing.

12


Groups and failure masking (2)

• Scenario• Assuming arbitrary failure semantics, we need 3k +1 group

members to survive the attacks of k faulty members. This is alsoknown as Byzantine failures

• Essence• We are trying to reach a majority vote among the group of

loyalists, in the presence of k traitors ⇒ need 2k +1 loyalists.

13


Agreement in Faulty Systems (1)

(a) what they send to each other

)b) what each one got from the other

)c) what each one got in second step

14


Agreement in Faulty Systems (2)

Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process. 15


Failure Detection

• We detect failures through timeoutmechanisms

• Setting timeouts properly is very difficult andapplication dependent

• You cannot distinguish process failures from networkfailures

• We need to consider failure notification throughoutthe system:

• Gossiping (i.e., proactively disseminate a failure detection)• On failure detection, pretend you failed as well

16


Reliable Communication

• So far Concentrated on process resilience (bymeans of process groups). What about reliablecommunication channels?

• Error detection• Framing of packets to allow for bit error detection• Use of frame numbering to detect packet loss

• Error correction• Add so much redundancy that corrupted packets can

be automatically corrected• Request retransmission of lost, or last N packets

17


RPC Semantics in the Presence of Failures (1)

Five different classes of failures that can occur inRPC systems:

1. The client is unable to locate the server.2. The request message from the client to the

server is lost.3. The server crashes after receiving a request.4. The reply message from the server to the client

is lost.5. The client crashes after sending a request.

18



• Client cannot locate server: Raise anexception or send a signal to client leading toloss in transparency.

• Lost request messages: Start a timer whensending a request. If timer expires before a replyis received, send the request again. Serverwould need to detect duplicate requests.

19


Server Crashes (1)

• Server crashes: Server crashes before or afterexecuting the request is indistinguishable from the clientside...

• We need to decide on what we expect from the server� At least once semantics

• The server guarantees it will carry out an operation at least once, no matterwhat.

� At most once semantics• The server guarantees it will carry out an operation at most once

� Guarantee nothing semantics!� Exactly once semantics.

20


Server Crashes (2)

Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution.

21


Server Crashes (3)

Three events that can happen at the server: • Send the completion message (M), • Print the text (P), • Crash (C).

22


Server Crashes (4)These events can occur in six different orderings:1. M →P →C: A crash occurs after sending the completion

message and printing the text.2. M →C (→P): A crash happens after sending the

completion message, but before the text could be printed.

3. P →M →C: A crash occurs after sending the completion message and printing the text.

4. P→C(→M): The text printed, after which a crash occurs before the completion message could be sent.

5. C (→P →M): A crash happens before the server could do anything.

6. C (→M →P): A crash happens before the server could do anything.

23


Server Crashes (5)

Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. 24



• Lost Reply Messages• Detecting lost replies can be hard, because it can also be that the server

had crashed. You don’t know whether the server has carried out theoperation

• Solution• Set a timer on client. If it expires without a reply, then sendthe request

again. If requests areidempotent, then they can be repeated againwithout ill-effects.

25



• Client Crashes:

• Createsorphans. An orphan is an active computation onthe server for which there is no client waiting. Dealingwith orphans.

• Extermination.• Client logs each request in a file before sending it. After a reboot the file is checked and

the orphan is explicitly killed off. Expensive, cannot locate grand-orphans etc.

• Reincarnation.• Divide time into sequentially numbered epochs. When a client reboots, it broadcasts a

message declaring a new epoch. This allows servers to terminate orphan computations.

• Gentle reincarnation.• A server tries to locate the owner of orphans before killing the computation.

• Expiration.• Each RPC is given a quantum of time to finish its job. If it cannot finish, then it asks for

another quantum. After a crash, a client need only wait for a quantum to make sure allorphans are gone.

26



• An idempotent operation that can be repeated as oftenas necessary without any harm being done. E.g. readinga block from a file.

• In general, try to make RPC/RMI methods be idempotentif possible. If not, it can be dealt with in couple of ways.

• Use a sequence number with each request so server can detectduplicates. But now the server needs to keep state for each client.

• Have a bit in the message to distinguish between original andduplicatetransmission.

27


Basic Reliable Multicasting Schemes(1)• Although most transport layers offer reliable point-to-point channels,

they rarely offer reliable communication to a collection of processes

• The best they can offer is to let each process set up a point-to-pointconnection to each other process it wants to communicate with

• Obviously, such an organization is not very efficient as it may wastenetwork bandwidth

• Nevertheless, if the number of processes is small, achievingreliability through multiple reliable point-to-point channels is a simpleand often straightforward solution

28


Basic Reliable Multicasting Schemes(2)

29

• To go beyond this simple case, we need to define precisely whatreliable multicasting is

• Intuitively, it means that a message that is sent to a process groupshould be delivered to each member of that group

• To cover such situations, a distinction should be made betweenreliable communication in the presence of faulty processes, andreliable communication when processes are assumed to operatecorrectly


Basic Reliable-Multicasting Schemes

Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail.

(a) Message transmission. (b) Reporting feedback. 30


Basic Reliable Multicasting Schemes(2)

31

• The sending process assigns a sequence number to each messageit multicasts

• We assume that messages are received in the order they are sent• In this way, it is easy for a receiver to detect it is missing a message• Each multicast message is stored locally in a history buffer at the

sender• Assuming the receivers are known to the sender, the sender simply

keeps the message in its history buffer until each receiver hasreturned an acknowledgment

• If a receiver detects it is missing a message, it may return a negativeacknowledgment, requesting the sender for a retransmission

• Alternatively, the sender may automatically retransmit the messagewhen it has not received all acknowledgments within a certain time


Scalability in Reliable Multicasting -Feedback suppression

Basic idea• Let a process P suppress its own feedback when it notices

another process Q is already asking for a retransmission

Assumptions• All receivers listen to a common feedback channel to which

feedback messages are submitted

• Process P schedules its own feedback message randomly,and suppresses it when observing another feedbackmessage

32


Nonhierarchical Feedback Control

Figure 8-10. Several receivers have scheduled a request for retransmission, but the first retransmission request

leads to the suppression of others.33


Hierarchical Feedback Control (1)

Basic solution• Construct a hierarchical feedback channel in which all

submitted messages are sent only to the root. Intermediatenodes aggregate feedback messages before passing themon.

Observation• Intermediate nodes can easily be used for retransmission

purposes

34


Hierarchical Feedback Control (2)

Figure 8-11. The essence of hierarchical reliable multicasting.Each local coordinator forwards the message to its children and

later handles retransmission requests.35


Atomic Multicast

36

• In particular, what is often needed in a distributedsystem is the guarantee that a message isdelivered to either all processes or to none at al

• In addition, it is generally also required that allmessages are delivered in the same order to allprocesses

• This is also known as the atomic multicastproblem


Virtual Synchrony (1)

• we again adopt a model in which the distributedsystem consists of a communication layer

• Within this communication layer, messages aresent and received

• A received message is locally buffered in thecommunication layer until it can be delivered to theapplication that is logically placed at a higher layer

37



Figure 8-12. The logical organization of a distributed system todistinguish between message receipt and message delivery.38



Figure 8-13. The principle of virtual synchronous multicast.39


Message Ordering (1)

Four different orderings are distinguished:• Unordered multicasts• FIFO-ordered multicasts• Causally-ordered multicasts• Totally-ordered multicasts

40



• A reliable, unordered multicast is a virtuallysynchronous multicast in which no guarantees aregiven concerning the order in which received messagesare delivered by different processes.

• In the case of reliable FIFO-ordered multicasts, thecommunication layer is forced to deliver incomingmessages from the same process in the same order asthey have been sent

41



42

• Reliable causally-ordered multicast delivers messagesso that potential causality between different messagesis preserved

• In other words. if a message m1 causally precedesanother message m2, regardless of whether they weremulticast by the same sender, then the communicationlayer at each receiver will always deliver m2 after it hasreceived and delivered m1



43

• Besides the three discussed orderings, there may bethe additional constraint that message delivery is to betotally ordered as well

• Total-ordered delivery means that regardless of whethermessage delivery is unordered, FIFO ordered, orcausally ordered, it is required additionally that whenmessages are delivered, they are delivered in the sameorder to all group members



Figure 8-14. Three communicating processes in the same group. The ordering of events

per process is shown along the vertical axis.44



Figure 8-15. Four processes in the same group with two different senders, and a possible delivery order of messages under

FIFO-ordered multicasting 45


Implementing Virtual Synchrony (1)

Figure 8-16. Six different versions of virtually synchronous reliable multicasting.

46



Figure 8-17. (a) Process 4 notices that process 7 has crashed and sends a view change. 47



Figure 8-17. (b) Process 6 sends out all itsunstable messages, followed by a flush message. 48



Figure 8-17. (c) Process 6 installs the new view when it has received a flush message from everyone else.

49


Two-Phase Commit (1)

Figure 8-18. (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant.

50



Figure 8-19. Actions taken by a participant P when residing in state READY and having contacted another participant Q.51



Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.

. . .

52



Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol.

. . .

53



Figure 8-21. (a) The steps taken by a participant

process in 2PC.

54



Figure 8-21. (b) The steps for handling incoming decision requests.. 55


Three-Phase Commit (1)

The states of the coordinator and each participantsatisfy the following two conditions:1. There is no single state from which it is possible

to make a transition directly to either a COMMIT or an ABORT state.

2. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.

56


Three-Phase Commit (2)

Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant.

57


Recovery: BackgroundEssence• When a failure occurs, we need to bring the system into an error-

free state:• Forward error recovery: Find a new state from which the system can continue

operation

• Backward error recovery: Bring the system back into a previous error-freestate

Practice• Use backward error recovery, requiring that we establish recovery

points

Observation• Recovery in distributed systems is complicated by the fact that

processes need to cooperate in identifying a consistent state fromwhere to recover 58


Recovery – Stable Storage

Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is updated. (c) Bad spot. 59


Consistent Recovery State

60

Requirement• Every message that has been received is also shown to have been

sent in the state of the sender

Recovery line• Assuming processes regularly checkpoint their state, the most recent

consistent global checkpoint.


Independent Checkpointing

Figure 8-25. The domino effect.61


Characterizing Message-Logging Schemes

Figure 8-26. Incorrect replay of messages after recovery, leading to an orphan process. 62

chapter 8, fault tolerance newpourhaji.ir/upload/2015/04/5535fe983e725.pdf · 2015-04-21 · fault...

Documents