chapter 11 fault tolerance. topics introduction process resilience reliable group communication...
TRANSCRIPT
![Page 1: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/1.jpg)
Chapter 11
Fault Tolerance
![Page 2: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/2.jpg)
Topics Introduction Process Resilience Reliable Group Communication Recovery
![Page 3: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/3.jpg)
Basic Concepts Failure = state of a system where system
fails to meet its contract Error= part of the system state that leads
to failure (e.g. differing from its intended value)
Faults = cause of an error, e.g. results from Design errors Manufacturing faults Deterioration External disturbance
![Page 4: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/4.jpg)
FaultErrorFailure
Remark:
Presence of a fault does not ensure that an error will occur, e.g. memory stuck-at-0
![Page 5: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/5.jpg)
Characteristics of FaultsDuration
Permanent fault Once a component fails, it never works correctly
again Easiest to diagnose
Transient fault 1 time only 10 times as likely as permanent faults
Intermittent fault Re-occurring May appear to be transient (if long period) Hard and expensive to detect
![Page 6: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/6.jpg)
Fault Tolerance Includes… Fault tolerance: system can avoid a failure
despite occurrence of faults Availability: probability that system works
correctly at a given instance of time Reliability: expected time between failures Safety: absence of catastrophic
consequences of a fault Maintainability: ease of recovering from a
failure (incl. automatic recognizing of faults)
![Page 7: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/7.jpg)
Failure ModelsType of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure (Byzantine failure)
A server may produce arbitrary responses at arbitrary times
![Page 8: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/8.jpg)
How to Overcome Failures? Design servers being able to announce
that they might fail in the near future? Design a DS that is able to detect that
A server is down and/or a server does no longer work correctly?
Design a DS that is able to mask faults via redundancy?
![Page 9: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/9.jpg)
Failure Masking Hide occurrence of faults using
redundancy Information (e.g., additional bits, i.e. error
correcting codes, e.g. Hamming-code) Time (e.g., retry an operation, an aborted
transaction may be repeated without any side effects)
Physical Hardware (replicated equipment) Software (replicated server processes/threads)
![Page 10: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/10.jpg)
Hardware Redundancy Passive (static)
Uses fault masking to hide occurrence of faults No action from system is required e.g. voting (see next slide)
Active (dynamic) Uses comparison for detection and/or diagnosis Remove faulty hardware from system
reconfiguration Hybrid
Combine both approaches Masking until diagnostic complete Expensive, but better to achieve higher reliability
![Page 11: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/11.jpg)
Failure Masking by Redundancy
Triple modular redundancy.
![Page 12: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/12.jpg)
Stand-by-Sparing Only one module is driving outputs
Other modules are Idle hot spares Shut down cold spares
In case of error detectionswitch to new module
Hot spares No power up delays But may be significant power consumption
Cold spares Vice versa to hot spares
![Page 13: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/13.jpg)
Failure Masking by Software Redundancy
How to improve reliability?
What can we do to mask thread/process faults?
![Page 14: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/14.jpg)
Process Resilience Protection against process failures Group of identical processes provides
redundancy Software: multiple processes on same machine Hardware: processes on different machines
Multicast communication ensures all members receive all messages (often atomic and ordered)
Processes can join and leave groups dynamically e.g., to replace failed processes
Membership protocol ensures agreement on group membership at any given time
![Page 15: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/15.jpg)
Flat Groups versus Hierarchical Groups
a) Communication in a flat group.b) Communication in a simple hierarchical group
![Page 16: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/16.jpg)
Group Management 1. Use a single group-server with a single data
base typical single point of failure 2. Use a single data base but several group-
servers (standby solution) 3. Manage groups in a distributed way, i.e. every
outsider wanting to enter a group sends a corresponding enter_group message per reliable multicast to every current group member, but
When does a new group member gets all the group internal messages?
When leaving the group, what about already sent but not yet received messages?
![Page 17: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/17.jpg)
Agreement in Faulty Systems (1) Ensure all non-faulty processes
reach consensus in a finite number of steps 1. Reliable processes, faulty
communication (omission faults). Two-army problem
2. Reliable communication, faulty processes (Byzantine faults).
![Page 18: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/18.jpg)
Agreement in Faulty Systems (2)
The Byzantine generals problem for 3 loyal generals and1 traitor.
a) The generals announce their troop strengths (in units of 1 kilosoldiers).
b) The vectors that each general assembles based on (a)c) The vectors that each general receives in step 3.
![Page 19: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/19.jpg)
Agreement in Faulty Systems (3)
The same as in previous slide, except now with 2 loyal generals and one traitor.
With m faulty processes, at least 2m+1 correctly functioning processes are required to reach an agreement.
![Page 20: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/20.jpg)
Reliable Group Communication
![Page 21: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/21.jpg)
Basic Reliable-Multicasting Schemes
A simple solution to reliable multicasting when all receivers are known and are assumed not to fail
a) Message transmissionb) Reporting feedback
![Page 22: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/22.jpg)
Scalability Feedback implosion: sender is swamped
with feedback messages Nonhierarchical multicast:
Use NACKS Feedback suppression: NACK’s multicast to
everyone. Prevents other receivers from sending NACK’s if they have already seen one
Reduces (N)ACK load on server Receivers have to be coordinated so they do not
all multicast NACKs at same time Multicasting feedback also interrupts processes
that successfully have received messages
![Page 23: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/23.jpg)
Nonhierarchical Feedback Control
Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
![Page 24: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/24.jpg)
Hierarchical Feedback Control
The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.
![Page 25: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/25.jpg)
Atomic Multicast: Virtual Synchrony Deliver a message either to all group
members (in the same order), or to none. Requires agreement about group membership Replica crash?
Process group: Group view: list of processes the sender has when
a message is sent. Each message uniquely associated with a group
View changes need to be ordered with respect to message transmissions: Either the message is delivered to the old or the new view
Special case: sender failure
![Page 26: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/26.jpg)
Virtual Synchrony (2)
The principle of virtual synchronous multicast. If the sender crashes during the multicast, the message may either be delivered
to all or ignored by each of them.
![Page 27: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/27.jpg)
Implementing Virtual Synchrony A message m sent in view Gi is stable if it
was received by all members of Gi Only stable messages are delivered
View changes are announced By the arriving/departing node or failure
detecting node via a view change message, followed by any unstable messages in the old view, followed by a flush message
View is changed after the flush message has arrived from all members of the old view
![Page 28: Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery](https://reader036.vdocuments.mx/reader036/viewer/2022081504/5697bfe01a28abf838cb3498/html5/thumbnails/28.jpg)
Implementing Virtual Synchrony (2)
a) Process 4 notices that process 7 has crashed, sends a view changeb) Process 6 sends out all its unstable messages, followed by a flush
messagec) Process 6 installs the new view when it has received a flush message
from everyone else