failure tolerancetschwarz/coen 317/failure.pdf · • chandy and lamport: distributed snapshot ......

Failure ToleranceDistributed Systems

Santa Clara University

Distributed Checkpointing

Distributed Checkpointing• Capture the global state of a distributed system • Chandy and Lamport: Distributed snapshot

• Reflects a consistent, global state • If process P has received a message from Q • Then global state should show that process Q

sent a message to process P

Distributed Checkpointing• Global state presented by a cut

• Consistent cuts: • Messages shown received are shown sent • Messages shown sent are either

• received or • in transit

Distributed Checkpointing• Represent distributed system as a system of

processes connected by unidirectional point-to-point

communication

Distributed Checkpointing• Distributed snapshot

• Anybody can start snapshot • Initiating process P records its own state • Process P sends a marker along all of its outgoing

channels • Process Q upon receiving first marker

• Records its state • Sends a marker to all of its neighbors • Starts recording all incoming channels

• Process Q upon receiving subsequent markers • Stops recording on channel on which the marker arrived

Distributed Checkpointing• Process Q upon receiving last marker

• Send • own state • messages on channels monitored

• to the initiating state

Distributed Checkpointing• Termination Detection:

• Use snapshot protocol • If Q receives a marker for the first time

• Sending process becomes its predecessors • If Q is done with the snapshot, sends a DONE

message to predecessor • This still allows for messages in transit

Distributed Checkpointing• Termination detection:

• Need snapshot where all channels are empty • Q returns DONE only if

• All of Q’s successors have returned a DONE message

• Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel

• In all other cases, Q sends a CONTINUE message

Distributed Checkpointing• Termination detection

• When initiating process receives only DONE messages • No regular messages are in transit • Thus, computation is terminated

Failure Types• Dependability consists of

• Availability • System is ready to be used

• Reliability • System can run continually without failure

• Safety • In a failure condition, nothing catastrophic

happens • Maintainability

• How easy can a failed system be repaired

Failure Types• Dependability:

• System that breaks down for a millisecond every hour • Availability > 99.9999 % • Reliability is low

• System breaks down only for two weeks every July • Availability ~ 96% • Reliability is high

Failure Types• Failure: system cannot meet its promises • Error: part of the system state that may lead to a

failure • Fault: cause of an error

Failure Types• Transient faults

• occur once and the disappear • If the operation is repeated, fault goes away • Example:

• Bird flies through the beam of a microwave transmitter • and possibly gets roasted

Failure Types• Intermittent fault

• Fault occurs • Goes away • Fault returns

Failure Types• Permanent fault

• Fault appears • Continues to exist until the faulty component is

repaired

Failure Types• Crash failure

• Server halts, but it is working correctly until it has • Omission failure

• A server fails to respond to incoming messages • Receive omission

• Server fails to receive incoming messages • Send omission

• Server fails to send messages

Failure Types• Timing failure

• A server’s response lies outside the specified time interval • Response failure

• A server’s response is incorrect • Value failure

• The value of the response is wrong • State transition failure

• The server deviates from the correct flow of control • Arbitrary / Byzantine failure

• A server may produce arbitrary responses at arbitrary times

Failure Types• Fail-stop failure

• Fail stop server stops producing output • Others can detect this state

• Fail-silent failure • Fail silent server stops producing output • Others cannot distinguish this from a server that is

slow • Fail-safe failure:

• Server acts arbitrarily • But other servers can recognize its output as false

Failure Masking• Failure masking by redundancy

• Erasure correcting codes • Replication

Failure MaskingTriple Modular Redundancy

Process Resilience• Organize processes into groups

• Groups can be • dynamic • run membership protocols • hierarchical

Process Resilience

Process Resilience• Leader election

• Bully algorithm • Process with highest ID wins

Process Resilience• Leader Election using a ring

Process Resilience• Agreement in Faulty Systems

Process Resilience• Byzantine general problem

• In the presence of byzantine failure • Can only decide on a single value is >2/3 of

the participants are not faulty

Process Resilience• Byzantine General Problem;

• Lamport algorithm • Each process has to share a value with all

others • But processes can lie and can misrepresent

their value • Goal: All processes accept values from the

non-faulty processes

Process Resilience• Lamport algorithm (1982)

• Each process sends its value to all other processes

• Values are gathered into vectors • Each process sends these vectors to everybody

else • Every process accepts values with a majority

Process Resilience

Reliable Group Communication

• Problem: • How to get messages to the members of a

process group • Reliable multicasting

• Without process failures: • Problem assumes that there is a join and

leave protocol for processes • Often: members receive messages in exactly

the same order


Simple solution if all receivers are known and assumed to not fail


• Tradeoffs: • Explicit retransmission requests or

retransmissions when acks are missing • Use multicast or point-to-point transmission for

retransmissions • Use piggy-backing in order save network

bandwidth


• Scalability in Reliable Multicasting • Simple scheme cannot support large numbers • Optimization:

• Get rid of acks • Only send retransmission requests • Difficult to get messages out of history buffer.

• Use cumulative acks


• Scalability in reliable multicasting • Feedback suppression • Implemented in Scalable Reliable Multicasting

(SRM) by Floyd (97) • Never ack receipt of messages • Whenever a process sends a retransmission

request (NACK), it multicasts to everyone • Servers that receive this multicast suppress

their own NACK message


• Feedback suppression scales reasonably well • Problems:

• Receivers need to schedule feedback messages accurately • Otherwise, too many will send out their NACK

anyway • Feedback still interrupts processes that received the

message • Could form a separate multicast process for those

that have not received • But that is difficult to do over a wide area network


• Hierarchical Feedback Control


• Atomic multicast (in the presence of failures) • Make a distinction between receiving and

delivering a message


• Each message is associated with a group view • The processes on the delivery list

• Changes in group membership • Announced by a group view change message

• Problem: • Message based on old group view needs to be

delivered before the group view change message is delivered


• Virtual Synchronicity • Reliable multicast where multicast message to a

group view G is delivered to all non-faulty processes in G


• Gives several possibilities for ordering • Unordered multicasts • Fifo ordered multicasts • Causally-ordered multicasts • Totally-ordered multicasts


• Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called

• Atomic multicasting


• ISIS: Implementing atomic multicast • Build on TCP as a reliable point-to-point

communication • Assumes that messages sent out by a sender

arrive in that order (TCP property) • Multicasting message with group view

• Same as sending individual messages to all members in the group


• Processes keep messages until they know that every other process has received m • In that case m is stable • ONLY STABLE MESSAGES ARE DELIVERED

• This is also true for view-change messages • Forwarding of messages guarantees that a

message delivered to one non-faulty process is received by everyone in the group

• Can require any process to send message to all members of the group


• Processing a group change • Process receives group change message

• Forwards any unstable message for the old group to all processes in the new group and marks them as stable • ISIS / TCP assumes that these messages are

never lost • All messages to the old group received by one

process are therefore guaranteed to be received by all non-faulty process in the old group


• When process P no longer has unstable messages: • Multicasts a flush message to the new group • When P receives flush messages from all

members of the new group, it installs the new view


• When process Q receives message sent to the old group • If Q still believes itself to be in the old group:

• Delivers message (unless it has already received it and considers it a duplicate)

• If Q has received view change message • Forwards any unstable message • Then sends flush message to the new group


• Need more protocol in order to deal with failure during a view change

• Details in Birman’s book or the papers on ISIS

Checkpointing• Revocery

• Forward recovery • Bring system to a new, failure free state

• Backward recovery • Bring system back to an old, failure free state

and start over

Checkpointing• Distributed snapshot to establish recovery line

Checkpointing• Domino effect

Checkpointing• Need to do coordinated checkpointing instead of

individual checkpointing • Simpler solution:

• Two-phase blocking protocol • Coordinator broadcasts a CHECKPOINT_REQ • Processes receiving CHECKPOINT_REQ create

local checkpoint • queue messages from the application • block until they receive CHECKPOINT_DONE

• Coordinator sends CHECKPOINT_DONE after receiving acks from everyone

Checkpointing• Techniques used to reduce checkpoints

• Message logging • Can lead to orphans

Checkpointing• Pessimistic logging protocols

• Ensure that for each non-stable message there is at most one process depending on it

• Optimistic logging protocols • Any orphan process depending on some

message is rolled back until it now longer depend on the message

failure tolerancetschwarz/coen 317/failure.pdf · • chandy and lamport: distributed snapshot ......

Documents