failure tolerancetschwarz/coen 317/failure.pdf · • chandy and lamport: distributed snapshot ......
TRANSCRIPT
Distributed Checkpointing• Capture the global state of a distributed system • Chandy and Lamport: Distributed snapshot
• Reflects a consistent, global state • If process P has received a message from Q • Then global state should show that process Q
sent a message to process P
Distributed Checkpointing• Global state presented by a cut
• Consistent cuts: • Messages shown received are shown sent • Messages shown sent are either
• received or • in transit
Distributed Checkpointing• Represent distributed system as a system of
processes connected by unidirectional point-to-point
communication
Distributed Checkpointing• Distributed snapshot
• Anybody can start snapshot • Initiating process P records its own state • Process P sends a marker along all of its outgoing
channels • Process Q upon receiving first marker
• Records its state • Sends a marker to all of its neighbors • Starts recording all incoming channels
• Process Q upon receiving subsequent markers • Stops recording on channel on which the marker arrived
Distributed Checkpointing• Process Q upon receiving last marker
• Send • own state • messages on channels monitored
• to the initiating state
Distributed Checkpointing• Termination Detection:
• Use snapshot protocol • If Q receives a marker for the first time
• Sending process becomes its predecessors • If Q is done with the snapshot, sends a DONE
message to predecessor • This still allows for messages in transit
Distributed Checkpointing• Termination detection:
• Need snapshot where all channels are empty • Q returns DONE only if
• All of Q’s successors have returned a DONE message
• Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel
• In all other cases, Q sends a CONTINUE message
Distributed Checkpointing• Termination detection
• When initiating process receives only DONE messages • No regular messages are in transit • Thus, computation is terminated
Failure Types• Dependability consists of
• Availability • System is ready to be used
• Reliability • System can run continually without failure
• Safety • In a failure condition, nothing catastrophic
happens • Maintainability
• How easy can a failed system be repaired
Failure Types• Dependability:
• System that breaks down for a millisecond every hour • Availability > 99.9999 % • Reliability is low
• System breaks down only for two weeks every July • Availability ~ 96% • Reliability is high
Failure Types• Failure: system cannot meet its promises • Error: part of the system state that may lead to a
failure • Fault: cause of an error
Failure Types• Transient faults
• occur once and the disappear • If the operation is repeated, fault goes away • Example:
• Bird flies through the beam of a microwave transmitter • and possibly gets roasted
Failure Types• Permanent fault
• Fault appears • Continues to exist until the faulty component is
repaired
Failure Types• Crash failure
• Server halts, but it is working correctly until it has • Omission failure
• A server fails to respond to incoming messages • Receive omission
• Server fails to receive incoming messages • Send omission
• Server fails to send messages
Failure Types• Timing failure
• A server’s response lies outside the specified time interval • Response failure
• A server’s response is incorrect • Value failure
• The value of the response is wrong • State transition failure
• The server deviates from the correct flow of control • Arbitrary / Byzantine failure
• A server may produce arbitrary responses at arbitrary times
Failure Types• Fail-stop failure
• Fail stop server stops producing output • Others can detect this state
• Fail-silent failure • Fail silent server stops producing output • Others cannot distinguish this from a server that is
slow • Fail-safe failure:
• Server acts arbitrarily • But other servers can recognize its output as false
Process Resilience• Organize processes into groups
• Groups can be • dynamic • run membership protocols • hierarchical
Process Resilience• Byzantine general problem
• In the presence of byzantine failure • Can only decide on a single value is >2/3 of
the participants are not faulty
Process Resilience• Byzantine General Problem;
• Lamport algorithm • Each process has to share a value with all
others • But processes can lie and can misrepresent
their value • Goal: All processes accept values from the
non-faulty processes
Process Resilience• Lamport algorithm (1982)
• Each process sends its value to all other processes
• Values are gathered into vectors • Each process sends these vectors to everybody
else • Every process accepts values with a majority
Reliable Group Communication
• Problem: • How to get messages to the members of a
process group • Reliable multicasting
• Without process failures: • Problem assumes that there is a join and
leave protocol for processes • Often: members receive messages in exactly
the same order
Reliable Group Communication
• Tradeoffs: • Explicit retransmission requests or
retransmissions when acks are missing • Use multicast or point-to-point transmission for
retransmissions • Use piggy-backing in order save network
bandwidth
Reliable Group Communication
• Scalability in Reliable Multicasting • Simple scheme cannot support large numbers • Optimization:
• Get rid of acks • Only send retransmission requests • Difficult to get messages out of history buffer.
• Use cumulative acks
Reliable Group Communication
• Scalability in reliable multicasting • Feedback suppression • Implemented in Scalable Reliable Multicasting
(SRM) by Floyd (97) • Never ack receipt of messages • Whenever a process sends a retransmission
request (NACK), it multicasts to everyone • Servers that receive this multicast suppress
their own NACK message
Reliable Group Communication
• Feedback suppression scales reasonably well • Problems:
• Receivers need to schedule feedback messages accurately • Otherwise, too many will send out their NACK
anyway • Feedback still interrupts processes that received the
message • Could form a separate multicast process for those
that have not received • But that is difficult to do over a wide area network
Reliable Group Communication
• Atomic multicast (in the presence of failures) • Make a distinction between receiving and
delivering a message
Reliable Group Communication
• Each message is associated with a group view • The processes on the delivery list
• Changes in group membership • Announced by a group view change message
• Problem: • Message based on old group view needs to be
delivered before the group view change message is delivered
Reliable Group Communication
• Virtual Synchronicity • Reliable multicast where multicast message to a
group view G is delivered to all non-faulty processes in G
Reliable Group Communication
• Gives several possibilities for ordering • Unordered multicasts • Fifo ordered multicasts • Causally-ordered multicasts • Totally-ordered multicasts
Reliable Group Communication
• Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called
• Atomic multicasting
Reliable Group Communication
• ISIS: Implementing atomic multicast • Build on TCP as a reliable point-to-point
communication • Assumes that messages sent out by a sender
arrive in that order (TCP property) • Multicasting message with group view
• Same as sending individual messages to all members in the group
Reliable Group Communication
• Processes keep messages until they know that every other process has received m • In that case m is stable • ONLY STABLE MESSAGES ARE DELIVERED
• This is also true for view-change messages • Forwarding of messages guarantees that a
message delivered to one non-faulty process is received by everyone in the group
• Can require any process to send message to all members of the group
Reliable Group Communication
• Processing a group change • Process receives group change message
• Forwards any unstable message for the old group to all processes in the new group and marks them as stable • ISIS / TCP assumes that these messages are
never lost • All messages to the old group received by one
process are therefore guaranteed to be received by all non-faulty process in the old group
Reliable Group Communication
• When process P no longer has unstable messages: • Multicasts a flush message to the new group • When P receives flush messages from all
members of the new group, it installs the new view
Reliable Group Communication
• When process Q receives message sent to the old group • If Q still believes itself to be in the old group:
• Delivers message (unless it has already received it and considers it a duplicate)
• If Q has received view change message • Forwards any unstable message • Then sends flush message to the new group
Reliable Group Communication
• Need more protocol in order to deal with failure during a view change
• Details in Birman’s book or the papers on ISIS
Checkpointing• Revocery
• Forward recovery • Bring system to a new, failure free state
• Backward recovery • Bring system back to an old, failure free state
and start over
Checkpointing• Need to do coordinated checkpointing instead of
individual checkpointing • Simpler solution:
• Two-phase blocking protocol • Coordinator broadcasts a CHECKPOINT_REQ • Processes receiving CHECKPOINT_REQ create
local checkpoint • queue messages from the application • block until they receive CHECKPOINT_DONE
• Coordinator sends CHECKPOINT_DONE after receiving acks from everyone