crash fault detection in celerating environments

1
Crash Fault Detection in Celerating Environments Srikanth Sastry Scott M. Pike ([email protected] du) ([email protected] ) Implementing ◊P Implementable under (some models of) partial synchrony. Popular model: Unknown bounds on message delay () and relative process speeds (). Round Trip Time (RTT) = Outgoing message delay + message processing time + incoming message delay PING Local ◊P module Outgoing message delay Ack generation Time ≤ f() ≤ f() ACK Incoming message delay RTT ≤ + f() + RTT is bounded above! This bound on RTT can be adaptively estimated. Local Adaptive Estimation of RTT Measuring Time Action Clocks in Accelerating Environments Process Speed Real Time De facto bound on Round-Trip Time (RTT) k action-clock ticks Estimated bound on RTT - k action ticks 2k action-clock ticks Timeout! False suspicion k action-clock ticks New estimate on RTT is now 2k action ticks …. 4k action-clock ticks Timeout! False suspicion And so on, leading to an infinite stream of mistakes! And so on, leading to an infinite stream of mistakes! …. 2k action-clock ticks Faster processes More action-clock ticks per RTT Action clock timer continually times out Two techniques: – Action clocks: Counting the number of actions – Real-time clocks: Independent device to measure time (e.g., hardware clocks, NTP). Either technique works in environments that do NOT accelerate or decelerate arbitrarily But in Celerating environments, where processes can accelerate or decelerate arbitrarily, each technique fails independently. Start timer with some arbitrary (small) value If timer expires without receiving a message, suspect the process If a message arrives after timer expiry, trust the process and increase the timer value. Eventually timer value exceeds the bound on RTT. After which correct processes will never be suspected. Any crashed process is permanently suspected. But how do processes measure time? Crash! Distributed Systems Crash! A collection of autonomous computers (processes) connected through a communication network • But processes can crash! • Maintain correctness despite crashes • Fault tolerance through crash detection • Crash detection determined by synchronism in the system Crash Detection and System Models Failure Detectors Eventually Perfect Failure Detector Failure detectors: Distributed system service to detect process crashes. Failure detector provide (potentially) incorrect information. Still powerful enough to solve important problems. E.g., distributed consensus, leader election, wait-free scheduling, contention management. Failure detector implementations often require partial synchrony. One well known failure detector is ◊P, the eventually perfect failure detector. Live Crashed … Fault Pattern 1 ◊P outputs Crashed … Live Crashed … Live Crashed … Live Fault Pattern 2 ◊P outputs Crashed Live Live Crashed Live Live Partial Synchrony Crash Detection Possible Greater Fidelity to Real World Systems Synchrony Restrictive Model Crash Detection Possible Asynchrony Permissive Model Crash Detection Impossible Real-time Clocks in Decelerating Environments Solving the Celeration Problem Bi-Chronal Timers in Non-Celerating Environments Conclusion Process Step Time Real Time Msg Send Timeout! False suspicion New estimate on Round-trip time is now 2k real-time ticks …. Timeout! False suspicion …. And so on, leading to an infinite stream of mistakes! And so on, leading to an infinite stream of mistakes! Msg Recv Estimate on Round-trip time is k real-time ticks Msg Send Msg Recv Msg Send Msg Recv (Process Speed ) Slower processes Longer duration to generate and process messages Unbounded RTT (in real time) Bi-chronal timer A vectored composition of action timer and real-time timer. Measures time in terms of actions as well as real-time. All processes use separate local bi- chronal timers. Timer expires only when both action timer and the real-time timer expire. The action timer insulates ◊P from deceleration. The real-time timer insulates ◊P from acceleration. Bi-chronal clocks insulate ◊P from transient network behavior. Hardware upgrades often accelerate process speeds Action clocks precipitate ◊P mistakes during acceleration Bi-chronal clocks are immune to acceleration Multiple process crashes (in a server farm), DoS attacks, and such can decelerate processes to a crawl Real-time clocks precipitate ◊P mistakes during deceleration Bi-chronal clocks are immune to deceleration Many existing ◊P implementations are subtly broken Bi-chronal clocks provide a simple solution Additionally, they insulate systems from transient behavior Future work: Properties and behavior of Bi-chronal clocks Use of Bi-chronal clocks in other applications Other approaches to dealing with Celeration Asynchrony: Unbounded message delay and process speeds Synchrony: Known bounds on message delay and process speeds Partial Synchrony: Between synchrony and asynchrony

Upload: auryon

Post on 05-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Real Time. And so on, leading to an infinite stream of mistakes!. Msg Send. Msg Recv. 4k action-clock ticks. Conclusion. Implementing ◊P. Failure Detectors. Measuring Time. Estimate on Round-trip time is k real-time ticks. Eventually Perfect Failure Detector. …. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Crash Fault Detection in Celerating Environments

Crash Fault Detection in Celerating EnvironmentsSrikanth Sastry Scott M. Pike

([email protected]) ([email protected])

Implementing ◊P

• Implementable under (some models of) partial synchrony.• Popular model: Unknown bounds on message delay ()

and relative process speeds ().

Round Trip Time (RTT) = Outgoing message delay + message processing time + incoming message delay

PINGLocal

◊P module

Outgoing message delay ≤

Ack generationTime ≤ f()

≤ f()ACKIncoming message delay ≤

RTT ≤ + f() + RTT is bounded above!This bound on RTT can be adaptively estimated.

Local Adaptive Estimation of RTT Measuring Time Action Clocks in Accelerating Environments

Pro

cess

Spe

ed

Real Time

De facto bound on Round-Trip Time (RTT)k action-clock ticks

Estimated bound on RTT - k action ticks

2k action-clock ticks

Timeout! False suspicion

k action-clock ticks

New estimate on RTTis now 2k action ticks

….

4k action-clock ticks

Timeout! False suspicion

And so on, leading to an infinite stream of mistakes!And so on, leading to an infinite stream of mistakes!

….

2k action-clock ticks

Faster processes More action-clock ticks per RTT Action clock timer continually times out• Two techniques:

– Action clocks: Counting the number of actions– Real-time clocks: Independent device to

measure time (e.g., hardware clocks, NTP).

• Either technique works in environments that do NOT accelerate or decelerate arbitrarily

• But in Celerating environments, where processes can accelerate or decelerate arbitrarily, each technique fails independently.

• Start timer with some arbitrary (small) value• If timer expires without receiving a message, suspect

the process• If a message arrives after timer expiry, trust the

process and increase the timer value.• Eventually timer value exceeds the bound on RTT.• After which correct processes will never be

suspected.• Any crashed process is permanently suspected.

But how do processes measure time?

Crash!

Distributed Systems

Crash!

A collection of autonomous computers (processes) connected through a communication network

• But processes can crash!• Maintain correctness despite crashes• Fault tolerance through crash detection• Crash detection determined by synchronism in the system

Crash Detection and System Models Failure Detectors Eventually Perfect Failure Detector

• Failure detectors: Distributed system service to detect process crashes.

• Failure detector provide (potentially) incorrect information.

• Still powerful enough to solve important problems.

• E.g., distributed consensus, leader election, wait-free scheduling, contention management.

• Failure detector implementations often require partial synchrony.

• One well known failure detector is ◊P, the eventually perfect failure detector.

Live Crashed …Fault Pattern 1

◊P outputs

Crashed …Live Crashed …

Live Crashed …

LiveFault Pattern 2

◊P outputs

CrashedLive

Live CrashedLive

Live

Partial SynchronyCrash Detection Possible

Greater Fidelity to Real World Systems

SynchronyRestrictive Model

Crash Detection Possible

AsynchronyPermissive Model

Crash Detection Impossible

Real-time Clocks in Decelerating Environments Solving the Celeration Problem Bi-Chronal Timers in Non-Celerating Environments Conclusion

Pro

cess

Ste

p T

ime

Real TimeMsg Send

Timeout! False suspicion

New estimate on Round-trip time is now 2k real-time ticks…

.

Timeout! False suspicion

….

And so on, leading to an infinite stream of mistakes!And so on, leading to an infinite stream of mistakes!

Msg Recv

Estimate on Round-trip time is k real-time ticks

Msg Send Msg Recv

Msg Send Msg Recv

(Pro

cess

Spe

ed

)

Slower processes Longer duration to generateand process messages Unbounded RTT (in real time)

• Bi-chronal timer– A vectored composition of action timer and real-

time timer.– Measures time in terms of actions as well as real-

time.– All processes use separate local bi-chronal timers.– Timer expires only when both action timer and the

real-time timer expire.

• The action timer insulates ◊P from deceleration.

• The real-time timer insulates ◊P from acceleration.

• Bi-chronal clocks insulate ◊P from transient network behavior.

• Hardware upgrades often accelerate process speeds– Action clocks precipitate ◊P mistakes during

acceleration– Bi-chronal clocks are immune to acceleration

• Multiple process crashes (in a server farm), DoS attacks, and such can decelerate processes to a crawl– Real-time clocks precipitate ◊P mistakes during

deceleration– Bi-chronal clocks are immune to deceleration

• Many existing ◊P implementations are subtly broken

• Bi-chronal clocks provide a simple solution• Additionally, they insulate systems from

transient behavior• Future work:

– Properties and behavior of Bi-chronal clocks– Use of Bi-chronal clocks in other applications– Other approaches to dealing with Celeration

• Asynchrony: Unbounded message delay and process speeds

• Synchrony: Known bounds on message delay and process speeds

• Partial Synchrony: Between synchrony and asynchrony