crash fault detection in celerating environments

Crash Fault Detection in Celerating EnvironmentsSrikanth Sastry Scott M. Pike

([email protected]) ([email protected])

Implementing ◊P

• Implementable under (some models of) partial synchrony.• Popular model: Unknown bounds on message delay ()

and relative process speeds ().

Round Trip Time (RTT) = Outgoing message delay + message processing time + incoming message delay

PINGLocal

◊P module

Outgoing message delay ≤

Ack generationTime ≤ f()

≤ f()ACKIncoming message delay ≤

RTT ≤ + f() + RTT is bounded above!This bound on RTT can be adaptively estimated.

Local Adaptive Estimation of RTT Measuring Time Action Clocks in Accelerating Environments

Pro

cess

Spe

ed

Real Time

De facto bound on Round-Trip Time (RTT)k action-clock ticks

Estimated bound on RTT - k action ticks

2k action-clock ticks

Timeout! False suspicion

k action-clock ticks

New estimate on RTTis now 2k action ticks

….



And so on, leading to an infinite stream of mistakes!And so on, leading to an infinite stream of mistakes!

….


Faster processes More action-clock ticks per RTT Action clock timer continually times out• Two techniques:

– Action clocks: Counting the number of actions– Real-time clocks: Independent device to

measure time (e.g., hardware clocks, NTP).

• Either technique works in environments that do NOT accelerate or decelerate arbitrarily

• But in Celerating environments, where processes can accelerate or decelerate arbitrarily, each technique fails independently.

• Start timer with some arbitrary (small) value• If timer expires without receiving a message, suspect

the process• If a message arrives after timer expiry, trust the

process and increase the timer value.• Eventually timer value exceeds the bound on RTT.• After which correct processes will never be

suspected.• Any crashed process is permanently suspected.

But how do processes measure time?

Crash!

Distributed Systems

Crash!

A collection of autonomous computers (processes) connected through a communication network

• But processes can crash!• Maintain correctness despite crashes• Fault tolerance through crash detection• Crash detection determined by synchronism in the system

Crash Detection and System Models Failure Detectors Eventually Perfect Failure Detector

• Failure detectors: Distributed system service to detect process crashes.

• Failure detector provide (potentially) incorrect information.

• Still powerful enough to solve important problems.

• E.g., distributed consensus, leader election, wait-free scheduling, contention management.

• Failure detector implementations often require partial synchrony.

• One well known failure detector is ◊P, the eventually perfect failure detector.

Live Crashed …Fault Pattern 1

◊P outputs

Crashed …Live Crashed …

Live Crashed …

LiveFault Pattern 2

◊P outputs

CrashedLive

Live CrashedLive

Live

Partial SynchronyCrash Detection Possible

Greater Fidelity to Real World Systems

SynchronyRestrictive Model

Crash Detection Possible

AsynchronyPermissive Model

Crash Detection Impossible

Real-time Clocks in Decelerating Environments Solving the Celeration Problem Bi-Chronal Timers in Non-Celerating Environments Conclusion

Pro

cess

Ste

p T

ime

Real TimeMsg Send


New estimate on Round-trip time is now 2k real-time ticks…

.


….

And so on, leading to an infinite stream of mistakes!And so on, leading to an infinite stream of mistakes!

Msg Recv

Estimate on Round-trip time is k real-time ticks

Msg Send Msg Recv

Msg Send Msg Recv

(Pro

cess

Spe

ed

)

Slower processes Longer duration to generateand process messages Unbounded RTT (in real time)

• Bi-chronal timer– A vectored composition of action timer and real-

time timer.– Measures time in terms of actions as well as real-

time.– All processes use separate local bi-chronal timers.– Timer expires only when both action timer and the

real-time timer expire.

• The action timer insulates ◊P from deceleration.

• The real-time timer insulates ◊P from acceleration.

• Bi-chronal clocks insulate ◊P from transient network behavior.

• Hardware upgrades often accelerate process speeds– Action clocks precipitate ◊P mistakes during

acceleration– Bi-chronal clocks are immune to acceleration

• Multiple process crashes (in a server farm), DoS attacks, and such can decelerate processes to a crawl– Real-time clocks precipitate ◊P mistakes during

deceleration– Bi-chronal clocks are immune to deceleration

• Many existing ◊P implementations are subtly broken

• Bi-chronal clocks provide a simple solution• Additionally, they insulate systems from

transient behavior• Future work:

– Properties and behavior of Bi-chronal clocks– Use of Bi-chronal clocks in other applications– Other approaches to dealing with Celeration

• Asynchrony: Unbounded message delay and process speeds

• Synchrony: Known bounds on message delay and process speeds

• Partial Synchrony: Between synchrony and asynchrony

crash fault detection in celerating environments

Documents