distributed algorithms 2g1513

28
Distributed Algorithms 2g1513 L16 – by Ali Ghodsi and Seif Haridi Failure Detection

Upload: turner

Post on 09-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Distributed Algorithms 2g1513. L16 – by Ali Ghodsi and Seif Haridi Failure Detection. Failure Detection. Failure Detector a module which uses timeouts to detect failures Useful abstraction for building systems Programming becomes easier May give false positives - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Algorithms 2g1513

Distributed Algorithms2g1513

L16 – by Ali Ghodsi and Seif HaridiFailure Detection

Page 2: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 2

Failure Detection Failure Detector a module which uses

timeouts to detect failures

Useful abstraction for building systems Programming becomes easier

May give false positives Process A wrongly thinks process C is dead Process B thinks process C is alive

We will not care what crashed processes failure detectors think!

Page 3: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 3

Different Failure Scenarios

Failure Pattern, F Actual view of crashed processes at a certain

time Monotonic

F(1)=, F(2)=, F(3)={P2}, F(7)={P2,P4}

P1P2P3P4

1 2 3 4 5 6 7 8 time

Page 4: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 4

Detections

Suspicions, H What a detector thinks at process P and time t

Process 3 thinks at time 8 H(3, 8)={1,4} Erroneously thinks 1 has crashed, detected 4’s crash,

have not detected 2’s crash

P1P2P3P4

1 2 3 4 5 6 7 8 time

Page 5: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 5

Completeness and Accuracy

Two important types of requirements

Completeness The detector will detect a crashed process

Accuracy The detector will not detect a non-crashed process

Trivial to satisfy only one requirement (how?) Both impossible in an asynchronous system!

Page 6: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 6

Practical Requirements

Strong Completeness Every crashed process is eventually detected by

all processes

For all failure patterns For all possible behaviors of a detector

There exists a time t, whereafter all crashed processes are detected by all processes

We will only study detectors with this property

Page 7: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 7

Practical Requirements

Strongly Accurate No process is every suspected unless it has

crashed

For all failure patterns For all possible behaviors of a detector

For all correct processes P and Q, P will never suspect Q

Quite strong assumption No premature timeouts

Page 8: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 8

Practical Requirements

Weakly Accurate There exists a correct process which is never suspected

by anyone

For all failure patterns For all possible behaviors of a detector

There exists a correct process P All correct processes will never suspect P

Quite strong assumption No premature timeouts

Page 9: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 9

Practical Requirements

Eventually Strongly Accurate After some finite time, t, the detector is strongly accurate

Eventually Weakly Accurate After some finite time, t, the detector is weakly accurate

After some time, the requirements are fulfilled Prior to that, any behavior is possible!

Weak assumptions Think about Eventually Weakly Accurate!

Page 10: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 10

Four Established Detectors

Perfect Detector (P) Complete, Strongly Accurate

Strong Detector (S) Complete, Weakly Accurate

Eventually Perfect Detector (P) Complete, Eventually Strongly Accurate

Eventually Strong Detector (S) Complete, Eventually Weakly Accurate

Page 11: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 11

Programming Difference (1/2) Programming without failure detectors

Can never receive from all processes (might block because of a failure)

General technique Assume only t nodes can fail Broadcast to all nodes Receive N-t messages

See the Initially Dead Consensus

Page 12: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 12

Programming Difference (2/2) Programming with failure detectors General technique

Broadcast Receive from all nodes

Failed nodes will timeout (completeness)

Code:if collect<msg, par> from q

print(q+” said “+msg);else

print(q+” looks dead!”);

Page 13: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 13

Pitfalls with Detectors

Two pitfalls:

A tries to send a message to all, fails halfway through. B might get the message, C mot not!

A sends a message to all, B gets it, but C erroneously detects A as dead

Page 14: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 14

Consensus: Rotating Coordinator for Sxi = input

for r:=1 to N doif p=r thenforall j do send <value, xi, r> to j;

if collect<value, x’, r> from pr thenxi = x’;

enddecide xi

How many failures can this tolerate?

Page 15: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 15

Tolerance of Eventuality (1/3) Eventually perfect detector, cannot solve

consensus with resilience t > n/2

Proof by contradiction: Assume it is possible, and assume N=10 and

t=6 The P detector initially tolerates any

behaviorRed nodes dead. green nodes alive. Detectors behave perfectly. Consensus will be 1 some time t1

1 1

1 1

Page 16: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 16

Tolerance of Eventuality (2/3) Eventually perfect detector, cannot solve

consensus with resilience t > n/2

Proof by contradiction: Assume it is possible, and assume N=10 and

t=6 The P detector initially tolerates any

behavior Red nodes dead. Blue nodes alive. Detectors behave perfectly. Consensus will be 0 at some time t0

0 0

0 0

Page 17: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 17

Tolerance of Eventuality (3/3) Eventually perfect detector, cannot solve

consensus with resilience t > n/2

Proof by contradiction: Assume it is possible, and assume N=10 and

t=6 The P detector initially tolerates any

behavior 0 0

0 0

1 1

1 1

For t1 time, green nodes think blue and red nodes are dead… Hence, agreement on 1

For t0 time, blue nodes think green and red nodes are dead… Hence, agreement on 0

Page 18: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 18

Consensus: Rotating Coordinator for S For the eventually strong detector

The trivial rotating coordinator will not work Why?

“Eventually” might imply no consensus in first round!

Trivial solution: Rotate forever Eventually all nodes collect one coordinator:

consensus Problem?

Termination: How do we know when to finish?

Page 19: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 19

Idea for termination

Bound the number of failures Less than a third might fail (t < n/3)

Similar to rotating coordinator for S: 1) Everyone send vote to coordinator c 2) R picks majority vote V, and broadcasts V 3) Every node get broadcast, change vote to V 4) Change coordinator c and goto 1)

Page 20: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 20

Consensus: Rotating Coordinator for Sxi := input r=0while true dobegin r:=r+1 c:=(r mod N)+1 { rotate to coordinator

c } send <value, xi, r> to pc { all send value to

coord }

Page 21: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 21

Consensus: Rotating Coordinator for Sxi := input r=0while true dobegin r:=r+1 c:=(r mod N)+1 { rotate to coordinator

c } send <value, xi, r> to pc { all send value to

coord }

if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant

counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } forall j do send <outcome, v, r> to pj { send v to all } end

Page 22: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 22

Consensus: Rotating Coordinator for Sxi := input r=0while true dobegin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord }

if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } forall j do send <outcome, v, r> to pj { send v to all } end

if collect<outcome, v, r> from pc then { collect value from coord } begin xi := v { change input to v } endend

Page 23: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 23

Loop Invariant

If kN-t agree on a value V before a round Then at least k nodes agree on V after the

round

Why? At most t did not vote V Will only change value X if X proposed by coord Coord only propose X if majority of N-t voted X

N-t > 2N/3, Majority of N-t is more than N/3 nodes More than N/3 voted X X has to be V

Page 24: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 24

Enforcing Decision

Coordinator checks if all N-t voted same Broadcast that information

If coordinator says all N-t voted same Decide for that value!

Page 25: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 25

Consensus: Rotating Coordinator for Sxi := input r=0while true dobegin r:=r+1 c:=(r mod N)+1 { rotate to coordinator c } send <value, xi, r> to pc { all send value to coord }

if i==c then { coord only } begin msgs[0]:=0; msgs[1]:=0; { reset 0 and 1 counter } for x:=1 to N-t do begin receive <value, V, R> from q { receive N-t msgs } msgs[V]:=msgs[V]+1; { increase relevant counter } end if msgs[0]>msgs[1] then v:=0 else v:=1 end { choose majority value } if msgs[0]==0 or msgs[1]==0 then d:=1 else d:=0 end { all same? } forall j do send <outcome, d, v, r> to pj { send v to all } end

if collect<outcome, d, v, r> from pc then { collect value from coord } begin xi := v { change input to v } if d then decide(v) { decide if d is true } endend

Page 26: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 26

Liveness: Decide will happen

Eventually some node q will not be false detected Eventually q is coord Everyone collects its vote V Everyone decides V From now all k nodes will vote V Next time q is coord, d=1 Everyone decides

Page 27: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 27

Summary

Failure Detectors simplify programming Can solve consensus any many other problems

Two main requirements Completeness (detecting failed processes) Accuracy (not detecting alive nodes)

Two main classes, Those that behave well always Those that eventually behave well

Page 28: Distributed Algorithms 2g1513

04/22/23 Ali Ghodsi and Seif Haridi 28

Summary

Failure Detectors Simple abstraction

Characterization Completeness Accuracy (strongly vs. weakly)

Four important classes of detectors Can be used to solve consensus with high resilience