Download - Group Communication: from practice to theory
![Page 1: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/1.jpg)
SOFSEM 2006 André Schiper 1
Group Communication:from practice to theory
André SchiperEPFL, Lausanne,
Switzerland
![Page 2: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/2.jpg)
SOFSEM 2006 André Schiper 2
Outline
• Introduction to fault tolerance• Replication for fault tolerance• Group communication for replication• Implementation of group communication
Practical issues
Theoretical issues
![Page 3: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/3.jpg)
SOFSEM 2006 André Schiper 3
Introduction
• Computer systems become every day more complex• Probability of faults increases• Need to develop techniques to achieve fault tolerance
![Page 4: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/4.jpg)
SOFSEM 2006 André Schiper 4
Introduction (2)
Fault tolerant techniques developed over the year
• Transactions (all or nothing property)• Checkpointing (prevents state loss)• Replication (masks faults)
![Page 5: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/5.jpg)
SOFSEM 2006 André Schiper 5
Transactions
• ACID properties (Atomicity, Consistency, Isolation, Durability)
begin-transactionremove (100, account-1)deposit (100, account-2)
end-transaction
Crash during the transaction: rollback to the state before the beginning of the transaction
account-1 account-2
site 1 site 2
![Page 6: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/6.jpg)
SOFSEM 2006 André Schiper 6
Checkpointing
Crash of p1: rollback of p1 to the latest saved state
chkpt chkpt
chkpt
chkpt
Xcrash
p1
p3
p2
![Page 7: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/7.jpg)
SOFSEM 2006 André Schiper 7
Replication
• Crash of s1 is masked to the client• Upon recovery of s1: state transfer to bring s1 up-to-date
s1
s2
s3
client
replicated server
request
response
![Page 8: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/8.jpg)
SOFSEM 2006 André Schiper 8
Introduction to FT (6)
Comparison of the three techniques • Only replication masks crashes (i.e., ensures high
availability)• Transactions and checkpointing: progress is only possible
after recovery of the crashed process
We will concentrate on replication
![Page 9: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/9.jpg)
SOFSEM 2006 André Schiper 9
Outline
• Introduction to fault tolerance• Replication for fault tolerance• Group communication for replication• Implementation of group communication
![Page 10: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/10.jpg)
SOFSEM 2006 André Schiper 10
Context
Replication in the general context of client-server interaction:
c srequest
response
serverclients(several)
![Page 11: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/11.jpg)
SOFSEM 2006 André Schiper 11
Context (2)
• Fault tolerance by replicating the server
cs2
request
response
s3
s1
clients(several)
server s
![Page 12: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/12.jpg)
SOFSEM 2006 André Schiper 12
Correctness criterion: linearizability
• Need to keep the replica consistent• Consistency defined in terms of the operations executed by
the clients
ci
cj
S1
S2
S3
op
op
![Page 13: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/13.jpg)
SOFSEM 2006 André Schiper 13
Correctness criterion: linearizability (2)
Example:• Server: FIFO queue• Operations: enqueue/dequeue
![Page 14: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/14.jpg)
SOFSEM 2006 André Schiper 14
Correctness criterion: linearizability (3)
Example 1
p
qexecution
execution
enq(a)
enq(b)
deq( )
deq( )
b
a
linearizable execution
![Page 15: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/15.jpg)
SOFSEM 2006 André Schiper 15
Correctness criterion: linearizability (4)
Example 2
p
q
enq(a)
enq(b)
deq( )
deq( )
b
a
non linearizable execution
![Page 16: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/16.jpg)
SOFSEM 2006 André Schiper 16
Replication techniques
Two main replication techniques for linearizability
• Primary-backup replication (or passive replication)• Active replication (or state machine replication)
![Page 17: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/17.jpg)
SOFSEM 2006 André Schiper 17
Primary-backup replication
client
S1 primary
S2 backup
S3 backup
request
update
response
![Page 18: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/18.jpg)
SOFSEM 2006 André Schiper 18
Primary-backup replication
• Crash cases
client
S1 primary
S2 backup
S3 backup
request
update
response
1 2 3
![Page 19: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/19.jpg)
SOFSEM 2006 André Schiper 19
Primary-backup replication (3)
• Crash detection usually based on time-outs• Time-outs may lead to incorrectly suspect the crash of a
process• The technique must work correctly even if processes are
incorrectly suspected to have crashed
![Page 20: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/20.jpg)
SOFSEM 2006 André Schiper 20
Active replication
client
S1
S2
S3
request response
wait for the first reply
Crash of a server replica transparent to the client
![Page 21: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/21.jpg)
SOFSEM 2006 André Schiper 21
Active replication (2)
• If more than one client, the requests must be received by all the replicas in the same order
• Ensured by a communication primitive called total order broadcast (or atomic broadcast)
• Complexity of active replication hidden in the implementation this primitive
S1
S3
C
C’
S2
![Page 22: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/22.jpg)
SOFSEM 2006 André Schiper 22
Outline
• Introduction to fault tolerance• Replication for fault tolerance• Group communication for replication• Implementation of group communication
![Page 23: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/23.jpg)
SOFSEM 2006 André Schiper 23
Role of group communication
• Active replication: requires communication primitive that orders client requests
• Passive replication: have shown the issues to be addressed• Group communication: communication infrastructure that
provides solutions to these problems
Replication technique
Group communication
Transport layer
![Page 24: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/24.jpg)
SOFSEM 2006 André Schiper 24
Role of group communication (2)
Let g be a group with members p, q, r• multicast(g, m): allows m to be multicast to g without
knowing the membership of g• IP-multicast (UDP) also provides such a feature• Group communication provides stronger guarantees
(reliability, order, etc.)
![Page 25: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/25.jpg)
SOFSEM 2006 André Schiper 25
Role of group communication (3)
We will discuss:
• Group communication for active replication• Group communication for passive replication
![Page 26: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/26.jpg)
SOFSEM 2006 André Schiper 26
Atomic broadcast for active replication
Group communication primitive for active replication:• atomic broadcast (denoted sometimes abcast)• also called total order broadcast
Replication technique
Group communication
Transport layer
abcast (g, m) adeliver (m)
receive (m)send (m) to p
![Page 27: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/27.jpg)
SOFSEM 2006 André Schiper 27
Atomic broadcast for active replication (2)
S1
S3
C
C’
S2
abcast(gS, req)
abcast(gS, req’)
group gS
![Page 28: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/28.jpg)
SOFSEM 2006 André Schiper 28
Atomic broadcast: specification
• If p executes abcast(g, m) and does not crash, then all processes in g eventually adeliver m
• If some process in g adelivers m and does not crash, then all processes in g that do not crash eventually adeliver m
• If p, q in g adeliver m and m’, then they adeliver them in the same order
![Page 29: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/29.jpg)
SOFSEM 2006 André Schiper 29
Role of group communication (3)
We will discuss:
• Group communication for active replication• Group communication for passive replication
![Page 30: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/30.jpg)
SOFSEM 2006 André Schiper 30
Generic broadcast for passive replication
• Atomic broadcast can also be used to implement passive replication
• Better solution: generic broadcast• Same spec as atomic broadcast, except that not all
messages are ordered:– Generic broadcast based on a conflict relation on the messages– Conflicting messages are ordered, non conflicting messages are
not ordered
![Page 31: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/31.jpg)
SOFSEM 2006 André Schiper 31
Generic broadcast for passive replication (2)
• Two types of messages:– Update– PrimaryChange: issued if a process suspects the current
primary Upon delivery: cyclic permutation of process listNew primary: process at the head of the process list
• Conflict relation:PrimaryChange Update
PrimaryChange no conflict conflict
Update conflict conflict
![Page 32: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/32.jpg)
SOFSEM 2006 André Schiper 32
Generic broadcast for passive replication (3)
client
S1 primary
S2 backup
S3 backup
request
Update
PrimaryChange
![Page 33: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/33.jpg)
SOFSEM 2006 André Schiper 33
Generic broadcast for passive replication (4)
Another way to view the strategy:• Replicas numbered 0 .. n-1• Computation divided into rounds• In round r, the primary is the process with number r mod n• When process p suspects the primary of round r, it
broadcasts newRound (= primaryChange)newRound Update
newRound no conflict conflict
Update conflict conflict
Conflict relation
All processes deliver Update in the same round
![Page 34: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/34.jpg)
SOFSEM 2006 André Schiper 34
Other group communication issues
To discuss:• Static vs. dynamic group• Crash-stop vs. crash-recovery model
![Page 35: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/35.jpg)
SOFSEM 2006 André Schiper 35
Static vs. dynamic groups
• Static group: group whose membership does not change during the lifetime of the system
• A static group is sometimes too limitative from a practical point of view: may want to replace a crashed replica with a new replica
• Dynamic group: group whose membership changes during the lifetime of the system
• Dynamic group requires to address two problems:– How to add / remove processes from the group (group
membership problem)– Semantics of communication primitives
![Page 36: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/36.jpg)
SOFSEM 2006 André Schiper 36
Group membership problem
• Need to distinguish– (1) group – (2) membership of the group at time t
• New notion: view of group g defined by: (i, s)– i identifies the view– s: membership of view i (set of processes)– membership i also denoted vi
![Page 37: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/37.jpg)
SOFSEM 2006 André Schiper 37
Group membership problem (2)
Changing the membership of a group:• add (g, p): for adding p to group g• remove (g, p): for removing p from group g
Usual requirement: all the members of a group see the same sequence of views:
if (i, sp) the view i of p
and (i, sq) the view i of q
then sp = sq
![Page 38: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/38.jpg)
SOFSEM 2006 André Schiper 38
Process recovery
• Usual theoretical model: crash-stop – processes do not have access to stable storage (disk)– if a process crashes, its whole state is lost (i.e. no recovery
possible)
• Drawback of the static crash-stop model: if non null crash probability, eventually all processes will have crashed
• Drawback of the dynamic crash-stop model: not tolerant to catastrophic failures (crash of all the processes in a group)
• To tolerate catastrophic failures: crash-recovery model
![Page 39: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/39.jpg)
SOFSEM 2006 André Schiper 39
Combining the different GC models
Combining static/dynamic with crash-stop/crash-recovery:
1. Static groups in the crash-stop modelConsidered in most theoretical papers
2. Dynamic groups in the crash-stop model Considered in most existing GC systems
3. Static groups in the crash-recovery model4. Dynamic groups in the crash-recovery model
![Page 40: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/40.jpg)
SOFSEM 2006 André Schiper 40
Quorum systems vs. group communication
• Quorum system: context of read/write operations• Typical situation:
– Access a majority of replicas to perform a read operation
– Access a majority of replicas to perform a write operation
![Page 41: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/41.jpg)
SOFSEM 2006 André Schiper 41
Quorum systems vs. group communication
Example: • a replicated server managing an integer with read/write
operations (called register)
• operation to perform: increment
![Page 42: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/42.jpg)
SOFSEM 2006 André Schiper 42
Quorum systems vs. group communication (2)
• Solution with quorum system:
• Solution with group communication
clients1
s2
s3
clients1
s2
s3
Read
Increment
Write
Increment
Critical section
![Page 43: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/43.jpg)
SOFSEM 2006 André Schiper 43
Outline
• Introduction to fault tolerance• Replication for fault tolerance• Group communication for replication• Implementation of group communication
![Page 44: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/44.jpg)
SOFSEM 2006 André Schiper 44
Implementation of group communication
Context:• Static groups• Crash-stop model• Non-malicious processes
• Atomic broadcast• Generic broadcast
Consensus
Atomicbroadcast
Genericbroadcast
Common denominator: Consensus
![Page 45: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/45.jpg)
SOFSEM 2006 André Schiper 45
Consensus (informal)
p1
p2
p3
4
7
1
77
7
![Page 46: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/46.jpg)
SOFSEM 2006 André Schiper 46
Consensus (formal)
A set of processes, each pi with an initial value vi
Three properties:
Validity: If a process decides v, then v is the initial value of some process
Agreement: No two processes decide differentlyTermination: Every process eventually decides some value
![Page 47: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/47.jpg)
SOFSEM 2006 André Schiper 47
Impossibility result
Asynchronous system:• No bound on the message transmission delay• No bound on the process relative speeds
Fischer-Lynch-Paterson impossibility result (1985):Consensus is not solvable in an asynchronous system with a
deterministic algorithm and reliable links if one single process may crash
![Page 48: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/48.jpg)
SOFSEM 2006 André Schiper 48
Impossibility of atomic broadcast
• By contradiction
p1
p2
p3
4
7
1
7
7
7Abcast(4)
Abcast(7)
Abcast(1)
Adeliver(7)
Adeliver(7)
Adeliver(7)
![Page 49: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/49.jpg)
SOFSEM 2006 André Schiper 49
Models for solving consensus
Synchronous system:– There is a known bound on the transmission delay of messages– There is a known bound on the process relative speeds
Consensus solvable with up to n-1 faulty processes
Drawback: • Requires to be pessimistic (large bounds)• Large bounds lead to a large blackout period in case of a
crash
![Page 50: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/50.jpg)
SOFSEM 2006 André Schiper 50
Models for solving consensus (2)
Partially synchronous system (Dwork, Lynch, Stockmeyer, 1988)
Two variants:• There is a bound on the transmission delay of messages
and on the process relative speed, but these bounds are not known
• There are known bounds on the transmission delay of messages and on the process relative speed, but these bounds hold only from some unknown point on.
Consensus solvable with a majority of correct processes
![Page 51: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/51.jpg)
SOFSEM 2006 André Schiper 51
Models for solving consensus (3)
Failure detector model (Chandra,Toueg, 1995)• Asynchronous model augmented with an oracle (called
failure detector) defined by abstract properties
• A failure detector defined by a completeness property and an accuracy property
pFDpr FDr
qFDq
network
![Page 52: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/52.jpg)
SOFSEM 2006 André Schiper 52
Failure detector model (2)
Example: failure detector S• Strong completeness: eventually every process that crashes
is permanently suspected to have crashed by every correct process
• Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process
Consensus solvable with S and a majority of correct processes
Drawback: • requires reliable links
![Page 53: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/53.jpg)
SOFSEM 2006 André Schiper 53
RbR model
• Joint work with Bernadette Charron-Bost• Combination of:
– Gafni 1998, Round-by-round failure detectors(unifies synchronous model and failure detector model)
– Santoro, Widmayer, 1989, Time is not a healer (show that dynamic faults have the same destructive effect on solving consensus as asynchronicity)
• Existing models for solving consensus have put emphasis on static faults
• Our new model handles static faults and dynamic faults in the same way
![Page 54: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/54.jpg)
SOFSEM 2006 André Schiper 54
RbR machine
• Distributed algorithm A on the set of processes
• Communication predicate P on the HO’s Examples: P1 : p , r > 0 : | HO(p, r) | > n/2
P2 : r0, HO 2, p : HO(p, r0) = HO
• A problem is solved by a pair ( A, P )
In round r, process p receives messages from the set HO(p, r)
st st’
Round r
p
![Page 55: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/55.jpg)
SOFSEM 2006 André Schiper 55
Round-by-Round (RbR) model (2)
• Assume q HO(p, r): in the RbR model, no tentative to justify this (e.g, channel failure, crash of q, send-omission failure of q, etc.)
• This removes the difference between static and dynamic faults (static: crash-stop; dynamic: channel fault, crash-recovery)
• In the RbR model, no notion of faulty process
![Page 56: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/56.jpg)
SOFSEM 2006 André Schiper 56
Consensus algorithms
• A lot ….
• Most influential algorithms:– Consensus algorithm based on the failure detector S (Chandra-
Toueg, 1995)– Paxos (Lamport, 1989-1998)
![Page 57: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/57.jpg)
SOFSEM 2006 André Schiper 57
Consensus algorithms (2)
The CT (Chandra-Toueg) and Paxos algorithms have strong similarities but also important differences:
• CT based on rotating coordinator / Paxos based on a dynamically chosen leader
• CT requires reliable links / Paxos tolerates lossy links• The condition for liveness clearly defined in CT • The condition for liveness less formally defined in Paxos
(can be formally expressed in the RbR model)
![Page 58: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/58.jpg)
SOFSEM 2006 André Schiper 58
Paxos in the RbR modelInitialization xp := vp
votep := V {?}, initially ? voteToSendp:=false; decidedp:=false; tsp:=0
Round r = 4Φ – 3 :Sr
p : send xp, tsp to leaderp(Φ) Tr
p :if p = leaderp(Φ) and #x, ts rcvd > n/2 then let θ be the largest θ from v, θ received votep := one v such that v,θ received voteToSendp := true
Round r = 4Φ – 2 :Sr
p : if p = leaderp(Φ) and voteToSendp then send votep to all processes
Trp :
if receivedv from leaderp(Φ) then xp := v ; tsp := Φ
Round r = 4Φ – 1 :Sr
p :if tsp = Φ then send ack to leaderp(Φ)
Trp :
if p = leaderp(Φ) and #ack rcvd > n/2 then
DECIDE (votep); decidedp := true
Round r = 4Φ Sr
p : if p = coordp(Φ) and decidedp then send votep to all processes
Trp :
if received v and not decidedp then DECIDE(v) ); decidedp := true voteToSendp := false leader
![Page 59: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/59.jpg)
SOFSEM 2006 André Schiper 59
Paxos (2)
CONDITION FOR SAFETY CONDITION FOR LIVENESS
None
Φ0 > 0 :
p : | HO(p, Φ0) | > n/2
and p, q: leaderp(Φ0) = leaderq(Φ0)
and leaderp(Φ0) K(Φ0)
• Kernel of round r: K (r) = HO (p, r)
• Kernel of a phase Φ : K(Φ) = K (r)
p
r (phase = sequence of rounds)
![Page 60: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/60.jpg)
SOFSEM 2006 André Schiper 60
Atomic broadcast
• A lot of algorithms published …• Easy to implement using a sequence of instance of
consensus• Consider atomic broadcast within a static group g:
– Consensus within g on a set of messages– Let consensus #k decide on the set Msg(k)– Each process delivers the messages in Msg(k) before those in
Msg(k+1)– Each process delivers the messages in Msg(k) in a deterministic
order (e.g., in the order of the message ids)
![Page 61: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/61.jpg)
SOFSEM 2006 André Schiper 61
Atomic broadcast (2)
Leads to the following algorithm:
• Each process p manages a counter k and a set undeliveredMessages
• Upon abcast(m) do broadcast(m)• Upon reception of m do
– add m to undeliveredMessages– if no consensus algorithm is running then
Msg(k) consensus (undeliveredMessages)deliver messages in Msg(k) in some deterministic order
![Page 62: Group Communication: from practice to theory](https://reader035.vdocuments.mx/reader035/viewer/2022062814/5681684f550346895dde4d23/html5/thumbnails/62.jpg)
SOFSEM 2006 André Schiper 62
Conclusion
• Necessarily superficial presentation of group communication
• Comments:– Static/crash-stop model has reached maturity– Maturity not yet reached in the other models (static/crash-
recovery, dynamic/crash-stop, …)– With the RbR model we hope to bridge the gap between
static/crash-stop and static/crash-recovery (ongoing implementation work)
– More work needed to quantitatively compare the various atomic broadcast algorithms (and other algorithms)