fundamentals stream session 9: fault tolerance & dependability ii csc 253 gordon blair,...
TRANSCRIPT
Fundamentals Stream Session 9:Fault Tolerance &Dependability II
CSC 253
Gordon Blair, François Taïani
Distributed Systems
CSC253 / 2005-06 G. Blair/ F. Taiani 2
Overview of the SessionInvestigate advanced issues of replication
Passive replicationoutput commit problemhow to provide exactly once semantics
Active replication fault tolerant totally ordered multicast link with consensus algorithms
CSC253 / 2005-06 G. Blair/ F. Taiani 3
Replication & Consistency Reminder: passive replication (aka primary backup)
FEs communicate with a single primary Replication Managers (RM), which must then communicate with secondary RMs
Requires election of new primary in case of primary failure Only tolerates crash faults (silent failure of replicas) Variant: primary’s state saved to stable storage (cold passive)
saving to stable storage known as “checkpointing”
C FE
FE
RM
RM
RM
Primary
C
Backup
Backup
W8
CSC253 / 2005-06 G. Blair/ F. Taiani 4
Consistency & Recovery “A system recovers correctly if its internal state is
consistent with the observable behaviour of the system before the failure”
[Strom and Yemini 1985]
Let's do as if this never happened
before after
CSC253 / 2005-06 G. Blair/ F. Taiani 5
Replication & ConsistencyThe problem with Passive Replication
Primary / backup hand-over consistency issue when primary crashes, backup might be lagging behind on recovery backup does not resume exactly were primary left risk of inconsistency from the client point of view
How to avoid this? synchronise backup with primary more frequently but too frequent high overhead "enough synchronisation but not more"
(aka checkpointing)
CSC253 / 2005-06 G. Blair/ F. Taiani 6
Our Assumptions In the following we assume the following
messages that are sent do arrive (FIFO reliable communication) switch from primary to backup is transparent to the client client will replay requests for which it does not get replies
Our goal "smooth" hand-over to backup on primary crash
We don't consider the following cases backup crashes client crashes any arbitrary failure ("wrong" messages)
Nicolaus Copernicus (1473-1543)
CSC253 / 2005-06 G. Blair/ F. Taiani 7
Do not show to clients results that the backup has not seen Known as the output commit problem
How to avoid this?
!
Replication & Consistency
C 1 2Primary Backup
C
1
2
+2
0
0
2
2
3 6
62
2
+1
3
3
?
primary
CSC253 / 2005-06 G. Blair/ F. Taiani 8
Output Commit Problem Algorithm
Always checkpoint before sending reply Similar to what is done in 2PC distributed commit
+2 2
2
3 6
62
2
+1
7
7
6
6
0
0
C
1
2 primary
There is still a problem with this new algorithm. Which one?
CSC253 / 2005-06 G. Blair/ F. Taiani 9
?18
18
3(bis)
primary
More-than-Once Problem
The primary might crash before sending its reply Client will time out and resend its request But request executed twice!
+2 2
2
3
62
2
6
6 primary
How do we avoid this?
6
0
0
C
1
2
CSC253 / 2005-06 G. Blair/ F. Taiani 10
Exactly-Once: Solution 1 To discard duplicate requests due to crash:
Attach a running request ID to requests Request ID is specific to client (several clients may be active) Primary and backup remember last request ID from given client
[6]2[3]2(bis)
primary
[+2]1 [2]1 [3]2
2,C=10
0
C
1
2
2C=1
2C=1
6C=2
6C=2
6,C=2
I've already seen this request (or rather my previous "incarnation" did).I don't do anything.
C=26
This solution might break w/ multiple clients. Do you see how?
primary
[+2]1 [2]1 [3]2
2,C=10
0
C
1
2
2C=1
6C=2
6C=2
6,C=2
C=26
CSC253 / 2005-06 G. Blair/ F. Taiani 11
Multi-Client Problem Problem is not that C sees the effect of D's operation
This can happen in a failure-free execution. Valid possibility.
Problem is that no failure-free execution could ever return 8
[8]C2[3]2(bis)
C ?[+2]1
C=2D=1
8
[8]D1
C=2D=1
8I've already seen this request from C.I don't do anything.
[+2]1 [2]C1 [3]2
0 2C=1
6C=2
0 2C=1
6C=2
2,C=1
1
2
D
6,C=2
CSC253 / 2005-06 G. Blair/ F. Taiani 12
Exactly-Once: Solution 2(multi-client safe)
Replies are logged along with request IDs. Old replies can be replayed. “As if ” network were very slow.
[6]C2
C [+2]1
C=[6],2D=[8],1
8
[8]D1
[3]2(bis)
C=[6],2D=[8],1
8I've already seen this request from C.I send the logged reply back.
[2]C1
0
[+2]1
2C=[2],1
[3]2
6C=[6],2
0 2C=[2],1
2,C=[2],1
1
2
D
6C=[6],2
6,C=[6],2
equivalent failure-free message
C
C=[6],2D=[8],1
C=[6],2D=[8],1
8
[2]C1
0
[+2]1
2C=[2],1
[3]2
6C=[6],2
0 2C=[2],1
2,C=[2],1
1
2
D
6C=[6],2
6,C=[6],2
CSC253 / 2005-06 G. Blair/ F. Taiani 13
Notes on Solution 2 “Smooth” hand-over from primary to backup on crash
all operations are executed exactly once
Primary failure is masked but not completely transparent client C receives its reply much later than in failure-free case
From C’s perspective, as if network were very slow replay handled by C’s middleware, transparent to application reply takes a while to come back this could also happen in a failure-free run
Major disturbance (server crash) replaced by minor annoyance (network delay) Graceful degradation: lesser quality of service but still running
W8: QoS
CSC253 / 2005-06 G. Blair/ F. Taiani 14
Further Thoughts Previous algo does not scale to many clients / large state
If millions of clients and big database: intractable
Solution: use logging “save” log of operations performed on primary
either on stable storage or by sending it to backup
regularly "flush" the log by checkpointing whole server state on recovery: latest checkpoint + reapply current log
All the above assume sequential server (i.e. monothreaded) state saved when no “request in progress” much more difficult with concurrent (multithreaded) server "hot" checkpointing/ backup ability needed for consistency
Not addressed here
CSC253 / 2005-06 G. Blair/ F. Taiani 15
What to remember from this? The actual algorithm is not important
What is important are the issues that were raised
Output commit problem Clients should not see changes that hasn't be made permanent
Duplicate requests and exactly once semantics No time-out and retry at most-once semantics Time-out and retry at least-once-semantics Exactly once requires some atomicity mechanism
Here what is atomic is the "checkpointing" message to the backup Either the backup receives it or it does not
CSC253 / 2005-06 G. Blair/ F. Taiani 16
Active Replication Reminder
Front Ends multicast requests to every Replica Manager appropriate reliable group communications needed ordering guarantees crucial (see W4: Group Communication)
tolerates crash-faults and arbitrary faults
FE
RM
RM
RM CFEC
W8
W4
CSC253 / 2005-06 G. Blair/ F. Taiani 17
Realising Active Replication Group communication needed, with total ordering property
See W4 lecture for what happens without total ordering
In W4, 2 solutions for totally ordered multicast presented centralised sequencer based on time-stamping with logical clocks
Problem: none of them is fault-tolerant the centralised sequencer is single point of failure time-stamping: crash of any participant blocks the algorithm
We need a fault-tolerant (atomic) totally ordered multicast tolerating crash-fault if active replication used against them tolerating arbitrary fault if active replication used against them
W4
CSC253 / 2005-06 G. Blair/ F. Taiani 18
Total Ordering and Consensus Realising total-ordering is equivalent to
realising distributed consensus
Distributed consensus All participants start by proposing a value At the end of the algorithm one of the proposed value has been
picked, and everybody agrees on this choice
From distributed consensus to total ordering Each participant proposes the next message it would like to accept Using consensus everybody agrees on next message This message is the next delivered
Fault-Tolerant consensus fault-tolerant total ordering
A
D C
B27
91
7
7
7
7
CSC253 / 2005-06 G. Blair/ F. Taiani 19
Total Ordering and Consensus
A
D C
Bm1
m2
m1m3
m1
m3
m2
m1
m3
m3m3
m3m3
m3
m3m3
m3
atomic broadcast queue
totally ordered queue
consensus
CSC253 / 2005-06 G. Blair/ F. Taiani 20
Fault-Tolerant Consensus Main idea 1: not be blocked by crashed processes
rely on failure detection to stop waiting for crashed processes
Main idea 2: propagate influence of crashed processes before crashing a process might have communicated with others these messages must be share with all non-crashed processes
(or with none of them). They might influence consensus outcome.
The properties of the failure detector is essential in reality false positive happen (time-out and slow network) different algo for different classes of imperfect failure detectors the more imperfect the less crashes can be tolerated
FT Consensus algorithms even exist for arbitrary failure Even more redundancy required
CSC253 / 2005-06 G. Blair/ F. Taiani 21
FT Consensus (strong failure detector, crash faults)
For information
only. Not Exam
material
[Chandra & Toueg, 1996]
CSC253 / 2005-06 G. Blair/ F. Taiani 22
Expected Learning Outcomes
At the end of this 8th Unit:
You should understand what the output commit problem is about.
You should appreciate the mechanisms involved in realising exactly once semantics in the presence of crash fault
You should understand why fault-tolerant total ordered multicast is essential to active replication
You should understand the relationship between total ordered multicast and distributed consensus
CSC253 / 2005-06 G. Blair/ F. Taiani 23
References
A survey of rollback-recovery protocols in message-passing systems E. N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, David B. Johnson,
ACM Computing Surveys, Volume 34 , Issue 3 (September 2002), Pages: 375 - 408
Very good and extensive survey on algorithmic issues
Unreliable failure detectors for reliable distributed systems Tushar Deepak Chandra, Sam Toueg, Journal of the ACM
(JACM), Volume 43 , Issue 2 (March 1996), Pages: 225 - 267 Fundamental consensus algorithms under various assumptions