cpsc 668set 17: fault-tolerant register simulations1 cpsc 668 distributed algorithms and systems...
Post on 20-Dec-2015
219 views
TRANSCRIPT
CPSC 668 Set 17: Fault-Tolerant Register Simulations 1
CPSC 668Distributed Algorithms and Systems
Fall 2009
Prof. Jennifer Welch
CPSC 668 Set 17: Fault-Tolerant Register Simulations 2
Fault-Tolerant Shared Memory Simulations• Previous algorithms implemented shared
variable on top of message passing, assuming no failures.
• What if some processors might crash?• Can we still provide a shared read/write
variable on top of message passing?• Yes, even in an asynchronous system, if we
have enough nonfaulty processors.• First, we must specify a failure-prone shared
memory.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 3
Specification of f-Resilient Shared Memory
• Inputs are invocations on the shared object.• Outputs are responses of the shared object.• A sequence of inputs and outputs is allowable iff:
– there is a partitioning of proc. indices into "faulty" and "nonfaulty"
– Correct Interaction: each proc. alternates invocations and matching responses
– Nonfaulty Liveness: Every invocation by a nonfaulty proc. has a matching response
– Extended Linearizability: Linearizability holds for all the completed operations and some subset of the pending operations
CPSC 668 Set 17: Fault-Tolerant Register Simulations 4
Assumptions for Algorithm• Each read/write variable ("register") to be
simulated has– one reader and– one writer– (next topic will be to build more powerful variables
out of these)
• There are n procs. which are cooperating to simulate a collection of such variables
• Underlying communication system is asynchronous message passing
• n > 2f (less than half the processors can crash)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 5
Main Ideas of Algorithm
• Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register.
• Use the redundant storage to provide fault-tolerance.
• Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 6
Writing the Simulated Register
• generate the next sequence number• send a message with the value and the
sequence number to all the procs.– each recipient updates its local copy of the
register
• wait to get back an ack from > n/2 procs.– safe since n - f > n/2
• do the ack for the write
CPSC 668 Set 17: Fault-Tolerant Register Simulations 7
Reading the Simulated Register
• send a request to all the procs.– each recipient sends back current value of
its replica
• wait to get reply from > n/2 procs.
• return value associated with largest sequence number
CPSC 668 Set 17: Fault-Tolerant Register Simulations 8
Key Idea for Correctness
• Each read should return the value of "the most recent" write.
• Each read or write communicates with > n/2 procs., so the set of procs. participating in operation O1 is guaranteed to intersect with the set of procs. participating in any other operation O2.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 9
But What About Asynchrony?
• The underlying communication system is asynchronous:– message on behalf of one operation could be
overtaken by a message on behalf of a later operation.
• Avoid such problems by adding additional mechanism to the algorithm:– reader and writer keep track of "status" of each
link– don't send a msg on a link until ack from previous
msg has been received
CPSC 668 Set 17: Fault-Tolerant Register Simulations 10
Outline of Correctness Proof
Interesting part is proving linearizability.• Let ts(W) = sequence number of W• Let ts(R) = sequence number of write that R reads
from• Let O1 O2 denote O1 finishes before O2 startsKey lemmas: • If W1 W2, then ts(W1) < ts(W2)• If W R, then ts(W) ≤ ts(R)• If R W, then ts(R ) < ts(W)• If R1 R2, then ts(R1) ≤ ts(R2)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 11
Matching Lower Bound on ResiliencyTheorem (10.22): No simulation of a 1-reader,
1-writer read/write linearizable register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures.
Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1-reader, 1-writer linearizable register on top of asynchronous message passing.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 12
Lower Bound Proof
• Partition procs into two sets, S0 and S1, each of size f.
• Let 0 be admissible exec. of A s.t.– initial value of simulated register is 0
– all procs. in S1 crash initially
– proc. p0 in S0 invokes write(1) at time 0 and no other operations are invoked.
– the write completes at some time t0 without any proc in S0 receiving a message from any proc in S1: must happen since A is supposed to tolerate f failures.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 14
Lower Bound Proof
• Let 1 be admissible exec. of A s.t.– initial value of simulated register is 0
– all procs. in S0 crash initially
– proc. p1 in S1 invokes a read at time t0+1 and no other operations are invoked.
– the read completes at some time t1 without any proc. in S1 receiving a message from any proc. in S0: must happen since A is supposed to tolerate f failures
– the read returns 0: must be since A guarantees linearizability
CPSC 668 Set 17: Fault-Tolerant Register Simulations 16
Lower Bound Proof
• Now create admissible execution by "merging" the views of procs in S0 from 0 and the views of procs in S1 from 1:
– messages that go between S0 and S1 are delayed so that they don't arrive until after time t1.
is not linearizable, since read(0) follows write(1). Contradiction.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 18
Lower Bound Diagram for n = 2time 0 t0 t0+1 t1
p0
p1
o:
p0
p1
1:
p0
p1
:
write(1)
read(0)write(1)
read(0)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 19
Simulating R/W Registers Using R/W Registers• The previous algorithm showed how to
simulate a 1-reader, 1-writer register on top of message passing.
• How can we get more powerful (flexible) registers, i.e., with– more readers– more writers
• We'll start with a warm-up:– simulate multi-valued register using binary-valued
registers– 1-reader and 1-writer
CPSC 668 Set 17: Fault-Tolerant Register Simulations 20
Wait-Free Register Simulations• Asynchronous model• Linearizable shared registers• Wait-free
– tolerate any number of crash failures
• We want to simulate one kind of (n-1)-resilient shared memory with another kind of (n-1)-resilient memory– recall earlier definition of f-resilient shared memory– recall earlier definition of one kind of
communication system simulating another
CPSC 668 Set 17: Fault-Tolerant Register Simulations 21
Alternative Definition of Wait-Free Simulation• Alternative definition for the wait-free case:• The failure-free version of one communication
system simulates the failure-free version of the other, and
• for any prefix of an admissible execution of the simulation algorithm in which pi has a pending operation, there is an extension in which the operation completes and only pi takes steps.
• Equivalent to previous definition, sometimes more convenient.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 22
Proving Linearizability
• We've seen one approach:– explicitly construct a permutation and prove that it
has the desired properties
• Alternative approach:– identify a time point for each operation, between
invocation and response: linearization points– Linearization points give the permutation– Obviously real-time order is preserved– Just need to show that legality holds
CPSC 668 Set 17: Fault-Tolerant Register Simulations 23
Overview of Register Simulations
multi-readersingle-writermulti-valued
single-readersingle-writermulti-valued
multi-readermulti-writermulti-valued
single-readersingle-writerbinary-valued
CPSC 668 Set 17: Fault-Tolerant Register Simulations 24
Multi-Valued From Binary
• Some ideas…• Use a different binary register to store each bit
of the multi-valued register being simulated• Read algorithm is to read all the binary
registers and return the resulting value• Write algorithm is to write the new bits in some
order• Difficulties arise if the reader overlaps a slow
write and sees some new bits and some old bits
CPSC 668 Set 17: Fault-Tolerant Register Simulations 25
A Unary Approach
• Suppose the simulated register is to take on the values {0,…,K-1}.
• Use an array of K binary registers, B[0..K-1]– represent value v by having B[v] = 1 and the other
entries 0
• Read algorithm: read B[0], B[1],…, until finding the first 1; return the index
• Write algorithm: zero out the old entry of B and set the new entry
CPSC 668 Set 17: Fault-Tolerant Register Simulations 26
Problems with Unary Approach
• OK if reads and writes don't overlap.• If they do, have to worry about
– reader never finding a 1 in B– new-old inversion: writer writes 1, then 2, but
reader reads 2, then 1.
• Counter-example execution on next slide– since binary registers are linearizable, we just
mark the linearization points of the reads and writes on the binary registers
CPSC 668 Set 17: Fault-Tolerant Register Simulations 27
Counter-Example
read 0from B[0]
read 0from B[1]
write 1to B[1]
write 0to B[3]
write 1
read 1from B[2]
write 1to B[2]
Initially B[0] = B[1] = B[2] = 0 and B[3] = 1
read 0from B[0]
read 1from B[1]
write 0to B[1]
write 2
read 2 read 1
CPSC 668 Set 17: Fault-Tolerant Register Simulations 28
Corrected Multi-Valued Algorithm
• To prevent "falling off the edge" of the end of B without finding a 1, write algorithm only clears (sets to 0) entries that are smaller the entry that is set (to 1)
• To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0.– returns smallest value associated with a 1 entry in
B that is observed during the downward scan
CPSC 668 Set 17: Fault-Tolerant Register Simulations 29
Multi-Valued Construction
B[0]
0/1
.
.
.
B[K-1]
0/1
reader writer
readeralg.
writeralg.
read
read write
write
CPSC 668 Set 17: Fault-Tolerant Register Simulations 30
Algorithm is Wait-Free
• Algorithm for writer does not involve any waiting: just do at most K (low-level) writes
• Algorithm for reader does not involve any waiting: just do at most 2K-1 (low-level) reads.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 31
Algorithm Ensures Linearizability
• Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering)
• Then show that it respects real-time ordering of non-overlapping operations.
• Fix any admissible execution of the algorithm.• Fix any linearization of the low-level
operations (on the binary registers)– exists since the execution is admissible, which
implies the underlying communication system (the binary registers) behaves properly (is linearizable)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 32
Reads-From Relations
• Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low-level operations.
• High-level read R on the simulated multi-valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 33
Reads-From Diagram
write 0to B[0]
write 1to B[1]
write 1
read 1from B[1]
read 0from B[0]
read 0from B[0]
read 1
low-level reads-from relationships
high-level reads-from relationship
CPSC 668 Set 17: Fault-Tolerant Register Simulations 34
Construct Permutation
• Place all (high-level) writes in the order in which they occur– no concurrent writes
• Consider each (high-level) read in the occur in which they occur– no concurrent reads
• Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 35
Correctness of Permutation
• Permutation is legal by construction– each read is placed after the write that it
reads from
• Why does it preserve order of non-overlapping operations?– two writes: by construction– a read that precedes a write in the
execution: OK, since the read cannot read from a later write.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 36
Correctness of Permutation
Lemma (10.1): Suppose• (high-level) read R returns v• R reads B[u], with u < v, during its upward
scan• this read of B[u] reads from a (low-level) write
contained in high-level write W1
Then R reads from a write that follows W1.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 37
write 1to B[w]
write 0to B[u]
write w
Figure for Lemma 10.1
write 1to B[v]
write v
low-level reads-from relationships
high-level reads-from relationship
read 0from B[u]
during upward scan, u < v
read v
read 1from B[v]
top of upward scanor during downwardscan
CPSC 668 Set 17: Fault-Tolerant Register Simulations 38
Correctness of Permutation
• Two cases remain to show that real-time order of non-overlapping operations is preserved:– a write that precedes a read in the
execution– two reads
• Proof of both cases are by contradiction and showing that there is a situation that violates Lemma 10.1.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 39
Multi-Reader from Single-Reader
• First consider a simple idea:
• Use a different single-reader register for each reader (Val[1],…,Val[n]).– n is number of readers
• Write algorithm: write the new value in each of the single-reader registers
• Read algorithm: read your own single-reader register and return that value
CPSC 668 Set 17: Fault-Tolerant Register Simulations 40
pw
p1
p2
write 1
Counter-Example
write 1to Val[1]
write 1to Val[2]
read 0from Val[2]
read 0read 1from Val[1]
read 1
Suppose 0 is initial value of multi-reader register.Suppose n = 2.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 41
New Idea for Correct Algorithm
• Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register.
• This is provably necessary…
CPSC 668 Set 17: Fault-Tolerant Register Simulations 42
Readers Must Write
Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single-reader single-writer registers, at least one reader must write.
Proof: Suppose in contradiction there is an algorithm in which readers never write.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 43
Readers Must Write
• pw is the writer, p1 and p2 are the readers
• initial value of simulated register is 0
• S1 is the set of single-reader registers that are read by p1
• S2 is the set of single-reader registers that are read by p2
CPSC 668 Set 17: Fault-Tolerant Register Simulations 44
Readers Must Write
• Consider execution in which pw writes 1 to the simulated register.
• The write algorithm performs a series of writes, w1,…,wk, to the single-reader registers.
• Each wj is a write to a register in either S1 or S2.
• Let vji be the value that would be returned
if pi were to do a read immediately after wj
CPSC 668 Set 17: Fault-Tolerant Register Simulations 45
Readers Must Write
pw
pi
writeto w1
writeto wj
writeto wj+1
writeto wk
… …
write 1
read vji
CPSC 668 Set 17: Fault-Tolerant Register Simulations 46
Readers Must Write
• For each reader (p1 and p2), there is a point when the writes w1, …, wk cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new).
• For p1: v1
1 = v21 = … = va-1
1 = 0 va
1 = … = vk1= 1
• For p2: v1
2 = v22 = … = vb-1
2 = 0 vb
2 = … = vk2= 1
CPSC 668 Set 17: Fault-Tolerant Register Simulations 47
Readers Must Write
• Why must a and b be different?
• a marks the point when p1's view of the simulated register's current value changes from old to new. So wa must write to a register in S1.
• Similarly, wb must write to a register in S2.
• W.l.o.g., assume a < b.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 48
Readers Must Write
pw
writeto w1
writeto wa
writeto wa+1
writeto wk
… …
write 1
p1
read va1 = 1
p2
read va2 = 0
CPSC 668 Set 17: Fault-Tolerant Register Simulations 49
Readers Must Write
• Where did we use the assumption in this proof that readers don't write?
• The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading.
• The readers are oblivious to each other.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 50
Corrected Multi-Reader Algorithm
• As part of the algorithm for the read on the simulated register, announce the value to be returned.
• Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier.
• Need timestamps to be able to determine relative age of returned values.
• Reader pi uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single-reader variables at our disposal)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 51
Writer's Algorithm
• get the next sequence number– use integers that are increased by one
each time
• write value and sequence number to Val[1],…,Val[n] (one copy for each reader)
CPSC 668 Set 17: Fault-Tolerant Register Simulations 52
Reader pi's Algorithm
• read the value and timestamp written by the writer to Val[i]
• read the value and timestamp written by each reader to Report[j,i]
• choose the value-timestamp pair with the largest timestamp
• write that pair to row i of Report• return value associated with that pair
CPSC 668 Set 17: Fault-Tolerant Register Simulations 53
Multi-Reader Construction3 readers
writerwriteralg.
readeralg.
readeralg.
readeralg.
writes
Val
Report
reads
writes
CPSC 668 Set 17: Fault-Tolerant Register Simulations 54
Correctness of Multi-Reader Algorithm• Wait-free
– writer does n low-level writes– reader does n+1 low-level reads and n low-
level writes
• To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 55
Constructing the Permutation
• Put in all writes in the order in which they occur in the execution– since single-writer, writes do not overlap
• Consider the reads in the order of their responses in the execution.– read R reads from write W if W generates the
timestamp associated with the value R returns– place R immediately before the write that follows
W
• By construction, the permutation is legal.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 56
Preserving Real-Time Order
• write-write: by construction of • read-write: Suppose R precedes W in . Then R
cannot read from W or any succeeding write, so R is placed in before W.
• write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in after W.
• read-read: Suppose Ri by pi precedes Rj by pj in . Then pj reads Ri's timestamp or a larger one from Report[i,j]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in after Ri.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 57
Multi-Writer from Single-Writer
• Idea: – each writer should announce each value it wants
to write to all the readers, by writing the value to its own (SW,MR) register.
– each reader reads all the values written by the writers and returns the latest one
• How to determine latest value?– use timestamps– new wrinkle is that multiple processes generate
timestamps, need to coordinate
CPSC 668 Set 17: Fault-Tolerant Register Simulations 58
Using Vector Timestamps
• Data structure VT at each proc consisting of a vector of m integers– m is the number of writers
• To get a new timestamp, writer pi increments VT[i] by one
• To compare timestamps, use lexicographic order– This is a total order that extends the partial
order defined for vector timestamps
CPSC 668 Set 17: Fault-Tolerant Register Simulations 59
Writer pw's Algorithm
• get the next vector timestamp:– read the timestamp written by each writer
to TS[0],…,TS[m-1]– extract the i-th entry of each TS[i]– increment own entry by 1– write my new timestamp to TS[w]
• write value and timestamp to Val[w]
CPSC 668 Set 17: Fault-Tolerant Register Simulations 60
Reader pr's Algorithm
• read the value and timestamp written by each writer to Val[0], …, Val[m-1]
• choose the value-timestamp pair with the largest timestamp
• return value associated with that pair
CPSC 668 Set 17: Fault-Tolerant Register Simulations 61
Multi-Writer Construction3 readers 2 writers
writeralg.
readeralg.
readeralg.
readeralg.
Val
TS
writeralg.
read write
CPSC 668 Set 17: Fault-Tolerant Register Simulations 62
Correctness of Multi-Writer Algorithm• Wait-free
– writer does m low-level reads and 2 low-level writes
– reader does m low-level reads
• To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 63
Constructing the Permutation
• Put in all writes in timestamp order– Lemma 10.6 shows this preserves order of non-
overlapping writes
• Consider the reads in the order of their responses in the execution.– read R reads from write W if W generates the
timestamp associated with the value R returns– place R immediately before the write that follows
W
• By construction, the permutation is legal.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 64
Preserving Real-Time Order
• write-write: by construction of • read-write: Suppose R precedes W in . Then R
cannot read from W or any succeeding write, so R is placed in before W.
• write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in after W.
• read-read: Suppose Ri by pi precedes Rj by pj in . By Lemmas 10.6 and 10.7, pj reads Ri's timestamp or a larger one from Val[ ]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in after Ri.
CPSC 668 Set 17: Fault-Tolerant Register Simulations 65
Atomic Snapshot Objects (ASO)
• An array of elements:– each one can be updated by just one proc.– a proc. can scan the whole array
"atomically"
• Useful abstraction for designing shared memory algorithms
• Can be wait-free implemented from read/write variables
CPSC 668 Set 17: Fault-Tolerant Register Simulations 66
ASO Sequential Specification
• Operations are– invocation scani, response returni(V) where
V is an array of n values, 0 ≤ i ≤ n-1– invocation updatei(d) where d is a data
value, response acki, 0 ≤ i ≤ n-1
• Legal sequences: for each V returned by a scan, V[i] equals parameter of latest preceding updatei
CPSC 668 Set 17: Fault-Tolerant Register Simulations 67
ASO Example
• Suppose array = [a,b,c] initially.
• This sequence is legal:
update1(x), update2(y), scan([a,x,y]),
update0(z), scan([z,x,y])
CPSC 668 Set 17: Fault-Tolerant Register Simulations 68
Sketch of Implementation
• Store each array entry ("segment") in a different read/write variable
• Update algorithm:– write to the variable holding that segment
• Scan algorithm:– Collect (read) all the values in the segments twice– If no segment is updated during the "double collect",
then we got a valid snapshot -- return it
• Issues:– how to tell if a segment is updated?– what to do if a segment is updated?
CPSC 668 Set 17: Fault-Tolerant Register Simulations 69
Detecting Updates
• Simple idea is to tag each value stored in a segment with a counter (1,2,3,…)– requires unbounded space
• More complex, bounded-space, solution is given in the textbook – uses a "handshaking" mechanism
CPSC 668 Set 17: Fault-Tolerant Register Simulations 70
Reacting to Update During Scan
• If a scanner observes enough changes to a particular segment, then the corresponding updater has performed a complete update during this scan
• Embed a scan at the beginning of each update:– the view obtained in this scan is written with the
data to the segment
• Scanner returns view obtained in last collect
CPSC 668 Set 17: Fault-Tolerant Register Simulations 71
Complexity of ASO Algorithm
• Number of building-block read/write variables is O(n) (although some are large)
• Scan algorithm uses O(n2) low-level reads and writes.
• Scan algorithm uses O(n2) low-level reads and writes.
• Update algorithm uses O(n2) low-level reads and writes.