cpsc 668set 17: fault-tolerant register simulations1 cpsc 668 distributed algorithms and systems...

71
CPSC 668 Set 17: Fault-Tolerant Register Simulations 1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch

Post on 20-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

CPSC 668 Set 17: Fault-Tolerant Register Simulations 1

CPSC 668Distributed Algorithms and Systems

Fall 2009

Prof. Jennifer Welch

CPSC 668 Set 17: Fault-Tolerant Register Simulations 2

Fault-Tolerant Shared Memory Simulations• Previous algorithms implemented shared

variable on top of message passing, assuming no failures.

• What if some processors might crash?• Can we still provide a shared read/write

variable on top of message passing?• Yes, even in an asynchronous system, if we

have enough nonfaulty processors.• First, we must specify a failure-prone shared

memory.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 3

Specification of f-Resilient Shared Memory

• Inputs are invocations on the shared object.• Outputs are responses of the shared object.• A sequence of inputs and outputs is allowable iff:

– there is a partitioning of proc. indices into "faulty" and "nonfaulty"

– Correct Interaction: each proc. alternates invocations and matching responses

– Nonfaulty Liveness: Every invocation by a nonfaulty proc. has a matching response

– Extended Linearizability: Linearizability holds for all the completed operations and some subset of the pending operations

CPSC 668 Set 17: Fault-Tolerant Register Simulations 4

Assumptions for Algorithm• Each read/write variable ("register") to be

simulated has– one reader and– one writer– (next topic will be to build more powerful variables

out of these)

• There are n procs. which are cooperating to simulate a collection of such variables

• Underlying communication system is asynchronous message passing

• n > 2f (less than half the processors can crash)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 5

Main Ideas of Algorithm

• Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register.

• Use the redundant storage to provide fault-tolerance.

• Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 6

Writing the Simulated Register

• generate the next sequence number• send a message with the value and the

sequence number to all the procs.– each recipient updates its local copy of the

register

• wait to get back an ack from > n/2 procs.– safe since n - f > n/2

• do the ack for the write

CPSC 668 Set 17: Fault-Tolerant Register Simulations 7

Reading the Simulated Register

• send a request to all the procs.– each recipient sends back current value of

its replica

• wait to get reply from > n/2 procs.

• return value associated with largest sequence number

CPSC 668 Set 17: Fault-Tolerant Register Simulations 8

Key Idea for Correctness

• Each read should return the value of "the most recent" write.

• Each read or write communicates with > n/2 procs., so the set of procs. participating in operation O1 is guaranteed to intersect with the set of procs. participating in any other operation O2.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 9

But What About Asynchrony?

• The underlying communication system is asynchronous:– message on behalf of one operation could be

overtaken by a message on behalf of a later operation.

• Avoid such problems by adding additional mechanism to the algorithm:– reader and writer keep track of "status" of each

link– don't send a msg on a link until ack from previous

msg has been received

CPSC 668 Set 17: Fault-Tolerant Register Simulations 10

Outline of Correctness Proof

Interesting part is proving linearizability.• Let ts(W) = sequence number of W• Let ts(R) = sequence number of write that R reads

from• Let O1 O2 denote O1 finishes before O2 startsKey lemmas: • If W1 W2, then ts(W1) < ts(W2)• If W R, then ts(W) ≤ ts(R)• If R W, then ts(R ) < ts(W)• If R1 R2, then ts(R1) ≤ ts(R2)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 11

Matching Lower Bound on ResiliencyTheorem (10.22): No simulation of a 1-reader,

1-writer read/write linearizable register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures.

Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1-reader, 1-writer linearizable register on top of asynchronous message passing.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 12

Lower Bound Proof

• Partition procs into two sets, S0 and S1, each of size f.

• Let 0 be admissible exec. of A s.t.– initial value of simulated register is 0

– all procs. in S1 crash initially

– proc. p0 in S0 invokes write(1) at time 0 and no other operations are invoked.

– the write completes at some time t0 without any proc in S0 receiving a message from any proc in S1: must happen since A is supposed to tolerate f failures.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 13

S1S0

p0

0:

CPSC 668 Set 17: Fault-Tolerant Register Simulations 14

Lower Bound Proof

• Let 1 be admissible exec. of A s.t.– initial value of simulated register is 0

– all procs. in S0 crash initially

– proc. p1 in S1 invokes a read at time t0+1 and no other operations are invoked.

– the read completes at some time t1 without any proc. in S1 receiving a message from any proc. in S0: must happen since A is supposed to tolerate f failures

– the read returns 0: must be since A guarantees linearizability

CPSC 668 Set 17: Fault-Tolerant Register Simulations 15

p11:

CPSC 668 Set 17: Fault-Tolerant Register Simulations 16

Lower Bound Proof

• Now create admissible execution by "merging" the views of procs in S0 from 0 and the views of procs in S1 from 1:

– messages that go between S0 and S1 are delayed so that they don't arrive until after time t1.

is not linearizable, since read(0) follows write(1). Contradiction.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 17

p11:

p1

p0

:delay untilafter t1

S1S0

p0

0:

CPSC 668 Set 17: Fault-Tolerant Register Simulations 18

Lower Bound Diagram for n = 2time 0 t0 t0+1 t1

p0

p1

o:

p0

p1

1:

p0

p1

:

write(1)

read(0)write(1)

read(0)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 19

Simulating R/W Registers Using R/W Registers• The previous algorithm showed how to

simulate a 1-reader, 1-writer register on top of message passing.

• How can we get more powerful (flexible) registers, i.e., with– more readers– more writers

• We'll start with a warm-up:– simulate multi-valued register using binary-valued

registers– 1-reader and 1-writer

CPSC 668 Set 17: Fault-Tolerant Register Simulations 20

Wait-Free Register Simulations• Asynchronous model• Linearizable shared registers• Wait-free

– tolerate any number of crash failures

• We want to simulate one kind of (n-1)-resilient shared memory with another kind of (n-1)-resilient memory– recall earlier definition of f-resilient shared memory– recall earlier definition of one kind of

communication system simulating another

CPSC 668 Set 17: Fault-Tolerant Register Simulations 21

Alternative Definition of Wait-Free Simulation• Alternative definition for the wait-free case:• The failure-free version of one communication

system simulates the failure-free version of the other, and

• for any prefix of an admissible execution of the simulation algorithm in which pi has a pending operation, there is an extension in which the operation completes and only pi takes steps.

• Equivalent to previous definition, sometimes more convenient.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 22

Proving Linearizability

• We've seen one approach:– explicitly construct a permutation and prove that it

has the desired properties

• Alternative approach:– identify a time point for each operation, between

invocation and response: linearization points– Linearization points give the permutation– Obviously real-time order is preserved– Just need to show that legality holds

CPSC 668 Set 17: Fault-Tolerant Register Simulations 23

Overview of Register Simulations

multi-readersingle-writermulti-valued

single-readersingle-writermulti-valued

multi-readermulti-writermulti-valued

single-readersingle-writerbinary-valued

CPSC 668 Set 17: Fault-Tolerant Register Simulations 24

Multi-Valued From Binary

• Some ideas…• Use a different binary register to store each bit

of the multi-valued register being simulated• Read algorithm is to read all the binary

registers and return the resulting value• Write algorithm is to write the new bits in some

order• Difficulties arise if the reader overlaps a slow

write and sees some new bits and some old bits

CPSC 668 Set 17: Fault-Tolerant Register Simulations 25

A Unary Approach

• Suppose the simulated register is to take on the values {0,…,K-1}.

• Use an array of K binary registers, B[0..K-1]– represent value v by having B[v] = 1 and the other

entries 0

• Read algorithm: read B[0], B[1],…, until finding the first 1; return the index

• Write algorithm: zero out the old entry of B and set the new entry

CPSC 668 Set 17: Fault-Tolerant Register Simulations 26

Problems with Unary Approach

• OK if reads and writes don't overlap.• If they do, have to worry about

– reader never finding a 1 in B– new-old inversion: writer writes 1, then 2, but

reader reads 2, then 1.

• Counter-example execution on next slide– since binary registers are linearizable, we just

mark the linearization points of the reads and writes on the binary registers

CPSC 668 Set 17: Fault-Tolerant Register Simulations 27

Counter-Example

read 0from B[0]

read 0from B[1]

write 1to B[1]

write 0to B[3]

write 1

read 1from B[2]

write 1to B[2]

Initially B[0] = B[1] = B[2] = 0 and B[3] = 1

read 0from B[0]

read 1from B[1]

write 0to B[1]

write 2

read 2 read 1

CPSC 668 Set 17: Fault-Tolerant Register Simulations 28

Corrected Multi-Valued Algorithm

• To prevent "falling off the edge" of the end of B without finding a 1, write algorithm only clears (sets to 0) entries that are smaller the entry that is set (to 1)

• To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0.– returns smallest value associated with a 1 entry in

B that is observed during the downward scan

CPSC 668 Set 17: Fault-Tolerant Register Simulations 29

Multi-Valued Construction

B[0]

0/1

.

.

.

B[K-1]

0/1

reader writer

readeralg.

writeralg.

read

read write

write

CPSC 668 Set 17: Fault-Tolerant Register Simulations 30

Algorithm is Wait-Free

• Algorithm for writer does not involve any waiting: just do at most K (low-level) writes

• Algorithm for reader does not involve any waiting: just do at most 2K-1 (low-level) reads.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 31

Algorithm Ensures Linearizability

• Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering)

• Then show that it respects real-time ordering of non-overlapping operations.

• Fix any admissible execution of the algorithm.• Fix any linearization of the low-level

operations (on the binary registers)– exists since the execution is admissible, which

implies the underlying communication system (the binary registers) behaves properly (is linearizable)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 32

Reads-From Relations

• Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low-level operations.

• High-level read R on the simulated multi-valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 33

Reads-From Diagram

write 0to B[0]

write 1to B[1]

write 1

read 1from B[1]

read 0from B[0]

read 0from B[0]

read 1

low-level reads-from relationships

high-level reads-from relationship

CPSC 668 Set 17: Fault-Tolerant Register Simulations 34

Construct Permutation

• Place all (high-level) writes in the order in which they occur– no concurrent writes

• Consider each (high-level) read in the occur in which they occur– no concurrent reads

• Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 35

Correctness of Permutation

• Permutation is legal by construction– each read is placed after the write that it

reads from

• Why does it preserve order of non-overlapping operations?– two writes: by construction– a read that precedes a write in the

execution: OK, since the read cannot read from a later write.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 36

Correctness of Permutation

Lemma (10.1): Suppose• (high-level) read R returns v• R reads B[u], with u < v, during its upward

scan• this read of B[u] reads from a (low-level) write

contained in high-level write W1

Then R reads from a write that follows W1.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 37

write 1to B[w]

write 0to B[u]

write w

Figure for Lemma 10.1

write 1to B[v]

write v

low-level reads-from relationships

high-level reads-from relationship

read 0from B[u]

during upward scan, u < v

read v

read 1from B[v]

top of upward scanor during downwardscan

CPSC 668 Set 17: Fault-Tolerant Register Simulations 38

Correctness of Permutation

• Two cases remain to show that real-time order of non-overlapping operations is preserved:– a write that precedes a read in the

execution– two reads

• Proof of both cases are by contradiction and showing that there is a situation that violates Lemma 10.1.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 39

Multi-Reader from Single-Reader

• First consider a simple idea:

• Use a different single-reader register for each reader (Val[1],…,Val[n]).– n is number of readers

• Write algorithm: write the new value in each of the single-reader registers

• Read algorithm: read your own single-reader register and return that value

CPSC 668 Set 17: Fault-Tolerant Register Simulations 40

pw

p1

p2

write 1

Counter-Example

write 1to Val[1]

write 1to Val[2]

read 0from Val[2]

read 0read 1from Val[1]

read 1

Suppose 0 is initial value of multi-reader register.Suppose n = 2.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 41

New Idea for Correct Algorithm

• Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register.

• This is provably necessary…

CPSC 668 Set 17: Fault-Tolerant Register Simulations 42

Readers Must Write

Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single-reader single-writer registers, at least one reader must write.

Proof: Suppose in contradiction there is an algorithm in which readers never write.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 43

Readers Must Write

• pw is the writer, p1 and p2 are the readers

• initial value of simulated register is 0

• S1 is the set of single-reader registers that are read by p1

• S2 is the set of single-reader registers that are read by p2

CPSC 668 Set 17: Fault-Tolerant Register Simulations 44

Readers Must Write

• Consider execution in which pw writes 1 to the simulated register.

• The write algorithm performs a series of writes, w1,…,wk, to the single-reader registers.

• Each wj is a write to a register in either S1 or S2.

• Let vji be the value that would be returned

if pi were to do a read immediately after wj

CPSC 668 Set 17: Fault-Tolerant Register Simulations 45

Readers Must Write

pw

pi

writeto w1

writeto wj

writeto wj+1

writeto wk

… …

write 1

read vji

CPSC 668 Set 17: Fault-Tolerant Register Simulations 46

Readers Must Write

• For each reader (p1 and p2), there is a point when the writes w1, …, wk cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new).

• For p1: v1

1 = v21 = … = va-1

1 = 0 va

1 = … = vk1= 1

• For p2: v1

2 = v22 = … = vb-1

2 = 0 vb

2 = … = vk2= 1

CPSC 668 Set 17: Fault-Tolerant Register Simulations 47

Readers Must Write

• Why must a and b be different?

• a marks the point when p1's view of the simulated register's current value changes from old to new. So wa must write to a register in S1.

• Similarly, wb must write to a register in S2.

• W.l.o.g., assume a < b.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 48

Readers Must Write

pw

writeto w1

writeto wa

writeto wa+1

writeto wk

… …

write 1

p1

read va1 = 1

p2

read va2 = 0

CPSC 668 Set 17: Fault-Tolerant Register Simulations 49

Readers Must Write

• Where did we use the assumption in this proof that readers don't write?

• The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading.

• The readers are oblivious to each other.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 50

Corrected Multi-Reader Algorithm

• As part of the algorithm for the read on the simulated register, announce the value to be returned.

• Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier.

• Need timestamps to be able to determine relative age of returned values.

• Reader pi uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single-reader variables at our disposal)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 51

Writer's Algorithm

• get the next sequence number– use integers that are increased by one

each time

• write value and sequence number to Val[1],…,Val[n] (one copy for each reader)

CPSC 668 Set 17: Fault-Tolerant Register Simulations 52

Reader pi's Algorithm

• read the value and timestamp written by the writer to Val[i]

• read the value and timestamp written by each reader to Report[j,i]

• choose the value-timestamp pair with the largest timestamp

• write that pair to row i of Report• return value associated with that pair

CPSC 668 Set 17: Fault-Tolerant Register Simulations 53

Multi-Reader Construction3 readers

writerwriteralg.

readeralg.

readeralg.

readeralg.

writes

Val

Report

reads

writes

CPSC 668 Set 17: Fault-Tolerant Register Simulations 54

Correctness of Multi-Reader Algorithm• Wait-free

– writer does n low-level writes– reader does n+1 low-level reads and n low-

level writes

• To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 55

Constructing the Permutation

• Put in all writes in the order in which they occur in the execution– since single-writer, writes do not overlap

• Consider the reads in the order of their responses in the execution.– read R reads from write W if W generates the

timestamp associated with the value R returns– place R immediately before the write that follows

W

• By construction, the permutation is legal.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 56

Preserving Real-Time Order

• write-write: by construction of • read-write: Suppose R precedes W in . Then R

cannot read from W or any succeeding write, so R is placed in before W.

• write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in after W.

• read-read: Suppose Ri by pi precedes Rj by pj in . Then pj reads Ri's timestamp or a larger one from Report[i,j]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in after Ri.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 57

Multi-Writer from Single-Writer

• Idea: – each writer should announce each value it wants

to write to all the readers, by writing the value to its own (SW,MR) register.

– each reader reads all the values written by the writers and returns the latest one

• How to determine latest value?– use timestamps– new wrinkle is that multiple processes generate

timestamps, need to coordinate

CPSC 668 Set 17: Fault-Tolerant Register Simulations 58

Using Vector Timestamps

• Data structure VT at each proc consisting of a vector of m integers– m is the number of writers

• To get a new timestamp, writer pi increments VT[i] by one

• To compare timestamps, use lexicographic order– This is a total order that extends the partial

order defined for vector timestamps

CPSC 668 Set 17: Fault-Tolerant Register Simulations 59

Writer pw's Algorithm

• get the next vector timestamp:– read the timestamp written by each writer

to TS[0],…,TS[m-1]– extract the i-th entry of each TS[i]– increment own entry by 1– write my new timestamp to TS[w]

• write value and timestamp to Val[w]

CPSC 668 Set 17: Fault-Tolerant Register Simulations 60

Reader pr's Algorithm

• read the value and timestamp written by each writer to Val[0], …, Val[m-1]

• choose the value-timestamp pair with the largest timestamp

• return value associated with that pair

CPSC 668 Set 17: Fault-Tolerant Register Simulations 61

Multi-Writer Construction3 readers 2 writers

writeralg.

readeralg.

readeralg.

readeralg.

Val

TS

writeralg.

read write

CPSC 668 Set 17: Fault-Tolerant Register Simulations 62

Correctness of Multi-Writer Algorithm• Wait-free

– writer does m low-level reads and 2 low-level writes

– reader does m low-level reads

• To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 63

Constructing the Permutation

• Put in all writes in timestamp order– Lemma 10.6 shows this preserves order of non-

overlapping writes

• Consider the reads in the order of their responses in the execution.– read R reads from write W if W generates the

timestamp associated with the value R returns– place R immediately before the write that follows

W

• By construction, the permutation is legal.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 64

Preserving Real-Time Order

• write-write: by construction of • read-write: Suppose R precedes W in . Then R

cannot read from W or any succeeding write, so R is placed in before W.

• write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in after W.

• read-read: Suppose Ri by pi precedes Rj by pj in . By Lemmas 10.6 and 10.7, pj reads Ri's timestamp or a larger one from Val[ ]. So Rj reads from the same write that Ri reads from or a later write. Thus Rj is placed in after Ri.

CPSC 668 Set 17: Fault-Tolerant Register Simulations 65

Atomic Snapshot Objects (ASO)

• An array of elements:– each one can be updated by just one proc.– a proc. can scan the whole array

"atomically"

• Useful abstraction for designing shared memory algorithms

• Can be wait-free implemented from read/write variables

CPSC 668 Set 17: Fault-Tolerant Register Simulations 66

ASO Sequential Specification

• Operations are– invocation scani, response returni(V) where

V is an array of n values, 0 ≤ i ≤ n-1– invocation updatei(d) where d is a data

value, response acki, 0 ≤ i ≤ n-1

• Legal sequences: for each V returned by a scan, V[i] equals parameter of latest preceding updatei

CPSC 668 Set 17: Fault-Tolerant Register Simulations 67

ASO Example

• Suppose array = [a,b,c] initially.

• This sequence is legal:

update1(x), update2(y), scan([a,x,y]),

update0(z), scan([z,x,y])

CPSC 668 Set 17: Fault-Tolerant Register Simulations 68

Sketch of Implementation

• Store each array entry ("segment") in a different read/write variable

• Update algorithm:– write to the variable holding that segment

• Scan algorithm:– Collect (read) all the values in the segments twice– If no segment is updated during the "double collect",

then we got a valid snapshot -- return it

• Issues:– how to tell if a segment is updated?– what to do if a segment is updated?

CPSC 668 Set 17: Fault-Tolerant Register Simulations 69

Detecting Updates

• Simple idea is to tag each value stored in a segment with a counter (1,2,3,…)– requires unbounded space

• More complex, bounded-space, solution is given in the textbook – uses a "handshaking" mechanism

CPSC 668 Set 17: Fault-Tolerant Register Simulations 70

Reacting to Update During Scan

• If a scanner observes enough changes to a particular segment, then the corresponding updater has performed a complete update during this scan

• Embed a scan at the beginning of each update:– the view obtained in this scan is written with the

data to the segment

• Scanner returns view obtained in last collect

CPSC 668 Set 17: Fault-Tolerant Register Simulations 71

Complexity of ASO Algorithm

• Number of building-block read/write variables is O(n) (although some are large)

• Scan algorithm uses O(n2) low-level reads and writes.

• Scan algorithm uses O(n2) low-level reads and writes.

• Update algorithm uses O(n2) low-level reads and writes.