12 consistency & replication
Post on 14-Oct-2014
145 Views
Preview:
TRANSCRIPT
Distributed Systems
Consistency & Replication (II)
2
Client-centric Consistency Models• Guarantees for a single client• How to hide inconsistencies from a client ?
– … assuming a data store where concurrent conflicting updates are rare
• … and relatively easy to resolve
• Examples:– DNS
• Single naming authority per zone• “lazy” propagation of updates
– WWW• No write-write conflicts• Usually acceptable to serve slightly out-of-date pages from a
cache
– Bayou (Terry et al – 1994)
3
Eventual Consistency• The principle of a mobile user accessing different replicas of a
distributed database.
If no updates take place for some time, all replicas gradually converge to a consistent state …
4
Alternative client-centric models• xi[t]: version of object x at local copy Li at time t
– … result of updates to a series of writes since system initialization at Li
– WS(xi[t]): series of writes– WS(xi[t2]; xj[t2]): series of writes that have also been
performed at copy Lj at a later time
• Assume an “owner” for each data item– … avoid write-write conflicts
• Monotonic reads• Monotonic writes• Read-your-values• Writes-follow-reads
5
Monotonic Reads
• The read operations performed by a single process P at two different local copies of the same data store.
a) A monotonic-read consistent data storeb) A data store that does not provide monotonic reads.
If a process has seen a value of x at time t, it will never see an older value at a later time.
Example: -replicated mailboxes with on-demand propagation of updates
WS(x1) is part of WS(x2)
6
Monotonic Writes
• The write operations performed by a single process P at two different local copies of the same data store
a) A monotonic-write consistent data store.b) A data store that does not provide monotonic-write consistency.
If an update is made to a copy, all preceding updatesmust have been completed first.
Example: - s/w library
FIFO propagation ofupdates by each process
A write may affect only part of the state of a data item
No guarantee that x at L2 has the same value as x at L1 at the time W(x1) completed
7
Read Your Writes
a) A data store that provides read-your-writes consistency.
b) A data store that does not.
A write is completed before a successive read, no matter where the read takes place
Negative examples:- updates of Web pages- changes of passwords
The effects of the previous write at L1 have not yet been propagated !
8
Writes Follow Reads
a) A writes-follow-reads consistent data storeb) A data store that does not provide writes-
follow-reads consistency
Any successive write will be performed on a copy that is up-to-date with the value most recently read by the process.
Example:- updates of a newsgroup: Responses are visible only after the original posting has been received
9
Implementing client-centric models (I)
• Globally unique ID per write operation– Assigned by the initiating server
• Per-client state:– Read set
• Write IDs relevant to client’s read operations
– Write set• IDs of writes performed by client
• Major performance issue:– Size of read/write sets ?
10
Implementing client-centric models (II)• Monotonic read:
– When a client issues a read, the server is given the client’s read set to check whether all the identified reads have taken place locally
• If not, the server contacts others to ensure that it is brought up-to-date
– After the read, the client’s read set is updated with the server’s “relevant” writes
• Monotonic write:– When a client issues a write, the server is given the
client’s write set• … to ensure that all specified writes have been applied (in-order)
– The write operation’s ID is appended to client’s write set
11
Implementing client-centric models (III)
• Read-your-writes:– Before serving a read request, the server fetches
(from other servers) all writes in the client’s write set
• Writes-follow-reads:– Server is brought up-to-date with the writes in the
client’s read set– After write, the new ID is added to the client’s
write set, along with the IDs in the read set • … as these have become “relevant” for the write just
performed
12
Implementing client-centric models (IV)
• Grouping a client’s read and write operations into sessions– A session is typically associated with an
application• … but may also be associated with an application that
can be temporarily shutdown (eg: email agent)
– What if the client never closes a session ?
• How to represent the read & write sets ?– List of IDs for write operations
• … Not all of these are actually needed !!
13
Implementing client-centric models (V)
• Using vector timestamps for improving efficiency:– When server Si accepts a write operation, it
assigns to it a globally unique WID and a timestamp ts(WID)
– Each server maintains vector RCVD(i)• RCVD(i)[j] := timestamp of the latest write initiated at
server Sj that has been received & processed at Si
• Server returns its current vector timestamp with its responses to read/write requests
• Client adjusts the timestamp for its own read/write set
14
Implementing client-centric models (VI)
• Efficient representation of read/write set A:– VT(A): vector timestamp
• VT(A)[i] := max. timestamp of all operations in A that were initiated at server Si
– Union of 2 sets of write IDs:• VT(A+B)[i] := max{ VT(A)[i], VT(B)[i] }
– Efficient way to check if A is contained in B:• VT(A)[i] <= VT(B)[i]
15
Replica Placement (I)
• The logical organization of different kinds of copies of a data store into three concentric rings.
16
Replica Placement (II)• Permanent copies
– Basis of distributed data store• Example from the Web:
– Anycasting & round-robin clusters– Mirror sites
• Server-initiated– Push caches
• Dynamic replication to handle bursts• Read-only
– Content Distribution Network (CDN)
• Client-initiated– Improve access time to data
• Danger of “stale” data
– Private vs Shared caches
17
Server-Initiated Replicas• Counting access requests from different clients.
•Deletion threshold: del(S, F)•Replication threshold: rep(S, F)
Routing DB to determine “closest” server for client C
P := closest serverfor both C1 & C2
CntQ(P, F)
At each server:•Count of accessesfor each file•Originating clients
Extra care to ensure that at least one copy remains !
Dynamic decisions to delete/migrate/replicate file F to server S
18
Update propagation• State vs Operations
– Notification of an update• Invalidation protocols• Best for low read/write ratio (%)
– Transfer data from one copy to another• Transfer of actual data … or log of changes• Batching• Best for relatively high read/write %
– Propagate the update to other copies• Active replication
• Pull vs Push– Push replicas maintain a high degree of consistency
• Updates are expected to be of use to multiple readers– Pull best for low read/write %– Hybrid scheme based on lease model
• Unicast vs Multicast– Push multicast group– Pull single server or client requests an update
19
Leases• A promise by a server that it will push
updates for a specified time period– After expiration, client has to “pull” for updates
• Alternatives:– Age-based leases
• Depending on the last time an item was modified– Long-lasting leases for items that are expected to remain
unmodified
• Renewal frequency-based leases– Short-term leases for clients that only occasionally ask fo a
specific item
• Leases based on state-space overhead at the server:– Lower expiration time as the server’s approaches overload
20
Pull versus Push Protocols
• Comparison between push-based & pull-based protocols in the case of multiple client, single server systems.
Issue Push-based Pull-based
State of server List of client replicas and caches None
Messages sent Update (and possibly fetch update later) Poll and update
Response time at client
Immediate (or fetch-update time) Fetch-update time
Stateful server: keeps track of all caches
21
Remote-Write Protocols (I)
• Primary-based remote-write protocol with a fixed server to which all read & write operations are forwarded.
22
Remote-Write Protocols (II)
• The principle of primary-backup protocol.
23
Primary-backup protocols• Blocking updates
– … straightforward implementation of sequential consistency
• The primary orders all updates• Processes see the effects of their most recent write
• Non-blocking updates– … reduce blocking delay for the process that
initiated the update• The process only waits until the primary’s ACK
– Fault tolerance ?
24
Local-Write Protocols (I)
• Primary-based local-write protocol in which a single copy is migrated between processes.
Keeping track of each data items’ current location ?
25
Local-Write Protocols (II)
• Primary-backup protocol in which the primary migrates to the process wanting to perform an update.
Suitable for disconnected operation
26
Active Replication (I)
• The problem of replicated invocations.
27
Active Replication (II)
(a) Forwarding an invocation request from a replicated object.
(b) Returning a reply to a replicated object.
28
Gifford’s quorum scheme (I)• Version numbers or timestamps per copy• A number of votes is assigned to each physical copy
– “weight” related to demand for a particular copy– totV(g): total number of votes for group of RMs– totV: total votes
• Obtain quorum before read/write:– R votes before read– W votes before write– W > 0.5*totV no write-write conflicts– (R + W) > totV(g) no read-write conflicts
• Any quorum pair must contain common copies– In case of partition, it is not possible to perform conflicting
operations on the same copy
29
Gifford’s quorum scheme (II)• Read:
– Version number inquiries to find set (g) of RMs • totV(g) >= R
– Not all copies need to be up-to-date• Every read quorum contains at least one current copy
• Write: – Version number inquiries to find set (g) of RMs
• totV(g) >= W • up-to-date copies
– If there are insufficient up-to-date copies, replace a non-current copy with a copy of the current copy
• Groups of RMs can be configured to provide different performance/reliability characteristics– Decrease W to improve writes– Decrease R to improve reads
30
Gifford’s quorum scheme (III)• Performance penalty for reads
– Due to the need for collecting a read quorum• Support for copies on local disks of clients
– Assigned zero votes - weak representatives• These copies cannot be included in a quorum
– After obtaining a read quorum, a read may be carried out on the local copy if it is up-to-date
• Blocking probability:– In some cases, a quorum cannot be obtained
31
Gifford’s quorum scheme (IV)Example 1 Example 2 Example 3
Latency Replica 1 75 75 75
(milliseconds) Replica 2 65 100 750
Replica 3 65 750 750
Voting Replica 1 1 2 1
configuration Replica 2 0 1 1
Replica 3 0 1 1
Quorum R 1 2 1
sizes W 1 3 3
Derived performance of file suite:
Read Latency 65 75 75
Blocking probability 0.01 0.0002 0.000001
Write Latency 75 100 750
Blocking probability 0.01 0.0101 0.03
Examples assume 99% availability for RMs
Ex1: file with high% read/write
Ex2: file with moderate %read/write
Ex3: file with very high % read/write
Reads can be satisfied by local RM, but writes must also access one remote RM
32
Quorum-Based Protocols
Three examples of the voting algorithm:a) A correct choice of read & write setb) A choice that may lead to write-write conflictsc) A correct choice, known as ROWA (read one, write all)
33
Transactions with Replicated Data• Better performance
– Concurrent service– Reduced latency
• Higher availability• Fault tolerance
– What if a replica fails or becomes isolated ?• Upon rejoining, it must “catch up”
• Replicated transaction service– Data replicated at a set of replica managers
• Replication transparency – One copy serializability– Read one, write all
Failures must be observed to have “happened before” any active Tx’s at other servers
34
Network Partitions• Separate but viable groups of servers• Optimistic schemes validate on recovery
– Available copies with validation
• Pessimistic schemes limit availability until recovery
T U
B B
withdraw(B) deposit(B)
BB
partition
35
Fault Tolerance
• Design to recover after a failure with no loss of (committed) data.
• Designs for fault tolerance:– Single server, fail and recover– Primary server with “trailing” backups– Replicated service
36
Fault Tolerance = ?
• Define correctness criteria • When 2 replicas are separated by network partition:
– Both are deemed “incorrect” & stop serving.– One (the master) continues & the other ceases service.– One (the master) continues to accept updates & both
continue to supply reads (of possibly stale data).– Both continue service & subsequently synchronise.
37
Passive Replication (I)• At any time, system has a single primary RM• One or more secondary backup RMs• Front ends communicate with primary, primary
executes requests, response to all backups• If primary fails, one backup is promoted to primary• New primary starts from “Coordination phase” for
each new request• What happens if primary crashes
before/during/after agreement phase?
38
Passive Replication (II)
FEC
FEC
RM
Primary
Backup
Backup
RM
RM
39
Passive replication (III)• Satisfies linearizability• Front end: looks up new primary, when current
primary does not respond• Primary RM is performance bottleneck• Can tolerate F failures for F+1 RMs• A variation: clients can access backup RMs
(linearizability is lost, but clients get sequential consistency)
• SUN NIS (yellow pages) uses passive replication: clients can contact primary or backup servers for reads, but only primary server for updates
40
Active replication (I)• RMs are state machines with equivalent roles• Front ends communicates the client requests to
RM group, using totally ordered reliable multicast• RMs process independently requests & reply to
front end (correct RMs process each request identically)
• Front end can synthesize final response to client (tolerating Byzantine failures)
• Active replication provides sequential consistency if multicast is reliable & ordered
• Byzantine failures (F out of 2F+1): front end waits until it gets F+1 identical responses
41
Active replication (II)
FE CFEC RM
RM
RM
42
replicamanagers
Replication Architectures
• How many replicas are required?– All or majority ?
• Forward all updates as soon as received.
• Two phase commit protocol.– Contacted replica acts as
coordinator– What if one of the replicas
isn’t available?”
• Primary copy replication
TA
getBalance(A)
B
deposit(B)
A
A
B
B
B
43
replicamanagers
T
Available Copies Replication
• Not all copies will always be available.
• Failures– Timeout at failed
replica– Rejected by
recovering, unsynchronised replica
Y
AgetBalance(A)
MB
NB
PB
deposit(B)
U
getBalance(B)
deposit(A)
X
A
44
Local Validation• Failure & recovery events do not occur during a Tx.• Example:
– T reads A before server X’s failure, therefore T failX– T observes server N’s failure when it writes B, therefore
failN T– failN T.getBalance(A) T.deposit(B) failX– failX U.getBalance(B) U.deposit(A) failN
Failure and recovery must be serialised just like a Tx: They occur before or after a Tx, but not during.
Server x fails followed by
Transaction U which is followed by
Server N’s failure which is followed by
Transaction T which is followed by server X’s failure.
This is inconsistent, so the transactions must not be allowed to commit.
top related