replication and consistency

C. Xu, 1998-2010 1

Replication and Consistency

• Data and Object Replication

• Data Consistency Model

• Implementation Issues

C. Xu, 1998-2010 2

Replication• What replicate?

– Data, File, process, object, service, etc• Why replicate?

– Enhance performance: • Caching data in clients and servers to reduce latency• Server farm to share workload

– Increase availability• Reduce downtime due to individual server failure through server

replication• Support network partitions and disconnected operations through

caching (e.g. email cache in POP3)– Fault tolerance

• Highly available data is not necessarily current data. Fault-tolerant service should guarantee correct behavior despite a certain number and type of faults.

• Correctness: freshness, timeliness • If f servers can exhibit byzantine failures, at least 2f+1 servers needed

C. Xu, 1998-2010 3

Challenges• Replication Transparency

– Clients should not be aware of the presence of multiple physical copies; data are organized as logical objects.

– Exactly one set of return– “We use replication techniques ubiquitously to guarantee consistent perf and high

availablity. Although replication brings us closer to our goals, it cannot achieve them in a perfectly transparent manner” [Wenrner Vogels, CTO of Amazon, Jan 2009]

• Consistency– Whether the operations on a collection of replicated objects produce

results that meet the specification of correctness for those objects– A read should return the value of last write operation on the data– Global synchronization for atomic update operations (transactions),

but they are too costly– Relax the atomic update requirement so as to avoid instantaneous

global synchronizations• To what extent consistency can be loosened• Without a global clock, how to define the “last”? Overhead?

• Overhead for keeping copies up to data

C. Xu, 1998-2010 4

Replication of Shared Objects• Replicating a shared remote object without correctly handling

of concurrent invocations lead to consistency problem.– W(x) R(x) in one site, but R(x) W(x) in another site

• Replicas need additional sync. to ensure that concurrent invocations are performed in a correct order at each replica.– Need support for concurrent access to shared objects– Need support for coordination among local and remote accesses

C. Xu, 1998-2010 5

Consistency Models• A consistency model is a contract between processes and

the data store. For replicated objects in the data store that are updated by multiple processes, the model specifies the consistency guarantees that the data store makes about the return values of read ops.

– E.g x=1; x=2; print(x) in the same process P1 ?

– E.g x=1 (at P1 @5:00pm); x=2 (at P2 @ 1 ns after 5:00pm); print(x) (at P1)?

C. Xu, 1998-2010 6

Overview of Consistency Models

• Strict Consistency • Linearizable Consistency• Sequential Consistency• Causal Consistency• FIFO Consistency• Weak Consistency• Release Consistency• Entry Consistency

Ease of ProgrammingLess Efficiency

Hard to ProgrammingMore Efficient

C. Xu, 1998-2010 7

Strict Consistency• Read of data item x returns value of the most

recent write on x. – Most recent in the sense of absolute global time

– E.g x=1; x=2; print(x) in the same process P1 ?

– E.g x=1 (at P1 @5:00pm); x=2 (at P2 @ 1 ns after 5:00pm); print(x) (at P1)?

• A strictly consistent store requires all writes are instantaneously visible to all processes.

(a) Strictly consistent; (b) Non-strictly consistent

time time

C. Xu, 1998-2010 8

Linearizability

• Operations are assumed to receive a timestamp using a globally available clock, but one with only finite precision.

• A data store is linearizable when each op is timestamped andi) The result of any execution is the same as if the ops by

all processes on the data store were executed in some sequential order

ii) Ops of each individual process appear in this sequence in the order specified by its program

iii) If top1(x)<top2(x), then op1(x) should precede op2(x) in this sequence.

C. Xu, 1998-2010 9

Sequential Consistency

a) A sequentially consistent data store.b) A data store that is not sequentially consistent.

•A data store is sequentially consistent if

i) The result of any execution is the same as if the ops by all processes on the data store were executed in some sequential order

ii) Ops of each process appear in this sequence in the order specified by its program

time time

Writer

Writer

Observer

Observer

C. Xu, 1998-2010 10

Possible Executions in SC Model

Process P1 Process P2 Process P3

x = 1;

print ( y, z);

y = 1;

print (x, z);

z = 1;

print (x, y);

x = 1;

print ((y, z);

y = 1;

print (x, z);

z = 1;

print (x, y);

Prints: 001011

Signature: 001011

(a)

x = 1;

y = 1;

print (x,z);

print(y, z);

z = 1;

print (x, y);

Prints: 101011

Signature: 101011

(b)

y = 1;

z = 1;

print (x, y);

print (x, z);

x = 1;

print (y, z);

Prints: 010111

Signature:

110101

(c)

y = 1;

x = 1;

z = 1;

print (x, z);

print (y, z);

print (x, y);

Prints: 111111

Signature:

111111

(d)

Assume initially x=y=z=0

Is signature 001001 valid by SC model ?

C. Xu, 1998-2010 11

Remarks on the SC Model

• SC model was conceived by Lamport in 1979 for shared memory of multiprocessors.– Maintain program order of each thread

– Coherence view of any data item

• Low efficiency in implementation– Each read/write op must be atomic.

– Atomic read/write on distributed shared objects costs too much

C. Xu, 1998-2010 12

Causal Consistency• In SC, two ops of w(x) and w(y) must provide consistent

view to all processes.

• In CC, the requirement is relaxed if w(x) and w(y) have no causal relationship (Hutto and Ahamad, 90)

• That is, writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines.

Valid in CC, but invalid in SC

C. Xu, 1998-2010 13

Causal Consistency (Cont)

a) A violation of a casually-consistent store.b) A correct sequence of events in a causally-consistent store.

C. Xu, 1998-2010 14

Remarks on Causal Consistency

• Causal Consistency Model allows concurrent writes to be seen in a different order on different machines. So

• Concurrent writes can be executed in parallel (overlapped) to improve efficiency

• It relies on a compiler/run-time support for constructing and maintaining a dependency graph that captures causality between the operations.

C. Xu, 1998-2010 19

Weak Consistency• Weak Consistency relaxed the ordering

requirement by distinguishing synchronization op from normal ops. (Dubois et al. 1988)– Accesses to synchronization variables associated

with a data store are sequentially consistent– No operation on a synchronization variable is

allowed to be performed until all previous writes have been completed everywhere

– No read or write ops are allowed to be performed until all previous operations to synchronization variables have been performed.

C. Xu, 1998-2010 20

Weak Consistency (cont’)

An invalid sequence for WC.

A valid sequence by WC

• WC enforces consistency on a group of operations, not on individual reads/writes;• WC guarantees the time when consistency holds, not the form of consistency.

C. Xu, 1998-2010 26

Summary of Consistency Models

(a) Consistency models not using synchronization operations.

(b) Models with synchronization operations.

Consistency Description

Strict Absolute time ordering of all shared accesses matters.

LinearizabilityAll processes must see all shared accesses in the same order. Accesses are furthermore ordered according to a (nonunique) global timestamp

Sequential All processes see all shared accesses in the same order. Accesses are not ordered in time

Causal All processes see causally-related shared accesses in the same order.

FIFOAll processes see writes from each other in the order they were used. Writes from different processes may not always be seen in that order

(a)

Consistency Description

Weak Shared data can be counted on to be consistent only after a synchronization is done

Release Shared data are made consistent when a critical region is exited

Entry Shared data pertaining to a critical region are made consistent when a critical region is entered.

(b)

C. Xu, 1998-2010 30

Eventually Consistent Distributed Data Store

• In such special distributed data stores, all replicas will gradually become consistent, if there are no updates for a long time.

• Examples– In distributed web servers, web pages are usually

updated by a single authority; – Web cache may not return the latest contents, but this

inconsistency acceptable in some situations– DNS implements EC

• Usages or Target Applications:– Most ops involve reading data– Lack of simultaneous updates– A relatively high degree of inconsistency can be tolerated

C. Xu, 1998-2010 31

• EC stores require only that updates are guaranteed to propagate to all replicas. Conflicts due to concurrent writes are often easy to resolve. Cheap implementation

• Problems arise when different replicas are accessed by the same process at different time Causal Consistency for a single client

Client-Centric Eventual Consistency

Eventual consistency data stores work fine as long as clients always access the same replica

C. Xu, 1998-2010 32

Variations of Client-Centric EC• Monotonic-reads

– If a process reads the value of a data item x, any successive read on x by that process will never return an older value of x.

– E.g. Mail read from a mbox in SF will also be seen in mbox of NY

• Monotonic-writes– A write by a process on a data item x is completed before any successive

write on x by the same process.– Similar to FIFO consistency, but monotonic-write model is about the

behavior of a single process.

• Read Your Writes– A write on data item x by a process will always be seen by a successive

read on by the same process. – E.g. Integrated editor and browser; Integrated dvips and ghostview

• Writes follow reads – A write on x by a process following its previous read of x is guaranteed

to take place on the same or a more recent value of x that was read

C. Xu, 1998-2010 33

Outline

• Object Replication• Data Consistency Model (client-side)• Implementation Issues (server-side)

– Update Propagation• Distributing updates to replicas, independent of consistency model

– Other Consistency Model Specific Issues

• Case Studies

C. Xu, 1998-2010 34

Replica Placement• A major design issue is to decide where, when, and by

whom copies of the data store are to be placed.

• Permanent Replicas– Initial setup of replicas that constitute a distributed store– For examples,

• Distributed web servers: • server mirrors, • distributed database systems (shared nothing arch vs federated DB)

C. Xu, 1998-2010 35

Replica Placement: server-initiated

• Dynamic replica placement of replica is a key to content delivery network (CDN). – CDN company installs

hundreds of CDN servers throughout Internet

– CDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates servers

• Key issue: When and where replicas should be created or deleted

origin server in North America

CDN distribution node

CDN serverin S. America CDN server

in Europe

CDN serverin Asia

C. Xu, 1998-2010 36

CDN Content Placement• An algorithm (Rabinvoich et al 1999)

– Each server keeps trace of access counts per file, and where access requests come from.

– Assume given a client C, each server can determine which of the servers is closest to C.

C. Xu, 1998-2010 37

Server-initiated Replication (cont’)• An algorithm (Rabinvoich et al 1999)

– Initial placement

– Migration or replication of objects to servers in the proximity of clients that issue many requests for the objects

• Counting access requests to file F at server S from different client

• Deletion threshold, replication threshold

– If #access(S,F) <= del(S,F), remove the file, unless it is the last copy

– If #access(S,F)>= rep(S,F), duplicate the file somewhere

– If del(S,F) < #access(S,F) < rep(S,F), migrate

C. Xu, 1998-2010 38

Replica placement: Client-Initiated• Cache: a local storage to

temporarily store a copy of the recently accessed data mainly for reducing request time and traffic on the net– Client-side cache (forward cache

in Web)

– Proxy cache

– Proxy cache networkclient

Proxyserver

client

HTTP request

HTTP request

HTTP response

HTTP response

HTTP request

HTTP response

origin server

origin server

Update Propagation• Invalidation vs Update Protocols

– In invalidation protocol, replicas are invalidated by a small message

– In update protocol, replicas are brought to up to date by providing with modified data, or specific update operations (active replication)

• Pull vs Push Protocols– Push-based (server-based) for appl with high read-to-update ratios (why?)

– Pull-based (client-based) is often used by client caches

• Unicasting versus Multicasting

Issue Push-based Pull-based

State of server List of client replicas and caches None

Messages sent Update (and possibly fetch update later) Poll and update

Response time at client Immediate (or fetch-update time) Fetch-update time

Push vs Pull in a multiple-clients-single-server systemC. Xu, 1998-2010 39

C. Xu, 1998-2010 40

Lease-based Update Propagation• [Gray&Chariton’89]: A lease is a server promise that

updates will be pushed for a specific time. When a lease expires, the client is forced to – pull the modified data from the server, if exists, or– requests a new lease for pushing updates

• [Duvvuri et al’90]: Flexible lease system: the lease period can be dynamically adapted – age-based lease, based on the last time the item was modified.

Long lasting leases to inactive data items.– renewal-frequency based lease. Long term lease be granted to

clients whose caches need to be refreshed often.– Based on server-side state-space overhead. Overloaded server

should reduce the lease period so that it needs to keep track of fewer clients as leases expire more quickly.

• [Yin et al’99] Volume lease on objects as well as volumes (i.e. collections of related objects)

C. Xu, 1998-2010 41

Volume Lease-based Update

• Leases provide good performance when the cost of lease renewal is amortized over many reads

• If the renewal cost is high, lease time must be long enough; but a long lease implies a high chance of reading stale data

• Volume lease uses a combination of long object lease and a short volume lease:– Long object lease save reading time

– Short volume lease keeps objects updated

– Spatial locality between objects in a volume

C. Xu, 1998-2010 42

Other Impl. Issues of Consistency Model

• Passive Replication (Primary-Backup Organization)– Single primary replica manager at any time and one or

more secondary replica manager

– Writes can be carried out only the primary copy

– Two Write Strategies:• Remote-Write Protocols

• Local-Write Protocols

• Active Replication: – There are multiple replica managers and writes can be

carried out at any replica

• Cache Coherence Protocols

C. Xu, 1998-2010 43

Passive Replication: Remote Write

• Write is handled only by the remote primary server, and the backup servers are updated accordingly; Read is performed locally.

• E.g. Sun Network Information Service (NIS, formerly Yellow Pages)

C. Xu, 1998-2010 44

Consistency Model of Passive Replication

• Blocking update, waiting till backups are updated– Blocking update of backup servers must be atomic so as to

implement sequential consistency as the primary can sequence all incoming writes and all processes see all writes in the same order from any backup servers.

• Nonblocking, returning as soon as primary is updated– what happens if backup fails after update is acknowledged

– Consistency model with non-blocking update ??

• Atomic multicasting in the presence of failures!– Primary replica failure

– Group membership change

– Virtual Synchrony Implementation

C. Xu, 1998-2010 45

Passive Replication: Local-Write • Based on object migration

– Multiple successive writes can be carried out locally

• How to locate the data item?– broadcast, home-based, forwarding pointers, or hierarchical location

service

C. Xu, 1998-2010 46

Remarks• Without backup servers

– Implement linerizability• With backup servers

– Read/write can be carried out simultaneously.– Consistency model ??

• Applicable to mobile computing– Mobile host serves a primary server before

disconnecting– While being disconnected, all update operations

are carried out locally; other processes can still perform read operations

– When connecting again, updates are propagated from the primary to the backups

C. Xu, 1998-2010 47

Active Replication• In active replication, each replica has an associated process

that carries out update ops.– Client makes a request (via a front-end) and the request is multicast

to the group of replica managers.– Totally-Ordered Reliable Multicast, based on Lamport timestamps– Implement Sequential Consistency

• Alternative is based on central coordinator (Sequencer)– First forward each op to the sequencer for a unique sequence

number– Then forward the op, together with the number, to all replicas

• Another problem is replicated invocations– Object A invokes object B, which in turn invokes object c; If object

B is replicated, each replica will invoke C independently, in principle

– This problem occur in any client-server problem– No satisfactory solutions

C. Xu, 1998-2010 48

Replication-Aware Solution to Replicated Invoc

• Forwarding an invocation request from a replicated object.

• Returning a reply to a replicated object.

C. Xu, 1998-2010 49

In Summary• Replication

– data, file (web files), object replication– reliability and performance

• Client-Perceived Consistency Models– Strong consistency– Weak Consistency– Eventual Consistency and client-centric variations

• Server Implementation (Update)

• CAP Theorem [Brewer’00, Gilbert&Lynch’02]– Of three properties of shared-data systems: data

consistency, service availability, and tolerance to network partition , only two can be achieved at any give n time.

CAP Theorem

• Three core systemic requirements in designing and deploying applications in a distributed env– Consistency, Availability and Partition Tolerance

– Partition tolerance: no set of failures less than total network failure is allowed to cause the system to respond incorrectly.

– Availability (minimal latency)• Amazon claim that just an extra one tenth of a second on their

response times will cost them 1% in sales. Google said they noticed that just a half a second increase in latency caused traffic to drop by a fifth.

C. Xu, 1998-2010 50

CAP• If we want A and B to be highly available (i.e. working

with minimal latency) and we want our nodes N1 to Nn (where n could be hundreds or even thousands) to remain tolerant of network partitions (lost messages, undeliverable messages, hardware outages, process failures) then sometimes we are going to get cases where some nodes think that V is V0 and other nodes will think that V is V1.

C. Xu, 1998-2010 51

CAP Theorem Proof• Impossibility Proof

52

• If M is an async msg (no clocks at all), N1 has no way of knowing whether N2 getsthe msg.

• Even with guaranteed delivery of M, N1 has no way of knowing if a msg is delayed by a partition event or something failing in N2.

• Making M sync doesn’t help because that treats the write by A on N1 and the update event from N1 to N2 as an atomic op, which gives us the same latency issues

• Even a partially-sync model atomicity cannot be guaranteed. • In a partially async network, each site has a local clock and all clocks

increment at the same rate, but the clocks are not synchronized to display the same value at the same instant.

CAP Theorem Proof• Consider a transaction (i.e. unit of work based around the

persistent data item V) called α, then α1 could be the write op from before and α2 could be the read. – On a local system this would be easily be handled by a database with

some simple locking, isolating any attempt to read in α2 until α1 completes safely.

– In the distributed model though, with nodes N1 and N2 to worry about, the intermediate synchronising message has also to complete. Unless we can control when α2 happens, we can never guarantee it will see the same data values α1 writes. All methods to add control (blocking, isolation, centralised management, etc) will impact either partition tolerance or the availability of α1 (A) and/or α2 (B).

C. Xu, 1998-2010 53

Dealing with CAP• Drop partition tolerance

– Partition tolerance limits scaling

• Drop availability– On encountering a partition event, affected services simply wait

until data is consistent and therefore remain unable during that time. Controling this could be complex in large scale systems

• Drop consistency– Lots of inconsistency don’t actually require much work

– Eventually consistent model works in most apps

• BASE: Basically available, Soft-state, Eventually consistent (logical opposite of ACID in database: atomicity, consistency, isolation, durability)

C. Xu, 1998-2010 54

replication and consistency

Documents

data store

caching data

current data

extent consistency

consistency problem

consistency guarantees

process p1

pm printx