consistency in optimistic replication systems - … in optimistic replication systems by ... one...
TRANSCRIPT
Consistency in Optimistic Replication Systems
by
Sunny Ming-Cheung Ho
B.Sc.(Honours), University of British Columbia, 1999
AN ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in
THE FACULTY OF GRADUATE STUDIES
(Department of Computer Science)
We accept this essay as conformingto the required standard
The University of British Columbia
August 2003
c© Sunny Ming-Cheung Ho, 2003
Abstract
Consistency is a major issue in optimistic replication. Four concepts related toconsistency will be discussed: eventual consistency, bounded inconsistency, client-centric consistency, and conflict resolution. Eventual consistency is concerned withways of ensuring that all replicas eventually converge to a common state. Boundedinconsistency relates to limiting the amount of inconsistency between any two repli-cas. Client-centric consistency refers to providing clients with guarantees on thequality of data in a replica. Conflict resolution is used to reconcile conflicting up-dates made to replicas.
ii
Contents
Abstract ii
Contents iii
Acknowledgements v
1 Introduction 1
1.1 Pessimistic Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Optimistic Replication . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Eventual Consistency 4
2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 State-Transfer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Operation-Transfer Systems . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Update Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Dealing with Conflicts . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Bounded Inconsistency 11
3.1 Continuum between Strong and Optimistic Consistency . . . . . . . 12
iii
3.2 TACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Numerical Error . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Order Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Staleness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Client-Centric Consistency 19
4.1 Read Your Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Monotonic Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Writes Follow Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Monotonic Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Conflict Resolution 26
5.1 Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1 Syntactic Conflict Detection . . . . . . . . . . . . . . . . . . . 26
5.1.2 Semantic Conflict Detection . . . . . . . . . . . . . . . . . . . 27
5.2 Conflict Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Conclusion 31
Bibliography 34
iv
Acknowledgements
I would like to thank my supervisor, Dr. Norm Hutchinson, for his support, patience,and guidance during the course of my studies in the Masters program.
Sunny Ming-Cheung Ho
The University of British ColumbiaAugust 2003
v
Chapter 1
Introduction
Data replication can be used to improve availability and performance [7]. It can allow
data to remain accessible even when there are node and network failures [7], and
it can also guard against permanent data loss when a replica fails [8]. Performance
can be improved by allowing concurrent access to replicas and by reducing network
latency because accesses can be directed to nearby replicas [7].
Pessimistic replication and optimistic replication are two contrasting replica-
tion models. They represent the two extremes in the availability-consistency tradeoff
[13]. Pessimistic replication favours consistency over availability, while optimistic
replication favours availability over consistency. In describing these two models, I
assume that replicas can accept read and write requests when there are no network
or node failures in the system.
1.1 Pessimistic Replication
In pessimistic replication, users never observe any inconsistencies in the replicated
data. In terms of consistency, it appears to the users as if there is only one replica
1
[7]. Conceptually, an update made to a replica is synchronously propagated to all
the other replicas. When nodes or networks fail, access to data may be denied
to prevent users from viewing inconsistent data [7]. For instance, in the presence
of network partitions, this may mean that access to data may be denied until the
partition is healed. Also, if a replica is unavailable (perhaps due to node failure),
it may prevent other replicas from being accessed temporarily until either the node
failure is detected [7] or the node recovers.
1.2 Optimistic Replication
Optimistic replication allows users to access any replica for reading or writing even
when there are network failures or when some replicas are unavailable. This property
of optimistic replication has two significant implications. First, the states of replicas
can be temporarily mutually inconsistent. An update can be applied to a single
replica without the update being synchronously applied to other replicas. There
may even be a substantial time lag from when an update is applied at a replica
to when it is eventually propagated to other replicas, which may result in stale
reads. Second, concurrent updates to different replicas may introduce conflicts. For
instance, in an optimistically-replicated airline reservation system [13], two replicas
may accept a reservation for the same seat. Despite these two drawbacks, optimistic
replication presents a number advantages over pessimistic replication [7]:
• Availability. Data availablity is high as accesses to data are never blocked.
• Networking flexibility. Networks do not need to be fully connected for replicas
to be remain fully accessible.
2
• Scalability. Greater number of sites can be supported because synchronous
communication is not needed for accepting updates.
This essay will introduce some issues related to consistency in optimistic
replication. In particular, four concepts related to consistency that researchers have
investigated will be discussed: eventual consistency, bounded inconsistency, client-
centric consistency, and conflict resolution. Eventual consistency is concerned with
ways of ensuring that all replicas eventually converge to a common state. Bounded
inconsistency relates to limiting the amount of inconsistency between any two repli-
cas. Client-centric consistency refers to providing clients with guarantees on the
quality of the data when accessing a replica. Conflict resolution is used to reconcile
conflicting updates made to replicas.
3
Chapter 2
Eventual Consistency
A fundamental goal in optimistic replication is for all the replicas to eventually con-
verge to a common state [7]. Researchers have studied various mechanisms for pro-
viding eventual consistency. Saito and Shapiro [7] provide a comprehensive overview
of some of the techniques and methods developed for acheiving eventual consistency
from which I present the basic concepts.
2.1 System Model
Conceptually, updates are initiated at a single replica and eventually propagated
to all other replicas. Updates are applied immediately to the local replica. There
are two forms in which updates can be propagated from one replica to another. In
state-transfer systems [7], the entire replica is transferred and overwrites another
replica. The updates applied to the replica are implicitly propagated when the
replica is transferred. In operation-transfer systems [7], each replica maintains a
write log that contains updates initiated locally as well as updates received from
other replicas. Only the updates in the write log are propagated. There are varying
4
methods for achieving eventual consistency using either state-transfer or operation-
transfer.
2.2 State-Transfer Systems
Achieving eventual consistency in state-transfer systems only requires that the most
recent replica be propagated and copied over the other replicas. This can be easily
accomplished by associating a timestamp with each replica that denotes the time
when it was last updated. To carry out update propagation, two replicas compare
their timestamps, and the replica with the more recent timestamp is copied over
the older replica. Given enough rounds of propagation, every replica will eventually
contain the most recent version. Assuming that clocks at the replicas are loosely
synchronized, using physical clocks for timestamping replicas is acceptable.
Even if concurrent updates to different replicas are not detected, eventual
consistency can still be achieved as long as one timestamp can be unambiguously
determined to be larger than the other whenever two timestamps are compared.
Without detecting concurrent updates, some updates, though, may be lost. For
instance, if there are two file replicas and both are updated concurrently, then the
replica with the more recent timestamp will overwrite the other replica, and the
update that was made at the overwritten replica will essentially be lost. However,
ignoring concurrent updates is generally not an acceptable policy. A possible way
to achieve eventual consistency even when concurrent updates are detected is to
create a reconciled version of the two replicas containing the concurrent updates
and to timestamp the reconciled version in a way so that it is considered more
recent than either replica [1]. Details of detecting concurrent updates is related to
conflict detection and resolution and will be discussed in Chapter 5.
5
2.3 Operation-Transfer Systems
Eventual consistency in operation-transfer systems can be achieved through two
mechanisms: update propagation and update scheduling [7].
2.3.1 Update Propagation
Updates made at each replica need to be propagated to all other replicas. This
can be achieved through epidemic propagation. In epidemic propagation, a replica
propagates both the updates in its log that were initiated at the replica as well as
updates it received from other replicas. One advantage of epidemic propagation
is that updates may still reach all the replicas even if the network is never fully
connected at any point in time.
To efficiently determine the updates to propagate to another replica, each
replica maintains a timestamp vector that indicates the set of writes in its write log.
First of all, each update initiated at a replica receives a unique timestamp from that
replica. The timestamp can come from a physical clock, Lamport logical clock [6],
or an increasing counter. The timestamp vector, TV, has an entry for each replica.
At a given replica, a value of i for an entry j, indicates that the replica has received
all updates with timestamp less than or equal to i from replica j. Hence, a replica
can simply update its own entry when it accepts an update initiated locally. When
one replica wants to propagate its updates to another replica, it first obtains the
other replica’s timestamp vector to determine which writes it has that the receiving
replica doesn’t have. Then the replica scans its write log and sends the updates that
are missing at the receiving replica. After sending the updates, the replica sends its
timestamp vector to the receiving replica to allow the receiving replica to update its
timestamp vector. The receiving replica updates its timestamp vector by comparing
6
the entries in its own timestamp vector with those of the other timestamp vector
and taking the pairwise maximum for each entry.
2.3.2 Update Scheduling
Even if all the updates have propagated to all the replicas, each replica may have
received them in a different order. Update scheduling deals with ways of ordering
updates in a write log so that when each replica executes its own sequence of updates,
the result is that all the replicas are identical. A simple way of acheiving this goal
is to totally order the updates at every replica (i.e., the order of updates is identical
at each replica). However, this is not always required for replicas to converge. For
example, if updates only contain increment and decrement operations that commute,
then two replicas will converge to the same value as long as they both have the same
set of updates.
Update scheduling is applied when either an update is initiated at a local
replica or when a replica receives a set of updates through update propagation. The
write log may need to be undone and redone when update scheduling is performed.
Update scheduling can be categorized as syntactic or semantic scheduling [7].
Syntactic Scheduling
In syntactic scheduling, updates in a write log are ordered according to the times-
tamp values of the updates. This results in a total ordering of the updates at every
replica and ensures that the replicas are identical if they have received the same
set of updates. If physical clocks are used for the timestamp values, the replicas’
clocks should be loosely synchronized so that new updates do not get ordered before
older updates in the write log (that have a “newer” timestamp due to clocks being
7
excessively out of sync). Using Lamport logical clocks [6] for timestamps ensures
that causal dependencies based on the Lamport’s happened-before relationship are
maintained across all logs.
Semantic Scheduling
Semantic scheduling takes into account the semantics of the updates themselves to
determine an ordering for the updates in the write log. This may reduce update
conflicts and the amount of undoing and redoing in the write log [7]. For instance,
suppose a write log contains two concurrent updates to a file system in which one
update modifies a file in a directory while another update deletes the parent direc-
tory. Logically, a file cannot be modified after its parent directory has been deleted.
Semantic scheduling could ensure that the update which modifies the file is ordered
before the update that deletes the parent directory, regardless of any timestamps
that may be associated with those updates. As another example, consider updates
that have only commutative increment or decrement operations. Using semantic
scheduling instead of syntactic scheduling would avoid any undoing of the write log
since the replicas would be mutually consistent once all the updates in the system
have propagated to all the replicas.
Update Commitment
Update commitment is a supplemental mechanism in the update scheduling process
that finalizes the position of updates in a write log. Even though total update
propagation combined with syntactic or semantic scheduling will ensure eventual
consistency, update commitment is useful for stabilizing tentative data and for log
management.
8
When updates are initially inserted into the write log, they are in a tentative
state because the effect of the updates may change if updates propagated from other
replicas are inserted in an earlier position in the log. For instance, in a meeting room-
scheduling application, consider the case when a user reserves a room at one replica
and another user concurrently makes a conflicting reservation at another replica.
Let the first reservation be R1, and the conflicting reservation be RC. Suppose, RC
is propagated to the first replica and inserted into the write log at a position earlier
than R1’s position. This requires R1 to be undone before RC is inserted into the
log. When R1 is redone, a conflict results. Depending on the conflict detection and
resolution mechanism being used, the effects of update R1 may be nullified, and
only reservation RC holds. If R1 had been committed, then its position in the write
log is fixed and no updates received later through propagation would be inserted
before it in the write log. Hence the effect of R1 on the replica at the time it is
executed in the log will be certain. Within an update log, the degree of committed
updates gives an indication of the stability of the replicated data. The greater the
percentage of committed updates, the greater the degree of data stability as there
would be fewer updates in the log whose effects when executed on the replica could
potentially change.
Also, for practical reasons, the number of updates in a write log cannot
accumulate indefinitely. Since committed updates do not need to be undone and
redone anymore, they may be permanently removed from the write log to save
space. There are various mechanisms for committing updates. One mechanism is to
use a primary replica to commit updates [9]. The primary replica simply fixes the
position of updates in its write log and propagates the commitment information to
other replicas. Tentative updates are ordered after the committed updates in the
9
log.
2.3.3 Dealing with Conflicts
Conflict detection and resolution in operation-transfer systems will be discussed in
Chapter 5.
2.4 Discussion
As mentioned by Saito and Shapiro [7], the choice of whether to use state-transfer
or log-transfer really depends on the constraints and goals of the system being
implemented as each has its own advantages over the other. One advantage of
state-transfer over log-transfer is that there is no additional overhead for storing
an update log. However, in state-transfer, the time to propagate an entire replica
increases linearly with the size of the replica, while propagation time is independent
of replica size using log-transfer. In log-transfer, though, the size of the log can grow
quickly and consume space if updates occur frequently. It is issues like these that
the designers of optimistic replication systems need to carefully consider.
10
Chapter 3
Bounded Inconsistency
One of the problems with optimistic replication is that there are no bounds on
how far apart any two replicas may be at any moment. Even though eventual
consistency ensures that replicas eventually converge to a common state, at any
given moment, two replicas may have diverged far apart from each other. Without
any bounds on the degree of divergence, some applications cannot be practically
employed. For instance, consider an airline reservation system that has two replicas
where each replica can accept reservations [13]. If there were no guarantees on how
far the two replicas may diverge from each other, in the worse case, each seat can
become double-booked. In their paper [13], Yu and Vahdat explore different ways
of bounding inconsistency among replicas in an optimistic replication environment,
and I present some of their work below.
11
3.1 Continuum between Strong and Optimistic Consis-
tency
Yu and Vadhat [13] note that there is a continuum between strong consistency and
optimistic consistency. Strong consistency is same kind of consistency provided
by pessimistic replication in which users never observe any inconsistencies in the
replicated data. Optimistic consistency refers to the kind of consistency offered
by optimistic replication systems in which replicas eventually converge but may be
highly divergent from one another at any given moment. Along the continuum
the maximum distance between any replica and a “consistent” image [13] (that
represents a replica that has had all the writes in the system applied to it) varies
from zero to infinity. The distance of zero corresponds to strong consistency and a
distance of infinity corresponds to optimistic consistency.
Moving along the continuum involves a trade-off between consistency and
availability. Availability increases as consistency decreases because there is less prob-
ability that each write will require synchronous communication with other replicas
to ensure a certain level of mutual consistency.
Yu and Vadhat present a set of metrics that allow applications to specify
their desired level of consistency throughout the range between strong consistency
and optimistic consistency. These consistency requirements essentially translate into
bounds on the degree of divergence between replicas. These metrics are implemented
in the form of a middleware toolkit called TACT (Tunable Availability/Consistency
Tradeoff). For a given read/write request, TACT determines whether coordination
with other replicas is needed to ensure that the inconsistency bounds are not ex-
ceeded. If coordination is not needed, then the read/write is processed locally, but
12
if coordination with other replicas is necessary, TACT blocks the read/write request
until it is able to pull updates from other replicas and/or push updates to other
replicas in order to stay within the inconsistency bounds. The replication model
assumed by Yu and Vadhat is based on operation-transfer. Updates are considered
tentative until committed using a commitment algorithm that uses the minimum
value in its timestamp vector to serialize and commit the updates with timestamp
values less than or equal to that minimum value.
3.2 TACT
TACT includes three metrics for specifying consistency requirements: Numerical
Error, Order Error, and Staleness. Numerical error limits the difference between
the numerical weight of writes applied locally and the numerical weight of writes
in the “consistent” image. Order error bounds the number of tentative writes at a
replica. Staleness specifies the maximum real time before the most recent write in
the system must be propagated to a replica. If the three metrics are bound to zero,
then TACT essentially enforces pessimistic replication. If there are no bounds, then
TACT would essentially be implementing optimistic replication.
3.2.1 Numerical Error
Numerical Error bounds the numerical difference between the sum of the weighted
updates seen locally at a replica and the sum of the weighted updates in a “consis-
tent” image. Each update has a weight associated with it. For instance, if a replica
stores the balance of a bank account, then the associated weight for each update to
the balance may be the amount of the withdrawl or deposit. In this case, numerical
error bounds the discrepancy between the bank balance at a replica and the actual
13
bank balance. Each replica is responsible for pushing local updates to other replicas
to ensure that the numerical error bound is not violated. For replica values that are
not inherently numerical in nature, weights can be assigned to updates to denote
the importance for the update to be quickly propagated to other replicas. A heavily
weighted update is more likely to force propagation than a lightly weighted update.
Two kinds of numerical error can be specified: absolute numerical error and relative
numerical error.
The notation used for describing the algorithms for implementing numerical
error will be consistent with that presented in the paper by Yu and Vadhat [13]. Vi
denotes the value of replica i, and Vfinal denotes the replica value if all writes in the
system were applied. Note that the value of Vfinal may not necessarily be known at
any replica.
Absolute Numerical Error
Absolute numerical error is used to specify a bound on the absolute difference be-
tween Vi for any replica i and Vfinal. In other words, |Vfinal − Vi| ≤ α for some
α ≥ 0. This bounds the maximum difference between the value of a replica and the
actual value of a “consistent” image.
Yu and Vahdat’s method for implementing an absolute numerical error bound,
α, involves dividing up α among all the replicas. The division may not necesarily be
uniform, and each replica is aware of the allocated values for all the other replicas.
For instance, if the absolute numerical error for a bank balance is α = $100 and
there are ten replicas, then one possible allocation is for two replicas to be allocated
$2 each and for the other eight replicas to be allocated $12 each, totalling $100
across all replicas. Each update is associated with a numerical weight. For a replica
14
that stores a bank balance and processes updates for deposits and withdrawls, the
amount of the deposit or withdrawl can be the associated weight for the update.
Each replica maintains two local variables for each of the other replicas. These
variables record the sum of the positive and negative weighted updates that have
been accepted locally but not yet propagated to the other replicas. Using the bank
balance example, a deposit may have a positive weight and a withdrawl may have a
negative weight. When a replica receives a request for an update, the replica checks
whether accepting the write locally would violate the allocated absolute numerical
error values for any other replica. This can be done by checking whether either local
variable associated with the replica would have a value that exceeds the allocated
numerical error for that replica if the update is accepted locally. If any bounds are
exceeded, the replica pushes local updates from the its log or the requested update
to the other replicas until it is able to accept the update. The local variables asso-
ciated with the contacted replicas are also updated. Local updates may be blocked
until the replica is able to successfully propagate the necessary updates.
This method for implementing the absolute numerical error bound is con-
servative in that one replica may propagate updates to another replica even when
the absolute numerical error bound is not violated. For instance, if a $100 absolute
numerical error bound on a bank balance is evenly divided among ten replicas, then
a deposit of $50 at one replica will cause the update to propagated to all the other
replicas even though the largest difference between a replica’s bank balance and the
actual bank balance is only $50. However, since a replica has no local knowledge of
the unpropagated updates at other replicas, it is possible that other replicas each
have $10 worth of unpropagated deposits, in which case, accepting the $50 deposit
without propagating it to other replicas would violate the numerical error.
15
Relative Numerical Error
Relative numerical error bounds the absolute numerical error at a replica (i.e.,
|Vfinal − Vi|) in relation to |Vfinal|. Each replica can specify its own relative error,
γi = |Vfinal−Vi||Vfinal| , and it is assumed that each replica knows the relative numerical
error bound of all the other replicas. Before a replica accepts an update, it may need
to push updates from its log or the requested update to other replicas to ensure that
the other replicas’ relative numerical error bounds are not violated upon accepting
a local update. Specifically, each replica must ensure that for all other replicas with
a specified relative numerical error bound, say γj , that |Vfinal − Vj | ≤ γj × |Vfinal|.
If γj × |Vfinal| is considered as an absolute numerical error bound, then it is con-
ceivable to apply the method used to bound absolute numerical error in order to
determine whether any relative error bounds are violated. However, the problem is
that replicas will generally not know |Vfinal|.
The method proposed by Yu and Vahdat to enforce the relative numerical
error bound is to convert the relative numerical error bound into an absolute numer-
ical error bound by using a conservative estimate of |Vfinal| based on information
local to each replica. To show how this method works, suppose that all the replicas
are initially within their own relative numerical error bounds. Each replica can lo-
cally compute its estimate of |Vfinal| using Vi and γi. Specifically, at each replica,
−|Vfinal− Vi| ≥ −γi× |Vfinal|. Also, |Vfinal| − |Vi| ≥ −|Vfinal− Vi|. Combining and
rearranging the inequality results in |Vfinal| ≥ |Vi|(1+γi)
, which gives a lower bound
for |Vfinal|. Using the expression for the lower bound for |Vfinal| in the expression,
|Vfinal−Vj | ≤ γj ×|Vfinal|, generates the expression |Vfinal−Vj | ≤ γj × |Vi|(1+γi)
. The
expression γj× |Vi|(1+γi)
can be treated as an allocated absolute numerical error bound
for replica j. Hence, the method used to implement absolute numerical error can
16
be applied to determine whether a replica needs to push updates to another replica
before accepting an update.
Since a lower bound on |Vfinal| is used as an estimate for the value of |Vfinal|,
the method for enforcing relative numerical error is conservative and may push
updates even when the relative numerical error bounds are not actually violated.
3.2.2 Order Error
Order error bounds the number of tentative writes at a replica. Each replica can
locally bound order error by checking if the number of tentative updates in its
write log would exceed the specified order error limit if it accepts a local update.
Each replica can specify its own order error limit independent of other replicas. If
the order error is exceeded, the replica performs a write commitment algorithm to
commit updates in its write log to reduce the number of tentative writes. One way
of committing updates, as suggested by Yu and Vadhat, is to find the smallest value
in the replica’s timestamp vector and to use that value to denote the set of the
updates that could be committed. All updates with timestamps less than or equal
to the minimum value in the timestamp vector can be committed.
3.2.3 Staleness
Staleness specifies the maximum real time before the most recent write in the sys-
tem must be propagated to a replica. To bound staleness, each replica maintains
a real-time vector, RT , that denotes the real time of the latest update received
from another replica. If the staleness bound is T , then each replica checks if
currentT ime − RT [i] < T for each server i. If a replica discovers that the stal-
eness bound is violated, it must pull updates from those replicas that violate the
17
staleness bound. The real-time clocks at the replicas should be loosely synchronized.
3.3 Discussion
The three metrics are not expected to be exported to end users. Instead, end users
should deal with semantically meaningful consistency bounds, and it is up to the
application programmer to realize those bounds using the metrics provided by TACT
[12].
One potential problem with using the commitment protocol suggested by Yu
and Vadhat is that an inactive replica could prevent other replicas from committing
updates in their write log, which in turn would prevent a replica from accepting
additional writes if the order error limit was already reached. The staleness metric
could conceivably be utilized to avoid this problem.
Availability to replicas may be adversely affected because local reads and
writes may be blocked until the local replica is able to successfully push updates to or
pull updates from other replicas in order to adhere to the inconsistency bounds [13].
As mentioned, numerical error actively pushes its local updates to other replicas
while order error and staleness pull updates from other replicas, and if a replica is
unavailable, perhaps due to node or network failure, it may prevent other replicas
from accepting local writes as the unavailable replica would not be able to receive
or send writes as required by other replicas.
18
Chapter 4
Client-Centric Consistency
In an optimistic replication system, a user who accesses a single replica will always
see consistent data [8]. If a user switches from one replica to another, he may see
inconsistent data [8]. Even if bounded inconsistency is used (e.g., TACT), the user
may still view inconsistencies (as long as the extreme case of strong consistency is
not being enforced under bounded inconsistency). For example, a user may issue a
write to a replica and then try to read what was just written. If the read is processed
by a different replica and the write has not yet propagated to that replica, then the
user will read stale data. From a practical perspective, this problem is relevant to
mobile computing environments where clients may connect to the nearest replica for
access to data. The form of consistency that addresses this problem is called client-
centric consistency [8]. Terry et. al. presents a particular solution to this problem
with session guarantees [10], and I present the main aspects of their solution.
Terry defines a session to be a “sequence of read and write operations per-
formed during the execution of an application” [10]. Consistency guarantees asso-
ciated with a session give assurances to the application that the replicas that are
accessed are consistent with respect to the operations that have previously been
19
requested during the session. There are four guarantees, each of which can be inde-
pendently applied, in any combination, to a session:
• Read Your Writes
• Monotonic Reads
• Writes Follow Reads
• Monotonic Writes
Conceptually, clients and applications can have multiple sessions simultaneously.
Reads and writes within a session may access different replicas, but from a consis-
tency point of view, it appears to the application as if a single shared replica is being
accessed.
In providing such guarantees, availability is traded-off for consistency. Since
the underlying replication model is assumed to be optimistic, it is possible, perhaps
due to node or network failures, that none of the available replicas contain the
required updates to meet the consistency guarantees associated with a session, in
which case access to data will be denied.
We define DependentWrites(R) for a read R to be the smallest set of writes
at the replica processing R such that executing R on that set returns the same result
as processing R using the entire replica. This is essentially the same as the definition
of RelevantWrites in Terry’s paper [10].
4.1 Read Your Writes
The Read Your Writes guarantee ensures that any reads made during the session are
processed at replicas that contain all preceding writes made during the same session.
20
To see the motivation for this guarantee, consider the case when a user updates a
database and then tries to read the data that was updated. The Read Your Writes
guarantee ensures that if the update and subsequent read request were made during
the same session, then the read request would be performed at a replica that already
includes the update. Without this guarantee, it is possible for the read to return
stale data from a replica that has not received the update, possibly leaving the user
confused.
Since session guarantees only give the illusion of a single shared replica, it is
possible for reads and writes in a given session to be interleaved with writes outside
the session. Hence, in the database example, the read may return more up-to-date
information than a previous update made during the session.
4.2 Monotonic Reads
The Monotonic Reads guarantee ensures that reads in a session are only processed
by replicas that contain all writes seen by previous reads made in the same session.
This means that reads will return data that is at least as recent as what has been
previously read during the session. To be precise, if a session guarantees Monotonic
Reads, then a read request in the session can only be processed at a replica which
contains all the writes in DependentWrites(R), for every preceding read request, R,
made in the session.
As an example, consider a replicated database. Suppose the user wishes to
read a data value that had been previously read during the same session. With the
Monotonic Reads guarantee, the user is ensured that the value returned is at least
as recent as what was initially read. Without the Monotonic Reads guarantee, it is
possible that stale data may be returned if an out-of-date replica is accessed for the
21
latter read request.
4.3 Writes Follow Reads
The Writes Follow Reads guarantee ensures that Writes are only applied to replicas
that contain all the Writes seen by previous Reads during the session. Specifically,
before a Write can be accepted at a replica, the replica must contain all the Depen-
dentWrites(R) for each read R that preceded the write during the session.
As an example of how this guarantee can be used, consider a newsgroup
service [10]. Upon reading a posting a user may post a reply to it. Using Writes
Follow Reads will ensure that users of the service will see the reply only if the original
posting is also available at the server.
4.4 Monotonic Writes
The Monotonic Writes guarantee ensures that a write is accepted at a server only if
all preceding writes issued during the session are already at the server. An example
of where this might be useful would be when a programmer updates a software
library and subsequently updates the application to use the updated library [10].
Using the Monotonic Writes guarantee ensures that the write that updated the
application will only be applied to a server that contains the write for updating
the library, preventing the situation where a server has the updated version of the
application but an outdated version of the library.
22
4.5 Implementation
Terry suggests a practical implementation for session guarantees that is based on
timestamp vectors. (Terry calls it version vectors). Each server maintains a times-
tamp vector that indicates the writes it has in its log. Timestamp values can be
based on any monotonically increasing clock.
On the client side, each session is associated with either one or two timestamp
vectors depending on the session guarantees chosen for the session. One timestamp
vector - call it write-set [10] - gives an indication of the writes made by the client
and would be used in any combination of the four session guarantees except when
only Monotonic Reads is used. The other timestamp vector - call it read-set [10] -
would be associated with the reads requested by the client and would be used in
any combination of the four session guarantees except when only the Monotonic
Writes guarantee is used. Upon reading from a server, the server’s timestamp vec-
tor is merged with the client’s read-set. Merging a server’s timestamp vector with
a read-set involves setting each entry in the read-set to the maximum of the cor-
responding entries in both timestamp vectors. Upon writing to a server, the client
receives a timestamp for the write and uses it to update the corresponding entry
in its write-set timestamp vector. Also, one timestamp vector dominates another
timestamp vector when all of its entries are at least as large as the corresponding
entries in the other vector.
The four guarantees would be implemented as follows using the read-set and
write-set of a session:
Monotonic Reads
When accessing a server to process a read request, a check is made as to whether
the server’s timestamp vector dominates the session’s read-set.
23
Read Your Writes
When accessing a server to process a read request, a check is made as to whether
the server’s timestamp vector dominates the session’s write-set.
Monotonic Writes
When accessing a server to process a write request, a check is made as to whether
the server’s timestamp vector dominates the session’s write-set.
Writes Follow Reads
When accessing a server to process a write request, a check is made as to whether
the server’s timestamp vector dominates the session’s read-set.
One simplification in the suggested implementation is that DependentWrites(R)
is conservatively estimated to be all the writes that are applied at the replica. With-
out this simplification, the computation of DependentWrites(R) for a read request
may be very expensive, especially for a complex query [10].
4.6 Discussion
There has been seemingly no further research in this area since session guarantees
were introduced by Terry, which could be an indication that his solution is satisfac-
tory.
I believe that the trade-off between availability and consistency should not
necessarily be as simple as denying access when the session guarantees cannot be
met by a replica. When none of the available replicas can satisfy a given session
guarantee, perhaps because the only replica that would have allowed access has
24
crashed, then access to stale replicas may be preferred by the user rather than
having no access at all. For example, for a replicated email inbox, allowing access to
a stale version of the inbox may be preferred over denying access to all the inboxes
because they do not satisfy the constraints of some session guarantee. This is really
a matter of policy which can be implemented on top of the session guarantees, and
one which, I believe, a system designer should be aware of.
25
Chapter 5
Conflict Resolution
Conflict resolution is a major issue in optimistic replication systems. By definition,
optimistic replication systems allow concurrent updates to different replicas, and
the concurrent updates may cause the replicas to be in a mutually conflicting state.
In this chapter, I will introduce the general ideas behind conflict detection and
resolution and present some mechanisms that have been devised for dealing with
conflicts.
5.1 Conflict Detection
Before conflicts can be resolved, they must first be detected. Conflicts can be de-
tected either syntactically or semantically [7].
5.1.1 Syntactic Conflict Detection
Given any two replicas, a syntactic conflict exists if concurrent updates have been
applied to the replicas. Hence, detection of a syntactic conflict simply implies de-
tecting the presence of concurrent updates. However, concurrent updates do not
26
necessarily mean that there is a semantic conflict as far as the application is con-
cerned. For instance, concurrent updates that reserve different meeting rooms in a
replicated calendar program is not considered an application-level conflict. Semantic
conflicts are considered later in this chapter.
In state-transfer systems, a syntactic conflict can be reliably detected by
associating version vectors with each replica [7]. Each replica has a version vector
with one entry for each replica. A version vector indicates the set of writes in a
replica. When an update is applied locally to a replica, it increments the timestamp
value in its own version vector entry. During update propagation, when one replica
overwrites another replica, the version vector of the source replica also overwrites
the version vector of the destination replica. A version vector V1 is said to dominate
version vector V2 if every entry in V1 is equal to or larger than the corresponding
entry in V2. A syntactic conflict exists between two replicas when neither version
vector dominates the other. In operation-transfer systems, a syntactic conflict can
be detected by comparing the timestamp vectors of the replicas. If neither vector
dominates the other, then a syntactic conflict exists.
5.1.2 Semantic Conflict Detection
In semantic conflict detection, two concurrent updates conflict if they would violate
the semantics of the application had the updates been applied to the same replica.
For example, using semantic conflict detection in a replicated airline reservation
system, two concurrent reservations for the same seat would be considered a semantic
conflict. However, if the concurrent reservations were for different seats, then there
would not be a semantic conflict. In state-transfer systems detection of semantic
conflicts would require the replication system to have semantic knowledge of the
27
contents of the replicas to determine whether a semantic conflict exists. In log-
transfer systems, a possible way to detect semantic conflicts is for each update to
check whether a semantic conflict would occur if the update were executed against
the current state of the replica. Bayou [9] uses this approach.
5.2 Conflict Resolution
Once a conflict is detected, it needs to be resolved. Resolution of conflicts inherently
requires knowledge of the application semantics. For instance, in a file system,
conflicting updates to file replicas cannot be resolved by the file system without
knowledge of the semantics of the data in the file. Conceptually, when conflicts in
replicas are resolved, the conflicting replicas are overwritten with a new value that
is the result of reconciling the conflicting replicas.
Conflicts can be resolved manually or automatically. Manual resolution re-
quires user intervention, while automatic resolution doesn’t. Using the example of
the meeting-scheduling application, if a semantic conflict is detected for conflicting
reservations for a meeting room, manual resolution may simply notify the user of
the conflict and let the user decide how to resolve it. With automatic resolution,
the system has the means to attempt conflict resolution without user intervention,
perhaps because the system understands the semantics of the replica contents or
it provides mechanisms for application-specific resolution. Automatic resolution is
preferred in optimistic replication systems since updates may propagate epidemi-
cally and the user may not be available at the time a conflict is detected. Coda
and Bayou are two systems that support automatic conflict resolution through the
provision of mechanisms that allow applications to specify how conflicts are to be
resolved.
28
Coda is a distributed file system [4]. Conflicting updates to files are possi-
ble due to network partitions or disconnected client operation. Coda uses version
vectors to syntactically detect file conflicts. To facilitate automatic conflict resolu-
tion, Coda allows users to install application-specific resolvers (ASRs), which are
programs that can be invoked by Coda to resolve file conflicts [5]. Upon detecting
a file conflict, Coda locates the associated ASR for the file and executes it.
Bayou is a weakly consistent replicated database storage system [9]. It uses
log-transfer for update propagation and ensures eventual consistency by totally or-
dering the update logs at all replicas. Bayou supports application-specific conflict
detection and resolution through two mechanisms: dependency checks and merge
procedures. Each update is associated with an application-specified dependency
check and merge procedure. The dependency check is used to determine whether
the update semantically conflicts with any previous update in the log (i.e., the cur-
rent state of the replica). The dependency check contains a query and an expected
result. When the query is run on the replica and the value matches the expected
result, the update is considered not to conflict with any previous updates, and the
update is then executed. However, if the query result does not match the expected
result, then a semantic conflict is detected, and the merge procedure is run. The
merge procedure is application-specific and is used to resolve the conflict. Since
Bayou uses log-transfer, updates in the log may be undone and redone numerous
times during update scheduling. A dependency check is run every time its asso-
ciated update is executed. As a concrete example, consider a meeting-scheduling
application. An update to reserve a meeting room may include as its dependency
check a query on whether the room is available at a given time, while the associated
merge procedure may try to reserve another room.
29
5.3 Discussion
Bayou’s dependency check is a flexible application-independent mechanism. In
fact, one can simply ignore conflicts in Bayou by using null dependency checks,
in which case the updates simply modify the underlying database without checking
for application-specific conflicts [11]. The dependency check mechanism has also
been adopted in other recent systems. Oceanstore, which is a global utility infras-
tructure intended to provide persistent, highly available storage, has adopted the
dependency check for detecting file conflicts [3]. IceCube [2] is a log reconcilia-
tion system that uses application-specified dependency checks (or preconditions as
denoted in IceCube) to semantically detect conflicts.
30
Chapter 6
Conclusion
Consistency in optimistic replication is a major issue. This essay introduced four
concepts related to consistency: eventual consistency, bounded inconsistency, client-
centric consistency, and conflict resolution.
Eventual consistency is one of the fundamental requirements in optimistic
replication. It refers to the requirement for all replicas to eventually be mutually
consistent despite the presence of concurrent updates at different replicas. The basic
concepts for achieving eventual consistency are that updates need to be propagated
to all replicas and ordered in a way so that the replicas are mutually consistent. As
mentioned in this essay, updates can either be propagated implicitly using state-
transfer mechanism or explicitly with a log-transfer mechanism. Using log-transfer
for propagation requires some form of update scheduling, using either syntactic or
semantic scheduling. Syntactic scheduling involves only comparing the timestamps
of the updates. Semantic scheduling determines an order based on the semantics of
the updates.
One of the problems in optimistic replication is that the contents of repli-
cas may diverge significantly from one another. This may lead to a high rate of
31
application-level update conflicts, which may be unacceptable for some applications,
such as an airline reservation system. Bounded inconsistency refers to bounding the
degree of divergence among replicas. Yu and Vahdat developed the TACT toolkit
that allows applications to specify inconsistency bounds along three dimensions:
numerical error, order error, and staleness.
Client-centric consistency deals with the problem that users may see inconsis-
tent data with respect to their own sequence of reads and writes when the sequence
involves accesses to different replicas. A solution proposed by Terry [10] is session
guarantees. A session is an abstraction for the sequence of reads and writes a user
or application performs. Four independent consistency guarantees may be associ-
ated with a session: Read Your Writes, Monotonic Reads, Writes Follow Reads,
and Monotonic Writes. An efficient implementation involves associating timestamp
vectors with each replica and with the read and write sets of each session.
One of the key differences between optimistic replication and pessimistic
replication is that concurrent updates are permitted in optimistic replication while
they are prohibited in pessimistic replication. Concurrent updates may lead to
conflicts in the contents of the replica in terms of application semantics. Detection
of conflicts can be done syntactically or semantically. Syntactic conflict detection
simply detects potential application-level conflicts by the presence of concurrent
updates. Normally, this kind of conflict detection is the only option in systems where
the application semantics of the replicas are unknown to the replication system.
Semantic conflict detection only detects concurrent updates that violate application
semantics.
Resolving conflicts inherently requires knowledge of the semantics of the
replica contents. Although manual resolution requiring user intervention is possi-
32
ble, it is undesirable since conflicts may be detected long after the user is no longer
using the system. Some recent systems have provided mechanisms for supporting
application-specific conflict resolution. One such system is Bayou, which provides
applications with the means to detect and resolve application-level conflicts through
the dependency checks and merge procedures supported by Bayou.
I believe that Bayou’s dependency check mechanism represents a significant
research result in optimistic replication. This is supported by the fact that two
systems that were developed after Bayou, Oceanstore and IceCube, both adopted
features from Bayou’s novel conflict detection mechanism. The flexibility of the
mechanism and its application-independent interface, which decouples the replica-
tion system from the application semantics, makes it easy to be adopted in other
replication systems. One possible concern with using dependency checks is the larger
update log entries and their impact on propagation delay and storage space at the
replicas.
The commitment protocols used in optimistic replication systems seem to
lack robustness. For instance, Bayou’s commitment protocol uses a primary replica
to commit updates. However, the primary replica becomes a single point of failure.
Also, in TACT, the use of the minimum timestamp in the timestamp vector for
purposes of update commitment is ineffective if there are inactive replicas. As I
alluded to in Chapter 3, a possible way around this problem is for a replica to
contact all other replicas. However, this is clearly not scalable and may not even be
possible when network connectivity is intermittent. I believe that there is potential
for further research to make update commitment more robust.
33
Bibliography
[1] Gerald J. Popek Gerard Rusidin Allen Stoughton Bruce J. Walker Evelyn Wal-ton Johanna M. Chow David Edwards Stephen Kiser D.Stott Parker, Jr. andCharles Kline. Detection of mutual inconsistency in distributed systems. IEEETransactions on Software Engineering, May 1983. Vol. 9, No. 3. pp. 240-247.
[2] A. Kermarrec, A. Rowstron, M. Shapiro, and P. Druschel. The icecube approachto the reconciliation of divergent replicas. Proceedings of Twentieth ACMSymposium on Principles of Distributed Computing, August 2001.
[3] John Kubiatowicz, David Bindel, Yan Chen, Patrick Eaton, Dennis Geels,Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westly Weimer,Christopher Wells, and Ben Zhao. Oceanstore: An architecture for global-scalepersistent storage. Proceedings of ACM ASPLOS, November 2000.
[4] Puneet Kumar. Mitigating the Effects of Optimistic Replication in a DistributedFile System, PhD Thesis. Carnegie Mellon University, December 1994.
[5] Puneet Kumar and M. Satyanarayanan. Flexible and safe resolution of fileconflicts. USENIX Winter, 1995. pp. 95-106.
[6] L. Lamport. Time, clocks, and the ordering of events in distributed systems.Communications of the ACM, 1978. Vol. 21, No. 7.
[7] Y. Saito and M. Shapiro. Replication: Optimistic approaches. Hewlett PackardTechnical Report HPL-2002-33, March 2002.
[8] Andrew S. Tannebaum and Maarten van Steen. Distributed Systems Principlesand Paradigms. Prentice Hall, 2002.
[9] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, and C.H.Hauser. Managing update conflicts in bayou, a weakly connected replicatedstorage system. ACM Symp. on Operating Systems Principles, December 1995.pp. 172-183.
34
[10] D.B. Terry, M.M. Theimer, K. Petersen, A.J. Demers, M.J. Spreitzer, and B.B.Walsh. Session guarantees for weakly consistent replicated data. ProceedingsThird International Conference on Parallel and Distributed Information Sys-tems, September 1994. pp. 140-149.
[11] D.B. Terry, M.M. Theimer, K. Petersen, and M.J. Spreitzer. The case fornon-transparent replication: Examples from bayou. IEEE Data EngineeringBulletin, December 1998. Vol 21., No. 4, pp. 12-20.
[12] Haifeng Yu and Amin Vahdat. Building replicated internet services using tact:A toolkit for tunable availability and consistency tradeoffs. The Second Inter-national Workshop on Advanced Issues of E-Commerce and Web-Based Infor-mation Systems, October 2000.
[13] Haifeng Yu and Amin Vahdat. Design and evaluation of a continous consistencymodel for replicated services. Fourth Symposium on Operating Systems Designand Implementation, 2000.
35