orbe: scalable causal consistency using dependency matrices … · 2018. 1. 4. · we implement the...

Orbe: Scalable Causal ConsistencyUsing Dependency Matrices and Physical Clocks

Jiaqing Du?, Sameh Elnikety†, Amitabha Roy?, Willy Zwaenepoel??EPFL †Microsoft Research

AbstractWe propose two protocols that provide scalable causalconsistency for both partitioned and replicated datastores using dependency matrices (DM) and physicalclocks. The DM protocol supports basic read and updateoperations and uses two-dimensional dependency matri-ces to track dependencies in a client session. It utilizesthe transitivity of causality and sparse matrix encodingto keep dependency metadata small and bounded. TheDM-Clock protocol extends the DM protocol to supportread-only transactions using loosely synchronized phys-ical clocks.

We implement the two protocols in Orbe, a distributedkey-value store, and evaluate them experimentally. Orbescales out well, incurs relatively small overhead over aneventually consistent key-value store, and outperformsan existing system that uses explicit dependency track-ing to provide scalable causal consistency.

1 IntroductionDistributed data stores are a critical infrastructure com-ponent of many online services. Choosing a consistencymodel for such data stores is difficult. The CAP theo-rem [7, 10] shows that among Consistency, Availabil-ity, and (network) Partition-tolerance, a replicated sys-tem can only have two properties out of the three.

A strong consistency model, such as linearizability[11] and sequential consistency [15], does not allow highavailability under network partitions. In contrast, even-tual consistency [24], a weak model, provides high avail-

Copyright c© 2013 by the Association for Computing Machinery, Inc.(ACM). Permission to make digital or hard copies of all or part of thiswork for personal or classroom use is granted without fee providedthat copies are not made or distributed for profit or commercial advan-tage and that copies bear this notice and the full citation on the firstpage. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected].

SOCC’13, October 01 - 03 2013, Santa Clara, CA, USACopyright 2013 ACM 978-1-4503-2428-1/13/10$15.00.http://dx.doi.org/10.1145/2523616.2523628

ability and partition-tolerance, as well as low update la-tency. It guarantees that replicas eventually converge tothe same state provided no updates take place for a longtime. However, it does not guarantee any order on apply-ing replicated updates. Causal consistency [2] is weakerthan sequential consistency but stronger than eventualconsistency. It guarantees that replicated updates are ap-plied at each replica in an order that preserves causality[2, 14] while providing availability under network par-titions. Furthermore, client operations have low latencybecause they are executed at a local replica and do notrequire coordination with other replicas.

The problem addressed in this paper is providing ascalable and efficient implementation of causal consis-tency for both partitioned and replicated data stores.Most existing causally consistent systems [13, 19, 22]adopt variants of version vectors [2, 21], which are de-signed for purely replicated data stores. Version vectorsdo not scale when partitioning is added to support adata set that is too large to fit on a single server. Theystill view all partitions as one logical replica and requirea single serialization point across partitions for repli-cation, which limits the replication throughput [17]. Ascalable solution should allow replicas of different parti-tions to exchange updates in parallel without serializingthem at a centralized component.

COPS [17] identifies this problem and provides a so-lution that explicitly tracks causal dependencies at theclient side. A client stores every accessed item as depen-dency metadata and associates this metadata with eachupdate operation issued to the data store. When an up-date is propagated from one replica to another for repli-cation, it carries the dependency metadata. The updateis applied at the remote replica only when all its depen-dencies are satisfied at that replica. COPS provides goodscalability. However, tracking every accessed item ex-plicitly can lead to large dependency metadata, whichincreases storage and communication overhead and af-fects throughput. Although COPS employs a number oftechniques to reduce the size of dependency metadata, itdoes not fundamentally solve the problem. When sup-porting causally consistent read-only transactions, the

dependency metadata overhead is still high under manyworkloads.

In this paper, we present two protocols and one op-timization that provide a scalable and efficient imple-mentation of causal consistency for both partitioned andreplicated data stores.

The first protocol uses two-dimensional dependencymatrices (DMs) to compactly track dependencies at theclient side. We call it the DM protocol. This protocolsupports basic read and update operations. It associateswith each update the dependency matrix of its client ses-sion. Each element in a dependency matrix is a scalarvalue that represents all dependencies from the corre-sponding data store server. The size of dependency ma-trices is bounded by the total number of servers in thesystem. Furthermore, the DM protocol resets the depen-dency matrix of a client session after each update opera-tion, because prior dependencies need not to be trackeddue to the transitivity of causality. With sparse matrixencoding, the DM protocol keeps the dependency meta-data of each update small.

The second protocol extends the DM protocol to sup-port causally consistent read-only transactions by us-ing loosely synchronized physical clocks. We call it theDM-Clock protocol. In addition to the dependency meta-data required by the DM protocol, this protocol assignsto each state an update timestamp obtained from a lo-cal physical clock and guarantees that the update times-tamp order of causally related states is consistent theircausal order. With this property, the DM-Clock protocolprovides causally consistent snapshots of the data storeto read-only transactions by assigning them a snapshottimestamp, which is also obtained from a local physicalclock.

We also propose dependency cleaning, an optimiza-tion that further reduces the size of dependency meta-data in the DM and DM-Clock protocols. It is based onthe observation that once a state and its dependencies arefully replicated, any subsequent read on the state doesnot introduce new dependencies to the client session.More messages need to be exchanged, however, for aserver to be able to decide that a state is fully replicated,which leads to a tradeoff which we study. Dependencycleaning is a general technique and can be applied toother causally consistent systems.

We implement the two protocols and dependencycleaning in Orbe, a distributed key-value store, and eval-uate them experimentally. Our evaluation shows thatOrbe scales out as the number of data partitions in-creases. Compared with an eventually consistent sys-tem, it incurs relatively little performance overhead for alarge spectrum of workloads. It outperforms COPS un-der many workloads when supporting read-only transac-tions.

Data Server Data Server

Client Client Client

Application Tier

Data Center

Data Center

Data Center

Figure 1: System architecture. The data set is replicated bymultiple data centers. Clients are collocated with the data storein the data center and are used by the application tier to accessthe data store.

In this paper, we make the following contributions:

• The DM protocol that provides scalable causal con-sistency and uses dependency matrices to keep thesize of dependency metadata under control.

• The DM-Clock protocol that provides read-onlytransactions with causal snapshots using looselysynchronized physical clocks by extending the DMprotocol.

• The dependency cleaning optimization that reducesthe size of dependency metadata.

• An implementation of the above protocols and op-timization in Orbe as well as an extensive perfor-mance evaluation.

2 Model and DefinitionIn this section we describe our system model and definecausality and causal consistency.

2.1 ArchitectureWe assume a distributed key-value store that manages

a large set of data items. The key-value store providestwo basic operations to the clients:

• PUT(key, val): A PUT operation assigns value valto an item identified by key. If item key does not ex-ist, the system creates a new item with initial valueval. If key exists, a new version storing val is cre-ated.

• val ← GET(key): The GET operation returns thevalue of the item identified by key.

An additional operation that provides read-only transac-tions is introduced later in Section 4.

The data store is partitioned into N partitions, andeach partition is replicated by M replicas. A data item isassigned to a partition based on the hash value of its key.

In a typical configuration, as shown in Figure 1, the datastore is replicated at M different data centers for highavailability and low operation latency. The data store isfully replicated. All N partitions are present at each datacenter.

The application tier relies on the clients to access theunderlying data store. A client is collocated with the datastore servers in a particular data center and only accessesthose servers in the same data center. A client does notissue the next operation until it receives the reply to thecurrent one. Each operation happens in the context of aclient session. A client session maintains a small amountof metadata that tracks the dependencies of the session.

2.2 Causal ConsistencyCausality is a happens-before relationship between

two events [2, 14]. We denote causal order by . Fortwo operations a and b, if a b, we say b depends on aor a is a dependency of b. a b if and only if one of thefollowing three rules holds:

• Thread-of-execution. a and b are in a single threadof execution. a happens before b.

• Reads-from. a is a write operation and b is a readoperation. b reads the state created by a.

• Transitivity. There is some other operation c thata c and c b.

We define the nearest dependencies of a state as allthe states that it directly depends on, without relying onthe transitivity of causality.

To provide causal consistency when replicating up-dates, a replica does not apply an update propagatedfrom another replica until all its causal dependencystates are installed locally.

3 DM ProtocolIn this section, we present the DM protocol that providesscalable causal consistency using dependency matrices.This protocol is scalable because replicas of differentpartitions exchange updates for replication in parallel,without requiring a global serialization point.

Dependency matrices are, for systems that supportboth replication and partitioning, the natural extensionof version vectors, for systems that support only replica-tion. A row of a dependency matrix is a version vectorthat stores the dependencies from replicas of a partition.

A client stores the nearest dependencies of its sessionin a dependency matrix and associates it with each up-date request to the data store. After a partition at theclient’s local data center executes the update, it propa-gates the update with its dependency matrix to the repli-cas of that partition at remote data centers. Using the de-pendency matrix of the received update, a remote replica

Symbols DefinitionsN number of partitionsM number of replicas per partitionc a clientDMc dependency matrix of c, N×M elementsPDTc physical dependency timestamp of cpm

n a server that runs mth replica of nth partitionVV m

n (logical) version vector of pmn , M elements

PVV mn physical version vector of pm

n , M elementsClockm

n current physical clock time of pmn

d an item, tuple 〈k,v,ut, put1,dm,rid〉k item keyv item valueut (logical) update timestampput physical update timestampdm dependency matrix, N×M elementsrid source replica idt a read-only transaction, tuple 〈st,rs〉st (physical) snapshot timestamprs readset, a set of read items

Table 1: Definition of symbols.

waits to apply the update until all partitions at its datacenter store the dependency states of the update.

3.1 DefinitionsThe DM protocol introduces dependency tracking

data structures at both the client and server side. It alsoassociates dependency metadata for each item. Table 1provides a summary of the symbols used in the protocol.We explain their meanings in details below.

Client States. Without losing generality, we assumea client has one session to the data store. A client cmaintains for its session a dependency matrix, DMc,which consists of N×M non-negative integer elements.DMc tracks the nearest dependencies of a client session.DMc[n][m] indicates that the client session potentiallydepends on the first DMc[n][m] updates at partition pm

n ,the mth replica of the nth partition.

Server States. Each partition maintains a version vec-tor (VV) [2, 21]. The version vector of partition pm

nis VV m

n , which consists of M non-negative integer el-ements. VV m

n [m] counts the number of updates pmn has

executed locally. VV mn [i] (i 6= m) indicates that pm

n hasapplied the first VV m

n [i] updates propagated from pin, a

replica of the same partition.A partition updates an item by either executing an up-

date request from its clients or by applying a propagatedupdate from one of its replicas at other data centers. Wecall the partition that updates an item to the current valueby executing a client request the source partition of theitem.

Item Metadata. We represent an item d as a tuple

1 put is only used in the DM-Clock protocol.

〈k,v,ut,dm,rid〉. k is a unique key that identifies theitem. v is the value of the item. ut is the update times-tamp, the logical creation time of the item at its sourcepartition. dm is the dependency matrix, which consistsof N×M non-negative integer elements. dm[n][m] indi-cates that d potentially depends on the first dm[n][m] up-dates at partition pm

n , a prefix of its update history. rid isthe source replica id, the replica id of the item’s sourcepartition.

We use sparse matrix encoding to encode dependencymatrices. Zero elements in a dependency matrix do notuse any bits after encoding. Only non-zero elements con-tribute to the actual size.

3.2 ProtocolWe now describe how the DM protocol executes GET

and PUT operations and replicates PUTs.GET. Client c sends a request 〈GET k〉 to a

partition at the local data center, where k is thekey of the item to read. Upon receiving the re-quest, partition pm

n obtains the read item, d, andsends a reply 〈GETREPLY vd ,utd ,ridd〉 back tothe client. Upon receiving the reply, the clientupdates its dependency matrix: DMc[n][ridd ] ←max(DMc[n][ridd ],utd). It then hands the read value, vd ,to the caller of GET.

PUT. Client c sends a request 〈PUT k,v,DMc〉 to apartition at the local data center, where k is the item keyand v is the update value. Upon receiving the request,pm

n performs the following steps: 1) Increment VV mn [m];

2) Create a new version d for the item identified byk; 3) Assign item key: kd ← k; 4) Assign item value:vd ← v; 5) Assign update timestamp: utd ← VV m

n [m];6) Assign dependency matrix: dmd ← DMc; 7) Assignsource replica id: ridd←m. These steps form one atomicoperation, and none of them is blocking. pm

n stores d onstable storage and overwrites the existing version if thereis one. It then sends a reply 〈PUTREPLY utdd ,ridd〉 backto the client. Upon receiving the reply, the client updatesits dependency matrix: DMc ← 0 (reset all elements tozero) and DMc[n][ridd ]← utd .

Update Replication. A partition propagates its localupdates to its replicas at remote data centers in theirupdate timestamp order. To replicate a newly updateditem, d, partition ps

n sends an update replication request〈REPLICATE kd ,vd ,utd ,dmd ,ridd〉 to all other replicas.

A partition also applies updates propagated from otherreplicas in their update timestamp order. Upon receivingthe request, partition pm

n guarantees causal consistencyby performing the following steps:

1) pmn checks if it has installed the dependency states

of d specified by dmd [n]. pmn waits until VV m

n >dmd [n], i.e., VV m

n [i]> dmd [n][i], for 06 i6M−1.

Replica 0

Partition 0X

X

Y

Y

ClientDM = [[0,0] [0,0]] DM = [[1,0] [0,0]] DM = [[0,0] [1,0]]

dmY = [[1,0] [0,0]] dmX = [[0,0] [0,0]]

VV = [0,0] VV = [1,0]

Replica 1

Partition 1

Partition 0

Partition 1VV = [0,0] VV = [1,0]

PUT(Y)

PUT(X)

Figure 2: An example of the DM protocol with two partitionsreplicated at two data centers. A client updates item X and thenitem Y at two partitions. X and Y are propagated concurrentlybut their installation at the remote data center is constrained bycausality.

2) pmn checks if causality is satisfied at the other local

partitions. pmn waits until, VV m

j > dmd [ j], for 0 6j 6 N−1 and j 6= n. It needs to send a message topm

j for dependency checking if dmd [ j] contains atleast one non-zero element.

3) If there is currently no value stored for item kd atpm

n , it simply stores d. If pmn has an existing version

d′ such that kd′ = kd , it orders the two versions de-terministically by concatenating the update times-tamp (high order bits) and source replica id (loworder bits). If d is ordered after d′, pm

n overwrites d′

with d. Otherwise, d is discarded.

4) pmn updates its version vector: VV m

n [s]← utd . As up-dates are propagated in order, we have the invariantbefore VV m

n [s] is updated: VV mn [s]+1 = utd .

Example. We give an example to explain the neces-sity for dependency matrices. Consider a system withtwo partitions replicated at two data centers (N = 2 andM = 2) in Figure 2. The client first updates item X at thelocal data center. The update is handled by partition p0

0.Upon receiving a response the client updates item Y atpartition p0

1. Causality through the client session dictatesthat X Y . At the other data center, p1

1 applies Y onlyafter p1

0 applies X. The figure shows the dependency ma-trices propagated with the updates. p1

1 waits until the ver-sion vector of p1

0 is no less than [1,0], which guaranteesthat X has been replicated.

Correctness. The DM protocol uses dependency ma-trices to track the nearest dependencies of a client ses-sion or an item. It resets the dependency matrix of aclient session after each PUT to keep its size after en-coding small. This does not affect correctness. By onlyremembering the update timestamp of the PUT, the pro-tocol utilizes the transitivity of causality to track depen-dencies correctly.

An element in a dependency matrix, a scalar value,

is the maximum update timestamp of all the nearest de-pendency items from the corresponding partition. Sincea partition always propagates local updates to and ap-plies remote updates from other replicas in their updatetimestamp order, once it applies an update from anotherreplica, it must have applied all the other updates with asmaller update timestamp from the same replica. There-fore, if a partition satisfies the dependency requirementspecified by the dependency matrix of an update, it musthave installed all the dependencies of the update.

3.3 Cost over Eventual ConsistencyCompared with typical implementations of eventual

consistency, the DM protocol introduces some overheadto capture causality. This is a reasonable price to pay forthe stronger semantics.

Storage Overhead. The protocol keeps a dependencymatrix and other metadata for each item. The per-itemdependency metadata is the major storage overhead ofcausal consistency. We keep it small by only trackingthe nearest dependencies and compressing it by sparsematrix encoding. In addition, the protocol requires that aclient session maintains a dependency matrix and a par-tition maintains a version vector. These global states aresmall and negligible.

Communication Overhead. Checking dependenciesof an update during replication requires a partition tosend a maximum of N− 1 messages to other local par-titions. If a row in the dependency matrix contains allzeros, then the corresponding partition does not need tobe checked since the update does not directly dependon any states managed by all replicas of that partition.In addition, the dependency metadata carried by updatereplication messages also contributes to inter-datacenternetwork traffic.

3.4 Conflict Detection and ResolutionThe above protocol orders updates of the same item

deterministically by using the update timestamp andsource replica id. If there are no updates for a longenough time, replicas of the same partition eventuallyconverge to the same state while respecting causality.However, the dependency metadata does not indicatewhether two updates from different replicas are conflict-ing or not.

To support conflict detection, we extend the DM pro-tocol by introducing one more member to the existingdependency metadata: the item dependency timestamp.We denote the item dependency timestamp of an itemd by idtd . When d’s source partition ps

n creates d, it as-signs to idtd the update timestamp of the existing versionof item kd if it exists or -1 otherwise. When partition pm

napplies d after dependency checking, it handles the ex-isting version d′ as below. If d is created after d′ is repli-

cated at psn, idtd = utd′ , then d and d′ do not conflict.

pmn overwrites d′ with d. If d and d′ are created concur-

rently by different replicas, idtd 6= utd′ , they conflict. Inthat case, either the application can be notified to resolvethe conflict using application semantics, or pm

n orders thetwo conflicting updates deterministically as outlined inSection 3.2.

4 DM-Clock ProtocolIn this section, we describe the DM-Clock protocol,which extends the DM protocol to support causally con-sistent read-only transactions. Many applications canbenefit from a programming interface that provides acausally consistent view on multiple items, as an exam-ple later in this section shows.

Compared with the DM protocol, the DM-Clock pro-tocol keeps multiple versions of each item. It also re-quires accesses to physical clocks. It assigns each itemversion a physical update timestamp, which imposes oncausally related item versions a total order consistentwith the (partial) causal order. A read-only transactionobtains its snapshot timestamp by reading the physicalclock at the first partition it accesses, its originating par-tition. The DM-Clock protocol then provides a causallyconsistent snapshot of the data store, including the lat-est item versions with a physical update timestamp nogreater than the transaction’s snapshot timestamp.

4.1 Read-only TransactionWith the DM-Clock protocol, the key-value store also

provides a transactional read operation:

• 〈vals〉 ← GET-TX(〈keys〉): This operation returnsthe values of a set of items identified by keys. Thereturned values are causally consistent.

A read-only transaction provides a causally consistentsnapshot of the data store. Assume xr and yr are two ver-sions of items X and Y , respectively. If a read-only trans-action reads xr and yr, and xr yr, then there does notexist another version of X , xo, such that xr xo yr.

We give a concrete example to illustrate the applica-tion of read-only transactions. Assume Alice wants toshare some photos with friends through an online socialnetwork such as Facebook. She first changes the permis-sion of an album from “public” to “friends-only” andthen uploads some photos to that album. When these twoupdates are propagated to and applied at remote repli-cas, causality ensures that their occurrence order is pre-served: the permission update operation happens beforethe photo upload operation. However, it is possible thatBob, not a friend of Alice, first reads the permission ofthe album as “public” and then sees the photos that wereuploaded after the album was changed to “friends-only”.Enclosing the album permission check and the viewing

of the photos in a causally consistent read-only trans-action prevents this undesirable outcome. In a causallyconsistent snapshot, the photos cannot be viewed if thepermission change that causally precedes their uploadsis not observed.

4.2 DefinitionsPhysical Clocks. The DM-Clock protocol uses

loosely synchronized physical clocks. We assume eachserver is equipped with a hardware clock that increasesmonotonically. A clock synchronization protocol, suchas the Network Time Protocol (NTP) [1], keeps the clockskew under control. The clock synchronization precisiondoes not affect the correctness of our protocol. We useClockm

n to denote the current physical clock time at par-tition pm

n .Dependency Metadata. Compared with the DM pro-

tocol, the DM-Clock protocol introduces additional de-pendency metadata. Each version of an item d has aphysical update timestamp, putd , which is the physicalclock time at the source partition of d when it is created.Different versions of an item are sorted in the item’sversion chain using their physical update timestamps. Aclient c maintains a physical dependency time, PDTc, forits session. This variable stores the greatest physical up-date timestamp of all the states a client session dependson. Each partition pm

n maintains a physical version vec-tor, PVV m

n , a vector of M physical timestamps. PVV mn [i]

(0 6 i 6M−1, i 6= m) indicates the physical time of pin

seen by pmn . This value comes either from replicated up-

dates or from heartbeat messages.

4.3 ProtocolWe now describe how the DM-Clock protocol extends

the DM protocol to support read-only transactions.GET. When a partition returns the read version d

back to the client, it also includes its physical up-date timestamp putd . Upon receiving the reply, theclient updates its physical dependency time: PDTc ←max(PDTc, putd).

PUT. An update request from client c to partition pmn

also includes PDTc. When pmn receives the request, it first

checks whether PDTc <Clockmn . If not, it delays the up-

date request until the condition holds. When pmn creates

a new version d for the item identified by kd , it also as-signs d the physical update time: putd←Clockm

n . It theninserts d to the version chain of item kd using utd . The re-ply message back to the client also includes putd . Uponreceiving the reply message, the client updates its phys-ical dependency time: PDTc← max(PDTc, putd).

With the above read and update rules, our proto-col provides the following property on causally relatedstates: For any two item versions x and y, if x y, thenputx < puty.

Update Replication. An update replication request ofd also includes putd . When partition pm

n receives d fromps

n and after d’s dependencies are satisfied, it inserts dinto the version chain of item kd using putd . pm

n thenupdates its physical version vector: PVV m

n [s]← putd .Heartbeat Broadcasting. A partition periodically

broadcasts its current physical clock time to its replicasat remote data centers. It sends out heartbeat messagesand updates in the physical timestamp order.

When pmn receives a heartbeat message with physi-

cal time pt from psn, it updates its physical version vec-

tor: PVV mn [s] ← pt. We use ∆ to denote the heartbeat

broadcasting interval. (Our implementation of Orbe sets∆ to 10ms.) A partition skips sending a heartbeat mes-sage to a replica if there was an outgoing update replica-tion message to that replica within the previous ∆ time.Hence, heartbeat messages are not needed when replicasexchange updates frequently enough.

GET-TX. A read-only transaction t maintains a phys-ical snapshot timestamp stt and a readset rst . Client csends a request 〈GETTX kset〉 to a partition by someload balancing algorithm, where kset is the set of itemsto read.

When t is initialized at pmo , the originating partition,

it reads the local hardware clock to obtain its snap-shot timestamp: stt ←Clockm

o −∆. We provide a slightlyolder snapshot to a transaction, by subtracting someamount of time from the latest physical clock time, toreduce the probability of a transaction being delayed andthe duration of the delay. pm

o reads the items specified bykset one by one. If pm

o does not store an item requiredby t, it reads the item from another local partition thatstores the item.

Before t reads an item at partition pmn , it first

waits until two conditions hold: 1) Clockmn > stt ; 2)

min({PVV mn [i] | 0 ≤ i ≤ M − 1, i 6= m}) > stt . t then

chooses the latest version d such that putd 6 stt fromthe version chain of the read item and adds d to its read-set rst . After t finishes reading all the requested items, itsends a reply 〈GETTXREPLY rst〉 back to the client. Theclient handles each retrieved item version in rst one byone in the same way as for a GET operation.

By delaying a read-only transaction under the aboveconditions, our protocol achieves the following property:The snapshot of a transaction includes all item versionswith a physical update timestamp no greater than itssnapshot timestamp, if no failure happens.

If failure happens, a read-only transaction may beblocked. We provide solutions to this in Section 5, wherewe discuss failure handling.

Correctness. To see why our protocol providescausally consistent read-only transactions, considerxr yr in the definition in Section 4.1 again. With thefirst property, if a transaction t reads xr and yr, then

putxr < putyr ≤ stt . With the second property, t onlyreads the latest version of an item with a physical updatetimestamp no greater than stt . Hence there does not existxo such that putxr < putxo ≤ stt . Therefore, it is impos-sible that there exists xo such that xr xo. Our protocolprovides causally consistent read-only transactions.

Conflict Detection and Resolution. The above DM-Clock protocol cannot tell whether two updates con-flict or not. We employ the same technique used by theDM protocol in Section 3.4 to detect conflicts. We donot need the conflict resolution part here since the DM-Clock protocol uses physical update timestamps to to-tally order different versions of the same item.

4.4 Garbage CollectionThe DM-Clock protocol stores multiple versions of

each item. We briefly describe how to garbage-collectold item versions to keep the storage footprint small.Partitions within the same data center periodically ex-change snapshot timestamps of the oldest active trans-actions. If a partition does not have any active read-onlytransactions, it sends out the latest physical clock time.At each round of garbage collection, a partition choosesthe minimum one among the received timestamps as thesafe garbage collection timestamp. With this timestamp,a partition scans the version chain of each item it stores.It only keeps the latest item version created before thesafe garbage collection timestamp (if there is one) andthe versions created after the timestamp. It removes allthe other versions that are not needed by active and fu-ture read-only transactions.

5 Failure HandlingWe briefly describe how the DM and DM-Clock proto-cols handle failures.

5.1 DM ProtocolClient Failures. When a client fails, it stops issuing

new requests to the data store. The failure of a clientdoes not affect other clients and the data store. Recoveryis not needed since a client only stores soft states fordependency tracking.

Partition Server Failures. A partition maintains aredo log on stable storage, which stores all installed up-date operations in the update timestamp order. A failedpartition recovers by replaying the log. Checkpointingcan be used to accelerate the recovery process. The par-tition then synchronizes its states with other replicas atremote data centers by exchanging locally installed up-dates.

The current design of the DM protocol does not tol-erate partition failures within a data center. However, itcan be extended to tolerate failures by replicating eachdata partition within the same data center using standard

techniques, such as primary copy [4, 20], Paxos [16] andchain replication [23].

Data Center Failures. The DM protocol toleratesthe failure of an entire data center, for example due topower outage, and network partitions among data cen-ters. If a data center fails and recovers later, it rebuildsthe data store states by recovering each partition in par-allel. If the network partitions and heals later, it updatesthe data store states by synchronizing the operation logswithin each replication group in parallel. If a data cen-ter fails permanently and cannot recover, any updatesoriginated in the failed data center, which are not prop-agated out, will be lost. This is inevitable due to thenature of causally consistent replication, which allowslow-latency local updates without requiring coordina-tion across data centers.

5.2 DM-Clock ProtocolThe DM-Clock protocol uses the same failure han-

dling techniques of the DM protocol, except that it treatsread-only transactions specially.

With the DM-Clock protocol, before reading an itemat a partition, a read-only transaction requires that thepartition has executed all local updates and applied allremote updates with update timestamps no greater thanthe snapshot timestamp of the transaction. However, ifa remote replica fails or the network among data cen-ters partitions, a transaction might be delayed for a longtime because it does not know whether there are any up-dates from a remote replica that should be included in itssnapshot but have not been propagated.

Two approaches can solve this problem. If a transac-tion is delayed longer than a certain threshold, its orig-inating partition re-executes it using a smaller snapshottimestamp to avoid blocking on the (presumably) failedor disconnected remote partition. With this approach, thetransaction provides relatively stale item versions untilthe remote partition reconnects. A transaction delayedfor enough long time can also switch to a two-round pro-tocol similar to the one used in Eiger [18]. In this case,the transaction returns relatively fresh data but may needtwo rounds of messages to finish.

6 Dependency CleaningIn this section, we present dependency cleaning, a tech-nique that further reduces the size of dependency meta-data in the DM and DM-Clock protocols. This idea isgeneral and can be applied to other causally consistencysystems.

6.1 IntuitionAlthough our DM and DM-Clock protocols effec-

tively reduce the size of dependency metadata by us-ing dependency matrices and a few other techniques, for

big data sets managed by a large number of servers, itis still possible that the dependency matrix of an up-date has many non-zero elements. For instance, a clientmay scan a large number of items located at many differ-ent partitions and update a single item in some statisticsworkloads. In this case, the dependency metadata canbe many times bigger than the actual application pay-load, which incurs more inter-datacenter traffic for up-date replication. In addition, checking dependencies ofsuch an update during replication requires a large num-ber of messages.

The hidden assumption behind tracking dependenciesat the client side is that a client does not know whethera state it accesses has been fully replicated by all repli-cas. To guarantee causal consistency, the client has to re-member all the nearest dependency states it accesses andassociates them to subsequent update operations. Hencewhen an update is propagated to a remote replica, theremote replica uses its dependency metadata to checkwhether all its dependency states are present there. Thisapproach is pessimistic because it assumes the depen-dency states are not replicated by all replicas. Most ex-isting solutions for causal consistency are built on thisassumption. This is a valid assumption if the networkconnecting the replicas fails or disconnects often, whichthe early works are based upon [21, 22]. However, it isnot realistic for modern data center applications. Datacenters of the same organization are often connected byhigh speed, low latency, and reliable fiber links. Mostof the time, network partitions among data centers onlyhappen because of accidents, and they are rare. There-fore, we argue that one should not be pessimistic aboutdependency tracking for this type of applications. If astate and its dependencies are known to be fully repli-cated by all replicas, a client does not need to include itin the dependency metadata when reading it. With thisobservation, we can substantially reduce the size of thedependency metadata.

6.2 Protocol ExtensionWe describe how to extend the DM and DM-Clock

protocols to support dependency cleaning. To track up-dates that are replicated by all replicas, we introduce afull replication version vector (RVV) at each partition.At partition pm

n , RVV mn indicates that the first RVV m

n [i]updates of pi

n (06 i6M−1) have been fully replicated.Update Replication. We add the following exten-

sions to the update replication process. After pmn prop-

agates an item version d to all other replicas, it re-quires them to send back a replication acknowledgmentmessage after they apply d. Similar to propagating lo-cal updates in their update timestamp order, a partitionalso sends replication acknowledgments in the same or-der. Once pm

n receives replication acknowledgments of d

from all other replicas, it increments RVV mn [m]. pm

n thensends a full-replication completion message of d to otherreplicas. Similarly, the full-replication completion mes-sages are also sent in the update timestamp order. Uponpartition pi

n receives the full-replication completion of d,it increments RVV i

n[m] (06 i6M−1 and i 6= m). Sincepartition pm

n increments an element of VV mn when apply-

ing a replicated update but increments the correspondingelement in RVV m

n only after that update is fully repli-cated, RVV m

n 6VV mn always holds.

GET and GET-TX. We now describe how RVV isused to perform dependency cleaning when the datastore handles read operations. Assume a client sends aread request to partition pm

n and an item version d is se-lected. If RVV m

n [ridd ] > utd , pmn knows that d and all its

dependency states have been fully replicated and thereis no need to include d in the dependencies of the clientsession. pm

n sends a reply message back to the clientwithout including d’s update timestamp. Upon receiv-ing the reply, the client keeps its dependency matrix un-changed. This technique can also be applied to read-onlytransactions similarly. Therefore, by marking an itemversion as fully replicated, this technique “cleans” thedependency it introduces to the client that reads it.

Normally, the duration that RVV mn [ridd ] < utd holds

is short. Under moderate system loads, this durationis roughly 1.5 WAN round-trip latency plus two timesthe write latency of stable storage , which is normallya few hundreds milliseconds. As a consequence, for abroad spectrum of applications, most read operations donot generate dependencies, which keeps the dependencymetadata small.

6.3 Message Overhead

Dependency cleaning has a tradeoff. It reduces thesize of dependency metadata and the cost of depen-dency checking at the expense of more network mes-sages for sending replication acknowledgments and full-replication completions.

Assume an update depends on states from k partitionsexcept its source partition. For a system with M repli-cas, without dependency cleaning, it takes M− 1 WANmessages to propagate the update to all other replicas atremote data centers. The dependency matrix in each ofthe M− 1 WAN messages has at least k non-zero rows.(M− 1)k LAN messages are required for checking de-pendencies of the propagated update at remote replicas.With dependency cleaning, it requires 3(M− 1) WANmessages to replicate an update. For many workloads,the dependency matrix of an update contains mostly zeroelements. Almost no LAN messages are needed for de-pendency checking.

7 EvaluationWe evaluate the DM protocol, DM-Clock protocol, anddependency cleaning in Orbe, a multiversion key-valuestore that supports both partitioning and replication. Inparticular, we answer the following questions:

• Does Orbe scale as the number of partitions in-creases?

• What is the overhead of providing causal consis-tency compared with eventual consistency?

• How does Orbe compare with COPS?

• Is dependency cleaning an effective technique forreducing the size of dependency metadata?

7.1 Implementation and SetupWe implement Orbe in C++ and use Google’s Proto-

col Buffers for message serialization. We partition thedata set to a group of servers using consistent hashing[12]. We run NTP to keep physical clocks synchronized.NTP can be configured to change the clock frequency tocatch up or fall back to a target. Hence, physical clocksalways move forward during synchronization, a require-ment for correctness in Orbe.

As part of the application tier, servers that run Orbeclients are in the same data center with Orbe partitionservers. A client chooses a partition in its local datacenter as its originating partition by a load balancingscheme. The client then issues all its operations to itsoriginating partition. If the originating partition does notstore an item required by a client request, it executes theoperation at the local partition that manages the requireditem.

Orbe’s underlying key-value store keeps all key-valuepairs in main memory. A key points to a linked list thatcontains different versions of the same item. The oper-ation log resides on disk. The system performs groupcommit to write multiple updates in one disk write. APUT operation inserts a new version to the version chainof the updated item and adds a record to the operationlog. During replication, replicas of the same partition ex-change their operation logs. Each replica replays the logfrom other replicas and applies the updates one by oneafter dependency checking.

We run the DM-Clock protocol in all experiments,even where the DM protocol would suffice, because itis a superset of the DM protocol. We set the heartbeatbroadcasting interval ∆ to 10ms. By default, dependencycleaning is disabled. We enable it in one experiment,where we mention its use explicitly (see Section 7.6).

We run experiments on a local cluster where allservers are connected by a single GigE switch. All

Operation Echo GET-10B PUT-1B PUT-16B PUT-128BThroughput(K op/s) 71.3 61.4 36.8 36.4 30.2

Table 2: Maximum throughput of client operations on a singlepartition server without replication.

servers in the cluster are Dell PowerEdge SC1425 run-ning Linux 3.2.0. Each server has two Intel Xeon pro-cessors, 4GB of DDR2 memory, one 7200rpm 160GBSATA disk, and one GigE network port. The round-tripnetwork latency in our local cluster is between 120 to180 microseconds. We enable the hardware cache ofthe disk. The latency of writing a small amount of data(64B) to the disk is around 450 microsecond. We parti-tion the local cluster into multiple logical “data centers”as necessary. We introduce an additional 120 millisec-onds network latency for messages among replicas ofthe same partition located at different logical data cen-ters.

7.2 MicrobenchmarksWe first evaluate the basic performance characteris-

tics of Orbe through microbenchmarks. We replicate thedata set in two data centers. At each data center, the dataset is partitioned on eight servers. Each partition loadsone million data items during initialization. For eachpreloaded item, the size of a key is eight bytes and thevalue is ten bytes.

In the first experiment, we examine the capability of asingle partition server. We launch enough clients to sat-urate the server. A GET operation reads a randomly se-lected item from its originating partition. The PUT op-eration also operates on the originating partition by up-dating a random item with different sizes of update val-ues. For comparison, we also introduce an Echo opera-tion, which simply returns the operation argument to theclients.

As shown in Table 2, a partition server can pro-cess Echo operations at about 70K ops/s, GET opera-tions at about 60K ops/s, and PUT operations at about30K ops/s. The throughput of Echo indicates the mes-sage processing capability of our hardware. As the up-date value size increases in PUT, the throughput dropsslightly due to the increased cost of memory copies. Inall cases, CPU is the bottleneck.

In the second experiment, we measure operation la-tencies. For this experiment, GET and PUT choose itemslocated at the originating partition with a probability of50% and at other local partitions with the other 50%. AGET-TX operation reads six items in total. One is fromits originating partition while the other five are fromother local partitions.

Figure 3 shows the latency distribution of the four op-erations. The Echo operation shows the baseline as it

0

20

40

60

80

100

0 200 400 600 800 1000 1200

CD

F

Operation latency (microseconds)

EchoGetPutTxGet

Figure 3: Latency distribution of client operations.

only takes one round-trip latency to finish. Each GETand PUT requires either one or two rounds of messageswithin the same data center (depending on the locationof the requested item), which results in two clear groupsof latencies. The latency of executing a client opera-tion is low, because Orbe does not require a partitionserver to coordinate with its replicas at other data cen-ters to process GET and PUT operations. None of the(microsecond-scale) client operations depend on repli-cation operations that incur the 120ms latency betweendata centers.

7.3 ScalabilityWe now examine the scalability of Orbe with an in-

creasing number of partitions. We set up two data cen-ters with two to eight partitions at each.

We first use three workloads which are a configurablemix of PUTs and GETs. Items are selected from theoriginating partition with a probability of 50% and fromother local partitions for the other 50%. PUT operationsupdate items with ten bytes values. For the workloadof read-only transactions, each GET-TX reads one itemfrom its originating partition and five from other localpartitions. Figure 4 shows the throughput of Orbe as thenumber of partitions increases. Regardless of the put:getratio, Orbe scales out with an increasing number of parti-tions. Because Orbe propagates updates across partitionsin parallel, it is able to utilize more servers to providehigher throughput.

7.4 Comparison with Eventual Consis-tency

To show the overhead of providing causal consis-tency in our protocols, we compare Orbe with an eventu-ally consistent key-value store, which is implemented inOrbe’s codebase. We set up two data centers of three par-titions each. A client accesses items randomly selectedfrom the three local partitions with different put:get ra-tios. PUT updates an item with a value of 60 bytes.

Figure 5 shows the throughput of Orbe and the eventu-ally consistent key-value store. For an almost read-onlyworkload, they have similar throughputs. For an almostupdate-only workload, Orbe’s throughput is about 24%lower. The minor degradation in throughput is a reason-

0

50

100

150

200

250

300

350

2 3 4 5 6 7 8

Max T

hro

ughput (K

ops/s

ec)

Number of Partitions

Put-Get 1:1Put-Get 1:8Put-Get 1:64TxGet

Figure 4: Maximum throughput of varied workloads with twoto eight partitions. The legend gives the put:get ratio.

60

70

80

90

100

110

120

128:1 64:1 16:1 4:1 1:1 1:4 1:16 1:64 1:128

Max T

hro

ughput (K

ops/s

ec)

Put:Get Ratio

OrbeEventual Consistency

Figure 5: Maximum throughput of workloads with variedput:get ratios for both Orbe (causal consistency) and eventualconsistency.

able price to pay for the much improved semantics overeventual consistency.

The major overhead of implementing causal consis-tency comes from 1) network messages for dependencychecking and 2) processing, storing and transmitting thedependency metadata. Figure 6 shows this overhead withtwo curves. The first is the average number of depen-dency checking messages per replicated update. The sec-ond is the percentage of the dependency metadata in theupdate replication traffic in Orbe. When the workloadis almost update-only, the metadata percentage is smalland so is the number of dependency checking messagesper replicated update. When the workload becomes read-heavy, the numbers go up, but level off after GETs dom-inate the workload.

The contents of dependency matrices explain thenumbers in Figure 6. As Orbe tracks only the nearestdependencies, an update depends only on the previousupdate and the reads since the previous update in thesame client session. With a high put:get ratio, the de-pendency matrix contains only a few non-zero elements.With a low put:get ratio, reads generate a large number

0

1

2

3

128:1 64:1 16:1 4:1 1:1 1:4 1:16 1:64 1:1280%

10%

20%

30%

Num

ber

of M

essages

Meta

data

Perc

enta

ge

Put:Get Ratio

Orbe dependency check msgOrbe dependency metadata

Figure 6: Average number of dependency checking messagesper replicated update and percentage of dependency metadatain the update replication traffic with varied put:get ratios inOrbe.

of dependencies, but the total number of elements in adependency matrix is bounded by the number of parti-tion servers in the data store.

7.5 Comparison with COPSWe compare Orbe with COPS [17], which also pro-

vides causal consistency for both partitioned and repli-cated data stores. We implement COPS in Orbe’s code-base. For an apples-to-apples comparison, we enableread-only transaction support in both Orbe and COPS(which is called COPS-GT in prior work [17]).

COPS explicitly tracks each item version read and up-dated at the client side. It associates a set of dependencyitem versions with each update. A dependency matrixin Orbe plays the same role as a set of dependency itemversions in COPS, but most of the time it takes less spaceusing sparse encoding.

Although COPS relies on a number of techniques toreduce the size of dependency metadata, it can still be-come considerable since COPS has to track the completedependencies to support read-only transactions whileOrbe only tracks the nearest dependencies. In addition,the execution time of the two-round transactional read-ing protocol in COPS limits the frequency of garbage-collecting the dependency metadata. During the fixed in-terval between two successive garbage collections, themore operations a client issues, the more dependencystates it creates. Hence, the size of dependency meta-data is highly related to the inter-operation delays at eachclient. COPS sets the garbage collection interval to sixseconds [17]. We use the same value in our COPS im-plementation.

We set up two data centers of three partitions each.A client accesses data items randomly selected from thethree partitions with a configurable put:get ratio. Figure7 illustrates the throughput of Orbe and COPS with dif-ferent client inter-operation delays. Figure 8 shows theaverage number of states on which an update depends.For COPS, this is the number of dependency item ver-sions. For Orbe, this is the number of non-zero elementsin the dependency matrix. Orbe provides consistently

0

20

40

60

80

100

120

1 10 100 1000

Max T

hro

ughput (K

ops/s

ec)

Average Inter-Operation Delay (milliseconds)

Orbe 1:4COPS 1:4Orbe 1:0COPS 1:0

Figure 7: Maximum throughput of operations with variedinter-operation delays for both Orbe and COPS. The legendgives the put:get ratio.

0

200

400

600

800

1000

1200

1400

1 10 100 1000

Num

ber

of D

ependencie

s



Figure 8: Average number of dependencies for each PUT op-eration with varied inter-operation delays for both Orbe andCOPS. The legend gives the put:get ratio.

higher throughput than COPS as it tracks fewer depen-dency states and spends fewer CPU cycles on messageserialization and transmission. Figure 8 suggests thatOrbe and COPS should have similar throughput whenthe inter-operation delays are longer than one second.

By tracking fewer dependency states and efficientlyencoding the dependency matrix, Orbe also reduces theinter-datacenter network traffic for update replicationamong replicas of the same partition. Figure 9 comparesthe aggregated replication traffic (transmission only) be-tween Orbe and COPS.

Tracking fewer states at the client side reduces theclient’s memory footprint, but also consumes fewer CPUcycles as fewer temporary objects are created and de-stroyed for tracking dependencies. Figure 10 shows theCPU utilization of a server that runs a group of clientsfor Orbe and COPS, separately. For this measurement,we run all clients at a single powerful server to satu-rate the data store and record the CPU utilization of theserver. Orbe is more efficient. It uses fewer CPU cyclesper operation as it manages fewer states.

7.6 Dependency CleaningDependency cleaning removes the necessity for de-

pendency tracking when a client reads a fully replicated

0

20

40

60

80

100

120

140

160

180

200

1 10 100 1000

Aggre

gate

d R

eplic

ation T

raffic

(M

bps)



Figure 9: Aggregated replication transmission traffic acrossdata centers with varied inter-operation delays for both Orbeand COPS. The legend gives the put:get ratio.

0%

20%

40%

60%

80%

100%

120%

1 10 100 1000

CP

U U

tiliz

ation



Figure 10: CPU utilization of a server that runs a group ofclients with varied inter-operation delays for both Orbe andCOPS. The legend gives the put:get ratio.

item version. However, it increases the cost of updatereplication as it requires additional inter-datacenter mes-sages to mark an item version as fully replicated.

In this experiment, we show the benefits of depen-dency cleaning for workloads that read from a largenumber of partitions and update only a few. We set uptwo data centers and vary the number of partitions fromone to eight at each data center. A client reads a ran-domly selected item from each of the local partitions andupdates one random item at its originating partition.

Figure 11 shows the maximum throughput of Orbewith and without dependency cleaning enabled. Whenthe system has only one partition, all reads and up-dates go to that partition. Dependency cleaning doesnot help as no network message is required for depen-dency checking. In this case, the throughput of Orbewith dependency cleaning enabled is slightly lower, be-cause it requires more messages for update replica-tion. However, the throughput drop is small, becausewe let update replication messages piggyback replica-tion acknowledgement and full-replication completion

70

75

80

85

90

95

100

105

110

1 2 3 4 5 6 7 8

Max T

hro

ughput (K

ops/s

ec)


Dependency Cleaning DisabledDependency Cleaning Enabled

Figure 11: Maximum throughput of operations with and with-out dependency cleaning enabled.

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8

Replic

ation T

raffic

(M

bps)



Figure 12: Aggregated replication transmission traffic withand without dependency cleaning enabled.

messages and batch these messages whenever possible.As we increase the number of partitions, the through-

put of Orbe with dependency cleaning enabled is higher.The throughput gap increases as the system has morepartitions. With dependency cleaning, an update doesnot depend on states from other partitions most of thetime, although it reads states from those partitions. As aresult, dependency checking on replicated updates doesnot incur network messages at the remote data center. Byremoving this part of the overhead, dependency cleaninghelps the overall throughput.

Dependency cleaning also reduces the size of depen-dency metadata for an update. Figure 12 shows the ag-gregated replication traffic from all partitions. The trafficdecreases as the number of partitions increases becausethe put:get ratio decreases. Figure 13 shows the averagepercentage of metadata in the update replication traffic.

8 Related WorkThere have been many causally consistent systems in theliterature, such as lazy replication [13], Bayou [22], andWinFS [19]. They use various techniques derived fromthe causal memory algorithm [2]. However, these sys-tems target only full replication. None of them considersscalable causal consistency for partitioned and replicateddata stores, to which Orbe provides a solution.

COPS [17] identifies the problem of causal consis-tency for both partitioned and replicated data stores andgives a solution. It tracks every accessed state as depen-dency metadata at the client side. To support causally

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5 6 7 8

Meta

data

Perc

enta

ge



Figure 13: Average percentage of dependency metadata in up-date replication traffic with and without dependency cleaningenabled.

consistent read-only transactions, a client has to trackthe complete dependencies explicitly. Although COPSgarbage-collects the dependency metadata periodically,the metadata size may still be large under many work-loads and can affect performance. In comparison, Orberelies on dependency matrices to track the nearest depen-dencies and keeps the dependency metadata small andbounded. Orbe provides causally consistent read-onlytransactions using loosely synchronized clocks. Orbe re-quires one round of messages to execute a read-onlytransaction in the failure-free mode while COPS requiresmaximum two rounds.

Eiger [18] is recent follow-up work on COPS and pro-vides causal consistency for a distributed column store.It proposes a new protocol for read-only transactionsusing Lamport clock [14]. Although Eiger also needsmaximum two rounds of messages to execute a read-only transaction, a client tracks only the nearest de-pendencies. In addition, Eiger provides causally consis-tent update-only transactions. Eiger still tracks every ac-cessed state at the client side.

ChainReaction [3] implements causal consistency ontop of chain replication [23]. ChainReaction also tracksevery accessed state at the client side. To solve the prob-lem of large dependency metadata when providing read-only transactions, it uses a global sequencer service ateach data center to totally order update operations andread-only transactions. However, the sequencer serviceincreases the latency of all update operations by oneround-trip network latency within the data center and isa potential performance bottleneck. In comparison, Orbedoes not have any centralized component. It providescausal consistency for update-anywhere replication andrelies on loosely synchronized physical clocks to imple-ment read-only transactions.

Bolt-on causal consistency [5] provides causal con-sistency to existing eventually consistent data stores. Itinserts a shim-layer between the data store and the appli-cation layer to insure the safety properties of causal con-sistency. It relies on the application to maintain explicitcausality relationships. Tracking causality based on ap-plication semantics is precise but requires the applica-

tion developers to specify causal relationships amongapplication operations. In contrast, Orbe provides causalconsistency directly in the data store, without requiringcoordination from the application.

Physical clocks have been used in many distributedsystems. For example, Spanner [8] implements serial-izable transactions in a geographically replicated andpartitioned data store. It provides external consistency[11] based on synchronized clocks with bounded un-certainty, called TrueTime, requiring access to GPS andatomic clocks. Clock-SI [9] uses loosely synchronizedclocks to provide snapshot isolation [6] to transactionsin a purely partitioned data store. In comparison, Orbetargets causal consistency, a weaker consistency model.It provides causally consistent read-only transactions us-ing loosely synchronized clocks in a partitioned andreplicated data store.

9 ConclusionIn this paper, we propose two scalable protocols that ef-ficiently provide causal consistency for partitioned andreplicated data stores. The DM protocol extends ver-sion vectors to two-dimensional dependency matricesand relies on the transitivity of causality to keep depen-dency metadata small and bounded. The DM-Clock pro-tocol relies on loosely synchronized physical clocks toprovide causally consistent read-only transactions. Weimplement the two protocols in a distributed key-valuestore. We show that they incur relatively small overheadfor tracking causal dependencies and outperform a priorapproach based on explicit dependency tracking.

Acknowledgments. We thank the SOCC program com-mittee and especially our shepherd, Indranil Gupta, fortheir constructive feedback. Daniele Sciascia also pro-vided useful comments on final revisions of this work.

References[1] The network time protocol. http://www.ntp.

org, 2013.

[2] M. Ahamad, G. Neiger, J. E. Burns, P. Kohli, andP. W. Hutto. Causal memory: Definitions, imple-mentation, and programming. Distributed Comput-ing, 9(1):37–49, 1995.

[3] S. Almeida, J. a. Leitao, and L. Rodrigues. Chain-reaction: a causal+ consistent datastore based onchain replication. In Proceedings of the 8th ACMEuropean Conference on Computer Systems, pages85–98. ACM, 2013.

[4] P. A. Alsberg and J. D. Day. A principle for re-silient sharing of distributed resources. In Pro-ceedings of the 2nd International Conference on

Software Engineering, pages 562–570. IEEE Com-puter Society Press, 1976.

[5] P. Bailis, A. Ghodsi, J. M. Hellerstein, and I. Sto-ica. Bolt-on causal consistency. In Proceedings ofthe 2013 ACM SIGMOD International Conferenceon Management of Data, pages 761–772. ACM,2013.

[6] H. Berenson, P. Bernstein, J. Gray, J. Melton,E. O’Neil, and P. O’Neil. A critique of ansi sql iso-lation levels. In Proceedings of the 1995 ACM SIG-MOD International Conference on Management ofData, pages 1–10. ACM, 1995.

[7] E. A. Brewer. Towards robust distributed systems(abstract). In Proceedings of the 19th Annual ACMsymposium on Principles of Distributed Comput-ing, pages 7–. ACM, 2000.

[8] J. C. Corbett, J. Dean, M. Epstein, A. Fikes,C. Frost, J. Furman, S. Ghemawat, A. Gubarev,C. Heiser, P. Hochschild, et al. Spanner: Google’sglobally-distributed database. In Proceedings ofthe 10th USENIX conference on Operating Sys-tems Design and Implementation, pages 251–264.USENIX Association, 2012.

[9] J. Du, S. Elnikety, and W. Zwaenepoel. Clock-SI:Snapshot isolation for partitioned data stores us-ing loosely synchronized clocks. In Proceedingsof the 32nd International Symposium on ReliableDistributed Systems. IEEE, 2013.

[10] S. Gilbert and N. Lynch. Brewer’s conjecture andthe feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News,33(2):51–59, 2002.

[11] M. P. Herlihy and J. M. Wing. Linearizability: Acorrectness condition for concurrent objects. ACMTransactions on Programming Languages and Sys-tems, 12(3):463–492, 1990.

[12] D. Karger, E. Lehman, T. Leighton, R. Panigrahy,M. Levine, and D. Lewin. Consistent hashing andrandom trees: distributed caching protocols for re-lieving hot spots on the world wide web. In Pro-ceedings of the 29th Annual ACM symposium onTheory of computing, pages 654–663. ACM, 1997.

[13] R. Ladin, B. Liskov, L. Shrira, and S. Ghe-mawat. Providing high availability using lazy repli-cation. ACM Transactions on Computer Systems,10(4):360–391, 1992.

[14] L. Lamport. Time, clocks, and the ordering ofevents in a distributed system. Communications ofthe ACM, 21(7):558–565, 1978.

[15] L. Lamport. How to make a multiproces-sor computer that correctly executes multiprocessprograms. IEEE Transactions on Computers,100(9):690–691, 1979.

[16] L. Lamport. The part-time parliament. ACMTransactions on Computer Systems, 16(2):133–169, 1998.

[17] W. Lloyd, M. J. Freedman, M. Kaminsky, andD. G. Andersen. Don’t settle for eventual: scal-able causal consistency for wide-area storage withcops. In Proceedings of the 23rd ACM Symposiumon Operating Systems Principles, pages 401–416.ACM, 2011.

[18] W. Lloyd, M. J. Freedman, M. Kaminsky, andD. G. Andersen. Stronger semantics for low-latency geo-replicated storage. In Proceedings ofthe 10th USENIX Conference on Networked Sys-tems Design and Implementation, pages 313–328.USENIX Association, 2013.

[19] D. Malkhi and D. Terry. Concise version vectorsin WinFS. Distributed Computing, 20(3):339–353,2005.

[20] B. M. Oki and B. H. Liskov. Viewstamped repli-cation: A new primary copy method to supporthighly-available distributed systems. In Proceed-ings of the 7th annual ACM Symposium on Princi-ples of Distributed Computing, pages 8–17. ACM,1988.

[21] D. S. Parker Jr, G. J. Popek, G. Rudisin,A. Stoughton, B. J. Walker, E. Walton, J. M. Chow,D. Edwards, S. Kiser, and C. Kline. Detection ofmutual inconsistency in distributed systems. IEEETransactions on Software Engineering, (3):240–247, 1983.

[22] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M.Theimer, and A. J. Demers. Flexible update propa-gation for weakly consistent replication. In ACMSIGOPS Operating Systems Review, volume 31,pages 288–301. ACM, 1997.

[23] R. van Renesse and F. B. Schneider. Chain replica-tion for supporting high throughput and availabil-ity. In Proceedings of the 6th Conference on Sym-posium on Operating Systems Design, pages 91–104, 2004.

[24] W. Vogels. Eventually consistent. Communica-tions of the ACM, 52(1):40–44, 2009.

orbe: scalable causal consistency using dependency matrices … · 2018. 1. 4. · we implement the...

Documents