a step towards coordination-efﬁcient programming with...

A Step Towards Coordination-efficient Programming WithLattices

Yihan Lin∗

UC [email protected]

Varun NaikUC Berkeley

[email protected]

Chenggang WuUC Berkeley

[email protected]

ABSTRACTMinimizing coordination among concurrently executing op-erations is crucial to scaling systems. Due to the cost of co-ordination, research has explored coordination-free systemsthat provide low latency and high scalability at the costof weak consistency guarantees. BloomL, as an example,is a programming language that allows the development ofcoordination-free, eventually consistent (a form of weak con-sistency) systems by composing lattices that are resilient tomessage reordering and duplication. However, the focus ofBloomL is more on its declarative logic programming modelthan on its performance.

This paper instead presents an in-depth study of the perfor-mance aspect of lattices, as well as their role in building sys-tems that offer strong consistency. For each type of lattice,we investigate the most efficient thread-safe implementationin a single-node, multicore setting. We then show that forsystems that provide strong consistency, their coordination-free components can be implemented via lattice composi-tion, and when coordination is strictly required, lattices canserve as useful tools to minimize coordination overhead.

1. INTRODUCTIONCoordination is one of the most crucial factors that preventsystems from scaling [3]. In the distributed setting, coordi-nation happens in the form of synchronous message passingbetween different nodes. In the single-node, multicore set-ting, coordination can occur via protected shared memoryaccess by different threads. In both cases, nodes and threadshave to wait for each other, which could incur a large amountof overhead. Therefore, minimizing coordination is vital toscaling systems.

In recent years, there have been efforts to build coordination-free eventually consistent systems that provide high scala-bility. For example, BloomL [7] lets programmers build such

∗Student not in CS 262A.

systems by lattice composition. Since every lattice supportscommutative, associative, and idempotent operations, it canbe easily verified that the entire system is resilient to mes-sage reordering and duplication. Therefore, the system iseventually consistent without coordination. One key contri-bution of BloomL is that its bottom-up approach of systembuilding offers a systematic way to reason about when coor-dination is strictly necessary.

However, BloomL’s lattice implementation is not sufficientlyperformant. One reason is that BloomL is implemented asa DSL in Ruby, which is not optimized for performance.More importantly, BloomL assumes single-threaded execu-tion within each node, and its built-in lattices are not de-signed to take advantage of a multicore setting. Besides per-formance, another issue is that although a coordination-freesystem offers low latency, high availability and partition tol-erance, the reduced consistency guarantee mandated by theCAP theorem [6] makes the system difficult to reason about,and sometimes it is unclear whether an eventually consistentsystem meets applications’ consistency requirements [4].

On the other hand, high-performance database systems andoperating systems have explored coordination-efficient im-plementation techniques to achieve new levels of performanceon multicore hardware. For example, Hekaton [8], a keycomponent of Microsoft SQL server, introduces a lock-freeimplementation of BW-Tree [13] to minimize coordinationoverhead between threads when performing concurrent op-erations to a key-value store. Although many of these worksdemonstrate clever ideas, the approaches to design and buildthese systems are rather ad hoc and difficult to generalize.

In this paper, we study how the bottom-up lattice com-position approach from BloomL can provide generalizablelessons about how to build strongly consistent systems thatcan take advantage of multicore hardware. Our contributioncan be summarized as follows:

1. We investigate efficient lattice implementations thattake advantage of multicore hardware.

2. We build a high-performance lattice library in C++whose lattices can be composed to construct useful sys-tem components.

3. We use lattice composition to build the coordination-free components of a timestamp ordered multi-version

concurrency control protocol that guarantees serializ-ability.

4. When coordination is strictly necessary, we propose alattice-based implementation strategy that minimizesthe coordination overhead under certain workloads.

The remainder of the paper proceeds as follows. In Sec-tion 1.1, we give a brief introduction to lattices. Section 2presents the current status of our lattice library and lessonslearned from developing high-performance lattices designedto scale with multicore. In Section 3, we first offer an in-troduction to MVCC, and then discuss our findings on im-plementing MVCC with lattice composition. We comparerelated work in Section 4, discuss future work in Section 5,and conclude in Section 6.

1.1 LatticesA bounded join semilattice consists of a set S, a binary op-erator t, and a “zero” value ⊥. The operator t is called the“least upper bound” and satisfies the following properties:

Commutativity: t(a, b) = t(b, a) ∀a, b ∈ SAssociativity: t(t(a, b), c) = t(a,t(b, c)) ∀a, b, c ∈ SIdempotence: t(a, a) = a ∀a ∈ S

t introduces a partial order between elements of S. Forany two elements a, b in S, if t(a, b) = b, then we saythat b’s order is higher than a. The “zero” value ⊥ is thesmallest element in S. Therefore, it follows that ∀a ∈ S,t(a,⊥) = a. For brevity, in this paper we use “lattice” torefer to “bounded join semilattice” and “merge function” torefer to “least upper bound”.

The key observation that motivates our research is that lat-tices can serve as building blocks for many systems. BloomL

focuses on exploring the use of lattices in a distributed envi-ronment, in which each node has private memory, and com-munication between nodes occurs through message passing.We instead explore the use of lattices in a multicore environ-ment, in which all threads have a private stack and sharedheap memory, and communication between threads occursthrough protected shared memory access.

2. THE LATTICE LIBRARYIn this section, we first provide descriptions of lattices cur-rently supported by our library. We then present the perfor-mance microbenchmarks of various thread-safe implementa-tions of lattices and discuss the lessons learned.

2.1 Core LatticesWe introduce several base lattices as well as composite lat-tices supported by our library. The simplest lattice is Bool-Lattice, with data type boolean. The lattice element startswith value ⊥ = false, and the t operator is defined as ∨.Effectively, once a true is merged into the lattice, the latticeelement remains true.

MaxIntLattice has data type integer. Theoretically, thelattice element should start with value ⊥ = −∞. How-ever, as shown in Section 2.2, in reality the ⊥ depends onthe representation of the integer. The t operator takes the

maximum between the input element and the current lat-tice element. Effectively, the lattice element monotonicallyincreases as more inputs are merged into the lattice. Forcompleteness, we implement an analogous MinIntegerLat-

tice.

GrowOnlySetLattice has data type set. The lattice elementis initially an empty set, and t operator takes the inputelements and insert them into the set. Notice that sinceGrowOnlySetLattice only accepts element insertion, oncean element is inserted, it will never be deleted, and the sizeof the set grows monotonically.

In addition to three base lattices described above, our li-brary also contains LatticeValuedArrayLattice, an arraywhose elements are lattices of the same type, and Lattice-

ValuedMapLattice, a map in which keys are of any typeand values are lattices of the same type. A summary of alllattices supported by our library is shown in Figure 1.

2.2 ImplementationWhile BloomL is implemented in Ruby, we want to imple-ment our library in a highly performant language. C wouldprovide the desired performance, but for ease of program-ming, we want to represent our lattice types as subclasses ofa generic lattice type in an object-oriented fashion. There-fore, we choose to implement the library in C++. In thissection, we discuss our implementation of MaxIntLattice

and GrowOnlySetLattice in detail. To determine whetherour multi-threaded lattice implementations provide a per-formance improvement over the single-threaded implemen-tation, we also implement a single-threaded version of Max-IntLattice and GrowOnlySetLattice.

MaxIntLattice uses a signed 32-bit representation of inte-gers, so the initial value is −231. The single-threaded Max-

IntLattice merge function sets the data item to the maxi-mum of the data item and the new value to be merged. Thisrequires a branch instruction and, if the branch succeeds, astore instruction. In addition to merge, the lattice also ex-poses a reveal function that returns the current value ofthe integer, and an assign function that sets the currentvalue to the input integer. We have three multi-threadedimplementations of MaxIntLattice. Two implementationsperform the same merge operation as the single-threadedversion, synchronized with a software-level lock. The firstone uses a mutex from the C++ STL, while the second oneuses a spinlock (atomic_flag from the C++ STL). In addi-tion, we implement a lock-free version of the integer latticethat uses an atomic hardware-level instruction to performthe merge operation. The atomic version encapsulates theinteger in a C++ atomic object and invokes a compare-and-swap instruction on this object in a loop to assign the newvalue. The loop terminates if the assignment succeeds, or ifother threads set the value of the data item greater than thevalue that the current thread is trying to merge. These threemulti-threaded MaxIntLattice implementations expose thesame methods as the single-threaded version.

GrowOnlySetLattice contains a set of signed 32-bit integersin the shared-memory area. The single-threaded GrowOnly-

SetLattice uses the C++ STL unordered_set type, whichuses a hash table internally for efficient lookups and amor-

Name Data Type Zero value (⊥) Merge function (t(a, b))

BoolLattice Boolean false a ∨ bMaxIntLattice Integer −∞ max(a, b)MinIntLattice Integer +∞ min(a, b)

GrowOnlySetLattice Set ∅ (empty set) a ∪ b (set union)LatticeValuedArrayLattice Array empty array lattice-specific

LatticeValuedMapLattice Map empty map lattice-specific

Figure 1: Lattice Definitions

tized inserts. The underlying unordered_set automaticallyignores any items that the user attempts to add if they al-ready exist in the set. The lattice exposes a merge functionthat adds several elements at a time, as well as an optimizedinsert function that adds only one element. The interfacealso includes set-specific operations find (search for an ele-ment) and size (get the total number of elements). The usercan obtain a copy of the internal set by calling the reveal

function.

C++ unordered_set is not thread-safe. However, concur-rent_unordered_set from Intel’s Threading Building Blocks(TBB) [14] is thread-safe, so we use it in a multi-threadedimplementation of GrowOnlySetLattice. C++ does notsupport encapsulation of unordered_set in an atomic ob-ject. Locking the entire set with a spinlock or mutex wouldeffectively linearize all updates and eliminate the opportu-nity for performance improvement. As a result, we haveonly one multi-threaded implementation of GrowOnlySet-

Lattice, as described above. As in the single-threaded case,the set ignores duplicates. The multi-threaded set latticeexposes the same methods as the single-threaded version.Since these methods access the shared data item, they allrequire coordination.

TBB concurrent_unordered_set uses a split-ordered list[15] to store all elements. Each bucket serves as a short-cut into a linked list ordered by the last k bits of the hashvalue, where the number of buckets n is 2k. There are nosoftware-level locks. Instead, the implementation uses anatomic compare-and-swap instruction to update the nextpointer for a node in the split-ordered list when inserting anelement. If two threads concurrently try to modify the samenext pointer, then the compare-and-swap instruction for onethread fails, so the thread simply moves to the next node inthe list and re-tries. The class uses another compare-and-swap instruction to atomically double the number of bucketson resize. If any operation accesses an uninitialized bucket,the class uses a compare-and-swap instruction to initialize itsafely. Lastly, the class uses an atomic fetch-and-incrementinstruction to adjust the list’s element count on insert.

By default, C++ unordered_set starts with 0 buckets, anda resize operation occurs when the load factor exceeds 1.0.Its hash table computes hashes of integers using the identityfunction, which maps every integer n to n. As a result, itexperiences no collisions. TBB concurrent_unordered_set

starts with 8 buckets and maximum load factor 4.0 unlessotherwise specified. Its default hash function is derived fromKnuth’s multiplicative hashing method [2], which could becomputationally expensive. Knowing that the number ofbuckets for the multi-threaded set lattice must be a power

Figure 2: MaxIntLattice Implementation Compari-son

of 2, we specify the initial number of buckets to be 16 forboth the single-threaded and the multi-threaded set lattice.In addition, we set the maximum load factor to 1.0, to reducethe number of hash collisions before a resize. To achieve afair comparison, we manually force the multi-threaded ver-sion to also use the identity function for its hash table.

2.3 Microbenchmark & DiscussionsWe run all of our performance benchmarks on a quad-coreMac OS X 10.11 with 256 KB of per-core L2 cache, 6 MBof L3 cache, and 8 GB of RAM. We compile our code usingGNU g++ with -O2 optimizations enabled. We compile forthe C++11 standard, so that we can access certain func-tionality for our spinlock and atomic MaxIntLattice. Eachdata point in each benchmark reflects an average across 5 it-erations of that particular experiment, to reduce the impactof fluctuations caused by other system processes.

2.3.1 MaxIntLatticeOur first experiment compares the efficiency of three multi-threaded MaxIntLattice implementations as described inSection 2.2. We generate 105 random integers and measurethe time it takes to merge these integers into the lattice.For all implementations, we use four threads to perform themerge operation concurrently (each of which performs 25000merge operations). As shown in Figure 2, the lock-free im-plementation outperforms the spinlock implementation byone order of magnitude, and outperforms the mutex im-plementation by two orders of magnitude. Therefore, thelock-free implementation incurs far less coordination over-head compared to other two alternatives. The general les-son learned is that we should use a lock-free implementationwhenever possible.

Figure 3: MaxIntLattice Read Only

Our next pair of experiments investigates the scalability ofthe lock-free multi-threaded MaxIntLattice implementationunder read-only and write-only workloads. In the read-onlyexperiment, we measure the time it takes to process 107 re-

veal operations as we increase the number of threads, andcompare it against the single-threaded implementation. Inthe write-only experiment, we measure the time it takes toprocess 105 assign operations as we increase the number ofthreads, and compare it against the single-threaded imple-mentation. As shown in Figure 3 and Figure 4, we noticethat in both experiments, the lock-free multi-threaded im-plementation performs orders of magnitude worse than thesingle-threaded implementation. This is because the coordi-nation overhead incurred by multi-threading dominates theoverhead of reveal and assign (reading and writing an inte-ger). Therefore, the lesson is that if the complexity of theactual work to be done is far less than the complexity causedby coordination, then a single-threaded implementation willsuffice. For the read-only workload, the performance of themulti-threaded implementation improves as we increase thenumber of threads because different threads can read fromthe lattice concurrently. Since we are using a quad-core ma-chine, adding more threads beyond a certain limit (4 in ourcase) only introduces additional context switching overhead,which accounts for the performance degradation as the num-ber of threads exceeds 11. For the write-only workload, theperformance of the multi-threaded implementation becomesworse as we increase the number of threads because concur-rent writes to the same integer object cause high contentionand need to be serialized. As a result, our lock-free imple-mentation does not scale with a high contention workload.

2.3.2 GrowOnlySetLatticeFigure 5 compares the times to insert integer values into asingle-threaded GrowOnlySetLattice and the multi-threadedversion for various numbers of threads. Our first microbench-mark measures the time to insert integers in the range [0, 106)into a set lattice. The single-threaded case simply insertsthese numbers into the set lattice in order. In the multi-threaded case, each thread inserts the numbers in a specificsubrange of the given range in order, where each subrangehas the same size up to rounding error. Since the bench-mark only calls insert and not find, this is a write-onlyworkload.

Figure 4: MaxIntLattice Write Only

When we run the sequential benchmark, the multi-threadedversion of GrowOnlySetLattice running on only one threadexperiences a higher latency than the single-threaded ver-sion. This occurs because concurrent_unordered_set hasa small overhead compared to unordered_set. We view thisas part of the cost of coordination among multiple threads.As the number of threads increases, the latency decreases,achieving a minimum value at 10 threads. For small num-bers of threads, the latency decreases as the concurrency in-creases, with maximal concurrency occurring when at least 1thread runs on each core. The latency continues to decreaseeven after the number of threads increases to 4, because thescheduler does not necessarily run each thread on a differentcore. As with MaxIntLattice, the overhead due to contextswitching between different threads eventually counteractsthis decrease in latency.

We also test a workload that inserts random integers to com-pare the performance of the two versions when some hashcollisions occur. The benchmark always uses the same ran-dom seed to generate the same array of 106 integers. In thesingle-threaded case, these elements are inserted into theset lattice in order. In the multi-threaded case, each threadinserts the elements in a subarray in order, where the sub-arrays are determined in a similar manner as the subrangesin the sequential benchmark. We use the same parametersto initialize the lattice as for the sequential benchmark.

The minimum latency for the multi-threaded set lattice oc-curs at 8 threads. A crucial difference between the twobenchmarks is that the ratio of the single-threaded latticelatency to the optimal multi-threaded lattice latency is muchhigher for the random benchmark (4.117) than for the se-quential benchmark (1.554). For the sequential benchmark,in both the single-threaded case and the multi-threaded case,there are no collisions. Instead, the largest factor that influ-ences latency is the time to perform writes. As the numberof threads increases, context switching overhead becomesthe dominating factor and limits speedup. For the randombenchmark, the largest factor that influences latency is thenumber of hash collisions. The collisions force the hash ta-bles in both the single-threaded and the multi-threaded setlattices to search the elements from a linked list, which isan expensive operation that dominates the cost of contextswitching. However, since different cores can perform these

Figure 5: SetLattice Benchmark

searches concurrently, we observe speedup as the number ofthreads increases.

The key difference between MaxIntLattice and GrowOnly-

SetLattice is that multiple threads can modify the dataitem concurrently for the set lattice, but not for the inte-ger lattice. Even though the sequential integer workloadand random integer workload are write-only, they both ex-perience low contention due to the properties of the un-derlying concurrent unordered set. As a result, the multi-threaded set lattice outperforms the single-threaded versionwhen any non-trivial number of threads perform inserts. Inthe next section, we explore a lattice-based implementationthat scales to larger numbers of threads even in workloadswith high contention.

3. LATTICE MVCCHaving implemented our high-performance lattice library,the next step is to compose lattices from the library to builduseful system components. BloomL demonstrates that lat-tices can be composed to build eventually consistent sys-tems such as a versioned key-value store. We would liketo take a step forward and study how lattices can con-tribute to building systems that offer strong consistency. Inthis project, we choose to implement an in-memory key-value store with timestamp ordered multi-version concur-rency control (MVCC). MVCC is a transaction facility thatguarantees serializability, the strongest consistency level fordatabases. We begin by reviewing the basic mechanism ofMVCC as well as how serializability is guaranteed. We thendiscuss how the coordination-free components of MVCC canbe implemented via lattice composition and how lattices canhelp identify where coordination is strictly necessary. Fi-nally, we investigate several implementation challenges thatcould result in poor scalability and show how lattices canhelp build MVCC in a scalable, coordination-efficient way.

3.1 MVCCIn timestamp ordered concurrency control (T/O) [5], eachtransaction is assigned a unique timestamp by the trans-action manager, and the timestamp is attached to all readand write requests issued on behalf of this transaction. Thesystem enforces serializability by requiring the transaction

execution to obey the timestamp ordering. In traditionalT/O, data items are updated in-place. MVCC improvesupon the traditional T/O by allowing different transactionsto create different versions of the same data item. There aretwo main advantages to multi-versioning. First of all, sincethere is no overwriting of data items, writers will never con-flict with each other. Secondly, since the system retains oldversions of data items, a read request with an old timestampcan read data from the past. As a result, read requests neverabort any transaction. Although these benefits come witha cost of storage, as the storage gets cheaper and cheaper,more DBMS’s [8, 11] are starting to employ MVCC.

At a high level, MVCC operates as follows. A read requestR that reads data item D is processed by reading the versionof D with the largest timestamp less than the timestamp ofR, and adding the timestamp of R to D’s set of read times-tamps. A write request W that writes to data item D isprocessed by checking the least timestamp S from D’s setof read timestamps and D’s set of write timestamps thatis greater than the timestamp of W . If S corresponds to aread timestamp, then the system aborts W . Otherwise, thesystem accepts W and creates a new version of D with itstimestamp. The proof that this mechanism ensures serializ-ability can be found in [5].

3.2 Lattice CompositionWe now discuss how MVCC can be partially built with lat-tice composition. Since a transaction may modify multipledata items, in order to guarantee atomicity and prevent cas-cading aborts, MVCC has to employ a two phase commitprotocol (2PC), which adds additional complexity to our im-plementation [5]. Before a transaction commits, its writesare processed in a private workspace and therefore not re-flected in the database. When a transaction commits, thetransaction manager (TM) first issues a pre-write requestfor each data item updated by the transaction. The pre-write is handled following the same mechanism described inSection 3.1. If a pre-write is not aborted, it is then buffered.When all pre-writes are successfully buffered, the TM issuesa commit-write request for each data item updated. Thecommit-write is accepted immediately, and the new versionbecomes visible to the readers. If a pre-write is aborted, thenabort-write requests are issued for all data items updated.For a read request, if the version it should read is buffereddue to a pre-write, then it has to be buffered and period-ically re-checked to see if the pre-write gets committed oraborted.

Therefore, for each write request, its lifecycle can be sum-marized as follows. When the pre-write comes in, it caneither be buffered or aborted. If the pre-write is success-fully buffered, it can either commit or abort depending onwhether the pre-writes for other data items go through.Eventually, the system determines that the version is nolonger useful and therefore it is garbage-collected. Based onthis observation, for any write request, its status change canbe modeled by a bounded lattice shown in Figure 6. In thiscase, the partial order is defined by the “happens after” re-lationship between two states. For example, “Abort” has ahigher order than “Buffer” because according to the MVCCprotocol, the status of the write request can be upgradedfrom “Buffer” to “Abort”, but not vice versa. We can also

Figure 6: Write Status Lattice

Figure 7: Write Write Interference

easily verify that the least upper bound defined by this lat-tice is commutative, associative, and idempotent. Note thatif MVCC is implemented correctly, under no circumstancewill the status of a write request be updated from “Commit”to “Abort”, and vice versa. We define the least upper boundof “Commit” and “Abort” to be “Garbage” for completeness.

Similarly, for each read request, its lifecycle can be summa-rized as follows. When the read request comes in, if theclosest version from the past is committed, then the readwill output the version immediately. Otherwise (when theversion is buffered due to a pre-write) the read will have tobe buffered. Later when the version is committed, the readrequest can safely output the version. When the transac-tion commits or aborts, the read request is committed oraborted, respectively. When the system determines that theread timestamp is no longer useful, it is garbage-collected.We can model the status change of a read request by abounded lattice shown in Figure 8, with the same partialorder definition as the write status lattice.

After showing that the status associated with read times-tamps and write timestamps can be implemented with lat-tices, the next step is to investigate whether the read times-tamp set and the write timestamp set for each key can beimplemented via lattice composition. To implement the readtimestamp set, we use a lattice valued map lattice, where thekey is the timestamp, and the value is the correspondingread status lattice. To prove that this map satisfies latticerequirements, it is sufficient to show that updates to the mapare commutative, associative, and idempotent. According tothe mechanism described in Section 3.1, read requests onlyneed to find the version associated with the most recent com-mitted write timestamp smaller than its timestamp. There-fore, a pair of read requests will never interfere with eachother, and hence updates to the map are re-orderable. Foreach read timestamp, since the merge function of its readstatus lattice is idempotent, it follows that the entire readtimestamp map is resilient to message duplication. Hence,the read timestamp map is a lattice.

Figure 8: Read Status Lattice

Figure 9: Read Write Interference

Proving that the write timestamp map is a lattice requiresadditional effort. Recall that for each pre-write request,it has to search for the smallest committed write times-tamp or output/committed read timestamp bigger than it-self. Therefore, it seems that write requests can interferewith each other. In order for a pair of writes to interferewith each other, one of them has to be a pre-write, and theother one has to be a commit-write. If the pre-write has abigger timestamp than the commit-write, then the commit-write has no effect on the pre-write, regardless of the order-ing between the two requests. Therefore, fortunately, theinterference happens only when the pre-write has a smallertimestamp than the commit-write (and no committed readtimestamp exists in between them). We prove that the writetimestamp map still satisfies lattice requirements in this sit-uation.

Consider the following timeline in Figure 7. Initially, thepre-write at timestamp b (pre-write(b)) is buffered, andread(c) is buffered waiting for the pre-write to commit. Con-sider a pair of requests, pre-write(a) and commit-write(b).If commit-write(b) arrives before pre-write(a), then the pre-write will be buffered because the smallest committed times-tamp bigger than itself is commit-write(b). If pre-write(a)arrives before commit-write(b), although the smallest com-mitted timestamp bigger than itself is no longer commit-write(b), we know that it will never correspond to a commit-ted read because otherwise the read would have outputted aversion that has not committed yet, which violates MVCCprotocol. Consequently, pre-write(a) will still be buffered,and therefore updates to the map are re-orderable. Subse-quent proof is similar to the proof for the read timestampmap. In conclusion, the write timestamp map is also a lat-tice. One may argue that due to message reordering, itis possible that output(c) arrives before pre-write(a) andcommit-write(b). As we will see in Section 3.4, since a pairof read and write is not re-orderable, they have to be syn-chronized. Therefore, efforts are made to make sure thatcommit-write(b) and output(c) appear at the timeline atom-

Figure 10: Shared Memory Low Contention

ically.

After showing that the read timestamp map and write times-tamp map for each key are lattices, the next step is to investi-gate whether the two maps together form a lattice. It is easyto verify that read and write requests are not re-orderable.Consider the timeline shown in Figure 9. Initially, write(a)is committed. Consider a pair of requests, pre-write(b)and read(c). If pre-write(b) arrives before read(c), thenboth pre-write(b) and read(c) will be buffered. However,if read(c) arrives before pre-write(b), then read(c) is out-putted (version at timestamp a is read) and pre-write(b) isaborted. Therefore, reordering two requests yield differentstates, and the two maps together do not form a lattice.

As a result, we choose to implement lattice MVCC key-valuestore with Intel TBB concurrent_unordered_map, whosekeys are the keys in the key-value store and values corre-spond to a read timestamp map, a write timestamp map,and versions associated with each write timestamp.

To summarize, we find that MVCC can only be partiallyimplemented via lattice composition. This meets our expec-tation, as lattices are designed to build eventually consistentsystems whereas MVCC guarantees strong consistency. Inthe following sections, we explore how to build lattice MVCCkey-value store in a coordination-efficient way.

3.3 Shared Memory ChallengesAs a first attempt, we build a lattice-based MVCC key-value store with shared memory, where different threadscan modify the database simultaneously. Due to the no-overwriting nature of MVCC, we expect our implementa-tion to scale nicely. However, several challenges prevent thedatabase from scaling. First of all, recall that in MVCC,each read request needs to find the version associated withthe most recent committed write timestamp smaller thanits read timestamp. Each pre-write request also needs tosearch for the smallest committed write timestamp or out-putted/committed read timestamp bigger than itself. Weimplement our timestamp map and write timestamp withan (ordered) C++ map. If we use unordered_map, the searchlatency grows linearly with respect to the number of times-

Figure 11: Shared Memory High Contention

tamps in the map, which is clearly not scalable. Based onour observation in the MaxIntLattice experiment in Section2.3, we should use a lock-free ordered map implementationto minimize the coordination overhead. However, to ourknowledge, there is no implementation of lock-free orderedmap, and therefore we have to use a spinlock for each keyto prevent race conditions. Since a spinlock does not allowconcurrent modifications to the timestamp map, it createshigh contention when lots of transactions modify the samekey. To make things worse, we notice every read requestis essentially a write to the read timestamp map. So everyread request and write request to the same key has to beserialized.

We now discuss the performance benchmark of our sharedmemory lattice MVCC key-value store. The experimentis carried out using the same settings as in the base lat-tice benchmark in Section 2.3. We benchmark our imple-mentation against a low-contention workload and a high-contention workload. Under the low-contention workload,each request accesses a key chosen randomly among 1000keys. Under the high-contention workload, all requests ac-cess the same key. For each type of workload, we also varythe read-write ratio and investigate how it can affect perfor-mance. The read-heavy workload consists of 105 requests,90% of which are reads and 10% of which are writes. Thewrite-heavy workload also consists of 105 requests, 90% ofwhich are writes and 10% of which are reads. We measurethe inverse throughput (the time to process all requests) aswe increase the number of threads. We also measure the per-formance on a single-threaded implementation as a baselinecomparison.

Under the low-contention workload, Figure 10 shows thatthe performance of our shared memory implementation im-proves near linearly as we increase the number of threads.The improvement slows down when the number of threadsexceeds 4, and the performance starts to degrade when thenumber of threads exceeds 8. We observe this trend inboth the write-heavy workload and the read-heavy work-load. Under a low contention workload, it is highly likelythat at any time, different threads are accessing differentkeys. Since accesses to different keys can be processed con-

Figure 12: Distributed Implementation Architec-ture

currently with TBB concurrent_unordered_map, our im-plementation scales well under low contention, regardless ofthe read-write ratio. Since we are using a quad-core ma-chine, adding more threads beyond certain limit (8 in ourcase) only introduces additional context switching overhead,which accounts for the performance degradation as the num-ber of threads exceeds 8. It is also worth noting that withappropriate number of threads (4-8), our implementationmanages to outperform the single-threaded implementationby a factor of 3.5.

However, under the high-contention workload, Figure 11shows that the shared memory implementation performsworse than the single-threaded implementation in both theread-heavy workload and the write-heavy workload. As weincrease the number of threads, the shared memory imple-mentation performs even worse. The reason is that undera high contention workload, all requests are accessing thesame key, and the spinlock serializes accesses to the times-tamp map of the same key. Therefore, increasing the num-ber of threads only adds contention and context switchingoverhead. As discussed earlier, since every read request isa write to the read timestamp map, we observe the sameperformance degradation pattern under both workloads.

3.4 Emulating Distributed SettingAs shown in Section 3.3, the shared memory key-value storeimplementation performs extremely poorly under high con-tention. To solve this problem, we observe from both thebase lattice benchmark and the lattice MVCC benchmarkthat the single-threaded implementation offers high perfor-mance as no coordination is needed. Therefore, it would benice if every thread can run on its own for coordination-freeoperations, and coordinate with each other only when neces-sary. Our lattice composition study from Section 3.2 showsthat write requests do not have to coordinate with each otheras the write timestamp map can be implemented via latticecomposition. The same argument holds for read requests.To this end, we present an implementation that emulatesthe distributed setting (we call it “distributed” from nowon), in which each thread manages its own single-threadeddatabase replica, and coordination occurs through sharedmemory access to a lattice. We show that the distributed

implementation manages to scale well under high contentionworkloads.

3.4.1 ImplementationFigure 12 shows the architecture of our distributed key-valuestore implementation consisting of 4 database replicas. Eachdatabase replica is managed by a single thread. Since writerequests do not conflict with each other, we can let eachthread handle write requests locally without coordinatingwith each other. Whenever a read request comes in, in or-der to guarantee serializability, two things have to happen.First of all, the read request finds the correct version toread among all the replicas. Secondly, if the read is suc-cessfully outputted, all replicas have to be aware of the readtimestamp, as well as the write timestamp associated withthe version being read, so that future pre-write requests canproperly abort. Specifically, as shown in Figure 12, if threadA receives a read request, it first broadcasts the read times-tamp to other threads. Each thread, after being notified,merges the most recent write timestamp smaller than theread timestamp into a shared MaxIntLattice. When allthreads finish merging, thread A decides if the read requesthas to be buffered or outputted. If it is buffered, no fur-ther action is necessary, and the buffered read is periodi-cally re-checked. Otherwise, the read timestamp and thewrite timestamp associated with the version being read arepropagated atomically to each database replica. To ensurecorrectness, during coordination, all threads stop acceptingnew requests to make sure that the shared MaxIntLattice

gets the most up-to-date write timestamp.

Since read requests also do not conflict with each other, analternative implementation is to let each thread process readrequests without coordination. In this case, whenever a pre-write request comes in, threads have to coordinate and figureout if the pre-write can be buffered. If so, then the pre-writehas to be synchronously propagated to other replicas so thatfuture read requests can be properly buffered. The imple-mentation is similar to the write-optimized version describedabove, with a shared MinIntLattice acting as a coordinator.

3.4.2 BenchmarkWe benchmark the distributed key-value store implementa-tion under high contention using the same workloads andexperiment settings as in the shared memory implementa-tion. For the write-optimized implementation, we noticefrom Figure 13 that under the write-heavy workload, as weincrease the number of threads, the inverse throughput im-proves near linearly until the number of threads reaches 4.The performance improvement is expected as no coordina-tion is needed between write requests. Therefore, differentthreads on different cores can concurrently process write re-quests. Since we are using a quad-core machine, addingmore threads beyond a certain limit (4 in our case) only in-troduces additional context switching overhead. Under theread-heavy workload, as we increase the number of threads,the inverse throughput degrades, and the performance isworse than the baseline (single-threaded implementation).This is also expected as every read request has to coordi-nate with each other, and every thread has to read from itslocal database replica. Fortunately, since the overhead of co-ordination is small compared to the overhead of processing

Figure 13: Distributed Write-optimized

read and write requests, the performance does not becomenoticeably worse as we increase the number of threads.

We observe the opposite performance trend for the read-optimized implementation, as shown in Figure 14. Under theread-heavy workload, the inverse throughput improves as weincrease the number of threads until all cores are fully uti-lized. Under the write-heavy workload, the inverse through-put becomes slightly worse as we increase the number ofthreads.

To summarize, under high contention, our write-optimizeddistributed implementation scales well with the write-heavyworkload, and the read-optimized distributed implementa-tion scales well with the read-heavy workload. If the im-plementation is not optimized for the workload, then theperformance experiences tolerable degradation as the num-ber of threads increases. Notably, both implementationsoutperform the shared memory approach.

Admittedly, our current distributed implementation can onlyfavor one type of workload (either read-heavy or write-heavy).However, it is possible to build an adapter that dynami-cally switches between two implementations based on theobserved workload trace. Given our focus on lattices andcoordination-efficient programming, implementing self-adaptiveMVCC is beyond the scope of this paper.

3.5 DiscussionIn this section, we discuss the role of lattices in achiev-ing a coordination-efficient implementation of MVCC. Firstof all, lattices help us reason about when coordination isstrictly necessary. In our case, since the read timestampmap and the write timestamp map can be implementedvia lattice composition, we know that write requests (andread requests) do not need to coordinate with each other.This leads us to the distributed implementation where onlyone type of request (read or write) needs to coordinate.Secondly, lattices can act as coordinators to minimize co-ordination overhead. In our write-optimized implementa-tion, for example, since we only care about the most recenttimestamp, the order in which merge requests from differentthreads arrive does not matter. As a result, we can sim-

Figure 14: Distributed Read-optimized

ply use a shared MaxIntLattice as a coordinator withoutneeding to rely on consensus protocols such as Paxos [12].In summary, lattices can not only offer a good way to iden-tify coordination-free system components, but also serve asuseful tools to reduce coordination overhead.

4. RELATED WORKAs mentioned earlier, BloomL explores coordination-free pro-gramming for distributed systems using lattices. However,it focuses more on language development than on perfor-mance. It shows how the Bloom interpreter can support theevaluation of lattice-based code, and demonstrates its pro-grammability by showing that the number of lines of coderequired to program an eventually consistent key-value storein BloomL is much smaller than in Java. We instead focus onthe performance optimization of lattices. Instead of Ruby,we use C++ (known to be one of the fastest languages) tobuild a high-performance lattice library. Instead of makinga single-threaded assumption within each node, we focuson investigating efficient lattice implementations that scalewith respect to the number of cores.

In recent years, there have been efforts [7, 16] to build even-tually consistent systems that offer low latency, high avail-ability, and partition tolerance. Although these systemscould be attractive due to their high performance and scal-ability, sometimes it is unclear whether an application issuitable to run on such systems. One major issue is thateventually consistent systems require inputs from develop-ers to guarantee application-level consistency. For example,CRDT’s [16] rely on developers to provide a merge functionin order to reconcile divergent replica states, and comingup with a reconciliation logic that makes sense to the ap-plication can be non-trivial. Bailis et al, show in [3] that ifdevelopers specify a set of invariants that must hold true foran application, and if all invariants pass invariant-confluencetests, then the database is eventually consistent without co-ordination. This work has similar issues in that generatinga complete set of invariants for an application is not easy.If developers miss certain invariants, the execution may vi-olate the application’s consistency requirement. Another is-sue with eventually consistent systems is that the result re-turned by a read request is difficult to interpret as it may not

reflect the most up-to-date information. The system mustperform coordination to reconcile conflicts between replicas.On the other hand, although systems that offer strong con-sistency require coordination and are susceptible to networkpartitions and node failures, they are much easier to reasonabout and require no additional effort from developers toensure application-level consistency [10]. Therefore, for thisproject we decide to target these systems and study howthey can be implemented with minimal coordination usinglattices.

Recent high-performance systems take advantage of mul-ticore settings [1, 8]. As an example, Hekaton [8] intro-duces a lock-free BW-tree implementation that uses atomicread-modify-write CPU instructions to handle concurrentmodifications to a key-value store. However, these worksfocus more on presenting their clever ideas and demonstrat-ing the systems’ effectiveness rather than showing how theircoordination-efficient implementation techniques can be gen-eralized to build other systems. We instead use a bottom-upsystem design approach to investigate how lattices can becomposed to build coordination-efficient systems. Throughour lattice MVCC example, we also provide lessons learnedon how developers could reason about when coordination isstrictly necessary with lattices.

Faleiro et al, present an alternative MVCC key-value storeimplementation via sharding [9]. In their case, the key-valuestore is partitioned into several shards, each of which ismanaged by a thread. An advantage of this approach isthat different threads do not need to coordinate for read-write synchronization. Another benefit is that having eachthread responsible for only a certain range of keys results ina smaller cache footprint and reduced cache coherence over-head. Their solution, however, has a few downsides. Firstof all, their implementation has to dispatch each requestto the proper thread responsible for handling the request(the key range managed by the thread has to contain thekey that the request reads or modifies), which incurs over-head. Secondly, during the commit phase, coordination isrequired to ensure that reads or modifications to differentkeys are successfully handled by different threads. Finally,the sharding approach does not scale under high contention,in which transactions access a small number of “hot” keys.In this scenario, it is possible that a small number of threadsare busy processing requests serially, while the majority areidling. Our distributed lattice MVCC implementation re-quires no additional coordination during the commit phase,as reads and modifications to different keys are handled bya single thread. It also manages to provide certain degreesof scalability even when the workload has high contention.However, since read-write coordination has to be carried outregardless of contention, unlike the sharding approach, ourimplementation does not benefit from low contention.

5. FUTURE WORKIn this section, we discuss several future research directionsof this project. Although the current lattice library containsuseful lattices that are sufficient to build certain system com-ponents, it is far from complete. For example, our set latticedoes not support deletion, which could be inconvenient forapplications that allow information revocation. To solve thisproblem, we plan to expand our lattice library to include a

tombstone set lattice, which supports both insertions anddeletions. Tombstone set lattice can be implemented withtwo GrowOnlySetLattices. All elements inserted are keptin one set, and all elements deleted are kept in the other set.The reveal operation returns the set difference of the twosets.

Furthermore, we notice that implementing the coordinationphase of distributed lattice MVCC is painstaking and error-prone. Requiring developers to manually invent the coordi-nation code for every system affects productivity. To makethings worse, since distributed programs are difficult to de-bug in general, verifying that the coordination process is im-plemented correctly requires effort. Fortunately, we noticethat during the coordination, only the shared lattice coor-dinator needs to be customized. The logic to notify otherthreads and wait for all of them to respond is the same acrosssystems. Therefore, we plan to build our coordination codewith customizable coordination lattices into the library sothat developers do not have to worry about reproducing thecoordination logic.

Finally, it is worth noting that fetching and modifying dataunder MVCC is just one step within the query-processingpipeline. In the future, we plan to build a high-performancequery engine with lattices that are connected by monotonefunctions. A monotone function maps one lattice to an-other while preserving the partial ordering. For example,the function that computes the size of a GrowOnlySetLat-

tice is a monotone function that maps the set lattice to aMaxIntLattice. The partial order is preserved because asmore elements are inserted into the set, the result of the sizefunction grows larger. For queries that can be expressedwith monotone lattice mappings, we know that the partialresult output by these queries can only grow over time, whichopens up spaces for optimization. The challenge is that notall queries can be expressed in this way, and we plan to alsoresearch how these queries could potentially benefit from thelattice query engine.

6. CONCLUSIONSUsing the idea of lattices from recent distributed systems re-search, we developed a high-performance lattice library thattakes advantage of multicore hardware. The library includesmultiple implementations of different lattices, and we inves-tigated workloads in which these lattices perform well. Ourresults showed that lattices do not provide an efficient ab-straction for a shared integer value. However, a grow only setlattice with a set shared among multiple threads can vastlyoutperform the single-threaded equivalent. Composing lat-tices, we developed a multi-threaded key-value store thatuses timestamp ordered multi-version concurrency controlto ensure serializability. We learned that a shared memoryimplementation performs well in a low-contention setting,while an implementation that emulates a distributed settingwith minimal coordination scales well in a high-contentionsetting. We feel confident that lattices provide a natural andefficient way to develop strongly consistent systems that runon multiple cores while keeping coordination at a minimum.

AcknowledgementsWe would like to thank Professor John D. Kubiatowicz forsuggesting benchmarks for the lattice library. We also want

to thank Professor Joseph M. Hellerstein for providing theBloomL context and proposing multi-version concurrencycontrol as a reasonable protocol to implement.

7. REFERENCES[1] M.-C. Albutiu, A. Kemper, and T. Neumann.

Massively parallel sort-merge joins in main memorymulti-core database systems. Proceedings of the VLDBEndowment, 5(10):1064–1075, 2012.

[2] O. Amble and D. E. Knuth. Ordered hash tables. TheComputer Journal, 17(2):135–142, 1974.

[3] P. Bailis, A. Fekete, M. J. Franklin, A. Ghodsi, J. M.Hellerstein, and I. Stoica. Coordination avoidance indatabase systems. PVLDB, 8(3):185–196, 2014.

[4] P. Bailis and A. Ghodsi. Eventual consistency today:Limitations, extensions, and beyond. Commun. ACM,56(5):55–63, May 2013.

[5] P. A. Bernstein and N. Goodman. Concurrencycontrol in distributed database systems. ACMComputing Surveys (CSUR), 13(2):185–221, 1981.

[6] E. A. Brewer. Towards robust distributed systems. InPODC, volume 7, 2000.

[7] N. Conway, W. R. Marczak, P. Alvaro, J. M.Hellerstein, and D. Maier. Logic and lattices fordistributed programming. In Proceedings of the ThirdACM Symposium on Cloud Computing, SoCC ’12,pages 1:1–1:14, New York, NY, USA, 2012. ACM.

[8] C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson,P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling.Hekaton: Sql server’s memory-optimized oltp engine.In Proceedings of the 2013 ACM SIGMODInternational Conference on Management of Data,pages 1243–1254. ACM, 2013.

[9] J. M. Faleiro and D. J. Abadi. Rethinking serializablemultiversion concurrency control. Proceedings of theVLDB Endowment, 8(11):1190–1201, 2015.

[10] J. Gray and A. Reuter. Transaction processing:concepts and techniques. Elsevier, 1992.

[11] A. Kemper and T. Neumann. Hyper: A hybridoltp&olap main memory database system based onvirtual memory snapshots. In Data Engineering(ICDE), 2011 IEEE 27th International Conference on,pages 195–206. IEEE, 2011.

[12] L. Lamport et al. Paxos made simple. ACM SigactNews, 32(4):18–25, 2001.

[13] J. J. Levandoski, D. B. Lomet, and S. Sengupta. Thebw-tree: A b-tree for new hardware platforms. In DataEngineering (ICDE), 2013 IEEE 29th InternationalConference on, pages 302–313. IEEE, 2013.

[14] C. Pheatt. Intel R© threading building blocks. Journalof Computing Sciences in Colleges, 23(4):298–298,2008.

[15] O. Shalev and N. Shavit. Split-ordered lists: Lock-freeextensible hash tables. Journal of the ACM (JACM),53(3):379–405, 2006.

[16] M. Shapiro, N. Preguica, C. Baquero, andM. Zawirski. A comprehensive study of convergent andcommutative replicated data types. PhD thesis,Inria–Centre Paris-Rocquencourt, 2011.

a step towards coordination-efﬁcient programming with...

Documents