proceedings of the acm symposium on principles of

12
Appears in the Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC ’97), Santa Barbara, CA, August 1997 Lazy Consistency Using Loosely Synchronized Clocks Atul Adya Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139 adya,liskov @lcs.mit.edu Abstract This paper describes a new scheme for guaranteeing that transactions in a client/server system observe consistent state while they are running. The scheme is presented in conjunction with an optimistic concurrency control algorithm, but could also be used to prevent read-only transactions from conflicting with read/write transactions in a multi-version system. The scheme is lazy about the consistency it provides for running transactions and also in the way it generates the consistency information. The paper presents results of simulation experiments showing that the cost of the scheme is negligible. The scheme uses multipart timestamps to inform nodes about information they need to know. Today the utility of such schemes is limited because timestamp size is proportional to system size and therefore the schemes don’t scale to very large systems. We show how to solve this problem. Our multipart timestamps are based on real rather than logical clocks; we assume clocks in the system are loosely synchronized. Clocks allow us to keep multipart timestamps small with minimal impact on performance: we remove old information that is likely to be known while retaining recent information. Only performance and not correctness is affected if clocks get out of synch. 1 Introduction This paper describes a new algorithm that insures transactions running at clients in a client/server system always view a consistent state as they run. The algorithm is useful in conjunction with optimistic concurrency control mechanisms (e.g., [1]), and also for ensuring that read-only transactions can commit without interfering with read/write transactions in a multi-version system (e.g., [2, 7, 30]). The algorithm is intended for use in a distributed environment in which servers provide reliable persistent

Upload: others

Post on 16-Mar-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proceedings of the ACM Symposium on Principles of

Appears in the Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC ’97), Santa Barbara, CA, August 1997

Lazy Consistency Using Loosely Synchronized Clocks

Atul Adya Barbara Liskov

Laboratory for Computer Science,Massachusetts Institute of Technology,

545 Technology Square, Cambridge, MA 02139adya,liskov @lcs.mit.edu

AbstractThis paper describesa new schemefor guaranteeing that transactionsin a client/server system observe consistent state while theyare running. The scheme is presented in conjunction with anoptimistic concurrency control algorithm, but could also be usedto prevent read-only transactions from conflicting with read/writetransactions in a multi-version system. The scheme is lazy about theconsistency it provides for running transactions and also in the wayit generates the consistency information. The paper presents resultsof simulation experiments showing that the cost of the scheme isnegligible.

The scheme uses multipart timestamps to inform nodes aboutinformation they need to know. Today the utility of such schemesis limited because timestamp size is proportional to system sizeand therefore the schemes don’t scale to very large systems. Weshow how to solve this problem. Our multipart timestamps arebased on real rather than logical clocks; we assume clocks in thesystem are loosely synchronized. Clocks allow us to keep multiparttimestamps small with minimal impact on performance: we removeold information that is likely to be known while retaining recentinformation. Only performance and not correctness is affected ifclocks get out of synch.

1 IntroductionThis paper describes a new algorithm that insures transactionsrunning at clients in a client/server system always view aconsistent state as they run. The algorithm is useful inconjunction with optimistic concurrency control mechanisms(e.g., [1]), and also for ensuring that read-only transactionscan commit without interfering with read/write transactionsin a multi-version system (e.g., [2, 7, 30]).

The algorithm is intended for use in a distributedenvironment in which servers provide reliable persistent

This research was supported in part by the Advanced Research ProjectsAgency of the Department of Defense, monitored by the Office of NavalResearch under contract N00014-91-J-4136.

Copyright c 1997 by the Association for Computing Machinery, Inc.Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, to republish, to post on servers, or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, [email protected]

storage for online information (e.g., a database or objects).Applications run at client machines that are distinct fromthe servers. Clients maintain caches that store copies ofpersistent objects; clients run application transactions locallyon cached copies of persistent objects and send modificationsto the servers when transactions commit. The schemeis designed to work in a very large system, e.g., tens ofthousands of servers and hundreds of thousands of clients.

The new algorithm propagates information about consis-tency using multipart timestamps or multistamps. Today theutility of multistamps is limited because their size is propor-tional to system size and therefore they don’t scale to verylarge systems. We show how to reduce the size of multi-stamps. Unlike other multistamp (or vector clock) schemes,e.g., [4, 17, 25, 8, 15], our scheme is based on time ratherthan on logical clocks: each entry in a multistamp containsa timestamp representing the clock time at some server inthe system. Using time rather than logical clocks allows usto keep multistamps small by removing old information. Asa result, multistamps need little storage at nodes and littlespace in messages. Furthermore, because we prune based ontime, the discarded information is likely to already be knownto interested parties; therefore discarding it has little impacton system performance. We assume that clocks are looselysynchronized; such an assumption is realistic in today’s envi-ronment [23]. The correctness of the scheme is not affectedif information of interest is pruned too early or clocks get outof synch (although performance may be).

This paper describes the new scheme in conjunctionwith an optimistic concurrency control algorithm, AOCC,which was developed for use in a distributed client/serverenvironment. AOCC has been shown to outperform otherconcurrency control mechanisms for all common workloadsand realistic system assumptions [1, 13]. In particular, AOCCoutperforms the best locking approach, adaptive callbacklocking [6]. One reason for AOCC’s good performance isthat it allows transactions to use cached information withoutcommunication with servers; locking mechanisms requiresuch communication at least when objects are modified.However, the reduced communication comes at a cost:unlike with locking, with AOCC transactions can viewan inconsistent state while they run. Such transactionscannot harm the persistent state since they will abort, but

1

Page 2: Proceedings of the ACM Symposium on Principles of

it is nevertheless desirable to provide transactions with aconsistent view, which allows application programmers todepend on invariants. This means code need not checkwhether invariants hold (as it would need to do if it couldnot depend on invariants) and it can display consistentinformation to users.

The paper describes how to augment AOCC to providerunning transactions with consistent views. The consistencyprovided by our scheme is weaker than serializability; wediscuss this point further in Section 3. We call it “lazy”consistency because it is what we can provide with verylittle work. We believe lazy consistency is appropriatefor our system since we commit in the standard way (bycommunicating with the server); a transaction may still aborteven though it viewed a consistent state.

The consistency mechanism uses multistamps to warnclients of potential violations of consistency. Multistampsare sent to clients on messages that are already flowing inthe system. We guarantee they arrive at clients before atransaction might observe an inconsistency. Clients are alsolazy; they act on the multistamp information only if it mightaffect the current transaction. Being lazy buys time so that theneeded consistency information is highly likely to be presentby the time it is needed.

The paper presents results of simulation experiments toevaluate the cost of the lazy scheme. Our results show thatthis cost is very small. The cost is manifested by “fetch stalls”in running transactions; these are events where a fetch doneby a client because of a cache miss causes it to communicatewith another server before continuing to run the transaction,where the communication would not have been required in thebasic AOCC scheme. Our results show that when contentionis low, fewer than one in a thousand fetches lead to stalls; evenin stressful workloads, less than one in one hundred fetcheslead to stalls. Therefore we believe the cost of the schemeis negligible. The studies also evaluate the effectiveness ofpruning multistamps and show that small multistamps are aseffective as larger ones in preventing stalls.

Thus the paper makes three contributions:

1. It presents a new efficient scheme that provides consis-tency for running transactions.

2. It presents a new way of implementing multistamps thatallows them to be pruned safely and effectively. Thiskeeps multistamps small so that they can be used evenin very large systems where otherwise they would not bepractical.

3. It presents the results of simulation studies that show thescheme provides consistency with almost no impact onsystem performance.

The remainder of the paper is organized as follows.Section 2 describes related work. Section 3 defines lazyconsistency and discusses how it relates to serializabilityand causality. Section 4 describes our system and how

AOCC works; Section 5 describes our implementation oflazy consistency; and Section 6 presents our performanceresults. We conclude with a summary of our results.

2 Related WorkLazy consistency is similar to “degree-2 isolation”; using theterminology of Gray and Reuter [12], our system providesdegree-3 isolation for all committed transactions and degree-2 isolation for running transactions.

Earlier research has investigated ways of providing con-sistency for read-only transactions [3, 30] using serializabil-ity or weaker notions of consistency that are different fromlazy consistency. For example, some schemes ensure that aread-only transaction is serializable with all read-write trans-actions but the whole system may not be serializable whenall read-only transactions are taken into account.

Chan and Gray [7] propose lazy consistency as acorrectness criterion but their work is concerned withcommitting read-only transactions more cheaply as opposedto providing consistency for transactions that abort. Theyalso propose an implementation technique for a distributedclient/server system using a kind of multistamp scheme butignore all details that would make it a practical scheme.

Our work is related to orphan detection schemes [20].Such schemes abort transactions before they can observean inconsistency; they guarantee a stronger property thanlazy consistency since they detect anti-dependencies also(discussed in Section 3).

3 Lazy ConsistencyThis section defines the kind of consistency our newmechanism provides.

We refer to the i version of object x as x . If a transactioncreates a version x , we assume that it read x 1. TransactionT is said to directly depend on transaction U if it read xand U created version x ; we also say that T has read fromU. We say that T depends on U if there exist transactionsV1, V2, V , where U is V1 and T is V and V2 directlydepends on V1, V3 directly depends on V2, and so on.

A running transaction Q will view a consistent state if thefollowing condition is satisfied:

If Q depends on T, it must observe the complete effectsof T.

For example, suppose that T creates x and y . If Q readsx and also reads y, it must read y or a version of y that islater than y .

The same condition is used for providing consistent viewsfor read-only transactions by Chan and Gray [7]; that paperproves that if the condition is satisfied, Q will observe aconsistent snapshot of the database. Here we present a briefsynopsis of the proof. Assume T1, T2, T is a prefixof the serialization history of the system, where T is thelatest transaction whose updates have been observed by Q.

2

Page 3: Proceedings of the ACM Symposium on Principles of

Suppose Q only depends on transactions T 1 , T 2 , Tand has missed the updates of transaction T . Since weare ensuring that the above condition is satisfied for Q,T ’s updates have no impact on T 1 , T 2 , T (they donot depend on T ). If we consider a history in which weremove (all such) T from the set of committed transactions,there will be no change in the database state observed byQ. Since a transaction transforms the database from oneconsistent state to another and the output of each transactionin T 1 , T 2 , T is unaffected by the presence or absenceof T , their combination must yield a consistent databasestate.

A transaction Q that views a consistent state is notnecessarily serializable with all the committed transactionsin the system. For example, suppose there are two objects xand y stored at servers X and Y respectively and suppose thattransactions U and T run in parallel and commit and then Qreads x0 and y2:

U: Read(x0) Read(y1) Write(x1)T: Read(y1) Write(y2)Q: Read(x0) Read(y2)

In this scenario, Q has observed the effects of T but missedthe effects of U. To commit Q, we would need to serialize itboth before U and after T, but since U is already serializedbefore T, this is impossible. The intuition why Q does notobserve inconsistencies is that since T does not depend on U,its modifications must preserve system invariants regardlessof what U does. A possible invariant that might be preservedis , where U stores the value of in and T increases

.This problem of transaction Q not being serializable can

occur only when there is an anti-dependency: A transactionA anti-depends on B if A overwrites an object version that Bhas read. In this case, T anti-depends on U. Figure 1 showsa cycle that is formed in the dependency graph [3] for thisschedule due to anti-dependencies.

Anti-Dependency

QTU Dependency

x

y y

Figure 1: A non-serializable transaction that has observed aconsistent database state.

Lazy consistency is independent of causality [18]: atransaction Q might observe the effects of T2 while missingthe effects of T1, where both T1 and T2 ran at the same clientin order T1; T2. Our implementation guarantees causality forindividual clients, which ensures that a client always observesthe effects of all its previous committed transactions andthe transactions they depended on. We believe such localcausality is good since it means code within a transactioncan depend on invariants holding between objects observeddirectly and also results computed by its earlier transactions

and stored in local variables. Causality is discussed further inSection 6.3, where we show the additional cost incurred overlazy consistency to support local causality. We also showwhat it would cost to support global causality, in which aclient that observes the effects of transaction T of some otherclient also observes effects of all earlier transactions of thatclient (and the transactions they depended on). Note thatlocking ensures global causality for active transactions.

4 Base AlgorithmThis section provides an overview of our system and AOCC.More information can be found in [1, 10, 13, 19].

Servers store the database objects in pages on disk;objects are typically smaller than pages. Clients maintaina page cache; they fetch missing pages from servers anduse approximate LRU for cache management. Clients andservers rely on ordered communication, e.g., TCP.

Transactions run entirely at clients; clients communicatewith servers only when there is a miss in the cache, andto commit transactions. The system serializes transactionsusing AOCC [1, 13] and two-phase commit [11]. (Two-phasecommit is avoided for transactions that use objects at onlyone server.)

AOCC works as follows: While a transaction T is running,client C keeps track of the objects it used and objectsit modified. When T is ready to commit, C sends thisinformation together with the new states of modified objectsto a server at which some objects used by this transactionreside. This server acts as the coordinator for the commit;other servers where objects used by the transaction reside actas participants.

The participants validate T to ensure that it used only up-to-date versions of all objects and that it has not modified anyobjects used by other prepared or committed transactions,i.e., we use backward validation [14]. Validation uses avalidation queue or VQ to track read/write sets of preparedand committed transactions; details can be found in [1].

Each participant sends its validation results to thecoordinator who commits the transaction iff all participantsvoted positively. The coordinator sends its decision to clientC and the participants. The delay observed by the client isdue to phase 1 only; phase 2 happens in the background.In phase two, a participant installs T’s updates (makes themavailable to other transactions).

When updates of a transaction run by client C1 modifyobjects in the cache of client C2, this causes C2’s cache tocontain out-of-date information. Furthermore, if C2 thenruns a transaction that observes this out-of-date information,that transaction will be forced to abort. To avoid such abortsand keep client caches relatively up to date, servers sendinvalidation messages. The server maintains a per-clientdirectory that lists pages cached at the client (the directorymay list pages no longer at the client since the server has notyet been informed that they were dropped). The server usesthe directories to determine what invalidations to generate;

3

Page 4: Proceedings of the ACM Symposium on Principles of

each invalidation identifies an object that may be out of dateat a particular client.

An invalidation message to a client contains all invalida-tions for that client. When a client receives an invalidationmessage it discards invalid objects (but not the containingpage) and aborts the current transaction if it used them. Inaddition to keeping caches almost up-to-date, which reducesthe number of aborts, invalidation messages also cause trans-actions that must abort to abort sooner.

Invalidation messages are piggy-backedon other messagesthat the server sends to the client; the client acks thesemessages (acks are also piggybacked), which allows theserver to discard pending invalidations for the client.However, invalidations are not allowed to remain pendingindefinitely. Instead there is a timeout period (half a secondin our current implementation). If a server has pendinginvalidations for the client, and if some of these invalidationsare “old”, i.e., they were generated nearly a timeout periodago, the server sends invalidations to the client on its ownaccount (piggybacked on “I’m alive” messages). Thusinvalidations are highly likely to arrive within half a secondof their generation.

Although our scheme keeps caches almost up to date,it does not guarantee that they are up to date. Whentransactions run they can observe old information, and in factif a transaction T modified objects at two different servers,a transaction running at some other client might observe thenew state of one object and the old state of the other. Thus,it is possible for a transaction in our system to observe aninconsistent state. Our new algorithm prevents this situationfrom happening.

5 The Lazy Scheme

We now present our lazy scheme for providing consistentviews for running transactions. Our scheme assumes thatserver clocks never run backwards, and advance rapidlyenough that each transaction can be assigned a distincttimestamp; these assumptions are easy to guarantee (see forexample, the discussion in [21]).

The basis of the scheme is the invalidations generated whentransactions commit. The fundamental idea is this: if aclient running transaction U observes a modification madeby transaction T, then it must already have received all theinvalidations of T and any transactions T depended on.

The information about invalidations is conveyed to clientsusing multistamps. Each committed transaction has amultistamp that indicates its invalidations and those of alltransactions it depends on. A multistamp is a set of tuples

client, server, timestamp ; each tuple C, S, ts meansthat an invalidation was generated for client C at server S attime ts. The timestamp ts is the value of S’s clock at the timeit prepared a transaction that caused invalidations for C.

We assume the obvious merge operation on multistamps:if the two input multistamps contain a tuple for the same

client/server pair, the merge retains the larger timestampvalue for that pair.

The next two subsections describe the processing at theserver and the client, ignoring size issues: multistamps areallowed to grow without bound and so are local tables at theserver. Section 5.3 describes how we solve the size problems.Section 5.4 gives an informal argument that the scheme iscorrect.

5.1 Processing at the ServerServers have two responsibilities: they must computemultistamps, and they must send them to clients in fetchresponses. A fetch response sends a page containingmodifications of transactions; at that point we also send themerge of the multistamps of those transactions.

The server maintains the following data structures. ThePSTAMP table maps pages to multistamps: the multistamp ofa page is the merge of the multistamps of all transactions thatmodified that page. The ILIST maps clients to invalidationinformation. Each element of ILIST[C] is a timestamp tsand a list of object ids, indicating that these objects wereinvalidated for C at time ts. The VQ records multistamps ofcommitted transactions (along with information about readsand writes of prepared and committed transactions).

Commit Processing. In the prepare phase, if validation oftransaction T succeeds, participant S computes multistampm as follows:

1. S initializes m to be empty.

2. If the commit of T would cause invalidations for any otherclients, S sets ts to the current time of its clock. Then foreach potentially invalidated client C:

(a) S adds tuple C, S, ts to m.

(b) S adds ts, olist to the ILIST for C, where olistcontains ids of all objects modified by T that are inpages listed in S’s directory for C.

3. For each transaction U that T depends on, S mergesVQ[U].mstamp with m. The dependencies are deter-mined using S’s VQ.

Then S sends m in the vote message to the coordinator. Ifthe coordinator decides to commit T, it merges multistampsreceived from participants to obtain T’s multistamp. Thismultistamp is sent to participants in the commit messages.The participants store it in VQ[T].mstamp. Furthermore,for each page P modified by T, the participant merges thismultistamp into PSTAMP[P].

If the coordinator decides to abort, it sends this informationto the participants. The participant then removes informationabout T from the ILIST.

Fetch Processing. When a server receives a fetch messagefor object x on page P, if there is a prepared transaction that

4

Page 5: Proceedings of the ACM Symposium on Principles of

modified x, it waits for it to complete. Then it sends the fetchreply, which contains P and also PSTAMP[P]. (Our basesystem never delays a fetch reply. However, the probabilityof a delay is very small since the prepare window is verysmall.)

Invalidations. To produce an invalidation message, theserver goes through the ILIST in timestamp order fromsmallest to largest, stopping when it has processed theentire list, or it reaches an entry for a prepared (but notyet committed) transaction. The ids of all objects in theprocessed entries are sent in the message along with thelargest timestamp contained in the processed entries.

As mentioned, invalidation messages are piggybackedon every message sent to a client and clients acknowledgeinvalidations. The acknowledgement contains the timestampts of the associated invalidation message. The server thenremoves all entries whose timestamps are less than or equalto ts from the ILIST.

A client may also request invalidation information. Sucha request contains a timestamp ts. The server responds bysending back an invalidation message as above except thatthe timestamp in the message must be greater than or equal tots. It is possible that some entry in the table with a timestampless than or equal to ts exists for a transaction that has not yetcommitted (it is still prepared); in this case, the server delaysthe response until the outcome for that transaction is known.Such delays are unlikely because the coordinator sends thephase-two message to participants promptly.

5.2 Processing at the Client

A client C is responsible for using multistamps to ensure thatit receives invalidations before their absence could lead to thecurrent transaction viewing an inconsistency.

C maintains two tables that store information about serversit is connected to. LATEST[S] stores the timestamp of thelatest invalidation message it has received from server S andREQ[S] is the largest timestamp for S that C is required tohear about. If REQ[S] LATEST[S], this means S hasinvalidations for C that C has not yet heard about.

The client also maintains a set CURR that identifiesall servers used by the currently running transaction.For each such server S in CURR, it guarantees thatLATEST[S] REQ[S]. In other words, for all servers usedby the current transaction, the invalidation information is asrecent as is required.

When the client receives an invalidation message fromserver S, it stores the timestamp in the message inLATEST[S].

Client C does the following when a transaction first usesobject x:

1. Adds x’s server S to CURR.

2. Fetches x if necessary. When the fetch reply arrives itprocesses the invalidations as described above. Then it

updates the information in REQ to reflect the multistampin the fetch response: for each multistamp entry

such that ts is larger than REQ[R], it storests in REQ[R].

3. If LATEST[R] REQ[R] for some server R in CURR,it sends an invalidation request to R (requesting R toreply with a message timestamped same as or later thanREQ[R]), waits for the response, and then processes it.

Note that invalidation processing in steps 2 and 3 can causethe transaction to abort (if the transaction had already usedan invalidated object).

The main point to notice here is that the processing over-heads associated with multistamps are minimal (assumingmultistamps are small). The main impact on system perfor-mance is in step 3 when an invalidation request is sent. Whenthis happens there is an extra message compared to AOCC,and the current transaction is delayed, i.e., there is a stall.A stall can occur when a fetch completes or when a serveris accessed for the first time in a transaction; note that thelatter type of stall can be attributed to a fetch that occurredpreviously at that client. In Section 6, we evaluate the cost interms of the stall rate, i.e., percentage of fetches that resultedin stalls.

5.3 Truncation

Now we discuss how to keep multistamps small and alsohow to keep the VQ and PSTAMP tables small. All of ouroptimizations rely on time: we remove “old” tuples frommultistamps. To account for removed tuples, a multistamp malso contains a timestamp m.threshold; m.threshold is greaterthan or equal to timestamps of all tuples that have beenremoved from m. The threshold allows us to compute aneffective multistamp EFF(m) containing a tuplefor every client/server pair, where ts is the timestamp inthe tuple for C/S in m if one exists and otherwise, ts ism.threshold.

As mentioned in Section 4, clients receive invalidationinformation from servers within the timeout period of whenit was generated. This communication implies that tuplescontaining a timestamp ts that is older than a server’scurrent time by more than this period are not useful, sinceby now those old invalidations will almost certainly havebeen propagated. Therefore, such old entries are aged, i.e.,automatically removed from multistamps.

Simple aging of tuples may not be enough to keepmultistamps small, so we also prune the multistamp byremoving tuples that are not old. The system bounds thesize of multistamps: whenever a multistamp m exceeding thissize is generated, it is immediately pruned by removing sometuples. Pruning occurs in two steps: First, if there is morethan some number of tuples concerning a particular server,these tuples are removed and replaced by a server stamp.Then, if the multistamp is still too large, the oldest entriesare removed and the threshold is updated. A server stamp is

5

Page 6: Proceedings of the ACM Symposium on Principles of

a pair server, timestamp ; EFF(m) expands a server stampinto a tuple for each client for that server and timestamp.

The VQ and PSTAMP are also truncated. Retaininginformation in the VQ about transactions that committedlonger ago than the timeout period is also not useful.Whenever the multistamp of a transaction contains no tuples(i.e., it consists only of a threshold), it is dropped from the VQ.The VQ has an associated multistamp VQ.m that is greaterthan or equal to the (effective) multistamps of all transactionsdropped from VQ. Information is dropped from PSTAMP inthe same way, with information about multistamps of droppedentries merged into PSTAMP.m, a multistamp associated withPSTAMP. Note that for both the VQ and PSTAMP, we canremove entries earlier if we want; all that is needed is toupdate the associated multistamp properly. Thus, a serveronly maintains entries for recently committed transactions inthe VQ and only information about recently modified pagesin PSTAMP.

A server initializes the multistamp m for a transactionto VQ.m. When a fetch request for page P arrives andPSTAMP contains no entry for P, the fetch response containsPSTAMP.m.

When a client receives a multistamp m, it computes EFF(m)and proceeds as described in Section 5.2.

The result of these optimizations is a loss of precision inmultistamps, which may in turn lead to more fetch stalls. Forexample, in pruning the multistamp, the server may haveremoved an entry D, S, ts concerning some client Ddifferent from C. When that multistamp arrives at C, C mayblock while waiting for invalidation information from S eventhough S has no invalidations for it.

When an invalidation request arrives at a server it mightcontain a timestamp ts that is larger than what is stored in anyentry in the ILIST for that client. To handle such a situation,the timestamp in the invalidation response will be the valueof the server’s current clock. If the current value of the clockis less than ts, the server will wait for it to advance. Thissituation is extremely unlikely, since it occurs only if clocksare out of synch.

Every message sent to the client contains a piggybackedinvalidation message. This message is as described earlierexcept that if the whole ILIST is being sent, or if the ILIST isempty, the current clock value is sent as the timestamp. Thistimestamp allows the clients to avoid stalls.

5.4 Correctness

The correctness of our scheme depends on two properties: (i)multistamps reflect dependencies properly; (ii) multistampsflow and are acted upon in a timely way.

Step (2a) of commit processing in Section 5.1 ensuresthat the multistamp for a transaction T contains tuples forall invalidations caused by it and step (3) ensures that T’smultistamp contains all tuples in multistamps of transactionsT depends on. The fact that we delay a client thatfetches an object modified by T while T is prepared and

the fact that the coordinator sends the merged multistampin its commit decision ensures that all of T’s read-writeparticipants have the complete invalidation information forT and the transactions it depends on. By induction on thedependency relation we know that T’s multistamp containstuples reflecting all invalidations of T or any transaction itdepends on. Truncation preserves dependencies because,for m’, the truncated form of m, the timestamp for anyclient/server pair in EFF(m’) is greater than or equal to thetimestamp for the pair in EFF(m).

Multistamps flow in a timely way. The only way atransaction U comes to depend on some other transaction T isby reading a modification of T. This can occur only if its clientfetches a page modified by T. But when such a fetch occurs,the multistamp of T is sent to the client (in the multistampof the returned page); the client records this information (inREQ). The REQ and LATEST data structures are used by aclient to maintain the invariant that its cache is up-to-date withrespect to all servers in the current transaction. Whenever thecache gets new information (i.e. a page is fetched) or a serveris accessed for the first time in the current transaction, theclient ensures that REQ[S] LATEST[S] for all servers Sused by the current transaction.

6 Performance Evaluation

It would be nice to determine stall rates and the effects oftruncation analytically, but these effects are dependent onworkloads and also connectivity among clients and servers,and are not amenable to analysis. Instead, we study theeffects via simulation. This section describes the simulationexperiments and results. The results show that the stall rate islow even in workloads that stress the system, that truncationkeeps multistamps small with little impact on stall rate, andthat aging alone is not sufficient for truncation in theseworkloads. The section also discusses the cost of supportingcausality and presents an optimization that further reducesstalls.

6.1 Simulator Model

The simulator allows us to control the number of clients andservers, their connectivity, and also workload issues such ashow the database is distributed among the servers,how clientsshare, and what transactions are like. The details of thealgorithm are simulated precisely; we replaced client-servertuples in multistamps with server stamps when there wereat least 10 entries for the same server. We constructed thesimulator and workloads starting from earlier concurrencycontrol studies [13, 6]. These studies were performedfor a single-server, multi-client system; we extended themto a distributed database with multiple servers. We ranexperiments with varying values of simulator parameters;we present results for a particular setup and mention resultsof other experiments along the way. Figure 2 describes theparameters of the study.

6

Page 7: Proceedings of the ACM Symposium on Principles of

Connectivity ParametersServers in cluster 2Client in cluster 20Total clusters in system 10Client connections with servers 4Preferred servers for each client 2

Network ParametersNetwork Bandwidth 155 MbpsFixed network cost 6000 instr.Variable network cost 7168 instr./KB

Client/Server ParametersClient CPU speed 200 MIPSServer CPU speed 300 MIPSClient cache size 50% of IUPServer cache hit ratio 50%Disk read time 16 msec

Database and Transaction ParametersObject size 64 bytesPage size 4 KBObjects per page 64In-use pages or IUP 1250Single server transactions 80%Read think time 64 sec/objectWrite think time 128 sec/objectAccess per transaction 200 objectsWrite accesses 20%Timeout period 0.5 sec

Figure 2: Summary of Parameter Settings

6.1.1 Connectivity Model

We divided clients and servers into clusters each containing2 servers and 20 clients. There are 10 clusters in thesystem. Each client is connected to 4 servers of which 2are “preferred”: most accesses by this client go to theseservers. A client’s preferred servers are in its cluster; thenon-preferred servers are chosen randomly from all theclusters. Thus, each server has 20 clients that prefer it and20 (on average) other clients. In our experiments each serverreceives about 1 transaction from a non-preferred client forevery 5 transactions received from preferred clients; thusnon-preferred clients are not very idle from a server’s pointof view. Figure 3 shows a typical setup.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 10

Client A Client B

Preferred connectionNon-preferredconnection

Figure 3: Client connections in a simulator run.

This setup models a realistic situation where clients mostlyaccess their local server (e.g., on their LAN). To stress ourscheme, we used 2 preferred servers per client instead ofone: as a client switches between the 2 preferred servers,the likelihood of stalls increases due to inter-transactiondependencies created for the client by local causality.Furthermore, the connectivity of the system is high; eachclient has two random connections to non-preferred serversin other clusters. Thus, a client can “spread” the multistampfrom its clusters to other clusters and vice-versa. Increasein multistamp propagation can lead to eager truncation andunnecessary stalls. We chose 10 clusters to model a systemin which multistamps must be truncated; otherwise, theywould contain 800 entries (since there are 800 client-serverconnections in the setup).

6.1.2 Network ModelWe used an abstract model of the network as in [13]. Apartfrom time spent on the wire, each message has a fixed andvariable processor cost for sending/receiving the message.Our network parameters were obtained from the experimentalresults reported for U-Net [29] running over a 155 MbpsATM.

6.1.3 Client/Server ParametersClient processors run at 200 MIPS and the server CPU at 300MIPS; these values reflect typical processor speeds expectedin the near future. Disks are not modeled explicitly butwe assume a server cache hit ratio of 50%. This value ismuch higher than hit ratios observed in real systems [5, 24].A lower server cache hit ratio would reduce the stallrate since transactions would take longer to execute andthus dependencies would spread more slowly (we ran anexperiment to confirm this). Furthermore, a lower hit ratewould reduce the relative impact of stalls on overall executiontime. The costs for reading and writing an object are about64 sec and 128 sec respectively. These times are basedon the observation that a transaction operates on an objectfor some time when it accesses it, e.g., the time spent perobject in the OO7 benchmark by Thor [19] is around 35 sec(and OO7 methods do not perform any significant amount ofwork). Other costs such as those for validation, cache lookup,etc., are not shown due to lack of space; they are negligiblecompared to the access and fetch costs. The timeout periodis chosen to be 0.5 second. Clock skews are not modeledsince they are insignificant compared to the timeout period;the Network Time Protocol [22] maintains clock skews thatare less than 10 milliseconds on LANs and WANs [23].

6.1.4 Database and Transaction ParametersThe database is divided into pages of equal size. Eachsimulator run accesses a small subset of the database; thisset is called in-use pages of the run. As shown in Figure 2,there are 1250 database in-use pages per server.

In all workloads, 80% of the transactions are singleserver and the rest are multi-server (with a bias towards

7

Page 8: Proceedings of the ACM Symposium on Principles of

fewer number of servers in a transaction). To generate atransaction’s accesses, we first generate the number of serversfor the transaction and then choose the servers. A preferredserver is chosen 90% of the time for transactions with oneor two servers. For a higher number of servers, we chooseboth preferred servers and then choose randomly from theremaining non-preferred servers. Each transaction accesses200 objects on average; 10 objects are chosen on a pageresulting in 20 pages being accessed in a transaction. In amulti-server transaction, the number of accesses are equallydivided among the servers. 20% of the accesses are writes.

The client cache size is 875 pages; cache management isdone using LRU. The client can potentially access 5000 pages(1250 from each server). In single-server studies [13, 6] theclient cache was 1/4 of the in-use pages, so it might seem thatour cache is too small. However, we observed that more than85% of accesses are to preferred servers, i.e., to 2500 pages.Thus, a client cache size between 625 and 1250 pages, witha bias towards 625, is in line with earlier work.

The total in-use database in our system is more than twicethe size used in [13, 6]. This means we have lower contention(less than half) than what was observed in those studies.We therefore also ran experiments with a smaller database(to make contention levels similar) and observed a stall rateincrease of about 50-75%.

As before, our parameters are designed to stress the lazyscheme. Although real systems are dominated by read-only transactions [9, 26], we don’t have any since otherwiseinvalidations would be generated with very low frequencymaking stalls highly unlikely. Also, we have a relativelyhigh percentage of multi-server transactions (20%); 11.5%of our transactions use two servers and 8.5% use more thantwo servers. Benchmarks such as TPC-A and TPC-C [28]have fewer than 10-15% multi-server transactions and thesetransactions involve two servers; other researchers have alsoreported two-server transactions to be common for distributedtransactions [26].

6.1.5 Workloads

We now discuss the different workloads used in our study;these workloads have been used in the past to compare theperformance of various concurrencycontrol mechanisms [13,6]. LOWCON is intended to be a realistic low-contentionworkload; the others are intended to stress the system.

LOWCON. This workload models a realistic system withlow contention (like those observed in [16, 27]). Each clienthas a private region of 50 pages at each preferred server. Eachserver has a shared region of 1200 pages. 80% of a client’saccesses go to its private region; the rest go to the sharedregions at its connected servers. Thus, in this workload, aclient can access 1250 pages at a preferred server and 1200pages at a non-preferred server (note that private regions ofother clients are not accessed by a client).

SKEWED. As in LOWCON, each client has a private regionof 50 pages at each preferred server; together these regionsconsume 1000 pages. Each server has a shared regioncontaining 250 pages. 80% of a client’s accesses go to itsprivate region; 20% of the accesses go to the rest of thedatabase (or RDB) region consisting of all other pages at itsconnected servers, including private regions of other clients.This workload models the case when a client has affinityfor its own objects and is used for checking contention withskewed sharing patterns.

HOTSPOT. This is the same as SKEWED except there isa small region of 50 pages added to the database at eachserver (the shared region is thus 200 pages at each server).10% of accesses go to the small region, 80% to the privateregion and 10% to the RDB. The small region might be thetop of a naming hierarchy that is accessed often by clients;to capture the fact that such a region will not be modifiedoften, only one in every ten transactions modifies the smallregion. This workload has the richest access patterns; itincludes both uniform sharing (all clients access the smallregion uniformly) and skewed sharing (one client accessesits private region more frequently than other clients). Theworkload is intended to stress the lazy scheme because ofhigh contention in the small region.

We could have added a small region to LOWCON, butin practice modifications to such a region are extremelyrare and therefore its impact on lazy consistency would beinsignificant.

HICON. This is very unrealistic workload used in concur-rency control studies to model high contention. There aretwo regions at each server — a “hot” 250 page region that isaccessed 80% of the time and a 1000 page “cold” region thatis accessed 20% of the time. Both hot and cold regions areshared uniformly among clients.

6.2 Basic Experimental Results

Figure 4 shows results of experiments as the maximum sizefor multistamps increases from zero entries (just the thresholdtimestamp) to the size reached when just aging is used.(A system that uses just the threshold timestamp is usingLamport clocks [18].) The X axis represents increasingmultistamp size; the Y axis shows the stall rate. The numberin parenthesis after the workload name gives the numberof fetches per transaction. The results show that even forsmall multistamps, the percentage of stalls is low. Low stallstranslate into extremely low cost as discussed in Section 6.2.2.Furthermore, we can decrease stall rates by a factor of two,as shown in Section 6.4.

LOWCON has relatively few misses (all due to the sharedregion) and few stalls. Since contention is low on the sharedregion, the stall rate is extremely low. Most misses inSKEWED and HOTSPOT are due to accesses to the RDBregion; HOTSPOT also has coherency misses due to the

8

Page 9: Proceedings of the ACM Symposium on Principles of

0 5 10 15 20 25 Infinity

Maximum multistamp entries

0

2

4

6

8

10

Per

cent

age

stal

ls (

%)

HOTSPOT (5.2)SKEWED (6.0)HICON (19.0)LOWCON (4.2)

Figure 4: Percentage of fetch stalls as the maximummultistamp size is increased

small region. The stall rate in HOTSPOT is higher thanin SKEWED because there is more contention on the smallregion. Furthermore, since the client can cache a higherpercentage of this region than the RDB region in SKEWED,more invalidation entries are generated in HOTSPOT. As aresult, there is more truncation, which leads to a higher stallrate.

HICON has very high contention and a high number ofstalls. However, the number of fetches is also high becausethere is a large number of coherency misses in the hot regionand capacity misses in the cold region and therefore the stallrates are low. If we had a large enough cache that there wereno capacity misses in HICON, the stall rate would be about2.6% (it will be lower than this since some of the capacitymisses will be replaced by coherency misses).

Figure 5 shows multistamp size when only aging is used.These multistamps are large and therefore pruning is needed.As shown in Figure 4, pruning to a reasonable size does notincrease stall rate significantly.

Workload Multistamp entriesLOWCON 27.5SKEWED 117.9HOTSPOT 135.8

HICON 201.7

Figure 5: No. of multistamp entries with aging

6.2.1 Additional ExperimentsThis section discusses results of some other experiments weran to determine what happens to the stall rate as parametersvary. These experiments used HOTSPOT since it has uniformand skewed sharing patterns, and because it stresses thesystem.

We changed the topology so that one of the 2 preferredservers was in a different cluster. The stall rate decreased

since the likelihood of a client depending on a multi-servertransaction that involved the same servers decreased (theprobability of two clients sharing multiple preferred servers islower). In another experiment, we decreased the percentageof accesses to the preferred servers; the stall rate decreasedas the percentage of accesses to the preferred servers wasreduced, for similar reasons.

We varied client cache sizes and observed that the stall ratedid not change. As cache sizes were increased, the numberof fetches remained the same; capacity misses were replacedby coherency misses.

We ran an experiment where a client had more than 2 non-preferred servers; we varied the number of non-preferredservers from 3 to 9, and allowed transactions to use allconnected servers. The stall rate for the 5-entry multistampincreased steadily from 2% to 6.3%. Additional experimentsshowed that this effect was due to transactions that used largenumbers of servers rather than the number of connections perclient. Of course, as discussed in Section 6.1.4, transactionsthat use 8 or 9 servers are highly unlikely in practice.

We ran an experiment where there were 15 clusters insteadof 10 (giving 300 clients and 30 servers). The stall rates werethe same. In another experiment, we changed the numberof clients per cluster from 20 to 30, i.e., each server had anaverage of 60 client connections. The stall rates for this setupwere also similar.

6.2.2 Impact on Performance

We ran experiments with lazy consistency enabled anddisabled and observed that the transaction latency/throughputwere the same. Very small stall rates have little impact ontransaction cost since other costs dominate. Each fetchinvolves a roundtrip with the reply being a big message(4KB); in contrast, a stall results in a small message roundtrip.On an ATM network, a stall takes 300 secs, a fetch fromthe server cache is approximately 650 secs (using U-Netnumbers), and a disk access is around 16 msec. For example,consider HOTSPOT with a 5-entry multistamp and assume50% of fetches hit in the cache. Using the stall rate of 2%shown in Figure 4, we have an increase in the cost of runningtransactions of less than 0.08%; with the simple optimizationdiscussed in Section 6.4, we get a stall rate of only 0.8% (seeFigure 8), and the increase is less than 0.03%.

Lazy consistency also has little cost in space or messagesize. A multistamp entry requires approximately 12 bytes ofstorage, and multistamps typically contain fewer entries thana given bound (e.g., in HOTSPOT, a 20-entry multistampactually contains approximately 10 entries). Therefore,multistamps are cheap to send in messages and the memoryrequirements for storing them are low. The memory cost forother data structures are also low since we add only a smallamount of information over what is needed for concurrencycontrol purposes.

We compared the abort rate with and without lazyconsistency and discovered that it did not change. This

9

Page 10: Proceedings of the ACM Symposium on Principles of

indicates that in these workloads invalidation requests arealmost always due to false sharing rather than true sharing,and that transactions would rarely observe inconsistenciesin our base system. Nevertheless, lazy consistencyis worth supporting since it provides a much cleanersemantics to programmers: they can count on never seeinginconsistencies.

6.3 The Cost of Causality

This section evaluates the cost of supporting local causalityby comparing its stall rate with what happens in a systemthat supports lazy consistency without any causality. It alsoevaluates the additional cost of supporting global causality.The results are shown in Figure 6; the results are for a 5-entrymultistamp. We can see that the stall rate approximatelydoubles when support for local causality is added to lazyconsistency, and it increases further when support for globalconsistency is added.

HOTSPOT SKEWED HICON LOWCON0.0

0.5

1.0

1.5

2.0

2.5

3.0

Per

cent

age

stal

ls (

%)

Figure 6: Breakdown of stall rate for local and globalcausality

Lazy consistency without local causality is implementedby having the client maintain a table SSTAMP that mapsservers to multistamps; every time the client receives afetch reply from server S, it merges the multistamp in themessage into SSTAMP[S]. Furthermore, rather than a globalREQ table, REQ stores information only for the currenttransaction. REQ is empty when the transaction starts. Thefirst time the transaction accesses an object from server S, ituses SSTAMP[S] to update entries for other servers used bythe transaction in REQ, and then sends invalidation requeststo these other servers if necessary. After commit, REQ ismerged into SSTAMP[S] for all servers S accessed by thecurrent transaction T (to capture T’s dependencies).

In a system with just lazy consistency, single-servertransactions cannot stall, and even in a multi-servertransaction, a client’s requirements are due only to the serversused in that transaction. Thus, the following situationmust occur for a client to stall: Client C fetches page Pat server S in a multi-server transaction that also involvesserver R. Page P must have been modified recently by amulti-server transaction that also involved server R and thatcaused an invalidation for C. The likelihood of a multi-server

transaction (fetching P) depending on another “temporallyclose” multi-server transaction is low. The fact that wekeep track of exact dependencies at validation ensures that apreparing transaction depends on very few recent transactions(in HOTSPOT, this was less than 1.5 and in HICON it wasaround 4). Also, 80% of these transactions are single-serverso that the multistamp of a preparing transaction will mostlycontain entries for this server. As a result, when P is fetched,P’s multistamp contains entries for other servers mostly dueto the last transaction that modified it.

When local causality is supported, there are more stallssince a transaction can stall due to requirements generated inan earlier transaction. For example, without causality single-server transactions never stall but now they do; we observedthat with local causality about half the total stalls in HICONand HOTSPOT were stalls in a single-server transaction.

Global causality is supported by adding the following tothe implementation described in Section 5: the coordinatorsends a transaction’s multistamp to the client in the commitreply; the client sends its current multistamp to the server atthe next commit request; and the server merges it into themultistamp generated for that transaction. Global causalityfurther increases the stall rate because clients now act as“propagators” of multistamps. Without global causality,multistamp information was propagated across servers onlythrough multi-server transactions. Now when a client“switches” servers, it propagates the multistamps (generateddue to its previous commits) to the new servers. Also,faster propagation causes more pruning to occur (sincethe multistamp has more entries); the threshold is raised,resulting in more stalls.

6.4 Optimizations6.4.1 Reducing Switch StallsWith local causality, we observed that 70-80% of stalls wereswitching stalls that occurred the first time client uses a serverit has not used in the recent past. Switches are frequentin our system since we select servers randomly for everytransaction. In a real system, a client will use a server or aset of servers for some time before switching to a differentserver; thus switching will be relatively infrequent and stallsare likely to be smaller than what we observed.

Workload NORMAL ALL PREFStall Stall StallRate Msgs Rate Msgs Rate Msgs

HOTSPOT 1.96 % 0.10 0.23 % 0.39 0.82 % 0.15HICON 1.73 % 0.33 0.10 % 0.95 0.37 % 0.54SKEWED 1.56 % 0.10 0.16 % 0.34 0.57 % 0.14LOWCON 0.26 % 0.01 0.02 % 0.05 0.11 % 0.02

Figure 7: Tradeoff involved in sending background invalida-tion requests to reduce the stall rates

A good way to reduce switching stalls is to send

10

Page 11: Proceedings of the ACM Symposium on Principles of

invalidation requests in the background, for example, inparallel with a commit request. That way, the client willbe up to date for all servers when the next transaction starts.However, since a client does not access non-preferred serversfrequently, it may not be worthwhile sending installationrequests to them. Figure 7 shows the results of thisoptimization (for 5-entry multistamps). Here ALL is thecase in which invalidation requests are sent to all out-of-date servers at commit, and PREF is the case in whichthese requests are sent only to out-of-date preferred servers;NORMAL is the case where no background invalidationrequests are sent. The “Msgs” column gives the total numberof invalidation messages (both background and foreground)sent per transaction for each scheme. The main point tonote here is that PREF works very well; only slightly moremessages are sent (over NORMAL), yet the stall rate dropssignificantly.

The results for all multistamp sizes for PREF are given inFigure 8; we can see that the optimization reduces stall rateby approximately a factor of two (compared with Figure 4).This figure shows the true performance of our scheme. Forall the stressful workloads, the stall rate is less than 1%; forthe realistic workload, the stall rate is around 0.1%.

Note that a client can use heuristics based on its historyof previous transactions to determine whether a server ispreferred or non-preferred.

0 5 10 15 20 25 Infinity

Maximum multistamp entries

0

1

2

3

4

5

Per

cent

age

stal

ls (

%)

HOTSPOTSKEWEDHICONLOWCON

Figure 8: Percentage of fetch stalls when messages are sentto preferred servers at end of transaction

6.4.2 Using Server Stamps

The current system uses both client-server stamps and serverstamps in multistamps; it replaces client-server stamps withserver stamps when the number of stamps for a server exceedsa bound (10 in our system). We also ran experiments whereonly server stamps were used. We found that the two systemsran identically when multistamp size (in number of entries)was held constant. However, multistamps containing onlyserver stamps are smaller than those containing client-serverstamps, and therefore we can get the same low stall rates with

less space and smaller messages using server stamps.We believe, however, that our workloads are biased

towards server stamps since all clients that share a serveralso share pages at that server, so that page modificationsare likely to cause invalidations for many clients. Therefore,there is not much information loss when client-server stampsare replaced by server stamps. Instead, consider two clientsC and D that share pages from server S and also both use Rbut do not share any pages at R. If client C fetches a page fromS that was modified in a multi-server transaction by client D,C can stall unnecessarily due to invalidations generated by Dfor some other client. Depending on the sharing patterns inthe system, the client-server stamp system will dynamicallyreplace some client-server stamps with a server stamp. Thus,if the above sharing pattern occurs, the client-server stampsystem may be able to handle it better than the server-stampsystem. This pattern may be quite likely in practice, andtherefore client-server stamps seem like a good idea sincethey are not very expensive.

6.5 Eager scheme

Lazy consistency can also be provided by an eager schemethat delays installing new object versions until all old versionshave been removed from client caches. Participants send thelist of invalidations to the coordinator in their vote. Thecoordinator sends the commit decision to the client andinitiates an invalidation phase in which it sends invalidationsto clients on behalf of participants. After it has receivedreplies from these clients, it sends the commit decisionto the participants. (The invalidation phase is similar tocommunication in other concurrency control schemes, e.g.,callbacks in [6], except that in those schemes, communicationoccurs in the foreground.)

The eager scheme avoids fetch stalls entirely, but at the costof extra messages and an extra commit phase. For example,in HOTSPOT an average of 20 clients are invalidated inevery transaction (the number of invalid objects per clientis low, e.g., 10), resulting in 20 extra roundtrip messages pertransaction with the eager scheme. Since our results showthat the stall rate is low using the lazy approach, we believethe extra cost of the eager approach is not justified.

7 ConclusionsThis paper has presented a new scheme for guaranteeingthat transactions in a client/server system view consistentstate while they are running. The scheme is presentedin conjunction with an optimistic concurrency controlalgorithm, but could also be used to prevent read-onlytransactions from conflicting with read/write transactions ina multi-version system [7].

The scheme is lazy about providing consistency. It simplygathers information as transactions commit, and propagatesinformation to clients in fetch replies. Clients use informationonly when necessary — to ensure that the current transactionobserves a consistent view.

11

Page 12: Proceedings of the ACM Symposium on Principles of

The scheme is based on multipart timestamps. Todaythe utility of multistamps is limited because their size isproportional to system size and therefore they don’t scaleto very large systems. Our scheme solves this problem byusing real times in multistamps. Using time allows us tokeep multistamps small by truncating them without loss ofinformation and with minimal impact on performance: weremove older information that is likely to already be knownwhile retaining recent information. We assume clocks areloosely synchronized; loss of synchronization affects onlyperformance and not correctness.

The paper shows that our approach has extremelylow costs. Cost takes the form of stalls — timeswhen a client transaction must be delayed because somedependency information is needed to ensure consistency.Our experiments show that small multistamps are sufficientto make stalls rare in all workloads and extremely rare inrealistic low-contention workloads. Furthermore, low stallrates translate into extremely low impact on the cost to runtransactions.

AcknowledgementsThis research was supported in part by the AdvancedResearch Projects Agency of the Department of Defense,monitored by the Office of Naval Research under contractN00014-91-J-4136.

We are grateful to Chandrasekhar Boyapati for carefullyreading this paper and discussing various aspects of thescheme with us. We would also like to thank Andrew Myersand the referees for their helpful comments.

References[1] A. Adya, R. Gruber, B. Liskov, and U. Maheshwari. Efficientoptimistic

concurrency control using loosely synchronized clocks. In SIGMOD,May 1995.

[2] D. Agrawal, A. J. Bernstein, P. Gupta, and S. Sengupta. DistributedMulti-version Optimistic Concurrency Control with Reduced Roll-back. Distributed Computing, 2(1), 1987.

[3] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Controland Recovery in Database Systems. Addison Wesley, 1987.

[4] K. Birman, A. Schiper, and P. Stephenson. Lightweight Causal andAtomic Group Multicast. ACM Transactions on Computer Systems,9(3), August 1991.

[5] M. Blaze. Caching in Large-Scale Distributed File Systems. TechnicalReport TR-397-92, Princeton University, January 1993.

[6] M. Carey, M. Franklin, and M. Zaharioudakis. Fine-Grained Sharingin a Page Server OODBMS. In Proc. SIGMOD Int’l Conf. onManagement of Data, pages 359–370. ACM Press, 1994.

[7] A. Chan and R. Gray. Implementing Distributed Read-OnlyTransactions. IEEE Transactions on Software Engineering, 11(2),1985.

[8] A. Demers, K. Petersen, M. Spreitzer, D. Terry, M. Theimer, andB. Welch. Session Guarantees for Weakly Consistent Replicated Data.Technical Report CSL-94-9, Xerox Parc, Palo Alto, CA, 1994.

[9] H. Garcia and G. Weiderhold. Read-Only Transactions in a DistributedDatabase. ACM Transactions on Programming Languages andSystems, 7(2), June 1982.

[10] S. Ghemawat. The Modified Object Buffer: A Storage Manage-ment Technique for Object-Oriented Databases. Technical ReportMIT/LCS/TR-666, MIT Laboratory for Computer Science, Septem-ber 1995.

[11] J. N. Gray. Notes on Database Operating Systems. In R. Bayer,R. Graham, and G. Seegmuller, editors, Operating Systems: AnAdvanced Course, number 60 in Lecture Notes in Computer Science,pages 393–481. Springer-Verlag, 1978.

[12] J. N. Gray and A. Reuter. Transaction Processing: Concepts andTechniques. Morgan Kaufmann Publishers Inc., 1993.

[13] R. Gruber. Optimism vs. Locking: A Study of Concurrency Controlfor Client-Server Object-Oriented Databases. PhD thesis, M.I.T.,Cambridge, MA, 1997.

[14] T. Haerder. Observationson OptimisticConcurrencyControl Schemes.Information Systems, 9(2):111–120, June 1984.

[15] P. Keleher, A. L. Cox, and W. Zwaenepoel. Lazy Release Consistencyfor Software Distributed Shared Memory. In Proc. of 19th Int’l Symp.on Computer Architecture, pages 13–21, May 1992.

[16] J. J. Kistler and M. Satyanarayanan. Disconnected Operation in theCoda File System. In SOSP, 1991.

[17] R. Ladin, B. Liskov, L. Shrira, and S. Ghemawat. Providing HighAvailability Using Lazy Replication. ACM Transactions on ComputerSystems, 10(4), November 1992.

[18] L. Lamport. Time, clocks, and the ordering of events in a distributedsystem. Comm. of the ACM, 21(7), July 1978.

[19] B. Liskov et al. Safe and Efficient Sharing of Persistent Objects inThor. In Proc. SIGMOD Int’l Conf. on Management of Data, 1996.

[20] B. Liskov, R. Scheifler, E. Walker, and W. Weihl. OrphanDetection. Programming Methodology Group Memo 53, Laboratoryfor Computer Science, MIT, February 1987.

[21] Barbara Liskov. Practical Uses of Synchronized Clocks in DistributedSystems. Distributed Computing, 6, August 1993.

[22] D. L. Mills. Network Time Protocol (Version 3) Specification,Implementation and Analysis. Network Working Report RFC 1305,March 1992.

[23] D. L. Mills. The Network Computer as Precision Timekeeper. In Proc.Precision Time and Time Interval (PTTI) Applications and PlanningMeeting, December 1996.

[24] D. Muntz and P. Honeyman. Multi-level caching in distributed filesystems - or - your cache ain’t nuthin’ but trash. In USENIX WinterConference, January 1992.

[25] D. S. Parker et al. Detection of Mutual Inconsistency in DistributedSystems. IEEE Transactions on Software Engineering, SE-9(3), May1983.

[26] G. Samaras, K. Britton, A. Citron, and C. Mohan. Two-Phase CommitOptimizations in a Commercial Distributed Environment. Distributedand Parallel Databases, 3, October 1995.

[27] M. Spasojevic and M. Satyanarayanan. An Empirical Study of aWide-Area Distributed File System. ACM Transactions on ComputerSystems, 14(2), May 1996.

[28] Transaction Processing Performance Council (TPC). TPC-A StandardSpecification, Revision 1.1, TPC-C Standard Specification, Revision1.0. Shanley Public Relations, 1992.

[29] T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-LevelNetwork Interface for Parallel and Distributed Computing. In SOSP,1995.

[30] W. E. Weihl. Distributed Version Management for Read-only Actions.IEEE Transactions on Software Engineering, SE-13(1), January 1987.

12