the concurrency control mechanism of sdd-1: a system for

15
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978 The Concurrency Control Mechanism of SDD-1: A System for Distributed Databases (The Fully Redundant Case) PHILIP A. BERNSTEIN, JAMES B. ROTHNIE, JR., NATHAN GOODMAN, AND CHRISTOS A. PAPADIMITRIOU Abstract-SDD-1, A System for Distributed Databases, is a distributed SDD-1 GENERAL SYSTEM DESCRIPTION database system being developed by Computer Corporation of America (CCA), Cambridge, MA. SDD-1 permits data to be stored redundantly z at several database sites in order to enhance the reliability and respon- siveness of the system and to facilitate upward scaling of system capa- //'J SITES city. This paper describes the method used by SDD-1 for updating data that are stored redundantly. Redundant updating can be costly because it may potentially involve extensive intercomputer communication overhead in order to lock all copies of data being updated. The method described here avoids this overhead by identifying cases in which it is not necessary to perform \ f this global database locking. The identification of transactions that do not require global locking is based on a predefinition of transaction classes performed by the *SDD-1 comprised of distribted databasesites database administrator using an analysis technique described herein. Distribunisnvle t users. The classes defined are used at run time to decide what level of syn- chronization is needed for a given transaction. It is important to note that this predefinition activity in no way limits the transactions that the Fig. 1. General system architecture of SDD-1. system can accept; it merely permits more efficient execution of those types of transactions that were anticipated. Index Terms-Concufrency, data base, database, data base manage- ment, database management, distributed data base, distributed data- base, distributed data base management, distributed database manage- ment, distributed processing, distributed systems, redundant data, synchronization. I. INTRODUCTION THIS paper describes a technique for updating data stored redundantly in a network of database management sys- tems (DBMS). The technique is being implemented in a sys- tem called SDD-1 (A System for Distributed Databases) under development at Computer Corporation of America. This Manuscript received January 3, 1978; revised February 3, 1978. This research was supported by the Advanced Research Projects Agency of the Department of Defense under Contract N00039-77-C-0074, ARPA Order 3175-6. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the Advanced Research Projects Agency or the U.S. Government. P. A. Bernstein is with the Department of Computer Science, Harvard University, Cambridge, MA, and the Computer Corporation of America, Cambridge, MA 02139. J. B. Rothnie, Jr., and N. Goodman are with Computer Corporation of America, Cambridge, MA 02139. C. A. Papadimitriou is with the Department of Computer Science, Harvard University, Cambridge, MA. technique guarantees that updates preserve the consistency of the database while trying to minimize intercomputer syn- chronization costs. SDD-1 consists of a collection of database sites intercon- nected through a communications network (Fig. 1). Each of the SDD-1 database sites contains a portion of the overall database. Some parts of the database may be stored redun- duntly at several sites. This improves the reliability and responsiveness of the system [7], [19], [20]. Users enter transactions at any database site and need not be concerned with the location of data. SDD-1 manages the retrieval of data that are dispersed through the network and the updating of all copies of data that are stored redundantly. Thus the users of SDD-1 are able to regard the database conceptually as a single centralized resource. SDD-1 is described further in [24], [30]. For expository convenience the remainder of this paper as- sumes a simpler system architecture in which each database site contains a copy of the entire database. The approach presented for this so-called "fully redundant" case extends to general configurations. The SDD-1 architecture offers three major advantages over physically centralized database systems: 0098-5589/78/0500-01 54$00.75 © 1978 IEEE 154

Upload: lytram

Post on 01-Jan-2017

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

The Concurrency Control Mechanism of SDD-1:A System for Distributed Databases

(The Fully Redundant Case)

PHILIP A. BERNSTEIN, JAMES B. ROTHNIE, JR., NATHAN GOODMAN, AND CHRISTOS A. PAPADIMITRIOU

Abstract-SDD-1, A System for Distributed Databases, is a distributed SDD-1 GENERAL SYSTEM DESCRIPTIONdatabase system being developed by Computer Corporation of America(CCA), Cambridge, MA. SDD-1 permits data to be stored redundantly zat several database sites in order to enhance the reliability and respon-siveness of the system and to facilitate upward scaling of system capa- //'J SITEScity. This paper describes the method used by SDD-1 for updatingdata that are stored redundantly.Redundant updating can be costly because it may potentially involve

extensive intercomputer communication overhead in order to lock allcopies of data being updated. The method described here avoids thisoverhead by identifying cases in which it is not necessary to perform \ fthis global database locking.The identification of transactions that do not require global locking

is based on a predefinition of transaction classes performed by the *SDD-1 comprised of distribted databasesitesdatabase administrator using an analysis technique described herein. Distribunisnvle t users.

The classes defined are used at run time to decide what level of syn-chronization is needed for a given transaction. It is important to notethat this predefinition activity in no way limits the transactions that the Fig. 1. General system architecture of SDD-1.system can accept; it merely permits more efficient execution of thosetypes of transactions that were anticipated.

Index Terms-Concufrency, data base, database, data base manage-ment, database management, distributed data base, distributed data-base, distributed data base management, distributed database manage-ment, distributed processing, distributed systems, redundant data,synchronization.

I. INTRODUCTIONTHIS paper describes a technique for updating data stored

redundantly in a network of database management sys-tems (DBMS). The technique is being implemented in a sys-tem called SDD-1 (A System for Distributed Databases) underdevelopment at Computer Corporation of America. This

Manuscript received January 3, 1978; revised February 3, 1978. Thisresearch was supported by the Advanced Research Projects Agency ofthe Department of Defense under Contract N00039-77-C-0074, ARPAOrder 3175-6. The views and conclusions contained in this documentare those of the authors and should not be interpreted as necessarilyrepresenting the official policies, either expressed or implied, of theAdvanced Research Projects Agency or the U.S. Government.

P. A. Bernstein is with the Department of Computer Science, HarvardUniversity, Cambridge, MA, and the Computer Corporation of America,Cambridge, MA 02139.

J. B. Rothnie, Jr., and N. Goodman are with Computer Corporationof America, Cambridge, MA 02139.C. A. Papadimitriou is with the Department of Computer Science,

Harvard University, Cambridge, MA.

technique guarantees that updates preserve the consistency ofthe database while trying to minimize intercomputer syn-chronization costs.SDD-1 consists of a collection of database sites intercon-

nected through a communications network (Fig. 1). Each ofthe SDD-1 database sites contains a portion of the overalldatabase. Some parts of the database may be stored redun-duntly at several sites. This improves the reliability andresponsiveness of the system [7], [19], [20]. Users entertransactions at any database site and need not be concernedwith the location of data. SDD-1 manages the retrieval of datathat are dispersed through the network and the updating ofall copies of data that are stored redundantly. Thus the usersof SDD-1 are able to regard the database conceptually as asingle centralized resource. SDD-1 is described further in[24], [30].For expository convenience the remainder of this paper as-

sumes a simpler system architecture in which each databasesite contains a copy of the entire database. The approachpresented for this so-called "fully redundant" case extends togeneral configurations.The SDD-1 architecture offers three major advantages over

physically centralized database systems:

0098-5589/78/0500-01 54$00.75 © 1978 IEEE

154

Page 2: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-1

1) Reliability: Since multiple copies of data are stored, itis possible for critical portions of a database to remain acces-sible even if a site fails or becomes inaccessible.2) Responsiveness: Data may be stored near to where it is

used frequently, providing faster access and lower communica-tion costs.3) Incremental upwards scaling: SDD-1 supports a large

database on a set of moderate size computers instead of on asingle large site. As the database grows in size or usage, newsites may be added. By comparison, centralized systems areoften difficult to upgrade without major service disruption.One of the key problems in a distributed DMBS is to per-

form redundant updates efficiently. This is the problem thatwe address in this paper. The simplest method for controllingredundant updates is to lock those portions of the databasebeing read or written by active transactions. In a distributedDBMS this often introduces an intolerable delay as lockinginformation is propagated in the database network.More efficient methods for locking in a distributed DBMS

have been proposed [1], [8], [18], [221, [28], [29]. Whilethe solutions differ in numerous ways, they all share thepremise that every transaction in a distributed DBMS requiressynchronization that is as strong as database locking. Evenwith the improvements suggested by these authors, uniformlocking is quite time consuming. For further discussion ofthese approaches, see [25] .The method presented here differs qualitatively from these

previous solutions because it does not assume that everytransaction requires synchronization as strong as global locking.The method is based on a formal analysis of how distributedtransactions interfere with each other, and how this inter-ference can be avoided. The central results of this analysis arethe following.

1) Global database locking is a much stronger mechanismthan is needed for correct distributed database operation.2) Different types of transactions need different levels of

synchronization. Some transactions only need local lockingon a site by site basis, while others require synchronizationmechanisms almost as strong as global locking.3) The "levels of synchronization" needed by different

types of transactions can be expressed as simple algorithms,"protocols," that are executed when a transaction of a giventype is entered into the system. Four synchronization proto-cols are presented in this paper which suffice to handle allpossible transactions.4) The decision as to which synchronization protocol must

be used to execute each type of transaction can be madeoff-line, for example during database design. The decisionprocess is driven by simple rules presented in this paper.5) The results of the decision process can be compiled into

tables that are used at run time to select the correct protocolfor a given transaction. The run time function that uses thesetables is simple, can execute rapidly, and does not require anyintercomputer communication.The effectiveness of the redundant update methodology

presented here depends on the percentage of transactions for a

Giver. two transactions:t i: X xxi

-_: x:=x+1

and suppose x=O. Suppose t2 executes at site sl at the same

time as t2 executes at s2. Then, t2 will set x:=-1, and t2 willset x: =1. However, the correct result of executing t 1 and t2

should have been x=O.

Fig. 2. Sample of incorrect database operation.

tion protocols. Preliminary studies suggest that for typicaldatabase applications the redundant update methodology ofSDD-1 will prove to be quite efficient.The remainder of this paper is organized as follows: Section

II explains the redundant update problem further, stating thecriteria for database consistency that the redundant updatealgorithm must uphold. Section III describes the transactionprocessing model that we employ for the fully redundant case

of SDD-1. Section IV explains how SDD-1 preserves "mutualconsistency" of redundant database copies. Section V ex-

plains how SDD-1 preserves "internal database consistency."Sections VI and VII present the analysis supporting thetechnique described in Section V and outline a proof of itscorrectness.

II. THE REDUNDANT UPDATE PROBLEM

The redundant update problem is to develop techniques forupdating redundantly stored data that 1) preserve databaseconsistency, and 2) minimize intercomputer synchronization.In a redundant DBMS, the notion of database consistency hastwo aspects [22], [29]:

1) mutual consistency of the redundant copies; and2) internal consistency of each copy.

Mutual consistency requires that all copies of the database beidentical. Though it is not possible for the copies to be identi-cal at all times, they must converge to the same final state ifall user activity were to cease.

Internal consistency requires that each copy of the databaseremain consistent within itself just as a nonredundant databasemust. Internal database consistency involves two subsidiaryconcepts:

1) semantic integrity, and2) serializability.

The "semantic integrity" concept dictates that the databaseaccurately reflect the enterprise being modeled. Verifyingthat transactions preserve integrity has been studied quitewidely for many years and remains a very active research area

[13]. Though of great importance, semantic integrity is notstrictly a distributed database issue, and we shall merelyassume that all transactions entered into the system preserve

semantic integrity.The issue then is to ensure that integrity is preserved even if

many transactions execute concurrently. In SDD-1 it willusually be the case that many transactions are in progress atthe same time, both because there are multiple sites and be-cause individual sites are timeshared. Without protectionmechanisms, it is possible that the effect of the concurrenttransactions could be erroneous even though each transaction

given application that may run under each of the synchroniza-

I 55

worked correctly (see Fig. 2). To avoid this possibility, the

Page 3: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

SDD-1 redundant update methodology ensures that the totalset of concurrent transactions is serializable [3], [4], [9],[10], [14]. This property requires that concurrent execu-tion of a set of transactions be equivalent to executing thosesame transactions, one at a time, i.e, serially. By "equivalent"we mean that each transaction in the nonserial executionproduces the same output as it would have, had the transac-tions been run one at a time. Serializability requires onlythat there exist some serial order equivalent to the inter-leaved operation. There may, in fact, be several such orderings.

Earlier we stated that each transaction is assumed to preservesemantic integrity when run individually. Thus, each trans-action when run alone maps one consistent database stateinto another consistent state. It follows by induction that aserial sequence of noninterleaved transactions will likewisepreserve consistency. Since a serializable history of operationis equivalent to some serial ordering, the serializable historyresults in a consistent database state as well.The concept of serializability is central to the redundant

update problem. Virtually all of the complexity in the prob-lem is motivated by the need to preserve this property.

It is important to note before moving on that SDD-1 main-tains serializability with respect to both database state anddatabase output. Our treatment unifies both aspects bytreating output devices as write-only memories that are partof the database state [23]. Retrieval transactions are thusmodeled as special "updates" for which the "data items""updated" are really output devices that can only be read byexternal (human) users.

III. TRANSACTION PROCESSING MODELFor the fully redundant case of SDD-1, transactions are

executed in two stages. First each transaction is executedlocally at the site where it was initiated. This execution islocal in that it only accesses the copy of the database storedat that site. During local execution intersite synchronizationmay be required, but no transfer of database entries occursbetween sites. The system keeps a list of changes to the data-base during local execution. When execution completes, thelocal site broadcasts the list of database changes to the rest ofthe database network.L' denotes the local execution of transaction t at site m.

Ut indicates the processing of the update message associatedwith transaction t at receiving site n. A single transaction tgenerates one L action followed by as many U actions as thereare other sites in the system.

It is important to note that between the "time" at which anLt executes and the "time" at which each of the correspond-ing Ut's executes, the model permits other L's and U's toexecute also.'

Fig. 3 shows an example of this type of interleaved execu-tion history. Many possible interleaved execution histories donot denote serializable system operation. The mechanisms

Ilntuitively, the argument being made here can be understood bypretending that all clocks in the system are synchronized. A precisestatement of this point can be made in terms of "global logs," presentedin Section VII.

Lo Ul U0 Li L2 U2 U2 Ul U I L3 3~ 3~ 4~ 4 4° s2 u3 L1 L2 12 u3 u2 U3 23 u3 u3 L3 u2tt

transaction 2

Fig. 3. Interleaved execution history.

TABLE ICLASSES DEFINED BY SIMPLE PREDICATES

Read set Write set Class (Comments)

Inventory Relation Inventory Relation 5 Transactionsprojected on projected on thatdescription price modifyand price prices

Inventory Relation Inventory Relation 27 Transactionsprojected on projected on that affectprice, and price high pricedrestricted to items onlyprice > $5.00

Inventory Relation User's terminal 63 Retrievalprojected on transactiondescription,size, andprice

used to prevent such histories are presented in Section V.Although L's and U's for different transactions may be inter-leaved, the model does assume that each L and U action isitself atomic. To enforce this assumption, each L action andeach U action is required to follow a local locking procedurewithin the database site at which it executes.Our methodology groups together similar transactions into

transaction classes for the purpose of specifying the "level ofsynchronization" for each type of transaction. A transactionclass C is associated with a given site s, and is characterized bytwo sets of data items:

1) R(Cs)-a so-called "read-set"; and2) W(C )-a so-called "write-set."

Cs is the set of transactions introduced at site s, each of whosemembers all read exclusively from the read-set R(C ) andwrite exclusively into the write-set W(CQ). To facilitate com-putation of class membership, read-sets and write-sets aredefined by "simple" predicates [9]. Table I illustrates someclasses defined in this way. Choosing classes is, from the pointof view of theory, arbitrary. From a practical point of view,however, specification of transaction classes strongly impactssystem performance. A "good" specification of classes willpermit most transactions to run using efflcient synchroniza-tion. However, a general method for selecting good or optimaltransaction classes is beyond the scope of this paper.

IV. ENSURING MUTUAL CONSISTENCY

The mutual consistency of database copies in SDD-1 isachieved by the use of timestamps. Whenever a data item ismodified it is stamped with the timestamp of the updatingtransaction. The timestamp of a transaction t is denoted

156

Page 4: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-1

TS(t) and is defined to be the time at which Lt ran accordingto the clock in site m.None of the SDD-1 synchronization mechanisms require that

clocks running in different sites be synchronized. However,for reasons of efficiency, it is desirable that clocks at differentsites be kept reasonably close to each other. In [17] a methodof synchronizing clocks in a network is described that involvespushing ahead a local clock if a message with a future time-stamp is received. This simple method will work well enoughfor the purposes of SDD-1. The SDD-1 mechanism for mutualconsistency requires all timestamps in the network to beunique; they will not work if for any two transactions tl andt2, TS(tl) = TS(t2).2 Uniqueness of timestamps is easy toprovide at a single site: once the clock has been used to assigna timestamp, it cannot be read again until the next tick. Toguarantee uniqueness across all sites of the network, one needonly append a unique (and unchanging) site-number to the loworder end of each timestamp. Thus, timestamps will alwaysdiffer in the low order digits at least.3The timestamp of a transaction t, TS(t), is propagated to

each site as part of t's update messages Ut . For each dataitem designated by Ut, site n changes its local copy if andonly if the timestamp of that local copy is older than TS(t).If the modification does occur, the timestamp of the localcopy of the data item is set to TS(t). This test is performed ona data item by data item basis; some data items in the updatemessage may result in write operations while others may not.The effect of the timestamping mechanism is to ensure that Uoperations appear to occur at each site in the same order asthe corresponding L actions occurred. If L 2 occurs after Lt,then this mechanism ensures that no update of t2 (i.e., noUn2) can be overwritten by any update of tI.To prove that mutual consistency is preserved, we must prove

that all copies of each data item would converge to the samevalue were all transaction processing to cease. Suppose that alltransaction processing did cease. For each data item X weconstruct a set Tx of transactions that have updated X in thehistory of the database system. If Tx is empty then X wasnever updated. Else, there must be a transaction tx in Txsuch that TS(tx) is maximum in Tx. Hence, once the U fortx has been processed at all sites, the value of X will be identi-cal at all sites, and will never subsequently change.

V. ENSURING INTERNAL CONSISTENCYMaintaining the internal consistency of each database copy

requires synchronization mechanisms beyond timestamping.As stated in Section II, this paper limits its attention to oneaspect of internal consistency-ensuring serializability. Themechanism used by SDD-1 for this purpose is unique andconstitutes the principal advantage of SDD-l's redundant up-

2The internal consistency mechanism described in Section V couldalso fail if timestamps were not unique; the mechanism would besusceptible to deadlock if that were allowed.3This scheme may appear to give priority to sites on the basis of their

site numbers; e.g., site 000 timestamps tend to be smaller then thosegenerated by site 999, say. However due to the way timestamps areused by SDD-1, this implicit priority induces very little selectiveadvantage.

date methodology over all previous solutions. There are twocentral components in the SDD-1 method for ensuringserializability:

1) synchronization protocols, and2) a protocol selector function.

The synchronization protocols are algorithms that specifymessages a site s must send and/or receive before it may processa given Lt, and messages that s must send upon completion ofLt. Four protocols are defined for use in SDD-1; these proto-cols are numbered P1 through P4 and provide varying levelsof synchronization and control over L and U actions. (P1offers the least control and protocol P4 the most.) Theseprotocols also vary significantly in cost, ranging from P1 whichis quite efficient and introduces essentially no delay in process-ing transactions, to P4 which is as costly as global databaselocking. The protocol selector function PS(s,t) tells whichprotocol to use in executing transaction t at site s; i.e., it tellssite s what intersite coordination it must carry out in process-ing Lt. The operation of the protocol selector function isdescribed later. These protocols are based on a formal analysisof transaction processing in a distributed system. Each proto-col eliminates some interference that can exist between typesof transactions. The protocol selector function makes itschoice on the basis of a formal characterization of the poten-tial interference between transaction types. While a moreformal analysis supporting the protocol definitions appears inSections VI and VII, it is possible to give some intuition hereas to what kinds of transactions use each protocol.The most efficient protocol P1 is used for two kinds of

common transactions: ones that reference data of "localinterest," such as checking out merchandise through a point-of-sale terminal; and ones whose output is not a functionof the database state, such as a supermarket ordering 200 casesof peas. The transactions that can use P1 are clearly quitesimple and limited; it is our belief though that most transac-tions, and in particular most update transactions, in manypractical applications are of this nature.Protocols P2 and P3 are used for more complicated transac-

tions. Protocol P3 is used principally for update transactionsthat do not fall into the P1 mold; protocol P2 is a weakerversion of P3 and is used for retrieval transactions that cannotbe handled by P1. Protocol P3 is a strong enough protocolto handle any kind of transaction; but by using P1 and P2more efficient operation is achieved.The final protocol P4 is a variation of P3 used to handle

"unanticipated" transactions. PS(s,t) operates by consultinga table, constructed ahead of time, indexed by "transactionclass" (defined in Section III). An "unanticipated" trans-action is one which is not included in this table. Protocol P4ensures that an unanticipated transaction cannot interfere withany other transactions no matter what protocol they run.All protocols make three assumptions. The first assumption

is that communication between pairs of sites are pipelined, i.e.,1) for every pair of sites m and n, m transmits messages to

n in timestamp order;2) the communication medium or subsystem always delivers

messages in the order they were sent.These two pipelining mechanisms ensure that for each

157

Page 5: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

[m, n] pair, n will receive messages from m in timestamp order.However, we make no assumptions about the order in whichmessages from two sites are received by a third one. Thesecond underlying assumption of all four protocols is that Land U are atomic actions, as stated in Section III. Local lock-ing at each database site enforces this assumption. The thirdassumption is that when an Lt completes, the correspondingut messages are transmitted to all sites immediately. Giventhese three mechanisms, we can describe simplified versions ofthe SDD-1 protocols.

A. Protocol P1Protocol P1 introduces no synchronization beyond that

implicitly contained in the pipelining and local locking mecha-nisms described above. Specifically, site m performs the fol-lowing steps in executing transaction t.

1) Site m follows a local locking procedure to ensure thatLt is atomic. The locking procedure entails the setting ofsharable locks on t's read-set and exclusive locks on t's write-set. The locks have no effect on actions at sites other thansite m.2) Site m executes t locally-i.e., Lt occurs.3) When Lt completes, site m broadcasts the resulting Ut

messages to all other sites in the system.44) Site m then releases the locks from step 1) and informs

the user that t has been executed.Note that this protocol entails no intersite communication

prior to the execution of Lt, and that the communicationfollowing Lt consists solely of broadcasting the Ut's to therest of the network. Consequently the execution of trans-actions under this protocol is quite fast.

B. Protocols P2 and P3Protocols P2 and P3 are virtually identical to each other,

differing only in one parameter. To execute transaction t atsite m using protocol P2 or P3, the following steps areexecuted.

1) Site m picks a timestamp called TStget. For P2, TStargetis the most recent timestamp on any data item in m's localcopy of t's read-set. For P3, TStrget is any time, providedonly that m has not yet processed an L or a U with timestampgreater than the selected time. This time must also be thetimestamp assigned to t. Note that by using the clock syn-chronization algorithm in Section IV the current time at sitem can always be picked as TStarget for P3.Having picked TStget, protocols P2 and P3 execute precisely

the same algorithm, steps 2)-6) below.2) Site m sends a P2/3-REQUEST message to all sites indicat-

ing TStarget.3) When a site n receives the P2/3-REQUEST message it

operates as follows.a) If site n is idle it replies with an ACCEPT message with

4Elsewhere [41 we show that this procedure does not really need toconsider all sites. The same comment applies to every step in eachprotocol that refers to "all sites."

timestamp equal the maximum of TStarget or the current timeat site n. If TSwget is the greater, n immediately sets itsclock to TSwget also.

b) If site n is processing tl with TS(tl) < TStarget, it holdsthe P2/3-REQUEST until tl completes and its U messages aresent out. At this point, site n is idle and case a) applies.

c) If site n is processing a transaction tl with TS(tl) >TStuget, it replies with an ACCEPT message whose timestampis TS(tl).4) In parallel with step 3), site m waits until it receives

either an ACCEPT or a U whose timestamp exceeds TStargetfrom each site n. U messages received during this wait areexecuted by m, except those for which

a) TS(U) > TSt&get; andb) the items updated by U are in t's read-set.

These U's are held pending the execution of t.5) When all necessary messages have been received, site m

executes Lt exactly as in protocol P1; i.e., m executes steps1)-4) of protocol P1 at this point.6) Finally, site m resumes executing U messages, including

any U messages which were "held" during the wait state ofstep 4).The key part of the P2/P3 algorithm is the waiting activity

in step 4). Site m waits for U or ACCEPT messages from eachother site n, such that the timestamp of each message is biggerthan TSwget. Because messages are pipelined, this ensuresthat m has received and processed all messages whose time-stamps are less than TStarget from all other sites, i.e., it ensuresthat all data read by t are equally up-to-date as of TStget. InP2, this wait only ensures that t's read-set is up-to-date as ofsome time in the past; in P3, though, step 4) ensures that thedata read by t are up-to-date as of the current time, i.e., thetime at which t executes. Thus if t's read-set includes a datumX, and if X is updated by a transaction to, and if to occursbefore t, then protocol P3 guarantees that site m will not exe-cute t until the update to X is received there. This is the maxi-mum degree of synchronization that can ever be needed bysets of transactions; hence P3 is sufficient by itself to ensurethe serializability of all transactions in a distributed databasesystem.Importantly, the P2/P3 algorithm is free of intersite dead-

locks. This is because1) P2/3-REQUESTS cause all nonpreemptive resources (i.e.,

portions of the database) needed by each transaction to beclaimed in advance; and2) the use of timestamps ensures that no cyclic dependencies

can be created among waiting transactions during the resourceclaiming phase. This property holds because timestamps areglobally unique and form the identical total order at all sites.The sufficiency of these two conditions for avoiding dead-

lock follows from the work of [12], [16] (re: prior claimingof resources) and [5], [27] (re: absence of cyclic dependenciesin deadlock graphs).

C. Protocol P4Protocol P4 involves extensive synchronization and is expen-

sive. Its purpose is to handle transactions that were not antici-

158

Page 6: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et at.: CONCURRENCY CONTROL MECHANISM OF SDD-1

pated ahead of time. The protocol operates as follows toexecute transaction t at site m.

1) Site m chooses a time in the future, TSfuture, that will beused for transaction t. TSfutwe should be large enough so thatno site has yet processed a transaction with a timestamp greaterthan TSfut,re2) Site m sends a P4-REQUEST message to all other sites

indicating TSfuture.3) When a site n receives the P4-REQUEST message it

operates as follows:a) if it has finished or is currently processing tl such that

TS(tl) > TSfuture, it sends a REJECT message to site m;b) otherwise it sends an ACCEPT to site m and immedi-

ately sets its clock to TSfuture. (Thus all of n's subsequenttransactions will have timestamps larger than TSfuture.) Also,site n agrees to execute its next transaction using protocol P3(if it is an anticipated transaction) or using protocol P4 (if itis an unanticipated transaction).4) Meanwhile, site m waits to receive ACCEPT messages

from every other site. If a REJECT is received, then m restartst from step 1, choosing a new TSfuture,

5) Site m executes Lt using the P2/P3 algorithm withTStarget = TSfuture.The problem posed by an unanticipated transaction tu is

that some anticipated transaction ta may be able to runeither P1 or P2 when tu is ignored, but must run P3 when tuis taken into account. Protocol P4 guards against this byensuring that all anticipated transactions that run concurrentlywith tu run P3. This protects every ta against tu since P3 isstrong enough to correctly synchronize arbitrary sets oftransactions.

VI. ANALYSIS OF SERIALIZABILITY

The correctness of the SDD-1 protocols follows from a de-tailed analysis of serializability. The central component ofthis analysis is an abstraction called a global log which is inturn constructed from a set of local logs.The execution history of L's and U's at a single site can be

represented as a serial sequence of events similar to the "be-haviors" introduced in [14], [1 ] and the "schedules" of [9] .Each execution history for a single site is called a local log. Aglobal log is the integration of all the local logs in the systeminto a single entity. The global log contains precisely the sameinformation as the set of local logs, but is more easily analyzed.Global logs are constructed in two steps. First, local logs areconstructed for each site by listing a possible sequence ofactions that could be performed at the site. Then the locallogs are combined into a global log in such a way that the ef-fect of the global log on each site is the same as that of theoriginal local logs. A single global log thus represents onepossible execution history for all the sites in the system.This integration is accomplished by a merge of the local

logs obeying the following constraints.1) The ordering of actions from each local log is preserved.

That is, the combination is a merge.

global log in timestamp order.3) All update messages Ut from a transaction t appear in

the log after Lt .

A. Serially Reproducible Global Logs

Serially reproducible (SR) global logs represent serializablesequences of transactions in the distributed system. SR logsare equivalent to serial logs, as we define below.Definitions:1) Database state: Informally, a database state, DS, is the

collection of all data values at all sites. Formally, it is thecollection of pairs (s, states) where s is a site and states is thecollection of relations stored at s. For the purpose of thisdefinition, each copy of each relation is viewed as a distinctrelation.2) Equivalent global logs: Since each L or U action may

change the database state, a global log, which is a sequence ofthese actions, may likewise change the database state. Aglobal log Gi defines a function gi, which maps one databasestate DS1 into another state DS2. That is

gi(DS1) = DS2.

Two global logs G1 and G2 are said to be equivalent ifa) they include the same L's and U's, andb) for all DSi, g, (DSi) = g2 (DSi).

(As stated earlier, we treat output devices as part of thedatabase state. This definition therefore requires that equiva-lent logs produce the same outputs as well as producingidentical data values within the database proper).3) Serially reproducible (SR) global logs: A global log G,

in which each Lt is immediately followed by all its Ut mes-sages is called a serial global log. A global log which is equiva-lent to a serial global log is called serially reproducible (SR).A serial global log has no interleaving of L and U actions for

different transactions; therefore, such an execution historyis a serializable history. SR global logs may have some inter-leaving of L's and U's for different transactions; however,since SR logs are equivalent to serial ones, it follows that everySR log also depicts serializable behavior.

B. Global Log Transformation Rules

In this section we present rules for transforming any globallog into another equivalent one. Then we describe how theserules may be used in an algorithm for determining SR-ness.The transformation rules state those cases in which adjacentactions in a global log may be interchanged, such that theresulting log is equivalent to the original. The permissibility ofa switch depends on the type of actions involved (L or U); onwhether the actions are taking place at the same site; and onthe intersections of the actions' read- and write-sets. Theglobal log transformation rules are summarized in Table II.L and U actions executing at the same site are not switch-

able if the L action reads data items modified by the U, sincethis could change the effect of the L. But, L and U actionswhose write-sets intersect can be switched because the time-

159

2) L actions from different local logs are placed into the

Page 7: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

TABLE IIGLOBAL LOG TRANSFORMATION RULES

Type ofAdjacent Actions are inActions Same Site Different Sites

Cannot switch if Cannot switch ifL-U read-set of L the L and U areor intersects from the sameU-L write-set of U transaction,

i.e.L is LtandU is ut

U-U Can always switch Can always switch

L-L Cannot switch if Cannot switch ifa. read-set of one write-set of one

intersects intersectswrite-set of other write-set of otheror

b. write-set of oneintersectswrite-set of other

stamp mechanism ensures that the effects of the latter of thetwo actions will persist in the database. L's and U's running atdifferent sites can freely be switched in most cases since eachaction refers to data at one site only. But if the L and U are

from the same transaction, they cannot be switched sincethat would imply that the U was processed before the trans-action was executed.Adjacent U actions may always be switched. If the U's

are at the same site, the timestamp mechanism will ensure thatthe effects of the later U will persist. If the U's execute atdifferent sites, they affect different physical data items.L actions that run at the same site read and write the same

physical data items; therefore, if either L reads data that are

written by the other one, the outcome may be sensitive totheir ordering. Thus in this case the two L's cannot beswitched. L's whose write-sets intersect are also nonswitch-able because the ordering of L's in the global log encodes theirtimestamp ordering. If L's with write-set intersections were

switched, all effects that depend on their timestamp orderingwould likewise be reversed.

C The ES Algorithm for Computing SR-ness

The global log transformation rules described above are

incorporated in an algorithm called ES used to determinewhether a given log is SR. The strategy used by ES is totransform the log G into successive equivalent logs Gi, untileither the algorithm fails or it obtains a Gi which is serial.ES scans through G from left to right until it finds the firstUt which is not part of a sequence of the form

Lt Ut Utb ... Ut Ut.

A sequence such as this, consisting of an L action followed bysome or all of its U's, is called a serialized L log; the notationLlt is used to indicate a serialized L log for transaction t.When ES finds a Ut which is not part of its LIt, it attempts

to bring the Ut and the Llt together by removing all symbolsthat separate them in the log. In doing this transformation,ES will not break up any Llt' which are present in the sublog

1 2 .. i i+1 * Sn- 1 Sn 4n

ES1 starts by trying to switch Lit's right neighbor with Lit.If it succeeds it will continue with S2, etc. A serializedL-log, Lit, an be switched with a symbol S if and only ifall constituent symbols of L1 t can be switched with S.

b. Si S2 .-- Si_, Ll' +t --nSn lnSu

At some point ES1 could encounter an Si that cannot beswitched with Llt. If so it skips Si and continues with Si+,.

C. Sl S2 ..- SS1 .-- Llt ---Sk... Sn-1SnUn? ?

After a while there may be many symbols that have beenskipped by the algorithm. The algorithm tries to move eachnew symbol Sk that it examines as far to the left in thesub-log as it can, hopefully moving Sk past Llt and out ofthe sub-lo altogether. ES1 continues to do this until itreaches Un

d.1S2 Llt Si Sj.. Sk S Ut

Then ES1 reverses direction, and scans from right-to-left,starting with Ut 's left neighbor.

e.S1S2 Lit ut Si S .- Sk Sl

If ES1 is able to move each symbol past ut it has succeeded!

f. Si S2 .. Llt S S ut S

If any Sk cannot be moved past Un, ES1 has failed.

Fig. 4. Operation of ES1.

separating Llt and Ut, this provision being necessary to en-sure termination of the algorithm. If ES succeeds in bringingthe Ut together with its LIt, it resumes scanning G from thepoint it was interrupted. If ES reaches the end of G-then G isSR. The key portion of this algorithm is the method of mov-ing a particular Ut to a position adjacent to its Llt. Thisalgorithm, called ES1, is defined in Fig. 4.We note that while the success of ES implies that the log is

SR, the converse is not true in general. This is because theglobal log transformation rules are sound but not complete.The basic facts about a sublog that ES1 cannot serialize aresummarized in Fig. 5. It is important to note that theseproperties need only hold for a sublog at the time when thealgorithm ES1 fails.

D. Global Log GraphThe ES algorithm described in Section VI-C can be expressed

in a graphic form called a global log graph which is quite use-ful in reasoning about the SR-ness of logs.Given a global log G we construct the graph by drawing an

arc between every pair of actions, which, if adjacent, couldnot be interchanged. These symbols are said to block eachother. Table II can be viewed as a set of rules for drawingthese arcs. Fig. 6 shows a log with these arcs added. Globallog graphs are closely related to the diagrams used in Fig. 5.The paths which appear in the global log graph are the key tothe SR question. For example, in the log graph of Fig. 6 L2and one of its U's U2 , lie on a cycle, consisting of the arcs:

L2 L3 L4 U2.

Moreover, algorithm ES cannot serialize L2 and U2; there issimply no way for ESl to move L3 and L4 either to the leftof LI or the right of U?.

160

Page 8: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-1

a. Llt ... S .. S _kUt

Basic facts about sub-log when ES1 fai Is

1. Sk cannot be switched with Ut2. Sk was unremovable during steps (a)- (c) of ES1.

b. Lit . s .

Moreover, there must be some specific symbol S- that Skcouldn't get past.(Note: Sj might equa L 1tt.

tc..Llt ...... S. s tc. El k........ Un

1. S- cannot be switched with Sk (because Sk cannot beswitched with Sj-the switching rules are symmetric).

2. S was also unremovable during steps (a) - (c) of ES1.

d. Llt.. Sh.. Si ... ... Sk ... UtnAs before, S- is blocked by a specific symbol Si, whichitself was left in the sub-log during steps (a)- (c) of ES1.

e. Ll ...S s S, .. , ... ut

Thus, there is a sequence of symbols between Lit and Ut,which cannot be switched with some symbol to the rightand some symbol to the left.

Fig. 5. Unserializable sublogs in ES1.

LiU2U3L2L3U1 U2 L' U3 L

# Arcs drawn to indicate the following intersections1 WMttM intersects R(t2)2 W(t2) intersects Wt3)3 W(t3) intersects W(t4)4 R (t4) intersects WCt2)5 Wlt4) intersects R(t5)All arcs underneath the log are drawn between L's and U's forthe same transactions.In this graph, there is a "blocking cycle" containing L2 andu (see text).Therefore this log is not SR.

Fig. 6. Global log graph.

Given any global log G; if any Lt and Ut lie on a cyclewhose nodes are a subset of the actions between Lt and Ut,then G cannot be serialized by ES. Such a cycle is called ablocking cycle. The converse of this statement is not true:for example, the log graph

G= L u

cannot be serialized by ES although it has no blocking cycles.5What is true is that whenever ES fails, the current log that it isworking on must include a blocking cycle.

E. L-U Graphs

L-U graphs are obtained by generalizing global log graphsin two directions:

1) The ordering information that is inherent in a particularglobal log graph is removed in the L-U graph. This permitsthe L-U graph to represent all possible global log graphs fora single set of transactions.

5The graph is cyclic but the cycle is not a blocking cycle. This isbecause the cycle is not a subset of the actions between Ll and U ,nor between L2 and U2.

tl t4 t2 t3 t5

si s2 s3

This graph depicts the transactions in the previous figure. Thearcs here are numbered in the same manner as in that figure.

Fig. 7. L-U transaction graph.

2) L-U graphs depict transaction classes as described inSection III rather than individual transactions.We introduce L-U graphs in two steps. First, L-U transaction

graphs are described which only achieve the first generaliza-tion. L-U transaction graphs are then generalized to depicttransaction classes.

Fig. 7 shows an L-U transaction graph. The graph containsa pair of nodes, labeled L and U, for each depicted transaction.The L node represents the L action of the transaction and theU node represents all the U actions of the transaction. Byconvention we always draw the U node for a given transactionimmediately below its L node, and we always group togethertransactions introduced at a given site.Arcs are drawn between nodes as in global log graphs: two

L's are connected if the global log transformation rules (seeTable II) prohibit them from being switched; an arc is drawnbetween an L and a U if the rules prevent the L from beingswitched with any U actions represented by that node. (Thisimplies that the L and U for the same transaction are alwaysdirectly connected.)L-U (class) graphs are constructed in a similar manner,

except that the L and U nodes now represent the actions ofdefined transaction classes. Arcs are drawn between pairs ofnodes, N1 and N2 , if the global log transformation rules wouldprevent any action represented by N1 from being switched

161

Page 9: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

Classes:

C1 defined to include tl, t4

defined to include t2

C3 defined to include t3

C3defined to include t4

3~~~~~~~~~~sCi C2 C3 C4

sl s2 3 3(Note: the arcs drawn in this figure are the minimum arcs thatcould result from the class "definitions" above. Depending onhow much bigger each class is than the indicated constituenttransactions, many more arcs could result.)

Arcs are drawn between the following pairs of nodes:1. L and U for the same class (called vertical arcs);2. L and L for classes at the same site if there is an R-W, W-W,

or W-R intersect between the classes (called horizontal arcs);3. L and L for classes at different sites if there is a W-W

intersect (called horizontal arcs); and4. L and U for classes at different sites if there is an R-W

intersect (called diagonal arcs).

Fig. 8. L-U (class) graph.

with any action represented by N2. Arcs thus represent theworst case possibility that some action in one class couldblock an action in the other class. Fig. 8 shows an L-U graphand summarizes the rules for drawing arcs in these graphs.From the topoldgy of L-U graphs, it is possible to infer

ordering constraints that L and U actions must obey inorder to ensure that all possible logs containing those actionsare SR. The specification of these ordering constraints is thekey step in specifying the synchronization protocols neededby the system and in specifying which transaction classesmust use each protocol. This topic is addressed in SectionVII below.

VII. SAFE PROTOCOL CONFIGURATIONS

A protocol configuration consists of a set of protocols thatsynchronize L and U actions, and a protocol selector functionPS(s,t) that tells what protocol to use when executing trans-action t at site s. A protocol configuration is safe if and onlyif all possible global logs produced by it are SR.L-U graphs can be used to analyze the safety of protocol

configurations. In terms of L-U graphs, the purpose of theprotocols is to constrain the ordering of L and U actions, i.e.,to restrict the global log graphs that a given L-U graph can

produce. The constraints enforced by the protocols are calledprotocol schemas.

A. Protocol Schemas

Each protocol schema is defined as a relationship betweenone transaction, tprotocol, and a set of other transactionsTtarget. We say that tprotocol obeys the protocol schema with

respect to the set Twget if and only if the ordering constraintsstated in the schema definition are true. Transactions neednot obey the same protocol schema with respect to all trans-action classes. Many transactions may need to satisfy a strongprotocol schema with respect to some classes, but may obeyweaker protocol schemas with respect to others. The protocol

algorithms described in Section V do not implement suchflexibility. Those algorithms are strictly stronger than theprotocol schemas described here, and so are a sufficient setof mechanisms to guarantee serializability. However, theyare not the most efficient algorithms that could be developedfor this purpose. More efficient algorithms that preciselyimplement the protocol schemas are described in [41 .As stated earlier, all of the protocol schemas are based on

three assumptions regarding system operation. These assump-tions are as follows.

1) Each primitive action, Lt and Ut, is atomic.2) When an Lt completes, the corresponding Ut messages

are transmitted to all sites immediately.3) The stream of U's sent between each pair of sites are

pipelined.Protocol Schema plProtocol schema pl is the null constraint. All transactions

obey pl with respect to all other transactions, given that theassumptions above are obeyed.Protocol Schema p2A transaction tp2, at site sl, obeys protocol schema p2

with respect to a set of transactions Ttget iff:for any distinct pair of transactions tx and ty in Ttgetsuch that neither tx nor ty was introduced at site sl:

if LtX precedes Lty in the global log, andif Uty precedes Lt2 in the log,then Utx precedes Lt2 also.

(Note that if tx and ty are introduced at the same site,then p2 is satisfied by virtue of pipelining.)

To motivate the p2 rule, consider a system with three transac-tions, {tl, t2, t3}, defined on data items x, y, and z as follows:

tl: x <- 5t2: y <- X2t3: z <- x+y.

Assume further the following initial values for x, y, z:

x=O

162

Page 10: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-1

Given transactions:

tl: xv*-5t2: yx2t3: z.*-x + y

run at sites sl, s2, s3 respectively

(a) L-U Transaction Graph

ti t2 t3L -

U

sl s2 s3

)b)A non-SR log (Blocking cycle marked by x's)

L'i ULu Lt ~~~~~~~~~~~Ut2's2,525s3 ts s2

Utvviolates p2 with respect to Lt3Ic) An SR /og-obeys p2 constraint.

tit3

Ut3 now obeys p2 with respect to 03

Fig. 9. Motivation for p2.

y = 1z =2.

And suppose that tl, t2, and t3 run at sites sl, s2, and s3respectively. The L-U transaction graph of these transactionsis shown in Fig. 9(a).In the absence of protocol schema p2, one possible global

log that could result from an execution of these transactions is

G -tls Utl L2 Ut2 L3 Utl Ut2 UO Ut3ILS1 2 s2 s3 s3 s3 s I s I s2

This log denotes inconsistent database behavior in that Lt3reads an inconsistent database state, i.e., it reads the new valueof y resulting from Lt2 (i.e., 25), but it reads the value x be-fore the execution of Lt' (i.e., 0). In terms of our theory,this inconsistency is indicated by the fact that Ut' cannotbe serialized with Lt' [see Fig. 9(b)] .Note that the p2 constraint is not obeyed by t3 with respect

to {tl, t2} in this non-SR log:

Lt 1 precedes Lt2,Ut2 precedes LS,

but

Ut' follows LO.

Moreover, if the p2 constraint had been in effect, the logwould have been SR [see Fig. 9(c)].Protocol Schema p3Protocol schema p3 places constraints on logs that are slightly

stronger than those required by p2. Given transaction tp3 in-troduced at site sI, tp 3 obeys protocol schema p3 with respectto a set of transactions Ttaget iff:

for any transaction tx in Ttaget such that tx was not intro-duced at site sl:

if Ltx precedes Lt3pthen Utx precedes Ltp3 also.

This condition, in effect, requires that Ltp3 "see" all updatesto the database caused by transactions in Ttarget that ranbefore it.

Given transactions:

t4: x_-y + 1

t5: y-._-x + 1

run at sites siland s21 respectively(a) L-U Transaction Graph

t4 t5

si s2

(b) A non-SR log

Lt-4 j56t4 ~Ut5si1 s2 s2 si1

this symbol violates p3with respect to L5

Once ES serializes onetransaction (t4 say), itcannot serialize the other.

E.g. e_5--L UILt5 L04 Ut4 UtS5Lt5 Lit4 Utsi 2 vi

(c) An SR log (obeys p3 constraint)

14U 5Ut5

This symbol now obeys p3with respect to L5

Fig. 10. Motivation for p3.

To motivate p3, consider two transactions {t4, t5} definedon data items x and y as follows:

t4: x<- y+ 1t5: y <- x+.

Suppose t4 and t5 run at sites sl and s2, respectively, and thattheir execution results in the following log:

G2=L Lt2 Ut4 Ut .

Fig. 10 shows the L-U transaction graph for these transactionsand the log graph for the above log G2. As Fig. 1O(b) indicates,the log is not SR because when ES serializes one of the trans-actions it must necessarily block the other one. (We saw asimilar example in Section VI-D.) If, on the other hand,protocol schema p3 had been observed here, Ut4 would haveto have preceded LU and the log would have been SR [seeFig. 10(c)]-Protocol Schema p4Protocol schema p4 places constraints on the log that are

even stronger than p3. Given transaction tp4 introduced atsite sl, tp4 obeys p4 with respect to a set of transactions,Ttarget iff:

1) transaction tp4 obeys p3 with respect to Ttget, and2) the first transaction introduced at each site that is in

Ttwget and follows Lsp4 in the log obeys p3 with respect toTtarget and {tp4}.Protocol schema p4 is used to handle "unanticipated"

transactions, i.e., ones which are not members of any pre-defimed transaction classes. Since unanticipated transactionscannot be accounted for in the construction of the protocoltables, it is possible that these transactions could interferewith transactions that run under the weaker pl and p2 proto-col schemas. To avoid this possibility, a very strong protocolschema is required for their execution.The first condition required by p4 causes the log prior to

Ltp4 to be "completed" so that there are no outstanding U'sat the time the unanticipated transaction is executed. The

163

Page 11: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

sublog preceding the unanticipated transaction satisfies all theprotocol schemas and therefore (by the proof to follow) is SR.The second condition achieved by p4 simply causes t to run

serially, after which the rest of the log resumes running accord-ing to pl, p2, and p3.Final Note on Protocol SchemasOne important fact to note is that each protocol schema pro-

vides a strictly stronger condition on logs than the precedingone.

B. The Protocol Selector FunctionIn this section we describe the protocol selector function

PS(s,t) which tells the correct protocol to use in processingtransaction t at site s.The operation of PS(s,t) is driven by two tables, established

at database design and stored at site s. The class table mapsany transaction t into one or more transaction classes. Theclass table is a listing of the transaction classes defined for thesystem (see Section III), stated in terms of read-sets and write-sets. The protocol table maps each transaction class into theprotocol to be used for that class. The protocol table can bealgorithmically computed from the class table; it is fullyspecified by the L-U graph for the defimed classes. If t is amember of several classes, PS(s,t) selects the quickest proto-col associated with them. If t is not a member of any classdefined at s, then PS(s,t) selects the most severe protocol,protocol P4. Note that the operation of PS(s,t) involves nointercomputer synchronization whatsoever. There onlyremains to specify each of the two tables used by PS(s,t).The class table was defined in Section III; the protocol table isdefined below.

C. Establishment ofProtocol TableThe protocol tables are constructed from the class tables by

first constructing the L-U graph for the classes (see SectionVI-E). Then the protocol tables are obtained by consideringeach class C in turn and applying certain rules regarding thegraph topology: first, if the L is not on a cycle containing avertical arc, transactions in C' may use protocol P1. Fig. 11shows examples of graphs of this type.Otherwise the L node is on a cycle that contains a vertical

arc and a more detailed test is required. This test is based onthe local topology of the cycle in the vicinity of C , and inparticular on the nature of the arcs of the cycle that impingeon C'.For the L node of C' to be on a cycle, the cycle must in-

clude at least two arcs involving the L [see Fig. 12(a)]. One ofthese must be a horizontal or diagonal arc, but the other canbe any kind of arc-horizontal, diagonal, or vertical [see Fig.12(b)]. Moreover, if the second arc is vertical, then clearlyit is the L-U arc of C itself, and we may conclude the exis-tence of a diagonal arc in the cycle from the U of Cl to someother L [see Fig. 12(c)]. From this we can construct all theways in which two arcs of a cycle can impinge on Cl andinclude the L node. There are eight ways in which this can bedone; Fig. 13 shows these eight topologies and indicates thecorrect protocol associated with each one. No rule specifiesthe use of P4 since this protocol is for "unanticipated" trans-

L

U

Fig. 11. Acyclic graph.

a. for any node N to be on a cycle, at least two arcs of the

cycle must impinge on it.

N1 N N2 N is on a cycle N1 N

,J7~7II7 N is noton a cycle

b. for L to be on a cycle, one arc must be horizontal or

diagonal; the other may be horizontal, diagonal, or

vertical:

L

Any these 5 arcs

can be on the cycle, buteach arc can only be usedonce.

U

c. if L-U is on the cycle then the cycle includes

L L

this or this

U U

or some reflection thereof.

Fig. 12. Facts about L nodes on cycles.

#1) Horizontal-Horizontal Pi

#2) Horizontal-Diagonal P3

#3) V Horizontal-Vertical-Diagonal Pl

#41 < Diagonal-Horizontal P3

#5 5 Diagonal-Diagonal P2

#65 Diagonal-Vertical-Diagonal P3

#7) Diagonal-Vertical-Horizontal P1

#8) Diagonal-Vertical-Diagonal P3

Fig. 13. Protocol selection rules for L nodes on cycles.

actions. It is possible for an L node to participate in a com-position of two or more of these topologies. For example, inFig. 14 C' is on two cycles, one which mandates P3 and onewhich only indicates P2. Transactions in Cl must satisfyprotocol schema p3 with respect to transactions in CJ, butneed only satisfy protocol schema p2 with respect to trans-actions in C k and Cl.As stated earlier, the protocols presented in Section V are

not flexible enough to handle two different protocol schemaswhen running a single transaction. However, the algorithmsactually used by SDD-1 do have this ability [4].

164

Page 12: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-I

Clp

A transaction on two cycles

C CJ* 0

Paths that don't include any of the expressly drawn edges.

Fig. 14. L node on multiple cycles.

D. An Outline of the ProofofSerial Reproducibility

To verify that the redundant updating technique we propose

will work correctly, we need to show that if for a given log all

transactions obey the protocol selection rules, then the log isserially reproducible (SR). A detailed proof of this theoremis quite lengthy and is presented in [4] ; here we only presenta sketch of how the proof proceeds. The proof technique isapparently quite general; we have applied it to several SDD-1-type designs to produce proofs of serial reproducibility [2],[21] .

To prove that a global log that satisfies the protocol selec-tion rules is SR, we assume the converse and show a contra-diction. That is, we assume that there is some log, sayLOGgiven, that resulted from operation of the system accord-ing to these rules and that the log is not SR. The generalapproach we take proceeds in four steps:

1) try to serialize LOGgiven;2) show a cycle in the L-U transaction graph;3) show a cycle in the L-U class graph;4) show LOGgiven must have violated the protocol selection

rules.To try to serialize LOGsiven we apply algorithm ES to it.

Since LOGgiven is not SR (by assumption), we must get"stuck" before a serial log is obtained; i.e., ES must fail.Suppose, for example, that transaction t cannot be serialized.This means that at least one of t's U actions, say Ut, cannot beput together by ES with the corresponding Llt. When thishappens the log that ES is transforming must include a sublog

Lm . n.U

with the following two important properties:1) there is a blocking cycle involving LI t and Ut, and2) every Us in the sublog other than Ut is already part of

its corresponding LI'. Fig. 15 illustrates these two properties.The first property follows from the process of serialization

in ES and in particular the operation of the subalgorithm ESI(see Section VI-D). The second follows from the order inwhich U's are serialized by ES-i.e., left-to-right. Finding theblocked sublog with these properties completes the first stepof the proof.In the next step, we examine the blocked sublog in more

detail, focusing on the blocking cycle in it:

Llt SI S2 '- Sn In

a. There must be a sub-log that is blocked

Ll' For some reason b utES cannot movethese two symbolstogether

b. The blocked sub-log has two properties:

1. There is a blocking cycle involving Lit and Ut

Li t i..* Sn .2n2. All U's in the sub-log are already adjacent to their L's.

(Thus all Si's must be L's or Li's.) E.g.

Lit L.$ 7 L ..Lln Ut"I 1 -}-

-) n

Fig. 15. First step of proof.

(a) The S - Ut arc must be an R-W intersection, goingbetween an LY action and Utn n.

Lt 1 2 Sn=L 7 t

This must be anOR R-W intersection

Lit *-SiS2 **Sn = L a .b U

Llv(b) Each arc in the blocking path maps into an arc in the

L-U transaction graph:

U

Fig. 16. Second step of proof.

The first observation we make is that each Si in the cycle mustbe an LI or L action. We know this since from point 2) aboveevery Us in the sublog (other than ut ) is already part of itsLls. The next point is that Sn, the symbol blocking Ut,must be an Llv or Lv whose read-set intersects Ut's write-set [see Fig. 1 6(a)] . We know this to be the case since thoseare the only kinds of actions that can block Ut except forLt (or Llt) itself. Every arc in the blocking cycle maps intoan arc in the L-U transaction graph. Therefore, we can becertain that a path exists in the L-U transaction graph fromeither the L or U node of Llt to the L node of Llv [see Fig.16(b)] . Furthermore, since the L node of LIv blocks Ut, weknow that there is also an arc from that L node (i.e., from Lv)to the U node of Llt [refer to Fig. 16(b)] . The L-U transac-tion graph in Fig. 16(b) is surely cyclic.The third step is to show that the corresponding L-U class

graph is also cyclic. Note that the L-U class graph must con-tain at least two classes since Lt and Ln run at different sites.Call Lt 's class, Ct , and Lvn's class Cv. The other Si's in theblocking path may be members of Ct, Cv, or some otherclasses. Suppose that all of them were members of Ct . Thenall of Sl, S2,* Sn -1 run at site m, and by the piping assump-tion that underlies the protocol schemas, it follows that noneof SI, S2, . * Sn-i may include Un actions. In other words,since each Si is an Ll or an L, and since each L runs at site m,and since each L follows Lt, we are guaranteed that eachcorresponding Un will also follow Ut [see Fig. 17(a)]. Con-sequently, we can be certain the arc between Sn l and Snis an L-L arc. Therefore in this case the L-U class graph is

165

Page 13: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

(a) L-U c/ass graph: 0 or more other classesct Cvm _n

m~Suppose S1,i2, ..., Sn-i are all members of Ct,then the log must look like

;Lt.. L L L L u U u2 .. un-1

(b) In this case the L-U class graph is cyclicc t cVm n

This arc is implied byLni1_ Lv

This arc is implied byLv_' -Ut

Fig. 17. Third step of proof.

cyclic [see Fig. 17(b)]. Similar arguments apply in all othercases leading to the conclusion that the blocked sublog impliesa cycle in the L-U graph, and that the node for L' is in thecycle. This implies that part 2) of the protocol selection rules(the part that applies to nodes on cycles) applies to L'.The final step of the proof is to show that Lv must violate

the protocol selection rules. As an example of this step,suppose the symbol to the left of Ln that blocks Ln is Ln-[see Fig. 18(a)] . That is, the blocked sublog is:

Lt * n-1 Lv UL L% Ln Utm m n,n

Then the L-U graph must be of the form indicated in Fig.18(b). Consulting the protocol selection rules in SectionVII-A, we see that transactions in Cv must obey protocolschema p3 with respect to transactions in Ctm. But considerthe following:

1) Ln-% precedes Lv;2) but Un-, follows Ln (by pipelining assumption) [see

Fig. 18(c)].This constitutes a violation of p3. Contradiction!Similar arguments can be applied for other cases, each case

being a different kind of symbol that blocks L' on the left.The conclusion is that if LOGgiven is not SR as hypothesized,

then it must have violated the protocol selection rules. Hence,so long as LOGgiven has not violated the protocol selectionrules, it must be SR.

VIII. CONCLUSIONThis paper has described a methodology for synchronizing

concurrent transactions in a distributed database managementsystem. The methodology is a simplified version of the tech-nique used by SDD-1 (a system for distributed databases),which is being developed by Computer Corporation ofAmerica. The methodology is superior to techniques proposedelsewhere because in many instances it is able to avoid globalsynchronization of the distributed copies of data being accessedby the transactions.The system employs a series of four synchronization proto-

cols which vary in cost and provide varying levels of synchroni-

a. Suppose the symbol that blocks Lv is Ln-1.

m..Lt_ oL4LY Ut .Ct Cg- v c C

b.C t cn-1 cv C t cvm m n m n

or

c. Note that Un-1 must be after Ut (by pipelining assumption).Therefore this log violates p3 with respect to Lvn

LmLn-i Lv Ut U...m . .Lmn n n

1. Ln1 precedes Lv

2. Therefore p3 dictates that Un-1 mustalso precede Lv.

Since it doesn't, this log violates p3.

Fig. 18. Fourth step of proof.

zation control. The most efficient of these protocols, protocolP1, introduces essentially no intersite synchronization over-head, while the more costly ones, protocols P2-P4, introduceincreasingly higher synchronization costs. Different transac-tions must be executed using different protocols and a cleargoal is to permit most transactions entered into the system touse the more efficient protocols.The decision as to which transactions must use each proto-

col, as well as the protocols themselves, is based on a formalmathematical analysis of the ways in which transactions in adistributed database system can interfere with each other.The analytic technique employed in this paper is both formaland quite general and has been used to analyze update algo-rithms in a variety of database configurations.An important characteristic of the SDD-1 technique is that

the procedure for selecting the correct protocol takes advan-tage of extensive application knowledge, and precomputationof possible transaction interference. The protocol selectionprocedure is driven by two tables established at databasedesign time. One of these tables, called the transaction classtable, is defined by the database administrator based on hisknowledge of the kinds of transactions that are expectedfor each database application. The other table, called theprotocol table, is then algorithmically constructed from theclass table using results presented in this paper. The protocoltable tells the system what protocol it must use when a trans-action of a given type (or class) is entered into the system.The protocol table embodies a detailed analysis of all the pos-sible ways transactions in the defined classes can interferewith each other. At run time, SDD-1 consults this table todetermine very quickly and concisely what level of synchroni-zation is needed to run any given transaction. It is importantto note that this predefimition and preanalysis activity in noway limits the types of transactions the system can accept; itmerely permits more efficient execution of the transactiontypes that were anticipated. Transactions that were not antici-pated ahead of time and thus not included in the preanalysisare simply handled via the strongest protocol, protocol P4.

The effectiveness of the methodology we have presented iscertainly application dependent: its effectiveness depends on

166

Page 14: The Concurrency Control Mechanism of SDD-1: A System for

BERNSTEIN et al.: CONCURRENCY CONTROL MECHANISM OF SDD-7

the relative number of transactions for the application thatmay run under each of the protocols. We believe that in manypractical applications most transactions will be able to rununder the most efficient protocol, and that the methodologypresented here will prove to be an effective and practicalmeans for synchronizing transactions in a distributed data-base system.

ACKNOWLEDGMENT

The authors gratefully acknowledge technical criticisms ofU. Dayal and the editorial assistance of T. Lozano in preparingthis paper. We are grateful also for refinements developed byour colleague D. Shipman.

REFERENCES

[1] P. A. Alsberg and J. D. Day, "A principle for resilient sharing ofdistributed resources," Center for Advanced Computation,Univ. Illinois at Urbana-Champaign, Urbana, IL, Rep., 1976;also accepted for publication in Proc. 2nd Int. Conf SoftwareEng.

[2] P. A. Bernstein and D. W. Shipman, "A formal model of con-currency control mechanisms for database systems," submittedto 1978 Berkeley Workshop on Distributed Data Managementand Computer Networks, Lawrence Berkeley Lab., Univ. Califor-nia, Berkeley, Aug. 1978.

[31 P. A. Bernstein, N. Goodman, J. B. Rothnie, and C. A. Papa-dimitriou, "Analysis of serializability in SDD-1: A system fordistributed databases (the fully redundant case)," in Proc. IstInt. Conf Computer Software and Applications (COMPSAC'77), IEEE Comput. Soc., Chicago, IL, Nov. 1977; also availablefrom Computer Corp. America, Cambridge, MA, Tech. Rep.CCA-77-05.

[4] P. A. Bernstein, D. W. Shipman, J. B. Rothnie, and N. Goodman,"The SDD-1 redundant update algorithm (the general case),"Computer Corp. America, Cambridge, MA, Tech. Rep. CCA-77-09, Dec. 15, 1977.

[5] D. D. Chamberlin, R. F. Boyce, and 1. L. Traiger, "A deadlock-free scheme for resource locking in a database environment," in1974 IFIPS Conf Proc. Amsterdam, The Netherlands: North-Holland, 1974.

[6] D. D. Chamberlin, J. N. Gray, and 1. L. Traiger, "Views, authori-zation, and locking in a relational database system," in AFIPSNat. Comput. Conf Proc., vol. 44. Montvale, NJ: AFIPS Press,1975.

[7] W. W. Chu, "Optimal file allocation in a computer network,"in Computer Communication Networks, F. F. Kuo, Ed. Engle-wood Cliffs, NJ: Prentice-Hall, Computer Applications in Elec-trical Engineering Series, 1973.

[8] C. A. Ellis, "A robust algorithm for updating duplicate data-bases," in Proc. 1977 Berkeley Workshop on Distributed DataManagement and Computer Networks, Lawrence Berkeley Lab.,Univ. California, Berkeley, May 1977.

[9] K. P. Eswaran, J. N. Gray, R. A. Lorie, and 1. L. Traiger, "Thenotions of consistency and predicate locks in a database system,"Commun. Ass. Comput. Mach., vol. 19, Nov. 1976.

[10] J. N. Gray, R. A. Lorie, G. R. Putzolu, and 1. L. Traiger, "Granu-larity of locks and degrees of consistency in a shared database,"IBM Res. Lab., San Jose, CA, Rep., 1975.

[11] 1. Greif, "Semantics of communicating parallel processes," Lab.Comput. Sci., Mass. Inst. Tech., Cambridge, Tech. Rep. TR-154,Sept. 1975.

[12] A. N. Haberman, "Prevention of system deadlock," Commun.Ass. Comput. Mach., vol. 12, pp. 373-377, July 1969.

[13] M. M. Hammer and D. J. McLeod, "Semantic integrity in arelational database system," in Proc. Int. Conf Very LargeData Bases, Sept. 1975.

[14] C. E. Hewitt, "Protection and synchronization in actor systems,"Artificial Intelligence Lab., Mass. Inst. Tech., Cambridge, Work-ing Paper 83, Nov. 1974.

[15] C. Hewitt, P. Bishop, and R. Steiger, "A universal actor formu-lation for artificial intelligence," in Proc. Int. Joint Conf Arti-ficial Intelligence (IJCAI), Stanford Artificial Intelligence Lab.,Stanford Univ., Stanford CA, Aug. 1973.

[16] R. C. Holt, "On deadlocks in computer systems," Univ. Toronto,Comput. Syst. Res. Group, Toronto, Ont., Canada, Tech. Rep.CSRG-6, Apr. 1971.

117] L. Lamport, "Time, clocks and ordering of events in a distributedsystem," Mass. Comput. Assoc. Rep. CA-7603-2911, Mar. 1976;also submitted to Commun. Ass. Comput. Mach.

[18] B. Lampson and H. Sturgis, "Crash recovery in a distributeddata storage system," Comput. Sci. Lab., Xerox Palo Alto Res.Center, Palo Alto, CA, unpubl. paper, 1976.

[19] K. D. Levin and H. L. Morgan, "Dynamic file assignment incomputer networks under varying access request patterns,"Dep. Decision Sci., The Wharton School, Univ. Pennsylvania,Tech. Rep. 750401, Apr. 1975.

[20] S. Mahmoud and J. S. Riordan, "Optimal allocation of resourcesin distributed information networks," ACM Trans. DatabaseSystems, vol. 1, pp. 66-78, Mar. 1976.

[21] C. A. Papadimitriou, P. A. Bernstein, and J. B. Rothnie, "Somecomputational problems related to database concurrency con-trol," in Proc. Conf Theoretical Comput. Sci., Univ. Waterloo,Waterloo, Ont., Canada, Aug. 1977.

[22] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis, "A systemlevel concurrency control for distributed database systems," inProc. 1977 Berkeley Workshop on Distributed Data Managementand Computer Networks, Lawrence Berkeley Lab., Univ. Califor-nia, Berkeley, May 1977.

[23] J. B. Rothnie and N. Goodman, "A study of updating in aredundant distributed database environment," Computer Corp.America, Cambridge, MA, Tech. Rep. CCA-77-01, Feb. 15, 1977.

[24] -, "An overview of the preliminary design of SDD-1: A Systemfor Distributed Databases," in Proc. 1977Berkeley Workshop onDistributed Data Management and ComputerNetworks, LawrenceBerkeley Lab., Univ. California, Berkeley, May 1977; also avail-able from Computer Corp. America, Cambridge, MA, Tech. Rep.CCA-77-04.

[25] -, "A survey of research and development in distributed data-base systems," in Proc. 3rd Int. Conf Very Large Data Bases,Tokyo, Japan, Oct. 1977.

[26] J. B. Rothnie, N. Goodman, and P. A. Bernstein, "The redundantupdate algorithm of SDD-1: A System for Distributed Databases(the fully redundant case)," in Phoc. lst Int. Conf Comput.Software and Applications (COMPSAC '77), IEEE Comput.Soc., Chicago IL, Nov. 1977; also available from Computer Corp.America, Cambridge, MA, Tech. Rep. CCA-77-02.

[27] R. E. Stearns, P. M. Lewis, 11, and D. J. Rosenkrantz, "Con-currency controls for database systems," in IEEE Phoc. 17thAnnu. Symp. Foundations Comput. Sci., 1976, pp. 19-32.

[281 M. Stonebraker and E. Neuhold, "A distributed database versionof INGRES," in Proc. 1977 Berkeley Workshop on DistributedData Management and Computer Networks, Lawrence BerkeleyLab., Univ. California, Berkeley, May 1977.

[29] R. H. Thomas, "A solution to the concurrency control problemfor multiple copy data bases," in Proc. 16th IEEE Comput.Soc. Int. Conf (COMPCON), Spring 1978.

[30] E. Wong, "Retrieving dispersed data from SDD-1: A System forDistributed Databases," in Proc. 1977 Berkeley Workshop onDistributed Data Management and ComputerNetworks, LawrenceBerkeley Lab., Univ. California, Berkeley, May 1977; also avail-able from Computer Corp. America, Cambridge, MA, Tech. Rep.CCA-77-03.

1>1 Philip A. Bernstein received the B.S. degree.from Cornell University, Ithaca, NY, in 1971,and the Ph.D. degree from the University ofToronto, Toronto, Ont., Canada, in 1975, bothin computer science.He is presently Assistant Professor of Com-

puter Science at Harvard University, Cambridge,MA, and Senior Computer Scientist at Com-puter Corporation of America, Cambridge, MA.His primary research interests are databasemanagement and operating systems.

167

Page 15: The Concurrency Control Mechanism of SDD-1: A System for

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-4, NO. 3, MAY 1978

James B. Rothnie, Jr., received the B.S. degree

i in 1970 and the Ph.D. degree in 1972, both

from the Massachusetts Institute of Technology,Cambridge, MA.He is presently Vice President and Manager of

the Sponsored Research Division of ComputerCorporation of America, Camnbridge, MA. His

primary research interests are database manage-ment and computer networks.

Nathan Goodman received the B.S. degree in mathematics in 1972, andthe M.S. degree in computer science in 1976, both from the Massachu-setts Institute of Technology, Cambridge, MA. Currently, he is workingtowards the Ph.D. degree in computer science at Harvard University,Cambridge, MA.

He is presently a Computer Scientist at Computer Corporation ofAmerica, Cambridge, MA. His principal research interests are in dis-tributed database management and database semantics.

Christos A. Papadimitriou was born in Athens,

Greece, in 1949. He received the B.S. degree inelectrical engineering from the National Tech-nical University, Athens, in 1972, and the Ph.D.

degree in computer science from Princeton Uni-versity, Princeton, NJ, in 1976.

Since 1976 he has been an Assistant Professorof Computer Science at Harvard University,Cambridge, MA. His main research interests are

computational complexity, analysis of algo-rithms, combinatorics, and certain aspects of

operations research and database theory.

168