lock-free reference counting

Upload: terminatory808

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Lock-Free Reference Counting

    1/10

    Lock-Free Reference Counting

    David L. Detlefs Paul A. Martin Mark Moir Guy L. Steele Jr.

    Sun Microsystem Laboratories1 Network Drive

    Burlington, MA 01803

    Abstract

    Assuming the existence of garbage collection makes iteasier to design implementations of concurrent datastructures. However, this assumption limits their ap-plicability. We present a methodology that, for a sig-nificant class of data structures, allows designers to

    first tackle the easier problem of designing a garbage-collection-dependent implementation, and then ap-ply our methodology to achieve a garbage-collection-independent one. Our methodology is based on thewell-known reference counting technique, and em-ploys the double compare-and-swap operation.

    1 Introduction

    It is well known that garbage collection (GC) cansimplify the design of sequential implementations ofdata structures. Furthermore, in addition to provid-

    ing storage management benefits, GC can also sig-nificantly simplify various synchronization issues inthe design of concurrent data structures [3, 10, 16].Unfortunately, however, simply designing implemen-tations for garbage-collected environments is not thewhole solution. First, not all programming environ-ments support GC. Second, almost all of those thatdo employ excessive synchronization, such as lockingand/or stop-the-world mechanisms, which brings intoquestion their scalability. Finally, an obvious restric-tion of particular interest to our group is that GC-dependent implementations cannot be used in the im-plementation of the garbage collector itself!

    The goal of this work, therefore, is to allow pro-grammers to exploit the advantages of GC in de-

    Contact email: [email protected]

    signing their data structure implementations, whileavoiding its drawbacks. To this end, we provide amethodology that allows programmers to first solvethe easier problem of designing a GC-dependent im-plementation, and to then apply our methodology inorder to achieve a GC-independent one.

    We have designed our methodology to preserve

    lock-freedom.1 (That is, if the GC-dependent im-plementation is lock-free 2, so too will be the GC-independent one derived using our methodology.)Lock-free programming is increasingly important forovercoming the problems associated with locking, in-cluding performance bottlenecks, susceptibility to de-lays and failures, design complications, and, in real-time systems, priority inversion.

    Another important feature of our methodology and implementations based on it is that it allowsthe memory consumption of the implementation togrow and shrink over time, without imposing any re-strictions on the underlying memory allocation mech-anisms. In contrast, lock-free implementations of dy-namic data structures often either require mainte-nance of a special freelist, whose storage cannot ingeneral be reused for other purposes (e.g. [19, 13]),or require special system support such as type stablememory [6].

    Our methodology is based on the well-known

    1We do not specify how objects are created and destroyed;usually malloc and free are not lock-free, so implementa-tions based on them are also not lock-free. However, mostproduction-quality malloc/free implementations do attempt toavoid contention for locks for example by maintaining sep-arate allocation buffers for each thread and therefore avoidmost of the problems associated with locks.

    2An implementation of a concurrent data structure is lock-free if it guarantees that in any execution, after a finite numberof steps of one of its operations, some operation on the datastructure completes. Of course, it is possible that the oper-ating system or garbage collector might prevent threads fromexecuting any instructions, in which case no operations willbe completed on the data structure. However, this does notmean that the concurrent data structure implementation is notlock-free. Thus, it is not contradictory to talk about a lock-free GC-dependent concurrent data structure implementation,even in environments with blocking garbage collectors.

  • 8/3/2019 Lock-Free Reference Counting

    2/10

    garbage collection technique of reference counting.We refer to this methodology as LFRC (Lock-FreeReference Counting). In this approach, each objectcontains a count of the number of pointers that pointto it, and can be freed if and only if this count iszero. The reason that garbage collectors commonly

    stop the world is that some of these pointers are inthreads registers and/or stacks; discovering these re-quires operating system support, and is difficult to doconcurrently with the executing thread. Our goal isto enable programmers to take advantage of the sim-plicity afforded by garbage collection, without havingto use locks or stop-the-world techniques.

    In order to maintain accurate reference counts, wewould like to be able to atomically create a pointer toan object and increment that objects reference count,and to atomically destroy a pointer to an object anddecrement its reference count. By freeing an objectwhen and only when its reference count becomes zero,

    we can ensure that objects are not freed prematurely,but that they are eventually freed when no pointersto the object remain.

    The main difficulty that arises in the above-described approach is the need to atomically modifytwo separate memory locations: a pointer and thereference count of the object to which it points. Thiscan be achieved either by using hardware synchro-nization primitives to enforce this atomicity, or byusing lock-free software mechanisms to achieve theappearance of atomicity. Work on the latter approachhas yielded various lock-free and wait-free implemen-

    tation of multi-variable synchronization operations[14, 17]. However, all of these results are compli-cated, introduce significant overhead, and/or workonly on locations in statically-allocated memory, ren-dering them inapplicable for the current work. There-fore, in this paper, we explore the former approach.In particular, we assume the availability of a dou-ble compare-and-swap (DCAS) instruction that canatomically access two independently-chosen memorylocations. (DCAS is precisely defined in [3].) Whilesuch an instruction is not widely available, it has beenimplemented in hardware in the past (e.g. [15]). Fur-thermore, it is increasingly clear that current hard-

    ware synchronization mechanisms are inadequate forscalable and non-blocking synchronization. Thus, weare motivated to study what can be achieved withstronger (in some sense) mechanisms.

    Even DCAS is not strong enough to allow us tomaintain reference counts that are accurate at alltimes. For example, if a pointer in shared memorypoints to an object v, and we change the pointer topoint to another object w, then we have to atom-ically modify the pointer, increment ws reference

    count, and decrement vs. However, it turns outthat a weaker requirement on the reference countssuffices, and that this requirement can be achievedusing DCAS. This weakening is based on the obser-vation that reference counts do not always need to beaccurate. The important requirements are that if the

    number of pointers to an object is non-zero, then sotoo is its reference count, and that if the number ofpointers is zero, then the reference count eventuallybecomes zero. These two requirements respectivelyguarantee that an object is never freed prematurely,and that the reference count of an object that has nopointers to it eventually become zero, so that it canbe freed.3

    These observations imply that it is safe for a threadto increment an objects reference count before cre-ating a new pointer to it, provided that the threadeventually either creates the pointer, or decrementsthe reference count to compensate for the previous in-

    crement. This is the approach used by our method-ology. It is tempting to think that the weaker re-quirements can be achieved without resorting to us-ing DCAS. However, when we load a pointer from ashared memory location, we need to increment thereference count of the object to which the loadedvalue points. If we can access this reference countonly with a single-variable compare-and-swap (CAS),then there is a risk that the object will be freed beforewe increment the reference count, and that the subse-quent attempt to increment the reference count willcorrupt memory that has been freed, and potentiallyreallocated for another purpose. By using DCAS, wecan increment the reference count while atomicallyensuring that a pointer still exists to this object. Wedo not see a way to overcome this problem using CASwithout introducing significant overhead.

    To demonstrate the utility of our methodology,we show how it can be used to convert a GC-dependent algorithm published recently in [3] into aGC-independent one. The algorithm presented in [3]is a lock-free implementation of a double-ended queue(hereafter deque) [9].

    The design of the implementation in [3] was sim-plified by the assumption that it would operate in an

    environment that provides GC because GC gives usa free solution to the so-called ABA problem (freein the sense that we do not have to solve this prob-lem as part of the implementation). This problemarises when we fail to detect that a value changesand then changes back to its original value. For ex-ample if a CAS or DCAS operation is about to oper-

    3To be precise, we should point out that it is possible forgarbage to exist and to never be freed in the case where athread fails permanently.

  • 8/3/2019 Lock-Free Reference Counting

    3/10

    ate on a pointer, and the object to which it points isfreed and then reallocated, then it is possible for theCAS or DCAS to succeed even though it should fail.Thus, this situation must be prevented by ensuringthat objects are not freed while threads have point-ers to them. Garbage collectors can determine this,

    for example by stopping a thread and inspecting itsregisters and stack. However, in the absence of GC,the implementation itself must deal with this trickyproblem. It was while working on this specific prob-lem that we realised that we could achieve a moregenerally-applicable methodology.

    While we believe that our methodology will be use-ful for a variety of concurrent data structures, it is notcompletely automatic, nor is it universally applicable,primarily for the following two reasons: our method-ology is applicable to implementations that manipu-late pointers using only a predefined set of operations,and also to implementations that ensure there are no

    cycles in garbage. We define the class of implemen-tations to which our methodology is applicable moreprecisely in the following section.

    The remainder of this paper is organized as fol-lows. In Section 2, we describe the operations sup-ported by LFRC, and in Section 3, we describe ingeneral how they can be applied. Then, in Section4, we demonstrate a specific example by using LFRCto convert the GC-dependent lock-free deque imple-mentation from [3] into a GC-independent one. InSection 5, we present lock-free implementations forthe LFRC operations. Related work is briefly dis-cussed in Section 6, and concluding remarks appearin Section 7.

    2 LFRC

    In this section, we describe the class of algorithms towhich LFRC is applicable and the LFRC operations.

    2.1 Applicability

    The LFRC operations support loading, storing,copying, and destroying pointers, as well as mod-

    ifying them using CAS and DCAS. We presenta methodology for transforming any garbage-collection-dependent concurrent data structure im-plementation that satisfies the two criteria below intoan equivalent implementation that does not dependon garbage collection.

    LFRC Compliance The implementation does notaccess or manipulate pointers other than by loading,storing, copying, or destroying them, or by modifyingthem using CAS and/or DCAS. (For example, this

    precludes the use of pointer arithmetic.)

    Cycle-Free Garbage There are no pointer cyclesin garbage. (Cycles may exist in the data structure,but not amongst objects that have been removed andshould be freed.)

    The set of operations allowed by the LFRC Compli-ance criterion above seems to be sufficient to supporta wide range of concurrent data structure implemen-tations. Given the general principles demonstratedin this paper, it should be straightforward to extendour methodology to support other operations such asload-linked and store-conditional.

    The Cycle-Free Garbage criterion will clearly dis-qualify some implementations, but we believe thatit is easily achievable for a wide-range of concurrentdata structures. In many cases, the criterion willbe satisfied by the natural implementation; in oth-ers some effort may be required to eliminate garbage

    cycles. In the example presented in Section 4, theoriginal implementation did allow the possibility ofcycles within garbage, but we overcame this problemwith a straightforward modification. We have severalother candidate implementations in the pipeline towhich we believe this technique will be applicable.

    2.2 LFRC Operations

    LFRC supports the following operations for access-ing and manipulating pointers. As stated above, weassume that pointers are accessed only by means ofthese operations.

    LFRCLoad(A,p) A is a pointer to a shared memorylocation that contains a pointer, and p is a pointer toa local pointer variable. The effect is to load the valuefrom the location pointed to by A into the variablepointed to by p.

    LFRCStore(A,v) A is a pointer to a shared memorylocation that contains a pointer, and v is a pointervalue to be stored in this location.

    LFRCCopy(p,v) p is a pointer to a local pointer vari-able and v is a pointer value to be copied to the vari-able pointed to by p.

    LFRCDestroy(v) v is the value of a local pointervariable that is about to be destroyed.

    LFRCCAS(A0,old0,new0) is the obvious simplifica-tion of LFRCDCAS, described next.

    LFRCDCAS(A0,A1,old0,old1,new0,new1) A0 andA1 are pointers to shared memory locations that con-tain pointers, and old0, old1, new0, and new1 arepointer values. The effect is to atomically comparethe contents of the location pointed to by A0 withold0 and the contents of the location pointed to by

  • 8/3/2019 Lock-Free Reference Counting

    4/10

    A1 with old1, and to change the contents of the lo-cations pointed to by A0 and A1 to new0 and new1,respectively, if the comparisons both succeed; if ei-ther comparison fails, then the contents of the loca-tions pointed to by A0 and A1 are left unchanged,and LFRCDCAS returns false.

    3 LFRC Methodology

    In this section we describe the steps of our method-ology for transforming a GC-dependent implementa-tion into a GC-independent one. All of these stepsexcept step 3 can be completely automated in anobject-oriented language such as C++ using smartpointer techniques like those described in [2]. How-ever, presenting these steps explicitly is clearer, andallows our methodology to be applied in non-object-oriented languages too. The steps are as follows:

    1. Add reference counts: Add a field rc to eachobject type to be used by the implementation. Thisfield should be set to 1 in a newly-created object (inour C++ implementation, this is achieved throughobject constructors).

    2. Provide LFRCDestroy(v) function: This func-tion should accept an object pointer v. If v is null,then the function should simply return; otherwise itshould atomically decrement vrc. If vrc becomeszero as a result, LFRCDestroy should recursively callitself with each pointer in the object, and then freethe object. An example is provided below, and weprovide a function for the atomic decrement of therc field, so writing this function is trivial. (We re-quire this function only because it is the most conve-nient and language-independent way to iterate overall pointers in an object.)

    3. Ensure no garbage cycles: Observe that thereference counts of nodes in a garbage cycle will re-main non-zero forever. Therefore, to ensure that allgarbage is collected, we must ensure that the imple-mentation does not result in cycles among garbageobjects. (Failing to achieve this will result in thememory on and reachable from the cycle being lost,but will not affect the correctness of the implemented

    data structure.) This step may be non-trivial or evenimpossible for some concurrent data structure imple-mentations. If this property cannot be achieved for aparticular data structure, then it is not a candidatefor applying our methodology.

    4. Produce correctly-typed LFRC operations:

    We have provided code for the LFRC operations tobe used in the example implementation presented inthe next section. In this implementation, there isonly one type of pointer; for simplicity and clarity,

    we have explicitly designed the LFRC operations forthis type. For similarly simple implementations, thisstep can be achieved by a simple search and replace.In algorithms that use multiple pointer types, eitherthe operations will need to be duplicated for differentpointer types, or the code for the operations will need

    to be unified to accept different pointer types. Forexample, this step could be eliminated completely byarranging for the rc field to be at the same offsetin all objects, and by using void pointers instead ofspecifically-typed ones.

    5. Replace pointer operations: Replace eachpointer operation with its LFRC counterpart. Forexample, if A0 and A1 are pointers to shared pointervariables, and x0, x1, old0, old1, new0, new1 arepointer variables, then replacements should be doneaccording to Table 1. Observe that the table doesnot contain an entry for replacing an assignment ofone shared pointer to another, for example *A0 =*A1. Such assignments are not atomic; instead, thelocation pointed to by A1 is read into a register inone instruction, and the contents of the register arestored into the location pointed to by A0 in a sepa-rate instruction; this should be reflected explicitly, forexample with the following code (this will be clearerafter reading step 6 below):

    {ObjectType *x = NULL;LFRCLoad(A1,&x);LFRCStore(A0,x);LFRCDestroy(x);

    }

    6. Management of local pointer variables:Whenever a thread loses a pointer (for example whena function that has local pointer variables returns, soits local variables go out of scope), it first calls LFRC-Destroy() with this pointer. Also, all pointer variablesmust be initialized to NULL before being used withany of the LFRC operations. Thus, all pointers in anewly-allocated ob ject should be initialized to NULLbefore the object is made visible to other threads. Itis also important to explicitly remove pointers con-tained in a statically allocated object before destroy-ing that object. (See Section 4 for an example.)

    4 An Example: Snark

    In this section we show how to use our methodologyto construct a GC-independent implementation of aconcurrent deque, based on the GC-dependent onepresented in [3] and affectionately known as Snark.Below we describe some of the relevant (to the ap-plication of our methodology) features of this algo-rithm, in preparation for explaining the application

  • 8/3/2019 Lock-Free Reference Counting

    5/10

    Original Code Replacement Code

    x0 = *A0; LFRCLoad(A0,&x0);

    *A0 = x0; LFRCStore(A0,x0);

    x0 = x1; LFRCCopy(&x0,x1);

    CAS(A0,old0,new0) LFRCCAS(A0,old0,new0)

    DCAS(A0,A1,old0,old1,new0,new1) LFRCDCAS(A0,A1,old0,old1,new0,new1)

    Table 1: Step 5: Replacing pointer accesses in original code with LFRC counterparts.

    of the LFRC methodology to this algorithm; readersinterested in the details are referred to [3].

    The C++ class declarations4 and code for one op-eration pushRight of the original Snark algo-rithm are shown on the left side of Figure 1. Snark[3] represents a deque as a doubly-linked list. Thenodes of the doubly-linked list are called SNodes(lines 1..2); they contain L and R pointers, which

    point to their left and right neighbours, respectively.A Snark deque consists of two hat pointers (Left-Hat and RightHat), which respectively point to theleftmost and rightmost nodes in a non-empty deque,and a special Dummy node, which is used as a sen-tinel node at one or both ends of the deque (line3). The Snark constructor (lines 4..9) allocates anSNode for Dummy, makes the L and R pointers ofDummy point to Dummy itself, and makes LeftHatand RightHat both point to Dummy. This is one rep-resentation of the empty deque in this implementa-tion. The self-pointers in the L and R pointers of theDummy node are used to distinguish this node from

    nodes that are currently part of the deque. Some popoperations leave a previous deque node as a sentinelnode; in this case, a pointer in this node is also di-rected toward the node itself to identify the node asa sentinel. Pointers are accessed only by load, store,and DCAS operations.

    The right side of Figure 1 shows the GC-independent code we derived from the original Snarkalgorithm by using the LFRC methodology. Below webriefly explain the differences between the two imple-mentations, and explain how we applied the steps ofthe LFRC methodology to the first in order to acquirethe second.

    Step 1 We added a reference count field (rc) to theSNode object (line 31); the SNode constructor setsit to 1 (line 32) to indicate that one pointer to theSNode exists (i.e., the one returned by new).

    Step 2 Our LFRCDestroy function is in Figure 2.

    Step 3 The Snark algorithm required a slight mod-ification in order to ensure that there are never any

    4For clarity and brevity, we have omitted uninteresting de-tails such as access modifiers.

    cycles in garbage. The reason is that sentinel nodeshave self pointers. The correctness of the Snark algo-rithm is not affected by changing these self pointersto null pointers, and interpreting null pointers thesame way the original algorithm interprets self point-ers. This is because null pointers were not used at allin the original algorithm, so every occurrence of a nullpointer in the revised algorithm can be interpreted as

    a self pointer. The result is that there are no cyclesin garbage. To effect these changes, we changed theSnark constructor to set the Dummy node up withnull pointers instead of self pointers (lines 36..37); wechanged the checks for self pointers to checks for nullpointers (line 59); and we changed the code in theleft and right pop operations (not shown) to installnull pointers instead of self pointers.

    Step 4 In this case, the LFRC operations (shown inthe next section) are already typed for Snark.

    Step 5 We replaced all pointer accesses with LFRCoperations, in accordance with the transformationsshown in Table 1. (See lines 35..39, 54, 57..58, 60..62,and 65..66.)

    Step 6 We inserted calls to LFRCDestroy for eachtime a pointer variable is destroyed (for example, be-fore returns from functions that contain local pointervariables). (See lines 52, 63, and 67.) We also addedinitialization to null for all pointers. (See lines 32, 34,and 50.) These initializations were not needed in theoriginal algorithm, as the pointers are always modi-fied before they are read. However, because LFRC-Store must reduce the reference count of the objectpreviously pointed to by the pointer it modifies, itis important to ensure that pointers are initialized

    before accessing them with any LFRC operation. Fi-nally, we added a destructor operation for Snark ob-

    jects, which writes null to the LeftHat, RightHat, andDummy pointers before destroying the Snark objectitself (lines 42..44). This ensures that objects remain-ing in the deque are freed, which was not necessaryin the GC-dependent version. (Note: this destructoris not intended to be invoked concurrently with otheroperations; it should be invoked only after all threadshave finished accessing the object.)

  • 8/3/2019 Lock-Free Reference Counting

    6/10

    class SNode {1 class SNode *L, *R; valtype V;2 SNode() {};};

    class Snark {

    3 SNode *Dummy, *LeftHat, *RightHat;4 Snark() {

    5 Dummy = new SNode;6 DummyL = Dummy;7 DummyR = Dummy;8 LeftHat = Dummy;9 RightHat = Dummy;

    };

    10 valtype pushRight(valtype v);

    11 valtype pushLeft(valtype v);12 valtype popRight();13 valtype popLeft();};

    valtype Snark::pushRight(valtype v) {14 SNode *nd = new SNode;15 SNode *rh, *rhR, *lh;16 if (nd == Null)

    17 return FULLval;

    18 ndR = Dummy;19 ndV = v;20 while (true) {21 rh = RightHat;

    22 rhR = rhR;23 if (rhR == rh) {24 ndL = Dummy;25 lh = LeftHat;26 if (DCAS(&RightHat, &LeftHat, rh, lh, nd, nd))

    27 return OKval;

    } else {28 ndL = rh;29 if (DCAS(&RightHat, &rhR, rh, rhR, nd, nd))

    30 return OKval;

    }}}

    class SNode {31 class SNode *L, *R; valtype V; long rc;32 SNode() : L(Null), R(Null), rc(1) {};};

    class Snark {

    33 SNode *Dummy, *LeftHat, *RightHat;34 Snark() : Dummy(Null),

    LeftHat(Null), RightHat(Null) {35 LFRCStoreAlloc(&Dummy, new SNode);36 LFRCStore(&DummyL, Null);37 LFRCStore(&DummyR, Null);38 LFRCStore(&LeftHat, Dummy);39 LFRCStore(&RightHat, Dummy);

    };40 Snark() {41 while (popLeft()!=EMPTYval);42 LFRCStore(&Dummy, Null);43 LFRCStore(&LeftHat, Null);44 LFRCStore(&RightHat, Null);

    };45 valtype pushRight(valtype v);

    46 valtype pushLeft(valtype v);47 valtype popRight();48 valtype popLeft();};

    valtype Snark::pushRight(valtype v) {49 SNode *nd = new SNode;50 SNode *rh = Null, *rhR = Null, *lh = Null;51 if (nd == Null) {52 LFRCDestroy(rhR, nd, rh, lh);53 return FULLval;

    }54 LFRCStore(&ndR, Dummy);55 ndV = v;56 while (true) {57 LFRCLoad(&RightHat, &rh);

    58 LFRCLoad(&rhR, &rhR);59 if (rhR == Null) {60 LFRCStore(&ndL, Dummy);61 LFRCLoad(&LeftHat, &lh);62 if (LFRCDCAS(&RightHat, &LeftHat,

    rh, lh, nd, nd)) {63 LFRCDestroy(rhR, nd, rh, lh);64 return OKval;

    }} else {

    65 LFRCStore(&ndL, rh);66 if (LFRCDCAS(&RightHat, &rhR,

    rh, rhR, nd, nd)) {67 LFRCDestroy(rhR, nd, rh, lh);68 return OKval;

    }

    }}}

    Figure 1: Class definitions and pushRight code for GC-dependent Snark (left) and transformed class definitions andpushRight code for GC-independent Snark using LFRC (right). A call to LFRCDestroy with multiple arguments isshorthand for calling LFRCDestroy once with each argument. LFRCStoreAlloc (line 35) is like LFRCStore exceptthat it does not increment the reference count of the object. This is more convenient than explicitly saving thepointer returned by new so that it can be immediately LFRCDestroyed.

  • 8/3/2019 Lock-Free Reference Counting

    7/10

    This concludes our discussion of the application ofthe LFRC methodology to Snark. All steps exceptStep 3 were quite mechanical, and did not requireany creativity or reasoning about the algorithm. Inthis case, Step 3 was also quite easy. However, thiswill clearly be more difficult for some algorithms.

    5 LFRC Implementation

    In this section we describe our simple implementationof the LFRC operations, and explain why they ensurethat there are no memory leaks, and that memorydoes not get freed prematurely. The basic idea isto implement a reference count in each object thatreflects the number of pointers to the object. Whenthis count reaches zero, there are no more pointers toit, so no more pointers to it can be created, and itcan be freed.

    The main difficulty is that we cannot atomicallychange a pointer variable from pointing to one ob-

    ject to pointing to another and update the referencecounts of both objects. We overcome this problemwith the observations that,

    provided an objects reference counter is alwaysat least the number of pointers to the object, itwill never be freed prematurely, and

    provided the count becomes zero when there areno longer any pointers to the object, there areno memory leaks.

    Thus, we conservatively increment an objects ref-erence count before creating a new pointer to it. Ifwe subsequently fail to create that pointer, then wecan decrement the reference count again afterwardsto compensate. The key mechanism in our imple-mentation is the use of DCAS to increment an ob-

    jects reference count while simultaneously checkingthat some pointer to the object exists. This avoidsthe possibility of updating an object after it has beenfreed.

    Below we describe our lock-free implementationsof the LFRC operations, which are shown in Fig-

    ure 2. We begin with the implementation of LFRC-Load, which accepts two parameters: a pointer A toa shared pointer, and a pointer dest to a local pointervariable of the calling thread. This operation loadsthe value in the location pointed to by A into thevariable pointed to by dest. This potentially has theeffect of destroying a pointer to one object O1 (theone previously pointed to by *dest) and creating apointer to another O2 (the one pointed to by *A).It is important to increment the reference count of

    object O2 before decrementing the reference count ofO1. To see why, suppose we are using LFRCLoadto load the next pointer of a linked list node intoa local variable that points to that node. We mustensure that the node is not freed (and potentiallyreused for another purpose) before we load the next

    pointer from that node. Therefore, LFRCLoad be-gins by recording the previous value of *dest (line 1)for destruction later (line 12). It then loads a newvalue from *A, and increments the reference count ofthe object pointed to by *A in order to reflect thecreation of a new pointer to it. Because the callingthread does not (necessarily) already have a pointerto this object, it is not safe to update the referencecount using a simple CAS, because the object mightbe freed before the CAS executes. (Valois [19] usedthis approach, and as a result was forced to maintainunused nodes explicitly in a freelist, thereby prevent-ing the space consumption of a list from shrinking

    over time.) Therefore, LFRCLoad uses DCAS to at-tempt to atomically increment the reference count,while ensuring that the pointer to the object still ex-ists. (A similar trick was used by Greenwald [6] in hisuniversal constructions.) This is achieved as follows.First, LFRCLoad reads the contents of *A (line 4).If it sees a null pointer, there is no reference count tobe incremented, so LFRCLoad simply sets *dest tonull (lines 5..7). Otherwise, it reads the current ref-erence count of the object pointed to by the pointerit read in line 4, and then attempts to increment thiscount using DCAS (line 9). (Note that there is no riskthat the object containing the pointer being read byLFRCLoad is freed during the execution of LFRC-Load because the calling thread has a pointer to thisobject that is not destroyed during the execution ofLFRCLoad, so its reference count cannot fall to zero.)If the DCAS succeeds, then the value read is storedin *dest (line 10). Otherwise, LFRCLoad retries.

    Having successfully loaded a pointer from *A into*dest, and incremented the reference count of theobject to which it points (if it is not null), it re-mains to call LFRCDestroy(*dest) (line 12) in orderto decrement the reference count of the object previ-ously pointed to by *dest. As explained previously, if

    LFRCDestroys argument is non-null, then it decre-ments the reference count of the object pointed toby its argument (line 13). This is done using theadd to rc function, which is implemented using CAS.add to rc is safe (in the sense that there is no riskthat it will modify a freed object) because it is calledonly in situations in which we know that the callingthread has a pointer to this object, which has pre-viously been included in the reference count. Thus,there is no risk that the reference count will become

  • 8/3/2019 Lock-Free Reference Counting

    8/10

    void LFRCLoad(SNode **A, SNode **dest) {1 SNode *a, *olddest = *dest;2 long r;3 while (true) {4 a = *A;5 if (a == Null) {6 *dest = Null;7 break;

    }8 r = arc;9 if (DCAS(A, &arc, a, r, a, r+1)) {10 *dest = a;11 break;

    }}

    12 LFRCDestroy(olddest);}

    void LFRCDestroy(SNode *p) {13 if (p != Null && add to rc(p, -1)==1) {14 LFRCDestroy(pL, pR);15 delete p;

    }}

    long add to rc(SNode *p, int v) {16 long oldrc;17 while (true) {18 oldrc = prc;19 if (CAS(&prc), oldrc, oldrc+v)20 return oldrc;

    }}

    void LFRCStore(SNode **A, SNode *v) {21 SNode *oldval;22 if (v != Null)23 add to rc(v, 1);24 while (true) {25 oldval = *A;26 if (CAS(A, oldval, v)) {27 LFRCDestroy(oldval);

    28 return;}

    }}

    void LFRCCopy(SNode **v, SNode *w) {29 if (w != Null)30 add to rc(w,1);31 LFRCDestroy(*v);32 *v = w;}

    bool LFRCDCAS(SNode **A0, SNode **A1,SNode *old0, SNode *old1,SNode *new0, SNode *new1) {

    33 if (new0 != Null) add to rc(new0, 1);

    34 if (new1 != Null) add to rc(new1, 1);35 if (DCAS(A0, A1, old0, old1, new0, new1)) {36 LFRCDestroy(old0, old1);37 return true;

    } else {38 LFRCDestroy(new0, new1);39 return false;

    }}

    Figure 2: Code for LFRC operations. LFRCCAS is the same as LFRCDCAS, but without the second location.

    zero, thereby causing the object to be freed, beforethe add to rc function completes. If this add to rc

    causes the reference count to become zero, then weare destroying the last pointer to this object, so itcan be freed (line 15). First, however, LFRCDestroycalls itself recursively with each pointer in the objectin order to update the reference counts of objects towhich the soon-to-be-freed object has pointers (line14).

    LFRCStore accepts two parameters, a pointer Ato a location that contains a pointer, and a pointervalue v to be stored in this location. If the newvalue to be stored is not null, then LFRCStore in-crements the reference count of the object to which

    it points (lines 22..23). Note that at this point, thenew pointer to this object has not been created, sothe reference count is greater than the number ofpointers to the object. However, this situation willnot persist past the end of the execution of LFRC-Store, because LFRCStore will not return until thatpointer has been created. The pointer is created byrepeatedly reading the current value of the pointer,and using CAS to attempt to change it to v (lines24..28). When the CAS succeeds, we have created

    the pointer previously counted, and we have also de-stroyed a pointer, namely the previous contents of

    *A. Therefore, LFRCStore calls LFRCDestroy withthe previous value of the pointer (line 27).

    LFRCCopy accepts two parameters, a pointer vto a local pointer variable, and a value w of a localpointer variable. This operation assigns the value wto the variable pointed to by v. This creates a newpointer to the object pointed to by w (if w is notnull), so LFRCCopy increments the reference countof that object (lines 29..30). It also destroys a pointer the previous contents of *v so LFRCCopy callsLFRCDestroy(*v) (line 31). Finally, LFRCCopy as-signs the value w to the pointer variable pointed toby v, and returns (line 32).

    LFRCDCAS accepts six parameters, correspond-ing to the DCAS parameters described in Section1. LFRCDCAS is similar to LFRCStore in thatit increments the reference counts of objects beforecreating new pointers to them, thereby temporarilysetting these counts artificially high (lines 33..34).LFRCDCAS differs from LFRCStore, however, inthat it does not insist on eventually creating thosenew pointers. If the DCAS at line 35 fails, then

  • 8/3/2019 Lock-Free Reference Counting

    9/10

    LFRCDCAS calls LFRCDestroy for each of the ob- jects whose reference counts were previously incre-mented, in order to compensate for those previousincrements, and then returns false (lines 38..39). Onthe other hand, if the DCAS succeeds, then the previ-ous increments were justified, but we have destroyed

    two pointers, namely the previous values of the twolocations accessed by the DCAS. Therefore, LFRC-DCAS calls LFRCDestroy in order to decrement thereference counts of these objects, and then returnstrue (lines 36..37). The implementation of LFRC-CAS (not shown) is just the obvious simplification ofthat of LFRCDCAS.

    The LFRC operations are lock-free because eachloop terminates unless some value changed during theloop iteration, and each time a value changes, theprocess that changes it terminates a loop.

    6 Related Work

    In this section we briefly discuss related work notalready mentioned in the paper.

    In [18], Steele initiated work on concurrentgarbagecollection, in which the garbage collector can executeconcurrently with the mutator. Much work has beendone on concurrent garbage collection since. How-ever, in almost all such work, mutual exclusion isused between mutator threads and the garbage col-lector (e.g. [4]) and/or stop-the-world techniques areused. Therefore, these collectors are not lock-free.

    Nonetheless, there are several pieces of prior workin which researchers have attempted to eliminate theexcessive synchronization of mutual exclusion andstop-the-world techniques. The first is due to Her-lihy and Moss [7]. They present a lock-free copy-ing garbage collector, which employs techniques sim-ilar to early lock-free implementations for concurrentdata structures. The basic idea in these algorithms isto copy an object to private space, update it sequen-tially, and later attempt to make the private copycurrent. This must be done each time an object isupdated, a significant disadvantage. (Herlihy andMoss also outline a possible approach for overcoming

    this limitation, but this approach is limited to cer-tain circumstances, and is not generally applicable.)We therefore believe that our algorithm will performbetter in most cases. Nonetheless, their approach hastwo advantages over ours: it does not require a DCASoperation, and it can collect cyclic garbage.

    Hesselink and Groote [8] designed a wait-freegarbage collector (wait-freedom is stronger than lock-freedom). However, this collector applies only to a re-stricted programming model, in which a thread can

    modify only its roots; objects are not modified be-tween creation and deletion. This collector is there-fore not generally applicable.

    Levanoni and Petrank [12] have taken another ap-proach to attacking the excessive synchronization ofmost other garbage collectors. Their approach is de-

    signed to avoid synchronization as much as possible:they even eschew the use of CAS. However, their de-sire to avoid such synchronization has resulted in arather complicated algorithm. We suspect that al-gorithms that make judicious use of strong synchro-nization mechanisms (such as DCAS) will performbetter because the algorithms will be much simpler.However, this depends on many factors, including theimplementation of the hypothetical strong primitives.

    The on-the-fly garbage collection algorithm ofDijkstra et al. [5] (with extensions by Kung and Song[11], Ben-Ari [1], and van de SnepSchuet [20]) is anexample of a concurrentcollector: GC work is moved

    to a separate processor. A GC-dependent lock-freedata structure implementation would gain many ofthe benefits of lock-free GC by using such a concur-rent GC implementation, as long as the concurrentGC kept up with the need for new storage. But theoverall system is not lock-free, since delaying the GCprocessor can delay all storage allocation requests.

    7 Concluding Remarks

    Our methodology for transforming GC-dependentlock-free algorithms into GC-independent ones al-lows existing and future lock-free algorithms that de-pend on GC to be used in environments without GC,and even within GC implementations themselves. Italso allows researchers to concentrate on the impor-tant features of their algorithms, rather than beingdistracted by often-tricky problems such as memorymanagement and the ABA problem.

    Our methodology is based on reference counting,and uses DCAS to update the reference count ofan object atomically with a check that the objectstill has a pointer to it. By weakening the require-ment that reference counts record the exact num-

    ber of pointers to an object, we are able to sepa-rate the updates of reference counts from the up-dates of the pointers themselves. This allows us tosupport strong synchronization operations, includingCAS and DCAS, on pointers.

    The simplicity of our approach is largely due to theuse of DCAS. This adds to the mounting evidencethat stronger synchronization primitives are neededto support efficient and scalable synchronization.

    Our methodology does have some shortcomings,

  • 8/3/2019 Lock-Free Reference Counting

    10/10

    and further work is needed to achieve practical andgenerally-applicable lock-free garbage collection. Webelieve that many of the techniques that have beenused in previous work in garbage collection can beadapted to exploit DCAS or other strong synchro-nization mechanisms. One obvious example is to ap-

    ply techniques that allow large structures to be col-lected incrementally. This would avoid long delayswhen a thread destroys the last pointer to a largestructure. Another example is to integrate a tracingcollector that can be invoked occasionally in order toidentify and collect cyclic garbage.

    References

    [1] M. Ben-Ari. On-the-fly garbage collection: Newalgorithms inspired by program proofs. InM. Nielsen and E. Meineche Schmidt, editors,

    Automata, Languages and Programming, 9thColloquium, volume 140 of Lecture Notes inComputer Science, pages 1422, Aarhus, Den-mark, 1216 July 1982. Springer-Verlag.

    [2] D. Detlefs. Garbage collection and run-time typ-ing as a C++ library. In Proceedings of the 1992Usenix C++ Conference, pages 3756, 1992.

    [3] D. Detlefs, C. Flood, A. Garthwaite, P. Mar-tin, N. Shavit, and G. Steele. Even betterDCAS-based concurrent deques. In Proceedingsof the 14th International Conference on Dis-

    tributed Computing, pages 5973, 2000.

    [4] J. DeTreville. Experiences with concurrentgarbage collectors for modula-2+. Technical Re-port 64, Digital Equipment Corporation SystemsResearch Center, 1990.

    [5] E. W. Dijkstra, L. Lamport, A. J. Martin, C. S.Scholten, and E. F. M. Steffens. On-the-flygarbage collection: An exercise in cooperation.CACM, 21(11):966975, November 1978.

    [6] M. Greenwald. Non-Blocking Synchronizationand System Design. PhD thesis, Stanford Uni-

    versity Technical Report STAN-CS-TR-99-1624,Palo Alto, CA, August 1999.

    [7] M. Herlihy and E. Moss. Lock-free garbage col-lection for multiprocessors. IEEE Transactionson Parallel and Distributed Systems, 3(3), 1992.

    [8] W.H. Hesselink and J.F. Groote. Waitfree dis-tributed memory management by create, andread until deletion (CRUD). In 113, page 17.Centrum voor Wiskunde en Informatica (CWI),

    ISSN 1386-369X, December 31 1998. SEN (Soft-ware Engineering (SEN)).

    [9] D. E. Knuth. The Art of Computer Pro-gramming: Fundamental Algorithms. Addison-Wesley, 1968.

    [10] H. Kung and L. Lehman. Concurrent manipula-tion of binary search trees. ACM Transactionson Database Systems, 5(3):354382, 1980.

    [11] H. T. Kung and S. Song. An efficient paral-lel garbage collector and its correctness proof.Technical report, Carnegie Mellon University,September 1977.

    [12] Y. Levanoni and E. Petrank. A scalable refer-ence counting garbage collector. Technical Re-port Technical Report CS-0967, Computer Sci-ence Department, Technion, Israel, 1999.

    [13] M. Michael and M. Scott. Simple, fast, andpractical non-blocking and blocking concurrentqueue algorithms. In Proceedings of the 15thAnnual ACM Symposium on the Principles of

    Distributed Computing, pages 267276, 1996.

    [14] M. Moir. Transparent support for wait-freetransactions. In Proceedings of the 11th Inter-national Workshop on Distributed Algorithms,pages 305319, 1997.

    [15] Motorola. MC68020 32-bit microprocessor usersmanual. Prentice-Hall, 1986.

    [16] W. Pugh. Concurrent maintenance of skip lists.Technical Report CS-TR-2222.1, Departmentof Computer Science, University of Maryland,1989.

    [17] N. Shavit and D. Touitou. Software transac-tional memory. Distributed Computing, SpecialIssue(10):99116, 1997.

    [18] G.L. Steele. Multiprocessing compactifyinggarbage collection. Communications of theACM, 18(9):495508, 1975.

    [19] J. Valois. Lock-free linked lists usingcompare-and-swap. In Proceedings of the14th Annual ACM Symposium on Principles of

    Dsitributed Computing, pages 21422, 1995. Seehttp://www.cs.sunysb.edu/valois for errata.

    [20] J. L. A. van de Snepscheut. Algorithms for on-the-fly garbage collection revisited. InformationProcessing Letters, 24(4):211216, March 1987.