fast and lock-free concurrent priority queues for · 2003-02-21 · concurrent priority queue and...

15
Technical Report no. 2003-01 Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems 1 Håkan Sundell Philippas Tsigas Department of Computing Science Chalmers University of Technology and Göteborg University SE-412 96 Göteborg, Sweden Göteborg, 2003 1 This work is partially funded by: i) the national Swedish Real-Time Systems research initiative ARTES (www.artes.uu.se) supported by the Swedish Foundation for Strategic Research and ii) the Swedish Research Council for Engineering Sciences.

Upload: others

Post on 04-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Technical Report no. 2003-01

Fast and Lock-Free Concurrent Priority Queues forMulti-Thread Systems1

Håkan Sundell Philippas Tsigas

Department of Computing ScienceChalmers University of Technology and Göteborg University

SE-412 96 Göteborg, Sweden

Göteborg, 2003

1This work is partially funded by: i) the national Swedish Real-Time Systems research initiative ARTES (www.artes.uu.se) supportedby the Swedish Foundation for Strategic Research and ii) the Swedish Research Council for Engineering Sciences.

Page 2: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Technical Report in Computing Science atChalmers University of Technology and Göteborg University

Technical Report no. 2003-01ISSN: 1650-3023

Department of Computing ScienceChalmers University of Technology and Göteborg UniversitySE-412 96 Göteborg, Sweden

Göteborg, Sweden, 2003

Page 3: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Abstract

We present an efficient and practical lock-free imple-mentation of a concurrent priority queue that is suitablefor both fully concurrent (large multi-processor) systemsas well as pre-emptive (multi-process) systems. Many al-gorithms for concurrent priority queues are based on mu-tual exclusion. However, mutual exclusion causes block-ing which has several drawbacks and degrades the system’soverall performance. Non-blocking algorithms avoid block-ing, and are either lock-free or wait-free. Previously knownnon-blocking algorithms of priority queues did not performwell in practice because of their complexity, and they areoften based on non-available atomic synchronization primi-tives. Our algorithm is based on the randomized sequentiallist structure called Skiplist, and a real-time extension ofour algorithm is also described. In our performance evalu-ation we compare our algorithm with some of the most effi-cient implementations of priority queues known. The exper-imental results clearly show that our lock-free implemen-tation outperforms the other lock-based implementations inall cases for 3 threads and more, both on fully concurrentas well as on pre-emptive systems.

1 Introduction

Priority queues are fundamental data structures. Fromthe operating system level to the user application level, theyare frequently used as basic components. For example, theready-queue that is used in the scheduling of tasks in manyreal-time systems, can usually be implemented using a con-current priority queue. Consequently, the design of efficientimplementations of priority queues is a research area thathas been extensively researched. A priority queue supportstwo operations, theInsert and theDeleteMin operation.The abstract definition of a priority queue is a set of key-value pairs, where the key represents a priority. TheInsert

operation inserts a new key-value pair into the set, and theDeleteMin operation removes and returns the value of thekey-value pair with the lowest key (i.e. highest priority) thatwas in the set.

To ensure consistency of a shared data object in a concur-rent environment, the most common method is to use mu-tual exclusion, i.e. some form of locking. Mutual exclusiondegrades the system’s overall performance [14] as it causesblocking, i.e. other concurrent operations can not make anyprogress while the access to the shared resource is blockedby the lock. Using mutual exclusion can also cause dead-locks, priority inversion (which can be solved efficiently onuni-processors [13] with the cost of more difficult analysis,although not as efficient on multiprocessor systems [12])and even starvation.

To address these problems, researchers have proposednon-blocking algorithms for shared data objects. Non-blocking methods do not involve mutual exclusion, andtherefore do not suffer from the problems that blockingcan cause. Non-blocking algorithms are either lock-freeor wait-free. Lock-free implementations guarantee that re-gardless of the contention caused by concurrent operationsand the interleaving of their sub-operations, always at leastone operation will progress. However, there is a risk forstarvation as the progress of other operations could causeone specific operation to never finish. This is although dif-ferent from the type of starvation that could be caused byblocking, where a single operation could block every otheroperation forever, and cause starvation of the whole system.Wait-free [6] algorithms are lock-free and moreover theyavoid starvation as well, in a wait-free algorithm every op-eration is guaranteed to finish in a limited number of steps,regardless of the actions of the concurrent operations. Non-blocking algorithms have been shown to be of big practicalimportance in practical applications [17, 18], and recentlyNOBLE, which is a non-blocking inter-process communi-cation library, has been introduced [16].

There exist several algorithms and implementations ofconcurrent priority queues. The majority of the algorithmsare lock-based, either with a single lock on top of a se-quential algorithm, or specially constructed algorithms us-ing multiple locks, where each lock protects a small part ofthe shared data structure. Several different representationsof the shared data structure are used, for example: Huntet al. [7] presents an implementation which is based onheap structures, Grammatikakiset al. [3] compares differ-ent structures including cyclic arrays and heaps, and mostrecently Lotan and Shavit [9] presented an implementa-tion based on the Skiplist structure [11]. The algorithmby Huntet al. locks each node separately and uses a tech-nique to scatter the accesses to the heap, thus reducing thecontention. Its implementation is publicly available andits performance has been documented on multi-processorsystems. Lotan and Shavit extend the functionality of theconcurrent priority queue and assume the availability of aglobal high-accuracy clock. They apply a lock on eachpointer, and as the multi-pointer based Skiplist structureis used, the number of locks is significantly more than thenumber of nodes. Its performance has previously only beendocumented by simulation, with very promising results.

Israeli and Rappoport have presented a wait-free algo-rithm for a concurrent priority queue [8]. This algorithmmakes use of strong atomic synchronization primitives thathave not been implemented in any currently existing plat-form. However, there exists an attempt for a wait-free al-gorithm by Barnes [2] that uses existing atomic primitives,though this algorithm does not comply with the generallyaccepted definition of the wait-free property. The algo-

Page 4: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

rithm is not yet implemented and the theoretical analysispredicts worse behavior than the corresponding sequentialalgorithm, which makes it not of practical interest.

One common problem with many algorithms for concur-rent priority queues is the lack of precise defined semanticsof the operations. It is also seldom that the correctness withrespect to concurrency is proved, using a strong propertylike linearizability [5].

In this paper we present a lock-free algorithm of a con-current priority queue that is designed for efficient use inboth pre-emptive as well as in fully concurrent environ-ments. Inspired by Lotan and Shavit [9], the algorithm isbased on the randomized Skiplist [11] data structure, butin contrast to [9] it is lock-free. It is also implemented us-ing common synchronization primitives that are availablein modern systems. The algorithm is described in detaillater in this paper, and the aspects concerning the underly-ing lock-free memory management are also presented. Theprecise semantics of the operations are defined and a proofis given that our implementation is lock-free and lineariz-able. We have performed experiments that compare the per-formance of our algorithm with some of the most efficientimplementations of concurrent priority queues known, i.e.the implementation by Lotan and Shavit [9] and the imple-mentation by Huntet al. [7]. Experiments were performedon three different platforms, consisting of a multiprocessorsystem using different operating systems and equipped witheither 2, 4 or 64 processors. Our results show that our al-gorithm outperforms the other lock-based implementationsfor 3 threads and more, in both highly pre-emptive as wellas in fully concurrent environments. We also present anextended version of our algorithm that also addresses cer-tain real-time aspects of the priority queue as introduced byLotan and Shavit [9].

The rest of the paper is organized as follows. In Section2 we define the properties of the systems that our imple-mentation is aimed for. The actual algorithm is describedin Section 3. In Section 4 we define the precise semanticsfor the operations on our implementations, as well show-ing correctness by proving the lock-free and linearizabilityproperty. The experimental evaluation that shows the per-formance of our implementation is presented in Section 5.In Section 6 we extend our algorithm with functionality thatcan be needed for specific real-time applications. We con-clude the paper with Section 7.

2 System Description

A typical abstraction of a shared memory multi-processor system configuration is depicted in Figure 1.Each node of the system contains a processor together withits local memory. All nodes are connected to the sharedmemory via an interconnection network. A set of co-

Local Memory

Processor 1

Local Memory

Processor 2

Local Memory

Processor n

Shared Memory

Interconnection Network

. . .

Figure 1. Shared Memory Multiprocessor Sys-tem Structure

H 1 2 3 4 5 T

Figure 2. The Skiplist data structure with 5nodes inserted.

operating tasks is running on the system performing theirrespective operations. Each task is sequentially executedon one of the processors, while each processor can serve(run) many tasks at a time. The co-operating tasks, possi-bly running on different processors, use shared data objectsbuilt in the shared memory to co-ordinate and communi-cate. Tasks synchronize their operations on the shared dataobjects through sub-operations on top of a cache-coherentshared memory. The shared memory may not though beuniformly accessible for all nodes in the system; proces-sors can have different access times on different parts of thememory.

3 Algorithm

The algorithm is based on the sequential Skiplist datastructure invented by Pugh [11]. This structure uses ran-domization and has a probabilistic time complexity ofO(logN) where N is the maximum number of elements inthe list. The data structure is basically an ordered list withrandomly distributed short-cuts in order to improve searchtimes, see Figure 2. The maximum height (i.e. the maxi-

structure Nodekey,level,validLevelh,timeInserti : integervalue :pointer to wordnext[level],prev :pointer to Node

Figure 3. The Node structure.

Page 5: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

1 2 4

3

Inserted node

Deleted node

Figure 4. Concurrent insert and delete opera-tion can delete both nodes.

function TAS(value:pointer to word ):booleanatomic do

if *value=0 then*value:=1;return true ;

else return false;

procedureFAA(address:pointer to word , number:integer)atomic do

*address := *address + number;

function CAS(address:pointer to word , oldvalue:word,newvalue:word):boolean

atomic doif *address = oldvaluethen

*address := newvalue;return true ;

else return false;

Figure 5. The Test-And-Set (TAS), Fetch-And-Add (FAA) and Compare-And-Swap (CAS)atomic primitives.

mum number of next pointers) of the data structure islogN .The height of each inserted node is randomized geometri-cally in the way that 50% of the nodes should have height1, 25% of the nodes should have height 2 and so on. Touse the data structure as a priority queue, the nodes are or-dered in respect of priority (which has to be unique for eachnode), the nodes with highest priority are located first in thelist. The fields of each node item are described in Figure 3as it is used in this implementation. For all code examplesin this paper, code that is between the “h” and “i” symbolsare only used for the special implementation that involvestimestamps (see Section 6), and are thus not included in thestandard version of the implementation.

In order to make the Skiplist construction concurrent andnon-blocking, we are using three of the standard atomicsynchronization primitives, Test-And-Set (TAS), Fetch-And-Add (FAA) and Compare-And-Swap (CAS). Figure5 describes the specification of these primitives which areavailable in most modern platforms.

As we are concurrently (with possible preemptions)traversing nodes that will be continuously allocated andreclaimed, we have to consider several aspects of mem-ory management. No node should be reclaimed and thenlater re-allocated while some other process is traversing thisnode. This can be solved for example by careful referencecounting. We have selected to use the lock-free memorymanagement scheme invented by Valois [19] and correctedby Michael and Scott [10], which makes use of the FAA andCAS atomic synchronization primitives.

To insert or delete a node from the list we have to changethe respective set of next pointers. These have to be changedconsistently, but not necessary all at once. Our solution is tohave additional information on each node about its deletion(or insertion) status. This additional information will guidethe concurrent processes that might traverse into one partialdeleted or inserted node. When we have changed all neces-sary next pointers, the node is fully deleted or inserted.

One problem, that is general for non-blocking imple-mentations that are based on the linked-list structure, ariseswhen inserting a new node into the list. Because of thelinked-list structure one has to make sure that the previousnode is not about to be deleted. If we are changing the nextpointer of this previous node atomically with CAS, to pointto the new node, and then immediately afterwards the pre-vious node is deleted - then the new node will be deleted aswell, as illustrated in Figure 4. There are several solutionsto this problem. One solution is to use the CAS2 opera-tion as it can change two pointers atomically, but this oper-ation is not available in any existing multiprocessor system.A second solution is to insert auxiliary nodes [19] betweeneach two normal nodes, and the latest method introduced byHarris [4] is to use one bit of the pointer values as a dele-tion mark. On most modern 32-bit systems, 32-bit valuescan only be located at addresses that are evenly dividableby 4, therefore bits 0 and 1 of the address are always set tozero. The method is then to use the previously unused bit0 of the next pointer to mark that this node is about to bedeleted, using CAS. Any concurrentInsert operation willthen be notified about the deletion, when its CAS operationwill fail.

One memory management issue is how to de-referencepointers safely. If we simply de-reference the pointer, itmight be that the corresponding node has been reclaimedbefore we could access it. It can also be that bit 0 of thepointer was set, thus marking that the node is deleted, andtherefore the pointer is not valid. The following functionsare defined for safe handling of the memory management:

function READ_NODE(address:pointer to pointer toNode):pointer to Node /* De-reference the pointer and in-crease the reference counter for the corresponding node. Incase the pointer is marked, NULL is returned */

Page 6: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

// Global variableshead,tail :pointer to Node// Local variablesnode2 :pointer to Node

function ReadNext(node1:pointer to pointer to Node,level:integer):pointer to NodeR1 if IS_MARKED((*node1).value)thenR2 *node1:=HelpDelete(*node1,level);R3 node2:=READ_NODE((*node1).next[level]);R4 while node2=NULLdoR5 *node1:=HelpDelete(*node1,level);R6 node2:=READ_NODE((*node1).next[level]);R7 return node2;

function ScanKey(node1:pointer to pointer to Node,level:integer, key:integer):pointer to NodeS1 node2:=ReadNext(node1,level);S2 while node2.key < keydoS3 RELEASE_NODE(*node1);S4 *node1:=node2;S5 node2:=ReadNext(node1,level);S6 return node2;

Figure 6. Functions for traversing the nodesin the Skiplist data structure.

function COPY_NODE(node:pointer to Node):pointerto Node /* Increase the reference counter for the corre-sponding given node */

procedure RELEASE_NODE(node:pointer to Node)/* Decrement the reference counter on the correspondinggiven node. If the reference count reaches zero, then callRELEASE_NODE on the nodes that this node has ownedpointers to, then reclaim the node */

While traversing the nodes, processes will eventuallyreach nodes that are marked to be deleted. As the processthat invoked the correspondingDelete operation might bepre-empted, thisDelete operation has to be helped to fin-ish before the traversing process can continue. However, itis only necessary to help the part of theDelete operationon the current level in order to be able to traverse to thenext node. The functionReadNext, see Figure 6, traversesto the next node ofnode1on the given level while help-ing (and then setsnode1to the previous node of the helpedone) any marked nodes in between to finish the deletion.The functionScanKey, see Figure 6, traverses in severalsteps through the next pointers (starting fromnode1) at thecurrent level until it finds a node that has the same or higherkey (priority) value than the given key. It also setsnode1tobe the previous node of the returned node.

The implementation of theInsert operation, see Fig-

// Local variablesnode1,node2,newNode,savedNodes[maxlevel]:pointer to Node

function Insert(key:integer, value:pointer to word ):booleanI1 hTraverseTimeStamps();iI2 level:=randomLevel();I3 newNode:=CreateNode(level,key,value);I4 COPY_NODE(newNode);I5 node1:=COPY_NODE(head);I6 for i:=maxLevel-1to 1 step-1 doI7 node2:=ScanKey(&node1,i,key);I8 RELEASE_NODE(node2);I9 if i<level then savedNodes[i]:=COPY_NODE(node1);I10 while true doI11 node2:=ScanKey(&node1,0,key);I12 value2:=node2.value;I13 if not IS_MARKED(value2)and node2.key=keythenI14 if CAS(&node2.value,value2,value)thenI15 RELEASE_NODE(node1);I16 RELEASE_NODE(node2);I17 for i:=1 to level-1doI18 RELEASE_NODE(savedNodes[i]);I19 RELEASE_NODE(newNode);I20 RELEASE_NODE(newNode);I21 return true2;I22 elseI23 RELEASE_NODE(node2);I24 continue;I25 newNode.next[0]:=node2;I26 RELEASE_NODE(node2);I27 if CAS(&node1.next[0],node2,newNode)thenI28 RELEASE_NODE(node1);I29 break;I30 Back-OffI31 for i:=1 to level-1doI32 newNode.validLevel:=i;I33 node1:=savedNodes[i];I34 while true doI35 node2:=ScanKey(&node1,i,key);I36 newNode.next[i]:=node2;I37 RELEASE_NODE(node2);I38 if IS_MARKED(newNode.value)or

CAS(&node1.next[i],node2,newNode)thenI39 RELEASE_NODE(node1);I40 break;I41 Back-OffI42 newNode.validLevel:=level;I43 hnewNode.timeInsert:=getNextTimeStamp();iI44 if IS_MARKED(newNode.value)then

newNode:=HelpDelete(newNode,0);I45 RELEASE_NODE(newNode);I46 return true ;

Figure 7. Insert

Page 7: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

// Local variablesprev,last,node1,node2 :pointer to Node

function DeleteMin():pointer to NodeD1 hTraverseTimeStamps();iD2 htime:=getNextTimeStamp();iD3 prev:=COPY_NODE(head);D4 while truedoD5 node1:=ReadNext(&prev,0);D6 if node1=tailthenD7 RELEASE_NODE(prev);D8 RELEASE_NODE(node1);D9 return NULL;

retry:D10 value:=node1.value;D11 if not IS_MARKED(value)hand

compareTimeStamp(time,node1.timeInsert)>0i thenD12 if CAS(&node1.value,value,

GET_MARKED(value))thenD13 node1.prev:=prev;D14 break;D15 else gotoretry;D16 else if IS_MARKED(value)thenD17 node1:=HelpDelete(node1,0);D18 RELEASE_NODE(prev);D19 prev:=node1;D20 for i:=0 to node1.level-1doD21 repeatD22 node2:=node1.next[i];D23 until IS_MARKED(node2)or CAS(

&node1.next[i],node2,GET_MARKED(node2));D24 prev:=COPY_NODE(head);D25 for i:=node1.level-1to 0 step-1 doD26 while true doD27 if node1.next[i]=1then break;D28 last:=ScanKey(&prev,i,node1.key);D29 RELEASE_NODE(last);D30 if last6=node1or node1.next[i]=1then break;D31 if CAS(&prev.next[i],node1,

GET_UNMARKED(node1.next[i]))thenD32 node1.next[i]:=1;D33 break;D34 if node1.next[i]=1then break;D35 Back-OffD36 RELEASE_NODE(prev);D37 RELEASE_NODE(node1);D38 RELEASE_NODE(node1); /* Delete the node */D39 return value;

Figure 8. DeleteMin

// Local variablesprev,last,node2 :pointer to Node

function HelpDelete(node:pointer to Node,level:integer):pointer to NodeH1 for i:=level to node.level-1doH2 repeatH3 node2:=node.next[i];H4 until IS_MARKED(node2)or CAS(

&node.next[i],node2,GET_MARKED(node2));H5 prev:=node.prev;H6 if not prevor level� prev.validLevelthenH7 prev:=COPY_NODE(head);H8 for i:=maxLevel-1to level step-1 doH9 node2:=ScanKey(&prev,i,node.key);H10 RELEASE_NODE(node2);H11 elseCOPY_NODE(prev);H12 while true doH13 if node.next[level]=1then break;H14 last:=ScanKey(&prev,level,node.key);H15 RELEASE_NODE(last);H16 if last6=nodeor node.next[level]=1then break;H17 if CAS(&prev.next[level],node,

GET_UNMARKED(node.next[level]))thenH18 node.next[level]:=1;H19 break;H20 if node.next[level]=1then break;H21 Back-OffH22 RELEASE_NODE(node);H23 return prev;

Figure 9. HelpDelete

ure 7, starts in lines I5-I11 with a search phase to find thenode after which the new node (newNode) should be in-serted. This search phase starts from the head node at thehighest level and traverses down to the lowest level untilthe correct node is found (node1). When going down onelevel, the last node traversed on that level is remembered(savedNodes) for later use (this is where we should insertthe new node at that level). Now it is possible that therealready exists a node with the same priority as of the newnode, this is checked in lines I12-I24, the value of the oldnode (node2) is changed atomically with a CAS. Otherwise,in lines I25-I42 it starts trying to insert the new node start-ing with the lowest level and increasing up to the level of thenew node. The next pointers of the nodes (to become pre-vious) are changed atomically with a CAS. After the newnode has been inserted at the lowest level, it is possible thatit is deleted by a concurrentDeleteMin operation beforeit has been inserted at all levels, and this is checked in linesI38 and I44.

TheDeleteMin operation, see Figure 8, starts from thehead node and finds the first node (node1) in the list that

Page 8: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

does not have its deletion mark on the value set, see linesD3-D12. It tries to set this deletion mark in line D12 usingthe CAS primitive, and if it succeeds it also writes a validpointer to the prev field of the node. This prev field is nec-essary in order to increase the performance of concurrentHelpDelete operations, these operations otherwise wouldhave to search for the previous node in order to completethe deletion. The next step is to mark the deletion bits of thenext pointers in the node, starting with the lowest level andgoing upwards, using the CAS primitive in each step, seelines D20-D23. Afterwards in lines D24-D35 it starts theactual deletion by changing the next pointers of the previ-ous node (prev), starting at the highest level and continuingdownwards. The reason for doing the deletion in decreas-ing order of levels, is that concurrent search operations alsostart at the highest level and proceed downwards, in this waythe concurrent search operations will sooner avoid travers-ing this node. The procedure performed by theDeleteMin

operation in order to change each next pointer of the previ-ous node, is to first search for the previous node and thenperform the CAS primitive until it succeeds.

The algorithm has been designed for pre-emptive as wellas fully concurrent systems. In order to achieve the lock-free property (that at least one thread is doing progress) onpre-emptive systems, whenever a search operation finds anode that is about to be deleted, it calls theHelpDelete op-eration and then proceeds searching from the previous nodeof the deleted. TheHelpDelete operation, see Figure 9,tries to fulfill the deletion on the current level and returnswhen it is completed. It starts in lines H1-H4 with settingthe deletion mark on all next pointers in case they have notbeen set. In lines H5-H6 it checks if the node given in theprev field is valid for deletion on the current level, other-wise it searches for the correct node (prev) in lines H7-H10. The actual deletion of this node on the current leveltakes place in lines H12-H21. This operation might executeconcurrently with the correspondingDeleteMinoperation,and therefore both operations synchronize with each otherin lines D27, D30, D32, D34, H13, H16, H18 and H20 inorder to avoid executing sub-operations that have alreadybeen performed.

In fully concurrent systems though, the helping strategycan downgrade the performance significantly. Thereforethe algorithm, after a number of consecutive failed attemptsto help concurrentDeleteMin operations that hinders theprogress of the current operation, puts the operation intoback-off mode. When in back-off mode, the thread doesnothing for a while, and in this way avoids disturbing theconcurrent operations that might otherwise progress slower.The duration of the back-off is proportional to the numberof threads, and for each consecutive entering of back-offmode during one operation invocation, the duration is in-creased exponentially.

4 Correctness

In this section we present the proof of our algorithm. Wefirst prove that our algorithm is a linearizable one [5] andthen we prove that it is lock-free. A set of definitions thatwill help us to structure and shorten the proof is first ex-plained in this section. We start by defining the sequentialsemantics of our operations and then introduce two defini-tions concerning concurrency aspects in general.

Definition 1 We denote withLt the abstract internal stateof a priority queue at the timet. Lt is viewed as a set ofpairs hp; vi consisting of a unique priorityp and a corre-sponding valuev. The operations that can be performed onthe priority queue areInsert (I) andDeleteMin (DM ).The timet1 is defined as the time just before the atomic exe-cution of the operation that we are looking at, and the timet2 is defined as the time just after the atomic execution of thesame operation. The return value oftrue2 is returned by anInsert operation that has succeeded to update an existingnode, the return value oftrue is returned by anInsert op-eration that succeeds to insert a new node. In the followingexpressions that defines the sequential semantics of our op-erations, the syntax isS1 : O1; S2, whereS1 is the condi-tional state before the operationO1, andS2 is the resultingstate after performing the corresponding operation:

hp1; _i 62 Lt1 : I1(hp1;v1i) = true;

Lt2 = Lt1 [ fhp1;v1ig (1)

hp1; v11i 2 Lt1 : I1(hp1;v12i) = true2;

Lt2 = Lt1 n fhp1;v11ig [ fhp1;v12ig (2)

hp1; v1i = fhmin p; vijhp; vi 2 Lt1g

: DM1() = hp1;v1i; Lt2 = Lt1 n fhp1;v1ig (3)

Lt1 = ; : DM1() = ? (4)

Definition 2 In a global time model each concurrent op-eration Op “occupies" a time interval[bOp; fOp] on thelinear time axis(bOp < fOp). The precedence relation(denoted by ‘!’) is a relation that relates operations ofa possible execution,Op1 ! Op2 means thatOp1 endsbeforeOp2 starts. The precedence relation is a strict par-tial order. Operations incomparable under! are calledoverlapping. The overlapping relation is denoted byk andis commutative, i.e.Op1 k Op2 and Op2 k Op1. Theprecedence relation is extended to relate sub-operationsof operations. Consequently, ifOp1 ! Op2, then forany sub-operationsop1 and op2 of Op1 andOp2, respec-tively, it holds thatop1 ! op2. We also define the di-rect precedence relation!d, such that ifOp1!dOp2, thenOp1 ! Op2 and moreover there exists no operationOp3such thatOp1 ! Op3 ! Op2.

Page 9: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Definition 3 In order for an implementation of a sharedconcurrent data object to be linearizable [5], for every con-current execution there should exist an equal (in the senseof the effect) and valid (i.e. it should respect the semanticsof the shared data object) sequential execution that respectsthe partial order of the operations in the concurrent execu-tion.

Next we are going to study the possible concurrent exe-cutions of our implementation. First we need to define theinterpretation of the abstract internal state of our implemen-tation.

Definition 4 The pairhp; vi is present (hp; vi 2 L) in theabstract internal stateL of our implementation, when thereis a next pointer from a present node on the lowest levelof the Skiplist that points to a node that contains the pairhp; vi, and this node is not marked as deleted with the markon the value.

Lemma 1 The definition of the abstract internal state forour implementation is consistent with all concurrent opera-tions examining the state of the priority queue.

Proof: As the next and value pointers are changed usingthe CAS operation, we are sure that all threads see the samestate of the Skiplist, and therefore all changes of the abstractinternal state seems to be atomic. 2

Definition 5 The decision point of an operation is definedas the atomic statement where the result of the operationis finitely decided, i.e. independent of the result of any sub-operations proceeding the decision point, the operation willhave the same result. We define the state-read point of anoperation to be the atomic statement where a sub-state ofthe priority queue is read, and this sub-state is the state onwhich the decision point depends. We also define the state-change point as the atomic statement where the operationchanges the abstract internal state of the priority queue af-ter it has passed the corresponding decision point.

We will now show that all of these points conform to thevery same statement, i.e. the linearizability point.

Lemma 2 An Insert operation which succeeds(I(hp; vi) = true), takes effect atomically at onestatement.

Proof: The decision point for anInsert operation whichsucceeds (I(hp; vi) = true), is when the CAS sub-operation in line I27 (see Figure 7) succeeds, all follow-ing CAS sub-operations will eventually succeed, and theInsert operation will finally returntrue. The state of thelist (Lt1) directly before the passing of the decision pointmust have beenhp; _i 62 Lt1 , otherwise the CAS would

have failed. The state of the list directly after passing thedecision point will behp; vi 2 Lt2 . 2

Lemma 3 An Insert operation which updates(I(hp; vi) = true2), takes effect atomically at onestatement.

Proof: The decision point for anInsert operation whichupdates (I(hp; vi) = true2), is when the CAS will succeedin line I14. The state of the list (Lt1) directly before passingthe decision point must have beenhp; _i 2 Lt1 , otherwisethe CAS would have failed. The state of the list directlyafter passing the decision point will behp; vi 2 Lt3 . 2

Lemma 4 A DeleteMin operation which succeeds(D() = hp; vi), takes effect atomically at one statement.

Proof: The decision point for anDeleteMin operationwhich succeeds (D() = hp; vi) is when the CAS sub-operation in line D12 (see Figure 8) succeeds. The stateof the list (Lt) directly before passing of the decision pointmust have beenhp; vi 2 Lt, otherwise the CAS would havefailed. The state of the list directly after passing the decisionpoint will be hp; _i 62 Lt. 2

Lemma 5 A DeleteMin operations which fails (D() =?), takes effect atomically at one statement.

Proof: The decision point and also the state-read pointfor anDeleteMin operations which fails (D() = ?), iswhen the hidden read sub-operation of theReadNext sub-operation in line D5 successfully reads the next pointer onlowest level that equals the tail node. The state of the list(Lt) directly before the passing of the state-read point musthave beenLt = ;. 2

Definition 6 We define the relation) as the total orderand the relation)d as the direct total order between alloperations in the concurrent execution. In the followingformulas,E1 =) E2 means that ifE1 holds thenE2 holdsas well, and� stands for exclusive or (i.e.a � b means(a _ b) ^ :(a ^ b)):

Op1 !d Op2; 6 9Op3:Op1 )d Op3;

6 9Op4:Op4 )d Op2 =) Op1 )d Op2 (5)

Op1 k Op2 =) Op1 )d Op2 �Op2 )d Op1 (6)

Op1 )d Op2 =) Op1 ) Op2 (7)

Op1 ) Op2;Op2 ) Op3 =) Op1 ) Op3 (8)

Lemma 6 The operations that are directly totally orderedusing formula 5, form an equivalent valid sequential execu-tion.

Page 10: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

Proof: If the operations are assigned their direct total order(Op1 )d Op2) by formula 5 then also the decision, state-read and the state-change points ofOp1 is executed beforethe respective points ofOp2. In this case the operationssemantics behave the same as in the sequential case, andtherefore all possible executions will then be equivalent toone of the possible sequential executions. 2

Lemma 7 The operations that are directly totally orderedusing formula 6 can be ordered unique and consistent, andform an equivalent valid sequential execution.

Proof: Assume we order the overlapping operations ac-cording to their decision points. As the state before as wellas after the decision points is identical to the correspondingstate defined in the semantics of the respective sequentialoperations in formulas 1 to 4, we can view the operations asoccurring at the decision point. As the decision points con-sist of atomic operations and are therefore ordered in time,no decision point can occur at the very same time as anyother decision point, therefore giving a unique and consis-tent ordering of the overlapping operations. 2

Lemma 8 With respect to the retries caused by synchro-nization, one operation will always do progress regardlessof the actions by the other concurrent operations.

Proof: We now examine the possible execution paths of ourimplementation. There are several potentially unboundedloops that can delay the termination of the operations. Wecall these loops retry-loops. If we omit the conditions thatare because of the operations semantics (i.e. searching forthe correct position etc.), the retry-loops take place whensub-operations detect that a shared variable has changedvalue. This is detected either by a subsequent read sub-operation or a failed CAS. These shared variables are onlychanged concurrently by other CAS sub-operations. Ac-cording to the definition of CAS, for any number of concur-rent CAS sub-operations, exactly one will succeed. Thismeans that for any subsequent retry, there must be oneCAS that succeeded. As this succeeding CAS will causeits retry loop to exit, and our implementation does not con-tain any cyclic dependencies between retry-loops that exitwith CAS, this means that the correspondingInsert orDeleteMin operation will progress. Consequently, inde-pendent of any number of concurrent operations, one oper-ation will always progress. 2

Theorem 1 The algorithm implements a lock-free and lin-earizable priority queue.

Proof: Following from Lemmas 6 and 7 and using the di-rect total order we can create an identical (with the same

semantics) sequential execution that preserves the partial or-der of the operations in a concurrent execution. Followingfrom Definition 3, the implementation is therefore lineariz-able. As the semantics of the operations are basically thesame as in the Skiplist [11], we could use the correspondingproof of termination. This together with Lemma 8 and thatthe state is only changed at one atomic statement (Lemmas1,2,3,4,5), gives that our implementation is lock-free.2

5 Experiments

In our experiments each concurrent thread performs10000 sequential operations, whereof the first 100 or 1000operations areInsert operations, and the remaining op-erations are randomly chosen with a distribution of 50%Insert operations versus 50%DeleteMin operations. Thekey values of the inserted nodes are randomly chosen be-tween0 and1000000 �n, where n is the number of threads.Each experiment is repeated 50 times, and an average ex-ecution time for each experiment is estimated. Exactly thesame sequential operations are performed for all differentimplementations compared. Besides our implementation,we also performed the same experiment with two lock-based implementations. These are; 1) the implementationusing multiple locks and Skiplists by Lotanet al. [9] whichis the most recently claimed to be one of the most efficientconcurrent priority queues existing, and 2) the heap-basedimplementation using multiple locks by Huntet al. [7]. Alllock-based implementations are based on simple spin-locksusing the TAS atomic primitive. A clean-cache operation isperformed just before each sub-experiment. All implemen-tations are written in C and compiled with the highest op-timization level, except from the atomic primitives, whichare written in assembler.

The experiments were performed using different num-ber of threads, varying from 1 to 30. To get a highly pre-emptive environment, we performed our experiments on aCompaq dual-processor Pentium II 450 MHz PC runningLinux. A set of experiments was also performed on a SunSolaris system with 4 processors. In order to evaluate ouralgorithm with full concurrency we also used a SGI Ori-gin 2000 195 MHz system running Irix with 64 processors.The results from these experiments are shown in Figure 10together with a close-up view of the Sun experiment. Theaverage execution time is drawn as a function of the numberof threads.

From the results we can conclude that all of the imple-mentations scale similarly with respect to the average sizeof the queue. The implementation by Lotan and Shavit [9]scales linearly with respect to increasing number of threadswhen having full concurrency, although when exposed topre-emption its performance decreases very rapidly; alreadywith 4 threads the performance decreased with over 20

Page 11: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (100 Nodes) - Linux PII, 2 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

200

400

600

800

1000

1200

1400

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (1000 Nodes) - Linux PII, 2 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

2000

4000

6000

8000

10000

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (100 Nodes) - Sun Solaris, 4 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

2000

4000

6000

8000

10000

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (1000 Nodes) - Sun Solaris, 4 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

200

400

600

800

1000

0 2 4 6 8 10 12 14

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (100 Nodes) - Sun Solaris, 4 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

200

400

600

800

1000

0 2 4 6 8 10 12 14

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (1000 Nodes) - Sun Solaris, 4 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

500

1000

1500

2000

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (100 Nodes) - SGI Mips, 64 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

0

500

1000

1500

2000

0 5 10 15 20 25 30

Exe

cutio

n T

ime

(ms)

Threads

Priority Queue (1000 Nodes) - SGI Mips, 64 Processors

LOCK-FREELOCK-BASED LOTAN-SHAVIT

LOCK-BASED HUNT ET AL

Figure 10. Experiment with priority queues and high contention, initialized with 100 respective 1000nodes

Page 12: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

times. We must point out here that the implementation byLotan and Shavit is designed for a large (i.e. 256) numberof processors. The implementation by Huntet al. [7] showsbetter but similar behavior. Because of this behavior we de-cided to run the experiments for these two implementationsonly up to a certain number of threads to avoid getting time-outs. Our lock-free implementation scales best comparedto all other involved implementations, having best perfor-mance already with 3 threads, independently if the systemis fully concurrent or involves pre-emptions.

6 Extended Algorithm

When we have concurrentInsert andDeleteMin op-erations we might want to have certain real-time propertiesof the semantics of theDeleteMin operation, as expressedin [9]. TheDeleteMin operation should only return itemsthat have been inserted by anInsert operation that finishedbefore theDeleteMin operation started. To ensure this weare adding timestamps to each node. When the node is fullyinserted its timestamp is set to the current time. WhenevertheDeleteMin operation is invoked it first checks the cur-rent time, and then discards all nodes that have a timestampthat is after this time. In the code of the implementation (seeFigures 6,7,8 and 9), the additional statements that involvetimestamps are marked within the “h” and “i” symbols. Thefunction getNextT imeStamp, see Figure 14, creates anew timestamp. The functioncompareT imeStamp, seeFigure 14, compares if the first timestamp is less, equal orhigher than the second one and returns the values -1,0 or 1,respectively.

As we are only using the timestamps for relative com-parisons, we do not need real absolute time, only that thetimestamps are monotonically increasing. Therefore we canimplement the time functionality with a shared counter, thesynchronization of the counter is handled using CAS. How-ever, the shared counter usually has a limited size (i.e. 32bits) and will eventually overflow. Therefore the values ofthe timestamps have to be recycled. We will do this by ex-ploiting information that are available in real-time systems,with a similar approach as in [15].

We assume that we haven periodic tasks in the system,indexed�1:::�n. For each task�i we will use the standardnotationsTi, Ci, Ri andDi to denote the period (i.e.minperiod for sporadic tasks), worst case execution time, worstcase response time and deadline, respectively. The deadlineof a task is less or equal to its period.

For a system to be safe, no task should miss its deadlines,i.e.8i j Ri � Di.

For a system scheduled with fixed priority, the responsetime for a task in the initial system can be calculated usingthe standard response time analysis techniques [1]. If wewith Bi denote the blocking time (the time the task can be

ti

Ti Ti Ti

Ri

LTv

= increment highest known timestamp value by 1

Figure 11. Maximum timestamp increasementestimation - worst case scenario

delayed by lower priority tasks) and withhp(i) denote theset of tasks with higher priority than task�i, the responsetimeRi for task�i can be formulated as:

Ri = Ci +Bi +X

j2hp(i)

�Ri

Tj

�Cj (9)

The summand in the above formula gives the time that task�i may be delayed by higher priority tasks. For systemsscheduled with dynamic priorities, there are other ways tocalculate the response times [1].

Now we examine some properties of the timestamps thatcan exist in the system. Assume that all tasks call eitherthe Insert or DeleteMin operation only once per itera-tion. As each call togetNextT imeStamp will introducea new timestamp in the system, we can assume that everytask invocation will introduce one new timestamp. This newtimestamp has a value that is the previously highest knownvalue plus one. We assume that the tasks always executewithin their response timesR with arbitrary many interrup-tions, and that the execution timeC is comparably small.This means that the increment of highest timestamp respec-tive the write to a node with the current timestamp can occuranytime within the interval for the response time. The max-imum time for anInsert operation to finish is the same asthe response timeRi for its task�i. The minimum time be-tween two index increments is when the first increment isexecuted at the end of the first interval and the next incre-ment is executed at the very beginning of the second inter-val, i.e.Ti�Ri. The minimum time between the subsequentincrements will then be the periodTi. If we denote withLTv the maximum life-time that the timestamp with valuev exists in the system, the worst case scenario in respect ofgrowth of timestamps is shown in Figure 11.

The formula for estimating the maximum difference invalue between two existing timestamps in any execution be-comes as follows:

Page 13: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

MaxTag =

nXi=0

��maxv2f0::1g LTv

Ti

�+ 1

�(10)

Now we have to bound the value ofmaxv2f0::1g LTv.When comparing timestamps, the absolute value of theseare not important, only the relative values. Our methodis that we continuously traverse the nodes and replace out-dated timestamps with a newer timestamp that has the samecomparison result. We traverse and check the nodes atthe rate of one step to the right for every invocation of anInsert or DeleteMin operation. With outdated times-tamps we define timestamps that are older (i.e. lower)than any timestamp value that is in use by any runningDeleteMin operation. We denote withAncientV al themaximum difference that we allow between the highestknown timestamp value and the timestamp value of a node,before we call this timestamp outdated.

AncientV al =

nXi=0

�maxj Rj

Ti

�(11)

If we denote withtancientthe maximum time it takesfor a timestamp value to be outdated counted from its firstoccurrence in the system, we get the following relation:

AncientV al =

nXi=0

�tancient

Ti

�>

nXi=0

�tancient

Ti

�� n

(12)

tancient<AncientV al + n

nXi=0

1

Ti

(13)

Now we denote withttraversethe maximum time it takesto traverse through the whole list from one position and get-ting back, assuming the list has the maximum sizeN .

N =

nXi=0

�ttraverse

Ti

�>

nXi=0

�ttraverse

Ti

�� n (14)

ttraverse<N + nnX

i=0

1

Ti

(15)

The worst-case scenario is that directly after the times-tamp of one node gets traversed, it gets outdated. Thereforewe get:

maxv2f0::1g

LTv = tancient+ ttraverse (16)

Putting all together we get:

01

2

X...

...

Figure 12. Timestamp value recycling

0 X

d1d2 d3

v1 v2 v3 v4

t1t2

Figure 13. Deciding the relative order betweenreused timestamps

MaxTag <

nXi=0

0BBBB@

26666666

N + 2n+

nXk=0

�maxj Rj

Tk

Ti

nXl=0

1

Tl

37777777+ 1

1CCCCA

(17)The above equation gives us a bound on the length of the

"window" of active timestamps for any task in any possibleexecution. In the unbounded construction the tasks, by pro-ducing larger timestamps every time they slide this windowon the[0; : : : ;1] axis, always to the right. The approachnow is instead of sliding this window on the set[0; : : : ;1]from left to right, to cyclically slide it on a[0; : : : ; X ] setof consecutive natural numbers, see figure 12. Now at thesame time we have to give a way to the tasks to identify theorder of the different timestamps because the order of thephysical numbers is not enough since we are re-using times-tamps. The idea is to use the bound that we have calculatedfor the span of different active timestamps. Let us then takea task that has observedvi as the lowest timestamp at someinvocation� . When this task runs again as� 0, it can con-clude that the active timestamps are going to be betweenviand(vi+MaxTag) mod X . On the other hand we shouldmake sure that in this interval[vi; : : : ; (vi + MaxTag)mod X ] there are no old timestamps. By looking closerto equation 10 we can conclude that all the other tasks havewritten values to their registers with timestamps that are atmostMaxTag less thanvi at the time that� wrote the valuevi. Consequently if we use an interval that has double the

Page 14: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

// Global variablestimeCurrent:integerchecked:pointer to Node// Local variablestime,newtime,safeTime:integercurrent,node,next:pointer to Node

function compareTimeStamp(time1:integer,time2:integer):integerC1 if time1=time2then return 0;C2 if time2=MAX_TIME then return -1;C3 if time1>time2and (time1-time2)�MAX_TAG or

time1<time2and (time1-time2+MAX_TIME)�MAX_TAG then return 1;

C4 else return -1;

function getNextTimeStamp():integerG1 repeatG2 time:=timeCurrent;G3 if (time+1)6=MAX_TIME then newtime:=time+1;G4 elsenewtime:=0;G5 until CAS(&timeCurrent,time,newtime);G6 return newtime;

procedureTraverseTimeStamps()T1 safeTime:=timeCurrent;T2 if safeTime�ANCIENT_VAL thenT3 safeTime:=safeTime-ANCIENT_VAL;T4 elsesafeTime:=safeTime+MAX_TIME-ANCIENT_VAL;T5 while true doT6 node:=READ_NODE(checked);T7 current:=node;T8 next:=ReadNext(&node,0);T9 RELEASE_NODE(node);T10 if compareTimeStamp(safeTime,next.timeInsert)>0 thenT11 next.timeInsert:=safeTime;T12 if CAS(&checked,current,next)thenT13 RELEASE_NODE(current);T14 break;T15 RELEASE_NODE(next);

Figure 14. Creation, comparison, traversingand updating of bounded timestamps.

size ofMaxTag, � 0 can conclude that old timestamps areall on the interval[(vi �MaxTag) mod X; : : : ; vi].

Therefore we can use a timestamp field with double thesize of the maximum possible value of the timestamp.

TagF ieldSize= MaxTag � 2TagF ieldBits = dlog2 TagF ieldSizee

In this way� 0 will be able to identify thatv1; v2; v3; v4(see figure 13) are all new values ifd2 + d3 < MaxTag

and can also conclude that:

v3 < v4 < v1 < v2

The mechanism that will generate new timestamps in acyclical order and also compare timestamps is presented inFigure 14 together with the code for traversing the nodes.Note that the extra properties of the priority queue that areachieved by using timestamps are not complete with respectto theInsert operations that finishes with an update. Theseupdate operations will behave the same as for the standardversion of the implementation.

Besides from real-time systems, the presented techniquecan also be useful in non real-time systems as well. Forexample, consider a system ofn = 10 threads, where theminimum time between two invocations would beT = 10ns, and the maximum response timeR = 1000000000 ns(i.e. after 1 s we would expect the thread to have crashed).Assuming a maximum size of the listN = 10000, wewill have a maximum timestamp differenceMaxTag <

1000010030, thus needing 31 bits. Given that most systemshave 32-bit integers and that many modern systems handle64 bits as well, it implies that this technique is practical foralso non real-time systems.

7 Conclusions

We have presented a lock-free algorithmic implementa-tion of a concurrent priority queue. The implementation isbased on the sequential Skiplist data structure and builds ontop of it to support concurrency and lock-freedom in an effi-cient and practical way. Compared to the previous attemptsto use Skiplists for building concurrent priority queues ouralgorithm is lock-free and avoids the performance penal-ties that come with the use of locks. Compared to theprevious lock-free/wait-free concurrent priority queue algo-rithms, our algorithm inherits and carefully retains the basicdesign characteristic that makes Skiplists practical: simplic-ity. Previous lock-free/wait-free algorithms did not performwell because of their complexity, furthermore they were of-ten based on atomic primitives that are not available in to-day’s systems.

Page 15: Fast and Lock-Free Concurrent Priority Queues for · 2003-02-21 · concurrent priority queue and assume the availability of a global high-accuracy clock. They apply a lock on each

We compared our algorithm with some of the most ef-ficient implementations of priority queues known. Experi-ments show that our implementation scales well, and with 3threads or more our implementation outperforms the corre-sponding lock-based implementations, for all cases on bothfully concurrent systems as well as with pre-emption.

We believe that our implementation is of highly practicalinterest for multi-threaded applications. We are currentlyincorporating it into the NOBLE [16] library.

References

[1] N.C. AUDSLEY, A. BURNS, R.I. DAVIS, K.W. TIN-DELL, A.J. WELLINGS. Fixed Priority Pre-emptiveScheduling: An Historical Perspective.Real-TimeSystems, Vol. 8, Num. 2/3, pp. 129-154, 1995.

[2] G. BARNES. Wait-Free Algorithms for Heaps.Techni-cal Report, Computer Science and Engineering, Uni-versity of Washington, Feb. 1992.

[3] M. GRAMMATIKAKIS , S. LIESCHE. Priority queuesand sorting for parallel simulation.IEEE Transactionson Software Engineering, SE-26 (5), pp. 401-422,2000.

[4] T. L. HARRIS. A Pragmatic Implementation of Non-Blocking Linked Lists.Proceedings of the 15th Inter-national Symposium of Distributed Computing, Oct.2001.

[5] M. H ERLIHY, J. WING. Linearizability: a Correct-ness Condition for Concurrent Objects.ACM Trans-actions on Programming Languages and Systems,12(3):463-492, 1990.

[6] M. H ERLIHY. Wait-Free Synchronization.ACMTOPLAS, Vol. 11, No. 1, pp. 124-149, Jan. 1991.

[7] G. HUNT, M. MICHAEL, S. PARTHASARATHY, M.SCOTT. An Efficient Algorithm for Concurrent Pri-ority Queue Heaps.Information Processing Letters,60(3), pp. 151-157, Nov. 1996.

[8] A. I SRAELI, L. RAPPAPORT. Efficient wait-free im-plementation of a concurrent priority queue.7th IntlWorkshop on Distributed Algorithms ’93, LectureNotes in Computer Science 725, Springer Verlag, pp1-17, Sept. 1993.

[9] I. L OTAN, N. SHAVIT . Skiplist-Based Concurrent Pri-ority Queues.International Parallel and DistributedProcessing Symposium, 2000.

[10] M. M ICHAEL, M. SCOTT. Correction of a MemoryManagement Method for Lock-Free Data Structures.

Computer Science Dept., University of Rochester,1995.

[11] W. PUGH. Skip lists: a probabilistic alternative tobalanced trees.Communications of the ACM, vol.33,no.6, pp. 668-76, June 1990.

[12] R. RAJKUMAR. Real-Time Synchronization Proto-cols for Shared Memory Multiprocessors.10th Inter-national Conference on Distributed Computing Sys-tems, pp. 116-123, 1990.

[13] L. SHA AND R. RAJKUMAR, J. LEHOCZKY. Pri-ority Inheritance Protocols: An Approach to Real-Time Synchronization.IEEE Transactions on Com-putersVol. 39, 9, pp. 1175-1185, Sep. 1990.

[14] A. SILBERSCHATZ, P. GALVIN . Operating SystemConcepts.Addison Wesley, 1994.

[15] H. SUNDELL, P. TSIGAS. Space Efficient Wait-FreeBuffer Sharing in Multiprocessor Real-Time SystemsBased on Timing Information.Proceedings of the 7thInternational Conference on Real-Time ComputingSystems and Applicatons (RTCSA 2000), pp. 433-440,IEEE press, 2000.

[16] H. SUNDELL, P. TSIGAS. NOBLE: A Non-BlockingInter-Process Communication Library.Proceedings ofthe 6th Workshop on Languages, Compilers and Run-time Systems for Scalable Computers (LCR’02), Lec-ture Notes in Computer Science, Springer Verlag,2002.

[17] P. TSIGAS, Y. ZHANG. Evaluating the performanceof non-blocking synchronization on shared-memorymultiprocessors.Proceedings of the international con-ference on Measurement and modeling of computersystems (SIGMETRICS 2001), pp. 320-321, ACMPress, 2001.

[18] P. TSIGAS, Y. ZHANG. Integrating Non-blockingSynchronisation in Parallel Applications: Perfor-mance Advantages and Methodologies.Proceedingsof the 3rd ACM Workshop on Software and Perfor-mance (WOSP ’02), ACM Press, 2002.

[19] J. D. VALOIS. Lock-Free Data Structures.PhD. The-sis, Rensselaer Polytechnic Institute, Troy, New York,1995.