wait-free implementations in message-passing...
TRANSCRIPT
Theoretical Computer Science
3 (1999) 211-245 www.elsevier.com/locate/tcs
Wait-free implementations in message-passing systems
Soma ~hau~u~ a, Maurice Herlihy b, Mark R. Tuttle G*
a Depurtment of Computer Science, low State University, Ames, IA 50011, USA b Computer Science Department, Brown University. Providence, RI 02912, USA
’ Compaq Cumbridge Research Lab, One Kendall Square. Bldg 700. Cambridge, MA 02139, USA
Abstract
We study the round complexity of problems in a synchronous, message-passing system with crash failures. We show that if processors start in order-equivalent states, then a logarithmic number of rounds is both necessary and sufficient for them to reach order-inequivalent states. These upper and lower bounds are significant because they establish a complexity threshold below which no nontrivial problem can be solved, but at which certain nontrivial problems do
have solutions. This logarithmic lower bound implies a matching lower bound for a variety of decision tasks
and concurrent object implementations. In particular, we examine two nontrivial problems for which this lower bound is tight: the strong renaming task, and a wait-free increment register
implementation. For each problem, we present a nontrivial algorithm that halts in O(log c) rounds, where c is the number of pa~icipating processors. @ 1999 Elsevier Science B.V. All
rights reserved.
Keywords: Distributed algorithms; Data structures; Wait-free algorithms; Message-passing model; Increment register; Strong renaming
1. Introduction
In a synchronous, message-passing system with crash failures, a computation pro-
ceeds in a sequence of rounds: each processor sends messages to the others, receives
all the messages sent to it, and performs an internal computation. At any point in the
computation, a processor may crush: it stops and sends no more messages. This model
is one of the most thoroughly-studied models in the distributed computing literature,
partly because it is so easy to describe, and partly because the behaviors exhibited by
this model are a subset of the behaviors exhibited by almost any other model of com-
putation, which means that lower bounds in this model usually extend to other models.
* Corresponding author. E-mail: [email protected],
03~-3975/99/$-~~~ front matter @ 1999 Elsevier Science B.V. All rights reserved. PII: SO304-3975(98)00242-4
212 S. Chaudhuri et ccl. I Theoreticd Computer Sciencr 220 (l!?WJ 21 l-245
We investigate the time needed to solve nontrivial problems in this model. Loosely
speaking, a nontrivial problem is one in which at least two processors must perform
different actions, which is a kind of “symmetry breaking.” In the well-known conSen.yUS
problem [6, 17, 191, each processor starts with an input value, and all nonfaulty proces-
sors must halt after agreeing on the input value of one of the processors. Consensus
breaks symmetry by requiring one processor to choose its input value, and the rest to
discard theirs. Consensus requires a linear number of rounds to solve [5], but it can
be used to solve almost any other nontrivial problem [lo, 15, 16,201, so solving these
nontrivial problems never takes longer than consensus. We want to know exactly how
quickly these nontrivial problems can be solved.
Solving a nontrivial problem requires causing two processors to perform different
actions. Speaking informally, if processors start in equivalent states and follow the
same deterministic protocol, then as long as they remain in equivalent states, they will
continue to perform the same actions. We can therefore equate the number of rounds
needed to reach inequivalent states with a threshold below which no nontrivial problem
has a solution. Surprisingly, we can show that there do exist nontrivial problems that
become solvable at exactly this threshold.
What does it mean for two processors to be in equivalent states? Each processor
begins execution with a unique identifier (called its id) taken from a totally-ordered
set. We assume that processors can test ids for equality and order: given ids p and
q, a processor can test whether p = q, p dq, and p 3q. We say that two processor
states are order-equivalent [7] if they cannot be distinguished by the order of the ids
appearing within them. More specifically, the two states must have the same structure,
and the order of any two ids appearing in one state must be the same as the order of
the ids appearing in the corresponding positions of the other state. For example, if each
processor’s initial state contains its id and nothing else, then all initial states are trivially
order-equivalent because there are no pairs of ids to compare within an initial state.
A protocol is comparison-based if the comparison operations are the only operations
applied to process ids. Clearly, comparison-based protocols cannot distinguish between
order-equivalent states.
We restrict our attention to comparison-based protocols. This restriction to protocols
that can only compare processor ids for order is reasonable in systems where there
are many more processor ids than there are processors, or in systems where there
is a very large pool of potential participants, of which only an unpredictable subset
actually participates in the protocol. In such systems, there may be no effective way
to enumerate all possible processor ids, and no way to tell whether there exists a
processor with an id between two other ids. Most significant, since there are so many
possible ids, it is not feasible to use processor ids as indices into data structures as is
frequently done in implementations of objects like atomic registers (see [21]).
This paper’s first principal contribution is a proof that any comparison-based pro-
tocol for c processors has an execution in which all processors remain in order-
equivalent states for R(logc) rounds. As a result, any problem that requires breaking
order-equivalence also requires R(logc) rounds. Although the proof is elementary, this
S. Chuudhuri et al. I Theoretical Computer Science 220 (1999) 21 I-245 213
logarithmic lower bound is “universal” for this model in the sense that it is diffi-
cult to conceive of a nontrivial problem that can be solved without breaking order-
equivalence. This bound is tight: we give a protocol that forces any two processors
into order-inequivalent states within O(log c) rounds.
This paper’s second principal contribution is to show that there exist nontrivial prob-
lems that do have solutions with O(logc) round complexity. The existence of these
problems implies that the synchronous message-passing model undergoes a kind of
“phase transition” at a logarithmic number of rounds. Below this threshold, it is im-
possible to solve any problem that requires different processors to take different actions.
Beyond this threshold, however, solutions do exist to nontrivial problems.
We consider two classes of problems: decision tasks, and concurrent objects.
A decision task is a problem in which each processor begins execution with a pri-
vate input value, runs for some number of rounds, and then halts with a private output
value satisfying problem-specific constraints. Consensus is an example of a decision
task. By contrast, a concurrent object is a long-lived data structure that can be si-
multaneously accessed by multiple processors. A concurrent object implementation is
wait-free if any nonfaulty processor can complete any operation on the object in a finite
number of steps, even if other processors crash at arbitrary points in their protocols.
A shared FIFO queue is an example of a concurrent object.
Strong renaming is a decision task in which processors start with input bits indicating
whether to participate in the protocol, and participating processors must choose unique
names in the range 1 . . c, where c is the number of participating processors. A weaker
form of this task has been extensively studied in asynchronous models [l, 2, 121. Any
protocol for strong renaming can be used to break order-equivalence, so a logarithmic
lower bound is immediate. This bound is tight: we give a nontrivial protocol that solves
strong renaming in O(logc) rounds.
Our lower bound on order-equivalence also translates into a lower bound on a variety
of wait-free concurrent object implementations. For example, an increment register is a
concurrent object consisting of an integer-valued register with an increment operation
that atomically increments the register and returns its previous value. Because we
can use an increment register to break order-equivalence, the s2(log c) lower bound
for order-inequivalence translates directly into an R(logc) lower bound on any wait-
free implementation of the increment operation. This bound is also tight: we give a
nontrivial wait-free increment register implementation in which each increment halts in
O(log c) rounds, where c is the number of concurrently executing increment operations.
Our increment register construction is interesting in its own right, since it is based
on our optimal solution to the strong renaming problem. In general, implementing long-
lived objects is inherently more difficult than solving decision tasks. A decision task
is solved once, while operations can be invoked on an object repeatedly. Even worse,
processors solving a decision task start together, while processors invoking operations
on an object can arrive at different and unpredictable times. The major technical diffi-
culty in our register construction is how to guarantee that increment operations starting
at different times do not interfere.
214 S. C~u~~l~ri et al. I Theoreticul Computrr Science 220 (1999) 211 -245
The rest of this paper is organized as follows. In Section 2 we define our model of computation. In Section 3, we define formally what we mean by order-equivalence of processors and present the matching loge lower and upper bounds for the problem. In Section 4, we define the strong renaming problem. We then reduce the problem of eliminating order-equivalence to the strong renaming problem, thus obtaining the log c lower bound for strong renaming. We then show that this bound is actually tight with an efficient strong renaming algorithm. In Section 5 we define concurrent objects and their implementation. We then reduce the problem of eliminating order- equivalence to the problem of implementing several concurrent objects, thus obtaining the same log c lower bound on the complexity of these concurrent objects. Finally we give our optimal wait-free implementation of an increment register, based on the strong renaming algorithm. We close with a discussion of some open problems in Section 6.
2. Model
Our model of computation is a synchronous, message-passing model with crash failures. It is similar to the models used in a number of other papers [S, 9, 181.
A system consists of fz unreliable processors ~1,. . . , pn and an external enz~iron~ent
e. We use n to denote the total number of processors in the system, and c to denote the number of these processors that actually participate in a protocol like strong renaming or access a concurrent object like an increment register. Each processor has a unique processor id taken from some totally-ordered set of processor ids. Each processor can send a message to any other processor and to the environment. The environment can send to any processor a message taken from some set of messages (including I to denote “no input”). Each processor has access to a global clock that starts at 0 and advances in increments of 1. Computation proceeds in a sequence of rounds, with round k lasting from time k-l to time k on the global clock. In each round, each processor receives some message (possibly -I-) from the environment, then it sends messages to other processors (inclu~ng itself) and the environment, then it receives the messages sent to it by processors in that round, and then it changes state and enters the next round. Communication is reliable: a message sent in a round is guaranteed to be delivered in that round. Processors are not reliable: a processor can crash or fail at any time by just halting in the middle of a round after sending some subset of its messages for that round.
A global state is a tuple (sr ,. . . ,sn,se) of local states, one local state si for each processor pi and one local state se for the environment. The local state for processor pi includes the time on the global clock, its processor id, the history of messages it has received from the environment, and the history of messages it has received from other processors. The local state for the environment may contain similar info~ation, but it certainly contains the sequence of messages it has sent to processors in the system, the pattern of failures exhibited by processors, and any other relevant info~ation that
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 215
cannot be deduced from processors’ local states. An execution e is an infinite sequence
9091 . . . of global states, where gi is the global state at time i. Processors follow a deterministic protocoZ that determines what messages to send
to processors and the environment during a round as a function of its local state.
A processor follows its protocol in every round, except that a processor may crash
or fail in the middle of a round. If pi fails in round k, then it sends all messages in
rounds j < k as required by the protocol, it sends a proper subset of its messages in
round k, and it sends no messages in rounds j > k. A processor is considered faulty in an execution if it fails in some round of that execution, and nonfaulty otherwise.
Without loss of generality, we can assume that processors follow a full-information
protocol in which processors broadcast their entire local state to every processor (in-
cluding itself) in every round, and apply a message function to their local states to
determine what message to send to the environment in every round. Given the state of
a processor in the full-information protocol, this state contains enough information for
us to compute the processor’s state at the corresponding point of any other protocol
17, 181. Also without loss of generality, since processors broadcast their entire local state in
every round, we can assume that the local state of a processor is a LISP S-expression
defined as follows. Let us fix some totally-ordered set 4 of processor ids, some set
Y of initial states, some set JZ of messages from the environment (including I), and
some set .4’” of messages from the processors to the environment. The initial state for
a processor with processor id p starting in initial state s with initial input m from the
enVirOnInent iS WI&n @ m S). Later St&S are Widen (pm mo ml . . . mk), where p
is the processor id, m is the input received from the environment at the start of the
current round, and mo . . . mk is the set of messages received from processors during the
last round (including p itself) sorted by processor id. The messages mi are themselves
S-expressions representing the states of the sending processors at the start of the round,
including p’s local state. Notice that while processors send S-expressions to each other,
they still send messages from a fixed set to the environment: the message function
maps a processor’s local state to a message in M that the processors sends to the
environment in that round. Again, we can assume processor states have such a special
representation since from such a description of a processor’s state we can reconstruct
the value of every variable v appearing in the actual state [7, 181. All of the protocols
in this paper are stated using an Algol-like notation for the sake of convenience, but
their translation into this model is straight-forward.
For any given protocol, an execution of the protocol is completely determined by
the processor ids, the initial states, the inputs received from the environment, and the
pattern of processor failures during the execution. We define an environment graph to be an infinite graph that records this information [ 181. We define an environment
graph to be a grid with n vertices in the vertical axis labeled with processor names
Plr..., p,, - denoting physical processors and not their processor ids - and with a
countable number of vertices in the horizontal axis labeled with times 0, 1,2,. . . . The
node representing processor p at time i is denoted by the pair (p, i). Given any pair
216 S. Chuudhuri et al. I Throretid Computer Science 220 (1999) 211-245
of processors p and q and any round i, there is an edge from (p, i - 1) to (q, i) if p
successfully sends a message to q in round i, and the edge is missing otherwise. Each
node (p, i) is labeled with the input received from the environment by processor p at
time i. In addition, each node (p,O) is labeled with p’s processor id and initial state.
Since processors fail by crashing, an environment graph must satisfy the following
property: if k is the least integer such that an edge is missing from (p, k), then no
edges are missing from (p,j) for all j <k and all edges are missing from (p,j) for
all j> k. An environment graph must also satisfy the property that every processor
starts with a unique processor id. Given a set S of processor initial states, a set I of
processor ids, and a set M of environment messages, we define 6(S, I,M) to be the
set of all such environment graphs labeled with initial states in S, processor ids in I,
and messages from the environment in M.
We define the local state of the environment at time k to be the finite prefix of an
environment graph describing the processor inputs and failures at times 0 through k, and we require that the local states of the environment in an execution be prefixes of the
same environment graph. We define an environment to be a set of environment graphs.
We will typically consider environments of the form B(S,Z,M), or simple restrictions
of such environments. For example, in the context of decision problems like consensus,
we might consider an environment in which each processor receives a message from
the environment (the processor’s input bit) at time 0 and at no later time. Given a
protocol P and an environment 8, we define P(8) to be the set of all executions of
P in the context of the environment graphs in 8.
3. Order-equivalence
In this section, we show that a logarithmic number of rounds is a necessary and suffi-
cient amount of time for comparison-based protocols to reach order-inequivalent states.
We begin with the definitions of comparison-based protocols and order-equivalent
states. Both definitions are based on the assumption that the set of processor ids is
a totally-ordered set, meaning that it is possible to test processor ids for relative order,
but that it is not possible to examine the structure of ids in any more detail.
Informally, two states are order-equivalent if they cannot be distinguished by com-
paring the processor ids that appear within them. Remember that a processor state is
represented by an S-expression in our model. A processor’s initial state is written as
(p m s), where p is a processor id, m is a message from the environment, and s is an
initial state. Later states are written (p m mo ml . . . mk), where p iS a processor id, m is a message from the environment, mo . . . mk is some set of messages (S-expressions rep-
resenting local states) received from processors during the last round and all sorted by
processor id. Loosely speaking, two processor states s and s’ are equivalent if (i) they
are structurally equivalent S-expressions, (ii) initial states from corresponding positions
in s and s’ are identical, (iii) messages from the environment from corresponding po-
sitions in s and s’ are identical, and (iv) if two ids p and q taken from s satisfy p <q,
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 217
then the two ids p’ and q’ taken from corresponding positions of s’ satisfy p’ dq’. Intuitively, a processor cannot distinguish equivalent states simply by comparing pro-
cessors ids: if a processor tries to learn something about its local state by sequentially
testing pairs of ids appearing within that state for relative order, then this sequence of
tests will yield the same results when applied to any other equivalent state.
Formally, a one-to-one function 4 from one totally ordered set to another is order- preserving if p < q implies d(p) < 4(q). Any such 4 can be extended to S-expressions
representing processor states by defining +( p m s) = (4(p) m s) and
Two processor states are order-equivalent if there exists an order-preserving fimc-
tion C$ mapping one to the other, and order-inequivalent otherwise. A protocol is a
comparison-based protocol if the message function choosing the message in ,$’ that
a processor is to send to the environment maps order-equivalent states to the same
message.
3.1. Lower bound
First let us prove that every comparison-based protocol has an execution in which
processors remain in order-equivalent states for a logarithmic number of rounds.
Given a protocol P, the larger the environment d - the more environment graphs
Q contains - the larger the set P(8) of executions of P in this environment, and the
more likely the set P(F) contains a long execution. To make our lower bound as
widely applicable as possible, we now define the smallest environment B for which
we can prove the existence of the long execution. A processor is active in round r in
an environment graph (or an execution) if it sends at least one message in round Y,
and a processor is active if it is active in any round (and, in particular, if it is active
in round 1). Given a set S of processor initial states, a set I of processor ids, and a
set M of environment messages, define 9(S,I,M) to be the subset of all environment
graphs in 8(&1,M) satisfying the condition that (i) each active processor starts with
the same initial state from S and the same environment message from M, and (ii) each
active processor receives I (representing no input) from the environment at every time
after time 0.
In such an environment, the long executions of a comparison-based protocol are the
ones in which the processors fail according to a sandwich failure pattern in every
round. This failure pattern is defined as follows. Suppose processors with ids 41,. . , qu have not failed at the start of a round, and suppose u = 3v+l for some integer u.
(The sandwich failure pattern can always fail the one or two processors with low-
est ids at the beginning of the round and pretend they do not exist.) The sandwich
failure pattern causes the u processors 41,. . . , qu with the lowest ids and the u proces-
sors qzo+2,. . . , q3”+1 with the highest ids to fail in the following way: each processor
4v+j E {40+ 11 . . . > q2V+l } receives messages only from processors q,, . . .) qzf;+j. Notice
that each such processor q1;+j sees 2v+l active processors, and sees its rank in this set
218 S. Chaudhuri et ul. I Theoretical Computer Science 220 (1999) 211-245
of active processors as v+ 1. Notice also that the active processors after a round of the
sandwich failure pattern is always a consecutive subsequence of processors from the
middle of the sequence of active processors at the beginning of the round. Using this
failure pattern, we can prove our lower bound:
Proposition 1. Let P be a comparison-based protocol, and let & be an environment containing an environment of the form 9(S,I,M). For every c < n, there is an ex-
ecution in P(8) with c active processors in which the nonfaulty processors remain order-equivalent for R(log, c) rounds.
Proof. Let G be an environment graph in P-($&M) 2 & with the sandwich failure
pattern and with c active processors with ids q1 , . . . , qr. Each active processor starts
with the same initial state s E S and the same environment input m EM, and receives
I from the environment at all times after time 0. Notice that the sandwich failure
pattern fails roughly 2/3 of the active processors, leaving roughly l/3 remaining active
in the next round. A simple analysis shows that if e < log, c, then the number 3v+l of
processors remaining at the end of round G is at least four. We claim that if / < log, c,
then after 8 rounds of the sandwich failure pattern the states si and sj of processors
qi and qj are related by an order-preserving function 4j-i defined as follows. For
all integers d, the function 4d(qi) = qi+d is defined for 1 6 i < c-d when d 3 0 and
for -d+l d i < c when d < 0. We claim that 4i_i(Si) = Sj. We proceed by induction
on /.
The result is immediate for G = 0 since each qi’s initial state is (qi m s), and
For e > 0, suppose the hypothesis is true for d- 1. Consider the 3vfl processors that
are active in round /. Since the active processors at the beginning of round / are a
consecutive subsequence of 41,. . ,qC, suppose they are q,+l, . . . , qb. By the induction
hypothesis for 8 - 1, the states S,+ I,. . , sb these processors have at the beginning of
the round are related by 4j_i(Si) = Sj. Notice that at the end of the round, due to the
sandwich failure pattern in that round, the active processors are qa+t,+l,. . . , qb__lr, and
that the state of processor qa+v+i is
(qa+o+i~&+r . . .Sa+Zc+i)
and the state of processor qa+a+j is
(4a+c+j L Sa+j . . Sa+2v+j).
It is easy to see that 4j-i maps the state of qa+a+i to qa+o+j, as described. 0
3.2. Order-equivalence elimination algorithm
Now let us prove that this lower bound is tight. There is a simple algorithm
that forces all processors into order-inequivalent states in a logarithmic number of
rounds:
S. Ch~udh~r~ et al. I Theoreticul Computer Science 220 (1999) 211.-245 219
Proposition 2. There exists a protocol that Ieaues all nonfa~it~ processors in order-
inequivalent states after O(Iogc) rounds, where c is the number of active processors
in the execution.
Proof. Here is a comparison-based algorithm that causes nonfaulty processors to choose
distinct sequences of integers after a logarithmic number of rounds. In each round,
each processor broadcasts its id and the sequence of integers constructed so far. Two
processors collide in a round if they broadcast identical sequences in that round. In
round 1, a processor with id p broadcasts (p, E), and hence all processors collide
initially with the empty sequence E. Let (p, ii . . . ik_ I> be the message p broadcast at
round k, and let ik be the number of processors with ids less than p that p hears from
and that collide with p in round k. In round k + I, p broadcasts (p, ii . . . ik). Each
processor halts when it does not hear from any colliding processor. As an example,
suppose that pl fails in round 1 by sending a message to p3 but not to p4. Then p3
receives (pi, E) from ~1, ~2, ~3, p4 and p4 receives (pi, e) from pz, p3, p4 so both p3
and p4 will consider its rank in the processors it hears from in round 1 to be 3, both
will broadcast (pi,3) in round 2, and both will collide again at the end of round 2.
We claim that the size of maximal sets of colliding processors must shrink by
approximately half with each round, yielding an O(Iog c) running time. Two processors
that broadcast different sequences continue to do so, so the set of processors that collide
with p at round k is a subset of the processors that collided with p at earlier rounds.
Consider a maximal set S of processors that collide in round k; that is, a set of t’
processors that do not fail in round k - 1 and broadcast the same sequence ii . . . ik-,
in round k. Let p be the lowest processor in that set, and let q be the highest. Since
processors in S do not fail in round k - 1, processor q must hear from each of the e
processors in S in round k - 1. Since these processors collide with q in round k, they
must collide with q in round k - 1 as well, so q must count at least G - 1 colliding
processors with lower ids that broadcast the sequence il . . . ik_2 in round k - 1. It
follows that ik-1 >G - 1 for processor q. Since p and 4 collide at round k, they
broadcast the same value for &_I in round k, so ik__l 2 f - 1 for processor p as well.
Therefore, processor p must see at least & - 1 colliding processors with lower ids
broadcasting the sequence il . . . ik-_2 in round k - 1, none of which are in S (since p
is the processor with smallest id in 5’). Hence at least 2/ - 1 processors broadcast the
sequence il . . . ik__2 in round k - 1 and collide with p and q in round k - 1, implying
that the number of processors colliding with p and q has shrunk from at least 28 - 1
to d in one round, which is a reduction by a factor of 2. 0
Since the logarithmic bounds are tight, these results show that the logarithmic bounds
are the best possible bounds that can be obtained in this model using the notion of
order-equivalence. In the remainder of this paper, we will show that this logarithmic
lower bound can be used to prove loga~~mic lower bounds for decision problems like
strong renaming and concu~ent objects like increment registers. Since this loga~thmic
bound is tight for order-equivalence, these results show, for example, that if operations
220 S. Chuudhuri et al. I Theoretical Computer Science 220 (1999) 211-245
on objects such as stacks or queues require more than a logarithmic number of rounds,
then this additional complexity cannot be an artifact of the comparison model, but must
somehow be inherent in the semantics of the objects themselves.
4. Decision probelms
We can use the lower bound on order equivalence to prove lower bounds for decision
problems. For example, consider the problem of strong renaming defined as follows.
Each processor has a unique id taken from a totally-ordered set of ids. At the start
of the protocol, each processor is in a distinguished initial state and receives a single
bit from the environment, either 0 or 1 meaning “do not pa~icipate” or “participate,”
respectively. At the end of the protocol, each participating processor sends an integer
to the environment. A protocol solves the strong renaming problem if each nonfaulty
participating processor sends a distinct integer from { 1,. . . , c} to the environment at
the end of every execution in which at most iz - 1 processors fail, where c < n is the
number of pa~icipating processors.
4.1. Lower bound
The logarithmic lower bound for strong renaming follows quickly from the lower
bound for order-inequivalence:
Proposition 3. Any comparison-based protocol for strong renaming requires R(log, c) rounds of communication, where c is the number of participating processors.
Proof. Let P be a comparison-based protocol for strong renaming. According to the
definition of strong renaming, each processor has a unique id from a totally-ordered
set I, each processor starts in the same initial state s, and each processor receives a
bit in (0, 1) from the environment. Consequently, the environment d for this problem
includes the environment F”( {s},Z, { 1)) consisting of environment graphs in which all
active processors start with the same state s and all active processors start with the
same pa~icipation bit 1. In this enviro~ent, the active processors are precisely the
participating processors.
Since each processor terminates by sending a different integer to the environment,
and since the message function of a comparison-based protocol - the function com-
puting the messages processors send to the environment - must be the same in order-
equivalent states, the processors must end the protocol in order-inequivalent states. By
Proposition 1, for every c d n, there must be some execution of P in P(8) in which the
c active (and hence participating) processors are still order-equivalent after R(log, c)
rounds. Consequently, for every c < n, there must be some execution of P that requires
fi(log, c) rounds. •i
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 221
define rank(s,S)= \{s’ ES :s’ < s}l
define M(S) = {s E S : mzk(x, S) < /S1/2}
define top(S) = S - bol(S)
define hot(S, li) = {s’ E S : rank(s’, S) Q k/2}
broadcast p
.P t { p' : p’ received}
b + Ilog PI1 I + [1,29
repeat
broadcast (p, I) .Y +- {I’ : (p’, 2’) received and I fl I’ # 8)
9’ c { p’ : ( p’, I’) received and I rY I’ # 0)
if I’ 2 I for every I’ E 3 then
if p E bot (9,111)
then 1 t b&(l)
else I +- top(I)
until 11’1 = 1 for all I’ E {I’ : (p’,I’) received}
return a, where I = [a, a]
Fig. I. A loge renaming protocol .rrl for processor p.
Lower bounds for other decision problems like order-preserving renaming can also
be proven using the same technique.
4.2. Strong renaming ulgorithm
The logarithmic lower bound for strong renaming is tight, because there is a simple
algorithm solving strong renaming in a logarithmic number of rounds. The algorithm
is given in Fig. 1.
The basic idea is that if a processor p hears of 2’ other participating processors, then
it chosses a b-bit name for itself one bit at a time, starting with the high-order bit and
working down to the low-order bit. Every round, p sends an interval I containing the
names it is willing to choose from. On the first round, when the processor has not yet
chosen any of the leading bits in its name, it sends the entire interval [1,2’]. It sets its
high-order bit to 0 if it finds it is in the bottom half of the set of processors it hears from
interested in names from the interval [I, 26], and to 1 if it finds itself in the top half.
In the first case it sends the interval [ 1,2’-‘I, and in the second it sends [2b-’ + 1,2’].
In order to make an accurate prediction of the behavior of processors interested in
names in its interval I, however, it must wait until every processor interested in names
in 1 is interested only in names in I before choosing its bit and splitting its interval in
222 S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245
half; that is, it must wait until its interval Z is maximal among the intervals intersecting
I. Continuing in this way, the processor chooses each bit in its name, and continues
broadcasting its name until all processors have chosen their name.
There are a few useful observations about the intervals processors send during this
algorithm. The first is that if processor p sends the interval Zk during round k, then
Ik >Zkl for all later rounds k’ 3 k. The second is that each interval Zk is of a very
particular form; namely, every interval sent during an execution of JJZ is of the form
[d2Z+l, &+2i] for some constant d. This is easy to see since the first interval 12
a processor sends (in round 2) is of the form [ 1, 26], and every succeeding interval
Zk is either Z&i or of the form top(Zk-1) or bot(Zk-_l) (as defined in Fig. 1). We
say that an interval I is a well-formed interval if it is of the form [d2Z+l, d2j+2j] for some constant d. It is easy to see that any two well-formed intervals are either
distinct, or one is contained in the other. Notice that every well-formed interval I is
properly contained in a unique minimal, well-formed interval i > I. Furthermore, either
I = top(f) or Z = hot(j), and it is the low-order bit of the constant d that tells us which
is the case. We define the operator i that maps a well-formed interval Z to the unique
minimal, well-formed interval 1 properly containing 1.
In every round of the algorithm, a processor p computes the set 9 of processors with
intervals I’ intersecting its current interval I. These processors in 9’ are the processors
p is competing with for names in I. When p sees that its interval Z is a maximal
interval (that is, all intervals I’ received by p that intersect I are actually contained in
I), processor p adjusts its set Z to either hot(I) or top(I). Our first lemma essentially
says that when p replaces Z with hot(Z) or top(I), there are enough names in Z to
assign a name from I to every competing processor. Furthermore, this lemma shows
that when a processor’s interval reduces to a singleton set, then this processor no longer
has any competitors for that name.
Lemma 4. Suppose p sends interval I during round k 3 2. If P is the set of processors sending an interval I’ c Z during round k, then (PI < /Zl.
Proof. We consider two cases: I = hot(f) and Z = top(i).
First, suppose Z = hot(f). Consider the greatest processor q (possibly p itself) in P. Processor q sent an interval Jk C I to p in round k, so consider the first round e d k in which q sent some interval J c I to any processor (and hence to p).
If & = 2, then J is of the form [ 1,2’], where 2’ is an upper bound on the number of
processors q heard from in round 1, and hence on the number of active processors in
round k > 2, and therefore on IPI, the number of processors sending intervals contained
in I in round k. It follows that IPI < 2b = IJ1 < II/. If 8 > 2, then q sent J C Z in round /, and q sent a larger interval j 151 in round
8 - 1, since L is the first round that q sent an interval contained in 1. In fact, we must
have J = Z and j = i, for if J c I then 1 C I and L is not the first round that q sent
an interval contained in I, a contradiction. Let 9’ be the set of processors sending an
interval intersecting i to q in round t! - 1. Since every processor p’ E P sending an
S, Cha~h~r~ et al. I ~~eore~ic~~ Cornpier S&me 220 (1999) 211-24.5 223
interval I’ C I to p in round k must also send an interval intersecting f to q in round
( - 1, each of these processors must be contained in 9, and hence P & 8. Since q sent 1 in round / - 1 and Z = b&(f) in round e, it must be the case that q E bot(9, IfI) at the end of round k’ - 1. Since P C LY) and since q is the greatest processor in P, it
must be the case that PC bot(9, Ii\). It follows that IPI d l&2 = 111, as desired.
Now, suppose I= top(i). The proof in this case is similar to the proof when
I = hot(j), except that q is now taken to be the least processor in P. Cl
Our second lemma shows that when a processor p selects an interval Z = [a, b], there
are enough pa~icipating processors to use up the names 1,. . . , a. In particular, when
p’s interval becomes the singleton set [~,a], then there are at least a pa~icipating
processors, and hence a is a valid name for p to choose. We say that a proces-
sor holds an interval [a,b] during a round if [a, b] is its interval at the beginning of
the round, and hence the interval it sends during that round (if it sends any interval
at all).
Lemma 5. If I = [a, b] is a maximal interval received by some processor p during round k, then there are at least a - 1 processors that either hold an interval [a’, b’] with 6’ <a during round k or fail to send to p in round k.
Proof. We proceed by induction on k. Suppose k = 2. This is the first round that any processor sends any interval, so Z
must be of the form [ 1, 2b], and it is vacuously true that at least 0 processors fail to
send to p in round 2. Now suppose k > 2, and suppose the induction hypothesis holds
for k’ <k. Suppose I = hot(f). Processsor p received I in round k, so consider any processor
q that sent I during round k, and consider the first round k’ < k in which q sent I. If k’ = 2, then Z is of the form [ 1,2&l, and it is vacuously true that 0 processors fail
to send to p, so suppose k’ > 2. In this case, q must send ! during round k’ - 1 and
I during round k’, and the fact that q splits down at the end of round k’ - 1 implies
that i is a maximal interval received by q during round k’ - 1. Since intervals I and r^ have the same lower bound a, the induction hypothesis for
k’ - 1 d k - 1 implies that there are at least a - 1 processors that either hold an interval
[a’, b’] with b’ <a during round k’ - 1 or fail to send to q in round k’ - 1. Since each
processor that holds an interval [a’, b’] in round k’ - 1 must hold an interval [a”, b”] contained in [a’, b’] in round k or fail to send to p in round k, and since each processor
that fails to send to q in round k’ - 1 must fail to send to p in round k, it follows that
there are at least a - 1 processors that either hold an interval [a”, b”] with b” d b’ <a in round k or fail to send to p in round k.
Suppose I = top(i). Let q be the smallest processor ever sending Z during the ex-
ecution. The interval I is not the interval that q sent in round 2 - the first round in
which any processor sends any interval - because in that round q sent an interval of
the form [1,2b], which is not of the form top(i). Since Z is a maximal interval received
224 S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245
by p in round k, it follows that 4 must have sent the interval I for the first time in
some round k’ d k and the interval i = [6, b] in round k’ - 1. Since 4 changed intervals
between round k’ - 1 and k’, the interval r^ must be a maximal interval received by 4 in
round k’ - 1. By the induction hypothesis there are at least a^ - 1 processors that either
hold an interval [a’, h’] with b’ <a^ during round k’ - 1 or fail to send to q in round
k’ - 1, and all of these processors must either hold an interval [a’, b’] with b’ < Li <a in
round k or fail to send to p in round k. Since 4 changed its interval from i = [a^, b] to
1= top(i) = [a, b] at the end of round k’ --- 1, there are at least a - ci = l&2 processors
q’ <q sending an interval contained in 1 to q in round k’ - 1. None of these processors
q’ sending an interval in f = [a^, b] to q in round k’ - 1 could have been one of the
ri - 1 processors that either held an interval [a’, b’] with b’ < a^ in round k’ - 1 or failed
to send to q in round k’ - I. All of these processors q’ sending an interval in ! = [ci, b]
to q in round k’ - 1 must either send an interval in hot(f) to p in round k or fail to
send to p in round k, since I = top(j) is a maximal interval received by p in round k,
and since q’ <q and we chose q to be the smallest processor ever sending I during the
execution. It follows that at least (6 - l)+(a - 6) = a - 1 processors hold an interval
[a’, b’] with b’ <a in round k or fail to send to p in round k, and we are done. 0
Finally, since the algorithm terminates when every processor’s interval is a singleton
set, and since the size of the maximal interval sent during a round decreases by a
factor of 2 in every round, it is easy to prove that the algorithm d terminates in log c
rounds.
Lemma 6. Tke ~~g5r~thm .d te~~~~ates in log c + 2 rounds, where c is the lubber
of participating processors.
Proof. Consider an arbitrary execution of d. For each round k, let 8& be the size of
the largest interval sent during round k.
Consider round k = 2. In this round, each processor p sends an interval of the form
[I, 26] where 2’ is the least power of two greater than or equal to the number of
processors that processor p heard from in round 1. It follows that 6’2 = 2b <2c, for
some b, where c is the number of participating processors.
Consider any round k >2 with /k-l > 1. We claim that /& d &k-,/2. Consider any
processor that holds an interval of size gk__L> 1 at the start of round k- 1, and hence
sends this interval in round k- 1. Since no interval sent in round k- 1 is larger than
/&-I, this processor must see that its interval is maximal at the end of round k-l
and split its interval in half for the next round. Since this is true for every processor
sending an interval of size l&_-I during round k- 1, and every processor sending a
smaller interval during round k-l sends an interval of size at most 8,-i/2, it follows
that all processors send an interval of size at most 8k_ l/2 in round k, so /k < 4,~ 1/2.
Since /2 <2c and /k % [k-l/2, we have /k d 8~/2~‘-~ < 2~/2~-‘. It follows that
8& = 1 within at most k = loge + 2 rounds, at which time all intervals are of size
1 and the processors can halt. 0
S. Chuudhuri et al. I Theoreticul Computer Science 220 (1999) 211-245 225
With these results, we are done:
Theorem I. The algorithm d solves the strong renaming problem, and terminates in log(c) + 2 rounds, where c is the number of participating processors.
Proof. First, by Lemma 6, all processors choose a name and terminate in log(c)+2
rounds, where c is the number of participating processors.
Second, the names chosen by processors are distinct. Suppose two processors p and
p’ chose the name a at the end of rounds k and k’ > k, respectively. Processors p and
p’ must have sent the singleton set I = [a,a] to all processors in rounds k and k’, and
intervals containing I in all preceding rounds. Since p could not have terminated in
round k unless all intervals it received were singletons, both processors must have sent
/ = [a,a] in round k. It follows by Lemma 4 that 2 d 111 = 1, which is impossible.
Finally, names chosen are in the interval [ 1, c], where c is the number of participating
processors. Consider the processor p choosing the highest name a chosen by any
processor, and consider the last round k in which p sends the singleton set I = [a, a] and terminates, returning a. All intervals p receives in round k must therefore be
singleton sets. This implies that I is a maximal interval received by p in round k. It follows by Lemma 5 that at least a-l processors hold intervals [a’, b’] with 6’ <a in round k or have failed, and hence that there are at least a participating processors.
This implies that c 3 a and all names are chosen in the interval [ 1, c], as required. 0
5. Wait-free objects
We can use the lower bound on order equivalence to prove lower bounds on the
complexity of wait-free implementations of concurrent objects. We can also prove that
this bound is tight in the case of a simple object called an increment register. The
implementation that we describe is very similar to the strong renaming algorithm in
the previous section. We start with some formal definitions.
An object is a data structure that can be accessed concurrently by all processors.
It has a type, which defines the set of possible values the object can assume, and
a set of operations that provide the only means to access or modify the object. A
processor invokes an operation by sending an invoke message to the object, and the
operation returns with a matching response message from the object. A history is a
sequence of invoke/response messages. A sequential history is a history in which every
invoke message is followed immediately by a matching response message, meaning
that the operations are invoked sequentially one after another. In addition to a type, an
object has a sequential speci$cation which is a set of all possible sequential histories
describing the sequential behavior of the object. For example, an increment register is
just a register with an increment operation. The value of the register is a nonnegative
integer. The increment operation atomically increments the value of the register and
returns the previous value. The sequential behaviors for an increment register are the
sequential histories of increment operations returning integer values in order, such as
0,1,2 )... .
We are interested in concurrent implementations of such objects. To us, given an
object 0 intended to be used by n processors PI,. . . , P,,, an implementation of 0
will be a collection of n processors FI,. . . , F,, called front ends [ 1 l] that process the
invocations from P 1,. . . , P, and return the responses from 0. Intuitively, the front end
Fi is just the procedure that processor Pi calls to invoke an operation on 0. The front
end Fi receives the invocations from Pi and sends the responses from 0. In our model,
since we are only concerned with the implementation of objects (and not their use), we
assume that the front ends F,, . . , , F, are really the system processors pt,. . . , pn. We
assume that the invoking processors PI , . . . ,P, are part of the external environment e,
and we ignore them completely. With this in mind, we define a history of a system to
be the history h obtained by projecting an execution of the system onto the subsequence
of invoke/response messages appearing in the execution,
The specification of an object’s concurrent behavior is defined in terms of its se-
quential specification. An object is linearizable [14] if each operation appears to take
effect instantaneously at some point between the operation’s invocation and response.
Linearizability implies that operations on the object appear to be interleaved at the
granularity of complete operations, and that the order of nonoverlapping operations is
preserved. An implementation is said to be wait-free if no front end is blocked by the
failure of other front ends. The precise de~nition of wait-free linea~zable objects is
well-known 1143, so we will not repeat it here.
We assume that any wait-free, linearizable implementation of an object can be
initialized to any value defined by the type of the object. Specifically, we assume
that for every value v in the type of an object, there is an initial processor state s,
with the following property: if every processor begins in state s, - with the possible
exception of the processors failing immediately at time 0 - then every execution from
this initial state is linearizable to a sequential history in which the operations in the
history are invoked sequentially on a copy of the object initialized to the value v. This
assumption is valid, for example, for all concurrent objects implemented using the tech-
nique of state machine replication 115, 16,201 which is the technique most commonly
used in message-passing models like ours.
5.1. Lower bounds
Our lower bound on order-equivalence can be used to prove lower bounds for a
number of concurrent objects. For example, an ordered set S is an object whose value
is some subset of a totally ordered set T, with an insert(u) operation that adds a E 7’ to
S and a remoue operation that removes minimum element a ES Tom S and returns a.
As the next result shows, we can use the remoue operation from any implementation
of an ordered set to solve the order-equivalence problem, so the logarithmic lower
bound on order-equivalence implies the same lower bound for the r~rnov~ operation
of an ordered set. Many interesting concurrent objects are special cases of an ordered
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 227
set. For example, an ordered set’s remove operation is just a special case of a stack’s
pop, a queue’s deq~e~e, and a heap’s min. Consequently, the next result implies a
logarithmic lower bound for each of these operations as well.
Proposition 8. Given any comparison-based, wait-free, linearizable implementation qf an ordered set, the remove operation requires IR(logs c) rounds of communication, where c is the number of noncurrent invocations of the remove operation,
Proof. Consider any such implementation of the ordered set. Consider any value S
of the ordered set containing at least II distinct values, where IZ is the number of
processors in the system, and let ss be the initial processor state with the property that
if all processors start in state ss, then the object is initialized to S. The environment 8
for an ordered set certainly includes the enviro~ent .9({ss},l, {,,o,e)) consisting
of environment graphs in which all active processors start with the same state ss
and all active processors start with an invocation of the remove operation from the
environment. In such an environment, the invoking processors are precisely the active
processors.
Each processor te~inates by removing a distinct value from the set and reaming it
to the environment by sending a distinct message to the environment. Since the message
function of a comparison-based protocol - the function choosing the message from JV’
that processors send to the environment - must be the same in order-equivalent states,
the processors must end the protocol in order-inequivalent states. By Proposition 1, for
every c < n, there must be some execution of P in P(8) in which the c active (and
hence invoking) processors are still order-equivalent (and hence cannot terminate) after
fi(logs c) rounds. q
As another example, consider the increment register defined earlier in this section.
The value of an increment register is just a nonnegative integer. The increment register
provides an increment operation that atomically increments the value of the register
and returns this new value. Since we can use the increment operation to solve the
order-equivalence problem, we can prove a logarithmic lower bound for the increment operation:
Prop~ition 9. Given any comparison-baled, ~ffit-free, ~i~eari~able implementation of an increment register, the increment operation requires R(log3 c) rounds of communication, where c is the number of concurrent invocations of the increment
operation.
Proof. Consider any such implementation S of an increment register. Let sg be the
initial processor state with the property that if every processor begins in state SO, then
the register is initialized to 0. The environment d for an increment register includes
the environment F( {SO}, 1, { increment}) consisting of environment graphs in which all
active processors start with the same state so and all active processors start with an
228 S. Chaudhuri rt al. I Theoretical Computer Sciencr 220 (1999) 211-245
invocation of the increment operation from the envirorurrent. In this environment, the
active processors are the incrementing processors.
Each processor terminates by returning a distinct value to the environment. Since
the message function of a comparison-based protocol must send the same message
to the environment in order-equivalent states, the processors must end the protocol
in order-inequivalent states. By Proposition 1, for every c d n, there must be some
execution of P in P(g) in which the c active (and hence incrementing) processors are
still order-equivalent (and hence cannot terminate) after Q(logs c) rounds. q
5.2. Increment register algorithm
In this section, we give the last major result of our paper: an optimal wait-free im-
plementation of an increment register. It closely resembles our optimal strong renaming
algorithm, and proves that the logarithmic lower bound is tight.
A processor p can invoke an increment operation multiple times in a single execu-
tion, and each invocation can take multiple rounds to complete. We refer to the set
of increment operations invoked during round k as generation k increments, and we
refer to the processors invoking these increments as generation k professors. We refer
to the rounds of a generation as phases, and we number the phases of generation k starting with 0 so that phase e of generation k occurs during round k + c!.
Since a processor p can invoke the increment operation more than once, it identifies
itself during generation k with an ordered pair (p, k) called its increment processor id. We assume each processor p maintains a set IncSet of all the increment processor ids
that it knows about, and continues to maintain this set in the background even when
it is not actually performing an increment operation. Every round, it broadcasts this
set to other processors, and merges the sets it receives from other processors into its
own set. For notational simplicity, however, since the generation k will always be clear
from context, we will frequently write p in place of (p, k). This set IncSet can also be
used to initialize the increment register. For the sake of simplicity, the implemen~tion
we give assumes the register is initialized to 0, but it can be initialized to any value
as follows: in the initial processor state Si in which the increment register has been
initialized to the value i, the set ZncSet is initialized to contain phantom increment
ids (p, -i), (p, -ii_ 1 ), . . . , (p, - I} for some processor id p to simulate p’s previously
incrementing the register i times.
Understanding our implementation requires understanding the notions of ranges, intervals, splitting, and chopping, so let us begin with these concepts.
Ranges. Our implementation has the property that increments in one generation are
effectively isolated from increments in other generations, in the sense that increments in
one generation can choose return values by communicating among themselves, ignoring
increments in other generations. This isolation is achieved by pa~itioning the return
values into ranges.
As illustrated in Fig. 2, each processor p maintains a range R = [R.lb, R.ub] of return
values. Initially, using the set IncSet of increment processor ids known to p, processor
S. ~~a~dhuri et ul, I Theoretical computer Science 220 (1999) 211-245 229
9
s ub-
lb -
i
4 3
2 1 0 i
(a) l&al range. (b) Fixtending the range.
Fig. 2. The range.
p sets its lower bound lb to the number of increments invoked by previous generations,
and its upper bound ub to the total number of increments invoked by previous and
current generations, Every phase, processor p exchanges ranges with other processors
in its generation, and extends its range by dropping its lower bound to the smallest
lower bound received from any of these processors.
Intuitively, by setting its initial lower bound to Zb, processor p is reserving lower
values v < lb as return values for increments in previous generations recorded in IncSet.
Later, if p hears that another processor q in the same genemtion set its initial lower
bound to Ib’clb, then p knows some of these earlier increments have failed, so p
ceases to reserve return values for them and drops its lower bound to lb’.
Our algorithm guarantees that if a nonfaulty processor sets its upper bound to ub,
then all processors in all later generations set their initial lower bounds to lb>ub,
so their lower bounds remain above ub forever. In this sense, the upper bounds of
the nonfaulty processors partition the return values. Nonfaulty processors in different
generations have disjoint ranges, allowing them to ignore each other once their initial
ranges have been chosen.
intervals and ~piitti~g. Given a range R of acceptable return values, however, p still
has to choose one of them to return. To do so, we modify the ~ndamental idea in the
optimal algo~thm for strong renaming described in Section 4. The basic idea is that if
the values in p’s range R are b bits long, then p chooses a b-bit value from R one bit
at a time, starting with the high-order bit and working down to the low-order bit. To
implement this idea, processor p maintains an interval I = [I.lb,Z.ub] of return values
that contains its range R (see Fig 3). The size of the interval is always a power of 2.
Processor p’s initial interval is the smallest interval of the form [0,2” - I] that contains
p’s initial range. During an increment, processor p repeatedly splits its interval in half
until the interval contains a single value, and this is the value that p returns. It is easy
to see that all of the intervals generated by p are of the form [a2k, aZk + (2k - 1 )] for
some b-k bit value a, and such intervals are called cell-furred intervals. Intuitively,
this interval represents the fact that p has chosen a as the high-order b-k bits of its
return value, but must still choose the low-order k bits.
230 S. Chaudhuri et al. I Throrrticnl Compuirr Science 220 (1999) 211-245
2% - 1
0
(a) Initial intend contains initial range.
Fig. 3. The interval.
The procedure that p uses to split its interval in half is impo~ant (see Fig. 3).
Every round, processor p exchanges intervals with other processors, and p maintains
a set C of all processors sending p an interval intersecting its current interval 1.
The processors in C are p’s competitors since they include the processors considering
return values in p’s range. To avoid returning the same value as one of its competitors,
processor p a~empts to predict what values its competitors will choose. To predict
accurately, however, p must wait until I is maximal among the intervals received
from its competitors; this means that p’s competitors are considering only values in I.
Once I is maximal, p assigns return values from its range to its competitors, starting
at the bottom of its range and assigning values to competitors in order of increasing
processor id. Eventually, p assigns a value u to itself. Processor p then replaces I with
its top half top(i) or its bottom half hot(l) - whichever half contains u - and then
replaces its range R with the intersection of R and I. Continuing in this way every
round, processor p’s interval eventually contains a single value u, at which point p
chooses u but continues exchanging its interval with other processors until all processors
in its generation have chosen a value.
~~opp~~~. It is easy to see that the split operation is what gives rise to the algorithm’s
logarithmic nature: in any given round, a maximal interval is guaranteed to split in
half, so the size of the maximal intervals decreases by a factor of 2 with every round.
Unfortunately, this is logarithmic in the size of the initial interval, which can be as
large as the total number of increments ever invoked, and we want the algorithm to run
in time logarithmic in the number of concu~ently executing increments. Fo~nately,
we can speed up the algorithm dramatically by introducing a new operation called a
chop, illustrated in Fig. 4. For example, if p’s range R is just the top few values in
its interval I, then it is clear that p is going to split up repeatedly for many rounds.
We accelerate this splitting by allowing p to chop in a single round from I up to
the smallest well-formed interval I’ containing R. We say that p chops up in this
case, and chopping down is similar. Since chopping is just an accelerated form of
splitting, processor p must wait until Z is maximal among the intervals received from
its competitors before chopping. On the other hand, it is important that we do not allow
p to split and chop in the same round: if p splits down and then immediately chops up
S. Chaudhuri et ul. I Theoretical Computer Science 220 (1999) 211-245 231
Fig. 4. Chopping.
to a smaller interval containing its new range, then it runs the risk of chopping away
the bottom of its interval before learning that it can extend this range by dropping the
lower bound, so it runs the risk of reaching a state in which its interval and range are
too small to assign distinct values from its range to all of its competitors.
Algorithm. With this, we have introduced the notions of ranges, intervals, splitting,
and chopping, and we can turn our attention to the increment register implementation
,.p itself. The main loop of the algorithm is given in Fig. 5, the definitions of splitting
and chopping are given in Fig. 6, and the definitions of some initialization steps are
given in Fig. 7.
During the initial phases of generation k, an incrementing processor p starts by
adding its increment processor id {p,k) to IncSet; it exchanges IncSet with other
processors and uses the result to choose its initial range R as described above; it
exchanges R with other processors, extends R by dropping its lower bound as described
above, and uses the result to choose its initial interval. In all later phases, processor
p exchanges its interval and range with other processors, extends its range if possible,
and splits or chops its interval and range whenever it finds that its interval is maximal
among its competitors. When processor p’s interval contains a single value, it continues
broadcasting its interval and range until all competing intervals contain a single value,
then p chooses its value and halts.
Proving the correctness of this algorithm consists of proving two properties.
The first property we must prove is that given two nonoverlapping increments, the
value returned by the first is less than the value returned by the second. This will
imply that the implementation is linearizable. In fact, this is very easy to prove using
232 S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245
begin /* a generation k increment by processor p */
initialize<); /* add <p,k> to Incset */
phase00; /a beast In&et, choose initial range R */
phaselo; /* beast R, extend R, and
choose initial interval I */
repeat
broadcast <p,R,I,lb>
receive <p',R',I',lb'> from generation k processors
/* collect names and intervals of competitors */
C <- {p':<p',R',I',lb'> received and I' intersects I}
N <- {I':<p',R',I',lb'> received and I' intersects I}
/* extend range by dropping the lower bound */
R.lb <- lb <- min {lb' : <p', R', I', lb'> received}
R <- E <-- R intersect I
/* E is used only in the proof */
if I is maximal in N then
if R is contained in either top(I) or hot(1)
then chop0
else split0
until /I'/=1 for all I' in N
v <- I.lb /* I=[v,v] *,I
return (v>;
end.
P'
Fig. 5. The increment register 4.
the observation that the ranges effectively isolate distinct generations, a fact mentioned
earlier in the discussion of ranges:
Lemma 10. Suppose p and q are generation i and j processors returning values v and
w, respectively. If i < j, then v < w.
Proof. Notice that v is no higher than p’s upper bound. Notice also that p set its
upper bound to lZncSetl - 1 at the end of phase 0 of generation i, and then broadcast
this set to all processors in phase 1 of generation i. Since i <j, phase 1 of generation i
is no later than phase 0 of generation j. Consequently, all generation j processors have
received IncSet before they set their lower bound at the end of phase 0 of generation j.
S. Cha~~h~r~ et al. I Theoretical Computer Science 220 (1999) 21 l--24.5 233
chop 0 begin
I <- smallest well-formed interval containing R
end.
split 0
begin
rank <- rank of p in C /* 0 is the lowest rank */
value <- R.lb + rank
if value in top(I) then
I.lb <- R.lb <- I.lb + /I//2
else
I.ub <- R.ub <- I.ub - II//2
fi
end.
Fig. 6. Chopping and splitting an interval.
This means that no generation j processor will ever lower its lower bound below p’s
upper bound. Since w is above g’s lower bound, we have TV< u’. c!
For the second property, remember that C is the set of competitors, and notice that E
(a history variable used only in the proof) is the extended range (the result of dropping
the lower bound of the real range R) that is used by a processor to assign values to
its competitors (including itself). The second property we must prove is that ICI d IEl
for every processor p in every phase. This invariant says that p can always assign
distinct values from E to its competitors. This will imply that the algorithm terminates:
whenever a processor finds that its interval is maximal, it can assign itself a value
and split or chop to a smaller interval containing this value. This will also imply that
distinct processors choose distinct values: if p and 4 return the same value v, then at
some point they both have the same extended range E consisting of the single value
v and they both have a set of competitors C including p and q, but /C/ = 2 $ 1 = lE/.
Proving that IC/ < I.El re q uires reasoning about the interactions between the splits
and chops performed by different processors in different phases, and we prove two
claims (Claims 13 and 14 below) about these interactions. Let us fix a generation k
for the rest of this section. We denote the values of I and R broadcast by p during
phase Y of an execution e by Ie,p,r and R,,p,r, and we denote the values of E and C
held by p at the end of phase r of execution e by Ee%p,r and Ce,p,y. We often omit
subscripts like e and p when they are clear from context.
We say that p splits to I in phase i if p sends i in phase i - 1 and I in phase
i, where p changes from i to I by splitting. We say that p splits up or filets ~~~~1~
depending on whether I = top(f) or f = &t(f). We say that p chops into I in phase i
234 S. Chaudhuri et al. I Theoreticul Computer Science 220 (1999) 211-245
initialize0
begin
k <- current round number /* choose generation */
P <- <processor id, k> /* choose id /*
IncSet <- IncSet union {p} /* set of incrementors /*
end.
phase00
begin
broadcast <p,IncSet>
receive <p',IncSet'> from all processors p'
IncSet <- union of all IncSet' received
GenSet <- set of all processors p' in IncSet with
generation k' < k
R.ub <- IIncSetl- 1
R.lb <- lb <- /GenSet/
end.
phase10
begin
broadcast <p,R,lb>
receive all <p',R',lb'>
R.lb<- lb <- min gen k lower bound lb' received
I <- smallest well-formed interval containing R
end.
Fig. 7. The initialization phases.
if p sends j g I in phase i - 1 and J C I in phase i, where p changes from j to J by
chopping. We say that p chops up or chops down depending on whether J C top(j) or
J C hot(j). Two simple properties about splitting and chopping are often useful.
Fact 11. If p splits from Ii-1 to Ii, then the upper bounds of Ri, Ei and I,, where Ri C Ei c Ii, are equal ifp splits down, and the lower bounds are equal ifp splits up.
Fact 12. If p chops from I,-, to Ii, then the upper bounds of Ei-1, Ei, Ri, Ii-1 and I;, where Ei_1 = Ri C Ei C I, c Zi_1, are equal ifp chops up, and the lower bounds are equal ifp chops down.
The first property follows from the fact that the range spans the midpoint of the
interval during a split (so the split truncates the range and interval at the same point).
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 235
The second property follows from the fact that the initial range always spans the
midpoint of the initial interval, so a split must occur before a chop (and again the split
truncates the range and interval at the same point).
Reasoning about one processor p’s splitting and chopping usually involves reasoning
about another processors q’s behavior in earlier phases. The first claim below argues
that whenever a processor p with interval I has to find room for its competitors C in
its extended range E, each of these competitors themselves had to find room for C in
their extended ranges when they split or chopped into the interval 1.
Claim 13. rf Iq,j > Ip,i for some j <i, then Cq,j 2 Cp.i.
Proof. Let r be a processor in CP,i. This means that the interval Ir,i, sent by Y to p in
phase i, intersects Ip,i. Since j <i, the interval Ir,j, sent by r to q in phase j, contains
I,,i. NOW, since Iq,j contains IP,;, and Ir,j contains Ir,i, the fact that I,,i intersects IP.!
implies that Zr,j intersects I4.j. Hence, r E Cq,j. It follows that Cp,i C: C4.j. 0
The second claim we prove concerns the fact that a processor p may split to an
interval I in an orderly sequence of splits while another processor q may chop into
I in a chaotic interleaving of splits and chops. The claim states that the moment this
happens, p’s extended range E spans its entire interval I from that moment on. This
means that if chopping complicates our analysis in one way, it simplifies our analysis
in another since we no longer have to be careful to distinguish between intervals and
ranges.
Claim 14. Suppose p splits to I in phase i, and suppose q chops into I in phase j. If’
i < f and j < e, then IP,y = EP,f at the end of phase e.
Proof. Suppose p splits down from i to I, and let E be p’s extended range at the end
of phase i. Since p split down, we know that I.ub = E.ub by Fact 11. Since the upper
bounds of I,,, and EP,/ are still equal at the end of phase e, all we have left to show is
that their lower bounds are equal. First, notice that q must have chopped down and not
up: this follows from the fact that p split down from i to I and the fact that q chopped
from j g I to J C I. Let R be the range q sent together with J in phase j. According to
Fact 12, the fact that q chopped down from j to J implies that R.lb = J.lb. Furthermore,
the fact that q chopped down from j g I to J C I implies that J.lb = I.lb. It follows
that R.lb = J.lb = I.lb. On the other hand, since R.lb = I.lb < I,,t.lb and since p drops
its lower bound every round, it follows that IP,l. lb = E,, f.lb by the end of phase / > j.
Suppose p split up from i to I, and let E be p’s extended range at the end of
phase i. Since p split up, we know that I.lb = E.lb by Fact 11, and we will now show
that I.ub = E.ub as well. It will follow that I = E at the end of phase i, and hence that
IP,/ = E,f at the end of phase 1 3 i. Suppose on the contrary that E.ub <I.ub. This
means that the upper bound R,z.ub of p’s phase 2 range is also less than I.ub. It
follows that Rp,2. lb<I.lb, since otherwise R,,z 2 I and p would have chosen I as its
236 S. Chuudhuri et ul. I Theoreticul Computer Science 220 (1999) 211-245
initial interval ~ the smallest well-formed interval containing Rp,2 - and not an interval
as large as i. On the other hand, since q’s initial interval Zq,2 contains q’s interval j,
and since JA.lb is under I.lb, we know that I,>z.lb < J*.lb< I.lb. Thus, R,z.lb <I.16
and &.lb<I.lb, so we know that q will drop its lower bound below I.lb at the end
of phase 2, and that q’s lower bound will remain below I.lb until q splits up to an
interval with a lower bound at or above I.lb. However, since j.lb<I.lb, we know that
q’s lower bound is still below I.lb at the end of phase j - 1, so it is impossible for
q to have chopped up from an interval j containing I in phase j - 1 to an interval J
contained in I in phase j, a contradiction. 0
These two claims give us the tools we need to prove that ICI < IEl is an invariant.
We prove this invariant by defining the condition
y/: lCe,p,rl d I&,p,rl m a executions e for all processors p and generation k 11 phases r = 2,. , &,
and then proceeding by induction on 8 32 to prove that & holds for all C. Fix
some execution e and processor p, and let I, R, E, and C denote &,e, Re,.,/, Ee,P,/, and
c e,p,f, As the basis of our induction, we show that the invariant is true initially. We actually
prove two results. The first concerns the simple case where p’s range contains some
other process’s initial range, and the second concerns the more common case where
p’s interval (which is bigger than the range) contains some other process’s initial
interval.
Claim 15. If R contains some processor q’s initial range R,l, then /C/ 6 IEl.
Proof. Let r be a processor in C. This means that Y sent an interval intersecting I
to p in phase /, and therefore that r sent a message to q in phase 0. Since the size
of R,l is exactly the number of processors sending to q in phase 0, it follows that
ICI 6 /R,I / d IRI. Finally, IRI < lE1 since R C E. q
Claim 16. If I contains some processor q’s initial interval Z,Q, then ICI < IEl
Proof. We will prove that either R *,I C R or R,, C R, depending on R,, ‘s upper bound,
and in either case we will be done by Claim 15.
Suppose R,, .ub < I.lb. This case can never arise, since the upper bound of p’s range
never increases, and since p never splits or chops to an interval above its upper bound.
Suppose Rp,l.ub E I. If we also have R,l .lb E I, then R,, C I which implies R,, C R
and we are done, so suppose that R,l.lb <I.lb. In this case, we know that R,z.lb <
R,l .lb < I.lb since q lowers its lower bound to R,, .lb or lower at the end of phase 1
before choosing its new range and interval for phase 2. This means that R4,2 $I, but
this in turn leads to the contradiction Iq,2 gI since R4,2 C IQ, so this case can never
arise.
S. Chuudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 231
Suppose R,l.ub >I.ub. Since the upper bound of p’s initial range is above I, we
know that p has split down at least once, and hence that R.ub = I.ub by Fact 11.
Furthermore, since p lowered its lower bound to R,,.lb or lower at the end of phase
1 before choosing its new range and interval for phase 2, we know that p’s lower
bound will remain R,J. lb or lower until p splits up to an interval with a lower bound
above R,,,.lb. However, since RJ cR,2 C&J &I, we have
R.ub = I.ub > R,, .ub > R,, .lb > R.lb,
so R,,l C_ R. 0
As for the inductive step itself, if I does not contain the initial interval of any
processor, then all of p’s competitors have split or chopped into I. The next result
concerns the chopping case. It says that if I is p’s interval and if any processor q has chopped into I at any time in the past - regardless of whether p and q are now
competitors - then the invariant is preserved. It is a strong statement that chopping
quickly brings distinct intervals and ranges into synch.
Claim 17. Suppose 9f-1 is true. Zf any processor has chopped into I by phase e, then ICI d IE(.
Proof. Suppose 1 is p’s initial interval, or contains any other process’s initial interval.
Then we are done by Claim 16.
Suppose p itself chops from i to I in phase i <e. Since p chopped its interval
from i to I in phase i, it follows that C C Cp,r_i by Claim 13. Since p does not
split its interval in phases i through &, we know that E,i_l C E. Consequently, since
i-1 de - 1, it follows from Y(_i that ICI~jC,,i_11dIE,,i_lIdlEl, as desired.
Suppose p splits from i to I, and that some other processor q chops from j $ I to
J &I in some phase j <&. Since q chopped from 1 to J, it follows that C c C4,j_,
by Claim 13. In addition, since q chopped from 1 to J, we know that E4,j-1 C_ J C I.
Finally, we know that I = E by Claim 14 since j - 1 < / - 1. It therefore follows from
9f-i that (Cl < ICy,j-i I < IEq,j-1 I < IEI, as desired. q
The difficult cases, therefore, are the cases in which p and all its competitors split
from 1 to I. The case of splitting down is easy, but the case of splitting up is difficult. In
fact, understanding how to choose and manipulate ranges to make the case of splitting
up go through is the most important way in which the increment register algorithm
differs from the strong renaming algorithm it is based on.
Claim 18. Suppose 3/_1 is true. If p and all its competitors have split down to I by phase e, then ICI < IEl.
Proof. Let q be the greatest competitor in C. This means that q is the greatest processor
to send an interval contained in I to p in phase e. Consider the phase j in which
238 S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245
q split from i to I, and notice that C C C,,,_t by Claim 13. Since q is the greatest
processor in C and since q split down from i to I, processor q found that all processors
in C c Cq,j-t could choose distinct values from the bottom half of its extended range
Eq,j_ 1, where the bottom half of its extended range just happens to be R,,j = I n EqJ- 1.
Consequently, 1 C( 6 IRqj 1. W e will now show that R,j C E, and it will follow that
IC( < IEl, as desired. First, notice that R,,j C Z4.j = 1. Next, notice that Z.ub = E.ub by
Fact 11 since p split down, so R,,j. ub < I.ub = E.ub. Finally, notice that j < e and that
p lowers its lower bound as much as possible every phase, so E.lb<R,j.lb by the
end of phase /. It follows that R,?j c E as desired. 0
Finally, let us consider the tricky case of splitting up.
Claim 19. Suppose JJ-1 is true. Zf p and all its competitors have split up to Z by
phase 8, then ICI ,< JE(.
Proof. Let q be the least competitor in C. This means that q is the least processor to
send an interval contained in I to p in phase /. Consider the phases i < e and j </ in
which p and q split up from i to Z, respectively. Notice that since p and q split their
intervals at the ends of phases i - 1 and j - 1, Claim 13 implies that C c C,,i_ 1 and
CC Cq,j-1.
Suppose that i <j (the case with j <i is similar, and easier). Let e’ be the execution
differing from e only in that in each phase k 3 i - 1 of e’ the processors p and q receive
messages from exactly the same set of processors that p receives messages from in
the corresponding phase of e. Notice that this does not change the set of messages p
receives in phase i - 1, and hence does not change the fact that p splits up to Z in
phase i, but it might change the messages and splitting of q. First, consider the lower bounds of the extended ranges that p and q use when they
decide to split up at the end of phases i - 1 and j - 1 in e. At the end of phase i - 1,
processor p first computes the lower bound lb, p,i_ 1 and then uses this lower bound to
set the lower bound of its extended range E,,P,i_l to the maximum of lb, P, 1 _ 1 and the
lower bound of i. It then broadcasts lb,,p,i-1 to q in phase i 6 j- 1. Consequently, at the
end of phase j - 1, processor q sets Ibc,,j-1 to lb,,,+1 or lower, and uses this lower
bound to set the lower bound of its extended range Ee,,,j_l to the maximum of Ib,,,j_I and the lower bound of Z. In other words, Ee,,+~.lb 3 Ee,,j_l.Ib. In fact, the construc-
tion of e’ from e guarantees that Eef,q,i_l .Ib = Ee’,p,i_l .lb = Ee,,,i_l .Ib>,Ee,,j_l .Ib. Next, consider the set of competitors for p and q in e. It follows from Claim 13
that Ce,q,j- I C Ce,p,i- I . In fact, from the construction of e’ from e, it follows that
C 5 Ce,q,j-I C Ce,p,i-1 = Ce’,p,i-I C Ce’,q,i-1.
The conclusion of this little exercise is that at the end of phase i - 1 in e’ processor
q’s lower bound is as high and its set of competitors is as large as at the end of phase
j - 1 in e. Since q splits up at the end of phase j - 1 in e, it will split up at the end
of phase i - 1 in e’, assuming its interval i is maximal among the intervals it receives
in e’. It must be maximal, however, because q receives precisely the same intervals
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 239
in phase i - 1 of e’ as p does, and p splits up. In fact, p and q assign the same
values to the same processors at the end of phase i - 1 of e’. It follows from Yt-1
that p and q can assign distinct values from Eet,p,i-i and Eef,q,i_, to all processors in
Cd,p,i-I - - Ce’,q,i-1, and we have already argued that they do so in precisely the same
way. Since q is the smallest processor in C C C et,q,i_i and q splits up, this means that
both p and q can find values for all processors in C in the top halves of their extended
ranges. Since the top half of p’s extended range is E - remember that upper bounds
never change - it follows that ICI < IE 1, as desired. 0
Putting all of this together, we have our invariant:
Lemma 20. (Cl < IEl.
Proof. We proceed by induction on / 22 to prove that Yl is true for all e.
First suppose e = 2. Since Ip,2 itself is an initial interval contained in Ip,2, it follows
by Claim 16 that (Cp,21<lEp,21. Now suppose / >2 and Y/-i is true. If I contains some process’s initial interval,
then we are done by Claim 16. If some processor chops into I, then we are done
by Claim 17. If all processors split from 1 to I, then we are done by Claims 18
and 19. 0
Using this invariant and Lemma 10 we can prove that our implementation is correct:
Theorem 21. 3 is a linearizable, wait-free implementation of an increment register.
Proof. First, notice that all nonfaulty processors choose a value. This follows from the
fact that, given any phase i of any generation k, all generation k intervals of maximal
size in phase i will either split or chop at the end of phase i or i + 1, meaning that the
size of the maximal generation k interval decreases by a factor of at least two with
every two rounds. Thus, eventually all generation k processors will have intervals of
size 1 and choose a value.
Second, notice that two processors always return distinct values. If p and q are of
distinct generations, then the result follows by Lemma 10. If p and q are of the same
generation and both return the same value v, then they both have the same extended
range E consisting of the single value v and they both have the same set of competitors
C consisting of p and q, but ICI =2 < 1 = IEl, violating the invariant I Cl d (E / . Third, we need to show that if p chooses the value v, then there are at least v - 1
other processors of p’s generation or earlier, which, if they decide, decide on values
below v. Consider the highest upper bound U chosen by any processor q in p’s generation. There are at least U processors of p’s generation or earlier which, if they
decide, decide on values less than or equal to U. Therefore there are at least v - 1 of
these processors which decide on values w < v if they decide at all.
Finally, it follows from Lemma 10 that the algorithm is linearizable. 0
240 S. Chaudhuri et al. /Theoretical Computer Science 220 (1999) 211-245
5.2.2. Time complexity
We now show that increment operations halt in O(logc) rounds, where c is the
number of concurrent operations. Technically speaking, a failed operation is concurrent
with (or overlaps) every following operation, so c can grow artificially large. Fortu-
nately, we can prove a tighter bound, depending on a set of concurrent operations that
is generally a much smaller set. ’ Our algorithm has the nice property that the invoca-
tion of an increment operation delays at most one generation. If the invoking processor
is nonfaulty, then the increment delays its own generation. If the invoking processor
is faulty, then it may delay a later generation, but it will delay at most one. In fact,
we can identify exactly which generation an operation delays.
For each generation k, we define the active set of processors, namely those processors
or invocations that contribute to the generation’s running time. We show that the largest
range chosen by any generation k processor is bounded in size by the size of the active
set, and we show that a generation halts in time logarithmic in the size of the largest
range. From this it follows that all generation k increment operations halt in time logck,
where ck is the size of the active set for generation k.
Active sets. We begin by defining act&k, the active set of processors for gener-
ation k.
Loosely speaking, the active set for generation k consists of all processors that
the “good” processors learn about for the first time in round k. Remember that all
processors choose their initial range at the end of phase 0, exchange their ranges,
and then choose their initial intervals at the end of phase 1 based on the ranges they
receive. The “good” processors for generation k are the generation k processors that
survive these initialization phases and begin broadcasting intervals.
Let ge?& be the set of generation k processors. Formally, we define goodk to be the
set of generation k processors that are nonfaulty in phases 0 and 1 of generation k
(that is, they do not fail in rounds k and k + 1). For any good processor p, the set
of processors that p has learned about in the first k rounds is exactly the value of its
set IncSet at the end of round k, which we denote by hc&!tp,k. The set knownk of
all processors the good processors know about at the end of round k is given by
and the set act&k of all processors that the good processors learn about for the first
time in round k is
activek = knowq - know?&_ 1
(where “-” denotes set difference).
It is clear that the set of known processors is nondecreasing:
’ This does not mean that our algorithm runs faster than the R(log c) worst-case lower bound, because
these two sets are equal in that single worst-case execution.
S. Chaudhuri et al. I Theoretical Computer Science 220 (1999) 211-245 241
Claim 22. knownk_1 C_ knownk for all k.
Proof. If q E knownk_1, then q E IncSe$,k_l for some p E goodk-1. This means that p
survived phase 1 of generation k - 1 and successfully broadcast its set k&t,,&_, to
all processors in that phase. Since phase 1 of generation k - 1 is phase 0 of generation
k, this means that k&?$,k_t C k&t,,k at the end of phase 0 of generation k for
all good processors r E gOOdk in generation k. Thus, q E IncSet,k_l & knownk, and it
follows that knownk_1 C knownk. 0
Using this observation, we can show that the set activek has two desirable properties:
every nonfaulty generation k processor belongs to activek, and every processor belongs
to at most one set act&k.
Claim 23. goodk C activek for all k, and activej n activek = 8 for all j #k.
Proof. First, notice that if p E goodk then p is a generation k processor that survives
phase 0 of generation k and adds p to its own set k&?t,,k. Notice also that a generation
k processor cannot appear in any set IncSet,,j for any generation j processor q, where
j < k. It follows that p E goodk implies p E knownk - knownk_1 = activek.
Next, the remainder of the claim follows immediately from the fact that knownj C
know&_1 for all j d k - 1, which follows from Claim 22. 0
Maximal range. For each generation k, we can bound the size of the ranges sent
by good processors with acti?&. Since we are trying to bound the execution time of
generation k increments, we need only consider the ranges of the good processors,
since all other processors fail by the end of phase 1.
Consider the largest range a good processor p can send. Every processor p chooses
upper and lower bounds up and lP at the end of phase 0, and then never raises its lower
bound without splitting or chopping up to a smaller interval and range. At any given
time, a processor p’s lower bound is the minimum of the lower bounds 1, chosen by
some subset of the generation k processors. In the worst case, a good processor p’s
largest range R,i is contained in mm_ran&?k = [lbk, Ubk], where
Ubk = I-fMX{U, : p E goodk},
lbk =min{l,: PC@&}.
In other words,
Claim 24. Rp,i c max-rangek for every good processor p E goodk and every phase i.
The next result shows that the size of mLD_rangek is bounded by the size of act&k,
and hence so is the size of any range used by any good processor in generation k.
Claim 25. (ma_rangekI < Iactiuek/.
242 S. Chaudhuri et al. I Theoretical Computrr Science 220 (I 999) 21 l-245
Proof. We prove that lknownk13 ubk + 1 and that Ikno~~nk_~ 16 Zbk, and it follows that
Imaa_rangekI = ubk - lbk + 1
< (knownk 1 - Jknownk- I I = lknownk - knownk_ I ) = lactiuek)
since knownk_1 C knownk by Claim 22.
TO prove lknownk I 3 ubk + 1, consider the good processor p E goodk with the max-
imum upper bound up = ubk at the end of phase 0. Processor p chose up to be
~hcsetp,~l - 1 and ZncSet,k C knownk, so IknOW?&) >ubk + 1.
TO prove Iknownk-1 I < Ibk, consider the processor p E genk with the minimum lower
bound lp = lbk at the end of phase 0. Since all good processors g E gOOdk_1 survive
phases 0 and 1 of generation k - 1, they all send their sets hc&t,,k_l to p during
round k (that is, during phase 1 of generation k - 1), so p has heard of all processors
in knownk_1 before it sets its lower bound lp at the end of round k (that is, during
phase 0 of generation k). Consequently, lknownk-1 I < lbk. 0
Running time analysis. For each generation k, we can bound the size of intervals
sent by good processors with ma_rangek. Consider any telescoping chain
I, 3 12 3 . . . 3 Z)
of intervals sent during phase 2, where Zi strictly contains Ii+,, and suppose the se-
quence is of maximal length. Since II is maximal, we know that it will split in half
immediately at the end of phase 2, leaving I, > . > II as a maximal chain. We now
prove that the size of 12 is roughly the size of max-rangek. Since the size of the
maximal interval reduces by half in each round, the running time is clearly logarithmic
in the size of the largest interval, and it will follow that the running time is roughly
logarithmic in jmax_?%ngek / 6 lactiuek I.
Claim 26. Given any sequence of intervals 1, > 1, > . . > Z, sent in phase 2 of gener-
ation k, we have IZzl <2ImaxrangekI.
Proof. Since the intervals Ii in the chain are sent in phase 2, they are sent by good
processors in goodk (processors surviving phases 0 and l), and their ranges Rj are
contained in wax_rangek by Claim 24. The upper and lower bounds Rl.ub and Rl.lb
of RI are clearly in the top half and bottom half of II, respectively. Since I2 is strictly
contained in II, we know that 12 is either in the top or bottom half of II. We consider
the two cases separately.
Suppose 12 is in the top half of II. Then since the lower bound RI .lb of R1 at the
end of phase 1 is in the bottom half of I,, the lower bound R2.16 of R2 will drop to
the bottom of 12 at the end of phase 2. Since the upper bound R2.ub of R2 is in the
top half of I,, the range R2 will span the bottom half of 12 by the end of phase 2.
This means that [ZZ( <2jR2( <2Imax_rangek(.
Suppose 12 is in the bottom half of II. This means that at the end of phase 1, the
lower bound Rz.lb of RZ is in the bottom half of I,. At the end of phase 2, therefore,
S. Chuudhuri et al. I Theoretical Computer Science 220 (19991 211-245 243
the lower bound Rl.lb of RI will be in the bottom half of 12. Since the upper bound
Rl.ub of RI is in the top half of It, it follows that RI will span the top half of 12:
Combining these results, we are done:
Theorem 27. Every generution k increment operation completes within O(log
lactiuek 1) rounds.
Proof. By Claim 26, after phase 2 starts, for every telescoping chain of intervals, the
set of active processors is guaranteed to be at least half of the size of the second
interval in the chain, so as soon as the first interval splits and chops (which will
happen immediately since it is immediately maximal), the chain will disappear in time
logarithmic in the number of active processors. 0
6. Conclusion
This paper represents an additional step toward understanding the round complexity
of problems in synchronous message-passing systems. We observed that as long as
processors remain in order-equivalent states, they cannot solve any problem that re-
quires processors to take distinct actions. We then showed that any comparison-based
protocol has an execution in which order-equivalence is preserved for loge rounds
with c participating processors, and that this bound is tight. This lower bound on
order-inequivalence yields the best-known lower bounds for a variety of concurrent
objects, including increment registers, ordered sets, and related data types, as well as
for decision tasks such as strong renaming.
We have also seen that this logarithmic bound separates protocols that can and cannot
solve nontrivial problems: we have seen examples of two nontrivial problems, the strong
renaming task and increment register objects, that have solutions with complexity lying
at exactly this boundary. These implementations are substantially more efficient than
an O(n) general-purpose algorithm using consensus, especially since the degree of
concurrency c itself is typically much less than n, the total number of processors.
A second interesting aspect of our construction is that our optimal increment register
implementation is based on our optimal solution to strong renaming, although these two
problems might seem quite different at first glance. Concurrent object implementations
are usually more difficult than solutions to decision tasks. Unlike decision tasks, where
processors start simultaneously, compute for a while, and halt with their outputs, con-
current objects have unbounded lifetimes during which they must handle an arbitrary
number of operations, these operations can be invoked at any time, and the order in
which operations are invoked is often important.
We can now draw a more complete picture of the complexity hierarchy for this
model. We have shown here that a logarithmic number of rounds is the minimal
necessary to solve nontrivial problems. It is known that a linear number of rounds
is the most needed to solve such problems (by reduction to consensus). In between,
it is known that the k-set agreement task [3] requires {n/k] + 1 rounds, well above
the logarithmic lower bound for strong renaming, but less than the n + 1 bound for
consensus. Little is known about other sublinear problems in this model.
Finally, we observe that our lower bound on order-inequivalence in the synchronous
model translates into a similar bound in the semi-synchronous model as well. In this
model, processors take steps at a rate bounded from above and below by constants, and
message delivery times vary between 0 and d. Our lower bound for order-inequivalence
translates into an immediate f2(d loge) lower bound in the semi-synchronous model.
It would be interesting to see if that bound could be improved.
Acknowledgements
Early versions of these results were originally published in [4, 131. We thank three
anonymous referees for their helpful and numerous comments on earlier drafts of this
paper. The first author was supported in part by NSF grant CCR-93-08103.
References
[1] H. Attiya, A. Bar-Noy, D. Dolev, D. Keller, 1). Peleg, R. Reischuk, Achievable cases in an asynchronous
environment, in: Proc. 28th IEEE Symp. on Foundations of Computer Science, October 1987,
pp. 337-346.
[2] H. Attiya, A. Bar-Noy, D. Dolev, D. Peleg, R. Reischuk. Renaming in an asynchronous environment,
J. ACM 37 (3) (1990) 524-548.
[3] S. Chaudhuri, M. Herlihy, N. Lynch, M.R. Tuttle, A tight lower bound for k-set agreement, in: Proc.
34th IEEE Symp. on Foundations of Computer Science, 1993, pp. 206-215.
[4] S. Chaudhuri, M.R. Tuttle, Fast increment registers, in: G. Tel, P. Vitanyi (Eds.), Proc. 8th Internat.
Workshop on Distributed Algorithms, Lecture Notes in Computer Science, vol. 857, Springer-Verlag,
Berlin, 1994. pp. 74-88.
[S] M.J. Fischer, N.A. Lynch, A lower bound for the time to assure interactive consistency, Inform. Process.
Lett. 14 (4) (1982) 183-186.
(61 M.J. Fischer, N.A. Lynch, MS. Paterson, Impossibility of distributed consensus with one faulty
processor, J. ACM 32 (2) (1985) 3744382.
[7] G.N. Frederickson, N.A. Lynch, Electing a leader in a synchronous ring, J. ACM 34 (1) (1987) 98.-- 115.
[8] J.Y. Halpem, R. Fagin, Modelling knowledge and action in distributed systems, Distrib. Comput. 3 (4)
(1989) 159-179.
[9] J.Y. Halpem, Y. Moses, Knowledge and common knowledge in a distributed environment, J. ACM
37 (3) (1990) 549-587.
[IO] M.P. Herlihy, Wait-free synchronization, ACM Trans. Programming Languages Systems 13 ( 1) (1991)
124- 149.
[I I] M. Herlihy, Randomized wait-free concurrent objects, in: Proc. 10th Annu. ACM Symp. on Principles
of Distributed Computing, 1991. pp. 11-22. [12] M.P. Herlihy, N. Shavit, The asynchronous computability theorem for t-resilient tasks, in: Proc. 25th
ACM Symp. on Theory of Computing, 1993, pp. ill- 120.
1131 M.P. Herlihy, M.R. Tuttle, Wait-free compu~tion in n~essage-passing systems: preliminary report, in:
Proc. 9th Annu. ACM Symp. on Principles of Distributed Computing, 1990, pp. 347-362.
S. Chuudhuri et ul. I Theoretical Computer Science 220 (1999) 21 l-245 245
[14] M.P. Herlihy, J.M. Wing, Linearizability: A correctness condition for concurrent objects, ACM Trans.
Programming Languages Systems 12 (3) (1990) 463-492.
[I51 L. Lamport, Time, clocks, and the ordering of events in a distributed system, Comm. ACM 21 (7)
(1978) 558-564.
[ 161 L. Lamport, The part-time parliament, Technical Report 49, DEC Systems Research Center, September
1989.
[I71 L.Lamport, R. Shostak, M. Pease, The Byzantine generals problem, ACM Trans. Programming
Languages and Systems 4 (3) (1982) 382-401.
[ 181 Y, Moses, M.R. Tuttle, Programming simultaneous actions using common knowledge, Algorithmica
3 (1) (1988) 121-169.
[19] M. Pease, R. Shostak, L. Lamport, Reaching agreement in the presence of faults, J. ACM 27 (2) (1980)
228-234.
[20] F.B. Schneider, Implementing fault-tolerant services using the state machine approach: A tutorial,
Technical report, Cornell University, Computer Science Department, November 1987.
[21] E. Styer, G.L. Peterson, Tight bounds for shared memory symetric mutual exclusion problems, in: Proc.
8th Annu. ACM Symp. on Principles of Distributed Computing, August 1989, pp. 1777 192.