parallel pointer machines

comput complexity 3 (1993), 19-30 1016-3328/93/010019-12 $1.50+0.20 �9 1993 Birkhguser Verlag, Basel

P A R A L L E L P O I N T E R M A C H I N E S

STEPHEN A. C O O K AND PATRICK W . DYMOND

Abst rac t . The parallel pointer machine is a synchronous collection of finite state transducers, each transducer receiving its inputs via pointers to the other transducers. Each transducer may change its input pointers dynamically by "pointer jumping". These machines provide a simple example of a parallel model with a time-varying processor inter-connection structure, and are sufficiently powerful to simulate deterministic space S(n) within time O(S(n)). Subject classifications. 68Q10.

Existing theoretical models of synchronous parallel computers can be divided into classes based on the power of the communication capabilities provided. One basic class consists of models in which the communications connections between processors are bounded and fixed throughout the computation. Models such as conglomerates [13], uniform circuits and aggregates [9] fall into this first class. A second class contains parallel models in which processors may communicate in one step with any of an unbounded number of other processors (e.g., via a shared global memory) at any step of the computation. SIMDAGs [13] and various other kinds of parallel random access machines (PRaMs) [12], [27], [10] fall into this second class, because individual processors in these PRAM models have the ability to read from and write to global memory locations. Using indirect addressing, any desired communication pattern can be realized and varied as the computation proceeds.

Goldschlager [13], Borodin [2], Chandra and Stockmeyer [4] and others ob- served close relationships between time or depth on parallel machines and space on sequential machines. For a given model X from the first group there are typically inclusions of the form

DSPACE(T) c_ X-TIME(T 2) c_ DSPACE(T 2) (1)

whereas for models in the second group the corresponding inclusions are

DSPACE(T) C_ X-TIME(T)C_ DSPACE(T2). (2)

20 Cook & Dymond comput complexity 3 (1993)

Throughout this paper we assume T(n) : AF --~ j~ is fl(logn). We use the notation X-TIME(T) to denote the class of sets over the alphabet {0, 1} recognized by machines of type X within time O(T(n)), and DSPACE(T) to denote the class of sets recognized by deterministic Turing machines in space T(n) (see [19]). Similarly, UDEPTH(T) denotes the class of sets recognized by bounded fan-in uniform circuit families using and, or and not gates in depth O(T(n)). See [25] for a discussion of uniform circuits.

The two types of inclusions described above indicate an apparent difference in power between the two classes of parallel machines. Time on machines of the second class is able to simulate sequential space without loss, whereas machines in the first class use a quadratic increase in time to simulate space. Another distinguishing feature of the models in the first class is that each processing element is finite state, whereas in the second class, the individual processors are random access machines with an unbounded local memory and wordsize. The models satisfying (2) use powerful instruction.s to obtain their linear time simulation of deterministic space, instructions which would seem to require a very complex physical realization. For example, in SIMDAGs, the "store" instruction requires that any of an unbounded number of processors be able to store into any of an unbounded number of global memory locations in a single step. Because of this huge implicit fan-in, Goldschlager introduced a variant of the SIMDAG model, in which the time charged for each instruction is proportional to the time which would be needed to simulate its execution on a (fixed structure) conglomerate. The resulting model satisfies (1) rather than (2).

Cook [5] proposed investigation of an intermediate model, which would provide communication access to only a bounded number of processors at any one time, but which would allow varying the processors for which connection is provided as computation proceeds. We define a specific example of this idea in "parallel pointer machines" (PPMs). Some of our results about PPMs appeared in preliminary form in [6] and [8], where they were called '"hardware modification machines (HMM)" in analogy with sequential storage modification machines earlier described by Schhnhage [26]. (Similar sequential pointer machines have been considered by Kolmogorov and Uspenskii [21], as "linking automata" by Knuth [20], and as "reference machines" by Tarjan [28],)

Many shared memory parallel algorithms work mainly by storing and up- dating pointers to global memory. The parallel pointer machine is well-suited to expressing this aspect of parallel algorithms. The PPM model is intended to examine the complexity classes defined by collections of finite state synchronous machines, each of which can rearrange its communication links by a

comput complexity 3 (1993) Parallel pointer machines 21

bounded amount in one step. We show that PPMs are surprisingly powerful, falling more closely into the second class of models defined above by satisfying inclusions (2). They may be in some sense the simplest class of parallel machines which do so. Any technology allowing the realization of any of the above-mentioned models satisfying inclusion (2) would also allow the construc- tion of PPMs. This follows from the observation that there is a straightforward linear time and processor simulation of PPMs by PRAMs, where the PRAM allocates one of its processors and a block of global memory cells for each of the PPMs processing units. Each processor in the PRAM simulates the moves of its unit, using the blocks in global memory to provide access to the processing unit's state and pointer information.

Barzdin and Kalnin'sh [1] have previously considered a "parallel action automata" from the standpoint of devising a universal machine of this type, al- though they do not consider complexity classes related to time and space. Hong [17] describes polynomial simulations between our PPM model and other parallel machine models, in which both time and hardware resources are si- multaneously related; this provides further evidence for the "extended parallel computation thesis" in [6]. Hong also considers nondeterministic variants of these machines. Nondeterministic PPMs are also considered in [7]. Lam and Ruzzo [22] have established a very close simultaneous correspondence between time-hardware resources on PPMs and the same resources on a restricted PRAM model. Their PRAM is restricted in two ways: first, global memory satisfies a concurrent-read owner-write or CROW protocol [10] in which every cell is written only by its designated owner; second, the arithmetic capabilities of the PRAM are limited to addition of one and doubling. In [14] Goodrich and Kosaraju used the term "parallel pointer machine" for a PRAM model in which values in global memory are of two types, pointer and arithmetic, and in which numeric operations are performed only on arithmetic values, while only pointer values may be used for indexing. Our PPM differs by eliminating arithmetic values and operations. We use the unqualified term PPM to represent our variant of the model in this paper. Actual machines with capabilities for varying processor interconnection patterns have been proposed and built; see for example the descriptions in [111 and [16].

The essential features of PPMs are:

1. Each piece of equipment, called a processing unit or unit, is a finite state transducer, accepting a k-tuple of input symbols, and computing an output symbol based on these symbols and current state. All active units have the same transition function. Each transducer has an ordered set of k input lines, called links or pointers, which can be thought of as taps on


the outputs of other units, or as pointers to those units. (We are allowing unbounded fan-out but fan-in is the constant k.) The input symbol at a given step along a link 1 for a unit Ui, which has its link I pointing to a unit Uj, is the output symbol of Uj at the end of the previous step.

2. The units operate synchronously, and each unit has the ability to adjust the connections of its links in a restricted way. In particular, a unit U; may in one step redirect one of its links to point to a unit Uj at distance no more than two from it. (That is, U~ must either already have a pointer to Uj or have a pointer to some Ur and one of Ur links points to Uj.)

3. Initially a single unit U0 is the only one active. In addition, a fixed input tree structure described below provides U0 with access to the PPM's input. U0 starts in state q0 and its input links are all pointing to itself except for one pointing to the root of the input tree. As the computation proceeds, each active unit can activate other units by specifying that one of its links should be redirected to a new unit, and specifying the initial state and locations for the links of the new units. The links must be set to units at distance _ 1 from the creator. The PPM accepts its input if U0 ever enters its "accept" state.

4. Access to the input word of the PPM is provided by means of a tree structure composed of special inactive units, each with three links which we refer to using labels {L, R, U}. These units are arranged in a complete binary tree of depth [log n]. Active units access the bits of the input by following pointers down this tree from the root. The input word w = wlw2...wn, wi C {0, 1}, is stored in the n successive units comprising the leftmost leaves of the tree by having each of these units output either zero or one. (The remaining leaf units each output the blank symbol "B".) Each leaf unit is pointing to its left and right neighbors (using links L and R) and to its parent in the tree (link U). A non-leaf unit in the tree has link U pointing to its parent and links L and R pointing to its left and right child, respectively. These non-leaf units output "L" or :'R" depending on whether they are a left child or a right child. The root of the tree outputs "B" and has its link U pointing to U0~ which in turn has its first link pointing to the root. The input-tree units never make any transitions.

The input convention described above has been chosen to allow :'random access" to the bits of the input, so that sublinear time bounds can be st.udied. Given a tap on the root, any active unit can find the first bit of the input by


moving a pointer down the "L" path from the root using [log n] transitions. Other input bits can be found quickly as well, provided the description of the path is known or can be generated.

Because the input tree units themselves are not active and make no transitions, they are not counted when we consider hardware on PPMs. It is of course clear that linear hardware is always required to represent inputs, but it is interesting to consider how much additional active hardware is required for particular computations, in the same way that we consider sublinear space bounds for deterministic Turing machines. The convention that the input units make no transitions corresponds to the Tm convention of read-only input. This input convention has no significant effect in cases where at least linear time and linear hardware are used, since the input tree could be created within these resource bounds from any reasonable input convention.

A PPM accepts an input w in time t if, in a computation from the initial state with input w, U0 enters the accepting state before it has made more than t transitions. Similarly, a PPM accepts w using hardware h if, in a computation from the initial state with input w, U0 enters the accepting state with no more than h units active (not counting the input tree units). We say Z accepts a set A C {0,1}* if for all w, Z accepts w if and only if w C A. A PPM Z accepts a set A in time T(n) (or in hardware H(n)) if Z accepts A and for every" w E A, Iw[ = n, Z accepts w in time T(n) (respectively, accepts w in hardware H(n)) . The complexity classes defined by hardware and time bounds for PPMs are PPM-TIME(T) and PPM-HARDWARE(I t ) . We have chosen our definition of acceptance within a time or space bound so that non- accepting computations need not be bounded. If we instead required working within the resource bounds on all inputs, accepted or not, then we would also add constructibility assumptions to the hypotheses of our theorems. Defined in the way chosen here, PPMs accept exactly the recursively enumerable sets.

THEOREM 1. PPM-HARDWARE(H(n) ) = DSPACE(H(n) * (logn + log H(n)).

PROOF. A deterministic Tm M simulating a PPM Z with h active units keeps a list of h records, one for each (non-input-tree) unit of Z. Each record consists of the unit 's current state and output symbol and, for each of the unit 's links, an integer of O(log n + log h) bits indicating the unit at which the unit is pointing.

(A negative integer can be used to represent that the link is pointing to one of the O(n) input tree units, and a positive integer i between 1 and h indicates the link is pointing to the active unit represented by the i-th record in the

24 Cook & Dymond cornput complexity 3 (1993)

list.) M can obtain the output symbols of input tree units by consulting its input tape, and has the states and output symbols of all other units. Thus it can make a new list, representing the state of Z after another parallel step has taken place. The space required for this simulation is O(h(log }~ + Iog n)). No assumption of constructability of the function H(n) is required for this simulation, since instead of knowing H(n), M can just add new records as Z adds new units.

Conversely, a PPM Z simulating an H(n)(log H(n) + log n) space Tm M does so using only h = O(H(n)) units by encoding extra information using the units' pointers. For simplicity of exposition, we will suppose that M has, besides its input tape, a single (binary) worktape of length h log h, where h is a power of 2. We will consider this worktape to be divided into h blocks of length log h. The simulating PPM Z contains a tree of O(h) units with h leaves, organized so that (1) by starting at any of the leaves and following the pointers of the units up to the root a unique sequence of log h bits can be recovered; and (2) given a sequence of log h bits, the corresponding leaf unit can be found by following a path of pointers down from the root. The tree is organized in the same way as that of the PPM input tree described at the beginning of this section. Each leaf l codes a sequence over { L,R } of length log h describing the path from 1 to the root, and a unit pointing to [ can recover the sequence by following the pointer P of the tree units from 1 up to the root. Now Z can represent the hlog h bits in the work tape of M by forming a doubly linked list of h units, where the i-th unit represents a block of log h bits by having a link point to the appropriate leaf. Note that the input tree can also be used to store block information, so that actually a work tape of length h(log h + log n) can be represented. No assumption of constructability of H(n) is needed, since successively larger values for h can be tried until one works. []

COROLLARY 2. I[ H(n) = f~(n ~) for some c, then PPM-HARDg/A,_~E(H) =- DSPA CE(H log H).

Theorem 1 for PPMs stands in close analogy to results in 19] for uniform aggregates by exhibiting a precise (up to a constant factor) correspondence between hardware and space complexity classes. (Aggregates are similar to bounded fan-in boolean circuits, but allow feedback in a synchronous fashion - the outputs of all gates at a given time step determine the inputs for all gates at the next time step.) For uniform aggregates and constructable resource bounds, there is a linear correspondence between aggregate hardware and Tm space. In the case of PPMs, however, less hardware is needed to simulate


a given space bound. We can conclude by the deterministic Turing machine space hierarchy theorem [15] that hardware on these machines with variable intercommunication structure is strictly more powerful than uniform aggregate hardware, in which the intercommunication paths are fixed.

Turning our at tention from hardware to PPM time, we now show that the inclusions corresponding to more powerful machine models (2) apply to PPMs. Since deterministic space is powerful enough to linearly simulate uniform circuit depth [2] and aggregate t ime [9], it follows from the theorem below that t ime on PPMs is at least as powerful as t ime on the other parallel models in the first group.

THEOREM 3. DSPACE(T) c_ PPM-TIME(T) c DSPACE(T2).

PROOF. The first inclusion will follow from Lemma 4. The second inclusion will be shown by means of Lemma 5 below which establishes that a PPM working in t ime T can be simulated by an alternating Turing machine [3], [24], [25] in t ime O(T2). The inclusion follows by the result in [3] giving a linear simulation of ATM time by deterministic space.

LEMMA 4. Suppose a directed graph G with out-degree one (self-Ioops are allowed) and N nodes is represented in a PPM H by a coI1ection of N units such that the unit representing node i has its input line pointing to unit j if and only if ( i, j) is an edge of G. Also suppose that node N's arc points to itself and that the corresponding unit is outputting a special syrnbd s, marking the sink of the graph. Then for any t, I t can determine if there is a path from 1 to N of length < 2 t in t + O(1) steps.

PROOF. The argument uses the now well-known technique of parallel pointer jumping suggested by Wyllie's proof [29] that DSPACE(T) C PRAM- TIME(T). Each node originally is receiving an input from its immediate (1- step) successor. In a single transition of H each of the units can adjust its input line to point to its (unique) 2-step successor. After t transitions, each of the units will be pointing to its 2~-step successor. A path from node 1 to node N of length _< 2 t exists if and onty if the unit representing node 1 is receiving s after t transitions, o

The application of the lemma to the proof of the first inclusion of the theorem is as follows: the N nodes of the graph will represent configurations (including input head positions) of M, the machine to be simulated. An arc (i,j) means the immediate successor configuration of i is j , as determined by M's transition function. Node 1 will represent the starting configuration, and


node N the unique accepting configuration. A path of length <_ N from 1 to N exists if and only if M accepts its input. Since N = O(TncT),log N = O(T). Thus in O(T) transitions H can tell whether N accepts.

It remains to show how H sets up the structure representing the configuration graph in O(T) steps. If T is known, this is done by growing a tree of units of depth O(T) with N leaves (c.f. Theorem 1). Each leaf determines a path from the root and this path determines a configuration corresponding to the leaf. By working backwards along this path the leaf unit discovers the bits of the configuration it represents and can create a list of units representing the bits of the successor configuration. This list can then be used by the leaf unit as a guide through the original tree to find its successor leaf. To remove the assumption about knowing T, try T = 1,2, 4 , . . . , at most quadrupling the total time taken. []

LEMMA 5. A PPM which works in time T(n) can be simulated by an indexing alternating Turing m~chine (see [25] for the definition) in time O(T(n) 2) and space O(T(n)).

PROOF. ~Ve will need to keep track of the various units to be simulated, so we introduce an identification scheme: the original unit's subscript label is 0; the units in the input tree are given numeric labels from 1 to 4rz in a uniform way; and alI other units are labelled with a string i~j , where i is the label of the unit's creator, and j is the number of steps which have elapsed from the time that the creator itself was creaoeo. If a unit has label c i#j , where j has no occurrence of ~ t let parent(c) = i, and let sum(c) = the sum of the numbers in c. With this labelling scheme, each active unit has a unique label of length _< 2T + 1 and sum(c) < T (even if i and j are written in unary notation). Observe that a unit Uc is activated at time t if and only if t = sum(c). For convenience, we also assume the output symbol of each unit is just its current state, and that U0 stays in an accepting state once entered.

The ATM A which simulates a PPM P first guesses T = leT(n). Then it calls procedure STATE(O, qa,T) to confirm that at time T, U0 is in the accepting state qa. In general, procedure STATE(c, q, t + 1) confirms that unit Uc was in a given state q at time t + 1 by guessing and recursively verifying appropriate states of units pointed to by U~ at time t. Checking that these units were appropriately pointing to Uc at time t is done using a complementary procedure CONNECT. Similarly, CONNECT(c,b,m,t) returns true if and only if the rn th input line of U~ is pointing to Ub at time t.


S T A T E ( c , q, t) c o m m e n t : return true iff Uc is in state q at time t

i f t = 0 t h e n r e t u r n q = qo a n d c = O e l s i f 1 < c < 4n

t h e n . . . c o m m e n t : answer correctly for input tree unit, depending on input

e l s i f t = sum(c) + 1 c o m m e n t : U~ was created in previous step

t h e n r e t u r n true iff 3c', q ' , i l , . . . , i k , q l , . . . , q k such that

a. S T A T E ( c ' , q ' , t - 1) and d = parent(c)

b. Vj, 1 < j < k, S T A T E ( i j , qj, t - 1)

c. Vj, 1 <_ j <_ k, C O N N E C T ( c ' , i j , j , t - 1), and

d. the transition function of P is such that the action of a unit in state q' receiving inputs q l , . . . , qk is to create a new unit in state q

e l s e c o m m e n t : in the general case, t > sum(c) + 1

r e t u r n true iff 3q', i l , . . . , ik, qt , . . . , qk such that a. S T A T E ( c , q', t - 1)

b. Yj, 1 < j <__ k, S T A T E ( i j , q j , t - 1)

c. Vj, 1 < j < k, C O N N E C T ( c , i j , j , t - 1), and

d. the transition function of P is such that the action of a unit in state q' receiving inputs ql,. �9 q~ is to transfer to state q

The function C O N N E C T is similar. Each time either procedure is called the parameter t is reduced by 1, so the maximum depth of recursion is T. Guessing the labels of other units may require space and time O(T) on the ATM; this bounds the time at each level of recursion. Thus the total time is O ( T 2) and the space used by the ATM is 0(7') .

COROLLARY 6. For space constructible functions T(n) , P P M - T I M E ( T ) C U D E P T H ( T 2) C DSPA CE(T2) .

PROOF. The first inclusion follows from Lemma 5 and a theorem of Ruzzo's [24] relating circuit depth and alternating machine time. The last inclusion follows from a theorem of Borodin [2]. []

Since Theorem 1 relates sequential space to PPM hardware and Theorem 3 relates sequential space to PPM time, we obtain the following two relationships.

28 Cook 8; Dymond comput complexity 3 (1993)

COROLLARY 7.

PPM-HARDWARE(H) C PPM-TIME(H(log n + log H)).

PPM-TSME(T) c_ PPM-HARDWARE(T21 log(T + a)).

Acknowledgements

We thank Larry Ruzzo for useful comments and for suggesting the name parallel pointer machine.

This research was supported by the Natural Sciences and Engineering Re- search Council of Canada.

References

{1]

[2]

{3]

[4]

Es]

[6]

[7]

[8]

Y. M. BARZDIN AND Y. Y. KALNIN'Stt, A universal automaton wi~h variable structure, Automatic Control and Computing Sciences 8 (1974), 9- 17. Russian journal title: Avtomatika i Vychislitel'naya Tekhnika.

A. BORODIN, On relating time and space to size and depth, SlAM journal on Computing 6 (1977), 733-744.

A. K. CHANDRA, D. C. KOZEN, AND L. J. STOCKMEYER, A]ternation~ Jour- nal of the ACM 28 (1981), 114--133.

A. K. CHANDRA AND L. J. STOCKMEYER, Alternation, in 17th Annual Symposium on Foundations of Computer Science, Houston, TX, 1976, IEEE, 98-108. Preliminary Version.

S. A. COOK, Towards a complexity theory of synchronous parallel computation, L'Enseignement Math6matique X X V I I (1981), 99-124. Also in [23, pages 75-

lOO].

P. W. DYMOND, Simultaneous Resource Bounds and Parallel Computation, PhD thesis, University of Toronto, 1980. Technical Report 145/80.

P. W. DYMOND, On nondeterminism in parallel coraputation, Theoretical Computer Science 47 (1986), 111-120.

P. W. DYMOND AND S. A. COOK, Hardware complexity and parallel computation, in 21st Annual Symposium on Foundations of Computer Science, Syracuse, NY, 1980, IEEE, 360-372.


[9]

[10]

[11]

[12]

[13]

[14]

[16]

[17]

[18]

[19]

[2o]

[211

P. W. DYMOND AND S. A. COOK, Complexity theory of parallel time and hardware, Information and Computation 80 (1989), 205-226.

P. W. DYMOND AND W. L. Ruzzo, Parallel random access machines with owned global memory and deterministic context-free language recognition, in Automata, Languages, and Programming: 13th International Colloquium, ed. LAURENT KOTT, vol. 226 of Lecture Notes in Computer Science, Rennes, France, 1986, Springer-Verlag, 95-104.

S. E. FAHLMAN, NETL: A System for Representing and Using ReM World Knowledge, MIT Press, 1979.

S. FORTUNE AND J. WYLLIE, Parallelism in random access machines, in Proceedings of the Tenth Annual ACM Symposium on Theory of Computing, San Diego, CA, 1978, 114-118.

L. M. GOLDSCItLAGER, A universal interconnection pattern for parallel computers, Journal of the ACM 29 (1982), 1073-1086.

M. T. GOODRICI4 AND S. R. KOSARAJU, Sorting on a parallel pointer machine with applications to set expression evaluation, in 30th Annum Symposium on Foundations of Computer Science, Research Triangle Park, NC, 1989, IEEE, 190-195. Preliminary version.

J. HARTMANIS, P. M. LEWIS, II, AND R. E. STEARNS, Hierarchies of memory limited computations, in Conference Record on Switching Circuit Theory and LogicM Design, Ann Arbor, MI, 1965, 179-190.

W. D. HILLIS, The Connection Machine, MIT Press, !985.

J. W. HONG, On similarity and duality of computation, in 21st Annum Sympo- sium on Foundations of Computer Science, Syracuse, NY, 1980, IEEE, 348-359. Appeared: [18].

J. W. HONe, Similarity and duality in computation, Information and Control 62 (1984), ]09-128.

J. E. HOPCROFT AND J. D. ULLMAN, hltroduction to Automata Theory, Languages, and Computation, Addison-Wesley, 1979.

D. E. KNUTH, Sorting and Searching, vol. 3 of The Art of Computer Program- ruing, Addison-Wesley, 1973.

A. N. KOI~MOGOROV AND V. A. USPENSKII, On the definition of an algorithm, Translations of the A.M.S. 27 (19.57).

30 Cook 8z Dymond comput comp]exity 3 (1993)

[22]

[231

[24]

[25]

[27]

[28]

[29]

T. W. LAM AND W. L. Ruzzo, The power of parallel pointer manipula- tion, in Proceedings of the 1989 ACM Symposium on ParM1e! Algorithms and Architectures, Santa Fe, NM, 1989, 92-102.

Logic and Algorithmic, An International Symposium Held in Honor of Ernst Specker, Ziirich, February 5-11, 1980. Monographie No. 30 de L'Enseignement Mathfimatique, Universitfi de Gen~ve, 1982.

W. L. Ruzzo, Tree-size bounded alternation, Journal of Computer and System Sciences 21 (1980) 218-235.

W. L. Ruzzo, On uniform circuit complexity, Journal of Computer and System Sciences 22 (1981) 36.5-383.

A. SCIItONIIAGE, Storage modification machines, SIAM Journal on Computing 9 (1980) 490-508.

L. J. STOCKMEYER AND U. VISttKIN, Simulation of parallel random access machines by circuits, SIAM Journal on Computing 13 (1984) 409-422.

R. E. TARJAN, A class of algorithms which require nonlinear time to maintain disjoint sets, Journal of Computer and System Sciences 18 (1979) 110-127.

J0 C. WYLLIE, The Complexity of Parallel Computations, PhD thesis, Cornell University, Department of Computer Science, 1979. TR 79-387.

Manuscript received 17 October 1991

STEPIIEN A. COOK Department of Computer Science University of Toronto Toronto, Ontario~ Canada M5S 1A4 sacook�9 toronto, edu

PATRICK W. DYMOND Department of Computer Science York University 4700 Keele Street Toronto, Ontario, Canada M3J 1P3 dymond~cs, yorku, ca

parallel pointer machines

Documents