an efficient parallel algorithm for the row minima of a

Purdue University Purdue University

Purdue e-Pubs Purdue e-Pubs

Department of Computer Science Technical Reports Department of Computer Science

1990

An Efficient Parallel Algorithm for the Row Minima of a Totally An Efficient Parallel Algorithm for the Row Minima of a Totally

Monotone Matrix Monotone Matrix

Mikhail J. Atallah Purdue University, [email protected]

S. Rao Kosaraju

Report Number: 90-959

Atallah, Mikhail J. and Kosaraju, S. Rao, "An Efficient Parallel Algorithm for the Row Minima of a Totally Monotone Matrix" (1990). Department of Computer Science Technical Reports. Paper 813. https://docs.lib.purdue.edu/cstech/813

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information.

https://docs.lib.purdue.edu/

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/cstech

https://docs.lib.purdue.edu/comp_sci

AN EFFICIENT PARALLEL ALGORITHMFOR THE ROW MINIMA OF A

TOTALLY MONOTONE MATRlX

Mikhail J. AtaJJahS. Rna Kosaraju

CSD-TR-959February 1990

(Revised April 1991)

An Efficient Parallel Algorithm for the Row Minimaof a Totally Monotone Matrix

Mikhail J. AtaUah"Purdue University

S. Rae Kosarajut

Johns Hopkins University

Abstract

We give a parallel algorithm for the problem of computing the row minima of atotally monotone two-dimensional matrix. Whereas the previous best CREW-PRAMalgorithm for this problem ran in O(log n log log n) time with O(n/ log log n) processors,our algorithm runs in O(log n) time with O(n) processors in the (weaker) EREW·PRAMmodel. Thus we simultaneously improve the time complexity without any deteriorationin the time x processors produd, even using a weaker model of parallel computation.

·Dept. of Computer Science, Purdue University, Weat La£ayette, IN 47907. This author's research wassupported by the Office of NavaJ Research under Contracts N00014·84-K-0502 and N00014-86-K-0689, theAir Force Office of Scientific Re~earch under Grant AFOSR-90-0107, Lhe National Science Foundation underGrant DCR-8451393, and the National Library of Medicine under Grant R01·LM05118.

IDept. of Computer Science, Johns Hopkins University, Baltimore, MD 21218. This author's researchwas supported by the National Science Foundation under Grant CCR-8804284 and NSF/DARPA GrantCCR·890a092.

1

1 Introduction

First we review the problem, which was introduced and solved in linear sequential time in

[2]. It has myriads of applications to geometric and combinatorial problems [1,2].

For any m X n matrix A, the row minima are an m-vector 0A such that, for every row

index T (1 ::; r ::; m), OA(r) is the smallest column index c that minimizes A(r,c) (that

is, among all c's that minimize A(r,c), BA(r) is the smallest). If 0A satisfies the following

sorted property:

and if for every submatrix A' of A, (JA' also satisfies the sorted property, then matrix A is

said to be totally monotone [1, 2]. (Note: this definition may appear different from that

given in [1,2]. but it is easily seen to be equivalent, by negating all the entries of the matrix.)

Given a totally monotone matrix A, the problem of computing the (}A array is known as

that of "computing the row minima of a totally monotone matrix" [I}. The best previous

CREW-PRAM algorithm for this problem ran in O(logn log log n) time and O(nj loglogn)

processors [1] (where m = n). The main result of our paper is an EREW~PRAM algorithm

of time complexity O(logn) and processor complexity O(n). This improves on the time

complexity of [1] without any increase in the time x processors product, and using a weaker

parallel model. It also implies corresponding improvements on the parallel complexities of

the many applications of this problem (which we refrain from listing-they can be found in

[1,2]). More specifically, our bounds are as follows.

Theorem 1.1 The row minima (that is, the arTay (}A) of an m x n totally monotone matrix

A can be computed in O(logm+log n) time with O(m+ n) processors in the EREW-PRAM

model.

In fact we prove a somewhat stronger result: that an implicit description of (}A can be

computed, within the same time bound as in the above theorem, by O(n) processors. From

this implicit description, a single processor can obtain any particular (}A(r) value in O(1ogn)

time.

Section 3 gives a preliminary result that is a crucial ingredient of the scheme of Section 4:

an O(ma.x(n,logm)) time algorithm using m.in(n,logm) processors in the EREW-PRAM

model. Section 4, which contains the heart of our CREW method, uses a new kind of

sampling and pipelining, where samples are evenly spaced (and progressively finer) clusters

2

of elements and where the "help" for computing the information at a node does not come

only from its children (as it did in [5]) but also from some of its subtree's leaves. Section 5

transforms the CREW algorithm of Section 4 into an EREW algorithm. This is not achieved

by using known methods such as [6], but rather relies on (i) storing each leaf solution in a

suitable parallel data structure, and (ii) re-defining the nature of the information stored at

the internal nodes.

Recall that the EREW-PRAM is the parallel model where the processors operate syn

chronously and share a common memory, but no two of them are allowed simultaneous

access to a memory cell (whether the access is for reading or for writing in that cell). The

CREW-PRAM differs from the EREW·PRAM in that simultaneous reading is allowed (but

simulateous writing is still forbidden).

2 Preliminaries

In this section we introduce some notation, terminology, and conventions.

Since the matrix A is understood, we henceforth use 0 as a shorthand for 0A. Through

out, R will denote the set of m row indices of A, and C will denote its n column indices.

To avoid cluttering the exposition, we assume that m and n are powers of two (our scheme

can easily be modified for the general case).

An interval of rows or columns is a non-empty set of contiguous (row or column) indices

[i,j] = {i, i +1,' .. ,j}. We imagine row indices to lie on a horizontal line, so that a row is

to the left of another row if and only if it has a smaller index (similarly "left of" is defined

for columns). We say that interval II is to the left of interval 12 , and 12 is to the right of II,

if the largest index of II is smaller than the smallest index of h. (Note: for rows, it might

seem more appropriate to say "above" rather than "to the left of' , but the latter js in fact

quite appropriate because we shall later map the rows to the leaves of a tree.)

Let I be a column interval, and let AI be the m X III submatrix of A consisting of the

columns of A in I. We use Of as a shorthand for OAr' That is,jfr is arowjndex, then Of(r)

denotes the smallest c E I for which A(r,c) is minimized. Note that Of(r) usually differs

from O(r), since we are minimizing only over I rather than C.

Throughout the paper, instead of storing Of directly, we shall instead store a function

1rf which is an implicit description of 0[.

Definition 2.1 For any column interval! and any column index c, 1r[(c) is the row interval

3

such that, for every row index T in that interval, we have t1[(r) = c; 1I"[(c) is empty if no

such r exists.

Note that the monotonicity of A implies that, if Cl < C2, then 'Il"[(ct} is to the left of

IT/(C2).

Note that each 'Il"/(c) can be stored in 0(1) space, since we need only store the beginning

and end of that row interval. Throughout the paper, we shall use 11"/ as an implicit descrip

tion of 8/. The advantage of doing so is that we use 0(1/1) storage instead of the O(lRI)

that would be needed for explicitly storing 8/. The disadvantage is that , given a row index

T, a processor needs to binary search in the 'Il"/ array for the position of row index r in order

to determine 6'/(r). Had we stored directly e[, 6'[(r) would be readily available in constant

time. From now on, we consider our problem to be that of computing the 1I"c array. Once

we have 1I"C, it is easy to do a postprocessing computation that obtains (explicitly) 9 from

1I"c with m processors: each column c gets assigned 111"c(c) I processors which set 9(r) = c

for every T E 1I"c(c). Therefore Theorem 1.1 would easily follow if we can establish the

following.

Theorem 2.1 1I"c can be computed in O(logm + logn) time and O(n) processors in the

EREW-PRAM model.

The rest of this paper proves the above theorem.

Definition 2.2 The 8-sample of the set R of row indices is obtained by choosing every 8-th

element of R (i.e., every row index which is a multiple of s). For example, the 4-sample of

R is (4,8,··· ,m). For k E [O,logm], let R, denote the (m/2')-sample of R.

For e.xample, Ro = (m),

R, = (m/8,m/4,3m/8,m/2,5m/8,3m/4, 7m/8,m),

and R!ogm = (1,2,"',m) = R.

Nole lhal IR,I = 2' = 2IR,_d.

3 A log m Processor Algorithm

This section gives an algorithm which is needed as an important ingredient in the algorithm

of the next section. It has the feature that its complexity bounds depend on the number of

columns in a stronger way than on the number of rows.

4

Lemma 3.1 The trc array can be computed in O(max(n,logm)) time with min(n,logm)

processors in ihe EREW-PRAM model.

The bounds of the above lemma might look unappealing at first sight, but their signifi

cance lies in the fact that m can be much larger than n. In fact we shall use this lemma, in

the next section, on problems of size m X (log n). The rest of this sedion proves the lemma.

Simple-minded approaches like "use one processor to binary search for trc(c) in parallel

for each c E G" do not work, the difficulty being that we do not know which trc{c)'s are

empty. In fact, if we knew which trc(c)'s are empty then we could easily achieve O(log m)

time with n processors (by using the above-mentioned straightforward binary search-e.g.,

binary search for the right endpoint of trc( c) by doing log m comparisons involving the two

columns c and c' , where d is the nearest column to the right of c having a nonempty trc(c')).

We shall compute the trc array by computing two arrays,LeftExtend and RightExtend,

whose significance is as follows.

Definition 3.1 For any column c, let LeftExtend(c) be the left endpoint of row interval

tr[l,c](C). That is, LeftExtend(c) is the minimum row index r such that, for any d < c,

A(r,c) < A(r,c'). Let RightExtend(c) be ihe right endpoint of row interval trlc,n)(c). That

is, RightExtend(c) is the maximum row index r such that, for any d > c, A(r, c) ~ A(r, c').

Intuitively, LeftExtend(c) measures how far to the left (in R) column c can "extend its

influence ll if the only competition to it came from columns to its left. The intuition for

RightExtend(c) is analogous, with the roles of "left" and "right" being interchanged. Note

that LeftExtend(c) (resp., RightExtend(c)) might be undefined, which we denote by set·

ting it equal to the nonexistent row m + 1 (resp., 0).

Once we have the RightExtend and LeftExtend arrays, it is easy to obtain the trc array,

as follows. If either LeftExtend(c) =m+ 1 or RightExtend(c) =0 then obviously trc( c) is

empty. Otherwise we distinguish two cases: (i) if RightExtend(c) < LeftExtend(c) then

trc(c) is empty, and (ii) if LeftExtend(c) ~ RightExtend(c) then interval trc(c) is not

empty and has LeftExtend(c) and RightExtend(c) as its two endpoints. Hence it suffices

to compute the RightExtend and LeftExtend arrays. The rest of this section explains how

to compute the LeftExtend array (the computation of RightExtend is symmetrical and is

therefore omitted). Furthermore, for the sake of definiteness, we shall describe our scheme

assuming n ~ log m (it will be easy for the reader to see that it also works if n < log m).

5

Thus we have min{n,logm} = logm processors and wish to achieve O(max{n,logm}) =

O(n) time performance. We shall show how to use the logm processors to compute the

LeftExtend array in n + logm (= O(n)) time steps. To simplify the presentation we

assume, without loss of generality, that A(m, 1) > A(m,2) > ... > A(m,n) (one can always

add a "dummy" last row to A in order to make this hold-obviously this does not destroy

the monotonicity of A). Th.is assumption simplifies the presentation because it causes every

LeftExtend(c) to be ~ m (i.e., it is defined).

We first give a rough overview of our scheme. Imagine that the row indices Rare

organized as a complete binary search tree TR: the leaves contain R sorted by increalling

order, and each internal node 11 contains the row index r of the largest leaf in the subtree of

11 's left child (in which Calle we can simply refer to 11 as "internal node r" rather than the more

cumbersome "the internal node that contains T"). Note that a row index T < m appears

exactly twice in TR: once at a leaf, and once at an internal node (m appears only once, as

the rightmost leaf). Having only logm processors, we clearly cannot afford to build all of

Tn. Instead, we shall build a portion of it, starting from the root and expanding downwards

along n root-to-Ieaf paths PI,'" I Pn. Path Pc is in charge of computing LeftExtend(c),

and does so by performing a binary search for it as it traces a root-to-leaf path in the binary

search tree Tn. If path Pc exits at leaf T then LeftExtend(c) = T. The tracing of all the

Pi'S is done in a total of n+logm time steps. Path Pc is inactive until time step c, at which

time it gets assigned one processor and begins at the root, and at each subsequent step it

makes one move down the tree, until it exits at some leaf at step c + log m. Clearly there

are at most logm paths that are simultaneously active, so that we use logm processors.

At time step n + logm the last path (PI1 ) exits a leaf and the computation terminates. If,

at a certain time step, path Pc wants to go down to a node of Tn not traced earlier by a

PCI (c' < c), then its processor builds that node of Tn (we use a pointer representation for

the traced portion of Tn- we must avoid indexing since we cannot afford using m space).

During path Pc's root-to-Ieaf trip, we shall maintain the property that, when Pc is at node

r of Tn, LeftExtend(c) is guaranteed to be one of the leaves in r's subtree (this property

is obviously true when Pc is at the root, and we shall soon show how to maintainlt as Pc

goes from a node to one of its children). It is because of tltis property that the completion

of a Pc (when it exits from a leaf of Tn) corresponds to the end of the computation of

LeftExtend(c).

6

The above overview implies that at each time step, the lowermost nodes of the logm

active paths are "staggered" along the levels of TR in that they are at levels 1,2,· .. ,log m

respectively. Hence at each time step, at most one processor is active at each level of Tn,

and the computation of exactly one LeftExtend(e) gets completed (at one of the leaves).

We have omitted a crucial detail in the above overview: what additional information

should be stored in the traced portion of Tn in order to aid the downward tracing of each

Pc (such information is needed for Pc to know whether to branch left or right on its trip

down). The idea is for each path to leave behind it a trail of information that will help

subsequent paths. Note that LeftExtend(e) depends upon columns l,2,···,c -1 and is

independent of columns c+ 1,"',n. Before specifying the additional information, we need

some definitions.

Definition 3.2 Let c and r! be column indices, r be a row index. We say thai c is better

than c' for T (denoted by c <r c') iff one of the following holds: (i) A(r,c) < A(r,d), or

(ii) A(T,C) = A(T,c') and c < c'.

Note that for any columns c,f! and row T, we must have either c <r c: or c' <r c.

When the algorithm compares A(r,e) to A(r,e') in order to determine whether C <r e' or

e' <r c, it is useful if the reader thinks of such a comparison as a competition between c

and c' for r: e wins the competition over e' if the outcome is e <r c', otherwise it loses T to

c'. When a path Pc is at an internal node r, it competes for r with the best column for r

among the columns in [l,c -1] (that is, it competes with 8[1,c_lj(r». If c beats 8[l,c_Ij(r)

in this competition for r, then Pc obviously branches down to the left ehild of r in Tn (its

LeftExtend(c) is certainly not greater than r). Otherwise it branches to the right child of

T. However, the above assumes that the 8[1.c_lj(r) values are available when needed. We

must now make sure that, when Pc enters node r, it can easily (i.e., in constant time) obtain

8[1,c_lj(r). Note that we can neither maintain the needed 8[1,C_I)(r) at r, nor can we carry

it down with Pc on its downward trip (it is not hard to see that either one of these two

approaches runs into trouble). Instead, we shall use a judicious combination of both: some

information is maintained locally in r, some is carried along by Pc. When Pc enters r, Pc

combines the information it is carrying, with the information in T, to obtain in constant

time 8[l,c_l](T). This is made more precise below.

Each internal node r' that has been already visited by a path contains, in a register

label(r'), the best c' for T' among the subset of columns whose path went through r' (hence

7

label( r') is empty if no path visited r' so far).

When Pc enters internal node r of TR' it carries with it, in a register rival(c), the largest

cl such that e' < c and Le/tExtend(c' ) is smaller than the leftmost leaf in the subtree of r

(hence rival(c) is empty when Pc is at the root).

Note that the current rival(e) and label(r) allow Pc to obtain 8[1,e_1j(r) in constant time,

as follows. Recall that 8[1,e_1j(r) is the best for r (Le., smallest under the <r relationship)

among the columns in [1, e -1]. Now, view [1, e - 1] as being partitioned into three subsets:

the subset Sl (resp., S3) consisting of the columns c! whose Le/tExtend(c') is smaller

(resp., larger) than the leftmost (resp., rightmost) leaf in the subtree of r, and the subset

52 of the columns cl whose Le/tExtend(c!) is in the subtree of r. Now, no column in 53

can be 8(1,e_1j(r) (by the definition of S3). The best for r in S2 is label(r) (by definition).

The best for r in 51 is rival(c), by the following argument. Suppose to the contrary

that there is a c" E Sl, c" < rival(c), such that e" is better for r than rival(c). A

contradiction with the monotonicity of A is obtained by observing that we now have: (i)

elf < rival(c), (ii) Le/tExtend(rival(c)) < r, and (iii) c" is better than rival(e) for r but not

for Le/tExtend(rival(c)). Hence rival(e) must be the best for r in 51. Hence 8(1,e_lj(r)

is one of {rival(c),label(r)}, which can be obtained in constant time from rival(c) and

label(r), both of which are available (Pe carried rival(c) down with it when it entered r,

and r itself maintained label(r) during the previous time step).

The main problem that remains is how to update, in constant time, the rival(c) and the

label(r) registers when Pe goes from r down to one of r's two children. We explain below

how Pc updates its rival(c) register, and how r updates itslabel(r) register. (We need not

worry about updating the label( r') of a row r' not currently being visited by a Pc" since

such it label(r') remains by definition unchanged.)

By its very definition, label(r) depends on all the paths Pc' (c' < c) that previously

went through r. Sjnce Pe has just visited r, we need to make c compete, for row r, with the

previous value of label(r): jf c wjns then label(r) becomes c, otherwise it remains unchanged.

The updating of rival(c) depends upon one of the following two cases.

The first case is when c won the compeHtion at r, i.e., Pe has moved to the left child

of r (call it rl). In that case by its very definition rival(c) remains unchanged (since the

leftmost leaf in the subtree of r' is the same as the leftmost leaf in the subtree of r).

The second case is when c lost the competition at r, i.e., Pc has moved to the right child

of r (call it r"). In that case we claim that it suffices to compare the old label(r) to the

8

old rival(c): the one which is better for r" is the new value of rival(c). We now prove the

claim. Let rl (resp., r2) be the leftmost leaf in the subtree of r (resp., r"). Let C' (resp'l Gil)

be the set of paths consisting of the columns (J such that (J < c and LeftExtend(fJ) < rl

(resp., LeftExtend(fJ) < f2). By definition, the old (resp., new) value of rival(c) is the

largest column index in C 1 (resp., Gil). The claim would follow if we can prove that the

largest column index in Gil - G' is the old label(r) (by "old label(r)" we mean its value

before updating, Le., its value when Pc first entered f). We prove this by contradiction:

let c denote the old labe.l(r), and suppose that there is a '/ E Gil - G' such that '/ > c.Since both '/ and care in Gil - C1

, their respective paths went from r to the left child of r.

However, P"f did so laler than Pc (because '/ > c). This in turn implies that '/ is better for

r than c, contradicting the fact that cis the old label(r). This proves the claim.

Concerning the implementation of the above scheme, the assignment of processors to

their tasks is trivial: each active Pc carries with it its own processor, and when it exlts from

a leaf it releases that processor which gets assigned to Pc+logm which is just beginning at

the root of TR.

4 The Tree-Based Algorithm

This section builds on the algorithm of the previous section and establishes the CREW

version of Theorem 2.1 (the next section will extend it to EREW). The sampling and

iterative refinement methods used in this section are reminiscent of the cascading divide

and-conquer technique [5, 4], and are similar to those used in [3] for solving in parallel the

string editing problem. In fact, although [3] deals with a different problem, it implicitly

contains a solution to our problem whose complexity is also O(1ogn) time but with a

disappointing O(nlogn) processors. As in [3], it is useful to think of the computation

as progressing through the nodes of a tree T which we now proceed to define (it differs

substantially from [3)).

Partition the column indices into nJ log n adjacent intervals II,' .. ,!n/logn of size log n

each. Call each such interval Ii a fat column. Imagine a complete binary tree T on top

of these fat columns, and associate with each node v of this tree a fat interval I(v) (i.e.,

an interval of fat columns) in the following way: the fat interval associated with a leaf is

simply the fat column corresponding to it, and the fat interval associated with an internal

node is the union of the two fat intervals of its children. Thus a node v at height h has a fat

9

interval I(v) consisting of II(v)1 = 2h fat columns. The storage representation we use for

a fat interval I(v) is a list containing the indices of the fat columns in it; we also call that

list I(v), in order to avoid introducing extra notation. For example, if v is the left child of

the root, then the I(v) array contains (1,2,···,nj(2Iogn». Observe that LueTII(v)1 =

O(ITllogITI) = O«n/logn) log n) = O(n).

The ultimate goal is to compute 1I"0(c) for every c E C.

Let leaf problem Ii be the problem of computing 1I"r,(c) for all c E Ii. Thus a leaf problem

is a subproblem of size m X logn. From Lemma 3.1 it follows that a leaf problem can be

solved in O(log n + log m) time by min{log n,log m} (~ log n) processors. Since there are

njlogn leaf problems, they can be solved in O(logn + logm) time by n processors. We

assume that this has already been done, i.e., that we know the 1I"Ii array for each leaf

problem Ii. The rest of this section shows that an additional O(logn + logm) time with

O(n) processors is enough for obtaining 11"0·

Definition 4.1 Let J(v) be the interval of original columns tllat belong to fat intervals in

l(v) (hence IJ(v)1 = II(v)1 ·logn ). Fo' every vET, lot column I E ltv), and subset

R' of R, let ¢u(R',j) be the internal in R' such that, for every r in that interval, 8J(u)(r)

is a column in fat column f. We use ''¢u(R' ,.)" as a shorthand for 'VJv(R',j) for all

IE l(v))".

We henceforth focus on the computation of the ¢roo!(T)(R,.) array, where root(T) is the

root node of T. Once we have the array ¢root(T)(R,*), it is easy to compute the required

11"0 array within the prescribed complexity bounds: for each fat column f, we replace the

¢root(T)(R,j) row interval by its intersection with the row intervals in the'1rIJ array (which

are already available at the leaf If of T). The rest of this section proves the following

lemma.

Lemma 4.1 tPu(R, *) for every VET can be computed by a CREW·PRAM in timeO(height(T)+

log m) and O(n) processors, where height(T) is the height of T (= O(log n)).

Proof. Since LveT(II(v )I+log n) = O(n), we have enough processors to assign 11(v)l+log n

of them to each vET. The computation proceeds in log m +height(T) - 1 stages, each

of which takes constant time. Each vET will compute Wu(R' ,*) for progressively larger

subsets R' of R, subsets R' that double in size from one stage to the next of the computation.

10

We now state precisely what these subsets are. Recall that 14 denotes the (m/2i )-sample

of R, 00 that 1&1 = 2'.

At the t-th stage of the algorithm, a node v of height h in T will use its II(v)1 + log n

processors to compute, in constant time, ¢tJ(RI_h, *) if h $ t $ h+log m. It does 50 with the

help of information from ¢t/(R1-1-h, *), ¢Le/tChild{tJ)(Rt-h, *), and ¢RightChild{t/)(Rt-h, *),

all of which are available from the previous stage t -1 (note that (t - 1) - (h - 1) = t - h).

If t < h or t > h + log m then node v does nothing during stage t. Thus before stage h the

node v lies "dormant", then at stage t = h it first "wakes up" and computes ¢tJ(14J,*), then

at the next stage t = h + 1 it computes WI/(Rt, *), etc. At stage t = h + logm it computes

wt/(R!ogm,*), after which it is done.

The details of what information v stores and how it uses its lIe v)1 + log n processors to

perform stage t in constant time are given below. In the description, tree nodes u and w

are the left and right child, respectively, of v in T.

After stage t, node v (of height h) contains wt/(RI-h, *) and a quantity CriticaltJ(Rt_h)

whose significance is as follows.

Definition 4.2 Let R' be any subset of R. Criticall/(R') is the largest r E R' that is

contained in WI/(R',I) for some f E I(u); if there is no such r then Criticalt/(R') = O.

The monotonicity of A implies that for every r' < Criticalt/(R') (resp., r ' > Criticalt/(R')),

r' is contained in wt/(R',I) for some f E I(u) (resp., f E lew)).

We now explain how v performs stage t, i.e., how it obtains CriticaltJ(Rt_h) and

¢t/(Rt_h,*) using 1/Ju(RI-h,*), Ww(RI-h,*), and CriticaltJ(R1_1_h) (all three of which were

computed in the previous stage t - 1). The fact that the II(v)1 + logn processors can do

this in constant time is based on the following observations, whose correctness follows from

the definitions.

Observation 4.1 1. Criticalt/(Rl_h) is either the same as CriticaluCRt_1_h), or the

successor of Criticalv(Rt_l_h) in Rt-h'

2. Iff E I(u), then wtJ(RI-h, J) is the portion of intervalWu(Rt_h, I) that is $ Criticall/(Rt_h).

3. Iff E I(w), then ¢t/(RI_h, I) is the portion of interval Ww(Rt-h, J) that is > CriticaltJ(Rt_h)'

The algorithmic implications of the above observations are discussed next.

11

4.1 Computing Critical.(R'_h).

Relationship (1) of Observation 4.1 implies that, in order to compute Criticalv(RI_h), all v

has to do is determine which of Criticalv(R1_1_h) or its successor in RC_h is the correct value

of CriticaliRt_h). This is done as follows. If Critica[u(Rt_1_h) has no successor in RC-h

then Critical l1(Rt_l_h) = m (the last row) and hence Criticalv(Rt_h) = Criticalv(Rt_1_h).

Otherwise the updating is done in the following two steps. For conciseness, let T denote

Criticalv(Rt_1_h), and let s denote the successor of T in Rt_h.

• The first step is to compute 8J (u)(s) and 8J(w)(s) in constant time. This involves

a search in leu) (resp., lew)) for the fat column f' E leu) (resp., r E lew»~ whose

1/J..(Rt-h,J') (resp., tPw(Rt-h,I"» contains s. These two searches in I(u) and I(w) are

done in constant time with the II(v)l processors available. We explain how the search

for f' in I(u) is done (that for 1" in I(w) is similar and omitted). Node v assigns a

processor to each I E I(u), and that processor tests whether s is in tPu(Rt-Ii,J)i the

answer is "yes" for exactly one of those II(u)l processors and thus can be collected

in constant time. Next, v determines 9J (u)(s) and 9J (w)(3) in constant time by using

logn processors to search for s in constant time in the leaf solutions 'TrII, and'TrI

J"

available at leaves If and pi, respectively. If the outcome of the search for s in 7r11,

is that s E 1I"[J'(c') for If E If" then OJ(u)(s) = C. Similarly, 8J(w)(s) is obtained from

the outcome of the search for s in 7rII".

• The next step consists of comparing A(s,8J(u)(s» to A(s,OJ(w)(s)). If the outcome is

A(s,OJ(u)(s)) > A(s,8J(w)(s »), then Critical..(Rt_h) is the same as Critical,,(Rt_ 1_ h ).

Otherwise Critical,,(Rt_h) is s.

We next show how the just computed Critical,,(Rt_h) value is used to compute tP,,(Rt.-h, *)

in constant time.

4.2 Computing ,p.(R,_h,*).

Relationship (2) of Observation 4.1 implies the following for each I E I(u):

• If ,p.(R'_h, J) is to the right of Critical.(R'_h) then VJ.(R'_h,j) = 0.

12

• If .,pu(R~_h,J) contains Criticalu(RI_h) then Wu(RI-h, J) consists of the portion of

v>,(Rt-h,j) up to (and including) Critical,(Rt_h)'

The above three facts immediately imply that 0(1) time is enough for II(u)1 of the II(v)1

processors assigned to v to compute .,pu(Rt-h, J) for all j E I(u) (recall that the .,pu(RI_h, *)

array is available in u from the previous stage t - 1, and Criticalu(Rt_h) has already been

computed).

A simllar argument, using relationship (3) of Obervation 4.1, shows that II(w)j pro

cessors are enough for computing 'l/Ju(Rt-h,J) for all f E I(w). Thus .,pu(Rt_h,*) can be

computed in constant time with II(v)1 processors.

This completes the proof of Lemma 4.1. 0

5 Avoiding Read Conflicts

The scheme of the previous section made crucial use of the "concurrent read" capability of

the CREW-PRAM. This occurred in the computation of Criticalu(RI_h) and also in the

subsequent computation of .,pu(Rt-h,*). In its computation of Criticalu(Rt_h), there are

two places where the algorithm of the previous section uses the "concurrent read" capability

of the CREW (both of them occur during the computation of OJ(u)(.9) and 9;(w)(s) in

constant time). After that, the CREW part of the computation of 'l/Jv(Rt-h,*) is the

common reading of Criticalu(RI_h). We review these three problems next, using the same

notation as in the previous section (Le., u is the left child of v in T, w is the right child of

v in T, v hail height h in T, s is the successor of Criticalu(Rt_1_h) in Rt-h, etc.).

• Problem 1: This arises during the search in I(u) (resp., I(w)) for the fat column

f' E I(u) (resp., !" E I(w)) whose v>,(Rt-h,f') (,esp., V>w(Rt-h,!")) contains •.

Specifically, for finding (e.g.) ff, node v assigns a processor to each j E I(u), and

that processor tests whether s is in 'l/Ju(R~-h,J)i the answer is "yes" for exactly one

of those II(u)1 processors and thus can be collected in constant time.

• Problem 2: Having found jf and r, node v determlnes 9](u)(.9) and O](w)(s) in

constant time by using logn processors to search for s, in constant time, in the leaf

solutions 1r/f, and Tr/

f" available at leaves l' and j", respectively. There are two

parts to this problem: (i) many ancestors of a leaf If (possibly all log n of them) may

simultaneously access the same leaf solution 7rII , and (ii) each of those ancestors uses

13

log n processors to do a constant-time search in the leaf solution 1fIJ (in the EREW

model, it would take Q(loglogn) time just to tell the processors in which leaf to

search).

• Problem 3: During the computation of Wv(Rt_h,*), the common reading of the

CriticaluCRt_h) value by the fat columns f E f(v).

Any solution we design for Problems 1-3 should also be such that no concurrent reading of

an entry of matrix A occurs. We begin by discussing how to handle Problem 2.

5.1 Problem 2.

To avoid the "many ancestors" part of Problem 2 (i.e., part (i)), it naturally comes to

mind to make logn copies of each leaf If and to dedicate each copy to one ancestor of

If, especially since we can easily create these logn copies of If in O(logn) time and logn

processors (because the space taken by 1fTJ

is O(logn)). But we are still left with part (ii)

of Problem 2, Le., how an ancestor can search the copy of If dedicated to it in constant

time by using its log n processors in an EREW fashion. On the one hand, just telling all of

those logn processors which If to search takes an unacceptable Q(loglogn) time, and on

the other hand a single processor seems unable to search 1fIJ in constant time. We resolve

this by organizing the information at (each copy of) If in such a way that we can replace the

log n processors by a single processor to do the search in constant time. Instead of storing a

leaf solution in an array rr IJ

, we store it in a tree structure (call it Tree(f)) that enables us

to exploit the highly structured nature of the searches to be performed on it. The search to

be done at any stage t is not arbitrary, and is highly dependent on what happened during

the previous stage t - 1, which is why a single processor can do it in constant time (as we

shall soon see).

We now define the tree Tree(f). Let List = [1,Tl],[rl + I,T2],···,[Tp + I,m] be the

list of (at most logn) nonempty intervals in 1fIJ

, in sorted order. Each node of Tree(f)

contains one of the intervals of List. (It is implicitly assumed that the node of Tree(f)

that contains 1I"IJ (C) also stores its associated column c.) Imagine a procedure that builds

Tree(f) from the root down, in the following way (this is not how Tree(J) is actually built,

but it is a convenient way of defining it). At a typical node x, the procedure has available a

contiguous subset of List (call it L(x)), together with an integer d(x), such that no interval

of L(x) contains a multiple of mj(2d(:z:l). (The procedure starts at the root of Tree(J) with

14

Figure 1. illustrating Tree(J).

L(root) = List and d(root) = -1.) The procedure determines which interval of L(x) to

store in x by finding the smallest integer i > d(x) such that a multiple of mj(2 i ) is in an

interval of L(x) (we call i the priority of that interval), together with the interval of L(x)

for which this happens (say it is interval [rk + 1,rk+1], and note that it is unique). Interval

[Tk + 1, rk+1] is then stored at x (together with its associated column), and the subtree of

x is created recursively, as follows. Let L' (resp., L") be the portion of L(x) to the left

(resp" right) of interval [Tk + l,rk+ll. If L' #- 0 then the procedure creates a left child for

x (call it y) and recursively goes to y with d(y) = i and with L(y) =L'. If L" #- 0 then the

procedure creates a right child for x (call it z) and recursively goes to z with d(z) = i and

with L(z) = L".

Note that the root of Tree(J) has priority zero and no right child, and that its left child

w has d(w) =0 and L(w) = (List minus the last interval in List).

Figure 1 shows the Tree(J) corresponding to the case where m = 32, logn = 8, and

"IJ = [1,6) [7,8) [9,14) [] [15,15] [16,17] [18,22) [23,32].

In that figure, we have a8sumed (for convenience) that the columns in If are numbered

1"" ,8, and we have shown both the columns c (circled) and their associated intervals

'1l'l](C). For this example, the priorities of the nonempty intervals of 7fT] are (respectively)

3,2,3,5,1,3,0. The concept of priority will be useful for building Tree(J) and for proving

various facts about it. The following proposition is an easy consequence of the above

definition of Tree(J).

Proposition 5.1 Let X be an interoal in List, of priority i. Let X' (resp., XII) be the

nearest interval that is to the left (resp., right) of X in List and that has priority smaller

15

than i. Let i' (resp., i") be the priority of X' (resp., X"). Then we have the following:

1. If i > 0 then at least one of {X', X"} exists.

2. If only one of {X',X"} exists, then X is its child in Tree(J) (right child in case of

X', left child in case of X").

3. If X' and X" both exist, then i' =f:. i". Furthermore, if i' > iff then X is the right child

of X' in Tree(J), otherwise (i.e., if i' < i") X is the left child of X".

Proof. The proof refers to the hypothetical procedure we used to define Tree(f). For

convenience, we denote a node x of Tree(J) by the interval X that it contains; i.e., whereas

in the description of the procedure we used to say "the node x of Tree(J) that contains

X" and "list L(x)", we now simply say "node X" and "list L(X)" (this is somewhat of an

abuse of notation, since when the procedure first entered x it did not yet know X).

That (1) holds is obvious.

That (2) holds follows from the following observation. Let X be a child of X' in Tree(J).

When the procedure we used to define Tree(J) was at X', it went to node X with the list

L(X) set equal to the portion of L(X') before X' (if X is left child of X') or after X' (if

X is right child of X'). This implies that the priorities of the intervals between X' and X

in List are all larger than the priority of X. This implies that, when starting in List at

X and moving along List towards X', X' is the first interval that we encounter that has a

lower priority than that of X. Hence (2) holds.

We now prove (3). Note that the proof we just gave for (2) also implies that the parent

of X is in {X/,X"}. Hence to prove (3), it suffices to show that one of {X',X"} is ancestor

of the other (this would imply that they are both ancestors of X, and that the one with

the larger priority is the parent of X). We prove this by contradiction: let Z be the lowest

common ancestor of X' and X" in Tree(J), with Z ¢ {X', X"}. Since Z has lower priority

than both XI and X", it cannot occur between X' and X" in List. However, the fact that

X' (resp., X") is in the subtree of the left (resp., right) child of Z implies that it is in the

portion of L(Z) before Z (resp., after Z). This implies that Z occurs between X' and X"

in List, a contradiction. 0

We now show that Tree(J) can be built in O(logm + log n) time with IListl (:::; logn)

processors. Assign one processor to each interval of List, and do the following in parallel

for each such interval. The processor assigned to an interval X computes the smallest

16

integer i such that a. multiple of mJ(2i ) is in that interval (recall that this integer i is the

priority of interval X). Following this O(logm) time computation of the priorities of the

intervals in List, the processor assigned to each interval determines its parent in Tree(f),

and whether it is left or right child of that parent, by using the above Proposition 5.1.

Reading conflicts are eaBy to avoid during this G(IListl) time computation (the details are

trivial and omitted).

Note: Although we do not need to do so, it is in fact possible to build Tree(J) in G(log m+

log n) time by using only one processor rather than log n processors, but the construction

is somewhat more involved and we refrain from giving it in order not to break the flow of

the exposition.

As explained earlier, after Tree(J) is built, we must make log n copies of it (one for each

ancestor in T of leaf If).

We now explain how a single processor can use a copy of Tree(J) to perform constant

time searching. From now on, if a node of Tree(J) contains intervall!'TJ(c), we refer to that

node as either "node l!'TJ(C)" or as "node c". We use RightChild(c) and LeftChild(c) to

denote the left and right child, respectively, of c in Tree(f).

Proposition 5.2 Let r E Ric. r' be the predecessor of r in Rk. Let r E l!'/J(e), and let

r' E l!'TJ(c'). Then the predecessor of r in Rk+l (call it r") is in l!'/,(c") where d' E

(.I, HightChild(.I), Le/tChild(c), cj.

Proof. If c" E {d,e} then there is nothing to prove, so suppose that d' rt. {c,d}. This

implies that c 1: d (since c = d would imply that ell;:: c). Then l!'/,(c'), 1!'I,(e") and 11'/, (c)

are distinct and occur in that order in List (not necessarily adjacent to one another in

List). Note that 11'1, (c') and l!'/J(c) each contains a row in Rk' that l!'/J(c") contains a row

in Rk+l but no row in Rk' and that all the other intervals of List that are between 11'1, (e')

and l!'/J(c) do not contain any row in Rk+l. This, together with the definition of the priority

of an interval, implies that 7l'[J(c') and 1!'I,(c) have priorities no larger than k, that l!'/,(c")

has a priority equal to k +1, and that all the other intervals of List that are between 11'1,(e')

and 1!'/,(e) have priorities greater than k + 1. This, together with proposition 5.1, implies

that .I' E (HightChild(<!),Le/tChild(c)). o

Proposition 5.3 Let r E Rk' r' be the successor of r in Rk. Let r E 11'1, (c), and let

r' E l!'/J(e'). Then the successor of r in Rk+l (call it r") is in l!'/J(c") where e" E

(c, HightChild(c) ,Le/tChild(c') , <!j.

17

Proof. Similar to that of Proposition 5.2, and therefore omitted. o

Now recall that, in the previous section, the searches done by a particular v in a leaf

solution at If had the feature that, if at stage t node vET asked which column ofIf contains

a certain row r E Ric, then at stage t +1 it is asking the same question about r' where r f is

either the successor or the predecessor of r in Rk+!_ This, together with Propositions 5.2

and 5.3, implies that a processor can do the search On Tree(J)) at stage t +1 in constant

time, so long as it maintains, in addition to the node of Tree(J) that contains the current

r, the two nodes of Tree(J) that contain the predecessor and (respectively) successor of r

in RI;. These are clearly easy to maintain. 0

We next explain how Problems 1 and 3 are handled.

5.2 Problems 1 and 3.

Right after stage t - 1 is completed, v stores the followjng information (recall that the

height of v in T is h). The fat columns I E I(v) for which interval ¢v(Rt- 1-h,J) is not

empty are stored in a doubly linked list. For each such fat column I, we store the following

information: (i) the row interval 'l/Jv(Rt-1-h,J) = [<:1:1,0'21, and (ii) for row 0'1 (resp., 0'2), a

pointer to the node, jn v's copy of Tree(J) , whose interval contains row 0'1 (resp., 0'2)' In

addition, v stores the following. Let z be the parent of 11 in T, and let Sf he the successor

of Criticalz-(Rt_1_(h+l)) in Rt-(h+l)' with s' in 'I/J'U(Rt-1-h,g) for some 9 E lev). Then v

has 9 marked as being its distinguished lat column, and v also stores a pointer to the node,

in v's copy of Tree(g), whose interval contains s'. Of course, information similar to the

above for v is stored in every node x ofT (including v's children, u and w), with the height

of x playing the role of h. In particular, in the formulation of Problem 1, the fat columns

we called J' and jff are the distinguished fat columns of u and (respectively) w, and thus

are available at u and (respectively) w, each of which also stores a pointer to the node

containing s in its copy of Tree(J') (for u) or of Tree(f//) (for w).

Assume for the time being that we are able to maintain the above information in constant

time from stage t -1 to stage t. This would enable us to avoid Problem 1 because instead of

searching for the desired fat column If (resp., r), node u (resp., w) already has it available

as its distinguished fat column. Problem 3 would also be avoided, because now only the

distinguished fat columns of u and w need to read from v the Critical'U(Rt_ll) value (whereas

previously aU of the fat columns in I(u) UI( w) read that value from v). It therefore suffices

to show how to maintain, from stage t - 1 to stage t, the above information (Le., v's linked

18

list and its associated pointers to the Tree(J)s of its elements, v's distinguished fat column

g, and the pointer to v's copy of Tree(g)). We explain how this is done at v.

First, v computes its Criticalv(Rt_h): since we know from stage t - 1 the distinguished

fat columns t and r of u and (respectively) w, and their associated pointers to u's copy of

Tree(f') and (respectively) w's copy of Tree(f"), v can compare the two relevant entries of

matrix A (Le., A(.9, 9J (u)(s)) and A(s,9J (w)(.9))) and it can decide, based on this comparison,

whether Criticalv(Rt_h) remains equal to Criticalv(Rt_1_h) or becomes equal to s, its

successor in Rt_h. But since this is done in parallel by all v's, we must show that no two

nodes of T (say, v and Vi) try to access the same entry of matrix A. The reason this does

not happen is as follows. If none of {v, Vi} is ancestor of the other then no read conflict in A

can occur between v and Vi because their associated columns (that is, J(v) and J(v')) are

disjoint. Ifone of v, Vi is ancestor of the other, then no read conflict in A can occur between

them because the rows they are interested in are disjoint (this is based on the observation

that v is interested in rows in RI_h - ut;;~-hR; and hence will have no conflict with any of

its ancestors).

As a side effect of the computation of Criticalv(RI_h), v also knows which fat column

j E {P .J"} is such that 'l./Jv(Rt-h,i) contains Criticalv(Rt_h)' It uses its knowledge of j

to update its linked list of nonempty fat columns (and their associated row intervals) as

follows (we distinguish two cases):

1. j = 1'. In that case v's new linked list consists of the portion of u's old (Le., at

stage t - 1) linked list whose fat columns are < !' (the row intervals associated

with these fat columns are as in u's list at t - 1), followed by l' with an associ

ated interval 'l./Jv(RI-h,I') equal to the portion of 'l./Ju(RI-h,!') up to and including

Criticalv(Re_h)' followed by Iff with an associated interval 'l./Jv(Rt-h' r) equal to the

portion of 'l./Jw(Rt_h' r) larger than Criticalv(Rt_h) if that portion is nonempty (if

it is empty then r is not included in v's new linked list), followed by the portion of

w's old (Le., at stage t - 1) linked list whose fat columns are> r (the row intervals

associated with these fat columns are as in w's list at t - 1).

2. j = r. Similar to the first case, except that Criticalv(Rt_h) is now in the interval

associated with r rather than 1'.

It should be clear that the above computation of v's new linked list and its associated

intervals can be implemented in constant time with II(v)1 processors (by copyjng the needed

19

information from u and w, since these are at height h - 1 and hence at stage t - 1 already

"knew" their information relative to R t_1_(h_l) = RI_h).

In either one of the above two cases (1) and (2), for each endpoint Q of a ¢v(Rt-h,j)

in v's linked list, we must compute the pointer to the node, in v's copy of Tree(f), whose

interval contains that a. We do it as follows. If ! was not in v's list at stage t - I, then

we obtain the pointer from u or w, simply by copying it (more specifically, if in u or w it

points to a node of u's or w's copy of Tree(f), then the "copy" we make of that pointer

is to the same node but in v's own copy of Tree(J». On the other hand, if f was in v's

list at stage t -I, then we distinguish two cases. In the case where Q was also an endpoint

of ¢v(Rt_ 1_h, I), we already have its pointer (to v's copy of Tree(J» available from stage

t - 1. H Q was not an endpoint of ¢v(Rt-1-h,f) then a is predecessor or successor of an

endpoint of Wv(Rt- 1-h,!) in Rt-hl and therefore the pointer for Q can be found in constant

time (by using Propositions 5.1 and 5.2).

Finally, we must show how v computes its new distinguished fat column. It does so by

first obtaining, from its parent z, Criticalz(Rt_(hH» that z has just computed. The old

distinguished fat collUlln 9 stored at v had its Wv(Rt- 1-h,g) containing the successor Sf of

Criticalz(Rt_1_(h+1» in Rt-(hH)' It must be updated into a fJ such that Criticalv(Rt_h, g)

contains the successor 5" of Criticalz(Rt_(h+1» in RtH-(hH)' We distinguish two cases.

1. Criticalz(Rt_(hH) = Critical%(Rt_1_(h+t). In that case g" is the predecessor of 5'

in R t+1_(h+l), and the fat column fJ E 1(v) for which Wv(Rt_h,g) contains 5" is either

the same as 9 or it is the predecessor of 9 in the linked list for v at stage t (which we

already computed). It is therefore easy to identify 9 in constant time in that case.

2. Criticalz(Rt_(h+t) = 5'. In that case SIl is the successor of 5' in Rt+t-(h+l), and the

fat column 9 E 1(v) for which Wv(Rt-h,fJ) contains 1/ is either the same as 9 or it is

the successor of 9 in the linked list for v at stage t (which we already computed). It

is therefore easy to identify 9 in constant time in that case as well.

In either one of the above two cases, we need to also compute a pointer value to the

node, in v's copy of Tree(fJ), whose interval contains SIl. This is easy if 9 = g, because we

know from stage t - 1 the pointer value for 5/ into v's copy of Tree(g), and the pointer

value for gil can thus be found by using Propositions 5.2 and 5.3 (since gil is predecessor

or successor of Sf in Rt_h). So suppose fJ =f:. g, Le., 9 is predecessor or successor of 9 in v's

20

linked list of nonempty fat columns at t. The knowledge of the pointer for s' to v's copy of

Tree(g) at t - 1 is of little help in that case, since we now care about v's copy of Tree(g)

rather than Tree(g). What saves us is the following observation: 3" must be an endpoint of

row interval 'l/Jv(Rt-h, g). Specifically, Sll is the left (i.e., beginning) endpoint of 'l/Jv(Rt-h, g)

if 9 is the successor of 9 in v's linked list at stage t (Case 2 above), otherwise it is the right

endpoint of 'l/Jv(Rt-h,g) (Case 1 above). Since 51 is such an endpoint, we already know the

pointer for if to v's copy of Tree(g) (because such pointers are available, in v's linked list,

for all the endpoints of the row intervals in that linked list).

References

[1] A. Aggarwal and J. Park, "Parallel searching in multidimensional monotone arrays," to appearin J. of Algorithms. (A preliminary version appeared in Proe. 29th Annual IEEE Symposiumon Foundations of Computer Science, 1988, pp. 497-512.)

[2] A. Aggarwal, M. M. Klawe, S. Moran, P. Shor and R. Wilber, "Geometric Applications of aMatrix Searching Algorithm," Algorilhmica, Vol. 2, pp. 209-233,1987.

[3] A. Apostolieo, M. J. Atallah, L. Larmore, and H. S. McFaddin, "Efficient Parallel AlgorithlDBfor String Editing and Related Problems," SIAM J. Comput., Vol. 19, pp. 968-988, 1990.

[4] M.J. Atallah, R. Cole and M.T. Goodrich, "Cascading Divide-and-Conquer: A Technique rorDesigning Parallel Algorithms," SIAM J. Comput., Vol. 18, pp. 499--532, 1989.

[5] R. Cole, "Parallel merge sort," SIAM J. Comput., Vol. 17, pp. 770-785, 1988.

[6] Paul, W., Vishkin, U., and Wagener, H. "Parallel dictionaries on 2-3 trees." Proc. 10th Call.on Autom., Lang., and Prog., LNCS 154, Springer, Berlin, 1983, pp. 597-609.

21

CD[1,61

[16.17]®

[23.321®

[15.15J

Figure

an efficient parallel algorithm for the row minima of a

Documents