time and parallel processor bounds for fortran-like loops

11
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-28, NO. 9, SEPTEMBER 1979 Time and Parallel Processor Bounds for Fortran-Like Loops UTPAL BANERJEE, SHYH-CHING CHEN, MEMBER, IEEE, DAVID J. KUCK, MEMBER, IEEE AND ROSS A. TOWLE Abstract-The main goal of this paper is to show that a large number of processors can be used effectively to speed up simple Fortran-like loops consisting of assignment statements. A practical method is given by which one can check whether or not a statement is dependent upon another. The dependence structure of the whole loop may be of different types. For each type, a set of time and processor upper bounds is given. We also show how a loop can sometimes be transformed to change its dependence structure. Finally, we give a result on the possible splitting up of a given recurrence system into a number of smaller subsystems. These results can be used to modify and sometimes improve the bounds for the loops as demanded by special circumstances. Index Terms-Analysis of programs, data dependence, Fortran- like loops, parallel computation, processor bounds, program speedup, recurrence relations, time bounds. I. INTRODUCTION TWO MAIN forces have led to faster and faster com- puters over the last 30 years. One has been faster hardware components and circuits. The other has been increased simultaneity of hardware operation by advances in machine organization. Whether or not these two forces continue indefinitely, at any given time machine designers have only a fixed set of circuit types to deal with. So, for the purposes of this paper, we will assume that circuit speeds are fixed, and we will study the problem of designing faster computer organizations subject to this constraint. Since arithmetic algorithms now operate near their theor- etical maximum speeds, we must look beyond arithmetic. For numerical programs, a fruitful option to consider is multioperation central processors (MCPU's) of some kind. Indeed, the fastest computers available for the past 10 years have had some kind of a multioperation CPU. Examples are the Burroughs B5500, B6700, and Illiac IV; the Control Data 6600, 7600, and Star; the Cray-1; the Goodyear Aerospace Staran IV; the IBM 360/91 and 370/195; the Texas Instruments ASC; and the Univac 1108 and 1110. This list includes traditional multiprocessors, multifunction arithmetic units, parallel processors, and pipeline processors with and without vector instruction sets; a very wide range of organizations indeed. And yet, if viewed as black boxes, all of these machines may be regarded as having some kind of Manuscript received March 1, 1975; revised August 1, 1976 and March 1, 1978. This work was supported in part by the National Science Founda- tion under Grant MCS73-07980. U. Banerjee and D. J. Kuck are with the Department of Computer Science, University of Illinois, Urbana, IL 61801. S. C. Chen was with the Burroughs Corporation, Paoli, PA 19301. He is now with Cray Research Inc., Chippewa Falls, WI 54729. R. A. Towle is with the Burroughs Corporation, Paoli, PA 19301. multioperation CPU. One of the goals of this paper is to show that much larger numbers of processors than have been used traditionally, may in fact be used efficiently in processing ordinary programs. This leads to much greater program speedups. Once we understand how to analyze and transform programs for such multioperation machines, then we can discuss how the machine should be organized. Coupled with the emergence of multioperation CPU's, have been a wide range of attempts to include some kind of simultaneity in software. These have ranged from explicit language features to attempts to extract simultaneity from programs written in ordinary languages. The latter approach was first taken some 10 years ago [21], [2]. More recently, comprehensive algorithms have been developed [23] and used to measure the parallelism in a number of ordinary programs [17], [14]. Also, compilers for multi- operation machines have been implemented which use a limited class of these algorithms [25], [19], [8]. It has recently been possible to prove bounds on the time and number of processors required to evaluate various parts of programs [15], [5], [3]. A second goal of this paper is to generalize and integrate some of these ideas in an effort to outline some compiler algorithms for machines with large numbers of processors. In this paper we will give time and processor upper bounds at a program level that is more comprehensive than has been considered before. We discuss entire Fortran DO loops consisting of assignment statements. Since loops usually tend to dominate programs, such bounds should be fairly representative of whole program bounds. Some of our results are based on rather long proofs which appear elsewhere. But our point is to show that program analysis algorithms exist which are capable of transforming most ordinary programs into a highly parallel form. What is the purpose of studying such fast computation? Many answers may be given. First, computer speeds have been increasing for 30 years and will probably not stop until 1) hardware speed increases saturate and 2) organizational speed increases saturate. We are attempting to explain how the latter will occur. In [27] and [28] it was shown that real computer arithmetic operation times are close to their theoretical minimum. If one can show how to transform whole programs into a form which can be computed in a time close to the fastest possible time for these programs, and machine organizations are available to execute the resulting programs, little further speedup will be possible through different machine organizations. While we do not discuss lower bounds here at all, it may be seen that a 0018-9340/79/0900-0660$00.75 C) 1979 IEEE 660

Upload: independent

Post on 25-Apr-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-28, NO. 9, SEPTEMBER 1979

Time and Parallel Processor Bounds for

Fortran-Like LoopsUTPAL BANERJEE, SHYH-CHING CHEN, MEMBER, IEEE,

DAVID J. KUCK, MEMBER, IEEE AND ROSS A. TOWLE

Abstract-The main goal of this paper is to show that a largenumber of processors can be used effectively to speed up simpleFortran-like loops consisting of assignment statements. A practicalmethod is given by which one can check whether or not a statementis dependent upon another. The dependence structure of the wholeloop may be of different types. For each type, a set of time andprocessor upper bounds is given. We also show how a loop cansometimes be transformed to change its dependence structure.Finally, we give a result on the possible splitting up of a givenrecurrence system into a number of smaller subsystems. Theseresults can be used to modify and sometimes improve the boundsfor the loops as demanded by special circumstances.

Index Terms-Analysis of programs, data dependence, Fortran-like loops, parallel computation, processor bounds, programspeedup, recurrence relations, time bounds.

I. INTRODUCTION

TWO MAIN forces have led to faster and faster com-puters over the last 30 years. One has been faster

hardware components and circuits. The other has beenincreased simultaneity of hardware operation by advancesin machine organization. Whether or not these two forcescontinue indefinitely, at any given time machine designershave only a fixed set of circuit types to deal with. So, for thepurposes of this paper, we will assume that circuit speeds arefixed, and we will study the problem of designing fastercomputer organizations subject to this constraint.

Since arithmetic algorithms now operate near their theor-etical maximum speeds, we must look beyond arithmetic.For numerical programs, a fruitful option to consider ismultioperation central processors (MCPU's) ofsome kind.Indeed, the fastest computers available for the past 10 yearshave had some kind ofa multioperation CPU. Examples arethe Burroughs B5500, B6700, and Illiac IV; the ControlData 6600, 7600, and Star; the Cray-1; the GoodyearAerospace Staran IV; the IBM 360/91 and 370/195; theTexas Instruments ASC; and the Univac 1108 and 1110.This list includes traditional multiprocessors, multifunctionarithmetic units, parallel processors, and pipeline processorswith and without vector instruction sets; a very wide rangeoforganizations indeed. And yet, ifviewed as black boxes, allof these machines may be regarded as having some kind of

Manuscript received March 1, 1975; revised August 1, 1976 and March1, 1978. This work was supported in part by the National Science Founda-tion under Grant MCS73-07980.

U. Banerjee and D. J. Kuck are with the Department of ComputerScience, University of Illinois, Urbana, IL 61801.

S. C. Chen was with the Burroughs Corporation, Paoli, PA 19301. He isnow with Cray Research Inc., Chippewa Falls, WI 54729.

R. A. Towle is with the Burroughs Corporation, Paoli, PA 19301.

multioperation CPU. One of the goals of this paper is toshow that much larger numbers of processors than havebeen used traditionally, may in fact be used efficiently inprocessing ordinary programs. This leads to much greaterprogram speedups. Once we understand how to analyze andtransform programs for such multioperation machines, thenwe can discuss how the machine should be organized.

Coupled with the emergence of multioperation CPU's,have been a wide range of attempts to include some kind ofsimultaneity in software. These have ranged from explicitlanguage features to attempts to extract simultaneity fromprograms written in ordinary languages. The latterapproach was first taken some 10 years ago [21], [2]. Morerecently, comprehensive algorithms have been developed[23] and used to measure the parallelism in a number ofordinary programs [17], [14]. Also, compilers for multi-operation machines have been implemented which use alimited class of these algorithms [25], [19], [8]. It has recentlybeen possible to prove bounds on the time and number ofprocessors required to evaluate various parts of programs[15], [5], [3]. A second goal of this paper is to generalize andintegrate some of these ideas in an effort to outline somecompiler algorithms for machines with large numbers ofprocessors.

In this paper we will give time and processor upperbounds at a program level that is more comprehensive thanhas been considered before. We discuss entire Fortran DOloops consisting of assignment statements. Since loopsusually tend to dominate programs, such bounds should befairly representative of whole program bounds. Some ofourresults are based on rather long proofs which appearelsewhere. But our point is to show that program analysisalgorithms exist which are capable of transforming mostordinary programs into a highly parallel form.What is the purpose of studying such fast computation?

Many answers may be given. First, computer speeds havebeen increasing for 30 years and will probably not stop until1) hardware speed increases saturate and 2) organizationalspeed increases saturate. We are attempting to explain howthe latter will occur. In [27] and [28] it was shown that realcomputer arithmetic operation times are close to theirtheoretical minimum. If one can show how to transformwhole programs into a form which can be computed in atime close to the fastest possible time for these programs,and machine organizations are available to execute theresulting programs, little further speedup will be possiblethrough different machine organizations. While we do notdiscuss lower bounds here at all, it may be seen that a

0018-9340/79/0900-0660$00.75 C) 1979 IEEE

660

BANERJEE et al.: PROCESSOR BOUNDS

number of our upper bounds on processing time are quitelow in absolute terms.

Another motivation for this work is that technology istaking us into a new era, where 8- or 16-bit processors canpresently be used as components. In the future, longer wordlength floating-point processors may be available as com-ponents. How should high-speed machine organizationrespond to the wide availability of LSI? We believe that ifhundreds or thousands of microprocessors could be or-ganized into a single CPU, methods ofprogram transforma-tion like those we discuss should be available to aid inmachine design and compiler design.The feasibility or reasonableness of multioperation

CPU's can also be challenged on efficiency grounds, sincenot all of the arithmetic elements are operating at all times.However, if all of the arithmetic elements are regardedas part of one big processor, this is really a question aboutgate level efficiency. Some 20 years ago, most computerswent from bit serial to bit parallel arithmetic operations at agreatly reduced gate level efficiency. We are suggesting atransition from word serial to word parallel processing ofordinary programs with another concomitant reduction ingate level efficiency. But two important factors should benoted. First, if we have processor level components, thengate level efficiency may not be an important measure.Secondly, in terms of speedup and cost/effectiveness, it maybe observed [12] that, on the average, the program transfor-mations we are advocating introduce no more inefficiency atthe processor level than the transition from bit serial to bitparallel addition and multiplication algorithms did at thegate level. Thus, even if we want to measure gate levelefficiency, in many cases higher overall efficiencies may beobtained by using slower, less parallel arithmetic operationstogether with faster algorithms that perform a number ofoperations at once.

Finally, we point out that many practical improvementscan be made to the ideas discussed in this paper. We arenecessarily discussing upper bounds on time and processorsusing rather general arguments. In practice, most programswill achieve better results than those stated in our boundsand we illustrate this in some of the examples.The paper is organized as follows. Section II prepares the

ground by explaining the key terms and assumptions. Thetheorems on the time and processor bounds for loops aregiven in Section III. Section IV describes a dependencetesting method. An overall algorithm for handling a givenloop is presented in Section V, and a brief summary inSection VI.

II. BASIC CONCEPTS

Starting with some fundamental definitions, in thissection we build the basic framework within which ourresults will be stated.By a variable, we mean either a scalar variable or an

element of an array. A constant in a given context is either auniversal constant (e.g., 2, log2 5, etc.) or a variable whosevalue is fixed in that context. An atom is either a variable or a

constant. An arithmetic expression is a well-formed stringconsisting of +, -, *, /, (, ), and atoms. (A precise recursivedefinition can be easily constructed.) A general arithmeticexpression with e atoms will be denoted by E<e>.An assignment statement is a program statement of the

form x = E, where x is a variable and E is an arithmeticexpression. For an assignment statement x = E, the length isthe number of atoms in E, the output variable is x, and theinput variables are all the variables in E. Such a statement islinear, if E is of the form c+a1*x1+a2*x2+" +ak * Xk, where c and the a's evaluate to constants, and thex's are variables.Our main object of study is a Fortran-like DO loop L of

the form:

DO 2 I, = 0, u1DO 2 I2 = 0, U2

DO 2 Id = 0, Ud

SX)D52(1)

SN(i)2 CONTINUE.

where I (to be called the index vector) stands for (Il, I2,Id), and every Sr represents an assignment statement. Wewrite Sr(i) to denote that S, is a function ofIand SrQ( when Itakes on a particular value i. (If, for some k, Ik runs throughthe sequence {ak, ak + bk, , ak + uk bk}, then we can alwaysuse a different variable Ik = (Ik - ak)/bk that has the se-quence {0, 1,... , Uk}. This particular choice of the form of theranges of II, I2' *, Id keeps our formulas simple withoutany loss in generality.)The loop L is linear if the assignment statements Sr are all

linear.Consider two statements Sr and St, not necessarily dis-

tinct. We say that St is dependent on Sr-and write SrySt,ifthere exist two values i and j of I such that

)1 The statement SrQ1) is executed before the statementS,Q) in the proper serial execution of the loop, and

2) one of the following holds:a) the output variable of Sr() is an input variable of

st(j);b) the output variables of Sr() and S(j) are the same;c) the output variable of S,(j) is an input variable of

SrA).Thus, there are three kinds of dependence. The definingconditions, the name and the notation for each kind areshown below.

1) and 2a): data dependence Sr6Sf;1) and 2b): data output dependence SrbS0t;1) and 2c): -data antidependence Sr;5St.

In order to have a visual description of the dependencestructure of the set {S,, S2, ..., SN}, we define a directedgraph, the dependence graph (G, say), ofthe loop L as follows.The nodes of G correspond to the statements SI, S2, **, SN,and there is an edge in G directed from Sr to St iff Sr ySt.

661

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-28, NO. 9, SEPTEMBER 1979

A chain in G is a sequence of nodes {Sq Sq * *, S.} suchthat Sq1 YSq2, Sq2 YSq3, ... ,5Sqk1 YSqk Such a chain becomes acycle, if in addition we have SqkYSq.1 The graph G is cyclic if ithas at least one cycle. The opposite of cyclic is acyclic.A maximal cycle is a cycle that is not a proper subset of

another cycle. An isolated point is a node that does notbelong to any cycle. We define a i-block to be either amaximal cycle or the set consisting ofan isolated point. Theset of all i-blocks is a partition ofthe set {S1, S2, * , SN) intopairwise disjoint subsets, and it will be denoted by i.A partial ordering F is defined on the set it as follows. For

any two i-blocks ir, and ;, we have ri,Fnt, if, either itr =it,or there is a chain {Sq,, Sq2 ., Sqkj of statements such thatSq1 eitr and Sqk Eir. If ir, ire we say that ;, depends on ir,. Asequence of it-blocks {ir1, it2, * it ) forms a chain if it1 Fra2,i2 r'i3 , x,* 1Frkt. Note, however, that two or moredistinct a-blocks can never form a cycle.A set ofi-blocks forms an antichain, ifno two members nr,

itt of the set exist such that itr F7r. Let h denote the maximumnumber of elements in any chain of i-blocks. Then all theit-blocks can be arranged in a sequence ofh pairwise disjointrows (lst, 2nd, , hth) such that

1) each row is an antichain,2) a block in the kth row will never depend on a block in

the tth row, if k < t,3) each block in the kth row depends on at least one block

in the (k - 1)st row, if k> 1.(This result is a slight variation of the dual form of

Dilworth's theorem, and follows directly from a proofofthesame (see [20]). Simply stated, the idea is as follows. Startwith the initial set of it-blocks. The minimal elements of thisset constitute the first row. Subtract this row from theoriginal set. The minimal elements of the remainder formthe second row, and so on. This procedure lasts throughexactly h steps.)The distributed loop L4 of a it-block itr is defined to be the

loop obtained from L by deleting all of the assignmentstatements not in itr. The fact that we can arrange thei-blocks in the way described above leads naturally tothe following principle.The Principle of Loop DistributionThe same results will be obtained from a computation, if,

instead of executing a given loop L serially, we execute inparallel all the distributed loops for the i-blocks in the firstrow in the first step, then execute in parallel all the dis-tributed loops for the i-blocks in the second row in the nextstep, and so on; the steps can overlap.

This principle was first introduced by Muraoka [23]. It isthe basic tool by which we introduce parallelism in theexecution of the loop L.We close this section with our first example.Example 1: Consider the following loop L:

DO 2 I =0, 100S,: A(I) =D(I) **2S2: B(I + 1) =A(I) * C(I + 1)S3: C(I + 4)= B(I) + A(l + 1)S4: X(I + 2) = X(I) + X(I + 1)S5: C(I+3)=D(I)-52 CONTINUE.

It is easily seenthatS1yS2,S2yS3,S3ySI,S3yS2,S3yS5,andS4 S4. To be more exact, we have S16S2, S23S3, S36S1,S3 S2 , S3 5S5 , and S4 5S4 The dependence graph G ofthisloop is shown below.

Si

There is only one isolated point S5, three cycles {S4}, {S2,S3}, {Sl, S2, S3}, and two maximal cycles {S4}, {SI, S2, S3)-There are three it-blocks: rt1 = {S1, S2, S3), it2 = {S4}, andi73 = {S5}.Andwehaveit1F7i3 . These blocks can be arrangedin two rows, where the first row consists ofi1 and it2 and thesecond of 7i3. The distributed loops L1 of 7r1 and L2 of 7t2should be executed simultaneously in the first step, and thenthe loop L3 of it3.

III. TIME AND PROCESSOR BOUNDS FOR LHere we show how to speed up the execution of a given

loop L. We assume that any number of processors may beused simultaneously, and that each processor can add,subtract, multiply, and divide, although all active processorsperform the same operation on each step. The smallestinterval in which any of these operations can be done by anyof the processors is taken to be the unit of time. (Differentoperation times for the arithmetic operators are explicitlyconsidered in [10].) We will ignore memory activity, dataalignment, and control unit times.

Let T[L] denote the number of time units and P[L] thenumber ofprocessors needed to executeL according to somealgorithm. Our aim is to minimize T[L] without putting anyrestrictions on P[L]. A typical theorem in this section showsthat under certain conditions, an algorithm exists by whichL can be computed in time T[L] less than or equal to acertain time bound, and the number of processors P[L]needed for this does not exceed a certain processor bound.These upper bounds are expressed in terms of T[E<e>],P[E<e>], T[R<n, m>], P[R<n, M>], and some parametersof L.

T[E<e>] is the time taken to evaluate an arithmeticexpression E<e> with e atoms, and P[E<e>] is the corre-sponding number of processors. Efficient high-speedup re-sults for T[E<e>] and P[E<e>] are summarized in Lemma 1.Lemma 1:

1) T[EKe>] < 4 log e with P[EKe>] < 3(e - 1)2) If d is the depth of parenthesis nesting in E,

then

7TE<e>] < log e + 2d + 1 with P[E<e>] < [(e - 2d)/21.

662

BANERJEE et al.: PROCESSOR BOUNDS

3) Ife e e/2

E= E ak, E= H bk, or E = Eakbkk= 1 k= I k= 1

then

T[E<e>] = rlog el with P[E<e>] = [el2l.Proof: See [3] and [16]. (Note-Throughout this paper

log x means log2 x.)T[R<n, m>] is the time and P[R<n, m>] the processor

count needed to compute a linear recurrence systemR<n, m>, 1 < m . n, according to some algorithm. Such asystem is simply a set of linear equations of the followingtype:

Xk = 0

k-i

Xk = Ck + Z aktXtt=k-m

(k < 0)

(1< k < n)

where the c's and the a's are constants, and the x's arevariables. The highest speedup results for T[R<n, m>] andP[R<n, m>] are given in Lemma 2.Lemma 2:

T[R<n,m>] < (2 + log m) log n - I(log2 m + log m)with

P[R/<nm> nm2/2 + 0(mn)PLRn,m>J < n3/68 + 0(n2)for m < nfor m < n.

Proof: See [4], [5], and [24].Let us now list the parameters of the loop L which will

appear in our theorems. The last entry is not needed forTheorems 1-3, and its definition is postponed.

A List of Parameters for the Loop L1) M, the total number of iterations ofL(M= (ul + 1)(U2 + 1). (Ud + 1));

2) N, the number of statements in L;3) e, the maximum length of any (assignment) statement

in L;4) c, the maximum number of nodes (of the graph G) in a

it-block;5) h, the maximum number of elements in any chain of

it-blocks;6) z, the total number of maximal cycles in G;7) Q, the maximum number of iterations over which a

data dependence (b) holds; and8) y, to be defined in Theorem 4.

Theorem 1: Let the dependence graph G of loop L beacyclic. Then

T[L] < h * T[E<e>]with

P[L] < (N-h + 1)M P[E<e>].Proof: Since there is no cycle, every it-block consists of

an isolated point. There are h rows (antichains) of i-blocks.Since each row must have at least one of the N blocks, there

cannot be more than (N - h + 1) blocks in any given row.All the M instances of the single statement in a given

distributed loop can be executed in parallel. And, all thedistributed loops for the blocks in a given row can beexecuted in parallel. Thus, for any row, the maximum timeand processor count needed are T[E<e>] and(N - h + 1)M P[E<e>], respectively. The theorem nowfollows. Q.E.D.

Corollary 1: Let the loop L be such that no two state-ments SF, St can be found with S, yS,. Then

T[L] < T[E<e>] with P[L] < MN- P[E<e>].

Proof: This follows immediately from Theorem 1, ifweobserve that h = 1.

Theorem 2: Let the loop L be linear and such that itsdependence graph G is cyclic. Then

T[L] < T[R<MN, MN>] + T[E<e>]with

P[L] < max {P[R<MN, MN>], MN - P[E<e>]}.

Proof: We-consider the worst case, namely, when thereis exactly one it-block. This means that the loop L cannot bebroken up at all. If we write down all the instances of the Nstatements of L corresponding to its M iterations, then theresulting set of equations can be treated as an R<n, m>system with n = MN. To make the situation the worstpossible, we also assume that m = MN. The bounds nowfollow, if we realize that sonme or all of the MN statementsmay need to be preprocessed, and that this preprocessingmay be done in parallel in time 7TE<e>] withMN- P[E<e>]processors. Q.E.D.

If more parameters are known about a linear'loop with acyclic dependence graph, we can prove the following.

Theorem 3: Let the loop L be linear and such that itsdependence graph G is cyclic. Then

T[L] < min{h, z} * T1R<n, m>] + h T[E<e>]with

P[L] < max{P1, P2},

where

P1= (N-h + 1)M P[E<e>]and

P2 = max{(z -h + 1), 1} - P[RKn, m>],n = cM

and

m= cQ + c- 1.

Proof: We show first that the distributed loop for theworst maximal cycle can be treated as an R<n, m> system.Such a loop has c statements. Suppose that the instances ofthese c statements corresponding to all theM iterations havebeen written down in the proper order. Then we have asystem ofn = cM linear equations; for each iteration there is

663

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-28, NO. 9, SEPTEMBER 1979

a group of c equations, and there are M such groups. Takethe output variable xl of a statement in the kth group. Inorder to compute x1 we might need to know the value oftheoutput variable x2 of an earlier statement. In the worstpossible case, x1 is the output variable ofthe cth statement ofthe kth group, and x2 is that of the first statement of the(k - Q)th group. Thus, we need never go back more thancQ + (c - 1) equations. Hence, the worst (i.e., the largest)value of m is cQ + c - 1.

Consider now the distributed loop of any ic-block. If theblock is a maximal cycle, then the time to execute its loopdoes not exceed T[R<n, m>] + T[E<e>], where the secondterm gives the maximum preprocessing time. If the blockconsists of an isolated point, then this time bound is simplyT[E<e>]. There are h rows ofic-blocks and z maximal cycles.For each row we first need a time of T[E<e>] units, whichmay be used to preprocess a maximal cycle, or to execute thedistributed loop of an isolated point. The time bound for Lnow follows from the fact that we need never have tocompute serially the loops for more than min {h, z} maximalcycles.

Since there are N statements in the loop L and h rows of7t-blocks, we will never get more than (N - h + 1) state-ments, if all the i-blocks in any row are combined. The totalnumber of iterations being M, the number of processorsneeded to preprocess all the instances of all the statements inany row never exceeds

P1 = (N-h + 1)M P[E<e>].After preprocessing, we need only worry about the dis-tributed loops for those it-blocks that are maximal cycles. Inthe worst case, there are all z of these cycles in one row.However, the time bound for 7[L] will remain unchanged, ifwe compute the z loops as follows. Ifz < h, then compute theloops serially. If z > h, then compute (z- h + 1) loops inparallel and the other (h - 1) loops serially. This way we willnever need more than

P2 = max {(z - h + 1), 1} * P[R<n, m>]processors in the second stage of the game. The processorbound is thus max {P1, P2}. This completes our proof.

Q.E.D.In the four previous results ofthis section, we accepted the

loop L in its given form and derived certain bounds based onits properties. Now we will show how one can sometimes getbetter bounds by transforming L into a loop with a simplerdependence structure.

Suppose that the dependence graph G of L is acyclic andthat S,yS, always means S6bS,. Every atom on the right-hand side of St that causes the dependence S,6S, is said todefine a 6-arc from Sr to S. There may be more than one6-arc directed from a given S, to a given S,. A fixed inputvariable of St can define several such 6-arcs, one for eachatom corresponding to that variable. Thus, when we talkabout 6-arcs, we are splitting the edges of G and looking atthe dependence relationship a little more closely.

tree-connected if there exists at most one chain of 6-arcs

between any two of its nodes.Let us now define statement substitution for acyclic graphs.

Take any two statements S, and St such that the i-block {Sr}is in the first row, the i-block {Stj is in the second, and S6bS,.

Let Srk and Slk denote the instances of Sr and St, respec-tively, corresponding to the kth iteration of the loop(1 < k < M). It will happen at least once that an inputvariable (xl, say) of some Stq is the same as the outputvariable (x2, say) of some Srk, where k < q for r < t andk < q otherwise. (Recall the definition of data dependencegiven in Section II.) Replace xl by the right-hand sideexpression of Srk. When all such replacements that can bemade have been made, we have obtained statement substitu-tion for the pair-S, S. By treating all such pairs S,, St weachieve statement substitution for the first two rows ofi-blocks. Now treat all other pairs ofrows in a similar way inthe following order: (1st, 3rd), (1st, 4th), .., (1st, hth), (2nd,3rd), -, ((h - 1 )st, hth). (h is the number ofrows.)When thisprocess ends, we have achieved complete statement substitu-tion for the loop L.We illustrate statement substitution by using two very

simple loops.Example 2: We see that S, 6S2 in the loop

Do 2 I = 0, 100S': A(I) = B(I) * C(I)S2: D(I)= A(I) + A(I) * A(I- 1)2 CONTINUE.

In fact, there are three 6-arcs from SI to S2* Statementsubstitution transforms this loop into the following:

Do2 I = 0, 100

S1: A(I) = B(I)* C(I)S'2 D(I) = B(I)* C(l) + B(I)* C(I)* B(I - 1)* C(I - 1)2 CONTINUE.

And now S' and S'2 are independent of each other.Consider next the loop

DO 2 I = 0, 100

SI: A(2 * I) = B(I) * C(I)S2: D(I) = A() + A(I)* A(I- 1)2 CONTINUE.

As before, we have S, 6S2 and there are three 6-arcs from S1to S2. After statement substitution we get a loop that can bebroken into the following two independent loops:

DO 2 I = 0, 100, 2

S1: A(2 * I) = B(l) * C(I)S'21 D(I) = B(I/2) * C(I/2) + B(I/2)* C(I/2)* A(I 1)2 CONTINUE

and

Do2I= 1,99,2S': A(2 * I)2= B() * C2I)S'22: D(I)= A(I) + A(I) * B((I- 1)/2) * (I - 1)/2)

The meaning of a chain of6-arcs is clear. We say that G is

664

2 CONTINUE

BANERJEE et al.: PROCESSOR BOUNDS

The statements S'1 and S'22 together constitute the state-ment S', the transformed version of S2. And, as before, ineach loop the two statements are independent ofeach other.

Theorem 4: Let the loop L have an acyclic graph G and besuch that the only possible dependence relationship betweenany pair of statements is that of data dependence (b). Let ydenote the maximum number of (-arcs from any statement.By statement substitution, L can be transformed into a

loop L that can be computed in time T[L] with P[L]processors, where

7[L] < T[E<N'e>],P[L] < MN- P[E<N'e>],

and

Nt =

N if L is tree-connected(N - h + 2)y' 1 otherwise.

Proof: For any given r such that 1 < r < N, let erdenotethe length of the statement Sr in 4r the statement in L thatcorresponds to Sr, and e', the length of S'r. Let e' denote themaximum length of an assignment statement in l. Since notwo statements S, St ofE can be found such that S' yS', thistheorem would follow from Corollary 1, ifwe can show thate' < N'e. The important thing to remember is that if astatement St is at the end of a chain of (-arcs from astatement Sr, then St has (er- 1) atoms in et due to Sr andthis chain alone.

Suppose first that L is tree-connected. Then, there is atmost one chain of (-arcs from any Sr to any S. Hence, wehave

N

el < et + E (er- 1)r= 1r$t

(1 < t < N)

< Ne

so that

e' < N'e.

Now assume that Lis not tree-connected. Since there is nocycle in G, every ir-block consists of an isolated point. Therecannot be more than h elements in a chain of i-blocks.Hence, there cannot be more than h statements in a chain of(-arcs. And, at most y chains of (-arcs can come out of astatement Sr. Thus, for a given statement St and in the worstpossible case, there are (N - h + 1) statements, each gener-ating y-' chains to St, one statement generating yh2chains, one y-3 chains, and so on. This means that

e < et+ [(N - h + 1)yh- 1+ ?2 +y+-3+ +y](e-1)

< e[1 + (N-h + )yh- I + yh-2 + yl-3 + ..+ y](1 < t . N),< e(N-h + 2)yh-

so that

e' < N'e.

So far, we have assumed that an unlimited number ofprocessors are available. This, of course, is never true inpractice. We will now show how one can sometimes reducethe number of processors by a technique called loopfreezing.

Given a loop L, let us consider the following sets of loopsfor k = 2, 3, ., d:

DO 2 Ik = 0, Uk

DO 2 Id = 0, Ud

SNQ)2 CONTINUE.

For a fixed k, there are [(u1 + 1.). (Uk- + 1)] loops in theset, which we denote by L(k)(I1, * *k_ I ), each correspond-ing to a definite set of values of I 1, - Jk-j Let G(k)(I 1, ...,Ik-1) denote the dependence graph ofL(k)(Ij, * * Ik- 1 ) Andsuppose that for a fixed k, each of the loops Ikk)(1, * , k_1)can be executed in time Tk)[L] with P(k)[L] processors. Thenext lemma follows immediately, if we observe that theoriginal loop L can be executed serially with respect to anynumber of consecutive index variables starting with I,.Lemma 3: Take any fixed k such that 1 < k < d. Then the

loop L can be executed in time

T[L] < (ul + 1) ... (Uk l + 1).*pk)[L]with

P[L] = P(k)[L] processors.

The usefulness of this lemma for some loops lies in the factthat by not doing everything in parallel we are saving on thenumber of processors, and at the same time we may reducethe total time ofexecution. The following example illustratesloop freezing and also the application ofsome of the earliertheorems of this section.

In order that we may effectively compare the time boundscorresponding to different numbers of processors, we intro-duce a term Sp, called the speedup of the p-processorcalculation over a uniprocessor. Sp (Sp[L], to be more exact)is defined by

Sp= TPwhere, for any p, Tp(_ Tp[L]) is the time bound for theexecution of L using p processors. Clearly, S . 1.Example 3: Consider the following loop L:

D021I =0,9DO 2 I2 = 0, 9DO 2 I3 = 0, 9

SI: A(l, I2 + 1,I3) = BQIJ,2JI3+ 2)* (I,IJ2) + U* VS2: B(11 + 1, I2, I3)= A(1, I2, I3 + 6)*D(I2, I3)2 CONTINUE.

Thie dependence graphs of all the loops L3) are the same.Q.E.D. The dependence graphs ofall the loops 12) are also the same.

665

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-28, NO. 9, SEPTEMBER 1979

The three graphs G, G(2), and G(3) are shown below:

G G V

We will execute the original loop L in four different ways.Case 1: Execute L serially with respect to I , I 2, I 3, using

only one processor.There are four operations for each of the 1000 iterations,

so that the time needed is

TJ[L] = 4000.

Case 2: Execute L serially with respect to I, I2 and inparallel with respect to I3.

Applying Corollary I to the loops 1!3) with graph G(3), weget

T3)[L] < T[E<4>] and P<3)[L] < 20P[E<4>],

since e = 4, M = 10, N = 2 (see the list of parameters in theearlier part of this Section for their definitions). Lemma 1then yields

7T3)[L] < 2 and P(3)[L] < 40.

Hence, by Lemma 3,

T[L] < 10* 10 2= 200

and

P[L] < 40.

The speedup in this case is given (since T[L] = T40[L]) by

S40 = T1[L]/T40[L] > 20.

Case 3: Execute L serially with respect to I, and inparallel with respect to I2, I3.Applying Theorem I to the loops 112) with graph G(2), we

get

TV2)[L] < 2T[E<4>] and P(3)[L] < IOOP[E<4>]

since h = 2, e = 4, N = 2, and M = 100.Lemma 1 then yields

7T2)[L] < 4 and P(3)[L] < 200.

Hence, by Lemma 3,

T[L] <10 4=40

and

P[L] < 200.

The speedup in this case is given by

S200 = T1[L]/T200[L] . 100.

Case 4: Execute L in parallel with respect to I1, 12, 13.Applying Theorem 3 to the loop L with graph G, we get

T[L] < T[R < 2000, 197 > ] + T[E<4>]and

P[L] < max {2000P[E<4>], P[R<2000, 197>]},

since h = 1,z =1 N = 2, M = 1000, c = 2, and Q=98. (Tofind Q, consider the relation S2 5S1 and look at the values (0,0, 2) and (1, 0, 0) of I). By Lemmas 1 and 2, we have

T[L] < 76

and

P[L] < P,

where P is approximately 4 - 107.The speedup in this case is given by

Sp = TJ[L]/Tp[L] . 52.

Clearly, we should choose Case 3 for our computation.There are a number ofresults that provide the time bound

for the execution of an R<n, m> system when a limitednumber of processors are available. We refer to [6] and [13].

IV. DEPENDENCE TESTINGIn this section we develop a method by which one can test

whether or not there is a dependence relationship (y)between two statements of the loop L. The dependencegraph G ofLcan be built up by using this method repeatedly.The basic idea is to determine iftwo variables are identical.When the variables are both scalars, the problem is trivial.So we concentrate on array variables, and consider the mostfrequently occurring case, namely, where the arrays areone-dimensional and the subscripts are linear functions ofthe index variables. The following notation is used forconvenience:

q + = max {q, 0}, q- = max {-q, 0},where q is any integer.

Theorem 5: Consider any two statements Sr and St of theloop L. Let A(f(i)) be a variable of S, and A(g(i))a variableof St, such that at least one of the variables is the outputvariable of the corresponding statement, A is a one-dimensional array, and

d

f(i)=ao+ E akIk

d

g(i)= bo+ bkIkk=1

where the a's and the b's are integer constants.If SrySt and if the dependence is caused by A(f(T)) and

A(g(i)), then the following two conditions hold.Condition 1: gcd(a1, * a,,adb, ... , bd) divides (b0 - ao), if

this ged is not zero, and

666

BANERJEE et al.: PROCESSOR BOUNDS

Condition 2:1-1

-b-E (ak-bk)uk - (aT + b1)'(ui - 1)k = I

d

- Z (a7+ bk)Ukk=l+ 1

(To see this, just consider separately what happens when(ak - bk) . 0 and when it is <0.)Case 2: Take k = 1. We have

a, il - bLj1<a+ (j,- 1)-b]j,=-b + (a+-bl)(j1-1)

since 0 < il <.1 - 1

l-I

. b1+ Z (akbk)+Uk +(a+ -bl) (ul- 1)d

+ ZE (ak + b-)ukk=l+1

where I is a positive integer such thatCondition 3:

1 .l.d+ 1

1 <<d

if r < t

if r > t.

(Note that ad+ 1 = bd+ I = Ud+ 1 = 0, by definition.)Proof: Assume that this particular dependence holds.

Then there must exist two values i = (ij, iid) andJ = (jh, ,-. jd) of the index vector [ such that

Condition 4: the statement S,O@ appears before the state-ment St(Z) in the serial execution of the loop, and

Condition 5:

f(i)= 90-)In order that Condition 4 may hold, there must exist aninteger I satisfying Condition 3 and such that

Condition 6:

ik =jk k = 1,2, ,* -I1 (for !> 1)

it <i. -1 (for I < d).

Using the expressions forfand g, we get from Condition 5that

Condition 7:d d

bo-a0= E akik- bkjk.k= 1 k=1

This implies Condition 1 of the theorem. In order to deriveCondition 2, we rewrite Condition 7 as

Condition 8:1-1

bo- ao = E (akik- bkjk) + (a,it-bij)k=1

d

+ E (akik- bkjk++k=l+ 1

and consider the following three cases.Case 1: Take any k such that 1 < k < I - 1. We have

ak ik - bkjk

(ak bk)jk by Condition 6

< (ak-bk), uk since O< jk < uk.

< bi + (a+ - bl)+(ul -1) since O < j - 1 < ul -1

Case 3: Take any k such that I + 1 < k < d. We have

ak ik- bkjk

<ak +uk+ bkuk since O<ik < ukandO <jk <uk

(a+ + b-)uk.Combining the results from all three cases, we get from

Condition 8 that

bo-ao < -b1+ , (ak-bk) Uk+ (a' -bl)'(ul- 1)d

+ E (ak + bUk,k=l+1

which is the second part of Condition 2. The first part can beobtained similarly if we multiply Condition 8 by -1 andproceed as before.

This completes the proof of the theorem. Q.E.D.We will see in a moment that the conditions ofTheorem 5

are not sufficient, in general. However, as the next resultindicates, there is a special case in which they are sufficient.

Theorem 6: The Conditions 1 and 2 of Theorem 5 areboth necessary and sufficient, when

ak = qcck, bk = qI3k

where q is a positive integer and

fkPk e { - 1, 0, 1} (k= 1,2, ,d).Proof: See Theorem 4.3 in [1].

Example 4: Consider the loop

DO 2 I = 0, 2S1: A(4-3 * I)= B(I)S2: C(I)= A(2 *I)2 CONTINUE

Let us see ifS2S1.Herewehave r = 2, t = 1,ao = 0,aI = 2,bo = 4, and b1 = -3. The only possible value of I is 1. Andthe Conditions 1 and 2 of Theorem 5 become

Condition 1:

gcd(2, -3) divides 4,and

Condition 2:

3 -(0-*3)+1 <4<3+(2+3)+ *1

both ofwhich clearly hold. However, we do not have S2 6S1,since two values i and j of I cannot be found, such that

0<i<j<2 and 2i=4-3j.

667

(k = 1, 2, - - -, d),

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-28, NO. 9, SEPTEMBER 1979

Example 5: Consider the loop

Do 2 I =-0, 10DO 2 I2 = 0, 5DO 2 I3 = 0, 8

S1: A(-40+4* I1-8 * I2)=E1S2: A(48-6 * I1 +4 * I2 + 16 * I3)=E22 CONTINUE.

Suppose that we want to test if S1 5S2. Condition 1 ofTheorem 5 reduces to

gcd(4, -8, 0, -6, 4, 16) divides 88

which is true. For I = 1, Condition 2 gives the inequality

-182<88<96

which also holds. In this situation, we will say that it isprobably true that S1 60S2 . [As a matter of fact, it is true; justlook at SI(10, 0, 0) and S2(10, 3, 0).]

If the constant term in the subscript ofthe output variableof S2 is changed from 48 to 49, then Condition 1 wouldensure that we do not have S1 b3S2. The same result wouldfollow if it is changed from 48 to 68. Then none of the fourinequalities of Condition 2 (corresponding to the fourpermissible values 1, 2, 3, 4 of 1) will be satisfied. We omit thedetails.We will end this section with a few remarks.

Remarks1) Theorem 5 can be extended to the case where the

subscript functionsfand g are arbitrary polynomials in theindex variables. The interested reader is referred to [1].

2) So far, we have worked only with one-dimensionalarrays. Suppose now that the array A (in Theorem 5) ismultidimensional. Let ND denote the number ofdimensionsof A, and UBk, LBk the upper and lower bounds of the kthdimension. Then A can be mapped onto a one-dimensionalarray A' in a one-to-one fashion by

A(tl, t2, tND)- A( ((t1, t2, ,tND))where

ND

(t1, t2, , tND) = E (tk -LBk)k=l

with

(UBk+l -LBk±+1+ 1) * (UBND-LBND + 1)ak= if k < ND-1

1 if k= ND.

After this transformation, we can apply Theorem 5.

V. AN ALGORITHM FOR THE EXECUTION OF L

In this section, we describe an algorithm by which anygiven loop L can be executed, exploiting the parallelismpresent in the loop for high speedup.

The Loop Algorithm1) Construct the dependence graph G of L2) Partition G into its i-blocks.3) Establish the partial ordering r between these blocks.4) Arrange these blocks in a hierarchy of rows as

described in Section II.5) Execute the distributed loops of all the blocks in the

first row simultaneously in the first step. Then execute thedistributed loops of all the blocks in the second rowsimultaneously in the second step, and so on.

6) Any distributed loop with an acyclic dependencegraph can be executed with the time and processors ofTheorem 1. Ifthe graph is cyclic, Theorem 3 holds in case theloop is linear, and the loop is processed sequentiallyotherwise.

7) One or more of the following techniques may beapplied with profit:

a) loop freezing,b) the wave front method,c) the splitting lemma,d) loop interchanging.

Loop freezing has already been described in Section III.Let us now describe the other three.

Wave Front Method

For loops containing cyclic dependence graphs arisingfrom variables with more than one subscript, the wave frontmethod may be useful. The method sweeps through amultidimensional space operating simultaneously on allvariables along some wave front. This is done with highefficiency. The method is very useful, for example, in partialdifferential equation solving where R<n, n-> systems arisewhich are inefficient to solve by Lemma 2 or by Lemma 4(because, e.g., ml = n/ and m2 = 1). However, the methodis of no value for program variables with single subscripts.The wave front method was introduced in [23] and extendedin [26]; for more discussion, see [13].The following lemma shows how an R<n, m> system can

sometimes be broken up into a number of smaller subsys-tems. This usually leads to a considerable improvement inthe time and processor bounds.Lemma 4: (The splitting lemma) Consider an R<n, m>

system defined by

Xk =0 (k < 0)

Xk = Ck + ak,k-miXk-ml + ak,k-m2Xk-m2 + ... + ak,k-m,Xk-m,

(1< k < n),where

n.m=mml>m2> >m > 1.

Ifg = gcd(m, iM2, - ., mi), then this system is equivalent to gindependent systems R<nl, m/g>, R<n2, mtg>, -.., R<nKg,m/g>, where

Ln/gJ<n, rn/gl (1!!<t <g).

668

BANERJEE et al.: PROCESSOR BOUNDS

(°<t<g- 1).

For any given a in 1 < a < n, there exists a unique t such that0 < t < g and a= t mod g, so that xa belongs to a

unique member of F. Hence, F is a partition of the set {xl,X2, *- - 9 Xn}*

Consider now an arbitrary set F,. It consists of thevariables xa, where a runs through the values t (if t > 0),t + g, t + 2g, -.., the last member of the sequence beingeither t + (Ln/gJ- 1)g or t + Ln/gig. (Find the upper boundon the integer variable q from the inequality t + qg < n.)Considering all cases, we find that the number ofvariables inF, is n,, where

Lnlg] < n, < rn/gl

To see that the equations corresponding to the variablesin F, form an independent recurrence system, take-any fixedx. in F,. For the computation of xa, we need to know thevalues of x._mi, , Xc.mr. Again, for the computation ofX.-Ml, the values of xa-2mil -*,Xa -m,mrare needed, and so

on. All the variables whose values must be known (directlyor indirectly) before we can compute x., are of the form xp,where x= - bIm1- --b7m, such that the b's are

nonnegative integers and 1 < 18 < n. Since x, eFt, we havea = t mod g. But, clearly we also have = oc mod g. Hence,, = t mod g, so that xp sF,. Thus, the variables in Ft can becomputed independently of all other variables.

Finally, that the tth independent recurrence system is anR<nt, m/g> system follows from the second paragraph, andthe fact that the subscripts of the successive variables in Ftincrease in steps of g. Q.E.D.

Let us illustrate loop interchanging by an example.Example 6: Consider the following loop L

Do 2 I1 = 0, 9DO 2 I2 = 0, 9DO 2 I3 = 0, 9

S1: A(11, 12) = A(11, 12) + B(11, 3) * C(13 12)2 CONTINUE.

This is equivalent to an R<1000, 1> system. By a rearrange-ment of the index variables, we get the loop L'

DO 2 13-=0, 9DO 2 I3 = °, 9Do2I1 =0,9

DO 2 I2 = 0, 9S1: A(11, I2)= A(1j, 12) + B(11, I3)* C(I3, 2)2 CONTINUE.

This is equivalent to an R<1000, 100> system of the form:

XkO" (k<0)

Xk = Ck + ak,k-OOXk- 100 (1 <k 1000).

100 independent R< 10, 1> systems. Lemma 2 implies that Lcan be executed much faster than L. Since these two loopsyield the same final results, we will be better off using Lthan L.

This technique ofloop interchangingcannot, ofcourse, beapplied to all loops, since a rearrangement of the indexvariables usually changes the given loop drastically.

Finally, we would like to mention that a method forhandling loops containing IF statements is being.developed,the details of which will appear in a subsequent paper.

VI. SUMMARY

We have tried to make two main points in this paper.1) Algorithms for transforming (i.e., compiling) pro-

grams written in standard Fortran-like languages into ahighly parallel form are now quite well developed.

2) The algorithms can be used as the basis for provingbounds on the time and processors required to evaluatewhole loops.

These bounds are based on a few simple, easy to measureprogram parameters. Generally, the more parameters wehave, the sharper the bounds that can be obtained.We are using these bounds in our new analyzer to estimate

the effectiveness of the algorithms for compiling. With ourprevious analyzer [14], we studied some programs with looplimits set to ten or less. Our conclusion was that they couldbe run on a 32 processor machine, achieving an averageefficiency of better than 30 percent. We anticipate betterresults using the improved algorithms reported here.

REFERENCES[1] U. Banerjee, "Data dependence in ordinary programs," M.S. thesis,

Dept. Computer Science, Univ. Illinois, Urbana-Champaign, IL,Rep. 76-837, Nov. 1976.

[2] H. W. Bingham, E. W. Riegel, and D. A. Fisher, "Control mechan-isms for parallelism in programs," Burroughs Corp., Paoli, PA, Rep.ECOM-02463-7, 1968.

[3] R. P. Brent, "The parallel evaluation of general arithmetic expres-sions," J. Ass. Comput. Mach., vol. 21, pp. 201-206, Apr. 1974.

[4] S. C. Chen, "Speedup of iterative programs in multiprocessorsystems," Ph.D. dissertation, Dep. Comput. Sci., Univ. Illinois,Urbana-Champaign, IL, Rep. 75-694, Jan. 1975.

[5] S. C. Chen and D. J. Kuck, "Time and parallel processor bounds forlinear recurrence systems," IEEE Trans. Comput., vol. C-24, pp.701-717, July 1975.

[6] S. C. Chen, D. J. Kuck, and A. H. Sameh "Practical parallel bandtriangular system solvers," Ass. Comput. Mach. Trans. Math.Software, vol. 4, pp. 270-277, Sept. 1978.

[7] S. C. Chen and A. H. Sameh, "On parallel triangular system solvers,"in Proc. 1975 Sagamore Comput. Conf. Parallel Processing. NewYork: IEEE, 1975, pp. 237-238.

[8] W. L. Cohagan, "Vector optimization for the ASC," in Proc. 7thAnnu. Princeton Conf. Inform. Sci. Syst., 1973, pp. 169-174.

[9] D. E. Knuth, "An empirical study of Fortran programs," Dept.Comput. Sci., Stanford Univ., Stanford, CA, Rep. CS-186, 1970.

[10] P. W. Kraska, "Parallelism exploitation and scheduling," Ph.D.dissertation, Dept. Comput. Sci., Univ. Illinois, Urbana-Champaign,IL, Rep. 72-518, June 1972.

[11] D. Kuck, "Multioperation machine computational complexity," inProc. Symp. Complexity Sequent. Parallel Numerical Algorithms. NewYork: Academic, 1973, pp. 17-47.

[12] , "On the speedup and cost of parallel computation," in TheComplexity ofComputational Problem Solving, R. S. Anderssen and R.P. Brent, Eds. St. Lucia, Queensland, Australia: University of

By the splitting lemma, this system can be broken up into

Proof: Define a family of sets

F= {Fo,F, , Fg_ }

by

Ft = {xa | a = t mod g, 1 < o < n}

669

Queensland'Press, 1976, pp. 63-78.

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-28, NO. 9, SEPTEMBER 1979

[13] -, "A survey of parallel machine organization and programming,"Ass. Comput. Mach. Comput. Surveys, vol. 9, pp. 29-59, Mar. 1977.

[14] D. Kuck, P. Budnik, S. C. Chen, E. Davis, Jr., J. Han, P. Kraska, D.Lawrie, Y. Muraoka, R. Strebendt, and R. Towle, "Measurements ofparallelism in ordinary Fortran programs," Computer., vol. 7, pp.37-46. Jan. 1974.

[15] D. Kuck and K. Maruyama, "Time bounds on the parallel evaluationof arithmetic expressions," SIAM J. Comput., vol. 4, pp. 147-162,June 1975.

[16] D. Kuck and Y. Muraoka, "Bounds on the parallel evaluation ofarithmetic expressions using associativity and commutativity," ActaInformatica, vol. 3, pp. 203-216, 1974.

[17] D. Kuck Y. Muraoka, and S. C. Chen, "On the number of operationsstimultaneously executable in Fortran-like programs and their result-ing speedup," IEEE Trans. Comput., vol. C-21, pp. 1293-1310, Dec.1972.

[18] H. T. Kung, "New algorithms and lower bounds for the parallelevaluation of certain rational expressions," presented at the Proc. 6thAnnu. Ass. Comput. Mach. Symp. Theory Comput., Apr. 1974.

[19] L. Lamport, "The parallel execution of Do loops," Commun. Ass.Comput. Mach., vol. 17, pp. 83-93, Feb. 1974.

[20] C. L. Liu, "Topics in combinatorial mathematics," presented at theMAA summer seminar, Williams College, Williamstown, MA, 1972.

[21] D. Martin and G. Estrin, "Experiments on models of computationsand systems," IEEE Trans. Electron. Comput., vol. EC-16, pp. 59-69,Feb. 1967.

[22] L. J. Mordell, Diophantine Equations. New York: Academic, 1969.[23] Y. Muraoka, "Parallelism exposure and exploitation in programs,"

Ph.D. dissertation, Dept. Comput. Sci., Univ. Illinois, Urbana-Champaign, IL, Rep. 71-424, Feb. 1971.

[24] A. H. Sameh and R. P. Brent, "Solving triangular systems on a paral-lel computer," SIAM J. Num. Analysis, vol. 14, pp. 1101-1113, Dec.1977.

[25] J. F. Thorlin, "Code generation for PIE (parallel instruction execu-tion) computers," in Spring Joint Comput. Conf., vol. 30, pp. 641-643,1967.

[26] R. Towle, "Control and data dependence for program transforma-tions," Ph.D. dissertation, Dept. Comput. Sci., Univ. Illinois,Urbana-Champaign, IL, Rep. 76-788, Mar. 1976.

[27] S. Winograd, "On the time required to perform addition," J. Ass.Comput. Mach., vol. 12, pp. 277-285, Apr. 1965.

[28] S. Winograd, "On the time required to perform multiplication," J.Ass. Comput. Mach., vol. 14, pp. 793-802, Oct. 1967.

Utpal Banerjee received the M.Sc. degree inapplied mathematics from Calcutta University,Calcutta, India, in 1963, and the Ph.D. degree inmathematics from Carnegie-Mellon University,Pittsburgh, PA, in 1970. He is presently com-pleting the Ph.D. program in computer scienceat the University of Illinois, Urbana.From 1969 to 1975 he was an Assistant Pro-

fessor of Mathematics at the University of Cin-cinnati, Cincinnati, OH. His present researchinterests are in the area of computer architec-

ture-software organization and machine design.

Shyb-Ching Chen (S'74-M'74) was born onFebruary 1, 1944. He received the B.S.E.E. degreefrom the National Taiwan University, TaipeiTaiwan, in 1966, the M.S.E.E. degree from Villa-nova University, Villanova, PA, in 1972, and thePh.D. degree from the Department of ComputerScience, University of Illinois, Champaign, IL, in1975.

After participating for several years in research00v@ ^ efforts at Digital Computer Laboratory, Univer-

sity of Illinois, studying theoretical and empiricalproblems of parallel processing, he joined Burroughs Corporation, Paoli,PA, from 1975 to 1978, in the development of the Burroughs ScientificProcessor, a superscale and general-purpose computer with array archi-tecture. In 1979, he joined Cray Research, Inc., Chippewa Falls, WI, inthe development of new and enhanced products of the Cray-1 processor,a superscale and general purpose computer with pipelined architecture. Hiscurrent interest is the coherent design of hardware, software, and appli-cations for advanced high-performance computer systems.

David J. Kuck (S'59-M'69) was born in Muske-gon, MI, on October 3, 1937. He received theB.S.E.E. degree from the University of Michigan,Ann Arbor, in 1959, and the M.S. and Ph.D.degrees from Northwestern University, Evanston,IL, in 1960 and 1963, respectively.From 1963 to 1965 he was an Assistant Profes-

sor ofElectrical Engineering at the MassachusettsInstitute of Technology, Cambridge. In 1965 hejoined the Department ofComputer Science, Uni-versity of Illinois, Urbana, where he is now a

Professor. Currently, his research interests are in the coherent design ofhardware and software systems. His recent computer systems research hasincluded theoretical studies of upper bounds on computation time, em-pirical analysis of real programs -and the design of high-performance pro-cessing, switching, and memory systems for classes ofcomputation rangingfrom numerical to nonnumerical. The latter work includes the study ofinteractive information retrieval systems, from the point of view of both thecomputer and the user.

Dr. Kuck was a Ford Postdoctoral Fellow from 1963 to 1965. He isa member of several professional societies and has served as a consultantto a number of govemment agencies as well as computer manufacturersand users. He has served as the Associate Editor for Computer Architectureand Systems of the IEEE TRANSACTIONS ON COMPUTERS, as well as anIEEE Computer Society Distinguished Visitor, and an Association forComputing Machinery National Lecturer. Currently, he is an editor of theInternational Journal of Computer and Information Sciences as well asthe Association for Computing Machinery Transactions on DatabaseSystems. He is the author of The Structure of Computers and Computa-tions, vol. I, and an editor of Proceedings of the Conference on High SpeedComputer and Algorithm Organization.

Ross A. Towle was born in Hammond, IN, onMarch 7, 1950. He received the B.S. and M.S.degrees in mathematics and the Ph.D. degree incomputer science from the University of Illinois,Urbana, IL, in 1972, 1973, and 1976, respectively.

|| r Since 1976 he has been with the BurroughsScientific Processor (BSP) Plant, Paoli, PA. Heis currently Section Leader for the Compiler

' Department Section for the AFP and BSP. Heis the codesigner of the Burroughs Vector Fortranlanguage.

670