resource constrained scheduling of uniform algorithms

16
Journal of VLSISignal Processing, 10, 295-310 (1995) (~) 1995Kluwer AcademicPublishers,Boston. Manufactured in The Netherlands. Resource Constrained Scheduling of Uniform Algorithms LOTHAR THIELE Swiss Federal Institute of Technology Zuerich (ETHZ), Computer Engineering and Communication Network.~ Lab. (TIK), Gloria.vtrasse 35, CH-8092 Zuerich, Switzerland Abstract. A method for optimizing the schedule and allocation of uniform algorithms onto processor arrays is derived. The main results described in the following paper are: (1) single (integer) linear programs are given for the optimal schedule of regular algorithms with and without resource constraints, (2) the class of algorithms is extended by allowing certain non-convex index domains, (3) efficient branch and bound techniques are used such that problems of relevant size can be solved. Moreover, additional constraints such as cache memory, bus bandwidths and access conflicts can be considered also. The results are applied to an example of relevant size. 1 Introduction This paper deals with techniques for mapping regular algorithms onto massively parallel architectures. First, some remarks on different mathematical approaches to the design of (piecewise) regular architectures are given. The potential of VLSI for massive parallel com- putations was first recognized by Kung and Leiserson [1]. The class of architectures they proposed is char- acterized by massive parallel computation, intensive multilevel pipelining, local interconnection scheme, distributed memory, and distributed computing power. The term systolic array has been introduced for regular mesh connected synchronous processor arrays consist- ing of identical processing elements that perform a time invariant processor function, see e.g. [ 1 ]-[4]. About the same time, synthesis methods for VLSI processor arrays received much attention. In princi- ple, behavioral descriptions are transformed in func- tions that distribute operations over time and space. The first pioneering result has been obtained by Kuhn [5] who used linear index transformations for relating al- gorithms and their systolic implementation. This re- sult has been rediscovered later by Moldovan [6]. The extensions of Quinton [7], Miranker et al. [8], and Capello et al. [9] to this work resulted in the theory of regular iterative algorithms as proposed by Rao [10], [I1]. He linked results obtained so far to the work of Karp, Miller, and Winograd [12] on uniform recurrence equations and he introduced considerable extensions to the synthesis methodology. These methods are used increasingly to parallelize loop programs for massive parallel architectures, see e.g. [13], [14]. Scheduling and allocation are among the most im- portant problems to be solved in mapping algorithms onto either domain specific hardware or programmable massively parallel architectures since the operations of a program are assigned to control steps and processing elements. In case of unlimited resources, scheduling tries to paraltelize the operations in order to minimize the overall execution time of an algorithm. On the other hand, if only limited resources are available, schedul- ing procedures attempt to serialize some operations to meet resource constraints. If unlimited computing resources are available, effi- cient solution to the scheduling problem are known. In the acyclic case, there is a close relation to the PERT/CPM scheduling problem as described in [15]. Either all activities are scheduled as early as possible (ASAP) or as late as possible (ALAP). Moreover, so- lutions to certain scheduling problems can be obtained by solving the longest path problem on the given signal flow graph. The weights associated to its edges corre- spond to execution times of operations. In [16], an uni- fied approach to the scheduling of signal flow graphs without resource constraints is given. The approach is based on a circuit transformation called retiming: registers are added or removed without changing the functional behavior of the circuit. There are many in- terpretations of this transformation scheme as the trans- formation of registers is equivalent to changing the schedule of operations. Particularly, in [16] polyno- mial time algorithms are proposed for the solution of the following scheduling problems: clock period min- imization and register minimization. Again, there is a close relation to the solution of the longest path problem

Upload: lothar-thiele

Post on 10-Jul-2016

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Resource constrained scheduling of uniform algorithms

Journal of VLSI Signal Processing, 10, 295-310 (1995) (~) 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Resource Constrained Scheduling of Uniform Algorithms

LOTHAR THIELE Swiss Federal Institute of Technology Zuerich (ETHZ), Computer Engineering and Communication Network.~ Lab. (TIK), Gloria.vtrasse 35,

CH-8092 Zuerich, Switzerland

Abstract. A method for optimizing the schedule and allocation of uniform algorithms onto processor arrays is derived. The main results described in the following paper are: (1) single (integer) linear programs are given for the optimal schedule of regular algorithms with and without resource constraints, (2) the class of algorithms is extended by allowing certain non-convex index domains, (3) efficient branch and bound techniques are used such that problems of relevant size can be solved. Moreover, additional constraints such as cache memory, bus bandwidths and access conflicts can be considered also. The results are applied to an example of relevant size.

1 Introduction

This paper deals with techniques for mapping regular algorithms onto massively parallel architectures. First, some remarks on different mathematical approaches to the design of (piecewise) regular architectures are given. The potential of VLSI for massive parallel com- putations was first recognized by Kung and Leiserson [1]. The class of architectures they proposed is char- acterized by massive parallel computation, intensive multilevel pipelining, local interconnection scheme, distributed memory, and distributed computing power. The term systolic array has been introduced for regular mesh connected synchronous processor arrays consist- ing of identical processing elements that perform a time invariant processor function, see e.g. [ 1 ]-[4].

About the same time, synthesis methods for VLSI processor arrays received much attention. In princi- ple, behavioral descriptions are transformed in func- tions that distribute operations over time and space. The first pioneering result has been obtained by Kuhn [5] who used linear index transformations for relating al- gorithms and their systolic implementation. This re- sult has been rediscovered later by Moldovan [6]. The extensions of Quinton [7], Miranker et al. [8], and Capello et al. [9] to this work resulted in the theory of regular iterative algorithms as proposed by Rao [10], [I1]. He linked results obtained so far to the work of Karp, Miller, and Winograd [12] on uniform recurrence equations and he introduced considerable extensions to the synthesis methodology. These methods are used increasingly to parallelize loop programs for massive parallel architectures, see e.g. [13], [14].

Scheduling and allocation are among the most im- portant problems to be solved in mapping algorithms onto either domain specific hardware or programmable massively parallel architectures since the operations of a program are assigned to control steps and processing elements. In case of unlimited resources, scheduling tries to paraltelize the operations in order to minimize the overall execution time of an algorithm. On the other hand, if only limited resources are available, schedul- ing procedures attempt to serialize some operations to meet resource constraints.

If unlimited computing resources are available, effi- cient solution to the scheduling problem are known. In the acyclic case, there is a close relation to the PERT/CPM scheduling problem as described in [15]. Either all activities are scheduled as early as possible (ASAP) or as late as possible (ALAP). Moreover, so- lutions to certain scheduling problems can be obtained by solving the longest path problem on the given signal flow graph. The weights associated to its edges corre- spond to execution times of operations. In [16], an uni- fied approach to the scheduling of signal flow graphs without resource constraints is given. The approach is based on a circuit transformation called retiming: registers are added or removed without changing the functional behavior of the circuit. There are many in- terpretations of this transformation scheme as the trans- formation of registers is equivalent to changing the schedule of operations. Particularly, in [16] polyno- mial time algorithms are proposed for the solution of the following scheduling problems: clock period min- imization and register minimization. Again, there is a close relation to the solution of the longest path problem

Page 2: Resource constrained scheduling of uniform algorithms

296 Thiele

on a weight transformed signal flow graph. Rao [10] converted the framework of [16] in a linear program setting and extended the results to the scheduling of regular processor arrays. The computational complex- ity of the corresponding optimization problems is in- dependent of the size of the algorithms or arrays. In [17], the optimal scheduling of systolic arrays has been reduced to the solution of a single (integer) linear pro- gram. Many other results are available on minimizing the overall computation time, e.g. [18].

If finite resources are taken into account, schedul- ing and allocation determines the cost/speed trade-off of the architecture. Very often, the required through- put does not necessitate a complete parallelization of the algorithm. Moreover, there may be modules avail- able which are specialized to certain classes of opera- tions. Unfortunately, most of the resource constrained scheduling problems are known to be NP-complete, see e.g. [19], [20]. Numerous heuristic methods have been proposed to find a compromise between the run-time of the scheduling algorithm and the qual- ity of the solution see [21]. List scheduling [22] and force-directed scheduling [23] improve the results of simple longest-path procedures by using a global cri- terion for se!ecting the operation which is scheduled next. Moreover, it is possible to model the scheduling and allocation problem as an integer linear program (ILP), see e.g. [24]-[27]. In [24], ILP's for the fol- lowing problems have been derived: time constrained scheduling, resource constrained scheduling, schedul- ing using pipelined arithmetic units (functional pipelin- ing), and scheduling of loop programs or signal flow graphs (loop folding). The solution of the correspond- ing ILP's yields optimal solutions to the resource and time constrained scheduling problems. This approach has been generalized in [28] to the synthesis of het- erogeneous multiprocessor systems including memory and bus design. In [29], hierarchical methods to solve the resource constraint scheduling problem for uniform recurrence equations are proposed. Here, at first a con- ventional scheduling and allocation for programs with acyclic data dependencies is performed. Then the re- suits are used in an affine by statement scheduling of the uniform algorithm.

In this paper, the concepts described in [17], [24], and [30] are combined so as to yield the following results:

�9 Single (integer) linear programs are given for the op- timal schedule of regular algorithms with and with- out resource constraints.

�9 The class of algorithms used in [17] is extended by allowing certain non-convex index domains, i.e. lin- early bounded lattices [30], [31].

�9 Efficient branch and bound techniques are used such that problems of relevant size can be solved.

Moreover, additional constraints such as cache mem- ory, bus bandwidths and access conflicts can be con- sidered also. To this end, the approach taken in [28] can be adopted.

2 Informal Introduction of the Problem

In this section, two instances of the problem solved in this paper are described. At first, the following algo- rithm

xl [i, j ] = f l (x2[i - 1, j ])

x2[i, j ] : f2(x l[ i , j ] , x3[i, j - 1])

x3[i, j ] = f3(xl[i, j ] ) (1)

is considered, where the three equations must be exe- cuted for all index points (i, j ) ~ I within the index d o m a i n I = {( i , j ) :0 < i < N A 0 < j < M}. The computations can be visualized by means of a depen- dence graph shown in Fig. 1. In particular, for each index point (i, j ) ~ I there are three vertices corre- sponding to each of f l , fz and f3, respectively. The edges represent the data dependencies of the algorithm. The reduced dependence graph of this algorithm as de- fined in Definition 3 is shown in Fig. 2. To each vari- able of the algorithm there is associated one node. The edges and their labels correspond to the data dependen- cies and their index displacements, respectively. For example, the edge from vertex 3 to vertex 2 represents the fact, that variable Xz is computed using variable x3. The label (~ characterizes the index displacement, i.e. x2[i, j ] directly depends on x3[i - O, j - 1].

Now, a linear allocation of the operations to pro- cessing elements is carried out, i.e. all operations in the direction of a projection direction v are computed within one processing element of a linear processors ar- ray. If o = (1 0) is chosen the processor array shown in Fig. 1 is obtained.

Moreover, linear scheduling functions will be used. The calculation of a variable xt[I] is finished at time instance

tl[I] = zr l + Yt E Z

Schedule a: Now let us suppose that each operation takes 1 time unit and that each processing element

Page 3: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algorithms 297

Fig. 1. Dependence graph for N = M = 2. The labels associated to the nodes correspond to the two schedules a/b.

1

1

Fig. 2. Reduced dependence graph.

) 2

( ~

) 3

contains unlimited computing resources then a feasible schedule satisfies

as

:rr = ( 2 1) Y! = 0 y 2 = l F3= 1

t l [ i , j ] - t2[i - l , j ] > 1

t2[i, j ] - t l [ i , j ] >_ 1

t2[i, j ] - t3[i , j - 1] > 1

t3[i, j ] - tx [i - 1, j ] > 1

The time potentials t t[I] are associated to the nodes

of the dependence graph in Fig. 1. The total evalua- tion time is Ta = t3[N, M ] - q[0, 0] = 2N + M + 1. Obviously, only two operations are executed simulta- neously within each processing element. Therefore, only two modules are necessary. Their occupation with operations is shown in Fig. 3. The iteration in- terval, i.e. the number of time instances between two successive iterations is ~-a = 2. Methods to deter- mine a time-optimal linear schedule are well known, see e.g. [17].

Usually, there is only a limited number of available modules within each processing element. For example, let us suppose that there is only one module capable of executing functions f l , f2 and f~. A feasible schedule is

S c h e d u l e b : J r = ( 3 0), Yz = 0 , y 2 = 2 , Y 3 = l

which yields Tb = 3N + 1, ~b ----- 3.

The schedule and bar chart is shown in Figs. 1 and 3, respectively. It is interesting to note that schedule b is faster than schedule a if M > N despite of the fact that schedule a uses one more module within the processing elements and that its iteration interval ~.b is larger than that of schedule a.

It should be obvious that it is not sufficient to con- sider a time-optimal schedule for one iteration only, i.e. without taking into account the dependencies across it- erations and between processing elements. The optimal solution to the resource constraint scheduling problem

Page 4: Resource constrained scheduling of uniform algorithms

298 Thie le

Fig. 3.

Fig. 4.

a

b

I I I I I I

liiiiii!i!i!!l lliiii iiiil ]!i!i!iiiiiil iiiiii iiiil liiiiiiiiiiiiil Ii!iii iiii] I

0 1 I

I

2 3 I

4 5 t I I I I

T a = 2 N + M + I , )L a = 2

I

fiiiiii!iiiiill [iiii!ii iiiiiiltiiiiii iiiil [!iiiiiiiiiiiiiii!l liiii iii] liiiiii iiiiill - -

I i , i , , i I k

,0 1 2 3 4 5 t

Bar charts corresponding to schedules a and b.

0 2'

Partially unfolded reduced dependence graph for partitioning.

requires the consideration of the complex relation be- tween intra- and inter-processor communication.

Another example for finite resources is the partition- ing problem, see e.g. [32] and the references therein. The following example is influenced by the results presented in [33]. Let us suppose that for the given algorithm only a limited number of processing ele- ments is available, and each processing element has a limited number of modules only. Then one possi- bility would be to assign several rows of the depen- dence graph shown in Fig. 1 to one processing ele- ment. If two adjacent rows are assigned to one PE, the reduced dependence graph shown in Fig. 4 must be mapped onto a linear array of r - ~ q processing el- ements. Again, the problem to be solved consists in assigning operations 1, 2, 3, 1', 2', 3' to modules and time instances such that the overall computation time of the algorithm is minimized. In general, a partial

unfolding of the reduced dependence graph must be carried out. The resulting reduced dependence graph represents all operations which are to be assigned to one processing element. This can be done for more general tilings of the dependence graph in a simi- lar way.

It is possible to add additional limited resources such as

�9 available cache memory, see [28],

�9 limited communication bandwidth within a PE,

�9 access conflicts to memory and buses and

�9 limited number and bandwidth of inter-processor links.

Some remarks concerning these obvious but important extensions are given in the last chapter.

Page 5: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algorithms 299

@ 0 J o o

0 O 1 - �9 0 I

0 1

@@

I I I I

J 0 1 Fig. 5. Examples of linearly bounded lattices.

- - , k . . . . .k _ I A

5 10 15 i

3 R e g u l a r A l g o r i t h m s , S c h e d u l i n g a n d A l l o c a t i o n

3.1 Regular Algorithms

The class of algorithms we are dealing with is related to uniform recurrence equations [ 12] or regular iterative algorithms [10]. But, similar to the class ofpiecewise regular algorithms, see [30], [31] the index domain can be a linearly bounded lattice. Therefore, more general input specifications for the scheduling procedure are allowed. In particular, this generalization is necessary if the regular algorithm to be scheduled results from previous design transformations such as partitioning. Moreover, there are input specifications which natu- rally lead to non-convex index domains, see the final example of this paper.

DEFINITION 1 (Regular Algorithm). A regular al- gorithm consists of a set of I VI quantified indexed equations St[l]

SI[I ] " '" St[I] . . . SIvI[I] u ~ I

where I ~ Z r is the index vector and the linearly bounded lattice I c Z r is the index domain. Each equation St[l] is of the form

xt[t] =~t(. .- x,[l -d,t] ...)

where xl [ l ] are indexed variables, ~-I are arbitrary func- tions and dkt ~ Z r are constant dependence vectors.

The example used in the previous chapter is a regular algorithm.

Finally, the class of index domains, i.e. linearly bounded lattices, see [30], [31], needs to be defined.

DEFINITION 2 (Linearly Bounded Lattice). A linearly bounded lattice is an index space of the form

I = { I : l = Ltc + p A AK > b A tc E Z l}

where L ~ Z r• p e Z r, A ~ Z m• and b e Z " .

Obviously, {K: Ar > b /x K e Z t} defines the set of all integer vectors within a polytope. The polytope is characterized by a set of linear inequalities. This set is mapped on I using an affine function, e.g. I = L x + m . Throughout this paper, we require that the matrix L is square and invertible. The index domain of the final example is a linearly bounded lattice of the form given above, see Fig. 10. The following example serves to clarify the above notations.

EXAMPLE 1. Let us suppose that an index set I] is defined by L = E, where E denotes the unity matrix, a n d p = 0. Then wehave I ] = {I: I = ic/x A x > b /x K ~ Zt}, or equivalently, I] = {I: A I > b /x I ~ Zt}.

0 1 1 F o r I = ( } ) , a = (-il ._01),b = ( ~ , ) a n d n = 4

the index set shown on the left hand side of Fig. 5 is obtained. Moreover, Fig. 5 also shows the linearly bounded lattice I2 = {i:i = Ki + 3K2 A to2 < tq < 4 A ~c2 > 1 }. It is obvious, that 12 is not necessarily a lattice or a convex set.

In order to describe the new scheduling procedures it is useful to define the reduced dependence graph, see e.g. [10], [12].

DEFINITION 3 (Reduced Dependence Graph). The re- duced dependence graph DG = (V, E, D) associated to a regular algorithm as defined above is defined as follows: To each variable xt[l] there is associated a node vt ~ V. There is an edge (Ok, Ol) ~ E with dis- tance vector dkt if the variable xt directly depends on x~ via the dependence vector dkt.

The reduced dependence graph can also be rep- resented algebraically. To this end we will use the node-arc incidence matrix C ~ {0, 1, - 1 } IvlxlEIl and

Page 6: Resource constrained scheduling of uniform algorithms

300 Thiele

the distance matrix D E Z r• whose column vector associated to an edge (k, l) is dkt.

The above described notations will be explained by the example introduced in the previous section.

EXAMPLE 2. The example is concerned with the algorithm

xl[i, j ] = f l (x2[i - 1, j ] ) u j ) e I

x 2 [ i , j ] = f 2 ( x l [ i , j ] , x a [ i , j - 1 ] ) u

x3[i, j ] = f3(xl[i , j ] ) V(i, j ) ~ I

w h e r e I = { ( i , j ) : 0 < i < N /x 0 < j < M}. Ob- viously, the index domain is a linearly bounded lattice with

- 1 0 /

('0 o 1 L = p = 0 A = 1 0

0 1

The incidence and distance matrices corresponding to the reduced dependence graph shown in Fig. 2 are

( , , , 0)/o01 o/ C = 1 0 - 1 1 D = 0 0

0 1 0 - 1

where the columns are ordered according to (1 ,2) , (1, 3), (2, 1), (3, 2).

The allocation and scheduling of operations to processing elements and time instances, respectively, closely follows the principles known from systolic ar- ray design, see e.g. [5], [6], [34].

3.2 Allocation to Processing Elements

The allocation of operations to processing elements is determined by a projection direction v ~ Z r. Then for any given index vector Io e I the variables xt[Io + otv] are computed in one processing element for all 1 < l < V and for all oe ~ Q such that Io-I- otv E I. In this definition v determines a direction only. 2 It will be useful for the derivation of the scheduling procedure to

determine a corresponding projection vector u unique up to its sign with the following properties:

DEFINITION 4 (Projection Vector). Given a linearly bounded lattice according to Definition 2 and a projec- tion direction v ~ Z ~. Then the corresponding projec- tion vector satisfies

1. u has the same direction as v, i.e. u = flu ~ Z ~ for some # ~ Q.

2. There exists an index vector I0 E I such that Io+u I holds.

3. For all Io ~ I the relation Io -t- 7u ~ I with y Q A I• < 1 holds for 7 = 0 only.

EXAMPLE 3. Let us suppose that I = {i : i = 2x /x 0 < x < 10} and v = 1. I f u = v = 1 then condition 2 is not satisfied as all io are even but all io + u are odd. If u = 4v = 4, then condition 3 is not satisfied as e.g. for i0 = 4 we have io Jr �89 = 6 6 I. Finally, if u = 2v = 2 is chosen, all conditions are satisfied.

Intuitively, Definition 4 requires that the length of u is not too small, i.e. there must be at least two index vectors in the index domain whose distance is u. On the other hand, u is not too large, i.e. there are no two index vectors whose distance has the direction of v but is smaller than u. It remains to give a procedure which enables the determination of a projection vector given the corresponding direction.

THEOREM 1. Given a projection direction v E Z r. Then a projection vector according to the above Definition 4 satisfies

Idet(L)l u=/3o ~= gcd{[adj(L)v]i: 1 < i < r}

where L ~I -- t adj(L), det(L) ~ Z and adj(L) - - I d e t ( L ) l

zrxr.

PROOF. Condition 1 is satisfied as u = fly. Let us suppose that Io = LK ~ + p for some K ~ 6 Z r. Then condition 2 is satisfied i f 3 x I ~ Z r such that Io + u = L x i + p. Therefore, condition 2 is satisfied if 3x 6 Z r such that LK = u. Indeed, x = L - j u is integral as

L_lu = L_l fl v = ad j (L)v gcd{[adj(L)v]i: 1 < i < r}

Moreover, because of the gcd-operation in the denomi- nator, all other vectors u I of smaller length, i.e. u' = y u with I• < 1, lead to a non integral K'. Therefore, con- dition 3 is satisfied too. []

Page 7: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algori thms 301

3.3 Scheduling

Corresponding to the design of systolic arrays we will use a linear scheduling function. Therefore, the calcu- lation of a variable xt[l] for I ~ I is finished at time instance

tt(I) = 7rl q- YI

where tt(l) ~ Z, Jr ~ Qr and yt ~ Z. The iteration interval ~. is an important quantity of the

scheduling procedures which will be described next. It is closely related to the previously defined projection vector u and scheduling parameter zr.

DEFINITION 5 (Iteration Interval). The iteration inter- val ~. of an allocated and scheduled uniform algorithm is the number of time instances between the evaluation of two successive instances of a variable within one processing element.

Using the definitions given above the iteration interval can be computed according to the following theorem.

THEOREM 2. The iteration interval 3. of an allocated and scheduled uniform algorithm is

~. = i~ru l

PROOF. The computations of variables xl[lo + otu] assigned to one processing element are finished at times tt = r r l + ~'l = t r io + otzru + Yr. Two neighbouring instances, i.e. xl[ lo q- otu] and xt[ Io + (~ 4- 1)u], have a distance in time of At ---- -4-zru. As the iteration interval is positive, we have 3. = 17rul. []

Now, the problem which we are going to solve can be characterized as follows: Given a regular algo- rithm according to Definition 1 and an allocation, i.e. a projection direction o. Determine under various implementation models an optimal scheduling function which minimizes the time interval between the first and last calculation.

In order to simplify the notation, we will suppose that the index domains are linearly bounded lattices of the form I = { I : I = LK A Ate > b A ic ~ Zr}, i.e. in contrary to Definition 2 we suppose that p = 0. It can be seen that this simplification does not restrict the generality. I f the given index domain does not satisfy p = 0, then at first a schedule with z?, Yt and t t(?) = Ji "/r + Yt for a shifted index domain I = I - p can be determined. If we set rr = ~" and Yt = y~ - rrp we find

an equivalent scheduling for the original problem and vice versa, i.e. tt(1) = ?t(I).

4 Scheduling Procedures

4.1 Unlimited Resources

At first, the model as used for this section is defined.

�9 Each processing element contains IVI modules. Module l with 1 < l < I VI is capable of execut- ing the function ~'t. The module evaluates variables xt[l] for some 1 ~ I. Therefore, there is one mod- ule for any node of the reduced dependence graph within each processing element.

�9 To each input (k, l) of the module l there is associated an evaluation time wkt ~ Z. These evaluation times can be combined in a weight vector w ~ Z I•

If these weights are included in the reduced depen- dence graph, a weighted reduced dependence graph DG = (V, E, D, w) is obtained.

�9 If the inputs (k, l) of the module l are available at time instances tk, then the output variable is available at time instance tt > max{tk + wkt: (k, l) ~ E}.

�9 The time difference between two successive itera- tions ~. is called iteration interval. It has to sat- isfy the inequality 3. > tbm~x. For example, if mod- ule l is not pipelined it can accept new input data every u)t = max{wkt: (k, l) ~ E} time instances. In this case we have

~max : max{tbt: 1 < l < IVI}

There are many other similar models and simple exten- sions. We restrict ourselves to the above for the sake of brevity.

Now, we are going to determine a time optimal scheduling function under the above implementation model. The solution is similar to [17]. These results are generalized to non-convex iteration spaces, i.e. lin- early bounded lattices as defined in Definition 2.

THEOREM 3. Given is a regular algorithm and its weighted reduced dependence graph. Then a sched- ule which minimizes the maximal value of rr ( J - l ) for all I, J E {Ltc: AK > b A t c E Qr}3 under the above described implementation model is given by

n" = r r 'L - l (2)

Page 8: Resource constrained scheduling of uniform algorithms

302 Thiele

where 7r' and y are a solution o f the integer linear program

minimize

subject to

- (y~ + y2)b (3)

y~A = Jr' Yl >_0 Yl E Ql• (4)

y2A = - z r ' Y2 > 0 Y2 �9 Ql• (5)

7l"/ �9 Z lxr 7/ �9 Z I• (6)

zr' adj(L) D + Idet(L)[ y C > [det(L)[ w (7)

4-zr 'adj(L) u > [det(L)[ tOma x (8)

Here we used the representation o f a linearly bounded lattice according to Definition 2, the incidence and distance matrices C and D and the weight vector w. Moreover, we define L - t __ ~ 1 adj(L) where

det(L) �9 Z andadj (L) �9 Z TM.

PROOF. The execution of operation l at index point I is finished at tt = JrI + Yt �9 Z where I = Lg for some g �9 Z r with AK > b. Therefore, tt = 7rLg + 7/t. In order to guarantee that t is integer for all x �9 Z r with Ag > b we have JrL = 7r' �9 Z r in (2) 4. Using L - I _ I adj(L) we can also write Jr Idet(L)l = Idet(L)l Jr' adj(L).

The total evaluation time is approximated by

max{n'(J - I ) : I, J �9 {Lg: Ag >_ b A g �9 Qr}}

or the solution to the following linear program

maximize rr'(g2 - gj) subject to Agl > b gl E Q r

AgE >_ b g2 �9 Qr

The dual linear program has the form

minimize -(Y2 + yl )b

subjectto yl A = rr' Yl >_0 Yl �9 QI• y2A = -Jr ' Y2 >-- 0 Y2 �9 QlXm

which leads to (3, 4, 5, 6). In addition, yr and 7/must satisfy the dependence constraints rr D + 7/C >_ w. As the iteration interval is either ~. = 7ru or ~. = - z r u one of the constraints zru >_ ff~max o r - J r u > tbr~x must be satisfied. Using the parameter Jr' according to (2) leads to (7) and (8). []

The signs 4- in (8) denote that the integer linear pro- gram must be solved for both signs, i.e. + and - , and the better result must be used. It is possible to combine

both problems into one integer linear program. To this end, (8) can be replaced by

rr' adj(L) u > Idet(L)l(~bmax -- o'(~.max --~ ~max))

7r' adj(L)u < ]det(L)l(~-max -a(~-max + tbmax))

a �9 {0, 1}

where )~m~x is an arbitrary upper bound on the iteration interval, i.e. the final solution satisfies ~. < 3-max, see also the remarks in section 4.

EXAMPLE 4. For the regular algorithm introduced in Example 2 we choose u = (~). Moreover, we suppose that Wl2 = wl3 = w21 = w32 = Wmax ~--- 1. With N ---- 10 and M = 20 we obtain as a solution of the linear integer program in Theorem 3

J r = J r ' = ( 3 0) y = ( 0 2 1)

Yl = ( 3 0 0 0) Y 2 = ( 0 0 3 0)

(Yl + y2)b = 30

In other words, the computation of the variables xl [i, j] , x2[i, j ] and x3 [i, j ] is finished at time instances tl = 3i, t2 = 3i + 2 and t3 = 3i + 1, respectively. This is the schedule b shown in Figs. 1 and 3. If we choose N = 20 and M ---- 10 the results

z r = J r ' = ( 2 1) y = ( 0 1 1)

y~ = (2 1 0 0)

Y 2 = ( 0 0 2 1) ( Y l + y 2 ) b = 5 0 tl = 2 i + j t 2 = 2 i + j + l t 3 = 2 i + j + l

are obtained, see schedule a in Figs. 1 and 3.

It is possible also to solve the relaxation of (3-8) in Theorem 3 by dropping the integral constraints and us- ing the nonlinear scheduling [Jr I +YtJ of the evaluation ofx t[ l ] . Obviously, better solutions in terms of the to- tal evaluation time can be obtained but this idea cannot be applied directly to the case of limited resources.

4.2 Limited Resources

Again, it is useful to start with the implementation model for this section. In the above described case of unlimited resources any processing element contained I VI modules. Each module has been responsible for the calculation of one variable, i.e. xt[l] forall I E I. Now, the number and functionality of the modules within each processing element is restricted. To model this sit- uation, as well a weighted reduced dependence graph as a bipartite resource grapla is used.

Page 9: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algorithms 303

�9 The weighted reduced dependence graph DG = (VD, Eo , D, w) is defined as in the previous section. In particular, if the input variables Xk[l --dkl] neces- sary to compute a variable xt[l] for some I are avail- able at times tk, then the time instance where Xl[l] is available satisfies tt > max{tk + Wkl: (k, l) ~ Eo}.

�9 The number of clock cycles a module is reserved for the calculation of xt via .~ is denoted as t~l.

�9 In the bipartite resource graph RG = (VR, ER, M) with VR = VD U Vr there is exactly one edge leaving each node l ~ Vo and there are no edges ending in s o m e l ~ Vo. Toeach vertexk ~ Vr there is associated a non negative integer M(k) ~ Z>_.o.

�9 Each processing element consists of modules. There are I Vrl different types of modules, i.e. to each vertex k ~ Vr there is associated a module type. If there is an edge (l, k) ~ ER then the function ~-t associated to l ~ Vo can be executed by a module of type k ~ Vr. There are at most M(k) modules of type k in each processing element.

The following example serves to explain the notation of a resource graph.

EXAMPLE 5. Again, we use the algorithm described in Example 2. Let us suppose that within the processing elements there is just one module a available. There- fore, we have M(a) = 1. All operations f l , f2 and f3 can be executed on this module. Therefore, the re- source graph has three nodes in 1/o corresponding to f l , f2 and f3 and one node in Vr corresponding to module type a. There are edges from all nodes in Vo to the node a in VT, see Fig. 6.

Again, we are going to determine an optimal scheduling function under the above implementation model. The optimal solution is formulated as a mixed integer linear program again. There are many differ- ent formulations possible. Only one basic approach is

M(a)=l

30- a

vo vr Fig. 6. Resource graph.

given in this paper. It is based on results described in [24] for high level synthesis.

4.3 Basic Procedure

There are two main concepts behind the proposed ap- proach: (1) Binary variables are used to represent the vector y and (2) a convolution like operator is used to consider resource constraints. The main result is de- scribed in the following theorem.

THEOREM 4. There is given a regular algorithm, its weighted reduced dependence graph, a resource graph, the iteration interval ) 5 and lower and up- per bounds on y, i.e. r(l) < Yt < s(1) for l ~ I/o. Then the parameters rc and y o f a scheduling which minimizes the maximal value of 7r(J - I ) for all I, J ~ {Ltc: Ate > b /x tc ~ Qr} can be obtainedas

s(I) zr = zr'L - I , Yt = ~ i Y~i u Vo (9)

i=r(l)

where lr' and Yl'i are a solution of the following integer linear program

minimize - ( Y l + y2)b (10)

subjectto y l A = J r ' Yl >_0 Yl 6Q1• (11)

y2A = - i t ' Y2 > 0 Y2 E QlXm (12) 7g t E Z Ix r

y~ E{0,1} V l ~ Vo, r ( l )<_ i<_s( l ) (13) s(l)

Yt'i = 1 u ~ 1Io (14) i=r(l)

Jr' adj(L) dkt + Idet(L)l / s(l) s(k) \

x ( ~ i y; i - y~. i Y~i -- Wkl) > 0 \ i=r(I) i=r(k)

V(k,l) e E o (15)

Jr' adj(L) u = -I-Idet(L)l~. (16) t~t--I

l:(l.k)EEtt p=0 p':r(l)<_i.-I-p+p'), <_s(l)

< M(k) Yk ~ Vr, 0 < i < 3 . - 1 (17)

PROOF. (9) defines the transformed scheduling pa- rameter rr' as in Theorem 3 and the representation of y as a weighted sum of binary variables y~. Constraint (14) guarantees a unique representation, i.e. only ex- actly one of the variables y~ for r(l) < i < s(l) equals

Page 10: Resource constrained scheduling of uniform algorithms

304 Thiele

one. Constraints (11, 12) are identical to those used in Theorem 3. Using the above defined notation, (15) di- rectly corresponds to (7). (16) guarantees that the given iteration interval 3. is achieved. Finally, (17) is respon- sible for the consideration of finite resources. Let us suppose that Yt'k = 1 for some operation I ~ Vo. Then

~ l - n y, = 1 f o r a l l k - t ~ t < i < k. Thisdoes not p=0 I.i+p yet correspond to the time interval where a resource is used. Remember that an operation I is executed repeat- edly, i.e. with a distance of), time instances. Therefore, operations in different iterations may overlap and mul- tiples of the iteration interval 3. must be added to j + p,

X--~tbi- l i.e. /--~t,=o ~-~.?':r(t)<_i+p+v'x<_s(t) Y,'t.i+r+t,'x" It can be seen that (17) guarantees the resource constraints, ra

Concerning the 4- in (16) the same remark as in Theo- rem 3 can be made. In order to combine both problems (16) can be replaced by

rr' adj(L) u > [det(L)[ (1 - 2cr)3.

zr' ad j (L)u < Idet(L)[ (1 - 2cr)3.

EXAMPLE 6. We again consider the regular algo- rithm introduced in Example 2. L, m, A, b, u and D are as in the previous examples. The weighted reduced dependence graph is DG = (Vo, Eo, D, w) with Vo = 1, 2, 3, Eo = (1, 2), (1,3), (2, 1), (3, 2) and w = (1 1 1 1). The resource graph has the form RG = (VR, ER, M) with VR = Vo U {a}, ER = (1 , a ) , (2 , a ) , (3 , a) a n d M ( a ) = 1. There is only one resource type a available. On one instance, all operations are executed, see Fig. 6.

We choose the constraints 0 < ~/1,2,3 ~ 2, i.e. r(l) = 0 a n d s ( l ) = 2 for a l l l ~ I'D. Methods to

determine these bounds systematically are described in later sections.

With N = 20 and M = 10 the following results are obtained:

�9 For ~. = 2 there is no solution to the set of mixed integer constraints.

�9 For 3. = 3 we find

y n = ( 3 0 0 0 ) Y 2 = ( 0 0 3 0)

z r = J r ' = ( 3 O) (Yn + y2)b = 60

Yl.l , YI.2, YI.3 = I, O, 0 Y2.1, F2.2, Y2,3 = O, O, I

Y3,1, Y3,2, Y3.3 = 0, 1 , 0

Therefore, wehave tl = 3i,t2 = 3 i + 2 a n d t 3 = 3 i+1 . This solution corresponds to schedule b as shown in Figs. 1 and 3.

5 Remarks, Variations and Extensions

The following remarks mainly concern the optimal scheduling in Theorem 4. A closer look at the main resource constraint (16) in Theorem 4 re.veals that op- erations are not guaranteed to be assigned to a sin- gle module during their evaluation time. Therefore, it may happen that the first half of an operation is as- signed to one module and the second half to another one. Usually, this is not acceptable. Figure 7 gives an example. There are three operations 1, 2 and 3 with evaluation time 2 and there are two modules. The fig- ure at the top shows a feasible schedule according to Theorem 4 as there is no time instance with more than two simultaneous operations. On the other hand, there is no assignment to two modules which has a period of three time instances. Two solutions to this problem are given below:

l l J

"t

Fig. 7. Allocation of operations.

Page 11: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algorithms 305

�9 Module Selection: For each module (and not for each module type!) one vertex of the resource graph is generated. As a consequence, we have M = 1 for all vertices in Vr. If an operation can be executed by a certain module type, there are edges from the corresponding node to all vertices corresponding to this type. This extension of the resource graph can also be used to model the following situation: An operation can be executed either on a module of type a or of type b. As a consequence, the formulation of the linear integer program becomes more involved. Details can be found in [28].

�9 Loop Unfolding: It can be shown that there is a con- sistent allocation of the operations to modules whose period is a multiple of the iteration interval. An ex- ample is given at the bottom of Fig. 7. Here, the final period is twice the iteration interval. The sit- uation simply generalizes to an arbitrary number of modules and operations.

Next, one notes that the iteration interval ~. must be given but it is not known beforehand. There are several possibilities to solve this problem:

1. ~. is required to be integral. If there is an interval for all accessible ~., then Theorem 4 could be applied for any ~. within this interval. Lower bounds on ~. can be obtained as follows:

(a) Determine a systolic schedule according to Theorem 3 without taking into account resource constraints. The term Izr 'adj(L)ul is a lower bound for Idet(L)l ~..

(b) Another lower bound for ~. is 6Oma x.

(c) Let us suppose that there are Ark time instances to be executed on modules of resource type k VT, i.e. Nk = ~..t~Vo:(t.k)eER COt where COt is the number of time instances, a module is reserved for operation I. Then

Imax{ M(Nkk):k ~ Vr } ]

is another lower bound for ),.

2. The most elegant but also most complex way is to use ~. as a linear parameter in the ILP formula- tion. Here only the main concepts can be described. In Theorem 4, Eqs. (10--16) remain essentially equivalent. But, instead of shifting the time poten- tials y ' by multiples of ~ in (17) by adding p',k to its time index, new potentials ~ (p ' ) are introduced. The shift in time of p'~. is obtained by additional

sets of equations where ~. is a linear parameter. The resulting solution time is inferior to the above de- scribed possibilities.

Other necessary but unknown parameters are the bounds r(l) and s(l) on the time potentials Yr. Here again, there are several possibilities to determine ap- propriate values:

1. It is possible to reduce the number of unknowns to one parameter Ymax. A trivial possibility would be to set r(l) = 1 and s(1) = Ymax for all I ~ I/o.

2. A more involved possibility is adopted from [24]. On the acyclic dependence graph obtained after re- moving all edges with dkt ~ 0 ASAP and ALAP schedules using Ymax are determined. The cor- responding schedules of operations are r(l) from ASAP and s(l) from ALAP schedule. It can eas- ily be seen that with the constraint 1 < g < }/max no other time potentials are feasible. As a result, the number of binary variables can be reduced to a great extent.

3. It is possible to determine a ILP formulation of the scheduling problem without the need of a parameter Ymax. To this end, additional integer variables gt for all l ~ Vo are introduced. In particular, a time potential Yl is now replaced by fit +gt3.. Obviously, Yt can now be bounded by 1 < Yt < ~., i.e. r(l) = 1

~ , s ( I ) and s(l) = )~. The terms of the form/-..~i=r(I) i Yti in (15) are replaced by

X

i = I

and (17) is replaced by

Ibt--I

I, ( i + p -- 1 )modX + 1 - - I:(l,k)~Et~ p=O

This transformation can be interpreted as a cutset transformation, see [16], which has been chosen such that all time potentials are within the interval 1 < yt < ~.. The modulo-operation is caused by the fact that the time interval a module is reserved may wrap around from the end of one iteration to the beginning of the next one.

6 Experimental Results

The solution of resource constraint scheduling prob- lems of relevant complexity has been made possible by

Page 12: Resource constrained scheduling of uniform algorithms

Thiele

t~.s .

I

306

(~ add-operation multiply-operation

Reduced dependence graph of a PDE solver.

v

t

carefully choosing the branch-and-bound strategies in the used ILP-solver. Otherwise, the exponential com- plexity of the proposed method will make its practical application infeasible. In particular

�9 at first the binary variables with low time indices are made integral and

�9 the local branching strategy has been adapted to the fact that most of the binary variables are 0.

The execution times are taken on a SUN Sparc Station. As an example, we have chosen a wave digital filter

algorithm for the solution of certain partial differen- tial equations, see Fig. 9 in [35]. The algorithm con- tains 21 add-operations and 6 multiply-operations. The reduced dependence graph is shown in Fig. 8. All displacement vectors are 0 but those whose edges con- tain the boxes with d t inside. The add- and multiply- operations need 1 and 2 time instances, respectively. We suppose that the implementation consists of two module types, i.e. adders and multipliers. More- over, within one processing element of the processor array there are only 2 adders and 1 multiplier avail- able. The corresponding resource graph is shown in Fig. 9.

For completeness, the whole specification of the problem in form of the weighted reduced dependence graph and resource graph is given:

Vo = { a l l , a l 2 , al3, al4, al5, m l l , m l 2 , a21,

a22, a23, a24, a31, a32, a33, a34, a41,

a42, a43, a44, m41, rn42, a51, a52, a53,

a54, m51, m52}

Eo = {(all , a15), (a12, a15), (a12, a23),

(a12, a24), (a13, a15), (a13, a33),

(a13, a34), (a14, a15), (a14, ml l ) ,

(a14, m12), (a15, a14), (roll, a l 1),

(m12, a12), (m12, a13), (a21, a12),

(a21, a14), (a22, a41), (a22, a42),

(a23, a21), (a23, a22), (a24, a21),

(a24, a22), (a31, a13), (a31, a14),

(a32, a51), (a32, a52), (a33, a31),

(a33, a32), (a34, a31), (a34, a32),

(a41, m41), (a41, rn42), (a41, a44),

(a42, a23), (a42, a24), (a42, a44),

(a43, a44), (a44, a41), (m41, a42),

(m42, a43), (a51, m51), (a51, m52),

(a51, a54), (a52, a33), (a52, a34),

(a52, a54), (a53, a54), (a54, a51),

(m51, a52), (m52, a53)}

All distances but (0 0 O) t

dais,at4 = (002) t da23,a22 = ( - I 0 1) t da24,a21 = (1 0 1) t da33.a31 = (0 -- i 1) t da33,a32 = (0 - 1 1) t da34,a32 = (0 1 1) t da44,a41 = (0 0 2) t

the following satisfy dkt

da23,a21 = ( -1 0 1) t

da24,a22 -~- (l 0 1) t

da34,a31 = (0 1 l) t

d.54,.51 = (002) t (18)

Page 13: Resource constrained scheduling of uniform algorithms

Resource Constrained Scheduling of Uniform Algorithms 307

Fig. 9. Resource graph.

a l 1 M(Add)=2

a540-

m l 1 M(Mul)=I

m520-

The resource graph is given by

Vr = {Add, Mul}

ER = { (a 11, Add), (a 12, Add), (a 13, Add),

(a14, Add), (a15, Add), (ml 1, Mul),

(m 12, Mul), (a21, Add), (a22, Add),

(a.23, Add), (a24, Add), (a31, Add),

(a32, Add),

(a41, Add),

(a44, Add),

(a5 I, Add),

(a54, Add),

(a33, Add), (a34, Add),

(a42, Add), (a43, Add),

(m41, Mul), (m42, Mul),

(a52, Add), (a53, Add),

(m51, Mul), (m52, Mul)}

The evaluation times satisfy w~t -- 2 'v'(k, l): (l, Mul) ER and w~t = 1 u Add) ~ Etc. As we

suppose non pipelined arithmetic units we also have Cot = 2 Vl: (l, Mul) E Etc and Cot = 1 Vl: (l, Add) Etc. The numbers of available modules are M(Mul) = 1 and M(Add) = 2.

The index vector of the above algorithm is of the form I = (i j t) t where 0 < i < N - 1, 0 < j < M - 1 and 0 < t < T - 1. Ac lose r examination of the dependence vectors dkt reveals that the dependence graph of the regular algorithm consists of two connected components i.e. the algorithm is able to compute two independent sets of problems. This is mentioned also in [35]. Consequently, if the solution of only one problem is of concern, the index space is a linearly bounded lattice. The method which enables the computation of this connected component is closely related to the problems described in [36]. At first, cut

set transformations on the given reduced dependence graph must be carried out such that there is a spanning tree whose edges satisfy dkt = 0. Then the Hermite- Normal Form, see e.g. [37], of the transformed distance matrix is the required lattice matrix L.

In the example, the reduced dependence graph al- ready contains the required spanning tree. The distance matrix D contains as nonzero columns those given in (18). Elementary column operations, i.e. adding inte- ger multiples of one column to another column, leads the required lattice matrix

L = 1 1

The final index space of the algorithm has the form

I = { L K : 0 < t q < N - 1 ^ 0 < K 2 < M - - 1 A 0

< KI q-/r q- 2tc3 < T - 1}

where K = Oct to2 to3) t. Figure 10 serves to explain the lattice structure of the index space. Only every second index point is used for computations. The lat- tice structure is shifted from one layer with constant t to the next one.

As the projection direction we choose o = (0 0 1) t. According to Theorem 4, the projection vector u satisfies

u = (0 0 2) t

For the following calculations, N ---- M = T = 1000 is chosen. The final processor array has the form shown in Fig. 11.

Page 14: Resource constrained scheduling of uniform algorithms

308 Thiele

Fig. I0.

y/ l , , A " / i / I _----','- o �9 �9 �9 o - �9 . ,"

, : ' : i i i ! ~ ! i i i i i i i ! i i ! ! ! ! i i ! i i i i i i i i

, _ .:.-iijii..kil..iiikii,iii,kiiiii,kli..ii!,k!i,kl. i!i ,::iiiiiiiii~ii::i::iiii~i~::~i~::~::~::~i~

l ...!iiiii.~iiil...~iiji-:.iiiiiiiilmi!iiii!ili!mliii[..~i!iiiiiiiii!iiill , . .

" t , / / e �9 �9 �9 �9 /

k.-'" V - ' : ~ ~ ~ ~ Lattice structure of the index space.

Fig. 11. Processor array for PDE solver.

At first, the optimal systolic schedule according to Theorem 3 is determined. The solution takes less than 0.1 seconds of CPU time. The objective function has the value (yj + y2)b = 5994 where

w = ( 0 0 6) k = 1 2

y = ( 5 5 5 2 6 4 4 1 1 6

6 1 1 6 6 2 5 5 6 4 4 2 5 5 6 4 4 ) t

and the elements of y are ordered as the list of nodes I/o given above. A corresponding implementation needs 6 multipliers and 7 adders. The integer linear program has 57 equations and uses 42 variables.

Now, the resource constraint scheduling according to Theorem 4 is evaluated. The solution of the ILP takes about 40 seconds of CPU time. The results are

(Y1+Y2)b=7992 J r = ( 0 0 8) k = 1 6

y = ( 1 4 7 9 4 1 6 1 3 6 1 1 8 8 3 3 1 0

1 0 2 6 13 144 1 1 4 9 16 188 15) t

The bar charts of the two available adders and the mul- tiplier are shown in Fig. 12. Here, the number of in- equalities and variables are 743 and 106, respectively. The branch-and-bound algorithm visited 189 branch- ing nodes.

Page 15: Resource constrained scheduling of uniform algorithms

Resource Constrained Schedul ing o f Un i fo rm Algor i thms 309

Fig. 12.

I 1 I I I I I I I I I I 1 [ I I 0 2 4 6 8 10 12 14

Bar charts.

16

7 Concluding Remarks

Final ly s o m e possible extensions to the methods as de-

scr ibed in this paper are described. In a s imilar way as

in [24], addit ional degrees o f f r eedom or constraints can

easi ly be taken into account , such as chaining, pipel ined

ar i thmetic units and mult i cycle operations.

It has been shown in [28] that it is possible to ex-

plici t ly include also the m e m o r y design, i.e. opti-

miz ing the total amount o f local m e m o r y needed and

avoid ing access confl icts to m e m o r y banks, and to in-

c lude the bus design, i.e. taking into account l imited

bandwidth and avoid ing access conflicts on the inter-

connect ion structure. Moreover , it is possible to as-

s ign more than one resource type to an operat ion. In

this case, the operat ion will be al located to one o f the

resources (module selection). These results can be

applied to the class o f a lgor i thms considered in this

paper.

In the case o f parameter ized index spaces, i.e. b in

the defini t ion o f the l inearly bounded lattices depends

on s o m e constant but unknown parameters , the opti-

mal solut ion can be de termined using methods to solve

parameter ized integer programs, see [38].

No te s

I. The column vector c,! of C associated to an edge (k, l) satisfies ckt = et - ek where ek and et denote the kth and lth unit vector, respectively.

2. In order to exclude degenerate cases, we suppose that I is suf- ficiently fat with respect to v, i.e. there exist Io ~ I, ~t ~ Q#O such that Io + ~to ~ I.

3. This expression is an approximation of the total evaluation time m a x [ r t ( J - - l ) + y k - - F t : l , J E I A k , I E V}.

4. This condition is necessary also if Ate > b is sufficiently fat, i.e. contains a unit cube.

5. The iteration interval of a schedule denotes the number of time- instances between two successive iterations.

References

1. H.T. Kung and C.E. Leisersoa, "Systolic arrays for VLSI:' SlAM Sparse Matrix Proceedings, Philadelphia, 1978. pp. 245-282.

2. H.T. Kung, "Let's design algorithms for VLSI systems," Proc. Caltech Conf. on VLSI, pp. 65-90, 1979.

3. H.T. Kung, "Why systolic architectures?" IEEE Computer, pp. 37-46, 1980.

4. S.Y. Kung, "On supercomputing with systolic/wavefront array processors," Proceedings of the IEEE, 1984, pp. 39-46.

5. R.H. Kuhn, "Transforming algorithms for single-stage and VLSI architectures," Workshop on Interconnection Networks for Par- allel and Distributed Processing, 1980.

6. D.I. Moldovan, "On the design of algorithms for VLSI systolic arrays," Proceedings of the IEEE, 1983, pp. 113-120.

7. P. Quinton, "Automatic synthesis of systolic arrays from uniform recurrent equations" The IEEF_,/ACM 11th Annual lnt'l Symp. on CmnputerArchitecture, pp. 208-214, Ann Arbor, MI, USA, 1984,

8. W. L. Miranker and A. Winkler, "Space-time representation of computational structures," Computing, pp. 93-114, 1984.

9. P.R. Capello and K. Steiglitz, "Unifying VLSI array design with linear transformations of space-time," Advances in Camputing Research, Vol. 2, pp. 23--65, 1984.

I0. S.K. Rao, Regular iterative algorithms and their implementa- tions on processor arrays, Ph.D. Thesis, Stanford University, 1985.

11. S.K. Ran and T. Kallath, "Regular iterative algorithms and their implementations on processor arrays," Ptr;ceedings of the IEEE, Vol. 6, pp. 259-282, March 1988.

12. R.M. Karp, R.E. Miller, and S. Winograd, "The organization of computations for uniform recurrence equations," JOurnal of the ACM, Vol. 14, pp. 563-590, 1967.

13. M. Wolfe, "Massive parallelism through program restructuring," Proc. IEEE Conf. on Frontiers of Massively Parallel Computa- tion, 1990, pp. 407-415.

14. M.W. Wolfe and M.S. Lain, "A Loop Transformation Theory and an Algorithm to Maximize Parallelism," IEEE Transac- tions on Parallel and Distributed Syster~r Vol. 2, pp. 452---471, 1991.

15. ES. Hillier and G.J. Lieberman, An Introduction to Operations Research, San Francisco: Holden Day, 1980.

16. C.E. Leiserson, F.M. Rose, and J.B. Saxe, "Optimizing syn- chronous circuitry by retiming," Proc. Third Caltech. Conf. on VLSI, pp. 87-116, 1983.

17. A. Darte and Y. Robert, "Scheduling uniform loop nests," Tech- nical Report 92-10, ENSL Lyon, France, 1992.

Page 16: Resource constrained scheduling of uniform algorithms

310 Thie le

18. Y. Wong and J.M. Delosme, "Optimization of Computation Time for Systolic Arrays," 26th AIlerton Conference on Com- munication, Control and Computing, September 1988.

19. E.G. Coffman, Computer and Job-Scheduling Theory, New York: John Wiley and Sons, 1976.

20. M.R. Garey and D.S. Johnson, Computers andlntractability: A Guide w the Theory of NP-Completeness, New York: Freeman, 1976.

21. M.C. McFarland, A.C. Parker, and R. Camposano, "The high- level synthesis of digital systems:' Proceedings of the IEEE, Vol. 78, pp. 301-318, 1990.

22. S. Davidson, D. Landskov, B.D. Shriver, and P.W. Mal- lett, "Some experiments in local microcode compaction for horizontal machines," IEEE Transactions on Computers, Vol. C-30, pp. 460--477, 1981.

23. P.G. Paulin and J.P. Knight, "Force-directed scheduling for the behavioral synthesis," IEEE Transactions on Computer-Aided Design, Vol. CAD-8, pp. 661-679, 1989.

24. C.T. Hwang, J.H. Lee, and Y.C. Hsu, "A formal approach to the scheduling problem in high level synthesis," IEEE Transac- t io~ on Computer-Aided Design, Vol. CAD-10, pp. 464--475, 1991.

25. N. Park and A.C. Parker, "Theory of clocking for maximum execution overlap of high-speed digital systems," IEEE Trans- actions on Computers, Vol. C-37, pp. 678-690, 1988.

26. L. Hafer and A.C. Parker, "A formal method for the specifica- tion, analysis and design of register-transfer digital logic:' IEEE Transactions on Computer-Aided Design, Vol. CAD-2, pp. 4 - 18, 1983.

27. N. Park and A.C. Parker, "Schwa: a software package for syn- thesis of pipelines from behavioral specifications:" IEEE Trans- actions on Computer-Aided Design, Vol. CAD-7, pp. 356-370, 1988.

28. A. Bachmann, M. Sch~binger, and L. Thiele, "Synthesis of domain specific multiproeessor systems including memory de- sign:' VLSISignal Processing VI, pp. 417-425, New York: IEEE Press, 1993.

29. M.V. Swaaij, J. Rossel, F. Catthoor, and H.D. Man, "Synthesis of ASIC regular arrays for real-time image processing systems," Ed F. Deprettere and Alle-Jan van der Veen, (Eds.), Algo- rithms and Parallel VLSI-Architectures, Volume B, pp. 329-341, Amsterdam, Elsevier, 1991.

30. L. Thiele, "On the design of piecewise regular processor arrays," Proc. IEEE Symp. on Circuits and Systems, pp. 2239-2242, Portland, 1989.

31. L. Thiele, "Compiler techniques for massive parallel architec- tures:' P. Dewilde fed.), The State of the Art in Computer Sys- terns and Software Engineering, pp. 101-151, Boston: Kluwer Academic Publishers, 1992.

32. J. Teich and L. Thiele, "Partitioning of processor arrays: A piecewise regular approach," INTEGRATION: The VLSI Jour- nal, Vol. 14, pp. 297-332, 1993.

33. W.H. Chou and S.Y. Kung, "Scheduling partitioned algorithms on processor arrays with limited communication supports,"

Proc. Conf. on AppL Spec. Array Processors, pp. 53-64, Venice, Italy, 1993.

34. S.Y. Kung, VLSI Processor Arrays, Englewood Cliffs, NJ: Prentice Hall, 1987.

35. A. Fettweis and G. N'.'tsche, "r,,umerical in~,:~rztion of par- tial differential equations using principles of multidimensional wave digital filters," Journal of VLSI Signal Processing, Vol. 3, pp. 7-24, 1991.

36. J.-K. Peir and R. Cytron, "Minimum distance: a method for par- titioning recurrences for multiproeessors:' IEEE Transactions on Computers, Vol. 38, pp. 1203-1211, 1989.

37. A. Schrijver, Theory of linear and integer programming, Wiley lntersc. Series in Discrete Math. and Optimization, 1986.

38. P. Feautrier, "Parametric integer programming," Operations Re- search, Vol. 22, pp. 243-268, 1988.

Lothar Thiele was born in Aachen, Germany on April 7, 1957. He received the Diploma-Ingenieur and Dr.-Ing. degrees in electrical engineering from Technical University of Munich, West Germany, in 1981 and 1985, respectively. Since 1981, he has been a research associate with Professor R. Saal at the Institute of Network Theory and Circuit Design of the Technical University Munich. After fin- ishing his Habilitation thesis, he joined the group of Professor T. Kailath at the Information Systems Laboratory, Stanford University, in 1987. In 1988, he has taken up the chair of microelectronics at the faculty of Engineering, University of Saarland, Saarbrucken, West Germany. He joined ETH Zurich, Switzerland, as a full professor in Computer Engineering in 1994. The research interests include mod- els, methods and software tools for the design of hardware/software systems and array processors and the development of parallel algo- rithms for signal and image processing, combinatorial optimization and cryptography.

Lothar Thiele authored and co-authored more than eighty papers. In 1986, he received the award of the Technical University for his Ph.D. Thesis. He received the "1987 Outstanding Young Author Award" of the IEEE Circuits and Systems Society. In 1988, he was the recipient of the "1988 Browder J. Thompson Memorial Prize Award" of the IEEE. thiele @tik.ee.ethz.ch