approximate algorithms for partitioning problems

21
International Journal of Parallel Programmmg, VoI. 20, No. 5, 1991 Approximate Algorithms for 1 Partitioning Problems Mohammad Ashraf Iqbal 2'3 Received August 1990; revised May 1992 We consider the problem of optimally assigning the modules of a parallel/ pipelined program over the processors of a multiple processor system under certain restrictions on the lnterconnection structure of the program as well as the multiple computer system. We show that for a variety of such problems, it is possible to find if a partition of the modular program exists in which the load on any processor is within a certain bound. This method when combined with a binary search over a fixed range, provides an optimal solution to the partitioning problem. The specific problems we consider are partitioning of (1) a chain structured parallel program over a chain-like computer system, (2)multiple chain-like programs over a host-satellite system, and (3) a tree structured parallel program over a host-satellite system. For a problem with N modules and M processors, the complexity of our algorithm is no worse than O(Mlog(N) log(Wr/e)), where WT is the cost of assigning all modules to one processors, and e the desired accuracy. This algorithm provides an improvement over the recently developed best known algorithm that runs in O(MNlog(N)) time. KEY WORDS: Multiprocessing; module assignment; parallel processing; pipelining; bounded processor load; complexity. ~This Research was supported by a grant from the Division of Research Extension and Advisory Services, University of Engineering and Technology Lahore, Pakistan. Further support was provided by NASA Contracts NASI-17070 and NASI-18107 while the author was resident at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, Virginia, USA. 2 Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan. 3 Present address: Department of EE-Systems, University of Southern California, Los Angeles, California 90007. 341 828/20/5-1 0885-7458/91/1000-0314506 50/0 1991 Plenum Pubhshmg Corporation

Upload: mohammad-ashraf-iqbal

Post on 10-Jul-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximate algorithms for partitioning problems

International Journal of Parallel Programmmg, VoI. 20, No. 5, 1991

Approximate Algorithms for 1 Partitioning Problems

Mohammad Ashraf Iqbal 2'3

Received August 1990; revised May 1992

We consider the problem of optimally assigning the modules of a parallel/ pipelined program over the processors of a multiple processor system under certain restrictions on the lnterconnection structure of the program as well as the multiple computer system. We show that for a variety of such problems, it is possible to find if a partition of the modular program exists in which the load on any processor is within a certain bound. This method when combined with a binary search over a fixed range, provides an optimal solution to the partitioning problem.

The specific problems we consider are partitioning of (1) a chain structured parallel program over a chain-like computer system, (2)multiple chain-like programs over a host-satellite system, and (3) a tree structured parallel program over a host-satellite system.

For a problem with N modules and M processors, the complexity of our algorithm is no worse than O(Mlog(N) log(Wr/e)), where WT is the cost of assigning all modules to one processors, and e the desired accuracy. This algorithm provides an improvement over the recently developed best known algorithm that runs in O(MNlog(N)) time.

KEY WORDS: Multiprocessing; module assignment; parallel processing; pipelining; bounded processor load; complexity.

~This Research was supported by a grant from the Division of Research Extension and Advisory Services, University of Engineering and Technology Lahore, Pakistan. Further support was provided by NASA Contracts NASI-17070 and NASI-18107 while the author was resident at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, Virginia, USA.

2 Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan.

3 Present address: Department of EE-Systems, University of Southern California, Los Angeles, California 90007.

341

828/20/5-1 0885-7458/91/1000-0314506 50/0 �9 1991 Plenum Pubhshmg Corporation

Page 2: Approximate algorithms for partitioning problems

342 Iqbal

1. I N T R O D U C T I O N

Efficient utilization of parallel computer systems requires that the task being executed be partitioned over the system in an optimal fashion in order to minimize the total execution time of the job. In general, the problem of finding the optimal partition of an arbitrarily connected multiple computer system is very difficult to solve. If, however, the modules of the program communicate in a restricted manner and the multiple com- puter system has a special structure then it is possible to solve some of the partitioning problems.

Bokhari C1) has addressed the problem of optimally partitioning a modular program in a parallel processing environment. He has shown that if the interconnection structure of the program is chain or tree-like and the parallel processor is either connected as a chain or is a host satellite system, it is possible to optimally partition the parallel program in polynomial time. Iqbal et al. ~2) have subsequently developed faster but sub-optimal or approximate methods for this class of problems. Nicol and O'Hallaron ~3) improved Bokhari's exact algorithms and developed new algorithms that were still faster but operated under the assumption of bounded execution and communication costs. Iqbal, ~4) and Choi and Narahari (5) have designed new algorithms that are faster than any of the previously reported exact algorithms. These new algorithms are straightforward to implement and provide the exactness of Ref. l, the speed of Ref. 3 and make no assumption about the magnitudes of costs.

In this paper we describe a fully polynomial time approximation scheme, which provide efficient solutions to most of the partitioning problems discussed in Ref. 1 and already solved by using pure polynomial time algorithms. This scheme is a modified version of the approach described by Iqbal et aL ~2) and IqbalJ 6) In order to appreciate the useful- ness of the approximate solutions, one should bear in mind that data for the problem being solved is often only approximately known. Hence an approximate solution may be as meaningful as an exact solution for many of the practical problems where the extra accuracy of the exact solution is not needed and where the approximate solution can be obtained in a relatively short time (Horowitz and Sahni~7)).

While chain and tree structured computations are simple and seem restrictive, they are found frequently in applications such as image process- ing and scientific computing. ~1'3'5) Many image processing applications involve a sequence of operations on frames of input image data and can thus be represented as a chain structured pipelined computation. One technique for solving partial differential equation is to partition the two- dimensional domain into vertical strips. Due to irregular domains, the

Page 3: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 343

modules and the edges between modules may have different weights. Nicol and O'Hallaron ~3) discuss how a task dependence graph can be trans- formed to a chain structured compt~tation. In addition, a simple solution to many problems is to decompose the problem into independent stages and pipeline the stages, thus resulting in a chain structured pipelined com- putation where the modules with the slowest rate of execution dictates the speed of the pipeline.

Cenkle ~8) has designed a preprocessing algorithm for partitioning an arbitrary task dependence graph to one with simple chain structure. This map is used to collapse a general data flow graph into its longest path; the resulting chain structured computation can then be mapped onto a linear processor array by using previously developed assignment algorithms. Mohindra and Yalamanchili ~9) have devised methods to represent the problem and its structure by its most influential parts--the most computa- tionally demanding constituents, and the most communication intensive pairs of nodes. A representation which is constructed by selecting such pairs of nodes and edges from a problem graph is referred to as a dominant representation. For example the maximum weighted spanning tree may be selected as the dominant representation for the problem graph. The original problem is thus reduced to that of mapping trees onto a processor graph: a problem that is relatively simpler than mapping arbitrary graphs.

There are certain parallel architectures which also compel us to con- sider chain decompositions. ~3) For example, the CMU Warp (1~ is a linear array of high-powered processors, so that pipelining sequential modules is a natural solution approach. It can be advantageous to use chains even if the communication topology is rich. For example the Intel iPSC hypercube has very high communication startup costs which are nearly independent of the message size. Better performance is sometimes seen by minimizing the number of messages, rather than the message volume. The programmer again is encouraged to limit the interconnection structure of the problem decomposition; a chain offers the simplest of useful structures.

In Section 2, we describe an algorithm for finding the optimal parti- tion of a chain structured parallel or pipelined program over a chain of identical processors. The central result of Section 2 is that for a trial weight w, it is possible to find if a partition exists in which the load on each processor is less than or equal to w in time proportional to O(M log N). The optimal partition is then found by making a binary search in a given range to find the partition for which w is minimum. The approach we use here is a fully polynomial time approximation scheme ~6) and is an exten- sion of the method discussed in Ref. 2 for finding the optimal partition of a one dimensional domain over a chain of processors.

In Section 3 and 4, we show that this technique can be used to

Page 4: Approximate algorithms for partitioning problems

344 Iqbal

optimally parti t ion programs composed of multiple chains over a multiple computer system based on a single-host and multiple satellite architecture. The time required by the entire system to complete the processing is deter- mined by the greater of (1) the individual load on the most heavily loaded satellite, and (2) the sum of the collective loads on the host.

In Section 5, we discuss an approach to optimally parti t ion a tree- structured parallel or pipelined program over a single-host multiple- satellite system. This algori thm is also based on a probing function. Tree structured computa t ions arise in signal processing and as well as in industrial control applications. In the latter sensor inputs from the shop floor are processed up the nodes of a tree to a central control node, and control signals travel in reverse direction. Such tree-structured computa- tions can be parti t ioned over the processors of a host-satellite system to improve response time.

The paper concludes with a discussion of our results in Section 6.

2. AN A L G O R I T H M FOR PARTIT IONING CHAINS

We describe in this section how a chain structured parallel or pipelined p rogram can be optimally parti t ioned over a chain of identical processors. We assume that a chain structured program is made up of N modules numbered 1 .. N and has an interconnection pattern such that module i can communica te only with modules i + 1 and i - 1 . Similarly we assume that the multiprocessor of size M < N has also a chain like architecture (Fig. 1).

Q~.I.J. ~+l.|+l.y

-,j J

F Fig. 1. A chain of N modules to be mapped onto a chain of M processors. The number outside each module is its execution cost. The number associated with each edge is the communication cost for the two modules at the ends of that edge. Modules[x..i] are assigned to processor p, while Modules [i + 1.. y] are assigned to processor p + 1.

Page 5: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 345

We work under the constraint that each processor has a contiguous subchain of program modules assigned to it. Thus the partitions of the chains have to be such that modules i and i + I are assigned to the same or adjacent processors (Fig. 1 ). The optimal partitioning would then be the assignment of subchains of program modules to processors that minimize the load on the most heavily loaded processor.

It is convenient for us to assume that should the optimal assignment dictate that fewer than the available M processors be used, we can simply ignore the impact of communicating with the outside world through a subchain of unused processors.

2.1. Def in i t ions and Propert ies of Chains

w, time consumed in the execution of module i.

c, time to communicate between modules i and i + 1 if the two modules are assigned to different processors. We assume that Co(CN), is the time for module 1 (N) to communicate with the outside world.

Wmax maximum value of w, for 1 ~< i ~< N.

Wr load on a processor if all the N modules are assigned to it. Thus WT = CO + CN + t = N Z,=I W,.

s ~ load on processor p if subchain Modulesl-j. . k] are assigned t = k to it. It is given by ck+cj_a + Z , = j w,.

weight the weight of a partition is the weight of its heaviest subchain.

Theorem 1. If a problem with N modules and M processors has a partition of weight w then a partition of equal or less weight exists if we merge modules i and i + 1 provided c, ~> w, + 1 + c,+ 1. A proof of this theorem is given in Appendix A.

Theorem 2. If a problem with N modules and M processors has a partition of weight w then a partition of equal or less weight exists if we merge modules i and i + 1 provided c, ~> w, + c, 1.

Coro l la ry 1. Suppose we initially assign M o d u l e s [ x . . i ] to pro- cessor p and Modules [ i+ 1 . . y ] to processor p + l . In such a case, if module i + 1 is moved from processor p + l to processor p, the load on processor p will always increase while the load on processor p + 1 will always decrease, provided the following inequality is true.

Ct+ 1 "~- Wl+ 1 ~ C t ~ Ct+ 1 - - W i + 1

Page 6: Approximate algorithms for partitioning problems

346 iqbal

Corollary 2. Suppose we initially assign Modules[x .. i] to pro- cessor p and M o d u l e s [ i + l . . y ] to processor p + l . In such a case, if module i is moved from processor p to processor p + 1, the load on processor p will always decrease while the load on processor p + 1 will always increase, provided the following inequality is true

C t -F W~ > C t_ 1 > Ct -- W~

2.2. The Procedure Monotonic

The procedure M O N O T O N I C transforms the chain Modules[1 .. N] into a new chain of Modules[ l ,, N ' ] , where N'~< N.

PROCEDURE MONOTONIC ( in=Modules [ i.. N ], out=Modules [ I.. N ' ] ) BEGIN

For i=N-i down to 1 do begin

if C I ~- wl. I + ct. I then Merge Module /+i with i;

end;

For i--I to N-I do begin

if c, ~ w i + ct_ t then Merge Module i with i+i;

end;

of

END.

Discussion.

t. The procedure M O N O T O N I C has two passes. During the first pass some of the modules may be merged together and the original chain of Modules[1 .. N] is transformed into a modified chain of modules. The modified chain of modules is further transformed into a new chain of Modules[1 .. N ' ] during the second pass of the procedure MONOTONIC.

2. If the Modules[ l .. N ] has a partition of weight w then (according to Theorem 1 and 2) a similar partition or a partition of less weight, exists for the new module chain Modules[ t .. N ' ] .

3. For the new module chain Modules[1 .. N ' ] , the rise of 121,t,~ and the fall of ~22,x+l,N with x, will be monotonic (see Corollary 1 and 2).

4. The procedure M O N O T O N I C performs steps proportional to O(N).

Page 7: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 347

Example 1. Let us examine an example of how the procedure M O N O T O N I C works on a chain of modules which is later mapped onto a chain of processors. Figure 2 (top) shows a 10 module chain. The number above each module is its execution "cost (on any processor) while the number below each edge is the communication cost for the two modules at the ends of that edge. Thus for module 4 the value of w4 = 6 and c4 = 2. The value of W r for this problem is 42 (we have assumed that Co = CN = 0).

Figure 2 (bot tom) shows a plot (grey line) of the load on processor 1, g21,1, x against x. It can be seen from the figure that the load on processor 1 increases from zero when no module is assigned to it, to WT when all 10 modules are assigned to processor 1.

Figure 2 (bot tom) also shows a plot (black) of the total remaining load, A x against x. The remaining load decreases from WT when no module is assigned to processor 1, to zero when all the 10 modules are assigned to it.

It is evident from Fig. 2 that the rise of g21,1,x and the fall of Ax with x are not monotonic. Thus, if we initially assign Modules[1 .. 6] to pro- cessor 1 and then further assign Module 7 to it then the remaining load A~, instead of decreasing, increases from 17 to 19. The reason for this behavior is non-uniform communication costs.

Figure 3 shows how the procedure M O N O T O N I C transforms the 10 module chain of Fig. 2 into a 7 module chain. During the execution of the

~ -- --4~-L - Remaining load

314 & x ~ll [ Load 0o P r o c e s s o r 1

o,.,.,

r176 | t

1o I

o 0 1 2 3 4 5 6 7 8 (3 10

IP X

Modules[ 1 ~J assigned to Processor I

Fig. 2. A 10 module chain (top). The load on Pro- cessor 1, ~l, l ,x (grey line) and the remaining load, As, (black) is plotted against x (bottom).

Page 8: Approximate algorithms for partitioning problems

348 Iqbal

(a)

(b) tO 3 ~J 6 2 4 2 .~

Fig, 3. The 10 module chain of Fig. 2 (top) is transformed into a 7 module chain by the procedure MONOTONIC.

first FOR loop of the procedure MONOTONIC, modules 9 and 10 are merged together as shown in Fig. 3a. The resulting 9 module chain is shown in Fig. 3b. When the second FOR loop is executed then the 9 module chain of Fig. 3c is transformed into the 7 module chain of Fig. 3d. During this process, module 7 is merged with module 8 while module 3 is merged with module 4.

In Fig. 4 (top), we reproduce the 7 module chain of Fig. 3d. Figure 4 (bottom) shows a plot (grey) of the load on processor 1, f21,1, x against x. Figure 4 (bottom) also shows a plot (black) of the remaining load, Ax, against x. Note that the procedure MONOTONIC has transformed the 10 module chain of Fig. 2 into 7 module chain (Fig. 4) in such a manner that the resulting rise of t21,1,x and the fall of 3x, with x, has become monotonic.

2.3. The Function Probe

The partitioning scheme (described in Section 2.4) is based on the function PROBE. The function PROBE(w) returns true if it is possible to partition a chain of Modules[1 .. N ' ] into subchains such that the load on each processor is less than or equal to w, and false otherwise (we assume here that the original chain of Modules[1 .. N] has already been trans- formed into a chain of Modules[1 .. N ' ] by first applying the procedure

Page 9: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 349

Modulo Chain

Remolnlng LO0~ LO~a on p r o c e s s o r I

bx QI,L.t 42

I.. . J

[ , ,

|_....'_t_! I ,o

f. 6 o l 2 3 4 5 7

X

Modules[ I x] ~ d to Proooss~r 1

Fig. 4. The 7 module chain (top) of Fig. 3. The load on Processor 1, g21,1, x (grey line) and the remaining load, A~, (black) is plotted against x (bottom).

M O N O T O N I C ) . The partition that results from a successful attempt to partition the chain so that the weight of each subchain is no more than w, is called the greedy partition (w).

FUNCTION PROBE(w):boolean; BEGIN

i=i; j=l; p=l;

while p ~ M do

END.

b e g i n

r e p e a t j=j+l;

until Q > w or j=N'; p , l , J

I f j = N ' t h e n r e t u r n ( t r u e ) ;

Assign subchain Modules[i..j-l] to processor p;

i=j; p=p+l;

end;

return(false);

Page 10: Approximate algorithms for partitioning problems

350 Iqbal

Example 2. Let us examine an example of how the function PROBE(w) tries to find a greedy partition of weight w. Figure 5 (top) shows the 7 module chain of Example 1. Remember that this chain has been derived from the chain of Fig. 2 (top) by the application of the proce- dure M O N O T O N I C . The 7 module chain is to be mapped onto a 4 pro- cessor chain with a trial weight w = 24. The function PROBE assigns the subchain Modules[1 .. x ] to processor 1 such that ~r'21,1. x ~ W and x is max- imal. The value of x which satisfies these constraints, is 3. The remaining load is 22 which is less than the trial weight w = 24. Thus, it is possible to find a greedy partition of weight w = 24 (as shown in Fig. 5) of the module chain, and the function PROBE(w) returns true. Let us now try to find if a greedy partition of the same module chain over the same processor chain exists with a trial weight w = 14. The function PROBE, while trying to find a greedy partition (14), assigns Modules[1 .. 2] to processor 1. It assigns a single module (module 3) to processor 2. Modules[4 .. 5] are assigned to processor 3. The remaining load left for processor 4, is 17 (see Fig. 6), which is larger than the trial weight. The function PROBE, therefore, returns false.

The function PROBE(w) can provide a true~false answer if it looks at each module only once. Thus under this scheme (looking at each module only once), the function performs steps proportional to O(N).

The function PROBE(w) while assigning Modules[1 . . x ] to pro- cessor 1, can make a binary search for x in the range 1, N ' so that x is maximal and 121,1, x ~< w. Under this new strategy, the function PROBE(w)

Module Chain

/ Processor Chaln

Fig. 5. The function PROBE, while trying to find a greedy partition of weight w = 24, assigns Modules[1 .. 3] to Processor 1 and Modules [4..7] to Processor 2. Note that the remaining load left for the rest of the Processors is zero.

Page 11: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 351

fl, o )

1 1 1 / Fig, 6. The function PROBE, while trying to find a greedy partition (14) assigns Modules[1 .. 2] to Processor 1. It assigns a single module to Processor 2 and Modules[3..4] are assigned to Processor 3. The remaining load left for Processor 4 is 17 which is larger than the trial weight = 14.

will perform steps proportional to O(log N) for each processor. Thus the time complexity will be O(N+ M log N) as the procedure M O N O T O N I C is first applied to transform the chain of Modules[1 .. N] into a new chain of Modules[1 . . N ' ] before the function PROBE(w) can provide a true/false answer.

T h e o r e m 3. If there exists a partition of weight w the function PROBE(w) will find a similar greedy-partition and will return true. In other words any partition of weight w can be converted into greedy- partition(w). Thus the optimal partition can also be converted into a greedy-partition(wopt). A formal proof of this theorem is given in Ref. 2.

2.4. The Approx imate Assignment Scheme

The approximate assignment scheme is based on the function PROBE. The scheme takes advantage of the fact that the weight assigned to the most heavily loaded processor in the optimal assignment lies some where between Wr and Wr/M.

The assignment scheme selects a trial weight w in the above range and then uses the function PROBE. The function returns true if it is possible to partition the chain of modules into subchains such that the weight of each subchain is less than or equal to w, and false otherwise.

The scheme makes a binary search in the range Wr, Wr/M, using the function PROBE(w) to find the partition for which the weight of the heaviest subchain is minimum. If the range Wr, Wr/M is resolved to an accuracy of e then the assignment scheme will find a greedy partition(w) in time proportional to O(N + min(N log(Wr/e), M log(N) log(Wr/e)) ) with

Page 12: Approximate algorithms for partitioning problems

352 Iqbal

the assurance that w is no greater than the weight of the heaviest subchain in the optimal partition by e. The time complexity reduces to O ( M log(N) 1og(WT/e)) provided N < M log(N)log(WT/~). It is important to note that the order of the approximate scheme is proportional to 1og(Wr/e) unlike other fully polynomial time approximation schemes C7) in which the time complexity is polynomial in 1/e.

3. PARTITIONING MULTIPLE CHAINS FOR A HOST-SATELLITE SYSTEM

The algorithms presented in the previous section can be used to solve several other partitioning problems in Host-Satellite systems as shown in Fig. 7. Here we have a large host computer connected to many satellite computers which receive data and process it in a pipelined fashion. Each relatively less powerful satellite computer has the choice of partitioning its workload between itself and the more powerful host to improve its comple- tion time. It should be kept in mind, however, that moving some modules from a satellite to the host will adversely affect the performance of other satellites. ~ 1 )

3.1. Definitions and Properties

Let us assume that each chain has N modules, there are M satellites and that for each module i of satellite s, the time required to run it on the satellite, e .... and on the host, h,. s. For each pair of modules i and i + 1 from satellite s, we have the time required for interprocessor communica- tion, c .... should i be assigned to the satellite and i + 1 to the host. When these M chains are partitioned between the host and the M satellites, the time required by the entire system to complete the processing is determined by the greater of (1) the individual load on the most heavily loaded satellite and (2) the sum of the collective loads on the host.

I1~ t l

0 0 0

O

Q

O

Hos t

Fig. 7. A Host-Satellite system processing real time data.

Page 13: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 353

We represent the load on a satellite s by Os, k, provided Modules[1. . k] are assigned to it. It is given by ~ = l e , , s - l - e k , s . The remaining load, due to the rest of the modules of satellite s, is assigned to the host and is denoted by A~,k which is equal to ~u=k+ 1 hi,~+ek, s.

Note that each of the M satellite chains possesses the properties expressed by Theorems 1 and 2 of Section 2. Thus, we use procedure M O N O T O N I C to transform each chain of Modules[1 .. N] into a new chain of Modules[-1 .. N ' ] where N' ~< N.

3.2. The Function Probe

The probing function, while trying to find a greedy partition of weight w, will assign Modules[-1 .. k] to a satellite s such that (2s, k ~< w, and k is

N maximal. If the total load on the host (which is equal to Zs= 1 zi~,k) is less than or equal to w then the function returns true, and false otherwise.

Discussion.

1. The policy, under which Modules[1 .. k] are assigned to a satellite s, is such that the remaining load, As, k, is minimized for each satellite s. Thus if As, k is minimum for each s, where 1 ~< s ~< M, then ~ s ~ l As, k will also be minimum.

2. If there exists a partition in which the total processing time is less than or equal to w, then for each individual satellite s, this parti- tion can always be transformed into a greedy partition of weight w by increasing k until it becomes maximal and g2~,k~< w. The result of this transformation will be that the load on the host will either decrease or remain the same. Thus if there exists a partition of weight w then this approach will always find that or an equivalent assignment.

3. In order to find k such that t2s. k ~< w and k is maximal, we can make a binary search for k in the range 1, N. Thus the function PROBE performs steps proportional to O(log N) for each satellite and will provide a true~false answer in time O(M log N).

3.3. The Approx imate Assignment Scheme

The scheme makes a binary search in the range Wr, WT/M, where WT- is given by the following equation, using the probing function to find the greedy partition of weight w, for which w is minimum.

N

1 max (i le

Page 14: Approximate algorithms for partitioning problems

354 Iqbal

Note that W T is the smaller of (1) total processing time if all modules are assigned to the host, and (2) total processing time if no module is assigned to the host. Thus if Wmax is the maximum value of h,, s for 1 ~< i ~< N and 1 <~s<~M then WT<.NM(wmax).

If the range WT, Wr/M is resolved to an accuracy of e, then the algorithm will call the function PROBE at the most log(Wr/e) times. The function PROBE performs steps proportional to O(Mlog N) provided each chain of modules has already been transformed into a new chain of modules by the procedure MONOTONIC. Thus the algorithm will find a greedy partition of weight w in time proportional to O(max[MN, M(log N)(log(WT/e))]), with the assurance that w is no greater than the worst load on any satellite and the total load on the host in the optimal assignment by ~.

4. PARTIT IONING DISTRIBUTED P R O G R A M S IN HOST-SATELLITE SYSTEMS

Stone (11) has solved the problem of partitioning a distributed program over a single-host and single satellite system. He has further studied the behavior of the optimal assignment as a function of load on the host and shows that a nesting property holds. (12) As load increases on the host, modules move away from the host and onto the satellite. At no point does an increase in load cause a module to move from the satellite onto the host.

Subsequent work by Bokhari ~1~ shows how this property can be exploited when finding optimal assignments in a single-host multiple- satellite system. We consider the individual programs to have a chain-like structure, regardless of their actual interconnection. The optimal assign- ment can be found using a Sum-Bottleneck path algorithm. The complexity of this approach is dominated by the O(NaM) algorithm that finds the individual chains for a problem with M satellites, each executing a program with N modules. The partitioning of the chains takes far less time than O(N4M) time.

We can find the partitioning of the chains, using an approach similar to the one discussed in the previous section. This takes O(Mlog(N) log(Wr/e)) time, where Wr and e are as defined in the previous section. The overall complexity of the algorithm is still O(N4M).

5. PARTIT IONING TREES IN HOST-SATELLITE SYSTEM

We consider the problem of partitioning a tree structured pipelined or parallel program over a single-host, multiple-satellite system as shown in Fig. 8. We assume that there are N nodes in the program tree and there are as many satellites as the number of leaf nodes of the tree. We work under

Page 15: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 355

Fig. 8. A tree structured parallel program partitioned over a host-satellite system.

the constraints that (1) individual maximal subtrees of the given tree are assigned to each satellite and (2) that the root is always assigned to the host. The total processing time under such a system will be the larger of the load on the host and the worst load on any satellite.

5.1. Def in i t ions

h; time consumed in the execution of module i on the host.

e; time consumed in the execution of module i on any satellite (all satellites are similar).

c, time required for communication if node i is assigned to a satellite and node father(i) to the host.

Wma x maximum value of h, for 1 ~< i ~< N.

load(i) load on a satellite if node i and all its descendant nodes are assigned to the satellite. This is equal to the sum of individual e,'s of the modules assigned to the satellite plus c,.

cost(i) cost of execution of node i and its descendants on the host. This is equal to sum of individual h,.'s of node i and its descendants.

depth(i) distance of node i from the root. The value of depth(root) = O.

Wr load on the host if all N nodes of the program tree are assigned to the host. This is equal to ~ = ~ h~.

dmax the maximum value of depth(i) for 1 ~< i ~< N.

Host_Load total load on the host. If the program tree is partitioned over a host and M satellites then for each satellite p,

Page 16: Approximate algorithms for partitioning problems

356 Iqbal

A,

there will be a node El(p) assigned to the satellite while father(H(p)) will be assigned to the host. The value of Host_Load will then be the sum of individual h,'s of the modules assigned to the host plus Zpg=l Crop ). In terms of Wr this is given by:

Host_Load W r - M = 5Zp = 1 cost(H(p)) + Epg__ 1 Crop) Suppose we have assigned all nodes to the host except the M sons of node i numbered 1 . .M. The value of Host_Load will then be W r - Y,~= 1(cost(k)- c~). If, instead of assigning each son of node i to a separate satellite, node i itself is assigned to a satellite then the new value of Host_Load will be WT-cos t ( i )+ c i. The difference between the two values of Host_Load is denoted by A, and is equal to ( c o s t ( i ) - c , - ~ 1(cost(k)- ek)). Thus if Ai is positive then the value of Host_Load will reduce by A, if we assign a single satellite to node i (and its descendants) instead of assigning a separate satellite to each son of node i. For each leaf node i, the value of Ai is cost(i) - c,.

5.2. The Funct ion Probe Tree

The function PROBE_TREE(w) returns true if it is possible to parti- tion the program tree over the host-satellite system such that the load on any satellite, and the load on the host, is less than or equal to w, and fa~e otherwise. The resulting partition, if any, is called the conservative partition for a trial weight w.

FUNCTION PROBE_TREE(w):boolean

BEGIN for level=d down to 1 do

max begin

for each node i at depth(i)=level do

if load(i)>w then Merge(i, father(i));

end; Host_Load=WT; for level=d down to 1 do

max begin

for each node i at depth(i)= level do

if A <0 then Merge(i, father(i)) i

else Host_Load=Host_Load-A1; end;

if Host_Load~w then return(true) else return(false); END.

Page 17: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 357

Discussion.

1. The function PROBE_ TREE while trying to find a conservative partition of weight w, will assign a node i and its descendants to a satellite if, and only if, load(i)<~ w. Each node j for which load(i) > w is, therefore, merged with father(j) by combining the execution cost ej(hj) with efather(j)(hfather(j)) and removing the edge (j, father(j)).

We assume that initially all N nodes are assigned to the host and thus the initial value of Host_Load is equal to WT. The function, while examining each leaf node i at depth(i) = din, x, assigns it to a satellite if A;~>0. If, on the other hand, A i < 0 then node i is merged with father(i).

3. The function then looks at each node j at a depth one less than dm~x. If Aj/> 0, then the value of Host_Load will further reduce by Aj, provided the previous partition is moved up one level by assigning a satellite to node j and to its descendants, instead of keeping the previous partition in which each son of node j has been assigned to a separate satellite.

4. The resulting value of Host_Load is equal to Wr minus the sum of individual Ak's for each node k of the program tree which is examined by the function and not merged with father(k).

5. For each trial weight w, the function P R O B E _ T R E E has to examine each node only twice to decide whether to assign the node and its descendants to a satellite or to merge it with its father. Thus the function performs O(N) steps to find a conser- vative partition of weight w, if it exists.

.

5.3. Theorem 4

If a tree structured program has a partition over a host satellite system in which the worst load on any satellite and the load on the host is w, then the function PROBE TREE will find that or an equivalent assignment. (6) A proof of this theorem is given in Appendix B.

5.4. The Part i t ioning Scheme

The algorithm makes a binary search in the range Wr, Wr/N, and it uses the function PROBE_ TREE to find a conservative partition of weight w for which w, is minimum. For each trial weight w, the function PROBE_ TREE has to examine each node only once to decide whether to

828/20/5-2

Page 18: Approximate algorithms for partitioning problems

358 Iqbal

assign the node and its descendants to a satellite or to merge it with its father. Thus the function performs O(N) steps to find a conservative partition of weight w, if it exists.

If the range Wr, Wr/N is resolved to an accuracy of e then the algo- rithm will find a conservative partition of weight w in time proportional to O(Nlog(Wr/e)), with the assurance that w is no greater than the larger of the load on the host and the worst load on any satellite in the optimal assignment by e.

6. C O N C L U S I O N S

We have discussed a number of partitioning problems in the field of parallel, pipelined, and distributed computing. We have demonstrated that in a variety of such problems, it is possible to design a probing function which can find out if a partition of a parallel program over a multiple com- puter system exists wherein the load on any processor is less than or equal to a given weight w. It has been shown that for a parallel program with a chain or tree-like interconnection structure, the probing function provides a true/false answer in linear time, provided the processor system is also limited to a chain of processors or is a host satellite system. The optimal partition is then found approximately by making a binary search in a finite range to find the partition for which w is minimum.

In order to extend this approach to other problems, it is essential to find an efficient probing function. The rule on which the probing function is based, is dependent upon the nature of the partitioning problem to be solved. For example, in the partitioning of one-dimensional domain over a chain of processors, the probing function was simply a greedy method~2); while in case of partitioning a chain-like program over a shared memory system, ~13) it was based on a dynamic programming approach described by Kernighan. ~14) Once an efficient probing function is found for a problem, the optimal partitioning can be found by making a binary search in a given range, using the probing function.

Future work in this field requires that this approach be extended to multiple computer systems with richer interconnection structure like a binary tree, hypercube or a mesh. It will be interesting to find if this approach can be efficiently applied to multiple computer systems composed of dissimilar processors. It also remains to be seen, how efficiently this type of approach can be applied to a two dimensional domain, with non- uniform work loads, which is to be partitioned into areas requiring equal computational effort (Berger and Bokhari~15)).

Page 19: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 359

APPENDIX A

T h e o r e m 1. If a problem with N modules and M processors has a partition of weight w then a shnitar partition, or a partJtion of less weight, exists even if we merge modules i and i + 1 provided c~) w, + ~ + c, + 5.

Proof. We assume that in the given partition of weight r~; modules i and i + t are assigned to different processors as otherwise the proof becomes trivial, Under such c~onditions Modules[x .. i] are assigned to processor p while Modules[ i+ 1 ,, y ] are assigned to processor p + 1. Thus the load on processor p and p + 1 are given here (see N g i).

(2)

~Arhen we merge modules i and i + 1 then the two mod~es shmfld be assigned to the same processor. We move module i + t from processor p + 1 to processor p. Under such conditions processor p + t is left with Modu les [ i+2 . . y ] while Modules[x. . i + 1] are assigned to processor p~ The new load on processors p and p + 1 are. therefore, given next.

12p+l,,+2+y = wk +e,+1+c5, \,~+2

(3)

(4)

From Eqs~ (I--4)

If ce ~> w,.+ ~ + c, + 1 then

Thus if we merge modules i and i + t then the load on processor p as welt as toad on processor p + 1 will either decrea~ or remain the same provided cr ~ + c~+ t- E/

Page 20: Approximate algorithms for partitioning problems

360 Iqbal

A P P E N D I X B

Theorem 4. If a tree structured program has a partition over a host satellite system in which the worst load on any satellite and the load on the host is w, then the function PROBE_ TREE will find that or an equivalent assignment.

Def in i t ions .

height dr, ax + 1 minus depth(i) for node i. Thus height of root is dr, ax + 1.

Hw. h a partition which guarantees that the load assigned to the host is minimum, load on each satellite is less than or equal to w, and only those nodes with height ~ h are permitted to reside on satellites.

ProoL In a tree structured program,, where each node i is already merged with father(i), if load(i) > w, the previous claim will be true if the function PROBE_TREE can find //w, draax"

By induction on height.

Consider the case height = 1. The function initially assigns all N nodes to the host and thus the starting value of Host_Load is WT. During the execution of the first iteration of the for loop, the function examines each leaf node i at height= 1, and assigns it to a satellite if A,~>0. If, on the other hand, A, < 0, then node i is merged with the father(i). In the resulting partition, the value of Host_Load will reduce by an amount equal to the sum of individual A;'s for each node i assigned to a satellite. Obviously, the partition found after the first iteration will be Hw,1.

We will now show that if the function PROBE_ TREE can find IIw. k after the kth iteration, then it can also find Hw,~+l after the k + l t h iteration of the for loop.

After finding H~.k, the function looks at each node i at height equal to k + 1. If Ai>~ 0 then for each node i the previous partition Hw, k is moved up one level, by assigning a satellite to node i and its children instead of keeping the previous partition, and thus the value of Host_Load further reduces by A,. If, however, A, < 0, then the previous partition H~, k is main- tained. For each node i examined at height = k + 1, the decision to keep the previous partition or to move the partition up one level, is solely dependent upon A,, and is not influenced by any other node examined during that iteration. Note that the partition Hw,k+l will either be the previous parti- tion Hw.k or the new partition wherein node i at height = k + 1 is assigned to a satellite (along with its children). It can not be any other partition because by definition, Hw, k guarantees that the value of Host_Load is mini-

Page 21: Approximate algorithms for partitioning problems

Approximate Algorithms for Partitioning Problems 361

mum, under the constraint that only those nodes with height<~k are permitted to reside on the satellites.

Thus if the function can find IIw.k after the kth iteration of the for loop, then it will be transformed into //,~,~+1 after the k + lth iteration, and into Hw,,amax, after the last iteration.

REFERENCES

1. S. Bokhari, Partitioning Problems in Parallel, Pipelined and Distributed Computmg, IEEE Tran. Comput., Vol. 37, No. 1 (January 1988).

2. M. A. Iqbal, J. H. Saltz, and S. H. Bokhari, Performance Tradeoffs m Static and Dynamic Load Balancing Strategies, Proc. of the lnt'l. Conf. on Parallel Processing, pp. 1040-1047 (August 1986)

3. D. M. Nicol and D. R. O'Hallaron, Improved Algorithms for Mapping Pipelined and Parallel Computations, IEEE Trans. on Computers (to appear). An earlier version is available as ICASE Report No. 88-2, NASA Contractor Report 181655 (April 1988).

4. M. A. Iqbal, Efficient Algorithms for Partitioning Problems, Proc. of the lnt'l. Conf. on Parallel Processing (August 1990).

5. Hyeong-Ah Chol and B. Narahari, Algorithms for Mapping and Partitioning Chain Structured Parallel Computations, Proc. of the lnt 7. Conf. on Parallel Processing (August 1991).

6. M. A. Iqbal, Approximate Algorithms for Partitioning and Assignment Problems, ICASE Report No. 86-40, NASA Contractor Report 178130 (June 1986).

7. E. Horowitz and S. Sahni, Fundamentals of Computer Algorithms, Computer Science Press, Inc. (1978).

8. Michal Cenkle, Partitioning Algorithm Graphs for Pipelined Computation, The MITRE Corporation Report, Bedford, Massachusetts (September 1991).

9. Ajay Mohindra and Sudhakar Yalamanchili, Dominant Representations: A Paradigm for Mapping Parallel Computations, Georgia InsUtute of Technology, Atlanta, Georgia (September 1991 ).

10. M. Annaratone, E. Arnould, T. Gross, H. Kung, M. Lain, O. Menzdcioglu, and J. Webb, The Warp Computer: Architecture, Implementahon, and Performance, IEEE Trans. on Computers, Vol. C-36, No. 12 (December 1987).

11. H. S. Stone, Multiprocessor Scheduling with the Aid of Network Flow Algorithms, IEEE Trans. Software Engineering, SE-3(1):85-93 (January 1977).

12. H. S. Stone, Critical Load Factors in Distributed Computer Systems, IEEE Trans. Software Engineering, SE-4(3):254-258 (May 1987).

13. M. A. Iqbal, Mapping and Assignment Problems in Multiple Computer Systems, PhD Dissertation, Department of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan (January 1991).

14. B. Kernighan, Optimal Sequential Partitions of Graphs, JACM. 18(1):34-40 (January 1971).

15. M. Berger and S. Bokhari, A Partitioning Strategy for Non-Uniform Problems Across Multiprocessors, IEEE Trans. Comput., Vol. C-36, No. 5 (May 1987).