dynamic data partitioning for distributed-memory multicomputers

18
aimed at source-to-source parallelizing compilers that re- lieve the programmer from the task of communication generation, while the task of data partitioning remains a responsibility of the programmer. As part of the research performed in the PARADIGM (PARAllelizing compiler for DIstributed-memory Gen- eral-purpose Multicomputers) project [3] at the University of Illinois, automatic data partitioning techniques have been developed to relieve the programmer of the burden of selecting good data distributions. Currently, the com- piler can automatically select a static distribution of data (using a constraint-based algorithm [16]) specifying both the configuration of an abstract multi-dimensional mesh topology along with how program data should be distrib- uted on the mesh. We have previously presented a tech- nique [27] which extends the static partitioning algorithm to select dynamic data distributions which can further im- prove the performance of the resulting parallel program. In this paper we describe the implementation in more detail, introduce the idea of the hierarchical component affinity graph for performing interprocedural data parti- tioning, as well as present experimental results obtained using the current implementation. The remainder of this paper is organized as follows: Section 2 presents a small example to illustrate the need for dynamic array redistribution; related work in automatic selection of static and dynamic data distribution schemes is discussed in Section 3; the methodology for selection of dynamic data distributions is presented in Section 4; the hierarchical component affinity graph is described and ex- perimental analysis of the presented techniques is per- formed in Section 6; and conclusions are presented in Sec- tion 7. 2. MOTIVATION FOR DYNAMIC DISTRIBUTIONS Figure 1 shows the basic computation performed in a two-dimensional Fast Fourier Transform (FFT). To exe- cute this program in parallel on a machine with distributed memory, the main data array, Image, is partitioned across the available processors. By examining the data accesses that will occur during execution, it can be seen that, for the first half of the program, data is manipulated along the rows of the array. For the rest of the execution, data is manipulated along the columns. Depending on how data JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 38, 158–175 (1996) ARTICLE NO. 0138 Dynamic Data Partitioning for Distributed-Memory Multicomputers 1 DANIEL J. PALERMO,EUGENE W. HODGES IV, AND PRITHVIRAJ BANERJEE 2 Center for Reliable and High-Performance Computing, University of Illinois at Urbana–Champaign, 1308 West Main Street, Urbana, Illinois 61801 158 0743-7315/96 $18.00 Copyright 1996 by Academic Press, Inc. All rights of reproduction in any form reserved. For distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5, the quality of the data partitioning for a given application is crucial to obtaining high performace. This task has traditionally been the user’s responsibility, but in recent years much effort has been directed to automating the selection of data partitioning schemes. Several researchers have proposed systems that are able to produce data distributons that remain in effect for the entire execution of an application. For complex programs, however, such static data distributions may be insuf- ficient to obtain acceptable performance. The selection of distri- butions that dynamically change over the course of a program’s execution adds another dimension to the data partitioning prob- lem. In this paper, we present a technique that can be used to automatically determine which partitionings are most beneficial over specific sections of a program while taking into account the added overhead of performing redistribution. This system has been implemented as part of the PARADIGM (PARAlleliz- ing compiler for DIstributed-memory General-purpose Multi- computers) project at the University of Illinois. The complete system strives to provide a fully automated means to parallelize programs written in a serial programming model obtaining high performance on a wide range of distributed-memory multicomputers. 1996 Academic Press, Inc. 1. INTRODUCTION Distributed-memory multicomputers such as the Intel Paragon, the IBM SP-1/SP-2, the NCUBE/2, and the Thinking Machines CM-5 offer significant advantages over shared-memory multiprocessors in terms of cost and scala- bility. However, lacking a global address space, they pre- sent a very difficult programming model in which the user must specify how data and computations are to be parti- tioned across processors and determine which sections of data need to be communicated among which processors. To overcome this difficulty, significant effort has been 1 This research was supported in part by the National Aeronautics and Space Administration under Contract NASA NAG 1-613, in part by an Office of Naval Research Graduate Fellowship, and in part by the Ad- vanced Research Projects Agency under contract DAA-H04-94-G-0273 administered by the Army Research office. We are also grateful to the National Center for Supercomputing Applications and the San Diego Supercomputing Center for providing access to their machines. 2 E-mail: hpalermo,ewhodges,banerjeej@ crhc.uiuc.edu.

Upload: daniel-j-palermo

Post on 15-Jun-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dynamic Data Partitioning for Distributed-Memory Multicomputers

aimed at source-to-source parallelizing compilers that re-lieve the programmer from the task of communicationgeneration, while the task of data partitioning remains aresponsibility of the programmer.

As part of the research performed in the PARADIGM(PARAllelizing compiler for DIstributed-memory Gen-eral-purpose Multicomputers) project [3] at the Universityof Illinois, automatic data partitioning techniques havebeen developed to relieve the programmer of the burdenof selecting good data distributions. Currently, the com-piler can automatically select a static distribution of data(using a constraint-based algorithm [16]) specifying boththe configuration of an abstract multi-dimensional meshtopology along with how program data should be distrib-uted on the mesh. We have previously presented a tech-nique [27] which extends the static partitioning algorithmto select dynamic data distributions which can further im-prove the performance of the resulting parallel program.In this paper we describe the implementation in moredetail, introduce the idea of the hierarchical componentaffinity graph for performing interprocedural data parti-tioning, as well as present experimental results obtainedusing the current implementation.

The remainder of this paper is organized as follows:Section 2 presents a small example to illustrate the needfor dynamic array redistribution; related work in automaticselection of static and dynamic data distribution schemesis discussed in Section 3; the methodology for selection ofdynamic data distributions is presented in Section 4; thehierarchical component affinity graph is described and ex-perimental analysis of the presented techniques is per-formed in Section 6; and conclusions are presented in Sec-tion 7.

2. MOTIVATION FOR DYNAMIC DISTRIBUTIONS

Figure 1 shows the basic computation performed in atwo-dimensional Fast Fourier Transform (FFT). To exe-cute this program in parallel on a machine with distributedmemory, the main data array, Image, is partitioned acrossthe available processors. By examining the data accessesthat will occur during execution, it can be seen that, forthe first half of the program, data is manipulated along therows of the array. For the rest of the execution, data ismanipulated along the columns. Depending on how data

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 38, 158–175 (1996)ARTICLE NO. 0138

Dynamic Data Partitioning for Distributed-Memory Multicomputers1

DANIEL J. PALERMO, EUGENE W. HODGES IV, AND PRITHVIRAJ BANERJEE2

Center for Reliable and High-Performance Computing, University of Illinois at Urbana–Champaign, 1308 West Main Street, Urbana, Illinois 61801

1580743-7315/96 $18.00Copyright 1996 by Academic Press, Inc.All rights of reproduction in any form reserved.

For distributed-memory multicomputers such as the IntelParagon, the IBM SP-1/SP-2, the NCUBE/2, and the ThinkingMachines CM-5, the quality of the data partitioning for a givenapplication is crucial to obtaining high performace. This taskhas traditionally been the user’s responsibility, but in recentyears much effort has been directed to automating the selectionof data partitioning schemes. Several researchers have proposedsystems that are able to produce data distributons that remainin effect for the entire execution of an application. For complexprograms, however, such static data distributions may be insuf-ficient to obtain acceptable performance. The selection of distri-butions that dynamically change over the course of a program’sexecution adds another dimension to the data partitioning prob-lem. In this paper, we present a technique that can be used toautomatically determine which partitionings are most beneficialover specific sections of a program while taking into accountthe added overhead of performing redistribution. This systemhas been implemented as part of the PARADIGM (PARAlleliz-ing compiler for DIstributed-memory General-purpose Multi-computers) project at the University of Illinois. The completesystem strives to provide a fully automated means to parallelizeprograms written in a serial programming model obtaininghigh performance on a wide range of distributed-memorymulticomputers. 1996 Academic Press, Inc.

1. INTRODUCTION

Distributed-memory multicomputers such as the IntelParagon, the IBM SP-1/SP-2, the NCUBE/2, and theThinking Machines CM-5 offer significant advantages overshared-memory multiprocessors in terms of cost and scala-bility. However, lacking a global address space, they pre-sent a very difficult programming model in which the usermust specify how data and computations are to be parti-tioned across processors and determine which sections ofdata need to be communicated among which processors.To overcome this difficulty, significant effort has been

1 This research was supported in part by the National Aeronautics andSpace Administration under Contract NASA NAG 1-613, in part by anOffice of Naval Research Graduate Fellowship, and in part by the Ad-vanced Research Projects Agency under contract DAA-H04-94-G-0273administered by the Army Research office. We are also grateful to theNational Center for Supercomputing Applications and the San DiegoSupercomputing Center for providing access to their machines.

2 E-mail: hpalermo,ewhodges,banerjeej@ crhc.uiuc.edu.

Page 2: Dynamic Data Partitioning for Distributed-Memory Multicomputers

years much research has been focused in the areas of per-forming multidimensional array alignment [4, 9, 22, 26];examining cases in which a communication-free parti-tioning exists [30]; showing how performance estimationis a key in selecting good data distributions [10, 41]; linear-izing array accesses and analyzing the resulting one-dimen-sional accesses [35]; applying iterative techniques that min-imize the amount of communication at each step [1]; andexamining issues for special-purpose distributed architec-tures such as systolic arrays [40].

3.2. Dynamic Partitioning

Anderson and Lam [1] have also proven that the dy-namic decomposition problem is NP-hard by transformingthe colored multiway cut problem (which is known to beNP-hard) into a subproblem of dynamic decomposition.They address the dynamic distribution problem by usinga communication graph in which nodes are loop nests andedges represent the time for communication if the twoloop nests involved in an edge are given different distribu-tions. A greedy heuristic is used to combine nodes in sucha way that the largest potential communication costs areeliminated first while maintaining sufficient parallelism.

Work by Bixby et al. [5], by Kremer [23], and morerecently by Garcia et al. [12] formulates the data parti-tioning problem in the form of a 0-1 integer programmingproblem. For each phase, a number of candidate partialdata layouts are enumerated along with the estimated exe-cution costs. The costs of the possible transitions amongthe candidate layouts are then computed forming a datalayout graph. A static performance estimator is used toobtain the node and edge weights of the graph. Since eachphase only specifies a partial data distribution, redistribu-tion constraints can cross over multiple phases, therebyrequiring the use of 0-1 integer programming. The numberof possible data layouts for a given phase is exponentialin the number of partitioned array dimensions resulting inpotentially large search spaces. To benefit from advancedtechniques in the field of integer programming, the re-sulting formulation is processed by a commercial integer

is distributed among the processors, several different pat-terns of communication could be generated. The goal ofautomatic data partitioning is to select the distribution thatwill result in the highest level of performance.

If the array were distributed by rows, every processorcould independently compute the FFTs for each row thatinvolved local data. After the rows had been processed,the processors would now have to communicate to performthe column FFTs as the columns have been partitionedacross the processors. Conversely, if a column distributionwere selected, communication would be required to com-pute the row FFTs while the column FFTs could be com-puted independently. Such static partitionings, as shownin Fig. 1a, suffer in that they cannot reflect changes ina program’s data access behavior. When conflicting datarequirements are present, static partitionings tend to becompromises between a number of preferred distributions.

Instead of requiring a single data distribution for theentire execution, program data could also be redistributeddynamically for different phases3 of the program. For thisexample, assume the program is split into two separatephases; a row distribution is selected for the first phaseand a column distribution for the second (as shown in Fig.1b). By redistributing the data between the two phases,none of the one-dimensional FFT operations would requirecommunication. Such dynamic partitionings can yieldhigher performance than a static partitioning when theredistribution is more efficient than the communicationpattern required by the statically partitioned computation.

3. RELATED WORK

3.1. Static Partitioning

Some of the ideas used in the static partitioning algo-rithm currently implemented in the PARADIGM compiler[16] were inspired by earlier work on multidimensionalarray alignment [26]. In addition to this work, in recent

3 A phase can be described simply as a sequence of statements in aprogram over which a given distribution is unchanged.

159DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 1. Two-dimensional Fast Fourier Transform. (a) Static (butterfly communication). (b) Dynamic (redistribution).

Page 3: Dynamic Data Partitioning for Distributed-Memory Multicomputers

programming tool. In some previous work [24], Kennedyand Kremer also examined a dynamic programming tech-nique that could determine dynamic distributions in poly-nomial time if each candidate layout specified a mappingfor every array in the program. The main drawback withboth of these techniques is that the selection of phasesmust be made a priori to the formulation of the problem.This means tht the size of the phases should be as smallas possible so that the correct solution is found and thatas many candidate layouts as possible should be specifiedin order to find the best dynamic data distribution. Thesefactors contribute to the increase in the size of thesearch space.

Bixby, Kennedy, and Kremer have also described anoperational definition of a phase, which is defined as theoutermost loop of a loop nest such that the correspondingiteration variable is used in a subscript expression of anarray reference in the loop body [5]. Even though thisdefinition restricts phase boundaries to loop structures anddoes not allow for overlapping or nesting of phases, it canbe seen that for the example in Section 2 this definition issufficient to describe the two distinct phases of the compu-tation.

Chapman, Fahringer, and Zima have noted that thereare many known good data distributions for importantnumerical problems that are frequently used [6]. They de-scribe the design of a distribution tool that makes use ofperformance prediction methods when possible, but alsoheuristically uses empirical performance data when avail-able. They have also developed cost estimates which modelthe work distribution, communication, and data locality ofa program while taking into account the features of thetarget architecture [10]. The combination of estimationtechniques along with profile information will be used toguide their system in potentially selecting lists of distribu-tions for different arrays that appear in the program. Cur-rently, the performance prediction system has been inte-grated into a compiler based on Vienna Fortran [7], andresearch is under way on their partitioning system.

Hudak and Abraham have also proposed a method forselecting redistribution points for programs executed onmachines with logically shared but physically distributedmemory [18, 19]. Their technique is based on locating sig-nificant control flow changes in the program and insertingremapping points [19]. Remapping points are inserted be-tween loop nests with different nesting levels, around con-trol loops, and between a loop that requires nearest neigh-bor communication (which is assigned a block partitioning)and a loop that is linearly varying (which is given a cyclicpartitioning). Once remapping points have been addedto the program structure, a merge phase is performed toremove any unnecessary data redistribution. Remappingpoints are removed between two phases with identical dis-tributions as well as then one phase has an unspecifieddistribution (in which case it is assigned a distribution froman adjacent phase). This heuristic was shown to improvethe performance of several programs as compared to a

160 PALERMO, HODGES, AND BANERJEE

strict data-parallel implementation. For a program con-sisting of a matrix multiplication followed by an LU factor-ization, they observed a 26% improvement by applyingthis technique.

More recently, Sheffler et al. [34] have applied graphcontraction methods to the dynamic alignment problem[33] to reduce the size of the problem space that mustbe examined. Several localized graph transformations areemployed to reduce the size of the dynamic alignmentproblem significantly while still preserving the optimal so-lution. The alignment-distribution graph (ADG) is basedon a single static assignment data-flow form of a data-parallel language. Nodes in the ADG represent data-paral-lel computations while edges represent the flow of data.The ADG itself is formed in a divide-and-conquer ap-proach using heuristics to approximately solve a combina-torial minimization problem at each step, taking into ac-count both redistribution costs as well as all candidatedistributions, to determine where to partition the programinto subphases [8]. Candidates are formed by first identi-fying the extents, or iteration space, of all objects in theprogram resulting in a number of clusters. These clustersappear to result in groups very similar to that formed by theoperational definition of a phase. A set of representativeextents is constructed by selecting one vector from eachcluster. This set is then used to determine the preferreddistributions of each dimension typically resulting in a fewtens of candidate distributions. When the alignments ofthe candidates for two connected nodes are different, com-munication costs are modeled by the worst-case data trans-fer between any two candidate distributions for nodes in-volved. After the graph contraction transformations havebeen applied to the ADG, the final data distribution isselected using several different heuristics. By first reducingthe size of the problem, they are able to apply more expen-sive heuristics to obtain better overall solutions.

4. DYNAMIC DISTRIBUTION SELECTION

The technique we propose to automatically select redis-tribution points can be broken into two main steps. First,the program is recursively decomposed into a hierarchy ofcandidate phases. They, taking into account the cost ofredistributing the data between the different phases, themost efficient sequence of phases and phase transitionsis selected.

This approach allows us to build upon the static parti-tioning techniques [14, 16] previously developed in thePARADIGM project. Static cost estimation techniques[15] are used to guide the selection of phases while staticpartitioning techniques are used to determine the best pos-sible distribution for each phase. The cost models used toestimate communication and computation costs use param-eters, empirically measured for each target machine, toseparate the partitioning algorithm from a specific archi-tecture.

Page 4: Dynamic Data Partitioning for Distributed-Memory Multicomputers

are potentially too restrictive) but they will be used herefor comparison as well as to facilitate the discussion.

4.1. Phase Decomposition

Initially, the entire program is viewed as a single phasefor which a static distribution is determined. At this point,the immediate goal is to determine if and where it wouldbe beneficial to split the program into two separate phasessuch that the sum of the execution times of the resultingphases is less than the original (as illustrated in Fig. 3).Using the selected distribution, a communication graphis constructed to examine the cost of communication inrelation to the flow of data within the program.

We define a communication graph as the flow informa-tion from the dependence graph (generated by Parafrase-2 [29]) weighted by the cost of communication. The nodesof the communication graph correspond to individual state-ments while the edges correspond to flow dependenciesthat exist between the statements. As a heuristic, the costof communication performed for a given reference in a

To help illustrate the dynamic partitioning technique, anexample program will be used. In Fig. 2, a two-dimensionalAlternating Direction Implicit iterative method4 (ADI) isshown, which computes the solution of an elliptic partialdifferential equation known as Poisson’s equation [13].Poisson’s equation can be used to describe the dissipationof heat away from a surface with a fixed temperature aswell as to compute the free-space potential created by asurface with an electrical charge.

For the program in Fig. 2, a static data distribution willincur a significant amount of communication for over halfof the program’s execution. For illustrative purposes only,the operational definition of phases previously describedin Section 3 identifies twelve different ‘‘phases’’ in theprogram. These phases exposed by the operational defini-tion need not be known for our technique (and, in general,

4 To simplify later analysis of performance measurements, the programshown performs an arbitrary number of iterations as opposed to periodi-cally checking for convergence of the solution.

161DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 2. 2-D Alternating Direction Implicit iterative method (ADI2D).

Page 5: Dynamic Data Partitioning for Distributed-Memory Multicomputers

statement is initially assigned to (reflected back along) ev-ery incoming dependence edge corresponding to the refer-ence involved. Since flow information is used to constructthe communication graph, the weights on the edges serve toexpose communication costs that exist between producer/consumer relationships within a program. Also, since werestrict the granularity of phase partitioning to the state-ment level, single node cycles in the flow dependence graphare not included in the communication graph.

After the initial costs have been assigned to the commu-nication graph they are scaled according to the number ofoutgoing edges to which each reference was assigned. Thescaling conserves the total cost of a communication opera-tion for a given reference, ref, at the consumer, j, by as-signing portions to each producer, i, proportional to thedynamic execution count of the given producer, i, dividedby the dynamic execution counts of all producers. Note thatthe scaling factors are computed separately for producerswhich are predecessor or successors of the consumer asshown in Eq. (2). Once the individual edge costs have beenscaled to conserve the total communication cost, they arepropagated back toward the start of the program (throughall edges to producers which are predecessors) while stillconserving the propagated cost as shown in Eq. (3). Also, tofurther differentiate between producers at different nestinglevels, all scaling factors are also scaled by the ratio of thenesting levels as shown in Eq. (1).

Scaling initial costs:

lratio(i, j) 5nestlevel(i) 1 1nestlevel( j) 1 1

(1)

W(i, j, ref ) 5dyncount(i)O

i,j P5in pred( j,ref )i.j P5in succ( j,ref )

dyncount(P)(2)

? lratio(i, j) ? comm(i, ref ).

FIG. 3. Phase decomposition.

162 PALERMO, HODGES, AND BANERJEE

Propagating costs:

W(i, j, p) 5 W(i, j, p) 1dyncount(i)O

P5in pred( j,p)dyncount(P)

(3)

? lratio(i, j) OP[out( j)

W( j,P, p).

In the actual implementation, this is accomplished in twopasses over the communication graph after assigning theinitial communication costs: the first to compute all thescaling factors and the second to propagate costs backtoward the start of the program.

In Fig. 4, the communication graph is shown for ADIwith some of the edges labeled with the expressions auto-matically generated by the static cost estimator (using aproblem size of 512 3 512 and maxiter set to 100). Condi-tionals appearing in the cost expressions represent coststhat will be incurred based on specific distribution decisions(e.g., P2 . 1 is true if the second mesh dimension is assignedmore than one processor). For reference, the communica-tion models for an Intel Paragon and a Thinking MachinesCM-5, corresponding to the communication primitivesused in the cost expressions, are shown in Table I.

Once the communication graph has been constructed, asplit point is determined by computing a maximal cut ofthe communication graph. The maximal cut removes thelargest communication constraints from a given phase topotentially allow better individual distributions to be se-lected for the two resulting split phases. Since we also wantto ensure that the cut divides the program at exactly onepoint to ensure that only two subphases are generated forthe recursion, only cuts between two successive statementswill be considered. Since the ordering of the nodes is re-lated to the linear ordering of statements in a program,this guarantees that the nodes on one side of the cut willalways all precede or all follow the node most closelyinvolved in the cut. The following algorithm is used todetermine which cut to use to split a given phase.

For simplicity of this discussion, assume for now thatthere is at most only one edge between any two nodes.For multiple references to the same array, the edge weightcan be considered to be the sum of all communicationoperations for that array. Also, to better describe the algo-rithm, view the communication graph G 5 (V, E) in theform of an adjacency matrix (with source vertices on rowsand destination vertices on columns).

TABLE ICommunication Primitives

Intel paragon TMC CM-5

23 1 0.12m m # 1650 1 0.018mTransfer(m)

86 1 0.12m m . 16

Shift(m) 2 p Transfer(m)

Page 6: Dynamic Data Partitioning for Distributed-Memory Multicomputers

one cut with the same highest nesting level, record theearliest and latest maximum cuts with that nesting level(forming a cut window).

4. Split the phase using the selected cut.

In Fig. 5, the computation of the maximal cut on asmaller example graph with arbitrary weights is shown.The maximal cut is found to be between vertices 3 and 4with a cost of 41. This is shown both in the form of the

1. For each statement Si hi [ [1, (uV u 2 1)]j computethe cut of the graph between statements Si and Si11 bysumming all the edges in the submatrices specified by[S1 , Si] 3 [Si11 , SuV u] and [Si11 , SuV u] 3 [S1 , Si].

2. While computing the cost of each cut also keep trackof the current maximum cut.

3. If there is more than one cut with the same maximumvalue, choose from this set the cut that separates the state-ments at the highest nesting level. If there is more than

163DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 4. Communication graph and some example initial edge costs for ADI2D. (Statement numbers correspond to Fig. 2.)

FIG. 5. Example graph illustrating the computation of a cut. (a) Adjacency matrix. (b) Actual representation. (c) Efficient computation.

Page 7: Dynamic Data Partitioning for Distributed-Memory Multicomputers

sum of the two adjacency submatrices in Fig. 5a, and graph-ically as a cut on the actual representation in Fig. 5b.

In Fig. 5c, the cut is again illustrated using an adjacencymatrix, but the computation is shown using a more efficientimplementation which only adds and subtracts the differ-ences between two successive cuts using a running cut totalwhile searching for the maximum cut in sequence. Thisimplementation also provides much better locality than thefull submatrix summary when analyzing the actual sparserepresentation since the differences between two succes-sive cuts can be easily obtained by traversing the incomingand outgoing edge lists (which correspond to columns androws in the adjacency matrix, respectively) of the nodeimmediately preceding the cut. This takes O (E) time onthe actual representation, only visiting each edge twice—once to add it and once to subtract it.

A new distribution is selected for each of the resultingphases (and any unspecified distributions for the subphasesare inherited from the parent phase) and the process iscontinued recursively with the costs from the newly se-lected distributions. As shown in Fig. 3, each level of therecursion is carried out in branch and bound fashion suchthat a phase is split only if the sum of the estimated execu-tion times of the two resulting phases shows an improve-ment over the original.5 In Fig. 6, the partitioned communi-cation graph is shown for ADI after the phasedecomposition is completed.

As mentioned in the cut algorithm it is also possible tofind several cuts which all have the same cost and nestinglevel forming a cut window. This can occur since not allstatements generate communication which results in eitheredges in the communication graph with zero cost as well

5 A further optimization can also be applied to bound the size of thesmallest phase that can be split by requiring its estimated execution timeto be greater than a ‘‘minimum cost’’ of redistribution.

164 PALERMO, HODGES, AND BANERJEE

as regions over which the propagated costs conserve edgeflow, both of which will maintain a cut value. To handlecut windows, the phase should be split into two subphasessuch that the lower subphase uses the earliest cut point andthe upper subphase uses the latest resulting in overlappingphases. After new distributions are selected for each over-lapping subphase, the total cost of executing the over-lapped region in each subphase is examined. The overlapis then assigned to the subphase that resulted in the lowestexecution time for this region. If they are equivalent, theoverlapping region can be equivalently assigned to eithersubphase. Currently, this technique is not yet implementedfor cut windows. We instead always select the earliest cutpoint in a window for the partitioning.

To be able to bound the depth of the recursion withoutignoring important phases and distributions, the static par-titioner must also obey the following property. A parti-tioning technique is said to be monotonic if it selects thebest available partition for a segment of code such that(aside from the cost of redistribution) the time to executea code segment with a selected distribution is less than orequal to the time to execute the same segment with adistribution that is selected after another code segment isappended to the first. In practice, this condition is satisfiedby the static partitioning algorithm that we are using. Thiscan be attributed to the fact that conflicts between distribu-tion preferences are not broken arbitrarily, but are resolvedbased on the costs imposed by the target architecture [16].

It is also interesting to note that if a cut occurs withina loop body, and loop distribution can be performed, theamount of redistribution can be greatly reduced by liftingit out of the distributed loop body and performing it inbetween the two sections of the loop. Also, if dependenciesallow statements to be reordered, statements may be ableto move across a cut boundary without affecting the costof the cut while possibly reducing the amount of data to

FIG. 6. Partitioned communication graph for ADI2D. (Statement numbers correspond to Fig. 2.).

Page 8: Dynamic Data Partitioning for Distributed-Memory Multicomputers

significantly longer than a single redistribution operation,which is usually the case especially if the redistributionconsidered within the loop was actually accepted whencomputing the shortest path.

Using the cost models for an Intel Paragon and a Think-ing Machines CM-5, the distributions and estimated execu-tion times reported by the static partitioner for the resultingphases described as ranges of operational phases are shownin Table II. The performance parameters of the two ma-chines are similar enough that the static partitioning actu-ally selects the same distribution at each phase for eachmachine. The times estimated for the static partitioningare slighly higher than those actually observed, resultingfrom a conservative assumption regarding pipelines madeby the static cost estimator [14], but they still exhibit similarenough performance trends to be used as estimates. Forboth machines, the cost of performing redistributon is lowenough in comparison to the estimated performance gainsthat a dynamic distribution scheme is selected, as shownby the shaded area in Fig. 7b.

The pseudo-code for the dynamic partitioning algorithmis presented in Fig. 8 to briefly summarize the entire pro-cedure.

Since the selection of the split point during decomposi-tion implicitly maintains the coupling between individualarray distributions, and any missing distributions (due tothe array not appearing in a subphase) are inherited fromthe parent phase, redistribution at any stage will only affectthe next stage. This can be contrasted to the techniqueproposed by Bixby et al. [5] which first selects a numberof partial candidate distributions for each phase specifiedby the operational definition. Since their phase boundariesare chosen in the absence of flow information, redistribu-tion can affect stages at any distance from the currentstage. This causes the redistribution costs to become binaryfunctions depending on whether or not a specific path istaken, therefore, necessitating the need for 0-1 integerprogramming. In [23] they do agree, however, that 0-1integer programming is not necessary when all phases spec-ify complete distributions (such as true in our case). Intheir work, this occurs only as a special case in which theyspecify complete phases from the innermost to outermost

TABLE IIDetected Phases and Estimated Execution Times (s)

for ADI2D

Op. phases(s) Distribution Intel Paragon TMC CM-5

I–XII *,BLOCK 1 3 32 22.151461 39.496276 Level 0

I–VIII *,BLOCK 1 3 32 1.403644 2.345815 Level 1IX–XII BLOCK,* 32 3 1 0.602592 0.941550

I–III BLOCK,* 32 3 1 0.376036 0.587556 Level 2IV–VIII *,BLOCK 1 3 32 0.977952 1.528050

Note. Performance estimates correspond to 32 processors.

be redistributed. Both of these optimizations can be usedto reduce the cost of redistribution but neither will beexamined in this paper.

4.2. Phase and Phase Transition Selection

After the program has been recursively decomposed intoa hierarchy of phases, a Phase Transition Graph (PTG) isconstructed. Nodes in the PTG are phases resulting fromthe decomposition while edges represent possible redistri-bution between phases as shown in Fig. 7a. Since it ispossible that using lower level phases may require transi-tioning through distributions found at higher levels (tokeep the overall redistribution costs to a minimum), thephase transition graph is first partitioned horizontally at thegranularity of the lowest level of the phase decomposition.Redistribution costs are then estimated [31] for each edgeand are weighted by the execution count of the sur-rounding code.

If a redistribution edge occurs within a loop structure,additional redistribution may be induced due to the controlflow of the loop. To account for a potential ‘‘reverse’’redistribution which can occur on the back edge of theiteration, the phase transition graph is partitioned aroundthe loop body and the first iteration of loop containing thephase transition is peeled off and the phases of the firstiteration of the body reinserted in the phase transitiongraph as shown in Fig. 7b. Redistribution within the peelediteration is only executed once while that within the re-maining loop iterations is now executed (N 2 1) times,where N is the number of iterations in the loop. The redis-tribution, which may occur between the first peeled itera-tion and the remaining iterations, is also multipled by(N 2 1) in order to model when the back edge causesredistribution (i.e., when the last phase of the peeled itera-tion has a different distribution than the first phase of theremaining one).

Once costs have been assigned to all redistributionedges, the best sequence of phases and phase transitionsis selected by computing the shortest path on the phasetransition graph. This is accomplished in O (V 2) time6

(where V is now the number of vertices in the phase transi-tion graph) using Dijkstra’s single source shortest pathalgorithm [38].

After the shortest path has been computed, the looppeeling performed on the PTG can be seen to be necessaryto obtain the best solution if the peeled iteration has adifferent transition sequence than the remaining iterations.Even if the peeled iteration does have different transitions,not actually performing loop peeling on the actual codewill only incur at most one additional redistribution stateupon entry to the loop nest. This will not overly affectperformance if the execution of the entire loop nest takes

6 If desired, it would also be possible to further improve the perfor-mance of the shortest path computation by removing phases which appearmultiple times through the different levels after the horizontal partitioninghas been performed.

165DISTRIBUTED-MEMORY MULTICOMPUTERS

Page 9: Dynamic Data Partitioning for Distributed-Memory Multicomputers

levels of a loop nest. For this situation they show how thesolution can be obtained using a hierarchy of single sourceshortest path problems in a bottom-up fashion (as opposed

166 PALERMO, HODGES, AND BANERJEE

to solving only one shortest path problem after performinga top-down phase decomposition as in our approach).

Up until now, we have not described how to handle

FIG. 7. Phase transition graph for ADI2D. (a) Initial phase transition graph. (b) After performing loop peeling.

Page 10: Dynamic Data Partitioning for Distributed-Memory Multicomputers

(using profiling or other criteria) its phases are obtainedusing the phase selection algorithm previously describedbut ignoring all code off the main trace. Once phases havebeen selected, all off-trace paths can be optimized sepa-rately by first setting their stop and start nodes to thedistributions of the phases selected for the points at whichthey exit and reenter the main trace. Each off-trace pathcan then be assigned phases by applying the phase selectionalgorithm to each path individually. Although this specific

control flow other than for loop constructs. More generalflow (caused by conditionals or branch operations) canbe viewed as separate paths of execution with differentfrequencies of execution. The same techniques that havebeen used for scheduling assembly level instructions byselecting traces of interest [11] or forming larger blocksfrom sequences of basic blocks [20] in order to optimizethe most frequently taken paths can also be applied to thephase transition. Once a single trace has been selected

167DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 8. Pseudo-code for the partitioning algorithm.

Page 11: Dynamic Data Partitioning for Distributed-Memory Multicomputers

technique is not currently implemented in the compiler,but will be addressed in future work, other researchershave also been considering it as a feasible solution forselecting phase transitions in the presence of general con-trol flow [2].

5. HIERARCHICAL COMPONENT AFFINITYGRAPH (HCAG)

To facilitate the computation of static alignments andcommunication cost estimates within the current phase,we have also developed a hierarchical form of the extendedcomponent affinity graph (CAG) [26] as currently used inthe static partitioner [16]. Since alignment preferences areonly affected by the relationship of the subscript expres-sions between the left- and right-hand sides of an assign-ment statement, this information does not change withrespect to the current distribution under consideration.Rather than computing an entire affinity graph for eachnew phase revealed during the partitioning process, theHCAG can be computed once for the entire program andreused throughout the analysis. This is achieved by pre-computing intermediate alignment graphs and storingthem for later use within the levels created by compoundnodes (e.g., ‘‘loop’’ or ‘‘if’’ statements) in the program’sabstract syntax tree (AST).

The HCAG is constructed in a single recursive traversalof the AST performing these operations:

1. pre-order: record individual alignment preferencesfor the statement.

2. in-order: store the current state of the affinity graphand then set it to NULL (for compound nodes).

3. post-order: merge the statement’s alignment prefer-ences with the current state of the affinity graph.

The graph that is stored at the root of the AST correspondsto the CAG for the entire program whereas each com-pound statement header will contain the CAG summariz-ing all statements within the corresponding subtree. Now,all that is required to obtain the CAG for a given phaseis to simply merge the HCAG components from only thetopmost level of the region of the AST containing thegiven phase. This saves a considerable amount of timewhich would otherwise be required to obtain the CAGwhen examining each new phase.

Furthermore, by also recursively traversing on the callgraph for function and subroutine calls encountered duringthe traversal, the HCAG can also be used to performinterprocedural alignment analysis. Upon encounteringthe first call site for a given function, the HCAG is com-puted for the function as previously described and storedretaining all parameters symbolically in the alignment costexpressions. At all following call sites, the stored HCAGis simply retrieved and the current values of the parameterssubstituted in the cost expressions. This technique ignoresany interprocedural side effects but, unless the access pat-terns are highly side-effect dependent, it will still accuratelymodel access patterns across function calls.

168 PALERMO, HODGES, AND BANERJEE

In Fig. 9, the HCAG is shown for ADI. For clarity, onlya few of the alignment costs are labeled and only thestatements corresponding to loop headers are shown. If allstatements were shown in the AST, statements containedwithin a loop body would appear as a connected sequenceof right children attached as the left child of the loopheader. In the AST, the body of a compound statementis attached as the left child while the right child followsstatements in sequence. In the figure, intermediate affinitygraphs are illustrated using a small inset at each level ofthe HCAG while the different nesting levels of the HCAGare indicated by shaded outlines. The affinity graphs repre-sent each dimension of an array as a separate node whilealignment preferences (with an associated cost which isincurred if the alignment is not obeyed) connect differentdimensions of the arrays.

One thing to notice for ADI is how most of the interme-diate affinity graphs are identical at the two innermostlevels of the HCAG. This is due to the fact that the inner-most two loops in ADI are almost always perfectly nestedcausing the outer loop to not contribute any further align-ment information than that already obtained from withinthe inner loop. The structures of the top two levels of thisHCAG are also identical since the preferences of the twointermediate graphs merged to form the topmost levelwere already present in both subgraphs. It is important tonote that even though the structure is the same for the toptwo levels, the alignment costs are actually different. Eventhough there aren’t any alignment conflicts in this example,one last observation to make is how useful the HCAG canbe for analyzing code in which an alignment conflict mayexist. If a conflict exists when considering a large portionof the program but disappears when smaller regions areexamined, appropriate alignments can be efficiently ob-tained for each of the competing regions by using theHCAG in combination with the techniques described inthe previous section.

As there are still other implementation issues within thestatic partitioner which remain to be resolved before it canperform interprocedural partitioning, the HCAG has notbeen implemented and will be the focus of future work inthis area.

6. EVALUATION

6.1. 2-D Alternating Direction Implicit (ADI)Iterative Method

In order to evaluate the effectiveness of dynamic distri-butions, the ADI program, with a problem size of 512 3512,7 is compiled with a fully static distribution (one itera-tion shown in Fig. 10a) as well as with the selected dynamic

7 To prevent poor serial performance from cache-line aliasing due tothe power of two problem size, the arrays were also padded with an extraelement at the end of each column. This optimization, although hereperformed by hand, is automated by aggressive serial optimizing compil-ers such as the KAP preprocessor from KAI.

Page 12: Dynamic Data Partitioning for Distributed-Memory Multicomputers

169DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 9. Hierarchical Component Affinity Graph (HCAG) for ADI2D. (Statement numbers correspond to Fig. 2.)

Page 13: Dynamic Data Partitioning for Distributed-Memory Multicomputers

distribution8 (one iteration shown in Fig. 10b). These twoparallel versions of the code were run on an Intel Paragonand a Thinking Machines CM-5 to examine the perfor-mance of each on the different architectures.

The static scheme illustrated in Fig. 10a performs a shiftoperation to initially obtain some required data and thensatisfies two recurrences in the program using softwarepipelining [17, 28]. Since values are being propagatedthrough the array during the pipelined computation, pro-cessors must wait for results to be computed before contin-uing with their own part of the computation. Dependingon the ratio of communication and computation perfor-mance for a given machine, exactly how much data is com-puted before communicating to the next processor willhave a great effect on the performance of pipelined compu-tations.

A small experiment is first performed to determine thebest pipeline granularity for the static partitioning. A gran-ularity of one (fine-grain) causes values to be communi-cated to waiting processors as soon as they are produced.By increasing the granularity, more values are computedbefore communicating, thereby amortizing the cost of es-tablishing communication in exchange for some reductionin parallelism. In addition to the experimental data, com-pile-time estimates of the pipeline execution [28] are shownin Fig. 11. For the two machines, it can be seen that byselecting the appropriate granularity, the performance ofthe static partitioning can be improved. Even though thecurrent computational model is only based on high-levelinstruction counts [15] and ignores processor pipeline andcache effects, the trend is still modeled well enough toselect a granularity at compile time that closely approxi-mates the optimal. Both a fine-grain and the optimalcoarse-grain static partitioning will be compared with thedynamic partitioning.

8 In the current implementation, loop peeling is not performed on theactual code. As previously mentioned in Section 4.2, the single additionalstartup redistribution due to not peeling will not be significant in compari-son to the execution of the loop (containing a dynamic count of 600 redis-tributions).

FIG. 10. Modes of parallel execution for ADI2D. (a) Static (pipe-lined); (b) dynamic (redistribution).

170 PALERMO, HODGES, AND BANERJEE

The redistribution present in the dynamic scheme ap-pears as three transposes9 performed at two points withinan outer loop (the exact points in the program can be seenin Fig. 7). Since the sets of transposes occur at the samepoint in the program, the data to be communicated foreach transpose can be aggregated into a single messageduring the actual transpose. It has been previously shownthat aggregating communication improves performance byreducing the overhead of communication [28], so we willalso examine aggregating the individual transpose opera-tions here.

In Fig. 12, the performance of both dynamic and staticpartitionings for ADI is shown for an Intel Paragon anda Thinking Machines CM-5. For the dynamic partitioning,both aggregated and nonaggregated transpose operationswere compared. For both machines, it is apparent thataggregating the transpose communication is very effective,especially as the program is executed on larger numbersof processors. This can be attributed to the fact that thestartup cost of communication (which can be several ordersof magnitude greater than the per byte transmission cost)is being amortized over multiple messages with the samesource and destination.

For the static partitioning, fine grain pipelining was com-pared to coarse-grain using the granularity selected earlier.The coarse-grain optimization yielded the greatest benefiton the CM-5 while still improving the performance (to alesser degree) on the Paragon. For the Paragon, the dy-namic partitioning with aggregation clearly improved per-formance (by over 70% compared to the fine-grain and60% compared to the coarse-grain static distribution). Onthe CM-5 the dynamic partitioning with aggregationshowed performance gains of over a factor of two com-pared to the fine-grain static partitioning but only outper-formed the coarse-grain version for extremely large num-bers of processors. For this reason, it would appear thatthe limiting factor on the CM-5 is the performance of thecommunication.

As previously mentioned in Section 4.2, the static parti-tioner currently makes a very conservative estimate forthe execution cost of pipelined loops [14]. For this reasona dynamic partitioning was selected for both the Paragonas well as the CM-5. If a more accurate pipelined costmodel [28] were used, a static partitioning would haveinstead been selected for the CM-5. For the Paragon, thecost of redistribution is still low enough that a dynamicpartitioning would still be selcted for machine configura-tions with a large number of processors.

To complete our analysis of ADI, it is also interestingto estimate the cost of performing a single transpose ineither direction (P 3 1 } 1 3 P) from the communicationoverhead present in the dynamic runs. Ignoring any perfor-

9 This could have been reduced to two transposes at each point if weallowed the cuts to reorder statements and perform loop distribution onthe innermost loops (between statements 17, 18 and 41, 42), but theseoptimizations are not examined here.

Page 14: Dynamic Data Partitioning for Distributed-Memory Multicomputers

finite difference models of the shallow water equations [32]written by Paul Swarztrauber from the National Center forAtmospheric Research.

As the program consists of a number of different func-tions, the program is first inlined since data partitioner isnot yet fully interprocedural. Also, a loop which is implic-itly formed by a GOTO statement is replaced with an explicitloop since the current performance estimation frameworkdoes not handle unstructured code. The final input pro-gram, ignoring comments and declarations, resulted in 143lines of executable code.

In Fig. 13, the phase transition graph is shown with theselected path using costs based on a 32 processor IntelParagon with the original problem size of 257 3 257 limitedto 100 iterations. The decomposition resulting in this graphwas purposely bounded only by the productivity of the cut,and not by a minimum cost of redistribution in order toexpose all potentially beneficial phases. This graph showsthat by using the decomposition technique presented inFig. 8, Shallow contains six phases (the length of the pathfrom the start to stop node) with a maximum of four (some-times redundant) candidates for any given phase.

As there were no alignment conflicts in the program,

mance gains from cache effects, the communication over-head can be computed by subtracting the ideal runtime(serial time divided by the selected number of processors)from the measured runtime. Given that three arrays aretransposed 200 times, the resulting overhead divided by600 yields a rough estimate of how much time is requiredto redistribute a single array as shown in Table III.

From Table III it can be seen that as more processorsare involved in the operation, the time taken to performone transpose levels off until a certain number of proces-sors is reached. After this point, the amount of data beinghandled by each individual processor is small enough thatthe startup overhead of the communication has becomethe controlling factor. Aggregating the redistribution oper-ations minimizes this effect thereby achieving higher levelsof performance than would be posible otherwise.

6.2. Shallow Water Weather Prediction Benchmark

Since not all programs will necessarily need dynamicdistributions, we also examine another program which en-hibits several different phases of computation. The Shallowwater benchmark is a weather prediction program using

171DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 11. Coarse-grain pipelining for ADI2D. (a) Intel Paragon. (b) TMC CM-5.

FIG. 12. Performance of ADI2D. (a) Intel Paragon. (b) TMC CM-5.

Page 15: Dynamic Data Partitioning for Distributed-Memory Multicomputers

and only BLOCK distributions were necessary to maintaina balanced load, the distribution of the 14 arrays in theprogram can be inferred from the selected configurationof the abstract mesh. By tracing the path back from thestop node to the start, the figure shows that the dynamicpartitioner selected a two-dimensional (8 3 4) static datadistribution. Since there is no redistribution performedalong this path, the loop peeling process previously de-scribed in Section 4.2 is not performed on this graph toimprove its clarity.10

As the communication and computation estimates are

10 As peeling is only used to model the possibility of implicit redistribu-tion due to the back-edge of a loop, it is only necessary when there isactually redistribution present within a loop.

172 PALERMO, HODGES, AND BANERJEE

FIG. 13. Phase transition graph and solution for Shallow. (Displayed with the selected phase transition path and cumulative cost.)

TABLE IIIEmpirically Estimated Time (ms) to Transpose a 1-D

Partitioned Matrix

Intel Paragon TMC CM-5

Processors Individual Aggregated Individual Aggregated

8 36.7 32.0 138.9 134.716 15.7 15.6 86.8 80.532 14.8 10.5 49.6 45.864 12.7 6.2 40.4 29.7

128 21.6 8.7 47.5 27.4

Note. 512 3 512 elements; double precision.

Page 16: Dynamic Data Partitioning for Distributed-Memory Multicomputers

Kennedy and Kremer have also examined the Shallowbenchmark, but predicted that a one-dimensional (column-wise) block distribution was the best distribution [21] forup to 32 processors of an Intel iPSC/860 hypercube (alsoshowing that the performance of a 1-D column-wise distri-bution was almost identical a 1-D row-wise distribution).They chose this distribution because they only exhaustivelyenumerated one dimensional candidate layouts for the op-erational phases.11 As their operational definition alreadyresults in 28 phases (over four times as many in comparisonto our approach), the complexity of their resulting 0-1integer programming formulation will also only increasefurther by considering multidimensional layouts.

7. CONCLUSIONS

Dynamic data partitionings can provide higher perfor-mance than purely static distributions for programs con-taining competing data access patterns. The distributionselection technique presented in this paper provides ameans of automatically determining high quality data dis-tributions (dynamic as well as static) in an efficient mannertaking into account both the structure of the input programas well as the architectural parameters of the target ma-chine. Heuristics, based on the observation that high com-munication costs are a result of not being able to staticallyalign every reference in complex programs simultaneously,are used to form the communication graph. By removingconstraints between competing sections of the program,better distributions can potentially be obtained for theindividual sections and if the gains in performance arehigh enough in comparison to the cost of redistribution,dynamic distributions are formed. Communication still oc-curs, but data movement is now a dynamic reorganizationof ownership as opposed to simply obtaining any requiredremote data based on a (compromise) static assignmentof ownership.

11 They justify this decision [23] since the Fortran D prototype compilercan not compile multidimensional distributions [39].

best-case approximations (they don’t take into accountcommunication buffering operations or effects of the mem-ory hierarchy), it is safe to say that for the Paragon, adynamic data distribution does not exist which can out-perform the selected static distribution. Theoretically, ifthe communication cost for a machine were insignificantin comparison to the performance of computation, redis-tributing data between the other phases revealed duringthe decomposition of Shallow would be beneficial. Thedynamic partitioner performed more work to come to thesame conclusion that a single application of the static parti-tioner would have found, but it is interesting to note thateven though it was considering any possible redistribution,it will still select a static data distribution if that is what ispredicted to have the best performance.

To evaluate the performance of the selected distribution,a parallel MPI program is generated using the prototypePARADIGM compiler [36, 37]. Since the PARADIGMcompiler does not fully support statements with differingleft-hand side index expressions for partitioned array di-mensions, loop-distribution was first performed by handon several loops to remove the need for any masks aroundsuch statements. Shallow also contains a number of loopsthat simply copy data from one region of array to another.When such a loop is parallelized, vectorized communica-tion operations are generated by PARADIGM which actu-ally perform the data movement directly between the localaddress spaces of the communicating processors. However,PARADIGM does not currently detect when such loopsare made unnecessary by the data movement of the com-munication so such loops were also removed by hand afterthe parallel program was generated.

In Fig. 14, the performance of the selected static 2-Ddistribution (BLOCK, BLOCK) is compared against a static1-D row-wise distribution (BLOCK, *) which appeared insome of the subphases. The 2-D distribution matches theperformance of the 1-D distribution for small numbers ofprocessors while outperforming the 1-D distribution forboth the Paragon and the CM-5 (by up to a factor of 1.6or 2.7, respectively) for larger numbers of processors.

173DISTRIBUTED-MEMORY MULTICOMPUTERS

FIG. 14. Performance of Shallow. (a) Intel Paragon. (b) TMC CM-5.

Page 17: Dynamic Data Partitioning for Distributed-Memory Multicomputers

A key requirement in automating this selection processis to be able to obtain estimates of communication andcomputation costs which accurately model the behavior ofthe program under a given distribution. Furthermore, bybuilding upon existing static partitioning techniques, thenumber of phases examined as well as the amount of redis-tribution considered are kept to a minimum.

Further investigation into the application of statementreordering and loop distribution to reduce the amount ofrequired redistribution is currently under way. We are alsoin the process of applying interprocedural analysis alongwith the techniques presented in this paper to investigatepossible redistribution at procedure boundaries.

ACKNOWLEDGMENTS

We thank the reviewers for their comments, Amber Roy Chowdhuryfor his assistance with the coding of the serial ADI algorithm, JohnChandy, Amber Roy Chowdhury, and Eugene Hodges for discussionson algorithm complexity, as well as Christy Palermo for her suggestionsand comments on this paper. The figures used in this paper for thecommunication graphs were also generated using a software packageknown as ‘‘Dot’’ developed by Eleftherios Koutsofios and Steven Northwith the Software and Systems Research Center, AT&T Bell Labora-tories [25].

REFERENCES

1. J. M. Anderson and M. S. Lam. Global optimizations for parallelismand locality on scalable parallel machines. Proc. of the ACM SIG-PLAN ’93 Conf. on Prog. Lang. Design and Implementation. Albu-querque, NM, June 1993, pp. 112–125.

2. E. Ayguade, J. Garcia, M. Girones, M. L. Grande, and J. Labarta.Data redistribution in an antomatic data distribution tool. Proc. ofthe 8th Work. on Langs. and Compilers for Parallel Computing. Lec-ture Notes in Computer Science. Vol. 1033 pp. 407–421, Springer-Verlag, Berlin, 1996. Columbus, OH, Aug. 1995.

3. P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IV, J. G. Holm,A. Lain, D. J. Palermo, S. Ramaswamy, and E. Su. The PARADIGMcompiler for distributed-memory multicomputers. IEEE Comput.28(10), 37–47, Oct. 1995.

4. D. Bau, I. Koduklula, V. Kotlyar, K. Pingali, and P. Stodghill. Solvingalignment using elementary linear algebra. Proc. of the 7th Work.on Langs. and Compilers for Parallel Computing, Ithaca, NY, 1994.Lecture Notes in Computer Science. Vol. 892, pp. 46–60, Springer-Verlag, Berlin, 1995.

5. R. Bixby, K. Kennedy, and U. Kremer. Automatic data layout using0-1 integer programming. Proc. of the 1994 Int’l Conf. on ParallelArchs. and Compilation Techniques. Montreal, Canada, Aug. 1994,pp. 111–122.

6. B. Chapman, T. Fahringer, and H. Zima. Automatic support for datadistribution on distributed memory multiprocessor systems. Proc.of the 6th Work. on Langs. and Compilers for Parallel Computing.Portland, OR, Aug. 1993. Lecture Notes in Computer Science. Vol.768, pp. 184–199. Springer-Verlag, Berlin, 1994.

7. B. Chapman, P. Mehrotra, and H. Zima. Programming in ViennaFortran. Sci. Prog. 1(1), 31–50, Aug. 1992.

8. S. Chatterjee, J. R. Gilbert, R. Schreiber, and T. J. Sheffler. Arraydistribution in data-parallel programs. Proc. of the 7th Work. onLangs. and Compilers for Parallel Computing. Itchaca, NY, 1994.Lecture Notes in Computer Science. Vol. 892, pp. 76–91. Springer-Verlag, Berlin, 1995.

9. S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. H. Teng. Automatic

174 PALERMO, HODGES, AND BANERJEE

array alignment in data-parallel programs. Proc. of the 20th ACMSIGPLAN Symp. on Principles of Prog. Langs. Charleston, SC, Jan.1993, pp. 16–28.

10. T. Fahringer. Automatic Performance Prediction for Parallel Pro-grams on Massively Parallel Computers. Ph.D. thesis, Univ. of Vienna,Austria, Sept. 1993. TR93-3.

11. J. A. Fisher. Trace Scheduling: A Technique for Global MicrocodeCompaction. IEEE Trans. Comput. 30, 478–490, July 1981.

12. J. Garcia, E. Ayguade, and J. Labarta. A novel approach towardsautomatic data distribution. Proc. of the Work. on Automatic DataLayout and Performance Prediction. Houston, TX, Apr. 1995.

13. G. Golub and J. M. Ortega. Scientific Computing: An Introductionwith Parallel Computing. Academic Press, San Diego, CA, 1993.

14. M. Gupta. Automatic Data Partitioning on Distributed Memory Multi-computers. Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois,Urbana, IL, Sept. 1992. CRHC-92-19/UILU-ENG-92-2237.

15. M. Gupta and P. Banerjee. Compile-time estimation of communica-tion costs on multicomputers. Proc. of the 6th Int’l Parallel ProcessingSymp. Beverly Hills, CA, Mar. 1992, pp. 470–475.

16. M. Gupta and P. Banerjee. PARADIGM: A compiler for automateddata partitioning on multicomputers. Proc. of the 7th ACM Int’l Conf.on Supercomputing. Tokyo, Japan, July 1993.

17. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D forMIMD distributed memory machines. Comm. ACM 35(8), 66–80,Aug. 1992.

18. D. E. Hudak and S. G. Abraham. Compiler techniques for datapartitioning of sequentially iterated parallel loops. Proc. of the 4thACM Int’l Conf. on Supercomputing. Amsterdam, The Netherlands,June 1990, pp. 187–200.

19. D. E. Hudak and S. G. Abraham. Compiling Parallel Loops forHigh Performance Computers—Partitioning, Data Assignment andRemapping. Kluwer, Boston, MA, 1993.

20. W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter,R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E.Haab, J. G. Holm, and D. M. Lavery. The Superblock: An effectivetechnique for VLIW and superscalar compilation. J. Supercomput.7(1), 229–248, Jan. 1993.

21. K. Kennedy and U. Kremer. Automatic data layout for high perfor-mance Fortran. Proc. of Supercomputing ’95. San Diego, CA, Dec.1995.

22. K. Knobe, J. Lukas, and G. Steele, Jr. Data optimization: Allocationof arrays to reduce communication on SIMD machines. J. ParallelDistrib. Comput. 8(2), 102–118, Feb. 1990.

23. U. Kremer. Automatic Data Layout for High Performance Fortran.Ph.D. thesis, Rice Univ., Houston, TX, Oct. 1995. CRPC-TR95559-S.

24. U. Kremer, J. Mellor-Crummey, K. Kennedy, and A. Carle. Auto-matic data layout for distributed-memory machines in the D program-ming environment. In Automatic Parallelization—New Approachesto Code Generation, Data Distribution, and Performance Prediction.1993, pp. 136–152.

25. B. Krishnamurthy, (Ed.) Practical Reusable UNIX Software. Wiley,New York, 1995.

26. J. Li and M. Chen. The data alignment phase in compiling programsfor distributed-memory machines. J. Parallel Distrib. Comput. 13(2),213–221, Oct. 1991.

27. D. J. Palermo and P. Banerjee. Automatic selection of dynamic datapartitioning schemes for distributed-memory multicomputers. Proc.of the 8th Work on Langs. and Compilers for Parallel Computing.Columbus, OH, Aug. 1995. Lecture Notes in Computer Science. Vol.1033, pp. 392–406. Springer-Verlag, Berlin, 1996.

28. D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee. Compiler optimi-zations for distributed memory multicomputers used in the PARA-DIGM compiler. Proc. of the 23rd Int’l Conf. on Parallel Processing.St. Charles, IL, Aug. 1994, pp. II: 1–10.

Page 18: Dynamic Data Partitioning for Distributed-Memory Multicomputers

40. P. S. Tseng. Compiling programs for a linear systolic array. Proc. ofthe ACM SIGPLAN ’90 Conf. on Prog. Lang. Design and Implemen-tation. White Plains, NY, June 1990, pp. 311–321.

41. S. Wholey. Automatic data mapping for distributed-memory parallelcomputers. Proc. of the 6th ACM Int’l Conf. on Supercomputing.Washington DC, July 1992, 25–34.

PRITHVIRAJ BANERJEE is currently the director of computationalscience and engineering and a professor of electrical and computer engi-neering at the University of Illinois at Urbana–Champaign. His researchinterests are in distributed-memory multicomputer architectures andcompilers, and parallel algorithms for VLSI CAD. He has authored morethan 140 papers in these areas and is the author of a book entitled ParallelAlgorithms for VLSI CAD, Prentice-Hall, 1994. Dr. Banerjee receivedhis B. Tech. in electronics and electrical engineering from the IndianInstitute of Technology, Kharagpur, India, in 1981, and the M.S. andPh.D. in electrical and computer engineering from the University ofIllinois at Urbana–Champaign in 1982 and 1984, respectively. Dr. Ban-erjee was the recipient of the President of India Gold Medal from theIndian Institute of Technology, Kharagpur, in 1981, the IBM YoungFaculty Development Award in 1986, and the National Science Founda-tion’s Presidential Young Investigators’ Award in 1987. He received theSenior Xerox Award for faculty research in 1992, and the University ofIllinois Scholar Award for 1992–1993. He is an editor of IEEE Transac-tions on VLSI Systems and Journal of Parallel and Distributed Computing.He is an IEEE fellow and a member of the IEEE Computer Society.

EUGENE W. HODGES IV received his B.S. in electrical engineeringfrom North Carolina State University in 1993, and his Masters in 1995in electrical engineering at the University of Illinois. He is currentlyemployed by SAS in Cary, North Carolina. His research interests includeparallelizing compilers and architecture.

DANIEL J. PALERMO received his B.S. in computer and electricalengineering from Purdue University, West Lafayette in 1990 and an M.S.in computer engineering from the University of Southern California in1991. In 1996 he received a Ph.D. in electrical engineering from theUniversity of Illinois. He is currently employed by Hewlett–Packard atthe Convex Technology Center in Dallas, Texas. His reserch interestsinclude high-performance computing systems, parallelizing and optimiz-ing compilers, and parallel languages and application programming.

Received October, 12, 1995; revised June 13, 1996; accepted July 2, 1996

29. C. D. Polychronopoulos, M. Girkar, M. R. Haghighat, C. L. Lee, B.Leung, and D. Schouten. Parafrase-2: An environment for paralleliz-ing, partitioning, synchronizing and scheduling programs on multi-processors. Proc. of the 18th Int’l Conf. on Parallel Processing. St.Charles, IL, Aug. 1989, pp. II:39–48.

30. J. Ramanujam and P. Sadayappan. Compile-time techniques for datadistribution in distributed memory machines. IEEE Trans. ParallelDistrib. Syst. 2(4), 472–481, Oct. 1991.

31. S. Ramaswamy and P. Banerjee. Automatic generation of efficientarray redistribution routines for distributed memory multicomputers.Frontiers ’95: The 5th Symp. on the Frontiers of Massively ParallelComputation. McLean, VA, Feb. 1995, pp. 342–349.

32. R. Sadourny. The dynamics of finite-difference models of the shallow-water equations. J. Atmospheric Sci. 32(4), Apr. 1975.

33. T. J. Sheffler, J. R. Gilbert, R. Schreiber, and S. Chatterjee. Aligningparallel arrays to reduce communication. Frontiers ’95: The 5th Symp.on the Frontiers of Massively Parallel Computation. McLean, VA,1995; pp. 324–331.

34. T. J. Sheffler, R. Schreiber, J. R. Gilbert, and W. Pugh. Efficientdistribution analysis via graph contraction. Proc. of the 8th Work. onLangs. and Compilers for Parallel Computing. Columbus, OH, Aug.1995. Lecture Notes in Computer Science. Vol. 1033, pp. 377–391.Springer-Verlag, Berlin, 1996.

35. H. Sivaraman and C. S. Raghavendra. Compiling for MIMD Distrib-uted Memory Machines. Tech. Report EECS-94-021. School of Elec-trical Engineering and Computer Science, Washington State Univ.,Pullman, WA, 1994.

36. E. Su, D. J. Palermo, and P. Banerjee. Automating parallelizationof regular computations for distributed memory multicomputers inthe PARADIGM compiler. Proc. of the 22nd Int’l Conf. on ParallelProcessing. St. Charles, IL, Aug. 1993, pp. II:30–38.

37. E. Su, D. J. Palermo, and P. Banerjee. Processor tagged descriptors: Adata structure for compiling for distributed-memory multicomputers.Proc. of the 1994 Int’l Conf. on Parallel Archs. and CompilationTechniques. Montreal, Canada, Aug. 1994, pp. 123–132.

38. R. E. Tarjan. Data Structures and Network Algorithms. Society forIndustrial and Applied Mathematics, Philadelphia, PA, 1983.

39. C. W. Tseng. An Optimizing Fortran D Compiler for MIMD Distrib-uted-Memory Machines. Ph.D. thesis, Rice Univ., Houston, TX, Jan.1993. COMP TR93-199.

175DISTRIBUTED-MEMORY MULTICOMPUTERS