dpart: an automatic data partitioning system for distributed memory parallel machines

This article was downloaded by: [University of Glasgow]On: 20 December 2014, At: 09:01Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Parallel Algorithms and ApplicationsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/gpaa19

DPART: AN AUTOMATIC DATA PARTITIONING SYSTEM FORDISTRIBUTED MEMORY PARALLEL MACHINESZHAOHUI DUAN a & ZHAOQING ZHANG aa National Research Center for Intelligent Computing Systems , Institute of ComputingTechnology , Academia Sinica, P.O. Box 2704, Beijing, PR, 100080, ChinaPublished online: 16 May 2007.

To cite this article: ZHAOHUI DUAN & ZHAOQING ZHANG (1996) DPART: AN AUTOMATIC DATA PARTITIONINGSYSTEM FOR DISTRIBUTED MEMORY PARALLEL MACHINES, Parallel Algorithms and Applications, 9:3-4, 205-212, DOI:10.1080/10637199608915576

To link to this article: http://dx.doi.org/10.1080/10637199608915576

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/gpaa19

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/10637199608915576

http://dx.doi.org/10.1080/10637199608915576

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Pmdd AlpriLhmr mdAppLc~rionr. Vel. 9, pp. 205-212 Reprints nvrihblc drreclly from the publisher Photocopying pcrmiltcd by license only

@ 1996OPA @errear Publisherr Assactarion) Amllerdam B V Published in The Netherlands under liccnrc by Gordon and Breach Sc~ence Publlrherr SA

Printed ~n Malaysia

DPART: AN AUTOMATIC DATA PARTITIONING SYSTEM FOR DISTKIBUTED MEMORY PARALLEL MACHINES

ZHAOHUI DUAN and ZHAOQING Z H A N G

National Research Center for Intelligent Computing Systems, Institute of Computing Technology, Academia Sinica, PO. Box 2704,

Beijing 100080, IIR. China

(Received January 30, 1996)

One of the most intellectual steps in compiling for distributed memory parallel machines is to determine a suitable data partitioning scheme Tor a particular program. Most of the parallelizing compilers for these machines provide no or little support to the user in this difficult task. We have developed DPART, an automatic data partitioning system for Fortran 77 procedures. This paper describes the partitioning stratcgies of alignment, distribution, and processor layout in DPART. Finally we present experimental results for TRED2, DGEFA, and JACOB1 procedures to demonstrate the effectiveness of this sptem.

KEY WORDS: Data partitioning, distributed memory parallel machines, parallelizing compiler

1. INTRODUCTION

Distributed-memory architectures have become extremely popular as a cost-effective method of building massively parallel computers. However, scientists find them extremely difficult to program. The reason is that these systems lack a global ad- dress space. The programmer has to perform low-level tasks like distributing data across processors, and explicitly managing communication among those processors. Languages like Fortran D or HPF [3] free the programmer from the burden of ex- plicit message-passing by allowing him to write sequential or shared-memory parallel programs, annotated with directives specifying data partitioning. The parallelizing compilers for these languages are responsible for partitioning the computation, and generating the communication necessary to obtain the non-local data referenced by a processor.

Although the data partitioning schemes determine the way in which computation is partitioned and communication is generated, most of the current parallelizing compilers for DMMs provides no or little support in this difficult and performance- key problem. In this paper, we present the partitioning strategies of DPART (as shown in Figure I), an automatic data partitioning system for Fortran D compiler, which accepts a conventional Fortran 77 program, and attempts to generate the SPMD (Single Program Multiple Data) program with message passing, based on a data partitioning scheme.

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

Z. DUAN AND Z. ZHANG

1 Alignment ~ n a r l

I Distribution Analysis

Processor Layout LT Partitioning Schemes r-----

Computation Cost Estimation

Communication Cost Estimation

Figure 1 An overview of the DPART system.

The rest of this paper is organized as follows. Section 2 briefly reviews the For- tran D data partitioning specification. Sections 3, 4 and 5 describe the algorithms of alignment, distribution, and processor layout analysis in DPART system. Sec- tion 6 present experimental results for TRED2, JACOB1 and DGEFA routines on Dawning-1000 MPP machines to demonstrate the effectiveness of our system. Fi- nally, we conclude this paper with a brief remark of our system.

2. T H E SPECIFICATION O F DATA PARTITIONING SCHEMES

Fortran D is a version of Fortran enhanced with data partitioning [3], which are specified using DECOMPOSITION, ALIGN, and DISTRIBUTE statements. A decomposition is a template or index domain, each element of which represents a unit of computation.

The ALIGN statement maps arrays onto decompositions. Arrays mapped to the same decomposition are automatically aligned with each other. The alignment of arrays to decompositions is specified by assigning an independent attribute to each dimension of a decomposition. Predefined attributes are BLOCK, CYCLIC, and BLOCK-CYCLIC. The symbol "." marks dimensions that are not distributed. Sup- pose that there are P processors and N elements in a decomposition. The BLOCK divides the decomposition into contiguous blocks of size N/P, assigning one block to each processor. The CYCLIC specifies a round-robin division of the decomposition, assigning every P th element to the same processor. The BLOCK-CYCLIC is similar to CYCLIC but takes a parameter M. It first divides the dimension into contiguous blocks of size M, then assigns these blocks in the same fashion as CYCLIC. Choosing the distribution for a decomposition maps all arrays aligned with it to the machine.

Fortran D also provides the capability of specifying processor allocations. The processor allocations are specified by adding an additional parameter indicating the number of processors for each dimension of the distribution.

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

DATA PARTKlONlNG

3. ALIGNMENT ANALYSIS

Alignment analysis attempts to determine the relative positions between arrays in a program. Axis and stride alignment play an important role in reducing communication costs because correcting axis and stride misalignment requires general all-to-all communications.

3.1. Axis Alignmen1

We use the Component Aflinify Graph (CAG) framework to derive the axis alignment. The CAG of a program phase is a weighted graph that consists of columns of nodes. The nodes in one column represent the dimensions of the same array. The edges represent axis alignment preferences between array dimensions, with weights representing the importance of honoring those alignment relationships. The weight of each edge is the %urn of the execution number of all the array assignment statements leading to the axis alignment preference.

For axis alignment, we attempt to partition the node set of the CAG into n (n is the maximum dimensionality of arrays) disjoint subsets (that identify classes of mu- tually aligned array dimensions) with the restriction that no two nodes in the same column are allowed to be in the same subset, such that the total weight of edges across nodes in different subsets is minimized. The following is the axis alignment performed by DPART:

Algorithm axis-alignment input: axis alignment CAG G of a program phase output: n disjoint node subsets Vl, Vz, . . ., V, of G (n is the maximum dimensionality of arrays in G )

m t the number of columns in G Cl t the column in G with maximum nodes for i = I to n do

V , - v i u { c ; } randomly arrange ht eleft m-1 columns as C2, C3,. . . , C, for i=2 to m do

find the maximum weighted matching of the bipartite graph (Cl,Ci) for k = l to num(Ci) do

if (c; is matched with ~ f ) then vi + v, u {cik} for (each node d in left m-2 columns) do

if (d is connected with c:) then if (d is also connected with c') then f weight(d,c:) t weight(d, Cl) + weight(d,c;) else

weight(d,~;) + weight(d, c ; ~ ) enddo

enddo remove column Ci from G

enddo end.

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

208 Z. DUAN AND Z. ZHANG

3.2. Stride Alignment

Stride alignment attempts to determine the spacing with which every dimension in one class of aligned dimensions derived by a d s alignment is aligned to template axis. We also use a greedy heuristic base on a stride alignment graph in which nodes in one column represent all the stride occurrences relative to a dimension in one class of .aligned dimensions. The goal of stride alignment is to find a maximal-weight spanning tree with the restriction that one and only one node in each column can occur in it. The heuristic is described as follows:

Algorithm stride-alignment input: stride alignment CAG G of a class of aligned dimensions output: the alignment stride of every dimension in G

m c the number of columns in G G' c #I select an arbitrary column in G as Cl randomly arrange the left m-1 columns as Cz, C3,. . . , C,,, fa; 1=1 to num(CI) do

G" c { C f ) f o r i=2 to m do

find the maximal weighted edge in the edges from C' to C, let another node incident to C: at this edge be c,~,c) will be the stride for (each node d in the left m-2 columns) do

if (d is connected with C f ) then if (d is also connected with c f ) then

weighted(d,c{) + weight(d,C{) + weight(d,Cf) else

weighted(d,C;) t weight(d,Cf) enddo remove C, from G

enddo if (weight(GU) 2 weight(G1)) then

G' + G" enddo

end.

3.3. Ojlset Alignment

The misalignment of offset will cause the nearest-neighbor communication which has relatively less cost than general communication routines. We use a similar ap- proach to the stride alignment to derive the alignment offset, which is described elsewhere.

4. DISTRIBUTION ANALYSIS

Distribution analysis determines whether a class of aligned dimensions should be distributed in a block or cycle manner. Each array dimension in the class should

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

DATA PARTITIONING 209

take the same manner in order to keep the alignment relationship. The alignment stride is the minimal region to be distributed;so the distribution with non-one stride leads to general BLOCK CYCLIC manner.

We use a maximum weight method to derive the way in which a class of aligned dimensions will take. We examine the following two situations of array references in loop nests, assign a weight to the corresponding distribution, and select the distribution with larger weights as the final distribution.

1. If the distribution of any array dimension on more than one processors leads to nearest neighbor communication, we suggest it take the blocked distribution because this manner will allow the communication to be restricted to the elements at the boundaries of regions. DPART analyzes pairs of subscript expressions corresponding to aligned dimensions at the both sides of each assignment statement. If the pair of subscripts satisfy the boundary-communication test, the weight for blocked distribution is obtained as the difference in the communication costs of blocked and cyclic distributions.

2. If the Ihs (left hand side) array in an assignment statement is involved in triangle-style loops, we suggest the dimension whose subscript expression contains an index of such loops take the cyclic distribution because this manner will lead to a more even distribution of computation than using blocked distribution. We use the difference in computational cost of blocked and cyclic distributions as the weight for cyclic distribution.

The computational and communication costs can be evaluated using the static performance estimation method of data partitioning schemes described in [ I ] .

5. PROCESSOR LAYOUT

In the preceding passes, each dimension of an array is assumed to be distributed on a virtual processor mesh. We need to determine the number of processors in each mesh dimension in order to map all the arrays to real physical machines. Empirical study has showed that only 8% of real scientific applications have array references with more than two dimensions. So we believe that restricting the number of physical mesh dimension to two does not limit the extent to which parallelism can be exploited. DPART applies the following algorithm to determine the processor layout.

Algorithm determine-phy-mesh input: P - program phase, N - number of processors, D - classes of aligned dimensions output: C-DI,C-Dz - distributed dimensions,

N-Dl,N-Dz - number of processors assigned to C-D1,C-D2 D' t D for i = l to num(D) do

if (Di do not exhibit parallelism) then num-procs[Di] = 1 D' +- D' - Di

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

2. DUAN AND 2. ZHANG

A Z D

Figure 2 The axis alignment CAG for TREDZ

enddo if (num(D1) > 2) then

for ((Di,Dj) E set(^&)) do pmax c 0 for k = l to N do

p-estimate t Performance-Estimate(P,Di,Dj, k, LNIk]) if (p-estimate > pmax) then

pmax t p-estimate C-Dl c D,, C-D2 c D j N-DI + k, N P 2 + LNIkl

endif enddo

enddo endif return(C-Dl, C-Dz, N-Dl, N-Dz)

end.

6. EXPERIMENTAL RESULTS

All the algorithms described in this paper have been implemented in DPART system. To evaluate the effectiveness of our proposed algorithms, we run our DPART system over three Fortran 77 procedures, run codes generated by Fortran D compiler based on the array partitioning schemes derived by DPART on the Dawning- 1000 MPP machine constructed by NCIC and compare our results to those based on other array partitioning schemes.

TRED2 is a routine that reduces a real symmetric matrix to a symmetric tridiago- nal matrix, using orthogonal similarly transformations. The axis alignment CAG for TRED2 is constructed in Figure 2.

DPART applies the axis alignment algorithm to the CAG and obtains the following two classes of aligned dimensions-class 1 consisting of A1,Z1, D1,E1, class 2 consisting of Dz, E2. The alignment stride of each aligned dimension is derived as 1 by DPART. Thus our system leads to the same result as [2] in which the effectiveness has been demonstrated in comparison with other partitioning schemes.

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

DATA PARTITIONING 211

Table 1

Processor Dala Size Sequential Column Cyclic Column Blocked 2-D Blocked Number n Time(s) Time(s) Time(s) Time(s)

128 2.38 1.41 1.19 0.82 4 256 9.% 4.53 3.88 3.25

512 48.78 19.27 15.91 12.97 1024 206.8 71.5 64.4 56.5

Table 2

Processor Data Size Sequential Column Blocked Column Cyclic Number n Time(s) Time(s) Time@)

128 0.39 0.29 0.19 4 256 3.18 1.74 1.24

512 26.28 13.54 9.47 1024 221.5 109.01 75.54

128 0.39 0.15 0.12 8 256 3.18 0.90 0.59

JACOBI is a simplified version of a relaxation code that performs Jacobi itera- tions in a loop. As the array references lead to nearest-neighbour communications, DPART selects 2-D blocked distributions for array A and B in JACOBI. We ran three versions of JACOB1 for different data partitioning schemes whose experimental results are given in Table 1. We find that the 2-D blocked version takes the least execution time of the three versions.

DGEFA factors a real matrix using Gaussian elimination with partial pivoting. With the process of elimination, the unique array A in DGEFA is traversed more frequently in the lower-right region than the upper-left region. DPART suggests that array A take the cyclic distribution to ensure load balance. We implemented two versions of DGEFA for column cyclic and column blocked distributions. The results in Table 2 show that the cyclic version leads to less execution time than blocked version for different data sizes and processors.

7. CONCLUSIONS

In this paper, we have described the partitioning strategies of DPART that makes data partitioning decisions on Fortran 77 programs for Fortran D compiler. The analysis of each pass determines one partitioning parameter of arrays. These strategies together with a static computational and communication performance estimat-

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

21 2 2. DUAN AND Z. ZHANG

ing module constitute the DPART system. We have shown the effectiveness of our system on regular computations through experimental results on some real-life For- tran codes.

We are currently exploring techniques for handling procedure calls and reparti- tioning data in different phases of a program.

References

[ I ] Z. Duan and 2. ~ha ' ng , The static estimating of data decomposition schemes on distributed memory parlalel machines. In Proceedings of the 4th lnternarional Conference for Young Computer Scienrisrs, Beijing, July 1995.

[21 M. Gupta and P. Banerjee, Demonstration of automatic data partitioning techniques ror parallelizing compilers on mul~icomputers. IEEE Transacrions on Parallel and Distributed Systems 3, 2 (Mar. 1992), 179-193.

131 S. Hiranandani, K. ~ c n n c d y and C. Tseng, Compiling Fortran D for MIMD distributed-memory machines. Communications ofrhe ACM 35, 8 (Aug. 1992), 66-80.

Dow

nloa

ded

by [

Uni

vers

ity o

f G

lasg

ow]

at 0

9:01

20

Dec

embe

r 20

14

dpart: an automatic data partitioning system for distributed memory parallel machines

Documents