nasa contractor report icase report no. 90-7 icase

NASA Contractor Report 1_196_.

ICASE Report No. 90-7

ICASESUPPORTING SHARED DATA STRUCTURES ON

DISTRIBUTED MEMORY ARCHITECTURES

Charles Koelbel

Piyush Mehrotra

John Van Rosendale

Contract No. NASl-18605

January 1990

Institute for Computer Applications in Science and Engineering

NASA Langley Research Center

Hampton, V£rginia 23665-5225

Operated by the Universities Space Research Association

N/LSANational Aeronautics and

Space Administration

Langley Research Center

Hampton, Virginia 23665-5225(NACA-CR-IBI98Z) SUPPORTING SHARFD OATA

STRUCTURES ON DISTRISUTFD MEMORY

ARCHIT_CIURFS Final Report (ICASF) 20 p

CSCL 12A

G3159

NgO-1731B

Supporting Shared Data Structures on

Distributed Memory Architectures*

Charles Koelbel

Department of Computer Sciences

Purdue University

West Lafayette, IN 47907.

Piyush Mehrotra t

John Van Rosendale

ICASE, NASA Langley Research Center

Hampton, Va 23665.

Abstract

Programming nonshared memory systems is more difficult than program-

ming shared memory systems, since there is no support for shared data struc-

tures. Current programming languages for distributed memory architectures

force the user to decompose all data structures into separate pieces, with each

piece "owned" by one of the processors in the machine, and with all communi-

cation explicitly specified by low-level message-passing primitives. This paper

presents a new programming environment for distributed memory architec-

tures, providing a global name space and allowing direct access to remote parts

of data values. We describe the analysis and program transformations required

to implement this environment, and present the efficiency of the resulting code

on the NCUBE/7 and IPSC/2 hypercubes A

*Research supported by the Office of Naval Research under contract ONR N00014-88-M-0108,and by the National Aeronautics and Space Administration under NASA contract NAS1-18605 whilethe authors were in residence at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton,VA 23665.

ton leave from Department of Computer Sciences, Purdue University, West Lafayette, IN 47907.

1 Introduction

Distributed memory architectures promise to provide very high levels of performance

for scientific applications at modest costs. However, they are extremely awkward to

program. The programming languages currently available for such machines directly

reflect the underlying hardware in the same sense that assembly languages reflect the

registers and instruction set of a microprocessor.

The basic issue is that programmers tend to think in terms of manipulating large

data structures, such as grids, matrices, etc. In contrast, in current message-passing

languages each process can access only the local address space of the processor on

which it is executing. Thus the programmer must decompose each data structure into

a collection of pieces, each piece being "owned" by a single process. All interactions

between different parts of the data structure must then be explicitly specified using

the low-level message-passing constructs supported by the language.

Decomposing all data structures in this way, and specifying communication ex-

plicitly can be extraordinarily complicated and error prone. However, there is also

a more subtle problem here. Since the partitioning of the data structures across the

processors must be done at the highest level of the program, and each operation on

these distributed data structure turns into a sequence of "send" and "receive" op-

erations intricately embedded in the code, programs become highly inflexible. This

makes the parallel program not only difficult to design and debug, but also "hard

wires" all algorithm choices, inhibiting exploration of alternatives.

In this paper we present a programming environment, called Kali*, which is de-

signed to simplify the problem of programming distributed memory architectures.

Kali provides a software layer supporting a global name space on distributed mem-

ory architectures. The computation is specified via a set of parallel loops using this

global name space exactly as one does on a shared memory architecture. The dan-

ger here is that since true shared memory does not exist, one might easily sacrifice

performance. However, by requiring the user to explicitly control data distribution

and load balancing, we force awareness of those issues critical to performance on non-

shared memory architectures. In effect, we acquire the ease of programmability of the

shared memory model, while retaining the performance characteristics of nonshared

memory architectures.In Kali, one specifies parallel algorithms in a hlgh-level, distribution independent

manner. The compiler then analyzes this high-level specification and transforms it

into a system of interacting tasks, which communicate via message-passing. This

approach allows the programmer to focus on high-level algorithm design and perfor-

mance issues, while relegating the minor but complex details of interprocessor com-

munication to the compiler and run-time environment. Preliminary results suggest

that the performance of the resulting message-passing code is in many cases virtually

identical to that which would be achieved had the user programmed directly in a

*Kali is the name of a Hindu goddess of creation and destruction who possesses multiple arms,

embodying the concept of parallel work.

processors Procs: array [ 1..P ] with P in 1..max_procs;

var A : array[1..N] of real dist by[ block] on Procs;B : array[1..N,1..M ] of real dlst by [ cyclic, * ] on Pr0cs;

forall i in 1..N-1 on A[i].loc doA[i] := A[i+l];

end;

Figure i: Kali languageprimitives

message-passinglanguage.The remainder of this paper is organizedas follows. Section 2 describesKali,

the languagein which we haveimplementedour ideas. Section3 presentsthe analy-sisneededto map a Kali program onto a nonsharedmemoryarchitecture. If enoughinformation is available, the compiler canperform this analysisat compile-time. Oth-erwisethe compiler producesrun-time codeto generatethe required information. Weclosethis sectionwith an exampleiilustrating the latter situation. Section4showstheperformanceachievedby the sampleprogram on the NCUBE/7 and IPSC/2. Finally,Section5 comparesour work with other groups,and Section6 givesour conclusions.

y

2 Kali Language Primitives

The goal of our approach is to allow programmers to treat distributed data structures

as single objects. We assume the user is designing data-parallel algorithms, which can

be specified as parallel loops. Our system then translates this high level specification

into an SPMD-style program which can execute efficiently on a distributed memory

architecture. In our approach, the progro:mmer must specify three things:

a) The processor topology on which the program iS to be executed

b) The distribution of the data structures across these processors =

c) The parallel loops and where they are to be executed

By specifying these items, the user retains control over aspects of the program critical

to performance, such as data distributions and load balancing ......

The following subsections describe each of these specifications. Figure 1 gives an

example of these declarations in Kali, a Pascal-llke language we created as a testbed

for these techniques [4, 6]. These primitives can as easily be added to FORTRAN, as

described in [7], or any other sequential language.

E

2.1 Processor Arrays

The first thing that needs to be specified is a "processor array." This is an array

of physical processors across which the data structures will be distributed, and on

which the algorithm will execute. The processors line in Figure 1 declares this array.

This particular declaration allocates a one-dimensional array Procs of P processors,

where P is an integer constant between 1 and maz_procs dynamically chosen by

the run-time system. (Our current implementation chooses the largest feasible P;

future implementations might use fewer processors to improve granularity or for other

reasons.) Multi-dimensional processor arrays can be declared similarly.

This construct provides a "real estate agent," as suggested by C. Seitz. Allowing

the size of the processor array to be dynamically chosen is important here, since it

provides portability and avoids dead-lock in case fewer processors are available than

expected. The basic assumption is that the underlying architecture can support multi-

dimensional arrays of physical processors, an assumption natural for hypercubes andmesh connected architectures.

2.2 Defining a Distribution Pattern

Given a processor array, the programmer must specify the distribution of data struc-

tures across the array. Currently the only distributed data type supported is dis-

tributed arrays. Array distributions are specified by a distribution clause in their

declaration. This clause specifies a sequence of distribution patterns, one for each

dimension of the array. Scalar variables and arrays without a distribution clause are

simply replicated, with one copy assigned to each of the processes.

Mathematically, the distribution pattern of an array can be defined as a function

from processors to sets of array elements. If Proc is the set of processors and Art

the set of array elements, then we define

local : Proc --_ 2 A**

as the function giving, for each processor p, the subset of Art which p stores locally.

In this paper we will assume that the sets of local elements are disjoint; that is, if

p _ q then local(p) N local(q) = 4. This reflects the practice of storing only one copy

of each array element. We also make the convention that collections of processors

and array elements are represented by their index sets, which we take to be vectors

of integers.

Kali provides notations for the most common distribution patterns. Once the

processor array Procs is declared, data arrays can be distributed across it using dist

clauses in the array declarations, also shown in Figure 1. Array A is distributed by

blocks, giving it a local function of

This assignsa contiguousblock of array elementsto eachprocessor.Array B has its

rows cyclically distributed; its local is

locaIB(p) = {(i,j) l i =_ p (rood P))

Here, if P were 10 processor 1 would store elements in rows 1, 11, 21, and so on,

while processor 10 would store rows which were multiples of 10. Kali also supports

block-cyclic distributions and provides a mechanism for user-defined distributions.

The number of dimensions of an array that are distributed must match the number

of dimensions of the underlying processor array. Asterisks are used to indicate di-

mensions of data arrays which are not distributed as in the case of B as shown in

Figure 1.

2.3 Forall Loops

Operations on distributeddata structures are specifiedby forall loops. The forall

loop here is similar to that in BLAZE [5].The example in Figure 1 shows a loop

which performs N - 1 loop invocations, shifting the values in the array A one space

to the left. The semantics here are "copy-in copy-out," in the sense that the values

on the right hand side of the assignment are the old values in array A, before being

modified by the loop. Thus the array A is effectively "copied into" each invocation

of the forall loop, and then the changes are "copied out."

In addition to the range specification in the header of the forall there is an

on clause. This clause specifies the processor on which each loop invocation is to

be executed. In the above program fragment, the on clause causes the ith loop

invocation to' be executed on the proCessor ownlng the ith element of the array A.

Although this is the most common use of the on clause, it is also possible to name

the processor directly by indexing into the processor array.

=

_=

=

2.4 Global Name Space

Given the process0rsl dist, and forall primitives, a programmer can specify a data

parallel algorithm at a high level, while still retaining control over those details critical

to performance. FOr example, the code fragment in Figure 2 in Section 3 shows a

typical numerical computation. It is important to note that there are no message

passing statements in either that program or Figure 1; instead, the programmer can

view the program as operating within a global name space. The compiler analyses

the program and produces the low level details of the message passing code required

to support the sharing of data on the distributed memory machines.

The support of a shared memory model provides a distinct advantage over message

passing languages; in those languages, communications statements often substantially

increase the program size and complexity [2]. The global name space model used here

allows the bodies of the forall loops to be independent of the distribution of the data

and processor arrays used. If only local name spaces were supported, this would not be

4

=

=

forall i E Index.set on A[f(i)].loc do

• • • R1 • • •

. . , R 2 • . ,

* ° ° RII °°°

end;

Figure 2: Pseudocode loop for subscript analysis

the case, since the communications necessary to implement two distribution patterns

would be quite different• With our primitives a variety of distribution patterns can

easily be tried by trivial modification of this program. Such a modification in a

message passing language would involve extensive rewriting of the communications

statements. Thus, Kali allows programming at a higher level of abstraction, since the

programmer can focus on the general algorithm rather than the machine-dependent

details of its implementation•

3 Analysis of the Program

Given a Kali program written using the distribution patterns and forall loops de-

scribed above, the compiler must generate code that implements the message passing

necessary to run the program on a nonshared memory machine. This entails an anal-

ysis of the subscripts of array references to determine which ones may cause access to

nonlocal elements. We will describe such an analysis in this section and then discuss

how it can be efficiently accomplished•

3.1 General Outline of the Analysis

The type of loop we are considering has the form shown in Figure 1. Iteration i of

the loop is executed on the processor storing A[f(i)]. In many cases, f will be the

identity function, but we allow other functions for generality. Each Rk represents an

array reference of the form

=

For simplicity, we will assume here that only one array A is referenced. The general

case of multiple arrays does not alter the goals of the analysis, although it may

complicate the analysis itself if the arrays have different distribution patterns. The

gk functions may depend on other program variables, so long as those variables are

invariant during the execution of the forall loop.

The set of iterations executed on processor p, denoted by exec(p) is determined

by the on clause associated with the forall loop. For example, in Figure 1 because

of the on clause, "A[f(i)].loc," this set is a subset of iterations i such that A[f(i)] is

be local to processorp. We define this set mathematically as

emec(p) = f-l(local(p))

where local is the distribution function associated with array A. Each processor p

will execute every iteration in exec(p) which is in the forall's range, that is, the

intersection of the range with exec(p). In the loop of Figure 1, for example, processor p

will execute all iterations in Indez2etnf-l(local(p))' This intersection is often equal

to exec(p) except for boundary condifions; the name exec(p)Was chosen to reflect

this close association. For simplicity, in this paper we will assume that p executes

exactly the iterations in exec(p), as is generally the case. In cases where this is not

true it is generally only necessary to intersect Index._et with exec(p) in the following

equations.

We first identify the forall iterations that can cause nonlocal array references.

There are two reasons for doing this: local accesses may be more amenable to opti-

mization than general accesses, and we can overlap communication with computation

in iterations that access only local array elements. For each processor p and refer-

ence R -- g[g(i)] we define the set

ref(p) = g-l(local(p))

This is the subset of the (unbounded) iteration space where R is always a local

reference. Note that iterations in exec(p) N ref(p) are executed on processor p and

access only p's local memory. Thus, if e_cec(p) C_ ref(p) then the reference R can

always be satisfied locally on processor p. Otherwise, any element a such fhat a C

exec(p) but a ¢_ ref(p) represents an iteration on p that may reference an array

element not on p; this element must be communicated to p via messages. In other

words, iterations in exec(p)- ref(p) cause nonlocal accesses on p. The first stage of

the analysis therefore finds ref(p) for each reference R and processor p and determines

how they intersect with the loop range sets exec(p).

If exec(p) _ ref(p) for some p, then more analysis must be done to generate the

messages received and sent by each processor. For each pair of processors p and q we

must compute the sets in(p, q), the set of elements received by p from q, and out(p, q),

the set of elements sent from p to q. This can be done in two ways. The first uses the

ref(p) sets defined above. Here, we note that those sets cover the iteration space.

Thus, exec(p) can be divided into parts by its intersections exec(p) A ref(a). Any

of these sets which is nonempty represents a region of iteration space executed on

processor p and accessing array elements on processor q. The sets of elements to be

received by p are g(exec(p) n ref(q)) for all q; similarly, the sets of elements that p

must send are g(exec(q)[1 ref(p)). The communications sets can therefore also be

defined as

in(p,q) = g(exec(p) flref(q))out(p,q) = g(e ,ec(q)n,'ef(p))

6

f

9E

The second, simpler way is to note that processor p can only access elements in

fl(exec(p)). Since every element has a "home" processor, we can identify the sources

of these elements using the local functions. Every nonempty set fl(emec(p)) C1local(q)

where q _ p represents a set of elements which processor p must receive as messages

from processor q. Conversely, every nonempty set g(emec(q)) F1local(p) represents a

set of elements that p must send to q. Thus, we can define

in(p,q) = g(ezec(v)) n local(q)out(p,q) = g(ezec(q)) n local(p)

We can now describe the organization of the message passing code derived from

simple forall statements. Figure 2 shows this for the program fragment in Figure 1,

assuming only one reference 1{ =_ A[g(i)]. Only high-level pseudocode for the com-

putation on processor p is shown. Using the in and oui sets, the processor sends

all its messages, performs the iterations which do not require nonlocal data, receives

all its messages, and finally performs the iterations requiring nonlocal data. These

sets can be computed at either compile-time or run-time. In the next subsection,

we characterize these two situations and then provide a detailed example requiring

run-time analysis.

3.2 Run-time Versus Compile-time Analysis

The major issue in applying the above model is the analysis required to compute

ezec(p), ref(p), and their derived sets. It is clear that a naive approach to computing

these sets at run-time will lead to unacceptable performance, in terms of both speed

and memory usage. This overhead can be reduced by either doing the analysis at

compile-time or by careful optimization of the run-time code.

In some cases we can analyze the program at compile-time and precompute the

sets symbolically. Such an analysis requires the subscripts and data distribution

patterns to be of a form such that closed form expressions can be obtained for the

communications sets. If such an analysis is possible, no set computations need be done

at run-time. Instead, the expressions for the sets can be used directly. Compile-time

analysis, however, is only possible when the compiler has enough information about

the distribution function, local, and the subscripting functions f and gk to produce

simple formulas for the sets. In this paper we will not pursue this optimization;

interested readers are referred to [3], which gives some flavor of the analysis.

In many programs the exec(p) and ref(p) sets of a forall loop depend on the run-

time values of the variables involved, in such cases, the sets must be computed at

run-time. However, the impact of the overhead from this computation can be lessened

by noting that the variables controlling the communications sets often do not change

their values between repeated executions of the forall loop. Our run-time analysis

takes advantage of this by computing the exec(p) and tel(p) sets only the first time

they are needed and saving them for later loop executions. This amortizes the cost of

the run-time analysis over many repetitions of the forall, lowering the overall cost of

Codeexecutedon processorp:

-- Sets used in message passing code

ezec(p) - f -l(local(p)) N Indez_set

tel(p) - g-'(Iocal(p)) ....in(p,q) - g(exec(p))Mref(q) for each q 6 Proc-{p}

out(p, q) - g(exec(q)) M ref(p) for each q E-Proc-(p} =

-- Send messages to Other processors

for each q 6 Proc do

i£ out(p, q) 4 ¢ then send( q, out(V, q) ); end;end;

-- Do local iterations

for each i 6 ezec(p) M ref(p) do

.,.A[g(i)]...

end;

- - Receive messages from other processors

for each q 6 Proc do

if in(p, q) # ¢ then

tmp[ in(p,q) ]:= recv( q );end;

end;

-- Do nonlocal iteratz_ons

for each i 6 exec(p)- ref(p) do ::_

. . . trnp[g(i)]. . .

end;

Figure 3: Message passing pseudocode for Figure 1

processors Procs: array[1..P] with P in 1..n;var a, old_a: array[1..n ] of real dlst by [ block ] on Procs;

count : array[ 1..n ] of integer dlst by [ block ] on Procs;

adj : array[ 1..n, 1..4 ] of integer dist by [ block, * ] on Procs;

coef : array[ 1..n, 1..4 ] of real dlst by [ block, * ] on Procs;

- - code to set up arrays 'adj' and 'coef'

while ( not converged ) do

- - copy mesh values

forall i in 1..n on old_a[i].loc do

old_@] := a[i];

end;

- - perform relaxation (computational core)

forall i in 1..n on a[i].loc do

var x : real;

x := 0.0;

for j in 1..count[i] do

x := x + coef[i_]] * old a[ adj[i,j] ];

end;

if (count[i] > 0) then a[i] := x; end;

end;

- - code to check convergence

end;

Figure 4: Nearest-neighbor relaxation on an unstructured grid

the computation. This method is generally applicable and, if the forall is executed

frequently, acceptably efficient. The next section shows how this method can be

applied in a simple example.

3.3 Run-time Analysis

In this section we apply our analysis to the program in Figure 2. This models a simple

partial differential equation solver on a user-defined mesh. Arrays a and old_a storevalues at nodes in the mesh, while array adj holds the adjacency list for the mesh

and coef stores algorithm-specific coefficients. This arrangement allows the solution

of PDEs on irregular meshes, and is quite common in practice. We will only consider

the computational core of the program, the second forall statement.

The reference to old..a[adj[i, j]] in this program creates a communications pattern

9

dependent on data (adj[i, j]) which cannot be fully analyzed by the compiler. Thus,

the ref(p) sets and the communications sets derived from them must be computed at

run-time. We do this by running a modified version of the forall called the inspector

before running the actual forall. The inspector only checks whether references to

distributed arrays are local. If a reference is local, nothing more is done. If the

reference is not local, a record of it and its "home" processor is added to a list of

elements to be received. This approach generates the in(p,q) sets and, as a side

effect, constructs the sets of local iterations (exec(p)Vlref(p)) and nonlocal iterations

(exec(p) - ref(p)). To construct the out(p,q) sets, we note that out(p,q) = in(q,p).

Thus, we need only route the sets to the correct processors. To avoid excessive

communications overhead we use a variant of Fox's Crystal router [2] which handles

such communications without creating bottlenecks. Once this is accomplished, we

have all the sets needed to execute the communications and computation of the

original forall, which are performed by the part of the program which we call the

ezecutor. The executor consists of the two for loops shown in Figure 2 which perform

the local and nonlocal computations.

The representation of the in(p, q) and out(p, q) sets deserv_ mention, since this

representatlon has a large effect on the efficiency of the overall program. We repre-

sent these sets as dynamically-allocated arrays of the record Shown in Figure 3. Each

record contains the information needed to access one contiguous block of an array

stored on one processor. The first two fields identify the sending and receiving pro-

cessors. On processor p, the field from_proc will always be p in the out set and the

field to_proc will be p in the in set. The low and high fields give the lower and upper

bounds of the block of the array to be communicated. In the case of multi-dimensional

arrays, these fields are actually the offsets from the base of the array on the home

processor. To fill these fields, we assume that the home processors and element offsets

can be calculated by any processor; this assumption is justified for static distributions

such as we use. The final buffer field is a pointer to the communications buffer where

the range will be stored. This field is only used for the in set when a communicated

element is accessed. When the in set is constructed, it is sorted on the from_woe

field, with the low field serving as a secondary key. Adjacent ranges are combined

where possible to rrfinimlze the number of records needed. The global concatenation

process which creates the out sets sorts them on the to_proc field, again using low

record

from proc! integer;

to_proc: integer;

low: integer;

high: integer;

buffer: "real;

end;

-- sending . p;ocv_sor-- receiving processor

-- lower bound of range

-- upper bound of range

-- pointer to message buffer

Figure 5: Representation of in and out sets

10

asthe secondarykey. If there are severalarrays to be communicated,we can add asymbol field identifying the array; this field then becomes the secondary sorting key,

and low becomes the tertiary key.

Our use of dynamically-allocated arrays was motivated by the desire to keep the

implementation simple while providing quick access to communicated array elements.

An individual element can be accessed by binary search in O(log r) time (where r is

the number of ranges), which is optimal in the general case here. Sorting by processor

id also allowed us to combine messages between the same two processors, thus saving

on the number of messages. Finally, the arrays allowed a simple implementation of

the concatenation process. The disadvantage of sorted arrays is the insertion time of

O(r) when the sets are built. In future implementations, we may replace the arrays

by binary trees or other data structure allowing faster insertion while keeping thesame access time.

The above approach is clearly a brute-force solution to the problem, and it is not

clear that the overhead of this computation will be low enough to justify its use.

As explained above, we can alleviate some of this overhead by observing that the

communications patterns in this forall will be executed repeatedly. The adj array

is not changed in the while loop, and thus the communications dependent on that

array do not change. This implies that we can save the in(p,q) and ou_(p,q) setsbetween executions of the forall to reduce the run-time overhead.

Figure 3 shows a high-level description of the code generated by this run-time

analysis for the relaxation forall. Again, the figure gives pseudocode for processor p

only. In this case the communications sets must be calculated (once) at run-time.

The sets are stored as lists, implemented as explained above. Here, local_list stores

exec(p) N ref(p); nonIocaI_list stores exec(p) - ref(p); and recv_list and send_list

store the in(p, q) and out(p, q) sets, respectively. The statements in the first if state-

ment compute these sets by examining every reference made by the forall on proces-

sor p. As discussed above, this conditional is only executed once and the results saved

for future executions of the foralI. The other statements are direct implementations

of the code in Figure 2, specialized to this example. The locality test in the nonlocal

computations loop is necessary because even within the same iteration of the forall,

the reference old_a[adj[i,j]] may be sometimes local and sometimes nonlocal. We

discuss the performance of this program in the next section.

4 Performance

To test the methods shown in Section 3, we implemented the run-time analysis in

the Kali compiler. The compiler produces C code for execution on the NCUBE/7

and iPSC/2 multiprocessors. We then compiled the program shown in Figure 2 using

various constants for the sizes of the arrays and ran the resulting programs for several

sizes of the hypercube, measuring the times for various sections of the codes.

Since our primary interest is unstructured grids, our program allows general adj

and coef arrays. However, in the tests here the grids used were simple rectangular

11

Code executed on processor p:

if ( first_time ) then -- Compute sets for later use

local_list := nonlocal_list := send_list := recvlist := NIL;

for each i E Iocala(p) do

flag := true;

for each j C {l,2,...,count[i]}do

if ( adj[i,j] f[ locaIota__(p) ) then

Add old_a[ adj[i,j] ] to recvJist

flag := false;

end;

end;

if ( flag ) then mddi to iocaiAiStelse Add i to nonlocalJist

end;

end;

Form send_list using recvJists from all processors

(requires global communication)

end; ....

for each msg C send_list do

send( msg );

end;

for each i E locaIdist do

Original loop body

end;

for each msg E recv_Iist do

processors

recv( msg ) and add contents to msg._list

end;

for each i C nonIocal_list do

x :-= 0.0;

for each j C {1,2,...,count[i]} do

if ( adj[i,j] C Iocatola._.(p) ) then

tmp := old_a[ adj[i,j] ];

else

tmp := Search msgjist for old_a[ adj[i,j] ]

end;

x := x + coef[i,j] *tmp;

end;if (count[i] > O) then a[i] := x; end;

end_

-- Send messages to other processors

-- Do local iterations

-- Receive messages from other

-- Do nonlocal iterations

Figure 6: Message passing pseudocode for Figure 4

12

=

Time (in seconds) for 100 sweeps over 128 x 128 mesh

processors

16

32

64

total time

246.07

127.46

68.38

38.95

24.36

17.71

executor time

244.04

126.12

67.28

37.88

23.21

inspector time

2.03

1.34

1.10

1.07

1.15

1.29

inspector overhead

0.8%

1.1%

1.6%

2.7%

4.7%

7.3%16.42

128 12.64 11.19 1.45 11.5%

Figure 7: Performance of run-time analysis for varying number of processors on an

NCUBE/7

Time (in seconds) for 100 sweeps over 128 × 128 mesh

processors total time executor time inspector time inspector overhead

2 60.69 60.34 0.34 0.56%

4 31.20 31.02 0.18 0.57%

8 16.23 16.13 0.i0 0.60%

16 8.88 8.82 0.06 0.64%

32 5.27 5.23 0.04 0.70%

Figure 8: Performance of run-time analysis for varying number of processors on an

iPSC/2

Time (in seconds for 100 sweeps on 128 processors

mesh size total time executor 'time inspector time inspector overhead speedup

64 × 64 4.97 3.56 1.38 27.8% 23.9

128 x 128 12.64 11.19 1.45 11.5% 37.3

256 × 256 34.13 32.52 1.61 4.7% 55.2

512 x 512 93.78 91.68 2.10 2.2% 80.4

1024 x 1024 305.03 301.31 3.72 1.2% 98.9

Figure 9: Performance of run-time analysis for varying problem size on an NCUBE/7

Time 'in seconds for 100 sweeps on 32 processors

mesh size total time inspector overhead speedup

64 x 64

i28 x 128

1.88

5.27

256 x 256 17.65

512 x 512 65.17

1024 x t024 249.75

executor time inspector time

1.86 0.02

5.23 0.04

17.54 0.11

64.79 0.38

248.34 1.41

0.85% 15.7

0.70% 22.5

0.62% 26.8

0.58% 29.1

0.56% 30.3

Figure 10: Performance of run-time analysis for varying problem size on an iPSC/2

13

grids, on which we performed 100 Jacobi iterations with the standard five point

Laplacian. For this test problem, the optimal static domain decomposition is obvious,

so we did not have to cope with the added complication of load balancing strategies.

Except for issues of load balancing and domain decomposition, we ran the program

exactly as we would for an unstructured grid. The only significant difference is that

the node connectivity is higher for unstructured grids; nodesln a two dimensional

unstructured grid have six neighbors, on average, rather than the four assumed here.

Thus all costs, execution, inspection, and communication, would be somewhat higher

for an unstructured grid.

Figures 4 and 5 show how the execution time varies when the problem size remains

constant and the number of processors is increased. The inspector overhead is defined

here as the proportion of time spent in the inspector; that is, the overhead is the

inspector time divided by the total time. The tables show that the overhead from

the inspector is never very high; for the NCUBE it varies from less than 1°_ to about

12% of the total computation time, while on the iPSC it is always less than 1% of the

total. These numbers obviously depend on the number of iterations performed, since

the inspector is always executed only once. We assumed 100 iterations, since this is

typical of many numerical algorithms.

For some problems, there are numerical algorithms requiring fewer relaxation it-

erations. Such algorithms tend to be much more complex, requiring incomplete LU

factorizations or multigrid techniques, and we suspect our approach would be less

useful in such cases. In the worst case, where one performs only one sweep, the in-

spector overhead on the NCUBE would range from 45% on 2 processors to 93% on

128 processors, while on the iPSC it ranges from 35% to 41%. These numbers illus-

trate the importance of saving inspector information to avoid recomputation. They

also suggest that with this kind of hardware/software environment algorithm choices

might shift in favor of simpler algorithm with more repetitive inner loops.

Figure 4 also shows how the time taken by the inspector varies. As can be seen,

the time for the inspector starts high, decreases to a minimum at 16 processors, and

then increases slowly. This behavior can be explained by the structure of the inspector

itself. It consists of two phases: the loop identifying nonlocal array references, and

the global communication phase to build the receive lists. The time to execute the

loop is proportional to the number of array references performed, and thus in this c_se

inversely proportional to the number of processors. The global communications phase,

on the other hand, requires time proportional to the dimension of the hypercube, and

thus is logarithmic in the number of processors. When there are few processors,

the inspector time is dominated by the array reference loop, and is thus inversely

proportional to the number of processors. However, as more processors are added,

the increasing time for the communications phase eventually overtakes the decreasing

loop time and the total time begins to rise.

This behavior is not seen in Figure 5 because the locality-checking loop always

dominates the computation on the iPSC. For sufficiently many processors the com-

munication phase would also become significant there. In general, the iPSC inspector

14

overheadsaremuch lessthan of the NCUBE. This appearsto be primarily due to therelatively lowercostof communicationsfor small messageson the iPSC.This increasesthe cost of the global combining in the inspector on the NCUBE, thus increasingitsoverheadin relation to the iPSC.

Figures 6 and 7 keep the number of processors constant and vary the problem

size. Inspector overhead is defined as above, while speedup is given relative to the

executor time on one processor. This represents the closest measurement we have to

an optimal sequential program, since it does not include any overhead for either the

inspector or for communication. As can be seen, the inspector overhead decreases

and the speedup increases when the number of processors is increased. The decrease

in inspector overhead can again be explained by the structure of the inspector. As

the problem size increases, the number of iterations of the locality-checking loop also

increases, making that phase of the inspector more dominant in the total inspector

time. Thus, our inspector-executor code organization can be expected to scale well as

problem size increases. The increases in speedup reflect decreasing overheads in the

executor loop. Our parallel programs have two overheads associated with nonlocal

references: the cost of sending and receiving data in messages, and data structure

overhead from the searches for nonlocal array elements. Any program written for a

distributed memory machine will have the communications overhead, however, the

search overhead is unique to our system. This search overhead is primarily responsible

for suboptimal speedups. Also, this overhead is much less for the iPSC than for the

NCUBE, probably because of the faster procedure calls on the iPSC. We are working

to both analyze these results and to improve the data structure performance on both

machines.

5 Related Work

There are many other projects concerned with compiling programs for nonshared

memory parallel machines. Three in particular break away from the message passing

paradigm and are thus closely related to our work.

Kennedy and his coworkers [1] compile programs for distributed memory by first

creating a version which computes its communications at run-time. They then use

standard compiler transformations such as constant propagation and loop distribution

to optimize this version into a form much like ours. Their optimizations appear to fail

in our run-time analysis cases. If significant compile-time optimizations are possible,

their results appear to be similar to our compile-time analysis in [3]. We extend their

work in our run-time analysis by saving information on repeated communications

patterns. It is not obvious how such information saving could be incorporated into

their method without devising new compiler transformations. We also provide a more

top-down approach to analyzing the communications, while their optimizations can

be characterized as bottom-up.

Rogers and Pingali [8] suggest run-time resolution of communications for the func-

tional language Id Nouveau. They do not attempt to save information between execu-

I5

tions of theirparallelconstructs,however. Because the information isnot saved,they

labelrun-time resolutionas "fairlyinefficient"and concentrate on optimizing special

cases.These cases appear to correspond roughly to our compile-time analysis.We ex-

tcnd theirwork by saving the communications information between forallexecutions

and by providing a common framework for run-time and compile-time resolution.

Saltz ctal [9]compute data-de_dendent communlcafionspatterns in a preprocessor,

producing schedules for each processor to execute later.This preprocessing isdone

off-line,although they are currently integratingthiswith the actual computation as

is done with our system. Their execution schedules also take into account inter-

iterationdependencies, something not necessary in our system since we currently

start with completely parallelloops. They do not give any performance figuresfor

thcirpreprocessor,although they do note that given its"relativelyhigh" complexity,

parallelizationwillbe required inany practicalsystem. Saving the information about

forallcommunications between executions isvery similarbetween our two works. A

major diffcrcncefrom our work isthat they explicitlyenumerate allarray references(

localand nonlocal)in a "list".This eliminatesthe overhead of checking and searching

for nonlocal referencesduring the loop execution but requiresmore storage than our

implementation. We alsodifferin that we consider compile-time optimizations,which

they do not attempt.

6 Conclusions

Current-pr0gramming environments for distributed memory architectures provide lit-

tle support for mapping applications to the machine. In particular, the lack of a

global name space implies that the algorithms have to be specified at a relatively

low level. This greatly increases the complexity of programs, and also hard wires the

algorithm choices, inhibiting experimentation with alternative approaches.

In this paper, we described an environment which aIIows the user to specify al-

gorithms at a higher level. By providing a global name space, our system allows the

user to specify data parallel algorithms in a more natural manner. The user needs to

make only minimal additions to a high level "shared memory" style specification of

the algorithm for execution in our system; the low level details of message-passing,

local array indexing, and so forth are left to the compiler. Our system performs these

transformations automatically, producing relatively efficient executable programs.

The fundamental problem in mapping a global name space onto a distributed

memory machine is generation of the messages necessary for communication of non-

local values. In this paper, we presented a framework which can systematically and

automatically generate these messages, using either compile time or run time analysis

of communication patterns. In this paper we concentrated on the more general (but

less e_cient) case of run-time analysis. Our run-time analysis generates messages

by performing an inspector loop before the main computation, which records any

nonlocal array references. The executor loop subsequently uses this information to

transmit information efficiently while performing the actual computation.

16

L

_aTo_3i _OrElutc S ar_S_,_Ee _3rn_n,slraloP

1. Report No. l

NASA CR-181981 JTCASE Report No. 90-7

4. Title and Subtitle

Report Documentation Page

2. Government Accession No. 3. Recipient's Catalog No.

5. Report Date

SUPPORTING SHARED DATA STRUCTURES ON DISTRIBUTED

MEMORY ARCHITECTURES

7. Author(s)

Charles Koelbel

Piyush Mehrotra

John Van Rosendale

9. Performing Organization Name and Address

Institute for Computer Applications in Science

and Engineering

Mail Stop 132C, NASA Langley Research Center

Hampton, VA 23665-5225

12. Sponsoring Agency Name and Address

National Aeronautics and Space Administration

Langley Research Center

Hampton, VA 23665-5225

January 1990

6. Performing Organization Code

8. Performing Organization Report No.

90-7

10. Work Unit No.

505-90-21-01

tl. Contract or Grant No,

NASI-18605

13. Type of Report and Period Covered

Contractor Report

14. Sponsoring ._gency Code

15. Supplementaw Notes

Langley Technical Monitor:Richard W. Barnwell

Final Report

To appear in the Proceedings of the

2nd SIGPLAN Symposium on Principles

and Practice of Parallel Program-

ming, March 1990.

16. AbstractProgramming nonshared memory systems is more difficult than programming

shared memory systems, since there is no support for shared data structures. Cur-

rent programming languages for distributed memory architectures force the user to

decompose all data structures into separate pieces, with each piece "owned" by one

of the processors in the machine, and with all communication explicitly specified

by low-level message-passing primitives. This paper presents a new programming en-

vironment for distributed memory architectures, providing a global name space and

allowing direct access to remote parts of data values. We describe the analysis

and program transformations required to implement this environment, and present the

efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes.

17. Key Words(SuggestedbyAuthor(s))

distributed memory architectures;

language constructs; runtime optimizations

19. Securi_ Cla_if. (of this report)

Unclassified

18. Distribution Statement

59 - Mathematical and Computer

Sciences (General)

61 - Computer Programming and Software

Unclassified - Unlimited

_. Securi_ Cla_if. (of this pa_) _21. No. of _s _. Price

Unclassified 1 19 A03

NASA FORM 1626 OCT 86

NASA-L_angIey, 1990

The inspector is clearly an expensive operation. However, if one amortizes the cost

of the inspector over the entire computation, it turns out to be relatively inexpensive

in many cases. This is especially true in cases where the computation is an iterative

loop executed a large number of times.

The other issue effectlng the overhead of our system is the extra cost incurred

throughout the computation by the new data structures used. This is a serious issue,

but one on which we have only preliminary results. In future work, we plan to give

increased attention to these overhead issues, refining both our run-time environment

and language constructs. We also plan to look at more complex example programs,

including those requiring dynamic load balancing, to better understand the relative

usability, generality and eNcacy of this approach.

References

[1] D. Callahan and K. Kennedy. Compiling programs for distributed-memory multipro-

cessors. Journal of Supercomputing, 2:151-169, 1988.

[2] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems

on Concurrent Processors, Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1986.

[3] C. Koelbel and P. Mehrotra. Compiler transformations for non-shared memory ma-chines. In Proceedings of the .4th International Conference on Supercomputing, volume 1,

pages 390-397, May 1989.

[4] P. Mehrotra. Programming parallel architectures: The BLAZE family of languages. In

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Com-

puting, pages 289-299, December 1987.

[5] P. Mehrotra and J. V. RosendMe. The BLAZE language: A parallel language for

scientific programming. Parallel Computing, 5:339-361, 1987.

[6] P. Mehrotra and J. Van Rosendale. Compiling high level constructs to distributed mem-

ory architectures. In Proceedings of the Fourth Conference on Hypercube Concurrent

Computers and Applications, March 1989.

[7] P. Mehrotra and J. Van Rosendale. Parallel language constructs for tensor product

computations on loosely coupled architectures. In Proceedings Supercomputing '89, pages

616-626, November 1989.

[8] A. Rogers and K. Pingali. Process decomposition through locality of reference. In

Conference on Programming Language Design and Implementation, pages 69-80. ACM

SIGPLAN, June 1989.

[9] J. Saltz, K. Crowiey, R. Mirchandaney, and H. Berryman. Run-time scheduling and

execution of loops on message passing machines. Journal of Parallel and Distributed

Computing, April 1990. (To appear).

17

nasa contractor report icase report no. 90-7 icase

Documents