nasa contractor report icase report no. 90-7 icase
TRANSCRIPT
NASA Contractor Report 1_196_.
ICASE Report No. 90-7
ICASESUPPORTING SHARED DATA STRUCTURES ON
DISTRIBUTED MEMORY ARCHITECTURES
Charles Koelbel
Piyush Mehrotra
John Van Rosendale
Contract No. NASl-18605
January 1990
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, V£rginia 23665-5225
Operated by the Universities Space Research Association
N/LSANational Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23665-5225(NACA-CR-IBI98Z) SUPPORTING SHARFD OATA
STRUCTURES ON DISTRISUTFD MEMORY
ARCHIT_CIURFS Final Report (ICASF) 20 p
CSCL 12A
G3159
NgO-1731B
Supporting Shared Data Structures on
Distributed Memory Architectures*
Charles Koelbel
Department of Computer Sciences
Purdue University
West Lafayette, IN 47907.
Piyush Mehrotra t
John Van Rosendale
ICASE, NASA Langley Research Center
Hampton, Va 23665.
Abstract
Programming nonshared memory systems is more difficult than program-
ming shared memory systems, since there is no support for shared data struc-
tures. Current programming languages for distributed memory architectures
force the user to decompose all data structures into separate pieces, with each
piece "owned" by one of the processors in the machine, and with all communi-
cation explicitly specified by low-level message-passing primitives. This paper
presents a new programming environment for distributed memory architec-
tures, providing a global name space and allowing direct access to remote parts
of data values. We describe the analysis and program transformations required
to implement this environment, and present the efficiency of the resulting code
on the NCUBE/7 and IPSC/2 hypercubes A
*Research supported by the Office of Naval Research under contract ONR N00014-88-M-0108,and by the National Aeronautics and Space Administration under NASA contract NAS1-18605 whilethe authors were in residence at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton,VA 23665.
ton leave from Department of Computer Sciences, Purdue University, West Lafayette, IN 47907.
1 Introduction
Distributed memory architectures promise to provide very high levels of performance
for scientific applications at modest costs. However, they are extremely awkward to
program. The programming languages currently available for such machines directly
reflect the underlying hardware in the same sense that assembly languages reflect the
registers and instruction set of a microprocessor.
The basic issue is that programmers tend to think in terms of manipulating large
data structures, such as grids, matrices, etc. In contrast, in current message-passing
languages each process can access only the local address space of the processor on
which it is executing. Thus the programmer must decompose each data structure into
a collection of pieces, each piece being "owned" by a single process. All interactions
between different parts of the data structure must then be explicitly specified using
the low-level message-passing constructs supported by the language.
Decomposing all data structures in this way, and specifying communication ex-
plicitly can be extraordinarily complicated and error prone. However, there is also
a more subtle problem here. Since the partitioning of the data structures across the
processors must be done at the highest level of the program, and each operation on
these distributed data structure turns into a sequence of "send" and "receive" op-
erations intricately embedded in the code, programs become highly inflexible. This
makes the parallel program not only difficult to design and debug, but also "hard
wires" all algorithm choices, inhibiting exploration of alternatives.
In this paper we present a programming environment, called Kali*, which is de-
signed to simplify the problem of programming distributed memory architectures.
Kali provides a software layer supporting a global name space on distributed mem-
ory architectures. The computation is specified via a set of parallel loops using this
global name space exactly as one does on a shared memory architecture. The dan-
ger here is that since true shared memory does not exist, one might easily sacrifice
performance. However, by requiring the user to explicitly control data distribution
and load balancing, we force awareness of those issues critical to performance on non-
shared memory architectures. In effect, we acquire the ease of programmability of the
shared memory model, while retaining the performance characteristics of nonshared
memory architectures.In Kali, one specifies parallel algorithms in a hlgh-level, distribution independent
manner. The compiler then analyzes this high-level specification and transforms it
into a system of interacting tasks, which communicate via message-passing. This
approach allows the programmer to focus on high-level algorithm design and perfor-
mance issues, while relegating the minor but complex details of interprocessor com-
munication to the compiler and run-time environment. Preliminary results suggest
that the performance of the resulting message-passing code is in many cases virtually
identical to that which would be achieved had the user programmed directly in a
*Kali is the name of a Hindu goddess of creation and destruction who possesses multiple arms,
embodying the concept of parallel work.
processors Procs: array [ 1..P ] with P in 1..max_procs;
var A : array[1..N] of real dist by[ block] on Procs;B : array[1..N,1..M ] of real dlst by [ cyclic, * ] on Pr0cs;
forall i in 1..N-1 on A[i].loc doA[i] := A[i+l];
end;
Figure i: Kali languageprimitives
message-passinglanguage.The remainder of this paper is organizedas follows. Section 2 describesKali,
the languagein which we haveimplementedour ideas. Section3 presentsthe analy-sisneededto map a Kali program onto a nonsharedmemoryarchitecture. If enoughinformation is available, the compiler canperform this analysisat compile-time. Oth-erwisethe compiler producesrun-time codeto generatethe required information. Weclosethis sectionwith an exampleiilustrating the latter situation. Section4showstheperformanceachievedby the sampleprogram on the NCUBE/7 and IPSC/2. Finally,Section5 comparesour work with other groups,and Section6 givesour conclusions.
y
2 Kali Language Primitives
The goal of our approach is to allow programmers to treat distributed data structures
as single objects. We assume the user is designing data-parallel algorithms, which can
be specified as parallel loops. Our system then translates this high level specification
into an SPMD-style program which can execute efficiently on a distributed memory
architecture. In our approach, the progro:mmer must specify three things:
a) The processor topology on which the program iS to be executed
b) The distribution of the data structures across these processors =
c) The parallel loops and where they are to be executed
By specifying these items, the user retains control over aspects of the program critical
to performance, such as data distributions and load balancing ......
The following subsections describe each of these specifications. Figure 1 gives an
example of these declarations in Kali, a Pascal-llke language we created as a testbed
for these techniques [4, 6]. These primitives can as easily be added to FORTRAN, as
described in [7], or any other sequential language.
E
2.1 Processor Arrays
The first thing that needs to be specified is a "processor array." This is an array
of physical processors across which the data structures will be distributed, and on
which the algorithm will execute. The processors line in Figure 1 declares this array.
This particular declaration allocates a one-dimensional array Procs of P processors,
where P is an integer constant between 1 and maz_procs dynamically chosen by
the run-time system. (Our current implementation chooses the largest feasible P;
future implementations might use fewer processors to improve granularity or for other
reasons.) Multi-dimensional processor arrays can be declared similarly.
This construct provides a "real estate agent," as suggested by C. Seitz. Allowing
the size of the processor array to be dynamically chosen is important here, since it
provides portability and avoids dead-lock in case fewer processors are available than
expected. The basic assumption is that the underlying architecture can support multi-
dimensional arrays of physical processors, an assumption natural for hypercubes andmesh connected architectures.
2.2 Defining a Distribution Pattern
Given a processor array, the programmer must specify the distribution of data struc-
tures across the array. Currently the only distributed data type supported is dis-
tributed arrays. Array distributions are specified by a distribution clause in their
declaration. This clause specifies a sequence of distribution patterns, one for each
dimension of the array. Scalar variables and arrays without a distribution clause are
simply replicated, with one copy assigned to each of the processes.
Mathematically, the distribution pattern of an array can be defined as a function
from processors to sets of array elements. If Proc is the set of processors and Art
the set of array elements, then we define
local : Proc --_ 2 A**
as the function giving, for each processor p, the subset of Art which p stores locally.
In this paper we will assume that the sets of local elements are disjoint; that is, if
p _ q then local(p) N local(q) = 4. This reflects the practice of storing only one copy
of each array element. We also make the convention that collections of processors
and array elements are represented by their index sets, which we take to be vectors
of integers.
Kali provides notations for the most common distribution patterns. Once the
processor array Procs is declared, data arrays can be distributed across it using dist
clauses in the array declarations, also shown in Figure 1. Array A is distributed by
blocks, giving it a local function of
This assignsa contiguousblock of array elementsto eachprocessor.Array B has its
rows cyclically distributed; its local is
locaIB(p) = {(i,j) l i =_ p (rood P))
Here, if P were 10 processor 1 would store elements in rows 1, 11, 21, and so on,
while processor 10 would store rows which were multiples of 10. Kali also supports
block-cyclic distributions and provides a mechanism for user-defined distributions.
The number of dimensions of an array that are distributed must match the number
of dimensions of the underlying processor array. Asterisks are used to indicate di-
mensions of data arrays which are not distributed as in the case of B as shown in
Figure 1.
2.3 Forall Loops
Operations on distributeddata structures are specifiedby forall loops. The forall
loop here is similar to that in BLAZE [5].The example in Figure 1 shows a loop
which performs N - 1 loop invocations, shifting the values in the array A one space
to the left. The semantics here are "copy-in copy-out," in the sense that the values
on the right hand side of the assignment are the old values in array A, before being
modified by the loop. Thus the array A is effectively "copied into" each invocation
of the forall loop, and then the changes are "copied out."
In addition to the range specification in the header of the forall there is an
on clause. This clause specifies the processor on which each loop invocation is to
be executed. In the above program fragment, the on clause causes the ith loop
invocation to' be executed on the proCessor ownlng the ith element of the array A.
Although this is the most common use of the on clause, it is also possible to name
the processor directly by indexing into the processor array.
=
_=
=
2.4 Global Name Space
Given the process0rsl dist, and forall primitives, a programmer can specify a data
parallel algorithm at a high level, while still retaining control over those details critical
to performance. FOr example, the code fragment in Figure 2 in Section 3 shows a
typical numerical computation. It is important to note that there are no message
passing statements in either that program or Figure 1; instead, the programmer can
view the program as operating within a global name space. The compiler analyses
the program and produces the low level details of the message passing code required
to support the sharing of data on the distributed memory machines.
The support of a shared memory model provides a distinct advantage over message
passing languages; in those languages, communications statements often substantially
increase the program size and complexity [2]. The global name space model used here
allows the bodies of the forall loops to be independent of the distribution of the data
and processor arrays used. If only local name spaces were supported, this would not be
4
=
=
forall i E Index.set on A[f(i)].loc do
• • • R1 • • •
. . , R 2 • . ,
* ° ° RII °°°
end;
Figure 2: Pseudocode loop for subscript analysis
the case, since the communications necessary to implement two distribution patterns
would be quite different• With our primitives a variety of distribution patterns can
easily be tried by trivial modification of this program. Such a modification in a
message passing language would involve extensive rewriting of the communications
statements. Thus, Kali allows programming at a higher level of abstraction, since the
programmer can focus on the general algorithm rather than the machine-dependent
details of its implementation•
3 Analysis of the Program
Given a Kali program written using the distribution patterns and forall loops de-
scribed above, the compiler must generate code that implements the message passing
necessary to run the program on a nonshared memory machine. This entails an anal-
ysis of the subscripts of array references to determine which ones may cause access to
nonlocal elements. We will describe such an analysis in this section and then discuss
how it can be efficiently accomplished•
3.1 General Outline of the Analysis
The type of loop we are considering has the form shown in Figure 1. Iteration i of
the loop is executed on the processor storing A[f(i)]. In many cases, f will be the
identity function, but we allow other functions for generality. Each Rk represents an
array reference of the form
=
For simplicity, we will assume here that only one array A is referenced. The general
case of multiple arrays does not alter the goals of the analysis, although it may
complicate the analysis itself if the arrays have different distribution patterns. The
gk functions may depend on other program variables, so long as those variables are
invariant during the execution of the forall loop.
The set of iterations executed on processor p, denoted by exec(p) is determined
by the on clause associated with the forall loop. For example, in Figure 1 because
of the on clause, "A[f(i)].loc," this set is a subset of iterations i such that A[f(i)] is
be local to processorp. We define this set mathematically as
emec(p) = f-l(local(p))
where local is the distribution function associated with array A. Each processor p
will execute every iteration in exec(p) which is in the forall's range, that is, the
intersection of the range with exec(p). In the loop of Figure 1, for example, processor p
will execute all iterations in Indez2etnf-l(local(p))' This intersection is often equal
to exec(p) except for boundary condifions; the name exec(p)Was chosen to reflect
this close association. For simplicity, in this paper we will assume that p executes
exactly the iterations in exec(p), as is generally the case. In cases where this is not
true it is generally only necessary to intersect Index._et with exec(p) in the following
equations.
We first identify the forall iterations that can cause nonlocal array references.
There are two reasons for doing this: local accesses may be more amenable to opti-
mization than general accesses, and we can overlap communication with computation
in iterations that access only local array elements. For each processor p and refer-
ence R -- g[g(i)] we define the set
ref(p) = g-l(local(p))
This is the subset of the (unbounded) iteration space where R is always a local
reference. Note that iterations in exec(p) N ref(p) are executed on processor p and
access only p's local memory. Thus, if e_cec(p) C_ ref(p) then the reference R can
always be satisfied locally on processor p. Otherwise, any element a such fhat a C
exec(p) but a ¢_ ref(p) represents an iteration on p that may reference an array
element not on p; this element must be communicated to p via messages. In other
words, iterations in exec(p)- ref(p) cause nonlocal accesses on p. The first stage of
the analysis therefore finds ref(p) for each reference R and processor p and determines
how they intersect with the loop range sets exec(p).
If exec(p) _ ref(p) for some p, then more analysis must be done to generate the
messages received and sent by each processor. For each pair of processors p and q we
must compute the sets in(p, q), the set of elements received by p from q, and out(p, q),
the set of elements sent from p to q. This can be done in two ways. The first uses the
ref(p) sets defined above. Here, we note that those sets cover the iteration space.
Thus, exec(p) can be divided into parts by its intersections exec(p) A ref(a). Any
of these sets which is nonempty represents a region of iteration space executed on
processor p and accessing array elements on processor q. The sets of elements to be
received by p are g(exec(p) n ref(q)) for all q; similarly, the sets of elements that p
must send are g(exec(q)[1 ref(p)). The communications sets can therefore also be
defined as
in(p,q) = g(exec(p) flref(q))out(p,q) = g(e ,ec(q)n,'ef(p))
6
f
9E
The second, simpler way is to note that processor p can only access elements in
fl(exec(p)). Since every element has a "home" processor, we can identify the sources
of these elements using the local functions. Every nonempty set fl(emec(p)) C1local(q)
where q _ p represents a set of elements which processor p must receive as messages
from processor q. Conversely, every nonempty set g(emec(q)) F1local(p) represents a
set of elements that p must send to q. Thus, we can define
in(p,q) = g(ezec(v)) n local(q)out(p,q) = g(ezec(q)) n local(p)
We can now describe the organization of the message passing code derived from
simple forall statements. Figure 2 shows this for the program fragment in Figure 1,
assuming only one reference 1{ =_ A[g(i)]. Only high-level pseudocode for the com-
putation on processor p is shown. Using the in and oui sets, the processor sends
all its messages, performs the iterations which do not require nonlocal data, receives
all its messages, and finally performs the iterations requiring nonlocal data. These
sets can be computed at either compile-time or run-time. In the next subsection,
we characterize these two situations and then provide a detailed example requiring
run-time analysis.
3.2 Run-time Versus Compile-time Analysis
The major issue in applying the above model is the analysis required to compute
ezec(p), ref(p), and their derived sets. It is clear that a naive approach to computing
these sets at run-time will lead to unacceptable performance, in terms of both speed
and memory usage. This overhead can be reduced by either doing the analysis at
compile-time or by careful optimization of the run-time code.
In some cases we can analyze the program at compile-time and precompute the
sets symbolically. Such an analysis requires the subscripts and data distribution
patterns to be of a form such that closed form expressions can be obtained for the
communications sets. If such an analysis is possible, no set computations need be done
at run-time. Instead, the expressions for the sets can be used directly. Compile-time
analysis, however, is only possible when the compiler has enough information about
the distribution function, local, and the subscripting functions f and gk to produce
simple formulas for the sets. In this paper we will not pursue this optimization;
interested readers are referred to [3], which gives some flavor of the analysis.
In many programs the exec(p) and ref(p) sets of a forall loop depend on the run-
time values of the variables involved, in such cases, the sets must be computed at
run-time. However, the impact of the overhead from this computation can be lessened
by noting that the variables controlling the communications sets often do not change
their values between repeated executions of the forall loop. Our run-time analysis
takes advantage of this by computing the exec(p) and tel(p) sets only the first time
they are needed and saving them for later loop executions. This amortizes the cost of
the run-time analysis over many repetitions of the forall, lowering the overall cost of
Codeexecutedon processorp:
-- Sets used in message passing code
ezec(p) - f -l(local(p)) N Indez_set
tel(p) - g-'(Iocal(p)) ....in(p,q) - g(exec(p))Mref(q) for each q 6 Proc-{p}
out(p, q) - g(exec(q)) M ref(p) for each q E-Proc-(p} =
-- Send messages to Other processors
for each q 6 Proc do
i£ out(p, q) 4 ¢ then send( q, out(V, q) ); end;end;
-- Do local iterations
for each i 6 ezec(p) M ref(p) do
.,.A[g(i)]...
end;
- - Receive messages from other processors
for each q 6 Proc do
if in(p, q) # ¢ then
tmp[ in(p,q) ]:= recv( q );end;
end;
-- Do nonlocal iteratz_ons
for each i 6 exec(p)- ref(p) do ::_
. . . trnp[g(i)]. . .
end;
Figure 3: Message passing pseudocode for Figure 1
processors Procs: array[1..P] with P in 1..n;var a, old_a: array[1..n ] of real dlst by [ block ] on Procs;
count : array[ 1..n ] of integer dlst by [ block ] on Procs;
adj : array[ 1..n, 1..4 ] of integer dist by [ block, * ] on Procs;
coef : array[ 1..n, 1..4 ] of real dlst by [ block, * ] on Procs;
- - code to set up arrays 'adj' and 'coef'
while ( not converged ) do
- - copy mesh values
forall i in 1..n on old_a[i].loc do
old_@] := a[i];
end;
- - perform relaxation (computational core)
forall i in 1..n on a[i].loc do
var x : real;
x := 0.0;
for j in 1..count[i] do
x := x + coef[i_]] * old a[ adj[i,j] ];
end;
if (count[i] > 0) then a[i] := x; end;
end;
- - code to check convergence
end;
Figure 4: Nearest-neighbor relaxation on an unstructured grid
the computation. This method is generally applicable and, if the forall is executed
frequently, acceptably efficient. The next section shows how this method can be
applied in a simple example.
3.3 Run-time Analysis
In this section we apply our analysis to the program in Figure 2. This models a simple
partial differential equation solver on a user-defined mesh. Arrays a and old_a storevalues at nodes in the mesh, while array adj holds the adjacency list for the mesh
and coef stores algorithm-specific coefficients. This arrangement allows the solution
of PDEs on irregular meshes, and is quite common in practice. We will only consider
the computational core of the program, the second forall statement.
The reference to old..a[adj[i, j]] in this program creates a communications pattern
9
dependent on data (adj[i, j]) which cannot be fully analyzed by the compiler. Thus,
the ref(p) sets and the communications sets derived from them must be computed at
run-time. We do this by running a modified version of the forall called the inspector
before running the actual forall. The inspector only checks whether references to
distributed arrays are local. If a reference is local, nothing more is done. If the
reference is not local, a record of it and its "home" processor is added to a list of
elements to be received. This approach generates the in(p,q) sets and, as a side
effect, constructs the sets of local iterations (exec(p)Vlref(p)) and nonlocal iterations
(exec(p) - ref(p)). To construct the out(p,q) sets, we note that out(p,q) = in(q,p).
Thus, we need only route the sets to the correct processors. To avoid excessive
communications overhead we use a variant of Fox's Crystal router [2] which handles
such communications without creating bottlenecks. Once this is accomplished, we
have all the sets needed to execute the communications and computation of the
original forall, which are performed by the part of the program which we call the
ezecutor. The executor consists of the two for loops shown in Figure 2 which perform
the local and nonlocal computations.
The representation of the in(p, q) and out(p, q) sets deserv_ mention, since this
representatlon has a large effect on the efficiency of the overall program. We repre-
sent these sets as dynamically-allocated arrays of the record Shown in Figure 3. Each
record contains the information needed to access one contiguous block of an array
stored on one processor. The first two fields identify the sending and receiving pro-
cessors. On processor p, the field from_proc will always be p in the out set and the
field to_proc will be p in the in set. The low and high fields give the lower and upper
bounds of the block of the array to be communicated. In the case of multi-dimensional
arrays, these fields are actually the offsets from the base of the array on the home
processor. To fill these fields, we assume that the home processors and element offsets
can be calculated by any processor; this assumption is justified for static distributions
such as we use. The final buffer field is a pointer to the communications buffer where
the range will be stored. This field is only used for the in set when a communicated
element is accessed. When the in set is constructed, it is sorted on the from_woe
field, with the low field serving as a secondary key. Adjacent ranges are combined
where possible to rrfinimlze the number of records needed. The global concatenation
process which creates the out sets sorts them on the to_proc field, again using low
record
from proc! integer;
to_proc: integer;
low: integer;
high: integer;
buffer: "real;
end;
-- sending . p;ocv_sor-- receiving processor
-- lower bound of range
-- upper bound of range
-- pointer to message buffer
Figure 5: Representation of in and out sets
10
asthe secondarykey. If there are severalarrays to be communicated,we can add asymbol field identifying the array; this field then becomes the secondary sorting key,
and low becomes the tertiary key.
Our use of dynamically-allocated arrays was motivated by the desire to keep the
implementation simple while providing quick access to communicated array elements.
An individual element can be accessed by binary search in O(log r) time (where r is
the number of ranges), which is optimal in the general case here. Sorting by processor
id also allowed us to combine messages between the same two processors, thus saving
on the number of messages. Finally, the arrays allowed a simple implementation of
the concatenation process. The disadvantage of sorted arrays is the insertion time of
O(r) when the sets are built. In future implementations, we may replace the arrays
by binary trees or other data structure allowing faster insertion while keeping thesame access time.
The above approach is clearly a brute-force solution to the problem, and it is not
clear that the overhead of this computation will be low enough to justify its use.
As explained above, we can alleviate some of this overhead by observing that the
communications patterns in this forall will be executed repeatedly. The adj array
is not changed in the while loop, and thus the communications dependent on that
array do not change. This implies that we can save the in(p,q) and ou_(p,q) setsbetween executions of the forall to reduce the run-time overhead.
Figure 3 shows a high-level description of the code generated by this run-time
analysis for the relaxation forall. Again, the figure gives pseudocode for processor p
only. In this case the communications sets must be calculated (once) at run-time.
The sets are stored as lists, implemented as explained above. Here, local_list stores
exec(p) N ref(p); nonIocaI_list stores exec(p) - ref(p); and recv_list and send_list
store the in(p, q) and out(p, q) sets, respectively. The statements in the first if state-
ment compute these sets by examining every reference made by the forall on proces-
sor p. As discussed above, this conditional is only executed once and the results saved
for future executions of the foralI. The other statements are direct implementations
of the code in Figure 2, specialized to this example. The locality test in the nonlocal
computations loop is necessary because even within the same iteration of the forall,
the reference old_a[adj[i,j]] may be sometimes local and sometimes nonlocal. We
discuss the performance of this program in the next section.
4 Performance
To test the methods shown in Section 3, we implemented the run-time analysis in
the Kali compiler. The compiler produces C code for execution on the NCUBE/7
and iPSC/2 multiprocessors. We then compiled the program shown in Figure 2 using
various constants for the sizes of the arrays and ran the resulting programs for several
sizes of the hypercube, measuring the times for various sections of the codes.
Since our primary interest is unstructured grids, our program allows general adj
and coef arrays. However, in the tests here the grids used were simple rectangular
11
Code executed on processor p:
if ( first_time ) then -- Compute sets for later use
local_list := nonlocal_list := send_list := recvlist := NIL;
for each i E Iocala(p) do
flag := true;
for each j C {l,2,...,count[i]}do
if ( adj[i,j] f[ locaIota__(p) ) then
Add old_a[ adj[i,j] ] to recvJist
flag := false;
end;
end;
if ( flag ) then mddi to iocaiAiStelse Add i to nonlocalJist
end;
end;
Form send_list using recvJists from all processors
(requires global communication)
end; ....
for each msg C send_list do
send( msg );
end;
for each i E locaIdist do
Original loop body
end;
for each msg E recv_Iist do
processors
recv( msg ) and add contents to msg._list
end;
for each i C nonIocal_list do
x :-= 0.0;
for each j C {1,2,...,count[i]} do
if ( adj[i,j] C Iocatola._.(p) ) then
tmp := old_a[ adj[i,j] ];
else
tmp := Search msgjist for old_a[ adj[i,j] ]
end;
x := x + coef[i,j] *tmp;
end;if (count[i] > O) then a[i] := x; end;
end_
-- Send messages to other processors
-- Do local iterations
-- Receive messages from other
-- Do nonlocal iterations
Figure 6: Message passing pseudocode for Figure 4
12
=
Time (in seconds) for 100 sweeps over 128 x 128 mesh
processors
16
32
64
total time
246.07
127.46
68.38
38.95
24.36
17.71
executor time
244.04
126.12
67.28
37.88
23.21
inspector time
2.03
1.34
1.10
1.07
1.15
1.29
inspector overhead
0.8%
1.1%
1.6%
2.7%
4.7%
7.3%16.42
128 12.64 11.19 1.45 11.5%
Figure 7: Performance of run-time analysis for varying number of processors on an
NCUBE/7
Time (in seconds) for 100 sweeps over 128 × 128 mesh
processors total time executor time inspector time inspector overhead
2 60.69 60.34 0.34 0.56%
4 31.20 31.02 0.18 0.57%
8 16.23 16.13 0.i0 0.60%
16 8.88 8.82 0.06 0.64%
32 5.27 5.23 0.04 0.70%
Figure 8: Performance of run-time analysis for varying number of processors on an
iPSC/2
Time (in seconds for 100 sweeps on 128 processors
mesh size total time executor 'time inspector time inspector overhead speedup
64 × 64 4.97 3.56 1.38 27.8% 23.9
128 x 128 12.64 11.19 1.45 11.5% 37.3
256 × 256 34.13 32.52 1.61 4.7% 55.2
512 x 512 93.78 91.68 2.10 2.2% 80.4
1024 x 1024 305.03 301.31 3.72 1.2% 98.9
Figure 9: Performance of run-time analysis for varying problem size on an NCUBE/7
Time 'in seconds for 100 sweeps on 32 processors
mesh size total time inspector overhead speedup
64 x 64
i28 x 128
1.88
5.27
256 x 256 17.65
512 x 512 65.17
1024 x t024 249.75
executor time inspector time
1.86 0.02
5.23 0.04
17.54 0.11
64.79 0.38
248.34 1.41
0.85% 15.7
0.70% 22.5
0.62% 26.8
0.58% 29.1
0.56% 30.3
Figure 10: Performance of run-time analysis for varying problem size on an iPSC/2
13
grids, on which we performed 100 Jacobi iterations with the standard five point
Laplacian. For this test problem, the optimal static domain decomposition is obvious,
so we did not have to cope with the added complication of load balancing strategies.
Except for issues of load balancing and domain decomposition, we ran the program
exactly as we would for an unstructured grid. The only significant difference is that
the node connectivity is higher for unstructured grids; nodesln a two dimensional
unstructured grid have six neighbors, on average, rather than the four assumed here.
Thus all costs, execution, inspection, and communication, would be somewhat higher
for an unstructured grid.
Figures 4 and 5 show how the execution time varies when the problem size remains
constant and the number of processors is increased. The inspector overhead is defined
here as the proportion of time spent in the inspector; that is, the overhead is the
inspector time divided by the total time. The tables show that the overhead from
the inspector is never very high; for the NCUBE it varies from less than 1°_ to about
12% of the total computation time, while on the iPSC it is always less than 1% of the
total. These numbers obviously depend on the number of iterations performed, since
the inspector is always executed only once. We assumed 100 iterations, since this is
typical of many numerical algorithms.
For some problems, there are numerical algorithms requiring fewer relaxation it-
erations. Such algorithms tend to be much more complex, requiring incomplete LU
factorizations or multigrid techniques, and we suspect our approach would be less
useful in such cases. In the worst case, where one performs only one sweep, the in-
spector overhead on the NCUBE would range from 45% on 2 processors to 93% on
128 processors, while on the iPSC it ranges from 35% to 41%. These numbers illus-
trate the importance of saving inspector information to avoid recomputation. They
also suggest that with this kind of hardware/software environment algorithm choices
might shift in favor of simpler algorithm with more repetitive inner loops.
Figure 4 also shows how the time taken by the inspector varies. As can be seen,
the time for the inspector starts high, decreases to a minimum at 16 processors, and
then increases slowly. This behavior can be explained by the structure of the inspector
itself. It consists of two phases: the loop identifying nonlocal array references, and
the global communication phase to build the receive lists. The time to execute the
loop is proportional to the number of array references performed, and thus in this c_se
inversely proportional to the number of processors. The global communications phase,
on the other hand, requires time proportional to the dimension of the hypercube, and
thus is logarithmic in the number of processors. When there are few processors,
the inspector time is dominated by the array reference loop, and is thus inversely
proportional to the number of processors. However, as more processors are added,
the increasing time for the communications phase eventually overtakes the decreasing
loop time and the total time begins to rise.
This behavior is not seen in Figure 5 because the locality-checking loop always
dominates the computation on the iPSC. For sufficiently many processors the com-
munication phase would also become significant there. In general, the iPSC inspector
14
overheadsaremuch lessthan of the NCUBE. This appearsto be primarily due to therelatively lowercostof communicationsfor small messageson the iPSC.This increasesthe cost of the global combining in the inspector on the NCUBE, thus increasingitsoverheadin relation to the iPSC.
Figures 6 and 7 keep the number of processors constant and vary the problem
size. Inspector overhead is defined as above, while speedup is given relative to the
executor time on one processor. This represents the closest measurement we have to
an optimal sequential program, since it does not include any overhead for either the
inspector or for communication. As can be seen, the inspector overhead decreases
and the speedup increases when the number of processors is increased. The decrease
in inspector overhead can again be explained by the structure of the inspector. As
the problem size increases, the number of iterations of the locality-checking loop also
increases, making that phase of the inspector more dominant in the total inspector
time. Thus, our inspector-executor code organization can be expected to scale well as
problem size increases. The increases in speedup reflect decreasing overheads in the
executor loop. Our parallel programs have two overheads associated with nonlocal
references: the cost of sending and receiving data in messages, and data structure
overhead from the searches for nonlocal array elements. Any program written for a
distributed memory machine will have the communications overhead, however, the
search overhead is unique to our system. This search overhead is primarily responsible
for suboptimal speedups. Also, this overhead is much less for the iPSC than for the
NCUBE, probably because of the faster procedure calls on the iPSC. We are working
to both analyze these results and to improve the data structure performance on both
machines.
5 Related Work
There are many other projects concerned with compiling programs for nonshared
memory parallel machines. Three in particular break away from the message passing
paradigm and are thus closely related to our work.
Kennedy and his coworkers [1] compile programs for distributed memory by first
creating a version which computes its communications at run-time. They then use
standard compiler transformations such as constant propagation and loop distribution
to optimize this version into a form much like ours. Their optimizations appear to fail
in our run-time analysis cases. If significant compile-time optimizations are possible,
their results appear to be similar to our compile-time analysis in [3]. We extend their
work in our run-time analysis by saving information on repeated communications
patterns. It is not obvious how such information saving could be incorporated into
their method without devising new compiler transformations. We also provide a more
top-down approach to analyzing the communications, while their optimizations can
be characterized as bottom-up.
Rogers and Pingali [8] suggest run-time resolution of communications for the func-
tional language Id Nouveau. They do not attempt to save information between execu-
I5
tions of theirparallelconstructs,however. Because the information isnot saved,they
labelrun-time resolutionas "fairlyinefficient"and concentrate on optimizing special
cases.These cases appear to correspond roughly to our compile-time analysis.We ex-
tcnd theirwork by saving the communications information between forallexecutions
and by providing a common framework for run-time and compile-time resolution.
Saltz ctal [9]compute data-de_dendent communlcafionspatterns in a preprocessor,
producing schedules for each processor to execute later.This preprocessing isdone
off-line,although they are currently integratingthiswith the actual computation as
is done with our system. Their execution schedules also take into account inter-
iterationdependencies, something not necessary in our system since we currently
start with completely parallelloops. They do not give any performance figuresfor
thcirpreprocessor,although they do note that given its"relativelyhigh" complexity,
parallelizationwillbe required inany practicalsystem. Saving the information about
forallcommunications between executions isvery similarbetween our two works. A
major diffcrcncefrom our work isthat they explicitlyenumerate allarray references(
localand nonlocal)in a "list".This eliminatesthe overhead of checking and searching
for nonlocal referencesduring the loop execution but requiresmore storage than our
implementation. We alsodifferin that we consider compile-time optimizations,which
they do not attempt.
6 Conclusions
Current-pr0gramming environments for distributed memory architectures provide lit-
tle support for mapping applications to the machine. In particular, the lack of a
global name space implies that the algorithms have to be specified at a relatively
low level. This greatly increases the complexity of programs, and also hard wires the
algorithm choices, inhibiting experimentation with alternative approaches.
In this paper, we described an environment which aIIows the user to specify al-
gorithms at a higher level. By providing a global name space, our system allows the
user to specify data parallel algorithms in a more natural manner. The user needs to
make only minimal additions to a high level "shared memory" style specification of
the algorithm for execution in our system; the low level details of message-passing,
local array indexing, and so forth are left to the compiler. Our system performs these
transformations automatically, producing relatively efficient executable programs.
The fundamental problem in mapping a global name space onto a distributed
memory machine is generation of the messages necessary for communication of non-
local values. In this paper, we presented a framework which can systematically and
automatically generate these messages, using either compile time or run time analysis
of communication patterns. In this paper we concentrated on the more general (but
less e_cient) case of run-time analysis. Our run-time analysis generates messages
by performing an inspector loop before the main computation, which records any
nonlocal array references. The executor loop subsequently uses this information to
transmit information efficiently while performing the actual computation.
16
L
_aTo_3i _OrElutc S ar_S_,_Ee _3rn_n,slraloP
1. Report No. l
NASA CR-181981 JTCASE Report No. 90-7
4. Title and Subtitle
Report Documentation Page
2. Government Accession No. 3. Recipient's Catalog No.
5. Report Date
SUPPORTING SHARED DATA STRUCTURES ON DISTRIBUTED
MEMORY ARCHITECTURES
7. Author(s)
Charles Koelbel
Piyush Mehrotra
John Van Rosendale
9. Performing Organization Name and Address
Institute for Computer Applications in Science
and Engineering
Mail Stop 132C, NASA Langley Research Center
Hampton, VA 23665-5225
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23665-5225
January 1990
6. Performing Organization Code
8. Performing Organization Report No.
90-7
10. Work Unit No.
505-90-21-01
tl. Contract or Grant No,
NASI-18605
13. Type of Report and Period Covered
Contractor Report
14. Sponsoring ._gency Code
15. Supplementaw Notes
Langley Technical Monitor:Richard W. Barnwell
Final Report
To appear in the Proceedings of the
2nd SIGPLAN Symposium on Principles
and Practice of Parallel Program-
ming, March 1990.
16. AbstractProgramming nonshared memory systems is more difficult than programming
shared memory systems, since there is no support for shared data structures. Cur-
rent programming languages for distributed memory architectures force the user to
decompose all data structures into separate pieces, with each piece "owned" by one
of the processors in the machine, and with all communication explicitly specified
by low-level message-passing primitives. This paper presents a new programming en-
vironment for distributed memory architectures, providing a global name space and
allowing direct access to remote parts of data values. We describe the analysis
and program transformations required to implement this environment, and present the
efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes.
17. Key Words(SuggestedbyAuthor(s))
distributed memory architectures;
language constructs; runtime optimizations
19. Securi_ Cla_if. (of this report)
Unclassified
18. Distribution Statement
59 - Mathematical and Computer
Sciences (General)
61 - Computer Programming and Software
Unclassified - Unlimited
_. Securi_ Cla_if. (of this pa_) _21. No. of _s _. Price
Unclassified 1 19 A03
NASA FORM 1626 OCT 86
NASA-L_angIey, 1990
The inspector is clearly an expensive operation. However, if one amortizes the cost
of the inspector over the entire computation, it turns out to be relatively inexpensive
in many cases. This is especially true in cases where the computation is an iterative
loop executed a large number of times.
The other issue effectlng the overhead of our system is the extra cost incurred
throughout the computation by the new data structures used. This is a serious issue,
but one on which we have only preliminary results. In future work, we plan to give
increased attention to these overhead issues, refining both our run-time environment
and language constructs. We also plan to look at more complex example programs,
including those requiring dynamic load balancing, to better understand the relative
usability, generality and eNcacy of this approach.
References
[1] D. Callahan and K. Kennedy. Compiling programs for distributed-memory multipro-
cessors. Journal of Supercomputing, 2:151-169, 1988.
[2] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems
on Concurrent Processors, Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1986.
[3] C. Koelbel and P. Mehrotra. Compiler transformations for non-shared memory ma-chines. In Proceedings of the .4th International Conference on Supercomputing, volume 1,
pages 390-397, May 1989.
[4] P. Mehrotra. Programming parallel architectures: The BLAZE family of languages. In
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Com-
puting, pages 289-299, December 1987.
[5] P. Mehrotra and J. V. RosendMe. The BLAZE language: A parallel language for
scientific programming. Parallel Computing, 5:339-361, 1987.
[6] P. Mehrotra and J. Van Rosendale. Compiling high level constructs to distributed mem-
ory architectures. In Proceedings of the Fourth Conference on Hypercube Concurrent
Computers and Applications, March 1989.
[7] P. Mehrotra and J. Van Rosendale. Parallel language constructs for tensor product
computations on loosely coupled architectures. In Proceedings Supercomputing '89, pages
616-626, November 1989.
[8] A. Rogers and K. Pingali. Process decomposition through locality of reference. In
Conference on Programming Language Design and Implementation, pages 69-80. ACM
SIGPLAN, June 1989.
[9] J. Saltz, K. Crowiey, R. Mirchandaney, and H. Berryman. Run-time scheduling and
execution of loops on message passing machines. Journal of Parallel and Distributed
Computing, April 1990. (To appear).
17