preprin t 0 (2001) ?{? 1 - brigham young universitydna.cs.byu.edu/cs484/papers/perfsurf.pdf ·...
TRANSCRIPT
Preprint 0 (2001) ?{? 1
Object Placement Using Performance Surfaces
Andr�e Turgeon, Quinn Snell and Mark Clement a
a Computer Science Department, Brigham Young University, Provo, Utah 84602-6576,
E-mail: fandre,snell,[email protected]
Heterogeneous parallel clusters of workstations are being used to solve many im-
portant computational problems. Scheduling parallel applications on the best collec-
tion of machines in a heterogeneous computing environment is a complex problem.
Performance prediction is vital to good application performance in this environment
since utilization of an ill-suited machine can slow the computation down signi�-
cantly. This paper addresses the problem of network performance prediction. A new
methodology for characterizing network links and application's need for network re-
sources is developed which makes use of Performance Surfaces[3]. This Performance
Surface abstraction is used to schedule a parallel application on resources where it
will run most e�eciently.
Keywords: Performance Surfaces, Prediction, Statistical Model, Parallel, Scheduling
1. Introduction
Clusters of workstations are being used in many production environments
to solve large compuational problems. Several metacomputer systems have been
proposed and are currently being developed to utilize clusters and traditional su-
percomputers to solve large problems. A few examples are Legion [11], Globus [6]
and DOGMA [8]. Despite great strides by prior metacomputer developers, there
is still much work to be done in the area of meta-scheduling. The heterogene-
ity, inherent in metacomputers, introduces several interesting problems. These
problems include: security issues, reliability, and performance prediction. This
paper will concentrate on performance prediction. More particularly, this paper
will concentrate on the networking aspects of performance prediction. The intent
is to use network performance predictions to make placement decisions when a
large number of platform alternatives exist.
Methods of characterizing and predicting network performance have been
2 Turgeon / Performance Surfaces
studied by several authors [13,14,10,4,12]. They all use some form of statistical
model. The prediction results are generally given in one of two forms: (1) A
�xed point value which is basically a number, or (2) a stochastic value which is
a distribution representing a set of possible values and their probabilities.
Schopf [12] claims that a stochastic value more accurately represents the
performance of the network than �xed point values. This approach is the basis for
performance surfaces (section 2 will de�ne the concept of a performance surface
in greater detail).
Object placement strategies (or more generally, meta-scheduling strategies)
have been studied by several authors [2,9,5]. Many of the systems use some
form of dynamic performance prediction (such as the Network Weather System
[17]) to help in the object placement decision . This research uses a benchmark
application to collect performance information and forms a multi-dimensional
stochastic model in order to make object placement decisions.
The goal of this research is to develop a system which automatically places
parallel applications in a heterogeneous environment while minimizing execution
time. Simplistically, this is done by placing pairs of tasks which communicate
heavily on a pair of nodes with good communication characteristics. Conversely,
tasks which seldom communicate can span a poor network link without negatively
a�ecting performance. To accomplish this goal, it is necessary to adequately
characterize the performance of network links and the load which a particular
application will place on the network (on a link by link basis). It is also neces-
sary, given these network and application characterizations, to develop a way to
mathematically determine how well suited an application is for a given network
{ an aÆnity measure. A novel performance model (performance surfaces) is used
to accomplish this.
This paper develops a performance prediction system which can be used
as part of a meta-scheduler. More precisely, this system is meant to assist in
object placement, a subset of scheduling. Scheduling is the process of deciding
where to place a task and when to schedule it. Object placement is simple the
where part of scheduling. The system is divided into three components. The �rst
component characterizes the performance of the network. The second component
characterizes an application's demand for network resources. The �nal compo-
nent uses the data collected by the �rst two components and calculates aÆnity
measures for various placement con�gurations. Using these aÆnity measures and
guided by heuristics, the system makes educated placement decisions. Although
Turgeon / Performance Surfaces 3
the placement heuristics are meant to be part of the meta-scheduler, some will
be built in our system for testing purposes.
2. Performance Surfaces
Performance surfaces are the basis for the performance prediction model
developed in this paper. A performance surface is a series of probability distri-
butions for discrete elements put back to back, thus forming the surface. Figure
1 is an example of a performance surface. In this example, each discrete element
is a message size. The other two dimensions are used to represent the probability
distribution for messages of that size. More precisely, given a message size, the
distribution gives us the probability that the latency will be between a given time
interval.
Much information is contained in the performance surface shown in Figure
1. In this example, the smaller message sizes (Figure1(a)) are likely to have low
average latency. The variance in the latency distribution for these message sizes
is small. This can be seen by the high ridge at the back of the distribution graph.
The larger message sizes (Figure1(b)) are more likely to have larger average la-
tency. The variance in the latency distribution for these messages is larger. This
can be seen by the location of peaks closer to the front of the graph. The distri-
bution is also more at which indicates that these larger message sizes have less
predictable delays.
For this research, two di�erent surfaces are used. The �rst surface, called the
Network Performance Surface (NPS), characterizes the network links. The other,
called the Application Performance Surface (APS), characterizes an application's
use of network resources. By combining the information contained in these two
surfaces, it is possible to make placement decisions.
Let us de�ne what is meant by network links and application links. Network
links, in the context of this paper, are the set of all links connecting the nodes
used by the metacomputer. Application links are the set of all communication
pathways an application uses during the course of its execution. These pathways
are not physical network links, instead, they represent the communication paths
used by the application.
Another di�erence between the two types of characterizations is that the
network surface is tied to a physical network and is therefore \�xed" by nature.
The application characterization, however can be changed as long as link de-
4 Turgeon / Performance Surfaces
pendencies are kept intact. For example, Figure 2 is a graphical representation
of two such mappings. In Figure 2(a) the application graph App is unchanged;
in Figure 2(b), the application graph has been rotated counter-clockwise. No-
tice the heavier lines in the �gure. In the case of the application graph, App, a
heavier line between two nodes represents heavier communication between those
two nodes. In the case of the network graph, Net, a heavier line represents a
faster network link (capable of sustaining heavier traÆc). It is clear that the
mapping of Figure 2(b) will give superior performance than the mapping of Fig-
ure 2(a). In Figure 2(a), the application's heavily communicating link between
node 1 and 2 is mapped to a poor network link (as indicated by the thin line).
Likewise, the application's other heavily communicating link between node 3 and
4 is also mapped to a poor network link. This situation is recti�ed in Figure
2(b) where both of the application's heavily communicating links are mapped to
superior network links. This example illustrates the usefulness of characterizing
every link of both the network and application. Using the information given by
these characterizations, a properly designed system can analyze various mapping
scenarios and determine which results in better performance.
2.1. De�nitions
General performance surfaces are de�ned as follows: If the distribution of
service time t is measured for various service requests, which can be described by
a set of n characteristics r = (r1; r2; � � � ; rn), the following conditional probability
is obtained:
surface(t; r) = P (tjr) (1)
This two dimensional probability distribution forms a surface where the x-
axis represents the characteristic r, the y-axis represents the service time t, and
the z-axis represents the probability that r takes time t to be serviced. Two
instances of performance surfaces are used in this research: A network perfor-
mance surface (NPS) and an application performance surface (APS). The set of
characteristics used for these performance surfaces is a set m = (m1;m2; � � � ;mn)
of message sizes.
For the NPS, netsurf, the distribution of service time t for a given message
size m represents the distribution of network delays intrinsic to the underlying
network when sending a message of size m.
Turgeon / Performance Surfaces 5
netsurf(t;m) = P (tjm) (2)
For the APS, appsurf, the distribution of service time t represents the dis-
tribution of computational periods between communication operations or grain
size.
appsurf(t;m) = P (tjm) (3)
Let us give a concrete example of these mathematical representations. For
this example, Equation 2 is used with constant values for the message size variable
m. Figure 3(a) gives us three delay probability distributions wherem = 32Kbyte,
m = 64Kbyte, and m = 94Kbyte respectively. By concatenating all the distri-
butions for the values of m sampled, a surface emerges as can be seen in Figure
3(b).
2.2. Representation of the Network and Application
A NPS and APS, as de�ned in the last section, refer to individual network
and application links respectively. In this section, a notation for referring to the
entire network or application is introduced.
For the purpose of this paper, the network connecting a cluster is logically
viewed as a fully connected graph. The de�nition of a graph is augmented to
include, in addition to the set of vertices and edges, a set of performance surfaces.
Figure 4 shows an example of a network. In this work, the vertices of the graph
are called nodes, and the edges are called links. In addition to these components,
each link has a performance surface associated with it. Formally, a network
graph, NET, is de�ned as follows:
NET = G(V; E ;S); (4)
where V and E are a set of vertices and edges respectively, and where S is a set of
Network Performance Surfaces (NPS) { one for each edge in the graph. Similarly,
an application graph, APP, is de�ned to be:
APP = G(V; E ;S); (5)
6 Turgeon / Performance Surfaces
where S is a set of APSs. The arity of a NET or APP, denoted by jNET j or
jAPP j is de�ned to be the number of vertices in the that graph. In general, it is
assumed that jNET j >= jAPP j.
2.3. Network Performance Surface Creation
An accurate characterization of network performance involves at least three
dimensions: (1) message size, (2) communication time, and (3) communication
endpoints. A Network Performance Surface is the performance characterization
of one link in the network. For each surface, given a message size, there is a
probability distribution for the delay associated with sending a message of that
size. To generate these distributions, it is necessary to benchmark each link by
sending messages of di�erent size and recording the delay (the time taken for the
reply divided by 2). To get a complete picture of the network, each link must be
characterized.
The problem of generating all the network surfaces can be approached in
two ways: (1) exhaustive testing of all possibilities (which is impractical for large
con�gurations), or (2) Monte Carlo [16] approximation. The two approaches
select values for partner and message size. The way these values are selected,
however, is di�erent. The exhaustive testing approach sequentially tests all mes-
sage sizes on all possible network links The Monte Carlo approximation, which
we use, randomly selects message sizes and speci�c network links and converges
on the complete surface without testing all possibilities.
The network benchmark algorithm runs on all nodes. Each node commu-
nicates with every other node, in a ping pong fashion as described in [15]. The
partner selection for this ping pong process is random. Likewise, the message size
used is random. Another random variable is the period of time to wait between
communication tests (sample collection). Because of the random waiting time,
this algorithm measures the network's response to congestion which, for shared
media networks, decreases the e�ective bandwidth.
2.4. Application Performance Surface Creation
The characterization of an application is very similar to the characterization
of a network. Like the network characterization, it contains the following three
dimensions: (1) message size, (2) frequency of communication (delay), and (3)
communication endpoints. Although \delay" is normally viewed as a negative
Turgeon / Performance Surfaces 7
characteristic, in the case of an application, larger values of delay (or larger
grainsize values) will increase the application's aÆnity to a given machine. These
measurements are gathered for each node pair where communications occur in
the application.
To collect this information, a trace of the running application must be col-
lected. Since the application must be run before it can be characterized, the
bene�t of this system will only appear over a series of runs. Current experiments
perform application characterization on a homogeneous cluster in order to min-
imize perturbation due to di�erences in hardware. Future work will investigate
gathering trace information every time an application is run and then factoring
out machine characteristics in order to arrive at the same surface.
2.5. AÆnity Measure Derivation
Once a characterization of both the network and the application is complete,
the next step is to use this information to make placement decisions. This is ac-
complished through combining the information from these two characterizations
in order to predict how a given network will react to the load presented by the
application.
As stated earlier, a performance surface is a series of delay probability distri-
butions { one distribution for each message size. The term application probability
distribution or APD will be used to denote an application's delay probability dis-
tribution for a given message size. The term network probability distribution or
NPD will be used to denote a network's delay probability distribution for a given
message size.
2.5.1. Weighted Convolution
A method was devised to translate the overlap found between the APD and
NPD into an aÆnity measure. The method is based on convolution used in signal
processing. The signal is analogous to the application surface and the impulse
response is analogous to the network surface. The convolution between a signal
and the impulse response results in a \prediction" of the output of the system
given that signal.
Weighted Convolution is similar to traditional convolution except that a
weight is given to each product of the convolution based upon the relative distance
between the network and application surfaces. If the average delay for the APD
8 Turgeon / Performance Surfaces
is less than the average delay for the NPD, the weight is small. As the average
delay for the APD increases, the weight also grows. Figures 5, 6 and 7 show three
scenarios.
In Figure 5, the average delay for the APD is far less than the average delay
for the NPD. In other words, the application's average delay between communi-
cations is smaller than the networks average delay for transmitting a message of
that size. The convolved distribution's area is smaller than the other two graphs
and denotes a smaller aÆnity measure.
In Figure 6, the average delay for the APD is still smaller than the aver-
age delay for the NPD, but this time, there is greater overlap. The convolved
distribution's area is larger than in Figure 5 to re ect the greater aÆnity.
Finally, in Figure 7, the average delay for the APD is greater that the average
delay for the NPD. In other words, the application spends more time computing
than communicating. This means the network should be able to deliver mes-
sages before the nodes have completed the computation. The resulting convolved
distribution's area is the largest denoting the best aÆnity measure of the three
examples.
Let us now give a more formal de�nition for aÆnity. First, the aÆnity
surface, from which the aÆnity measure is derived, is de�ned. Given a network
surface netsurf and an application surface appsurf, an aÆnity surface a�surf is
de�ned as:
a�surf(m; d) =mDelayXk=0
netsurf(m; d)appsurf(m; k)k
k + d; (6)
where m, d, and mDelay represent a given message size, a given delay, and the
maximum delay respectively.
A compelling validation for weighted convolution is to think of the weight
factor kk+d
of Equation 6 as an eÆciency weight. Let us derive the weight factor
from the de�nition of eÆciency. The standard de�nition for eÆciency is as follows:
E =S
P; (7)
where S is the speedup and P is the number of processors used. Speedup, in turn,
is de�ned to be the ratio of the serial execution time over the parallel execution
time or T1TP
. The parallel execution time TP can further be decomposed into the
Turgeon / Performance Surfaces 9
sum of computation time (Tcomp), communication time (Tcomm) and idle time
(Tidle). Substituting the expanded de�nition of speedup into Equation 7 yields
the following de�nition for eÆciency:
E =T1
P (Tcomp + Tcomm + Tidle): (8)
Writing Tcomp in terms of T1 yields Tcomp =T1P. The sum of communication
time and idle time can be thought of as the sum of the delays generated by each
processor's communication, D, divided by the number of processors. In other
words, DP= Tcomm+Tidle. The ratio of
DPcan be thought of as an average of the
sum of delays. Substituting these values for Tcomp and Tcomm + Tidle yields the
following de�nition:
E =T1
P (T1P+ D
P): (9)
This de�nition can be simpli�ed by canceling P and yields:
E =T1
T1 +D: (10)
This �nal de�nition is analogous to the weight factor kk+d
where k, the
application's grain size, corresponds to T1 and d, the network's delay, corresponds
to D. In essence, Equation 6 convolves two distributions but gives each term of
the summation a weight that is based on eÆciency. This weight factor is only an
approximation of eÆciency since Equation 10 is used for the entire application
whereas weighted convolution deals with individual links. Nevertheless, it appears
from the experiment's results that it is a good approximation.
Finally, given an aÆnity surface, a�surf, the aÆnity measure, aÆnity, is
de�ned as follows:
aÆnity =mSizeXm=0
mDelayXd=0
a�surf(m; d); (11)
where mSize is the maximum size represented by the �nite NPS and APS1. This
is simply the volume under the aÆnity surface.
1 The NPS and APS must have matching dimensions for these operations to make sense.
10 Turgeon / Performance Surfaces
2.6. Placement Problem Complexity
One problem with characterizing each link in a fully connected network of
computers is that the number of links grow geometrically with the number of
nodes. If n is the number of nodes then the number of links l can be calculated
by the following function:
l =n(n� 1)
2: (12)
There are many ways to map an application graph APP unto a network
graph NET. Given a mapping, the sum of the individual aÆnity measures (one
for each APS/NPS pair formed by the mapping) is the overall aÆnity measure for
that mapping. The higher the measure, the better or more eÆcient the mapping.
Finding the mapping which gives the highest aÆnity sum is a complex process.
The search space S to evaluate all permutations is as follows:
S =
(jNET j! if jNET j = jAPP jjNET j!
jNET j!�jAPP j! if jNET j > jAPP j: (13)
In fact, this problem is analogous to the traveling salesman problem, and
hence, it is np-complete O(N !). Fortunately, it is not necessary, as we shall see
in the next section, to search through all permutations.
2.7. Optimizations
It must be noted that the information given by the set of performance sur-
faces characterizing both the network and application can be quite redundant.
One elegant way to reduce the search space is to reduce the number of elements
to analyze. It is often the case that homogeneous clusters exist inside the net-
work. By aggregating these homogeneous clusters into one node, the number
of placement permutations can be signi�cantly reduced. Likewise, applications
often have homogeneous communication zones which can be aggregated. This
section examines these optimizations in more detail.
2.7.1. Aggregating Networks into Homogeneous Clusters
One way to reduce the search space drastically is to aggregate a network by
�nding homogeneous clusters. Consider the following metacomputer comprised
of two sites, site A and site B. Sites A and B are clusters of computers running
Turgeon / Performance Surfaces 11
100 Mbit Ethernet LANs. Both sites are linked though a nebulous link (i.e. the
Internet). Figure 8(a) is a pictorial representation of our example metacomputer.
In this example, it is likely that all links connecting the nodes of site A will have
similar characteristics. Likewise, all links connecting the nodes of site B would
be very homogeneous. Lastly, both of these sets of links are likely faster than
the set of links spanning the two sites. Figure 8(b) shows the number of links
which would have to be searched if no aggregation was done. In this case, using
equation 12, the number of links is 28. If, however, all the homogeneous nodes at
each site are represented by one node, the number of links is reduced drastically.
Figure 8(c) shows the number of links which would have to be searched once the
network is aggregated. There is the link describing all connections between the
two sites, the link describing all connections between the nodes of site A, and
the link describing all connections between the nodes of site B. The aggregation
yields 3 links { a reduction of 25 links. As can be seen from this example, network
aggregation reduces the search space considerably.
The question then becomes, how does one �nd homogeneous clusters? The
answer lies in �nding a method to rate the speed of links relative to one another.
A method was developed to do just this, using performance surfaces. To �nd
the relative speed of each network link or RNS (Relative Network Speed), each
network surface is convolved (using weighted convolution) with an \identity sur-
face" or \benchmarking application surface". Figure 9 shows the identity surface
(middle surface).
The identity surface is used to discriminate between slow and fast network
links. Figure 9 gives an example of the process. In this �gure two links are rated:
on the left, a high speed link; on the right, a low speed link. The same process is
used to rate the two links. First, the link is convolved with the identity surface.
This convolution yields the aÆnity surface. The aÆnity measure is derived from
the aÆnity surface by calculating the volume under the aÆnity surface.
The RNS value of all the links are grouped together using a histogram.
Figure 10 shows a typical histogram. As can be seen in that �gure, RNS values
will naturally clump up into groups of nodes of similar speed. It is important
to note that this grouping is not representative of the clusters. It is merely a
grouping of the relative speeds of the links. In the previous example (Figure 8),
all the links from the site A cluster and the site B cluster would be clumped up
together in one group and all the links between the two sites would be clumped
into the other group. After this primary separation, it is necessary to analyze
12 Turgeon / Performance Surfaces
the link dependencies to determine the clusters.
2.7.2. Aggregating Applications into Homogeneous Clusters
Homogeneity can also be found in the communication patterns of an appli-
cation. Indeed, it was found that many parallel algorithms use symmetrical com-
munication patterns. This means that like network graphs, application graphs
can also be aggregated into clusters.
The same technique which is applied to aggregate network graphs can be
applied to aggregate application graphs. This time, instead of relative network
speed, relative network requirements is measured. In other words, the clusters
represent areas of high/low communication. Again, each application surface is
convolved with an \identity surface". A histogram is formed and the clumps are
analyzed for link dependencies to form the application clusters.
If it is the case that the communication characteristics of the application
are uniform, then aggregating the application graph will yield one cluster. This
simply means that there is no \good" way to split the application to exploit slow
communication links.
In the case where the communication patterns are non-uniform, it is often
the case that clusters of communication can be found. The high-communication
clusters can then easily be mapped to network clusters with high speed network
connections. Application links with low communication requirements are paired
o� with low performance network links.
2.8. Placement Heuristics
Once aÆnity measures for the various application/network cluster combi-
nations are calculated, placement decisions can be made. Although placement
heuristics really belong in a meta-scheduler, some were included in this system to
provide a proof of concept. The following example heuristics were implemented
in the system.
� If only one cluster is found which has suÆcient nodes to accommodate the
parallel application, use it.
� If more than one cluster is found (which is large enough to accommodate the
application) use performance surfaces to determine which cluster has better
characteristics.
Turgeon / Performance Surfaces 13
� If more than two clusters exist and the application must span multiple clusters,
use the aÆnity measure to determine which cluster combination is best.
� If the application displays asymmetric communication patterns, and must be
split among clusters, split the application along communication cluster lines.
These are just a few examples of possible heuristics. A meta-scheduler might
choose to implement other policies such as maximizing throughput, or resource
utilization.
3. Results
This section presents the results of several experiments which were per-
formed to ascertain the usefulness of the performance surface model. First, the
environment under which the experiments were performed and the procedure fol-
lowed is described. Next, each experiment is presented along with the results.
Lastly, the results are discussed.
3.1. Environment and Procedure
To achieve a heterogeneous environment, machines from various labs were
used. Machines from the Computer Science Department at Brigham Young Uni-
versity, the Mathematics department at BYU, and the Swarm cluster at Oregon
State University were used. The MPI communication API was used for all appli-
cations. A secure shell mechanism, SSH, was used for security purposes and also
because there was no other way to communicate to and from the Swarm cluster.
A slight performance hit might be incurred, but is uniform for all experiments so
it should not a�ect the results.
Because network performance is only one of the performance factors in par-
allel program execution, e�orts were made to eliminate other di�erences in the
machine con�gurations. Other parameters which can come into play during par-
allel execution are: OS used, amount of RAM on the system, and CPU type
and speed. The majority of the machines had Pentium II, 300MHz CPUs with
128MB of memory running Linux.
The procedure for each experiment is roughly as follows:
� The metacomputer's links are analyzed using the NetBench utility,
� The application is run once on a large cluster to get a signature for that
application(i.e. a trace �le), and,
14 Turgeon / Performance Surfaces
� The results from NetBench and the application's signature are fed to the
Affinity program which analyzes the data and makes placement decisions.
For each experiment, several parallel applications were used. These application
were selected to represent the behavior of typical parallel programs. Several of the
applications used are part of the widely recognized NAS Parallel Benchmark suite
(NPB 2.3)[1]. From the NPB package, three benchmarks were selected: the LU,
MG and EP benchmarks2. The LU benchmark is a simulated CFD application
which uses symmetric successive over-relaxation (SSOR) to solve a block lower
triangular / block upper triangular system of equations. The MG benchmark
uses a multigrid method to compute the solution of the three-dimensional scalar
Poisson equation. The EP benchmark is an Embarrassingly Parallel application
where each node generates a large number of random numbers independent of
the other nodes. These three applications span a wide range of communication
frequency: LU has heavy communication, MG has medium communication, and
EP has light communication (practically no communication). All of the NPB
benchmarks have symmetric communication patterns. To test the ability of the
system to exploit asymmetric communication patterns, a homegrown benchmark
application was developed. This new benchmark will be described and justi�ed
in section 3.2.3.
For comparison purposes, the timing results from our system's placement
were compared with both a random and a round-robin placement strategy. It is
assumed that the system has no knowledge about how each machine is linked to
the other machines in the system. For round-robin placement, a starting machine
is randomly picked from the list of machines in the metacomputer and the rest
of the nodes are picked sequentially { the list is circular so that processor usage
wraps when the end of the list is reached. This process is repeated for each
iteration of the experiment. For random placement, an array of selected nodes is
used to keep track of which node has been selected; the number of nodes required
for the algorithm are selected randomly. Ten iterations of each placement method
are made for each experiment; the results are then averaged out. The sections
below describe each of the experiments in greater detail and give the results.
2A Class A problem size was used for both LU and EP; A class W problem size was used for
MG.
Turgeon / Performance Surfaces 15
3.2. Experimental Results
Three experiments were developed to validate di�erent functions of the sys-
tem. The �rst experiment was designed to verify the ability of the system to iden-
tify homogeneous clusters in the heterogeneous network. The second experiment
was designed to verify its ability to rate the eÆciency of various combinations of
clusters and pick the best one. The third experiment was designed to verify the
system's ability to analyze an application's communication pattern and exploit
any asymmetric communication pattern.
Because of limited resources available, all applications used for these exper-
iments were compiled to use 8 nodes. The procedure outlined in Section 3.1 was
used in all three experiments. That is, for each experiment, the NetBench ap-
plication was used to characterize the metacomputer's network, each application
was traced, and the Affinity application was used to determine the placement
of the parallel application. Using the system's placement, the applications were
run 10 times and the average execution time was computed. For comparison, the
same application was run using random placement and round-robin placement
(again 10 times for each placement strategy).
The following three sub-sections describe each experiment in more detail.
Each sub-section will explain the metacomputer con�guration used, the appli-
cations used, and give the results for the given experiment. The next section,
Section 3.3 will o�er a more elaborate discussion of the results.
3.2.1. Experiment 1 { Finding Homogeneous Clusters
The �rst experiment was designed to verify the ability of the system to
identify homogeneous clusters in a heterogeneous metacomputer network. To
this end, a metacomputer was formed which spanned two clusters: the Computer
Science lab cluster and the Swarm cluster. Eight machines were used from each
cluster. The applications selected were the LU, MG and EP benchmarks.
Figure 11 shows the timing results for the EP, MG and LU benchmark. In
each case, smaller average execution time is better. The �gures clearly show
that performance surface placement consistently outperforms the others. Case in
point: for the MG benchmark, the surface placement was 21 times faster than
round-robin.
16 Turgeon / Performance Surfaces
3.2.2. Experiment 2 { Choosing Clusters
The second experiment was designed to verify the system's ability to rate
the eÆciency of various combinations of clusters and pick the best one. Three
clusters were used: the Computer Science cluster, the Swarm cluster (connected
through slow internet links) and the Math department cluster (connected through
10Mbps Ethernet). The application suite was the same as for experiment 1 (LU,
MG and EP benchmarks), but this time only 4 machines were used on each
cluster. Since the application required 8 machines, it forces the system to span
the application across at least two of the clusters. Thus, this experiment veri�es
the system's ability to pick a pair of clusters which work best together.
Figure 12 shows the timing results for the EP, MG, and LU benchmarks.
Again, the random and round-robin placement strategies are compared with our
system's placement which consistently equals or outperforms the other place-
ments. As an example: for the MG benchmark, the surface placement was over
3 times faster than round-robin.
3.2.3. Experiment 3 { Exploiting Asymmetric Communication
The third experiment's goal was to verify the ability of the system to analyze
an application's communication pattern and exploit any asymmetric communica-
tion pattern. Since the communication patterns of the applications in the NPB
suite are symmetrical, a new benchmark was devised.
The new benchmark, called BiTalk, mimics a class of parallel applications
which use functional decomposition rather than domain decomposition. For an
in depth discussion of functional and domain decomposition, see [7]. Typically,
when functional decomposition is used, natural clusters of communication are
formed along the functional decomposition lines. For example, a weather sim-
ulation program could be functionally decomposed to include di�erent models.
For example, the di�erent models forming the weather simulation system are
an atmospheric model, a hydrology model, an ocean model, and a land surface
model. Each of these models can perform computations independent of the oth-
ers. Because of data dependencies between the models, these clusters of nodes
(models) also need to communicate with each other and exchange information.
For scalability, each of these functional units is further decomposed using domain
decomposition; thus each model spans several nodes. Generally, communication
within each model is more frequent than the communication between the models.
Thus, by analyzing the communication patterns of the application as a whole,
Turgeon / Performance Surfaces 17
these communication clusters can be identi�ed and exploited.
The BiTalk application simulates two communication clusters where signi�-
cant communication takes place within the clusters and little among the clusters.
For this experiment two distant clusters were used: the Computer Science
cluster and the Swarm cluster. A total of 8 nodes (four nodes on each cluster)
were used. This new benchmark clearly has an asymmetrical communication
pattern which can be exploited by the system. Figure 13 shows the results of
the experiment. Again, the results are compared with random and round-robin
placement. This �gure also shows the execution time of BiTalk when run on a
single homogeneous cluster. As can be seen by the results, the di�erence in exe-
cution time between the surface placement on two distant clusters and a manual
placement on a single cluster is negligible. Here again, the surface placement
clearly outperforms both random and round-robin placement. Using the BiTalk
benchmark (Figure 13), the surface placement was 30 times faster than round-
robin.
3.3. Discussion
In the �rst experiment, the performance surface system was able to identify
both of the clusters (BYU's Computer Science cluster and the Oregon State
University Swarm cluster). It placed all of the tasks in the Swarm cluster following
the heuristic. The round-robin and random placement did not fare as well since
most of the runs included nodes from both clusters. The system's placement, on
average, outperformed both random and round-robin placement; it outperformed
the random placement by a factor of 13 and the round-robin placement by a factor
of 10.
Again, in the second experiment, the performance surface system was able
to identify the clusters (BYU's Computer Science, BYU's Math and the ORST
Swarm cluster). This time it had to spread the task among two clusters since
there were not enough nodes for the entire parallel application on any single
cluster. The system spread the nodes between the Computer Science and Math
clusters (using the aÆnity measures and the heuristics). This, of course, made
sense since network performance between the two departments at BYU is much
better than between these two clusters and the Swarm cluster at Oregon State
University. As in the �rst experiment, the round-robin and random placement
did not fare as well since, on occasion, some nodes from the Swarm cluster were
18 Turgeon / Performance Surfaces
included. The system's placement, on average, outperformed both random and
round-robin placement; it outperformed the random placement by a factor of 3
and the round-robin placement by a factor of 2.6.
In the third experiment, the system was able to identify the communication
clusters. The task of placement was then simpli�ed to mapping the communica-
tion clusters unto the network clusters. BiTalk was designed to use half of the
node list for the �rst communication cluster and the other for the next commu-
nication cluster. This design actually helped the round-robin placement method.
Since all that was required for the placement to be optimum was to select all
nodes from one cluster for the �rst half of the list and all nodes from the other
for the later half. Again, the system's placement, on average, outperformed both
random and round-robin placement; it outperformed the random placement by a
factor of 42 and the round-robin placement by a factor of 30. Another interest-
ing fact is the narrow timing di�erence between the application running on two
clusters using the system's placement and the application running on a single
cluster. This shows that a certain class of applications (which uses asymmetric
communication pattern) can potentially span clusters with minimal performance
degradation.
On average, the round-robin placement strategy performed better than ran-
dom for all three experiments. This can be attributed to the fact that the list of
nodes was arranged such that clusters were grouped together. This made it more
likely to select nodes from the same cluster using the round-robin method.
Applications which had medium to heavy communication patterns (LU and
MG in this case) bene�ted more from our system than applications with few
communications (EP benchmark). Since the only communication cost involved
with the EP benchmark was the startup cost, there was very little bene�t from
using performance surfaces to assist in the placement. In fact, other factors such
as CPU speed negated the communication cost and proved to be more important.
This can be seen in experiment 2 for the EP benchmark. In this experiment, the
time for the random placement is actually better (by a few seconds) than the
placement proposed by our system. In that experiment, the system placed the
application such that it would span the CS and Math department clusters. This
would have made sense had the application communicated frequently since the
link performance between these two clusters is superior. The problem was that
the CS cluster's machines have slightly slower CPUs (266 MHz, as opposed to
300 MHz for the rest of the machines). Because the system assumed that all
Turgeon / Performance Surfaces 19
CPUs were equivalent, it did not accurately predict that EP would run better on
the faster CPUs. Future work will factor CPU speed into performance surface
placement.
In general, an application which has very uniform communication charac-
teristics cannot be placed across heterogeneous networks without su�ering a per-
formance loss (as compared with the same application running on a single cluster
of equivalent speed). If, however, communication clusters can be found, it is
possible to get results very similar to the same application running on a single
cluster. This can be seen in the results of experiment 3, which compare the time
for running BiTalk on two clusters using performance surfaces and running the
application on one cluster.
Using the performance surface system never had a substantial negative e�ect
on performance. In many cases it yielded great performance gain. According to
the three experiments performed, the performance surface placement was, on
average, over 19 times faster than the random placement. It was, on average,
over 14 times faster than round-robin placement. Overall, the above experiments
show the viability of performance surfaces as a network performance prediction
tool.
4. Conclusion
The results presented in this paper have shown the usefulness of performance
surfaces as a performance prediction model. The tests conducted clearly show
that, on average, the placement produced by surface performance prediction is
superior to round-robin and random placement. Additionally, it has been shown
that the developed system can rapidly detect homogeneous clusters in the het-
erogeneous network; It also allows for easy comparison of these various clusters.
Finally, when asynchronous communication patterns are present, the system can
detect it and place the application over several clusters with minimal performance
degradation.
We would like to thank Michael Quinn and Oregon State University for the
use of their Swarm cluster for these experiments.
20 Turgeon / Performance Surfaces
References
[1] David Bailey, Tim Harris, William Saphir, Rob van der Wijngaar, Alex Woo, and Maurice
Yarrow. The NAS Parallel Benchmark 2.0. Technical Report NAS-95-020, NASA Ames
Research Center, December 1995.
[2] Francine D. Berman, Rich Wolski, Silvia Figueira, Jennifer Schopf, and Gary Shao.
Application-Level Scheduling on Distributed Heterogeneous Networks. In Supercomput-
ing '96 Conference Proceedings, August 1996.
[3] Mark J. Clement, Glenn M. Judd, Joy L. Peterson, Bryan S. Morse, and J. Kelly Flanagan.
Performance Surface Prediction for WAN-Based Clusters. In Proceedings of the 31st Hawaii
International Conference on System Sciences, HICSS-31, January 1998.
[4] Silvia M. Figueira and Francine Berman. Predicting Slowdown for NetworkedWorkstations.
In Proceedings of the 6th IEEE International Symposium on High Performance Distributed
Computing, HPDC-6, page 92, August 1997.
[5] Steven Fitzgerald, Ian Foster, Carl Kesselman, Gregor von Laszewski, Warren Smith, and
Steven Tuecke. A Directory Service for Con�guring High-Performance Distributed Com-
putations. In Proceedings of the 6th IEEE International Symposium on High Performance
Distributed Computing, HPDC-6, page 365, August 1997.
[6] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Interna-
tional Journal of Supercomputer Applications, 11(2):115{128, 1997.
[7] Ian Foster. Designing and Building Parallel Programs. Addison Wesley, New York, 1995.
[8] Quinn O. Snell Glenn M. Judd, Mark J. Clement. DOGMA: Distributed Object Group
Management Architecture. Technical Report BYU-NCL-97-102, Brigham Young Univer-
sity, 1997.
[9] John F. Karpovich. Support for Object Placement in Wide Area Heterogeneous Distributed
Systems. Technical Report CS-96-03, University of Virginia, January 1996.
[10] JunSeong Kim and David J. Lilja. Utilizing Heterogeneous Networks in Distributed Parallel
Computing Systems. In Proceedings of the 6th IEEE International Symposium on High
Performance Distributed Computing, HPDC-6, page 336, August 1997.
[11] Michael J. Lewis and Andrew Grimshaw. The Core Legion Object Model. In Proceedings
of the Fifth IEEE International Symposium on High Performance Distributed Computing,
August 1996.
[12] Jennifer M. Schopf. Structural Prediction Models for High-Performance Distributed Ap-
plications. In Proceedings of the Cluster Computing Conference, 1997.
[13] Jennifer M. Schopf and Francine Berman. Performance Prediction in Production Environ-
ments. In Proceedings of IPPS/SPDP, 1998.
[14] W. Smith, I. Foster, and V. Taylor. Predicting Application Run Times Using Historical
Information. In The 4th Workshop on Job Scheduling Strategies for Parallel Processing,
March 1998.
[15] Quinn O. Snell and John L. Gustafson. HINT: A New Way to Measure Computer Perfor-
mance. In Proceedings of the 28th Hawaii International Conference on System Sciences,
HICSS-28, January 1995.
Turgeon / Performance Surfaces 21
[16] Ilya M. Sobol. A Primer for the Monte Carlo Method. CRC Press, Boca Raton, 1994.
[17] Rich Wolski. Forecast Network Performance to Support Dynamic Scheduling Using the
Network Weather Service. In Proceedings of the 6th IEEE International Symposium on
High Performance Distributed Computing, HPDC-6, page 316, August 1997.
22 Turgeon / Performance Surfaces
0
321.5
3.0
0
0.5
1
Probability
Message Size (KByte)Latency (msec)
Performance Surface
64
a)
b)
Figure 1. Example of Performance Surface: a) Small message sizes with low average latency
and less variance. b) Large message sizes with higher average latency and more variance.
a)
1 2
34
2
3
1
4
b)
1 2
34
3
4
2
1
2
3
1
4
Net App
Net App'
Figure 2. Example of network node / application node mapping.
Turgeon / Performance Surfaces 23
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
m = 32 Kbyte m = 96 Kbytem = 64 Kbyte
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 100.0
0.1
0.2
0.3
0.4
0.5
Pro
babi
lity
Delay (msec)
a)
b)
Figure 3. The formation of performance surfaces. a) The individual probability distributions,
representing the probability of a delay t given a message size m. b) The concatenation of the
individual distributions to form the performance surface.
Figure 4. Augmented de�nition of graph.
24 Turgeon / Performance Surfaces
0 5 10 15 20 25 300
2040
60
APDNPDWC(APD,NPD)
Pro
babi
lity
Delay
Figure 5. Weighted convolution: APD << NPD.
0 5 10 15 20 25 30
020
4060
APDNPDWC(APD,NPD)
Pro
babi
lity
Delay
Figure 6. Weighted convolution: APD < NPD.
0 5 10 15 20 25 30
020
4060
APDNPDWC(APD,NPD)
Pro
babi
lity
Delay
Figure 7. Weighted convolution: APD > NPD.
Turgeon / Performance Surfaces 25
Site A Site B
Site A Site B
Internet
a)
b) c)
Figure 8. Aggregating the network into clusters
26 Turgeon / Performance Surfaces
0 1 2 3 4 5 6 7 8 9
0
20480
40960
0
0.2
0.4
0.6
0.8
1
0
1.25 2.5
3.75 5
6.25 7.5
8.75
0
358400
0.2
0.4
0.6
0.8
1
Pro
bab
ility
Mes
sage
Siz
e
0
1.5 3
4.5 6
7.5 9
0
384000
0.2
0.4
0.6
0.8
1P
rob
abili
ty
0
1.5 3
4.5 6
7.5 9
0
384000
2
4
6
8
10
Delay
0
1.25 2.5
3.75 5
6.25 7.5
8.75
0
358400
2
4
6
8
10
Delay
Pro
bab
ility
Mes
sage
Siz
e
Mes
sage
Siz
e
Mes
sage
Siz
e
Delay(grain size)
Mes
sage
Siz
e
Delay Delay
High speed link Low speed link
Identity Surface
Volume = 171 Volume = 109
Performance Surfaces
AffinitySurfaces
AffinityMeasures
Figure 9. Calculating the RNS value for high/low speed links.
Turgeon / Performance Surfaces 27
Histogram of RNS Values
RNS
Fre
quen
cy
110 120 130 140 150 160 170
02
46
810
12
Figure 10. Example histogram of RNS values.
Experiment #1
1:47 2:11 2:15
56:5659:30
7:245:17
0:15
7:55
Surface Round Robin Random
Placement Method
Tim
e(m
in:s
ec)
EP
MG
LU
Figure 11. Experiment 1
Experiment #2
1:13:13
2:212:242:246:32 7:06
1:50
52:29
16:04
Surface Round Robin Random
Placement Method
Tim
e(m
in:s
ec
)
EP
MG
LU
Figure 12. Experiment 2
28 Turgeon / Performance Surfaces
Experiment #3
35:58
49:52
1:100:53
Cluster Surface Round Robin Random
Placement Method
Tim
e(m
in:s
ec)
BiTalk
Figure 13. Experiment 3