preprin t 0 (2001) ?{? 1 - brigham young universitydna.cs.byu.edu/cs484/papers/perfsurf.pdf ·...

Preprint 0 (2001) ?{? 1

Object Placement Using Performance Surfaces

Andr�e Turgeon, Quinn Snell and Mark Clement a

a Computer Science Department, Brigham Young University, Provo, Utah 84602-6576,

E-mail: fandre,snell,[email protected]

Heterogeneous parallel clusters of workstations are being used to solve many im-

portant computational problems. Scheduling parallel applications on the best collec-

tion of machines in a heterogeneous computing environment is a complex problem.

Performance prediction is vital to good application performance in this environment

since utilization of an ill-suited machine can slow the computation down signi�-

cantly. This paper addresses the problem of network performance prediction. A new

methodology for characterizing network links and application's need for network re-

sources is developed which makes use of Performance Surfaces[3]. This Performance

Surface abstraction is used to schedule a parallel application on resources where it

will run most e�eciently.

Keywords: Performance Surfaces, Prediction, Statistical Model, Parallel, Scheduling

1. Introduction

Clusters of workstations are being used in many production environments

to solve large compuational problems. Several metacomputer systems have been

proposed and are currently being developed to utilize clusters and traditional su-

percomputers to solve large problems. A few examples are Legion [11], Globus [6]

and DOGMA [8]. Despite great strides by prior metacomputer developers, there

is still much work to be done in the area of meta-scheduling. The heterogene-

ity, inherent in metacomputers, introduces several interesting problems. These

problems include: security issues, reliability, and performance prediction. This

paper will concentrate on performance prediction. More particularly, this paper

will concentrate on the networking aspects of performance prediction. The intent

is to use network performance predictions to make placement decisions when a

large number of platform alternatives exist.

Methods of characterizing and predicting network performance have been

2 Turgeon / Performance Surfaces

studied by several authors [13,14,10,4,12]. They all use some form of statistical

model. The prediction results are generally given in one of two forms: (1) A

�xed point value which is basically a number, or (2) a stochastic value which is

a distribution representing a set of possible values and their probabilities.

Schopf [12] claims that a stochastic value more accurately represents the

performance of the network than �xed point values. This approach is the basis for

performance surfaces (section 2 will de�ne the concept of a performance surface

in greater detail).

Object placement strategies (or more generally, meta-scheduling strategies)

have been studied by several authors [2,9,5]. Many of the systems use some

form of dynamic performance prediction (such as the Network Weather System

[17]) to help in the object placement decision . This research uses a benchmark

application to collect performance information and forms a multi-dimensional

stochastic model in order to make object placement decisions.

The goal of this research is to develop a system which automatically places

parallel applications in a heterogeneous environment while minimizing execution

time. Simplistically, this is done by placing pairs of tasks which communicate

heavily on a pair of nodes with good communication characteristics. Conversely,

tasks which seldom communicate can span a poor network link without negatively

a�ecting performance. To accomplish this goal, it is necessary to adequately

characterize the performance of network links and the load which a particular

application will place on the network (on a link by link basis). It is also neces-

sary, given these network and application characterizations, to develop a way to

mathematically determine how well suited an application is for a given network

{ an aÆnity measure. A novel performance model (performance surfaces) is used

to accomplish this.

This paper develops a performance prediction system which can be used

as part of a meta-scheduler. More precisely, this system is meant to assist in

object placement, a subset of scheduling. Scheduling is the process of deciding

where to place a task and when to schedule it. Object placement is simple the

where part of scheduling. The system is divided into three components. The �rst

component characterizes the performance of the network. The second component

characterizes an application's demand for network resources. The �nal compo-

nent uses the data collected by the �rst two components and calculates aÆnity

measures for various placement con�gurations. Using these aÆnity measures and

guided by heuristics, the system makes educated placement decisions. Although

Turgeon / Performance Surfaces 3

the placement heuristics are meant to be part of the meta-scheduler, some will

be built in our system for testing purposes.

2. Performance Surfaces

Performance surfaces are the basis for the performance prediction model

developed in this paper. A performance surface is a series of probability distri-

butions for discrete elements put back to back, thus forming the surface. Figure

1 is an example of a performance surface. In this example, each discrete element

is a message size. The other two dimensions are used to represent the probability

distribution for messages of that size. More precisely, given a message size, the

distribution gives us the probability that the latency will be between a given time

interval.

Much information is contained in the performance surface shown in Figure

1. In this example, the smaller message sizes (Figure1(a)) are likely to have low

average latency. The variance in the latency distribution for these message sizes

is small. This can be seen by the high ridge at the back of the distribution graph.

The larger message sizes (Figure1(b)) are more likely to have larger average la-

tency. The variance in the latency distribution for these messages is larger. This

can be seen by the location of peaks closer to the front of the graph. The distri-

bution is also more at which indicates that these larger message sizes have less

predictable delays.

For this research, two di�erent surfaces are used. The �rst surface, called the

Network Performance Surface (NPS), characterizes the network links. The other,

called the Application Performance Surface (APS), characterizes an application's

use of network resources. By combining the information contained in these two

surfaces, it is possible to make placement decisions.

Let us de�ne what is meant by network links and application links. Network

links, in the context of this paper, are the set of all links connecting the nodes

used by the metacomputer. Application links are the set of all communication

pathways an application uses during the course of its execution. These pathways

are not physical network links, instead, they represent the communication paths

used by the application.

Another di�erence between the two types of characterizations is that the

network surface is tied to a physical network and is therefore \�xed" by nature.

The application characterization, however can be changed as long as link de-


pendencies are kept intact. For example, Figure 2 is a graphical representation

of two such mappings. In Figure 2(a) the application graph App is unchanged;

in Figure 2(b), the application graph has been rotated counter-clockwise. No-

tice the heavier lines in the �gure. In the case of the application graph, App, a

heavier line between two nodes represents heavier communication between those

two nodes. In the case of the network graph, Net, a heavier line represents a

faster network link (capable of sustaining heavier traÆc). It is clear that the

mapping of Figure 2(b) will give superior performance than the mapping of Fig-

ure 2(a). In Figure 2(a), the application's heavily communicating link between

node 1 and 2 is mapped to a poor network link (as indicated by the thin line).

Likewise, the application's other heavily communicating link between node 3 and

4 is also mapped to a poor network link. This situation is recti�ed in Figure

2(b) where both of the application's heavily communicating links are mapped to

superior network links. This example illustrates the usefulness of characterizing

every link of both the network and application. Using the information given by

these characterizations, a properly designed system can analyze various mapping

scenarios and determine which results in better performance.

2.1. De�nitions

General performance surfaces are de�ned as follows: If the distribution of

service time t is measured for various service requests, which can be described by

a set of n characteristics r = (r1; r2; � � � ; rn), the following conditional probability

is obtained:

surface(t; r) = P (tjr) (1)

This two dimensional probability distribution forms a surface where the x-

axis represents the characteristic r, the y-axis represents the service time t, and

the z-axis represents the probability that r takes time t to be serviced. Two

instances of performance surfaces are used in this research: A network perfor-

mance surface (NPS) and an application performance surface (APS). The set of

characteristics used for these performance surfaces is a set m = (m1;m2; � � � ;mn)

of message sizes.

For the NPS, netsurf, the distribution of service time t for a given message

size m represents the distribution of network delays intrinsic to the underlying

network when sending a message of size m.


netsurf(t;m) = P (tjm) (2)

For the APS, appsurf, the distribution of service time t represents the dis-

tribution of computational periods between communication operations or grain

size.

appsurf(t;m) = P (tjm) (3)

Let us give a concrete example of these mathematical representations. For

this example, Equation 2 is used with constant values for the message size variable

m. Figure 3(a) gives us three delay probability distributions wherem = 32Kbyte,

m = 64Kbyte, and m = 94Kbyte respectively. By concatenating all the distri-

butions for the values of m sampled, a surface emerges as can be seen in Figure

3(b).

2.2. Representation of the Network and Application

A NPS and APS, as de�ned in the last section, refer to individual network

and application links respectively. In this section, a notation for referring to the

entire network or application is introduced.

For the purpose of this paper, the network connecting a cluster is logically

viewed as a fully connected graph. The de�nition of a graph is augmented to

include, in addition to the set of vertices and edges, a set of performance surfaces.

Figure 4 shows an example of a network. In this work, the vertices of the graph

are called nodes, and the edges are called links. In addition to these components,

each link has a performance surface associated with it. Formally, a network

graph, NET, is de�ned as follows:

NET = G(V; E ;S); (4)

where V and E are a set of vertices and edges respectively, and where S is a set of

Network Performance Surfaces (NPS) { one for each edge in the graph. Similarly,

an application graph, APP, is de�ned to be:

APP = G(V; E ;S); (5)


where S is a set of APSs. The arity of a NET or APP, denoted by jNET j or

jAPP j is de�ned to be the number of vertices in the that graph. In general, it is

assumed that jNET j >= jAPP j.

2.3. Network Performance Surface Creation

An accurate characterization of network performance involves at least three

dimensions: (1) message size, (2) communication time, and (3) communication

endpoints. A Network Performance Surface is the performance characterization

of one link in the network. For each surface, given a message size, there is a

probability distribution for the delay associated with sending a message of that

size. To generate these distributions, it is necessary to benchmark each link by

sending messages of di�erent size and recording the delay (the time taken for the

reply divided by 2). To get a complete picture of the network, each link must be

characterized.

The problem of generating all the network surfaces can be approached in

two ways: (1) exhaustive testing of all possibilities (which is impractical for large

con�gurations), or (2) Monte Carlo [16] approximation. The two approaches

select values for partner and message size. The way these values are selected,

however, is di�erent. The exhaustive testing approach sequentially tests all mes-

sage sizes on all possible network links The Monte Carlo approximation, which

we use, randomly selects message sizes and speci�c network links and converges

on the complete surface without testing all possibilities.

The network benchmark algorithm runs on all nodes. Each node commu-

nicates with every other node, in a ping pong fashion as described in [15]. The

partner selection for this ping pong process is random. Likewise, the message size

used is random. Another random variable is the period of time to wait between

communication tests (sample collection). Because of the random waiting time,

this algorithm measures the network's response to congestion which, for shared

media networks, decreases the e�ective bandwidth.

2.4. Application Performance Surface Creation

The characterization of an application is very similar to the characterization

of a network. Like the network characterization, it contains the following three

dimensions: (1) message size, (2) frequency of communication (delay), and (3)

communication endpoints. Although \delay" is normally viewed as a negative


characteristic, in the case of an application, larger values of delay (or larger

grainsize values) will increase the application's aÆnity to a given machine. These

measurements are gathered for each node pair where communications occur in

the application.

To collect this information, a trace of the running application must be col-

lected. Since the application must be run before it can be characterized, the

bene�t of this system will only appear over a series of runs. Current experiments

perform application characterization on a homogeneous cluster in order to min-

imize perturbation due to di�erences in hardware. Future work will investigate

gathering trace information every time an application is run and then factoring

out machine characteristics in order to arrive at the same surface.

2.5. AÆnity Measure Derivation

Once a characterization of both the network and the application is complete,

the next step is to use this information to make placement decisions. This is ac-

complished through combining the information from these two characterizations

in order to predict how a given network will react to the load presented by the

application.

As stated earlier, a performance surface is a series of delay probability distri-

butions { one distribution for each message size. The term application probability

distribution or APD will be used to denote an application's delay probability dis-

tribution for a given message size. The term network probability distribution or

NPD will be used to denote a network's delay probability distribution for a given

message size.

2.5.1. Weighted Convolution

A method was devised to translate the overlap found between the APD and

NPD into an aÆnity measure. The method is based on convolution used in signal

processing. The signal is analogous to the application surface and the impulse

response is analogous to the network surface. The convolution between a signal

and the impulse response results in a \prediction" of the output of the system

given that signal.

Weighted Convolution is similar to traditional convolution except that a

weight is given to each product of the convolution based upon the relative distance

between the network and application surfaces. If the average delay for the APD


is less than the average delay for the NPD, the weight is small. As the average

delay for the APD increases, the weight also grows. Figures 5, 6 and 7 show three

scenarios.

In Figure 5, the average delay for the APD is far less than the average delay

for the NPD. In other words, the application's average delay between communi-

cations is smaller than the networks average delay for transmitting a message of

that size. The convolved distribution's area is smaller than the other two graphs

and denotes a smaller aÆnity measure.

In Figure 6, the average delay for the APD is still smaller than the aver-

age delay for the NPD, but this time, there is greater overlap. The convolved

distribution's area is larger than in Figure 5 to re ect the greater aÆnity.

Finally, in Figure 7, the average delay for the APD is greater that the average

delay for the NPD. In other words, the application spends more time computing

than communicating. This means the network should be able to deliver mes-

sages before the nodes have completed the computation. The resulting convolved

distribution's area is the largest denoting the best aÆnity measure of the three

examples.

Let us now give a more formal de�nition for aÆnity. First, the aÆnity

surface, from which the aÆnity measure is derived, is de�ned. Given a network

surface netsurf and an application surface appsurf, an aÆnity surface a�surf is

de�ned as:

a�surf(m; d) =mDelayXk=0

netsurf(m; d)appsurf(m; k)k

k + d; (6)

where m, d, and mDelay represent a given message size, a given delay, and the

maximum delay respectively.

A compelling validation for weighted convolution is to think of the weight

factor kk+d

of Equation 6 as an eÆciency weight. Let us derive the weight factor

from the de�nition of eÆciency. The standard de�nition for eÆciency is as follows:

E =S

P; (7)

where S is the speedup and P is the number of processors used. Speedup, in turn,

is de�ned to be the ratio of the serial execution time over the parallel execution

time or T1TP

. The parallel execution time TP can further be decomposed into the


sum of computation time (Tcomp), communication time (Tcomm) and idle time

(Tidle). Substituting the expanded de�nition of speedup into Equation 7 yields

the following de�nition for eÆciency:

E =T1

P (Tcomp + Tcomm + Tidle): (8)

Writing Tcomp in terms of T1 yields Tcomp =T1P. The sum of communication

time and idle time can be thought of as the sum of the delays generated by each

processor's communication, D, divided by the number of processors. In other

words, DP= Tcomm+Tidle. The ratio of

DPcan be thought of as an average of the

sum of delays. Substituting these values for Tcomp and Tcomm + Tidle yields the

following de�nition:

E =T1

P (T1P+ D

P): (9)

This de�nition can be simpli�ed by canceling P and yields:

E =T1

T1 +D: (10)

This �nal de�nition is analogous to the weight factor kk+d

where k, the

application's grain size, corresponds to T1 and d, the network's delay, corresponds

to D. In essence, Equation 6 convolves two distributions but gives each term of

the summation a weight that is based on eÆciency. This weight factor is only an

approximation of eÆciency since Equation 10 is used for the entire application

whereas weighted convolution deals with individual links. Nevertheless, it appears

from the experiment's results that it is a good approximation.

Finally, given an aÆnity surface, a�surf, the aÆnity measure, aÆnity, is

de�ned as follows:

aÆnity =mSizeXm=0

mDelayXd=0

a�surf(m; d); (11)

where mSize is the maximum size represented by the �nite NPS and APS1. This

is simply the volume under the aÆnity surface.

1 The NPS and APS must have matching dimensions for these operations to make sense.


2.6. Placement Problem Complexity

One problem with characterizing each link in a fully connected network of

computers is that the number of links grow geometrically with the number of

nodes. If n is the number of nodes then the number of links l can be calculated

by the following function:

l =n(n� 1)

2: (12)

There are many ways to map an application graph APP unto a network

graph NET. Given a mapping, the sum of the individual aÆnity measures (one

for each APS/NPS pair formed by the mapping) is the overall aÆnity measure for

that mapping. The higher the measure, the better or more eÆcient the mapping.

Finding the mapping which gives the highest aÆnity sum is a complex process.

The search space S to evaluate all permutations is as follows:

S =

(jNET j! if jNET j = jAPP jjNET j!

jNET j!�jAPP j! if jNET j > jAPP j: (13)

In fact, this problem is analogous to the traveling salesman problem, and

hence, it is np-complete O(N !). Fortunately, it is not necessary, as we shall see

in the next section, to search through all permutations.

2.7. Optimizations

It must be noted that the information given by the set of performance sur-

faces characterizing both the network and application can be quite redundant.

One elegant way to reduce the search space is to reduce the number of elements

to analyze. It is often the case that homogeneous clusters exist inside the net-

work. By aggregating these homogeneous clusters into one node, the number

of placement permutations can be signi�cantly reduced. Likewise, applications

often have homogeneous communication zones which can be aggregated. This

section examines these optimizations in more detail.

2.7.1. Aggregating Networks into Homogeneous Clusters

One way to reduce the search space drastically is to aggregate a network by

�nding homogeneous clusters. Consider the following metacomputer comprised

of two sites, site A and site B. Sites A and B are clusters of computers running


100 Mbit Ethernet LANs. Both sites are linked though a nebulous link (i.e. the

Internet). Figure 8(a) is a pictorial representation of our example metacomputer.

In this example, it is likely that all links connecting the nodes of site A will have

similar characteristics. Likewise, all links connecting the nodes of site B would

be very homogeneous. Lastly, both of these sets of links are likely faster than

the set of links spanning the two sites. Figure 8(b) shows the number of links

which would have to be searched if no aggregation was done. In this case, using

equation 12, the number of links is 28. If, however, all the homogeneous nodes at

each site are represented by one node, the number of links is reduced drastically.

Figure 8(c) shows the number of links which would have to be searched once the

network is aggregated. There is the link describing all connections between the

two sites, the link describing all connections between the nodes of site A, and

the link describing all connections between the nodes of site B. The aggregation

yields 3 links { a reduction of 25 links. As can be seen from this example, network

aggregation reduces the search space considerably.

The question then becomes, how does one �nd homogeneous clusters? The

answer lies in �nding a method to rate the speed of links relative to one another.

A method was developed to do just this, using performance surfaces. To �nd

the relative speed of each network link or RNS (Relative Network Speed), each

network surface is convolved (using weighted convolution) with an \identity sur-

face" or \benchmarking application surface". Figure 9 shows the identity surface

(middle surface).

The identity surface is used to discriminate between slow and fast network

links. Figure 9 gives an example of the process. In this �gure two links are rated:

on the left, a high speed link; on the right, a low speed link. The same process is

used to rate the two links. First, the link is convolved with the identity surface.

This convolution yields the aÆnity surface. The aÆnity measure is derived from

the aÆnity surface by calculating the volume under the aÆnity surface.

The RNS value of all the links are grouped together using a histogram.

Figure 10 shows a typical histogram. As can be seen in that �gure, RNS values

will naturally clump up into groups of nodes of similar speed. It is important

to note that this grouping is not representative of the clusters. It is merely a

grouping of the relative speeds of the links. In the previous example (Figure 8),

all the links from the site A cluster and the site B cluster would be clumped up

together in one group and all the links between the two sites would be clumped

into the other group. After this primary separation, it is necessary to analyze


the link dependencies to determine the clusters.

2.7.2. Aggregating Applications into Homogeneous Clusters

Homogeneity can also be found in the communication patterns of an appli-

cation. Indeed, it was found that many parallel algorithms use symmetrical com-

munication patterns. This means that like network graphs, application graphs

can also be aggregated into clusters.

The same technique which is applied to aggregate network graphs can be

applied to aggregate application graphs. This time, instead of relative network

speed, relative network requirements is measured. In other words, the clusters

represent areas of high/low communication. Again, each application surface is

convolved with an \identity surface". A histogram is formed and the clumps are

analyzed for link dependencies to form the application clusters.

If it is the case that the communication characteristics of the application

are uniform, then aggregating the application graph will yield one cluster. This

simply means that there is no \good" way to split the application to exploit slow

communication links.

In the case where the communication patterns are non-uniform, it is often

the case that clusters of communication can be found. The high-communication

clusters can then easily be mapped to network clusters with high speed network

connections. Application links with low communication requirements are paired

o� with low performance network links.

2.8. Placement Heuristics

Once aÆnity measures for the various application/network cluster combi-

nations are calculated, placement decisions can be made. Although placement

heuristics really belong in a meta-scheduler, some were included in this system to

provide a proof of concept. The following example heuristics were implemented

in the system.

� If only one cluster is found which has suÆcient nodes to accommodate the

parallel application, use it.

� If more than one cluster is found (which is large enough to accommodate the

application) use performance surfaces to determine which cluster has better

characteristics.


� If more than two clusters exist and the application must span multiple clusters,

use the aÆnity measure to determine which cluster combination is best.

� If the application displays asymmetric communication patterns, and must be

split among clusters, split the application along communication cluster lines.

These are just a few examples of possible heuristics. A meta-scheduler might

choose to implement other policies such as maximizing throughput, or resource

utilization.

3. Results

This section presents the results of several experiments which were per-

formed to ascertain the usefulness of the performance surface model. First, the

environment under which the experiments were performed and the procedure fol-

lowed is described. Next, each experiment is presented along with the results.

Lastly, the results are discussed.

3.1. Environment and Procedure

To achieve a heterogeneous environment, machines from various labs were

used. Machines from the Computer Science Department at Brigham Young Uni-

versity, the Mathematics department at BYU, and the Swarm cluster at Oregon

State University were used. The MPI communication API was used for all appli-

cations. A secure shell mechanism, SSH, was used for security purposes and also

because there was no other way to communicate to and from the Swarm cluster.

A slight performance hit might be incurred, but is uniform for all experiments so

it should not a�ect the results.

Because network performance is only one of the performance factors in par-

allel program execution, e�orts were made to eliminate other di�erences in the

machine con�gurations. Other parameters which can come into play during par-

allel execution are: OS used, amount of RAM on the system, and CPU type

and speed. The majority of the machines had Pentium II, 300MHz CPUs with

128MB of memory running Linux.

The procedure for each experiment is roughly as follows:

� The metacomputer's links are analyzed using the NetBench utility,

� The application is run once on a large cluster to get a signature for that

application(i.e. a trace �le), and,


� The results from NetBench and the application's signature are fed to the

Affinity program which analyzes the data and makes placement decisions.

For each experiment, several parallel applications were used. These application

were selected to represent the behavior of typical parallel programs. Several of the

applications used are part of the widely recognized NAS Parallel Benchmark suite

(NPB 2.3)[1]. From the NPB package, three benchmarks were selected: the LU,

MG and EP benchmarks2. The LU benchmark is a simulated CFD application

which uses symmetric successive over-relaxation (SSOR) to solve a block lower

triangular / block upper triangular system of equations. The MG benchmark

uses a multigrid method to compute the solution of the three-dimensional scalar

Poisson equation. The EP benchmark is an Embarrassingly Parallel application

where each node generates a large number of random numbers independent of

the other nodes. These three applications span a wide range of communication

frequency: LU has heavy communication, MG has medium communication, and

EP has light communication (practically no communication). All of the NPB

benchmarks have symmetric communication patterns. To test the ability of the

system to exploit asymmetric communication patterns, a homegrown benchmark

application was developed. This new benchmark will be described and justi�ed

in section 3.2.3.

For comparison purposes, the timing results from our system's placement

were compared with both a random and a round-robin placement strategy. It is

assumed that the system has no knowledge about how each machine is linked to

the other machines in the system. For round-robin placement, a starting machine

is randomly picked from the list of machines in the metacomputer and the rest

of the nodes are picked sequentially { the list is circular so that processor usage

wraps when the end of the list is reached. This process is repeated for each

iteration of the experiment. For random placement, an array of selected nodes is

used to keep track of which node has been selected; the number of nodes required

for the algorithm are selected randomly. Ten iterations of each placement method

are made for each experiment; the results are then averaged out. The sections

below describe each of the experiments in greater detail and give the results.

2A Class A problem size was used for both LU and EP; A class W problem size was used for

MG.


3.2. Experimental Results

Three experiments were developed to validate di�erent functions of the sys-

tem. The �rst experiment was designed to verify the ability of the system to iden-

tify homogeneous clusters in the heterogeneous network. The second experiment

was designed to verify its ability to rate the eÆciency of various combinations of

clusters and pick the best one. The third experiment was designed to verify the

system's ability to analyze an application's communication pattern and exploit

any asymmetric communication pattern.

Because of limited resources available, all applications used for these exper-

iments were compiled to use 8 nodes. The procedure outlined in Section 3.1 was

used in all three experiments. That is, for each experiment, the NetBench ap-

plication was used to characterize the metacomputer's network, each application

was traced, and the Affinity application was used to determine the placement

of the parallel application. Using the system's placement, the applications were

run 10 times and the average execution time was computed. For comparison, the

same application was run using random placement and round-robin placement

(again 10 times for each placement strategy).

The following three sub-sections describe each experiment in more detail.

Each sub-section will explain the metacomputer con�guration used, the appli-

cations used, and give the results for the given experiment. The next section,

Section 3.3 will o�er a more elaborate discussion of the results.

3.2.1. Experiment 1 { Finding Homogeneous Clusters

The �rst experiment was designed to verify the ability of the system to

identify homogeneous clusters in a heterogeneous metacomputer network. To

this end, a metacomputer was formed which spanned two clusters: the Computer

Science lab cluster and the Swarm cluster. Eight machines were used from each

cluster. The applications selected were the LU, MG and EP benchmarks.

Figure 11 shows the timing results for the EP, MG and LU benchmark. In

each case, smaller average execution time is better. The �gures clearly show

that performance surface placement consistently outperforms the others. Case in

point: for the MG benchmark, the surface placement was 21 times faster than

round-robin.


3.2.2. Experiment 2 { Choosing Clusters

The second experiment was designed to verify the system's ability to rate

the eÆciency of various combinations of clusters and pick the best one. Three

clusters were used: the Computer Science cluster, the Swarm cluster (connected

through slow internet links) and the Math department cluster (connected through

10Mbps Ethernet). The application suite was the same as for experiment 1 (LU,

MG and EP benchmarks), but this time only 4 machines were used on each

cluster. Since the application required 8 machines, it forces the system to span

the application across at least two of the clusters. Thus, this experiment veri�es

the system's ability to pick a pair of clusters which work best together.

Figure 12 shows the timing results for the EP, MG, and LU benchmarks.

Again, the random and round-robin placement strategies are compared with our

system's placement which consistently equals or outperforms the other place-

ments. As an example: for the MG benchmark, the surface placement was over

3 times faster than round-robin.

3.2.3. Experiment 3 { Exploiting Asymmetric Communication

The third experiment's goal was to verify the ability of the system to analyze

an application's communication pattern and exploit any asymmetric communica-

tion pattern. Since the communication patterns of the applications in the NPB

suite are symmetrical, a new benchmark was devised.

The new benchmark, called BiTalk, mimics a class of parallel applications

which use functional decomposition rather than domain decomposition. For an

in depth discussion of functional and domain decomposition, see [7]. Typically,

when functional decomposition is used, natural clusters of communication are

formed along the functional decomposition lines. For example, a weather sim-

ulation program could be functionally decomposed to include di�erent models.

For example, the di�erent models forming the weather simulation system are

an atmospheric model, a hydrology model, an ocean model, and a land surface

model. Each of these models can perform computations independent of the oth-

ers. Because of data dependencies between the models, these clusters of nodes

(models) also need to communicate with each other and exchange information.

For scalability, each of these functional units is further decomposed using domain

decomposition; thus each model spans several nodes. Generally, communication

within each model is more frequent than the communication between the models.

Thus, by analyzing the communication patterns of the application as a whole,


these communication clusters can be identi�ed and exploited.

The BiTalk application simulates two communication clusters where signi�-

cant communication takes place within the clusters and little among the clusters.

For this experiment two distant clusters were used: the Computer Science

cluster and the Swarm cluster. A total of 8 nodes (four nodes on each cluster)

were used. This new benchmark clearly has an asymmetrical communication

pattern which can be exploited by the system. Figure 13 shows the results of

the experiment. Again, the results are compared with random and round-robin

placement. This �gure also shows the execution time of BiTalk when run on a

single homogeneous cluster. As can be seen by the results, the di�erence in exe-

cution time between the surface placement on two distant clusters and a manual

placement on a single cluster is negligible. Here again, the surface placement

clearly outperforms both random and round-robin placement. Using the BiTalk

benchmark (Figure 13), the surface placement was 30 times faster than round-

robin.

3.3. Discussion

In the �rst experiment, the performance surface system was able to identify

both of the clusters (BYU's Computer Science cluster and the Oregon State

University Swarm cluster). It placed all of the tasks in the Swarm cluster following

the heuristic. The round-robin and random placement did not fare as well since

most of the runs included nodes from both clusters. The system's placement, on

average, outperformed both random and round-robin placement; it outperformed

the random placement by a factor of 13 and the round-robin placement by a factor

of 10.

Again, in the second experiment, the performance surface system was able

to identify the clusters (BYU's Computer Science, BYU's Math and the ORST

Swarm cluster). This time it had to spread the task among two clusters since

there were not enough nodes for the entire parallel application on any single

cluster. The system spread the nodes between the Computer Science and Math

clusters (using the aÆnity measures and the heuristics). This, of course, made

sense since network performance between the two departments at BYU is much

better than between these two clusters and the Swarm cluster at Oregon State

University. As in the �rst experiment, the round-robin and random placement

did not fare as well since, on occasion, some nodes from the Swarm cluster were


included. The system's placement, on average, outperformed both random and

round-robin placement; it outperformed the random placement by a factor of 3

and the round-robin placement by a factor of 2.6.

In the third experiment, the system was able to identify the communication

clusters. The task of placement was then simpli�ed to mapping the communica-

tion clusters unto the network clusters. BiTalk was designed to use half of the

node list for the �rst communication cluster and the other for the next commu-

nication cluster. This design actually helped the round-robin placement method.

Since all that was required for the placement to be optimum was to select all

nodes from one cluster for the �rst half of the list and all nodes from the other

for the later half. Again, the system's placement, on average, outperformed both

random and round-robin placement; it outperformed the random placement by a

factor of 42 and the round-robin placement by a factor of 30. Another interest-

ing fact is the narrow timing di�erence between the application running on two

clusters using the system's placement and the application running on a single

cluster. This shows that a certain class of applications (which uses asymmetric

communication pattern) can potentially span clusters with minimal performance

degradation.

On average, the round-robin placement strategy performed better than ran-

dom for all three experiments. This can be attributed to the fact that the list of

nodes was arranged such that clusters were grouped together. This made it more

likely to select nodes from the same cluster using the round-robin method.

Applications which had medium to heavy communication patterns (LU and

MG in this case) bene�ted more from our system than applications with few

communications (EP benchmark). Since the only communication cost involved

with the EP benchmark was the startup cost, there was very little bene�t from

using performance surfaces to assist in the placement. In fact, other factors such

as CPU speed negated the communication cost and proved to be more important.

This can be seen in experiment 2 for the EP benchmark. In this experiment, the

time for the random placement is actually better (by a few seconds) than the

placement proposed by our system. In that experiment, the system placed the

application such that it would span the CS and Math department clusters. This

would have made sense had the application communicated frequently since the

link performance between these two clusters is superior. The problem was that

the CS cluster's machines have slightly slower CPUs (266 MHz, as opposed to

300 MHz for the rest of the machines). Because the system assumed that all


CPUs were equivalent, it did not accurately predict that EP would run better on

the faster CPUs. Future work will factor CPU speed into performance surface

placement.

In general, an application which has very uniform communication charac-

teristics cannot be placed across heterogeneous networks without su�ering a per-

formance loss (as compared with the same application running on a single cluster

of equivalent speed). If, however, communication clusters can be found, it is

possible to get results very similar to the same application running on a single

cluster. This can be seen in the results of experiment 3, which compare the time

for running BiTalk on two clusters using performance surfaces and running the

application on one cluster.

Using the performance surface system never had a substantial negative e�ect

on performance. In many cases it yielded great performance gain. According to

the three experiments performed, the performance surface placement was, on

average, over 19 times faster than the random placement. It was, on average,

over 14 times faster than round-robin placement. Overall, the above experiments

show the viability of performance surfaces as a network performance prediction

tool.

4. Conclusion

The results presented in this paper have shown the usefulness of performance

surfaces as a performance prediction model. The tests conducted clearly show

that, on average, the placement produced by surface performance prediction is

superior to round-robin and random placement. Additionally, it has been shown

that the developed system can rapidly detect homogeneous clusters in the het-

erogeneous network; It also allows for easy comparison of these various clusters.

Finally, when asynchronous communication patterns are present, the system can

detect it and place the application over several clusters with minimal performance

degradation.

We would like to thank Michael Quinn and Oregon State University for the

use of their Swarm cluster for these experiments.


References

[1] David Bailey, Tim Harris, William Saphir, Rob van der Wijngaar, Alex Woo, and Maurice

Yarrow. The NAS Parallel Benchmark 2.0. Technical Report NAS-95-020, NASA Ames

Research Center, December 1995.

[2] Francine D. Berman, Rich Wolski, Silvia Figueira, Jennifer Schopf, and Gary Shao.

Application-Level Scheduling on Distributed Heterogeneous Networks. In Supercomput-

ing '96 Conference Proceedings, August 1996.

[3] Mark J. Clement, Glenn M. Judd, Joy L. Peterson, Bryan S. Morse, and J. Kelly Flanagan.

Performance Surface Prediction for WAN-Based Clusters. In Proceedings of the 31st Hawaii

International Conference on System Sciences, HICSS-31, January 1998.

[4] Silvia M. Figueira and Francine Berman. Predicting Slowdown for NetworkedWorkstations.

In Proceedings of the 6th IEEE International Symposium on High Performance Distributed

Computing, HPDC-6, page 92, August 1997.

[5] Steven Fitzgerald, Ian Foster, Carl Kesselman, Gregor von Laszewski, Warren Smith, and

Steven Tuecke. A Directory Service for Con�guring High-Performance Distributed Com-

putations. In Proceedings of the 6th IEEE International Symposium on High Performance

Distributed Computing, HPDC-6, page 365, August 1997.

[6] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Interna-

tional Journal of Supercomputer Applications, 11(2):115{128, 1997.

[7] Ian Foster. Designing and Building Parallel Programs. Addison Wesley, New York, 1995.

[8] Quinn O. Snell Glenn M. Judd, Mark J. Clement. DOGMA: Distributed Object Group

Management Architecture. Technical Report BYU-NCL-97-102, Brigham Young Univer-

sity, 1997.

[9] John F. Karpovich. Support for Object Placement in Wide Area Heterogeneous Distributed

Systems. Technical Report CS-96-03, University of Virginia, January 1996.

[10] JunSeong Kim and David J. Lilja. Utilizing Heterogeneous Networks in Distributed Parallel

Computing Systems. In Proceedings of the 6th IEEE International Symposium on High

Performance Distributed Computing, HPDC-6, page 336, August 1997.

[11] Michael J. Lewis and Andrew Grimshaw. The Core Legion Object Model. In Proceedings

of the Fifth IEEE International Symposium on High Performance Distributed Computing,

August 1996.

[12] Jennifer M. Schopf. Structural Prediction Models for High-Performance Distributed Ap-

plications. In Proceedings of the Cluster Computing Conference, 1997.

[13] Jennifer M. Schopf and Francine Berman. Performance Prediction in Production Environ-

ments. In Proceedings of IPPS/SPDP, 1998.

[14] W. Smith, I. Foster, and V. Taylor. Predicting Application Run Times Using Historical

Information. In The 4th Workshop on Job Scheduling Strategies for Parallel Processing,

March 1998.

[15] Quinn O. Snell and John L. Gustafson. HINT: A New Way to Measure Computer Perfor-

mance. In Proceedings of the 28th Hawaii International Conference on System Sciences,

HICSS-28, January 1995.


[16] Ilya M. Sobol. A Primer for the Monte Carlo Method. CRC Press, Boca Raton, 1994.

[17] Rich Wolski. Forecast Network Performance to Support Dynamic Scheduling Using the

Network Weather Service. In Proceedings of the 6th IEEE International Symposium on

High Performance Distributed Computing, HPDC-6, page 316, August 1997.


0

321.5

3.0

0

0.5

1

Probability

Message Size (KByte)Latency (msec)

Performance Surface

64

a)

b)

Figure 1. Example of Performance Surface: a) Small message sizes with low average latency

and less variance. b) Large message sizes with higher average latency and more variance.

a)

1 2

34

2

3

1

4

b)

1 2

34

3

4

2

1

2

3

1

4

Net App

Net App'

Figure 2. Example of network node / application node mapping.


0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

m = 32 Kbyte m = 96 Kbytem = 64 Kbyte

0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 100.0

0.1

0.2

0.3

0.4

0.5

Pro

babi

lity

Delay (msec)

a)

b)

Figure 3. The formation of performance surfaces. a) The individual probability distributions,

representing the probability of a delay t given a message size m. b) The concatenation of the

individual distributions to form the performance surface.

Figure 4. Augmented de�nition of graph.


0 5 10 15 20 25 300

2040

60

APDNPDWC(APD,NPD)

Pro

babi

lity

Delay

Figure 5. Weighted convolution: APD << NPD.

0 5 10 15 20 25 30

020

4060

APDNPDWC(APD,NPD)

Pro

babi

lity

Delay

Figure 6. Weighted convolution: APD < NPD.

0 5 10 15 20 25 30

020

4060

APDNPDWC(APD,NPD)

Pro

babi

lity

Delay

Figure 7. Weighted convolution: APD > NPD.


Site A Site B

Site A Site B

Internet

a)

b) c)

Figure 8. Aggregating the network into clusters


0 1 2 3 4 5 6 7 8 9

0

20480

40960

0

0.2

0.4

0.6

0.8

1

0

1.25 2.5

3.75 5

6.25 7.5

8.75

0

358400

0.2

0.4

0.6

0.8

1

Pro

bab

ility

Mes

sage

Siz

e

0

1.5 3

4.5 6

7.5 9

0

384000

0.2

0.4

0.6

0.8

1P

rob

abili

ty

0

1.5 3

4.5 6

7.5 9

0

384000

2

4

6

8

10

Delay

0

1.25 2.5

3.75 5

6.25 7.5

8.75

0

358400

2

4

6

8

10

Delay

Pro

bab

ility

Mes

sage

Siz

e

Mes

sage

Siz

e

Mes

sage

Siz

e

Delay(grain size)

Mes

sage

Siz

e

Delay Delay

High speed link Low speed link

Identity Surface

Volume = 171 Volume = 109

Performance Surfaces

AffinitySurfaces

AffinityMeasures

Figure 9. Calculating the RNS value for high/low speed links.


Histogram of RNS Values

RNS

Fre

quen

cy

110 120 130 140 150 160 170

02

46

810

12

Figure 10. Example histogram of RNS values.

Experiment #1

1:47 2:11 2:15

56:5659:30

7:245:17

0:15

7:55

Surface Round Robin Random

Placement Method

Tim

e(m

in:s

ec)

EP

MG

LU

Figure 11. Experiment 1

Experiment #2

1:13:13

2:212:242:246:32 7:06

1:50

52:29

16:04

Surface Round Robin Random

Placement Method

Tim

e(m

in:s

ec

)

EP

MG

LU



Experiment #3

35:58

49:52

1:100:53

Cluster Surface Round Robin Random

Placement Method

Tim

e(m

in:s

ec)

BiTalk


preprin t 0 (2001) ?{? 1 - brigham young universitydna.cs.byu.edu/cs484/papers/perfsurf.pdf ·...

Documents