writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf ·...

15
Writing Programs that Run EveryWare on the Computational Grid Rich Wolski, Member, IEEE, John Brevik, Graziano Obertelli, Member, IEEE Computer Society, Neil Spring, Student Member, IEEE Computer Society, and Alan Su, Student Member, IEEE Computer Society Abstract—The Computational Grid [12] has been proposed for the implementation of high-performance applications using widely dispersed computational resources. The goal of a Computational Grid is to aggregate ensembles of shared, heterogeneous, and distributed resources (potentially controlled by separate organizations) to provide computational “power” to an application program. In this paper, we provide a toolkit for the development of globally deployable Grid applications. The toolkit, called EveryWare, enables an application to draw computational power transparently from the Grid. It consists of a portable set of processes and libraries that can be incorporated into an application so that a wide variety of dynamically changing distributed infrastructures and resources can be used together to achieve supercomputer-like performance. We provide our experiences gained while building the EveryWare toolkit prototype and an explanation of its use in implementing a large-scale Grid application. Index Terms—Computational Grid, EveryWare, Ramsey Number search, grid infrastructure, ubiquitous computing, distributed supercomputer. æ 1 INTRODUCTION I NCREASINGLY, the high-performance computing commu- nity is blending parallel and distributed computing technologies to meet its performance needs. A new architecture, known as The Computational Grid [12], has recently been proposed which frames the software infra- structure required to implement high-performance applica- tions using widely dispersed computational resources. The goal of a Computational Grid is to aggregate ensembles of shared, heterogeneous, and distributed resources, poten- tially controlled by separate organizations to provide computational “power” to an application program. Appli- cations should be able to draw compute cycles, network bandwidth, and storage capacity seamlessly from the Grid 1 in an analogous way in which household appliances draw electrical power from a power utility. The framers of the Computational Grid paradigm identify four qualitative criteria for the concept to be realized. According to [12] (p. 18), a Computational Grid must deliver consistent, dependable, pervasive, and inexpensive cycles to the end user. In this paper, we outline five quantitative requirements which, if met, fulfill the qualita- tive criteria from [12]. We also describe EveryWare—a toolkit for constructing Computational Grid programs— and quantitatively evaluate how well an example Every- Ware program fulfills the Computational Grid vision. Our evaluation is based on five quantitative metrics: 1. Execution Rate: measures the sustained computa- tional performance of the entire application. Although not mentioned explicitly as a criterion, the Grid must be able to deliver efficient execution performance which we measure in terms of sustained execution rate. 2. Adaptivity: measures the difference between the performance variability exhibited by the underlying resources and the performance variability exhibited by the application itself. If program execution is stable, independent of fluctuations in resource performance (i.e., the program adapts to varying performance conditions successfully), we suggest that the program is able to sustain consistent execution. 3. Robustness: measures the overall duration of continuous program execution in the presence of resource failures. A program that can continue to execute effectively in the presence of unpredictable resource failure is a dependable program. 4. Ubiquity: measures the the degree of heterogeneity a program can exploit in terms of the number of different resource types used by the application. If a program can execute using any and all available resources (both software and hardware), it is a pervasive program. 1066 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001 . R. Wolski is with the computer Science Department, University of California, Santa Barbara, CA 93106. E-mail: [email protected]. . J. Brevik is with the Department of Mathematics and Computer Science, College of the Holy Cross, Worcester, MA 01610. E-mail: [email protected]. . G. Obertelli and A. Su are with the Computer Science and Engineering Department, University of California–San Diego, La Jolla, CA 92093. E-mail: {graziano, alsu}@cs.ucsd.edu. . N. Spring is with the Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195. E-mail: [email protected]. Manuscript received 12 Nov. 1999; revised 21 Feb. 2001; accepted 18 Apr. 2001. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 110960. 1. We will capitalize the word “Grid” when referring to “Computational Grid” throughout this paper. 1045-9219/01/$10.00 ß 2001 IEEE

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

Writing Programs that RunEveryWare on the Computational Grid

Rich Wolski, Member, IEEE, John Brevik, Graziano Obertelli, Member, IEEE Computer Society,

Neil Spring, Student Member, IEEE Computer Society, and

Alan Su, Student Member, IEEE Computer Society

AbstractÐThe Computational Grid [12] has been proposed for the implementation of high-performance applications using widely

dispersed computational resources. The goal of a Computational Grid is to aggregate ensembles of shared, heterogeneous, and

distributed resources (potentially controlled by separate organizations) to provide computational ªpowerº to an application program. In

this paper, we provide a toolkit for the development of globally deployable Grid applications. The toolkit, called EveryWare, enables an

application to draw computational power transparently from the Grid. It consists of a portable set of processes and libraries that can be

incorporated into an application so that a wide variety of dynamically changing distributed infrastructures and resources can be used

together to achieve supercomputer-like performance. We provide our experiences gained while building the EveryWare toolkit

prototype and an explanation of its use in implementing a large-scale Grid application.

Index TermsÐComputational Grid, EveryWare, Ramsey Number search, grid infrastructure, ubiquitous computing, distributed

supercomputer.

æ

1 INTRODUCTION

INCREASINGLY, the high-performance computing commu-nity is blending parallel and distributed computing

technologies to meet its performance needs. A newarchitecture, known as The Computational Grid [12], hasrecently been proposed which frames the software infra-structure required to implement high-performance applica-tions using widely dispersed computational resources. Thegoal of a Computational Grid is to aggregate ensembles ofshared, heterogeneous, and distributed resources, poten-tially controlled by separate organizations to providecomputational ªpowerº to an application program. Appli-cations should be able to draw compute cycles, networkbandwidth, and storage capacity seamlessly from the Grid 1

in an analogous way in which household appliances drawelectrical power from a power utility.

The framers of the Computational Grid paradigmidentify four qualitative criteria for the concept to berealized. According to [12] (p. 18), a Computational Grid

must deliver consistent, dependable, pervasive, and inexpensive

cycles to the end user. In this paper, we outline five

quantitative requirements which, if met, fulfill the qualita-

tive criteria from [12]. We also describe EveryWareÐa

toolkit for constructing Computational Grid programsÐ

and quantitatively evaluate how well an example Every-

Ware program fulfills the Computational Grid vision.Our evaluation is based on five quantitative metrics:

1. Execution Rate: measures the sustained computa-tional performance of the entire application.Although not mentioned explicitly as a criterion,the Grid must be able to deliver efficient executionperformance which we measure in terms ofsustained execution rate.

2. Adaptivity: measures the difference between theperformance variability exhibited by the underlyingresources and the performance variability exhibitedby the application itself. If program execution isstable, independent of fluctuations in resourceperformance (i.e., the program adapts to varyingperformance conditions successfully), we suggestthat the program is able to sustain consistentexecution.

3. Robustness: measures the overall duration ofcontinuous program execution in the presence ofresource failures. A program that can continue toexecute effectively in the presence of unpredictableresource failure is a dependable program.

4. Ubiquity: measures the the degree of heterogeneitya program can exploit in terms of the number ofdifferent resource types used by the application. If aprogram can execute using any and all availableresources (both software and hardware), it is apervasive program.

1066 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

. R. Wolski is with the computer Science Department, University ofCalifornia, Santa Barbara, CA 93106. E-mail: [email protected].

. J. Brevik is with the Department of Mathematics and Computer Science,College of the Holy Cross, Worcester, MA 01610.E-mail: [email protected].

. G. Obertelli and A. Su are with the Computer Science and EngineeringDepartment, University of California±San Diego, La Jolla, CA 92093.E-mail: {graziano, alsu}@cs.ucsd.edu.

. N. Spring is with the Department of Computer Science and Engineering,University of Washington, Seattle, WA 98195.E-mail: [email protected].

Manuscript received 12 Nov. 1999; revised 21 Feb. 2001; accepted 18 Apr.2001.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 110960.

1. We will capitalize the word ªGridº when referring to ªComputationalGridº throughout this paper.

1045-9219/01/$10.00 ß 2001 IEEE

Page 2: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

5. Expense: measures the cost of the resourcesnecessary to implement the infrastructure. Thismetric maps directly to the the expense criteriondescribed in [12].

Therefore, a program that achieves a high execution rate,which is able to adapt to rapidly changing performanceconditions, which is robust to resource failures, which canexecute ubiquitously, and which requires little added expenseover a single-machine program possess all of the qualitiesdescribed in [12] that a Grid program must possess.

EveryWare is a software toolkit consisting of threeseparate components:

. A portable lingua franca that is designed to allowprocesses using different infrastructures andoperating systems to communicate,

. A set of performance forecasting libraries thatenable an application to make short-term resourceand application performance predictions in near-realtime, and

. A distributed state exchange service that allowsapplication components to manage and synchronizethe program state in a dynamic environment.

The goal is to allow a user to write Grid programs thatcombine the best features of different Grid infrastructuressuch as Globus [11], Legion [19], Condor [36], or NetSolve[6], as well as the native functionality provided by Java [18],Windows NT [27], and Unix to the performance advantageof the application. EveryWare is implemented as a highlyportable set of libraries and processes that can ªglueºdifferent locally available infrastructures together so that aprogram may draw upon these resources seamlessly. Ifsophisticated systems such as Globus, Legion, or Condorare available, the EveryWare program must be able to usethe features provided by those systems effectively. If onlybasic operating system functionality is present, however, anEveryWare program should be able to extract what everfunctionality it can, realizing that these resources may beless effective than those supporting better infrastructure.The ability to execute ubiquitously with respect to all of theresources accessible by the user is key to meeting thepervasiveness criterion. By leveraging the most performance-efficient infrastructure that is present on those resources, anEveryWare application can ensure the best possible execu-tion performance and the greatest degree of robustnesspossible.

Designed to be quickly and easily portable, EveryWare isintended to be the thinnest middleware layer capable ofunifying heterogeneous resources with various softwareinfrastructures to accomplish a computational task. In aGrid environment with several incompatible softwareinfrastructure choices, it has been challenging to builda distributed application running everywhere untilEveryWare.

We have implemented a prototype toolkit to test theefficacy of the EveryWare approach. In an experimententered as a contestant in the High-Performance ComputingChallenge [22] at SC98, we were able to use this prototypeto leverage Globus, Legion, Condor, and NetSolve Gridcomputing infrastructures, the Java language and executionenvironment, native Windows NT, and native Unix systems

simultaneously in a single, globally distributed application.The application, a program that searches for RamseyNumber counter-examples, does not perform an exhaustivesearch, but instead uses search heuristics, such as simulatedannealing to negotiate the enormous search space.Effectively implementing this approach requires carefuldynamic scheduling to avoid substantial communicationoverheads. Moreover, by focusing on enhancing theinteroperability of the resources in our pool, we wereable to combine the Tera MTA [37] and the NT Super-cluster [30]Ðtwo unique and powerful resourcesÐwithseveral more commonly available systems, includingparallel supercomputers, PC-based workstations, shared-memory multiprocessors, and Java-enabled desk-top brow-sers. With nondedicated access to all resources, underextremely heavy load conditions, the EveryWare applica-tion was able to sustain supercomputer performance levelsover long periods of time. As such, the Ramsey NumberSearch application using EveryWare represents an exampleof a true Grid programÐthe computational ªpowerº of allresources that were available to the application's user wasassessed, managed, and delivered to the application.

In detailing our Computational Grid experiences, thispaper makes four important contributions.

. It defines five quantitative metrics that can be usedto measure the effectiveness of Grid applications.

. It demonstrates, using these quantitative metrics, thepotential power of globally distributed Gridcomputing.

. It details our experiences using most of the relevantdistributed computing technology available to us inthe fall of 1998.

. It describes a new programming model andmethodology for writing Grid programs.

In the next section, we motivate the design of EveryWarein the context of current Computational Grid research. InSection 3, we detail the functionality of the EveryWaretoolkit and describe the programming model it implements.Section 4 discusses the Ramsey Number Search applicationwe used in this experiment and, in Section 5, we detail theperformance results we were able to obtain in terms of thefour metrics described above. We conclude, in Section 6,with a description of future research directions.

2 COMPUTING WITH COMPUTATIONAL GRIDS

The goal of EveryWare is to enable the construction of trueGrid programsÐones which draw computational powerseamlessly from a dynamically changing resource pool.Since the field is evolving, a single definition of ªComputa-tional Gridº has yet to be universally adopted.2 In thiswork, we will use the following definition.

Computational Grid. A heterogenous, shared, and federatedcollection of computational resources connected by a network

that supports interprocess communication.

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1067

2. In [12], the authors define Computational Grids in terms of a set ofcriteria that must be met. We address these criteria in our work, but preferthe definition provided herein for the purpose of illustration.

Page 3: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

By ªsharedº we mean that it is impractical to dedicate allof the resources in a Computational Grid to a singleapplication for an appreciable amount of time. The termªfederatedº means that each resource is expected to havelocal administration, local resource allocation policies, andlocal resource management software. No single overarchingresource management policy can be imposed on allresources.

The resources housed at the National Partnership forAdvanced Computational Infrastructure (NPACI) andNational Computational Science Alliance (NCSA) constituteexamples of Computational Grids. At these centers,machines and storage devices of various types are inter-networked. Each resource is managed by its own resourcemanager (e.g., batch scheduler, interactive priority mechan-ism, etc.) and it is not generally possible to dedicate allresources (and the network links that interconnect them) ateither site to a single application. Moreover, it is possible tocombine NPACI and NCSA resources together to form alarger Computational Grid that has the same characteristics.In this larger case, it is not even possible to mandate that auniform software infrastructure be present at all potentiallyuseful execution sites.

To maximize application performance on a Computa-tional Grid, a program must be scalable (able to exploitconcurrency for performance), adaptive, robust, and ubiquitous.

Other work has met these requirements to differentdegrees. AppLeS [4] (Application Level Scheduling) agentshave enabled applications to meet these requirements inenvironments where a single infrastructure is present andthe scheduling agent does not experience resource failure.An AppLeS agent dynamically evaluates the performancethat all available resources can deliver to its application andcrafts a schedule that maximizes the application's overallexecution performance. EveryWare supports this principlebut also extends it to wide-area lossy environments inwhich several infrastructures may be available. Note alsothat the AppLeS agent is a specialized application compo-nent that performs a single application managementfunction: scheduling. EveryWare generalizes this notion toother application management functions in the form ofapplication-specific services. In Section 4, we describeapplication-specific scheduling, persistent-state manage-ment, and performance logging for the Ramsey Numbersearch application in EveryWare.

The MPI (Message Passing Interface) [10], and PVM(Parallel Virtual Machine) [17] implementations for net-worked systems allow distributed clusters of machines to beprogrammed as a single, ªvirtualº parallel machine, allowingapplications to scale. In addition, portable implementationsthat do not require privileged (super-user) access forinstallation or execution [20], [17] are available, promotingtheir ubiquity. However, they do not manage resourceheterogeneity on behalf of the program nor do they exposeit to the programmer so that it may be managed explicitlyand, so, are not adaptive.

Grid computing systems such as Globus, Legion,Condor, and HPC-Java [21] include support for resourceheterogeneity as well, but they are not yet ubiquitous. Asthey gain in popularity, we anticipate these systems to be

more widely installed and maintained. However, we notethat their level of sophistication makes porting them to newand experimental environments labor intensive.

EveryWare is similar to Globus [11] in that applicationcomponents communicate via different well-definedprotocols to obtain Grid ªservice.º EveryWare extendsthis ªsum of serviceº approach to provide tools for theGrid programmer to develop application-specific protocolsand services so that the application, and not just theunderlying infrastructure, can be robust and ubiquitous.

It also supports information hiding and location trans-parency in the same way object-oriented systems such asLegion [19] and CORBA [31] do. Indeed, it is possible toleverage the salient features from these object-orientedsystems via EveryWare where advantageous to theapplication.

In particular, we were able to build an application-specific process location service using EveryWare that issimilar in concept to the functionality provided by JINI [3].JINI relies on broadcast and multicast facilities, however,making it difficult to use in wide-area environments. Usingthe EveryWare Gossip protocol, we were able to overcomethis limitation, although it is possible that JINI could beused to implement part of the Gossip infrastructure.

EveryWare complements the functionality provided byCondor [36] by providing a robust messaging layer.Adaptive and robust execution facilities permit Condor tokill and restart EveryWare processes at will. However, inorder to provide an automatic and seamless checkpointingfacility, Condor only provides a way for tasks to bemigrated between machines of the same architecture.EveryWare's Gossip protocol enables a programmer towrite an explicit state-saving facility which is both applica-tion and platform neutral. In conjunction with Condor'scheckpointing facility, this enables EveryWare programs tospan Condor pools based on different architectures.

Dynamically schedulable adaptive programs that arecapable of tolerating resource performance fluctuationshave been developed by the Autopilot [33], Prophet [38],Winner [2], and MARS [16] groups. Most of these systemsrely on a centralized scheduler for each application,sacrificing robustness. If the scheduler fails or becomesdisconnected from the rest of the application, the programis disabled. In addition, having a single scheduling agentimpedes scalability as communication with the schedulerbecomes a performance bottleneck.

EveryWare is designed as a portable ªtoolkitº for linkingtogether program components running in different envir-onments. Individual program components may use whatever locally available infrastructure is present. In addition,we provide a low-level ªbare-bonesº implementation that isdesigned to use only basic operating system functionality.In this way, an EveryWare application does not assume thatany single operating system or infrastructure, except it'sown, will be accessible from every resource. Borrowingfrom the AppLeS [4] project, EveryWare applicationscharacterize all resources in terms of their quantifiableimpact on application performance. In this way, hetero-geneity is expressed as the difference in deliverableperformance to each application. The EveryWare toolkit

1068 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

Page 4: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

includes support for process replication and performanceforecasting so that an EveryWare application can adapt todynamically changing resource conditions. We leverage theNetwork Weather Service [40], [39] forecasting facilities toprovide both heterogeneity management and adaptiveresource performance prediction.

3 THE EVERYWARE TOOLKIT

The EveryWare toolkit is composed of three separatesoftware components: a portable lingua franca that allowsprocesses using different infrastructures and operatingsystems to communicate, a set of performance forecastingservices and libraries that enable an application to makeshort-term resource and application performance predictionsin near-real time, and a distributed state exchange servicethat allows application components to manage andsynchronize program state in a dynamic environment.Fig. 1 depicts the relationship between these components.Application components that are written to use differentGrid infrastructure features can communicate amongstthemselves, with the EveryWare state exchange serviceand with other multiinfrastructure services such as theNetwork Weather Service [40] using the lingua franca. NWSdynamic forecasting libraries (small triangles in the figure)can be loaded with application components directly. Theselibraries, in conjunction with the performance forecastsprovided by the NWS, permit the program to anticipateperformance changes and adapt execution accordingly. Thedistributed state-exchange services provide a mechanismfor synchronizing and replicating an important programstate to ensure robustness and scalability.

The toolkit we have implemented is strictly a prototypedesigned to expose the relevant programming issues. Assuch, we do not describe the specific APIs supported byeach component (we expect them to change dramatically inour future implementations). Rather, in this section, wemotivate and describe the functionality of each EveryWarecomponent and discuss our overall implementation strat-egy. Our intent is to use the prototype first to implement avariety of applications so that we may determine whatfunctionality is required and then to provide a ªuser-friendlyº implementation of EveryWare for public release.

3.1 Lingua Franca

The lingua franca provides a base set of resource controlabstractions that are portable across infrastructures. Theyare intended to be simple, easy to implement using differentGrid technologies, and highly portable. Initially, we havedeveloped simple process, datagram message, and storagebuffer abstractions for EveryWare. The process abstractioncreates and destroys a single execution thread on a targetresource that is capable of communicating via both theEveryWare datagram message abstraction and any localcommunication facilities that are present. EveryWare data-gram messages are sent between processes via nonblockingsend and blocking receive calls and processes can blockwaiting for messages from multiple sources. Processes canalso create and destroy storage buffers on arbitrary resources(e.g., by creating ªmemoryº processes that respond to readand write requests to their own memory).

We implemented the lingua franca using C and TCP/IPsockets. To ensure portability, we tried to limit theimplementation to use only the most ªvanillaº features ofthese two technologies. For example, we did not usenonblocking socket I/O nor did we rely upon keep-alivesignals to inform the system about end-to-end communica-tion failure. In our experience, the semantics associatedwith these two useful features are specific to the vendorand, in some cases, to the operating system release level. Wetried to avoid controlling the portability of EveryWarethrough C preprocessor flags whenever possible so that thesystem could be ported quickly to new architectures andenvironments. Similarly, we chose not to rely upon XDR [35]for data type conversion for fear that it would not bereadily available in all environments. Another importantdecision was to strictly limit our use of signals. Unixsignal semantics are somewhat detailed and we did notwant to hinder the portability to nonUnix environments(e.g., Java and Windows NT). More immediately, many ofthe currently available Grid communication infrastruc-tures such as Legion [19] and Nexus [14] take over theuser-level signal mechanisms to facilitate message delivery.Lastly, we avoided the use of threads throughout thearchitecture as differences in thread semantics and threadimplementation quality has been a source of incompatibilityin many of our previous Grid computing efforts.

Above the socket level, we implemented rudimentarypacket semantics to enable message typing and delineaterecord boundaries within each stream-oriented TCP com-munication. Our approach takes its inspiration from thepublicly available implementation of netperf [23]. How-ever, the actual implementation of the messaging layercomes directly from the current Network Weather Service(NWS) [40], where it has been stress-tested in a variety ofGrid computing environments.

Note that the EveryWare lingua franca differs from othermessage passing implementations such as PVM [17] orMPI [20] in several important ways. First, these otherinterfaces are designed to support arbitrary parallelprograms in environments where resource failure is rare(i.e., on parallel machines). As such, they include usefulprimitives (such as global barrier synchronization) thatmake them attractive programming facilities, but sometimes

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1069

Fig. 1. EveryWare components.

Page 5: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

difficult to implement in Grid environments. Often, when aresource fails, the entire PVM or MPI program must berestarted. EveryWare assumes that resource availabilitywill be dynamically changing. As such, all primitives obeyuser-specified time outs, success and failure status isreported explicitly, and only those primitives that can failindividually (that is, without affecting more than theprocess calling them) are implemented. We do not intendthe lingua franca to replace any of the existing message-passing or remote invocation systems that are available toGrid programmers. Rather, we will provide the minimalfunctionality required to allow these infrastructures tointeroperate efficiently so that programs can span Gridinfrastructures. We expect that the portability of the linguafranca will also benefit from this minimalist approach.

3.2 Forecasting Services

We also borrowed and enhanced the NWS forecastingmodules for EveryWare. To make performance forecasts,the NWS applies a set of light-weight time series forecastingmethods and dynamically chooses the technique that yieldsthe greatest forecasting accuracy over time [39]. The NWScollects performance measurements from Grid computingresources (CPUs, networks, etc.) and uses these forecastingtechniques to predict short-term resource availability. ForEveryWare, however, we needed to be able to predict thetime required to perform arbitrary but repetitive programevents. Our strategy was to manually instrument thevarious EveryWare components and application moduleswith timing primitives, and then pass the timing informa-tion to the forecasting modules to make predictions. Werefer to this process as dynamic benchmarking as it usesbenchmark techniques (e.g., timed program events)perturbed by ambient load conditions to make perfor-mance predictions.

For example, we use the NWS forecasting modules andNWS dynamic benchmarking to predict the response timeof each EveryWare state-exchange server. We first identifyinstances of request-response interactions in the state-servercode. At each of these ªeventsº we instrument the code torecord an identifier indicating the address where therequest is serviced and the message type and time requiredto get the corresponding response. By forecasting howquickly a server responds to each type of message, we areable to dynamically adjust the message time-out interval toaccount for ambient network and CPU load conditions. Thisdynamic time-out discovery is crucial to overall programstability. Using the alternative of statically determined time-outs, the system frequently misjudges the availability (orlack thereof) of the different EveryWare state-manage-ment servers causing needless retries and dynamicreconfigurations.

In general, the NWS forecasting services and NWSdynamic benchmarking allow both the EveryWare toolkit,and the application using it, to dynamically adapt itself tochanging load and performance response conditions. Weuse standard timing mechanisms available on each systemto generate time stamps and event timings. However, weanticipate that more sophisticated profiling systems such asParadyn [28] and Pablo [9] can be incorporated to yieldhigher-fidelity measurements.

3.3 Distributed State Exchange Service

To function in the current Grid computing environments,a program must be robust with respect to resourceperformance failure while at the same time able toleverage a variety of different target architectures. Every-Ware provides a distributed state exchange service thatcan be used in conjunction with application-level check-pointing to ensure robustness. EveryWare state-exchangeservers (called Gossips) allow application processes toregister for state synchronization by providing a contactaddress, a unique message type, and a function that allowsa Gossip to compare the ªfreshnessº of two differentmessages having the same type. All application componentswishing to use Gossip service must also export a state-update method for each message type they wish tosynchronize.

Once registered, an application component periodicallyreceives a request from a Gossip process to send a fresh copyof its current state, identified by message type. Using thepreviously registered comparator function, the Gossipcompares the current state with the latest state messagereceived from other application components. When theGossip detects that a particular message is out-of-date, itsends a fresh state update to the application component thatoriginated the out-of-date message.

To allow the system to scale, we rely on threeassumptions: that the Gossip processes cooperate, that thenumber of application components wishing to synchronizeis small, and that the granularity of synchronization eventsis relatively coarse. Cooperation between Gossip processes isrequired so that the workload associated with the synchro-nization protocol may be evenly distributed. Gossipsdynamically partition the responsibility for querying andupdating application components amongst themselves. Forthe SC98 experiment, we stationed several Gossips at well-known addresses around the country. When an applicationcomponent registered, it was assigned a responsible Gossipwithin the pool of available Gossips whose job it was to keepthat component synchronized.

In addition, we allow the Gossip pool to fluctuate. NewGossip processes register themselves with one of the well-known sites and are announced to all other functioningGossips. Within the Gossip pool, we use the NWS cliqueprotocol [40] (a token-passing protocol based on leader-election [15], [1]) to manage network partitioning and Gossipfailure. The clique protocol allows a clique of processes todynamically partition itself into subcliques (due to networkor host failure) and then merge when conditions permit.The EveryWare Gossip pool uses this protocol to reconfigureitself and rebalance the synchronization load dynamicallyin response to changing conditions.

The assumptions about synchronization count andgranularity are more restrictive. Because each Gossip doesa pair-wise comparison of application component state, N2

comparisons are required for N application components.Moreover, if the overhead associated with state synchroni-zation cannot be amortized by useful computation, perfor-mance will suffer. We believe that the prototype state-exchange protocol can be substantially optimized (orreplaced by a more sophisticated mechanism) and that

1070 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

Page 6: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

careful engineering can reduce the cost of state synchroni-zation over what we were able to achieve. However, wehasten to acknowledge that not all applications or applica-tion classes will be able to use EveryWare effectively forGrid computation. Indeed, it is an interesting and openresearch question as to whether large-scale, tightlysynchronized application implementations will be able toextract performance from Computational Grids, particu-larly if the Grid resource performance fluctuates as muchas we have typically observed [41], [39]. EveryWare doesnot allow any application to become an effective Gridapplication. Rather, it facilitates the deployment of applica-tions, enabling them to ubiquitously draw computationalpower from a set of fluctuating resources.

Similarly, the consistency model required by the applica-tion program dramatically affects its suitability as anEveryWare application, in particular, and as a Gridapplication in general. The development of a high-perfor-mance state replication facilities that implement tightbounds on consistency is an active area of research.EveryWare does not attempt to solve the distributed stateconsistency problem for all consistency models. Rather, itspecifies the inclusion of replication and synchronizationfacilities as a constituent service. For the applicationdescribed in Section 4, we implemented a loosely consistentservice based on the Gossip protocol. Other, more tightlysynchronized services can be incorporated, each with itsown performance characteristics. We note, however, thatapplications having tight consistency constraints are, ingeneral, difficult to distribute while maintaining acceptableperformance levels. EveryWare is not intended to changethe suitability of these programs with respect to Gridcomputing, but rather enables their implementation anddeployment at what ever performance level they can attain.

3.4 The EveryWare Programming Model

An EveryWare application is structured as a set of computa-tional application clients that request runtime managementservices from a set of application-specific servers. Applicationclients perform the actual ªworkº within the applicationusing the features of a native Grid infrastructure. They maythemselves be parallel or distributed programs, and theyare not constrained to use only the lingua franca forcommunication, process control, and storage management.For operations that require more global control, such asscheduling, user interaction, etc., the computational appli-cation clients appeal to application-specific servers, alsowritten by the application programmer. Like the clients, theapplication-specific servers are not constrained to use anysingle communication or process control mechanismÐtheymay be written to use any native Grid infrastructure.However, using the lingua franca enables arbitrary clientand server interaction and ensures portability acrossinfrastructures.

Fig. 2 depicts the structure of an application. Applicationclients (denoted ªAº in the figure) can execute in a numberof different environments, such as NetSolve, Globus,Legion, Condor, etc. They communicate with application-specific scheduling servers to receive scheduling directivesdynamically. Persistent state managers, tuned for theapplication, control and protect any program state that

must survive host or network failure. Application perfor-

mance logging servers allow arbitrary messages to be

logged by the application. Finally, all application compo-

nents use the EveryWare Gossip service to synchronize state.

To anticipate load changes, the various application compo-

nents consult the Network Weather Service (NWS).This application architecture offers several advantages.

First, the overall program can be constructed incrementally.

Most concurrent programs are not structured so that some

parts may execute while others are being revised, enhanced,

or debugged. By structuring an EveryWare program as a

communicating set of application-specific services, how-

ever, it is possible to interface new pieces of code with the

running application. The adaptive nature of the code allows

new processes to join and others to drop out while the code

continues to execute. Since we do not have to restart the

application every time we wish to add a new program

component, we can improve and evolve the running

application dynamically. Another advantage is that it

allows us to implement infrastructure-specific clients that

can get the best possible performance by running in

ªnativeº mode. Since the clients need only speak the

protocol required by each server, we do not need to put a

complete software veneer between the computational code

and the native infrastructure.Note that the EveryWare programming model is

fundamentally different from that used by most procedure

oriented Grid infrastructures such as NetSolve [6], NINF

[29], CORBA [31], and NEOS [26]. These infrastructures

typically support applications structured as a single

controlling client that makes remote-procedure calls to

remote computational servers. Under the EveryWare pro-

gramming model, computation is centered at the clients and

program control is coordinated by a set of cooperating

application-specific servers. Since the roles or client and

server are reversed, we term this application architecture an

inverted client-server model. This novel application structure

offers EveryWare applications greater scalability and

robustness than a single-client approach.

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1071

Fig. 2. EveryWare application structure.

Page 7: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

4 EXAMPLE APPLICATION:RAMSEY NUMBER SEARCH

The application we chose to implement to test theeffectiveness of EveryWare attempts to improve the knownbounds of classical Ramsey numbers. The nth classical orsymmetric Ramsey number Rn � Rn; n is the smallest numberk such that any complete two-colored graph on k verticesmust contain a complete one-colored subgraph on n of itsvertices. It can proven in a few minutes that R3 � 6; it isalready a nontrivial result that R4 � 18 and the exact valuesof Rn for n > 4 are unknown.

Observe that to show that a certain number j is a lowerbound for Rn, one might try to produce a particular two-colored complete graph on �jÿ 1� vertices that has no one-colored complete subgraph on any n of its vertices. We willrefer to such a graph as a ªcounter-exampleº for the nthRamsey number. Our goal was to find new lower boundsfor Ramsey numbers by finding counter-examples.

This application addresses an unsolved problem incombinatorics using new search techniques. It is not aªclassicº scientific application, however, since it does notmodel real-world phenomena, nor does it provide betterapplied mathematical or computational techniques for suchmodeling. However, this application was especially attrac-tive as a first test of EveryWare because of its loosesynchronization requirements and its resistance to exhaus-tive search techniques like those employed in cryptographicfactoring.

This resistance arises from the combinatorial complexityof the problem. For example, if one wishes to find a newlower bound for R5, one must search in the space ofcomplete two-colored graphs on 43 veritces since the

known lower bound is currently 43 [32]. Since such a graph

has432

� �� 903 edges, there are 2903 > 10270 different two-

colored graphs on 43 vertices. Even if one could examine

1012 configurations every second, an exhaustive search

would take over 10250 years.Therefore, we must use heuristic techniques to control

the search process. The process of counter-example identi-fication is related to distributed ªbranch-and-boundº state-space searching.

4.1 Application Clients

Our goal was to create a dynamically changing populationof computational processes executing different heuristics.Heuristic design is an active area of research in combina-tronics [32]. As such, we designed the application to be ableto incorporate different heuristic algorithms concurrently,each of which implemented as a single application client.The clients would then use the lingua franca to communicatewith a set of application servers to receive schedulingdirectives and state management services.

The heuristics that we used all involved directed search,by which we mean the following: On the search space oftwo-colored complete graphs of a particular size, there is anumerical ªscoreº which assigns to each graph the degreeto which it fails to be a counter-example in some suitablesense. There is also a set of manipulations called ªmovesº

(transformations) that one can perform on a particulargraph to produce other graphs. The algorithm, then, isroughly to start with an arbitrary graph and perform asequence of moves with a view toward lowering the scoreby each successive move. Note that in any such heuristic, itis necessary to provide some possibility of making a movethat worsens the score. Otherwise, there is the danger thatthe search will get trapped at a local minimum which is nota global minimum.

In our case, the score assigned to a two-colored graph issimply the number of ªviolations,º or complete one-coloredsubgraphs on n vertices, that it possesses. Thus, a graph is acounter-example if and only if its score is 0.

Various algorithms employed used slightly differentdefinitions for their moves. The simplest and most commonwas to change the color of a single edge. Thus, for a graphon 43 vertices possessing 903 edges, there are 903 possiblemoves that can be made from any given graph. In otheralgorithms, a move comprised changing the colors of threeedges. Still other algorithms worked in restricted searchspaces which partitioned the edges and only consideredthose graphs for which all the edges in any given partitionwere the same color. In such a case, a move comprisedchanging the colors of all the edges within a particularpartition.

The two classes of search heuristics employed were thosebased on tabu search [32] and simulated annealing. In tabusearch, the algorithm keeps a list (the tabu list) of a fixedlength recording the most recent moves that have beenmade. From a given configuration, it examines all movesnot in the tabu list, finds the one that gives the lowest score,and makes and records this move. The tabu list is in place toavoid loops. In practice, some element of randomness isnecessary in order to avoid large loops. We employed twovariants of the tabu search, namely one that allowed aparticular move to be made no more than twice on the listand another that allowed a particular move onto the list ifits last appearance was with a different predecessor.

The simulated annealing heuristic mimics the physicalbehavior of a mass as it undergoes cooling. In this case, thescore of a configuration is analogous to the temperature ofthe mass. Generally, from a given configuration, thealgorithm chooses a move at random and makes the moveif it results in a lower score. Otherwise, it rejects the moveand chooses another at random from the same probabilitythat decreases as the score drops. Here again, thisrandomness has the effect of keeping the algorithm fromgetting trapped in a local minimum.

4.1.1 Scheduling Service

To schedule the EveryWare Ramsey Number application,we use a collection of cooperating, independent schedulingservers to control application execution dynamically. Eachcomputational client periodically contacts a schedulingserver and reports its algorithm type, the IP address ofthe machine on which it is running, the progress it hasmade since it last made a scheduling decision, and theamount of time that has elapsed since its last contact.Servers are programmed to issue different control directivesbased on the type of algorithm the client is executing, how

1072 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

Page 8: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

much progress the client has made, and the most recentcomputational rate of the client.

Scheduling servers are also responsible for migratingwork. Clients report the number of violations in the graphthey are testing when they check in. If the number is low,the server will ask the client for a copy of the graph it iscurrently considering. If it is high, the server sends theclient a better graph and directs it to continue from adifferent point in the search space. The clients areprogrammed to randomize their starting point in differentways to prevent the system from dwelling irrevocably in alocal minimum. In addition, the thresholds for identifying aªgoodº graph (one with a low number of violations), a badone, and the number of times a good one can be migrated toserve as a new starting point in the search space, are tunableparameters.

The schedulers also make decisions based on dynamicperformance forecasting information. If a scheduler predictsthat a client will be slow based on previous performance, itmay choose to migrate that client's current workload to amachine that it predicts will be faster. Rather than basingthat prediction solely on the last performance measurementfor each client, the scheduler uses the NWS lightweightforecasting facilities to make its predictions. Note that thismethodology is inspired by some of our previous work inbuilding application-level schedulers (AppLeS) [34], [4].AppLeS is an agent-based approach in which each applica-tion is fitted with a customized application scheduler thatdynamically manages its execution. For the RamseyNumber Search application, however, a single schedulingagent would have been insufficient to control the entireapplication, both because it would limit the scalability of theapplication and because the agent would constitute asingle-point-of-failure. We designed an application-specificscheduling service that forms organized and robust, butdynamically changing, groups of cooperating processes thatcan make progress if and when the network partitions. Assuch, we term this type of scheduling Organized RobustAutoNomous Group Scheduling (ORANGS). ORANGS andAppLeS are, indeed, similar in that they use NWS perfor-mance forecasts to make application-specific schedulingdecisions. However, the distributed and robust nature ofthe ORANGS service made it a more appropriate choicefor the Ramsey Number Search application.

Notice that, for the Ramsey Number search application,the scheduling service considers the use of all availableresources. When an application client checks in with ascheduling server, the server evaluates the client in terms ofthe performance it will be able to deliver to the application(using the forecasting services) and decides on the amountand type of work that client should receive. In all cases, theRamsey Number search clients receive some amount ofwork to perform. For other applications, however, thescheduling service may decide that the use of a particularresource will hinder rather than aid performance and, hence,should be excluded. Therefore, while resource selection is notan issue for Ramsey Number search, the EveryWareprogramming model supports its implementation.

Schedulers within the scheduling service communicateon a nonpersistent state amongst themselves via the Gossip

service. In particular, the IP addresses and port numbers ofall servers are circulated so that new server instances can beadded dynamically. Clients are furnished with a list ofactive servers when they make contact so that they cancontact alternates in the event of a failed servercommunication. Similarly, scheduling servers learn ofdifferent Gossip servers, persistent state managers, andlogging servers via Gossip updates.

4.1.2 Persistent State Management Service

To improve robustness, we identify three classes ofprogram state within the application:

Local: State that can be lost by the application due tomachine or network failure (e.g., local variables withineach computational client).

Volatile-But-Replicated: State that is passed between pro-cesses as a result of Gossip updates, but not written topersistent storage (e.g., the up-to-date list of activeservers).

Persistent: State that must survive the loss of all activeprocesses in the application (e.g., the largest counter-example that the application finds).

We use a separate persistent state service for threereasons. First, we want to limit the size of the file systemfootprint left by the application. Many sites restrict theamount of disk storage a guest user may acquire. Byseparating the persistent storage functionality, we are ableto dynamically schedule the application's disk usageaccording to available capacities. Secondly, we want toensure that persistent state is ultimately stored in ªtrustedºenvironments. For example, we maintained a persistentstate server at the San Diego Supercomputer Center becausewe were assured of reliable storage and regular tape back-ups. Lastly, we are able to implement runtime sanity checkson all persistent state accesses. If a process attempts to storea counter-example, for example, the persistent statemanager first checks to make sure the stored object is,indeed, a Ramsey counter-example for the given problemsize. This is a significant advantage to application-specificstate management.

To implement this functionality, all persistent stateobjects must be typed. For each persistent type used inthe program, the state manager needs a set of sanity-checks(performed when an object is accessed) and a comparatoroperator so that the state may be synchronized by the Gossipservice. We acknowledge that developing this functionalityfor all Grid applications may not be possible. However, wenote that many Computational Grid infrastructures cur-rently support mechanisms that can be used to implementthe state management functionality we require for RamseyNumber search. For example, the sanity checks performedby the state manager were implemented, primarily, toprevent errant or malicious processes from damagingprogram state. Instead, Globus authentication mechanisms[13] could be used to provide access control so that onlytrusted processes may modify persistent state. Similarly, theLegion class management system [25] tracks object in-stances in a way that could be used to identify stale state.We wanted to ensure that all application components

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1073

Page 9: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

(computational clients and application-specific servers)would be portable to any environment so we did notchoose to rest any of the application's functionality on aparticular infrastructure. Future versions of the RamseySearch application may relax this restriction to furtherbenefit from maturing Computational Grid technologies.

4.1.3 Logging Service

To track the performance of the application dynamically,we implemented a distributed logging service. Schedulingservers base their decisions, in part, on performanceinformation they receive from each computational client.Before the information is discarded, it is forwarded to alogging server so that it can be recorded. Having a separateservice, again, allows us to limit and control the storageload generated by the application. For example, NPACIloaned our group a pair of file servers so that we couldcapture a performance log that spanned the time of theconference.

As with the persistent state managers and the schedulingservers, the logging servers register themselves with theGossip service. Any application process wishing to logperformance information learns of a logging server throughthe server list that is circulated. The logging servers do notregister a state synchronization function, however. They usethe Gossip service only to join the running application.

5 RESULTS

To test the efficacy of our approach, we deployed theRamsey Number search application on a globally distrib-uted set of resources during SC98. As part of the test, weentered EveryWare in the High-Performance ComputingChallenge [22] (an annual competition held during theconference) as we believed that the fluctuating loadsgenerated by our competitors would test the capabilitiesof our system vigorously.

We instrumented each application client to maintain arunning count of the computational operations it performsso that we could monitor the performance of RamseyNumber search application. The bulk of the work in each ofthe heuristics (see Section 4) are integer test and arithmeticinstructions. Since each heuristic has an execution profilethat depends largely on the point in the search space whereit is searching, we were unable to rely on static instructioncount estimates. Instead, we inserted counters into eachclient after every integer test and arithmetic operation. Sincethe ratio of instrumentation code to computational code isessentially one-to-one (one integer increment for everyinteger operation) the performance estimates we report areconservative. Moreover, we do not include any instrumen-tation instructions in the operation counts, nor do we countthe instructions in the client interface to EveryWareÐonlyªusefulº work delivered to the application is counted.Similarly, we include all communication delays incurred bythe clients in the elapsed timings. The computational rateswe report include all of the overheads imposed by oursoftware architecture and the ambient loading conditionsexperienced by the program during SC98. That is, all of theresults we report in this section are conservative estimates

of the sustained performance delivered to the applicationduring the experiment.

5.1 Execution Rate

As a Computational Grid experiment, we wanted todetermine if we could obtain high application performancefrom widely distributed, heavily used, and nondedicatedcomputational resources. In Fig. 3, we show the sustainedexecution performance of the entire application during thetwelve-hour period including and immediately precedingthe judging of our High-Performance Computing Challengeentry at SC98 on November 12, 1998.3

The x-axis shows the time of day, Pacific Standard Time,4

and the y-axis shows the average computational rate over afive-minute time period. The highest rate that the applica-tion was able to sustain was 2.39 billion integer operationsbetween 9:51 and 9:56 during a test an hour before thecompetition. The judging for the competition itself (whichrequired a ªliveº demonstration) began at 11:00. As severalcompeting projects were being judged simultaneously andmany of our competitors were using the same resources wewere using, the networks interlinking the resourcessuddenly experienced a sharp load increase. Moreover,many of the competing projects required dedicated accessfor their demonstration. Since we deliberately did notrequest dedicated access, our application suddenly lostcomputational power (as resources were claimed by anddedicated to other applications) and the communicationoverheads rose (due to increased communication load). Thesustained performance dropped to 1.1 billion operations asa result. The application was able to adapt to theperformance loss and reorganize itself so that by 11:10(when the demonstration actually took place), the sustainedperformance had climbed to 2.0 billion operations persecond.

This performance profile clearly demonstrates thepotential power of Computational Grid computing. Withnondedicated access, under extremely heavy load condi-tions, the EveryWare application was able to sustainsupercomputer performance levels.

1074 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

3. We demonstrated the system for a panel of judges between 11:00 AMand 11:30 AM PST.

4. SC98 was held in Orlando, Florida which is in the Eastern time zone.Our logging and report facilities, primarily located at stable sites on thewest coast, used Pacific Standard Time. As such, we report all time-of-dayvalues in PST.

Fig. 3. Application speed.

Page 10: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

In Fig. 4, we show the number of hosts used during thesame time period. In this figure, each data point representsthe number of hosts checking in during the correspondingfive-minute period.5 Note that the maximum host count(266) occurs at 23:51 as we ran a large scale test of thesystem the night before the competition. However, themaximum host count does not correspond to the maximumsustained rate. While we were able to incorporate manynew and powerful resources on the morning of thecompetition, we lost some of the workstations that wereloaned to us by Condor during the night. Also, these hostcount numbers are based on unique IP addresses (andnot process id) making them very conservative. Sincesome systems use the same IP address for all hosts (e.g.,the NT Supercluster) the actual host population wasmuch higher. However, we could not distinguish betweenmultiple processes on different hosts with the same IPaddress, and multiple process restarts due to eviction forthe combined host population. As a result, we report themore conservative estimates.

5.2 Adaptivity

We also wanted to measure the smoothness of theperformance response the application was able to obtainfrom the Computational Grid. For the Grid vision to beimplemented, an application must be able to draw ªpowerºuniformly from the Computational Grid as a whole despitefluctuations and variability in the performance of theconstituent resources. In Fig. 5 and Fig. 6, we compare theoverall performance response obtained by the application(graph (c) in both figures) with the performance andresource availability provided by each infrastructure.Fig. 5 makes this comparison on a linear scale and Fig. 6shows the same data on a log scale so that the wide range ofperformance variability may be observed. In Fig. 5a andFig. 6a we detail the number of cycles we were able tosuccessfully deliver from each Grid infrastructure duringthe twelve hours leading up to the competition. Similarly, inFig. 5b and Fig. 6b, we show the host availability from eachinfrastructure for the same time period. Together, thesegraphs show the diversity of the resources we used in theSC98 experiment.

Specifically, Condor supports a dynamic loan-and-reclaim resource usage model. Users agree to loan idle

workstations to the Condor system for use by other

processes. When a user-specified keyboard activity or load

threshold is exceeded, the resource is declared busy and

any Condor jobs that are running at the time are evicted.

Note that Condor processing power and host count

fluctuated through the night and then fell off as the day

began in Wisconsin and user activity caused their work-

stations to be reclaimed. For Java, the performance

trajectory was the opposite. We fitted the Java applets with

the necessary logging features at approximately 4:30 AM,

although we had a small number of test hosts running

before then. At approximately 8:00 AM, we announced the

availability of the Java implementation and solicited

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1075

5. The maximum time between check-ins for any computational clientwas set to five minutes during the test.

Fig. 4. Application host count.

Fig. 5. (a) Execution rate by infrastructure. (b) Host count byinfrastructure. (c) Total sustained execution rate.

Page 11: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

participation from ªfriendlyº sites. In addition, we began to

execute the Java applet using HotJava [18] on workstations

that had been brought to SC98 for general use by conference

attendees. At about the same time, Legion (which had been

down since approximately midnight) became available

again and the application immediately began to take

advantage of the newly available resources. Our Globus

utilization, however, was low until just after the competi-

tion ended at 11:30 AM, when it suddenly spiked. The

Globus group entered the High-Performance Computing

Challenge with two separate entries. As we did not request

dedicated access or special access priority for the demon-

stration, our application was able to leverage these

resources only after higher-priority Globus processesfinished. NetSolve gave us access to the student workstationlaboratories and several resources in the InnovativeComputing Laboratory at the University of Tennessee. Wedetected a bug in the performance logging portion of theNetSolve implementation at approximately 8:00 AM, hence,we have no reliable performance numbers to report for theperiod before then. The bulk of the NT hosts we were ableto leverage came from the Superclusters [30] located at theNational Computational Science Alliance (NCSA) and inthe the Computer Systems Architecture Group [7] (CSAG)located at the University of California, San Diego (UCSD).These systems used batch queues to provide space-sharedaccess to their processors. Unix host count remainedrelatively constant throughout the experiment, but perfor-mance jumped at the end as the Tera MTA (the fastest Unixhost) was added to the resource pool.

In Fig. 5c, we reproduce Fig. 3 for the purpose ofcomparison. Fig. 6c shows this same data on a log scale. Bycomparing graphs (a) and (b) to (c) on each scale, we exposethe degree to which EveryWare was able to realize theComputational Grid paradigm. Despite fluctuations in thedeliverable performance and host availability providedby each infrastructure, the application itself was able todraw power from the overall resource pool relativelyuniformly. As such, we believe the EveryWare exampleconstitutes the first application to be written that success-fully demonstrates the potential of high-performanceComputational Grid computing. It is one of the firstexamples of a truly adaptive Grid program.

5.3 Aggregate Performance

Fig. 7 shows the total number of integer operations theapplication was able to obtain during the twelve hoursbefore the competition (on a log scale). With the exceptionof Java and NetSolve, all infrastructures were within anorder of magnitude in terms of the cycles they delivered.Interpreted Java applet performance was typically betweenthree and five times slower than native binary execution,and the NetSolve computational servers were shared byother NetSolve jobs and student projects.

5.4 Robustness

High-performance computer users often complain aboutapplication sensitivity to resource failure in distributedenvironments. Fig. 8 shows the total number of hosts and

1076 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

Fig. 6. Log scale. (a) Sustained processing rate. (b) Host count by

infrastructure. (c) Total sustained rate.

Fig. 7. Total cycle count by infrastructure.

Page 12: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

processes controlled by each infrastructure that were usedby the application during the twelve hours leading up to thecompetition. Comparing the number of processes to hostsgives an indication of the process failure and restart rateduring the experiment. Each computational client wasprogrammed to run indefinitely; therefore, in the absenceof process failure, the number of processes would equal thenumber of hosts. We implemented several ªad-hocº processrestart mechanisms for the environments in which theywere not automatic. However, most of the process restartswere due either to deliberate termination on our part whiledebugging or dynamic resource reclamation by resourceowners. On the Condor system, we ran each computationalclient as a ªvanillaº job which is terminated without noticewhen the resource on which it is running is reclaimed andsubsequently restarted when another suitable resource isfree. It is interesting that, despite the midweek daytimeusage, process restart due to resource reclamation wasrelatively infrequent in the Condor environment during theexperiment. The Globus comparison illustrates the power ofthe GRAM interface [11]. Globus allows all processes to belaunched and terminated through a single GRAM request.During the time leading up to the competition, we wereimproving and debugging our Globus implementation.Having a single control point allowed us to restart largebatches of processes easily. Under Legion, the concept ofprocess is not defined. Instead, class ªinstancesº movebetween blocked and running states (and vice versa), so wesimply report the number of instances we used during thedemonstration. As a result, this level of process restartactivity is an estimate. The numbers are accurate for theGlobus, Condor, and Unix environments but somewhatambiguous for the other infrastructures. Despite the level ofprocess failure we were able to detect, we were able toobtain the sustained processing rates, shown in Fig. 3,during the same time period.

Indeed, EveryWare and the application design we usedproved to be quite robust. In Fig. 9, we show host countsover five-minute intervals during the 17 days prior to thejudging on November 12. Some portion of the applicationwas executing more or less continuously during the entireperiod. As we concentrated our initial efforts on developingthe EveryWare toolkit and new Ramsey search heuristics,we did not add performance logging to the running systemuntil October 26. The program had actually been runningcontinuously since early June of 1998; however, we only

have performance data dating from the end of October.Note that we were able to add and then completely revisethe performance logging service while the program was inexecution.

5.5 Ubiquity

For the Computational Grid paradigm to succeed, alluseful resources must be accessible by the application.Metaphorically, all profitable methods of power generationmust be usable by any power consumer. Fig. 10 comparesthe delivered performance from the fastest host controlledby each infrastructure. The values not only benchmark ourcode on various architectures, but also show the wide rangeof resource options we were able to leverage during theexperiment. In each case, we attempted to use the native,vendor-specific C compiler (as opposed to GNU gcc) withfull optimization enabled. On the top half of the figure, wecompare the best performance from each infrastructure. Thefastest Unix machine was the Tera MTA [37]. We reportonly the single-processor performance; however, the Terawas also able to parallelize the code automatically andachieve an almost linear speed-up on two processors. Thefastest NT-based machine was located at the University ofWisconsin, but we are unable to determine its architecturalcharacteristics. An unknown participant downloaded theNT binary from the EveryWare home page when weannounced that the system was operational on Wednesdaymorning. The fastest Condor machine was a Pentium P6

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1077

Fig. 8. Total host count by infrastructure. Fig. 9. Sixteen-day host counts.

Fig. 10. Host speeds.

Page 13: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

running Solaris, also located at the University of Wisconsin.Single-processor Pentium P6 performance was particularlygood (second only to the Tera) for the integer-orientedsearch heuristics we developed. The fastest Legion host wasa Digital Equipment Corporation Alpha processor runningRed Hat Linux, located at the University of Virginia, andthe fastest Globus machine was an experimental Convex Vclass host located at the Convex development facility inRichardson, Texas. Surprisingly, the fastest Java executionwas faster than the fastest NT, Legion, and Globusmachines. An unknown participant at Kansas StateUniversity loaded the applet using Microsoft's InternetExplorer on a 300Mhz dual-processor Pentium II machinerunning NT. We speculate that a student used some form ofjust-in-time compilation technology to achieve the execu-tion performance depicted in the figure, although we areunable to ascertain how this performance level was reached.

On the bottom half of the figure, we show the best single-processor performance of other interesting and popularmachines. The NT Superclusters at UCSD and NCSAgenerated almost identical per-node processing rates. Asingle node of the Cray T3E located at the San DiegoSupercomputer Center was able to run only slightly fasterthan a single node of the Berkeley NOW [8]. Thiscomparison surprised us since the T3E is space shared(meaning that each process had exclusive access to itsprocessor once it made it through the batch queue) and theNOW (which is timeshared) was heavily loaded. Thebottom-most entry shows the speed of a publicly accessibleApple iMac workstation located in a coffee shop on theUCSD campus which is typical of the interpreted Javaperformance we were able to achieve.

In addition to detailing the relative performance ofdifferent architectures and infrastructures, Fig. 10 demon-strates the utility of EveryWare. It would not have beenpossible to include experimental (and powerful) resources,such as the Tera MTA and the NT Superclusters, withoutthe EveryWare toolkit. At the time of the experiment, noneof the existing Grid infrastructures had been ported toeither architecture. We were able to port EveryWare to bothsystems quickly (under 30 minutes for the Tera) allowing usto couple them with other, more conventional hosts that didsupport some form of Grid infrastructure. By providingexecution ubiquity, EveryWare was able to leverageresources that no other Grid computing infrastructurecould access. As such, the Ramsey Number Searchapplication is the first program to couple the Tera MTA,both NT Superclusters, and the Berkeley NOW with

parallel supercomputers such as the Cray T3E, work-stations, and desktop web browsers.

6 CONCLUSIONS AND FUTURE WORK

By leveraging a heterogeneous collection of Grid softwareand hardware resources, dynamically forecasting futureresource performance levels, and employing relativelysimple distributed state management techniques, Every-Ware has enabled the first application implementation thatmeets the requirements for Computational Grid computing.In [12], the authors describe qualitative criteria that a

Computational Grid must fulfill as the provision of pervasive,dependable, consistent, and inexpensive computing.

. Pervasive. At SC98, we were able to use EveryWareto execute a globally distributed program onmachines ranging from the Tera MTA to a webbrowser located in a campus coffee shop at UCSD.

. Dependable. The Ramsey Number Search applica-tion ran continuously from early June, 1998, untilthe High-Performance Computing Challenge onNovember 12.

. Consistent. During the twelve hours leading up tothe competition itself, the application was able todraw uniform compute power from resourceswith widely varying availability and performanceprofiles.

. Inexpensive. All the resources used by the RamseyNumber Search application were nondedicated andaccessed via a nonprivileged user login.

We plan to study how EveryWare can be used toimplement other Grid applications as part of our futureefforts. In particular, we plan to use it to build Grid versionsof a medical imaging code written at the University ofTennessee and a data mining application from the Uni-versity of Torino. We also plan to extend ORANGS toinclude storage scheduling directives and memoryconstraints. Finally, we plan to leverage our experiencewith EveryWare to build new Network Weather Servicesensors for different Grid infrastructures.

ACKNOWLEDGMENTS

The authors would like to express that it is impossible toacknowledge and adequately thank all of the people andorganizations that helped make the EveryWare demonstra-tion at SC98 a success. They would like to attempt to expresstheir gratitude to the AppLeS group at UCSD for enduringweeks of maniacal behavior. In particular, they thank FranBerman for her moral support during the effort and MarcioFaerman, Walfredo Cirne, and Dmitrii Zagorodnov forlaunching EveryWare on every conceivable public emailand Java workstation at SC98 itself. They thank NPACI forsupporting our High-Performance Challenge entry in everyway and, in particular, Mike Gannis for enthusiasticallymaking the NPACI booth at SC98 ground-zero forEveryWare. Rob Pennington at NCSA left no stopsunpulled on the NT Supercluster so that the authors couldrun and run fast and Charlie Catlett, once again, made it allhappen at ªThe Alliance.º They thank Miron Livny (theprogenitor of Condor and the University of Wisconsin) forfirst suggesting and then insisting that EveryWarehappen. Henri Casanova, at UCSD, single-handedlyported EveryWare to NetSolve after an off-handed mentionof the project was carelessly made by a project memberwithin his range of hearing. Steve Fitzgerald at CaliforniaState University, Northridge and ISI/USC introduced theauthors to the finer and more subtle pleasures of Globus, asdid Greg Lindahl for analogously hedonistic experienceswith Legion. Brent Gorda and Ken Sedgewick at MetaEx-change Corporation donated entirely too much time, space,coffee, good will, more coffee, sound advice, and patience tothe effort. Allen Downey and the Colby Supercomputer

1078 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001

Page 14: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

Center provided the authors with cycles, encouragement,

and more encouragement. Cosimo Anglano of the Diparti-

mento di Informatica, UniversitaÁ di Torino provided them

with intercontinental capabilities and tremendously spirited

support. Lastly, the authors thank everyone who partici-

pated anonymously via their web interface and downloads.

They express that they may not know who you are, but they

know your IP addresses and thank you for helping them.

REFERENCES

[1] H. Abu-Amara and J. Lokre, ªElection in Asynchronous CompleteNetworks with Intermittent Link Failures,º IEEE Trans. Computers,vol. 43, no. 7, pp. 778-788, 1994.

[2] O. Arndt, B. Freisleben, T. Kielmann, and F. Thilo, ªSchedulingParallel Applications in Networks of Mixed Uniprocessor/Multi-processor Workstations,º Proc. Int'l Symp. Computer Architecture11th Conf. Parallel and Distributed Computing, Sept. 1998.

[3] K. Arnold, B.O. Sullivan, R. Scheifler, J. Waldo, and A. Wollrath,The Jini Specification. Addison-Wesley, 1999.

[4] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao,ªApplication Level Scheduling on Distributed HeterogeneousNetworks,º Proc. Supercomputing 1996, 1996.

[5] ªThe Bovine RC5-64 Project,º http://distributed.net/rc5/, 1999.[6] H. Casanova and J. Dongarra, ªNetSolve: A Network Server for

Solving ComputationalScience Problems,º Int'l J. SupercomputerApplications and High Performance Computing, 1997.

[7] Concurrent Systems Architecture Group, http://www-csag.ucsd.edu/, 1999.

[8] D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun,S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, andF. Wong, ªParallel Computing on the Berkeley NOW,º Proc.Nineth Joint Symp. Parallel Processing, 1997. Also available athttp://now.CS.Berkeley.EDU/Papers2.

[9] L. DeRose, Y. Zhang, and D. Reed, ªSvpablo: A Multi-LanguagePerformance Analysis System,º Proc. 10th Int'l Conf. ComputerPerformance Evaluation, Sept. 1998.

[10] M.P.I. Forum ªMPI: A Message-Passing Interface Standard,ºTechnical Report CS-94-230, Univ. of Tennessee, Knoxville, 1994.

[11] I. Foster and C. Kesselman, ªGlobus: A Metacomputing Infra-structure Toolkit,º Int'l J. Supercomputer Applications, 1997.

[12] I. Foster and C. Kesselman, The Grid: Blueprint for a New ComputingInfrastructure. Morgan Kaufmann, 1998.

[13] I. Foster, C. Kesselman, G. Tsudik, and S. Tuecke, ªSecurityArchitecture for Computational Grids,º Proc. Fifth ACM Conf.Computer and Comm. Security Conf., pp. 83-92, 1998.

[14] I. Foster, C. Kesselman, and S. Tuecke, ªThe Nexus Approach toIntegrating Multithreading and Communication,º J. Parallel andDistributed Computing, 1996.

[15] H. Garcia-Molina, ªElections in a Distributed Computing System,ºIEEE Trans. Computers, vol. 31, no. 1, pp. 49-59, Jan. 1982.

[16] J. Gehrinf and A. Reinfeld, ªMarsÐA Framework for Minimizingthe Job Execution Time in a Metacomputing Environment,º Proc.Future General Computer Systems, 1996.

[17] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, andV. Sunderam, PVM: Parallel Virtual Machine A Users' Guide andTutorial for Networked Parallel Computing. MIT Press, 1994.

[18] J. Gosling and H. McGilton, ªThe Java Language Environment,ºSun White Paper, http://java.sun.com/docs/white/, 1996.

[19] A.S. Grimshaw, W.A. Wulf, J.C. French, A.C. Weaver, and P.F.Reynolds, ªLegion: The Next Logical Step Toward a NationwideVirtual Computer,º Technical Report CS-94-21, Univ. of Virginia,1994.

[20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, ªA High-Performance, Portable Implementation of the MPI MessagePassing Interface Standard,º Parallel Computing, vol. 22, no. 6,pp. 789-828, Sept. 1996.

[21] T. Haupt, E. Akarsu, G. Fox, and W. Furmanski, ªWeb BasedMetacomputing,º Technical Report SCCS-834, SyracuseUniv. Northeast Parallel Architectures Center, 1999. Alsoavailable at http://www.npac.syr.edu/techreports/html/0800/abs-0834.html.

[22] ªHigh-Performance Computing Challenge at SC98,º http://www.supercomp.org/sc98/hpcc/, Nov. 1998.

[23] R. Jones, ªNetperf: A Network Performance Monitoring Tool,ºhttp://www.netperf.org/, 2001.

[24] A. Lenstra and M. Manasse, ªFactoring by Electronic Mail,º Proc.Advances in CryptologyÐEUROCRYPT '89, pp. 355-371, 1990.

[25] M.J. Lewis and A.S. Grimshaw, ªUsing Dynamic Configurabilityto Support Object-Oriented Programming Languages and Systemsin Legion,º Technical Report CS-96-19, Univ. of Virginia, 1996.

[26] J.M.M. Ferris and M. Mesnier, ªNeos and Condor: SolvingOptimization Problems Over the Internet,º Technical Report ANL/MCS-P708-0398, Argonne National Laboratory, Mar. 1998. http://www-fp.mcs.anl.gov/otc/Guide/TechReports/index.html.

[27] Microsoft Windows NT, http://www.microsoft.com/ntserver/nts/techdetails/overview/WpGlobal.asp, 1999.

[28] B. Miller, M. Callaghan, J. Cargille, J. Hollingsworth, R. Irvin,K. Karavanic, K. Kunchithapadam, and T. Newhall, ªTheParadyn Parallel Performance Measurement Tools,º Computer,vol. 28, no. 11, pp. 37-46, Nov. 1995.

[29] H. Nakada, H. Takagi, S. Matsuoka, U. Nagashima, M. Sato,and S. Sekiguchi, ªUtilizing the Metaserver Architecture in theNinf Global Computing System,º Proc. High-Performance Comput-ing and Networking '98, pp. 607-616, 1998.

[30] ªNT SuperCluster,º http://www.ncsa.uiuc.edu/General/CC/ntcluster/, 1999.

[31] ªOMG,º The Complete Formal/98-07-01: The CORBA/IIOP 2.2Specification, 1998.

[32] S. Radziszowski, ªSmall Ramsey Numbers,º Dynamic SurveyDS1ÐElectronic J. Combinatorics, vol. 1, p. 28, 1994.

[33] R.L. Ribler, J.S. Vetter, H. Simitci, and D.A. Reed, ªAutopilot:Adaptive Control of Distributed Applications,º Proc. Seventh IEEESymp. High Performance Distributed Computing, Aug. 1998.

[34] N. Spring and R. Wolski, ªApplication Level Scheduling: GeneSequence Library Comparison,º Proc. ACM Int'l Conf. Super-computing 1998, July 1998.

[35] Sun Microsystems, ªXDR: External Data Representation, 1987,ºARPA Working Group Requests for Comment DDN NetworkInformation Center, SRI Int'l, Menlo Park, Calif., RFC-1014, 1987.

[36] T. Tannenbaum and M. Litzkow, The Condor DistributedProcessing System, Dr. Dobbs J., Feb. 1995.

[37] ªThe Tera MTA,º http://www.tera.com/, 1999.[38] J. Weissman and X. Zhao, ªScheduling Parallel Applications in

Distributed Networks,º Concurrency: Practice and Experience, vol. 1,no. 1. 1998.

[39] R. Wolski, ªDynamically Forecasting Network PerformanceUsing the Network Weather Service,º Cluster Computing, 1998.Also available at http://www.cs.utk.edu/~rich/publications/nws-tr.ps. gz.

[40] R. Wolski, N. Spring, and J. Hayes, ªThe Network WeatherService: A Distributed Resource Performance Forecasting Servicefor Metacomputing,º Future Generation Computer Systems, vol. 15,no. 5, pp. 757-768, 1999. Also available at http://www.cs.utk.edu/rich/publications/nws-arch.ps.gz.

[41] R. Wolski, N. Spring, and J. Hayes, ªPredicting the CPUAvailability of Time-Shared Unix Systems on the ComputationalGrid,º Proc. Eighth IEEE Symp. High Performance DistributedComputing, 1999. Also available at http://www.cs.utk.edu/rich/publications/nws-cpu.ps.gz.

Rich Wolski is an assistant professor ofcomputer science at the University of California,Santa Barbara. His research interests includecomputational grid computing, distributed com-puting, scheduling, and resource allocation. Inaddition to the EveryWare project, he leads theNetwork Weather Service project which focuseson on-line prediction of resource performance,and the G-commerce project studying computa-tional economies for the Grid. He is a member ofthe IEEE.

WOLSKI ET AL.: WRITING PROGRAMS THAT RUN EVERYWARE ON THE COMPUTATIONAL GRID 1079

Page 15: Writing programs that run everyware on the computational ...alsu/papers/tpds_everyware.pdf · together to achieve supercomputer-like performance. We provide our experiences gained

John Brevik completed his PhD in mathematicsin 1996 at the University of California at Berkeleyunder Robin Hartshorne. His interest in Ramseynumbers arose from a combination of caffeinejitters, a slightly flippant response to a challengefrom the lead author to produce a combinato-rially challenging mathematics problem, and agross underestimation of the formidability of theproblem.

Graziano Obertelli received his Laurea degreefrom the Dipartimento di Scienze dell'informa-zione at the UniversitaÁ di Milano. Since joiningthe Department of Computer Science andEngineering at the University of California, SanDiego (UCSD), he has been conductingresearch in distributed and parallel computing.Currently, he is working in the Grid Lab atUCSD. He is a member of the IEEE ComputerSociety.

Neil Spring is a PhD student at the University ofWashington. He received his BS in computerengineering from the University of California,San Diego in 1997 and his MS in ComputerScience from the University of Washington in2000. His research interests include congestioncontrol, network performance analysis, distribu-ted operating systems, adaptive scheduling ofdistributed applications, and operating systemsupport for networking. He is a student member

of the IEEE Computer Society.

Alan Su is a PhD candidate in the Departmentof Computer Science and Engineering at theUniversity of California, San Diego. His researchinterests have ranged from database systems toscheduling problems in computational grid en-vironments. Currently, he is focusing on improv-ing the performance of grid applications whichexhibit dynamic performance characteristics. Heis a student member of the IEEE ComputerSociety.

. For more information on this or any computing topic, please visitour Digital Library at http://computer.org/publications/dlib.

1080 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 10, OCTOBER 2001