the next frontier: interactive and closed loop performance steering

The Next Frontier: Interactive and Closed Loop PerformanceSteering�Daniel A. Reed Christopher L. ElfordTara M. Madhyastha Evgenia SmirniStephen E. Lammfreed,elford,tara,esmirni,[email protected] of Computer ScienceUniversity of IllinoisUrbana, Illinois 61801AbstractSoftware for a growing number of problem domainshas complex, time varying behavior and unpredictableresource demands (e.g., WWW servers and parallel in-put/output systems). While current performance anal-ysis tools provide insights into application dynamicsand the causes of poor performance, with a posteri-ori analysis one cannot adapt to temporally varyingapplication resource demands and system responses.We believe that the solution to this performance op-timization conundrum is integration of dynamic per-formance instrumentation and on-the- y performancedata reduction with real-time adaptive control mecha-nisms that select and con�gure resource managementalgorithms automatically, based on observed applica-tion behavior, or interactively, through high-modalityvirtual environments. We motivate this belief by �rstdescribing our experiences with performance analy-sis tools, input/output characterization, and WWWserver analysis, and then sketching the design of in-teractive and closed loop adaptive control systems.1 IntroductionTo attack problems of increasing complexity, ap-plication developers are eschewing regular, easily par-allelized algorithms in favor of more complicated but exible approaches that can exploit the multiplicityof temporal and spatial scales inherent in multidisci-plinary problems. In consequence, an increasing frac-tion of important applications have complex, data de-pendent execution behavior and time varying resourcedemands. Because the interactions between applica-tion and system software change during and acrossapplication executions, it is di�cult or impossible todetermine a globally optimal application con�guration�This work was supported in part by the Advanced ResearchProjects Agency under ARPA contracts DAVT63-91-C-0029and DABT63-93-C-0040, by the National Science Foundationunder grants NSF IRI 92-12976, NSF ASC 92-12369, and NSFCDA 94-01124, and by the National Aeronautics and Space Ad-ministration under NASA Contracts NGT-51023, NAG-1-613,and USRA 5555-22.

or to statically con�gure runtime systems and resourcemanagement policies.As the scope of scalable parallel computing expandsfrom regular computations and single parallel systemsto irregular computations and distributed collectionsof heterogeneous parallel systems, software complexityand associated optimization problems increase com-mensurately. This view is buttressed by our expe-riences characterizing parallel input/output behavior[2] and analyzing WWW server tra�c patterns [7].Applications in both domains have complex, irregularresource access patterns that make static allocation ofresources ine�cient.In contrast to rapid changes in application soft-ware models, tools for debugging and tuning paral-lel software have not kept pace. Simply put, currentsoftware tools are data limited and enforce a passiveview | users and developers must manipulate textand graphics via a small workstation window on thecomplex world of large-scale, heterogeneous metacom-puting, with few options to adapt application behavioror resource management policies during execution.We believe the solution to the performance opti-mization conundrum is integration of dynamic per-formance instrumentation and on-the- y performancedata reduction with con�gurable, malleable resourcemanagement algorithms, and real-time adaptive con-trol mechanisms that either automatically choose andcon�gure resource management algorithms based onapplication request patterns and observed system per-formance or allow users to interactively guide applica-tion behavior. To capitalize on human instincts andskills, interactive guiding is best done in high-modalityvirtual environments where users can directly manip-ulate software representations and behavior.Building on this thesis, in x2, we describe the de-sign of the Pablo performance analysis environmentand the motivations for its current structure. In x3{x4,we describe our experiences using the Pablo softwareto capture and analyze the input/output dynamicsof large, input/output intensive parallel applicationsand of large-scale WWW servers. Based on these ex-

periences, in x5 we describe the lessons learned fromthis analysis, and in x6{x7 sketch a vision of open andclosed loop adaptive control systems for resource man-agement policy selection. In x8 we describe relatedwork. Finally, x9 summarizes the current state of ourwork and outlines plans for continued research.2 Performance Analysis SoftwareThe critical importance of robust, exible, and e�-cient performance tools has long been recognized andhas been a major discussion topic at several nationalmeetings [9, 15]. By necessity, such tools must em-body knowledge of the execution environment andidentify performance bottlenecks in terms of appli-cation code interaction with the execution environ-ment. Moreover, because the root causes of poorperformance or unexpected program behavior may liewith run-time libraries, compilers, operating system,or hardware, tools must gather and correlate informa-tion from many sources.Ideally, a complete performance analysis environ-ment should include instrumentation at multiple hard-ware and software levels, real-time and post-mortemdata reduction, and an extensible set of data analysisand correlation tools. The environment also should beportable across a range of parallel and distributed sys-tems, allowing users to amortize learning costs acrossmany contexts, scalable with the size of the systembeing studied, allowing performance optimization onlarge systems, and extensible, allowing addition offunctionality as needed. These beliefs are based onour experiences building three generations of hardwareand software performance analysis tools for parallelsystems [8, 13], as well as the collective experiences ofthe performance tool community [15].For the past �ve years, we have worked to developa software performance analysis infrastructure, calledPablo [11], that embodies the design goals of portabil-ity, extensibility, and scalability. Below, we describethe components of the Pablo environment, the keysto its adaptability, and examples of its application toselected problem domains.2.1 Pablo Performance EnvironmentAs we described in [12], the complete Pablo1 per-formance environment consists of the following:� An extensible self-de�ning data format (SDDF)and associated library that separates the struc-ture of performance data records from their se-mantics, allowing easy addition of new perfor-mance data types.� An instrumenting parser capable of generating in-strumented SPMD source code.� Extensible instrumentation libraries that can cap-ture event traces, counts, or interval times andreduce the captured performance data on the y.� Graphical performance data display and soni�ca-tion tools, based on coarse-grain graphical pro-gramming, that support rapid prototyping of per-formance analyses.1Pablo is a registered trademark of the Board of Trustees ofthe University of Illinois.

The Pablo instrumentation software captures dy-namic performance data via instrumentation librarycalls that are inserted in the application source code.Typically, these library calls bracket key software con-structs (e.g., procedure, input/output, or communi-cation calls) and invoke the instrumentation libraryto record timestamps, durations, and parameters ofthe constructs. During program execution, the per-formance data that can be either directly recorded bythe data capture library or processed by one or moredata analysis extensions prior to recording (e.g., togenerate a statistical pro�le from a timestamped eventstream).As the performance data is generated and option-ally processed, it can be extracted via network socketsfor real-time display, or with the requisite interfaces,used to interactively control application or system be-havior; see x7. Alternatively, after program execu-tion completes, the data can be analyzed by a toolkitof data transformation modules capable of process-ing the self-de�ning data format (SDDF). Just as thedata ow prototyping model proved successful in vi-sualization (e.g., with AVS and SGI Explorer), thePablo coarse-grain, graphical data ow support forrapid prototyping of performance analyses makes itpossible to quickly construct new data analyses andgraphical displays.2.2 Pablo Software AdaptabilityTo satisfy the design goals of portability and exten-sibility, the Pablo environment separates performancedata structure and data semantics via the self-de�ningdata format (SDDF) and isolates the implementationdetails of performance data capture on a particularsystem behind a set of standard library interfaces. Theformer has allowed us to easily de�ne and process newtypes of performance data without changing the un-derlying Pablo software, and the latter has made pos-sible broad extension of the data capture library.Although several data meta-formats have been de-veloped (e.g., netCDF and HDF) for grid-based scien-ti�c data, SDDF was designed and optimized specif-ically to represent record-based performance data.SDDF data streams consist of an initial group ofrecord descriptors and a subsequent stream of recordinstances. The SDDF descriptors de�ne the structureof the record instances in the same way as record dec-larations in Pascal or C, though SDDF records arelimited to scalars and arrays of character strings, in-tegers, and single and double precision oating pointvalues. The data following the SDDF descriptors con-sists of a stream of descriptor tag and data recordpairs. Each tag identi�es the record descriptor thatde�nes the immediately following data.By separating the structure of data from its se-mantics and allowing one to de�ne new performancedata records appropriate to a particular instrumenta-tion context, the SDDF library allows one to constructtools that can extract and process SDDF records andrecord �elds with minimal knowledge of the data'sdeeper semantics. Moreover, the Pablo coarse-graindata ow model for rapid prototyping embodies per-formance analysis semantics in the analysis graphconstruction, where one interactively speci�es what

1e-05

0.0001

0.001

0.01

0.1

1

10

0 50 100 150 200 250

Rea

d D

ura

tio

ns

(sec

on

ds)

Timestamp (seconds)

1e-05

0.0001

0.001

0.01

0.1

1

10

0 5 10 15 20 25 30 35 40 45 50 55R

ead

Du

rati

on

s (s

eco

nd

s)Timestamp (seconds)(a) OSF 1.3.3 (16 I/O Nodes) (b) OSF 1.4 (80 I/O Nodes)Figure 1: File Read Durations During Execution (PRISM on Intel Paragon XP/S)SDDF records and �elds should be processed, ratherthan in the semantics of the individual graph modules.The second key to Pablo's adaptability is the de-sign of a set of software extension interfaces for thePablo data capture library. By separating low-level in-strumentation details (e.g., timestamp generation anddata extraction) from data processing, one can de-velop platform-independent data analysis extensionsthat process SDDF performance data records prior toextraction. Exploiting the Pablo instrumentation li-brary's extension interfaces, we have expanded the in-strumentation library beyond its initial target of mes-sage passing codes to include support for analysis ofcodes written in data parallel languages [1], study ofWorld Wide Web behavior [7], analysis of applicationinput/output patterns [2, 16], and study of parallel �lesystem policies [4].3 Input/Output CharacterizationThe modest input/output con�gurations and �lesystem limitations of many current high-performancesystems preclude solution of problems with large in-put/output needs. Moreover, technology trends areincreasing the already wide disparity between the per-formance of secondary and tertiary storage devicesand the performance of processors and communicationlinks. In consequence, input/output hardware and �lesystem parallelism is the key to bridging the perfor-mance gap. This requires detailed knowledge of theinput/output characteristics of scalable parallel appli-cations and exploitation of such knowledge to design

and manage parallel input/output systems. In short,one must characterize extant input/output patternsboth quantitatively (i.e., request sizes and durations)and qualitatively (i.e., why applications have particu-lar access patterns).3.1 Characterization TechniquesAs part of the national Scalable I/O (SIO) initia-tive [10], we have extended the Pablo infrastructureto capture and analyze the temporal and spatial pat-terns of parallel application input/output. The initialfocus on application level characterization was moti-vated by a desire to isolate and identify �le access pat-terns that were intrinsic to each application and to un-derstand the interactions between application requestpatterns and the hardware and software of parallel in-put/output systems.2Using the Pablo instrumentation library extensionmechanisms, the input/output characterization soft-ware supports either detailed tracing of individual in-put/output calls or statistical summarization of in-put/output activity. The former generates event tracerecords that include the time, duration, size, and otherparameters of each input/output operation. When de-tailed event trace capture might generate an excessivevolume of data, perturbing the input/output systembeing measured, statistical analysis can reduce thetrace data before extraction. These statistical sum-2At present, we are augmentingapplicationdata with instru-mentation of input/output device drivers to capture physicalinput/output data.

0

1

2

3

4

5

6

7

8

9

1 10 100 1000 10000 100000 1e+06

Rea

d D

ura

tio

ns

(sec

on

ds)

Read Size (bytes)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 10 100 1000 10000 100000 1e+06R

ead D

ura

tions

(sec

onds)

Read Size (bytes)(a) OSF 1.3.3 (16 I/O Nodes) (b) OSF 1.4 (80 I/O Nodes)Figure 2: File Read Durations and Request Sizes (PRISM on Intel Paragon XP/S)maries can take one of three forms: �le lifetime, timewindow, or �le region. The �rst describes the sizesand latencies of requests during the time interval thatinput/output occurs to a given �le. The second andthird are spatial and temporal analogs, describing ac-cess patterns during an interval of time or a �le region,respectively.3.2 Input/Output EvolutionDuring the past three years, we and others haveexploited the extended Pablo analysis software tostudy the behavior of a variety of parallel applicationson the Intel Paragon XP/S and IBM SP/2 [2, 16].These studies have shown that parallel input/outputpatterns are much more complex and irregular thanthose expected based on an extrapolation from in-put/output patterns on either vector supercomputersor high-performance workstations.As one example, we have tracked the evolvingperformance of a computational uid dynamics code(PRISM) across two releases of the Intel ParagonOSF/1 parallel �le system (PFS) and two disk hard-ware con�gurations (sixteen and eighty I/O nodes).PRISM is a parallel implementation of a 3-dimensionalnumerical simulation of the Navier-Stokes equations.This particular code models a geometry where the owis periodic in at least one direction (e.g., ow past acylinder, ow in a channel, or ow over a backward-facing step). An initial velocity �eld is given by theinput data, and the solution is integrated forward intime by solving the equations that describe advectionand di�usion of momentum in the uid.

Figure 1 shows the behavior of one snapshot of thePRISM code that uses Unix input/output primitiveson the Intel Paragon XP/S. In Figure 1a, the code ex-ecuted on Intel OSF version 1.3.3 with all application�les striped across sixteen I/O nodes, and in Figure1b, the code executed on Intel OSF version 1.4 with�les striped across eighty I/O nodes. Not only does thetotal execution time di�er by a factor of �ve betweenFigures 1a and 1b, but the temporal patterns of �lereads and their individual read durations change dra-matically. Comparison of Figures 1 and 2 shows thetemporal compression of read requests in time and asa function of request size.Although operating system optimizations are im-portant, additional experiments with OSF 1.4, butusing only sixteen I/O nodes, showed that the largerinput/output hardware con�guration is the dominantcause of higher input/output performance. To under-stand the relation between �le system semantics, in-put/output hardware con�gurations, and applicationrequest patterns, we have also measured the perfor-mance of two other versions of the PRISM code thatexploit the Intel PFS primitives. We found that per-formance di�erences among the three versions aroseonly when the number of I/O nodes was samll, limit-ing input/output parallelism. This suggests that in-put/output optimization strategies are strongly sen-sitive to the hardware con�guration and the accesspattern.

Figure 3: Avatar Scattercube WWW Display3.3 Input/Output ObservationsThe complex behavior shown in Figures 1 and 2is typical of that observed in parallel applications onboth the Intel Paragon XP/S and the IBM SP/2. Notonly do parallel input/output requests vary widely insize, with both very small (i.e., tens of bytes) andvery large (i.e., tens of megabytes) requests, but in-put/output concurrency varies widely as well, rangingfrom single processor input/output to concurrent ac-cess by all processors. Likewise, request patterns canbe sequential, strided, or random with �xed or vari-able sizes.Overall, our characterization studies [2, 16] haveshown that parallel applications exhibit a much widervariety of input/output request patterns than antici-pated by parallel �le system designers. Thus, extantparallel �le systems are optimized for a small subset ofthe parallel input/output access space. For example,large requests are best managed by streaming datadirectly to or from storage devices and applicationbu�ers. Conversely, small requests are better servedby prefetching, caching, and write-behind.Given the diversity of parallel input/output pat-terns, we believe parallel �le systems that rely on asingle, system-imposed policy are unlikely to be suc-cessful. Instead, exploitation of input/output accesspattern knowledge in caching and prefetching systemsis crucial to obtaining a substantial fraction of peak in-put/output performance. Inherent in such an adaptiveapproach is the need to identify access patterns andchoose policies based on access pattern characteristics.We will return to this notion of dynamic adaptationbased on changing application resource demands inx6{x7.

4 WWW CharacterizationThe World Wide Web (WWW), together withrapidly expanding interests in distributed data min-ing, poses a decidedly di�erent, though equally thorny,set of performance optimization problems than thosefaced when optimizing input/output behavior on a sin-gle parallel system. In particular, WWW servers nowplace demands on operating systems, input/outputsystems, and network protocol implementations thatlie far outside the system software's original designpoints. Moreover, as the WWW is increasingly usedas a geographically distributed multimedia data base,wide area distributed query optimization must ex-ploit both query semantics and time-varying data onthe computation and input/output capabilities of het-erogeneous WWW servers connected by networks ofwidely varying bandwidth and latency. Thus, under-standing extant and evolving request patterns and theloads placed on both servers and networks is key tooptimizing both server design and operating systemresource management policies.4.1 Characterization TechniquesDue to its early development of the Mosaic WWWbrowser and its role as an NSF supercomputer center,the National Center for Supercomputing Applications(NCSA) processes several hundred thousand WWWrequests each day. To understand the access patternsto this server and the server responses, NCSA logsnot only the request stream but also samples of theserver network statistics, disk input/output, and pro-cessor utilization [6]. Given the diversity of requestpatterns and the size of the NCSA WWW site, theselogs exceed 150 MB/day and have been archived forover two years. Although the NCSA WWW logs pro-vide the data needed to understand both short andlong term evolution of WWW tra�c, their sheer sizemakes data analysis challenging.We extended the Pablo instrumentation and anal-ysis software to process process the NCSA logs, creat-ing a set of SDDF records that describe access patternstimuli and server responses. Collectively, these SDDFrecords de�ne nearly forty access pattern and servermetrics that are computed each minute. To under-stand and correlate these metrics, we exploited Pablo'ssupport for real-time data analysis and extraction todrive an immersive virtual environment, called Avatar[12, 14, 7], that supports real-time, three-dimensionalvisual and aural display of dynamic performance data.Figure 3 shows a three-dimensional scatterplot, oneof several Avatar virtual environment displays. InFigure 3, the three axes of the cube correspond tothree di�erent performance metrics, with the coloredribbons showing the trajectory of the WWW serverswithin the metric space. By interactively changingthe metrics mapped to each axis, one can explore thecorrelation of access patterns and server responses.4.2 WWW ObservationsAs we described in [6], analyzing several months ofaccesses to the NCSA WWW server showed that com-mercial and government use of the server is growingrapidly, and request heterogeneity is increasing, less-ening server cache hit ratios. In addition, the dataclearly show that increasing use of non-text data (i.e.,

scienti�c data sets, imagery, audio, and video) willrequire di�erent approaches to WWW server design.For example, data cache partitioning would amelioratethe pernicious e�ects of retrieving large images by re-serving a portion of the cache for smaller, text items.Likewise, streaming video techniques that guaranteequality of service without preloading an entire videosegment would lessen memory requirements. In gen-eral, WWW servers must become more adaptive andintelligent, with exible �le caching and prefetchingstrategies that can adapt to changing request patternsand client types.5 Performance Analysis LessonsThe common theme underlying the input/outputcharacterization and WWW tra�c analysis in x3 andx4 is widely varying access patterns and concomitantresource demand variability. In both cases, experi-ence suggests that optimizing performance will re-quire a judicious match of resource management poli-cies to resource request patterns. Parallel �le systemsshould dynamically con�gure caching and prefetchingpolicies, as well as cache sizes, based on access pat-tern attributes observed during program execution.Likewise, sophisticated WWW servers should exploitknowledge of request data types, client capabilities,network connection bandwidths, and user-speci�c re-quest patterns.Although the current Pablo software infrastruc-ture's exibility allowed us to quickly construct theperformance analysis and display tools needed to un-derstand the rapidly varying resource demands of in-put/output intensive applications and WWW servers,it is not su�cient. Because the interactions betweendynamic, irregular applications and system softwarechange during application execution, we believe that a exible infrastructure for performance directed adap-tive control must replace a posteriori performanceanalysis. Such an infrastructure would allow applica-tion and system developers to create adaptive softwarethat can change its behavior in response to real-timevarying resource demands.In x6{x7, we describe two complementary views ofsuch an adaptive control infrastructure. The �rst, aclosed loop design shown in Figure 4, relies on real-time performance data captured using a group of dis-tributed performance sensors, a suite of distributed de-cision procedures to select resource management poli-cies, and a complementary set of policy actuators torealize policy decisions. The second, an interactivedesign shown in Figure 7, replaces software-directedresource policy selection with immersive environmentsfor real-time performance display and user-directed re-source management.6 Closed Loop Adaptive ControlMinimally, an infrastructure capable of dynami-cally adapting to changing resource demands would(a) monitor resource demands and system responses,(b) select resource management policies based on mon-itored data, and (c) implement policy changes in glob-ally consistent ways. A more general adaptive controlsystem might contain additional components for cap-ture of qualitative as well as quantitative performance

data, o�-line assessment of policy e�ectiveness, and exible con�guration of decision procedures.Creating a exible closed loop control system thatcould be used on multiple parallel and distributed sys-tems to manage a variety of physical and logical re-sources is a daunting task that poses a plethora of openresearch problems. Below, we �rst sketch one possibledesign and brie y describe our current research withinthe context of this design.6.1 Proposed Closed Loop DesignFigure 4 shows one possible design for a general-purpose, closed loop resource management and controlsystem containing the following components:� Distributed performance sensors that can capturequantitative application and system performancedata and compute performance metrics.� Software actuators that can enable and con�gureresource management policies.� Program assertion interfaces for qualitativelyspecifying application resource request patterns(e.g., large read requests or short computationintervals) and automatic behavioral classi�cationtechniques for use when assertions are missing orinaccurate.� Decision procedures, both local (e.g., per paralleltask) and global (e.g., per parallel program), forselecting resource management policies and en-abling actuators based on observed applicationresource requests and the system responses cap-tured by performance sensors.� O�-line analysis tools to assess the performanceof current decision procedures, to conduct para-metric performance studies using captured appli-cation behavioral data, and to identify improveddecision procedures.In this design, performance instrumentation sensorscapture and compute quantitative application and sys-tem performance metrics. Qualitative application be-havior is obtained from user assertions or via auto-matic behavioral classi�cation techniques. Together,the qualitative and quantitative data are used by ahierarchy of decision procedures to choose and con-�gure resource management policies via software ac-tuators. Finally, o�-line analysis tools permit meta-performance assessment of the adaptive control sys-tem's e�cacy.6.2 Control Prototype ExperiencesTo test the feasibility of the design just described,we have enhanced our user-level portable parallel �lesystem (PPFS) [4] to include prototype sensors, actua-tors, decision procedures, and behavioral classi�cationsoftware. PPFS consists of a group of cooperating�le servers and has a rich application interface, allow-ing applications to advertise access patterns and tocontrol caching, prefetching, and data placement poli-cies at multiple levels. The PPFS prototype relies onthe Pablo library to capture and compute quantitativeperformance metrics, includes an arti�cial neural

Offline Behavioral Analysis

GlobalControl

DecisionProcedures

Policy Algorithm

Sensor Actuator

LocalDecisions

Resource Management Policy Component

Policy Algorithm

Sensor Actuator

LocalDecisions


Malleable Runtime System

Policy Algorithm

Sensor Actuator

LocalDecisions


Application Task

Policy Algorithm

Sensor Actuator

LocalDecisions

Requests

Policy Algorithm

Sensor Actuator

LocalDecisions


Application Task

Policy Algorithm

Sensor Actuator

LocalDecisions

Requests

Parallel Application

Online Performance Analysis

Application SpecificGeneral

Performance History

Con

trol M

essa

ges

Control M

essages

Control Messages

Con

trol M

essa

ges

Performance Metrics

Optional User Controls

Optional User Controls

Performance Data

Performance Data

Perf

orm

ance

Dat

a

Sensor, Actuator, &Policy Registration

To Decision, Sensor and Actuator Software

Assertions

AssertionsFigure 4: Closed Loop Resource Management Control Infrastructure

network that accepts �le access statistics and gener-ates qualitative access patterns, and relies on a simpleset of decision procedures to select �le policies.The goal of the PPFS design is to improve paral-lel input/output performance by adaptively selectingand tuning �le system policies based on applicationrequest patterns. Below, we describe qualitative andquantitative behavioral classi�cation techniques andour preliminary experiences using these approaches totune PPFS policies. Ultimately, we believe that com-bining the qualitative and quantitative approaches topolicy selection will create a rich environment for dy-namic, closed loop policy steering within PPFS.6.2.1 Automatic Qualitative Classi�cationGiven a qualitative classi�cation of behavior (.e.g.,that �le accesses are sequential and small), one canselect a resource management policy family bestmatched to the behavior type. Such qualitative datacan be obtained by inserting behavioral assertions insource code or by monitoring and automatically clas-sifying behavior. Automatic classi�cation lessens theapplication programming burden, by transparentlyidentifying application access patterns.To automatically classify �le access patterns, wehave identi�ed a set of features commonly observedduring our characterization of scienti�c applications.These features, shown in Table 1, partition requestsalong three axes: read/write mix, sequentiality, andsize. Using this partitioning, we trained a feedforwardarti�cial neural network to classify input/output re-quests during application execution. The output ofthe trained neural net corresponds directly to the fea-tures in Table 1. Not only is such a qualitative clas-si�cation system independent, it may be unique to aparticular application execution. By using the classi�-cation during execution to drive �le policy actuators,one can transparently tune caching and prefetchingpolicies for a given system and set of application in-puts with application hints or optimizations.As a validation of automatic behavioral classi�ca-tion and dynamic adaptation, we used the enhancedPPFS to improve the input/output performance of aa single processor application from the NOAA/NASAPath�nder AVHRR (Advanced Very High ResolutionRadiometer) data processing project. Path�nder pro-cessing is typical of low-level satellite data processingapplications | fourteen large �les of AVHRR orbitaldata are processed to produce a large output data set.For the Path�nder code, the automatic behavioralclassi�cation and adaptive �le system policies of PPFSo�er a signi�cant performance improvement over thatpossible with UNIX bu�ered input/output. The neu-ral net access pattern classi�er detects that the output�le access pattern is initiallywrite only and sequential,with large accesses; later, the access pattern changesto write only and strided, with very small accesses.PPFS chooses an MRU cache block replacement pol-icy for the �rst phase. In the second phase, it enlargesthe cache to retain the working set of �le blocks.Figure 5 shows the �le write durations during the�rst phase of Path�nder execution, both with andwithout the PPFS adaptive policies, with executed on

a single processor SPARC server. The �rst cluster ofaccesses at the left of Figures 5a and 5b is the sequen-tial write phase. Performance for the �rst phase isroughly equivalent using either MRU or the default,non-adaptive LRU replacement policy. However, en-larging the cache in the second, strided access phasesubstantially decreases the average write duration andoverall execution time.In general, we have found that automatic �le accesspattern classi�cation and policy selection yields ma-jor performance bene�ts with modest overhead. Cur-rently, we are extending this e�ort with a more generalclassi�cation infrastructure that supports global ac-cess pattern classi�cation across multiple processors.6.2.2 Sensor Quantitative Classi�cationQualitative access pattern classi�cation, used to selectand tune resource management policies, is a primarystep toward creation of adaptive, closed loop control.However, it must be coupled with quantitative dataon resource utilization to provide true closed loop con-trol. In a complementary e�ort to explore the e�cacyof adaptive control, we have coupled performance sen-sor data with user assertions that describe applicationinput/output patterns.To automatically determine the input/output per-formance of the �le system, we have instrumentedPPFS with the Pablo instrumentation library. Thisinstrumented PPFS produces sensor data as slidingwindow averages of key performance metrics (e.g., �leread byte counts and durations, �le cache hit ratios,input/output server timings, queue lengths and de-lays, and prefetch initiation overheads).Together with application assertions about quali-tative access patterns, the PPFS sensor data can beprocessed in real time by an actuator library to iden-tify and eliminate the current PPFS bottleneck. Forexample, if the �le access pattern is sequential and the�le interaccess time declines, the cache hit ratio sen-sor may decline as well, indicating that PPFS shouldincrease the prefetch parameter and more aggressivelyprefetch data to increase the cache hit ratio. Becausethe e�ects of such changes are manifest as changes inthe sensor data, the control infrastructure can quicklydetermine the veracity of the control change.To demonstrate the e�cacy of sensor-based adap-tive control when coupled with behavioral assertions,we used an input/output benchmark to conduct a setof simple experiments on the IBM SP/2 and on twodi�erent hardware con�gurations of the Intel ParagonXP/S. In this benchmark, all processors read disjoint,interleaved 4 KB blocks of the same �le, forming aglobally sequential access stream. Using the enhancedPPFS, we measured the average read access time asa function of �le interaccess latency and �le cacheprefetch parameters (i.e., the number of �le blocksprefetched ahead of the current access point and thenumber of blocks prefetched at a time). Figure 6 sum-marizes the results of these experiments.With large interaccess delays, moderate prefetch-ing su�ces to service all requests from the PPFS �lecache. As the interaccess time declines and the ag-gregate request rate increases, PPFS must prefetch a

Category Category FeaturesRead/Write Read Only Write Only Read-Update-Write Read/Write NonupdateSequentiality Sequential 1-D Strided 2-D Strided Variably StridedRequest Sizes Uniform VariableTable 1: Qualitative Input/Output Classi�cation0.0001

0.001

0.01

0.1

1

10

0 100 200 300 400 500 600 700 800 900 1000

Timestamp (seconds)

Wri

te D

ura

tions

(sec

onds)

0.0001

0.001

0.01

0.1

1

10

0 50 100 150 200 250

Timestamp (seconds)

Wri

te D

ura

tions

(sec

onds)

(a) UNIX (Non-adaptive) (b) PPFS (Adaptive)Figure 5: File Write Durations (Path�nder on Sun SPARC 670)larger number of blocks to maintain high cache hit ra-tios. Figure 6 shows that an e�cient operating pointdepends on the request size, interaccess time, and diskperformance. Thus, the slower disk system on theParagon XP/S necessitates more aggressive prefetch-ing than on the faster IBM SP/2. Intuitively, one canautomatically determine a near optimal prefetch unit,given quantitative sensor data and a qualitative classi-�cation of the access pattern. This last point is crucial| the shape of the response time curves in Figure 6is critically sensitive to the �le access pattern.6.3 Status and DirectionsPreliminary experience with our prototype adap-tive control implementation within PPFS suggeststhat major performance improvements are possiblewith performance-directed control. Based on these en-couraging results, we are concurrently pursuing severalresearch directions.First, we are extending the adaptive PPFS proto-type to more tightly couple quantitative sensor datawith user assertions, automatic access pattern clas-si�cation and actuators to realize a fully operationalclosed loop prototype. The next step is to synthe-size global performance metrics from local sensor dataand characterize global access patterns using local ac-cess pattern classi�cations. This global data will drivea set of platform-independent decision procedures forselecting �le system policies.Finally, we are beginning the detailed design ofthe general-purpose adaptive control infrastructuresketched in Figure 4. The control infrastructure willtarget three resource management domains: parallelinput/output policies, thread and task scheduling, andWWW server management.7 Interactive Adaptive ControlFrom a sensory perspective, the computer need notbe simply a low-bandwidth medium of interaction be-tween users and a passive software structure; instead,via immersive virtual environments users could inter-act directly with the executing software (e.g., resizinga �le cache by \stretching" its three-dimensional rep-resentation with one's hands). In this model, the useris no longer an external observer of the system and its

(4,4)(8,2) (16,4) (32,8) (64,8) (128,16)(prefetch ahead, prefetch unit)08162432MeanReadTime(milliseconds) � � �....................... � �....................... �....................... � �....................... �....................... � �........................ �......................... � �........................ �.............................. �........................................... � �................................... �.............................................. �....................................................................� �.............................................. 150 ms, Fast Disk� � �....................... � �....................... �....................... � �....................... �....................... � �....................... �........................ � �....................... �........................ �.......................... � �......................... �............................... �...........................................................� �.............................................. 300 ms, Fast Disk� � �................................ � �................................................................ �............................................. � �....................................... �........................................................... � �............................... �......................................................................... � �....................... �............................................................ �.......................................................... � �................................. �...................................................... �.........................................................................� �.............................................. 150 ms, Slow Disk? ? ?.................................................................. ? ?.............................................................................................................. ?........................................................................................................................ ? ?............................................. ?............................................ ? ?........................... ?......................... ? ?........................ ?..................................... ?................................ ? ?......................... ?....................................................................................... ?..........................? ?.............................................. 300 ms, Slow Disk

(4,4)(8,2) (16,4) (32,8) (64,8) (128,16)(prefetch ahead, prefetch unit)02468MeanReadTime(milliseconds) � � �....................... � �....................... �........................ � �....................... �....................... � �........................ �....................... � �........................ �....................... �........................ � �.......................... �....................... �..........................� �.............................................. 075 ms� � �....................... � �........................ �....................... � �....................... �.......................... � �....................... �....................... � �....................... �....................... �....................... � �....................... �........................ �..........................� �.............................................. 175 ms

(a) Intel Paragon XP/S (b) IBM SP/2Figure 6: PPFS Parallel Read Times (One File Server)behavior, but replaces closed loop decision proceduresdescribed in x6 to become an active participant withthe system.7.1 Proposed Interactive Control DesignFigure 7 shows how one might assemble a virtualenvironment for direct direct manipulation and inter-active control using the following components:� Virtual environment views that can represent ahierarchy of parallel software components andtheir interactions, ranging from computationson geographically distributed parallel systems tomodule interactions within a single task.� Virtual environment and attribute controls for in-teractive modi�cation and adjustment of softwareattributes and parameters (e.g., cache sizes) thatare analogous to controls on physical objects. Ex-amples might include handles for software modulemovement and resizing, docking ports for soft-ware interconnection, dials and levers for specify-ing module parameters.� Direct manipulation tools that are the virtual en-vironment analog of a toolbox. Like physicaltools, these virtual tools (e.g., a critical path�nder or a dipstick for behavior sampling) aug-ment manipulation abilities.� Annotation software for marking temporal andspatial points within the virtual environment.Collectively, such annotation mechanisms wouldaid navigation, ease collaborative identi�cationof important features, and provide a historicalrecord of important events.

� Actualization interfaces that connect the vir-tual environment with dynamic performance datafrom executing systems and that pass informationbased on user manipulations to software con�gu-ration and behavior management software.In this virtual environment, performance sensorsand control actuators would permit real-time captureof performance data and modi�cation of application orsystem behavior based on user manipulations withinthe virtual environment. Part of the environmentwould execute on a group of parallel or distributedsystems, providing connections to sensors and actua-tors. The rest of the system would execute on the vir-tual environment platform importing and exportingthese connections to and from remote systems. An-notation and performance recording software wouldmanage user-generated multimedia data (i.e., voice,video, and text descriptions) and meta-data (e.g., ex-tant conditions and contextual information).The interactive control software would include a hi-erarchy of graphical views, ranging fromprocedure callgraphs within a thread to geographically distributedcommunication among parallel systems. Via an asso-ciated suite of attribute controls and direct manipu-lation tools, users would be able to change softwarestructure (e.g., by replacing software modules) andmodify application parameters.7.2 Status and DirectionsWe just argued that the solution to the software un-derstanding and optimization dilemma is the develop-ment of virtual environments that allow software usersand developers to directly manipulate software com-ponents while immersed in representations of software

CAVE−to−CAVE CommunicationActualization

Interface

Parallel/Distributed Computing System

Sensor Actuator

ControlProcedures


Actualization Interface

Parallel/Distributed Computing System

Sensor Actuator

ControlProcedures


Actualization Interface

Annotation& Recording

Direct Manipulation Tools

Environment &Attribute Controls

Direct Manipulation Tools

Environment &Attribute Controls

Performance Metric View

Software Structure View

Figure 7: Interactive Resource Management Control Infrastructurestructure, behavior and real-time performance. Thisbelief is supported by our experiences with our Avatarvirtual environment prototype [14] and its limited sup-port for direct manipulation of software behavior.We have tested the Avatar prototype's data rep-resentations with several users and have found thatvirtual environment exploration can provide substan-tive new insights into software behavior and structure.In general, the ability to walk and y through thedata, to examine it from multiple perspectives, and tointeractively change real-time display attributes hasproven invaluable | we and others gained insightsinto behavioral and performance metric interactionsthat were not possible otherwise. Based on this ex-perience, we are enhancing Avatar to better supportdirect manipulation and to tighten its coupling to thePPFS adaptive control prototype.8 Related WorkSeveral systems have been built that support appli-cation behavior steering (i.e., guiding a computationtoward interesting phenomena) [5]. Typically, applica-tion behavioral steering is interactive, with an applica-tion scientist studying near real-time scienti�c visual-izations and guiding the application code by changingkey application variables. The Supercomputing '95I-WAY [5], with immersive visualizations of scienti�ccomputations, is the most recent and notable example.In contrast to the design of systems for steeringcomputation behavior, there have been far fewer ef-forts [3, 14] to interactively steer or adaptively con-trol application performance. Notably, Schwan [3]has developed an adaptive control library for thread-based computations on shared memory parallel sys-tems. This system, called Falcon, allows applicationdevelopers to insert software sensors in their sourcecode. Performance data from these sensors activatesactuators to change program behavior based on cur-rent conditions and measured performance. The con-trol library automatically adjusts thread locking poli-cies based on expected synchronization delay.9 ConclusionsOur performance analysis and characterizationdata show that an increasing fraction of applications indiverse problem domains (e.g., parallel input/outputand WWW access) have highly variable resource de-mands. This data suggests that optimizing the per-formance of these dynamic applications will require ajudicious match of resource management policies toresource request patterns.Because the interactions between dynamic, irreg-ular applications and system software change duringapplication execution, we believe that a exible in-

frastructure for performance directed adaptive controlmust replace a posteriori performance analysis. Suchan infrastructure would allow application and systemdevelopers to create adaptive software that can changeits behavior in response to real-time varying resourcedemands.In this paper, we sketched the design of both in-teractive and closed loop adaptive control systemsand described our preliminary experiences with anadaptive control prototype for parallel input/output.This prototype, based on our PPFS user-level paral-lel �le system, selects and con�gures �le caching andprefetching policies using both qualitative classi�ca-tions of access patterns and performance sensor dataon �le system responses.In the coming months, we plan to extend the adap-tive PPFS prototype to more tightly couple quantita-tive sensor data with user assertions, automatic accesspattern classi�cation and actuators, to realize a fullyoperational closed loop prototype. Concurrently, weare enhancing our Avatar virtual environment for di-rect manipulation and are tightening its coupling tothe PPFS adaptive control prototype. Finally, we arebeginning the detailed design of the general-purposeadaptive control infrastructure sketched in Figure 4.AcknowledgmentsRuth Aydt polished the Pablo input/output char-acterization software, and Phyl Crandall was one ofits early users. Will Scullin, Luis Tavera, and KeithShields designed Avatar, our virtual environment forperformance data immersion and control. Finally, ourobservations on the intellectual and political problemsinherent in performance tool development arose froma series of workshops co-hosted with Ann Hayes andMargaret Simmons.References[1] Adve, V. S., Mellor-Crummey, J., Ander-son, M., Kennedy, K., Wang, J., and Reed,D. A. An Integrated Compilation and Perfor-mance Analysis Environment for Data ParallelPrograms. In Proceedings of Supercomputing '95(Dec. 1995).[2] Crandall, P. E., Aydt, R. A., Chien, A. A.,and Reed, D. A. Characterization of a Suite ofInput/Output Intensive Applications. In Proceed-ings of Supercomputing '95 (Dec. 1995).[3] Gu, W., Eisenhauer, G., Kraemer, E.,Schwan, K., Stasko, J., Vetter, J., andMallavarupu, N. Falcon: On-line Monitoringand Steering of Large-Scale Parallel Programs. InProceedings of the 5th Symposium of the Frontiersof Massively Parallel Computing (Feb. 1995),pp. 422{429.[4] Huber, J. V., Elford, C. L., Reed, D. A.,Chien, A. A., and Blumenthal, D. S. PPFS:A High-Performance Portable Parallel File Sys-tem. In Proceedings of the 9th ACM Inter-national Conference on Supercomputing (July1995), pp. 385{394.

[5] Korab, H., and Brown, M. D. Virtual Envi-ronments and Distributed Computing at SC'95:GII Testbed and HPC Challenge Applications onthe I-WAY, Dec. 1995.[6] Kwan, T. T., McGrath, R. E., and Reed,D. A. NCSA's World Wide Web Server: Designand Performance. IEEE Computer (Nov. 1995),68{74.[7] Lamm, S. E., Scullin, W. H., and Reed,D. A. Real-time Geographic Visualization ofWorld Wide Web Tra�c. In Proceedings of theFifth International World Wide Web Conference(May 1996).[8] Malony, A. D., and Reed, D. A. Visu-alizing Parallel Computer System Performance.In Instrumentation for Future Parallel Comput-ing System s, M. Simmons, R. Koskela, andI. Bucher, Eds. Addison-Wesley Publishing Com-pany, 1989, pp. 59{90.[9] Messina, P., and Sterling, T., Eds.Pasadena Workshop on System Software andTools for High-Performance Computing Environ-ments. SIAM, Jan. 1992.[10] Poole, J. T. Scalable I/O Initiative. Cal-ifornia Institute of Technology, Available athttp://www.ccsf.caltech.edu/SIO/, 1996.[11] Reed, D. A. Performance Instrumentation Tech-niques for Parallel Systems. In Models and Tech-niques for Performance Evaluation of Computerand Communications Systems, L. Donatiello andR. Nelson, Eds. Springer-Verlag Lecture Notes inComputer Science, 1993, pp. 463{490.[12] Reed, D. A., Elford, C. L., Madhyastha,T., Scullin, W. H., Aydt, R. A., andSmirni, E. I/O, Performance Analysis, andPerformance Data Immersion. In Proceedings ofMASCOTS '96 (Feb. 1996), pp. 1{12.[13] Reed, D. A., and Rudolph, D. C. Experi-ences with Hypercube Operating System Instru-mentation. International Journal of High-SpeedComputing (Dec. 1989), 517{542.[14] Reed, D. A., Shields, K. A., Tavera, L. F.,Scullin, W. H., and Elford, C. L. VirtualReality and Parallel Systems Performance Anal-ysis. IEEE Computer (Nov. 1995), 57{67.[15] Simmons, M. L., Hayes, A. H., Brown, J. J.,and Reed, D. A., Eds. Debugging and Perfor-mance Tuning for Parallel Computing Systems.IEEE Computer Society Press, 1996.[16] Smirni, E., Aydt, R. A., Chien, A. A., andReed, D. A. I/O Requirements of Scienti�c Aap-plications: An Evolutionary View. In submittedfor publication (1996).

the next frontier: interactive and closed loop performance steering

Documents