performance modeling and system management for multi-component

Appeared in Proc. of the Second USENIX Symposium on Networked Systems Designand Implementation (NSDI’05).

Performance Modeling and System Managementfor Multi-component Online Services∗

Christopher Stewart and Kai ShenDepartment of Computer Science, University of Rochester

{stewart, kshen}@cs.rochester.edu

Abstract

Many dynamic-content online services are comprisedof multiple interacting components and data partitionsdistributed across server clusters. Understanding theperformance of these services is crucial for efficient sys-tem management. This paper presents a profile-drivenperformance model for cluster-based multi-componentonline services. Our offline constructed application pro-files characterize component resource needs and inter-component communications. With a given componentplacement strategy, the application profile can be used topredict system throughput and average response time forthe online service. Our model differentiates remote in-vocations from fast-path calls between co-located com-ponents and we measure the network delay caused byblocking inter-component communications. Validationwith two J2EE-based online applications show that ourmodel can predict application performance with smallerrors (less than 13% for throughput and less than 14%for the average response time). We also explore how thisperformance model can be used to assist system manage-ment functions for multi-component online services, withcase examinations on optimized component placement,capacity planning, and cost-effectiveness analysis.

1 Introduction

Recent years have witnessed significant growth in on-line services, including Web search engines, digital li-braries, and electronic commerce. These services are of-ten deployed on clusters of commodity machines in orderto achieve high availability, incremental scalability, andcost effectiveness [14, 16, 29, 32]. Their software ar-chitecture usually comprises multiple components, somereflecting intentionally modular design, others devel-oped independently and subsequently assembled into alarger application,e.g., to handle data from independentsources. A typical service might contain components re-sponsible for data management, for business logic, andfor presentation of results in HTML or XML.

∗This work was supported in part by the National Science Founda-tion grants CCR-0306473, ITR/IIS-0312925, and CCF-0448413.

Previous studies have recognized the value of us-ing performance models to guide resource provisioningfor on-demand services [1, 12, 27]. Common factorsaffecting the system performance include the applica-tion characteristics, workload properties, system man-agement policies, and available resources in the hostingplatform.

However, the prior results are inadequate for predict-ing the performance of multi-component online services,which introduce several additional challenges. First,various application components often have different re-source needs and components may interact with eachother in complex ways. Second, unlike monolithic appli-cations, the performance of multi-component services isalso dependent upon the component placement and repli-cation strategy on the hosting cluster. Our contribution isa comprehensive performance model that accurately pre-dicts the throughput and response time of cluster-basedmulti-component online services.

Our basic approach (illustrated in Figure 1) is tobuild application profiles characterizing per-componentresource consumption and inter-component communica-tion patterns as functions of input workload properties.Specifically, our profiling focuses on application charac-teristics that may significantly affect the service through-put and response time. These include CPU, memory us-age, remote invocation overhead, inter-component net-work bandwidth and blocking communication frequency.We perform profiling through operating system instru-mentation to achieve transparency to the application andcomponent middleware. With a given application profileand a component placement strategy, our model predictssystem throughput by identifying and quantifying bottle-neck resources. We predict the average service responsetime by modeling the queuing effect at each cluster nodeand estimating the network delay caused by blockinginter-component communications.

Based on the performance model, we explore waysthat it can assist various system management functionsfor online services. First, we examine component place-ment and replication strategies that can achieve high per-formance with given hardware resources. Additionally,our model can be used to estimate resource needs to sup-port projected future workload or to analyze the cost-

Component resource

consumption profile Performance

modeling

Component placement strategy

Available resources

and

input workload

characteristicsInter-component

communication profile

A B

CA B

Predicted throughput and

average response time

Application

CPU

Mem

WS

CPU

Mem

A

CPU

Mem

B

CPU

Mem

C

CPU

Mem

DB

Profile on remote

invocation overhead

Offline profiling

W

S

W

S

D

B

Figure 1:Profile-driven performance modeling for an online service containing a front-end Web server, a back-end database, andseveral middle-tier components (A, B, and C).

effectiveness of hypothetical hardware platforms. Al-though many previous studies have addressed varioussystem management functions for online services [5, 6,10, 31], they either deal with single application com-ponents or they treat a complex application as a non-partitionable unit. Such an approach restricts the man-agement flexibility for multi-component online servicesand thus limits their performance potentials.

The rest of this paper is organized as follows. Sec-tion 2 describes our profiling techniques to characterizeapplication resource needs and component interactions.Section 3 presents the throughput and response timemodels for multi-component online services. Section 4validates our model using two J2EE-based applicationson a heterogeneous Linux cluster. Section 5 providescase examinations on model-based component place-ment, capacity planning, and cost-effectiveness analysis.Section 6 describes related work. Section 7 concludesthe paper and discusses future work.

2 Application Profiling

Due to the complexity of multi-component onlineservices, an accurate performance model requires theknowledge of application characteristics that may affectthe system performance under any possible componentplacement strategies. Our application profile capturesthree such characteristics: per-component resource needs(Section 2.1), the overhead of remote invocations (Sec-tion 2.2), and the inter-component communication delay(Section 2.3).

We perform profiling through operating system instru-mentation to achieve transparency to the application andcomponent middleware. Our only requirement on themiddleware system is the flexibility to place componentsin ways we want. In particular, our profiling does notmodify the application or the component middleware.Further, we treat each component as a blackbox (i.e., wedo not require any knowledge of the inner-working of ap-plication components). Although instrumentation at the

application or the component middleware layer can moreeasily identify application characteristics, our OS-levelapproach has wider applicability. For instance, our ap-proach remains effective for closed-source applicationsand middleware software.

2.1 Profiling Component Resource Consump-tion

A component profile specifies component resourceneeds as functions of input workload characteristics. Fortime-decaying resources such as CPU and disk I/O band-width, we use the average ratesθcpu and θdisk to cap-ture the resource requirement specification. On the otherhand, variation in available memory size often has a se-vere impact on application performance. In particular, amemory deficit during one time interval cannot be com-pensated by a memory surplus in the next time interval.A recent study by Doyleet al. models the service re-sponse time reduction resulting from increased memorycache size for Web servers serving static content [12].However, such modeling is only feasible with intimateknowledge about application memory access pattern. Tomaintain the general applicability of our approach, weuse the peak usageθmem for specifying the memory re-quirement in the component profile.

The input workload specifications in component pro-files can be parameterized with an average request arrivalrateλworkload and other workload characteristicsδworkload

including the request arrival burstiness and the composi-tion of different request types (or request mix). Puttingthese altogether, the resource requirement profile for adistributed application component specifies the follow-

ing mappingf

−→:

(λworkload, δworkload)f

−→ (θcpu, θdisk, θmem)

Since resource needs usually grow by a fixed amountwith each additional input request per second, it is likelyfor θcpu, θdisk, andθmem to follow linear relationships withλworkload.

0 2 4 6 8 10 120%

2%

4%

6%

8%

10%

external request rate (in requests/second)

CP

U u

sage

BidBid linear fittingWeb serverWeb server linear fitting

Figure 2:Linear fitting of CPU usage for two RUBiS compo-nents.

Component CPU usage (in percentage)

Web server 1.525· λworkload + 0.777Database 0.012· λworkload + 3.875Bid 0.302· λworkload + 0.807

BuyNow 0.056· λworkload + 0.441Category 0.268· λworkload + 1.150Comment 0.079· λworkload + 0.568Item 0.346· λworkload + 0.588Query 0.172· λworkload + 0.509Region 0.096· λworkload + 0.556User 0.403· λworkload + 0.726

Transaction 0.041· λworkload + 0.528

Table 1:Component resource profile for RUBiS (based on lin-ear fitting of measured resource usage at 11 input request rates).λworkload is the average request arrival rate (in requests/second).

We use a modified Linux Trace Toolkit (LTT) [38] forour profiling. LTT instruments the Linux kernel withtrace points, which record events and forward them toa user-level logging daemon through therelayfs filesystem. We augmented LTT by adding or modifyingtrace points at CPU context switch, network, and disk I/Oevents to report statistics that we are interested in. We be-lieve our kernel instrumentation-based approach can alsobe applied for other operating systems. During our pro-file runs, each component runs on a dedicated server andwe measure the component resource consumption at anumber of input request rates and request mixes.

Profiling Results on Two Applications We presentprofiling results on two applications based on Enter-prise Java Beans (EJB): the RUBiS online auction bench-mark [9, 28] and the StockOnline stock trading appli-cation [34]. RUBiS implements the core functional-ity of an auction site: selling, browsing, and bidding.It follows the three-tier Web service model contain-ing a front-end Web server, a back-end database, andnine movable business logic components (Bid, BuyNow,Category, Comment, Item, Query, Region, User, and

Component CPU usage (in percentage)

Web server 0.904· λworkload + 0.779Database 0.008· λworkload + 4.832Account 0.219· λworkload + 0.789Item 0.346· λworkload + 0.781

Holding 0.268· λworkload + 0.674StockTX 0.222· λworkload + 0.490Broker 1.829· λworkload + 0.533

Table 2: Component resource profile for StockOnline (basedon linear fitting of measured resource usage at 11 input re-quest rates).λworkload is the average request arrival rate (in re-quests/second).

Transaction). StockOnline contains a front-end Webserver, a back-end database, and five EJB components(Account, Item, Holding, StockTX, andBroker).

Profiling measurements were conducted on a Linuxcluster connected by a 1 Gbps Ethernet switch. Each pro-filing server is equipped with a 2.66 GHz Pentium 4 pro-cessor and 512 MB memory. For the two applications,the EJB components are hosted on a JBoss 3.2.3 appli-cation server with an embedded Tomcat 5.0 servlet con-tainer. The database server runs MySQL 4.0. The datasetfor each application is sized according to database dumpspublished on the benchmark Web sites [28, 34].

After acquiring the component resource consumptionat the measured input rates, we derive general functionalmappings using linear fitting. Figure 2 shows such aderivation for theBid component and the Web serverin RUBiS. The results are for a request mix similar tothe one in [9] (10% read-write requests and 90% read-only requests). The complete CPU profiling results forall 11 RUBiS components and 7 StockOnline compo-nents are listed in Table 1 and Table 2 respectively. Wedo not show the memory and disk I/O profiling resultsfor brevity. The memory and disk I/O consumption forthese two applications are relatively insignificant andthey never become the bottleneck resource in any of ourtest settings.

2.2 Profiling Remote Invocation Overhead

Remote component invocations incur CPU overheadon tasks such as message passing, remote lookup, anddata marshaling. When the interacting components areco-located on the same server, the component middle-ware often optimizes away these tasks and some evenimplement local component invocations using directmethod calls. As a result, the invocation overhead be-tween two components may vary depending on how theyare placed on the hosting cluster. Therefore, it is im-portant to identify the remote invocation costs such thatwe can correctly account for them when required by thecomponent placement strategy.

Avg: 0.588

Avg

: 0.0

58

Category

Web Server

Avg: 0.1605

Avg

: 0.035

Avg

: 0.0

52

Avg

: 0.

136 A

vg: 0.246

Avg: 0.063

Avg: 0.256

Avg: 0.209

Avg: 0.87

Region BuyNow

Bid

Transaction

User

Comment

Query

Item

Figure 3: Profile on remote invocation overhead for RUBiS.The label on each edge indicates the per-request mean of inter-component remote invocation CPU overhead (in percentage).

It is challenging to measure the remote invocationoverhead without assistance from the component mid-dleware. Although kernel instrumentation tools such asLTT can provide accurate OS-level statistics, they do notdirectly supply middleware-level information. In partic-ular, it is difficult to differentiate CPU usage of normalcomponent execution from that of passing a message orserving a remote lookup query. We distinguish the re-mote invocation cost of component A invoking compo-nent B in a three step process. First, we isolate com-ponents A and B on separate machines. Second, weintercept communication rounds initiated by componentA. We define a communication round as a sequence ofmessages between a pair of processes on two differentmachines in which the inter-message time does not ex-ceed a threshold. Finally, we associate communicationrounds with invocations. Thus, the remote invocationcost incurred between components A and B is the sumof resource usage during communication rounds betweenthem. Since the components are isolated during the pro-filing, communication rounds are not likely to be affectedby network noises.

Profiling Results Figure 3 shows the application pro-file on remote invocation overhead for RUBiS.

2.3 Profiling Inter-Component Communica-tions

We profile component interaction patterns that may af-fect bandwidth usage and network service delay betweendistributed application components. We measure inter-component bandwidth consumption by intercepting allnetwork messages between components during off-lineprofile runs. Note that the bandwidth usage also dependson the workload level, particularly the input user requestrate. By measuring bandwidth usage at various input re-quest rates and performing linear fitting, we can acquireper-request communication data volume for each inter-component link.

The processing of a user request may involve multi-ple blocking round-trip communications (corresponding

to request-response rounds) along many inter-componentlinks. We consider the request processing network delayas the sum of the network delays on all inter-componentlinks between distributed components. The delay oneach link includes the communication latency and thedata transmission time. Inter-component network delaydepends on the link round-trip latency and the numberof blocking round-trip communications between compo-nents. We define around trip as a synchronous write-read interaction between two components.

Due to the lack of knowledge on the application be-havior, it is challenging to identify and count blockinground trip communications at the OS level. Our basicapproach is to intercept system calls involving networkreads and writes during profile runs. System call inter-ception provides the OS-level information nearest to theapplication. We then count the number of consecutivewrite-read pairs in the message trace between two com-ponents. We also compare the timestamps of consecu-tive write-read pairs. Multiple write-read pairs occurringwithin a single network round-trip latency in the profil-ing environment are counted as only one. Such a situa-tion could occur when a consecutive write-read pair doesnot correspond to each other in a blocking interaction.Figure 4 illustrates our approach for identifying blockinground trip communications. To avoid confusion amongmessages belonging to multiple concurrent requests, weprocess one request at a time during profile runs.

Profiling Results Our profiling results identify the per-request communication data volume and blocking round-trip interaction count for each inter-component link. Fig-ure 5 shows inter-component communication profiles forRUBiS. The profiling result for each inter-componentlink indicates the per-request mean of profiled target.

2.4 Additional Profiling Issues

Non-Linear functional mappings. In our applica-tion studies, most of the workload parameter (e.g., re-quest arrival rate) to resource consumption mappings fol-low linear functional relationships. We acknowledge thatsuch mappings may exhibit non-linear relationships insome cases, particularly when concurrent request execu-tions affect each other’s resource consumption (e.g., ex-tra CPU cost due to contention on a spin-lock). However,the framework of our performance model is equally ap-plicable in these cases, as long as appropriate functionalmappings can be extracted and included in the applica-tion profile. Non-linear fitting algorithms [4, 30] can beused for such a purpose. Note that a non-linear functionalmapping may require more workload samples to producean accurate fitting.

Profiling cost. To improve measurement accuracy, weplace each profiled component on a dedicated machine.

#128 <1.732364sec> NET SEND: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:177#129 <1.734737sec> NET RECV: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:1619#130 <1.736060sec> NET RECV: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:684#131 <1.738076sec> NET RECV: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:1448#132 <1.738398sec> NET SEND: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:600#133 <1.738403sec> NET RECV: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:568#134 <1.738421sec> NET SEND: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:363#135 <1.752501sec> NET RECV: PID:13661 HOST_ADDR −> 128.151.67.29:41800 REMOTE_ADDR −> 128.151.67.228:8080 SIZE:12

Figure 4: Identifying blocking round trip communications based on anintercepted network system call trace. #132−→#135 iscounted as only one round trip interaction because #132, #133, and #134 occur very close to each other (less than a single networkround-trip latency in the profiling environment).

Category

Web Server

Database

Avg: 5976

Avg: 1

635

Avg

: 961

Avg

:530

6

Avg

: 731

Avg: 4432

Avg: 1019

Avg: 5245

Avg: 3534

Avg: 46

Avg: 388

Avg: 13

Avg

: 83

00

Avg: 6

6

Avg: 37

Avg: 667 Avg: 90

Avg: 1770

Avg: 2974

Region

BuyNow

Bid

Transaction

User

Comment

Query

Item

Avg: 5250

User

Avg: 1.84

Avg

: 0.9

1

Category

Web Server

Database

Avg: 7.74

Avg

: 0.72

Avg

: 1.3

9

Avg

: 7.

43

Avg: 5.32

Avg: .767

Avg: 6.549

Avg: 4

Avg: 0.23

Avg: 8.9

Avg: 0.1

Avg

: 0.4

3

00

Avg: 0

.31

Avg: 0.23

Avg: 1.21

Avg: 0.46

Avg:3.34

Region BuyNow

Bid

Transaction

User

Comment

Query

Item

11

User

Figure 5: Inter-component communication profile for RUBiS. In the left figure, the label on each edge indicates the per-requestmean ofdata communication volumein bytes. In the right figure, the label on each edge indicatesthe per-request mean oftheblocking round-trip communication count.

This eliminates the interference from co-located compo-nents without the need of complex noise-reduction tech-niques [8]. However, the profiling of isolated compo-nents imposes a large demand on the profiling infrastruc-ture (i.e., the cluster size equals the number of compo-nents). With a smaller profiling infrastructure, it wouldrequire multiple profile runs that each measures somecomponents or inter-component links. At a minimum,two cluster nodes are needed for per-component resourceconsumption profiling (one for the profiled componentand the other for the rest) and three machines are re-quired to measure inter-component communications. Atthese settings for anN -component service, it wouldtakeN profile runs for per-component measurement andN ·(N−1)

2 runs for inter-component communication pro-filing.

3 Performance Modeling

We present our performance model for cluster-basedmulti-component online services. The input of our modelincludes the application profile, workload properties,available resources in the hosting platform, as well asthe component placement and replication strategy. Be-low we describe our throughput prediction (Section 3.1)and response time prediction (Section 3.2) schemes. We

discuss several remaining issues about our performancemodel in Section 3.3.

3.1 Throughput Prediction

Our ability to project system throughput under eachcomponent placement strategy can be illustrated by thefollowing three-step process:

1. We first derive the mapping between the input re-quest rateλworkload and runtime resource demandsfor each component. The CPU, disk I/O bandwidth,and memory needsθcpu, θdisk, θmem can be obtainedwith the knowledge of the component resource con-sumption profile (Section 2.1) and input workloadcharacteristics. The remote invocation overhead(Section 2.2) is then added when the componentin question interacts with other components thatare placed on remote servers. The component net-work resource demandθnetwork can be derived fromthe inter-component communication profile (Sec-tion 2.3) and the placement strategy. More specif-ically, it is the sum of communication data volumeon all non-local inter-component links adjacent tothe component in question.

2. Given a component placement strategy, we can de-termine the maximum input request rate that cansaturate the CPU, disk I/O bandwidth, network I/O

bandwidth, or memory resources at each server. LetτCPU(s), τdisk(s), τnetwork(s), andτmemory(s) denotesuch saturation rates at servers. When a compo-nent is replicated, its load is distributed to all serverreplicas. The exact load distribution is dependenton the policy employed by the component middle-ware. Many load distribution strategies have beenproposed in previous studies [13, 25, 26, 40]. How-ever, most of these have not been incorporated intocomponent middleware systems in practice. Par-ticularly, we are only able to employ round-robinload distribution with the JBoss application server,which evenly distributes the load among replicas.Our validation results in Section 4 are thereforebased on the round-robin load distribution model.

3. The system reaches its maximum throughput assoon as one of the servers cannot handle any moreload. Therefore, the system throughput can be esti-mated as the lowest saturation rate for all resourcetypes at all servers:

minfor all servers

{τCPU(s), τdisk(s), τnetwork(s), τmemory(s)}

3.2 Response Time Prediction

The service response time for a cluster-based onlineservice includes two elements: 1) the request executiontime and the queueing time caused by resource com-petition; and 2) network delay due to blocking inter-component communications.

Request Execution and Queuing The system-widerequest execution and the queueing time is the sum ofsuch delay at each cluster server. We use an M/G/1queue to simplify our model of the average responsetime at each server. The M/G/1 queue assumes inde-pendent request arrivals where the inter-arrival times fol-low an exponential distribution. Previous studies [7, 11]found that Web request arrivals may not be indepen-dent because Web workloads contain many automati-cally triggered requests (e.g., embedded Web objects)and requests within each user session may follow par-ticular user behavioral patterns. We believe our simpli-fication is justified because our model targets dynamic-content service requests that do not include automati-cally triggered embedded requests. Additionally, busyonline services observe superimposed workloads frommany independent user sessions and thus the requestinter-dependencies with individual user sessions are notpronounced, particularly at high concurrencies.

The average response time at each server can be esti-mated as follows under the M/G/1 queuing model:

E[r] = E[e] +ρE[e](1 + C2

e )

2(1 − ρ)

whereE[e] is the average request execution time;ρ isthe workload intensity (i.e., the ratio of input request rateto the rate at which the server resource is completely ex-hausted); andCe is the coefficient of variation (i.e., theratio of standard deviation to the sample mean) of therequest execution time.

The average request execution time at each compo-nent is the resource needs per request at very low requestrate (when there is no resource competition or queuing inthe system). The average request execution time at eachserverE[e] is the sum of such time for all hosted com-ponents. The workload intensity at each serverρ can bederived from the component resource needs and the set ofcomponents that are placed at the server in question. Thecoefficient of variation of the request execution timeCe

can be determined with the knowledge of its distribution.For instance,Ce = 1 if the request execution time fol-lows an exponential distribution. Without the knowledgeof such distribution, our application profile maintains anhistogram of execution time samples for each componentand we then use these histograms to determineCe foreach server under a particular placement strategy.

Network Delay Our network delay prediction is basedon the inter-component communication profile describedin Section 2.3. From the profile, we can acquire the per-request communication data volume (denoted byνl) andround-trip blocking interaction count (denoted byκl) foreach inter-component linkl. Under a particular clusterenvironment, letγl andωl be the round-trip latency andbandwidth, respectively, of linkl. Then we can modelthe per-request total network delay as:

Γ =∑

for each non-local inter-component linkl

[κl · γl +νl

ωl

]

3.3 Additional Modeling Issues

Cache pollution due to component co-location.When components are co-located, interleaved or concur-rent execution of multiple components may cause pro-cessor cache pollution on the server, and thus affect thesystem performance. Since modern operating systemsemploy affinity-based scheduling and large CPU quanta(compared with the cache warm-up time), we do notfind cache pollution to be a significant performance fac-tor. However, processor-level multi-threading technolo-gies such as the Intel Hyper-Threading [19] allow con-current threads executing on a single processor and shar-ing level-1 processor cache. Cache pollution is more pro-nounced on these architectures and it might be necessaryto model such cost and its impact on the overall systemperformance.

Replication consistency management.If replicatedapplication states can be updated by user requests, mech-anisms such as logging, undo, and redo may be required

to maintain consistency among replicated states. Pre-vious studies have investigated replication consistencymanagement for scalable online services [15, 29, 32, 39].Consistency management consumes additional systemresources which our current model does not consider.Since many component middleware systems in practicedo not support replication consistency management, webelieve this limitation of our current model should notseverely restrict its applicability.

Cross-architecture performance modeling. A ser-vice hosting cluster may comprise servers with multipletypes of processor architectures or memory sizes. Ourcurrent approach requires application profiling on eachof the architectures present in the cluster. Such profil-ing is time consuming and it would be desirable to sep-arate the performance impact of application character-istics from that of server properties. This would allowindependent application profiling and server calibration,and thus significantly save the profiling overhead forserver clusters containing heterogeneous architectures.Several recent studies [22, 33] have explored this issuein the context of scientific computing applications andtheir results may be leveraged for our purpose.

4 Model Validation

We perform measurements to validate the accuracy ofour throughput and response time prediction models. Anadditional objective of our measurements is to identifythe contributions of various factors on our model accu-racy. We are particularly interested in the effects of theremote invocation overhead modeling and the networkdelay modeling.

Our validation measurements are based on the RU-BiS and StockOnline applications described in Sec-tion 2.1. The application EJB components are hostedon JBoss 3.2.3 application servers with embedded Tom-cat 5.0 servlet containers. The database servers runMySQL 4.0. Although MySQL 4.0 supports mas-ter/slave replication, the JBoss application server cannotbe configured to access replicated databases. Thereforewe do not replicate the database server in our experi-ments.

All measurements are conducted on a 20-node hetero-geneous Linux cluster connected by a 1 Gbps Ethernetswitch. The roundtrip latency (UDP or TCP without con-nection setup) between two cluster nodes takes around150µs. The cluster nodes have three types of hard-ware configurations. Each type-1 server is equipped witha 2.66 GHz Pentium 4 processor and 512 MB memory.Each type-2 server is equipped with a single 2.00 GHzXeon processor and 512 MB memory. Each type-3server is equipped with two 1.26 GHz Pentium III pro-cessors and 1 GB memory. All application data is hosted

0 100 200 300 4000

50

100

150

200

250

300

350

400

Input workload (requests/second)

Thr

ough

put (

requ

ests

/sec

ond)

(A) StockOnline

The base modelThe RI modelThe full modelMeasured results

0 50 100 150 200 2500

50

100

150

200

250

300


Thr

ough

put (

requ

ests

/sec

ond)

(B) RUBiS


Figure 6:Validation results on system throughput.

on two local 10 KRPM SCSI drives at each server.The performance of a cluster-based online service is

affected by many factors, including the cluster size, themix of input request types, the heterogeneity of the host-ing servers, as well as the placement and replication strat-egy. Our approach is to first provide detailed resultson the model accuracy at a typical setting (Section 4.1)and then explicitly evaluate the impact of various factors(Section 4.2). We summarize our validation results inSection 4.3.

4.1 Model Accuracy at a Typical Setting

The measurement results in this section are based on a12-machine service cluster (four are type-1 and the othereight are type-2). For each application, we employ aninput request mix with 10% read-write requests and 90%read-only requests. The placement strategy for each ap-plication we use (shown in Table 3) is the one with high-est modeled throughput out of 100 random chosen candi-date strategies. We compare the measured performancewith three variations of our performance model:

#1. The base model:The performance model that doesnot consider the overhead of remote component in-vocation and network delays.

Servers Pentium4 Pentium4 Pentium4 Pentium4 Xeon Xeon Xeon Xeon Xeon Xeon Xeon Xeon

Account Account Account Account Account Account Account Account Account Account Account AccountStock Item Item Holding Holding Holding Holding Holding Holding Holding StockTX Holding HoldingOnline Holding Holding StockTX StockTX StockTX StockTX StockTX StockTX StockTX Broker Broker StockTX

StockTX StockTX Broker Broker Broker Broker Broker Broker Broker WS WS BrokerBroker Broker WS WS WS WS WS DB

Bid Item User Region BuyNow Region BuyNow WS WS Region BuyNow CategoryRUBiS Query BuyNow Trans. BuyNow WS Trans. Comment

WS WS Query DB

Table 3:Component placement strategies (on a 12-node cluster) usedfor measurements in Section 4.1.

1 2 3 4 5 6 7 8 9 10 11 120%

20%

40%

60%

80%

100%

Cluster nodes

CP

U r

esou

rce

usag

e

(A) StockOnline

The base modelThe full modelMeasured results

1 2 3 4 5 6 7 8 9 10 11 120%

20%

40%

60%

80%

100%

Cluster nodes

CP

U r

esou

rce

usag

e

(B) RUBiS


Figure 7:Validation results on per-node CPU resource usage(at the input workload rate of 230 requests/second).

#2. The RI model:The performance model that con-siders remote invocation overhead but not networkdelays.

#3. The full model:The performance model that con-siders both.

Figure 6 shows validation results on the overall sys-tem throughput for StockOnline and RUBiS. We mea-sure the rate of successfully completed requests at dif-ferent input request rate. In our experiments, a requestis counted as successful only if it returns within 10 sec-onds. Results show that our performance model can ac-curately predict system throughput. The error for RUBiSis negligible while the error for StockOnline is less than13%. This error is mostly attributed to instable results(due to timeouts) when the system approaches the satu-ration point. There is no difference between the predic-

50% 60% 70% 80% 90%0

50

100

150

200

250

300

Input workload (in proportion to the saturation throughput)

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

(A) StockOnline


50% 60% 70% 80% 90%0

50

100

150

200

Input workload (in proportion to the saturation throughput)

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

(B) RUBiS


Figure 8:Validation results on service response time.

tion of the RI model and that of the full model. This isexpected since the network delay modeling does not af-fect the component resource needs and subsequently thesystem throughput. Comparing between the full modeland the base model, we find that the modeling of remoteinvocation overhead has a large impact on the predictionaccuracy. It improves the accuracy by 36% and 14% forStockOnline and RUBiS respectively.

Since the system throughput in our model is derivedfrom resource usage at each server, we further examinethe accuracy of per-node resource usage prediction. Fig-ure 7 shows validation results on the CPU resource us-age at the input workload rate of 230 requests/second(around 90% workload intensity for both applications).We do not show the RI model since its resource usageprediction is the same as the full model. Comparing be-tween the full model and the base model, we find that the

0 5 10 15 200

100

200

300

400

500

600

Cluster size (number of nodes)

Thr

ough

put (

requ

ests

/sec

ond)

StockOnline


Figure 9: Validation results on system saturation throughputat various cluster sizes.

0 5 10 15 200

50

100

150

200

250

300

350

Cluster size (number of nodes)

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

StockOnline


Figure 10:Validation results on service response time at var-ious cluster sizes. The response time is measured at the inputworkload that is around 85% of the saturation throughput.

remote invocation overhead can be very significant onsome of the servers. The failure of accounting it resultsin poor performance prediction of the base model.

Figure 8 shows validation results on the average ser-vice response time for StockOnline and RUBiS. For eachapplication, we show the average response time when theinput workload is between 50% and 90% of thesatura-tion throughput, defined as the highest successful requestcompletion rate achieved at any input request rate. Re-sults show that our performance model can predict theaverage response time with less than 14% error for thetwo applications. The base model prediction is very poordue to its low resource usage estimation. Comparing be-tween the full model and the RI model, we find that thenetwork delay modeling accounts for an improved accu-racy of 9% and 18% for StockOnline and RUBiS respec-tively.

4.2 Impact of Factors

We examine the impact of various factors on the ac-curacy of our performance model. When we vary onefactor, we keep other factors unchanged from settings in

Readonly 10%−write 20%−write0

100

200

300

400

500

Service request mixes

Thr

ough

put (

requ

ests

/sec

ond)

StockOnline


Figure 11:Validation results on system saturation throughputat various service request mixes.

Readonly 10%−write 20%−write0

50

100

150

200

250

Service request mixes

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

StockOnline


Figure 12:Validation results on service response time at vari-ous service request mixes. The response time is measured at theinput workload that is around 85% of the saturation throughput.

Section 4.1.

Impact of cluster sizes We study the model accuracyat different cluster sizes. Figure 9 shows the through-put prediction for the StockOnline application at serviceclusters of 4, 8, 12, 16, and 18 machines. At each clus-ter size, we pick a high-throughput placement strategyout of 100 random chosen candidate placements. Fig-ure 10 shows the validation results on the average re-sponse time. The response time is measured at the inputworkload that is around 85% of the saturation through-put. Results demonstrate that the accuracy of our modelis not affected by the cluster size. The relative accura-cies among different models are also consistent acrossall cluster sizes.

Impact of request mixes Figure 11 shows the through-put prediction for StockOnline at input request mixeswith no writes, 10% read-write request sequences, and20% read-write request sequences. Figure 12 shows thevalidation results on the average response time. Results

1 2 30

100

200

300

400

500

Number of machine types in the cluster

Thr

ough

put (

requ

ests

/sec

ond)

StockOnline


Figure 13:Validation results on system saturation throughputat various cluster settings.

1 2 30

50

100

150

200

250

300

350

Number of machine types in the cluster

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

StockOnline


Figure 14:Validation results on service response time at vari-ous cluster settings. The response time is measured at the inputworkload that is around 85% of the saturation throughput.

demonstrate that the accuracy of our model is not af-fected by different types of input requests.

Impact of heterogeneous machine architecturesFigure 13 shows the throughput prediction for StockOn-line at service clusters with one, two, and three types ofmachines. All configurations have twelve machines intotal. Figure 14 illustrates the validation results on theaverage response time. Results show that the accuracyof our model is not affected by heterogeneous machinearchitectures.

Impact of placement and replication strategies Fi-nally, we study the model accuracy at different com-ponent placement and replication strategies. Figure 15shows the throughput prediction for StockOnline withthe throughput-optimized placement and three randomchosen placement strategies. Methods for finding high-performance placement strategies will be discussed inSection 5.1. Figure 16 shows the validation results onthe average response time. Small prediction errors are

OPT RAND#1 RAND#2 RAND#30

50

100

150

200

250

300

350

Placement strategies

Thr

ough

put (

requ

ests

/sec

ond)

StockOnline


Figure 15:Validation results on system saturation throughputat various placement strategies.

OPT RAND#1 RAND#2 RAND#30

50

100

150

200

250

Placement strategies

Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

StockOnline


Figure 16:Validation results on service response time at vari-ous placement strategies. The response time is measured at theinput workload that is around 85% of the saturation throughput.

observed at all validation cases.

4.3 Summary of Validation Results

• Our model can predict the performance of the twoJ2EE applications with high accuracy (less than13% error for throughput and less than 14% errorfor the average response time).

• The remote invocation overhead is very importantfor the accuracy of our performance model. Withoutconsidering it, the throughput prediction error canbe up to 36% while the prediction for the averageresponse time can be much worse depending on theworkload intensity.

• Network delay can affect the prediction accuracyfor the average response time by up to 18%. Itdoes not affect the throughput prediction. However,the impact of network delay may increase with net-works of lower bandwidth or longer latency.

• The validation results are not significantly affectedby factors such as the cluster size, the mix of in-put request types, the heterogeneity of the hosting

Servers Pentium4 Pentium4 Pentium4 Pentium4 Xeon Xeon Xeon Xeon Xeon Xeon Xeon Xeon

Simulate Holding Item Item Item Holding Holding Holding Account Account Account Account Accountannealing StockTX StockTX StockTX Holding Broker Broker Broker Broker Broker Broker Broker Broker

OPT Broker Broker Broker Broker WS WS WS WS WS WS WS WSDB

Item Item Account Holding Account Account Account Account Account Account Account AccountRandom StockTX Broker Item StockTX Item Item Item Item Item Item Item Itemsampling Broker DB Holding Broker StockTX StockTX StockTX StockTX StockTX StockTX StockTX StockTX

OPT WS StockTX Broker Broker Broker Broker Broker Broker Broker BrokerBroker WS WS WS WS WS WS WS WS

Low Item Item WS WS Account Holding StockTX Broker Broker Broker Broker DBreplication Broker Broker Broker Broker Broker WS WS WS

Table 4: Component placement strategies (on a 12-node cluster) usedfor measurements in Section 5.1. The “all replication”strategy is not shown.

servers, and the placement strategy.

5 Model-based System Management

We explore how our performance model can beused to assist system management functions for multi-component online services. A key advantage of model-based management is its ability to quickly explore theperformance tradeoff among a large number of systemconfiguration alternatives without high-overhead mea-surements. Additionally, it can project system perfor-mance at hypothetical settings.

5.1 High-performance Component Placement

Our objective is to discover a component placementand replication strategy that achieves high performanceon both throughput and service response time. Morespecifically, our target strategy should be able to sup-port a large input request rate while still maintaining anaverage response time below a specified threshold. Ourmodel proposed in this paper can predict the performancewith any given component placement strategy. However,the search space of all possible placement strategies istoo large for exhaustive check. Under such a context,we employ optimization by simulated annealing [21, 24].Simulated annealing is a random sampling-based opti-mization algorithm that gradually reduces the samplingscope following an “annealing schedule”.

We evaluate the effectiveness of our approach on a 12-node cluster (the same as in Section 4.1) using the Stock-Online application. We set the response time thresholdof our optimization at 1 second. The number of samplesexamined by our simulated annealing algorithm is in theorder of 10,000 and the algorithm takes about 12 secondsto complete on a 2.00 GHz Xeon processor. For compar-ison, we also consider a random sampling optimizationwhich selects the best placement out of 10,000 randomlychosen placement strategies. Note that both of these ap-proaches rely on our performance model.

For additional comparison, we introduce two place-ment strategies based on “common sense”,i.e., without

0 50 100 150 200 250 300 3500

50

100

150

200

250

300


Thr

ough

put (

requ

ests

/sec

ond)

StockOnline

Simulated annealing OPTRandom sampling OPTAll replicationLow replication

Figure 17:System throughput under various placement strate-gies.

the guidance of our performance model. In the first strat-egy, we replicate all components (except the database)on all nodes. This is the suggested placement strategy inthe JBoss application server documentation. High repli-cation may introduce some overhead, such as the compo-nent maintenance overhead at each replica that is not di-rectly associated with serving user requests. In the otherstrategy, we attempt to minimize the amount of replica-tion while still maintaining balanced load. We call thisstrategylow replication. Table 4 lists the three placementstrategies except all replication.

Figure 17 illustrates the measured system throughputunder the above four placement strategies. We find thatthe simulated annealing optimization is slightly betterthan the random sampling approach. It outperforms allreplication and low replication by 7% and 31% respec-tively. Figure 18 shows the measured average responsetime at different input workload rates. The response timefor low replication rises dramatically when approaching220 requests/second because it has a much lower satu-ration throughput than the other strategies. Comparedwith random sampling and all replication, the simulatedannealing optimization achieves 22% and 26% lower re-sponse time, respectively, at the input workload rate of250 requests/second.

100 150 200 2500

50

100

150

200

250

300

350


Ave

rage

res

pons

e tim

e (m

illis

econ

ds)

StockOnline

Simulated annealing OPTRandom sampling OPTAll replicationLow replication

Figure 18:Average service response time under various place-ment strategies.

0 200 400 600 800 1000 1200 14000

16

32

48

64

80

Projected workload rate (requests/second)

Req

uire

d cl

uste

r si

ze

StockOnline

Simulated annealing OPTAll replicationAll replication linear proj.

Figure 19:Capacity planning using various models.

5.2 Capacity Planning

The ability of predicting future resource needs at fore-cast workload levels allows an online service providerto acquire resources in an efficient fashion. Figure 19presents our capacity planning results for StockOnline onsimulated annealing-optimized placement and all repli-cation placement. The base platform for capacity plan-ning is a 12-node cluster (four type-1 nodes and eighttype-2 nodes). Our performance model is used to projectresource needs at workload levels that could not be sup-ported in the base platform. We assume only type-2nodes will be added in the future. Previous work [23]has suggested linear projection-based capacity planningwhere future resource requirement scales linearly withthe forecast workload level. For the comparison purpose,we also show the result of linear projection. The baseperformance for linear projection is that of all replicationon the 12-node cluster.

Results in Figure 19 illustrate that our optimized strat-egy consistently saves resources compared with all repli-cation. The saving is at least 11% for projected work-load of 1000 requests/second or higher. Comparing be-tween the modeled all replication and linearly projected

0 200 400 600 800 1000 1200 14000

20

40

60

80

100

120

Projected workload rate (requests/second)

Cos

t (th

ousa

nd U

S d

olla

rs)

StockOnline

Using type−1 machinesUsing type−2 machinesUsing type−3 machines

Figure 20:Cost-effectiveness of various machine types.

all replication, we find that the linear projection signifi-cantly underestimates resource needs (by about 28%) athigh workload levels. This is partially due to hetero-geneity in the machines being employed. More specif-ically, four of the original nodes are type-1, which areslightly more powerful than the type-2 machines that areexpected to be added. Additionally, linear projectionfails to consider the increased likelihood of remote in-vocations at larger clusters. In comparison, our perfor-mance model addresses these issues and provides moreaccurate capacity planning.

5.3 Cost-effectiveness Analysis

We provide a model-based cost-effectiveness analy-sis. In our evaluation, we plan to choose one of thethree types of machines (described in the beginning ofSection 4) to expand the cluster for future workloads.We examine the StockOnline application in this evalu-ation. Figure 20 shows the estimated cost when eachof the three types of machines is employed for expan-sion. We acquire the pricing for the three machine typesfrom www.epinions.com and they are $1563, $1030,and $1700 respectively. Although type-2 nodes are lesspowerful than the other types, it is the most cost-effectivechoice for supporting the StockOnline application. Thesaving is at least 20% for projected workload of 1000 re-quests/second or higher. Such a cost-effectiveness analy-sis would not be possible without an accurate predictionof application performance at hypothetical settings.

6 Related Work

Previous studies have addressed application resourceconsumption profiling. Urgaonkaret al.use resource us-age profiling to guide application placement in sharedhosting platforms [36]. Amzaet al. provide bottleneckresource analysis for several dynamic-content online ser-

vice benchmarks [3]. Doyleet al. model the service re-sponse time reduction with increased memory cache sizefor static-content Web servers [12]. The Magpie toolchain actually extracts per-request execution control flowthrough online profiling [8]. The main contribution ofour profiling work is that we identify a comprehensiveset of application characteristics that can be employed topredict the performance of multi-component online ser-vices with high accuracy.

A very recent work by Urgaonkaret al. [35] models amulti-tier Internet service as a network of queues. Theirview of service tiers is not as fine-grained as applicationcomponents in our model. Additionally, service tiers areorganized in a chain-like architecture while applicationcomponents can interact with each other in more com-plex fashions. As a result, their model cannot be usedto directly guide component-level system managementsuch as distributed component placement. On the otherhand, our approach uses a simple M/G/1 queue to modelservice delay at each server while their model more ac-curately captures the dependencies of the request arrivalprocesses at different service tiers.

Recent studies have proposed the concept ofcomponent-oriented performance modeling [17, 37].They mainly focus on the design of performance char-acterization language for software components and theway to assemble component-level models into whole-application performance model. They do not describehow component performance characteristics can be ac-quired in practice. In particular, our study finds that thefailure of accounting the remote invocation overhead cansignificantly affect the model accuracy.

Distributed component placement has been examinedin a number of prior studies. Coign [18] examinesthe optimization problem of minimizing communicationtime for two-machine client-server applications. ABA-CUS [2] focuses on the placement of I/O-specific func-tions for cluster-based data-intensive applications. Ivanet al. examine the automatic deployment of component-based software over the Internet subjected to throughputrequirements [20]. Most of these studies heuristically op-timize the component placement toward a performanceobjective. In comparison, our model-based approach al-lows the flexibility to optimize for complex objectives(e.g., a combination of throughput and service responsetime) and it also provides an estimation on the maximumachievable performance.

7 Conclusion and Future Work

This paper presents a profile-driven performancemodel for cluster-based multi-component online ser-vices. We construct application profiles characterizingcomponent resource needs and inter-component commu-

nication patterns using transparent operating system in-strumentation. Given a component placement and repli-cation strategy, our model can predict system throughputand the average service response time with high accu-racy. We demonstrate how this performance model canbe employed to assist optimized component placement,capacity planning, and cost-effectiveness analysis.

In addition to supporting static component placement,our model may also be used to guide dynamic runtimecomponent migration for achieving better performance.Component migration requires up-to-date knowledge ofruntime dynamic workload characteristics. It also desiresa migration mechanism that does not significantly dis-rupt ongoing service processing. Additionally, runtimecomponent migration must consider system stability, es-pecially when migration decisions are made in a decen-tralized fashion. We plan to investigate these issues inthe future.

Project Website More information about this work,including publications and releases of related tools anddocumentations can be found on our project website:www.cs.rochester.edu/u/stewart/component.html.

Acknowledgments We would like to thank PeterDeRosa, Sandhya Dwarkadas, Michael L. Scott, JianYin, Ming Zhong, the anonymous referees, and our shep-herd Dina Katabi for helpful discussions and valuablecomments. We are also indebted to Liudvikas Bukys andthe rest of the URCS lab staff for maintaining the exper-imental platform used in this study.

References[1] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer,

R. Becker-Szendy, R. Golding, A. Merchant, M. Spaso-jevic, A. Veitch, and J. Wilkes. Minerva: An AutomatedResource Provisioning Tool for Large-Scale Storage Sys-tems.ACM Trans. on Computer Systems, 19(4):483–418,November 2001.

[2] K. Amiri, D. Petrou, G. Ganger, and G. Gibson. DynamicFunction Placement for Data-Intensive Cluster Comput-ing. In Proc. of the USENIX Annual Technical Conf., SanDiego, CA, June 2000.

[3] C. Amza, A. Chanda, A. L. Cox, S. Elnikety, R. Gil,K. Rajamani, W. Zwaenepoel, E. Cecchet, and J. Mar-guerite. Specification and Implementation of DynamicWeb Site Benchmarks. InProc. of the 5th IEEE Work-shop on Workload Characterization, Austin, TX, Novem-ber 2002.

[4] E. Anderson. Simple Table-based Modeling of StorageDevices. Technical Report HPL-SSP-2001-04, HP Labo-ratories, July 2001.

[5] K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt,M. Kalantar, S. Krishnakumar, D. Pazel, J. Pershing, andB. Rochwerger. Oceano – SLA Based Management of AComputing Utility. InProc. of the 7th Int’l Symp. on In-tegrated Network Management, Seattle, WA, May 2001.

[6] M. Aron, P. Druschel, and W. Zwaenepoel. ClusterReserve: A Mechanism for Resource Management inCluster-based Network Servers. InProc. of the ACM SIG-METRICS, pages 90–101, Santa Clara, CA, June 2000.

[7] P. Barford and M. Crovella. Generating RepresentativeWeb Workloads for Network and Server PerformanceEvaluation. InProc. of the ACM SIGMETRICS, pages151–160, Madison, WI, June 1998.

[8] P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. UsingMagpie for Request Extraction and Workload Modeling.In Proc. of the 6th USENIX OSDI, pages 259–272, SanFrancisco, CA, December 2004.

[9] E. Cecchet, J. Marguerite, and W. Zwaenepoel. Perfor-mance and Scalability of EJB Applications. InProc. ofthe 17th ACM Conf. on Object-oriented Programming,Systems, Languages, and Applications, pages 246–261,Seattle, WA, November 2002.

[10] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vah-dat, and R. P. Doyle. Managing Energy and Server Re-sources in Hosting Centers. InProc. of the 18th ACMSOSP, pages 103–116, Banff, Canada, October 2001.

[11] S. Deng. Empirical Model of WWW Document Arrivalsat Access Link. InProc. of the IEEE Conf. on Communi-cations, pages 1797–1802, Dallas, TX, June 1996.

[12] R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M.Vahdat. Model-based Resource Provisioning in a WebService Utility. In Proc. of the 4th USENIX Symp. onInternet Technologies and Systems, Seattle, WA, March2003.

[13] D. L. Eager, E. D. Lazowska, and J. Zahorjan. Adap-tive Load Sharing in Homogeneous Distributed Systems.IEEE Trans. on Software Engineering, 12(5):662–675,May 1986.

[14] A. Fox, S. D. Gribble, Y. Chawathe, E. A. Brewer, andP. Gauthier. Cluster-Based Scalable Network Services. InProc. of the 16th ACM SOSP, pages 78–91, Saint Malo,France, October 1997.

[15] S. D. Gribble, E. A. Brewer, J. M. Hellerstein, andD. Culler. Scalable, Distributed Data Structures for In-ternet Service Construction. InProc. of the 4th USENIXOSDI, San Diego, CA, October 2000.

[16] S. D. Gribble, M. Welsh, E. A. Brewer, and D. Culler. TheMultiSpace: An Evolutionary Platform for InfrastructuralServices. InProc. of the USENIX Annual Technical Conf.,Monterey, CA, June 1999.

[17] S. A. Hissam, G. A. Moreno, J. A. Stafford, and K. C.Wallnau. Packaging Predictable Assembly. InProc. ofthe First IFIP/ACM Conf. on Component Deployment,Berlin, Germany, June 2002.

[18] G. C. Hunt and M. L. Scott. The Coign AutomaticDistributed Partitioning System. InProc. of the ThirdUSENIX OSDI, New Orleans, LA, February 1999.

[19] Hyper-Threading Technology. http://www.intel.com/technology/hyperthread.

[20] A.-A. Ivan, J. Harman, M. Allen, and V. Karamcheti. Par-titionable Services: A Framework for Seamlessly Adapt-ing Distributed Applications to Heterogeneous Environ-ments. InProc. of the 11th IEEE Symp. on High Per-formance Distributed Computing, pages 103–112, Edin-burgh, Scotland, July 2002.

[21] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vec-chi. Optimization by Simulated Annealing.Science,220(4598):671–680, May 1983.

[22] G. Marin and J. Mellor-Crummey. Cross-ArchitecturePerformance Predictions for Scientific Applications Us-ing Parameterized Models. InProc. of the ACM SIGMET-RICS, pages 2–13, New York, NY, June 2004.

[23] D. A. Menasce and V. A. F. Almeida.Capacity Plan-ning for Web Performance: Metrics, Models, and Meth-ods. Prentice Hall, 1998.

[24] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth,A. H. Teller, and E. Teller. Equation of State Calcu-lations by Fast Computing Machines.J. Chem. Phys.,21(6):1087–1092, 1953.

[25] M. Mitzenmacher. On the Analysis of Randomized LoadBalancing Schemes. InProc. of the 9th ACM SPAA, pages292–301, Newport, RI, June 1997.

[26] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel,W. Zwaenepoel, and E. Nahum. Locality-Aware RequestDistribution in Cluster-based Network Servers. InProc.of the 8th ASPLOS, pages 205–216, San Jose, CA, Octo-ber 1998.

[27] D. Petriu and M. Woodside. Analysing Software Re-quirements Specifications for Performance. InProc. ofthe Third Int’l Workshop on Software and Performance,Rome, Italy, July 2002.

[28] RUBiS: Rice University Bidding System. http://rubis.objectweb.org.

[29] Y. Saito, B. N. Bershad, and H. M. Levy. Manageabil-ity, Availability, and Performance in Porcupine: a HighlyScalable, Cluster-based Mail Service.ACM Trans. onComputer Systems, 18(3):298–332, August 2001.

[30] G. A. Seber.Nonlinear Regression Analysis. Wiley, NewYork, 1989.

[31] K. Shen, H. Tang, T. Yang, and L. Chu. IntegratedResource Management for Cluster-based Internet Ser-vices. InProc. of the 5th USENIX OSDI, pages 225–238,Boston, MA, December 2002.

[32] K. Shen, T. Yang, and L. Chu. Clustering Support andReplication Management for Scalable Network Services.IEEE Trans. on Parallel and Distributed Systems - Spe-cial Issue on Middleware, 14(11):1168–1179, November2003.

[33] A. Snavely, L. Carrington, and N. Wolter. ModelingApplication Performance by Convolving Machine Sig-natures with Application Profiles. InProc. of the 4thIEEE Workshop on Workload Characterization, Austin,TX, December 2001.

[34] The StockOnline Benchmark. http://forge.objectweb.org/projects/stock-online.

[35] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, andA. Tantawi. An Analytical Model for Multi-tier InternetServices and Its Applications. InProc. of the ACM SIG-METRICS (to appear), Banff, Canada, June 2005.

[36] B. Urgaonkar, P. Shenoy, and T. Roscoe. ResourceOverbooking and Application Profiling in Shared Host-ing Platforms. InProc. of the 5th USENIX OSDI, pages239–254, Boston, MA, December 2002.

[37] X. Wu and M. Woodside. Performance Modeling fromSoftware Components. InProc. of the 4th Int’l Workshopon Software and Performance, pages 290–301, RedwoodCity, CA, January 2004.

[38] K. Yaghmour and M. R. Dagenais. Measuring and Char-acterizing System Behavior Using Kernel-Level EventLogging. InProc. of the USENIX Annual Technical Conf.,San Diego, CA, June 2000.

[39] H. Yu and A. Vahdat. Design and Evaluation of a Con-tinuous Consistency Model for Replicated Services. InProc. of the 4th USENIX OSDI, San Diego, CA, October2000.

[40] S. Zhou. A Trace-Driven Simulation Study of DynamicLoad Balancing.IEEE Trans. on Software Engineering,14(9):1327–1341, September 1988.

performance modeling and system management for multi-component

Documents