bodik etal hotcloud09

Statistical Machine Learning Makes Automatic Control Practical forInternet Datacenters

Peter Bodı́k, Rean Griffith, Charles Sutton, Armando Fox, Michael Jordan, David PattersonRAD Lab, EECS Department, UC Berkeley

AbstractHorizontally-scalable Internet services on clusters of

commodity computers appear to be a great fit for au-tomatic control: there is a target output (service-levelagreement), observed output (actual latency), and gaincontroller (adjusting the number of servers). Yet fewdatacenters are automated this way in practice, due inpart to well-founded skepticism about whether the sim-ple models often used in the research literature can cap-ture complex real-life workload/performance relation-ships and keep up with changing conditions that mightinvalidate the models. We argue that these shortcomingscan be fixed by importing modeling, control, and anal-ysis techniques from statistics and machine learning. Inparticular, we apply rich statistical models of the applica-tion’s performance, simulation-based methods for find-ing an optimal control policy, and change-point methodsto find abrupt changes in performance. Preliminary re-sults running a Web 2.0 benchmark application driven byreal workload traces on Amazon’s EC2 cloud show thatour method can effectively control the number of servers,even in the face of performance anomalies.

1 IntroductionMost Internet applications have strict performance re-quirements, such as service level agreements (SLAs) onthe 95th percentile of response time. To meet these re-quirements, applications are designed as much as possi-ble to be horizontally scalable, meaning that they can addmore servers in the face of larger demand. These addi-tional resources come at a cost, however, and especiallygiven the increasing popularity of utility computing suchas Amazon’s EC2, applications are incented to minimizeresource usage because they incur cost for their incre-mental resource usage. A natural way to minimize usagewhile meeting performance requirements is to automat-ically allocate resources based on the current demand.However, despite a growing body of research on auto-

matic control of Internet applications [2, 11, 6, 5, 4, 10],application operators remain skeptical of such methods,and provisioning is typically performed manually.

In this paper, we argue that the skepticism of datacen-ter operators is well founded, and is based on two keylimitations of previous attempts at automatic provision-ing. First, the performance models usually employed,such as linear models and simple queueing models, tendto be so unrealistic that they are not up to the task ofcontrolling complex Internet applications without jeop-ardizing SLA’s. Second, previous attempts at automaticcontrol have failed to demonstrate robustness to changesin the application and its environment over time, includ-ing changes in usage patterns, hardware failures, changesin the application, and sharing resources with other ap-plications in a cloud environment.

We claim that both problems can be solved usinga suite of modeling, control, and analysis techniquesrooted in statistical machine learning (SML). We pro-pose a control framework with three components. First,at the basis of our framework are rich statistical per-formance models, which allow predicting system per-formance for future configurations and workloads. Sec-ond, to find a control policy that minimizes resource us-age while maintaining performance, we employ a con-trol policy simulator that simulates from the performancemodel to compare different policies for adding and re-moving resources. Finally, for robustness we employmodel management techniques from the SML literature,such as online training and change point detection, to ad-just the models when changes are observed in applicationperformance. Importantly, control and model manage-ment are model-agnostic: we use generic statistical pro-cedures that allow us to “plug and play” a wide varietyof performance models.

Our new contributions are, first, to demonstrate thepower of applying established SML techniques for richmodels to this problem space, and second, to show howto use these techniques to augment the conventional

!"#$%&'()*+&,-./0(",'

1.'.2'"(,

0.)34/5(6.%

!78'.#

*(,')(%%.)

9&%:+&;/<.'&=

5.&8$).6/>()?%(&6

&,6/:.)3()#&,2.

&%:+&;/<.'& *(,')(%/8"-,&%@

A$#<.)/(3

5&2+",.8

0.)34/5(6.%

0.)34/5(6.%

Figure 1: Architectural diagram of the control framework.

closed-loop control framework, making automatic con-trol robust and practical to use in real, highly variable,rapidly-changing Internet applications.

2 Modeling and Model ManagementControllers must adapt to multiple types of variation inthe expected performance of a datacenter application.“Expected” variations, such as diurnal or seasonal work-load patterns and its effects on performance, can bemeasured accurately and captured in a historical model.However, unexpected sustained surges (“Slashdot ef-fect”) cannot be planned for, and even “planned” eventssuch as bug fixes may have unpredictable effects on therelationship between workload and performance, or mayinfluence the workload mix (e.g., new features). Fur-thermore, the controller also must handle performanceanomalies and resource bottlenecks.

We propose a resource controller (see Figure 1) thatuses an accurate performance model of the applicationto dynamically adjust the resource allocation in responseto changes in the workload. We find the optimal controlpolicy in a control policy simulator and adapt the per-formance model to changes via change-point detectiontechniques. The control loop is executed every 20 sec-onds as follows:

Step 1. First, predict the next 5 minutes of workloadusing a simple linear regression on the most recent 15minutes. (More sophisticated historical workload modelscould easily be incorporated here.)

Step 2. Next, the predicted workload is used as in-put to a performance model that estimates the number ofservers required to handle the predicted workload (Sec-tion 2.1). However, many complex factors affect appli-cation performance, such as workload mix, size of thedatabase, or changes to application code. Rather thantrying to capture all of these in a single model, we de-tect when the current performance model no longer ac-curately models actual performance, and respond by es-timating a new model from production data collectedthrough exploration of the configuration space. We ex-plain this model management process in Section 2.2.

Step 3. Servers are added or removed according therecommendation of the performance model, which wecall starget. To prevent wild oscillations in the controller,we use hysteresis with gains α and β. More formally, wemaintain s, a continuous version of the desired numberof servers that tracks starget by

snew ← sold+

{α(starget − sold) if starget > sold

β(starget − sold) otherwise.(1)

Here α and β are hysteresis parameters that deter-mine how fast the controller is to add and remove servers.Section 2.3 explains how optimal values of α and β arefound using the simulator.

The proposed change-point and simulation techniquesare model-agnostic, meaning that they work with mostexisting choices of statistical performance model; anymodel that predicts the mean and variance of perfor-mance could be used. This makes the proposed frame-work very flexible; progress in statistical machine learn-ing can be directly applied to modeling, control, or modelmanagement without affecting the other components.

2.1 Statistical Performance Models

The performance model estimates the fraction of re-quests slower than the SLA threshold, given input of theform {workload, # servers}. Each point in the train-ing data represents the observed workload, number ofservers, and observed performance of the system overa twenty-second interval. We use a performance modelbased on smoothing splines [3], an established techniquefor nonlinear regression that does not require specifyingthe shape of the curve in advance. Using this method, weestimate a curve that directly maps workload and numberof servers to mean performance (for an example, see Fig-ure 3). In addition to predicting the mean performance,it is just as important to predict the variance, becausethis represents our expectation of “typical” performancespikes. After fitting the mean performance, we estimatethe variance by computing for each training point thesquared difference to the model’s predicted mean perfor-mance, resulting in one training measurement of the vari-ance. Finally, we fit a nonlinear regression model (in par-ticular, a LOESS regression [3]) that maps the mean per-formance to variance. This method allows us to capturethe important fact that higher workload not only causes ahigher mean in performance, but also a higher variance.

2.2 Detecting Changes in Application Performance

Our performance model should be discarded or modifiedwhen it no longer accurately captures the relationshipamong workload, number of servers and performance.In practice, this relationship could be altered by softwareupgrades, transient hardware failures, or other changes inthe environment. Note that this is different from detect-

ing changes in performance alone: if the workload in-creases by 10%, this may cause a performance decrease,but we do not want to flag it as a change point.

The accuracy of a model is usually estimated fromthe residuals, i.e., the difference between the measuredperformance of the application and the prediction of themodel. Under steady state, the residuals should followa stationary distribution, thus a shift of the mean or in-crease of variance of this distribution indicates that themodel is no longer accurate and should be updated. On-line change-point detection [1] techniques use statisti-cal hypothesis testing to compare the distribution of theresiduals in two subsequent time intervals of possiblydifferent lengths, e.g., 9 AM to 9 PM and 9 PM to 11 PM.If the difference between the distributions is statisticallysignificant, we start training a new model. The magni-tude of the change will influence the detection time; anabrupt change should be detected within minutes, whileit might take days to detect a slow, gradual change.

Because we train the model on-line from produc-tion data, rather than from a small-scale test deploy-ment, our approach is under pressure to quickly col-lect necessary training data for the new model. To ad-dress this constraint, we use an active exploration pol-icy until the performance model settles. In explorationmode, the controller is very conservative about the num-ber of machines required to handle the current workload;it starts with a large number of machines to guaranteegood performance of the application and then slowly re-moves machines to find the minimum required for thecurrent workload level. As the accuracy of the perfor-mance model improves, the controller switches from ex-ploration to optimal control.

2.3 Control Policy Simulator

An accurate performance model alone doesn’t guaranteegood performance of the control policy in the produc-tion environment because the various parameters in thecontrol loop, such as the hysteresis gains α and β, sig-nificantly affect the control. This is a standard problemin control theory called gain scheduling, however, it isvery difficult to find the optimal values of the parame-ters in complex control domains such as ours becauseof delays in actions and different time scales used in thecontroller and the cost functions. To solve this problem,we use Pegasus [8], a policy search algorithm that com-pares different control policies using simulation. We usea coarse-grained simulator of the application to evaluatevarious values of α and β parameters using the workloadand performance models. The simulator uses real, pro-duction workloads as observed in the past several days orany other historic workloads such as spikes. Given par-ticular values of the α and β parameters, the simulatorexecutes the control policy on the supplied workloads,

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

workload [req/s]

% s

low

requ

ests 2 servers 5 servers

8 servers

Figure 3: The smoothing spline performance model estimatedfrom observed data; the circles and crosses represent observeddata for two and five servers, respectively. Each curve repre-sents the mean performance estimated by the model. The errorbars in the eight-servers line represent our estimate of the stan-dard deviation of performance.

estimates performance of the application using the per-formance model, and returns the amount of money spenton machines and the number of SLA violations. By us-ing a local search heuristic such as hill-climbing, we canfind the optimal values of α and β that minimize the totalcost of running the application.

3 Preliminary ResultsIn this section we present early results demonstratingour approach to automatic resource allocation. Wefirst present a resource management experiment usinga smoothing-splines based performance model as de-scribed in Section 2.1. Next, we demonstrate that thesimulator is effective for selecting the optimal value ofthe β hysteresis parameter. Finally, we use change-pointdetection techniques to identify a change in performanceduring a performance anomaly.

In all of the experiments, we used the Cloudstone Web2.0 benchmark [9], written in Ruby on Rails and de-ployed on Amazon EC2. Cloudstone includes a work-load generator called Faban [7], which we use to re-play three days of real workload data obtained fromEbates.com. We compress the trace playback time to 12hours for the sake of efficiency.

3.1 Automatic Resource Allocation

First we demonstrate that an SML performance modelcan be used for automatic resource allocation.1 Wemodel the business cost of violating an SLA as a $10penalty for each 10-minute interval in which the 95th

percentile of latency exceeds one second. We trainedan initial performance model using data from an offline

1Although Amazon EC2 costs 10 cents per server per hour, we ad-just this to 10 cents per 10 minutes to account for our faster replay ofthe workload.

time [hours]

% s

low

requ

ests

0.00

0.04

0.08

fraction > 1sec (20-sec period)fraction > 1sec (10-minute period)SLA threshold (5%)

0 2 4 6 8 10 12

020

60100

time [hours]

wor

kloa

d [re

q/s]

24

68

10throughput# machines

Figure 2: Results of a 12 hour long experiment replaying a three-day workload. The bottom graph shows the workload and thenumber of servers requested by the controller (thick, step curve). The top graph shows the fraction of requests slower than onesecond during 20-second intervals (thin, gray line) and during the ten-minute SLA evaluation intervals (thick, step curve). Becausethe fraction of slow requests during the ten-minute intervals is always below 5%, the SLA was not violated. During the wholeexperiment 0.52% of the requests were slower than one second. The controller performed 55 actions.

benchmark and used it to derive the relationship betweenthe workload and the optimal number of servers. (Weleave online training of the performance model to fu-ture work.) The controller attempts to minimize the totalcost in dollars that combines a cost for hardware with apenalty for violating the SLA. We set α = 0.9 to quicklyrespond to workload spikes and β = 0.01 to be very con-servative when removing machines. Figure 2 shows thatthese choices led to a successful run, with no SLA vio-lations and few controller actions to modify the numberof servers. However, as Figure 4 shows, the controlleris very sensitive to the values of α and β, so we nextdescribe how we use a simulator to automatically findoptimal values for these parameters.

3.2 Control Policy Simulator

In this section, we demonstrate using the control policysimulator (Section 2.3) to find the optimal value for theparameters of the controller. In this experiment, we usethe simulator to find the optimal value of β, the hysteresisgain when removing machines, while keeping α = 0.9.For each value of β, we executed ten simulations andcomputed the average total cost of that control policy(dashed line in Figure 4). The minimum average totalcost of $78.05 was achieved with β = 0.01. To vali-date this result, we measured the actual performance ofthe application on EC2 with the same workloads and thesame values of β (solid line in Figure 4). The resultsalign almost perfectly, confirming that the simulator findsthe optimal value of β.

3.3 Detecting Changes in Performance

In this section we demonstrate the use of change-pointdetection on a performance anomaly that we observedwhile running Cloudstone on Amazon’s EC2. Although

70

80

90

100

110

0.001 0.01 0.1

total cost =

VM cost +

SLA

pen

al2es

beta

experiments

simula4on

Figure 4: Comparison of the total cost for different valuesof the β parameter. The solid line represents actual measure-ments, the dashed line represents the average simulated values(Section 2.3). The simulation runs for all values of β werecompleted in less than an hour.

the workload and the control parameters were identicalto the experiment described in Figure 2, we observeda three hour long performance anomaly during whichthe percentage of slow requests significantly increased(hours 6 to 8 in the top graph of Figure 5). The result ofthe change-point test for each t is a p-value; the lower thep-value, the higher the probability of a significant changein the mean of the normalized performance signal. Thebottom graph on Figure 5 shows the computed p-valuesin logarithmic scale; the dips do indeed correspond to thebeginning and the end of the performance anomaly pe-riod. Furthermore, outside of the performance anomaly,the p-values remain flat. This result shows promise thatthe dips in p-value could indeed be used in practice todirect model management.

4 Related Work

Most previous work in dynamic provisioning for Webapplications uses analytic performance models, such as

05

% s

low

x.range[i]

0.5

0.9

norm

. per

f.

0 2 4 6 8 10 12-100

0lo

g(p

vals

)

hours

Figure 5: Top graph: performance of the application during a12-hour experiment with a performance anomaly during hourssix to eight. Center graph: observed performance normalizedusing the predictions of the model. Bottom graph: p-valuesreported by the hypothesis test.

queueing models, and doesn’t consider adaptation tochanges in the environment. In contrast, our statisticalmodels are numerical in character, which allows us tomore naturally employ statistical techniques for optimiz-ing control parameters and for model management.

Muse [2] uses a control strategy that adds, removes,powers down or reassigns servers to maximize energyefficiency subject to SLA constraints on quality of ser-vice for each application in a co-hosted services scenario.Muse assumes that each application already has a util-ity function that expresses the monetary value of addi-tional resources, which includes the monetary value ofimproved performance. In our work, we learn the perfor-mance impact of additional resources.

In [6] the authors devise an adaptive admission-controlstrategy for a 3-tier web application using a simple queu-ing model (a single M/G1/1/Processor Sharing queue)and a proportional integral (PI) controller. However, asingle queue cannot model bottleneck effects, e.g., whenadditional application severs no longer help, that a statis-tical model can incorporate naturally.

[11] uses a more complex analytic performance modelof the system (a network of G/G/1 queues) for resourceallocation. [5] presents a controller for virtual machineconsolidation based on a simple performance model andlookahead control – similar to our simulator. [10] appliesreinforcement learning for training a resource allocationcontroller from traces of another controller and improvesits performance. The system learns a direct relation-ship between observations and actions; however, becausewe model the application performance explicitly, our ap-proach is more modular, interpretable, and allows us tosimulate hypothetical future workloads.

[4] provides an example of using change point detec-tion in the design of a thread-pool controller to adapt tochanges in concurrency levels and workloads. In ourcontrol strategy, change-point detection is used to indi-

cate when we need to modify our performance model.

5 ConclusionWe have demonstrated that the perceived shortcomingsof automating datacenters using closed-loop control canbe addressed by replacing simple techniques of modelingand model management with more sophisticated tech-niques imported from statistical machine learning. A keygoal of our framework and methodology is enabling therapid uptake of further SML advances into this domain.We are encouraged by the possibility of increased inter-action among the research communities of control the-ory, machine learning, and systems.

References[1] M. Basseville and I. V. Nikiforov. Detectiong of Abrupt

Changes. Prentice Hall, 1993.[2] J. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat,

and R. P. Doyle. Managing energy and server resourcesin hosting centers. In Symposium on Operating SystemsPrinciples (SOSP), 2001.

[3] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elementsof Statistical Learning. Springer, August 2001.

[4] J. L. Hellerstein, V. Morrison, and E. Eilebrecht. Opti-mizing concurrency levels in the .net threadpool: A casestudy of controller design and implementation. In Feed-back Control Implementation and Design in ComputingSystems and Networks, 2008.

[5] D. Kusic, J. O. Kephart, J. E. Hanson, N. Kandasamy, andG. Jiang. Power and performance management of virtu-alized computing environments via lookahead control. InICAC ’08: Proceedings of the 2008 International Confer-ence on Autonomic Computing, pages 3–12, Washington,DC, USA, 2008. IEEE Computer Society.

[6] X. Liu, J. Heo, L. Sha, and X. Zhu. Adaptive con-trol of multi-tiered web applications using queueing pre-dictor. Network Operations and Management Sympo-sium, 2006. NOMS 2006. 10th IEEE/IFIP, pages 106–114, April 2006.

[7] S. Microsystems. Next generation benchmark develop-ment/runtime infrastructure. http://faban.sunsource.net/,2008.

[8] A. Y. Ng and M. I. Jordan. Pegasus: A policy searchmethod for large mdps and pomdps. In UAI ’00: Proceed-ings of the 16th Conference on Uncertainty in ArtificialIntelligence, pages 406–415, San Francisco, CA, USA,2000. Morgan Kaufmann Publishers Inc.

[9] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen,H. Wong, S. Patil, A. Fox, and D. Patterson. Cloudstone:Multi-platform, multi-language benchmark and measure-ment tools for web 2.0, 2008.

[10] G. Tesauro, N. Jong, R. Das, and M. Bennani. A hybridreinforcement learning aproach to autonomic resource al-location. In International Conference on Autonomic Com-puting (ICAC), 2006.

[11] B. Urgaonkar, P. Shenoy, A. Chandra, and P. Goyal. Dy-namic provisioning of multi-tier internet applications. InICAC, 2005.

bodik etal hotcloud09

Documents