[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Scalable Exploration of Physical Database Design

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Scalable Exploration of Physical Database Design

Post on 24-Mar-2017




0 download

Embed Size (px)


<ul><li><p>Scalable Exploration of Physical Database Design</p><p>Arnd Christian KonigMicrosoft Research</p><p>chrisko@microsoft.com</p><p>Shubha U. NabarStanford University</p><p>sunabar@stanford.edu</p><p>Abstract</p><p>Physical database design is critical to the performance of alarge-scale DBMS. The corresponding automated design tuningtools need to select the best physical design from a large set ofcandidate designs quickly. However, for large workloads, evaluat-ing the cost of each query in the workload for every candidate doesnot scale. To overcome this, we present a novel comparison prim-itive that only evaluates a fraction of the workload and providesan accurate estimate of the likelihood of selecting correctly. Weshow how to use this primitive to construct accurate and scalableselection procedures. Furthermore, we address the issue of en-suring that the estimates are conservative, even for highly skewedcost distributions. The proposed techniques are evaluated througha prototype implementation inside a commercial physical designtool.</p><p>1 Introduction</p><p>The performance of applications running against enter-prise database systems depends crucially on the physicaldatabase design chosen. To enable the exploration of po-tential designs, todays commercial database systems haveincorporated APIs that allow What-if analysis [8]. Thesetake as input a query Q and a database configuration C, andreturn the optimizer-estimated cost of executing Q if config-uration C were present. This interface is the key to buildingtools for exploratory analysis as well as automated recom-mendation of physical database designs [13, 20, 1, 10].</p><p>The problem of physical design tuning is then definedas follows: a physical design tool receives a representa-tive query workload WL and constraints on the configu-ration space to explore as input, and outputs a configura-tion in which executing WL has the least possible cost (asmeasured by the optimizer cost model)1. To determine thebest configuration, a number of candidate configurations areenumerated and then evaluated using What-if analysis.</p><p>The representative workload is typically obtained bytracing the queries that execute against a production system(using tools such as IBM Query Patroler, SQL Server Pro-filer, or ADDM [10]) over a representative period of time.</p><p>1Note that queries include both update and select statements. Thus thisformulation models the trade-off between the improved performance ofselect-queries and the maintenance costs of additional indexes and views.</p><p>Since a large number of SQL statements may execute inthis time, the straightforward approach of comparing con-figurations by repeatedly invoking the query optimizer foreach query in WL with every configuration is often nottractable [5, 20]. Our experience with a commercial phys-ical design tool shows that a large part of the tools over-head arises from repeated optimizer calls to evaluate largenumbers of configuration/query combinations. Most of theresearch on such tools [13, 20, 1, 10, 7, 8] has focused onminimizing this overhead by evaluating only a few carefullychosen configurations. Our approach is orthogonal to thiswork and focuses on reducing the number of queries forwhich the optimizer calls are issued. Because the topic ofwhich configurations to enumerate has been studied exten-sively, we will not comment further on the configuration-space.</p><p>Current commercial tools address the issue of largeworkloads by compressing them up-front, i.e. initially se-lecting a subset of queries and then tuning only this smallerset [5, 20]. These approaches do not offer any guarantees onhow the compression affects the likelihood of choosing theright configuration. However, such guarantees are impor-tant due to the significant overhead of changing the physicaldesign and the performance impact of bad database design.</p><p>The problem we study in this paper is that of efficientlycomparing two (or more) configurations for large work-loads. Solving this problem is crucial for scalability of au-tomated physical design tuning tools [20, 1, 10]. We pro-vide probabilistic guarantees on the likelihood of correctlychoosing the best configuration for a large workload froma given set of candidate configurations. Our approach is tosample from the workload and use statistical inference tech-niques to compute the probability of selecting correctly.</p><p>The resulting probabilistic comparison primitive can beused for (a) fast interactive exploratory analysis of the con-figuration space, allowing the DB administrator to quicklyfind promising candidates for full evaluation, or (b) as thecore comparison primitive inside an automated physical de-sign tool, providing both scalability and locally good deci-sions with probabilistic guarantees on the accuracy of eachcomparison. Depending on the search strategy used, the lat-ter can be extended to guarantees on the quality of the finalresult.</p><p>The key challenges to this probabilistic approach are</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>twofold. First, the accuracy of the estimation depends crit-ically on the variance of the estimator we use. The chal-lenge is thus to pick an estimator with as little variance aspossible. Second, sampling techniques rely on (a) the ap-plicability of the Central Limit Theorem (CLT) [9] to deriveconfidence statements for the estimates and (b) the samplevariance being a good estimate of the true variance of theunderlying distribution. Unfortunately, both assumptionsmay not be valid in this scenario. Thus, there is a need fortechniques to determine the applicability of the CLT for thegiven workload and set of configurations.</p><p>1.1 Our Contributions</p><p>We propose a new probabilistic comparison primitivethat given as input a workload WL, a set of configura-tions C and a target probability outputs the configura-tion with the lowest optimizer-estimated cost of executingWL with probability at or above the target probability .It works by incrementally sampling queries from the origi-nal workload and computing the probability of selecting thebest configuration with each new sample, stopping once thetarget probability is reached. Our work makes the followingsalient contributions:</p><p> We derive probabilistic guarantees on the likelihood ofselecting the best configuration (Section 4).</p><p> We propose a modified sampling scheme that signif-icantly reduces estimator variance by leveraging thefact that query costs exhibit some stability across con-figurations (Section 4.2).</p><p> We show how to reduce the estimator variance fur-ther through a stratified sampling scheme that lever-ages commonality between queries (Section 5).</p><p> Finally, we describe a novel technique to address theproblem of highly skewed distributions in which thesample may not be representative of the overall distrib-ution and/or the CLT may not apply for a given samplesize (Section 6).</p><p>The remainder of the paper is organized as follows: InSection 2 we review related work. In Section 3 we give aformal description of the problem and introduce the nec-essary notation. In Section 4 we describe two samplingschemes that can be used to estimate the probability of se-lecting the correct configuration. In Section 5 we then showhow to reduce variances through the use of stratificationand combine all parts into an efficient algorithm. In Sec-tion 6 we describe how to validate the assumptions on whichthe probabilistic guarantees described earlier are based. Weevaluate our techniques experimentally in Section 7.</p><p>2 Related Work</p><p>The techniques developed in this paper are related to thefield of statistical selection and ranking [15] which is con-cerned with the probabilistic ranking of systems in experi-mental setups based on a series of measurements from eachsystem. However, statistical ranking techniques are typi-cally aimed at comparing systems for which the individualmeasurements are distributed according to a normal distri-bution. This is clearly not the case in our scenario. To in-corporate non-normal distributions into statistical selection,techniques like batching (e.g. [17]) have been suggested,that initially generate a large number of measurements, andthen transform this raw data into batch means. The batchsizes are chosen to be large enough so that the individualbatch means are approximately independent and normallydistributed. However, because procedures of this type needto produce a number of normally distributed estimates perconfiguration, they require a large number of initial mea-surements (according to [15], batch sizes of over 1000 mea-surements are common), thereby nullifying the efficiencygain due to sampling for our scenario.</p><p>Workload compression techniques such as [5, 20] com-pute a compact representation of a large workload beforesubmitting it to a physical design tool. Both of these ap-proaches are heuristics in the sense that they have no meansof assessing the impact of the compression on configura-tion selection or consequently on physical design tuning it-self. [5] poses workload compression as a clustering prob-lem, using a distance function that models the maximumdifference in cost between two queries for arbitrary configu-rations. This distance function does not use the optimizerscost estimate, so it is not clear how well this approxima-tion holds for complex queries. [20] compresses a work-load by selecting queries in order of their current costs, untila user-specifiable percentage X of the total workload costhas been reached. While computationally simple, this ap-proach may lead to a significant reduction in tuning qualityfor workloads in which queries of some templates are gen-erally more expensive than the remaining queries, as thenonly some templates are considered for tuning. Making Xuser-specifiable does not alleviate this problem, for the userhas no way of assessing the impact of X on tuning quality.</p><p>The most serious drawback of both approaches is thatthey do not adapt the number of queries retained to the spaceof configurations considered. Consequently, they may re-sult in compression that is either too conservative, resultingin excessive numbers of optimizer calls or too coarse, re-sulting in inferior physical designs.</p><p>3 Problem Statement</p><p>In this section we give a formal definition of the problemaddressed in this paper. This definition is identical to theinformal one given in Section 1.1, but in addition to the pa-rameters C,WL and the target probability , we introduce</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>an additional parameter , which describes the minimumdifference in cost between configurations that we care to de-tect. Specifying a value of &gt; 0 helps to avoid scenarios inwhich the algorithm samples a large fraction of the work-load when comparing configurations of (nearly) identicalcosts. While accuracy for such configurations is necessaryin some cases, detecting large differences is all that mattersin others. For instance, when comparing a new candidateconfiguration to the current one, the overhead of changingthe physical database design is justified only when the newconfiguration is significantly better.</p><p>3.1 Notation</p><p>In this paper, we use Cost(q, C) to denote the optimizer-estimated cost of executing a query q in a configura-tion C. Similarly, the total estimated cost of execut-ing a set of queries {q1, . . . , qN} in configuration C isCost({q1, . . . , qN}, C) :=</p><p>Ni=1 Cost(qi, C); when the</p><p>set of queries includes the entire workload, this will be re-ferred to as the cost of configuration C. In the interestof simple notation, we will use the simplifying assumptionthat the overhead of making a single optimizer call is con-stant across queries2. In the remainder of the paper we usethe term cost to describe optimizer-estimated query costs.</p><p>We use the phrase to sample a query to denote theprocess of obtaining the querys text from a workload ta-ble or file, and evaluating its cost using the query optimizer.The configuration being used to evaluate the query will beclear from context.</p><p>The probability of an event A will be denoted by Pr(A),and the conditional probability of event A given event B byPr(A|B).</p><p>3.2 The Configuration Selection Problem</p><p>Problem Formulation: Given a set of physical databaseconfigurations C = {C1, . . . , Ck} and a workload WL ={q1, . . . , qN}, a target probability and a sensitivity para-meter , select a configuration Ci such that the probabilityof a correct selection, Pr(CS), is larger than , wherePr(CS) :=</p><p>Pr(Cost(WL, Ci) &lt; Cost(WL, Cj) + , j = i), (1)</p><p>while making a minimal number of optimizer-calls.Our approach to solving the configuration selection</p><p>problem is to repeatedly sample queries from WL, evaluatePr(CS), and terminate once the target probability hasbeen reached. Next, we describe how to estimate Pr(CS)(Section 4) and how to use these estimates and stratificationof the workload to construct an algorithm for the configura-tion selection problem (Section 5).</p><p>2We discuss how to incorporate differences in optimization costs be-tween queries in Section 5.2.</p><p>4 Sampling Techniques</p><p>In this section we present two sampling schemes toderive Pr(CS) estimates first we describe a straight-forward approach called Independent Sampling (Sec-tion 4.1); then we describe Delta Sampling, which exploitsproperties of query costs to come up with a more accurateestimator (Section 4.2).</p><p>4.1 Independent Sampling</p><p>Independent Sampling is the base sampling scheme thatwe use to estimate the differences in costs of pairs of con-figurations. First, we define an unbiased estimator Xi ofCost(WL, Ci). The estimator is obtained by sampling aset of queries SLi from WL and is calculated as</p><p>Xi = N|SLi|</p><p>qSLi Cost(q, Ci),</p><p>i.e., Xi is the mean of |SLi| random variables, scaled up bythe total number of queries in the workload. The varianceof the underlying cost-distribution is</p><p> 2i =Nl=1(Cost(ql,Ci)Cost(WL,Ci)N )2</p><p>N .</p><p>Now, when using a simple random sample of size n to es-timate Xi, the variance of this estimate is V ar(Xi) :=N2</p><p>n 2i NN1</p><p>(1 nN</p><p>)[9]. In the following, we will use the short-</p><p>hand S 2i := 2i N/(N 1).</p><p>To choose the better of two configurations Cl and Cjwe now use the random variable Xl Xj , which is anunbiased estimator of the true difference in costs l,j :=Cost(WL, Cl) Cost(WL, Cj). Under large-sample as-sumptions, the standardized random variable</p><p>l,j :=(XlXj)l,j</p><p>N</p><p>(S 2l</p><p>|SLl|(1 |SLl|N</p><p>)+</p><p>S 2j</p><p>|SLj |(1 |SLj |N</p><p>)) (2)</p><p>is normally distributed with mean 0 and variance 1 (accord-ing to the CLT). Based on this we can assess the probabilityof making a correct selection when choosing between Cland Cj , which we write as Pr(CSl,j).</p><p>The decision procedure to choose between the two con-figurations is to pick the configuration Cl, for which the es-timated cost Xl is the smallest. In this case the probabilityof making an incorrect selection corresponds to the eventXl Xj &lt; 0 and l,j &gt; . Thus,</p><p>Pr(CSl,j) = 1 Pr(Xl Xj &lt; 0|l,j &gt; )= Pr(X...</p></li></ul>


View more >