[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent Histogram Construction Using Query Feedback

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent Histogram Construction Using Query Feedback

Post on 27-Mar-2017




0 download

Embed Size (px)


<ul><li><p>ISOMER: Consistent Histogram Construction Using Query Feedback</p><p>U. Srivastava1 P. J. Haas2 V. Markl2 N. Megiddo2 M. Kutsch3 T. M. Tran4</p><p>1 Stanford University 2 IBM Almaden Research Center 3 IBM Germany 4 IBM Silicon Valley Lab</p><p>usriv@stanford.edu {phaas,marklv,megiddo}@almaden.ibm.com kutschm@de.ibm.com minhtran@us.ibm.com</p><p>Abstract</p><p>Database columns are often correlated, so that cardi-nality estimates computed by assuming independence oftenlead to a poor choice of query plan by the optimizer. Mul-tidimensional histograms can help solve this problem, butthe traditional approach of building such histograms usinga data scan often scales poorly and does not always yieldthe best histogram for a given workload. An attractive al-ternative is to gather feedback from the query execution en-gine about the observed cardinality of predicates and usethis feedback as the basis for a histogram. In this paper wedescribe ISOMER, a new feedback-based algorithm for col-lecting optimizer statistics by constructing and maintainingmultidimensional histograms. ISOMER uses the maximum-entropy principle to approximate the true data distributionby a histogram distribution that is as simple as possiblewhile being consistent with the observed predicate cardi-nalities. ISOMER adapts readily to changes in the under-lying data, automatically detecting and eliminating incon-sistent feedback information in an efficient manner. Thealgorithm controls the size of the histogram by retainingonly the most important feedback. Our experiments indi-cate that, unlike previous methods for feedback-driven his-togram maintenance, ISOMER imposes little overhead, isextremely scalable, and yields highly accurate cardinalityestimates while using only a modest amount of storage.</p><p>1. Introduction</p><p>Query optimization relies heavily on accurate cardinal-ity estimates for predicates involving multiple attributes. Inthe presence of correlated attributes, cardinality estimatesbased on independence assumptions can be grossly inaccu-rate [15], leading to a poor choice of query plan. Multidi-mensional histograms [14, 15] have been proposed for ac-curate cardinality estimation of predicates involving corre-lated attributes. However, the traditional method of buildingthese histograms through a data scan (henceforth called theproactive method) does not scale well to large tables. Sucha histogram needs to be periodically rebuilt in order to in-</p><p>corporate database updates, thus exacerbating the overheadof this method. Moreover, since proactive methods inspectonly data and not queries, they generally do not yield thebest histogram for a given query workload [4].</p><p>In this paper, we consider an alternative method [1, 4, 6,11] of building histograms through query feedback (hence-forth called the reactive method). For example, considera query having a predicate make = Honda, and sup-pose that the execution engine finds at runtime that 80 tu-ples from the Car table satisfy this predicate. Such a pieceof information about the observed cardinality of a predicateis called a query-feedback record (QFR). As the DBMS ex-ecutes queries, QFRs can be collected with relatively littleoverhead [17] and used to build and progressively refine ahistogram over time. For example, the above QFR may beused to refine a histogram on make by creating a bucket forHonda, and setting its count to 80.</p><p>The reactive method is attractive since it scales ex-tremely well with the table size, does not contend fordatabase access, and requires no periodic rebuilding of thehistogram (updates are automatically incorporated by newQFRs). The histogram buckets may be chosen to best suitthe current workload, thereby efficiently focusing resourceson precisely those queries that matter most to the user. Ofcourse, the reactive method may lead to inaccuracies forparts of the data that have never been queried. However,such errors can be ameliorated, for example, by startingwith a coarse histogram built proactively [4].</p><p>Previous proposals for histogram construction usingquery feedback have lacked either accuracy or efficiency.Some proposals, e.g., STGrid [1], use heuristics to refinethe histogram based on a new QFR, thereby leading to inac-curacies in the constructed histogram. Other proposals, e.g.,STHoles [4], require extremely detailed feedback from thequery execution engine that can be very expensive to gatherat runtime. In this paper, we describe ISOMER (ImprovedStatistics and Optimization by Maximum-Entropy Refine-ment), a new algorithm for feedback-driven histogram con-struction. In contrast to previous approaches, ISOMERis both accurate as well as efficient. ISOMER uses the</p><p>1</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>information-theoretic principle of maximum entropy [10] toapproximate the true data distribution by a histogram distri-bution that is as simple as possible while being consistentwith the observed cardinalities as specified in the QFRs. Inthis manner, ISOMER avoids incorporating extraneousand potentially erroneousassumptions into the histogram.</p><p>The reactive approach to histogram maintenance entailsa number of interesting challenges:</p><p>1. Enforcing consistency: To ensure histogram accu-racy, the histogram distribution must always be con-sistent with all currently valid QFRs and not incorpo-rate any ad hoc assumptions; see Section 3.2. Previ-ous proposals for feedback-driven histogram construc-tion [1, 11] have lacked this crucial consistency prop-erty, even when the data is static.</p><p>2. Dealing with data changes: In the presence of up-dates, deletes, and inserts, some of the QFRs collectedin the past may no longer be valid. Such old, invalidQFRs must be efficiently identified and discarded, andtheir effect on the histogram must be undone.</p><p>3. Meeting a limited space budget: Database systemsusually limit the size of a histogram to a few diskpages in order to ensure efficiency when the histogramis read at the time of query optimization. Thus, weassume that there is a limited space budget for his-togram storage. In general, adding more QFRs to thehistogram while maintaining consistency leads to anincrease in the histogram size. To keep the histogramsize within the space budget, the relatively importantQFRs (those that refine parts of the histogram that arenot refined by other QFRs) must be identified and re-tained, and the less important QFRs discarded.</p><p>ISOMER addresses each of the above challenges usingnovel, efficient techniques:</p><p>1. ISOMER uses the maximum-entropy principle (Sec-tion 3.3) to approximate the true data distribution bythe simplest distribution that is consistent with all ofthe currently valid QFRs. This approach amounts toimposing uniformity assumptions (as made by tradi-tional optimizers) when, and only when, no other sta-tistical information is available. ISOMER efficientlyupdates the maximum-entropy approximation in an in-cremental manner as new QFRs arrive.</p><p>2. ISOMER employs linear programming (LP) to quicklydetect and discard old, invalid QFRs.</p><p>3. An elegant feature of ISOMER is that the maximum-entropy solution yields, for free, an importance mea-sure for each QFR. Thus, to meet a limited space bud-get, ISOMER simply needs to discard QFRs in in-creasing order of this importance measure.</p><p>Our experiments indicate that for reactive histogram main-tenance, ISOMER is highly efficient, imposes very littleoverhead during query execution, and can provide muchmore accurate cardinality estimates than previous tech-niques while using only a modest amount of storage.</p><p>The rest of the paper is organized as follows. After sur-veying related work in the remainder of this section, we layout the basic architecture of ISOMER in Section 2 and in-troduce its various components. In Section 3, we describehow ISOMER uses the maximum-entropy principle to buildhistograms consistent with query feedback. Section 4 de-scribes how ISOMER deals with changing data, and Sec-tion 5 details how ISOMER keeps the histogram within alimited space budget by discarding relatively unimportantQFRs. We describe our experimental results in Section 6,and conclude in Section 7.</p><p>1.1. Related Work</p><p>Existing work on multidimensional statistics can bebroadly classified as addressing either the problem of de-ciding which statistics to build or that of actually buildingthem. This paper addresses only the latter problem. Manytypes of statistics have been proposed, e.g., histograms [15]and wavelet-based synopses [13]; we restrict attention tohistograms.</p><p>For building multidimensional histograms, proactive ap-proaches that involve a data scan have been proposed, e.g.,MHist [15], GenHist [9], and others [7, 14, 19]. As men-tioned before, data scans may not effectively focus systemresources on the users workload and do not scale well tolarge tables. In principle, histograms can be constructedfaster using a page-level sample of the data [5], but largesample sizesand correspondingly high sampling costscan be required to achieve sufficient accuracy when datavalues are clustered on pages and/or highly skewed.</p><p>The idea of using query feedback to collect the statis-tics needed for estimating cardinality was first proposed in[6]. The specific approach relied on fitting a combinationof model functions to the data distribution; the choice offunctions is ad hoc and can lead to poor estimates when thedata distribution is irregular. Query feedback is also usedin [17], but only to compute adjustment factors to cardinal-ity estimates for specific predicates, and not to build his-tograms. STGrid [1] and SASH [11] both use query feed-back to build histograms, but they often have low accuracybecause their heuristic methods for adding new QFRs to thehistogram do not maintain consistency. ISOMERs use ofthe well founded [16] maximum-entropy principle avoidsthis problem.</p><p>STHoles [4] is another approach that uses query feed-back to build histograms. The histogram structure ofSTHoles is superior to other bucketing schemes such as</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>MHist [15], and for this reason is used by ISOMER. Un-fortunately, the original STHoles maintenance algorithm re-quires, for each query and each histogram bucket, the com-putation of the number of rows in the intersection of thequery and bucket regions. These detailed row counts, whichare used to decide when and where to split and merge buck-ets, are usually not obtainable from the original query pred-icates alone. The query engine must therefore insert artifi-cial predicates that specify the (possibly recursive) bucketboundaries. As the number of histograms and the numberof buckets per histogram grows, the overhead of evaluatingthis artificial feedback becomes so high as to make theSTHoles maintenance approach impractical. (ISOMER, incontrast, needs only the actual feedback that naturally oc-curs during query executionnamely, the number of rowsprocessed at each step during the query planwhich canbe monitored with low overhead [17].) Finally, STHoles,unlike ISOMER, does not provide principled methods foraddressing issues of inconsistent feedback and limits on theavailable memory for storing the histogram.</p><p>The principle of maximum entropy has been used in [12],but for the significantly different problem of consistentlyestimating the selectivity of conjuncts of predicates suchas sel(p1 p2 p3 p4), given partial selectivities such assel(p1), sel(p2 p3), sel(p2 p3 p4), and so forth. I.e.,the methods in [12] permit the exploitation of existing mul-tidimensional statistics (not necessarily from histograms),whereas the current paper is concerned with the collectionof a specific type of statistic.</p><p>2. ISOMER architecture</p><p>Figure 1 shows the architecture of ISOMER and whereit would fit in a modern commercial DBMS. The left por-tion of the figure shows an ordinary DBMS. Here, a query isfirst fed to a cost-based query optimizer, which produces aplan for the query based on the available database statistics.Presently, all commercial systems apply the proactive ap-proach for maintaining statistics that requires database ac-cess (as shown by the dotted arrow). The chosen query planis then executed by the runtime engine. We prototyped ISO-MER on top of DB2 UDB, which can measure and storeactual predicate cardinalities or QFRs at run time.</p><p>Due to efficiency concerns, a QFR is not immediatelyused to refine the histogram as soon as it is obtained. In-stead, the QFRs are collected in a query feedback store andused in batches to refine the histogram. During periods oflight load, or during a maintenance window, the databasesystems invokes the ISOMER component. On invocation,ISOMER uses the QFRs accumulated in the query feedbackstore to refine the histogram. The right portion of the figureshows the various components of ISOMER.</p><p>ISOMER begins by reading the currently maintained his-</p><p>Store(invoked at times of light load)</p><p>Store</p><p>Offline</p><p>Statistics</p><p>Environment</p><p>ISOMER</p><p>Present commercial DBMS</p><p>(Proactive)</p><p>Database</p><p>Query Feedback</p><p>Runtime</p><p>OptimizerQuery</p><p>Query</p><p>Compute finalhistogram</p><p>Eliminateunimportant</p><p>feedback</p><p>feedbackinconsistent</p><p>DetectAdd newfeedback</p><p>Figure 1. ISOMER Architecture</p><p>togram from the database statistics. It then accesses the newQFRs and adds them to the current histogram. This processmainly involves forming buckets corresponding to the newQFRs, and is explained in Section 3.4.1.</p><p>As discussed in Section 3.2 below, a histogram must beconsistent with respect to both new and previous QFRs.However, because of updates to the data these QFRs maybe contradictory, in which case there does not exist a his-togram consistent with all QFRs. Thus ISOMERs first taskis to detect and discard old, inconsistent QFRs. For this pur-pose, ISOMER keeps a list of QFRs previously added to thehistogram in an offline store as depicted in Figure 1. ISO-MER reads the previous QFRs from the offline store anduses linear programming to find and eliminate inconsistentQFRs. This process is described in Section 4. Note that theoffline store does not have to be read at query-optimizationtime and hence does not encroach on the space budget al-lowed for the histogram. In any case, the size of the offlinestore is not a concern since it cannot grow bigger than thesize of the maintained histogram. Indeed, if there are moreQFRs than buckets, then some of these QFRs will be eitherinconsistent or redundant and hence will be eliminated; seeSection 3.4.2.</p><p>Once ISOMER obtains a consistent set of QFRs,the algorithm computes the histogram according to themaximum-entropy principle. We describe this computationin Sections 3.4.2 and 3.4.3. If the histogram is too large tofit within its allotted space, ISOMER selectively discardsthe relatively unimportant QFRs in order to reduce thehistogram size. Intuitively, a particular QFR is unimportantif the information provided by that QFR is already providedby other QFRs, i.e., if the QFR refines a...</p></li></ul>


View more >