[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

12
ISOMER: Consistent Histogram Construction Using Query Feedback U. Srivastava 1 P. J. Haas 2 V. Markl 2 N. Megiddo 2 M. Kutsch 3 T. M. Tran 4 1 Stanford University 2 IBM Almaden Research Center 3 IBM Germany 4 IBM Silicon Valley Lab [email protected] {phaas,marklv,megiddo}@almaden.ibm.com [email protected] [email protected] Abstract Database columns are often correlated, so that cardi- nality estimates computed by assuming independence often lead to a poor choice of query plan by the optimizer. Mul- tidimensional histograms can help solve this problem, but the traditional approach of building such histograms using a data scan often scales poorly and does not always yield the best histogram for a given workload. An attractive al- ternative is to gather feedback from the query execution en- gine about the observed cardinality of predicates and use this feedback as the basis for a histogram. In this paper we describe ISOMER, a new feedback-based algorithm for col- lecting optimizer statistics by constructing and maintaining multidimensional histograms. ISOMER uses the maximum- entropy principle to approximate the true data distribution by a histogram distribution that is as “simple” as possible while being consistent with the observed predicate cardi- nalities. ISOMER adapts readily to changes in the under- lying data, automatically detecting and eliminating incon- sistent feedback information in an efficient manner. The algorithm controls the size of the histogram by retaining only the most “important” feedback. Our experiments indi- cate that, unlike previous methods for feedback-driven his- togram maintenance, ISOMER imposes little overhead, is extremely scalable, and yields highly accurate cardinality estimates while using only a modest amount of storage. 1. Introduction Query optimization relies heavily on accurate cardinal- ity estimates for predicates involving multiple attributes. In the presence of correlated attributes, cardinality estimates based on independence assumptions can be grossly inaccu- rate [15], leading to a poor choice of query plan. Multidi- mensional histograms [14, 15] have been proposed for ac- curate cardinality estimation of predicates involving corre- lated attributes. However, the traditional method of building these histograms through a data scan (henceforth called the proactive method) does not scale well to large tables. Such a histogram needs to be periodically rebuilt in order to in- corporate database updates, thus exacerbating the overhead of this method. Moreover, since proactive methods inspect only data and not queries, they generally do not yield the best histogram for a given query workload [4]. In this paper, we consider an alternative method [1, 4, 6, 11] of building histograms through query feedback (hence- forth called the reactive method). For example, consider a query having a predicate make = ’Honda’, and sup- pose that the execution engine finds at runtime that 80 tu- ples from the Car table satisfy this predicate. Such a piece of information about the observed cardinality of a predicate is called a query-feedback record (QFR). As the DBMS ex- ecutes queries, QFRs can be collected with relatively little overhead [17] and used to build and progressively refine a histogram over time. For example, the above QFR may be used to refine a histogram on make by creating a bucket for ’Honda’, and setting its count to 80. The reactive method is attractive since it scales ex- tremely well with the table size, does not contend for database access, and requires no periodic rebuilding of the histogram (updates are automatically incorporated by new QFRs). The histogram buckets may be chosen to best suit the current workload, thereby efficiently focusing resources on precisely those queries that matter most to the user. Of course, the reactive method may lead to inaccuracies for parts of the data that have never been queried. However, such errors can be ameliorated, for example, by starting with a coarse histogram built proactively [4]. Previous proposals for histogram construction using query feedback have lacked either accuracy or efficiency. Some proposals, e.g., STGrid [1], use heuristics to refine the histogram based on a new QFR, thereby leading to inac- curacies in the constructed histogram. Other proposals, e.g., STHoles [4], require extremely detailed feedback from the query execution engine that can be very expensive to gather at runtime. In this paper, we describe ISOMER (Improved Statistics and Optimization by Maximum-Entropy Refine- ment), a new algorithm for feedback-driven histogram con- struction. In contrast to previous approaches, ISOMER is both accurate as well as efficient. ISOMER uses the 1 Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: tm

Post on 27-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

ISOMER: Consistent Histogram Construction Using Query Feedback

U. Srivastava1 P. J. Haas2 V. Markl2 N. Megiddo2 M. Kutsch3 T. M. Tran4

1 Stanford University 2 IBM Almaden Research Center 3 IBM Germany 4 IBM Silicon Valley Lab

[email protected] {phaas,marklv,megiddo}@almaden.ibm.com [email protected] [email protected]

Abstract

Database columns are often correlated, so that cardi-nality estimates computed by assuming independence oftenlead to a poor choice of query plan by the optimizer. Mul-tidimensional histograms can help solve this problem, butthe traditional approach of building such histograms usinga data scan often scales poorly and does not always yieldthe best histogram for a given workload. An attractive al-ternative is to gather feedback from the query execution en-gine about the observed cardinality of predicates and usethis feedback as the basis for a histogram. In this paper wedescribe ISOMER, a new feedback-based algorithm for col-lecting optimizer statistics by constructing and maintainingmultidimensional histograms. ISOMER uses the maximum-entropy principle to approximate the true data distributionby a histogram distribution that is as “simple” as possiblewhile being consistent with the observed predicate cardi-nalities. ISOMER adapts readily to changes in the under-lying data, automatically detecting and eliminating incon-sistent feedback information in an efficient manner. Thealgorithm controls the size of the histogram by retainingonly the most “important” feedback. Our experiments indi-cate that, unlike previous methods for feedback-driven his-togram maintenance, ISOMER imposes little overhead, isextremely scalable, and yields highly accurate cardinalityestimates while using only a modest amount of storage.

1. Introduction

Query optimization relies heavily on accurate cardinal-ity estimates for predicates involving multiple attributes. Inthe presence of correlated attributes, cardinality estimatesbased on independence assumptions can be grossly inaccu-rate [15], leading to a poor choice of query plan. Multidi-mensional histograms [14, 15] have been proposed for ac-curate cardinality estimation of predicates involving corre-lated attributes. However, the traditional method of buildingthese histograms through a data scan (henceforth called theproactive method) does not scale well to large tables. Sucha histogram needs to be periodically rebuilt in order to in-

corporate database updates, thus exacerbating the overheadof this method. Moreover, since proactive methods inspectonly data and not queries, they generally do not yield thebest histogram for a given query workload [4].

In this paper, we consider an alternative method [1, 4, 6,11] of building histograms through query feedback (hence-forth called the reactive method). For example, considera query having a predicate make = ’Honda’, and sup-pose that the execution engine finds at runtime that 80 tu-ples from the Car table satisfy this predicate. Such a pieceof information about the observed cardinality of a predicateis called a query-feedback record (QFR). As the DBMS ex-ecutes queries, QFRs can be collected with relatively littleoverhead [17] and used to build and progressively refine ahistogram over time. For example, the above QFR may beused to refine a histogram on make by creating a bucket for’Honda’, and setting its count to 80.

The reactive method is attractive since it scales ex-tremely well with the table size, does not contend fordatabase access, and requires no periodic rebuilding of thehistogram (updates are automatically incorporated by newQFRs). The histogram buckets may be chosen to best suitthe current workload, thereby efficiently focusing resourceson precisely those queries that matter most to the user. Ofcourse, the reactive method may lead to inaccuracies forparts of the data that have never been queried. However,such errors can be ameliorated, for example, by startingwith a coarse histogram built proactively [4].

Previous proposals for histogram construction usingquery feedback have lacked either accuracy or efficiency.Some proposals, e.g., STGrid [1], use heuristics to refinethe histogram based on a new QFR, thereby leading to inac-curacies in the constructed histogram. Other proposals, e.g.,STHoles [4], require extremely detailed feedback from thequery execution engine that can be very expensive to gatherat runtime. In this paper, we describe ISOMER (ImprovedStatistics and Optimization by Maximum-Entropy Refine-ment), a new algorithm for feedback-driven histogram con-struction. In contrast to previous approaches, ISOMERis both accurate as well as efficient. ISOMER uses the

1

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

information-theoretic principle of maximum entropy [10] toapproximate the true data distribution by a histogram distri-bution that is as “simple” as possible while being consistentwith the observed cardinalities as specified in the QFRs. Inthis manner, ISOMER avoids incorporating extraneous—and potentially erroneous—assumptions into the histogram.

The reactive approach to histogram maintenance entailsa number of interesting challenges:

1. Enforcing consistency: To ensure histogram accu-racy, the histogram distribution must always be con-sistent with all currently valid QFRs and not incorpo-rate any ad hoc assumptions; see Section 3.2. Previ-ous proposals for feedback-driven histogram construc-tion [1, 11] have lacked this crucial consistency prop-erty, even when the data is static.

2. Dealing with data changes: In the presence of up-dates, deletes, and inserts, some of the QFRs collectedin the past may no longer be valid. Such old, invalidQFRs must be efficiently identified and discarded, andtheir effect on the histogram must be undone.

3. Meeting a limited space budget: Database systemsusually limit the size of a histogram to a few diskpages in order to ensure efficiency when the histogramis read at the time of query optimization. Thus, weassume that there is a limited space budget for his-togram storage. In general, adding more QFRs to thehistogram while maintaining consistency leads to anincrease in the histogram size. To keep the histogramsize within the space budget, the relatively “important”QFRs (those that refine parts of the histogram that arenot refined by other QFRs) must be identified and re-tained, and the less important QFRs discarded.

ISOMER addresses each of the above challenges usingnovel, efficient techniques:

1. ISOMER uses the maximum-entropy principle (Sec-tion 3.3) to approximate the true data distribution bythe “simplest” distribution that is consistent with all ofthe currently valid QFRs. This approach amounts toimposing uniformity assumptions (as made by tradi-tional optimizers) when, and only when, no other sta-tistical information is available. ISOMER efficientlyupdates the maximum-entropy approximation in an in-cremental manner as new QFRs arrive.

2. ISOMER employs linear programming (LP) to quicklydetect and discard old, invalid QFRs.

3. An elegant feature of ISOMER is that the maximum-entropy solution yields, for free, an “importance” mea-sure for each QFR. Thus, to meet a limited space bud-get, ISOMER simply needs to discard QFRs in in-creasing order of this importance measure.

Our experiments indicate that for reactive histogram main-tenance, ISOMER is highly efficient, imposes very littleoverhead during query execution, and can provide muchmore accurate cardinality estimates than previous tech-niques while using only a modest amount of storage.

The rest of the paper is organized as follows. After sur-veying related work in the remainder of this section, we layout the basic architecture of ISOMER in Section 2 and in-troduce its various components. In Section 3, we describehow ISOMER uses the maximum-entropy principle to buildhistograms consistent with query feedback. Section 4 de-scribes how ISOMER deals with changing data, and Sec-tion 5 details how ISOMER keeps the histogram within alimited space budget by discarding relatively unimportantQFRs. We describe our experimental results in Section 6,and conclude in Section 7.

1.1. Related Work

Existing work on multidimensional statistics can bebroadly classified as addressing either the problem of de-ciding which statistics to build or that of actually buildingthem. This paper addresses only the latter problem. Manytypes of statistics have been proposed, e.g., histograms [15]and wavelet-based synopses [13]; we restrict attention tohistograms.

For building multidimensional histograms, proactive ap-proaches that involve a data scan have been proposed, e.g.,MHist [15], GenHist [9], and others [7, 14, 19]. As men-tioned before, data scans may not effectively focus systemresources on the user’s workload and do not scale well tolarge tables. In principle, histograms can be constructedfaster using a page-level sample of the data [5], but largesample sizes—and correspondingly high sampling costs—can be required to achieve sufficient accuracy when datavalues are clustered on pages and/or highly skewed.

The idea of using query feedback to collect the statis-tics needed for estimating cardinality was first proposed in[6]. The specific approach relied on fitting a combinationof model functions to the data distribution; the choice offunctions is ad hoc and can lead to poor estimates when thedata distribution is irregular. Query feedback is also usedin [17], but only to compute adjustment factors to cardinal-ity estimates for specific predicates, and not to build his-tograms. STGrid [1] and SASH [11] both use query feed-back to build histograms, but they often have low accuracybecause their heuristic methods for adding new QFRs to thehistogram do not maintain consistency. ISOMER’s use ofthe well founded [16] maximum-entropy principle avoidsthis problem.

STHoles [4] is another approach that uses query feed-back to build histograms. The histogram structure ofSTHoles is superior to other bucketing schemes such as

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

MHist [15], and for this reason is used by ISOMER. Un-fortunately, the original STHoles maintenance algorithm re-quires, for each query and each histogram bucket, the com-putation of the number of rows in the intersection of thequery and bucket regions. These detailed row counts, whichare used to decide when and where to split and merge buck-ets, are usually not obtainable from the original query pred-icates alone. The query engine must therefore insert artifi-cial predicates that specify the (possibly recursive) bucketboundaries. As the number of histograms and the numberof buckets per histogram grows, the overhead of evaluatingthis “artificial feedback” becomes so high as to make theSTHoles maintenance approach impractical. (ISOMER, incontrast, needs only the actual feedback that naturally oc-curs during query execution—namely, the number of rowsprocessed at each step during the query plan—which canbe monitored with low overhead [17].) Finally, STHoles,unlike ISOMER, does not provide principled methods foraddressing issues of inconsistent feedback and limits on theavailable memory for storing the histogram.

The principle of maximum entropy has been used in [12],but for the significantly different problem of consistentlyestimating the selectivity of conjuncts of predicates suchas sel(p1 ∧ p2 ∧ p3 ∧ p4), given partial selectivities such assel(p1), sel(p2 ∧ p3), sel(p2 ∧ p3 ∧ p4), and so forth. I.e.,the methods in [12] permit the exploitation of existing mul-tidimensional statistics (not necessarily from histograms),whereas the current paper is concerned with the collectionof a specific type of statistic.

2. ISOMER architecture

Figure 1 shows the architecture of ISOMER and whereit would fit in a modern commercial DBMS. The left por-tion of the figure shows an ordinary DBMS. Here, a query isfirst fed to a cost-based query optimizer, which produces aplan for the query based on the available database statistics.Presently, all commercial systems apply the proactive ap-proach for maintaining statistics that requires database ac-cess (as shown by the dotted arrow). The chosen query planis then executed by the runtime engine. We prototyped ISO-MER on top of DB2 UDB, which can measure and storeactual predicate cardinalities or QFRs at run time.

Due to efficiency concerns, a QFR is not immediatelyused to refine the histogram as soon as it is obtained. In-stead, the QFRs are collected in a query feedback store andused in batches to refine the histogram. During periods oflight load, or during a maintenance window, the databasesystems invokes the ISOMER component. On invocation,ISOMER uses the QFRs accumulated in the query feedbackstore to refine the histogram. The right portion of the figureshows the various components of ISOMER.

ISOMER begins by reading the currently maintained his-

Store(invoked at times of light load)

Store

Offline

Statistics

Environment

ISOMER

Present commercial DBMS

(Proactive)

Database

Query Feedback

Runtime

OptimizerQuery

Query

Compute finalhistogram

Eliminateunimportant

feedback

feedbackinconsistent

DetectAdd newfeedback

Figure 1. ISOMER Architecture

togram from the database statistics. It then accesses the newQFRs and adds them to the current histogram. This processmainly involves forming buckets corresponding to the newQFRs, and is explained in Section 3.4.1.

As discussed in Section 3.2 below, a histogram must beconsistent with respect to both new and previous QFRs.However, because of updates to the data these QFRs maybe contradictory, in which case there does not exist a his-togram consistent with all QFRs. Thus ISOMER’s first taskis to detect and discard old, inconsistent QFRs. For this pur-pose, ISOMER keeps a list of QFRs previously added to thehistogram in an offline store as depicted in Figure 1. ISO-MER reads the previous QFRs from the offline store anduses linear programming to find and eliminate inconsistentQFRs. This process is described in Section 4. Note that theoffline store does not have to be read at query-optimizationtime and hence does not encroach on the space budget al-lowed for the histogram. In any case, the size of the offlinestore is not a concern since it cannot grow bigger than thesize of the maintained histogram. Indeed, if there are moreQFRs than buckets, then some of these QFRs will be eitherinconsistent or redundant and hence will be eliminated; seeSection 3.4.2.

Once ISOMER obtains a consistent set of QFRs,the algorithm computes the histogram according to themaximum-entropy principle. We describe this computationin Sections 3.4.2 and 3.4.3. If the histogram is too large tofit within its allotted space, ISOMER selectively discardsthe relatively “unimportant” QFRs in order to reduce thehistogram size. Intuitively, a particular QFR is unimportantif the information provided by that QFR is already providedby other QFRs, i.e., if the QFR refines a portion of the his-togram that is already sufficiently refined by other QFRs.The process for detecting and discarding unimportant QFRsis described in Section 5.

Any inconsistent or unimportant QFR that ISOMER dis-cards is also removed from the list of QFRs that are cur-rently incorporated into the histogram, and the revised listis written back to the offline store. The final histogram iscomputed according to the maximum-entropy principle andwritten back into the database statistics.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

3. Incorporating Feedback

After defining the notions of histogram and query feed-back that we use throughout the paper (Section 3.1), we in-troduce a running example (Section 3.2) that illustrates theuse of query feedback to build and refine a histogram, theimportance of maintaining consistency, and the difficultiesthat arise therein. We then introduce the maximum-entropyprinciple to address these issues (Section 3.3), and explainits use in ISOMER (Section 3.4).

3.1. Histograms and Feedback

Given a table T comprising N tuples, we wish to build ad-dimensional histogram over attributes A1, . . . ,Ad of tableT . For each numerical attribute Ai, denote by li and ui theminimum and maximum values of Ai in the table. We as-sume that these values are available from one-dimensionaldatabase statistics on Ai. A categorical attribute Ai havingDi distinct values can be treated as numerical by mappingeach distinct value to a unique integer in [1,Di], so thatli = 1 and ui = Di. The space in which the tuple valueslie is S = [l1,u1]× [l2,u2]× . . .× [ld ,ud ].

A multidimensional histogram is a lossy compressedrepresentation of the true distribution of tuples in S thatis obtained by partitioning S into k ≥ 1 mutually disjointregions called buckets, and recording the number of tu-ples in T that fall in each bucket. ISOMER actually main-tains an approximate histogram that records an estimate ofthe number of tuples that fall in each bucket. Denote byb1,b2, . . . ,bk the k histogram buckets, by C(bi) the regionof S that is covered by bi, and by n(bi) the number of tu-ples estimated to lie in bi. The tuples in each bucket bi areassumed to be uniformly distributed throughout C(bi).

ISOMER maintains a histogram based on queryfeedback. Specifically, at each time point, thesystem maintains a list of QFRs of the form(q1,N(q1)

),(q2,N(q2)

), . . . ,

(qm,N(qm)

)for some m ≥ 1,

where each q is a predicate of the form1

(x1 ≤C1 ≤ y1)∧·· ·∧ (xd′ ≤Cd′ ≤ yd′)

and N(q) is the number of tuples that satisfy q. Hered′ ≤ d and C1, . . . ,Cd′ are distinct attributes from amongA1, . . . ,Ad . We assume that if Cj is categorical, then x j = y j,so that the conjunct x j ≤ Cj ≤ y j is actually an equalitypredicate. In the following sections, we denote by R(q) thesubset of the region S for which q is true.

3.2. A Running Example

Consider a Car relation with attributes make andcolor, and suppose that a user executes the query

SELECT * FROM Car WHEREmake = ’Honda’ AND color = ’White’

1The maximum-entropy principle can be applied to general predicates,but we have initially focused on conjunctive predicates in ISOMER.

N(make = ‘Honda’, color = ‘White’) = 25color=‘White’

Query Feedback

N(make = ‘Honda’) = 80

Car (100)

(80)

(25)σ

make=‘Honda’σ

Figure 2. Gathering Query Feedback (QFRs)

Figure 2 shows a possible plan for executing this query; theactual number of tuples at each stage are shown in parenthe-sis. The figure also shows the various predicates q for whichN(q) can be collected during query execution; the cardi-nality information for these predicates comprise the QFRs.Gathering of QFRs is already supported by several commer-cial database systems and has been shown [17] to have verylow runtime overhead (typically less than 5%).

Suppose that we wish to build a two-dimensional his-togram on the attributes make and color by using queryfeedback. Also suppose for simplicity that there are onlytwo distinct makes (say ’Honda’ and ’BMW’), and onlytwo distinct colors (say ’Black’ and ’White’) in theCar relation. Then any multidimensional histogram onthese attributes has at most four buckets (Figure 3). Weassume that our histogram stores all of these four buckets.

The first QFR that we add to the histogram is the totalnumber of tuples in the table. This number can be obtainedfrom system catalog statistics. Suppose there are 100 tuplesin the Car table. This QFR can be expressed as:

n(b1)+n(b2)+n(b3)+n(b4) = 100 (1)

At this point, there are various possible assignments ofvalues to the n(bi)’s that will make the histogram consis-tent with this single QFR. In the absence of any additionalknowledge, we assume, as do traditional optimizers, thatvalues are distributed uniformly. Hence, the histogram ob-tained after adding this QFR is one in which each n(bi)equals 25, as shown in Figure 3(b).

We now add the QFR N(make = ’Honda’) = 80 tothe histogram. This QFR can be expressed as:

n(b2)+n(b4) = 80 (2)

To make the histogram consistent with (2) while preserv-ing uniformity between n(b2) and n(b4), we set n(b2) =n(b4) = 40. The resulting histogram is shown in Fig-ure 3(c). At this point, the STGrid [1] approach would con-sider the process of adding this QFR to be finished, withFigure 3(c) as the final histogram.

Notice, however, that the histogram in Figure 3(c) is nolonger consistent with (1). The expressions in (1) and (2)together imply that n(b1) + n(b3) = 20. To enforce con-sistency with this equation as well as uniformity betweenn(b1) and n(b3), we set n(b1) = n(b3) = 10. The final con-sistent histogram is shown in Figure 3(d).

Observe that even though we added information onlyabout Hondas, the histogram now gives a more accurate es-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

n(b )

2

n(b )

n(b ) n(b )1 #White cars = 30#Cars = 100 #Hondas = 80

152525

2525

3 15

655

40

40

10

10

4025

4025

HondaBMW

4Consistency

(e)(c) (d)(b)(a)

MaintainBlack

White

Figure 3. Running Example

timate of the frequency of BMWs. This improvement is adirect result of our final histogram adjustment in which weenforced consistency with both the QFRs.

In summary, it is reasonable to impose the following re-quirements when adding a new QFR to the histogram:

1. Consistency: After adding a new QFR, the histogramshould be consistent with all QFRs added so far.

2. Uniformity: If there are multiple histograms consis-tent with all of the QFRs, then the choice of the finalhistogram should not be arbitrary, but must be basedon the traditional assumption of uniformity.

So far in this example, it has been easy and intuitiveto apply the uniformity assumption. However, it is not al-ways clear how to impose uniformity. To illustrate, supposethat the next QFR we add to the histogram is N(color =’white’) = 30. This QFR can be written as:

n(b3)+n(b4) = 30. (3)

If we employ the naı̈ve solution of making the histogramconsistent with (3) while imposing uniformity on n(b3) andn(b4), we get n(b3) = n(b4) = 15. Further enforcing con-sistency with (1) and (2), we get the final histogram shownin Figure 3(e). This solution clearly does not enforce uni-formity2: although white Hondas and white BMWs arein equal proportion, there are far more black Hondas thanblack BMWs. This non-uniformity among black cars hasnot been indicated by any added QFR, and is only due toour ad-hoc method of enforcing consistency. To enforceconsistency while avoiding unsatisfactory results as in Fig-ure 3(e), we turn to the information-theoretic principle ofmaximum entropy, described next.

3.3. The Maximum-Entropy Principle

Consider a discrete probability distribution(p1, p2, . . . , pn) on n distinct outcomes, i.e., each pi isnonnegative and ∑n

i=1 pi = 1. In many applications, onlypartial information about such a probability distributionis available (e.g., p1 + p2 = 0.5). Define a “candidate”distribution as one that is consistent with all availableinformation about the distribution. In general, there maybe multiple candidate distributions. The maximum-entropy

2STHoles [4] does not face this problem of enforcing uniformity be-cause for a predicate q, it explicitly gathers a count of the intersection ofR(q) with every histogram bucket. Thus, STHoles would gather n(b3) andn(b4) individually, making it a high-overhead approach in general.

principle (see, e.g., [16]) provides a well grounded criterionfor selecting a unique distribution from among the can-didates. Specifically, the principle prescribes selection ofthe candidate P = (p1, p2, . . . , pn) that has the maximumentropy value H(P), where H(P) = −∑n

i=1 pi ln(pi). Themaximum-entropy principle can be justified informally asfollows. From information theory, we know that entropymeasures the uncertainty or uninformativeness in a distri-bution. For example, the value of the entropy ranges from0—when a specified outcome occurs with certainty—to amaximum of ln(n) when no information is available and alloutcomes are equally likely (p1 = · · · = pn = 1/n). Thusthe maximum-entropy principle leads to a choice of thesimplest, i.e., most uninformative distribution possible thatis consistent with the available information. To choose adistribution with lower entropy would amount to assuminginformation that we do not have; to choose one with ahigher entropy would ignore the information that we dohave. The maximum-entropy distribution is therefore theonly reasonable choice. A more formal justification of theprinciple can be found in [16]. For a continuous probabilitydistribution with probability density function (pdf) P(u),the entropy is defined as H(P) =

∫S

P(u) ln(P(u)

)du,

and the foregoing discussion extends to this setting in anobvious way.

For ISOMER, the maximum-entropy principle can alsobe justified as a means of ensuring uniformity. As men-tioned above, entropy is maximized for a uniform distri-bution. Thus, choosing the distribution according to themaximum-entropy principle facilitates our goal of main-taining consistency and uniformity at the same time.

3.4. Maximum Entropy in ISOMER

We first describe the histogram structure used in ISO-MER (Section 3.4.1). We then show in Section 3.4.2 howapplication of the maximum-entropy principle leads to anoptimization problem. Section 3.4.3 describes how this op-timization problem is solved in ISOMER.

3.4.1 Histogram Structure in ISOMER

ISOMER uses the STHoles data structure [4] to representand store multidimensional histograms. In the STHoles his-togram, each bucket bi has a hyperrectangular bounding boxdenoted by box(bi) (⊆ S ), i.e., bucket bi is bounded be-tween two constant values in each dimension. Bucket bi,however, does not cover the entire region box(bi). There

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

R(q)

q

Newfeedback

Figure 4. Drilling Holes for a New QFR

may be some “holes” inside box(bi) that are not coveredby bi. These regions are themselves histogram buckets,and are referred to as children of bi. The bounding boxesof these children are mutually disjoint hyperrectangles andcompletely enclosed within box(bi). The region covered bybi is formally given by:

C(bi) = box(bi)−⋃

b j∈children(bi)

box(b j)

Intuitively, in the absence of holes, bi would represent aregion of uniform tuple density. However, the STHoles his-togram identifies regions within bi that have a different tupledensity, and represents them as separate histogram buckets.

For a multidimensional predicate q that selects all tupleslying in a specified region R(q)⊆S , an STHoles histogramcomprising buckets b1,b2, . . . ,bk estimates the number oftuples that satisfy this predicate as

N̂(q) =k

∑i=1

n(bi)vol

(R(q)∩C(bi)

)vol

(C(bi)

) . (4)

Here vol(R) denotes the usual euclidean volume of the re-gion R when the data is real-valued; for discrete (i.e., integeror integer-coded categorical) data, vol(R) denotes the num-ber of integer points that lie in R.

ISOMER initializes the STHoles histogram to containa single bucket b1 such that box(b1) = C(b1) = S andn(b1) = N. At this point, the histogram embodies the sim-plest possible uniformity assumption. As more QFRs areadded over time, ISOMER learns more about the distribu-tion of tuples in S and incorporates this information intothe histogram by “drilling” holes in b1.

ISOMER’s technique for drilling holes is a simpler ver-sion of the method given in [4]. Suppose that ISOMERobtains a QFR about a specified multidimensional predi-cate q. To make the histogram consistent with this QFR,ISOMER must first ensure that the histogram contains aset of buckets that exactly cover R(q), so that the sum ofthe tuple counts in these buckets can then be equated toN(q) as in Section 3.2. If such a set of buckets alreadyexists, no holes need to be drilled. Otherwise, the pro-cess of drilling holes for q proceeds as shown in Figure 4.Specifically, ISOMER descends down the bucket tree untilit finds a bucket b such that R(q) ⊂ C(b) but R(q) � C(b′)for any b′ ∈ children(b). ISOMER forms a new bucketbnew such that box(bnew) = R(q) and processes each bucketb′ ∈ children(b) as follows.

• If box(b′)∩R(q) = /0, then nothing needs to be done.

• If box(b′)⊂ R(q), then b′ is removed from children(b)and added to children(bnew).

• If box(b′) partially overlaps R(q), then bucket b′ (andrecursively its children), are split as shown in Figure 4to preserve disjointness and a hyperrectangular shape.The splitting is done one dimension at a time, using anarbitrary ordering among the dimensions.

3.4.2 Formulation of the Optimization Problem

To apply the maximum-entropy principle in ISOMER, weassociate a probability distribution P , and hence an entropyvalue H(P), with every possible histogram. In accordancewith the maximum-entropy principle, ISOMER then maxi-mizes H(P) over the set of all histograms that are consis-tent with the current set of QFRs. To define P and H(P),consider an STHoles histogram with buckets b1, . . . ,bk hav-ing bucket counts n(b1), . . . ,n(bk). If the data is discrete,then P is a probability distribution over the integer pointsof S given by pu = n(b∗u)/[N ·V (b∗u)] for u ∈ S , whereV (b) is abbreviated notation for vol

(C(b)

), and b∗u is the

unique bucket b such that u ∈C(b). This definition followsfrom (4) after dividing both sides by the total number of tu-ples N and taking q to be the point query “(A1,A2, . . . ,Ad) =u”. The entropy H(P) = −∑u∈S pu ln(pu) correspondingto the distribution P is thus given by

H(P) = − ∑u∈S

n(b∗u)N ·V (b∗u)

ln

(n(b∗u)

N ·V (b∗u)

)

= −k

∑i=1

∑u∈C(bi)

n(bi)

N ·V (bi)ln

(n(bi)

N ·V (bi)

).

Since the inner sum comprises V (bi) identical terms that areindependent of u,

H(P) = −k

∑i=1

n(bi)

Nln

(n(bi)

N ·V (bi)

)

= −1N

k

∑i=1

n(bi) ln

(n(bi)

V (bi)

)+ ln(N),

(5)

where the last equality uses the identity ∑ki=1 n(bi) = N.

For real-valued data, we take P to be the pdf defined byP(u) = pu for each real-valued point u ∈ S , where pu

is defined as above; note that this density function is con-stant within each region C(bi). A straightforward calcula-tion shows that the entropy H(P) =

∫S

P(u) ln(P(u)

)du

is given by (5).We now express the QFRs as constraints on the his-

togram. Suppose that ISOMER has obtained QFRs form predicates q1, . . . ,qm. First, ISOMER drills holes forthese QFRs in the histogram as described in Section 3.4.1.For each qi, the drilling procedure ensures that the set ofhistogram buckets lying within R(qi) exactly cover R(qi).Hence the QFR for qi can be written as the constraint

∑b|C(b)⊆R(qi)

n(b) = N(qi) (6)

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

λ

λ λ

λ λ1 21

1 1 2 33

e e

ee

iterativescaling

BMW HondaN(q )=100

1N(q )=80

N(q )=303

λ

5614

6 24

Black

White

2

λ λ

Figure 5. Iterative Scaling Example

The application of the maximum-entropy principle thusleads to a well posed3 optimization problem: for anSTHoles histogram with buckets b1, . . . ,bk, select nonneg-ative bucket counts n(b1), . . . ,n(bk) so as to maximize theexpression H(P) in (5), while satisfying (6) for 1 ≤ i ≤ m.

3.4.3 Solution of Optimization ProblemTo solve the above optimization problem, associate a La-grange multiplier yi (1≤ i≤m) with the ith constraint givenby (6). After removing the constants from the objectivefunction in (5), the Lagrangian of the optimization problemis given by:

m

∑i=1

yi

(∑

b|C(b)⊆R(qi)

n(b)−N(qi)

)−

k

∑i=1

n(bi) ln

(n(bi)

V (bi)

)

Differentiate the above expression with respect to n(b) andequate to 0 to get

∑i|C(b)⊆R(qi)

yi − ln

(n(b)

V (b)

)−1 = 0,

so that, setting λi = eyi ,

n(b) =V (b)

e ∏i|C(b)⊆R(qi)

λi. (7)

Combining (7) with (6), we get a system of m equations,where there are m unknowns λ1, . . . ,λm. This system ofequations can be solved by a procedure known as iterativescaling. The details of iterative scaling are omitted due tolack of space and can be found in [12].4

Example 1 For the example of Section 3.2, V (bi) = 1 foreach i. As before, we add three QFRs given by (1), (2), and(3). Denote these QFRs as q1, q2, and q3, respectively; seeFigure 5. Since bucket b1 is referred to only by q1, it fol-lows from (7) that n(b1) = λ1/e. Similarly, n(b2) = λ1λ2/e,

3There may be some buckets b for which n(b) = 0 in any consistentsolution. We detect such buckets by solving an appropriate linear program,and exclude them from the entropy calculation since their contribution tothe entropy is 0 (limn(b)→0 n(b) ln

(n(b)

)= 0).

4We actually modify the basic algorithm in [12] to permit efficient in-cremental updating of the maximum entropy solution. The idea is to persistthe multipliers corresponding to each QFR in the offline store, so that weonly need to recompute the solution for a few specified regions. Detailsare omitted for brevity.

n(b3) = λ1λ3/e, and n(b4) = λ1λ2λ3/e. We can thereforerewrite (1), (2), and (3) as

λ1 +λ1λ2 +λ1λ3 +λ1λ2λ3 = 100e

λ1λ2 +λ1λ2λ3 = 80e

λ1λ3 +λ1λ2λ3 = 30e

Solving the above equations, we obtain λ1 = 14e, λ2 = 4,and λ3 = 3/7, yielding the final histogram (Figure 5).

This histogram is consistent with all of the added QFRs.It is also “the most uniform” in the following sense. It main-tains the 80-20 ratio between Hondas and BMWs for boththe colors. Similarly, it maintains the 30-70 ratio betweenwhite and black cars for both the makes. Such uniformity isnot obtained by adding QFRs in an ad-hoc manner, e.g., asin the histogram of Figure 3(e).

4. Dealing with Database Updates

As long as no tuples are updated, inserted, or deleted, allof the QFRs obtained by ISOMER are consistent with eachother, and there exists at least one valid histogram solutionthat satisfies the set of constraints in (6). However, in thepresence of data changes, the set of QFRs might evolve to apoint at which no histogram can be simultaneously consis-tent with the entire set.

Example 2 Suppose that ISOMER obtains the twoQFRs N(make = ’Honda’) = 80 and N(make= ’Honda’,color = ’White’) = 30, andthen some updates to the data occur. Afterthese updates, ISOMER might obtain the QFRN(make = ’Honda’, color = ’Black’) = 60.Clearly, there exists no histogram solution consistent withall three QFRs.

Given inconsistent QFRs, there exists no solution to the op-timization problem of Section 3.4.2. Thus, ISOMER mustfirst discard the QFRs that are no longer valid due to datachanges, leaving a set of consistent QFRs having a validhistogram solution. However, deciding which QFR is in-valid is not always straightforward and depends on the typeof data change. In Example 2, if some new black-Honda tu-ples have been inserted, the first QFR is invalid. However,if the color of some Hondas has been updated from white toblack, then the second QFR is invalid. In general, both thefirst and second QFRs may be invalid.

Since no information is available about the type of datachange, ISOMER uses the notion of the age of a QFR todecide which QFRs to discard. The intuition is that the oldera QFR, the more likely that it has been invalidated. ThusISOMER discards those QFRs that are relatively old andwhose removal leaves behind a consistent set. To quicklydetect such QFRs, the following LP approach is employed.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

4.1. A Linear-Programming Solution

ISOMER associates two “slack” variables with each con-straint corresponding to the QFRs. The constraints in (6) arerewritten as (for 1 ≤ i ≤ m)

∑b|C(b)⊆R(qi)

n(b)−N(qi) = s+i − s−i (8)

ISOMER also adds the nonnegativity constraints

n(b) ≥ 0 for all b, s+i ,s−i ≥ 0 for i = 1, . . . ,m. (9)

If there is a solution to the set of constraints (8) and (9)such that s+

i = s−i = 0, then the solution satisfies the ithconstraint from (6). Otherwise, if s+

i or s−i is positive, theith constraint is not satisfied. Ideally, we would like a solu-tion that satisfies the maximum number of constraints from(6), i.e., a solution that minimizes the number of nonzeroslack variables. Unfortunately, determining such a solutionis known to be NP-complete [2]. ISOMER instead settlesfor minimizing the sum of the slack variables, because thisproblem can be solved by linear programming. ISOMERthen discards all QFRs that correspond to constraints hav-ing nonzero slack. Note that if all the original constraintsfrom (6) are satisfiable, then there exists a solution in whichall of the slacks equal 0, and hence no QFRs are discarded.

As noted earlier, we want to preferentially discard olderQFRs. Instead of minimizing the sum of slack vari-ables, ISOMER therefore minimizes a weighted sum of theslack variables, where the slack corresponding to a QFR isweighted inversely by the age of the QFR. Thus, a QFRthat is not satisfied and has nonzero slack incurs a smallerobjective-function penalty if it is old than if it is new. Thusan optimal solution is more likely to permit slack in olderQFRs, so that such QFRs are preferentially discarded. Theage of the ith QFR is given by m− i + 1. Thus ISOMERsolves the following linear program to detect inconsistentconstraints: Minimize

m

∑i=1

1m− i+1

(s+i + s−i )

subject to (8) and (9). If s+i or s−i is nonzero in the result-

ing solution, then the ith QFR is discarded. In our imple-mentation, we have used the highly optimized open sourceCoin LP solver [18]. Discarding QFRs enables ISOMER tomerge buckets as described in the next subsection.

Our overall method for eliminating invalidated QFRs isclearly a heuristic. However, as shown in Section 6, it workseffectively and efficiently in practice. In future work, wehope to enhance ISOMER with a more principled methodfor eliminating inconsistent QFRs.

4.2. Merging Histogram Buckets

Once ISOMER has decided to discard a particular QFR,the total histogram size can potentially be reduced by merg-ing two or more histogram buckets. The process of merging

buckets is essentially the inverse of the process for drillingholes described in Section 3.4.1 and is similar to that de-scribed in [4]. Due to space constraints, we give only abrief description here.

After the ith QFR is discarded, ISOMER examines all“top-level” buckets that cover R(qi), i.e., all buckets b suchthat C(b) ⊆ R(qi), but C

(parent(b)

)� R(qi). Bucket b can

be merged with another bucket in the following cases:

• If bucket b has exactly the same referring set as itsparent, i.e., if { j|C(b) ⊆ R(q j)} = { j|C

(parent(b)

)⊆

R(q j)}, then b can be merged with its parent. Thisis because both b and parent(b) have the same num-ber of tuples per unit volume in the maximum-entropysolution; cf. (7). The children of b now become thechildren of parent(b).

• If there is a sibling b′ of b such that (i) b′ and b havethe same referring set and (ii) box(b′)∪ box(b) is hy-perrectangular, then b and b′ can be merged. The chil-dren of b are recursively examined to see if they can bemerged with their new siblings, i.e., children of b′. Thenew merged bucket is also examined to see if it can bemerged with any of its siblings.

5. Meeting a Space BudgetDatabase systems usually limit the size of a histogram

in order to ensure efficiency when the histogram is read atoptimization time. Thus, we assume that there is limitedspace available for histogram storage. The addition of newQFRs causes the size of ISOMER’s STHoles histogram togrow as new holes are drilled. Whenever the histogramsize exceeds the space budget, ISOMER reduces the his-togram size by discarding “unimportant” QFRs. Intuitively,a QFR is unimportant if it provides little information overand above what is already provided by other QFRs. Dis-carding unimportant QFRs reduces the total histogram sizeby triggering the merging of buckets (Section 4.2).

Example 3 In the example of Section 3.2, suppose thatwe have space to store only two histogram buck-ets. After adding the QFRs N = 100 and N(make =’Honda’) = 80, the resulting histogram has two buck-ets, and is shown in Figure 6(a). Each bucket has vol-ume equal to 2, and bucket b2 is a hole in the top-levelbucket b1. Suppose that we now add the third QFR,N(make = ’BMW’, color = ’white’) = 10. ISO-MER drills a hole corresponding to this QFR, and the re-sulting histogram, shown in Figure 6(b), has three buckets,violating the space budget. Notice, however, that the ad-dition of this third QFR yields no extra information: be-cause tuples are assumed to be uniformly distributed withina bucket, the histogram in Figure 6(b) is already implied bythe histogram in Figure 6(a). Thus the third QFR is unim-portant, and can be discarded.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

Black

White

BMW Honda

n(b )=802

#white BMW’s=10

=13λ

1n(b )=10

n(b )=103

n(b )=802

(b)(a)

=42λ

λ1=10e λ1=10e =42λ

n(b )=201

Figure 6. Unimportant QFR Example

How can the unimportant QFRs be efficiently deter-mined? Note that the age of a QFR—which ISOMER usesas a criteria for deciding which QFRs are invalid—is notrelevant for deciding importance. E.g., in Example 3, ISO-MER can receive many instances of the third QFR in suc-cession, thus making the second QFR very old. However,the second QFR is still more important than the third QFR.

An elegant aspect of ISOMER is that the maximum-entropy solution yields, for free, an importance measurefor each QFR. This leads to a very efficient procedure fordetecting and discarding unimportant QFRs. Specifically,ISOMER uses the quantity | ln(λi)| as the importance mea-sure for the ith QFR, where λi is the multiplier defined in(7). To justify this choice intuitively, we note that if λi = 1,then removal of the ith QFR does not affect the bucketcounts, so that the final maximum-entropy solution is un-altered. For instance, λ3 = 1 in the final solution in Ex-ample 3—see Figure 6(b)—and we concluded that the thirdQFR was unimportant. By the same token, if the multipliercorresponding to a QFR is close to 1, then removal of thatQFR will affect the final solution less than if we remove aQFR whose multiplier is much greater than 1. Thus, the ithQFR is unimportant if λi ≈ 1, i.e., if | ln(λi)| ≈ 0.

An alternative justification for our definition followsfrom the fact that | ln(λi)| = |yi|, where yi is the La-grange multiplier corresponding to the ith constraint in themaximum-entropy optimization problem (Section 3.4.3). Itis well known from optimization theory [3] that the La-grange multiplier for a constraint measures the degree towhich the constraint affects the optimum value of the ob-jective function. Thus |yi| measures how much the ith con-straint affects the entropy, i.e., the amount of information inthe distribution. In other words, |yi| = | ln(λi)| is a measureof the amount of information carried by the ith constraint,and hence a measure of the importance of the ith QFR.

Thus, whenever the histogram exceeds the space budget,ISOMER proceeds by examining the current maximum-entropy solution and, for each i, computes the importancemeasure for the ith QFR as | ln(λi)|. ISOMER then dis-cards the QFR with the least importance according to thismeasure and merges buckets as described in Section 4.2.ISOMER then incrementally computes the new maximum-entropy solution and repeats the above procedure until thehistogram is sufficiently small.

6. Experiments

In this section, we provide an experimental validation ofISOMER. We prototyped ISOMER on top of DB2 UDB,which supports gathering of query feedback. The experi-ments were conducted on an Intel 2.3GHz processor with1GB RAM. We used ISOMER to maintain a multidimen-sional histogram using query feedback and compared ISO-MER’s cardinality estimates against estimates obtained bya state-of-the-art commercial optimizer. The only form ofmultidimensional statistics presently supported by the opti-mizer is a count of the number of distinct values in a groupof columns. We also compared the ISOMER estimates withthose obtained by STGrid, since both algorithms use thesame type of feedback. We did not compare ISOMER withSTHoles, because the feedback collection strategy used bySTHoles has a very high overhead that makes the approachimpractical, particularly when index scans are used duringquery processing; see Section 1.1.

Our experiments demonstrate the following:

1. ISOMER provides significantly better cardinality esti-mates than either the commercial optimizer techniqueor STGrid.

2. For static data, the cardinality estimates provided byISOMER consistently improve and then stabilize asQFRs are added to the histogram, i.e., there is no os-cillation in accuracy when more QFRs are added (andothers removed to keep within the space budget).

3. For changing data, ISOMER learns the new data dis-tribution much faster (i.e., with the addition of manyfewer QFRs) than STGrid and provides significantlybetter cardinality estimates.

4. The overall overhead of ISOMER is low.

For our experiments we used a DMV (Department ofMotor Vehicles) database that was derived from a real-world application. For our experiment, we focused on theCars table which has several correlated columns such asmake, model, color, and year. The maximum numberof distinct values in any column was approximately 150.

We generated a collection of queries referred to as train-ing queries, issued them to the query execution engine, andcollected QFRs for the predicates in these queries. We usedISOMER (or STGrid for comparison) to maintain a multidi-mensional histogram on the set of referenced attributes. Thehistogram was initialized using one-dimensional databasestatistics on the attributes and by assuming independencebetween the attributes. QFRs collected during execution ofthe training queries were then used to refine this histogramwhile allowing a maximum of k buckets, where k was anexperimental parameter.

To test the accuracy of the maintained histogram, wegenerated a collection of 200 test queries from the same

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

Figure 7. Error of 2D Histogram on model andyear

distribution as that of the training queries. We periodicallytested the histogram’s accuracy by comparing the actual andestimated cardinalities for each predicate in the test queries.We measured the relative error in the estimation as

Relative Error =|N(q)− N̂(q)|

max(100,N(q)),

where N(q) and N̂(q) are the actual and estimated numberof tuples respectively that satisfy the predicate q. In our for-mula, the quantity 100 is a sanity constant [8] that preventsthe relative error from blowing up in case of highly selec-tive predicates, i.e., predicates for which N(q) is either verysmall or equal to 0. We measured the overall accuracy of thehistogram by the average relative error across all predicatesin the test queries.

We discuss the accuracy of ISOMER on static data inSection 6.1 and then consider updates in Section 6.2. Westudy the running time of ISOMER in Section 6.3.

6.1. Accuracy on Static Data

6.1.1 2D Histograms

To demonstrate the ability of ISOMER to handle both nu-merical and categorical attributes, we used ISOMER tobuild a two-dimensional histogram on the attributes model(categorical) and year (numerical) of the Car table. Bothtest and training queries were of the form

SELECT * FROM Car WHERE model = xAND year BETWEEN y1 AND y2

where x, y1, and y2 are variables. The queries were gen-erated according to the data distribution. First, x was cho-sen randomly from all possible models; the probability forchoosing a given model was proportional to the model’s fre-quency in the database. Then y1 and y2 (y1 ≤ y2) were cho-sen randomly from between the minimum and maximumyears that occur with model x. Note that the execution ofsuch queries gave us QFRs not only for two-dimensionalpredicates, but also for one-dimensional predicates, sincepredicates are applied one by one. Thus, we obtained QFRssuch as N(model = x) and N(model = x, y1 ≤ year ≤y2).

Figure 7 plots the error of the various approaches as afunction of the number of training queries. Each histogram

Figure 8. Error of 3D Histogram on make,model and color

was allowed to have a maximum of k = 175 buckets. It canbe seen that ISOMER significantly outperforms STGrid byproviding much more accurate cardinality estimates. Alsonote that the error of ISOMER consistently decreases andthen stabilizes, but never increases with the addition of moreQFRs. Although STGrid also seems to possess this desir-able property in case of static data, we show in the sequelthat, for dynamic data, the error of STGrid may actuallyincrease with addition of more QFRs. As expected, theoptimizer performs poorly, mainly because of the limitedmultidimensional statistics it supports. That is, the opti-mizer keeps only a count of the number of distinct valuesfor a group of columns, which is not useful for predictingthe cardinality of range predicates on the numerical attributeyear. Thus, the optimizer estimates are simply based onthe independence assumption.

6.1.2 3D HistogramsIn this experiment, we built a three-dimensional histogramon the highly correlated attributes make, model, andcolor. Both the training and test queries were of the form

SELECT * FROM Car WHERE make = x ANDmodel = y AND color = z

where x, y, and z are variables. The query-generation pro-cess was similar to that described in Section 6.1.1. First,a make x was chosen from the various makes according tothe data distribution. Then from the various models cor-responding to make x, a particular model y was chosen,again according to the data distribution. Finally a color zwas chosen from the various colors corresponding to the(make, model) combination (x,y). ISOMER obtained QFRsfor predicates of dimension 1, 2, and 3.

Figure 8 plots the error of the various approaches againstthe number of training queries. Each histogram was al-lowed to have a maximum of k = 250 buckets. ISOMERagain outperforms STGrid by an even greater margin thanin the 2D case, for two reasons. First, there are no partiallyoverlapping buckets, because all attributes are categoricaland hence the only predicates are equality predicates. Thusno bucket splits take place in ISOMER, so that a largernumber of QFRs can be added to the histogram before thespace budget is reached, thereby boosting accuracy. Sec-ond, STGrid performs poorly because it merely adjusts the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 11: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

Figure 9. Error vs allowed number of buckets

tuple counts in the buckets based on QFRs and does not usethe QFRs to restructure the buckets. This lack of restructur-ing is a problem because buckets are determined from theinitial one-dimensional statistics, that ignore the correlationbetween attributes. Since make, model, and color aremuch more strongly correlated than model and year, theperformance of STGrid for this 3D histogram is worse thanfor the 2D histogram in Section 6.1.1.

6.1.3 Effect of Space Budget

To study the effect of the space budget on histogram accu-racy, we used a 2D histogram on model and year as inSection 6.1.1.

Figure 9 plots the error of the ISOMER and STGrid his-tograms against the maximum number of buckets allowed,along with the optimizer error ( the optimizer presentlymaintains a hardcoded, fixed number of buckets ). The errorwas measured after 400 training queries had been executedand the corresponding QFRs used to refine the histogram.As expected, the error of both ISOMER as well as STGriddecreases as more buckets are allowed to be stored. How-ever, ISOMER improves much more rapidly than STGrid asthe space budget increases and outperforms STGrid at everyvalue of the space budget.

6.2. Accuracy on Changing Data

To study the ability of ISOMER to deal with changingdata, we worked with a 2D histogram on model and yearas in Section 6.1.1. We interspersed the execution of train-ing queries with the issuing of updates to the database. Tospecify the type of updates issued, we first define the notionof a correlated tuple and a uniform tuple. A correlated tupleis a tuple drawn from the real data distribution; it has all thecorrelations that the real database has. For example, sincemake and model are correlated, a correlated tuple canhave the (make, model) value (Honda, Civic) butnever (Toyota, Civic). In contrast, a uniform tuple isgenerated by choosing the value of each attribute randomlyand independently from the attribute’s domain. For exam-ple, unlike a correlated tuple, the value of a uniform tuplecan be any (make, model) combination.

We report the results of an experiment in which we

Figure 10. Updating: uniform to correlated

changed the data distribution from uniform to correlated.5

Specifically, we started out with a database consisting com-pletely of uniform tuples. The optimizer gathered statis-tics over this database; these initial statistics remained fixedthroughout the experiment. The training queries were exe-cuted and the gathered QFRs used to refine the histogramover time. After every 300 training queries, 20% of thetuples were updated from uniform to correlated tuples.This process was repeated 5 times, after which the entiredatabase consisted only of correlated tuples.

Figure 10 plots the error of the various approachesagainst the number of queries executed. The spikes in theerror curve that occur after roughly every 300 queries cor-respond to the times at which the database is updated. Allthe histograms start off very accurately because the data isuniform with independent attributes. For all approaches,the error increases when the data is updated. For ISOMER,however, the histogram evolves as queries are issued againstthe new data distribution, and the error decreases again. TheSTGrid error also decreases to some extent as the histogramis refined after updates. However, the improvement de-creases as data becomes correlated. In fact, when more than80% of the tuples have been updated, the addition of newQFRs tends to increase the overall error in STGrid. Thisis mainly due to the heuristic nature of STGrid, that doesnot preserve consistency when using new QFRs to refinethe histogram. As expected, the optimizer error increasesat each database update and never decreases, since the opti-mizer statistics are never refined.

ISOMER is not able to attain its original accuracy afterupdates have occurred. However, this phenomenon is ex-pected because ISOMER imposes the uniformity assump-tion on any regions of the data distribution about which itlacks information. Hence ISOMER is bound to be moreaccurate on uniform than on correlated data.

6.3. Running Time

The results in Sections 6.1 and 6.2 show that in contrastto STGrid, ISOMER is robust and highly accurate acrossvarious data distributions and update scenarios. Neverthe-less, the STGrid approach is very efficient because it uses asimple heuristic to refine the histogram. Although ISOMER

5We obtained similar results, not reported here, when changing the dis-tribution from correlated to uniform.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 12: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - ISOMER: Consistent

Figure 11. ISOMER running time vs numberof buckets

is not as efficient as STGrid, we show in this section that therobustness of ISOMER comes at an affordable cost.

ISOMER’s cost has two components. The first com-ponent is the cost of collecting QFRs, which has beenshown [17] to be low (less than 5% overhead). Because thiscost is incurred by any reactive approach, we do not con-sider it further. The second component of the ISOMER costis the cost of refining the histogram using QFRs; this cost isincurred whenever ISOMER is invoked. To study this lattercost component, we work with a 2D histogram on modeland year as in Section 6.2 and use changing data.

Figure 11 plots the total running time of ISOMER to add4000 QFRs against the maximum number of buckets al-lowed in the histogram (A comparison with STGrid is notshown, since its running time is essentially 0). Note thatalthough the worst case complexity is quadratic in the num-ber of buckets (since all sibling pairs may be compared forbucket merging; Section 4.2), the actual running time isonly slightly superlinear and remains well under a minuteeven for a moderately sized histogram with 350 buckets.In general, the number of buckets depends in a compli-cated manner on the number of distinct data values and thesmoothness of the data distribution. We found that most ofthis time is spent in the process of reducing the histogramsize to within the space budget, because the maximum-entropy solution has to be recomputed every time a QFR isdiscarded. We are exploring the use of the faster Newton-Raphson method for solving the maximum-entropy opti-mization problem in order to improve the overall runningtime of ISOMER.

7. Conclusions

We have described ISOMER, an algorithm for main-taining multidimensional histograms using query feedback.ISOMER uses the information-theoretic principle of max-imum entropy to refine the histogram based on queryfeedback gathered over time. Unlike previous propos-als for feedback-driven histogram maintenance, which lackeither robustness (e.g., STGrid [1]), or efficiency (e.g.,STHoles [4]), ISOMER is both reasonably efficient and ro-bust to changes in the underlying data.

ISOMER can be extended in several ways to increaseits utility in a database system. First, to handle a large

number of attributes, ISOMER can be combined with tech-niques based on graphical models [7, 11]; such techniquesdivide the set of attributes into correlated subsets and main-tain multidimensional statistics only for these subsets. Sec-ond, ISOMER can be extended to build histograms even onattributes in different tables, by using statistical views. Fi-nally, ISOMER can be combined with a proactive approachin order to increase its robustness for queries that refer todata that has not been previously queried.

References

[1] A. Aboulnaga and S. Chaudhuri. Self-tuning histograms:building histograms without looking at data. In SIGMOD1999.

[2] E. Amaldi and V. Kann. The complexity and approximabilityof finding maximum feasible subsystems of linear relations.Theoretical Computer Science, pages 181–210, 1995.

[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-bridge University Press, 2004.

[4] N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: A multi-dimensional workload-aware histogram. In SIGMOD 2001.

[5] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sam-pling for histogram construction: How much is enough? InSIGMOD 1998.

[6] C. Chen and N. Roussopoulos. Adaptive selectivity estima-tion using query feedback. In SIGMOD 1994.

[7] A. Deshpande, M. Garofalakis, and R. Rastogi. Indepen-dence is good: dependency-based histogram synopses forhigh-dimensional data. In SIGMOD 2001.

[8] S. Guha, K. Shim, and J. Woo. REHIST: Relative Error His-togram Construction Algorithms. In VLDB 2004.

[9] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi.Approximating multi-dimensional aggregate range queriesover real attributes. In SIGMOD 2000.

[10] E. Jaynes. Information theory and statistical mechanics.Physical Reviews, 1957.

[11] L. Lim, M. Wang, and J. Vitter. SASH: A self-adaptive his-togram set for dynamically changing workloads. In VLDB2003.

[12] V. Markl et al. Consistently estimating the selectivity of con-juncts of predicates. In VLDB 2005.

[13] Y. Matias, J. Vitter, and M. Wang. Wavelet-based histogramsfor selectivity estimation. In SIGMOD 1998.

[14] M. Muralikrishna and D. DeWitt. Equi-depth histograms forestimating selectivity factors for multidimensional queries. InSIGMOD 1988.

[15] V. Poosala and Y. Ioannidis. Selectivity estimation withoutthe attribute value independence assumption. In VLDB 1997.

[16] J. Shore and R. Johnson. Axiomatic derivation of the princi-ple of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Information Theory, 26(1):26–37, Jan.1980.

[17] M. Stillger, G. Lohman, V. Markl, and M. Kandil. LEO -DB2’s Learning Optimizer. In VLDB 2001.

[18] Computational Infrastructure for Operations Research .http://www.coin-or.org/.

[19] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic mul-tidimensional histograms. In SIGMOD 2002.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE