Transcript
Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

New Sampling-Based Estimators for OLAP Queries

Ruoming Jin Leo Glimcher Chris Jermaine Gagan AgrawalKent State University Ohio State University University of Florida Ohio State University

[email protected], {glimcher,agrawal}@cse.ohio-state.edu, [email protected]

ABSTRACTOne important way in which sampling for approximate queryprocessing in a database environment differs from traditionalapplications of sampling is that in a database, it is feasible tocollect accurate summary statistics from the data in additionto the sample. This paper describes a set of sampling-basedestimators for approximate query processing that make useof simple summary statistics to to greatly increase the ac-curacy of sampling-based estimators. Our estimators areable to give tight probabilistic guarantees on estimation ac-curacy. They are suitable for low or high dimensional data,and work with categorical or numerical attributes. Further-more, the information used by our estimators can easily begathered in a single pass, making them suitable for use in astreaming environment.

1. INTRODUCTIONInteractive-speed evaluation of ad-hoc, OLAP-style queries

is quite often an impossible task over the largest databases.Even using modern query processing techniques in conjunc-tion with the latest hardware, OLAP-style queries can stillrequire hours or days to complete execution. A quick perusalof the latest TPC-H benchmark supports this assertion [25].

One obvious way to address this problem is to rely onapproximation. If an approximate answer can be given atsub-second, interactive speeds, it may encourage the use oflarge databases for ah-hoc data exploration and analysis.One of the most often-cited possibilities for supporting ap-proximation in a database environment is to rely on randomsampling [12, 11, 1, 2, 5, 6, 9, 13, 18, 21, 20, 17]. Unlikeother estimation methods such as wavelets [26, 7, 8] andhistograms [10], sampling has the advantage that most es-timates over samples are unaffected by data dimensionality,and sampling is equally applicable to categorical and numer-ical data. Another primary advantage of random sampling isthat sampling techniques are well-understood and samplingas a field of scientific study is very mature. Fundamentalresults from statistics can be used as a guide when applyingsampling to data management tasks [24, 19, 23, 22].

However, there is one important way in which samplingin a database environment differs from sampling as it hasbeen studied in statistics and related fields. Specifically,most work from statistics makes the assumption that it isimpossible to gather accurate summary statistics from thepopulation that is to be sampled from. After all, in tra-ditional applications like sociology and biology, sampling isused when it is impossible to directly study the entire pop-ulation.

In a database, this assumption is far too restrictive. Thereis no reason that accurate and extensive summary statis-tics over the underlying data set cannot be maintained. Indatabases, unlike in traditional statistical inference, the en-tire data set is available and can be pre-processed or pro-cessed in an online fashion as it is loaded. The reason thatapproximation may be required in a database environment isthat there are too many possible queries to pre-compute theanswer to every one, and it may be too expensive to evaluateevery single query to get an exact answer. This does not im-ply that it is impractical to maintain a very comprehensiveset of summary statistics over the data.

A recent paper [13] proposed a new technique called Ap-proximate Pre-Aggregation (APA) that makes use of suchsummary statistics to greatly increase the accuracy of sampling-based-estimation in a database environment. APA is use-ful for providing extremely accurate answers to aggregatequeries (SUM, COUNT, AVG) over high-dimensional database ta-bles, with a complex relational selection predicate attachedto the aggregate query. For example, imagine we have adatabase table COMPLAINTS (PROF, SEMESTER, NUM_COMPLAINTS)

and the SQL query:

SELECT SUM (NUM_COMPLAINTS)

FROM COMPLAINTS

WHERE PROF = ’Smith’

AND SEMESTER = ’Fa03’;

This query asks: “How many complaints did Prof. Smithreceive in the Fall 2003 semester?” APA was shown experi-mentally to produce estimates for the answer to this type ofquery that have only a small fraction of the error associatedwith other techniques such as stratified random samplingand wavelets, especially over categorical data.

However, there is one glaring weakness with APA: it is dif-ficult or impossible to formally reason about the accuracy ofthe approach. In particular, while APA was shown exper-imentally to produce excellent results, there is no obviousway to associate statistically meaningful confidence boundswith the estimates produced by APA. A confidence boundis an assertion of the form, “With a probability of .95, Prof.Smith received 27 to 29 complaints in the Fall of 2003.” Thelack of confidence bounds poses a severe limitation on theapplicability of APA to real-life estimation problems.This Paper’s Contributions: In this paper, we proposea new method for combining summary statistics with sam-pling to improve estimation accuracy, called APA+. APA+is in many ways simpler than APA, and yet it has the over-whelming advantage of giving statistically meaningful confi-dence bounds on its estimates. As we will show experimen-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

S Prof. Sem. Cmpl. S Prof. Sem. Cmpl.

Adams Fa 02 3√

Smith Su 01 7Jones Fa 02 2

√Smith Sp 01 8

Adams Sp 02 9√

Adams Fa 00 4√Jones Sp 02 2 Smith Su 01 7√Smith Sp 02 21 Smith Fa 00 33Smith Fa 01 36

√Adams Su 00 3√

Jones Su 01 1√

Jones Su 00 0Adams Su 01 2 Jones Sp 99 1

Table 1: Number of complains over three years

tally in this paper, APA+ produces confidence bounds thatare tighter and more accurate than those that are producedusing more traditional, sampling-based estimators. Further-more, computing an APA+ estimate is computationally veryefficient, which renders APA+ an excellent candidate for usein applications like online aggregation [12], where an esti-mate must be re-computed every second or so in order toinform a user as to the current accuracy of the estimate.

Given the accuracy and wide applicability of the tech-niques presented in the paper, we assert that APA+ shouldbe used in place of traditional, sampling-based estimation inany environment where it is possible to augment a samplewith simple summary statistics over a database.Paper Organization: Section 2 of the paper gives anoverview of APA and our new method, APA+. The sim-plest version of APA+ is formally described in Section 3.Section 4 generalizes APA+ to work with more complicatedand comprehensive summary statistics, and Section 5 dis-cusses the issue of correlation in the individual estimatesthat make up the final APA+ estimator. Section 6 presentsour experimental evaluation. Finally, the paper is concludedin Section 7.

2. OVERVIEW OF APA AND APA+In this Section, we will describe both APA and APA+ at

a high level in the context of the example database tabledepicted in Table 1 [13].

This table models the situation where a number of stu-dents take a course each semester, and the number of studentgrading complaints is recorded.

2.1 Approximate Pre-Aggregation (APA)In our example, we are interested in answering a query of

the form:

SELECT SUM (NUM_COMPLAINTS)

FROM COMPLAINTS

WHERE PROF = ’Smith’

Imagine that we decided to answer this query using a 50%sample of the database, where the sampled tuples are indi-cated by a check mark in Table 1. Estimating the answerto the query using the sample is very straightforward. Thenumber of complaints in the sample is 36, and since the sam-ple constitutes 50% of the database tuples, we can estimatethat Professor Smith received 72 complaints.

However, as we see from Table 1, the actual number ofstudents who came to see Professor Smith was 121 (yielding40.5% relative error). The problem is the variance in thenumber of students who complained to Professor Smith eachsemester. This ranges from a low of 7 to a high of 36, and ithappens that our sample missed the two semesters when thegreatest number of students complained to Professor Smith.

APA makes use of additional summary statistics to re-duce sampling’s vulnerability to variance. The APA pro-cess begins by using the relational selection predicate of thequery to divide the data space into 2n quadrants, where nis the number of clauses in the predicate. APA then asso-ciates a probability density function with each quadrants,and uses the additional summary information that is avail-able to create a set of constraints on the means of thesedistributions. A Maximum Likelihood Estimation (MLE)is then used to adjust the means in order to correct vio-lations of the constraints. Depending on the dimensional-ity of the summary information used, different variationsof APA are possible. For example, APA0 makes use ofzero-dimensional facts about the data, or assertions of theform (SUM (COMPLAINTS)) = 148. APA1 makes use of one-dimensional facts about the data, or assertions of the form(SUM (COMPLAINTS)) WHERE (SEMESTER = ’Fa02’) = 5, whichhave one clause in their relational selection predicate.

2.2 APA+The fundamental weakness of APA is a lack of a formal

analysis of the accuracy of the approach, which in turn pre-cludes analytically-derived confidence bounds on APA esti-mates. Such an analysis is difficult for several reasons:

1. First, it can be difficult in general to reason aboutthe accuracy of maximum likelihood estimators. WhileMLEs are usually Minimum Variance Unbiased Esti-mators (MVUEs) and are typically normally distributed[15], this only provides a lower bound on the varianceof the MLE using the Carmer-Rao inequality; it doesnot say whether or not the MLE actually achieves thislower bound.

2. Second, things are even more difficult in the case ofAPA, which uses a constrained MLE where the spaceof possible solutions is restricted by the available sum-mary statistics. It is far from clear how to formallyreason about the effect of this on the accuracy of theapproach.

3. Finally, the various estimators that are combined inAPA are correlated, and this correlation is ignoredby APA. If one sample falls in the quadrant associ-ated with Professor Smith, than it cannot fall in thequadrant holding tuples not associated with ProfessorSmith. Considering this correlation is absolutely nec-essary in order to develop confidence bounds for APA,and yet it makes the analysis even more difficult.

In this paper, we provide a simple alternative to APA,called APA+. APA+ is similar to APA, except that it doesnot rely on MLE and is far more amenable to a rigorousstatistical analysis of its accuracy.

At the highest level, just like APA, APA+ also makes useof additional summary statistics, but in a fundamentally dif-ferent way. APA+ relies on the idea of a negative estimatorto make use of summary statistics. In APA+, the original,straightforward estimate of 72 is called a positive estimator,in that it is derived by directly considering the tuples thatmatch our relational selection predicate. However, it is alsopossible to produce a negative estimator by first estimatingthe number of complaints not attributed to professor Smith(which is 2 × (2 + 1 + 0 + 3 + 4) = 20), and subtractingthis quantity from 148, which is the total number of com-plaints. Thus, 148 - 20 = 128 is our negative estimator forthe number of complaints attributed to Professor Smith.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

If we denote the positive and negative estimator by t̂p andt̂′p respectively, then

t̂ = αt̂p + (1 − α)t̂′p

is itself an unbiased estimator of the number of complaintsfor Professor Smith for any value of α where 0 ≤ α ≤ 1. Bychoosing α so as to minimize the variance of t̂, it is possibleto develop an extremely accurate estimator.

This alternative method for incorporating summary statis-tics into our estimate has several obvious advantages:

1. Intuitively, the accuracy of APA+ should be very good,just like for APA. The reason is that the ensemble ap-proach leverages more information into the estimation.By using the information from the negative domain,we can reduce the variance of our estimation. Consid-ering that our ensemble estimator is unbiased for theanswer of the query, the error associated with it mustbe small, leading to very accurate estimates.

2. Unlike APA, APA+ relies on simple arithmetic in or-der to make use of summary information, and is thusvery amenable to statistical analysis, including the deriva-tion of confidence bounds.

3. Just like for APA, it is possible to extend APA+ towork with more complicated summary information in-volving specific subsets of the data and more compli-cated queries. In the general case, APA+ makes use ofmany negative estimators, all of which are combinedto form a single, unbiased estimator for the answer tothe query.

An additional advantage of APA+ is its capability to dealwith both categorical and numerical attributes in WHERE

clause. In our technical report [14], we have shown howAPA+ can be extended to handle numerical attributes.

Similar to the different versions of APA, i.e., APA0, APA1,etc., we also have different versions of APA+, which are re-ferred to as APA0+, APA1+, and so on. The version thatuses i-dimensional aggregates is referred to as APAi+. Inthe remainder of the paper, we formally describe APA+ andthe derivation of its associated confidence bounds with a fo-cus on categorical attributes in the WHERE clause. In the nextsection, we first study APA0+.

3. NEGATIVE ESTIMATORS AND APA0+To facilitate our discussion, we first introduce some defi-

nitions.Let N be the total size of dataset, and T be the zero-

dimensional fact that describes the total value of the at-tribute that is to be aggregated over all tuples in the database(in our example, T is SUM (COMPLAINTS) = 148). Let S de-note a sample of size n, extracted from the complete dataset using sampling without replacement [24].

Given a WHERE clause (or relational selection predicate) p,the domain of p (denoted Dp) is the set of data points inthe complete dataset which satisfy the clause. The domainsample is the set of samples from S that satisfy the predicatep. The size of the domain sample is denoted as nd. The i-th unit in the domain sample is represented by yi. In ourexample, the selection clause WHERE PROF = ’Smith’ specifiesthe domain sample {21, 7, 8}. The domain sample can beextended to the domain data space by assigning 0 to allthe samples from S that do not satisfy the predicate. In

our example, the domain data space is {0, 21, 0, 7, 8, 0, 0, 0}.We will use ui to represent the i-th unit in the domain dataspace, and u to represent the mean of the domain data space,computed as (

Pn1 ui)/n.

Given these definitions, let t̂p be the naive estimator forthe value of a SUM query over Dp. This can be formallyexpressed as

t̂p =N

n

ndXi=1

yi =N

n

nXi=1

ui

Note that in this paper, we also call this estimator a positiveestimator, since its estimate is only based on the samplessatisfying the predicate p. Also, t̂p is unbiased:

Lemma 1. The estimation of t̂p is unbiased, i.e. the ex-pectation of this estimator equals to the true value, E(t̂p) =tp [24].

Furthermore, we also know that the variance of V (t̂p) canbe estimated as:

bV (t̂p) = N2(N − n

Nn)s2

u

where, s2u = 1/(n − 1)

Pni=1(ui − u)2 is the sample variance

in the domain data space [24].In our example, t̂p estimates the number of Prof. Smith’s

complaints to be 2 × (21 + 7 + 8) = 72. However, the cor-rect number of Prof. Smith’s complaints is 121. Clearly,the naive estimator significantly under-estimates the targetvariable. For this example, the standard error (SE), which

is the square root of the variance, (SE(t̂p)=

qbV (t̂p) ), is

68.2. In practice, such large relative standard errors can becommon.

In the following subsections, we will describe how to utilizethe zero-dimensional fact T to improve the naive estimatorof the total of the domain Dp. Section 4 will show how theseresults can be generalized to make use of facts that providemore detailed information.

3.1 The Negative EstimatorLet p be the negation of the clause p. Thus, p specifies

those tuples from the data set that are not in the domainDp. Clearly, t̂p is also an unbiased estimator for tp, the totalof the domain Dp. In our running example, the negation ofthe original clause produces the following query:

SELECT SUM (NUM COMPLAINTS)

FROM COMPLAINTS

WHERE PROF �= ’SMITH’

Just as t̂p is a naive estimator for the number of com-plaints received by Professor Smith, t̂p is a naive estimatorfor the total number of complaints for all the professors ex-cept Smith: t̂p = 16/8 × (2 + 1 + 0 + 3 + 4) = 20.

A simple but important fact connects tp, the total of thedomain Dp, and tp, the total of the domain Dp:

T = tp + tp

Given this, we introduce a new estimator based on the neg-ative clause:

Definition 1. The negative estimator of the domaintotal tp is t̂′p = T − t̂p.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

Just as the positive estimator is unbiased, so is the nega-tive estimator:

Lemma 2. The negative estimator t̂′p is an unbiased esti-mator for tp, the total of domain Dp, and its variance is thesame as the variance of t̂p, i.e. V (t̂′p) = V (t̂p).

Proof:Please see our technical report [14]. �

Also, directly from the above discussion, the variance ofthe negative estimator can be estimated as:

bV (t̂′p) = bV (t̂p)

Applying this new estimator to our example, the negativeestimate for the number of complaints for Professor Smith is148 − 20 = 128, significantly reducing the standard error to13.4. The improved accuracy can be attributed to the factthat the sample variance of the domain sample for the neg-ative predicate is much smaller than the one for the originalpredicate in this example.

3.2 An Optimized Unbiased EstimatorSo far, we have introduced two unbiased estimators, t̂p and

t̂′p for the total of domain Dp. In this subsection, we studyhow to combine these two estimators together to have anoptimized unbiased estimator whose variance is lower thanthe variance of either of the existing estimators.

Let the new estimator be the linear combination of twoexisting estimators, t̂p and t̂′p:

t̂APA0 = αt̂p + (1 − α)t̂′p

where, 0 ≤ α ≤ 1. By the linearity of expectation, this newestimator is clearly unbiased as well:

Lemma 3. The new estimator t̂APA0 is an unbiased esti-mator for the domain total tp.

Now the question is how to find an optimized α to min-imize the variance of t̂APA0. For simplicity, in this sectionwe will assume the positive estimator and negative estimatorare independent. In this case, Lemma 4 states the existenceof an optimal unbiased estimator. Note that the two esti-mators are actually correlated. In Section 5, we will discusshow to incorporate this correlation into our estimation ap-proach.

Lemma 4. Assume V (t̂p)+V (t̂p) �= 0, let α =V (t̂p)

V (t̂p)+V (t̂p),

the estimator t̂optAPA0 = αt̂p + (1 − α)t̂′p, 0 ≤ α ≤ 1 is an un-

biased estimator of the domain total tp, and minimizes thevariance of t̂APA0.

Proof:Please see our technical report [14]. �

We call above estimator the optimal estimator. Since thevariances of t̂p and t̂p are usually unknown, it is not possibleto compute α exactly; rather, the natural estimator α̂ =

bV (t̂p)

bV (t̂p)+ bV (t̂p)is used.

Applying this new estimator to our running example, theoptimized α̂ is 0.0373, and the estimated total is 125.95,with a reduced standard error 13.1.

3.3 Constructing a Confidence IntervalThe standard method for describing the accuracy of an

estimate is to associate a confidence interval with the esti-mate. For example, to associate a 95% confidence intervalwith our estimate, we choose a range of values so that withprobability 95%, the true answer to the query will fall withinthis range. In practice, in order to provide a tight confidenceinterval the distribution of the estimator must be known.

As a result of the central limit theorem (CLT), the distri-bution of the positive estimator t̂p is approximately normal

(Gaussian) with mean tp and variance bV (t̂p) for large n [19].Thus, a large-sample 100(1 − α)% confidence interval (CI)for the domain total tp is:

[t̂p − zα/2SE(t̂p), t̂p + zα/2SE(t̂p)]

where zα/2 is the (1−α/2)th percentile of the standard nor-mal distribution. The size (or width) of confidence intervalis 2 × zα/2SE(t̂p). For example, for a 95% CI, α is 5%and z2.5 is 1.96. Therefore, an approximate 95% CI for theestimator t̂p is given by:

[t̂p − 1.96SE(t̂p), t̂p + 1.96SE(t̂p)]

The following Lemma establishes an important propertyof of the distributions of the negative and optimal estima-tors.

Lemma 5. If the distribution of the naive estimator onthe negative domain is normal with mean tp, then the distri-bution of the negative estimator t̂′p is normal with mean tp

and variance V (t̂p). Furthermore, if the distribution of thenaive estimator on the positive domain is also normal withmean tp, the distribution of the optimal estimator t̂opt

APA0 isnormal with mean tp and variance V (t̂opt

APA0).

As a result, the confidence intervals for t̂p and t̂optAPA0 can

be derived as described above.

4. GENERALIZATIONIn practice, it is easy to compute and store more com-

plicated pre-aggregated results than those used in APA0+.This information can be used to achieve much greater esti-mation accuracy. In this Section, we will study how to buildoptimal estimators when more complicated pre-aggregationresults are available. We refer to a fact with i predicatesin its relational selection clause as an i-dimensional fact,and we refer to the version of APA+ that makes use of i-dimensional facts as APAi+. We begin this section with afew definitions, specifically, we define the idea of a negativeclause with respect to more complicated pre-aggregation re-sults.

We assume a relational selection clause p that can be ex-pressed as a conjunction of m boolean predicates, b1, · · · , bm.For example, consider the following query [13]:

SELECT SUM (SALARY)

FROM EMPLOYEE

WHERE SEX=’M’

DEPARTMENT=’ACCOUNT’

AND JOB TYPE=’SUPERVISOR’

In this case, we have b1 = (SEX = ’M’), b2 = (DEPARTMENT =

’ACCOUNT’), b3 = (JOB_TYPE = ’SUPERVISOR’). We will alsoconsider the negation of each of these predicates: b1, b2, b3.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

Figure 1: Spatial representation of query predicates

Note that if we take all possible meaningful combinations ofthese predicates, the entire dataset can be partitioned into2m non-overlapping domains, specified by 2m clauses. Wecall such clauses domain clauses. In our example query, the23 = 8 domain clauses are b1 ∧ b2 ∧ b3, b1 ∧ b2 ∧ b3, b1 ∧ b2 ∧b3, b1 ∧b2 ∧b3, b1 ∧b2 ∧b3, b1 ∧b2 ∧b3, b1 ∧b2 ∧b3, b1 ∧b2 ∧b3.Note that each of these clauses corresponds to a single cellin the multidimensional data cube defined by b1, b2, and b3

(see Figure 1) [13].Any conjunction over a subset of {b1, b2, · · · , bm} can be

expressed as a disjunction of the various domain clauses. Letthe conjunctive clause bj1 ∧ · · · ∧ bji

be denoted by pj1,··· ,ji.

For APAi+, the fact involving such a conjunctive clausewith i predicates is referred to as an i-dimensional fact, andis assumed to be available. This fact is an aggregate overthe domain specified by this clause. In order to utilize i-dimensional facts, we introduce the following definitions.

Definition 2. The set of 2m−i domain clauses whosedisjunction is equivalent to pj1,··· ,ji

is called the set of do-

main clauses with respect to pj1,··· ,ji.

In our example, the domain clauses with respect to theconjunctive clause p1 = b1 (equivalent to SEX = MALE) areb1 ∧ b2 ∧ b3, b1 ∧ b2 ∧ b3, b1 ∧ b2 ∧ b3, b1 ∧ b2 ∧ b3; the domainclauses with respect to the conjunctive clause p2,3 = b2 ∧ b3

are b1 ∧ b2 ∧ b3 and b1 ∧ b2 ∧ b3.This definition is important, because it will allow us to de-

fine a negative clause that will be used to form the negativeestimators in higher-order versions of APA+:

Definition 3. The set of domain clauses with respect topj1,··· ,ji except the predicate p is called the set of negative

clauses with respect to pj1,··· ,ji, or p w.r.t. pj1,··· ,ji

forshort.

In our example, p w.r.t. p1 is (b1 ∧ b2 ∧ b3) ∨ (b1 ∧ b2 ∧b3) ∨ (b1 ∧ b2 ∧ b3), and p w.r.t. p2,3 is b1 ∧ b2 ∧ b3.

Finally, we use tj1,··· ,jito denote the sum of the target

attribute over the domain specified by pj1,··· ,ji, and use

tpj1,··· ,jito denote the total of the domain defined by p

w.r.t. pj1,··· ,ji. For example, tb1∧b2∧b3

refers to the sum ofthe target attribute over the domain specified by the domainclause b1 ∧ b2 ∧ b3, and tp refers to the answer to our query.Note that tj1,··· ,ji

is an i-dimensional fact.

The next four Subsections describe how to extend thetechniques of Section 3 to develop APA1+. A generalizationto APAi+ for arbitrary values of i is given in Section 4.5.

4.1 Negative Estimators in APA1+Just like APA0+, APA1+ relies on the idea of a negative

estimator. However, unlike APA0+, APA1+ will make useof many negative estimators. In APA1+, we assume that wehave access to all one-dimensional facts over the data set.In our running example, this means that the totals t1, t2, t3are available. Using the notation given above, the positiveestimator for the sum of the target attribute over the domainDp is denoted by t̂p = t̂b1∧b2∧b3 . An estimator for the sumof the target attribute over the domain of a negative clausecan be defined similarly, for example:

t̂p1= t̂b1∧b2∧b3

+ t̂b1∧b2∧b3+ t̂b1∧b2∧b3

Since in APA1+ it is assumed that t1 is available, t1 andt̂p1

can easily be combined to form a new estimator for theanswer to our query:

t̂p1= t1 − t̂p1

This is an example of a negative estimator in APA1+. Ingeneral:

Definition 4. t̂pi= ti − t̂pi

is the APA1+ negative

estimator for tp where p = b1 ∧ b2 · · · bm, and pi is thedisjunction of all negative clauses with respect to the clausebi, 1 ≤ i ≤ m.

Thus, there are a total of m negative estimators availablein APA1+. Based on the same argument as in Lemma 2,we have the following lemma:

Lemma 6. Each APA1+ negative estimator t̂piis an un-

biased estimator for tp, the sum of the target attribute overDp. The variance of t̂pi

is the same as the variance of t̂pi,

i.e. V (t̂pi) = V (t̂pi

).

4.2 Combining Positive and Negative Estima-tors in APA1+

Just as in APA0+, the positive estimator and the negativeestimators in APA1+ can be combined together to producea new estimator with smaller variance.

The basic approach is similar to the method described inSubsection 3.2. We first introduce a new family of estimatorsbased on the linear combination of the existing estimators.Formally, for a selection clause p = b1 ∧ b2 · · · bm, the newestimators are of the form:

t̂APA1 = α0 t̂p + α1 t̂p1+ · · · + αm t̂pm

where t̂p is the positive estimator, t̂piis the negative esti-

mator for i > 0, and α0 +α1 + · · ·+αm = 1. Clearly, t̂APA1

is unbiased as well:

Lemma 7. The new estimator t̂APA1 is an unbiased esti-mator for the domain total tp.

4.3 Minimizing the Variance of t̂APA1

Because several negative estimators can contain the es-timate associated with each cell in the cube illustrated inFigure 1, they will be strongly correlated. In order to de-rive the appropriate parameters to minimize the variance oft̂APA1, we first decompose each negative estimator based on

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

the negative clause so that we can manage this correlation.For example, in our running example, the negative estimatort̂p1

can be decomposed as:

t̂p1= t̂b1∧b2∧b3

+ t̂b1∧b2∧b3+ t̂b1∧b2∧b3

By decomposing each APA1+ negative estimator in thisfashion, we can group like terms in order to reduce the pair-wise correlation. The process is demonstrated in the follow-ing steps over the APA1+ estimator for our example query:

t̂APA1 = α0 t̂b1∧b2∧b3 +

α1(t1 − (t̂b1∧b2∧b3+ t̂b1∧b2∧b3

+ t̂b1∧b2∧b3)) +

α2(t2 − (t̂b1∧b2∧b3+ t̂b1∧b2∧b3

+ t̂b1∧b2∧b3)) +

α3(t3 − (t̂b1∧b2∧b3+ t̂b1∧b2∧b3

+ t̂b1∧b2∧b3)) =

α0 t̂b1∧b2∧b3 − (α1 + α2)t̂b1∧b2∧b3− (α1 + α3)t̂b1∧b2∧b3

−(α2 + α3)t̂b1∧b2∧b3− α1 t̂b1∧b2∧b3

−α2t̂b1∧b2∧b3− α3 t̂b1∧b2∧b3

+ α1t1 + α2t2 + α3t3

In this way, the APA1+ estimator is transformed intothe linear combination of estimators for the totals on a setof non-overlapping domains. Since the domains are non-overlapping, for the moment we will assume that all of theestimators used in t̂APA1 are pair-wise independent. In real-ity, the estimators of the non-overlapping domain totals arecorrelated, and we will study correlation in Section 5. Usingthe pair-wise independence assumption, we can derive thevariance for the new estimator:

V (t̂APA1) = α20V (t̂b1∧b2∧b3) + (α1 + α2)

2V (t̂b1∧b2∧b3) +

(α1 + α3)2V (t̂b1∧b2∧b3

) + (α2 + α3)2V (t̂b1∧b2∧b3

) +

α21V (t̂b1∧b2∧b3

) + α22V (t̂b1∧b2∧b3

) + α23V (t̂b1∧b2∧b3

)

To minimize the variance, in most of the cases, we cansimply use Lagrange multipliers to optimize the values ofthe various α parameters. This procedure involves using alinear solver to solve a m × m linear equation, with costO(m3). If the parameters (αi) are invalid, i.e., they arenegative, we can apply quadratic programming [13] to findthe minimal variance. We refer to the estimator with theminimal variance as the optimal estimator, and denote it ast̂optAPA1.

4.4 APA1+ Confidence IntervalsAs discussed before, in order to provide a confidence in-

terval, we have to know the distribution of the estimator.Using the same argument as in Lemma 5, we have the

following important property of optimal estimator t̂optAPA1:

Lemma 8. The distribution of the optimal estimator t̂optAPA1

is normal if the distributions of each positive estimator forthe total of each non-overlapped domain (the domain totalspecified by each conjunctive clause) are normal.

Thus, just as in APA0+, for sufficiently large n each posi-tive estimator of the domain total is approximately normal.Therefore, the distribution of the new optimal estimator is

also approximately normal, with variance bV (t̂optAPA1).

The confidence interval of the new estimators can be de-rived the same as the positive estimator, i.e. a 100(1 − α)%confidence interval (CI) for the domain total tp is:

[t̂optAPA1 − zα/2SE(t̂opt

APA1), t̂optAPA1 + zα/2SE(t̂opt

APA1)]

where zα/2 is the (1−α/2)th percentile of the standard nor-mal distribution.

4.5 Generalizing the MethodThe methods developed in this Section can be generalized

to utilize higher-dimensional facts. For example, we can de-fine APA2+ to be the version of APA+ that makes use offacts having two clauses in their relational selection pred-icate: SUM (SALARY) WHERE (SEX = ’M’ AND DEPARTMENT =

’ACCOUNT’) = $2.1M. For APA2+ we have:

t̂APA2 = α0 t̂p + α12 t̂p12+ α13 t̂p13

+ α23 t̂p23

where α0 + α12 + α23 + α13 = 1 and 0 ≤ α12, α13, α23 ≤ 1.In a manner similar to APA1+, the optimal estimator forAPA2+ is:

t̂optAPA2 = α0 t̂b1∧b2∧b3 + α12(t12 − t̂b1∧b2∧b3

) +

α13(t13 − t̂b1∧b2∧b3) + α23(t23 − t̂b1∧b2∧b3

)

= α0 t̂b1∧b2∧b3 − α12 t̂b1∧b2∧b3−

α13 t̂b1∧b2∧b3− α23 t̂b1∧b2∧b3

+ α12t12 + α13t13 + α23t23

with each α chosen so as to minimize the variance.It is also possible to extend the method to develop a hi-

erarchical version of APA where we make use of APA0+,APA1+, and perhaps higher-order pre-aggregation together.For example, to use APA0+, APA1+, and APA2+ togetherwe make use of the following linear combination:

t̂ = αt̂p + α0t̂′

p + α1 t̂p1+ α2 t̂p2

+ α3 t̂p3+

α12 t̂p12+ α13 t̂p13

+ α23 t̂p23

where t̂p is the positive estimator of the domain total tp; t̂′pis the negative estimator using APA0; t̂p1

, t̂p2, and t̂p3

arethe negative estimators of APA1+; and t̂p12

, t̂p13, and t̂p23

are the negative estimators for APA2. As in APA0+ andAPA1+, we have the constraint that α+α0+α1 +α2 +α3 +α12 + α13 + α23 = 1,

5. DEALING WITH CORRELATIONIn Sections 3 and 4, we assumed the estimators on each

individual domain are pair-wise independent. However, thecorrelation does exist among these estimators. In this Sec-tion, we will first derive the closed formula of the correlation(covariance) among them (Subsection 5.1). Then we studyhow to incorporate the correlation factor into our estimatorsto improve their accuracy (Subsection 5.2).

5.1 Correlations among EstimatorsIn this Subsection, we first study the correlation for the

estimators of the domain totals specified by a clause andits negation. This corresponds to the situation in APA0+,where the combined estimator is based on a pair of theseclauses. Then we will study the general case, where wecompute the correlation between two estimators on the non-overlapped domain totals. This corresponds to the situationwhere higher-order pre-aggregation information is used bythe APA+ methodology.

We capture the correlation by computing the covariance.The covariance (Cov) between two estimators t1 and t2 isdefined as follows [3]:

Cov(t1, t2) = E(t1t2) − E(t1)E(t2)

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

We begin by considering the case of APA0+, where wehave a domain clause and its negation. Let t̂p and t̂p bethe estimators of the totals for the domains specified by aselection clause p and its negative clause p; tp and tp are thetrue totals of these domains, respectively. Lemma 9 statesthe covariance of these two estimators:

Lemma 9. The covariance between the two estimators, t̂p

and t̂p, is:

Cov(t̂p, t̂p) = −(tptp)(N − n

n(N − 1))

Proof: Please see our technical report [14]. �

Since N is typically very large, the covariance can be ap-proximated as:

Cov(t̂p, t̂p) ≈ − tptp

n

Note that the covariance depends on the totals of the domainof p, and its negative p, which are estimation targets. Wewill use the naive estimator to replace them, and therefore,we have:

Cov(t̂p, t̂p) ≈ − t̂pt̂p

n

In the general case, we have two estimators t̂p1and t̂p2

which are estimators of the totals for the domains speci-fied by two clauses p1 and p2 where p1 and p2 are non-overlapping. Let tp1

and tp2be the true sum of the target

attribute over these domains, respectively. Lemma 10 showsthat the covariance of these two estimators is actually thesame as the case in which they are negations of one another,i.e. p2 = p1.

Lemma 10. The covariance between the two estimators,t̂p1

and t̂p2, is:

Cov(t̂p1, t̂p2

) = −(tp1tp2

)(N − n

n(N − 1))

Proof: Please see our technical report [14]. �

Using the same techniques as we did in case of APA0+,we can approximate the covariance as follows:

Cov(t̂p1, t̂p2

) ≈ − tp1tp2

n≈ − t̂p1

t̂p2

n

5.2 Using the CovarianceIn this subsection, we study how to improve the new esti-

mators developed in Section 3 and Section 4 after consider-ing the correlation among each individual estimator. Notewe will only discuss the variance of these new estimatorssince the expectation of these estimators is not affected bycorrelation.

APA0+: For the estimator t̂APA0 = αt̂p + (1 − α)t̂′p, 0 ≤α ≤ 1, considering the correlation between t̂p and t̂′p, wehave

V (t̂APA0) = V (αt̂p + (1 − α)(T − t̂p))

= V (αt̂p − (1 − α)t̂p)

= α2V (t̂p) + (1 − α)2V (t̂p) − 2α(1 − α)Cov(t̂p, t̂p)

To minimize the variance, we can use the standard mathe-matical methods, i.e. looking at the first and second order

derivatives with respect to α, to find the desired parameter.

APAi+: Recall that in Section 4, we assumed the estima-tors on the domain clauses are pair-wise independent. Aswe showed in Subsection 5.1, they are in fact correlated.The covariance between any two estimators on the non-overlapped domains, t̂p1

and t̂p2, can be approximated as

Cov(t̂p1, t̂p2

) ≈ − t̂p1t̂p2

n.

Given this, we can incorporate the correlation factor intoour new estimators as follows. Consider the new estimatorbased on APA2+ as given in Subsection 4.5. The varianceof the new estimator taking correlation into account is asfollows:

V (t̂APA2) = α20V (t̂b1∧b2∧b3) + α2

12V (t̂b1∧b2∧b3) +

α213V (t̂b1∧b2∧b3

) + α223V (t̂b1∧b2∧b3

) −α0α12Cov(t̂b1∧b2∧b3 , t̂b1∧b2∧b3

) −α0α13Cov(t̂b1∧b2∧b3 , t̂b1∧b2∧b3

) −α0α23Cov(t̂b1∧b2∧b3 , t̂b1∧b2∧b3

) +

α12α13Cov(t̂b1∧b2∧b3, t̂b1∧b2∧b3

) +

α12α23Cov(t̂b1∧b2∧b3, t̂b1∧b2∧b3

) +

α13α23Cov(t̂b1∧b2∧b3, t̂b1∧b2∧b3

)

Given this formula, an approach similar to the one givenin Section 4 can be used to find the desired parameters tominimize the variance.

One issue that still needs to be explored is the effect ofthe covariance on the distribution of the various APA+ es-timators. So far, we have relied on the CLT as justificationof the normality of the APA+ estimators. However, afterincorporating the correlation factor, the distribution of thecombined estimator may not necessarily be normal, even ifthe CLT holds for each of the component estimators. Itturns out that such distributions are very close to the classof spherically symmetric distributions [16], of which normaldistribution is a special case. The analytical formula of thedistribution of combined estimator is available in a techni-cal report [14]. Though the exact confidence interval forsuch distributions is still hard to derive, our analysis andexperimental evaluation of the empirical distribution [14]give strong evidence that normal distributions can serve asa good approximation for such distributions. Therefore, inthe experimental results presented in Section 6, we simplyutilize normal distributions to derive confidence intervals.Our experimental results validate this approach.

6. EXPERIMENTAL RESULTSThis section presents an experimental benchmark of the

methods proposed in the paper. Traditionally, benchmarksof approximate query processing techniques have focusedon testing the absolute or relative error of the estimatesprovided by an approximation methodology. While we areguilty of publishing a few benchmarks of this type ourselves[13], we assert that this style of benchmarking can be oflimited practical value. In practice, any useful approxima-tion methodology must be able to differentiate among thecases when the estimation accuracy is likely to be good andwhen the accuracy is likely to be poor. In other words, anaccurate estimate is of little use unless the methodology canrecognize that the estimate is very accurate and the user can

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

be notified accordingly.As a result, this section focuses on studying the ability

of APA+ to provide accurate estimates, and to recognizewhen those estimates are accurate. This is done by studyingthe confidence bounds provided by APA+. Specifically, ourstudy focuses on two important questions:

1. How accurate are the confidence bounds provided byAPA+? In other words, if APA+ guarantees a certainaccuracy on an estimate, what are the real chancesthat the estimate meets those accuracy guarantees?

2. How tight are the bounds of our confidence intervals,as compared to traditional methods that could be usedas an alternative to APA+? In other words, how muchaccuracy is APA+ able to guarantee, compared to theobvious alternatives?

6.1 Approximation Techniques TestedIn our experiments, we compare APA+ with two sampling-

based alternatives. We concentrate on sampling for severalreasons. Samples are widely used in the data managementliterature, and there is an extensive statistical literature rel-evant to associating confidence bounds with samples [24,19]. Furthermore, sampling is arguably the most widelyapplicable approximation technique. Unlike other methodssuch as wavelets [26, 7, 8], samples are unaffected by datadimensionality, and work equally well with categorical andnumerical data. Thus, the following approximation tech-niques were used in our comparison:

Simple Random Sampling: The sample is extracted fromthe original data set using the sampling without re-placement method. The estimator is the naive un-biased estimator, and confidence bounds are derivedusing the central limit theorem (CLT).

Stratified Sampling: Stratification is a standard statis-tical technique that can be used to boost sampling-based estimation accuracy, and has previously beenapplied to problems in data management [4]. Strati-fied sampling works by partitioning the data set intoa set of non-overlapping strata. In order to increaseestimation accuracy, partitions can be chosen so as tominimize subsequent estimation variance. In our ex-periments, we use a uniform allocation of samples tostrata. Experiments were conducted with 2 differentsizes for each strata: 10 samples per strata and 100samples per strata (in general, for a fixed number oftotal samples, using more strata and fewer samples perstrata tends to increase estimation accuracy). Given auniform allocation, the partitions are chosen so as tominimize the variance of the estimate of an aggregatequery over the entire data space.

APA0+: We also implemented and tested APA0+, as de-scribed in this paper. APA0+ uses zero-dimensionalfacts to produce confidence bounds for each query. Re-call that zero-dimensional facts are facts of the formSUM(COMPLAINTS) = 123. The additional memory costof this method is minimal compared to simple randomsampling; e.g. in the datasets that we tested with itwas only a few bytes. APA0+ takes into account thecovariance of its various sub-estimators while perform-ing the estimation, as described in Sections 3 and 5.

APA1+: In this method, one-dimensional facts are used.Each one-dimensional fact has exactly one clause in

exp. %DB 1 2 3 4 5 6 7 8match query pred pred pred pred pred pred pred pred

10% 4% 4% 4% 4% 4% 4%1% 4% 4% 4% 4% 4% 4%

0.1% 4% 4% 4% 4% 4% 4% 4%0.01% 4% 4% 4% 4% 4% 4%

Table 2: Distribution of Testing Queries

the WHERE predicate associated with the fact. For ex-ample, a one-dimensional fact is SUM(SALARY) WHERE

(SEX=’F’) = $1.9M. The additional memory require-ments of APA1+ compared to simple random samplingare still small, in practice only on the order of a fewkilobytes. Just like the APA0+, APA1+ accounts forthe covariance that possibly exists between the esti-mator’s sub-cubes (Figure 4).

6.2 Experimental SetupThe four datasets used in our benchmark are derived from

four real high dimensional data sets, previously used in [13].The four data sets are: Forest Cover data (from the UCIKDD archive), River Flow data, William Shakespeare data(word proximity information), and Image Feature Vectordata. The transformation of these datasets for our experi-ments were previously described in [13].

Some basic features of our experimental data sets are asfollows. The first data set has 8 categorical attributes, andthe last three have 30 categorical attributes each. Theseattributes are the ones that we use for selection. Each ofthe categorical attributes has five different categories. Eachdataset also has 2 numerical attributes which are designatedas our measure attributes. Selection attributes only ap-pear in the relational selection predicates (WHERE clauses),whereas measure attributes appear in the aggregation func-tions of the SELECT clauses. Each SELECT clause is of theform SELECT SUM(ATT). We create a total of 2000 queries foreach data set, 1000 for each measure attribute. The WHERE

clause in each predicate is a conjunction of boolean equal-ity predicates on various categorical values. The number ofpredicates in each such conjunction varies from one to eightand are generated to vary the expected selectivity as shownin Table 2. The sampling rate used for our experiments wasfixed at 10%.

6.3 Confidence Interval AccuracyThis subsection reports the results of a set of experiments

designed to test the accuracy of the guarantees provided byeach method. For each method and each query, a confidenceinterval is computed using a 95% user-specified confidencelevel. This confidence interval is considered to be correct ifthe exact answer to the query falls within the bounds of theinterval bounds. Since a user-specified confidence level of95% is used, if each method works exactly as is theoreticallyexpected, we should find that 95% of the confidence intervalswere correct. The actual percentage of correct intervals foreach of the data sets and approximation methodologies isreported in Table 3.

Discussion

The specified confidence for all of these methods was 95%,but it is clear from the results that only APA1+ comes closeto achieving this confidence level experimentally. The con-fidence level achieved by APA1+ averaged 89.27% over alltests, which was 23.2% higher than simple random sampling(the next best alternative in terms of accuracy). The dif-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

Data Set APA0+ APA1+ Sampling ss = 10 ss = 100

Rivers Att 1 72.34% 96.80% 79.30% 74.80% 76.10%Rivers Att 2 72.34% 92.57% 78.47% 74.85% 75.05%Forest Att 1 71.73% 92.33% 64.75% 59.64% 59.53%Forest Att 2 66.71% 92.57% 68.05% 57.82% 57.83%Image Att 1 51.37% 77.37% 57.83% 51.59% 49.37%Image Att 2 57.77% 82.97% 59.31% 54.17% 51.31%Shakes Att 1 44.55% 90.14% 60.36% 56.14% 54.02%Shakes Att 2 39.24% 89.40% 60.46% 55.03% 54.93%

Table 3: Observed Confidence over All Tests When 95%

Confidence Is Specified (ss stands for strata size used in

conjunction with the stratified sampling approach).

ference between APA1+ and the stratified-sampling-basedtechniques is even greater. Not only did APA1+ achievethe highest experimental confidence level, but it also comesvery close to achieving the theoretically-expected 95% con-fidence based on the parameter settings (90+% confidencewas observed for 5 out of 8 cases tested).

A few other points worth mentioning are:

• One possible explanation for the poor accuracy of thenon-APA+ estimates is that the CLT-based intervalsused turned out to be overly aggressive. The CLT isa limiting theorem, which is only guaranteed to holdgiven an infinite number of samples from a distribu-tion. In practice, the CLT often holds after only a fewdozen samples. However, under adverse circumstances(particularly with very skewed or bi-modal distribu-tions) many more samples can be required. Given thatseveral of our data sets were rather poorly-behaved(particularly the Image Feature Vector data set), thismay be a cause of the problems that are evident inTable 3. One possible solution would be to use less-aggressive bounds (such as Chebychev or Hoeffdingbounds). However, while this would improve the ac-curacy of the bounds, it will also increase the width ofthe associated confidence intervals, which already donot compare favorably with APA1+ (see Section 6.4).

• A factor that seems to cause problems with all five al-ternatives is that the variance used in computing theaccuracy of any sampling-based estimate is always it-self an estimate, that may or may not be accurate.This is a difficult problem without an easy solution,and is always present in practical statistical inference.The fact that stratified sampling demonstrates a con-fidence level that is lower than that of simple randomsampling (with an average drop off of 5.93% over alldata sets) seems to support this fact. Since the sizeof each strata is small, the variance estimate for eachstrata is likely to be more inaccurate than for simplerandom sampling, which effectively uses only a singlestrata. However, APA1+ seems to be least-affected bythis problem.

• Finally, it is interesting to note that APA1+ performedby far the best, and APA0+ was on par with (orslightly worse than) stratified sampling in terms of theaccuracy of the confidence level given. The key rea-son is that APA0+ uses only one negative estimator,as compared to APA1+, which uses 2m − 2. Whileone negative estimator can prove to be inaccurate, thedegree of inaccuracy of multiple (APA1+) negative es-timators would be far less. Therefore, we see an im-

Data Set Sampling ss = 100 ss = 10

Rivers Att 1 3.72% -7.43% -15.15%Rivers Att 2 18.34% 15.30% 0.67%Forest Att 1 15.08% 12.12% 4.46%Forest Att 2 29.10% 9.79% 6.87%Image Att 1 21.70% -4.26% 8.21%Image Att 2 29.88% -8.30% -14.14%Shakes Att 1 16.39% 8.84% -17.01%Shakes Att 2 31.18% 12.62% 6.79%

Table 4: Confidence Interval Width Compared to

APA0+ (ss stands for strata size used in conjunction

with the stratified sampling approach).

Data Set Sampling ss = 100 ss = 10

Rivers Att 1 50.93% 33.60% 21.60%Rivers Att 2 58.76% 50.25% 30.13%Forest Att 1 34.82% 30.97% 20.99%Forest Att 2 43.45% 23.43% 20.50%Image Att 1 49.26% 23.43% 15.04%Image Att 2 56.00% 35.81% 6.19%Shakes Att 1 62.36% 23.53% 13.58%Shakes Att 2 55.97% 32.00% 23.09%

Table 5: Confidence Interval Width Compared to

APA1+ (ss stands for strata size used in conjunction

with the stratified sampling approach).

proved accuracy when a linear combination of them istaken in t̂opt

APA1.

6.4 Confidence Interval WidthNot only should the confidence bounds provided by an es-

timation methodology be accurate, they should also be tight,in the sense that the actual answer to the query should beconstrained to fall in a very small interval. In this subsectionwe analyze tightness of confidence interval bounds producedby various methods. We perform pair-wise comparisons be-tween APA0+ and APA1+ and each of the other three esti-mators, in order to determine which one produces smallestconfidence intervals. For both APA0+ and APA1+, we com-pute the median change in confidence interval width thatwould have been obtained had another estimator been usedinstead. Specifically, we compute the percentage change forthose queries that both estimators being compared answeredcorrectly (that is, both estimators produced a correct con-fidence interval). The reason we focus on cases where bothestimators were correct is that we do not want to reward anestimator for producing a very tight bound on an incorrectestimate.

Table 4 illustrates the median change in width of the con-fidence intervals produced by each of the three traditionalmethods compared to APA0+. For example, if, for a givenquery, both estimators produce correct confidence intervalswith the size of APA0+ interval denoted by CI0 and size ofthe other method’s confidence interval denoted by CI , thenthe change (decrease) in confidence interval width for thatpair would be (CI − CI0)/CI .

Similarly, next table illustrates the shows median changein width of the confidence intervals produced by each of thethree traditional methods compared to APA1+.

Discussion

For the data sets tested, APA0+ always produces tighterconfidence intervals than simple random sampling. Across

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based

all data sets, APA0+ produced confidence intervals thatwere 14.81% smaller than for simple random sampling. Theaverage improvement with APA1+ is even better. APA1+provides confidence intervals that are on average 51% smallerthan for simple random sampling, which means that whenboth estimators are predicting correctly, APA1+ will pro-duce a confidence interval 1/2 the size of that produced byrandom sampling. APA1+ is also clearly better than strat-ified sampling. The intervals provided by APA1+ were av-eragingly 31.63% smaller than using 100 samples per strata,which translates into APA1+ intervals being about 2/3 thesize of those produced by this particular stratified samplingapproach. Comparing APA1+ to stratified sampling ap-proach using 10 samples per strata, the average of the me-dian decrease (across data sets) is 18.89%, which translatesinto APA1+ intervals being about 4/5 the size of their coun-terparts produced by stratified sampling. Furthermore, asdemonstrated in the previous subsection, the APA1+ inter-vals are also far more accurate than the intervals producedusing stratification.

The curves in both the figures are the normal distribu-tions (in the histogram scale) with mean and variance de-rived from APA0+ and APA1+ estimators, respectively. Inother words, the point in the curve is the number of ob-servations that would appear in a bucket-size area for thecorresponding normal distributions. These plots show thatour estimators have a strong tendency towards the normaldistributions.

7. CONCLUSIONS AND FUTURE WORKThe sampling-based estimators described in this paper are

very useful in a database environment when it is possible tocollect summary information in addition to the sample. Theestimators have the advantage of being unaffected by datadimensionality, and are suitable for use with categorical andnumerical data. Furthermore, since the information usedby the estimators can be collected in a single pass over adata set, the estimators are suitable for use in a streamingenvironment.

One important issue that we have not considered in thispaper is use of the estimators for queries other than SUM

and COUNT queries. In particular, extending our methodsto work with AVERAGE queries is an important problem forfuture work. AVERAGE queries can be treated as a ratio ofa SUM and a COUNT query, but since the two estimators willbe correlated if they are the result of the same sample, itbecomes necessary to use relatively pessimistic confidencebounds in the absence of a rigorous study of the correlationof the two estimators.

8. REFERENCES[1] Swarup Acharya, Phillip B. Gibbons, and Viswanath

Poosala. Congressional samples for approximate answeringof group-by queries. SIGMOD Conference Proceedings,pages 487–498, May 2000.

[2] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala,and Sridhar Ramaswamy. Join synopses for approximatequery processing. SIGMOD Conference Proceedings, pages275–286, June 1999.

[3] George Casella and Roger L. Berger. Statistical Inference,2nd. Edition. DUXBURY Publishers, 2001.

[4] Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. Arobust, optimization-based approach for approximateanswering of aggregate queries. In Proceedings of the 2001ACM SIGMOD international conference on Managementof data, 2001.

[5] Sumit Ganguly, Phillip B. Gibbons, Yossi Matias, andAbraham Silberschatz. Bifocal sampling for skew-resistantjoin size estimation. SIGMOD Conference Proceedings,pages 271–281, June 1996.

[6] Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan.Icicles: Self-tuning samples for approximate queryanswering. In VLDB, pages 176–187, 2000.

[7] Minos Garofalakis and Phillip B. Gibbons. Waveletsynopses with error guarantees. In Proceedings of the 2002ACM SIGMOD international conference on Managementof data, 2002.

[8] Minos N. Garofalakis and Amit Kumar. Deterministicwavelet thresholding for maximum-error metrics. In PODS,pages 166–176, 2004.

[9] Phillip B. Gibbons and Yossi Matias. New sampling-basedsummary statistics for improving approximate queryanswers. SIGMOD Conference Proceedings, pages 331–342,June 1998.

[10] Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras,and Carlotta Domeniconi. Approximatingmulti-dimensional aggregate range queries over realattributes. SIGMOD Conference Proceedings, pages463–474, May 2000.

[11] Peter J. Haas and Joseph M. Hellerstein. Ripple joins foronline aggregation. SIGMOD Conference Proceedings,pages 287–298, June 1999.

[12] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang.Online aggregation. In Proceedings of the 1997 ACMSIGMOD international conference on Management ofdata, 1997.

[13] Chris Jermaine. Robust estimation with sampling andapproximate pre-aggregation. VLDB ConferenceProceedings, pages 886–897, August 2003.

[14] Ruoming Jin, Leo Glimcher, Chris Jermaine, and GaganAgrawal. New sampling-based estimators for olap queries.Technical report, CSE, The Ohio State University, June2005.

[15] Norman L. Johnson, Samulel Kotz, and Adrienne W.Kemp. Univariate Discrete Distributions. WileyInterscience, 1993.

[16] E.L. Lehmann and George Casella. Theory of PointEstimation. Springer-Verlag, 1998, 2nd Edition.

[17] Richard J. Lipton and Jeffrey F. Naughton. Query sizeestimation by adaptive sampling. PODS ConferenceProceedings, pages 40–46, April 1990.

[18] Richard J. Lipton, Jeffrey F. Naughton, and Donovan A.Schneider. Practical selectivity estimation through adaptivesampling. SIGMOD Conference Proceedings, pages 1–11,May 1990.

[19] Sharon L. Lohr. Sampling: Design and Analysis. DuxburyPress, 1999.

[20] James F. Lynch. Analysis and application of adaptivesampling. PODS Conference Proceedings, pages 260–267,May 2000.

[21] Frank Olken and Doron Rotem. Simple random samplingfrom relational databases. VLDB Conference Proceedings,pages 160–169, August 1986.

[22] Carl-Erik Sarndal, Bengt Swensson, and Jan Wretman.Model Assisted Survey Sampling. Spriger-Verlag, 1992.

[23] Jun Shao. MAthematical Statistics. Spriger-Verlag, 2003,2nd Edition.

[24] Steven K. Thompson. Sampling, Second Edition. Wiley,2002.

[25] TPC. Tpc-h benchmark. http://www.tpc.org/tpch/, 2004.[26] Jeffrey Scott Vitter and Min Wang. Approximate

computation of multidimensional aggregates of sparse datausing wavelets. SIGMOD Conference Proceedings, pages193–204, June 1999.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related