[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - New Sampling-Based Estimators for OLAP Queries

Post on 24-Mar-2017

213 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • New Sampling-Based Estimators for OLAP Queries

    Ruoming Jin Leo Glimcher Chris Jermaine Gagan AgrawalKent State University Ohio State University University of Florida Ohio State University

    jin@cs.kent.edu, {glimcher,agrawal}@cse.ohio-state.edu, cjermain@cise.ufl.edu

    ABSTRACTOne important way in which sampling for approximate queryprocessing in a database environment differs from traditionalapplications of sampling is that in a database, it is feasible tocollect accurate summary statistics from the data in additionto the sample. This paper describes a set of sampling-basedestimators for approximate query processing that make useof simple summary statistics to to greatly increase the ac-curacy of sampling-based estimators. Our estimators areable to give tight probabilistic guarantees on estimation ac-curacy. They are suitable for low or high dimensional data,and work with categorical or numerical attributes. Further-more, the information used by our estimators can easily begathered in a single pass, making them suitable for use in astreaming environment.

    1. INTRODUCTIONInteractive-speed evaluation of ad-hoc, OLAP-style queries

    is quite often an impossible task over the largest databases.Even using modern query processing techniques in conjunc-tion with the latest hardware, OLAP-style queries can stillrequire hours or days to complete execution. A quick perusalof the latest TPC-H benchmark supports this assertion [25].

    One obvious way to address this problem is to rely onapproximation. If an approximate answer can be given atsub-second, interactive speeds, it may encourage the use oflarge databases for ah-hoc data exploration and analysis.One of the most often-cited possibilities for supporting ap-proximation in a database environment is to rely on randomsampling [12, 11, 1, 2, 5, 6, 9, 13, 18, 21, 20, 17]. Unlikeother estimation methods such as wavelets [26, 7, 8] andhistograms [10], sampling has the advantage that most es-timates over samples are unaffected by data dimensionality,and sampling is equally applicable to categorical and numer-ical data. Another primary advantage of random sampling isthat sampling techniques are well-understood and samplingas a field of scientific study is very mature. Fundamentalresults from statistics can be used as a guide when applyingsampling to data management tasks [24, 19, 23, 22].

    However, there is one important way in which samplingin a database environment differs from sampling as it hasbeen studied in statistics and related fields. Specifically,most work from statistics makes the assumption that it isimpossible to gather accurate summary statistics from thepopulation that is to be sampled from. After all, in tra-ditional applications like sociology and biology, sampling isused when it is impossible to directly study the entire pop-ulation.

    In a database, this assumption is far too restrictive. Thereis no reason that accurate and extensive summary statis-tics over the underlying data set cannot be maintained. Indatabases, unlike in traditional statistical inference, the en-tire data set is available and can be pre-processed or pro-cessed in an online fashion as it is loaded. The reason thatapproximation may be required in a database environment isthat there are too many possible queries to pre-compute theanswer to every one, and it may be too expensive to evaluateevery single query to get an exact answer. This does not im-ply that it is impractical to maintain a very comprehensiveset of summary statistics over the data.

    A recent paper [13] proposed a new technique called Ap-proximate Pre-Aggregation (APA) that makes use of suchsummary statistics to greatly increase the accuracy of sampling-based-estimation in a database environment. APA is use-ful for providing extremely accurate answers to aggregatequeries (SUM, COUNT, AVG) over high-dimensional database ta-bles, with a complex relational selection predicate attachedto the aggregate query. For example, imagine we have adatabase table COMPLAINTS (PROF, SEMESTER, NUM_COMPLAINTS)and the SQL query:

    SELECT SUM (NUM_COMPLAINTS)

    FROM COMPLAINTS

    WHERE PROF = Smith

    AND SEMESTER = Fa03;

    This query asks: How many complaints did Prof. Smithreceive in the Fall 2003 semester? APA was shown experi-mentally to produce estimates for the answer to this type ofquery that have only a small fraction of the error associatedwith other techniques such as stratified random samplingand wavelets, especially over categorical data.

    However, there is one glaring weakness with APA: it is dif-ficult or impossible to formally reason about the accuracy ofthe approach. In particular, while APA was shown exper-imentally to produce excellent results, there is no obviousway to associate statistically meaningful confidence boundswith the estimates produced by APA. A confidence boundis an assertion of the form, With a probability of .95, Prof.Smith received 27 to 29 complaints in the Fall of 2003. Thelack of confidence bounds poses a severe limitation on theapplicability of APA to real-life estimation problems.This Papers Contributions: In this paper, we proposea new method for combining summary statistics with sam-pling to improve estimation accuracy, called APA+. APA+is in many ways simpler than APA, and yet it has the over-whelming advantage of giving statistically meaningful confi-dence bounds on its estimates. As we will show experimen-

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • S Prof. Sem. Cmpl. S Prof. Sem. Cmpl.

    Adams Fa 02 3

    Smith Su 01 7Jones Fa 02 2

    Smith Sp 01 8

    Adams Sp 02 9

    Adams Fa 00 4Jones Sp 02 2 Smith Su 01 7Smith Sp 02 21 Smith Fa 00 33Smith Fa 01 36

    Adams Su 00 3

    Jones Su 01 1

    Jones Su 00 0Adams Su 01 2 Jones Sp 99 1

    Table 1: Number of complains over three years

    tally in this paper, APA+ produces confidence bounds thatare tighter and more accurate than those that are producedusing more traditional, sampling-based estimators. Further-more, computing an APA+ estimate is computationally veryefficient, which renders APA+ an excellent candidate for usein applications like online aggregation [12], where an esti-mate must be re-computed every second or so in order toinform a user as to the current accuracy of the estimate.

    Given the accuracy and wide applicability of the tech-niques presented in the paper, we assert that APA+ shouldbe used in place of traditional, sampling-based estimation inany environment where it is possible to augment a samplewith simple summary statistics over a database.Paper Organization: Section 2 of the paper gives anoverview of APA and our new method, APA+. The sim-plest version of APA+ is formally described in Section 3.Section 4 generalizes APA+ to work with more complicatedand comprehensive summary statistics, and Section 5 dis-cusses the issue of correlation in the individual estimatesthat make up the final APA+ estimator. Section 6 presentsour experimental evaluation. Finally, the paper is concludedin Section 7.

    2. OVERVIEW OF APA AND APA+In this Section, we will describe both APA and APA+ at

    a high level in the context of the example database tabledepicted in Table 1 [13].

    This table models the situation where a number of stu-dents take a course each semester, and the number of studentgrading complaints is recorded.

    2.1 Approximate Pre-Aggregation (APA)In our example, we are interested in answering a query of

    the form:

    SELECT SUM (NUM_COMPLAINTS)

    FROM COMPLAINTS

    WHERE PROF = Smith

    Imagine that we decided to answer this query using a 50%sample of the database, where the sampled tuples are indi-cated by a check mark in Table 1. Estimating the answerto the query using the sample is very straightforward. Thenumber of complaints in the sample is 36, and since the sam-ple constitutes 50% of the database tuples, we can estimatethat Professor Smith received 72 complaints.

    However, as we see from Table 1, the actual number ofstudents who came to see Professor Smith was 121 (yielding40.5% relative error). The problem is the variance in thenumber of students who complained to Professor Smith eachsemester. This ranges from a low of 7 to a high of 36, and ithappens that our sample missed the two semesters when thegreatest number of students complained to Professor Smith.

    APA makes use of additional summary statistics to re-duce samplings vulnerability to variance. The APA pro-cess begins by using the relational selection predicate of thequery to divide the data space into 2n quadrants, where nis the number of clauses in the predicate. APA then asso-ciates a probability density function with each quadrants,and uses the additional summary information that is avail-able to create a set of constraints on the means of thesedistributions. A Maximum Likelihood Estimation (MLE)is then used to adjust the means in order to correct vio-lations of the constraints. Depending on the dimensional-ity of the summary information used, different variationsof APA are possible. For example, APA0 makes use ofzero-dimensional facts about the data, or assertions of theform (SUM (COMPLAINTS)) = 148. APA1 makes use of one-dimensional facts about the data, or assertions of the form(SUM (COMPLAINTS)) WHERE (SEMESTER = Fa02) = 5, whichhave one clause in their relational selection predicate.

    2.2 APA+The fundamental weakness of APA is a lack of a formal

    analysis of the accuracy of the approach, which in turn pre-cludes analytically-derived confidence bounds on APA esti-mates. Such an analysis is difficult for several reasons:

    1. First, it can be difficult in general to reason aboutthe accuracy of maximum likelihood estimators. WhileMLEs are usually Minimum Variance Unbiased Esti-mators (MVUEs) and are typically normally

Recommended

View more >