[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

12

Click here to load reader

Upload: jf

Post on 24-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

End-biased Samples for Join Cardinality Estimation

Cristian Estan, Jeffrey F. NaughtonComputer Sciences Department

University of Wisconsin-Madisonestan,[email protected]

Abstract

We present a new technique for using samples to es-timate join cardinalities. This technique, which we term“end-biased samples,” is inspired by recent work in net-work traffic measurement. It improves on random samplesby using coordinated pseudo-random samples and retain-ing the sampled values in proportion to their frequency. Weshow that end-biased samples always provide more accu-rate estimates than random samples with the same samplesize. The comparison with histograms is more interesting— while end-biased histograms are somewhat better thanend-biased samples for uncorrelated data sets, end-biasedsamples dominate by a large margin when the data is cor-related. Finally, we compare end-biased samples to the re-cently proposed “skimmed sketches” and show that neitherdominates the other, that each has different and compellingstrengths and weaknesses. These results suggest that end-biased samples may be a useful addition to the repertoire oftechniques used for data summarization.

1. Introduction

Often it is desirable to use synopses (for example, his-tograms) to estimate query result sizes. Synopses work byfirst processing a data set, computing a synopsis that issmaller than the original data, then using this synopsis atrun time to quickly generate an estimate of the cardinal-ity of a query result. In this paper we consider a challengingscenario for estimating join cardinalities — one in whichjoins are not key-foreign key, and for which building multi-dimensional histograms on the join attributes is not feasible(perhaps because the space of all possible joins is too large,perhaps because the tables being joined are geographicallydistributed.) As one example of a situation in which suchjoins arise, consider a network monitoring setting in whichthere are hundreds or thousands of sites storing logs of theirtraffic. An analyst may decide to join the logs of any pair ofsites on non-key attributes, presumably to correlate proper-

ties of traffic at the first site with properties of traffic at thesecond.

When multi-dimensional histograms are not feasible,an obvious approach to solve this estimation problem isto fall back on single dimensional histograms. Unfortu-nately, while single dimensional histograms work superblyfor range queries, as we show in this paper, for join queriesthey fail miserably if the data in the join attributes of the in-put relations is correlated. Because of this, it makes senseto consider the alternative approach of retaining a sampleof the values in the join attributes, because such samplesare likely to reflect the correlations in the attributes. But theproblem with such an approach is that for good estimateswe may have to use samples that are impractically large.

All work on synopsis-based approaches to join size esti-mation must be viewed in the context of Alon et al.’s [2] fun-damental result, which implies that for worst-case data sets,it is impossible to do much better than simple random sam-pling. This does not mean that we cannot improve on sim-ple random sampling — not all data sets are worst case!It does mean that future progress in the field will likely bemarked by exploring where existing and newly proposedtechniques are strong and where they are weak, and analyz-ing the tradeoffs between a number of techniques to gaininsight as to when each can be profitably employed, not byfinding a “silver bullet” magical technique that is alwayssubstantially better than any other.

In this spirit we propose a new way of building a sam-ple summary, inspired by recent work in the network mon-itoring community, which we term “end-biased samples.”End-biased samples use three key ideas to improve uponsimple random samples. First, in a way reminiscent of end-biased histograms, they store exact frequencies for the mostfrequent values in the join attribute. This captures the fre-quent values in the “head” of the distribution of the valuesin the join attribute. Second, for the less frequent values inthe tail of the distribution, we sample those values, but re-tain them in the sample with a probability proportional totheir frequency. For values that are retained in the sample,we store their exact frequency. Third, when deciding which

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

values in the tail of the distribution to retain in the sam-ple, we use the same universal-2 hash function for both in-put tables, so that a value that is kept in the sample in onetable is likely to also appear in the sample for the other ta-ble.

We show that end-biased samples always provide an un-biased estimator of the join cardinality, and give analyticalformulas for the variance of this estimator under several as-sumptions about the input data. We also perform an empir-ical study on the tradeoffs between end-biased samples andend-biased histograms. This study shows that while end-biased histograms perform slightly better than end-biasedsamples for uncorrelated data, they perform much worse ifthe data is correlated (either positively or negatively). Theseresults suggest that end-biased sampling may be a usefultechnique to add to the statistical summary tools used byquery optimizers.

Finally, we compare end-biased samples to the re-cently proposed “skimmed sketches” [12]. The com-parison is interesting. For small memory budgets, end-biased samples dominate; in fact, in such configurations,skimmed sketches appear to have a bias in their esti-mate. For larger memory budgets, skimmed sketches givelower errors than end-biased samples. Furthermore, un-like skimmed sketches, end-biased samples can estimatethe sizes of an important class of select-join queries; con-versely, unlike end-biased samples, skimmed sketchesare effective in a “read once” streaming environment; fi-nally end-biased samples are simpler to configure andrequire less “tuning” than skimmed sketches to return ac-curate results.

2. Related Work

A great deal of work is relevant to end-biased samplesfor join cardinality estimation. There is a long and distin-guished history of papers dedicated to the design and eval-uation of more accurate histograms for various estimationproblems, including [18, 22, 25, 27, 28]. While some ofthese histograms could give somewhat better estimates thanthe end-biased equi-depth histograms we consider in thispaper in some scenarios, the main conclusion that single di-mensional histograms are problematic for join cardinalityestimation in the presence of correlated data still remains.The problem arises whenever you assume a uniform distri-bution within a bucket, as do all these histograms. Multi-dimensional histograms [3, 16] fare better than their sin-gle dimensional counterparts for join cardinality estimationwhen they are feasible (that is, when it is feasible to com-pute and store a join histogram for every pair of tables thatmight be joined.)

With respect to error bounds and histograms, Ioannidisand Christodoulakis [19] considered the problem of which

histograms give the best performance for a restricted classof multiway joins. Jagadish et al. [20] give quality of esti-mate guarantees for histograms in the context of range se-lection queries. Neither paper considers tradeoffs betweensample-based synopses and histograms.

Two orthogonal lines of research are sketches [2, 5] andwavelets [13, 14]. It would be an interesting area for fu-ture work to explore how wavelets compare to end-biasedsamples for join cardinality estimation. In this paper weprovide a comparison with the current “champion” of thesketch-based approach to join-size estimation, “skimmedsketches” [12].

There is also a large body of literature dedicated to theuse of sampling techniques in database systems. Much ofthat work, including [4, 17, 23, 26], is devoted to issues thatarise when using samples at runtime to give approximateanswers while avoiding scanning the entire input to a query.Maintaining samples as a way of summarizing a large dataset has been considered before as a technique for givingapproximate answers to queries. In particular, Acharya etal. [1] consider using join synopses for approximate queryanswering in the presence of joins. Like multi-dimensionalhistograms, their approach requires actually computing thejoin in order to build the summary, hence it is also prob-lematic for large numbers of joins and distributed data. Re-cently, Kaushik et al. [21] presented a theoretical treatmentof how accurate one can expect summaries to be for vari-ous classes of queries. While they consider different tech-niques (join synopses rather than end-biased samples) andqueries (key-foreign key joins rather than arbitrary joins),their results are consistent with ours. In fact, in generalterms their results prove that there must be distributions forwhich single-dimensional histograms fail, and we show aspecific kind of distribution for which they fail (correlatedjoin attributes).

In work on sample-based synopses, Gibbons and Ma-tias [15] proposed “counting samples.” In related work, var-ious rules have been proposed for choosing the samplingprobability [10, 24] offering various memory usage or accu-racy guarantees. In Section 5.2.2 we show that join size es-timates from counting samples can have much higher vari-ance than those from our end-biased samples.

Duffield et al. [6] estimate the traffic of various networktraffic aggregates based on a sample of a collection of flowrecords produced by routers. Our technique is inspired bytheirs, which works by sampling all records whose trafficis above a threshold and sampling those below the thresh-old with probability proportional to their traffic, and this en-ables low variance estimates from small samples. They didnot consider the join cardinality estimation problem, hencedid not consider using correlated samples.

Flajolet [11] proposed an algorithm for estimating thenumber of distinct values of an attribute in a database which

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

uses a sample of these values based on a hash function.Estan et al. [8] proposed using these samples to estimatethe number of distinct flows in various traffic aggregates.Duffield and Grossglauser [7] used a shared hash functionto correlate traffic sampling decisions at routers through-out a network. Our use of a shared hash function to buildthe end-biased samples is inspired by the use of hash func-tions in these approaches, although they also did not explorethe application of such techniques to join cardinality estima-tion.

3. End-biased sampling

The usage of the method we propose is very similar tothe usage of single dimensional histograms for join size es-timation: we build a compact summary for each importantattribute of a table; when joining two tables on a certainattribute, we estimate the join size based on the two end-biased samples. We build the end-biased samples indepen-dently in that the choices of which values to sample in onetable are not influenced by the frequency of values in theother. As we will see, we use the same hash function tomake correlated pseudo-random decisions for both tables,but the construction of the samples does not require the sys-tem to even specify the join to be estimated at the time thatthe sample is built. This is an advantage in distributed set-tings.

3.1. Constructing end-biased samples

To build the end-biased sample for a given attribute ofa table we need the full list of the repeat counts for eachvalue of the attribute. End-biased sampling picks some ofthe entries in this list based on two core ideas: preferen-tially sampling values that repeat more often, and correlat-ing sampling decisions at different tables to help find infre-quent values that occur in both.

Frequent values can have a major effect on join sizes,so it is important that we keep them in our sample. Thuswe bias our sample towards keeping the frequent values ina manner similar to end-biased histograms. We use the fol-lowing rule for the sampling probability pv for a given valuev: if the frequency fv of value v is above a threshold T , wekeep (v,fv) in our sample, otherwise we keep it with prob-ability pv = fv/T . The threshold T is a parameter we canuse to trade off accuracy and sample size; the higher it is,the smaller the sample, the lower it is the more accurate ourestimates will be. We propose that the size of the sampleis decided ahead of time and during the construction of theend-biased sample the threshold is increased until the sam-ple is small enough.

Infrequent values will have a small probability of be-ing sampled. If the decisions for sampling infrequent val-

ues in the two tables are statistically independent, the prob-ability of finding values that occur infrequently in bothtables is very low. While these values do not contributelarge amounts to the join size, if there are many of them,their contributions add up. Within the sampling probabili-ties given by the rules from the previous paragraph, we wantthe sampling decisions to be correlated: the same infrequentelements should be picked for all tables. We achieve thisby basing the sampling decisions on a common hash func-tion h which maps values uniformly to the range [0, 1]: ifh(v) ≤ pv, we sample (v,fv), otherwise not. The end-biased sampling processes for different tables will only needto share the seed of the hash function to correlate their sam-pling decisions. We will choose our hash function from afamily strongly 2-universal hash functions and thus guaran-tee that the sampling decisions for any pair of values are in-dependent.

3.2. Estimating join sizes

Let A and B be two tables to be joined. We use av andbv to denote the frequency (repeat count) of value v in thejoin attribute of tables A and B respectively. Suppose thatwe have two end-biased samples that were constructed onthe common attribute with thresholds Ta and Tb. Then wecan compute our estimate S of the join size S by summingthe contributions cv of the values present in both sampleswhere the contributions are computed as follows.

cv =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

avbv if av ≥ Ta and bv ≥ Tb

Tabv if av < Ta and bv ≥ Tb

avTb if av ≥ Ta and bv < Tb

avbv · max(

Ta

av, Tb

bv

)if av < Ta and bv < Tb

S =∑

v

cv (1)

Lemma 1 Estimates of join sizes computed throughEquation 1 are unbiased.

Note that Equation 1 ensures that if the actual join isempty, the estimate will be always 0. For all cases we cancompute the variance given the full histograms on the com-mon attribute for tables A and B and the thresholds Ta andTb used for computing the end-biased samples V AR[ S] =∑

v V AR[cv] +∑

v1 �=v2COV [cv1 , cv2 ]. By choosing our

hash function h from a family strongly 2-universal hashfunctions, we ensure that the sampling decisions for twodistinct values v1 and v2 are independent, so the covarianceof their contributions to the estimate is COV [cv1 , cv2 ] = 0.We can compute Δv = V AR[cv] = E[c2

v] − E[cv]2 =(1/pv−1)(avbv)2, where pv is the probability that the valuev is sampled for both tables. From Δv we obtain the for-mula for V AR[S].

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

Δv =

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

0 if av ≥ Ta , bv ≥ Tb(Ta

av− 1

)a2

vb2v if av < Ta , bv ≥ Tb(

Tb

bv− 1

)a2

vb2v if av ≥ Ta , bv < Tb(

max(

Ta

av, Tb

bv

)− 1

)a2

vb2v if av < Ta , bv < Tb

V AR[S] =∑

v

Δv (2)

Note that if we made the sampling decisions indepen-dently at random for the two tables, in the fourth case p v

would have been av/Ta · bv/Tb < min(av/Ta, bv/Tb) andthus the variance of the estimate would have been higher.

3.3. Updating end-biased samples

While Section 3.1 gives a procedure of constructing anend-biased sample from the frequency distribution of an at-tribute, it does not give a procedure of updating the sam-ple as tuples are added to or removed from the table. Dueto space constraints we do not discuss updating end-biasedsamples in detail here, and just note that in general up-dates can be processed efficiently if the system maintainssome data structure (e.g., an index) that records the fullfrequency distribution on the join attribute. (Such a struc-ture must have been built, at least implicitly, when the end-biased sample was originally constructed.) The only kindof update that is expensive is one that causes the samplingthreshold to be lowered, since then the entire attribute dis-tribution must be re-sampled to determine what gets addedto the set of frequent values.

4. Variance Bounds

Given the thresholds used by end-biased sampling, wecan compute the variance of the join size estimates for anypair of tables for which we have the full list of the frequen-cies of all values. In this section we derive bounds on thevariance of these estimates that do not depend on such de-tailed knowledge. To get useful bounds, we need to con-strain the distribution of the frequencies of values in thetwo tables in a meaningful way. In this section we workwith progressively stronger constraints that give progres-sively tighter bounds: we first limit the actual join size, nextwe cap the frequency of individual values in both tables,last we compute a bound that relies on the exact distribu-tion of frequencies that are above the threshold. Our boundshold for all possible distributions of values in the two ta-bles that fit within the constraints. Note that these are notbounds on the errors, but on the variance of join size esti-mates.

We will use an example throughout this section to illus-trate these bounds numerically. We assume we have twotables whose join has 1, 000, 000 tuples and we use end-biased samples with a threshold of 100 to estimate the sizeof this join. While the bounds are on the variance of the es-timate V AR[S], in our numerical examples we report theaverage relative error SD[S]/S.

4.1. Limit on join size

Lemma 2 For two tables A and B whose end-biased sam-ples on a common attribute are computed with thresholdsTa and Tb respectively, V AR[S] ≤ (max(Ta, Tb) − 1)S2.

We note that this variance may be too large to beuseful in practice. The standard deviation of the esti-mate SD[S] is larger than S by a multiplicative factor of√

max(Ta, Tb) − 1 which can amount to orders of magni-tude for configurations we consider practical. For our nu-merical example, which uses thresholds of 100, this lemmalimits the average error of the estimate of the join size to nomore than 995% which means that it is common for esti-mates to be off by a factor of 10. On the other hand, this ex-treme variance is achieved when the entire join is dueto a single value that repeats S times in one of the ta-bles and once in the other. We expect any method thatdoesn’t use knowledge about the frequent values in one ta-ble when constructing the summary of the other, anddoesn’t use summaries with memory requirements compa-rable to the number of distinct values, to have large errorson such an input.

4.2. Limits on value frequencies

Lemma 3 Let A and B be two tables with a common at-tribute for which we have an upper bound on frequenciesof values in the two tables ∀v, av ≤ Ma and ∀v, bv ≤ Mb.The variance of their join size estimate based on end-biasedsamples with thresholds Ta and Tb respectively, is bound byV AR[S] ≤ (max((Ta − 1)Mb, (Tb − 1)Ma)S.

Note that if the ratio SD[S]/S is bound by√max(TaMb, TbMa)/S, so for some settings, espe-

cially when the actual join size is large and the thresh-olds and the maximum frequencies are low, Lemma 3 canguarantee low relative errors. If for the tables in our nu-merical example we know that no value repeats more thanM = 1, 000 times in either, we can bound the average rel-ative error to 31.5%. If all values are below the thresholdof 100, using Corollary 3.1 we get even smaller average er-ror: 9.9%.

Corollary 3.1 If the frequencies of all values are below theend-biased sampling thresholds, the join size estimate vari-ance is bound by V AR[S] ≤ (Ta − 1)(Tb − 1)S.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

We can further strengthen these bounds by taking intoaccount the actual frequencies of the values whose frequen-cies are above the threshold. These bounds are presented inthe technical report version of this paper [9]. As a numericexample of this stronger bound, if we know not only thatthe most frequent value in both tables repeats 1, 000 times,but also that the second most frequent 500 times, the third333 times and the ith 1, 000/i times (Zipf distribution withα = 1), the bound on the average error is 12.3%.

5. Theoretical comparison

Once we have results on the accuracy of join size esti-mates provided by end-biased samples it is incumbent on usto compare the accuracy of these estimates with that of othersolutions for solving the same problem. We first explore an-alytical comparisons, but due to the very different nature ofsome of the solutions, we use experiments in Section 6 toperform comparisons that are unfeasible otherwise.

5.1. Worst case comparison

Alon et al. have shown [2] that when estimating the joinsize of two relations with n tuples whose join is at leastB, there are some input data set distributions for which allmethods that summarize the relations separately need to useat least (n−√

B)2/B bits to guarantee accurate join size es-timates with high probability. Furthermore, they show thatsimply taking random samples from both tables achievesthis bound within a small multiplicative factor. More ex-actly using a sample of size cn2/B tuples can ensure accu-rate estimates with high probability where c > 3 is a con-stant that depends on the desired accuracy and confidence.Ganguly et al.[12] present a streaming sketch algorithm thatachieves this worst case bound when used for join size esti-mation. In Section 5.2.1 we show that when using the samenumber of elements in the sample, end biased samples al-ways give more accurate (lower variance) results than ran-dom samples, and thus they also are within a small constantfactor of the theoretical lower bound.

All three algorithms above basically achieve the worstcase bound. Is it valid to conclude that they are equallygood? No. The worst case bound only tells us that there is adistribution of inputs (frequent values in one relation beingamong the many infrequent values in the other) that is veryhard for all algorithms, and on these inputs all three algo-rithms are within a constant factor of the bound. But as wementioned in the introduction, this bound tells us nothingabout how these algorithms perform on easier inputs. Thenext section goes beyond this worst case and gives some re-sults that apply for all inputs.

5.2. General results for sampling algorithms

In this section we compare end-biased samples againsttwo other sampling methods used in databases: randomsamples and counting samples [15]. For both these sam-pling methods there are unbiased estimators of the join sizethat rely on samples of values of the join attribute in the twotables. In this section we show that the variance of these es-timates is higher than that of the estimates based on end-biased samples using similar sample size.

5.2.1. Random samples By “random samples” we sim-ply mean the most obvious approach: maintaining a randomsample of values for the join attribute. We have the follow-ing lemma.

Lemma 4 For any data set, the expected size of an end-biased sample using threshold T is no larger than the ex-pected size of a random sample obtained by sampling withprobability 1/T .

We use pa and pb for the sampling probabilities at thetwo tables, and for each attribute value v, xv is the numberof samples with value v from the first table, yv is the num-ber of samples from the second, av the actual frequency inthe first table, and bv its frequency in the second. The es-timator of the join size S =

∑v xv/pa · yv/pb sums the

contributions of all values v present in both samples. Usingthe fact that xv/pa and yv/pb are independent random vari-ables and unbiased estimators for av and bv respectively, wecan show that V AR[S] =

∑v(bv(1/pa − 1) + av(1/pb −

1)+(1/pa−1)(1/pb−1))avbv where v ranges over all val-ues in the join. After substituting Ta = 1/pa and Tb = 1/pb

in the formula for the variance introduced into the estimateby one value (Ta/av−1/av +Tb/bv−1/bv +(Ta−1)/av ∗(Tb−1)/bv)a2

vb2v and comparing with Equation 2, it is easy

to see the variance of the estimator is strictly larger than thevariance of the one based on end-biased samples. The ra-tio of the two variances is very close to 1 for the hard caseof a very small av and a very large bv (or the reverse), butvery large for the case where av and bv are small and infi-nite for the case when both are over the threshold.

5.2.2. Counting samples Counting samples [15] of the at-tribute of interest are built in a single pass over the tablein the following way: for each sampled value, the count-ing sample stores exactly the count of tuples that have thatvalue starting with the first one sampled. Note that this doesnot give us exact counts of the frequencies of the valuesin the sample, but the count only misses the tuples beforethe first one sampled. If p is the sampling probability used,the probability that a value v with frequency fv will ap-pear in the sample is 1 − (1 − p)fv ≈ 1 − e−pfv . If we setp = 1/T , for every value, the probability that it appears inthe end-biased sample is close to the probability that it ap-pears in the counting sample with the largest ratio being

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

achieved when fv = T , so the size of the end-biased sam-ple is at most e/(e − 1) ≈ 1.58 times larger.

We provide the details of our analysis of join size esti-mates based on counting samples in Appendix B. We con-clude that counting samples allow estimates of the join sizewhose variance is somewhat larger than that of estimatesbased on end-biased samples for values that are frequent inone table and infrequent in the other, but their variance issignificantly higher for values with high frequencies in bothtables for which end-biased samples give the exact contri-bution to the join size and even more so for values with lowfrequencies in both tables because end-biased samples cor-relate sampling decisions using a shared hash function.

6. Experimental comparison

The previous section shows that when given similaramounts of memory, end-biased samples dominate ordinarysamples and counting samples on all inputs, and the accu-racy of the estimates based on end-biased samples is signif-icantly better on some inputs. Sketches and histograms arevery different approaches, and analytical comparisons thatextend to all data sets are hard. In this section we explorethis issue empirically.

Answering this question with an exhaustive experimen-tal evaluation over all real-world data sets for all proposedmethods for computing join size estimates clearly wouldbe prohibitively expensive. Accordingly, in this section wepresent a less ambitious experiment meant to give a roughanswer and some intuition. We compare end-biased sam-ples with four other algorithms for estimating join sizesbased on summaries on the join attribute computed sep-arately for the two relations. When comparing the differ-ent algorithms, their summary data structures use the samenumber of words of memory.

6.1. Description of experiments

We generate various synthetic data sets and run the fivesummarization algorithms on them. The code we used forgenerating the datasets, to compute the summaries, estimatejoin sizes, and process the results is publicly available athttp://www.cs.wisc.edu/˜estan/ebs.tar.gz(we do not include the code for sketches because it is notour code, we received it from Sumit Ganguly).

6.1.1. Summarization algorithms used The two othersampling algorithms we evaluate are concise samples andcounting samples. Concise samples differ from randomsamples in that for attribute values that are sampled morethan once, instead of keeping a copy of each sampled in-stance of the value, we keep a single copy of the value and acounter storing the number of repetitions of the value in thesample. This can make it slightly more compact. Also for

concise samples we count attribute values that are sampledonly once as consuming a single word of memory, whereasfor counting samples and end-biased samples all values inthe sample take two words because we also store the counterassociated with them. For all three algorithms we start witha sampling probability of p = 1 (or equivalently thresh-old T = 1/p = 1) and decrement it slightly whenever wereach the memory limit so that we stay within budget.

We also compare end-biased samples against a simplesingle dimensional histogram. We do not claim that thereare no histograms that could provide more accurate esti-mates, but we tried to choose a powerful member fromthe class of simple histograms with general applicability:an end-biased equi-depth histogram. The histogram storesexplicitly the frequency of values that repeat more timesthan a certain threshold, and builds bins defined by non-overlapping value intervals that cover the whole range forthe attribute. We compute the join size estimate by assum-ing that except for values with frequencies over the thresh-old, the frequency of all values in a bin is the same f/nwhere f is the frequency of all values covered by the binand n is the number of values covered by the bin (not thenumber of values present in the table). Of course the fre-quent values are not counted towards the total frequency ofthe bin and neither towards the number of values covered byit. Since we use equi-depth histograms, the total frequencyof elements in each bin is approximately the same.

There are two parameters that affect the memory usageof a histogram and the accuracy of the predictions based onit: the threshold above which a value’s frequency is storedexactly and the number of bins used. To make the compar-ison with end-biased samples fair, we ran the histogramswith parameters that resulted in the same memory usage:for each data set we used the same threshold and a num-ber of bins equal to the number of values with frequencybelow the threshold present in the end-biased sample.

A simple thought experiment reveals that using singledimensional histograms for join cardinality estimation isprone to errors in some cases. The main realization requiredis that no matter how carefully histograms choose theirbuckets, they always assume uniformity within a bucket.This means that they can be easily “tricked” into makinga bad estimate. For example, it is possible that one relationhas only even values in the join attribute, while the otherhas only odd values. The histogram, assuming uniformityand independence within buckets, will incorrectly predict anon-zero join size. Similarly, if both join inputs have onlyeven values, the histogram, assuming uniformity and inde-pendence, will underestimate the join size.

Ganguly et al. proposed a skimmed sketches [12], an al-gorithm that achieves the worst case bound for estimatingjoin sizes. We contacted them for an implementation of thisalgorithm and we used an improved sketch they are cur-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

rently working on. This sketch uses a number of hash ta-bles that operate in parallel. Each update is hashed to oneof the buckets of each hash table based on the attribute’svalue, and then added to or subtracted from (based on an-other hash of the attribute’s value) the counter associatedwith the bucket. The sketch also has a small heap that oper-ates in parallel storing the most frequent values. When esti-mating the join size, the frequency of the values in the heapis estimated based on the sketch, and these values are sub-tracted from the tables. The final join estimate is the sum ofthe estimate based on the estimated frequencies of the fre-quent values from the two heaps and the estimate based onthe remaining counters in the tables. Due to the implemen-tation of the hash function used by the sketches, the numberof buckets in the table had to be a power of two in all our ex-periments.

6.1.2. Input data and memory budget As input we usedtwo randomly generated tables with approximately 1 mil-lion tuples each with a Zipf distribution for the frequenciesof the values. The values are from the domain with 5 mil-lion values, and for each of these values we pick their fre-quency independently at random from the distribution offrequencies. For each configuration we repeat the experi-ment 1, 000 times to get sufficient data to characterize thedistribution of the errors in the join size estimates1. Sincethe histograms are deterministic, we generate new randominputs for every run, but keep all configuration parametersthe same. We used the Condor high throughput computingenvironment developed at our department and the experi-ments consumed collectively more than one CPU-year ofcomputing power.

Our first experiment looks at the effect of peaks in thedistribution of attribute value frequencies. We vary the Zipfexponent from 0.2 to 0.95 to go from a distribution with nopeaks to one with high peaks. In all these configurations wekeep the number of tuples around 1, 000, 000, so as we in-crease the Zipf exponent the number of distinct attribute val-ues decreases. In this experiment, all algorithms use 10, 304words per table to summarize the distribution of the fre-quencies of the values of the join attribute.

In our next two experiments we vary the amount of mem-ory used by the summaries. We used the Zipf distributions 2

with parameters α = 0.8 and α = 0.35 to see the differencebetween how memory usage affects the accuracy of the re-sults on a peaked distribution versus a non-peaked distribu-

1 For some of the configurations, not all 1, 000 runs completed, but formost of them the results reported in this paper are on more than 990runs and for all of them the results are on more than 925 runs.

2 The actual distributions for the frequencies are given by�15, 250/(1, 000, 000r + 0.5)0.8 + 0.5� and �61/(5, 000, 000r +0.5)0.35 + 0.5� where r is a random variable uniformly dis-tributed between 0 and 1 and we pick independent values for r for all5 million possible values of the attribute.

tion. Since the implementation of sketches we used requiresthat the number of buckets per table be a power of two(hence the seemingly arbitrary decision of using 10, 304words in the previous experiment), we pick the memorysizes based on the constraints imposed by sketches. We usetwo types of configurations, both based on discussions withthe authors of the sketch algorithm: 5 tables and the numberof elements in the heap 1/64 times the number of bucketsin the table; and 3 tables and a heap size of 1/32 times thetable size. Each bucket in the table holds one counter thattakes one word of memory, and each element in the heaptakes two memory words, one for the attribute value andanother one for its estimated frequency. We vary the mem-ory budget from 204 to 659, 456 words.

In our last two experiments we vary the degree of cor-relation between the two tables we estimate the join for.In the experiments so far, when choosing the frequency ofany given value, the decisions were independent for the twotables. By positively correlating these choices, we make itmore likely that when one of the tables has a high frequencyfor the value, the other one will too, and by correlating themnegatively we make it more likely that the frequency of thevalue in the other table will be low (possibly zero). Nega-tively correlated data sets will have a smaller join size andpositively correlated data sets will have a larger join size(in the extreme of perfect correlation, the two tables haveidentical frequency distributions and we get the self joinsize). For the experiment with unpeaked data (Zipf param-eter of 0.2) we have 8 configurations with join sizes fromaround 6% of the size for uncorrelated inputs to 571%. Forthe peaked input data we have 12 configurations going fromaround 5% of the size of the uncorrelated input to almost100 times larger. As for the first experiment, we use 10, 304words which corresponds to a 5 table configuration for thesketches.

6.2. Discussion of results

In our discussion of the results of the experiments,we use 4 measures of the distribution of join size esti-mates of the various algorithms in the various configura-tions. The first measure is the average over all runs of the ra-tio of the estimate to the actual join size (since the datais different for every run, the join sizes differ too). Avalue of close to 1 for this measure means that the estima-tor is not biased, while larger values indicate a bias towardsoverestimation and smaller ones towards underestima-tion. The second measure is the square root of the averageof the square of the relative error. We refer to this quan-tity as average (relative) error. The third and fourth mea-sure are the 5th and 95th percentile for the ratio betweenthe join size estimate and the actual join size. We can-not present the full results for all experiments due to

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

1000 10000 1e+05

Memory budget (log scale)

0.01

0.1

1

10

Ave

rage

rel

ativ

e er

ror

(log

sca

le)

End-biased samplesConcise samples

Figure 1. Average relative errors for thepeaked input set.

1000 10000 1e+05

Memory budget (log scale)

0.01

0.1

1

10

Ave

rage

rel

ativ

e er

ror

(log

sca

le)

End-biased samplesConcise samples

Figure 2. Average relative errors for the un-peaked input set.

lack of space. We present just the most important re-sult and the interested reader can access the full results athttp://www.cs.wisc.edu/˜estan/ebs.tar.gz.

6.2.1. Concise samples As expected, the more memorywe used, the more accurate the estimates, and concise sam-ples showed no evidence of bias towards over or under-estimation. Figures 1 and 2 show how the average erroris affected by the amount of memory used by the sam-ples for peaked and unpeaked frequency distributions re-spectively. It might seem surprising that despite our proofin Section 5.2.1 that random samples give higher varianceresults, concise samples do slightly better than end-biasedsamples for most memory sizes for the peaked data. The

0.1 1 10 100

average join size / uncorrelated join size (log scale)

0.1

1

10

join

siz

e es

timat

e / j

oin

size

(lo

g sc

ale)

5th percentile for end-biased samplesaverage for end-biased samples95th percentile for end-biased samples5th percentile for histogramsaverage for histograms95th percentile for histograms

Figure 3. Different degrees of correlation ofthe peaked frequency distributions of the twoinputs affect the join size and cause biases inthe estimates based on histograms.

explanation is that since concise samples use a single wordfor values that are sampled only once whereas end-biasedsamples use two words (to store both value and frequency)for all values sampled, concise samples can afford to sam-ple slightly more aggressively than end-biased samples, andthis small increase in sampling probability is enough tomake estimates slightly more accurate. For the last datapoint the error of concise samples is 0 because they can af-ford to store the entire data set. The results are very differ-ent for unpeaked data. While for large amounts of memory(high sampling rates) the estimates are comparably accu-rate, for low memory settings the average error of the con-cise samples is much larger than that of end-biased samples(1434% versus 26.87% for 204 words). The explanation isthat the join is due to many relatively infrequent values oc-curring in both relations. Correlating the sampling decisionsvia a hash function with a shared seed, as done by the end-biased samples, makes it easier to estimate how many suchvalues there are and leads to the dramatically lower errors.

6.2.2. Counting samples Counting samples never outper-form end-biased samples. Despite being able to providemore accurate estimates for the frequency of frequent val-ues than concise samples, the join size estimates based oncounting samples are slightly less accurate (average errorbelow approximately twice the error with concise samples)for all settings tested. The reason is that counting samplesneed to operate with less aggressive sampling than concisesamples as they store two words for each sampled value.

6.2.3. Histograms In all our experiments with uncorre-lated data sets histograms estimated the join size with lowererror than end-biased samples. Histograms did especially

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

0.1 1

average join size / uncorrelated join size (log scale)

0.1

1

10

join

siz

e es

timat

e / j

oin

size

(lo

g sc

ale)

5th percentile for end-biased samplesaverage for end-biased samples95th percentile for end-biased samples5th percentile for histogramsaverage for histograms95th percentile for histograms

Figure 4. Different degrees of correlation ofthe unpeaked frequency distributions of thetwo inputs affect the join size and cause bi-ases in the estimates based on histograms.

well in settings with little memory. But if the value fre-quencies in the two tables are correlated positively or neg-atively (as in the even-even and odd-even examples fromSection 6.1.1), the histograms “do not notice” and give esti-mates much closer to the join size on uncorrelated data thanthe actual join size (see Figures 3 and 4), whereas estimatesbased on end-biased samples are unbiased as predicted byLemma 1. In our experiment with the unpeaked distributionof value frequencies histograms went from overestimationby a factor of 16.8 on average to underestimation by a factorof 0.18 on average and for peaked data from 20 to 0.14. Theincrease towards the end of the plot for the peaked distribu-tion breaks the trend for histograms. This is due to the val-ues that are frequent in both relations. Since the histogramswe use are end-biased they correctly compute the contribu-tion of such values to the join – with distributions that havestrong positive correlation the contribution of such values tothe join increases, and the bias of histograms diminishes.

6.2.4. Sketches Of the methods we tested, based on theaccuracy of the estimates, sketches are the only one thatwe would recommend over end-biased samples for certaintypes of inputs. Table 1 shows the results of the experi-ment looking at various values of the Zipf parameter. Thisexperiment confirms our theoretical results from Section 4which predict that the variance of the join cardinality es-timates based on end-biased samples goes up as the fre-quency of the most frequent values increases. For large val-ues of α the sketches give more accurate estimates, but forsmall values, the sketches show a slight bias towards over-estimation. For high values of α, there are some values withvery high frequency that can influence the join size signif-

0.1 1 10 100

join size / uncorrelated join size (log scale)

0

0.2

0.4

0.6

0.8

1

aver

age

erro

r

End-biased samplesSketches

Figure 5. The degree of correlation betweenthe two inputs with peaked distributions af-fects the results.

icantly, thus finding a good estimate of the frequency ofthese values in the other relation is very important. Withend-biased sample either the value is sampled or not and weassume this gives higher variance than the estimate of thisfrequency using sketches and thus the better performance ofsketches for join size estimation on this type of data. We as-sume that the sketches’ bias towards overestimation for un-peaked data is due to interactions between how the heap se-lects the most frequent values and the actual tables of coun-ters. Other experiments show that the configuration with 3tables and larger heap has an even stronger bias.

If we look at different degrees of correlations for the in-puts with peaked (α = 0.8) frequency distributions (seeFigure 5) we see a more nuanced picture. Even thoughsketches give better results for the uncorrelated case, thereis a significant portion of the input space over which end-biased samples give lower errors.

The results to our experiment varying the memory bud-get with uncorrelated peaked input distributions show inFigure 6 that end-biased samples have an advantage for lowmemory settings while sketches do better when there ismore memory. Sketches using 3 tables instead of 5 and alarger heap are significantly less accurate that their properlyconfigured counterparts, and this highlights the sensitivityof sketches to proper configuration. With unpeaked dataend-biased samples are significantly better than sketcheswith 3 tables for all configurations because of the sketches’bias towards overestimation.

Finally, we note that skimmed sketches and end-biasedsamples work well for overlapping but not identical do-mains. There are qualitative differences too. End-biasedsamples can be used to estimate the size of a select-join ifthe selection is on the join attribute, skimmed sketches can-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

α End-biased samples Sketches5th percentile / average /95th percentile (average error)

0.2 0.953 / 1.001 / 1.052 (3.06%) 1.007 / 1.036 / 1.066 ( 4.05%)0.35 0.944 / 1.001 / 1.065 (3.67%) 1.000 / 1.031 / 1.064 ( 3.67%)0.5 0.907 / 1.002 / 1.127 (7.10%) 0.950 / 1.000 / 1.052 ( 2.97%)

0.65 0.790 / 1.003 / 1.353 (22.85%) 0.837 / 1.000 / 1.174 ( 10.67%)0.8 0.554 / 1.025 / 1.903 (71.00%) 0.583 / 0.999 / 1.477 ( 29.28%)

0.95 0.275 / 1.104 / 2.936 (170.15%) 0.591 / 1.000 / 1.424 ( 29.13%)

Table 1. The larger the Zipf exponent α, the harder to estimate the join size accurately. The sketcheswe used show a slight bias towards overestimation for small values of α, but give more accurateestimates than end-biased samples for large α.

1000 10000 1e+05

Memory budget (log scale)

0.01

0.1

1

10

Ave

rage

rel

ativ

e er

ror

(log

sca

le)

End-biased samplesSketches (3 tables)Sketches (5 tables)

Figure 6. End-biased samples give lower er-rors with low memory budgets.

not; skimmed-sketches can be used in a streaming “read-once” environment, end-biased samples cannot.

7. Conclusions

When one surveys the synopses proposed to date, it isclear that no single approach dominates all others for allpurposes. It is certainly not our claim that end-biased sam-ples should replace all other synopses for join cardinalityestimation. However, based on our observations in this pa-per, we think that end-biased samples are useful for someestimation problems on which other approaches either donot apply or give inaccurate estimates.

To elaborate, many join cardinality estimation tech-niques, including multi-dimensional histograms and sam-ples of join results, only apply if the join itself (or at least arepresentative subset) can be precomputed. This is infeasi-ble in some applications; as we have noted, these applica-

tions include ones with distributed data and ones for whichthere are “too many” potential joins.

Compared to end-biased samples, single dimensionalend-biased histograms have the disadvantage that they canoverestimate or underestimate the join size by orders ofmagnitude when the distribution of values in the two ta-bles is not independent. Good configurations for sketchesproduce more accurate estimates than end-biased sampleson some of the data sets we tested, and sketches alsohave the advantage of supporting streaming updates, butthe accuracy of their results strongly depends on configura-tion parameters and they are slightly biased towards over-estimation in some settings. For some types of data setsand for settings where the available memory is low, end-biased samples consistently give more accurate estimatesthan sketches. Furthermore, unlike sketches and single di-mensional histograms, end-biased samples support selec-tion on the join attribute. Compared to concise samples andcounting samples, end-biased histograms give significantlymore accurate results when the join is dominated by val-ues not frequent in either table.

Substantial room for future work remains. One interest-ing task would be to compute confidence intervals for theestimates produced by end-biased samples. Another inter-esting and important task would be to better understand howto automatically pick the method that is best fitted for agiven scenario of interest.

8. Acknowledgments

We thank Sumit Ganguly for the code implementingsketches that we used in our experiments. This work sup-ported in part by NSF grant ITR 0086002.

References

[1] S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. Joinsynopses for approximate query answering. In SIGMOD,

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 11: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

1999.[2] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking

join and self-join sizes in limited storage. In PODS, 1999.[3] N. Bruno, S. Chaudhuri, and L. Gravano. A multi-

dimensional workload-aware histogram. In SIGMOD, 2001.[4] S. Chaudhuri, R. Motwani, and V. Narasayya. On random

sampling over joins. In SIGMOD, 1999.[5] A. Dobra, M. Garofalakis, J. E. Gehrke, and R. Rastogi. Pro-

cessing complex aggregate queries over data streams. In SIG-MOD, 2002.

[6] N. Duffield, C. Lund, and M. Thorup. Charging from sam-pled network usage. In SIGCOMM Internet MeasurementWorkshop, Nov. 2001.

[7] N. G. Duffield and M. Grossglauser. Trajectory sampling fordirect traffic observation. In Proceedings of the ACM SIG-COMM, pages 271–282, Aug. 2000.

[8] C. Estan, K. Keys, D. Moore, and G. Varghese. Building abetter netflow. In Proceedings of the ACM SIGCOMM, Aug.2004.

[9] C. Estan and J. F. Naughton. End-biased samples for join car-dinality estimation. U.W.-Madison C.S. Tech. rep. TR1541,Oct. 2005.

[10] C. Estan and G. Varghese. New directions in traffic mea-surement and accounting: Focusing on the elephants, ignor-ing the mice. In ACM Trans. Comput. Syst., Aug. 2003.

[11] P. Flajolet. On adaptive sampling. COMPUTG: Comput-ing (Archive for Informatics and Numerical Computation),Springer-Verlag, 43, 1990.

[12] S. Ganguly, M. Garofalakis, and R. Rastogi. Processing data-stream join aggregates using skimmed sketches. In EDBT,2004.

[13] M. Garofalakis and P. B. Gibbons. Wavelet synopses witherror guarantees. In SIGMOD, 2002.

[14] M. Garofalakis and A. Kumar. Deterministic wavelet thresh-olding for maximum-error metrics. In PODS, 2004.

[15] P. B. Gibbons and Y. Matias. New sampling-based summarystatistics for improving approximate query answers. In Pro-ceedings of the ACM SIGMOD, pages 331–342, June 1998.

[16] D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi.Approximating multi-dimensional aggregate range queriesover real attributes. In SIGMOD, 2000.

[17] P. Haas and A. Swami. Sequential sampling procedures forquery size estimation. In SIGMOD, 1992.

[18] Y. E. Ioannidis. Universality of serial histograms. In VLDB,1993.

[19] Y. E. Ioannidis and S. Christodoulakis. Optimal histogramsfor limiting worst-case error propagation in the size of thejoin radius. In ACM Transactions on Database Systems, vol-ume 18, pages 709–748, 1993.

[20] H. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukr-ishnan, and T. Suel. Optimal histograms with quality guar-antees. In VLDB, 1998.

[21] R. Kaushik, R. Ramakrishnan, and V. T. Chakaravarthy. Syn-poses for query optimization: A apace-complexity perspec-tive. In Proceedings of PODS, 2004.

[22] N. Koudas, S. Muthukrishnan, and D. Srivastava. Optimalhistograms for hierarchical range queries. In PODS, 2000.

[23] R. Lipton, J. Naughton, and D. Schneider. Practical selec-tivity estimation through adaptive sampling. In SIGMOD,1990.

[24] G. Manku and R. Motwani. Approximate frequency countsover data streams. In In Proceedings of the 28th Interna-tional Conference on Very Large Data Bases, Aug. 2002.

[25] M. Muralikrishna and D. DeWitt. Equi-depth histograms forestimating selectivity factors for multi-dimensional queries.In SIGMOD, 1988.

[26] F. Olken and D. Rotem. Simple random sampling from rela-tional databases. In VLDB, 1986.

[27] G. Piatetsky-Shapiro and C. Connell. Accurate estimationof the number of tuples satisfying a condition. In SIGMOD,1984.

[28] V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Im-proved histograms for selectivity estimation of range predi-cates. In SIGMOD, 1996.

A. Proofs of Lemmas

In this appendix we include proofs of the lemmas in thetext.Lemma 1 Estimates of join sizes computed throughEquation 1 are unbiased.

Proof We need to show that the E[S] = S. The actualjoin size is is the sum of the number of repetitions in the joinfor each value v that occurs in both tables S =

∑avbv. If

consider cv = 0 for values that do not appear in both sam-ples we can extend the summation from Equation 1 to allvalues that appear in both tables. By the linearity of expec-tation, E[S] =

∑E[cv], so it suffices to show that the ex-

pected contribution to the estimate E[cv] = avbv for eachvalue v. Let pv be the probability that we chose a hash func-tion h for which the value v is sampled for both tables A andB. We show that in all four cases from Equation 1 we havecv = avbv/pv and thus E[cv] = pv ·avbv/pv+(1−pv)·0 =avbv. It is straightforward to check the first three caseswhere the values of pv are 1, av/Ta, and bv/Tb respectively.The discussion of the fourth case relies on how correlatedsampling decisions are made using the hash function h. Forthe value v to be selected in both samples, we need to haveh(v) ≤ av/Ta and h(v) ≤ bv/Tb. This happens with prob-ability pv = min(av/Ta, bv/Tb) = 1/ max(Ta/av, Tb/bv)which gives the unbiasedness for the fourth case. �Lemma 2 For two tables A and B whose end-biased sam-ples on a common attribute are computed with thresholdsTa and Tb respectively, V AR[S] ≤ (max(Ta, Tb) − 1)S2.

Proof S =∑

avbv , V AR[S] =∑

Δv and by a sim-ple transformation of the cases in Equation 2 we can showthat Δv is of the form (Ta−av)bv ·avbv, av(Tb−bv) ·avbv,or 0 ·avbv if v’s frequency is above the threshold in both ta-bles. Let m = maxv((Ta − av)bv, av(Tb − bv), 0) be thelargest Δv/(avbv) ratio. Thus

∑Δv ≤ ∑

mavbv = mS.But for the values that participate in the join, we know that

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 12: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - End-biased

1 ≤ av ≤ S and 1 ≤ bv ≤ S because avbv ≤ S, thus(Ta−av)bv ≤ (Ta−1)S and av(Tb− bv) ≤ S(Tb−1). Bysubstituting we get

∑Δv ≤ mS ≤ max((Ta − 1)S, (Tb −

1)S)S = (max(Ta, Tb) − 1)S2 �Lemma 3 Let A and B be two tables with a common at-tribute for which we have an upper bound on frequenciesof values in the two tables ∀v, av ≤ Ma and ∀v, bv ≤ Mb.The variance of their join size estimate based on end-biasedsamples with thresholds Ta and Tb respectively, is bound byV AR[S] ≤ (max((Ta − 1)Mb, (Tb − 1)Ma)S.

Proof Based on exactly the same reasoning as in theproof of Lemma 2, we arrive at V AR[S] ≤ mS, but sinceav and bv are bound by Ma and Mb respectively instead ofS, m ≤ max((Ta − 1)Mb, (Tb − 1)Ma). �Lemma 4 For any data set, the expected size of an end-biased sample using threshold T is no larger than the ex-pected size of a random sample obtained by sampling withprobability 1/T .

Proof We first compute for each value v, the expectedcontribution to the size of the two samples. For randomsamples, each tuple has a probability of 1/T of being sam-pled and since different tuples with the same attribute valueare not combined, the expected contribution of a value thatappears in fv tuples is fv · 1/T = fv/T . For end-biasedsamples the probability of the sample containing the en-try (v, fv) is min(1, fv/T ) because values with fv ≥ Tappear with probability one and the rest with probabilityfv/T . Since end-biased samples have one entry for eachvalue present, v’s expected contribution to the sample sizeis min(1, fv/T ) ≤ fv/T . As each value is expected to con-tribute to the end-biased sample no more than to the ran-dom sample, the expected size of the end-biased sample isno larger than the expected size of the random sample. �

B. Analysis of counting samples

Let xv and yv be the counts associated with value v in thetwo counting samples. It is easy to show by induction on av

and bv that Xv = xv + 1/pa − 1 and Yv = yv + 1/pb − 1are unbiased estimators of av and bv respectively (whena value v is not in their respective sample Xv and Yv are0). Through a similar induction, but with a more involvedderivation, we show that V AR[Xv] = 1/pa(1/pa − 1)(1−(1−pa)av ) and V AR[Yv] = 1/pb(1/pb−1)(1−(1−pb)bv ).Using that Xv and Yv are independent random variables, itis easy to show that S =

∑v XvYv =

∑v(xv + 1/pa −

1)(yv + 1/pb − 1) is an unbiased estimator of the join sizeand its variance is V AR[S] =

∑v V AR[Xv]V AR[Yv] +

V AR[Xv]b2v + V AR[Yv]a2

v.We will not compare counting samples against end-

biased samples directly but against a weaker version ofend-biased samples which make sampling decisions aboutvalues independently for the two relations with the same

probabilities as end-biased samples, but without the sharedhash function. Based on these samples we can computean unbiased estimate X ′

v of av as follows: X ′v = av if

av ≥ Ta, X ′v = Ta if av < Ta and v is in the sam-

ple and X ′v = 0 otherwise. We define Y ′

v similarly. Wehave V AR[X ′

v] = (Tb − av)av for av < Ta (and 0 oth-erwise) and similarly V AR[Y ′

v ] = (Tb − bv)bv. Note thatS′ =

∑v X ′

vY′v is also an unbiased estimator of the join

size. The variance for values participating to the join isV AR[Xv]V AR[Yv]+V AR[Xv]b2

v +V AR[Yv]a2v. For val-

ues above threshold in either table, this is exactly the sameas Δv for end-biased samples, it is only worse for valuesbelow the threshold in both tables (because the samplingdecisions are not correlated). By showing that the estima-tor based on counting samples has higher variance than S′,we show that it has higher variance than for end-biasedsamples. We only need to show now that for pa = 1/Ta

V AR[Xv] ≥ V AR[X ′v] ∀av ∈ 1, ..., Ta and for pb = 1/Tb

V AR[Yv] ≥ V AR[Y ′v ] ∀bv ∈ 1, ..., Tb. We most prove the

following inequality.

1p

(1p− 1

) (1 − (1 − p)f

) ≥ (1/p − f)f

1 − p − (1 − p)f+1 ≥ fp − f2p2

1 − (f + 1)p + f2p2 − (1 − p)f+1 ≥ 0

We will treat the left side as a continuous function g(p)defined on [0, 1]. g ′(p) = −f −1+2f2p+(f +1)(1−p)f

and g′′(p) = 2f2 − (f2 + f)(1 − p)f−1. Since g(0) = 0and g′(0) = 0, it suffices to show that g′′(p) ≥ 0 over [0, 1].g′′(p) ≥ 0 is equivalent to (1− p)f−1 ≤ 2f/(f + 1). Sincefor f ≥ 1, 2f/(f + 1) = 2(1 − 1/(f + 1)) ≥ 1 ≥ (1 −p)f−1.

Discussion of results This analysis in this appendixshows that V AR[Xv] ≥ V AR[X ′

v] and V AR[Yv] ≥V AR[Y ′

v ], but the difference is small for small values of av

and bv. While for large values, the difference is larger, evenas V AR[X ′

v] and V AR[Y ′v ] reach zero after av and bv reach

the threshold, we always have V AR[Xv] ≤ 1/pa(1/pa −1) = Ta(Ta − 1) ≈ T 2

a and V AR[Yv] ≤ 1/pb(1/pb −1) = Tb(Tb − 1) ≈ T 2

b . For values frequent in onlyone table, the variance for estimates based on countingsamples is only slightly larger than the variance of thosebased on end-biased samples. For values that are frequentin both, end-biased samples have a variance of 0, and therelative error for estimates based on counting samples isSD[S]/S ≈ max(Ta/av, Tb/bv). For values that are in-frequent in both tables, because counting samples do notcorrelate sampling decisions, the variance of their estimatescan be larger than for end-biased samples by a multiplica-tive factor of min(Ta, Tb), just like for random samples.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE