[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

12
Techniques for Warehousing of Sample Data Paul G. Brown & Peter J. Haas IBM Almaden Research Center San Jose, California, USA {pbrown1,phaas}@us.ibm.com Abstract We consider the problem of maintaining a warehouse of sampled data that “shadows” a full-scale data warehouse, in order to support quick approximate analytics and meta- data discovery. The full-scale warehouse comprises many “data sets,” where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and com- pare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging sam- ples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample stor- age. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approx- imate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Exper- iments indicate that our methods are efficient and scalable, and provide guidance for their application. 1 Introduction In the setting of large-scale data repositories and ware- houses, random sampling has long been recognized as an invaluable tool for obtaining quick approximate answers to analytical queries, for auditing the data, and for exploring the data interactively; see, for example, [9, 10, 19]. Re- cently, sampling has received attention as a useful tool for data integration tasks such as automated metadata discov- ery [2, 3, 13, 14, 15, 18]. One means of exploiting random sampling is to sam- ple the data on an as-needed, ad hoc basis; see, for ex- ample, [4, 10, 12]. This approach can work well within a single database management system, but can be diffi- Full-Scale Data Warehouse Sample Data Warehouse sample D 1 D 1,2 merge D i D i,j sample sample Sn,m S1,1 etc. S*,2 S*,* S1-2,3-7 S1,2 . . . . . . D D n i,m Figure 1. Sample-Warehouse Architecture cult to implement in more complex warehousing and inte- gration environments. Another popular approach is to, in essence, maintain a warehouse of sampled data that “shad- ows” the full-scale data warehouse. This approach is im- plicit in the “backing sample” ideas in [8] and in stream sampling methods [11], and appears more explicitly in the various sampling-based data synopses proposed as part of the AQUA project [6, 9]; especially pertinent to our cur- rent investigation is the work in [7] on concise and counting samples. Jermaine et al. [16] also discuss techniques for maintaining large disk-based samples. In this paper, we pursue the latter approach and present novel algorithms for maintaining a warehouse of sampled data. We focus on issues of scalability, parallel processing, and flexibility that have not previously been addressed in this setting. We assume that the full-scale warehouse com- prises many “data sets,” where a data set is a bag (multi- set) of values; the data sets can vary enormously in size. A data set might correspond, for example, to the values in the column of a relational table or to the instance values cor- responding to a leaf node in an XML schema. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for inde- pendent and parallel uniform random sampling of data-set partitions, where the partitions are created either by divid- ing the batch or splitting the stream. As a data-set partition Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: pj

Post on 09-Feb-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Techniques for Warehousing of Sample Data

Paul G. Brown & Peter J. HaasIBM Almaden Research Center

San Jose, California, USA{pbrown1,phaas}@us.ibm.com

Abstract

We consider the problem of maintaining a warehouse ofsampled data that “shadows” a full-scale data warehouse,in order to support quick approximate analytics and meta-data discovery. The full-scale warehouse comprises many“data sets,” where a data set is a bag of values; the data setscan vary enormously in size. The values constituting a dataset can arrive in batch or stream form. We provide and com-pare several new algorithms for independent and paralleluniform random sampling of data-set partitions, where thepartitions are created by dividing the batch or splitting thestream. We also provide novel methods for merging sam-ples to create a uniform sample from an arbitrary unionof data-set partitions. Our sampling/merge methods are thefirst to simultaneously support statistical uniformity, a prioribounds on the sample footprint, and concise sample stor-age. As partitions are rolled in and out of the warehouse, thecorresponding samples are rolled in and out of the samplewarehouse. In this manner our sampling methods approx-imate the behavior of more sophisticated stream-samplingmethods, while also supporting parallel processing. Exper-iments indicate that our methods are efficient and scalable,and provide guidance for their application.

1 Introduction

In the setting of large-scale data repositories and ware-houses, random sampling has long been recognized as aninvaluable tool for obtaining quick approximate answers toanalytical queries, for auditing the data, and for exploringthe data interactively; see, for example, [9, 10, 19]. Re-cently, sampling has received attention as a useful tool fordata integration tasks such as automated metadata discov-ery [2, 3, 13, 14, 15, 18].

One means of exploiting random sampling is to sam-ple the data on an as-needed, ad hoc basis; see, for ex-ample, [4, 10, 12]. This approach can work well withina single database management system, but can be diffi-

Full-ScaleData Warehouse

SampleData Warehouse

sample D 1

D 1,2

merge

D i

D i,j

sample

sample

Sn,mS1,1

etc.S*,2S*,* S1-2,3-7

S1,2

.

.

.

.

.

.

D

D n

i,m

Figure 1. Sample-Warehouse Architecture

cult to implement in more complex warehousing and inte-gration environments. Another popular approach is to, inessence, maintain a warehouse of sampled data that “shad-ows” the full-scale data warehouse. This approach is im-plicit in the “backing sample” ideas in [8] and in streamsampling methods [11], and appears more explicitly in thevarious sampling-based data synopses proposed as part ofthe AQUA project [6, 9]; especially pertinent to our cur-rent investigation is the work in [7] on concise and countingsamples. Jermaine et al. [16] also discuss techniques formaintaining large disk-based samples.

In this paper, we pursue the latter approach and presentnovel algorithms for maintaining a warehouse of sampleddata. We focus on issues of scalability, parallel processing,and flexibility that have not previously been addressed inthis setting. We assume that the full-scale warehouse com-prises many “data sets,” where a data set is a bag (multi-set) of values; the data sets can vary enormously in size. Adata set might correspond, for example, to the values in thecolumn of a relational table or to the instance values cor-responding to a leaf node in an XML schema. The valuesconstituting a data set can arrive in batch or stream form.We provide and compare several new algorithms for inde-pendent and parallel uniform random sampling of data-setpartitions, where the partitions are created either by divid-ing the batch or splitting the stream. As a data-set partition

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

moves through the full-scale warehouse, the correspondingsample of the partition moves through the sample ware-house in a parallel manner.

Our new sampling methods are based on well knownBernoulli and reservoir sampling schemes, and are similarto concise sampling [7] in that (i) the footprint (i.e., the re-quired storage space) both during and after sample collec-tion never exceeds a user-specified a priori bound, and (ii)the sample is stored using a compact representation. Forexample, the complete data-set partition can be stored, pro-vided that the number of distinct values in the partition issufficiently small. Unlike concise and counting samples,however, our samples are truly uniform, in a sense thatwe make precise below. We also provide novel methodsfor merging partition samples to create a uniform samplefrom an arbitrary union of partitions; these techniques per-mit flexible creation of samples from the partition samplesthat are stored in the warehouse. We empirically comparethe performance of the Bernoulli-based and reservoir-basedtechniques and show that our methods have good speedupand scaleup behavior.

The rest of the paper is organized as follows. In Section 2we describe the architecture of the sample warehousing sys-tem and specify the corresponding set of requirements forour sampling methods. Section 3 describes some existingsampling schemes upon which our new methods rest. Aspart of our discussion we show, perhaps surprisingly, thatthe well known concise-sampling scheme in [7] does notproduce uniform samples. Section 4 describes our two newmethods, called Algorithm HB and Algorithm HR, for cre-ating compact uniform random samples subject to a prioribounds on the sample footprint. Also provided are methodsfor merging samples created by these algorithms. In Sec-tion 5 we describes the results of an empirical study of thenew methods, and Section 6 contains our conclusions anddirections for future work.

2 System Architecture and Requirements

As mentioned previously, we assume that the full-scalewarehouse consists of many data sets that can vary in sizefrom a few hundred to hundreds of millions of values. Thevalues that constitute a data set arrive in two ways: bundledinto a large batched collection, or as a streamed sequenceof singleton values. To achieve our flexibility and scalabil-ity goals, we allow (and typically require) each data set tobe partitioned into mutually disjoint sets. The sampling in-frastructure needs to support independent sampling of thepartitions and subsequent merging of the per-partition sam-ples to create a single sample of the data values in an arbi-trary union of partitions. A sample of the concatenation ofall partitions corresponds to a sample of the entire data set.

In one data warehousing scenario, for example, an ini-

tial batch of data from an operational system would be bulkloaded, followed up periodically by smaller sets of data re-flecting additions to the operational system over time, aswell as periodic deletions. We would like to be able to par-allelize the sampling of the initial batch to minimize inges-tion time, and then merge samples acquired from the updatestream so as to maintain a sample of the total data set. Inanother scenario, the bulk-load component of the data setmight be small but the ongoing data stream overwhelmingfor a single computer. Then the incoming stream could besplit over a number of machines and samples from the con-current sampling processes merged on demand. In eitherscenario, it may be desirable to further partition the incom-ing data stream temporally, e.g., one partition per day, andthen combine daily samples to form weekly, monthly, oryearly samples as needed for purposes of analysis, auditing,or exploration.

Figure 1 shows the data flow for a generic data set D;in general, D might be parallelized across multiple CPUsas D1,D2, . . ., and the ith such stream may be partitionedtemporally into Di,1,Di,2, . . ., say by day. The sampled par-titions Si, j are sent to the sample warehouse, where theymay be subsequently retrieved and merged in various ways.As new daily samples are rolled in and old daily samplesare rolled out, the system would approximate stream sam-pling algorithms such as those described in [1, 11], but withsupport for parallel processing. Additional partitioning canalso be performed on-the-fly within a stream, in order to ro-bustly deal with fluctuations in the data arrival rate. E.g.,suppose that we wish to maintain fixed-size samples andsimultaneously ensure that each sample comprises at leasta specified minimum fraction of its parent data. Then wewait until the ratio of sampled data to observed parent datahits the specified lower bound, at which point we finalizethe current data partition (and corresponding sample), andbegin a new partition (and sample).

In light of our discussion, we see that our sampling in-frastructure must support the following functionality:

1. Uniform random sampling: Since most methods forsampling-based analytics and metadata discovery as-sume a uniform random sample (formally defined be-low), providing such samples is a basic functional re-quirement, and our focus in this paper.1

2. Scalable, flexible, robust sampling: As discussedabove, we typically partition each data set D into mutu-ally disjoint subsets D1,D2, . . . ,Dk and require that oursystem be able to sample independently and in paral-lel from these data sets to obtain samples S1,S2, . . . ,Sk,where Si is a uniform random sample of the values inDi for 1 ≤ i ≤ k. We also require that our system be

1In Section 4.1, we briefly discuss how stratified samples can also beproduced using our methods.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

able to subsequently merge the samples: for any subsetK ⊆ {1,2, . . . ,k} the system must be able to produce asubset of values SK that is a uniform random sample ofthe values in DK =

⋃i∈K Di. This sample/merge func-

tionality allows the system to deal flexibly with het-erogeneous data sources and data-arrival patterns, andto handle large amounts of data quickly by exploitingparallelism.

3. Bounded footprint: From a systems perspective, it ishighly desirable that the storage required during andafter sample creation be bounded a priori, so that thereare no unexpected disk or memory shortages. Becausethe number of data-set partitions can be large, evensmall fluctuations in sample sizes can have large cu-mulative effects.

4. Compact samples: Again, because of the large numberof partitions, each sample partition should be stored ina compact manner. Perhaps the most basic require-ment is that duplicate values be stored in a (value,count) format when possible, as in the concise sam-ples and counting samples described in [7]. Althoughnot considered in the current paper, data compressiontechniques can be used to further minimize storage re-quirements for the samples; whether or not such tech-niques are worthwhile depends on the desired tradeoffsbetween processing speed and storage requirements.

3 A Survey of Pertinent Sampling Methods

In this section we briefly review some sampling schemesupon which our proposed techniques rest, and show thatnone of these techniques alone satisfies all of our require-ments.

A sampling scheme is formally specified by an associ-ated probability mass function ( · ;D) on subsets of a pop-ulation D = {1,2, . . . , |D|} of distinct data elements. For asubset S ⊆ D, the quantity (S;D) is the probability thatthe sampling scheme, when applied to D, will produce thesample S. A sampling scheme is uniform if, for any pop-ulation D, the associated probability function satisfies

(S;D) = (S′;D) whenever S,S′ ⊆ D with |S| = |S′|.I.e., all samples of equal size are equally likely. We de-note the value of the ith data element by ui, and allow forthe possibility that ui = u j for i �= j.

3.1 Bernoulli Sampling

A Bernoulli sampling scheme with sampling rate q ∈[0,1] includes each arriving data element in the samplewith probability q and excludes the element with proba-bility 1− q, independently of the other data elements. Wedenote such a sampling scheme as Bern(q). Formally, the

associated probability function is given by (S;D) =q|S|(1− q)|D|−|S|, so that Bernoulli sampling is uniform. Akey advantage of Bernoulli sampling is that collecting sam-ples is simple and inexpensive [11], and merging Bernoullisamples is relatively straightforward (see Section 4.1).

The key disadvantage of Bernoulli sampling is that thesize of the sample is random, and hence cannot be con-trolled. Indeed, the size of a Bern(q) sample from a pop-ulation of size N is binomially distributed with parametersN and q, so that the standard deviation of the sample size is√

Nq(1−q). Hence the variability of the sample size growswithout bound as the population size increases.

We sometimes appeal to the following two results, whichare easy consequences of the fact that data elements are in-cluded or excluded in a mutually independent manner. First,if S is a Bern(p) sample from D and S′ is a Bern(q) samplefrom S, then S′ is a Bern(pq) sample from D. Second, ifSi is a Bern(q) sample from Di for i = 1,2, where D1 andD2 are disjoint sets, then S1 ∪ S2 is a Bern(q) sample fromD1 ∪D2.

3.2 Reservoir Sampling

Simple random sampling (without replacement) withsample size k ≥ 1 is defined as the unique uniform samplingscheme that produces a sample of the specified size:

(S;D) =

{1/

(|D|k

)if |S| = k;

0 otherwise.

Reservoir sampling is a well known sequential algorithmfor producing a simple random sample without replace-ment; see [11] for details and references. The idea in reser-voir sampling is to maintain the invariant that the currentreservoir constitutes a simple random sample of all data el-ements seen so far. Thus the first k arriving data elementsare inserted into the reservoir, so that the invariant propertyholds trivially. When the nth data element arrives (n > k),this element is included in the sample with probability k/n,replacing a randomly and uniformly selected victim; withprobability 1 − (k/n) the arriving data element is not in-cluded in the sample. Note that the insertion probability isprecisely the probability that the arriving element would ap-pear in a simple random sample of size k drawn from a pop-ulation of size n. Vitter [20] famously proposed a number ofimplementation tricks to speed up the basic algorithm, theprimary idea being to directly generate the random skipsbetween successive inclusions using acceptance-rejectiontechniques. In the following we denote by skip(n;k) thefunction that generates a random skip; that is, if data ele-ment n has just been processed (either included in or ex-cluded from the reservoir), then the next data element tobe included in the reservoir is element n + skip(n;k). Fordetails of the skip function see the description of Algo-rithm Z in [20] or the discussion in [11]. We often use the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

terms “simple random sample” and “reservoir sample” in-terchangeably.

The key advantage of reservoir sampling is that the sam-ple footprint is bounded a priori. A disadvantage has beenthe absence of an algorithm for merging reservoir samples;we remedy this situation in the current paper.

In the sequel, we occasionally use the following impor-tant relationship between Bernoulli and reservoir samples:if S is a Bern(q) sample from a population D, then for k ≥ 0and s ⊆ D with |s| = k,

P{

S = s∣∣ |S| = k

}=

P{S = s}P{|S| = k}

=qk(1−q)|D|−k(|D|k

)qk(1−q)|D|−k

=1(|D|k

) .

That is, given that the size of a Bern(q) sample S is equal tok, the sample S is statistically identical to a reservoir sampleof size k. Similarly, for m ∈ {k,k +1, . . . , |D|},

P{

S = s∣∣ |S| ≤ m

}=

P{S = s}P{|S| ≤ m}

=qk(1−q)|D|−k

∑mk=0

(|D|k

)qk(1−q)|D|−k

.

Thus a sampling scheme that repeatedly takes Bern(q) sam-ples until the sample size does not exceed m produces auniform sample, but not a Bernoulli sample. Note, how-ever, that if P{|S| ≤ m} ≈ 1, then P{S = s

∣∣ |S| ≤ m} ≈qk(1−q)|D|−k, so that, to a good approximation, the samplecan be treated as if it were a Bernoulli sample. Finally, wealso use the fact if S is a simple random sample of size kfrom D and S′ is a simple random sample of size k′ (< k)from S, then S′ is a simple random sample of size k′ fromD; the proof of this result is straightforward.

3.3 Concise Sampling

The concise sampling scheme was introduced in [7] withthe goal of providing a uniform sampling method hav-ing both an a priori bounded footprint and, unlike basicBernoulli or reservoir sampling, a compact representationof the sample. The idea is to store the sample in a com-pact, bounded histogram representation, i.e., as a set of pairs(vi,ni) whose footprint does not exceed F bytes, where vi isthe ith distinct data-element value in the sample and ni is thenumber of data elements in the sample that have value vi. Tofurther save space, pairs of the form (vi,1) are representedsimply by a single number, namely, the first element vi.2

Thus the sample size, i.e., number of data elements in the

2For ease of exposition, we often describe algorithms that act on con-cise samples as if the singletons were represented in the form (v,1) ratherthan v; the algorithmic modifications required to deal with singletons arealways obvious.

// phase = 1, 2, or 3 (static variable initialized to 1)// F is the maximum allowable sample footprint// (corresponding to a sample size of nF data-element values)// i is the index of the newly arrived data element// ui is the value of the newly arrived data element// S is the sample of data elements (static, initialized to /0)// S′ is a temporary sample of data elements (static)// q is the phase-2 sampling rate—see equation (1)// n is index of next element to insert into reservoir (static)// expand(S) converts S from a set of// (value, count) pairs to a bag of values// uniform() returns a uniform[0,1] random number// skip(n,k) is the reservoir-sampling skip function

1 if phase = 1 then // insert ui into sample2 insertValue(ui,S)3 if footprint(S) = F then4 S′ ← purgeBernoulli(S,q) // precompute subsample5 if |S′| < nF then6 phase ← 2 // switch to Bernoulli mode7 else // subsample is too large8 purgeReservoir(S′,nF ) // take reservoir subsample9 phase ← 3 // switch to reservoir mode10 n ← i+ skip(i;nF )11 exit12 if phase = 2 then // execute Bernoulli step13 if uniform() ≤ q then // insert ui into sample14 if unexpanded(S) then // take Bernoulli subsample15 S ← expand(S′)16 S ← S∪{ui } // add ui to bag of values17 if |S| = nF then18 phase ← 3 // switch to reservoir mode19 n ← i+ skip(i;nF )20 exit21 if phase = 3 then // execute reservoir step22 if i = n then // insert ui into reservoir23 if unexpanded(S) then S ← expand(S′)24 removeRandomVictim(S)25 S ← S∪{ui }26 n ← i+ skip(i;nF )27 exit

Figure 2. Algorithm HB

sample, is |S| = L +∑i ni, where L is the number of single-ton values. Incoming data values are either included or ex-cluded from the sample according to a Bernoulli samplingmechanism in which the sampling rate is systematically de-creased over time in order to maintain the upper bound onthe sample footprint.

In more detail, the initial sampling rate is q = 1, so thatall incoming data elements are included in the sample. If theinsertion of an arriving data element would cause the sam-ple footprint to exceed the upper bound F , then the samplefootprint is reduced by executing a “purge” step. At the be-ginning of the purge, the current sampling rate q is reducedto a new value q′ < q. Then each of the |S| sample elementsis independently removed from the sample with probabil-ity 1− (q′/q) by decrementing the value of the appropriateni to n′i = ni − 1 and removing the value if its count dropsto 0; with probability q′/q the sample element is retained.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

// this function purges a sample S by taking a Bern(q) subsample// S is stored in compact form as (count, value) pairs// binomial(n, p) returns a binomial random number

1 for (v,n) ∈ S2 n ← binomial(n,q) // take Bernoulli subsample3 if n = 0 then S ← S−{(v,n)}

Figure 3. Function purgeBernoulli(S,q)

Note that, by luck of the draw, a purge might not result in adecrease in the footprint; in this case, the purge step is re-peated until the desired reduction occurs and the footprintno longer exceeds F . The counting-sample scheme intro-duced in [7] is an extension of concise sampling that han-dles deletions in the parent warehouse.

A nice feature of concise sampling is that if the parentpopulation contains few enough distinct values so that thesample footprint never exceeds F during processing, thenthe concise sample contains complete statistical informa-tion about the entire population in the form of an exacthistogram. Unfortunately, concise sampling does not suf-fice for our purposes because this sampling scheme is notuniform, as demonstrated by the following simple example.Consider the population of data elements D = {1,2, . . . ,6}with corresponding data-element values u1 = u2 = u3 =a and u4 = u5 = u6 = b, and suppose that the concise-sampling data structure can hold at most one (value, count)pair. Let S1 = {1,2,3}, S2 = {4,5,6}, and S3 = {1,2,4}be three possible samples of size 3 from D. Also let H1 ={(a,3)}, H2 = {(b,3)}, and H3 = {(a,2),b} be the com-pact histogram representations for the data-element valuesin S1, S2, and S3. If the concise sampling scheme were uni-form, then we would have either (S1;D) = (S2;D) =

(S3;D) > 0 or (S1;D) = (S2;D) = (S3;D) = 0, sothat either H1, H2, and H3 would each have a positive proba-bility of being produced (with H3 being nine times as likelyas H1 or H2), or none of H1, H2, or H3 would ever be pro-duced. As can be seen by inspection, however, H1 and H2

each have a positive probability of being produced, whereasH3 is never produced because there is not enough space inthe concise-sampling data structure. The counting-samplescheme in [7] also is not uniform, for similar reasons. Thelack of uniformity in these two sampling schemes seems tohave gone unnoticed in the literature. Because concise sam-pling is biased toward samples with fewer distinct values,data-element values that appear infrequently in the popula-tion will be underrepresented in a sample.

4 New Sampling Methods

We introduce two sampling methods that are suitable forour sample warehouse environment. The methods are basedon Bernoulli and reservoir sampling, respectively. For each

// this function purges a sample S// by taking a reservoir subsample of size M// S is stored in compact form as (count, value) pairs// Assume that elements of S are accessed sequentially// as (v1,n1),(v2,n2), . . . ,(vm,nm)// j is index of next value to be included in reservoir// b is the current “upper bucket boundary”// L is current number of values in the reservoir// uniformInt(J) returns a random integer uniform in {1,2, . . . ,J}// skip(n,k) is the reservoir-sampling skip function

1 b = L = 0 and j = skip(0;M)2 for i = 1 to m3 b ← b+ni

4 ni ← 05 if j ≤ b then // insert instance(s) of vi

6 repeat7 if L = M then // reservoir is full8 v = uniformInt(M) // choose a random victim9 l = γ such that ∑γ−1

i=1 ni < v ≤ ∑γi=1 ni

10 nl ← nl −1 and L ← L−111 ni ← ni +1 and L ← L+112 j ← j + skip( j;M)13 until j > b

Figure 4. Function purgeReservoir(S,M)

method, we also provide an algorithm for merging samplescreated by the method. These methods retain the boundedfootprint property and, to a partial degree, the compact-representation property of the concise sampling scheme,while ensuring statistical uniformity.

4.1 Hybrid Bernoulli Sampling

The idea behind the hybrid Bernoulli sampling scheme isto attempt to sample at rate q = 1 and maintain the samplein compact histogram form, exactly as in concise sampling.If the resulting sample footprint stays below a specified up-per bound F during processing, then the algorithm will re-turn an exact histogram representation of the parent data-setpartition D. If, at any point, the sample footprint exceedsF , then the compact representation is abandoned and thescheme switches over to ordinary Bern(q) sampling withq < 1. The algorithm assumes that the population size isknown, and selects q so that, with high probability, the sam-ple size will never exceed nF , where the sample-size boundof nF data-element values corresponds to the maximum al-lowable footprint size of F bytes. In the unlikely event thatthe sample size exceeds nF , the algorithm switches to reser-voir sampling with reservoir size nF .

Algorithm HB, displayed in Figure 2, is executed foreach data element upon arrival. In phase 1 the function in-sertValue is used to add the value ui of the arriving dataelement to the sample S, which is initially stored in com-pact histogram form. Specifically, if there is already a pair(v,n) ∈ S with v = ui, then insertValue updates the pair to(v,n + 1), and if there is a singleton element v with v = ui,

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

then insertValue replaces v by the pair (v,2); otherwise,insertValue adds the singleton ui to S. As with concisesampling, the final “sample” will consist of the exact fre-quency histogram for the population if the footprint neverexceeds the upper bound F . On the other hand, if the foot-print exceeds F , then the algorithm attempts to switch toBernoulli sampling. In more detail, the algorithm invokesthe purgeBernoulli function, shown in Figure 3, which takesa Bern(q) subsample S′ by appropriately decrementing thesecond component of each (v,n) pair in S according to abinomial distribution. (The required binomially distributedrandom numbers can be generated quickly using standardmethods as in [5].) If the resulting sample size is less thannF (the usual case), then the algorithm enters phase 2 andswitches over to Bernoulli sampling. In the unlikely eventthat the Bernoulli sample is too large, i.e., |S′| ≥ nF , thenthe algorithm first takes a reservoir sample of size nF fromS′ using the function purgeReservoir that is displayed inFigure 4, and then transitions into reservoir-sampling mode(phase 3).

When the algorithm is in phase 2, it attempts to maintaina Bern(q) sample, where q is computed as in (1) below.At the time of the first Bernoulli insertion into the sampleafter the algorithm enters phase 2, the algorithm invokesthe function expand to convert the sample (initially storedin S′) from compact histogram form to a bag of values; e.g.,if S′ = {(a,2),b,(c,3)}, then expand(S′) = {a,a,b,c,c,c}.For the remainder of phase 2, arriving data elements aresampled according to a Bern(q) mechanism.

If the footprint attains the upper bound F during phase 2(a low probability event), then the algorithm enters phase 3,and any further arrivals are processed using the standardreservoir sampling algorithm. This algorithm uses the func-tion removeRandomVictim, which removes a randomly anduniformly selected element from the bag of values. (The ex-pansion step in line 23 is invoked at the beginning of phase 3if the algorithm has transitioned to phase 3 directly fromphase 1.)

In any case, after the last data element has arrived, thesample is finalized by converting S, if necessary, back tocompact histogram form. I.e., the algorithm applies the in-verse of the expand function. Thus, depending on whetherthe algorithm terminates in phase 1, 2, or 3, Algorithm HBproduces either a histogram representation of the entire par-ent partition D, a histogram representation of a Bern(q)sample from D,3 or a histogram representation of a reser-voir sample from D of size nF .

The Bernoulli sampling rate q in phase 2 is selectedso that with high probability the number of data-elementvalues in S will never exceed the upper bound nF . Inmore detail, let p be the maximum allowable probabil-

3Actually, the sample is not quite a true Bern(q) sample, but can betreated as one in practice; see the discussion below.

1 e−05 5 e−05 5 e−04 5 e−03

0.00

10.

005

0.05

00.

500

Exceedance probability p for N = 105

Rel

ativ

e er

ror

(%)

of q

App

roxi

mat

ion

nF = 102

nF = 103

nF = 104

max = 2.765%

Figure 5. Error of the approximation in (1)

ity that |S| > nF and N be the population size. Then therequired Bernoulli sampling rate q = q(N, p,nF) is com-puted by solving the equation f (q) = p, where f (q) =∑N

j=nF +1

(Nj

)q j(1 − q)N− j. In the usual case where N is

large, nF/N is not vanishingly small, and p ≤ 0.5, we have

q(N, p,nF ) ≈N(2nF + z2

p)− zp

√N(Nz2

p +4NnF −4n2F )

2N(N + z2p)

, (1)

where zp is the (1− p)-quantile of the standard (mean 0,variance 1) normal distribution; see the appendix. Figure 5displays the relative error of the approximation formula (1)for N = 105 and various values of p and nF . As can be seen,the relative error never exceeds 3%, and is typically muchlower.

The description of the Bernoulli and reservoir samplingsteps in Figure 2 has been simplified for ease of exposi-tion. Our actual implementation incorporates the variousoptimizations discussed in [11].

If the algorithm terminates in phase 2, then the resultingsample S is uniform as defined previously. Note that S is notquite a true Bernoulli sample because the sample size neverexceeds nF . By design, however, P{|S| ≤ nF }= 1− p ≈ 1.Therefore, as discussed at the end of Section 3.2, the samplecan be treated for practical purposes as if it were a Bernoullisample. If the algorithm terminates in phase 3, then S isclearly uniform. Thus Algorithm HB always produces auniform sample, and in the usual case produces (essentially)a Bernoulli sample; as shown below, Bernoulli samples arerelatively easy to merge and otherwise manipulate.

A variant of the above algorithm eliminates phase 3and, in phase 2, repeatedly purges the sample via Bernoulli

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

// S1 and S2 are the two input HB samples in compact form// D1 and D2 are the parent partitions// S is the combined sample// F is the maximum allowable sample footprint// (corresponding to a sample size of nF data-element values)// hi was the final phase of Algorithm HB when creating Si

// qi is Bernoulli sampling rate when hi = 2// p is max. probability that Bernoulli sample exceeds nF

// which corresponds to nF data-element values

1 if hi = 1 for i = 1 or 2 then // ≥ 1 sample is exhaustive2 S ← S2−i3 apply Algorithm HB to add values from Si to S4 exit5 if hi = 3 for i = 1 or 2 then // ≥ 1 reservoir sample6 combine S1 and S2 using function HRMerge7 exit8 if h1 = h2 = 2 then // both Bernoulli samples9 q ← q(|D1|+ |D2|, p,nF ) // see equation (1)10 purgeBernoulli(S1,q/q1)11 purgeBernoulli(S2,q/q2)12 if footprint

(join(S1,S2)

)< F then

13 S ← join(S1,S2)14 else // low-probability case15 S ← purgeReservoir(S1,nF )16 use reservoir sampling to add values in S2 to S17 exit

Figure 6. Function HBMerge

subsampling—possibly with ever smaller values of q—inorder to keep the sample size below the upper bound nF ,in a manner reminiscent of concise sampling. As with Al-gorithm HB, the multiple-purge algorithm would not pro-duce true Bernoulli samples. Moreover, the multiple-purgealgorithm would be somewhat more expensive than Algo-rithm HB on average, and the final sample sizes would tendto be smaller and less stable. Thus the multiple-purge al-gorithm is dominated by Algorithm HB and we do not con-sider it further.

The function HBMerge, shown in Figure 6, merges twosamples S1 and S2 generated by Algorithm HB from respec-tive disjoint data-set partitions D1 and D2. When at leastone sample Si is exhaustive, i.e., represents an entire data-set partition (line 1), HBMerge simply initializes the run-ning sample in Algorithm HB to equal S2−i, sequentiallyextracts data-element values from Si, and feeds the resultingstream of values to Algorithm HB; note that no expansionof Si is required for such extraction. Algorithm HB is ap-propriately initialized to be in phase 1, 2, or 3, dependingupon whether S2−i is an exhaustive, Bernoulli, or reservoirsample.

When neither S1 nor S2 is exhaustive but at least onesample is a reservoir sample (line 5), then the other samplecan always be viewed as a simple random sample (at leastconditionally—see the remarks at the end of Section 3.2).In this case, we use the HRMerge algorithm, discussed inthe following section, to perform the merge.

When both samples are Bernoulli samples (line 8), HB-Merge determines the sampling rate q such that a Bern(q)sample from D1 ∪ D2 will, with high probability, not ex-ceed the upper bound nF . In lines 10 and 11, HBMergetakes Bernoulli subsamples from S1 and S2 such that, afterthe subsampling, Si is a Bern(q) sample of Di for i = 1,2,and hence S1 ∪ S2 is a Bern(q) sample of D1 ∪ D2; seethe remarks at the end of Section 3.1. Note that q/qi ≈|Di|/(|D1|+ |D2|) for i = 1,2. If the footprint of the com-bined samples does not exceed F , then we form the mergedsample using the join function. This function computesthe compact histogram representation S of expand(S1) ∪expand(S2) without actually performing the expansions.E.g., for each value v such that (v,n) ∈ (S1 −S2)∪ (S2 −S1)for some n, join inserts the pair (v,n) into S, and for eachvalue v such that (v,n1) ∈ S1 and (v,n2) ∈ S2 for somen1,n2, join inserts the pair (v,n1 + n2) into S. (Note thatthe if clause in line 12 can be evaluated without actuallyinvoking join in its entirety.) In the unlikely case that theBernoulli sample S1 ∪ S2 is too large, reservoir samplingis used (lines 15–16) to create a simple random sample ofsize nF . The idea is to first apply reservoir sampling to S1

using purgeReservoir. Then an algorithm almost identicalto purgeReservoir is used to stream in the values from S2

(without requiring expansion of S2). When processing S2,the only difference from purgeReservoir is that when a pair(u,n) derived from S2 is included in S, the (vi,ni) pairs al-ready in S must be scanned to see if there is some i for whichvi = u, so that the new pair (u,n) can be incorporated simplyby setting ni ← ni +n.

The foregoing algorithm may appear complex, but it usu-ally executes very quickly. The typical scenario is onein which S1 and S2 are both Bernoulli samples, and theunion of the Bernoulli subsamples of S1 and S2 obtainedin lines 10–11 has a footprint that does not exceed F , sothat these subsamples can be quickly combined using thejoin function.

Although non-uniform sampling is not the focus of thecurrent paper, we note that the samples produced by Al-gorithm HB can also be simply concatenated, yielding astratified random sample of the concatenation of the parentdata-set partitions. A similar observation applies to Algo-rithm HR below.

We also note that if, for some n > 1 and q ∈ (0,1], Algo-rithm HB produces a collection of n Bern(q) samples fromn parent data-set partitions, simply unioning the samples to-gether yields a Bern(q) sample from the union of the par-ent partitions. Such unioning is useful when enforcing anupper bound on the sample size is not an issue, e.g., af-ter samples have been removed from the warehouse setting.If the Bernoulli sampling rate differs among the partitions,then the purgeBernoulli function can be applied to equal-ize the sampling rates prior to unioning, as in HBMerge.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

// phase = 1 or 2 (static variable initialized to 1)// F is the maximum allowable sample footprint// (corresponding to a sample size of nF data-element values)// i is the index of the newly arrived data element// ui is the value of the newly arrived data element// S is the sample of data elements (static, initialized to /0)// n is index of next element to insert into reservoir (static variable)// expand(S) converts S from a set of// (value, count) pairs to a bag of values

1 if phase = 1 then // insert ui into sample2 insertValue(ui,S)3 if footprint(S) = F then4 phase ← 2 // switch to reservoir mode5 n ← i+ skip(i;nF )6 exit7 if phase = 2 then // execute reservoir step8 if i = n then // insert ui into sample9 if unexpanded(S) then10 purgeReservoir(S,nF ) // get reservoir subsample11 expand(S)12 removeRandomVictim(S)13 S ← S∪{ui } // add ui to bag of values14 n ← i+ skip(i;nF )15 exit

Figure 7. Algorithm HR

The reservoir samples produced by the algorithm describedin the following section cannot be simply unioned in thismanner—as discussed below, this lack of flexibility is bal-anced by improved stability of the sample size.

4.2 Hybrid Reservoir Sampling

The reservoir-based sampling scheme, denoted Algo-rithm HR, is similar to Algorithm HB in that it attemptsto maintain the sample in compact histogram form until itis forced to abandon that representation. The algorithm ei-ther produces a complete histogram representation of theparent data-set partition or a histogram representation of areservoir sample of size at most nF , where, as before, nF

is the number of data-element values corresponding to themaximum allowable sample footprint of F bytes. The al-gorithm is displayed in Figure 7; as with Algorithm HB,Algorithm HR is invoked upon the arrival of each data el-ement, and S is converted to compact histogram form afterthe last data element has arrived. Our actual implementationincorporates the various optimizations discussed in [11].

The function HRMerge, shown in Figure 8, merges twosamples S1 and S2 generated by Algorithm HR from respec-tive disjoint partitions D1 and D2. When at least one sam-ple is exhaustive (line 1), the algorithm proceeds similarlyto HBMerge. The interesting case is when both samples aretrue reservoir samples (line 5). In this case, HRMerge formsa merged simple random sample of size k = |S1|∧ |S2| fromD1∪D2 by selecting L values randomly and uniformly fromS1 and k−L values from S2, where L is a random variable

// S1 and S2 are the two input HR samples in compact form// D1 and D2 are the parent partitions// S is the combined sample// F is the maximum allowed footprint of S,// which corresponds to nF data-element values// hi was the final phase of Algorithm HR when creating Si

// computeProb computes the probability distribution in (2)// genVal(P) generates a random integer distributed according to P

1 if hi = 1 for i = 1 or 2 then // ≥ 1 sample is exhaustive2 S ← S2−i3 apply Algorithm HR to add values from Si to S4 exit5 if h1 = h2 = 2 then // both reservoir samples6 k ← |S1|∧ |S2| // merged sample size7 computeProb(P, |D1|, |D2|, |S1|, |S2|,k)8 L = genProb(P)9 purgeReservoir(S1,L)10 purgeReservoir(S2,k−L)11 S ← join(S1,S2)12 exit

Figure 8. Function HRMerge

with probability mass function P(l) = P{L = l } given by

P(l) =

(|D1|l

)(|D2|k−l

)(|D1|+|D2|

k

) (2)

for l = 0,1, . . . ,k. I.e., P is a hypergeometric probabilitydistribution. The function computeProb calculates the prob-ability vector P, and genProb generates a sample from P.The join function is the same as in HBMerge. The follow-ing result asserts the correctness of our approach.

Theorem 1 If D1 and D2 are disjoint and h1 = h2 = 2,then HRMerge produces a simple random sample of sizek = |S1|∧ |S2| from D1 ∪D2.

See the appendix for a proof.4 (Our proof actually estab-lishes the correctness of our process for any merged samplesize k ∈ {1,2, . . . , |S1|∧ |S2|}.) Of course, this result ap-plies to the general problem of merging two simple randomsamples, even outside the context of Algorithm HR. It canbe shown (see the appendix) that

P(l +1) =(k− l)(|D1|− l)

(l +1)(|D2|− k + l +1)·P(l) (3)

for l = 0,1, . . . ,k − 1, so that P(0),P(1), . . . ,P(k) can becomputed relatively quickly in the function computeProb.

There are a variety of ways to generate a random samplefrom P. Perhaps the most straightforward “inversion” ap-proach generates a random number U uniformly distributedon [0,1] and returns the value L = min{ l ≥ 1 : U ≤C(l)},where C(l) = ∑l

i=1 P(l) for 1 ≤ l ≤ k. More elaborate gen-eration methods based on rejection are available [5], but,

4The authors recently learned that a result similar to Theorem 1 hasbeen established independently in a different context by R. Gemulla andW. Lehner.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

in our setting, do not seem to offer major performance im-provements over a careful implementation of the inversionapproach.

In some scenarios, the partition sizes and sample sizesare unchanging and merges are performed in a symmetricpairwise fashion, in which case we need to produce manysamples from a fixed probability vector P (actually, from asmall collection of such probability vectors that correspondto the different levels in the binary tree that represents themerge steps). In this case, the alias method can be usedto increase generation efficiency; see [17, p. 474]. Theidea is to compute probabilities r0,r1, . . . ,rk and “aliases”a0,a1, . . . ,ak ∈ {0,1, . . . ,k} such that rl +∑ j:a j=l(1− r j) =(k +1)P(l) for 0 ≤ l ≤ k. It can be shown that such aliasesand probabilities always exist, and several algorithms areavailable for computing these quantities. Once the “aliastable” (r0,a0,r1,a1, . . . ,rk,ak) has been computed, samplesfrom P can be generated rapidly. The sampling algorithmgenerates a random integer I taking values uniformly in{0,1, . . . ,k} and a random number U uniformly distributedon [0,1]. The algorithm returns L = I if U ≤ rI and L = aI

if U > rI .

4.3 Preliminary Algorithm Comparison

Algorithms HB and HR have relative advantages and dis-advantages. The samples produced by Algorithm HB aremuch less expensive to merge than those produced by Algo-rithm HR. On the other hand, Algorithm HB, unlike Algo-rithm HR, requires a priori knowledge of the data-set par-tition size N; see (1). In some situations this may not bea problem, for example, when the data-element arrival rateis constant. Even when the data rate is variable and moredata is arriving than expected, it may be possible to formdata-set partitions of specified size on the fly, in which caseN is known (because it is selected by the system). If muchless data arrives than expected, however, then the samplewill probably be smaller than desired because q will be settoo low. The fluctuations in sample size become magnifiedeven more as samples are merged. Algorithm HR does notrequire accurate knowledge of N and avoids uncontrollablysmall sample sizes, both at sample creation and subsequentmerging. This improved control over sample size comes atthe cost of more expensive merge operations. In the follow-ing section, we further explore the differences between thealgorithms by means of an empirical study.

5 Experiments

In this section we present the results of several experi-ments performed on a prototype implementation of our newalgorithms. We partition a data set and observe the behav-ior of the various algorithms as they sample each partition

(in parallel) and then execute a sequence of pairwise merges(serially) to create a uniform sample of the entire data set.(We restrict attention to serial merges to keep the exper-imental analysis simple.) The goal of these experiments isto evaluate the algorithms in terms of their scalability—bothspeedup and scaleup—and also in terms of the quality of theuniform samples that each algorithm produces.

As a benchmark for speed and scalability, we also con-sider a very simple “stratified Bernoulli” sampling scheme.This scheme, called Algorithm SB, produces a uniform ran-dom sample from a data set by sampling data-set partitionsat a fixed rate and then simply unioning the samples. Al-gorithm SB does not meet our criteria for compactness,sample-size control or an a priori bounded footprint. Acomparison to this algorithm thus indicates the cost of theimproved functionality that our new algorithms provide.

The key factors that affect the algorithm performance arethe number of distinct values, data skew, and degree of par-allelism (i.e., the number of partitions). To explore scaleupbehavior and sample quality under a range of conditions,we considered three kinds of data sets: a set of unique inte-gers between 1 and the population size, a set of data valuesthat are uniformly distributed over the range 1 to 1,000,000,and a set of integer values over the range of 1 to 4000 hav-ing a Zipf distribution. We also considered six populationsizes ranging from 220 through 226 and eleven partitioningschemes ranging from a single partition to 1024 partitions,for a total of 198 test scenarios. All reported numbers rep-resent an average over three independent and identical ex-periments. Unless otherwise indicated, we used a targetexceedance-probability value of p = 0.001 when comput-ing q-values for the purgeBernoulli routine. Due to lack ofspace, we present only a representative set of experimentalresults.

Our experiments were performed on a cluster of twoidentical machines, each with dual 1.1 GHz Intel Pentiumprocessors. Each machine had 1G of main memory andused an internal disk for temporary storage of per-partitionsamples before merging. In all of our experiments, we in-strumented the executables to report their total CPU usagein milliseconds.

We first assess the speedup performance of our algo-rithms as the number of partitions increases and the data-set size stays fixed. As we add more partitions, the time toproduce each partition sample decreases, decreasing CPUcost, but more merges are then required, which drives upthe CPU cost. The minimum of the resulting U-shaped to-tal cost curve indicates the limits to the possible speedup.In Figures 9–11, we display such cost curves (expressedas overall elapsed time) for a fixed population size of 226

unique-valued data elements. We break down the costs intosampling times (light bars) and merge times (dark bars). Asexpected, Algorithm SB has the best overall performance

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1024

Partition Count

Sec

on

ds

Sample Time

Merge Time

Figure 9. Speedup for Algorithm SB

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1024

Partition Count

Sec

on

ds

Sample Time

Merge Time

Figure 10. Speedup for Algorithm HB

for each partitioning scheme, with Algorithm HB secondbest and Algorithm HR slightly slower. Algorithm SB alsoscales best, i.e., supports the highest degree of parallelism,with overall elapsed time improving until between 256 and512 partitions were used. The two hybrid approaches arecomparable, supporting between 32 and 64 partitions.

To assess the scaleup performance of our algorithms, weincreased both the number of partitions and the populationsize in such a manner as to keep the number of data elementsper partition constant. We then measured total elapsed timeas before. Some typical results are shown in Figures 12–14.For these experiments, we held the number of data elementsper partition at 32K. Note that the time scales of these threefigures are different. The relative performance of the threealgorithms is the same as for the speedup experiments: Al-gorithm SB is clearly the fastest and Algorithm HB is com-parable to Algorithm HR. The scaleup is roughly linear forall algorithms.

Our final experiment examines the sample-size behaviorof Algorithms HB and HR. For this experiment, we fixedthe partition size at 32K data elements. Because our exper-iments are performed on integer data, the maximum num-ber of data elements in a sample is nF = 8192. Our resultsare displayed in Figures 15 and 16, which show the aver-

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1024

Partition Count

Sec

on

ds

Sample Time

Merge Time

Figure 11. Speedup for Algorithm HR

0.1

1

10

32 64 128 256 512

Scale Factor

log

(Sec

on

ds)

Unique

Uniform

Zipfian

Figure 12. Scaleup for Algorithm SB

age sample sizes produced by each algorithm for the uniqueand uniform distributions5 and for all partition counts, i.e.,for all population sizes. As can be seen, the sample sizesproduced by Algorithm HB are smaller and less stable thanthose produced by Algorithm HR; in the worst scenario(512 partitions, p = 0.001), the average sample size for Al-gorithm HB was 760 elements (9.25%) smaller than the cor-responding average sample size for Algorithm HR. This re-sult is not surprising: the size of an Algorithm HR sampleremains constant at each pairwise merge step, whereas thesize of an Algorithm HB sample decreases unpredictably,due to fluctuations induced by the Bernoulli subsampling.Figure 15 also shows that the size of the sample producedby Algorithm HB is relatively insensitive to the value of thetarget exceedance-probability p; we can therefore take p tobe very small, ensuring high statistical quality for the sam-ples.

We conclude from our experiments that

1. Each of the new algorithms is within an order of mag-nitude of Algorithm SB in terms of sampling speed.This is a reasonable price to pay for the greatly en-

5We do not display results for the Zipfian population because in thiscase the number of distinct values is small and hence the samples are al-ways exhaustive.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

0.1

1

10

100

1000

32 64 128 256 512

Scale Factor

log

(Sec

on

ds)

Unique

Uniform

Zipfian

Figure 13. Scaleup for Algorithm HB

1

10

100

32 64 128 256 512

Scale Factor

log

(Sec

on

ds)

Unique

Uniform

Zipfian

Figure 14. Scaleup for Algorithm HR

hanced functionality that the new algorithms provide.

2. The absolute performance of the new algorithms wasquite acceptable: Algorithm HB can exploit 64-wayparallelism to sample 4.6 million data elements persecond, and Algorithm HR can exploit 32-way paral-lelism to sample 3 million data elements per second.

3. Both new algorithms achieve linear scaleup.

4. Algorithm HR yields larger and more stable samplesizes than Algorithm HB, with a concomitant loss ofsampling speed.

6 Summary and Conclusion

We have articulated a proposal for a flexible and scalablesample warehouse that shadows a full-scale warehouse. Insupport of this proposal, we have developed novel methodsfor obtaining and merging random samples that guaranteestatistical uniformity while keeping the size of the samplescompact and providing a guaranteed bound on the maxi-mum storage required during and after sample processing.Experimental results indicate that our methods perform ac-ceptably in practice, with linear scaleup. Our results help

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32 64 128 256 512 1024

Partition Count

Sa

mp

leS

ize Uniform: p=0.001

Unique: p=0.001

Uniform: p=0.00001

Unique: p=0.00001

Figure 15. Sample Sizes for Algorithm HB

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32 64 128 256 512 1024

Partition CountS

am

ple

Siz

e

Uniform

Unique

Figure 16. Sample Sizes for Algorithm HR

quantify the tradeoffs between Algorithm HR and Algo-rithm HB—in general, Algorithm HB is faster but Algo-rithm HR provides larger and more stable sample sizes. Fu-ture work includes incorporation of our sampling methodsinto the prototype data-integration system under develop-ment by the authors [2] and other systems where scalableand flexible sampling is needed. Another area of future re-search is the extension of our sampling methods to handleother useful sampling designs such as stratified, systematic,and biased sampling.

Appendix: Proofs

We first derive the approximation in (1). By the centrallimit theorem for independent and identically distributedrandom variables, |S| is distributed approximately as a nor-mal random variable with mean Nq and variance Nq(1−q),so that

P{|S| > nF } = P{ |S|−Nq√

Nq(1−q)>

nF −Nq√Nq(1−q)

}

≈ 1−Φ( nF −Nq√

Nq(1−q)

),

where Φ is the cumulative distribution function of a stan-dard normal random variable. Equating the rightmost termto p and using the fact that, by definition, zp = Φ−1(1− p),we have nF −Nq = zp

√Nq(1−q). Squaring both sides and

solving the resulting quadratic equation for q yields (1).

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

We now prove Theorem 1. Fix k ∈ {1,2, . . . , |S1|∧ |S2|}and l ∈ {0,1, . . . ,k}, and consider a fixed but arbitrary finalsample of the form s = A∪B, where A = {a1,a2, . . . ,al } ⊆D1, B = {b1,b2, . . . ,bm }⊆D2, and m = k− l. The probabil-ity of obtaining sample s as a one-step simple random sam-ple from D1∪D2 is simply 1(s) = 1/

(|D1|+|D2|k

)by the de-

finition of simple random sampling. On the other hand, ourproposed two-step sampling procedure obtains first-stagesamples S1 and S2 with |Si| ≤ |Di| for i = 1,2, then ob-tains subsamples of respective sizes L and k− L (where Lhas probability mass function P), and finally merges thesesubsamples to form the final sample. The probability thatS1 ⊇ A is given by

(|D1|−l|S1|−l

)/(|D1||S1|

). Moreover, given that

L = l and S1 ⊇ A, the probability that the subsample fromS1 precisely equals A is 1/

(|S1|l

). Similar computations hold

for the subsample B, so that the probability 2(s) of ob-taining sample s via the two-step procedure is

2(s) =

(|D1|−l|S1|−l

)(|D1||S1|

) · 1(|S1|l

) ·(|D2|−m|S2|−m

)(|D2||S2|

) · 1(|S2|m

) ·P(l)

=1(|D1|

l

)(|D2|m

) ·P(l).

To obtain the formula in (2), equate 1(s) and 2(s) andsolve for P(l), using the easy-to-prove identities

(|D1|−l|S1|−l

)(|D1||S1|

)(|S1|l

) =1(|D1|l

) and

(|D2|−m|S2|−m

)(|D2||S2|

)(|S2|m

) =1(|D2|m

) ,

which assert that if S is a simple random sample from D andS′ is a simple random sample from S, then S′ can be viewedas a simple random sample from D.

To obtain a recursive relationship between P(l + 1) andP(l), use the identity ( n−m

m+1 ) · (nm

)=

( nm+1

)together with (2)

to obtain (with m = k− l)

P(l +1) =

(|D1|l+1

)( |D2|k−l−1

)(|D1|+|D2|

k

) =

(|D1|l

) |D1|−ll+1

(|D2|k−l

) m|D2|−m+1(|D1|+|D2|

k

)=

m(|D1|− l)(l +1)(|D2|−m+1)

·P(l),

which is precisely the relation in (3).

Acknowledgement

The authors wish to thank R. Gemulla for several veryhelpful comments on an earlier draft of this paper.

References

[1] B. Babcock, M. Datar, and R. Motwani. Sampling from amoving window over streaming data. In SODA 2002, pages633–634.

[2] P. Brown, P. Haas, J. Myllymaki, H. Pirahesh, B. Reinwald,and Y. Sismanis. Toward automated large-scale informationintegration and discovery. In Data Management in a Con-nected World. Springer, 2005.

[3] P. G. Brown and P. J. Haas. BHUNT: Automatic discoveryof fuzzy algebraic constraints in relational data. In VLDB2003, pages 668–679.

[4] S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R.Narasayya. Overcoming limitations of sampling for aggre-gation queries. In ICDE 2001, pages 534–542.

[5] L. Devroye. Non-Uniform Random Variate Generation.Springer-Verlag, New York, 1986.

[6] P. B. Gibbons. Distinct sampling for highly-accurate answersto distinct values queries and event reports. In VLDB 2001,pages 541–550.

[7] P. B. Gibbons and Y. Matias. New sampling-based summarystatistics for improving approximate query answers. In SIG-MOD 1998, pages 331–342.

[8] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incrementalmaintenance of approximate histograms. ACM Trans. Data-base Syst., 27(3):261–298, 2002.

[9] P. B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias,S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: Sys-tem and techniques for approximate query answering. Tech-nical report, Bell Laboratories, Feb. 1998.

[10] P. J. Haas. Techniques for online exploration of large object-relational datasets. In SSDBMS 1999, pages 4–12.

[11] P. J. Haas. Data-stream sampling: basic techniques and re-sults. In Data Stream Management: Processing High SpeedData Streams. Springer, 2006. To appear.

[12] P. J. Haas and C. Konig. A bi-level Bernoulli scheme fordatabase sampling. In SIGMOD 2004, pages 275–286.

[13] A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan,L. McDowell, and I. Tatarinov. Join synopses for approxi-mate query answering. In CIDR 2003.

[14] IBM Corporation. WebSphere Profile Stage User’s Manual.2005.

[15] I. F. Ilyas, V. Markl, P. J. Haas, P. G. Brown, and A. Aboul-naga. CORDS: Automatic discovery of correlations and softfunctional dependencies. In SIGMOD 2004, pages 647–658.

[16] C. Jermaine, A. Pol, and S. Arumugam. Online maintenanceof very large random samples. In SIGMOD 2004, pages 299–310.

[17] A. M. Law and K. W. D. Simulation Modeling and Analysis.Wiley, New York, third edition, 2000.

[18] U. Leser and F. Naumann. (Almost) hands-off informationintegration for the life sciences. In CIDR 2005, pages 131–143.

[19] F. Olken. Random Sampling from Databases. Ph.D. Disser-tation, University of California, Berkeley, CA, 1993.

[20] J. S. Vitter. Random sampling with a reservoir. ACM Trans.Math. Software, 27:703–718, 1985.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE