# [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Techniques for Warehousing of Sample Data

Post on 09-Feb-2017

218 views

Embed Size (px)

TRANSCRIPT

<ul><li><p>Techniques for Warehousing of Sample Data</p><p>Paul G. Brown & Peter J. HaasIBM Almaden Research Center</p><p>San Jose, California, USA{pbrown1,phaas}@us.ibm.com</p><p>Abstract</p><p>We consider the problem of maintaining a warehouse ofsampled data that shadows a full-scale data warehouse,in order to support quick approximate analytics and meta-data discovery. The full-scale warehouse comprises manydata sets, where a data set is a bag of values; the data setscan vary enormously in size. The values constituting a dataset can arrive in batch or stream form. We provide and com-pare several new algorithms for independent and paralleluniform random sampling of data-set partitions, where thepartitions are created by dividing the batch or splitting thestream. We also provide novel methods for merging sam-ples to create a uniform sample from an arbitrary unionof data-set partitions. Our sampling/merge methods are thefirst to simultaneously support statistical uniformity, a prioribounds on the sample footprint, and concise sample stor-age. As partitions are rolled in and out of the warehouse, thecorresponding samples are rolled in and out of the samplewarehouse. In this manner our sampling methods approx-imate the behavior of more sophisticated stream-samplingmethods, while also supporting parallel processing. Exper-iments indicate that our methods are efficient and scalable,and provide guidance for their application.</p><p>1 Introduction</p><p>In the setting of large-scale data repositories and ware-houses, random sampling has long been recognized as aninvaluable tool for obtaining quick approximate answers toanalytical queries, for auditing the data, and for exploringthe data interactively; see, for example, [9, 10, 19]. Re-cently, sampling has received attention as a useful tool fordata integration tasks such as automated metadata discov-ery [2, 3, 13, 14, 15, 18].</p><p>One means of exploiting random sampling is to sam-ple the data on an as-needed, ad hoc basis; see, for ex-ample, [4, 10, 12]. This approach can work well withina single database management system, but can be diffi-</p><p>Full-ScaleData Warehouse</p><p>SampleData Warehouse</p><p>sample D 1</p><p>D 1,2</p><p>merge</p><p>D i</p><p>D i,j</p><p>sample</p><p>sample</p><p>Sn,mS1,1</p><p>etc.S*,2S*,* S1-2,3-7</p><p>S1,2</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>.</p><p>D</p><p>D n</p><p>i,m</p><p>Figure 1. Sample-Warehouse Architecture</p><p>cult to implement in more complex warehousing and inte-gration environments. Another popular approach is to, inessence, maintain a warehouse of sampled data that shad-ows the full-scale data warehouse. This approach is im-plicit in the backing sample ideas in [8] and in streamsampling methods [11], and appears more explicitly in thevarious sampling-based data synopses proposed as part ofthe AQUA project [6, 9]; especially pertinent to our cur-rent investigation is the work in [7] on concise and countingsamples. Jermaine et al. [16] also discuss techniques formaintaining large disk-based samples.</p><p>In this paper, we pursue the latter approach and presentnovel algorithms for maintaining a warehouse of sampleddata. We focus on issues of scalability, parallel processing,and flexibility that have not previously been addressed inthis setting. We assume that the full-scale warehouse com-prises many data sets, where a data set is a bag (multi-set) of values; the data sets can vary enormously in size. Adata set might correspond, for example, to the values in thecolumn of a relational table or to the instance values cor-responding to a leaf node in an XML schema. The valuesconstituting a data set can arrive in batch or stream form.We provide and compare several new algorithms for inde-pendent and parallel uniform random sampling of data-setpartitions, where the partitions are created either by divid-ing the batch or splitting the stream. As a data-set partition</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>moves through the full-scale warehouse, the correspondingsample of the partition moves through the sample ware-house in a parallel manner.</p><p>Our new sampling methods are based on well knownBernoulli and reservoir sampling schemes, and are similarto concise sampling [7] in that (i) the footprint (i.e., the re-quired storage space) both during and after sample collec-tion never exceeds a user-specified a priori bound, and (ii)the sample is stored using a compact representation. Forexample, the complete data-set partition can be stored, pro-vided that the number of distinct values in the partition issufficiently small. Unlike concise and counting samples,however, our samples are truly uniform, in a sense thatwe make precise below. We also provide novel methodsfor merging partition samples to create a uniform samplefrom an arbitrary union of partitions; these techniques per-mit flexible creation of samples from the partition samplesthat are stored in the warehouse. We empirically comparethe performance of the Bernoulli-based and reservoir-basedtechniques and show that our methods have good speedupand scaleup behavior.</p><p>The rest of the paper is organized as follows. In Section 2we describe the architecture of the sample warehousing sys-tem and specify the corresponding set of requirements forour sampling methods. Section 3 describes some existingsampling schemes upon which our new methods rest. Aspart of our discussion we show, perhaps surprisingly, thatthe well known concise-sampling scheme in [7] does notproduce uniform samples. Section 4 describes our two newmethods, called Algorithm HB and Algorithm HR, for cre-ating compact uniform random samples subject to a prioribounds on the sample footprint. Also provided are methodsfor merging samples created by these algorithms. In Sec-tion 5 we describes the results of an empirical study of thenew methods, and Section 6 contains our conclusions anddirections for future work.</p><p>2 System Architecture and Requirements</p><p>As mentioned previously, we assume that the full-scalewarehouse consists of many data sets that can vary in sizefrom a few hundred to hundreds of millions of values. Thevalues that constitute a data set arrive in two ways: bundledinto a large batched collection, or as a streamed sequenceof singleton values. To achieve our flexibility and scalabil-ity goals, we allow (and typically require) each data set tobe partitioned into mutually disjoint sets. The sampling in-frastructure needs to support independent sampling of thepartitions and subsequent merging of the per-partition sam-ples to create a single sample of the data values in an arbi-trary union of partitions. A sample of the concatenation ofall partitions corresponds to a sample of the entire data set.</p><p>In one data warehousing scenario, for example, an ini-</p><p>tial batch of data from an operational system would be bulkloaded, followed up periodically by smaller sets of data re-flecting additions to the operational system over time, aswell as periodic deletions. We would like to be able to par-allelize the sampling of the initial batch to minimize inges-tion time, and then merge samples acquired from the updatestream so as to maintain a sample of the total data set. Inanother scenario, the bulk-load component of the data setmight be small but the ongoing data stream overwhelmingfor a single computer. Then the incoming stream could besplit over a number of machines and samples from the con-current sampling processes merged on demand. In eitherscenario, it may be desirable to further partition the incom-ing data stream temporally, e.g., one partition per day, andthen combine daily samples to form weekly, monthly, oryearly samples as needed for purposes of analysis, auditing,or exploration.</p><p>Figure 1 shows the data flow for a generic data set D;in general, D might be parallelized across multiple CPUsas D1,D2, . . ., and the ith such stream may be partitionedtemporally into Di,1,Di,2, . . ., say by day. The sampled par-titions Si, j are sent to the sample warehouse, where theymay be subsequently retrieved and merged in various ways.As new daily samples are rolled in and old daily samplesare rolled out, the system would approximate stream sam-pling algorithms such as those described in [1, 11], but withsupport for parallel processing. Additional partitioning canalso be performed on-the-fly within a stream, in order to ro-bustly deal with fluctuations in the data arrival rate. E.g.,suppose that we wish to maintain fixed-size samples andsimultaneously ensure that each sample comprises at leasta specified minimum fraction of its parent data. Then wewait until the ratio of sampled data to observed parent datahits the specified lower bound, at which point we finalizethe current data partition (and corresponding sample), andbegin a new partition (and sample).</p><p>In light of our discussion, we see that our sampling in-frastructure must support the following functionality:</p><p>1. Uniform random sampling: Since most methods forsampling-based analytics and metadata discovery as-sume a uniform random sample (formally defined be-low), providing such samples is a basic functional re-quirement, and our focus in this paper.1</p><p>2. Scalable, flexible, robust sampling: As discussedabove, we typically partition each data set D into mutu-ally disjoint subsets D1,D2, . . . ,Dk and require that oursystem be able to sample independently and in paral-lel from these data sets to obtain samples S1,S2, . . . ,Sk,where Si is a uniform random sample of the values inDi for 1 i k. We also require that our system be</p><p>1In Section 4.1, we briefly discuss how stratified samples can also beproduced using our methods.</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>able to subsequently merge the samples: for any subsetK {1,2, . . . ,k} the system must be able to produce asubset of values SK that is a uniform random sample ofthe values in DK =</p><p>iK Di. This sample/merge func-</p><p>tionality allows the system to deal flexibly with het-erogeneous data sources and data-arrival patterns, andto handle large amounts of data quickly by exploitingparallelism.</p><p>3. Bounded footprint: From a systems perspective, it ishighly desirable that the storage required during andafter sample creation be bounded a priori, so that thereare no unexpected disk or memory shortages. Becausethe number of data-set partitions can be large, evensmall fluctuations in sample sizes can have large cu-mulative effects.</p><p>4. Compact samples: Again, because of the large numberof partitions, each sample partition should be stored ina compact manner. Perhaps the most basic require-ment is that duplicate values be stored in a (value,count) format when possible, as in the concise sam-ples and counting samples described in [7]. Althoughnot considered in the current paper, data compressiontechniques can be used to further minimize storage re-quirements for the samples; whether or not such tech-niques are worthwhile depends on the desired tradeoffsbetween processing speed and storage requirements.</p><p>3 A Survey of Pertinent Sampling Methods</p><p>In this section we briefly review some sampling schemesupon which our proposed techniques rest, and show thatnone of these techniques alone satisfies all of our require-ments.</p><p>A sampling scheme is formally specified by an associ-ated probability mass function ( ;D) on subsets of a pop-ulation D = {1,2, . . . , |D|} of distinct data elements. For asubset S D, the quantity (S;D) is the probability thatthe sampling scheme, when applied to D, will produce thesample S. A sampling scheme is uniform if, for any pop-ulation D, the associated probability function satisfies</p><p>(S;D) = (S;D) whenever S,S D with |S| = |S|.I.e., all samples of equal size are equally likely. We de-note the value of the ith data element by ui, and allow forthe possibility that ui = u j for i = j.</p><p>3.1 Bernoulli Sampling</p><p>A Bernoulli sampling scheme with sampling rate q [0,1] includes each arriving data element in the samplewith probability q and excludes the element with proba-bility 1 q, independently of the other data elements. Wedenote such a sampling scheme as Bern(q). Formally, the</p><p>associated probability function is given by (S;D) =q|S|(1 q)|D||S|, so that Bernoulli sampling is uniform. Akey advantage of Bernoulli sampling is that collecting sam-ples is simple and inexpensive [11], and merging Bernoullisamples is relatively straightforward (see Section 4.1).</p><p>The key disadvantage of Bernoulli sampling is that thesize of the sample is random, and hence cannot be con-trolled. Indeed, the size of a Bern(q) sample from a pop-ulation of size N is binomially distributed with parametersN and q, so that the standard deviation of the sample size is</p><p>Nq(1q). Hence the variability of the sample size growswithout bound as the population size increases.</p><p>We sometimes appeal to the following two results, whichare easy consequences of the fact that data elements are in-cluded or excluded in a mutually independent manner. First,if S is a Bern(p) sample from D and S is a Bern(q) samplefrom S, then S is a Bern(pq) sample from D. Second, ifSi is a Bern(q) sample from Di for i = 1,2, where D1 andD2 are disjoint sets, then S1 S2 is a Bern(q) sample fromD1 D2.</p><p>3.2 Reservoir Sampling</p><p>Simple random sampling (without replacement) withsample size k 1 is defined as the unique uniform samplingscheme that produces a sample of the specified size:</p><p>(S;D) =</p><p>{1/</p><p>(|D|k</p><p>)if |S|= k;</p><p>0 otherwise.</p><p>Reservoir sampling is a well known sequential algorithmfor producing a simple random sample without replace-ment; see [11] for details and references. The idea in reser-voir sampling is to maintain the invariant that the currentreservoir constitutes a simple random sample of all data el-ements seen so far. Thus the first k arriving data elementsare inserted into the reservoir, so that the invariant propertyholds trivially. When the nth data element arrives (n > k),this element is included in the sample with probability k/n,replacing a randomly and uniformly selected victim; withprobability 1 (k/n) the arriving data element is not in-cluded in the sample. Note that the insertion probability isprecisely the probability that the arriving element would ap-pear in a simple random sample of size k drawn from a pop-ulation of size n. Vitter [20] famously proposed a number ofimplementation tricks to speed up the basic algorithm, theprimary idea being to directly generate the random skipsbetween successive inclusions using acceptance-rejectiontechniques. In the following we denote by skip(n;k) thefunction that generates a random skip; that is, if data ele-ment n has just been processed (either included in or ex-cluded from the reservoir), then the next data element tobe included in the reservoir is element n+ skip(n;k). Fordetails of the skip function see the description of Algo-rithm Z in [20] or the discussion in [11]. We often use the</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>terms simple random sample and reservoir sample in-terchangeably.</p><p>The key adva...</p></li></ul>