compression and aggregation of bayesian estimates for data ...ychen/public/bayes.pdf · compression...

24
Under consideration for publication in Knowledge and Information Systems Compression and Aggregation of Bayesian Estimates for Data Intensive Computing Ruibin Xi 1 , Nan Lin 2 , Yixin Chen 3 and Youngjin Kim 4 1 Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA; 2 Department of Mathematics, Washington University, St. Louis, MO, USA; 3 Department of Computer Science, Washington University, St. Louis, MO, USA; 4 Google Inc., Mountain View, CA, USA. Abstract. Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, com- puter science, economics, epidemiology, genetics, imaging and political science. How- ever, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation scheme (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy. Received xxx Revised xxx Accepted xxx

Upload: others

Post on 18-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Under consideration for publication in Knowledge and InformationSystems

Compression and Aggregation ofBayesian Estimates for Data IntensiveComputing

Ruibin Xi1, Nan Lin2, Yixin Chen3 and Youngjin Kim4

1Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA;2Department of Mathematics, Washington University, St. Louis, MO, USA;3Department of Computer Science, Washington University, St. Louis, MO, USA;4Google Inc., Mountain View, CA, USA.

Abstract.Bayesian estimation is a major and robust estimator for many advanced statistical

models. Being able to incorporate prior knowledge in statistical inference, Bayesianmethods have been successfully applied in many different fields such as business, com-puter science, economics, epidemiology, genetics, imaging and political science. How-ever, due to its high computational complexity, Bayesian estimation has been deemeddifficult, if not impractical, for large-scale databases, stream data, data warehouses,and data in the cloud.

In this paper, we propose a novel compression and aggregation scheme (C&A)that enables distributed, parallel, or incremental computation of Bayesian estimates.Assuming partitioning of a large dataset, the C&A scheme compresses each partitioninto a synopsis and aggregates the synopsis into an overall Bayesian estimate withoutaccessing the raw data. Such a C&A scheme can find applications in OLAP for datacubes, stream data mining, and cloud computing. It saves tremendous computing timesince it processes each partition only once, enabling fast incremental update, and allowsparallel processing. We prove that the compression is asymptotically lossless in thesense that the aggregated estimator deviates from the true model by an error that isbounded and approaches to zero when the data size increases. The results show thatthe proposed C&A scheme can make feasible OLAP of Bayesian estimates in a datacube. Further, it supports real-time Bayesian analysis of stream data which can onlybe scanned once and cannot be permanently retained. Experimental results validateour theoretical analysis and demonstrate that our method can dramatically save timeand space costs with almost no degradation of the modeling accuracy.

Received xxxRevised xxxAccepted xxx

Page 2: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

2 R. Xi et al

Keywords: Bayesian estimation; data cubes; OLAP; stream data mining;compression;aggregation.

1. Introduction

In the last few years, there has been active research on compression and ag-gregation (C&A) schemes for advanced statistical analysis on structured andlarge-scale data [6, 7, 8, 13, 14, 17, 20, 21, 23, 34]. For a given statistical model,a general compression and aggregation (C&A) scheme partitions a largedataset into segments, processes each segment separately to generate some com-pressed representations, and aggregates the compressed results into a final model.Key benefits of such a scheme include its support for multi-dimensional data cubeanalysis, online processing, and distributed processing. We can find the followinguseful scenarios of the C&A scheme.

– The techniques developed in this paper are useful for data warehousing and theassociated on-line analytical processing (OLAP) computing. With our C&Ascheme, a Bayesian statistical model for a given data cell can be obtainedby aggregating the compressed synopsis of relevant lower level cells, withoutbuilding the model from raw data from scratch. Such a scheme allows for fastinteractive analysis of multidimensional data to facilitate effective data miningat multiple levels of abstraction.

– The proposed C&A scheme enables online Bayesian analysis of real-time datastreams. It is challenging to build online statistical models for high-speed datastreams, since it is typically not practical to rebuild a complex model everytime a new segment of data is received, due to high computational costs andthe fact that raw data are not stored for many stream data applications. OurC&A scheme solves this problem by retaining only synopsis, instead of rawdata, in the system. For each new data segment, we compress it and use ouraggregation scheme to efficiently update the model online.

– Cloud computing is a major trend for data intensive computing, as it en-ables scalable processing of massive amount of data. It is a promising next-generation computing paradigm given its many advantages such as scalability,elasticity, reliability, high availability, and low cost. As data localization isimportant for efficiency, it is desirable that each processing unit is only re-sponsible for its own segment of local data. The proposed C&A scheme is wellsuited for performing Bayesian analysis on massive datasets in the cloud. Forexample, the compression and aggregation phases in a C&A scheme matchthe mapping and reducing phases, respectively, of the well-known MapReducealgorithmic framework for cloud computing. The C&A scheme allows parti-tioning and parallel processing of data, and thus enables high-performancestatistical analysis in the cloud.

Although there are earlier works to support C&A schemes for statistical infer-ences, most of such works are based on maximum likelihood estimation (MLE).In this paper, we propose a C&A scheme for Bayesian estimation, another ma-jor estimation approach that is considered superior to MLE in many contexts.The premise of Bayesian statistics is to incorporate prior knowledge, along with agiven set of current observations, in order to make statistical inferences. The prior

Page 3: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 3

information could come from previous comparable experiments, from experiencesof some experts or from existing theories. However, it is often very expensive tocompute the Bayesian estimates, as there generally exists no closed form solu-tion and Markov chain Monte Carlo (MCMC) methods such as Gibbs samplersand Metropolis algorithms are often employed. Hence, to process large-scale data(possibly in parallel) and online stream data using Bayesian estimation, fast andeffective C&A schemes are desired. C&A schemes for Bayesian estimation havenot been studied before.

Earlier works in data cubes [13] support aggregation of simple measures suchas sum() and average(). However, the fast development of OLAP technology hasled to high demand for more sophisticated data analyzing capabilities, such asprediction, trend monitoring, and exception detection of multidimensional data.Oftentimes, existing simple measures such as sum() and average() become in-sufficient, and more sophisticated statistical models are desired to be supportedin OLAP. Recently, some researchers developed aggregation schemes for moreadvanced statistical analyses including parametric models such as linear regres-sion [8, 14] general multiple linear regression [7, 20] logistic regression analy-sis [34] and predictive filters [7], as well as nonparametric statistical models suchas naive Bayesian classifiers [6] and linear discriminant analysis [23]. Along thisline, we develop a C&A scheme to support Bayesian estimation.

Bayesian methods are statistical approaches to parameter estimation andstatistical inference which use prior distributions over parameters. Bayes’ ruleprovides the framework for combining prior information with sample data. Sup-pose that f(D|θ) is the probability model of the data D with parameter (vector)θ ∈ Θ and π(θ) is the prior probability density function (pdf) on the parameterspace Θ. The posterior distribution of θ given the data D, using Bayes’ rule, isgiven by

f(θ|D) =f(D|θ)π(θ)∫

θ∈Θf(D|θ)π(θ)dθ

.

The posterior mean θ∗ =∫

θ∈Θθf(θ|D)dθ is then a Bayesian estimate of the

parameter θ.While it is easy to write down the formula of the posterior mean θ∗, a closed

form existed only in a few simple cases, such as a normal sample with a normalprior. In practice, MCMC methods are often used to evaluate the posterior mean.However, these algorithms are usually slow especially for large data sets, makingOLAP processing based on these algorithms impractical. Furthermore, theseMCMC algorithms require using the complete data set. In many data miningapplications such as stream data applications and distributed analysis in thecloud, we often encounter the difficulty of not having the complete set of datain advance. One-scan algorithms are required for such applications.

In this paper, we propose a C&A scheme and its associated theory to sup-port high-quality aggregation of Bayesian estimation for statistical models. Inthe proposed approach, we compress each data segment by retaining only themodel parameters and some auxiliary measures. We then develop an aggregationformula that allows us to reconstruct the Bayesian estimation from partitionedsegments with a small and asymptotically diminishing approximation error. Wefurther show that the Bayesian estimates and the aggregated Bayesian estimatesare asymptotically equivalent.

Page 4: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

4 R. Xi et al

This paper is organized as follows. In Section 2, we introduce the researchproblem in the context of data cubes, noting that the general theory and C&Ascheme can be applied in other contexts such as stream data mining as well.In Section 3, we review the basics of Bayesian statistics. We develop the C&Ascheme and its theory in Section 4 and report experimental results in Section 5.Then, we discuss related works in Section 6 and give conclusions in Section 7.

2. Concepts and Problem Definition

We develop our theory and algorithms for the C&A scheme in the context of datacubes and OLAP. We understand that the proposed theory and algorithms canalso be used in other contexts, such as stream data mining and cloud computing.We present our result in a data cube context since it assumes a clear and simplestructure of data, which facilitates our discussion. In our empirical study, we showresults in both data cube and data stream contexts. In this section, we introducethe basic concepts related to data cubes and define our research problem.

2.1. Data cubes

Data cubes and OLAP tools are based on a multidimensional data model. Themodel views data in the form of a data cube. A data cube is defined by dimensionsand facts. Dimensions are the perspectives or entities with respect to which anorganization wants to keep records. Usually each dimension has multiple levels ofabstraction formed by conceptual hierarchies. For example, country, state, city,and street are four levels of abstraction in a dimension for location.

To perform multidimensional, multi-level analysis, we need to introduce somebasic terms related to data cubes. Let D be a relational table, called the base ta-ble, of a given cube. The set of all attributesA inD is partitioned into two subsets,the dimensional attributes DIM and the measure attributes M (so DIM∪M = Aand DIM ∩ M = ∅). The measure attributes depend on the dimensional at-tributes in D and are defined in the context of data cube using some typicalaggregate functions, such as count(), sum(), avg(), or some Bayesian estimatorsto be studied here.

A tuple with schemaA in a multi-dimensional data cube space is called a cell.Given two distinct cells c1 and c2, c1 is an ancestor of c2, and c2 a descendantof c1 if on every dimensional attribute, either c1 and c2 share the same value, orc1’s value is a generalized value of c2’s in the dimension’s concept hierarchy.

A tuple c ∈ D is called a base cell. A base cell does not have any descendant.A cell c is an aggregated cell if it is an ancestor of some base cells. For eachaggregated cell, the values of its measure attributes are derived from the set ofits descendant cells.

2.2. Aggregation and classification of data cube measures

A data cube measure is a numerical or categorical quantity that can be evaluatedat each cell in the data cube space. A measure value is computed for a given cellby aggregating the data corresponding to the respective dimension-value pairs

Page 5: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5

defining the given cell. Measures can be classified into several categories basedon the difficulty of aggregation.

1)An aggregate function is distributive if it can be computed in a distributedmanner as follows. Suppose the data is partitioned into n sets. The computa-tion of the function on each partition derives one aggregate value. If the resultderived by applying the function to the n aggregate values is the same asthat derived by applying the function on all the data without partitioning, thefunction can be computed in a distributive manner. For example, count() canbe computed for a data cube by first partitioning the cube into a set of sub-cubes, computing count() for each subcube, and then summing up the countsobtained for each subcube. Hence, count() is a distributive aggregate func-tion. For the same reason, sum(), min(), and max() are distributive aggregatefunctions.

2)An aggregate function is algebraic if it can be computed by an algebraicfunction with several arguments, each of which is obtained by applying adistributive aggregate function. For example, avg() (average) can be computedby sum()/count() where both sum() and count() are distributive aggregatefunctions. min N(), max N() and stand dev() are algebraic aggregate functions.

3)An aggregate function is holistic if there is no constant bound on the storagesize needed to describe a sub-aggregate. That is, there does not exist an alge-braic function with M arguments (where M is a constant) that characterizethe computation. Common examples of holistic functions include median(),mode(), and rank().

Except for some simple special cases like a normal sample with a normalprior, Bayesian estimates seem to be holistic measures since they require theinformation of all the data points in an aggregated cell for the computation. Inthis paper, we show that Bayesian estimates are compressible measures [7, 34].An aggregation function is compressible if it can be computed by a procedurewith a number of arguments from lower level cells, and the number of argu-ments is independent from the number of tuples in the data cell. In other words,for compressible aggregate functions, we can compress each cell, regardless ofits size (i.e., the number of tuples), into a constant number of arguments, andaggregate the function based on the compressed representation. The data com-pression technique should satisfy the following requirements: (1) the compresseddata should support efficient lossless or asymptotically lossless aggregation ofstatistical measures in a multidimensional data cube environment; and (2) thespace complexity of compressed data should be low and be independent from thenumber of tuples in each cell, as the number of tuples in each cell may be huge.

In this paper, we develop a compression and aggregation scheme for Bayesianestimates that can support asymptotically lossless aggregation.

3. Bayesian Statistics

Suppose that x1, · · · , xn are n observations from a probability model f(x|θ),where θ ∈ Θ is the parameter (vector) of the probability model f(x|θ). The priorinformation in Bayesian statistics is given by a prior distribution π(θ) on the pa-rameter space Θ. Then, under the independence assumption of the observationsx1, · · · , xn given the parameter θ, the posterior distribution, f(θ|x1, · · · , xn), ofthe parameter θ can be calculated using Bayes’ rule,

Page 6: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

6 R. Xi et al

f(θ|x1, · · · , xn) =f(x1, · · · , xn|θ)π(θ)∫

θ∈Θf(x1, · · · , xn|θ)π(θ)dθ

=∏n

i=1 f(xi|θ)π(θ)∫θ∈Θ

∏ni=1 f(xi|θ)π(θ)dθ

, (1)

where f(x1, · · · , xn|θ) is the joint distribution of x1, · · · , xn given the parameterθ.

Then, we could use the posterior mean θ∗n as an estimate of the parameter θ,i.e.

θ∗n =∫

θ∈Θ

θf(θ|x1, · · · , xn)dθ

=( ∫

θ∈Θ

n∏

i=1

f(xi|θ)π(θ)dθ

)−1 ∫

θ∈Θ

θn∏

i=1

f(xi|θ)π(θ)dθ. (2)

MCMC meethods are often employed to evaluate the formula (2) due to its diffi-culty of direct evaluation. These algorithms are based on constructing a Markovchain that has the posterior distribution (1) as its equilibrium distribution. Afterrunning the Markov chain a large number of steps, called burn-in steps, a samplefrom the Markov chain could be viewed as a sample from the posterior distribu-tion (1). We then can approximate the posterior mean θ∗n with any accuracy wewish by taking a large enough sample from the posterior distribution (1).

We consider the following example [10, 25, 32] to illustrate the algorithm ofthe Gibbs sampler.

Example 1: 197 animals are distributed multinomially into four categoriesand the observed data are y = (y1, y2, y3, y4) = (125, 18, 20, 34). A genetic modelspecifies cell probabilities

(12

4,

1− θ

4,

1− θ

4,

θ

4

).

Assume that the prior distribution is Beta(1, 1), which is also the uniform dis-tribution on the interval (0, 1) and therefore is a non-informative prior. Theposterior distribution of θ is

f(θ|y) ∝ (2 + θ)y1(1− θ)y2+y3θy4 .

It is difficult, though not impossible, to calculate the posterior mean. However,a Gibbs sampler can be easily developed by augmenting the data y. Specifically,let x = (x1, x2, x3, x4, x5) such that y1 = x1 + x2, y2 = x3, y3 = x4 and y4 = x5.Assume the cell probabilities for x are

(12,

θ

4,

1− θ

4,

1− θ

4,

θ

4

).

Then, the distribution of y is the marginal distribution of x. The full conditionaldistribution of θ is f(θ|x2, y) ∝ θx2+y4(1 − θ)y2+y3 , which is Beta(x2 + y4 +1, y2 + y3 + 1). The full conditional distribution of x2 is f(x2|y, θ) ∝ (2/(2 +θ))y1−x2(θ/(2 + θ))x2 , i.e. the binomial distribution Binom(y1, θ/(2 + θ)). TheGibbs sampler starts with any value θ(0) ∈ (0, 1) and iterates the following twosteps.

Page 7: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 7

1. Generate x(k)2 from the full conditional distribution f(x2|y, θ(k−1)), i.e. from

Binom(125, θ(k−1)/(2 + θ(k−1))).2. Generate θ(k) from the full conditional distribution f(θ|x2, y), i.e. from Beta(x2+

35, 39).

Then, we could take average over θ(b+s), · · · , θ(b+sn) to get an estimate of θ,where b is a large positive integer and s is a positive integer. The first b iterationsare burn-in iterations and b is usually chosen large enough such that the Markovchain converges after b iterations. When n is large enough, this average will bea very good approximation to the posterior mean. The integer s is to reduce thecorrelation between two successive samples and is usually chosen to be small. Inour experiment where we set θ(0) = 0.5, s = 1, b = 1000 and n = 5, 000, thesample average we got is 0.622646 and is very close to the true posterior mean0.622806. ¤

4. Compression and Aggregation of Bayesian Estimates

Since the computation of the Bayesian estimates θ∗n often involves MCMC algo-rithms, the compression and aggregation of Bayesian estimation are more difficultcompared to the maximum likelihood estimates (MLE) of coefficients in regres-sion models. In general, it is very difficult to achieve lossless compression for theBayesian estimates and we have to resort to the asymptotic theory of Bayesianestimation to derive an asymptotic lossless compression scheme.

We first review the notion of asymptotically lossless compression representa-tion (ALCR) introduced in [34].

Definition 4.1. In data cube analysis, a cell function g is a function thattakes the data records of any cell with an arbitrary size as inputs and maps intoa fixed-length vector as an output. That is:

g(c) = v, for any data cell c (3)

where the output vector v has a fixed size.

Suppose that we have a probability model f(x|θ), where x are attributes andθ is the parameter of the probability model. Suppose ca is a cell aggregated fromthe component cells c1, · · · , ck. We define a cell function g2 to obtain mi = g2(ci),i = 1, . . . , k and use an aggregation function g1 to obtain an estimate of theparameter θ for ca by

θ = g1(m1, · · · ,mk). (4)

We say θ, an estimate of θ, is an asymptotically losslessly compressible mea-sure if we can find an aggregation function g1 and a cell function g2 such thata) the difference between θ = g1(m1, · · · ,mk) and θ(ca) tends to zero in probabil-ity as the number of tuples in ca goes to infinity, where mi = g2(ci), i = 1, . . . , k;b) θ(ca) = g1(g2(ca)); andc) the dimension of mi is independent from the number of tuples in ci.

The measures mi are called an ALCR of the cell ci, i = 1, · · · k. In the follow-ing, we develop an ALCR for Bayesian estimates in (2) based on its asymptoticproperty. We show that the asymptotic distributions of the estimates obtained

Page 8: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

8 R. Xi et al

from aggregation of the ALCR for each component cell and the Bayesian esti-mates in the aggregated cell are the same and further show that the differencebetween them approaches zero in probability as the number of tuples in ca goesto infinity. Further, the space complexity of the ALCR is independent from thenumber of tuples. Therefore, the Bayesian estimates are asymptotically losslesslycompressible measures.

4.1. Compression and aggregation scheme

Consider aggregating K cells at a lower level into one aggregated cell at a higherlevel. Suppose that there are nk observations in the kth component cell ck. Let{xk,1, · · · , xk,nk

} be the observations in the component cell ck. Note that theobservations xk,j (j = 1, · · · , nk) could be multidimensional. Based on the ob-servations in the kth component cell ck, we have the Bayesian estimate

θ∗k,nk=

( ∫

θ∈Θ

nk∏

j=1

f(xk,j |θ)π(θ)dθ

)−1 ∫

θ∈Θ

θ

nk∏

j=1

f(xk,j |θ)π(θ)dθ. (5)

We propose the following asymptotically lossless compression technique forthe Bayesian estimation.

– Compression into ALCR. For each base cell ck, k = 1, · · · ,K, at the lowestlevel of the data cube, calculate the Bayesian estimate θ∗k,nk

using (5). Save

ALCR=(θ∗k,nk, nk)

in each component cell ck.

– Aggregation of ALCR. Calculate the aggregated ALCR (θa, na) using thefollowing formula

na =K∑

k=1

nk, θa = n−1a

K∑

k=1

nkθ∗k,nk

Such a process can be used to aggregate base cells at the lowest level as wellas cells at intermediate levels. But for any non-base cell, θ is used in place ofθ∗k,nk

in its ALCR.

4.2. Compressibility of Bayesian estimation

We now show that (θa, na) is an ALCR. We denote the Bayesian estimate forthe aggregated cell to be θ∗a and the corresponding estimates derived from theALCR compression and aggregation to be θa. We will show that the asymptoticdistributions of θ∗a and θa are the same and their difference tends to zero inprobability.

Suppose that Θ ⊂Rp is an open subset of Rp. We will only give detailedproof of the theorem in the case of p = 1 and briefly describe the proof forthe multidimensional case. We make the following regularity assumptions onfθ(·) = f(·| θ) before giving the main theorem.

Page 9: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 9

(C1) {x : fθ(x) > 0} is the same for all θ ∈ Θ.

(C2) L(θ, x) = log fθ(x) is thrice differentiable with respect to θ in a neigh-borhood Uδ0(θ0) = (θ0 − δ0, θ0 + δ0) of θ0 ∈ Θ. If L′, L′′ and L(3) stand forthe first, second and third derivatives, then Eθ0 |L′(θ0, X)| and Eθ0 |L′′(θ0, X)|are both finite and

supθ∈Uδ0 (θ0)

|L(3)(θ, x)| ≤ M(x) and Eθ0M(X) < ∞.

(C3) Interchange of the order of expectation with respect to θ0 and differ-entiation at θ0 are justified, so that Eθ0L

′(θ0, X) = 0 and Eθ0L′′(θ0, X) =

−Eθ0 [L′(θ0, X)]2.

(C4) Iθ0 = Eθ0 [L′(θ0, X)]2 > 0.

(C5) If X1, · · · , Xn are random variables sampled from fθ0 and Ln(θ) =∑n

i=1 L(θ, Xi),then for any δ > 0, there exists an ε > 0 such that Pθ0{sup|θ−θ0|>δ[Ln(θ) −Ln(θ0)] ≤ −ε} → 1.

(C6) The prior has a density π(θ) with respect to the Lebesgue measure, whichis continuous and positive at θ0. Furthermore, π(θ) satisfies

∫θ∈Θ

|θ|π(θ)dθ <∞.

These conditions guarantee the consistency and asymptotic normality of theposterior mean and are the same as the conditions in [12].

Theorem 1. Suppose {fθ| θ ∈ Θ} satisfies Conditions (C1)-(C5) and the priordistribution satisfies Condition (C6). Let Xk,1, · · · , Xk,nk

(k = 1, · · · ,K) berandom variables from the distribution fθ0 , θ∗k,nk

be the posterior mean (2)based on the random variables X1,k, · · · , X1,nk

and θa = n−1a

∑Kk=1 nkθ∗k,nk

bethe aggregated Bayesian estimate. Then we have

√na(θa − θ0)

d−→N(0, I−1θ0

) as mK = min{n1, · · · , nK} → ∞.

Proof. Since {fθ| θ ∈ Θ} and π(θ) satisfy Condition (C1)-(C6), from Theorem1.4.3 in [12] we have

√nk(θ∗k,nk

− θ0)d−→N(0, I−1

θ0) as nk →∞.

Let Zk,nk=√

nk(θ∗k,nk−θ0) and φk,nk

(t) = E[eitZk,nk ] be its characteristic func-tion. Denote v2 = I−1

θ0. Then, by Levy’s Continuity Theorem (see, for example,

[9] and [31] among others), we have φk,nk(t) converges to exp(−v2t2/2) uniformly

in every finite interval, where exp(−v2t2/2) is the characteristic function of thenormal distribution N(0, v2). On the other hand, the characteristic function of

Page 10: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

10 R. Xi et al

the random variable Zna=√

na(θa − θ0) is

φna(t) = E[exp{it√na(θa − θ0)}]

= E[exp{it√nan−1a

K∑

k=1

nk(θ∗k,nk− θ0)}]

=K∏

k=1

E[exp{itn−1/2a nk(θ∗k,nk

− θ0)}]

=K∏

k=1

φk,nk(n1/2

k n−1/2a t).

Then, we have

| log[φna(t)] + v2t2/2| =

∣∣∣∣K∑

k=1

{log[φk,nk(n1/2

k n−1/2a t)] +

nk

2nav2t2}

∣∣∣∣

≤K∑

k=1

∣∣ log[φk,nk(n1/2

k n−1/2a t)] +

12v2

(n

1/2k n−1/2

a t)2∣∣.

Since φk,nk(t) converges to exp(−v2t2/2) uniformly in every finite interval, log[φk,nk

(t)]will converge to −v2t2/2 uniformly in every finite interval. Then for any ε > 0,there exists an Nk(ε) > 0 such that when nk > Nk(ε), we have | log[φk,nk

(τ)] +v2τ2/2| ≤ ε/K for all |τ | ≤ |t|. Take MK(ε) = max{N1(ε), · · · , NK(ε)}. Since|n1/2

k n−1/2a t| ≤ |t|, we have

| log[φna(t)] + v2t2/2| ≤

K∑

k=1

ε/K = ε

for mK ≥ MK(ε). Therefore, φna(t) converges to exp(−v2t2/2) for all t ∈ R and

the Theorem can be seen by using Levy’s Continuity Theorem again.

To prove a similar result for p > 1, we need replace Conditions (C2) – (C4)with the following conditions.

(C2′) L(θ, x) = log fθ(x) is thrice differentiable with respect to θ in a neigh-borhood Uδ0(θ0) = {θ : ‖θ − θ0‖ < δ} of θ0 ∈ Θ. If L′i, L′′ij and L

(3)ijk stand

for the first, second and third partial derivatives with respect to the i, j, kthcomponent of θ, then Eθ0 |L′i(θ0, X)| and Eθ0 |L′′ij(θ0, X)| are both finite and

supθ∈Uδ0 (θ0)

|L(3)ijk(θ, x)| ≤ M(x) and Eθ0M(X) < ∞.

(C3′) Interchange of the order of expectation with respect to θ0 and differ-entiation at θ0 are justified, so that Eθ0L

′(θ0, X) = 0 and Eθ0L′′(θ0, X) =

−Eθ0 [L′(θ0, X)L′T (θ0, X)], where L′ is the gradient vector with the ith com-

ponent as L′i, L′′ is the Heissan matrix with L′′ij as its (i, j)th component andL′T (θ0, X) is the transpose of the column vector L′(θ0, X).

(C4′) Iθ0 = Eθ0 [L′(θ0, X)L′T (θ0, X)] is a positive definite matrix.

Page 11: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 11

Table 1. Success rates for different groups of stone size.

Treatment A Treatment B

Small Stone 93%(81/87) 87%(234/270)Large Stone 73%(192/263) 69%(55/80)

Both 78%(273/350) 83%(289/350)

Theorem 2. Under Conditions (C1), (C2′)–(C4′), (C5) and (C6), we have√

na(θa − θ0)d−→N(0, I−1

θ0) as mK = min{n1, · · · , nK} → ∞.

Proof. Let θk,nkbe the MLE of the parameter θ based on the data Xk,1, · · · , Xk,nk

.From Theorem 5.1 in [18], we have

√nk(θk,nk

− θ0)d−→N(0, I−1

θ0) as nk →∞.

On the other hand, from Theorem 2.1 in [4], we have the difference between theBayesian estimator θ∗k,nk

and the MLE satisfies n1/2k (θ∗k,nk

− θk,nk) → 0 almost

surely. And hence, we have√

nk(θ∗k,nk− θ0)

d−→N(0, I−1θ0

) as nk →∞.

The remaining part of the proof is similar to the proof of Theorem 1 and isomitted.

Corollary 1. Under the conditions of Theorem 1 or 2, the difference betweenthe estimates θ∗na

and θnaapproaches 0 in probability.

Proof. From Theorem 1, θna approaches θ0 in probability as mK goes to infin-ity. The Bayesian estimate θ∗na

also approaches θ0 in probability. Therefore, thedifference between θ∗na

and θnaconverges to 0 in probability.

Corollary 1 means that the difference between θ∗naand θna

will becomesmaller as more data become available. Henceforth, the estimate θna

is a goodapproximation to θ∗na

with a diminishing error when the dataset is large.

4.3. Detection of non-homogeneous data

Theorem 1 and Corollary 1 rely on the assumption that the data from differentsubcubes come from the same probability model, i.e. the data are homogeneous.Aggregation of non-homogenous data can lead to misleading results and Simp-son’s paradox [1] may occur. Therefore, it is important to develop tools of testingnon-homogeneity. The test of non-homogeneity should be able to support theOLAP analysis and hence it should only depend on the compressed measures orthe ALCRs of subcubes. The ALCR defined in Section 4.1 is insufficient for thetest of non-homogeneity and one additional measure is needed. Let vk,nk

be theposterior variance matrix based on the observations in the kth component cells,i.e.

vk,nk= C−1

θ∈Θ

(θ − θ∗k,nk)(θ − θ∗k,nk

)Tnk∏

j=1

f(xk,j |θ)π(θ)dθ, (6)

Page 12: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

12 R. Xi et al

where C =∫

θ∈Θ

∏nk

j=1 f(xk,j |θ)π(θ)dθ is the normalizing constant. If the param-eter θ is p-dimensional, the measure vk,nk

is a p × p matrix. We propose thefollowing modified compression and aggregation scheme.

– Compression into ALCR. For each base cell ck, k = 1, · · · ,K, at the lowestlevel of the data cube, calculate the Bayesian estimate θ∗k,nk

using (5) and theposterior variance vk,nk

using (6). Save

ALCR=(θ∗k,nk, vk,nk

, nk)

in each component cell ck.

– Aggregation of ALCR. Calculate the aggregated ALCR (θa, va, na) usingthe following formula

na =K∑

k=1

nk, θa = n−1a

K∑

k=1

nkθ∗k,nk, va = n−2

a

K∑

k=1

n2kvk,nk

.

For any non-base cells, θ and va are used in place of θ∗k,nkand vk,nk

in theirALCR.

Suppose that c1 and c2 are two subcubes and (θ1, v1, n1) and (θ2, v2, n2) aretheir ALCRs respectively. By Theorem 1,

√nk(θk − θ0) approximately follows

the normal distribution N(0, I−1θ0

), or θk − θ0 (k = 1, 2) approximately followsthe normal distribution N(0, n−1

k I−1θ0

) (k = 1, 2). Using vk as the estimate ofn−1

k I−1θ0

, it follows that t = (θ1 − θ2)T (v1 + v2)−1(θ1 − θ2) approximately followsa χ2

p distribution. Hence, we can use the statistic t to test the non-homogeneity.We use the kidney stone data as considered in [34] as an example of the test

of non-homogeneity. The data are from a medical study [5, 16] comparing thesuccess rates of two treatments for kidney stones. The two treatments are opensurgery (treatment A) and percutaneous nephrolithotomy (treatment B). Table 1shows the effects of both treatments under different conditions. It reveals thattreatment A has a higher success rate than treatment B for both small stone andlarge stone groups. However, after aggregating over the two groups, treatmentA has a lower success rate than treatment B.

Let S be a binary random variable that indicates whether a treatment suc-ceeds or not, and T be the type of treatment that a patient receives. We usepA and pB to denote the success rate of treatment A and B, respectively, andαA the probability that a patient receives treatment A. We have the probabilitymodel

Pr(S, T |pA, pB , αA) =[pS

A(1−pA)1−SαA

]I(T=A)[pS

B(1−pB)1−S(1−αA)]I(T=B)

,

where I(·) is the indicator function. Set priors for pA, pB , αA as the nonin-formative prior Beta(1, 1). Given observations D = {(s1, t1), · · · , (sn, tn)}, theposterior distribution of (pA, pB , αA) is

f(pA, pB , αA|D) ∝ pnAs

A (1− pA)nAf pnBs

B (1− pB)nBf αnA

A (1− αA)nB ,

where nAs =∑n

i=1 siI(ti = A), nAf =∑n

i=1(1−si)I(ti = A), nBs =∑n

i=1 siI(ti =B), nBf =

∑ni=1(1− si)I(ti = B), nA =

∑ni=1 I(ti = A) and nB =

∑ni=1 I(ti =

Page 13: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 13

Fig. 1. MAD of the aggregated estimates with a varying number of partitions K, where thesolid, dashed and the dotted lines correspond to the initial probabilities (l)p, the transitionmatrices (l)P , and the parameter α in the mixture of transition models, respectively.

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

K

MA

DInitial probability

Transition Matrixα

1000 20 40 60 80

B). Therefore, the posterior distribution is the product of three independentBeta distributions.

Denote θ = (pA, pB , αA). Based on the small stone group data, the Bayesianestimate is θ∗s = (0.92, 0.86, 0.24) and the corresponding posterior variance isvs = diag{0.00080, 0.00043, 0.00051}. Based on the large stone group, the Bayesianestimate is θ∗l = (0.73, 0.68, 0.77) and the corresponding posterior variance isvl = diag{0.00075, 0.0026, 0.00052}. Using these results, the test statistic ist = (θ∗s−θ∗l )T (vs+vl)−1(θ∗s−θ∗l ) = 296.63, which is highly significant. Therefore,it is highly possible that the two data sets are inhomogeneous and we should notaggregate them together.

5. Experimental Evaluation

We perform experimental studies to evaluate the proposed scheme. We first eval-uate the accuracy of the proposed C&A scheme. Then, we report the time andquality performance of the C&A scheme in data cube and stream data miningcontexts. At last, we apply the C&A scheme to a real data set and show that theaggregated Bayesian estimates can closely approximate the Bayesian estimates.

5.1. Quality of the proposed compression and aggregationscheme

In this subsection, we use the mixture of transition models to evaluate the qualityof the proposed C&A scheme. The mixture of transition models has been usedto model user visiting website [3, 26, 27], unsupervised training of robots [24]and the dynamics of a military scenario [30].

Page 14: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

14 R. Xi et al

Transition models are useful in describing time series that have only a finitenumber of states. The observations of a transition model are finite state Markovchains of finite length. For example, the sequence (A,B, A, C,B, B, C) could be arealization of a 3-state first-order Markov chain, where the transition probabilityat time t only depends on the state of the Markov chain at time t but noton the previous history. If all the observations are realizations from the sametransition model, one can readily get a closed form of the posterior mean ofthe parameters. However, the set of sequences may be heterogeneous and thesequences may come from several different transition models, in which case themixture of transition models is useful in estimating the transition matrices andclustering the observational sequences.

Consider a data set of N sequences, D = {x1, · · · , xN}, that are realizationsfrom some s-state discrete first-order Markov process. The sequences are possiblyof different length. Assume that each sequence comes from one of m transitionmodels. Let (l)Pij be the element (i, j) of the lth probability transition matrix,or the transition probability from state i to state j for a process in cluster l.Let (l)pi be the ith element of the initial state distribution of processes fromcluster l. Further assume that αl is the probability that a process is from clusterl. Denote x0

k as the initial state of the sequence xk and n(k)ij be the number of

times that the process xk transitioned from state i to state j. The mixture oftransition models is

f(xk|θ) =m∑

l=1

αl

s∏

i=1

(l)pI(x0

k=i)i

s∏

i=1

s∏

j=1

(l)Pn

(k)ij

ij ,

where θ is the parameter vector consisting of (l)Pij , (l)pi and αl as its elements,and I(·) is the indicator function. The prior distribution for the parameter vectorsα = (α1, · · · , αm), (l)p = ((l)p1, · · · , (l)ps) and (l)Pi· = ((l)Pi1, · · · , (l)Pis) areDirichlet priors with all parameters as 1. The Dirichlet priors used here arenon-informative priors. The posterior mean has no closed form for this Bayesianmodel. However, by introducing a “missing data” δ

(k)l , a 0/1 unobserved indicator

for whether process k belongs to cluster l, one can readily develop a Gibbssampler [26, 27].

We apply the C&A scheme to a mixture of transition models. In the ex-periment, the number of clusters is set to 3 and the Markov chains are 2-statechains. We generated 10, 000 chains from the mixture of transition models andeach chain is of length 30. The underlying true parameters are set as

1. initial probabilities:

(1)p = (0.2, 0.8), (2)p = (0.9, 0.1), (3)p = (0.4, 0.6);

2. transition matrices

(1)P =(

0.9 0.10.8 0.2

), (2)P =

(0.3 0.70.1 0.9

), (3)P =

(0.7 0.30.9 0.1

);

3. the probability vector α = (0.2, 0.5, 0.3).

We partition the entire data set into K = 1, 10, 20 · · · , 100 cells with equalnumber of observations and then use our C&A scheme to approximately com-pute the posterior mean for the entire data set. We run the Gibbs sampler11, 000 iterations and set the burn-in iterations to 1, 000. Note that the esti-

Page 15: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 15

Fig. 2. Comparison of the C&A method (dotted lines) and direct method (solid lines) in datastream.

0 1000 2000 3000 4000 5000

0.0

00

.04

0.0

8

Time point

MA

D

0 1000 2000 3000 4000 5000

0.0

00

0.0

04

0.0

08

Time point

MA

D

0 1000 2000 3000 4000 5000

10

15

20

Time point

Co

mp

uta

tio

na

l tim

e (

in s

eco

nd

s)

50

(a) MAD between β, β and β0 (b) MAD between β and β (c) Update time.

mate corresponding to K = 1 is just the posterior mean. Let (l)p, (l)P and(l)α be the estimates of (l)p, (l)P and α (l = 1, 2, 3), respectively. We definethe maximum absolute deviances (MAD) as D(p, p) = max{|(l)pi − (l)pi| : l =1, 2, 3, i = 1, 2}, D(P , P ) = max{|(l)Pij − (l)Pij | : l = 1, 2, 3, i, j = 1, 2} andD(α, α) = max{|αl − αl| : l = 1, 2, 3}. Figure 1 shows the MAD of the aggre-gated estimates from different partitions. The solid line is for the MAD D(p, p),the dashed line is for D(P , P ), and the dotted line is for D(α, α). Observed fromthe low MAD values (all ≤ 0.005), it is clear that the estimates under variousnumber of partitions all have very small errors. The evaluation result shows thatthe accuracy of the aggregated estimates from our C&A scheme is almost asgood as the accuracy of the original Bayesian estimates.

5.2. Performance on data streams

In this experiment, we apply our aggregation method to data streams. The Bayesmodel under consideration is the linear model with 5 predictors, x1, · · · , x5, i.e.

y = β0 +5∑

i=1

βixi + ε,

where ε is the error term. In the experiment, we set the true parameters β =(β0, · · · , β5) = (0, 1, 2, 3, 4, 5) and the total number of observation N to be 5million. We generate the covariates xi (i = 1, · · · , 5) from the standard normaldistribution, generate the error term ε from N (0, σ2 = 4), and calculate theresponse y from the above equation. The priors of the parameters in the Bayesianmodel are the flat priors, i.e. π(βi) ∝ 1 (i = 0, · · · , 5) and π(σ2) ∝ 1/σ2. TheGibbs sampler can then be easily developed.

We update our model for every 1000 new data records. In our methods,whenever we receive 1000 new data records, we compute its ALCR, update theBayes linear model by aggregating the ALCR with previous ALCRs, and discardthe raw data. We compare the performance of our method to a naive method,which stores all the stream data and uses the raw data to update the model forevery 1000 new data records. We run the Gibbs sampler for 20000 iterations, setthe number of burn-in iterations to 1000, and set s to 5.

Figure 2a shows the MADs between the aggregation estimation β (dashed

Page 16: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

16 R. Xi et al

Table 2. Comparison of the computational time in Experiment 1.

C&A method direct method

Compression 1,403 minutes N/AQuery processing 0.1 minute 19,049 minutes

line), the estimation from naive method β (solid line) and the true parameterβ. Figure 2b shows the MAD between β and β. Figure 2c gives the computa-tional time used for updating the parameter estimation using our C&A scheme(dashed line) and the naive method (solid line). We see that, comparing to thenaive method, the C&A method gives almost identical accuracy, but save tremen-dous amount of computing time. In fact, from Figure 2c, we see that the C&Amethod uses a nearly constant time to perform each online update, while thenaive method uses more time as more data accumulates. It is clear that theC&A method is more suitable for stream data mining.

5.3. Performance on data cubes

Experiment 1. In this experiment, we study the efficiency and quality of thecompression and aggregation scheme for aggregated cells in data cubes. TheBayesian model under consideration is again the mixture of transition models.Two dimensions are time and location. Since the MCMC algorithm for the mix-ture of transition models is highly time-consuming even for moderate-sized data,we consider a relatively small data cube in this experiment. We have 20 months’records in the time dimension and 50 states in the location dimension. In prac-tice, the data can be the records of a website that records users’ visit to thewebsite. The location dimension can be the IP address of the user. For eachstate in each month, we have 500 observations, i.e. we have 500 users’ records.Hence, we have 500,000 observations in total. The observations are sequencesthat record users’ visiting path in the website. As in Section 5.1, the number ofclusters are set to 3 and the Markov chains are 2-state chains. The underlyingtrue parameters are also set as in Section 5.1.

We compare our ALCR method to the direct Bayesian estimation method,which directly uses raw data to calculate the Bayesian estimates of the param-eters, by comparing their computing time for handling 100 randomly generatedqueries. To save the computing time of the direct Bayesian estimation method,the aggregated cells that the queries ask for can have at most 200 base cells.More specifically, to generate a query, we first randomly select a number D from{1, · · · , 200}, and then we randomly select D cells from the 1000 base cells (ti, lj)(i = 1, · · · , 20, j = 1, · · · , 50). The corresponding query asks for parameter es-timates of the mixture of transition models based on the data of the selectedD base cells. For example, assume that D is randomly selected as 3 and thebase cells are chosen as c1 = (t1, l1), c5 = (t5, l5), c30 = (t30, l30). Then, theaggregated estimate of the aggregated cell ca = c1 ∪ c5 ∪ c30 is calculated byaggregating the ALCRs of the base cells c1, c5 and c30; the Bayesian estimate isdirectly calculated with Gibbs sampling based on the raw data of the aggregatedcell ca. We run the Gibbs sampler for 6, 000 iterations and set the number ofburn-in iterations to 1, 000.

Table 2 shows the time with and without using compression, respectively.

Page 17: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 17

Fig. 3. Comparison of the C&A method (dotted lines) and direct method (solid lines) inExperiment 1.

0 100

0.00

0.01

0.02

0.03

0.04

Query

MAD

20 40 60 80 0 100

0.00

0.01

0.02

0.03

0.04

Query

MAD

20 40 60 80

0.00

0.01

0.02

0.03

0.04

Query

MAD

0 10020 40 60 80

(a) MADs for (l)p (b) MADs for (l)P (c) MADs for α.

The first row shows the computational time for compression and the secondrow shows the aggregation time for all these 100 queries. Without using ALCRcompression, the aggregation time is the time to compute Bayesian estimatesdirectly from the raw data in these selected cells. It is obvious that our methodsaves huge amount of computational time when handling OLAP queries in adata cube.

Figure 3 compares MADs of estimates for each query from the ALCR methodand the direct method. The dotted lines are for the ALCR methods and thesolid lines are for the direct method. Figure 3 (a), (b) and (c) are the MADsof estimates for the initial probabilities (l)p, the transition matrices (l)P , andthe probabilities α, respectively. The queries are ordered by their sizes or bythe base cell numbers in the queries. Figure 3 shows that the estimates fromthe ALCR method tend to have larger MAD than estimates from the directmethod when the size of query is large, especially for the estimates of initialprobabilities, although in general all the MADs for both methods are very small.Figure 4 shows the MAD between the original Bayesian estimates and the ALCRbased estimates. The queries are ordered by their sizes. The differences of theinitial probability estimates are generally larger comparing to estimates of otherparameters, but generally the two estimates are close.

Experiment 2. In this experiment, we consider the Bayesian estimator of thelinear regression model and compare the computational efficiency and the ac-curacy of the Bayesian estimator and the C&A estimator in data cubes. Themodel under consideration is the same as Section 5.2, but the underlying trueparameter β was set as (1, 2, 3, 4, 5, 6). The data cube is a 6-dimensional datacube and the dimension sizes are 50, 120, 5, 4, 3 and 2, respectively. Thus, thedata cube contains 50× 120× 5× 3× 2 = 720, 000 base cells. To introduce morevariations, the number of observations in each base cell was sampled uniformlyfrom {100, · · · , 1000} and the variance of the error term was sampled from thechi-square distribution with 2 degrees of freedom. In total, the data cube hasaround 50 GB raw data.

We randomly generated 2000 queries and set the maximum number of basecells of the queries as 2000. The procedure for generating the queries is similaras in Experiment 1. The total number of iterations of the Gibbs sampler was setas 11000 and the burn-in iterations as 1000. Table 3 shows the computation timefor the two methods. Again, we see that the aggregation method saves a largeamount of computational time compared with the direct Bayesian estimation

Page 18: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

18 R. Xi et al

Fig. 4. MAD between the original and the aggregated estimates in Experiment 1. The solid,dashed and dotted lines are MADs for the initial probabilities (l)p, the transition matrices (l)P ,and the parameter α in the mixture of transition models, respectively.

100

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

Query

MA

DInitial probability

Transition Matrix

0 20 40 60 80

α

Table 3. Comparison of the computational time in Experiment 2.

C&A method direct method

Compression 645 minutes N/AQuery processing 1 minute 1,779 minutes

method. The accuracy of the estimates based on the ALCR method is also similarto the original Bayesian estimates (Figure 5).

5.4. Application on a real data set

In this section, we apply our compression and aggregation scheme to the Behav-ioral Risk Factor Surveillance System (BRFSS) survey data (2005 – 2008) [11].The BRFSS, administered by the Center for Disease Control and Prevention, isan ongoing data collection program designed to measure behavioral risk factorsin the adult population. The BRFSS collects surveillance data on risk behaviorsthrough monthly telephone interviews to people in the 50 states and 5 districtsof the United States of America. After filtering records with missing data, thisdata set has around 1.2 million data points.

We are interested in modeling the variable body mass index ( BMI4). Thevariable BMI4 can take value from 1 to 9998 and we view it as continuousvariable. The explanatory variables are SEX, AGE, EXERANY2, DIABETE2,DRNKANY4, RFSMOK3 and EDUCAG. The variables EXERANY2, DIA-BETE2, DRNKANY4 and RFSMOK3 describe whether an interviewee has anykind of exercise during the past month, was told by a doctor that he/she hasdiabetes, has been drinking alcoholic beverages during the past month, and is asmoker, respectively. The variable EDUCAG indicates the completed education

Page 19: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 19

Fig. 5. Comparison of accuracy of the Bayesian Estimates and the C&A estimates in Experi-ment 2. The queries are ordered by the base cell number.

0 500 1000 1500 2000

0.000

0.004

0.008

Query

MAD

0 500 1000 1500 2000

0.000

0.004

0.008

QueryMAD

(a) MAD of the Bayesian Estimates (b) MAD of the ALCR estimates

level of an interviewee. We stratify this variable into two levels, hight school orlower and above high school. The variables SEX and AGE are sex and age of aninterviewee. For notation simplicity, denote Y as the response variable BMI4,and X1, · · · , X7 as the seven explanatory variables. The model under consider-ation is the following linear model

Y = β0 +7∑

i=1

βiXi + ε,

where ε is the error term. We compare the original Bayesian parameter estimateβ∗ with the aggregated estimate β. To accommodate different magnitudes of theestimates β∗i , we use the mean relative difference (1/8)

∑7i=0 |βi − β∗i |/|β∗i | as a

measure of the accuracy of the aggregated estimate β.The data set in each year can be partitioned into 12 subsets by month and

each subset can be further partitioned by the states. We compute the ALCR foreach state in each month, and then aggregate the ALCRs. For the data in eachmonth, we can get the aggregated estimates over the states. Figure 6 shows themean relative difference of these aggregated estimates. The relative differencesare always less than 0.04, which suggests the high accuracy of the ALCR method.

6. Discussion of Related Work

We discuss some related works and compare them to ours. Statistical modelscan be put into two categories, parametric models such as linear regression andlogistic regression, and nonparametric models such as probability based ensem-bles, naive Bayesian classifier and kernel-density-based classifiers. In parametricmodels, emphasis is often put on parameter estimation, such as how accurate anestimator is. On the other hand, prediction accuracy is more important in evalu-ating the performance of a nonparametric model. The framework of regressioncube [7, 8] develops a lossless compression and aggregation scheme for generalmultiple linear regression.

Another very related work is that on prediction cube [6], which supports

Page 20: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

20 R. Xi et al

Fig. 6. The mean relative difference between the Bayesian estimates β∗ and the aggregated

estimates β for BRFSS data.

2005 2006 2007 2008

Year

Me

an

Re

lative

Diffe

ren

ce

0.0

00

.01

0.0

20

.03

0.0

4

OLAP of prediction models including probability based ensemble, naive Bayesianclassifier, and kernel-density classifier. The prediction cubes bear similar ideasas regression cubes in that both of them aim at deriving high-level models fromlower-level models instead of accessing the raw data and rebuilding the modelsfrom scratch. A key difference is that, the prediction cube only supports modelsthat are distributively decomposable or algebraically decomposable [6], whereasthe Bayesian models in our study are not. Also, the prediction cubes deal withthe prediction accuracy of nonparametric statistical models, whereas our com-pression theory is developed for parameter reconstruction of Bayesian models.

The above developments all focus on lossless computation for data cubes. Al-ternatively, asymptotically lossless computation that provides good approxima-tions to the desired results is also acceptable in many applications when efficientstorage and computation is attainable. Recently, a nearly lossless compressionand aggregation scheme has been developed for logistic regression, a nonlinearparametric model [34]. An approximation technique called quasi-cube uses theloglinear model, a parametric model, to characterize regions of a data cube [2].Efficient storage and fast computation are achieved by storing the parametersof the loglinear models instead of the original data. In quasi-cubes, the desiredcomputation is done based on approximations to the original data provided bythe loglinear model. However, it is difficult to quantify the approximation errorsin a quasi-cube.

Our paper considers aggregation operations without accessing the raw data.Palpanas, Koudas, and Mendelzon [22] have considered the reverse problem,which is to derive original raw data from the aggregates. An approximative esti-mation algorithm based on maximum information entropy is proposed [22]. It willbe interesting to study the interactions of these two complimentary approaches.

Safarinejadian et al.[28] recently proposed a distributive EM algorithm forestimating parameters in finite mixture models. The EM algorithm and MCMC

Page 21: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 21

algorithm are two parallel algorithms for estimating parameters in finite mixturemodels. EM algorithms are generally faster than MCMC algorithms, but MCMCalgorithms are generally easier to develop and implement, and MCMC algorithmscan readily provide interval estimation. A comparison of our aggregated Bayesianestimate with their distributive EM algorithm would be interesting.

Dimension hierarchies, cubes, and cube operations are formally introduced byVassiliadis [33]. Lenz and Thalheim [19] proposed to classify OLAP aggregationfunctions into distributive, algebraic, and holistic ones. In data warehousing andOLAP, much progress has been made on the efficient support of standard andadvanced OLAP queries in data cubes, including selective cube materialization[15] and intelligent roll-up [29]. However, the measures studied in previous OLAPsystems are usually single values or simple statistics, not sophisticated statisticalmodels such as the Bayesian models studied in this paper.

Our work is related to database engine architectures such as Netezza (www.netezza.com) and Infobright (www.infobright.com) where the synopses computedfor the partitioned data blocks are used for query optimization and execution.Infobright has recently introduced the notion of a rough query which is an ap-proximate query basing only on the synopsis without drilling down to full details.Our method matches this framework. We plan to extend the open source versionof Infobright with the proposed Bayesian synopses in order to enrich queryingby the elements of Bayesian modeling.

7. Conclusions

In this paper, we have proposed an asymptotically lossless compression and ag-gregation technique to support efficient Bayesian estimation of statistical mod-els. We have developed a compression and aggregation scheme that compressesa data segment into a compressed representation whose size is independent fromthe size of the data segment. Under regularity conditions, we have proved thatthe aggregated estimator is strongly consistent and asymptotically error-free.We have further proposed a compression and aggregation scheme that enablesdetection of non-homogeneous data.

Our experimental studies on data cubes and data streams show that our com-pression and aggregation method can significantly reduce computational timewith little loss of accuracy. Moreover, the aggregation error diminishes as thesize of the data increases. Therefore, the proposed scheme is widely applicableas it enables efficient and accurate construction of Bayesian statistical mod-els in a distributed fashion. It can be used in the contexts of data cube andOLAP, stream data mining, and cloud computing. For data cubes, it allows usto quickly perform OLAP operations and compute Bayesian statistics at anylevel in a data cube without retrieving or storing the raw data. For stream datamining, it enables efficient one-scan online computation of Bayesian statistics,without requiring to retain the raw data. For cloud computing, it facilitates anal-ysis of distributed large datasets, under parallel processing paradigms such asMapReduce.

The proposed scheme works best for the scenarios of homogeneous data. Sinceit is more likely to have inhomogeneous data in stream data, we would expectthe convergence in stream data be worse than OLAP. However, as statisticalanalysis without realizing the inhomogeneity would be misleading, our methodalso provides a way to detect this inhomogeneity without accessing the raw data.

Page 22: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

22 R. Xi et al

Acknowledgements. This research is partly supported by NSF grant NeTS 1017701and a Microsoft Research New Faculty Fellowship to Y. C. and by NSF grant DMS0906023to N.L..

References

[1] A Agresti. Categorical Data Analysis. John Wiley and Sons, New Jersey, 2nd edition,2002.

[2] D. Barbara and X. Wu. Loglinear-based quasi cubes. Journal of Intelligent InformationSystems, 16:255–276, 2001.

[3] I. Cadez, D. Heckerman, P. Smyth C. Meek, and S. White. Visualization of navigationpatterns on a web site using model-based clustering. Technical report, Microsoft Research,2000. MSR-TR-00-18.

[4] M.T. Chao. The asymptotic behavior of Bayes’ estimators. The Annals of MathematicalStatistics, 41(2):601–608, 1970.

[5] C. R. Charig, D. R. Webb, S. R. Payne, and O. E. Wickham. Comparison of treatment ofrenal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy. British Medical Journal, 292:897–882, 1986.

[6] B. Chen, L. Chen, Y. Lin, and R Ramakrishnan. Prediction cubes. In Proceedings of the31st VLDB Conference, pages 982–993, 2005.

[7] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah, , and J. Wang. Regression cubes with losslesscompression and aggregation. IEEE Transactions on Knowledge and Data Engineering,18:1585–1599, 2006.

[8] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysisof time-series data streams. pages 323–334, 2002.

[9] K. L. Chung. A Course in Probability Theory. Elsevier, San Diego, California, 3rd edition,2001.

[10]A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1–38, 1977.

[11]Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance SystemSurvey Data. U.S. Department of Health and Human Services, Centers for Disease Controland Prevention, 2005-2008.

[12]J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonprametrics. Springer, New Jersey,2002.

[13]J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, andH. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-taband sub-totals. Data Mining and Knowledge Discovery, 1:29–54, 1997.

[14]J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah, J. Wang, and Y. Cai. Stream cube:An architecture for multi-dimensional analysis of data streams. Distributed and ParallelDatabases, 18(2):173–197, 2005.

[15]V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. InProceedings of ACM SIGMOD International Conferernce on Management of Data, pages205–216, 1996.

[16]S. A. Julious and M. A. Mullee. Confounding and Simpson’s paradox. British MedicalJournal, 309:1480–1481, 1994.

[17]A. Khoshgozaran, A. Khodaei, M. Sharifzadeh, and C. Shahabi. A hybrid aggregation andcompression technique for road network databases. Knowledge and Information Systems,17(3):265–286, 2008.

[18]E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, New Jersey, 2ndedition, 1998.

[19]H. Lenz and B. Thalheim. OLAP databases and aggregation functions. In Proceedingsof the 13th International Conference on Scientific and Statistical Database Management,pages 91–100, 2001.

[20]C. Liu, M. Zhang, M. Zheng, and Y. Chen. Step-by-step regression: A more efficientalternative for polynomial multiple linear regression in stream cube. In Proceedings of the7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 437–448,2003.

[21]H. Liu, Y. Lin, and J. Han. Methods for mining frequent items in data streams: an overview.Knowledge and Information Systems, pages 1–30, 2011.

Page 23: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 23

[22]T. Palpanas, N. Koudas, and A. O. Mendelzon. Using datacube aggregates for approximatequerying and deviation detection. IEEE Transactions on Knowledge and Data Engineering,17(11):1465–1477, 2005.

[23]S. Pang, S. Ozawa, and N. Kasabov. Incremental linear discriminant analysis for classi-fication of data streams. IEEE Transactions on Systems Man and Cybernetics, Part B,35(5):905–14, 2005.

[24]M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. MachineLearning, 47(1):99–121, 2002.

[25]C. R. Rao. Linear Statistical Inference and Its Applications. John Wiley, New York, 1973.[26]G. Ridgeway. Finite discrete markov process clustering. Technical report, Microsoft Re-

search, 1997. MSR-TR-97-24.[27]G. Ridgeway and S. Altschuler. Clustering finite discrete markov chains. In Proceedings of

the Section on Physical and Engineering Sciences, pages 228–229, 1998.[28]B. Safarinejadian, M.B. Menhaj, and M. Karrari. A distributed EM algorithm to estimate

the parameters of a finite mixture of components. Knowledge and Information Systems,23(3):267–292, 2010.

[29]G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAP data. In Proceed-ings of the 27th VLDB conference, pages 531–540, 2001.

[30]P. Sebastiani, M. Ramonni, P. Cohen, J. Warwick, and J. Davis. Discovering dynamicsusing Bayesian clustering. In Advances in Intelligent Data Analysis, Lecture Notes inComputer Science, pages 395–406. Springer, 1999.

[31]A. N. Shiryaev. Probability. Springer, New Jersey, 2nd edition, 1995.[32]M. A. Tanner and W. H. Wong. The calculation of posterior distribution by data augmen-

tation. Journal of the American Statistical Association, 82:528–540, 1987.[33]P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Pro-

ceedings of the 10th International Conference on Scientific and Statistical Database Man-agement, pages 53–62, 1998.

[34]R. Xi, N. Lin, and Y. Chen. Compression and aggregation for logistic regression analysisin data cubes. IEEE Transactions on Knowledge and Data Engineering, 21(4):479–492,2009.

Author Biographies

Ruibin Xi is currently a Research Associate at Harvard MedicalSchool, Center for Biomedical Informatics. He received his PhD de-gree from the Department of Mathematics, Washington University inSt. Louis. His research interests include statistical analysis of next gen-eration sequencing data, copy number variation and structural varia-tion, statistical computing, massive data analysis, variable selectionmethods and Bayesian statistics.

Nan Lin is an Associate Professor of Mathematics and Biostatisticsat the Washington University in St. Louis. He received the Ph.D. inStatistics from University of Illinois at Urbabna-Champaign in 2003.He has also worked as a Postdoctoral Associate at Yale UniversitySchool of Medicine from 2003 to 2004. His research interest includesstatistical computing, massive data analysis, robust statistics, bioinfor-matics and psychometrics. He is a member of the American StatisticalAssociation and the International Chinese Statistical Association.

Page 24: Compression and Aggregation of Bayesian Estimates for Data ...ychen/public/Bayes.pdf · Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 deflning

24 R. Xi et al

Yixin Chen is an Associate Professor of Computer Science at theWashington University in St Louis. He received the Ph.D. in Comput-ing Science from University of Illinois at Urbana-Champaign in 2005.His research interests include nonlinear optimization, constrainedsearch, planning and scheduling, data mining, and data warehous-ing. His work on constraint partitioning and planning has won FirstPrizes in optimal and satisficing tracks in the International PlanningCompetitions (2004 & 2006), the Best Paper Award at the Interna-tional Conference on Tools for AI (2005), and the Best Paper Awardat AAAI Conference (2010). His work on data clustering has won theBest Paper Award at the International Conference on Machine Learn-ing and Cybernetics (2004) and the Best Paper nomination at theInternational Conference on Intelligent Agent Technology (2004). Heis partially funded by an Early Career Principal Investigator Award(2006) from the Department of Energy and a Microsoft Research NewFaculty Fellowship (2007).

Youngjin Kim is a Software Engineer at Google Inc. in MountainView, California. He received his Master degree from the Departmentof Computer Science at Washington University in St. Louis. His re-search interest includes machine learning and data mining from hugeand noisy real world data.

Correspondence and offprint requests to: Yixin Chen, Department of Computer Science, Wash-

ington University, St. Louis, MO, USA. Email: [email protected]