dirichlet process mixtures of generalized linear...

8
313 Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah David M. Blei Warren B. Powell Department of Operations Research and Financial Engineering, Princeton University Department of Computer Science, Princeton University Department of Operations Research and Financial Engineering, Princeton University Abstract We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLMs), a new method of nonparametric regression that accommodates continuous and categorical in- puts, models a response variable locally by a generalized linear model. We give conditions for the existence and asymptotic unbiased- ness of the DP-GLM regression mean func- tion estimate; we then give a practical ex- ample for when those conditions hold. We evaluate DP-GLM on several data sets, com- paring it to modern methods of nonparamet- ric regression including regression trees and Gaussian processes. 1 Introduction In this paper, we examine a Bayesian nonparametric solution to the general regression problem. The gen- eral regression problem models a response variable Y as dependent on a set of d-dimensional covariates X, Y | X f (m(X)). Here, m(·) is a deterministic mean function, which specifies the conditional mean of the response, and f is a distribution, which characterizes the deviation of the response from the conditional mean. In a regres- sion problem, we estimate the mean function and devi- ation parameters from a data set of covariate-response pairs {(x i ,y i )} N i=1 . Given a new set of covariates x new , we predict the response via its conditional expecta- tion, E[Y | x new ]. In Bayesian regression, we compute the posterior expectation of these computations, con- ditioned on the data. Appearing in Proceedings of the 13 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copyright 2010 by the authors. Regression models are a central focus of statistics and machine learning. Our goal is to develop a model that can be used in many settings. Generalized lin- ear models (GLMs) serve this purpose in a paramet- ric setting. They pass a linear transformation of co- variates through a possibly non-linear link function to generate a response. Inherent to GLMs, however, is an assumption that the response varies according to a linear transformation of the covariates. The method that we develop relieves this assumption. We develop Dirichlet process mixtures of generalized linear models (DP-GLMs), a Bayesian nonparametric regression model that combines the advantages of gen- eralized linear models with the flexibility of nonpara- metric regression. A DP-GLM produces a regression model by modeling the joint distribution of the covari- ates and the response. This is done using a Dirichlet process (DP) mixture model: for each observation a hidden parameter θ is drawn, covariates are generated from a parametric distribution conditioned on θ, and then the response is drawn from a GLM conditioned on the covariates and θ. The clustering effect of the DP mixture leads to an “infinite mixture” of GLMs, a model which effectively identifies local regions of co- variate space in which the covariates exhibit a con- sistent relationship to the response. In combination, these local GLMs represent a complex global response function. Note that the DP-GLM is flexible in that the number of segments, i.e., the number of mixture components, is determined by the observed data. Like the Dirichlet process regression models of Muller et al. (1996) and Rodriguez et al. (2009), we model the joint distribution of the covariates and response to generate an implicit conditional response distribution. This method is conceptually similar to the double- kernel method of Fan et al. (1996), except that the kernel generated by the Dirichlet process is determined by the imparted distribution over partition structures rather than a traditional distance metric. The DP- GLM is a generalization of several existing DP-based regression models (Muller et al., 1996; Shahbaba and Neal, 2009) to a variety of covariate types and response

Upload: others

Post on 10-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

313

Dirichlet Process Mixtures of Generalized Linear Models

Lauren A. Hannah David M. Blei Warren B. PowellDepartment of Operations Research

and Financial Engineering,Princeton University

Department ofComputer Science,

Princeton University

Department of Operations Researchand Financial Engineering,

Princeton University

Abstract

We propose Dirichlet Process mixtures ofGeneralized Linear Models (DP-GLMs), anew method of nonparametric regression thataccommodates continuous and categorical in-puts, models a response variable locally by ageneralized linear model. We give conditionsfor the existence and asymptotic unbiased-ness of the DP-GLM regression mean func-tion estimate; we then give a practical ex-ample for when those conditions hold. Weevaluate DP-GLM on several data sets, com-paring it to modern methods of nonparamet-ric regression including regression trees andGaussian processes.

1 Introduction

In this paper, we examine a Bayesian nonparametricsolution to the general regression problem. The gen-eral regression problem models a response variable Yas dependent on a set of d-dimensional covariates X,

Y |X ∼ f(m(X)).

Here, m(·) is a deterministic mean function, whichspecifies the conditional mean of the response, and fis a distribution, which characterizes the deviation ofthe response from the conditional mean. In a regres-sion problem, we estimate the mean function and devi-ation parameters from a data set of covariate-responsepairs {(xi, yi)}Ni=1. Given a new set of covariates xnew,we predict the response via its conditional expecta-tion, E[Y |xnew]. In Bayesian regression, we computethe posterior expectation of these computations, con-ditioned on the data.

Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the authors.

Regression models are a central focus of statistics andmachine learning. Our goal is to develop a modelthat can be used in many settings. Generalized lin-ear models (GLMs) serve this purpose in a paramet-ric setting. They pass a linear transformation of co-variates through a possibly non-linear link function togenerate a response. Inherent to GLMs, however, isan assumption that the response varies according to alinear transformation of the covariates. The methodthat we develop relieves this assumption.

We develop Dirichlet process mixtures of generalizedlinear models (DP-GLMs), a Bayesian nonparametricregression model that combines the advantages of gen-eralized linear models with the flexibility of nonpara-metric regression. A DP-GLM produces a regressionmodel by modeling the joint distribution of the covari-ates and the response. This is done using a Dirichletprocess (DP) mixture model: for each observation ahidden parameter θ is drawn, covariates are generatedfrom a parametric distribution conditioned on θ, andthen the response is drawn from a GLM conditionedon the covariates and θ. The clustering effect of theDP mixture leads to an “infinite mixture” of GLMs,a model which effectively identifies local regions of co-variate space in which the covariates exhibit a con-sistent relationship to the response. In combination,these local GLMs represent a complex global responsefunction. Note that the DP-GLM is flexible in thatthe number of segments, i.e., the number of mixturecomponents, is determined by the observed data.

Like the Dirichlet process regression models of Mulleret al. (1996) and Rodriguez et al. (2009), we modelthe joint distribution of the covariates and response togenerate an implicit conditional response distribution.This method is conceptually similar to the double-kernel method of Fan et al. (1996), except that thekernel generated by the Dirichlet process is determinedby the imparted distribution over partition structuresrather than a traditional distance metric. The DP-GLM is a generalization of several existing DP-basedregression models (Muller et al., 1996; Shahbaba andNeal, 2009) to a variety of covariate types and response

Page 2: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

314

Dirichlet Process Mixtures of Generalized Linear Models

distributions. Bayesian nonparametric models havepreviously been proposed, but they lacked generalityand/or theoretical guarantees, such as the existence ofa mean function estimator or its asymptotic unbiased-ness.

In this paper, we present the DP-GLM and give con-ditions under which a mean function estimate ex-ists and is asymptotically unbiased. We review theBayesian nonparametric regression literature in Sec-tion 2, review Dirichlet processes and GLMs in Section3, present the DP-GLM in Section 4, and give theoret-ical properties in Section 5. In Section 6, we report ona study of the DP-GLM to several data sets, illustrat-ing its flexibility to multiple types of covariates andresponse variables.

2 Bayesian Regression Literature

Nonparametric regression is a field that has receivedconsiderable study, but less so in a Bayesian context.Gaussian process (GP) and Dirichlet process mixturesare the most common prior choices for Bayesian non-parametric regression. GP priors assume that the ob-servations arise from a Gaussian process model withknown covariance function form (see Rasmussen andWilliams (2006) for a review). Without heavy modi-fication, however, the GP model is only applicable toproblems with continuous covariates.

Dirichlet process priors have been used previously forregression. West et al. (1994); Escobar and West(1995) and Muller et al. (1996) used joint Gaussianmixtures for the covariates and response; Rodriguezet al. (2009) generalized this method using dependentDPs for multiple response functionals. This methodcan be slow if a fully populated covariance matrixis used and is potentially inaccurate if only a diag-onal matrix is used. To avoid these over-fitting thecovariate distribution and under-fitting the response,local weights on the covariates have been used to pro-duce local response DPs, with kernels and basis func-tions (Griffin and Steel, 2007; Dunson et al., 2007),GPs (Gelfand et al., 2005) or general spatial-basedweights (Griffin and Steel, 2006, 2007; Duan et al.,2007). Other methods, such as dependent DPs, havebeen introduced to capture similarities between clus-ters, covariates or groups of outcomes (De Iorio et al.,2004; Rodriguez et al., 2009). The DP-GLM tries tobalance fitting the covariate and response distributionsby introducing local GLMs–the clustering structure isheavily based on the covariates, but within each clus-ter response fit is better because it is represented by aGLM rather than a constant. This method is simpleand flexible.

Dirichlet process priors have also been used in conjunc-

tion with GLMs. Mukhopadhyay and Gelfand (1997)and Ibrahim and Kleinman (1998) used a DP priorfor the random effects portion of the the GLM. Like-wise, Amewou-Atisso et al. (2003) used a DP priorto model arbitrary symmetric error distributions ina semi-parametric linear regression model. Shahbabaand Neal (2009) proposed a model that mixes overboth the covariates and response, which are linked bya multinomial logistic model. The DP-GLM studiedhere is a generalization of this idea.

The asymptotic properties of Dirichlet process regres-sion models have not been well studied. Most cur-rent literature centers around consistency of the pos-terior density for DP Gaussian mixture models (Bar-ron et al., 1999; Ghosal et al., 1999; Ghosh and Ra-mamoorthi, 2003; Walker, 2004; Tokdar, 2006) andsemi-parametric linear regression models (Amewou-Atisso et al., 2003; Tokdar, 2006).Only recently havethe posterior properties of DP regression estimatorsbeen studied. Rodriguez et al. (2009) showed point-wise asymptotic unbiasedness for their model, whichuses a dependent Dirichlet process prior, assumingcontinuous covariates under different treatments witha continuous responses and a conjugate base measure(normal-inverse Wishart).

3 Mathematical Background

Dirichlet Process Mixture Models. In aBayesian mixture model, we assume that the truedensity of the covariates X and response Y can bewritten as a mixture of parametric densities, such asGaussians or multinomials, conditioned on a hiddenparameter θ. For example, in a Gaussian mixture,θ includes the mean µ and variance σ2. Due to ourmodel formulation, θ is split into two parts: θx,which is associated only with the covariates X, andθy, which is associated only with the response, Y .Set θ = (θx, θy). The marginal probability of anobservation is given by a continuous mixture,

f0(x, y) =∫Tf(x, y|θ)P (dθ).

In this equation, T is the set of all possible parametersand the prior P is a measure on that space.

The Dirichlet process models uncertainty about theprior density P (Ferguson, 1973; Antoniak, 1974). IfP is drawn from a Dirichlet process then it can be ana-lytically integrated out of the conditional distributionof θn given θ1:(n−1). Specifically, the random variableΘn has a Polya urn distribution (Blackwell and Mac-Queen, 1973),

Θn|θ1:(n−1) ∼1

α+ n− 1

n−1∑i=1

δθi+

α

α+ n− 1G0. (1)

Page 3: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

315

Lauren A. Hannah, David M. Blei, Warren B. Powell

(Lower case values refer to observed or fixed values,while upper case refer to random variables.)

Equation (1) reveals the clustering property of the jointdistribution of θ1:n: There is a positive probabilitythat each θi will take on the value of another θj , lead-ing some of the parameters to share values. This equa-tion also makes clear the roles of α and G0. The uniquevalues of θ1:n are drawn independently from G0; theparameter α determines how likely Θn+1 is to be anewly drawn value from G0 rather than take on one ofthe values from θ1:n. G0 controls the distribution of anew component.

In a DP mixture, θ is a latent parameter to an observeddata point x (Antoniak, 1974),

P ∼ DP(αG0), Θi ∼ P, xi|θi ∼ f(· | θi).

Examining the posterior distribution of θ1:n given x1:n

brings out its interpretation as an “infinite clustering”model. Because of the clustering property, observa-tions are grouped by their shared parameters. Un-like finite clustering models, however, the number ofgroups is random and unknown. Moreover, a new datapoint can be assigned to a new cluster that was notpreviously seen in the data.

Generalized Linear Models. Generalized linearmodels (GLMs) build on linear regression to provide aflexible suite of predictive models. GLMs relate a lin-ear model to a response via a link function; examplesinclude familiar models like logistic regression, Poissonregression, and multinomial regression. (See McCul-lagh and Nelder (1989) for a full discussion.)

GLMs have three components: the conditional proba-bility model for response Y , the linear predictor andthe link function. The probability model for Y , de-pendent on covariates X, is

f(y|η) = exp(yη − b(η)a(φ)

+ c(y, φ)).

Here the canonical form of the exponential family isgiven, where a, b, and c are known functions specificto the exponential family, φ is an arbitrary scale (dis-persion) parameter, and η is the canonical parameter.A linear predictor, Xβ, is used to determine the canon-ical parameter through a set of transformations. It canbe shown that b′(η) = µ = E[Y |X]. However, we canchoose a link function g such that µ = g−1(Xβ), whichdefines η in terms of Xβ. The canonical form is usefulfor discussion of GLM properties, but we use the meanform in the rest of this paper. The flexible nature ofGLMs allows us to use them as a local approximationfor a global response function.

0 100 200 300 400 500 600 700 800 9000

4,000

8,000

C`

Smoothed Estimate

0 100 200 300 400 500 600 700 800 9000

4,000

8,000

C`

Data Assigned to Clusters

0 100 200 300 400 500 600 700 800 9000

4,000

8,000

Multipole `

C`

Clusters

Figure 1: The top figure shows the smoothed regres-sion estimate for the Gaussian model of equation (2).The center figure shows the training data (blue) fittedinto clusters, with the prediction given a single sam-ple from the posterior, θ(i) (red). The bottom figureshows the underlying clusters (blue) with the fitted re-sponse (red) for each point in the cluster. Data plotmultipole moments against power spectrum C` for cos-mic microwave background radiation (Bennett et al.,2003).

4 Dirichlet Process Mixtures ofGeneralized Linear Models

We now develop Dirichlet process mixtures of gener-alized linear models (DP-GLMs), a flexible Bayesianpredictive model that places prior mass on a large classof response densities. Given a data set of covariate-response pairs, we describe Gibbs sampling algorithmsfor approximate posterior inference and prediction.We present theoretical properties of the DP-GLM inSection 5.

4.1 DP-GLM Formulation

In a DP-GLM, we assume that the covariates X aremodeled by a mixture of exponential-family distri-butions, the response Y is modeled by a GLM con-ditioned on the inputs, and that these models areconnected by associating a set of GLM coefficientswith each exponential family mixture component. Letθ = (θx, θy) denote the bundle of parameters over Xand Y |X, and let G0 denote a base measure on thespace of both. For example, θx might be a set of d-dimensional multivariate Gaussian location and scaleparameters for a vector of continuous covariates; θymight be a d+2-vector of reals for their correspondingGLM linear prediction coefficients, along with a GLM

Page 4: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

316

Dirichlet Process Mixtures of Generalized Linear Models

dispersion parameter. The full model is

P ∼ DP (αG0), (θx,i, θy,i)|P ∼ P,Xi|θx,i ∼ fx(·|θx,i), Yi|xi, θy,i ∼ GLM(·|xi, θy,i).

The density fx describes the covariate distribution; theGLM for y depends on the form of the response (con-tinuous, count, category, or others) and how the re-sponse relates to the covariates (i.e., the link function).

The Dirichlet process clusters the covariate-responsepairs (x, y). When both are observed, i.e., in “train-ing,” the posterior distribution of this model will clus-ter data points according to near-by covariates thatexhibit the same kind of relationship to their response.When the response is not observed, its predictive ex-pectation can be understood by clustering the covari-ates based on the training data, and then predictingthe response according to the GLM associated withthe covariates’ cluster. The DP prior acts as a kernelfor the covariates; instead of being a Euclidean metric,the DP measures the distance between two points bythe probability that the hidden parameter is shared.See Figure 1 for a demonstration of the DP-GLM. Wenow give an example of the DP-GLM for continuouscovariates/response that will be used throughout therest of the paper.

Example: Gaussian Model. For continuous co-variates/response in R, we model locally with a Gaus-sian distribution for the covariates and a linear re-gression model for the response. The covariates havemean µi,j and variance σ2

i,j for the jth dimension ofthe ith observation; covariance matrix is diagonal forsimplicity. The GLM parameters are the linear predic-tor βi,0, . . . , βi,d and the response variance σ2

i,y. Here,θx,i = (µi,1:d, σi,1:d) and θy,i = (βi,0:d, σi,y). This pro-duces a mixture of multivariate Gaussians. The fullmodel is,

P ∼ DP (αG0), (2)Θi|P ∼ P,

Xij |µij , σij ∼ N(µij , σ2ij), j = 1, . . . , d,

Yi|xi, βi, σiy ∼ N(βi0 +d∑j=1

βijxij , σ2iy).

4.2 DP-GLM Regression

The DP-GLM is used in prediction problems. Givena collection of covariate-response pairs (xi, yi)ni=1, ourgoal is to compute the expected response for a new setof covariates x. Conditional on the latent parametersthat generated the observed data, θ1:n, the expectation

Algorithm 1: DP-GLM RegressionData: Observations (Xi, Yi)1:n, functions fx, fy,

number of posterior samples M , query xResult: Mean function estimate at x, m̄(x)initialization;for m = 1 to M do

Obtain posterior sample θ(m)1:n |(Xj , Yj)1:n;

Compute E[Y |x, θ(m)1:n ];

end

Set m̄(x) = 1M

∑Mm=1 E[Y |x, θ(m)

1:n ];

of the response is

E[Y |x, θ1:n] =α

b

∫T

E [Y |x, θ] fx(x|θ)G0(dθ) (3)

+1b

n∑i=1

E [Y |x, θi] fx(x|θi),

b = α

∫Tfx(x|θ)G0(dθ) +

n∑i=1

fx(x|θi).

Since Y is assumed to be a GLM, the quantityE [Y |x, θ] is analytically available as a function of xand θ.

As θ1:n is not actually known, the unobserved randomvariables Θ1:n are integrated out of equation (3) us-ing the posterior distribution given the observed data.Let ΠP denote the DP prior on the set of hiddenparameter measures, P . Let MT be the space ofall distributions over the hidden parameters. Since∫T fy(y|x, θ)fx(x|θ)P (dθ) is a density for (x, y), ΠP in-

duces a prior on F , the set of all densities f on (x, y).Denote this prior by Πf and define the posterior dis-tribution,

Πfn (A) =

∫A

∏ni=1 f(Xi, Yi)Πf (df)∫

F∏ni=1 f(Xi, Yi)Πf (df)

,

where A ⊆ F . Define ΠPn similarly. The regression is

E[Y |x, (Xi, Yi)1:n]

=1b

n∑i=1

∫MT

∫T

E[Y |x, θi]fx(x|θi)P (dθi)ΠPn (dP )

b

∫T

E [Y |x, θ] fx(x|θ)G0(dθ), (4)

where b normalizes the probability of Y being associ-ated with the parameter θi.

Equation (4) is difficult to compute because it requiresintegration over a hidden random measure. To avoidthis problem, we approximate equation (4) by an av-erage of M Monte Carlo samples of the expectation

Page 5: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

317

Lauren A. Hannah, David M. Blei, Warren B. Powell

conditioned on θ(m)1:n ,

E[Y |x, (Xi, Yi)1:n] ≈ 1M

M∑m=1

E[Y |x, θ(m)

1:n

]. (5)

The regression procedure is given in Algorithm 1. Wedescribe how to generate posterior samples {θ(m)

1:n }Mm=1

in Section 4.3.

Example: Gaussian Model. Continuing our ex-ample of the Gaussian model, equation (3) becomes

E[Y |x, θ1:n] =α

b

∫TβT0:dx

d∏j=1

φσj(xj − µj)G0(dθ)

+1b

n∑i=1

βTi,0:dxd∏j=1

φσij(xj − µij),

b = α

∫T

d∏j=1

φσj (xj−µj)G0(dθ)+n∑i=1

d∏j=1

φσij (xj−µij),

and φσ(x) is the Gaussian density at x with varianceσ2. An example of regression for a single covariateGaussian model is shown in Figure 1.

4.3 Posterior Sampling Methods

The above algorithm relies on samples ofθ1:n|(Xi, Yi)1:n. We use Markov chain Monte Carlo(MCMC), specifically Gibbs sampling, to obtain{θ

(m)1:n |(Xi, Yi)1:n

}Mm=1

. Gibbs sampling has a longhistory of being used for DP mixture posterior infer-ence (see Neal (2000) for state of the art algorithms).We use Algorithm 8 of Neal (2000), but conjugatebase measures allow the use of the more efficientcollapsed sampler, Algorithm 3.

5 Theoretical Properties

Two pitfalls might arise when using the DP-GLM:an inability to compute the mean function and meanfunction bias that does not tend toward 0. A desir-able property of any estimator is that it should beunbiased. If it holds in the limit, this property iscalled asymptotic unbiasedness. Diaconis and Freed-man (1986) give an example of a location model witha Dirichlet process prior where the estimated locationcan be bounded away from the true location, evenwhen the number of observations approaches infinity.We want to assure that DP-GLM avoids these prob-lems; we sketch the ideas needed to show asymptoticunbiasedness and mean function existence and thengive theorems for asymptotic unbiasedness and meanfunction existence.

5.1 Consistency and Theoretical Properties

Consistency, the notion that as the number of obser-vations goes to infinity the posterior distribution ac-cumulates in neighborhoods arbitrarily “close” to thetrue distribution, is tightly related to both asymptoticunbiasedness and mean function estimate existence.Weak consistency assures that the posterior distribu-tion accumulates in regions of densities where “prop-erly behaved” functions (i.e., bounded and continuous)integrated with respect to the densities in the regionare arbitrarily close to the integral with respect to thetrue density. Note that an expectation is not bounded;in addition to weak consistency, uniform integrabilityis needed to guarantee that the posterior expectationconverges to the true expectation, giving asymptoticunbiasedness. Uniform integrability also ensures thatthe posterior expectation almost surely exists with ev-ery additional observation. Therefore we need to showweak consistency and uniform integrability.

With the inclusion of covariates, the observationsof covariate response pairs are not identically dis-tributed, so we use a modified variant of the Schwartz(1965) theorem for weak posterior consistency, givenin Amewou-Atisso et al. (2003). We assume that co-variates are observed from the entire distribution; wegive conditions for consistency for a predictor x in acompact subset C of the covariate domain. In practice,this is not particularly restrictive as C can be made ar-bitrarily large.

5.2 Asymptotic Unbiasedness

We are now ready to state the main theorem.Theorem 5.1. Let x be in a compact set C and Πf bea prior on F . If,

(i) for every δ > 0, Πf puts positive measure on{f :∫f0(x, y) log

f0(x, y)f(x, y)

dxdy < δ,∫f0(x, y)

(log

f0(x, y)f(x, y)

)2

dxdy < δ

},

(ii)∫|y|2f0(y|x)dy <∞ for every x ∈ C, and

(iii) there exists an ε > 0 such that for every x ∈ C,∫ ∫|y|1+εfy(y|x, θ)G0(dθ) <∞,

then for every n ≥ 0, EΠ[Y |x, (Xi, Yi)1:n] exists andhas the limit Ef0 [Y |x], almost surely PF∞0 .

The conditions of Theorem 5.1 must be checked for theproblem (f0) and prior (Πf ) pair, and can be difficult

Page 6: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

318

Dirichlet Process Mixtures of Generalized Linear Models

to show. Condition (i) assures weak consistency ofthe posterior, condition (ii) guarantees a mean func-tion exists in the limit and condition (iii) guaranteesthat positive probability is only placed on densitiesthat yield a finite mean function estimate. See Han-nah et al. (2009) for a discussion and proof.

5.3 Example: Gaussian Model with aConjugate Base Measure

The next theorem gives an example of when Theorem5.1 holds for the Gaussian model. Discussion, proofand extensions are given in Hannah et al. (2009).Theorem 5.2. Let (X,Y ) have the joint Gaussianmodel. If, for a compact covariate set C,

(i)∫f0(x, y)(log f0(x, y))2dxdy <∞,

(ii)∫|y|2f0(y|x)dy <∞ for every x ∈ C, and

(iii) G0 is conjugate to the Gaussian model, that is,

β0:d, σy ∼ N − Inv −Gamma(νy,Ξy, ry, λy),µi, σi ∼ N − Inv −Gamma(νi, ξi, ri, λi),

and ry, r1:d ∈ (1/2, 1), then the conditions of Theorem5.1 are satisfied.

6 Empirical Study

We compare the performance of DP-GLM regressionto other regression methods. We chose data sets to il-lustrate the strengths of the DP-GLM, including abil-ity to model different response/covariate types, androbustness with respect to heteroscedasticity and mod-erate dimensionality.

We compare to the following algorithms:

Naive Ordinary Least Squares (OLS). A para-metric method that often provides a reasonable fitwhen there are few observations.

Regression Trees (Tree). A nonparametricmethod generated by the Matlab function classregtree.It accommodates both continuous and categoricalinputs and any type of response.

Gaussian Processes (GP). GPs were generated inMatlab by the program gpr of Rasmussen and Williams(2006). It is suitable only for continuous responses andcovariates.

Basic DP Regression (DP Base). Similar to DP-GLM, except the response is a function only of µy,rather than β0 +

∑βixi. That is,

Yi|xi, θi ∼ µiy.

CMB Dataset

Number of Observations

Err

or

1.0

1.2

1.4

1.6

1.8

0.5

0.6

0.7

0.8

5 10 30 50 100 250 500

Mean A

bsolute Error

Mean S

quare Error

Algorithm

DP−GLM

OLS

TreeRegressionGaussianProcess

Figure 2: The average mean absolute error (top) andmean squared error (bottom) +/− one standard devia-tion for ordinary least squares (OLS), tree regression,Gaussian processes and DP-GLM on the CMB dataset. The data were normalized.

Without the GLM response, the model cannot inter-polate well in higher dimensions, leading to poor pre-dictive performance.

Poisson GLM (GLM). A Poisson generalized lin-ear model (count responses), used on the Solar Flaredata set.

With these methods, we examined three data sets:

Cosmic Microwave Background (CMB) Re-sults. The CMB dataset Bennett et al. (2003) con-sists of 899 observations which map positive integers` = 1, 2, . . . , 899, called ‘multipole moments,’ to thepower spectrum C`. Both the covariate and responseare considered continuous. The data are highly non-linear and heteroscedastic. Competitors were OLS, re-gression trees and Gaussian processes. Mean absolute(L1) error and mean squared (L2) error for 5, 10, 30,50, 100, 250, and 500 training data were computedusing 10 random subset selections for each amount ofdata. A conjugate base measure was used. Results aregiven in Figure 2.

Concrete Compressive Strength (CCS) Re-sults. The CCS Yeh (1998) dataset has 8 continuouscovariates. The response is the compressive strengthof the resulting concrete, also continuous. There are1,030 observations. The data have relatively littlenoise. Competitors were OLS, GPs and regression

Page 7: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

319

Lauren A. Hannah, David M. Blei, Warren B. Powell

CCS Dataset

Number of Observations

Err

or

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.3

0.4

0.5

0.6

0.7

●●

●●

30 50 100 250 500

Mean A

bsolute Error

Mean S

quare Error

Algorithm

DP−GLM

OLS

TreeRegressionGaussianProcess

●LocationScale DP

Figure 3: The average mean absolute error (top) andmean squared error (bottom) +/− one standard de-viation for ordinary least squares (OLS), tree regres-sion, Gaussian processes, location/scale DP and theDP-GLM Poisson model on the CCS data set. Thedata were normalized.

trees. We also included a basic DP regression tech-nique (location/scale DP) on this data set. Mean ab-solute (L1) error and mean squared (L2) error for 20,30, 50, 100, 250, and 500 training data were computedusing 10 random subset selections for each amountof data. Gaussian mean and log-Gaussian scale basemeasures were used. Results are given in Figure 3.

Solar Flare Results. The Solar Bradshaw (1989)dataset was chosen to demonstrate the flexibility ofDP-GLM. The response is the number of solar flares ina 24 hour period in a given area. There 1,389 observa-tions and 11 categorical covariates. Competitors weretree regression and a Poisson GLM. GPs and othermethods cannot be used for count/categorical data.Mean absolute (L1) error and mean squared (L2) er-ror for 50, 100, 200, 500, and 800 training data werecomputed using 10 random subset selections for eachamount of data. A Dirichlet covariate and Gaussianslope base measure was used with a Poisson responsedistribution. Results are given in Figure 4.

Discussion

DP-GLM has flexibility that is not offered by mostregression methods. It does well on data sets withheteroscedastic errors because it fundamentally incor-porates them; error parameters (σiy) are included inthe DP mixture. DP-GLM is comparatively robust

Solar Dataset

Number of Observations

Err

or

0.6

0.7

0.8

0.9

1.0

1.1

1.2

0.45

0.50

0.55

0.60

0.65

50 100 200 500 800

Mean A

bsolute Error

Mean S

quare Error

Algorithm

DP−GLM

PoissonGLMTreeRegression

Figure 4: The average mean absolute error (top) andmean squared error (bottom) +/− one standard devi-ation for tree regression, a Poisson GLM (GLM) andDP-GLM on the Solar data set.

with small amounts of data because in that case ittends to put all (or most) of the observations into onecluster; this effectively produces a linear regression,but eliminates outliers by placing them into their own(low-weighted) clusters.

The comparison between DP-GLM regression and ba-sic DP regression is illustrative. We compared basicDP regression only on the CCS data set because it hasa large number of covariates. Like kernel smoothing,basic DP regression struggles in high dimensions be-cause it cannot efficiently interpolate values betweenobservations. The GLM component effectively elimi-nates this problem.

The diversity of the data sets demonstrates the adapt-ability of the DP-GLM. Only tree regression was ableto work on all of the data sets, and the DP-GLMhas many desirable properties that tree regression doesnot, such as a smooth mean function estimate and lesssensitivity to bandwidth/pruning level.

References

Amewou-Atisso, M., Ghosal, S., Ghosh, J. andRamamoorthi, R. (2003), ‘Posterior consistencyfor semi-parametric regression problems’, Bernoulli9(2), 291–312.

Antoniak, C. (1974), ‘Mixtures of Dirichlet processeswith applications to Bayesian nonparametric prob-lems’, The Annals of Statistics 2(6), 1152–1174.

Page 8: Dirichlet Process Mixtures of Generalized Linear Modelsproceedings.mlr.press/v9/hannah10a/hannah10a.pdf · Lauren A. Hannah, David M. Blei, Warren B. Powell (Lower case values refer

320

Dirichlet Process Mixtures of Generalized Linear Models

Barron, A., Schervish, M. and Wasserman, L. (1999),‘The consistency of posterior distributions in non-parametric problems’, The Annals of Statistics27(2), 536–561.

Bennett, C., Halpern, M., Hinshaw, G., Jarosik, N.,Kogut, A., Limon, M., Meyer, S., Page, L., Spergel,D., Tucker, G. et al. (2003), ‘First-Year WilkinsonMicrowave Anisotropy Probe (WMAP) 1 Observa-tions: Preliminary Maps and Basic Results’, TheAstrophysical Journal Supplement Series 148(1), 1–27.

Blackwell, D. and MacQueen, J. (1973), ‘Fergusondistributions via Polya urn schemes’, The AnnalsStatistics 1(2), 353–355.

Bradshaw, G. (1989), ‘UCI machine learning reposi-tory’.

De Iorio, M., Muller, P., Rosner, G. and MacEachern,S. (2004), ‘An ANOVA model for dependent ran-dom measures’, Journal of the American StatisticalAssociation 99(465), 205–215.

Diaconis, P. and Freedman, D. (1986), ‘On the consis-tency of Bayes estimates’, The Annals of Statistics14(1), 1–26.

Duan, J., Guindani, M. and Gelfand, A. (2007),‘Generalized spatial Dirichlet process models’,Biometrika 94(4), 809–825.

Dunson, D., Pillai, N. and Park, J. (2007), ‘Bayesiandensity regression’, Journal of the Royal StatisticalSociety Series B, Statistical Methodology 69(2), 163.

Escobar, M. and West, M. (1995), ‘Bayesian Den-sity Estimation and Inference Using Mixtures.’,Journal of the American Statistical Association90(430), 577–588.

Fan, J., Yao, Q. and Tong, H. (1996), ‘Es-timation of conditional densities and sensitiv-ity measures in nonlinear dynamical systems’,Biometrika83(1), 189–204.

Ferguson, T. (1973), ‘A Bayesian analysis of somenonparametric problems’, The Annals of Statistics1(2), 209–230.

Gelfand, A., Kottas, A. and MacEachern, S. (2005),‘Bayesian nonparametric spatial modeling withDirichlet process mixing’, Journal of the AmericanStatistical Association 100(471), 1021–1035.

Ghosal, S., Ghosh, J. and Ramamoorthi, R.(1999), ‘Posterior consistency of Dirichlet mixturesin density estimation’, The Annals of Statistics27(1), 143–158.

Ghosh, J. and Ramamoorthi, R. (2003), BayesianNonparametrics, Springer.

Griffin, J. and Steel, M. (2006), ‘Order-based depen-dent Dirichlet processes’, Journal of the AmericanStatistical Association 101(473), 179–194.

Griffin, J. and Steel, M. (2007), Bayesian nonparamet-ric modelling with the Dirichlet process regressionsmoother, Technical report, Institute of Mathemat-ics, Statistics and Actuarial Science, University ofKent.

Hannah, L., Blei, D. and Powell, W. (2009), Dirich-let Process Mixtures of Generalized Linear Models.arxiv:0909.5194v1.

Ibrahim, J. and Kleinman, K. (1998), Semiparamet-ric Bayesian methods for random effects models,in ‘Practical Nonparametric and SemiparametricBayesian Statistics’, chapter 5, pp. 89–114.

McCullagh, P. and Nelder, J. (1989), Generalized Lin-ear Models, Chapman & Hall/CRC.

Mukhopadhyay, S. and Gelfand, A. (1997), ‘Dirich-let Process Mixed Generalized Linear Models’,Journal of the American Statistical Association92(438), 633–639.

Muller, P., Erkanli, A. and West, M. (1996), ‘Bayesiancurve fitting using multivariate normal mixtures’,Biometrika 83(1), 67–79.

Neal, R. (2000), ‘Markov chain sampling methods forDirichlet process mixture models’, Journal of Com-putational and Graphical Statistics 9(2), 249–265.

Rasmussen, C. and Williams, C. (2006), Gaussian pro-cesses for machine learning, Springer.

Rodriguez, A., Dunson, D. and Gelfand, A. (2009),‘Bayesian nonparametric functional data analysisthrough density estimation’, Biometrika 96(1), 149–162.

Schwartz, L. (1965), ‘On Bayes procedures’, Z.Wahrsch. Verw. Gebiete 4(1), 10–26.

Shahbaba, B. and Neal, R. (2009), ‘Nonlinear ModelsUsing Dirichlet Process Mixtures’, Journal of Ma-chine Learning Research 10, 1829–1850.

Tokdar, S. (2006), ‘Posterior consistency of Dirichletlocation-scale mixture of normals in density estima-tion and regression’, Sankhya: The Indian Journalof Statistics 67, 90–110.

Walker, S. (2004), ‘New approaches to Bayesian con-sistency’, The Annals of Statistics 32(5), 2028–2043.

West, M., Muller, P. and Escobar, M. (1994), Hierar-chical priors and mixture models, with applicationin regression and density estimation, in ‘Aspects ofuncertainty: A Tribute to DV Lindley’, pp. 363–386.

Yeh, I. (1998), ‘Modeling of strength of high-performance concrete using artificial neu-ral networks’, Cement and Concrete research28(12), 1797–1808.