dirichlet process mixtures of generalized linear models · dirichlet process mixtures of gps...

26
Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah [email protected] Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA David M. Blei [email protected] Department of Computer Science Princeton University Princeton, NJ 08544, USA Warren B. Powell [email protected] Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA Editor: ???? Abstract We propose Dirichlet Process-Generalized Linear Models (DP-GLM), a new method of nonparametric regression that accommodates continuous and categorical inputs, and any response that can be modeled by a generalized linear model. We prove conditions for the asymptotic unbiasedness of the DP-GLM regression mean function estimate and give a practical example for when those conditions hold. Additionally, we provide Bayesian bounds on the distance of the estimate from the true mean function based on the number of observations and posterior samples. We evaluate DP-GLM on several data sets, comparing it to modern methods of nonparametric regression like CART and Gaussian processes. We show that the DP-GLM is competitive with the other methods, while accommodating various inputs and outputs and being robust when confronted with heteroscedasticity. Keywords: Bayesian Nonparametrics, Dirichlet Process, Regression, Asymptotic Prop- erties 1. Introduction In this paper, we examine a Bayesian nonparametric solution to the general regression problem. The general regression problem models a response variable Y as dependent on a set of d-dimensional covariates X , Y | X f (m(X )). (1) Here, m(·) is a deterministic mean function, which specifies the conditional mean of the response, and f is a distribution, which characterizes the deviation of the response from the conditional mean. In a regression problem, we estimate the mean function and deviation parameters from a data set of covariate-response pairs D = {(x i ,y i )} N i=1 . Given a new set of covariates x new , we predict the response via its conditional expectation, E[Y | X = 1

Upload: others

Post on 22-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Dirichlet Process Mixtures of Generalized Linear Models

Lauren A. Hannah [email protected] of Operations Research and Financial EngineeringPrinceton UniversityPrinceton, NJ 08544, USA

David M. Blei [email protected] of Computer SciencePrinceton UniversityPrinceton, NJ 08544, USA

Warren B. Powell [email protected]

Department of Operations Research and Financial Engineering

Princeton University

Princeton, NJ 08544, USA

Editor: ????

Abstract

We propose Dirichlet Process-Generalized Linear Models (DP-GLM), a new method ofnonparametric regression that accommodates continuous and categorical inputs, and anyresponse that can be modeled by a generalized linear model. We prove conditions forthe asymptotic unbiasedness of the DP-GLM regression mean function estimate and givea practical example for when those conditions hold. Additionally, we provide Bayesianbounds on the distance of the estimate from the true mean function based on the number ofobservations and posterior samples. We evaluate DP-GLM on several data sets, comparingit to modern methods of nonparametric regression like CART and Gaussian processes.We show that the DP-GLM is competitive with the other methods, while accommodatingvarious inputs and outputs and being robust when confronted with heteroscedasticity.

Keywords: Bayesian Nonparametrics, Dirichlet Process, Regression, Asymptotic Prop-erties

1. Introduction

In this paper, we examine a Bayesian nonparametric solution to the general regressionproblem. The general regression problem models a response variable Y as dependent on aset of d-dimensional covariates X,

Y |X ∼ f(m(X)). (1)

Here, m(·) is a deterministic mean function, which specifies the conditional mean of theresponse, and f is a distribution, which characterizes the deviation of the response from theconditional mean. In a regression problem, we estimate the mean function and deviationparameters from a data set of covariate-response pairs D = {(xi, yi)}Ni=1. Given a newset of covariates xnew, we predict the response via its conditional expectation, E[Y |X =

1

Page 2: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

x,D]. In Bayesian regression, we compute the posterior expectation of these computations,conditioned on the data.

The distribution function f(m(X)) can be quite complicated globally. However, wepropose modeling f(m(X)) with a local mixture of much simpler functions: generalizedlinear models (GLMs). Generalized linear models generalize linear regression to a diverseset of response types and distributions. In a GLM, the linear combination of coefficientsand covariates is passed through a link function, and the underlying distribution of theresponse is in any exponential family (McCullagh and Nelder, 1989). GLMs are specifiedby the functional form of the link function and the distribution of the response, and manyprediction models, such as linear regression, logistic regression, multinomial regression, andPoisson regression, can be expressed as a GLM. Thus, they can accommodate many types ofresponse data, including continuous, categorical and count. Through independent variabletransformations, they can support various forms of independent variables as well. How-ever, GLMs are parametric functions, which limit their application to data where a linearrelationship between the covariates and the response holds across all possible covariates.

To model a function as a local mixture of GLMs, we must determine what, exactly, is“local.” A suitable local region depends on both the covariates and the response: the linearrelation between the covariates and response must hold within that area. We use a Dirichletprocess mixture model to probabilistically choose suitable regions.

Dirichlet process mixture models represent complicated distributions as a countablemixture of simpler distributions in a Bayesian manner. To make the model more flexible,the number of components is allowed to be countably infinite. In this case, we model thejoint distribution of X and Y as a mixture of a simple distribution over the covariates, suchas a Gaussian or multinomial, and a GLM for the response within each component. TheDirichlet process itself places a prior over the mixing proportions, producing only a fewwith non-trivial values. Samples from the posterior are used to find a distribution over the“correct” number of components and parameters for each of those components. We call theresulting model a Dirichlet process mixture of generalized linear models (DP-GLM).

The DP-GLM is a very flexible model. By “flexibility,” we mean the ability to modelcovariates and responses with various forms, including continuous, categorical, count andcircular, along with the ability to accommodate for various data characteristics, such asoverdispersion and heteroscedasticity. Many applications, such as finance and engineering,often encounter mixed data types, but there are few models, such as classification and re-gression trees (CART), that can support diverse data types. Heteroscedasticity describesnon-constant variance of the data. Many regression techniques, such as parametric re-gression and Gaussian process prior regression, have a homoscedasicity (constant variance)assumption. Violation of this assumption can lead to bias, it does for logistic regression.

Flexibility in data type comes from two parts of the DP-GLM construction. The Dirich-let process mixture model allows most any type of covariate to be modeled as a mixture ofsimpler distributions. The GLM portion takes a smaller but still large set of covariate typesand uses them to predict a wide range of response types, such as continuous, categorical,count and circular. Flexibility in data features comes from the Dirichlet process mixturemodel. Overdispersion occurs in single parameter GLMs when the observed variance islarger than that implied by the parameter. A mixture over the intercept parameters of theGLM can produces a class of models that can accommodate overdispersion Mukhopadhyay

2

Page 3: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

and Gelfand (1997). Heteroscedasticity is modeled implicitly with a mixture over the jointdistribution of X and Y when Y is locally modeled with a two parameter GLM; each lo-cal component has its own dispersion parameter. This produces a model that is naturallyheteroscedastic.

Here, we develop Dirichlet process mixtures of generalized linear models (DP-GLMs), aBayesian nonparametric regression model that combines the advantages of generalized linearmodels with the flexibility of nonparametric regression. DP-GLMs are a generalizationof several existing DP-based regression models (Muller et al., 1996; Shahbaba and Neal,2009) to a variety of covariate types and response distributions. We derive Gibbs samplingalgorithms for computation with DP-GLMs, and investigate some of the theoretical andempirical properties of these models, such as the form of their posterior and conditions forthe asymptotic unbiasedness of their predictions. We examine DP-GLMs with several typesof data.

In addition to defining and discussing the DP-GLM, a central contribution of this paperis our theoretical analysis of its response estimator and, specifically, the asymptotic unbi-asedness of its predictions. Asymptotic properties help justify the use of certain regressionmodels, but they have largely been ignored for regression models with Dirichlet processpriors. We will give general conditions for asymptotic unbiasedness, and examples of whenthey are satisfied. (These conditions are model-dependent, and can be difficult to check.)

The rest of this paper is organized as follows. In Section 2, we review the Bayesiannonparametric regression literature. In Section 3, we review Dirichlet process mixturemodels and generalized linear models. In Section 4, we construct the DP-GLM and derivealgorithms for posterior computation; in Section 5, we study the empirical properties of theDP-GLM; in Section 6 we give general conditions for unbiasedness and prove it in a specificcase with conjugate priors. In Section 7 we compare DP-GLM and existing methods on threedata sets. We illustrate that the DP-GLM provides a powerful nonparametric regressionmodel that can accommodate many data analysis settings.

2. Literature Review

Nonparametric regression is a field that has received considerable study, but less so in aBayesian context. Gaussian process (GP), Bayesian regression trees and Dirichlet processmixtures are the most common prior choices for Bayesian nonparametric regression. GPpriors assume that the observations arise from a Gaussian process model with known co-variance function form (see Rasmussen and Williams (2006) for a review). Without heavymodification, however, the GP model is only applicable to problems with continuous covari-ates and constant variance. The assumption of constant covariance can be eased by usingDirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression trees place a prior over the size of the tree andcan be viewed as an automatic bandwidth selection method for classification and regressiontrees (CART) (Chipman et al., 1998). Bayesian trees have been expanded to include linearmodels (Chipman et al., 2002) and GPs (Gramacy and Lee, 2008) in the leaf nodes.

Dirichlet process priors have been used previously for regression. West et al. (1994);Escobar and West (1995) and Muller et al. (1996) used joint Gaussian mixtures for thecovariates and response; Rodriguez et al. (2009) generalized this method using dependent

3

Page 4: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

DPs for multiple response functionals. This method can be slow if a fully populated covari-ance matrix is used and is potentially inaccurate if only a diagonal matrix is used. To avoidthese over-fitting the covariate distribution and under-fitting the response, local weights onthe covariates have been used to produce local response DPs, with kernels and basis func-tions (Griffin and Steel, 2007; Dunson et al., 2007), GPs (Gelfand et al., 2005) or generalspatial-based weights (Griffin and Steel, 2006, 2007; Duan et al., 2007). Other methods,such as dependent DPs, have been introduced to capture similarities between clusters, co-variates or groups of outcomes (De Iorio et al., 2004; Rodriguez et al., 2009). The DP-GLMtries to balance fitting the covariate and response distributions by introducing local GLMs–the clustering structure is heavily based on the covariates, but within each cluster responsefit is better because it is represented by a GLM rather than a constant. This method issimple and flexible.

Dirichlet process priors have also been used in conjunction with GLMs. Mukhopadhyayand Gelfand (1997) and Ibrahim and Kleinman (1998) used a DP prior for the randomeffects portion of the the GLM. Likewise, Amewou-Atisso et al. (2003) used a DP priorto model arbitrary symmetric error distributions in a semi-parametric linear regressionmodel. Shahbaba and Neal (2009) proposed a model that mixes over both the covariatesand response, which are linked by a multinomial logistic model. The DP-GLM studied hereis a generalization of this idea.

The asymptotic properties of Dirichlet process regression models have not been wellstudied. Most current literature centers around consistency of the posterior density forDP Gaussian mixture models (Barron et al., 1999; Ghosal et al., 1999; Ghosh and Ra-mamoorthi, 2003; Walker, 2004; Tokdar, 2006) and semi-parametric linear regression models(Amewou-Atisso et al., 2003; Tokdar, 2006).Only recently have the posterior properties ofDP regression estimators been studied. Rodriguez et al. (2009) showed point-wise asymp-totic unbiasedness for their model, which uses a dependent Dirichlet process prior, assumingcontinuous covariates under different treatments with a continuous responses and a conju-gate base measure (normal-inverse Wishart).

3. Mathematical Review

We review Dirichlet process mixture models and generalized linear models.

Dirichlet Process Mixture Models. n a Bayesian mixture model, we assume that thetrue density of the covariates X and response Y can be written as a mixture of parametricdensities, such as Gaussians or multinomials, conditioned on a hidden parameter θ. Forexample, in a Gaussian mixture, θ includes the mean µ and variance σ2. Due to our modelformulation, θ is split into two parts: θx, which is associated only with the covariates X,and θy, which is associated only with the response, Y . Set θ = (θx, θy). The marginalprobability of an observation is given by a continuous mixture,

f0(x, y) =

∫Tf(x, y|θ)P (dθ).

In this equation, T is the set of all possible parameters and the prior P is a measure onthat space.

4

Page 5: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

The Dirichlet process models uncertainty about the prior density P (Ferguson, 1973;Antoniak, 1974). If P is drawn from a Dirichlet process then it can be analytically integratedout of the conditional distribution of θn given θ1:(n−1). Specifically, the random variable Θn

has a Polya urn distribution (Blackwell and MacQueen, 1973),

Θn|θ1:(n−1) ∼1

α+ n− 1

n−1∑i=1

δθi +α

α+ n− 1G0. (2)

(Lower case values refer to observed or fixed values, while upper case refer to randomvariables.)

Equation (2) reveals the clustering property of the joint distribution of θ1:n: There is apositive probability that each θi will take on the value of another θj , leading some of theparameters to share values. This equation also makes clear the roles of α and G0. Theunique values of θ1:n are drawn independently from G0; the parameter α determines howlikely Θn+1 is to be a newly drawn value from G0 rather than take on one of the valuesfrom θ1:n. G0 controls the distribution of a new component.

In a DP mixture, θ is a latent parameter to an observed data point x (Antoniak, 1974),

P ∼ DP(αG0),

Θi ∼ P,xi|θi ∼ f(· | θi).

Examining the posterior distribution of θ1:n given x1:n brings out its interpretation as an“infinite clustering” model. Because of the clustering property, observations are groupedby their shared parameters. Unlike finite clustering models, however, the number of groupsis random and unknown. Moreover, a new data point can be assigned to a new cluster thatwas not previously seen in the data.

Generalized Linear Models. Generalized linear models (GLMs) build on linear regres-sion to provide a flexible suite of predictive models. GLMs relate a linear model to aresponse via a link function; examples include familiar models like logistic regression, Pois-son regression, and multinomial regression. (See McCullagh and Nelder (1989) for a fulldiscussion.)

GLMs have three components: the conditional probability model for response Y , thelinear predictor and the link function. The probability model for Y , dependent on covariatesX, is

f(y|η) = exp

(yη − b(η)

a(φ)+ c(y, φ)

).

Here the canonical form of the exponential family is given, where a, b, and c are knownfunctions specific to the exponential family, φ is an arbitrary scale (dispersion) parameter,and η is the canonical parameter. A linear predictor, Xβ, is used to determine the canonicalparameter through a set of transformations. It can be shown that b′(η) = µ = E[Y |X].However, we can choose a link function g such that µ = g−1(Xβ), which defines η in termsof Xβ. The canonical form is useful for discussion of GLM properties, but we use the meanform in the rest of this paper. The flexible nature of GLMs allows us to use them as a localapproximation for a global response function.

5

Page 6: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

0 100 200 300 400 500 600 700 800 900−5000

0

5000

10000

X

Y

Testing Data

0 100 200 300 400 500 600 700 800 900−5000

0

5000

10000

X

YTraining Data

Figure 1: The top figure shows the smoothed regression estimate for the Gaussian modelof equation (3). The center figure shows the training data (blue) fitted intoclusters, with the prediction given a single sample from the posterior, θ(i) (red).The bottom figure shows the underlying clusters (blue) with the fitted response(red) for each point in the cluster. Data plot multipole moments against powerspectrum C` for cosmic microwave background radiation (Bennett et al., 2003).

4. Dirichlet Process Mixtures of Generalized Linear Models

We now turn to Dirichlet process mixtures of generalized linear models (DP-GLMs), a flex-ible Bayesian predictive model that places prior mass on a large class of response densities.Given a data set of covariate-response pairs, we describe Gibbs sampling algorithms forapproximate posterior inference and prediction. Theoretical properties of the DP-GLM aredeveloped in Section 6.

4.1 DP-GLM Formulation

In a DP-GLM, we assume that the covariates X are modeled by a mixture of exponential-family distributions, the response Y is modeled by a GLM conditioned on the inputs,and that these models are connected by associating a set of GLM coefficients with eachexponential family mixture component. Let θ = (θx, θy) denote the bundle of parametersover X and Y |X, and let G0 denote a base measure on the space of both. For example, θxmight be a set of d-dimensional multivariate Gaussian location and scale parameters for avector of continuous covariates; θy might be a d+ 2-vector of reals for their corresponding

6

Page 7: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

GLM linear prediction coefficients, along with a GLM dispersion parameter. The full modelis

P ∼ DP (αG0),

θ = (θi,x, θy,i)|P ∼ P,Xi|θi,x ∼ fx(·|θi,x),

Yi|xi, θi,y ∼ GLM(·|Xi, θi,y).

The density fx describes the covariate distribution; the GLM for y depends on the form ofthe response (continuous, count, category, or others) and how the response relates to thecovariates (i.e., the link function).

The Dirichlet process clusters the covariate-response pairs (x, y). When both are ob-served, i.e., in “training,” the posterior distribution of this model will cluster data pointsaccording to near-by covariates that exhibit the same kind of relationship to their response.When the response is not observed, its predictive expectation can be understood by cluster-ing the covariates based on the training data, and then predicting the response accordingto the GLM associated with the covariates’ cluster. The DP prior acts as a kernel forthe covariates; instead of being a Euclidean metric, the DP measures the distance betweentwo points by the probability that the hidden parameter is shared. See Figure 1 for ademonstration of the DP-GLM. We now give an example of the DP-GLM for continuouscovariates/response that will be used throughout the rest of the paper.

Example: Gaussian Model. For continuous covariates/response in R, we model locallywith a Gaussian distribution for the covariates and a linear regression model for the response.The covariates have mean µi,j and variance σ2

i,j for the jth dimension of the ith observation;covariance matrix is diagonal for simplicity. The GLM parameters are the linear predictorβi,0, . . . , βi,d and the response variance σ2

i,y. Here, θx,i = (µi,1:d, σi,1:d) and θy,i = (βi,0:d, σi,y).This produces a mixture of multivariate Gaussians. The full model is,

P ∼ DP (αG0), (3)

θi|P ∼ P,Xij |θi,x ∼ N

(µij , σ

2ij

), j = 1, . . . , d,

Yi|Xi, θi,y ∼ N

βi0 +d∑j=1

βijXij , σ2iy

.

4.2 DP-GLM Regression

The DP-GLM is used in prediction problems. Given a collection of covariate-response pairsD = (Xi, Yi)

ni=1, our goal is to compute the expected response for a new set of covariates x.

We now give a step-by-step process for forming using the DP-GLM for prediction.

Choose Mixture Model and GLM. We begin by choosing fx and the GLM. The co-variate mixture distribution is chosen so that the observed distribution is supported andwell-modeled. For example, a Gaussian mixture model will support any continuous distri-bution and model it well under an appropriate choice of G0. However, a count covariate

7

Page 8: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

with a variance smaller than mean will not be well modeled by a Poisson mixture; in thiscase, a negative binomial mixture may be more appropriate. GLMs are chosen in much thesame manner.

Choose G0 and Other Hyperparameters. The choice of G0 affects how expressive theDP-GLM is, the computational efficiency of the prediction and whether some theoreticalproperties, such as asymptotic unbiasedness, hold. For example, G0 for the Gaussian modelis a distribution over (µi, σi, βi,0:d, σi,y). A conjugate base measure is normal-inverse-gammafor each covariate dimension and multivariate normal inverse-gamma for the response pa-rameters. This G0 allows all continuous, integrable distributions to be well modeled, retainstheoretical properties and yields highly efficient posterior sampling. However, it is not par-ticularly expressive for small amounts of data due to quickly declining tails and mean thatis tied to the variance. The base measure is often chosen in accordance to data size, distri-bution type, distribution features (heterogeneity, etc) and computational constraints.

Hyperparameters for the DP-GLM include α and parameters for G0. It is often usefulto place a gamma prior on α, while the parameters for G0 may have their own prior as well.Each level of priors reduces their influence but adds computational complexity.

Predict E[Y |X = x,D]. To predict E[Y |X = x,D], we use the tower property ofexpectations, conditioning on the latent parameters,

E [Y |X = x,D] = E [E [Y |X = x, θ1:n] |D] . (4)

The inner expectation is straight-forward to compute. Conditional on the latent parametersθ1:n that generated the observed data, the expectation of the response is

E[Y |X = x, θ1:n] =α∫T E [Y |X = x, θ] fx(x|θ)G0(dθ) +

∑ni=1 E [Y |X = x, θi] fx(x|θi)

α∫T fx(x|θ)G0(dθ) +

∑ni=1 fx(x|θi)

.

(5)Since Y is assumed to be a GLM, the quantity E [Y |X = x, θ] is analytically available as afunction of x and θ.

The outer expectation of Equation (4) is generally impossible to compute with morethan a few observations, so we approximate it by Monte Carlo integration using M posteriorsamples of θ1:n,

E [Y |X = x,D] ≈ 1

M

M∑m=1

E[Y |X = x, θ

(m)1:n

]. (6)

The observations θ(m)1:n are i.i.d. from the posterior distribution of θ1:n |D.

We use Markov chain Monte Carlo (MCMC), specifically Gibbs sampling, to obtain Mi.i.d. samples from this distribution. Gibbs sampling has a long history of being used forDP mixture posterior inference (see Escobar (1994), MacEachern (1994), Escobar and West(1995) and MacEachern and Muller (1998) for foundational work; Neal (2000) provides amodern treatment and state of the art algorithms). Gibbs sampling constructs a Markovchain where the hidden parameters θ1:n give the state of the chain; transition probabilitiesare constructed so that the limiting distribution is the same as that of θ1:n |D. While Gibbssampling is fairly easy to implement, sampling may require a significant amount of timeand it is generally unclear when the chain has converged.

8

Page 9: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

MCMC is not the only way to sample from the posterior. Blei and Jordan (2005) applyvariational inference methods to Dirichlet process mixture models. Variational inferencemethods use optimization to create a parametric distribution close to the posterior fromwhich samples can be drawn. Variational inference avoids the issue of determining when thestationary distribution has been reached, but lacks some of the convergence properties ofMCMC. Fearnhead (2004) uses a particle filter to generate posterior samples for a Dirichletprocess mixture. Particle filters also avoid having to reach a stationary distribution, butthe large state vectors created by non-conjugate base measures and large data sets can leadto collapse.

5. Empirical Properties of the DP-GLM

In this section we evaluate the empirical properties of the DP-GLM, including data typeflexibility and data property flexibility. We then empirically show how the DP-GLM differsfrom the traditional Dirichlet process mixture model regression (Muller et al., 1996).

5.1 Data Type Flexibility

Flexibility in the data type comes from both the use of Dirichlet process mixture modelsand GLMs. Dirichlet process mixture models allow nearly any data type to be capturedwithin the covariate mixture. Many of those data types can then be transformed for useas independent variables in the GLM. The certain mixture distributions simply supportcertain types of covariates; they are not necessarily a good fit. For example, one covariatemight consist of count data. This could be modeled with a Poisson mixture distribution,

Xi |λi ∼ Poisson(λi).

However, the Poisson distribution is a single parameter exponential family, so both themean and variance are controlled by λi. The distribution of Xi will not be modeled well ifV ar(Xi) < E[Xi]. In this case, a negative binomial distribution would be a more suitablechoice for the mixture distribution because it has two controlling parameters. The GLMgives flexibility in the types of response variables supported.

We shall now discuss how the DP-GLM can naturally accommodate various data prop-erties, such as heteroscedasticity and over-dispersion.

5.2 Data Property Flexibility

Many models, such as GLMs and Gaussian processes, have assumptions about data dis-persion and homoscedasticity. Over-dispersion occurs in single parameter GLMs when thedata variance is larger than the variance predicted by the model mean. Mukhopadhyayand Gelfand (1997) have successfully used DP mixtures over GLM intercept parameters tocreate classes of models that include over-dispersion. The DP-GLM retains this property,but not linearity in the covariates.

Homoscedasticity is the term for variance is constant among all covariate regions; het-eroscedasticity is variance that changes with the covariates. Models like GLMs and Gaussianprocesses assume homoscedasticity and can give poor fits when that assumption is violated.However, the DP-GLM can naturally accommodate heteroscedasticity when multiparameter

9

Page 10: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

CMB Dataset

X

Wav

elen

gth

−10000

−5000

0

5000

10000

15000

20000

−10000

−5000

0

5000

10000

15000

20000

DPGLM

Treed GP

200 400 600 800

Gaussian Process

Treed Linear Model

200 400 600 800

Figure 2: Modeling heteroscedasticity with the DP-GLM and other Bayesian nonparametricmethods. The estimated mean function is given along with a 90% predictedconfidence interval for the estimated underlying distribution. DP-GLM producesa smooth mean function and confidence interval.

10

Page 11: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

GLMs are used, such as linear, Gamma and negative binomial regression models. The mix-ture model setting allows the variance parameter to vary between clusters, creating smoothlytransitioning heteroscedastic models. A demonstration of this property is shown in Figure2, where the DP-GLM is compared against a homoscedastic model, Gaussian processes,and heteroscedastic modifications of homoscedastic models, treed Gaussian processes andtreed linear models. The DP-GLM is robust to heteroscedastic data and provides a smoothmean function estimate, while the other models are not as robust or provide non-smoothestimates.

5.3 Comparison with Dirichlet Process Mixture Model Regression

In this subsection we compare the DP-GLM to a plain vanilla Dirichlet process mixturemodel (DPMM) for regression. A generic DPMM has the form,

P ∼ DP (αG0), (7)

θi|P ∼ P,Xi|θi,x ∼ fx(x|θi,x),

Yi|θi,y ∼ fy(y|θi,y).

The difference between Model (7) and the DP-GLM is that the distribution of Y given θis conditionally independent of the covariates X. This is small difference that has largeconsequences for the posterior distribution and predictive results.

Consider the log-likelihood of the posterior of Model (7). Assume that fy is a singleparameter exponential, where θy = β,

`(θdp |D) ∝K∑i=1

`(βCi) +∑c∈Ci

`(yc |βCi) +d∑j=1

`(θCi,xj |D)

, (8)

and the log-likelihood of the DP-GLM posterior for a single parameter GLM, where θy =(β0, . . . , βd),

`(θdpglm |D) ∝K∑i=1

d∑j=0

`(βCi,j) +∑c∈Ci

`(yc |βTCixc) +

d∑j=1

`(θCi,xj |D)

. (9)

As the number of covariates grows, the likelihood associated with the covariates grows inboth equations. However, the likelihood associated with the response also grows with theextra response parameters in Equation (9), whereas it is fixed in Equation (8).

These posterior differences lead to two predictive differences: 1) the DP-GLM is muchmore resistant to dimensionality than the DPMM, and 2) as the dimensionality grows, theDP-GLM produces less stable predictions than the DPMM. Since the number of responserelated parameters grows with the number of covariate dimensions in the DP-GLM, therelative posterior weight of the response does not shrink as quickly in the DP-GLM as itdoes in the DPMM. This keeps the response variable relatively important in the selection ofthe mixture components and hence makes the DP-GLM a better predictor than the DPMMas the number of dimensions grows.

11

Page 12: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

While the additional GLM parameters help maintain the relevance of the response, theyalso add noise to the prediction. This can be seen in Figure 3. The GLM parameters inthis figure have a Gaussian base measure, effectively creating a local ridge regression. Inunpublished numerical work, we have tried other sparsity-inducing base measures, such asa Laplacian distribution with an L1 penalty. They produced less stable results than theGaussian base measure, likely due to the sample size differences between the clusters. Inlower dimensions, the DP-GLM produced more stable results than the DPMM because asmaller number of larger clusters were required to fit the data well. The DPMM, however,consistently produced stable results in higher dimensions as the response became more ofa sample average than a local average. The DPMM has the potential to predict well ifchanges in the mean function coincide with underlying local modes of the covariate density.

6. Asymptotic Unbiasedness of the DP-GLM Regression Model

A desirable property of any estimator is that it should be unbiased, particularly in the limit.Diaconis and Freedman (1986) give an example of a location model with a Dirichlet processprior where the estimated location can be bounded away from the true location, even whenthe number of observations approaches infinity. We want to assure that DP-GLM does notend up in a similar position.

Notation for this section is more complicated than the notation for the model. Letf0(x, y) be the true joint distribution of (x, y); in this case, we will assume that f0 isa density. Let F be the set of all density functions over (x, y). Let Πf be the priorover F induced by the DP-GLM model. Let Ef0 [·] denote the expectation under the truedistribution and EΠf [·] be the expectation under the prior Πf .

Traditionally, an estimator is called unbiased if the expectation of the estimator overthe observations is the true value of the quantity being estimated. In the case of DP-GLM,that would mean for every x ∈ A and every n > 0,

Ef0 [EΠ[Y |x, (Xi, Yi)ni=1]] = Ef0 [Y |x],

where A is some fixed domain, EΠ is the expectation with respect to the prior Π and Ef0is the expectation with respect to the true distribution.

Since we use Bayesian priors in DP-GLM, we will have bias in almost all cases (Gelmanet al., 2004). The best we can hope for is asymptotic unbiasedness, where as the numberof observations grows to infinity, the mean function estimate converges to the true meanfunction. That is, for every x ∈ A,

EΠ[Y |x, (Xi, Yi)ni=1]→ E[Y |x] as n→∞.

Diaconis and Freedman (1986) give an example for a location problem with a DP prior wherethe posterior estimate was not asymptotically unbiased. Extending that example, it followsthat estimators with DP priors do not automatically receive asymptotic unbiasedness. Weuse consistency and uniform integrability to give conditions for asymptotic unbiasedness.

Consistency, the notion that as the number of observations goes to infinity the posteriordistribution accumulates in neighborhoods arbitrarily “close” to the true distribution, istightly related to both asymptotic unbiasedness and mean function estimate existence.

12

Page 13: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Comparison over Dimensions

X

Y

−4−2

02

−4−2

02

−4−2

02

−4−2

02

−4−2

02

DP Mixture

−3 −2 −1 0 1 2 3

DPGLM

−3 −2 −1 0 1 2 3

13

510

20

Figure 3: A plain Dirichlet process mixture model regression (left) versus DP-GLM, plottedagainst the number of spurious dimensions (vertical plots). The estimated meanfunction is given along with a 90% predicted confidence interval for the estimatedunderlying distribution. Data have one predictive covariate and a varying numberof spurious covariates. The covariate data were generated by a mixture model.DP-GLM produces a smoother mean function and is much more resistant todimensionality.

13

Page 14: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Weak consistency assures that the posterior distribution accumulates in regions of densitieswhere “properly behaved” functions (i.e., bounded and continuous) integrated with respectto the densities in the region are arbitrarily close to the integral with respect to the truedensity. Note that an expectation is not bounded; in addition to weak consistency, uniformintegrability is needed to guarantee that the posterior expectation converges to the trueexpectation, giving asymptotic unbiasedness. Uniform integrability also ensures that theposterior expectation almost surely exists with every additional observation. Therefore weneed to show weak consistency and uniform integrability.

6.1 Asymptotic Unbiasedness

We approach asymptotic unbiasedness by showing weak consistency for the posterior of thejoint distribution and then using uniform integrability to show that the conditional expec-tation EΠf [Y |X = x, (Xi, Yi)

ni=1] exists for every n and converges the the true expectation

almost surely. The proof for this is a bit involved, so it has been placed in the Appendix.

Theorem 1 Let x be in a compact set C and Πf be a prior on F . If,

(i) for every δ > 0, Πf puts positive measure on{f :

∫f0(x, y) log

f0(x, y)

f(x, y)dxdy < δ

},

(ii)∫|y|f0(y|x)dy <∞ for every x ∈ C, and

(iii) there exists an ε > 0 such that for every x ∈ C,∫ ∫|y|1+εfy(y|x, θ)G0(dθ) <∞,

then for every n ≥ 0, EΠ[Y |x, (Xi, Yi)1:n] exists and has the limit Ef0 [Y |X = x], almostsurely Pf∞0 .

The conditions of Theorem 1 must be checked for the problem (f0) and prior (Πf ) pair, andcan be difficult to show. Condition (i) assures weak consistency of the posterior, condition(ii) guarantees a mean function exists in the limit and condition (iii) guarantees thatpositive probability is only placed on densities that yield a finite mean function estimate,which us used for uniform integrability. Condition (i) is quite hard to show, but therehas been recent work demonstrating weak consistency for a number of Dirichlet processmixture models and priors (Ghosal et al., 1999; Ghosh and Ramamoorthi, 2003; Amewou-Atisso et al., 2003; Tokdar, 2006). See the Appendix for a proof of Theorem 1.

6.2 Example: Gaussian Model with a Conjugate Base Measure

The next theorem gives an example of when Theorem 1 holds for the Gaussian model. Notethat in the Gaussian case, slope parameters can be generated by a covariance matrix: usinga conjugate prior, a Normal-Inverse-Wishart, will produce an instance of the DP-GLM. Let

14

Page 15: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

G0 be a Normal-Inverse Wishart distribution and then define the following model, whichwas used by Muller et al. (1996),

P ∼ DP (αG0), (10)

θi |P ∼ P,(Xi, Yi) | θi ∼ N(µ,Σ).

The last line of Model (10) can be broken down in the following manner,

Xi | θi ∼ N (µx,Σx) ,

Yi | θi ∼ N(µy + bTΣ−1

x b(Xi − µx), σ2y − bTΣ−1

x b),

where

µ =

[µyµx

], Σ =

[σ2y bT

b Σx

].

We can then define β in the following manner,

β0 = µy − bTΣ−1x µx, β1:d = bTΣ−1

x .

Theorem 2 Model the DP-GLM as in Model (10) where G0 is Normal-Inverse-Wishart.Suppose that the true distribution of X is absolutely continuous and has compact support C.If Ef0 [Y |X = x] <∞ for every x ∈ C, then the conditions of Theorem 1 are satisfied.

See the Appendix for a proof. Note that we may not always, or even usually, want to usethis prior; when Σx is not sparse, the algorithm is computationally complex and weightsthe covariates more heavily than a prior where Σx is forced to be diagonal. We do allcomputations with a diagonal prior on Σx.

7. Numerical Results

We compare the performance of DP-GLM regression to other regression methods. We chosedata sets to illustrate the strengths of the DP-GLM, including robustness with respect todata type, heteroscedasticity and higher dimensionality than can be approached with tra-ditional methods. Shahbaba and Neal (2009) used a similar model on data with categoricalcovariates and count responses; their numerical results were encouraging. We tested theDP-GLM in the following settings.

Data Sets. We selected three data sets with continuous response variables. They high-light various data difficulties within regression, such as error heteroscedasticity, moderatedimensionality (10–12 covariates), various input types and response types.

• Cosmic Microwave Background (CMB) Bennett et al. (2003). The dataset consists of 899 observations which map positive integers ` = 1, 2, . . . , 899, called‘multipole moments,’ to the power spectrum C`. Both the covariate and response areconsidered continuous. The data pose challenges because they are highly nonlinearand heteroscedastic. Since this data set is only two dimensions, it allows us to easilydemonstrate how the various methods approach estimating a mean function whiledealing with non-linearity and heteroscedasticity.

15

Page 16: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

• Concrete Compressive Strength (CCS) Yeh (1998). The data set has eightcovariates: the components cement, blast furnace slag, fly ash, water, superplasticizer,coarse aggregate and fine aggregate, all measured in kg per m3, and the age of themixture in days; all are continuous. The response is the compressive strength of theresulting concrete, also continuous. There are 1,030 observations. The data haverelatively little noise. Difficulties arise from the moderate dimensionality of the data.

• Solar Flare (Solar) Bradshaw (1989). The response is the number of solar flaresin a 24 hour period in a given area; there are 11 categorical covariates. 7 covariatesare Bernoulli: time period (1969 or 1978), activity (reduced or unchanged), previous24 hour activity (nothing as big as M1, one M1), historically complex (Y/N), recenthistorical complexity (Y/N), area (small or large), area of the largest spot (small orlarge). 4 covariates are multinomial: class, largest spot size, spot distribution andevolution. The response is the sum of all types of solar flares for the area. There are1,389 observations. Difficulties are created by the moderately high dimensionality,categorical covariates and count response. Few regression methods can appropriatelymodel this data.

Competitors. The competitors represent a variety of regression methods; some methodsare only suitable for certain types of regression problems.

• Ordinary Least Squares (OLS). A parametric method that often provides a rea-sonable fit when there are few observations. Although OLS can be extended for usewith any set of basis functions, finding basis functions that span the true functionis a difficult task. We naively choose [1X1 . . . Xd]

T as basis functions. OLS can bemodified to accommodate both continuous and categorical inputs, but it requires acontinuous response function.

• CART. A nonparametric tree regression method generated by the Matlab functionclassregtree. It accommodates both continuous and categorical inputs and any typeof response.

• Bayesian CART. A tree regression model with a prior over tree size (Chipman et al.,1998); it was implemented in R with the tgp package.

• Bayesian Treed Linear Model. A tree regression model with a prior over tree sizeand a linear model in each of the leaves (Chipman et al., 2002); it was implementedin R with the tgp package.

• Gaussian Processes (GP). A nonparametric method that can accommodate onlycontinuous inputs and continuous responses. GPs were generated in Matlab by theprogram gpr of Rasmussen and Williams (2006). It is suitable only for continuousresponses and covariates.

• Treed Gaussian Processes. A tree regression model with a prior over tree size anda GP on each leaf node (Gramacy and Lee, 2008); it was implemented in R with thetgp package.

16

Page 17: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Method Mean Absolute Error Mean Square Error30 50 100 250 500 30 50 100 250 500

DP-GLM 0.58 0.51 0.49 0.48 0.45 1.00 0.94 0.91 0.94 0.83Linear Regression 0.66 0.65 0.63 0.65 0.63 1.08 1.04 1.01 1.04 0.96CART 0.62 0.60 0.60 0.56 0.56 1.45 1.34 1.43 1.29 1.41Bayesian CART 0.66 0.64 0.54 0.50 0.47 1.04 1.01 0.93 0.94 0.84Treed Linear Model 0.64 0.52 0.49 0.48 0.46 1.10 0.95 0.93 0.95 0.85Gaussian Process 0.55 0.53 0.50 0.51 0.47 1.06 0.97 0.93 0.96 0.85Treed GP 0.52 0.49 0.48 0.48 0.46 1.03 0.95 0.95 0.96 0.89

Table 1: Mean absolute and square errors for methods on the CMB data set by trainingdata size. The best results for each size of training data are in bold.

• Basic DP Regression. Similar to DP-GLM, except the response is a function onlyof µy, rather than β0 +

∑βixi. For the Gaussian model,

P ∼ DP (αG0),

θi|P ∼ P,Xi|θi ∼ N(µi,x, σ

2i,x),

Yi|θi ∼ N(µi,y, σ2i,y.)

This model was explored in Section 5.3.

• Poisson GLM (GLM). A Poisson generalized linear model, used on the Solar Flaredata set. It is suitable for count responses.

Cosmic Microwave Background (CMB) Results. For this dataset, we used a Guas-sian model and the DP-GLM was given the base measure

µx ∼ N(mx, s2x), σ2

x ∼ exp{N(mx,s, s

2x,s)},

β0:d ∼ N(my,0:d, s2y,0:d), σ2

y ∼ exp{N(mx,s, s

2x,s)}.

This prior was chosen because the variance tails are heavier than an inverse gamma andthe mean is not tied to the variance. It is a good choice for heterogeneous data because ofthose features.

All non-linear methods except for CART (DP-GLM, Bayesian CART, treed linear mod-els, GPs and treed GPs) did comparably on this dataset; CART had difficulty finding anappropriate bandwidth. Linear Regression did poorly due to the non-linearity of the dataset.Fits for heteroscedasticity for the DP-GLM, GPs, treed GPs and treed linear models on250 training data points can be seen in Figure 2 in Section 5. See Figure 4 and Table 1 forresults.

Concrete Compressive Strength (CCS) Results. The CCS dataset was chosen be-cause of its moderately high dimensionality and continuous covariates and response. For this

17

Page 18: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

CMB Dataset

Number of Observations

Err

or

0.5

0.6

0.7

0.8

1.0

1.2

1.4

1.6

1.8

●●

●●

5 10 30 50 100 250 500

mae

mse

Algorithm

DP−GLM

OLS

CART

GaussianProcess

●BayesianCart

TreedGaussianProcess

Figure 4: The average mean absolute error (top) and mean squared error (bottom) forordinary least squares (OLS), tree regression, Gaussian processes and DP-GLMon the CMB data set. The data were normalized. Mean +/− one standarddeviation are given for each method.

18

Page 19: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Method Mean Absolute Error Mean Square Error30 50 100 250 500 30 50 100 250 500

DP-GLM 0.54 0.50 0.45 0.42 0.40 0.47 0.41 0.33 0.28 0.27Location/Scale DP 0.66 0.62 0.58 0.56 0.54 0.68 0.59 0.52 0.48 0.45Linear Regression 0.61 0.56 0.51 0.50 0.50 0.66 0.50 0.43 0.41 0.40CART 0.72 0.62 0.52 0.43 0.34 0.87 0.65 0.46 0.33 0.23Bayesian CART 0.78 0.72 0.63 0.55 0.54 0.95 0.80 0.61 0.49 0.46Treed Linear Model 1.08 0.95 0.60 0.35 1.10 7.85 9.56 4.28 0.26 1232Gaussian Process 0.53 0.52 0.38 0.31 0.26 0.49 0.45 0.26 0.18 0.14Treed GP 0.73 0.40 0.47 0.28 0.22 1.40 0.30 3.40 0.20 0.11

Table 2: Mean absolute and square errors for methods on the CCS data set by trainingdata size. The best results for each size of training data are in bold.

dataset, we used a Gaussian model and the DP-GLM was given a conjugate base measurewith conditionally independent covariate and response parameters,

(µx, σ2x) ∼ Normal − Inverse−Gamma(mx, sx, ax, bx),

(β0:d, σ2y) ∼Multivariate Normal − Inverse−Gamma(My, Sy, ay, by).

This base measure allows the sampler to be fully collapsed but has fewer covariate-associatedparameters than a full Normal-Inverse-Wishart base measure, giving it a better fit in amoderate dimensional setting. In testing, it also provided better results for this datasetthan the exponentiated Normal base measure used for the CMB dataset; this is likely dueto the low noise and variance of the CCS dataset.

Results on this dataset were more varied than those for the CMB dataset. GPs hadthe best performance overall; on smaller sets of training data, the DP-GLM outperformedfrequentist CART. Linear regression, basic DP regression and Bayesian CART all performedcomparatively poorly. Treed linear models and treed GPs performed very well most of thetime, but had convergence problems leading to overall higher levels of predictive error.Convergence issues were likely caused by the moderate dimensionality (8 covariates) of thedataset. See Figure 5 and Table 2 for results.

Solar Flare Results. The Solar dataset was chosen to demonstrate the flexibility of DP-GLM. Many regression techniques cannot accommodate categorical covariates and mostcannot accommodate a count-type response. For this dataset, the DP-GLM was given thefollowing model,

P ∼ DP (αG0),

θi |P ∼ P,Xi,j | θi ∼ (pi,j,1, . . . , pi,j,K(j)),

Yi | θi ∼ Poisson

βi,0 +

d∑j=1

K(j)∑k=1

βi,j,k1{Xi,j=k}

.

19

Page 20: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

CCS Dataset

Number of Observations

0.3

0.4

0.5

0.6

0.7

0.2

0.4

0.6

0.8

●●

●●

30 50 100 250 500

Mean A

bsolute Error

Mean S

quare Error

Algorithm

DP−GLM

OLS

CART

GaussianProcessLocationScale DP

●BayesianCART

Figure 5: The average mean absolute error (top) and mean squared error (bottom) for ordi-nary least squares (OLS), tree regression, Gaussian processes, location/scale DPand the DP-GLM Poisson model on the CCS data set. The data were normalized.Mean +/− one standard deviation are given for each method.

20

Page 21: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Solar Dataset

Number of Observations

Err

or

0.45

0.50

0.55

0.60

0.65

0.6

0.7

0.8

0.9

1.0

1.1

1.2

●●

●●

50 100 200 500 800

Mean A

bsolute Error

Mean S

quare Error

Algorithm

DP−GLM

PoissonGLMTreeRegression

●BayesianCART

Figure 6: The average mean absolute error (top) and mean squared error (bottom) for treeregression, a Poisson GLM (GLM) and DP-GLM on the Solar data set. Mean+/− one standard deviation are given for each method.

We used a conjugate base measure,

(pj,1, . . . , pj,K(j)) ∼ Dirichlet(aj,1, . . . , aj,K(j)), βj,k ∼ N(mj,k, s2j,k).

The sampler was fully collapsed.Therefore we compared the DP-GLM, CART, Bayesian CART and Poisson regression

on this dataset. The DP-GLM had the best performance under both error measures, withBayesian CART a fairly close second. The high mean squared error values suggests thatfrequentist CART overfit while the high mean absolute error for Poisson regression suggeststhat it did not adequately fit nonlinearities. See Figure 6 and Table 3 for results.

Discussion. The DP-GLM is a relatively strong competitor on all of the datasets, butonly CART and Bayesian CART match its flexibility. It was more stable than most ofits Bayesian competitors (aside from GPs) on the CCS dataset. Our results suggest thatthe DP-GLM would be a good choice for small sample sizes when there is significant prior

21

Page 22: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

Method Mean Absolute Error Mean Square Error50 100 200 500 800 50 100 200 500 800

DP-GLM 0.52 0.49 0.48 0.45 0.44 0.84 0.76 0.71 0.69 0.63Poisson Regression 0.65 0.59 0.54 0.52 0.48 0.87 0.84 0.80 0.73 0.64CART 0.53 0.48 0.50 0.47 0.47 1.13 0.88 1.03 0.88 0.83Bayesian CART 0.59 0.52 0.51 0.47 0.45 0.86 0.80 0.78 0.71 0.60

Table 3: Mean absolute and square errors for methods on the Solar data set by trainingdata size. The best results for each size of training data are in bold.

knowledge; in those cases, it acts as an automatic outlier detector and produces a resultthat is similar to a Bayesian GLM. Results from Section 5 suggest that the DP-GLM isnot appropriate for problems with high dimensional covariates; in those cases, the covariateposterior swamps the response posterior with poor numerical results.

8. Conclusions and Future Work

We introduced a flexible Bayesian regression technique; we analyzed its empirical properties;we gave conditions for asymptotic unbiasedness and gave a situation in which they hold;finally, we tested the DP-GLM on a variety of datasets against state of the art Bayesiancompetitors. The DP-GLM was able to perform reasonably well on all datasets which hada variety of variable types and data features.

Our empirical analysis of the DP-GLM has implications for regression methods thatrely on modeling a joint posterior distribution of the covariates and the response. Ourexperiments suggest that the covariate posterior can swamp the response posterior, butcareful modeling can mitigate the effects for problems with low to moderate dimensionality.This area needs more theoretical exploration as similar effects can be seen in other graphicalmodels, such as Latent Dirichlet Allocation (Blei et al., 2003). A better understanding wouldallow us to know when and how such modeling problems can be avoided.

Appendix

A-1 Proof of Theorem 1

The outline for the proof of Theorem 1 is as follows:

1) Show consistency under the set of conditions given in part (i) of Theorem 1.

2) Show that consistency and conditions (ii) and (iii) imply the existence and conver-gence of the expectation.

a) Show that weak consistency implies pointwise convergence of the conditionaldensity.

b) Note that pointwise convergence of the conditional density implies convergenceof the conditional expectation.

22

Page 23: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

The first part of the proof relies on a theorem by Schwartz (1965).

Theorem A-1 (Schwartz (1965)) Let Πf be a prior on F . Then, if Πf places positiveprobability on all neighborhoods{

f :

∫f0(x, y) log

f0(x, y)

f(x, y)dxdy < δ

}for every δ > 0, then Πf is weakly consistent at f0.

Therefore, consistency of the posterior at f0 follows directly from condition (i).

We now show pointwise convergence of the of the conditional densities. Let fn(x, y) bethe Bayes estimate of the density under Πf after n observations,

fn(x, y) =

∫Ff(x, y)Πf (df | (Xi, Yi)

ni=1) .

Posterior consistency of Πf at f0 implies that fn(x, y) converges weakly to f0(x, y), pointwisein (x, y).

Lemma A-2 Suppose that Πf is weakly consistent at f0. Then fn(y|x) converges weaklyto f0(y|x).

Proof If Πf is weakly consistent at f0, then fn(x, y) converges weakly to f0(x, y). We canintegrate out y to see that fn(x) also converges weakly to f0(x). Then note that

limn→∞

fn(y|x) = limn→∞

fn(x, y)

fn(x)=f0(x, y)

f0(x)= f0(y|x).

Now we are ready to prove Theorem 1.

Proof [Theorem 1] Existence of expectations: by condition (iii) of Theorem 1,∫|y|1+εfn(y|x)dy <∞

almost surely. This and condition (ii) ensure uniform integrability and that EΠf [Y |X =x, (Xi, Yi)

ni=1] exists for every n almost surely. By Lemma A-2, the conditional density

converges weakly to the true conditional density; weak convergence and uniform integrabilityensure that the expectations converge: EΠf [Y |X = x, (Xi, Yi)

ni=1] converges pointwise in x

to Ef0 [Y |X = x].

23

Page 24: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

A-2 Proof of Theorem 2

Proof Condition (i) is shown by Ghosal et al. (1999) for basic location scale mixtures, andby Tokdar (2006) and Rodriguez et al. (2009) for Normal-Inverse Wishart base measures.Condition (ii) is satisfied by problem assumptions while condition (iii) is satisfied by theNormal-Inverse Wishart.

Remark: Rodrıguez (2007) showed that the class of models produced by Model (10) is denseover the set of integrable functions on the compact set C.

References

M. Amewou-Atisso, S. Ghosal, J.K. Ghosh, and RV Ramamoorthi. Posterior consistencyfor semi-parametric regression problems. Bernoulli, 9(2):291–312, 2003.

C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametricproblems. The Annals of Statistics, 2(6):1152–1174, 1974.

A. Barron, M.J. Schervish, and L. Wasserman. The consistency of posterior distributionsin nonparametric problems. The Annals of Statistics, 27(2):536–561, 1999.

CL Bennett, M. Halpern, G. Hinshaw, N. Jarosik, A. Kogut, M. Limon, SS Meyer,L. Page, DN Spergel, GS Tucker, et al. First-Year Wilkinson Microwave AnisotropyProbe (WMAP) 1 Observations: Preliminary Maps and Basic Results. The AstrophysicalJournal Supplement Series, 148(1):1–27, 2003.

D. Blackwell and J.B. MacQueen. Ferguson distributions via Polya urn schemes. The AnnalsStatistics, 1(2):353–355, 1973.

D.M. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. BayesianAnalysis, 1(1):121–144, 2005.

D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. The Journal of MachineLearning Research, 3:993–1022, 2003.

G. Bradshaw. UCI machine learning repository, 1989.

H.A. Chipman, E.I. George, and R.E. McCulloch. Bayesian CART model search. Journalof the American Statistical Association, 93(443):935–948, 1998.

H.A. Chipman, E.I. George, and R.E. McCulloch. Bayesian treed models. Machine Learn-ing, 48(1):299–320, 2002.

M. De Iorio, P. Muller, G.L. Rosner, and S.N. MacEachern. An ANOVA model for depen-dent random measures. Journal of the American Statistical Association, 99(465):205–215,2004.

P. Diaconis and D. Freedman. On the consistency of Bayes estimates. The Annals ofStatistics, 14(1):1–26, 1986.

24

Page 25: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

J.A. Duan, M. Guindani, and A.E. Gelfand. Generalized spatial Dirichlet process models.Biometrika, 94(4):809–825, 2007.

D.B. Dunson, N. Pillai, and J.H. Park. Bayesian density regression. Journal of the RoyalStatistical Society Series B, Statistical Methodology, 69(2):163, 2007.

M.D. Escobar. Estimating normal means with a Dirichlet process prior. Journal of theAmerican Statistical Association, 89(425):268–277, 1994.

M.D. Escobar and M. West. Bayesian Density Estimation and Inference Using Mixtures.Journal of the American Statistical Association, 90(430):577–588, 1995.

P. Fearnhead. Particle filters for mixture models with an unknown number of components.Statistics and Computing, 14(1):11–21, 2004.

T.S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statis-tics, 1(2):209–230, 1973.

A.E. Gelfand, A. Kottas, and S.N. MacEachern. Bayesian nonparametric spatial modelingwith Dirichlet process mixing. Journal of the American Statistical Association, 100(471):1021–1035, 2005.

A. Gelman, J.B. Carlin, and H.S. Stern. Bayesian data analysis. CRC press, 2004.

S. Ghosal, JK Ghosh, and RV Ramamoorthi. Posterior consistency of Dirichlet mixtures indensity estimation. The Annals of Statistics, 27(1):143–158, 1999.

JK Ghosh and RV Ramamoorthi. Bayesian Nonparametrics. Springer, 2003.

R.B. Gramacy and H.K.H. Lee. Bayesian treed Gaussian process models with an applicationto computer modeling. Journal of the American Statistical Association, 103(483):1119–1130, 2008.

JE Griffin and M.F.J. Steel. Order-based dependent Dirichlet processes. Journal of theAmerican Statistical Association, 101(473):179–194, 2006.

JE Griffin and MFJ Steel. Bayesian nonparametric modelling with the Dirichlet processregression smoother. Technical report, Technical Report, Institute of Mathematics, Statis-tics and Actuarial Science, University of Kent, 2007.

J.G. Ibrahim and K.P. Kleinman. Semiparametric Bayesian methods for random effectsmodels. In Practical Nonparametric and Semiparametric Bayesian Statistics, chapter 5,pages 89–114. 1998.

S.N. MacEachern. Estimating normal means with a conjugate style Dirichlet process prior.Communications in Statistics-Simulation and Computation, 23(3):727–741, 1994.

S.N. MacEachern and P. Muller. Estimating mixture of Dirichlet process models. Journalof Computational and Graphical Statistics, pages 223–238, 1998.

25

Page 26: Dirichlet Process Mixtures of Generalized Linear Models · Dirichlet process mixtures of GPs (Rasmussen and Ghahramani, 2002) or treed GPs (Gra-macy and Lee, 2008). Bayesian regression

P. McCullagh and J.A. Nelder. Generalized linear models (Monographs on statistics andapplied probability 37). Chapman Hall, London, 1989.

S. Mukhopadhyay and A.E. Gelfand. Dirichlet Process Mixed Generalized Linear Models.Journal of the American Statistical Association, 92(438):633–639, 1997.

P. Muller, A. Erkanli, and M. West. Bayesian curve fitting using multivariate normalmixtures. Biometrika, 83(1):67–79, 1996.

R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journalof Computational and Graphical Statistics, 9(2):249–265, 2000.

C.E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. Ad-vances in neural information processing systems 14: proceedings of the 2001 conference,pages 881–888, 2002.

C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. Springer,2006.

A. Rodrıguez. Some advances in Bayesian nonparametric modeling. PhD thesis, DukeUniversity, 2007.

A. Rodriguez, D.B. Dunson, and A.E. Gelfand. Bayesian nonparametric functional dataanalysis through density estimation. Biometrika, 96(1):149–162, 2009.

L. Schwartz. On Bayes procedures. Z. Wahrsch. Verw. Gebiete, 4(1):10–26, 1965.

B. Shahbaba and R.M. Neal. Nonlinear Models Using Dirichlet Process Mixtures. Journalof Machine Learning Research, 10:1829–1850, 2009.

S. Tokdar. Posterior consistency of Dirichlet location-scale mixture of normals in densityestimation and regression. Sankhya: The Indian Journal of Statistics, 67:90–110, 2006.

S. Walker. New approaches to Bayesian consistency. The Annals of Statistics, 32(5):2028–2043, 2004.

M. West, P. Muller, and M.D. Escobar. Hierarchical priors and mixture models, withapplication in regression and density estimation. In Aspects of uncertainty: A Tribute toDV Lindley, pages 363–386. 1994.

I.C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks.Cement and Concrete research, 28(12):1797–1808, 1998.

26