bayesian curve fitting using multivariate normal …mw/mwextrapubs/mueller1996.pdfbiometrika (1996),...

13
Biometrika (1996), 83, 1,pp.67-79 Printed inGreat Britain Bayesian curve fitting using multivariate normal mixtures BY PETER MULLER Institute of Statistics andDecision Sciences, Duke University, Box 90251,Durham, North Carolina 27708-0251, U.S.A. ALAATTIN ERKANLI Developmental Epidemiology Program, Duke University MedicalCenter, Box 3354,Durham, North Carolina 27710,U.S.A. AND MIKE WEST Institute of Statistics andDecision Sciences, Duke University, Box 90251,Durham, North Carolina 27708-0251, U.S.A. SUMMARY Problems of regression smoothing and curvefitting are addressed via predictive infer- ence in a flexible class of mixture models. Multidimensional density estimation using Dirichlet mixture models providesthe theoretical basis for semi-parametric regression methods in whichfitted regression functions may be deduced as means of conditional predictive distributions. TheseBayesian regression functions havefeatures similar to gener- alisedkernel regression estimates, buttheformal analysis addresses problems ofmultivari- ate smoothing, parameter estimation, and theassessment ofuncertainties about regression functions naturally. Computations are based on multidimensional versions of existing Markov chain simulation analysis of univariate Dirichlet mixture models. Some key words: Bayesian regression estimation; Dirichlet mixturemodel; Markov chain simulation; Multivariate density estimation; Smoothing. 1. INTRODUCTION A major concernof moderndata analysisis the estimation of a smoothregression functiong(x) = E(y I x) based on sampled data zi = (yi, xi) (i = 1, 2, ... ., n). Here y and x maybe scalar or vector. Usuallyg is chosenwith respect to some criterion thatmeasures how close g(x) is to y on the average.For example, one such criterion is the expected squarederror, E [{y - g(x)}2 1 x], which is minimised when g(x) = E(y I x). Thus,ifwe can estimate thejoint distribution of (y,x), or more directly the conditional distribution of (y I x), the problem is solvedin principle by calculating g(x). Popular approaches to this class ofsmoothing problems are essentially nonparametric. Given data D = {yi, xi; i= 1,. .. , n}, g(x) can be estimated by usingone of several tech- niques,including lowess (Cleveland,1979), splinesmoothing (Craven & Wahba, 1979), and kernelsmoothing (Silverman, 1985, p. 76; Gasser & Muller, 1984). The common assumption amongthese methods is that x1,. .. ., x, are thedesign points ofan experiment at which theobservations Yi, ... , Yn are obtained, and that there is an unknown regression function g(x) satisfying y = g(x) + 8(x), whereg(.) and ?(.) are systematic and residual components, respectively. Sensible modelsrequire theintroduction ofsmoothness priors,

Upload: phungminh

Post on 23-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Biometrika (1996), 83, 1, pp. 67-79 Printed in Great Britain

Bayesian curve fitting using multivariate normal mixtures

BY PETER MULLER Institute of Statistics and Decision Sciences, Duke University, Box 90251, Durham,

North Carolina 27708-0251, U.S.A.

ALAATTIN ERKANLI Developmental Epidemiology Program, Duke University Medical Center, Box 3354, Durham,

North Carolina 27710, U.S.A.

AND MIKE WEST

Institute of Statistics and Decision Sciences, Duke University, Box 90251, Durham, North Carolina 27708-0251, U.S.A.

SUMMARY

Problems of regression smoothing and curve fitting are addressed via predictive infer- ence in a flexible class of mixture models. Multidimensional density estimation using Dirichlet mixture models provides the theoretical basis for semi-parametric regression methods in which fitted regression functions may be deduced as means of conditional predictive distributions. These Bayesian regression functions have features similar to gener- alised kernel regression estimates, but the formal analysis addresses problems of multivari- ate smoothing, parameter estimation, and the assessment of uncertainties about regression functions naturally. Computations are based on multidimensional versions of existing Markov chain simulation analysis of univariate Dirichlet mixture models.

Some key words: Bayesian regression estimation; Dirichlet mixture model; Markov chain simulation; Multivariate density estimation; Smoothing.

1. INTRODUCTION

A major concern of modern data analysis is the estimation of a smooth regression function g(x) = E(y I x) based on sampled data zi = (yi, xi) (i = 1, 2, ... ., n). Here y and x may be scalar or vector. Usually g is chosen with respect to some criterion that measures how close g(x) is to y on the average. For example, one such criterion is the expected squared error, E [{y - g(x)}2 1 x], which is minimised when g(x) = E(y I x). Thus, if we can estimate the joint distribution of (y, x), or more directly the conditional distribution of (y I x), the problem is solved in principle by calculating g(x).

Popular approaches to this class of smoothing problems are essentially nonparametric. Given data D = {yi, xi; i= 1,. .. , n}, g(x) can be estimated by using one of several tech- niques, including lowess (Cleveland, 1979), spline smoothing (Craven & Wahba, 1979), and kernel smoothing (Silverman, 1985, p. 76; Gasser & Muller, 1984). The common assumption among these methods is that x1, . .. ., x, are the design points of an experiment at which the observations Yi, ... , Yn are obtained, and that there is an unknown regression function g(x) satisfying y = g(x) + 8(x), where g(.) and ?(.) are systematic and residual components, respectively. Sensible models require the introduction of smoothness priors,

68 P. MULLER, A. ERKANLI AND M. WEST

or analogous constraints such as smoothness penalties, for the regression function and the error structure (Silverman, 1985, p. 26). Modelling must strike a balance between smoothness and flexibility in reflecting local structure and changes in the apparent regression relationship. Different regions of the (x, y) space may exhibit completely different patterns of relationship.

A simple idea underlying essentially all nonparametric regression approaches, and made explicit in our approach, is that of local linear regression; generally, we might seek models in which regression functions have, at least approximately, the form g(s) = Ejsj(x)Mj(x), where the mj(x) are distinct linear functions of x and the sj(x) are probability weights that vary across the design space. The weights sj(x) should emphasise the appropriate com- ponent j by taking larger values for x in the locale corresponding to mj(x). A mean E(y x) = g(x) of this form may derive from a joint distribution of mixture form, (x, y) - EZ pj(x, y), in which the conditional distributions pj(y I x) of the components have means mj(x). This is the basis of our approach. To develop an appropriately flexible class of mixture distributions we use multivariate Dirichlet process mixtures of normals. Several authors, notably Ferguson (1973, 1983), Antoniak (1974) and Escobar & West (1995), have discussed Bayesian methods of predictive inference in univariate Dirichlet mixture models, where nonparametric Bayesian inference relates quite closely, though with import- ant differences, to traditional kernel density estimation. The multivariate framework developed here offers similar analogies with traditional kernel methods of regression smoothing, as is demonstrated below. We assume that the data pairs (xi, yi) are sampled from a joint distribution of x and y. This is not a very restrictive assumption for practical applications since the resulting methods can be applied without modification in curve fitting problems with nonrandom design points. We illustrate how these mixture models lead to a natural and smooth estimate of g(x). The Bayesian framework directly involves estimation and elimination of smoothing parameters, and, of course, formal assessment of uncertainties about estimated regression functions.

The underlying class of normal mixture models is introduced and briefly reviewed in ? 2, followed by general development of a Gibbs sampler simulation analysis and predictive inference in ? 3. Section 4 returns focus to the smoothing problem. In ? 5, we apply these ideas and techniques to an ozone data set given by Chambers et al. (1983).

2. DIRICHLET PROCESS MIXTURES OF NORMALS

Let zi (i = 1, . . . , n) be p-vectors of random quantities assumed independently drawn from an uncertain distribution to be estimated. From a Bayesian perspective, density estimation may be viewed as a problem of predicting a further draw z = Zn + 1; hence any model must provide a means of computing the predictive distribution p(z I D) for the future draw conditional on D = {z1, ... . z, }. This requires an appropriate class of models. We use a mixture model zi -E' l wjFo. for which we introduce a parsimonious parametris- ation and prior probability model by means of a Dirichlet process model.

Write bx for a unit point mass at x. A Dirichlet process i(acGo) (Antoniak, 1974) defines a probability model on discrete distributions G = E w boj by successively generating

Wji{tI (1 -wi)}Beta (1, a), Oj -Go (j,> 1).

Here a is the total mass or precision, and Go is the prior expectation E(G) of the Dirichlet process. Escobar & West (1995) follow Ferguson (1983) in using a Dirichlet process model

Bayesian curvefitting 69

to define a class of normal mixture models for univariate density estimation. Here we have one class of multivariate generalisations of this previous work, in which we assume the following hierarchical description:

zi - N(pi, Qi),

Oi = (pi, ni) - Gs G - _9(cGO).

In words, the discrete distribution G has a Dirichlet process prior with parameter acGo; given G, the parameters Oi are independently drawn from G; then, given Oi, zi follows a multivariate normal distribution with moments given by components of Oi. An additional level may be added to this hierarchy to specify hyperpriors on the parameters a and Go; we do this in ? 341.

Marginalisation over G also implies that Go is the marginal prior for each of the Oi. This model can be extended to a richer class by putting a hyperprior on a and the hyperparameters of Go; see Escobar & West (1995) and West & Cao (1993) for such extensions in univariate models. One key feature of the model is that because of the discreteness of G it assigns positive probability to common values among the mean/ variance parameters Oi. An important way of describing this is via the set of condi- tional priors for each Oi given the rest, that is, Oi 0(i) where, for each i = 1,... n, O(i)(o = v oi-1, Oi+l1 ... 5 On). We have

n (Oi 0(i)) - aan -1 Go(Oi) + an - 1 E 60(oi)v (1)

j=1,i*j

where b3i(O) is the unit point mass at 0 = Oj and ai = 1/(o +j) (j = 1, .. ., n). Similarly, considering a further observation Zn + 1 with moments On+ + 1 1+ 1 , Qn + 1 } we have

n (On + l1O (n+ l)) (xanGo(On + l)+ an Z 0i 6(On+l). (2)

In any set of n realised values 0(n+1) - (0 ..., On) there will be some k < n distinct values, denoted by 0 *= (0*,. . , Sk*). Write nj for the number of occurrences Oi = * (j = 1, .. ., k) so that n ... +l nk= n. Then

k

(On+11Z(n+1))_(xanG0(On+J+an E njbO*(On+1). (3) j=1

Antoniak (1974) gives the prior for the number of distinct components k, which has the feature that, for large n relative to k,

E(k I1, n) =a log(1 + n/a).

This indicates that k is typically very small compared to n, so the model essentially implies the data are drawn from a mixture of a small number of normals. As k increases, the model analysis can achieve high fidelity to observed data, indicating its usefulness for data smoothing and interpolation that underlies its utility in density estimation.

The joint posterior density for 0(n+1) given the data D = {Z1,. . ., Zn} is

whereD c >0 is anormalis ons aGO(zi) + Ej <li copon f, where~ ~~ ( cn > 0 is Da =omlsn cntn and f (zi Oi) istelieiod4opnntfr)

70 P. MUJLLER, A. ERKANLI AND M. WEST

derived from the density of the normal distribution of zi I i. If a is very large, then (4) reduces to the usual form of Bayes' theorem, with Go the prior of 6(n+ 1); as a goes to zero, the Oi's are estimated by pooling the zi's and some neighbouring zj's together.

We are interested in the posterior predictive density of z = Zn,+1 namely

p(zID) = J'p(zIO(n+1))p(6(n+1)ID)dO(n+1). (5)

The second density in the integrand is the posterior in (4). The first is

p(zIn 1)= fP(ZI On+DJP(On+ 1u I(n+1)) dOn+15

which, using (3), reduces to k

p(ZI o(n + )= J x (zI 0) dGO() + an E njf(zI6O), (6) j=1

where f(z 0) denotes the normal density for z with moments 0 = {M, Q}. For practically important cases in which c./n is negligible, (6) gives

k

p(ZIl (n +l))- E )jf (zIl 0j ), j=1

with oj=nj/(c + n) n,/n. This has the form of a kernel density estimate with kernel locations and corresponding variance matrices provided by the distinct draws from Go. This clarifies structure, although application involves averaging this simple mixture with respect to the posterior for the WI, or, equivalently, the posterior for the n parameters 6i given in (4). Precise evaluation of (4) is difficult even for small sample size n. However, Markov chain simulation schemes can be developed to derive methods for approximately simulating the posterior p(6(n + 1) ID). Sampling the posterior in turn leads to Monte Carlo approximations to the predictive density p(z ID) and its features. Fundamental develop- ments in univariate cases are given by Escobar (1994), and these were extended by Escobar & West (1995) to problems of univariate density estimation. These schemes have since been refined and updated by MacEachern (1994) on which our approach here builds. Some brief review and illustration of this approach are given by West, Muller & Escobar (1994).

3. MODEL COMPLETION AND ANALYSIS

341. Hyperparameters Although other choices can be made, in this paper we specify Go such that, indepen-

dently,

pi mB -N(pi; m, B), 7

Qi- I S, S Wp {Qi- ; s, (SS)Il} (8)

We assume the following priors for the hyperparameters m, S, B and oc, assumed mutually independent:

m N(m; a,A), S Wp(S;mq, q-p1R), B-1oWpuaB-i1;cnva aro cai m-}(oc;uaobo).

Bayesian curvefitting 71

feasible. The univariate development of Escobar & West (1995) can be extended and modified to produce various conditional distributions described below. This exploits the conditional prior structure for the Oi identified in equation (1). By iterative resampling from the stated full conditionals we implement a Gibbs sampler. Among the many fine descriptions of the Gibbs sampling scheme are Gelfand & Smith (1990) and Smith & Roberts (1993).

3 2. Augmenting the parameter vector by configuration indicators First, we note general expressions for the conditional posterior distributions of Oi given

0-) = (01, ..-. , Oi+1l... , On). Write ki for the number of distinct values in 0(i), denote these distinct values by O'* and suppose that O'* occurs some nij times. Then, using Bayes' theorem and the conditional prior (1), we have

ki

(OilD,0(i'), m, S, B, oa) qioG(')(Oi)+ E qijbj6i*(0j). (9) j=1

Here G(i) is the posterior of Oi under the prior Go updated by the likelihood f(zi I Of). The mixing weights qij are

qio oca jf(zjI O) dGo(Oj), q jjcnjjf(z IO*) (j = 1,. ., kr).

Although early implementations of the Dirichlet mixtures used (9) as the basis for simulation, we prefer and develop an alternative, based on multivariate and nonconjugate extensions of work by MacEachern (1994). MacEachern discusses the theoretical and computational advantages of his modified scheme in simplified versions of the univariate normal mixture context. The key feature, that of superior convergence properties relative to previous approaches, carries over to multivariate and nonconjugate problems, such as ours. Introduce the configuration vector f = (1, .. . , b'n), where each gJ takes a value in the set { 1, . . . , n}. We use these configuration indicators to identify common elements among the Oj; in particular, M = = g9" =j if observations zi and zi share a common param- eter Oi = Oi' =O. Then (9) implicitly determines conditional posterior probabilities for the configuration indicators Xi, and these may be used to sample values of Y sequentially. Given a complete configuration f, the posterior distributions for the parameters Oi and all hyperparameters simplify. Markov chain simulation is carried out by successively sampling configuration indicators followed by parameters and hyperparameters, and iterating.

To elaborate, condition the data on a complete configuration Y with exactly k distinct parameter values 0* =(0, . . ., ok*) and counts nj= #{i: Si=j}. Given Y and k, the data are organised into k distinct groups; the nj observations in the jth such group are indepen- dent and normally distributed with mean and variance O = {4Q7}, and the data are also conditionally independent across groups. Thus the configuration determines a one- way multivariate normal layout of the data. The following general expression of the resulting posterior for all parameters and hyperparameters leads to the various conditional posteriors detailed below:

k r p(6*,m,S,BID,?, k,oc)= H jH N(zi; ,u Qt) N(p4; m,B)WP{Q>j ; s,(sS)<}

72 P. MULLER, A. ERKANLI AND M. WEST

3-3. Conditional posteriorsfor primary parameters The O6 are conditionally independent with posteriors arising from the normal one-way

layout induced by the configuration. For each j = 1,. . . ,k, introduce the group sample means z-;= nfl1Z {ji = Zi. Then

(,yj* I D, t2j*, m, S, B, oc, 5?, k) -N (,j*; mj, Tj), (1

(Q;- ID, ,j*, m, S, B, OC, , k) ,WP (t21; s + n, Sj), (12)

where

T-= B-1 + njQ1, mj = Tj(B 1m + njQ2- 1%j),

S = Ss + (zi - P)(zi-*)' {i:Sj =}

3 4. Conditional posteriors for hyperparameters m, s, B and x Given a full configuration Y and the corresponding k distinct parameters 6*, equation

(10) implies

(mID,0*,S,B,c,Y,k)>-N(m; a,A), (13)

(SID, 6*,m,B,ca,J2, k)- WP{S; q+sk, (qR-l+s Kr Qj 1) (14)

(B-1ID,O*, m, S,a,,k)Wp[B1; c+ k, {cC+ j (M -m)Cu7 -m)T}], (15) j= 1

where -t*

= j_M*/k, A = A-1 + kB-1 and a = A(A-a + kB-1p*). The development of Escobar & West (1995) and West & Cao (1993) provides an

appropriate conditional posterior distribution for a by augmenting the parameters with an additional variable il e (0, 1) such that

(ocxD, 0*, m, S, B, f, k, ir) - 1Fr{a0 + k, bo - log(ij)} + i2F {a0 + k - 1, bo -lg(i1)j; (16)

here

,g = (a +k-1)1[ao +k-1+ n{bo-log(tl)}], 72 = 1-rl

and, as shown in those references,

(jID, 0*,m, S,B, Y, k, oc)-Beta(oc+ 1, n). (17)

Thus oc may be sampled by first generating a value of i from (17) based on the previous values of k and oc, then drawing a new oc from the gamma mixture given by (16).

3 5. Conditional posteriorsfor configuration S9 Let Y(i) = (Y9, ... -, 1- , +1, .. . , b'n) denote the configuration vector corresponding

to the V) distinct element O'* - {Mj~*, QK'*} in 6(i)=(l1, ... 6i-1, Oi+1i.. ., 6n). Recall that Oj* occurs nii times. Then discussion around equation (9) leads to the following probabilit-

Bayesian curvefitting 73

qij = pr(9= jiID, OI, m, S, B, ( ki) = c pr(&i =iIos, .9i(L), ki)N(zi; ,'* QJ)

=cni.N(zi;a*, Qi*) (18)

where c is a common normalisation constant appearing in (18) and (19). Index j =0 corresponds to Oi drawn as a new parameter from Go described in (7) and (8); the relevant probability may be written as

qio =pri(Y.= 0ID, nt, m, S, B5 ('),ki)

=pr(c = 01 a, (i), ki)N(zi; mn, 1 + B)

=ccaN(zi; m,Q *+1 + B); (19)

here Q*+ 1 represents a new value simulated from the Wishart component of the base prior Go. Note that the integral in qio = f(zi I Oi) dGo(0i) is only calculated with respect to pi. The average with respect to Qi is approximated by substituting a draw Q?41 from the base measure. This approximation may be refined by averaging several or many draws of the covariance matrix. The probabilities qij are easily evaluated and normalised; each involves the evaluation of a multivariate normal density function. They may then be used to generate a set of configuration indicators for i = 1, ... , n based on the previous, existing configuration, parameters and hyperparameters. As mentioned, some of the rationale underlying this approach is given by MacEachern (1994); the major differences in our context involve the multivariate extension and the use of a nonconjugate base prior form for Go. West et al. (1994) provide discussion of the use of nonconjugate prior forms, and of the potential pitfalls in use of conjugate priors.

3-6. A Gibbs sampling scheme Write q = (0*, m, S, B, ox, 5, k) for the complete set of parameters whose posterior has

just been described. Simulation analysis proceeds by sequentially drawing b values ?b1 ... ., qN as follows. Assume a current value of 0. The next step in the Markov chain simulation produces a new value of q via:

(i) drawing a new configuration via (18) and (19) based on the current values of parameters in b, hence the current configuration; this implicitly determines a new value for k;

(ii) based on this new configuration drawing new parameters 0* via (11) and (12); and finally

(iii) drawing new hyperparameters m, S, B and Xc via (13) to (17), inclusive, based on the latest parameters 0* and corresponding configuration.

The Markov chain so determined produces sequences (/r that ultimately represent approximate draws from the posterior p(b I D). We discuss convergence issues in the follow- ing section. Assuming convergence, a set of N draws q$ leads to sampled predictive distri- butions p(zn + 1 1 &). Predictive inference formally based on p(z + 1 i D) requires the averaging of p(z + 1 1 (/) over p(o I D) as implied in (5); namely P(Zn + 1 i D) N 1 Yr P(Zn + 1 1 (/)r) where the summands are mixtures of few normals, the number of components varying with sampled configurations. In addition to approximating the predictive density this way, the sampled densities p(z +?1' I ) provide information relevant to assessing posterior uncer- tainty about this 'fitted' or 'estimated' density for future data, formally addressing the issues of uncertainty assessment in density estimation and regression.

74 P. MOLLER, A. ERKANLI AND M. WEST

3-7. Convergence The argument follows closely the discussion in an unpublished manuscript by

S. N. MacEachern and P. Muller 'Estimating mixture of Dirichlet process models', which considers convergence of more general models. Let P(/, .) denote the transition probability defined by one iteration of the Gibbs sampler if the current state of the chain is 0, and write i for the posterior distribution.

For a class of Markov chain Monte Carlo methods including the Gibbs sampler, Tierney (1994, Theorem 1, Corollary 2) shows that if P(/, .) is 7t-irreducible then II P'((b, .) -7 II -+0 for all 0. Here 11. 11 denotes total variation distance, and Pf(l/, .) is the transition probability over n iterations when starting at 0. To show i-irreducibility we need to show that, for each subset A of the parameter space ? with 7r(A) > 0, and for each parameter vector e E, there exists an integer n = n(O, A) > 1 such that PJ(0, A) > 0. For each iteration of the Gibbs sampler we consider the two sub-steps of (i) resampling

the configuration vector 9, and then (ii) resampling all other parameters conditional on 9. The configuration vector 9 introduces a finite partition of the parameter space into subspaces Es with equal configurations. There exists a configuration s with it8(A) = it(A nr)O) > 0. Since at each iteration the Gibbs sampler allows with positive probability a move to any other configuration, we can, for any initial 0, within sub-step (i) of one iteration of the Gibbs sampler change with positive probability to configuration s. In sub- step (ii) we only generate from distributions which are mutually absolutely continuous with respect to it8. Both arguments together suffice to show that n(O, A) = 1.

As a practical diagnostic to indicate termination of the Gibbs sampling iterations we relied on a diagnostic proposed by Geweke (1992).

4. REGRESSION FUNCTION ESTIMATION

We now make explicit the application of the discussed model to the regression estimation as laid out initially. Suppose z = z,,n1 is partitioned into vectors x and y, so that z = (y, x), and that one primary goal is the assessment of the regression function E(y lx). The Bayesian approach focuses attention on the evaluation of the predictive expectation E(y lx, D).

Under the assumed structure, p(z IO) is a locally weighted mixture of a small number of normals, and conditioning on x implies

k

p(y Ix, O) = s0(x)p0(yIx, O) + E sW(x)fi(yIx, O), j=1

where po is the conditional density of y given x based on the normalised base measure Go, and fj is the conditional normal density of y given x under the joint normal f (zl 107). The corresponding k + 1 weights sj(x) (j = 0, 1, . . . , k) are functions of the mar- ginal densities of x under the base prior Go and the joint normals f (z IO 7), respectively.

Thus, conditional on a given configuration (9, k), the corresponding regression function is E(y Ix, 0)= 04= sj(x)mj(x), where mj(x) is the mean of the jth component distribution for y given x. This is a weighted sum of the component regressions mj(x) derived from the component normals. Note that mj(x) is a linear function in x in each case as the distributions involved are normals. They may be very different linear functions; the compo- nent distributions are, from the earlier development, distinct and may have quite different covariance structure, so that the 'slopes' of the regressions mi(x) vary across j. The regression weights sj(x) determine that components mj(x) will be more highly weighted in

Bayesian curvefitting 75

predicting y when the value of the density fj(x I 0) is relatively large; thus x values 'close' to a particular component represented in 0 implies that the regression function of that component dominates the predictions.

As discussed in ? 3 6, the Gibbs sampling analysis leads to sampled parameters O5r and hence to sampled conditional predictive distributions for y I X, Or. The Monte Carlo average of these approximates the required conditional predictive distribution. Similarly, the arith- metic mean of simulated conditional means E(y IX, xr) approximates the required regression 'estimate', and uncertainty about the regression can be assessed via similar estimates of conditional probabilities, variances and so forth.

(a) (b) A A

AA A

0 -0 A0 300 - 10 15 2

Radiation A A~~~~~

Fig 1. Aprxmt prdctv ditibto A( a hw h agnlfr(,Z,()temria

A t~~~~~~~~~~~~~~~~~ 0 A100 200 300 200

A A A ~~~~~AA A A A ~~~~~~~A A

3 - A A

A~~~~~~~~~~~A A A A ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ A AA A 2 A A A AAA A 100

A~~~ AA

A A-AA A A

AAA A A 1 A 0A

d I _ I I 0

TIII

0 100 200 300 5 10 15 2'0 Radiation Wind speed

Fig. 1. Approximate predictive distribution p(z j D). (a) shows the marginal for (R, Z), (b) the marginal for (W, R). Triangles are the data points. Different orientations of the two apparent clusters indicate

the importance of the model's allowance for distinct covariance matrices 2jo.

(a) (b)

530

4-

2 K ~ ~ ~~~~~~~~~0

2 -~~~~~~~~~~~~~~~~

0 100 200 300 5 10 15 20 Radiation Wind speed

Fig. 2. Predictive contours conditional on -ftt0: the predictive distribution conditional on the state of the Markov chain simulation after 18900 iterations. There are k = 3 mixtuire compnonents.Q The-

76 P. MULLER, A. ERKANLI AND M. WEST

5. ILLUSTRATION

Some illustration is provided in analyses of a data set of Chambers et al. (1983, pp. 346-7). The data are n = 111 daily observations on ozone concentration (Z), radiation (R), wind speed (W), and temperature (T). We will use zi = (Zi, Ri, W', T7) to denote the ith observation (i = 1, . . . , n). Observations with missing values were excluded, and ozone was transformed to a cube root scale.

We start by simulating 20 000 passes of the simulation scheme described in ? 3 and

(a) Estimated mean E(ZIR, D) (b) Estimated mode of p(RI W, D) A A

A dA A 5 -E tA 300 - A AA A A

A AA AAA AAA A0A A - 2A ..A00350 1*5* 2

4 - A A A ~A A A

A A

A A A A A..A

AA ~~~~~~~~~~A. A

A I ~~~~AA A A A ~~~~A A0A

(a) A AA AbA 0)AA A AA

3 ~ ~ AAA A A A AAA

2 10 20 A A 100 20A A A A 4A AA

A A ~~~~~~~~~~~~~A A

AAA A A A 1 A 0 A AA

0 100 200 300 5 10 15 20 Radiation Wind speed

Fig. 3. (a) shows a regression curve defined by conditional means. (b) shows a modal regression trace; note the bifurcation of the modal trace around W =i7. For comparison, conditional means are shown as a dotted line. The line E(R I W, D) can be misleading for multimodal or skewed conditional

distributions.

(a) (b) A ~~~~~~~~~~~~~~A

5 - 5A 5 A~~~~~~~~~~~~~~~~

A~ AL

4 - ~ A A 4A AA

AAA

0 A A A N 3 A A A 3

A AA~~~~A 2 AAA A ~~~AA A A A

A A AA

1 A 1 A~~~~~~AA

0 100 200 300 0 100 200 300 Radiation Radiation

Fig. 4. (a) shows multiple draws from the posterior distribution on the smoothing curve, i.e. the lines show E(Z IR, 4Y) where 4y are simulated draws from the posterior. (b) shows approximate 66% highest posterior density regions for the conditional distribution p(Z I R, D). Note the slight asymmetries in the bounds. The contours show the, underlying, density estimate n(R, Z1 ID from which the highest

Bayesian curvefitting 77

estimating predictive densities by taking Monte Carlo averages in batches of 200 iterations. The chain was considered to have practically converged after 20 000 passes based on the convergence diagnostic proposed by Geweke (1992).

Two bivariate margins of the estimated predictive distribution p(zID) are shown in Fig. 1. Recall that the predictive distribution can be written as an average over conditional predictive distributions p(z I k, D) = p(z I 4), where the average is taken over the posterior distribution on 0 and the conditional predictive distributions are all finite mixtures of multivariate normals. Figure 2 graphs bivariate contours of p(z I D, 01800), that is bivariate

(a)

6

,2 .

(b) 6~~~~~~~~-

Fig.S5. regression surfaces (a) Z =g(R, T) and (b) Z =h(W, R). The ordinates are the predictive conditional expectations, that is g(R, T) = E(ZI R, T, D) in (a), and h(W,R) = E(ZI W, R, D) in (b), determined from

p(z |ID).

78 P. MULLER, A. ERKANLI AND M. WEST

margins of the estimated density conditional on the particular parameter vector 0 imputed after 1800 iterations of the Markov chain. Integrating over the posterior on 0 mixes over the uncertain locations p,i and covariance matrices Qi of the terms, as well as the number k of terms in the mixture. Once the predictive distribution p(z ID) is estimated, regression curves can be readily obtained as conditional expectations. Alternatively, Scott (1992, p. 233) makes an argument for using a trace of the conditional modes to summarise bivariate data. Such modal regression curves are no more difficult to derive than the conditional expectations. Figure 3(b) shows an example. Our model based approach to density estimation provides appropriate measures of uncertainty for such fitted curves without resorting to asymptotic arguments. The posterior distribution p( ID) implies a posterior distribution on the whole regression curve. Some elements of this distribution are illustrated in Fig. 4. Higher dimensional regression surfaces are no more complicated to compute than the univariate regression curve. The surfaces in Fig. 5 are three-dimen- sional smoothing surfaces.

The amount of smoothing is closely related to the distribution on k, the number of distinct pairs {4j5 QJ }. The prior distribution on k is determined by the a parameter of the model. The particular hyperprior on X applied in this example was a F(1 0, 0-2) distri- bution. Figure 6(a) compares this prior distribution with a Monte Carlo approximation to the posterior p(a ID). To make an argument about the relative robustness with respect to the choice of hyperprior on X, we replicated the whole simulation with X - F(1-0, 0005). The resulting posterior on X is shown in Fig. 6(b).

The priors on other hyperparameters were chosen as follows. For the Wishart distri- butions on Qi, B` and S the degrees of freedom parameters s, c and q were set at 10, the matrix parameters C and Q were diagonal matrices with diagonal elements (1, 10000, 100, 10), indicating only the scale of the respective variables. The hyperprior on m was taken as a multivariate normal with the same diagonal covariance matrix as C, again with the elements only indicating the scale of the variables. The hyper-mean a was set to a = (3, 180, 80, 10).

The Gibbs sampling scheme was initialised by setting all hyperparameters equal to their

(a) Under p(a) = G(1, 1) (b) Under p(cx) = G(1, 1) 0-6

0-6-

0-4 0-4-

02 - 02

0 .0 _ _ _ _ _ _ _ _ _ _0 -0 _ _ _ _ _ _ _ _ _ I ~ , , I I I I I I

0 1 2 3 4 5 0 1 2 3 4 5

Fig. 6. Posterior distribution on ac under two alternative prior choices. The histograms, on a relative frequency scale, show the posteriors pn( I D). The lines plot the gamma priors in each case. The model

seems reasonably robust against choice of either of these two priors.

Bayesian curvefitting 79

expected values under the respective hyperpriors. The configuration vector was initialised by setting k = n, that is putting each observation into a class by itself. The Gibbs sampler was started by drawing Oi (i = 1, . . . , n). Therefore no initial values for 0i were required. The algorithm was implemented as a C program on a DEC-station 3000/500x and took 8 minutes CPU time for the 20000 Gibbs sampler iterations. Most time was spent on evaluating the predictive distributions and expectations on the 50 x 50 grids for Figs 1 and 5.

ACKNOWLEDGEMENT

This research was supported in part by the National Science Foundation. The authors are grateful to the editor and the referee for their comments which led to an improved exposition.

REFERENCES

ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to non-parametric problems. J. Am. Statist. Assoc. 2, 1152-74.

CHAMBERS, J. M., CLEVELAND, S., KLEINER, B. & TuKEY, A. P. (1983). Graphical Methods for Data Analysis. Boston: Duxbury.

CLEVELAND, W. S. (1979). Robust locally-weighted regression and smoothing scatter plots. J. Am. Statist. Assoc. 74, 829-36.

CRAVEN, P. & WAHBA, G. (1979). Smoothing noisy data with spline functions. Numer. Math. 24, 375-82. EscoBAR, M. D. (1994). Estimating normal means with a Dirichlet process prior. J. Am. Statist. Assoc.

89, 268-77. ESCOBAR, M. D. & WEST, M. (1995). Bayesian density estimation and inference using mixtures. J. Am. Statist.

Assoc. 90, 577-88. FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209-30. FERGUSON, T. S. (1983). Bayesian density estimation by mixtures of normal distributions. In Recent Advances

in Statistics, Ed. H. Rizvi and J. Rustagi, pp. 287-302. New York: Academic Press. GASSER, T. & MULLER, H.-G. (1984). Estimating regression functions and their derivatives by the kernel

method. Scand. J. Statist. 11, 171-85. GELFAND, A. E. & SMITH, A. F. M. (1990). Sampling based approaches to calculating marginal densities.

J. Am. Statist. Assoc. 85, 398-409. GEWEKE, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior

moments. In Bayesian Statistics 4, Ed. J. 0. Berger, J. M. Bernardo, A. P. Dawid and A. F. M. Smith, pp. 169-94. London: Oxford University Press.

MAcEACHERN, S. N. (1994). Estimating normal means with a conjugate style Dirichlet process prior. Commun. Statist. B 23, 727-41.

SCOTT, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. New York: Wiley. SILVERMAN, B. W. (1985). Density Estimationfor Statistics and Data Analysis. New York: Chapman and Hall. SMITH, A. F. M. & ROBERTS, G. 0. (1993). Bayesian computation via the Gibbs sampler and related Markov

chain Monte Carlo methods (with Discussion). J. R. Statist. Soc. B 55, 3-23. TIERNEY, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22, 1701-28. WEST, M. & CAO, G. (1993). Assessing mechanisms of neural synaptic activity. In Bayesian Statistics in Science

and Technology: Case Studies, Ed. C. Gatsonis, J. Hodges, R. Kass and N. Singpurwalla, pp. 416-28. New York: Springer-Verlag.

WEST, M., MULLER, P. & ESCOBAR, M. D. (1994). Hierarchical priors and mixture models, with application in regression and density estimation. In Aspects of Uncertainty: A Tribute to D. V Lindley, Ed. A. F. M. Smith and P. Freeman, pp. 363-86. New York: Wiley.

[Received March 1993. Revised May 1995]