a criterion for optimal predictive model selection

This article was downloaded by: [North Dakota State University]On: 22 October 2014, At: 08:14Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Theoryand MethodsPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/lsta20

A Criterion for Optimal Predictive ModelSelectionMinh-Ngoc Tran aa Department of Statistics and Applied Probability , NationalUniversity of Singapore , SingaporePublished online: 14 Dec 2010.

To cite this article: Minh-Ngoc Tran (2011) A Criterion for Optimal Predictive Model Selection,Communications in Statistics - Theory and Methods, 40:5, 893-906, DOI: 10.1080/03610920903486798

To link to this article: http://dx.doi.org/10.1080/03610920903486798

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/lsta20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/03610920903486798

http://dx.doi.org/10.1080/03610920903486798

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

A Criterion for Optimal Predictive Model Selection

MINH-NGOC TRAN

Department of Statistics and Applied Probability, National University ofSingapore, Singapore

AIC is known in the literature as the best for estimation of the regression function,BIC is the best for identification—i.e., when there is no model uncertainty, BIC con-sistently identifies the true model —but both of them do not necessarily have theoptimal out-of-sample predictive ability. In the presence of model uncertainty,Bayesian model averaging (BMA) is the gold standard for making out-of-samplepredictions and inferences, but it does not select a model because it averages overa given collection of models. Therefore, BMA does not have interpretability—aproperty that is often desirable for a statistical procedure. We propose a procedurefor model selection that trades off between the prediction accuracy and the interpret-ability. The procedure seeks a model—thus, it has the interpretability—that has apredictive distribution closest to that of BMA, thus the selected model has betterpredictive performance than any other model. The suggested procedure can be easilyand efficiently implemented by using a Markov Chain Monte Carlo algorithm. Wepresent a number of examples in a linear regression framework for both real dataand simulated data. It is shown in these examples that our procedure selects modelswith optimal predictive performance. In the extreme case where there is no modeluncertainty, our procedure selects the same model as BIC does.

Keywords Bayesian model averaging; Markov chain Monte Carlo; Modelselection; Optimal predictive model.

Mathematics Subject Classification Primary 62J05; Secondary 62F15.

1. Introduction

A primary goal in statistics as well as in many other scientific fields such as machinelearning or data mining is the performance of statistical algorithms on the futuredata. The prediction accuracy depends on the model used to fit=interpret the data.Thus, model selection is an important step in the natural progression from datacollection to estimation and model selection, and finally to prediction. Based on adataset D (and probably on some prior information about models and model para-meters), the model selection problem is to select a model (from a given collection ofmodelsM¼ {M1, . . . , MK}) that has certain good properties. When there is uncer-tainty about models and model parameters, one often selects a model with highestposterior probability as the ‘‘best’’ one. Under some mild conditions, this model isidentical to the model selected by BIC (Schwarz, 1978), thus hereafter we refer to

Received April 20, 2009; Accepted November 14, 2009Address correspondence to Minh-Ngoc Tran, Department of Statistics and Applied

Probability, National University of Singapore, 6 Science Drive 2, 117546, Singapore; E-mail:[email protected]

Communications in Statistics—Theory and Methods, 40: 893–906, 2011Copyright # Taylor & Francis Group, LLCISSN: 0361-0926 print=1532-415X onlineDOI: 10.1080/03610920903486798

893

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

this model as BIC model, or BIC for short. BIC is widely known as the best in termsof identification, i.e., if the model collectionM contains the true model, BIC consist-ently identifies the true model as the sample size goes to infinity (Nishii, 1984; Shao,1997). However, whether or not the true model exists is a controversial issue. For areal dataset, many people believe either that there is no true model or that thetrue model has an infinite number of parameters (Burnham and Anderson, 2002).Furthermore, Barbieri and Berger (2004) showed, in a framework of normal linearregression, that BIC model is not necessarily the optimal predictive one in termsof predictive expected squared loss.

For selection among linear regression models, AIC (Akaike, 1973) and itsvariants, Cp (Mallows, 1973), GIC (Nishii, 1984; Rao and Wu, 1989), FPE (Shibata,1984), are optimal in terms of mean squared error loss (see, e.g., Shibata, 1983; Shao,1997). This has been often misinterpreted as saying that AIC-type criteria are thebest for making predictions. The reason is that the mean squared error loss, as itsdefinition, is associated only with the fixed design points in the training data,therefore, AIC-type criteria do not necessarily have the optimal ability for makingout-of-sample predictions and inferences. As we will see in Sec. 3, AIC often has apoor out-of-sample predictive performance.

Bayesian Model Averaging

Bayesian model averaging (BMA) is now widely known as providing the best per-formance in terms of out-of-sample predictions; see Leamer (1978); Draper (1995);Raftery et al. (1997); Hoeting et al. (1999) and references therein. Let p(Mk) bethe prior probability of model Mk2M, p(hkjMk) be the prior distribution of modelparameter hk under model Mk. Then, the posterior predictive distribution of a futureobservation D is given by

pðDjDÞ ¼XKk¼1

pðDjMk;DÞpðMkjDÞ: ð1Þ

In this expression, p(MkjD) is the posterior probability of model Mk

pðMkjDÞ ¼pðDjMkÞpðMkÞPKl¼1 pðDjMlÞpðMlÞ

; ð2Þ

where

pðDjMkÞ ¼Z

pðDjhk;MkÞpðhkjMkÞdhk ð3Þ

is the marginal likelihood of model Mk, and

pðDjMk;DÞ ¼Z

pðDjhk;Mk;DÞpðhkjMk;DÞdhk ð4Þ

is the posterior predictive distribution of D under model Mk. Expression (1) is aweighted average of the posterior predictive distributions of D under each model,the weights being posterior model probabilities.

894 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

Many results have supported the use of BMA (Madigan and Raftery, 1994;Draper, 1995; Raftery et al., 1997; Hoeting et al., 1999; Clyde and George, 2004).BMA provides better predictive performance over any single model. Some methodsto implement BMA have been proposed. Among them are the Occam’s windowmethod of Madigan and Raftery (1994) (see also Hoeting et al., 1999) and theMarkov chain Monte Carlo model composition (Smith and Roberts, 1993, Madiganand York, 1995; Raftery et al., 1997).

BMA vs Model Selection

Although BMA has the optimal predictive performance, its main drawback is noninterpretability. Because BMA averages over all models in a given collection, itdoes not produce an easily interpretable model. For various reasons, the dataanalyst often prefers a simple (hence interpretable) model. For example, given alarge number of potential covariates and a collection of linear models associatedto them, one often would like to select a smaller subset and corresponding linearsub-model achieving some kind of optimal predictive performance. This wouldreduce the time and costs of collecting data, and also produce a model makingclear which and how covariates affect the response. This drawback of BMA issomewhat similar to that of ridge regression (Hoerl and Kennard, 1970). Althoughridge regression produces a stable estimate of coefficients and often has the optimalmean squared error, it does not give an interpretable model. In contrast to ridgeregression, the lasso (Tibshirani, 1996) shrinks some coefficients to exact 0, so itproduces an easily interpretable model. That is why the lasso is somewhat preferedto ridge regression.

The Motivation

BMA is the gold-standard for making out-of-sample predictions and inferences, butit does not produce a model (hereafter, by ‘‘a model’’ we mean a single model in themodel setM). Our idea is to choose a model that is closest to BMA in some sense.We use an information distance function to measure the distance between theposterior predictive distribution under a candidate model and that of BMA, and seeka model that has the smallest distance. Then, the chosen model has obviously betterpredictive performance than any other single model—besides, it is a single model,thus it is interpretable.

We would like to make it clear that we are assuming that the interpretabilityconstraints preclude the use of BMA. If we can afford to consider all given modelsand if the predictive performance is the primary goal, BMA is ideal for the purposeof making out-of-sample predictions and inferences.

The article is organized as follows. The suggested procedure and its implemen-tation are presented in detail in Sec. 2. Our method will be referred to as theProcedure for Optimal Predictive Model Selection (POPMOS). In Sec. 3, POPMOSis applied to two real datasets (U.S. crime data and percent body fat data) in a linearregression framework. Simulated examples are also conducted to demonstratefurther how POPMOS works. We also introduce in this section an indicator calledthe model uncertainty indicator (MUI) to measure model uncertainty. Section 4contains conclusions and outlook. Some notes on software for implementingPOPMOS are relegated to the Appendix.

A Criterion for Optimal Predictive Model Selection 895

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

2. The Procedure for Optimal Predictive Model Selection (POPMOS)

2.1. The POPMOS

Setup of POPMOS

LetRd(p, q)dx be a distance function that measures the distance (or pseudo-distance)

between two density functions p and q (for simplicity, we assume that Lebesguemeasure being used, however the following procedure can be constructed similarlyfor the general case). Let D be a future observation. We define the distance betweenBMA posterior predictive distribution p(DjD) and the posterior predictive distri-bution p(DjD, Mk) under model Mk by

dðMkÞ � dðMk;D; dð:; :ÞÞ ¼Z

dðpðDjDÞ; pðDjD;MkÞÞdD: ð5Þ

As discussed earlier, in statistics, BMA posterior predictive distribution p(DjD) hasbeen shown to have the optimal predictive ability. In other scientific fields, suchas machine learning or information theory, predictive distribution p(DjD) has alsobeen realized as the standard for making predictions (see, e.g., Barron et al., 1993;Hutter, 2005). Therefore, if a single model is preferred, it is natural to seek a modelMk that has the predictive distribution p(DjD, Mk) closest to p(DjD). Formally,the optimal predictive (OP) model (among a given collection of model M) isdetermined as

MMOP ¼ argminMk2MdðMkÞ: ð6Þ

This characterization of POPMOS is general enough to apply to many frameworkswhere the collection of single models involves linear regression models, generalizedlinear models, Cox models, graphical models, etc.

For the distance function, we will consider in this article the Kullback-Leibler(KL) distance where d(p, q)¼ plog(p=q). KL distance is widely used in statisticsand information theory to measure the distance between two density functions, itwas used to derive two well-known selection rules AIC (Akaike, 1973) andMDL (Rissanen, 1978). Besides, many other functions can be used as well. Some

of them are: Hellinger distance where dðp; qÞ ¼ ð ffiffiffipp � ffiffiffiqp Þ2; f-divergence where

d(p, q)¼ f(p=q)q for a convex function f such that f(1)¼ 0. Using the KL distance,(5) becomes

dKLðMkÞ ¼ E logpðDjDÞ

pðDjD;MkÞ

� �ð7Þ

where the expectation is w.r.t. p(DjD).

Related WorkHutter (2008) proposed a principle for predictive hypothesis testing which is similar inspirit to POPMOS. Suppose that dataD come from a distribution p(.jh), where h is theparameter with prior density p(h). LetHh be the hypothesis that a future D come fromthe hypothesized predictive distribution p(Djh). Hutter (2008) suggested identifyingthe best predictive hypothesis among a class H of hypotheses Hh by minimizing a

896 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

distance between p(DjD)¼Rp(Djh)p(hjD)dh and p(Djh). Hutter’s principle mainly

focuses on hypothesis testing, while our procedure is designed for model selection.

2.2. The Implementation

The implementation of BMA (and thus of POPMOS) is a difficult problem.Fortunately, by virtue of recent computational advances and computational meth-odologies like Markov chain Monte Carlo (MCMC) methods, the computationalburden of the integrals in (5) is greatly reduced. We shall discuss in this subsectionapproaches to implementing the POPMOS.

Occam’s Window PrincipleThe number of competing models under consideration is often huge and precludesthe calculation of all distances d(Mk). It’s natural that if a model gets very littlesupport from the data (i.e., its posterior model probability p(MkjD) is very small),it should be excluded from consideration. This is the Occam’s window idea ofMadigan and Raftery (1994). More formally, we only consider models belonging to

A ¼ Mk 2M :max1�l�K pðMl jDÞ

pðMkjDÞ� C

� �ð8Þ

where the cutoff parameter C is chosen by the data analyst, C¼ 20 being oftenused (Madigan and Raftery, 1994; Raftery et al., 1997). Then the posterior predic-tive distribution p(DjD) in (1) and (5) is approximated by

pðDjDÞ ¼XMk2A

pðDjMk;DÞpðMkjDÞ ð9Þ

and (6) reduces to

MMOP ¼ argminMk2AdðMkÞ: ð10Þ

In most cases, the number of models in A is greatly reduced to fewer than 50 andoften fewer than 25. Note that once A is determined, the posterior model probabil-ities must be normalized (so that

PMk2A pðMkjDÞ ¼ 1). Some notes on implementing

the Occam’s window are relegated to the Appendix.

MCMC for Distance CalculationMCMC methods provide a very efficient way to estimate high-dimensional integrals.A good reference book on MCMC in practice is Gilks et al. (1996). Hereafter,we discuss the implementation of integral (5) when the KL-distance is used, i.e.,the distance is given by (7). The Metropolis-Hasting algorithm to estimate dKL(Mk)as follows:

1. Initialize a Markov chain to D0, set t 0.2. Sample a candidate point D from a multivariate normal distribution with mean Dt

and covariance matrix r2Ip, where p is the dimension of D (r¼ .5 is a popularchoice).


Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

3. Sample a point u from a uniform distribution U(0, 1).

4. If u � min 1; pðDjDÞpðDtjDÞ

� �then set Dtþ 1 D, else set Dtþ 1 Dt.

5. Set t tþ 1 and go back step 2.

Let T be the length of the chain {Dt}, and T0 be the burn-in number. Then expec-tation (7) is approximated by

dKLðMkÞ ¼ E logpðDjDÞ

pðDjD;MkÞ

� �� 1

T � T0

XTt¼T0þ1

logpðDtjDÞ

pðDtjD;MkÞ:

In order to get an accurate approximation, our experience shows that several chainswith overdispersed starting points should be sampled so that the chains can runthrough the whole support of the target distribution.

Calculating Integrals (3) and (4)What remains in implementing the POPMOS is to compute integrals (3) and (4). Insome special cases such as linear regression with conjugate priors (Raftery et al.,1997) or discrete graphical models (Madigan and York, 1995), integrals (3) and(4) have closed forms. In general cases, the Laplace approximation is used toestimate p(DjMk) (Schwarz, 1978; Tierney and Kadane, 1986; Raftery, 1996)

and p(DjD, Mk) is often approximated by pðDjD; hhk;MkÞ, where hhk is the maximumlikelihood estimate of hk (Taplin, 1993; Draper, 1995). The relative approximationerror is O(n�1) (Kass and Vaidyanathan, 1992).

3. Examples

In this section, we demonstrate the suggested procedure for linear regression models.We use the Bayesian framework used in Raftery et al. (1997). Each model M underconsideration is of the form

Y ¼ b0 þ b1Xi1 þ � � � þ bkXik þ E; E � Nð0; r2Þ

where fXi1 ; . . . ;Xikg is a subset of the set {X1, . . . , Xp} of all potential covariates. LetY and X be the response vector and the corresponding design matrix, respectively. Itis reasonable to assign a uniform prior to possible combinations of covariates, i.e.,the prior information is ‘‘objective’’ between models (Berger, 1985, p. 151). Formodel parameters, we assume priors

bjr2 � Nkþ1ðl; r2VÞ;nkr2� v2n :

Hyperparameters l, V, n, k are chosen as follows (see Raftery et al., 1997, for thedetails):

n ¼ 2:58; k ¼ :28; l ¼ ðbb0; 0; . . . ; 0Þ; V ¼ diag s2Y ;/2

s2i1; . . . ;

/2

s2ik

!

898 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

where bb0 is the OLS estimate of b0, s2Y ; s2i1; . . . ; s2ik are sample variances of

Y ;Xi1 ; . . . ;Xik , respectively, and /¼ 2.85. Typically, in our experience, results arerelatively insensitive to changes in values of the hyperparameters.

Then the likelihood (3) under model M is

pðDjMÞ ¼C nþn

2

� ðnkÞn=2½kn þ ðy� XlÞ>ðI þ XVX>Þ�1ðy� XlÞ��ðnþnÞ=2

pn=2Cðn=2ÞjI þ XVX>j1=2: ð11Þ

The predictive distribution of future observation D¼ (x, y), where x ¼ð1; xi1 ; . . . ; xikÞ is a row vector, writes as

pðDjD;MÞ ¼C nþnþ1

2

� ffiffiffipp

C nþn2

� 1

ð1þ xðX>X þ V�1Þ�1x>Þ1=2AðnþnÞ=2

Bðnþnþ1Þ=2; ð12Þ

where

A ¼ kn þ jjyjj2 þ l>V�1l� ðX>yþ V�1lÞ>ðX>X þ V�1Þ�1ðX>yþ V�1lÞ

and

B ¼ kn þ jjyjj2 þ y2 þ l>V�1l�� ðx>yþ X>yþ V�1lÞ>ðx>xþ X>X þ V�1Þ�1ðx>yþ X>yþ V�1lÞ:

Measures of Predictive Ability

As mentioned earlier, a primary goal of statistical analysis is to make predictionsand inferences on future data. Many authors argue that a model is more impressive=preferable if it assigns higher probabilities to the actual data. Thus, one of measuresof predictive ability is the partial predictive scores (PPS) (Good, 1952; Geisser, 1980;Hoeting et al., 1999). Suppose that the data is split into two parts, the training setDT and the prediction set DP. Then the partial predictive scores of model M isdefined as

PPSðMÞ ¼ �XD2DP

log pðDjM;DT Þ: ð13Þ

The smaller the PPS, the better the predictive performance. In order to easecomparison, an anonymous referee suggests to normalize the PPS by the cardinalityof DP.

Another measure of predictive ability is the predictive coverage (PC). Let m ands be the mean and the standard deviation estimated by MCMC, then 90% predictioninterval for a future observation is approximated by the interval m� 1.645s. The PCis defined as the proportion of observations in DP that fall in the 90% predictioninterval. In the following examples, we use these two measures, PPS and PC, toassess the predictive performance of selected models.


Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

Model Uncertainty Indicator

We now introduce an indicator to measure model uncertainty, which we callthe model uncertainty indicator (MUI). It is defined as the ratio of the secondhighest posterior model probability to the highest. More formally, letM0 ¼ argmaxM2A pðMjDÞ, then the MUI is

MUI ¼maxM2AnfM0g pðMjDÞ

pðM0jDÞ� 1: ð14Þ

In particular, the larger the MUI is, the more model uncertainty there is. Ourexperience to date shows that when MUI is small enough (often, MUI� .5), OPmodel coincides with BIC model, otherwise OP model is often different fromBIC model and the predictive ability of OP model is better than that of BIC model.

Example 3.1: Crime data. Criminal behavior was argued to be strongly related tocriminal activity’s costs and benefits and to other legitimate opportunities. Ehrlich(1973) used the data from 47 U.S. states in 1960 to test this argument. The dependentvariable was the crime rate. The costs of crime were measured by probability ofimprisonment and average time served in prison. The benefits were related towealth and income inequality in the community. The investigation also includedother variables such as sex ratio, percentage of young males, etc. In summary, 15potential covariates (Table 1) were considered.

This benchmark dataset has been analyzed by many authors. The previousdiagnostic checking (see, e.g., Draper and Smith, 1981) did not show any violationof the linear assumption. Ehrlich (1973) used the stepwise method to select

Tabel 1

Crime data: overall posterior probabilities and selected models

Number Covariate P(bj 6¼ 0jD) AIC BIC OP MP

1 % of males age 14–24 .78 $ $ $ $

2 Indicator for southern state .183 Mean years of schooling .97 $ $ $ $

4 Police expenditure in 1960 .72 $ $ $ $

5 Police expenditure in 1959 .50 $ $

6 Labor force participation rate .087 No. males per 1000 females .088 State population .249 No. nonwhites per 1000 people .61 $ $ $ $

10 Unemployment rate age 14–24 .1111 Unemployment rate age 35–39 .45 $ $

12 Wealth .31 $

13 Income inequality 1.00 $ $ $ $

14 Probability of imprisonment .82 $ $ $ $

15 Ave. time in state prisons .23 $

MUI¼ .71 shows that there is high model uncertainty.

900 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

significant variables. However, Raftery et al. (1997) reported evidence againstEhrlich’s results and suggested using posterior probabilities to do variable selection.We now use this dataset to demonstrate POPMOS and compare it to other modelselection rules.

Table 1 summarizes the experimental results using the whole dataset. Modelsselected by different methods are listed in the corresponding columns. The thirdcolumn is the overall posterior probability that j-th covariate is in a model, i.e.,P(bj 6¼ 0jD), calculated by summing the posterior probabilities of models thatcontain j-th covariate, j¼ 1, 2, . . . , 15. POPMOS selected the predictors with highestposterior probabilities (.5). Raftery et al. recommended (from an empirical analy-sis) using posterior probabilities rather than p-values for variable selection. The lastcolumn presents the so-called median probability model (MP) introduced byBarbieriand Berger (2004). MP model is defined as the model consisting of those covariateswhich have overall posterior probability P(bj 6¼ 0jD) .5. In the framework ofnormal linear regression and under some conditions, Barbieri and Berger showedthat MP model has the optimal predictive performance in terms of predictiveexpected squared loss (see Barbieri and Berger, 2004, for the full definition). Asshown in Table 1, OP model is the same as MP model.

Table 1 also shows the models selected by AIC and BIC (which can be exhaus-tively searched by using the branch-and-bound algorithm (Miller, 2002). Threemethods, AIC, BIC and POPMOS, produced three different models. This is nota surprise because these criteria have different goals. As we may expect, AICmodel is the ‘‘biggest’’ model among selected models: it contains 9 covariates versus7 covariates for OP and MP. As we will see next, AIC models often have poorpredictive performances.

We now use the crime data to assess the predictive ability of selection rules.In the experiment, the dataset was randomly split into two parts. One with 24observations was used as the training set, the other with 23 observations was usedas the prediction set. Other splits can be adopted. Table 2 shows the normalizedPPS and PC of models chosen by different methods. With C¼ 20 being used,model set A contains 29 models. The model uncertainty indicator MUI¼ .61 showsthat there is moderate model uncertainty. As shown, OP model has a betterpredictive performance than AIC and BIC models. AIC has a poor predictiveperformance.

Note that the models selected using half of data are slightly different from themodels selected using the full data (however, they both contain the most important

Table 2Crime data: assessment of predictive ability

Method Model Normalized PPS PC (%)

AIC 1 3 4 5 9 13 14 .18 82.61BIC 3 4 9 13 14 .16 82.61MP 1 3 4 5 9 13 .12 86.96OP 1 3 4 5 9 13 .12 86.96BMA All .06 91.30

MUI¼ .61 shows that there is moderate model uncertainty.


Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

covariates). This is not a surprise because of uncertainties and small size of dataset.If we had a large enough dataset, using either the full data or half of it would leadto the same results. The selected models summarized in Table 2 are used only toexamine the methods, they are not the final chosen models.

Example 3.2: Percent body fat data. Percentage of body fat is an important measureof health, which can be accurately estimated by underwater weighing techniques(Bailey, 1994). These techniques often require specialized equipments and are some-times not convenient, thus fitting percent body fat to simple body measurements is aconvenient way to predict body fat. Johnson (1996) introduced a dataset in whichpercent body fat and 13 simple body measurements (weight, height, abdomen cir-cumference, . . .) were recorded for 252 men (see Table 3 for the summarized data).This dataset was also carefully analyzed by Hoeting et al. (1999). The previous diag-nostic checking (see, e.g., Hoeting et al., 1999) showed that it is reasonable to assumea linear regression model. Following Hoeting et al. (1999), after omitting an out-lier—observation 42th—we split the remaining dataset into two parts: the first 142observations were used as the training set, the remaining 109 observations were usedas the prediction set. Using cutoff C¼ 20, we obtained a model set A made of 23candidate models. The MUI¼ 0.56 shows the presence of model uncertainty.Table 4 presents selected models and their normalized PPS and PC. As shown, thePOPMOS does a good job: it selects a model that has better predictive ability thanthe others.

As mentioned earlier, the smaller (normalized) PPS, the preferable the model.Normalized PPSs summarized in Tables 2 and 4 show that fitting a linear modelto Example 3.1 is likely to be preferable than using a linear model for the datasetin Example 3.2. In general, normalized PPS can be used as a measure of modelpreferability across examples.

Table 3

Body fat example: summarized data

Predictor number Predictor Mean s.d.

1 Age (years) 44.89 12.632 Weight (pounds) 178.82 29.403 Height (inches) 70.31 2.614 Neck circumference (cm) 37.99 2.435 Chest circumference (cm) 100.80 8.446 Abdomen circumference (cm) 92.51 10.787 Hip circumference (cm) 99.84 7.118 Thigh circumference (cm) 59.36 5.219 Knee circumference (cm) 38.57 2.40

10 Ankle circumference (cm) 23.10 1.7011 Extended biceps circumference 32.27 3.0212 Forearm circumference (cm) 28.66 2.0213 Wrist circumference (cm) 18.23 .93

902 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

Example 3.3: Simulation. In this example, we consider a simulation study. A datasetof 50 observations was simulated as follows. We first generated 50 observations of 8covariates X1, . . . , X8 independently and identically distributed as N(0, 1). Theresponse was then generated from the model

y ¼ X1 þ X3 þ X5 þ E;

where e�N(0, r2) with r¼ .5. Note that the noise level r¼ .5 is small, so we mayexpect that there will be no or moderate model uncertainty. We consider linear mod-els to fit the simulated dataset. Table 5 presents the selected models. All methods sel-ect the same model as the true one {1, 3, 5}. This is not a surprise because theMUI¼ 0.11 shows that there is nearly no model uncertainty.

As for the second simulated example, we created a dataset of 50 observations inthe same manner as above, but now with r¼ 3, to form the training set. 50 additionalobservations were generated in the same manner to form the prediction set. Table 6presents the selected models and their normalized PPS and PC. The MUI¼ .68shows that there is model uncertainty. OP model which is the same as MP modelshows a better predictive performance than AIC and BIC models.

We repeated the simulation study many times with various values for r in orderto get various MUIs. POPMOS often selected the same model as MP model. (Notethat MP model is just valid in the linear regression framework, while POPMOS canbe directly extended to other frameworks such as generalized linear models). Veryoften when MUI was smaller than .5, BIC, MP, and POPMOS selected the samemodel as the true one. AIC often produced overfitted models.

Table 4

Percent body fat data: assessment of predictive ability


AIC 1 2 6 10 12 13 2.954 84.40BIC 2 6 12 13 2.892 86.24MP 2 6 12 13 2.892 86.24OP 2 4 6 12 13 2.887 88.07BMA All 2.867 90.83

MUI¼ .56 shows that there is moderate model uncertainty.

Table 5Simulated data with a low noise level

Method Model

AIC 1, 3, 5BIC 1, 3, 5MP 1, 3, 5OP 1, 3, 5

MUI¼ .11: there is no model uncertainty.


Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

4. Conclusions and Outlook

We introduced a procedure for optimal predictive model selection (POPMOS).POPMOS is a general-purpose criterion that can be directly applied to many frame-works. The implementation was discussed, and the procedure was demonstrated by anumber of examples in a linear regression framework. Surprisingly, there is very littleliterature on optimal out-of-sample predictive model selection. The newcomerappears to be a promising criterion for model selection on the purpose of makingfuture predictions and inferences.

There are many open questions for research, we would like to list two of themhere. The first is probably the choice of prior distributions. This is not only an openquestion related to the newcomer, but also to the Bayesian analysis in general. Thesecond is the choice of cutoff parameter C in Occam’s window. In our opinion, Cbetween 10 and 20 is acceptable, but it is worth considering a serious investigation.

Appendix

Software for implementing POPMOS is written in R. To search for models in set A,we use function MC3.REG in BMA package as a helper. MC3.REG is a functionperforming MCMC model composition for linear regression, written by J. Hoetingwith assistance of G. Gadbury and available free of charge. Our software is freelyavailable upon contacting the author.

Acknowledgment

The author would like to thank an anonymous referee for his=her careful readingand helpful comments. The author is very grateful to Marcus Hutter for sharinghis ideas and discussion.

References

Akaike, H. H. (1973). Information theory and an extension of the maximum likelihoodprinciple. Proc. 2nd Int. Sym. Inform. Theor. Budapest: Akademiai Kaido, pp. 267–281.

Bailey, C. (1994). Smart Exercise: Burning Fat, Getting Fit. Boston: Houghton-Mifflin.Barbieri, M. M., Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist.

32(3):870–897.

Table 6

Simulated data with a high noise level


AIC 1, 2, 4, 5 2.80 82BIC 1, 5 2.97 78MP 1, 4, 5 2.51 92OP 1, 4, 5 2.51 92BMA All 2.43 94

MUI¼ .68: there is model uncertainty.

904 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

Barron, A. R., Clarke, B. S., Haussler, D. (1993). Information bounds for the risk of bayesianpredictions and the redundancy of universal code. Proc. IEEE Int. Symp. Inform. Theor.(ISIT), pp 54–54.

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Inference. New York:Springer-Verlag.

Burnham, K. P., Anderson, D. (2002). Model Selection and Multimodel Inference: a PracticalInformation-Theoretic Approach. New York: Springer.

Clyde, M., George, E. I. (2004). Model uncertainty. Statist. Sci. 19(1):81–94.Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). J. Roy.

Statist. Soc. B 57(1):45–97.Draper, N. R., Smith, H. (1981). Applied Regression Analysis. New York: John Wiley.Ehrlich, I. (1973). Participation in illegitimate activities: a theoretical and empirical investi-

gation. J. Polit. Econ. 81:521–565.Geisser, S. (1980). Discussion of ‘‘sampling and bayes’ inference in scientific modelling and

robustness’’ by g.e.p. box. J. Roy. Statist. Soc. A 143:416–417.Gilks, W. R., Spiegelhalter, D. J., Richardson, S. (1996). Markov Chain Monte Carlo in

Practice. London: Chapman & Hall.Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. B 14:107–114.Hoerl, A. E., Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal

problems. Technometrics 12(1):55–67.Hoeting, J. A., Madigan, D., Raftery, A. E., Volinsky, C. T. (1999). Bayesian model

averaging: a tutorial. Statist. Sci. 14(4):382–417.Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions based on Algorithm

Probability. Berlin: Springer.Hutter, M. (2008). Predictive hypothesis identification. arXiv:0809.1270v1 [cs.LG].Johnson, R. W. (1996). Fitting percentage of body fat to simple body measurements. J. Statist.

Educat. 4.Kass, R. E., Vaidyanathan, S. (1992). Approximate bayes factors and orthogonal parameters,

with application to testing equality of two binomial proportions. J. Roy Statist. Soc. B54:129–144.

Leamer, E. E. (1978). Specification Searches. New York: Wiley.Madigan, D., Raftery, A. E. (1994). Model selection and accounting for model uncertainty in

graphical models using occam’s window. J. Amer. Statist. Assoc. 89.Madigan, D., York, J. (1995). Bayesian graphical models for discrete data. Int. Statist. Rev.

63:215–232.Mallows, C. L. (1973). Some comments on Cp. Technometrics 15:661–675.Miller, A. (2002). Subset Selection in Regression. London: Chapman & Hall=CRC.Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in multiple

regression. Ann. Statist. 12:758–765.Raftery, A. E. (1996). Approximate bayes factors and accounting for model uncertainty in

generalised linear models. Biometrika 830(2):251–266.Raftery, A. E., Madigan, D., Hoeting, J. A. (1997). Bayesian model averaging for linear

regression models. J. Amer. Statist. Assoc. 920(437): 179–191.Rao, C. R., Wu, Y. (1989). A strongly consistent procedure for model selection in a regression

problem. Biometrika 76:369–374.Rissanen, J. J. (1978). Modeling by shortest data description. Automatica 14(5):465–471.Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6(2):461–464.Shao, J. (1997). An asymptotic theory for linear model selection. Statistica Sinica

7:221–264.Shibata, R. (1983). Asymptotic mean efficiency of a selection of regression variables. Ann.

Instit. Statist. Math. 35:415–423.Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of

regression variables. Biometrika 71:43–49.


Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

Smith, A. F. M., Roberts, G. O. (1993). Bayesian computation via gibbs sampler and relatedMarkov chain Monte Carlo methods. J. Roy. Statist. Soc. B 55:3–24.

Taplin, R. H. (1993). Robust likelihood calculation for time series. J. Roy. Statist. Soc. Ser. B55:829–836.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. B58(1):267–288.

Tierney, L., Kadane, J. B. (1986). Accurate approximations for posterior moments andmarginal densities. J. Amer. Statist. Assoc. 81:82–86.

906 Tran

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

8:14

22

Oct

ober

201

4

a criterion for optimal predictive model selection

Documents