# [Wiley Series in Probability and Statistics] Analysis of Ordinal Categorical Data (Agresti/Analysis) || Bayesian Inference for Ordinal Response Data

Post on 08-Dec-2016

212 views

Embed Size (px)

TRANSCRIPT

C H A P T E R 11

Bayesian Inference for Ordinal Response Data

This book has taken the traditional, so-called frequentist, approach to statistical inference. In this approach, probability statements apply to possible values for the data, given the parameter values, rather than to possible values for the parameters, given the data. Recent years have seen increasing popularity of an alternative approach, the Bayesian approach, which has probability distributions for parameters as well as for data and which assumes a prior distribution for the parameters which may reflect prior beliefs or information about the parameter values. This information is combined with the information that the data provide through the likelihood function to generate a posterior distribution for the parameters.

With the Bayesian approach, the data analyst makes probability statements about parameters, given the observed data, instead of probability statements about hypo-thetical data given particular parameter values. This results in inferential statements such as "Having observed these data, the probability is 0.03 that the population cumulative log odds ratio is negative" instead of "If the population cumulative log odds ratio were 0, then in a large number of randomized studies of this type, the proportion of times that we would expect to observe results more positive than the ones we observed is 0.03." The Bayesian approach seems more natural to many researchers, but it requires adding further structure to the model by the choice of the prior distribution.

In this chapter we apply the Bayesian paradigm to analyses of ordered cate-gorical response data. In Section 11.2 we focus on Bayesian estimation of cell probabilities for a multinomial distribution with ordered categories or for a con-tingency table in which at least one variable is ordinal. In Section 11.3 we focus on Bayesian regression modeling of ordinal response variables, with emphasis on cumulative link models. In Section 11.4 we focus on association modeling and in Section 11.5 cite extensions to multivariate response models. Some general comparisons of Bayesian and frequentist approaches to analyzing ordinal data are

Analysis of Ordinal Categorical Data, Second Edition, By Alan Agresti Copyright 2010 John Wiley & Sons, Inc.

315

316 BAYESIAN INFERENCE FOR ORDINAL RESPONSE DATA

made in Section 11.6. First, though, we summarize the basic ideas that underlie the Bayesian approach.

11.1 BAYESIAN APPROACH TO STATISTICAL INFERENCE

As mentioned above, the Bayesian approach treats a parameter as a random variable having a probability distribution rather than as a fixed (but unknown) number. Statistical inference relies on the posterior distribution of a parameter, given the data, rather than the likelihood function alone.

11.1.1 Role of the Prior Distribution

To obtain the posterior distribution, we must first choose a prior distribution. The approach in doing so may be subjective or objective. In the subjective approach, the prior distribution reflects the researcher's prior information (before seeing the data) about the value of the parameter. The prior information might be based on eliciting beliefs about effects from the scientist who planned the study, those beliefs perhaps being based on research results available in the scientific literature. In the objective approach, the prior distribution might be chosen so that it has little influence on the posterior distribution, and resulting inferences are substantively the same as those obtained with a frequentist analysis. This might be done if the researcher prefers the Bayesian approach mainly because of the interpretive aspect of treating the parameter as random rather than fixed.

Introduction of the prior distribution to the statistical analysis is the key aspect of the Bayesian approach that is not a part of frequentist inference. Different choices for the prior distribution can result in quite different ultimate inferences, espe-cially for small sample sizes, so the choice should be given careful thought. In this chapter we present the families of prior distributions commonly used for multino-mial parameters and for parameters in ordinal models for multinomial responses.

The method of combining a prior distribution with the likelihood function to obtain a posterior distribution is called Bayesian because it is based on applying Bayes' theorem. By that result, the posterior probability density function A of a parameter , given the data y, relates to the probability mass function / of y, given , and the prior density function g of , by

h(fl | y) = , / f /(y)

The denominator / (y) on the right-hand side is the marginal probability mass function of the data. This is a constant with respect to 0, so is irrelevant for inference about . Computational routines must determine it so that the posterior density function h{9 | y) integrates to 1 with respect to , yielding a legitimate probability distribution for . When we plug in the observed data, / ( y | ) is the likelihood function when viewed as a function of . So, in summary, the prior

BAYESIAN APPROACH TO STATISTICAL INFERENCE 317

density function multiplied by the likelihood function determines the posterior distribution.

Unlike the frequentist approach, the Bayesian approach does not differentiate between large- and small-sample analyses. Inference relies on the same posterior distribution regardless of the sample size n. Under standard regularity conditions such as not being on the boundary of the parameter space, as n grows the posterior distribution approximates more closely a normal distribution. Similarly, as n grows, a Bayesian estimator of (usually, the mean of the posterior distribution) behaves in a manner similar to the maximum likelihood estimator in terms of properties such as asymptotic normality, consistency, and efficiency. For example, Freedman (1963) showed consistency of Bayesian estimators under general conditions for sampling from discrete distributions such as the multinomial. He also showed asymptotic normality of the posterior distribution, assuming a smoothness condition for the prior distribution.

11.1.2 Simulating the Posterior Distribution

Except in specialized cases such as those presented in Section 11.2, there is no closed-form expression for the posterior distribution. Simulation methods can then approximate the posterior distribution. The primary method for doing this is Markov chain Monte Carlo (MCMC), with particular versions of it being the Metropolis-Hastings algorithm and its special case of Gibbs sampling. It is beyond our scope to discuss the technical details of how an MCMC algorithm works. In a nutshell, a stochastic process of Markov chain form is designed so that its long-run stationary distribution is the posterior distribution. One or more such Markov chains provide a very large number of simulated values from the posterior distribution, and the distribution of the simulated values approximates the posterior distribution.

The process begins by the data analyst selecting initial estimates for the param-eters and using a "burn-in period" for the Markov chain until its probability distribution is close to the stationary distribution. After the burn-in period, the sim-ulated values are treated as providing information about the posterior distribution. The successive observations from the Markov chain are correlated, but observations that are a sufficient number of lags apart have little correlation. So "thinning" the process by taking lagged values (such as every fourth value) provides essentially uncorrelated observations from the posterior distribution. Enough observations are taken after the burn-in period so that the Monte Carlo error is small in approx-imating the posterior distribution and summary measures of interest, such as the posterior mean and certain quantiles.

Various graphical and numerical summaries provide information about when the stationary condition has been reached and the Monte Carlo error is small. For a given parameter, plotting the simulated values against the iteration number is helpful for showing when burn-in is sufficient. After that, plotting the mean of the simulated values since the burn-in period against the iteration number helps to show when the approximation for the posterior mean has stabilized. A plot of the autocorrelations of lagged simulated values indicates how much the values must

318 BAYESIAN INFERENCE FOR ORDINAL RESPONSE DATA

be thinned to achieve approximately uncorrelated simulations from the posterior distribution. An alternative approach does not use thinning but accounts for the correlations.

Software is now widely available that can perform these computations for the basic analyses presented in this chapter. Textbooks specializing in computational aspects of the Bayesian approach, such as Ntzoufras (2009), provide details.

11.1.3 Inference Using the Posterior Distribution

Having found the posterior distribution, Bayesian methods of inference are avail-able that parallel those for frequentist inference. For example, to summarize the information about a parameter , we can report the mean and standard deviation of the posterior distribution of . We can also construct an interval that contains most of that distribution. Analogous to the frequentist confidence interval, it is often referred to as a credible interval or Bayesian confidence interval.

A common approach for constructing a credible interval uses percentiles of the posterior distribution, with equal probabilities in the two tails. For example, the 95% credible interval for is the region between the 2.5 and 97.5 percentiles of the posterior distribution for . An alternative approach constructs a highest posterior density (HPD) region. This is the region of values such that the pos-terior density function everywhere over that region is higher than the posterior density function everywhere outside that region, and the posterior probability over that region equals 0.95. Usually the HPD region is an interval, the narrowest one possible with the chosen probability.

A caution: For a parameter in the linear predictor of a model, it is not sensible to construct a HPD interval for a nonlinear function of . For a cumulative logit model, for instance, an HPD interval makes sense for but not for exp(j). To illustrate, suppose that is the coefficient of a binary indicator for two groups. Then a HPD interval for the cumulative odds ratio exp() does not consist of the reciprocals of values of the HPD interval for exp(), which is the cumulative odds ratio for the reverse assignment of values for the indicator variable. This is because the HPD region for a random variable 1/Z is not the same as the set of reciprocals of values in the HPD region for the random variable Z. However, the HPD region for Z is the same as the set of negatives of values in the HPD region for Z, so an HPD interval for (a log cumulative odds ratio) is valid.

In lieu of P-values, with the Bayesian approach, posterior tail probabilities are useful. For example, for an effect parameter , the information about the effect direction is contained in the posterior probabilities P( > 0 | y) and P ( < 0 | y). With a flat prior distribution, P( > 0 | y) corresponds to the frequentist P -value for the one-sided test with Ha: < 0.

Instead of a formal hypothesis test comparing models or possible values for a parameter, it is often useful to construct a Bayes factor. Consider data y and two models Mi and Mj, which could also be two possible ranges of parameter values for a given model, two hypotheses, or even two nonnested distinct model types. The Bayes factor is the ratio of the marginal probabilities, P(y | Mi)/P(y \ M\). This ratio gives the amount of evidence favoring one model relative to the other.

ESTIMATING MULTINOMIAL PARAMETERS 319

Many other analyses, including ways of checking models and an analog of AIC (called BIC), are available but are beyond the scope of this brief chapter treatment. For more in-depth presentations of Bayesian methods, particularly as they apply to ordered categorical data, see Johnson and Albert (1999), O'Hagan and Forster (2004, Sec. 12.52), and Congdon (2005, Chap. 7).

11.2 ESTIMATING MULTINOMIAL PARAMETERS

We first present Bayesian methods for estimating parameters of a multinomial distribution. The distribution could refer to counts for categories of an ordinal variable or to cell counts in a contingency table in which at least one variable is ordinal.

11.2.1 Estimation Using Dirichlet Prior Distributions

We begin with the multinomial with c = 2 categories, that is, the binomial. The simplest Bayesian inference for a binomial parameter pi uses a member of the beta distribution as the prior distribution. The beta distribution is called the conjugate prior distribution for inference about a binomial parameter. This means that it is the family of probability distributions such that when combined with the likelihood function, the posterior distribution falls in the same family. When we combine a beta prior distribution with a binomial likelihood function, the posterior distribution also is a beta distribution. The beta probability density function for pi is propor-tional to 7r"1_1(l pi)"2~ for two parameters et\ > 0 and ctj > 0. The family of beta probability density functions has a wide variety of shapes over the interval (0, 1), including uniform (when ct\ = a2 = 1)> unimodal symmetric {a\ ai> 1), uni-modal skewed left (a\ > az > 1), unimodal skewed right (a2 > i > 1), and bimodal U-shaped ( < l,a2 < 1)

For c> 2 categories, the beta distribution generalizes to the Dirichlet distribu-tion. Its probability density function is positive over the simplex of nonnegative values pi = (,. . . , nc) that sum to 1. Expressed in terms of gamma functions and parameters {a, > 0}, it is

g(n) = pi rf ( ""''"' for < *< < J a11 ' >' = L

,() f=\ Y The case {a, = 1} is the uniform distribution over the simplex of possible probabil-ity values. The case {a, = ^} is the Jeffreys prior for the multinomial distribution, which is the prior distribution that is invariant to the scale of measurement for the parameter. Let K = , "(' The Dirichlet distribution has (pi,) = at/K and Var(7r,) = a,(AT oti)/[K2(K + 1)]. For particular relative sizes of {a,}, such as identical values, the distribution is more tightly concentrated around the means as K increases.

320 BAYESIAN INFERENCE FOR ORDINAL RESPONSE DATA

For cell counts n = (n\,..., nc) from n = ], n, independent observations with cell probabilities pi, the multinomial probability mass function is

t c

/(n|ir)= , < ' n i! nc! i l

Given n, multiplying this multinomial likelihood function by the Dirichlet prior density function g(jr) yields a posterior density function h{n | n) for pi that is also a Dirichlet distribution, but with the parameters {a,} replaced by {or- =...

Recommended