statistical evidence in contingency tables analysis

15
Journal of Statistical Planning and Inference 138 (2008) 873 – 887 www.elsevier.com/locate/jspi Statistical evidence in contingency tables analysis M. Kateri a , N. Balakrishnan b, a Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou Str., 185 34 Piraeus, Greece b Department of Mathematics and Statistics, McMaster University, 1280 Main StreetWest, Hamilton, Ont., Canada L8S 4K1 Received 24 July 2006; received in revised form 17 January 2007; accepted 19 February 2007 Available online 12 March 2007 Abstract The likelihood ratio is used for measuring the strength of statistical evidence. The probability of observing strong misleading evidence along with that of observing weak evidence evaluate the performance of this measure. When the corresponding likelihood function is expressed in terms of a parametric statistical model that fails, the likelihood ratio retains its evidential value if the likelihood function is robust [Royall, R., Tsou, T.S., 2003. Interpreting statistical evidence by using imperfect models: robust adjusted likelihood functions. J. Roy. Statist. Soc. Ser. B 65, 391–404]. In this paper, we extend the theory of Royall and Tsou [2003. Interpreting statistical evidence by using imperfect models: robust adjusted likelihood functions. J. Roy. Statist. Soc., Ser. B 65, 391–404] to the case when the assumed working model is a characteristic model for two-way contingency tables (the model of independence, association and correlation models). We observe that association and correlation models are not equivalent in terms of statistical evidence. The association models are bounded by the maximum of the bump function while the correlation models are not. © 2007 Elsevier B.V.All rights reserved. Keywords: Independence; Association models; Correlation models; The Law of Likelihood; Misleading evidence; Robust likelihood function; Adjusted likelihood 1. Introduction The theory of statistical evidence provides the answer to the basic question “What do the data say?”, which is not fully faced by the frequentist and the Bayesian approaches. In particular, it provides the answer to the more precise question “When is it correct to say that a given data set represents evidence supporting one hypothesis over another?” where the hypotheses are about a fixed dimensional parameter . The development of these concepts originated from Birnbaum (1962) and Hacking (1965), and are presented in detail by Royall (1997). Besides defining statistical evidence, it is vital to measure its strength and control the probabilities of weak evidence and strong misleading evidence (Royall, 2000). Statistical evidence is represented and interpreted via likelihood functions, and its strength is measured by the likelihood ratio. For a clear introduction to the theory of statistical evidence and some illustrative examples, we refer to Blume (2002). Recently, Royall and Tsou (2003) addressed the problem of interpreting a likelihood function when the assumed working model is wrong and introduced two criteria under which a likelihood function is robust and is of evidential worth even though the working model is wrong. Corresponding author. Tel.: +1 905 525 9140; fax: +1 905 522 1676. E-mail addresses: [email protected] (M. Kateri), [email protected] (N. Balakrishnan). 0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2007.02.005

Upload: mcmaster

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Statistical Planning and Inference 138 (2008) 873–887www.elsevier.com/locate/jspi

Statistical evidence in contingency tables analysis

M. Kateria, N. Balakrishnanb,∗aDepartment of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou Str., 185 34 Piraeus, GreecebDepartment of Mathematics and Statistics, McMaster University, 1280 Main Street West, Hamilton, Ont., Canada L8S 4K1

Received 24 July 2006; received in revised form 17 January 2007; accepted 19 February 2007Available online 12 March 2007

Abstract

The likelihood ratio is used for measuring the strength of statistical evidence. The probability of observing strong misleadingevidence along with that of observing weak evidence evaluate the performance of this measure. When the corresponding likelihoodfunction is expressed in terms of a parametric statistical model that fails, the likelihood ratio retains its evidential value if thelikelihood function is robust [Royall, R., Tsou, T.S., 2003. Interpreting statistical evidence by using imperfect models: robustadjusted likelihood functions. J. Roy. Statist. Soc. Ser. B 65, 391–404]. In this paper, we extend the theory of Royall and Tsou[2003. Interpreting statistical evidence by using imperfect models: robust adjusted likelihood functions. J. Roy. Statist. Soc., Ser. B65, 391–404] to the case when the assumed working model is a characteristic model for two-way contingency tables (the model ofindependence, association and correlation models). We observe that association and correlation models are not equivalent in termsof statistical evidence. The association models are bounded by the maximum of the bump function while the correlation models arenot.© 2007 Elsevier B.V. All rights reserved.

Keywords: Independence; Association models; Correlation models; The Law of Likelihood; Misleading evidence; Robust likelihood function;Adjusted likelihood

1. Introduction

The theory of statistical evidence provides the answer to the basic question “What do the data say?”, which is not fullyfaced by the frequentist and the Bayesian approaches. In particular, it provides the answer to the more precise question“When is it correct to say that a given data set represents evidence supporting one hypothesis over another?” where thehypotheses are about a fixed dimensional parameter �. The development of these concepts originated from Birnbaum(1962) and Hacking (1965), and are presented in detail by Royall (1997). Besides defining statistical evidence, it isvital to measure its strength and control the probabilities of weak evidence and strong misleading evidence (Royall,2000). Statistical evidence is represented and interpreted via likelihood functions, and its strength is measured by thelikelihood ratio. For a clear introduction to the theory of statistical evidence and some illustrative examples, we referto Blume (2002). Recently, Royall and Tsou (2003) addressed the problem of interpreting a likelihood function whenthe assumed working model is wrong and introduced two criteria under which a likelihood function is robust and is ofevidential worth even though the working model is wrong.

∗ Corresponding author. Tel.: +1 905 525 9140; fax: +1 905 522 1676.E-mail addresses: [email protected] (M. Kateri), [email protected] (N. Balakrishnan).

0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved.doi:10.1016/j.jspi.2007.02.005

874 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

Although the meaning of the likelihood function is the same regardless of the dimension of the parameter space,its visualization and understanding becomes more difficult as the dimension increases. The cases considered so farprimarily deal with unidimensional parameter spaces. For example, Royall and Tsou (2003) provide the probabilityof strong misleading evidence for the mean of the univariate Poisson distribution, the mean of the univariate normaland lognormal distributions with fixed variance, and for the mean of a gamma distribution with fixed shape parameter.Blume (2005) studied evidential-based criteria for choosing a working regression model and recognized that misleadingevidence about the object of interest (regression coefficient in the true model) is more likely to be observed when theworking model is chosen according to other criteria.

In this paper, we focus on the probability of strong misleading evidence under wrong model assumption for multi-nomial likelihoods in the context of contingency tables. We extend the theory of Royall and Tsou (2003) to the caseof contingency tables of certain probability structure (independence, uniform association, RC association and RC cor-relation) with multidimensional parameter spaces. We also compare association and correlation models in terms ofstatistical evidence.

The outline of this paper is as follows. Section 2 presents briefly the basic ideas of the theory of statistical evidenceas well as the models used throughout this paper. Section 3 contains the theoretical results that support our findings.In particular, it is established that the object of interest (i.e., the parameter vector of interest) remains the object ofinference even under the wrong model, provided that the expected loglikelihood under the correct model is maximizedat the sufficient statistic. The asymptotic probabilities of strong misleading evidence for the above mentioned modelsunder contiguous alternatives along with some examples are provided next in Section 4. The upper bounds of theseprobabilities are presented in Section 5. A discussion on various possible alternatives is provided in Section 6. Finally,some comments and conclusions are made in Section 7.

2. Preliminaries

2.1. The concept of statistical evidence

The interpretation of statistical data as evidence is achieved through the Law of Likelihood (Hacking, 1965; Royall,1997):

If one hypothesis, H1, implies that a random variable X takes the value x with probability f1(x), while anotherhypothesis H2, implies that the probability is f2(x), then the observation X = x is evidence supporting H1 overH2 if f1(x) > f2(x) and the likelihood ratio f1(x)/f2(x) measures the strength of that evidence.

In conclusion, the likelihood function is the key-quantity for interpreting statistical evidence.Let x = (x1, x2, . . . , xn) be a realization of the random variables X1, X2, . . . , Xn, which are i.i.d. with probability

distribution {f (·, �), � ∈ �}, � being a fixed-dimensional parameter space. The issue is to interpret the data x asevidence about the underlying probability distribution. In this context, the probability distribution is determined by theparameter �, the likelihood function is L(x, �) =∏n

i=1f (xi, �) and for the pair of hypotheses about the true value ofthe parameter �, H1: �=�1 and H2: �=�2, the likelihood ratio L(x, �1)/L(x, �2) measures the strength of the evidencesupporting H1 against H2.

We need to characterize an evidence as “weak”, “fairly strong” or “strong”. Likelihood ratios close to one correspondto weak evidence while likelihood ratios greater than k (or smaller than 1/k), for some “large” k, represent strongevidence. Royall (1997) suggested for k the values 8 (moderate evidence) and 32 (strong evidence), which are relatedto standard reference values for the Bayes Factor (Kass and Raftery, 1995).

It is important to have a control over the probability of observing weak evidence under the true probability distributionas well as the probability of observing strong evidence under the wrong probability distribution (misleading evidence).These probabilities are defined as

W(n) = Pr�1(1/k < L(x, �1)/L(x, �2) < k)

and

M(n) = Pr�2(L(x, �1)/L(x, �2)�k),

respectively, for a fixed pre-specified value of k.

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 875

The probability of observing strong misleading evidence of strength k or greater is always less or equal to 1/k forany fixed sample size and any pair of probability distributions. Practically, this means that it is difficult to observestrong misleading evidence when the assumed density f (·, �) is correct, up to �. This bound is known as universalbound, because it applies to any probability model (Birnbaum, 1962). However, in many cases the probability ofobserving strong misleading evidence is much lower than the universal bound. Thus, in case the two distributions underconsideration are normal with different means (�1 and �2) and common variance �2, then

M(n) = �

(−

√n�

2�− � ln k√

n�

),

where � = |�2 − �1| and � denotes the standard normal cumulative distribution function. Further, if � = c�/√

n, i.e.,the distance between the two means is measured in terms of the standard error, then for fixed c, the above probabilityis independent of the sample size n and is equal to �(− c

2 − ln kc

). The graph of this probability is the so-called bumpfunction (Royall, 2000) with its maximum value equalling �(−{2 ln k}−1/2).

In summary, when the likelihood function represents evidence about the true parameter value �0, it has two importantperformance properties, as listed by Royall and Tsou (2003):

(L1) For any false value � �= �0, the evidence will eventually support �0 over � by an arbitrary large factor:Pr{L(x, �0)/L(x, �) → ∞ as n → ∞} = 1.

(L2) In large samples, the probability of misleading evidence Pr{L(x, �)/L(x, �0)�k} as a function of �, is approxi-mated by the bump function �(− c

2 − ln kc

), where c is proportional to the distance between � and �0.

The first property holds for all statistical models for any fixed alternative while the second holds for models for whichthe likelihood function is smooth.

The evidence represented by the likelihood function when the true underlying distribution is not one of those in theworking model (i.e., there exists no �0 for which f (·, �0) is the true distribution) has been considered by Royall andTsou (2003). In this case, in general, the probability of misleading evidence can be quite large. Royall and Tsou (2003)introduced criteria under which a likelihood function remains valid for evidential interpretation of data in case theassumed model fails. They argued that the likelihood function has to be robust. If q(·) is the true probability distributionand �q is the value of � that maximizes Eq [ln L(x, �)], the likelihood function L(x, �) is robust if it retains the followingtwo key performance properties while the underlying model is wrong:

(R1) The object of inference, �q , remains the object of interest.(R2) In large samples, the probability of misleading evidence is described by the bump function, and so its maximum

value is �(−{2 ln k}−1/2).

Royall and Tsou (2003)observed that, in general, the second property usually fails, indicating the fact that when themodel fails we do lose control over the maximum probability of misleading evidence. If this indeed is the case, theymake the likelihood function robust by adjusting it.

2.2. Association and correlation models for two-way contingency tables

Let X=(Xij ) be an I ×J contingency table, having a Mult(n, �q) distribution, where �q denotes the true probabilitymodel of the table. We are interested in the statistical evidence when considering some characteristic models incontingency tables analysis, such as the basic model of independence or the models of association and correlation; seeAgresti (2002).

For an I × J probability table � = (�ij ), if �i• (i = 1, . . . , I ) and �•j (j = 1, . . . , J ) denote the row and columnmarginals, respectively, then the model of independence is defined by

�ij = �i•�•j , i = 1, . . . , I, j = 1, . . . , J . (2.1)

From now on, we shall denote the probability model in (2.1) by I.

876 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

In the context of contingency tables analysis, the association and correlation models are well known; see, for example,Goodman (1985, 1986). The multiplicative row–column association model, denoted by RC, is defined by

�ij = aibj exp(�ij ), i = 1, . . . , I, j = 1, . . . , J . (2.2)

The parameters ai (i = 1, . . . , I ) and bj (j = 1, . . . , J ) are the row and column main effects, respectively, while thevectors μ = (1, . . . , I )

′ and � = (1, . . . , J )′ are the row and column scores. The parameter � is known as theintrinsic association parameter and is redundant. On the row and column scores are imposed the constraints

I∑i=1

w1ii =J∑

j=1

w2j j = 0 (2.3)

and

I∑i=1

w1i2i =

J∑j=1

w2j 2j = 1, (2.4)

where w1i (i = 1, . . . , I ) and w2j (j = 1, . . . , J ) are row and column (positive) weights, respectively. In the literature,commonly used weights are the uniform (w1i =w2j =1 for all i, j) and the marginal (w1i =�i•, w2j =�•j , i =1, . . . , I ,j = 1, . . . , J ). For a detailed justification of these choices for the weights, see Goodman (1985) and Becker and Clogg(1989). In the case of ordinal classification variables, simpler association models with less parameters than in (2.2) canbe obtained by considering the row or/and column scores as known, usually equidistant for successive classificationcategories. Hence, if the column (row) scores are known while the i’s (i’s) are unknown parameters, we obtain theRow (Column) effect model. If both set of scores are known, we arrive at the Uniform (U) association model, whichhas just one parameter more than the independence model.

On the other hand, the correlation model, denoted by CA, is defined as

�ij = �i•�•j (1 + �ij ), i = 1, . . . , I, j = 1, . . . , J . (2.5)

The row and column scores μ and � of CA also satisfy the constraints in (2.3) and (2.4), but strictly with the marginalweights.

We would like to point out that the μ and � scores of the RC model are not the same as the corresponding quantitiesof the CA model. They are denoted the same way, however, in order to display their qualitative similarity. When aconfusion could possibly arise, we add a superscript for the scores of each model. We will also eliminate the redundantparameters from the association and correlation models. Thus, instead of (2.2) and (2.5), from now on we shall use themodels

�ij = aibj exp(ij ), i = 1, . . . , I, j = 1, . . . , J (2.6)

and

�ij = �i•�•j (1 + ij ), i = 1, . . . , I, j = 1, . . . , J , (2.7)

respectively.Note that the RC and CA models are special cases of more general association and correlation models, viz., the RC(K)

and CA(K) models, respectively, when K = 1. The possible values of K are 1�K �M , where M = min(I, J ) − 1. ForK = M , the RC(M) and CA(M) models are equivalent and are reparameterizations of the saturated log-linear model.For a detailed discussion on these models, one may refer to Goodman (1986). The main qualitative difference betweenthese two classes of models is that, even though both are models of dependence, the association models are (undercertain conditions) the closest to independence in terms of the Kullback–Leibler distance, while the correlation modelsare the closest to independence in terms of the Pearsonian distance (Gilula et al., 1988).

3. Model misspecification

According Property (R1), �q remains the object of interest even when the observations come from a distributiondifferent from the assumed one.

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 877

In the examples of Royall and Tsou (2003), the parameter of interest �q was always the mean and so Property (R1)was satisfied as long as Eq{log L(x, �)} was maximized at �q = Eq(X). In the case of a multinomial likelihood, i.e.,if we assume that x follows a Mult(n, �) distribution, in terms of expected cell proportions (i.e., �ij = �ij ), Property(R1) is ensured, since Eq(Xij /n) = �q

ij = �qij . But if a model � = f (�) is considered for the cell probabilities and the

quantity of interest is the parameter �, then this condition has to be extended in order to ensure Property (R1) of robustlikelihood functions in case the assumed model is wrong and the true one is � = q(�0). Note that �0 is a parametervector different from � (probably also of different dimension).

Consider for example that the assumed model for the cell probabilities is the independence model in (2.1). In thiscase, the parameter vector is

�I = (�1•, . . . , �I•, �•1, . . . , �•J ) (3.1)

and the expected loglikelihood (under the correct model) is equal to

Eq{log L(x, �I)} = Eq

⎛⎝A +

∑i,j

xij log(�i•�•j )

⎞⎠ , (3.2)

where A = log( n!x11!x12!···xIJ ! ). With the assumption that the Eq{log(xij !)} exists, the expected loglikelihood in (3.2) is

seen to be maximized at �q

I = Eq [TI(x)], where TI(x) = 1n(x1•, . . . , xI•, x•1, . . . , x•J ).

In a more general framework, the following lemma can be established.

Lemma 3.1. Let x be a contingency table for which we assume a multinomial distribution Mult(n, �). In the casewhen the parametric model � = f (�) is assumed instead of the true � = q(�0), Property (R1) of a robust likelihoodfunction is satisfied whenever Eq{log L(x, �)} is maximized at Eq [T (x)] = �q , where T (x) is the sufficient statistic forthe parameter � and L(x, �) is the likelihood under the assumed model.

Examples. By proceeding in a manner similar to that of the independence model presented above and with the use ofLemma 3.1, results for some other characteristic models can also be obtained as presented below.

(1) For the U model, the statistic that maximizes the expected loglikelihood under this model is TU(x)=(TI(x), 1/n

∑i,j xijij ), which is the sufficient statistic for the corresponding parameter vector

�U = (a1, a2, . . . , aI , b1, b2, . . . , bJ , �).

(2) For the RC model in (2.2), the parameter vector is

�RC = (a1, . . . , aI , b1, . . . , bJ , RC1 , . . . , RC

I , RC1 , . . . , RC

J )

and TRC(x) = (TI(x), 1n

∑j

x1j j , . . . ,1n

∑j

x1j j ,1n

∑i

xi1i , . . . ,1n

∑i

xiJ i ).

(3) For the correlation model in (2.5), the parameter vector is

�CA = (�I, CA1 , . . . , CA

I , CA1 , . . . , CA

J )

and the expected loglikelihood under this model is maximized at

TCA(x) =⎛⎝TI(x),

1

n

∑j

x1j j

1 + 1j

, . . . ,1

n

∑j

xIj j

1 + I j

,1

n

∑j

xi1i

1 + i1, . . . ,

1

n

∑i

xiJ i

1 + iJ

⎞⎠ .

The vectors TU, TRC and TCA derived above are all sufficient statistics for the parameters of the models U, RCand CA, respectively. Thus, all the models considered satisfy Property (R1) of a robust likelihood function. Notethat, for the sake of computational convenience in what follows, we have also included the redundant terms (forexample, �I•) in these sufficient statistics.

Remark 3.1. The result in Lemma 3.1 will always hold for log-linear models, since the likelihood of a log-linearmodel is a linear combination of the logarithms of the model parameters multiplied by the corresponding sufficient

878 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

statistics (Birch, 1963). This result will also continue to hold if we try a simpler but nested model instead of the truemodel.

4. Probabilities of strong misleading evidence

Next, we shall evaluate the probabilities of strong misleading evidence for the models of Independence, the associationmodels U and RC, and the correlation model CA. The models of Row or Column effect can be handled in a manneranalogous to the U and RC models.

4.1. Independence model

Let us assume (wrongly) that independence holds for an I × J contingency table when the true probability model is�q = (�q

ij ). In this context, our parameter vector is as in (3.1). If �q

I is the vector with the true values for the marginals,the probability of strong misleading evidence is provided by

Prq

[L(x, �I)

L(x, �q

I)�k

]= Prq

⎡⎣∑

i,j

xij (log �ij − log �qij )� log k

⎤⎦ ,

which, in terms of �I and �q

I, is equal to

Prq

[L(x, �I)

L(x, �q

I)�k

]= Prq

⎛⎝ I∑

i=1

xi•(log �i − log �qi ) +

J∑j=1

x•j (log �I+j − log �qI+j )

−n

I∑i=1

(�i − �qi ) − n

J∑j=1

(�I+j − �qI+j )� log k

⎞⎠ . (4.1)

As already mentioned in the last section, in order to simplify computations, we also use the redundant parameters �I•and �•J and then include in the likelihood ratio the corresponding constraints on the parameters.

Considering contiguous alternatives for the row and column marginals, we have

�i = �qi + c1i√

n, i = 1, . . . , I (4.2)

and

�I+j = �qI+j + c2j√

n, j = 1, . . . , J (4.3)

with c1i and c2j satisfying the constraints

c1i > − √n�q

i (i = 1, . . . , I ), c2j > − √n�q

I+j (j = 1, . . . , J )

and

I∑i=1

c1i = 0,

J∑j=1

c2j = 0. (4.4)

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 879

Further, by considering the Taylor expansions of the log-differences occurring in (4.1) around �qi and �q

I+j , respectively,and using the conditions in (4.4), (4.1) becomes equivalent to

Prq

[L(x, �I)

L(x, �q

I)�k

]= Prq

⎡⎣ I∑

i=1

√nc1i

�qi

(pi• − �qi ) +

J∑j=1

√n

c2j

�qI+j

(p•j − �qI+j )

+Op(n−1/2)� log k + 1

2

⎛⎝ I∑

i=1

(c1i

�qi

)2

pi• +J∑

j=1

(c2j

�qI+j

)2

p•j

⎞⎠⎤⎦ ,

where pij = xij /n is the sample proportion. Since pi•n→∞−→ �q

i and p•jn→∞−→ �q

I+j , from the central limit theorem andfor n → ∞, we finally obtain the following asymptotic expression for the above probability of strong misleadingevidence (for fixed k):

MI(c, k) = Prq

[L(x, �I)

L(x, �q

I)�k

]= �

⎡⎢⎢⎢⎢⎣−

log k + 1

2

(∑Ii=1

c21i

�qi

+∑Jj=1

c22j

�qI+j

)√

Varq(YI)

⎤⎥⎥⎥⎥⎦ , (4.5)

where c = (c11, . . . , c1I , c21, . . . , c2J ),

YI =∑i,j

(c1i

�qi

+ c2j

�qI+j

)xij =

I∑i,j

c1i

�qi

xi• +J∑

j=1

c2j

�qI+j

x•j (4.6)

and Varq(YI) is the variance of YI in (4.6) under the true model.In order to define contiguous alternatives depending on a univariate parameter, which are easier to handle and possess

meaningful interpretation, we shall consider the contiguous alternatives of �q towards a probability table �∗, i.e., ouralternatives will be of the form � = �q + s√

n(�∗ − �q) with s ∈ R and appropriately restricted so that � is a valid

probability table. In the framework of log-linear models, such contiguous alternatives have been used by Cressie et al.(2003). These alternatives, in terms of our parameters (the vectors of the marginal probabilities), are expressed as

�i = �qi + s

�∗i − �q

i√n

, i = 1, . . . , I (4.7)

and

�I+j = �qI+j + s

�∗I+j − �q

I+j√n

, j = 1, . . . , J , (4.8)

with �∗ being the vector of row and column marginals of the �∗ table, |s| < smax, where smax=min

{max

((1−�q

i )√

n

�∗i −�q

i

, − �qi

√n

�∗i −�q

i

)}with the min and max being considered with respect to all i = 1, . . . , I + J . This

constraint on s poses no practical difficulty since we consider large n.In this setup, (4.5) turns out to be

M(s, k) = Prq

[L(x, �I)

L(x, �q

I)�k

]= �

⎡⎢⎢⎣−

logk

|s| + |s|2

X2(�∗I, �

q

I)√Varq(YI∗)

⎤⎥⎥⎦ , (4.9)

where X2(�∗I, �

q

I) =∑Ii=1

(�∗i −�q

i )2

�qi

+∑Jj=1

(�∗I+j −�q

I+j )2

�qI+j

is the Pearsonian distance between �∗I and �

q

I, while

YI∗ =∑i,j

(�∗i − �q

i

�qi

+ �∗I+j − �q

I+j

�qI+j

)xij =

I∑i=1

�∗i − �q

i

�qi

xi• +J∑

j=1

�∗I+j − �q

I+j

�qI+j

x•j . (4.10)

880 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

-100 -50 50 100

0.005

0.01

0.015

0.02

0.025

Fig. 1. Probability of strong misleading evidence for the likelihood of the Independence model for the probability table �1 as a function of s [see(4.7) and (4.8)], along with the maximum of the bump function.

In order to illustrate this, let us consider a 3 × 3 contingency table of large total sample size n with the underlyingprobability table

�1 =(0.068 0.058 0.034

0.203 0.230 0.1770.057 0.086 0.087

),

which incidentally is of U association structure. By wrongly assuming the independence model, our parameter vector ofinterest consists of the row and column marginals and the expected loglikelihood under the correct model is maximizedat �

q

I = (0.1601, 0.6107, 0.2292, 0.3285, 0.3741, 0.2974), i.e., the marginals of the true probability table �q = �1. If�∗I=(0.15, 0.6, 0.25, 0.3, 0.4, 0.3) and we consider the contiguous alternatives in (4.7) and (4.8), we get the probability

of strong misleading evidence provided in Fig. 1 for k = 8. In this figure, the maximum of the bump function has alsobeen plotted (the horizontal line).

4.2. Uniform association model

If we assume that the U model holds while the true probability vector is �q , then, for fixed row and column scores,the probability of strong misleading evidence is

Prq

[L(x, �U)

L(x, �qU)

�k

]= Prq

⎡⎣ I∑

i=1

xi•(log ai − log aqi ) +

J∑j=1

x•j (log bj − log bqj )

+(� − �q)∑i,j

ij xij + n∑i,j

(�ij − �qij )� log k

⎤⎦ , (4.11)

where �U is the parameter vector under the U model and �qU = Eq [T (x)] (see Section 3).

For the general multivariate contiguous alternative �U = �qU + c/

√n with c = (c11, . . . , c1I , c21, . . . , c2J , c�), i.e.,

ai = aqi + c1i√

n, i = 1, . . . , I ,

bj = bqj + c2j√

n, j = 1, . . . , J ,

� = �q + c�√n

(4.12)

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 881

by using Taylor expansions, one can verify that (4.11) converges to �(− log k+AU/2√VarqYU

), where

YU =I∑

i=1

c1i

aqi

xi• +J∑

j=1

c2j

bqj

x•j + c�

∑i,j

ij xij

and

AU =I∑

i=1

(c1i

aqi

)2

�qi• +

J∑j=1

(c2j

bqj

)2

�q•j + c2

∑i,j

(ij )2�q

ij

+ 2∑i,j

c1i

aqi

c2j

bqj

�qij + 2c�

∑i,j

(c1i

aqi

+ c2j

bqj

)ij�

qij . (4.13)

The alternatives defined in (4.12) need additionally to satisfy the following constraints in order for � to be a validprobability table:

− √na

qi < c1i <

√n

⎧⎨⎩

J∑j=1

(b

qj + c2j√

n

)exp[(�q + cq)ij ]

⎫⎬⎭

−1

− aqi ,

− √nb

qj < c2j <

√n

{I∑

i=1

(a

qi + c1i√

n

)exp[(�q + cq)ij ]

}−1

− bqj ,

∑i,j

(a

qi + c1i√

n

)(b

qj + c2j√

n

)exp[(�q + cq)ij ] = 1.

As in the case of independence, we shall restrict our alternatives in order to control them through a univariate parameters. Analogous to (4.7) and (4.8), we consider a �∗ and associated parameter vector �∗

U. The contiguous alternativesare (4.12) with c replaced by s(�∗

U − �qU). Note that under this setting the parameters ai’s or bj ’s need to be rescaled

(divided by∑

i,j�ij ) in order to ensure that∑

i,j�ij = 1. Going through the computational details, one can verify thatthis rescaling does not affect the approximation of the corresponding probability of strong misleading evidence. Thus,

(4.11) converges to �

(−

log k|s| + |s|

2 A∗U√

VarqY ∗U

), where

Y ∗U =

I∑i=1

a∗i − a

qi

aqi

xi• +J∑

j=1

b∗j − b

qj

bqj

x•j + (�∗ − �q)∑i,j

ij xij

and A∗U is AU in (4.13) with c replaced by (�∗

U − �qU).

For s, the restriction

− max{A(1), B(1)} < s < min{A(2), B(2)}should hold, where A(k)=minSA

k{Sa(i)}, k=1, 2, with SA

1 ={i: a∗i −a

qi > 0}, SA

2 ={i: a∗i −a

qi < 0}, Sa(i)=√

naqi /|a∗

i −aqi |,

and the corresponding quantities for the column scores defined similarly.Consider a 4 × 4 contingency table of large total sample size n with the underlying probability table

�2 =⎛⎜⎝

0.085 0.067 0.066 0.0280.035 0.028 0.028 0.0120.057 0.042 0.082 0.0580.085 0.062 0.145 0.120

⎞⎟⎠ ,

which incidentally is of RC structure. In this case, we shall provide the probability of strong misleading evidenceunder the consideration of the U model and for the univariate contiguous alternatives, as a function of s. The pa-rameter vector �U that maximizes the expected loglikelihood under the correct probability table �q = �2 is �

qU =

882 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

-40 -20 20 40

0.005

0.01

0.015

0.02

Fig. 2. Probability of strong misleading evidence for the likelihood of the U model for the probability table �2 as a function of s, along with themaximum of the bump function.

(0.2312, 0.1016, 0.2386, 0.4016, 0.2791, 0.2094, 0.3188, 0.1961, 0.8719). The corresponding graph, for�∗

U = (0.25, 0.1, 0.25, 0.4, 0.3, 0.2, 0.3, 0.2, 0.7) and k = 8, is presented in Fig. 2.

4.3. Row–column association model

In a similar manner, for the RC model and for the general contiguous alternative with c=(c11, . . . , c1I , c21, . . . , c2J ,

c1, . . . , cI

, c1 , . . . , cJ), we conclude that the corresponding probability of strong misleading evidence MRC(c, k),

when the true probability table is �q , converges to �

(− log k+ 1

2 ARC√VarqYRC

), where

YRC =I∑

i=1

c1i

aqi

xi• +J∑

j=1

c2j

bqj

x•j +I∑

i=1

ci

⎛⎝ J∑

j=1

j xij

⎞⎠+

J∑j=1

cj

(I∑

i=1

ixij

)

and

ARC =I∑

i=1

(c1i

aqi

)2

�qi• +

J∑j=1

(c2j

bqj

)2

�q•j +

∑i,j

(ciqj + cj

qi )2�q

ij

+ 2∑i,j

c1i

aqi

c2j

bqj

�qij + 2

∑i,j

(c1i

aqi

+ c2j

bqj

)(ci

qi

+ cj

qj

)q

i qj �q

ij . (4.14)

These alternatives also need to satisfy constraints analogous to those given above for the U model.In the special case of contiguous alternatives controlled by a univariate parameter s, MRC(s, k) converges to

(−

log k|s| + |s|

2 A∗RC√

VarqY ∗RC

), where Y ∗

RC and A∗RC are same as YRC and ARC, respectively, with c replaced by (�∗

RC − �qRC).

As in the case of the U model, here also, one set of the main effect parameters (the ai’s or the bj ’s) need to be rescaledin order to ensure that the entries of � add to one. We use uniform weights for the constraints on the row and columnscores to be common for both tables �q and �∗ and all intermediates, thus avoiding rescaling. Under the constraints onthe alternative, (2.4) will no longer hold, but we do not rescale the scores since this does not affect the cell estimates.

Consider a 4 × 5 contingency table with true probability table

�3 =⎛⎜⎝

0.042 0.008 0.050 0.031 0.0030.091 0.016 0.112 0.067 0.0050.082 0.017 0.142 0.082 0.0070.056 0.012 0.091 0.078 0.008

⎞⎟⎠ ,

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 883

-10 -5 5 10

0.005

0.01

0.015

0.02

Fig. 3. Probability of strong misleading evidence for the likelihood of the RC model for the probability table �3 as a function of s, along with themaximum of the bump function.

which incidentally is of CA(2) structure. Working with the RC as the underlying model, then �qRC = (0.095, 0.206,

0.245, 0.188, 0.370, 0.072, 0.539, 0.346, 0.030, −0.246, −0.274, 0.059, 0.460, −0.339, −0.184, −0.073, 0.198,

0.397). The probability of strong misleading evidence, as a function of s, for a fixed �∗ with corresponding parametervector �∗

RC = (0.101, 0.207, 0.259, 0.105, 0.258, 0.118, 0.505, 0.320, 0.130, −0.972, −0.510, 0.331, 0.149, −1.331,

−0.037, 0.081, 0.521, 0.764) and k = 8, is provided in Fig. 3.

4.4. Row–column correlation model

For the correlation model and under the general contiguous alternatives for its parameters, the probability of strong

misleading evidence converges to �

(− log k+ 1

2 ACA√VarqYCA

), where

YCA =I∑

i=1

c1i

�qi•

xi• +J∑

j=1

c2j

�q•j

x•j +∑

i

ci

⎛⎝∑

j

qj xij

1 + qi q

j

⎞⎠+

∑j

cj

(∑i

qi xij

1 + qi q

j

)

and

ACA =I∑

i=1

c21i

�qi•

+J∑

j=1

c22j

�q•j

+ 2∑i,j

c1i

�qi•

c2j

�q•j

�qij +

∑i,j

(ci

qj − cj

qi

1 + qi q

j

)2

�qij

+⎛⎝∑

j

qj c2j −

∑j

cj�q

•j

⎞⎠∑

i

ci�q

i• +⎛⎝∑

i

qi c1i −

∑j

ci�q

i•

⎞⎠∑

j

cj�q

•j . (4.15)

The correlation model is the more restrictive one on considering the contiguous alternatives, since under CA the scorescannot vary freely as we have to ensure that 1+ij > 0 for all i, j . Also by the definition of the model, the constraintsin (2.3) are crucial and the weights have to be the marginals. Thus, c has to be restricted accordingly.

For the special class of contiguous alternatives depending on a single parameter s, if �∗CA and �

qCA are the parameter

vectors corresponding to tables �∗ and �q , respectively, we define for the scores the alternatives as for the RC modelbut for the marginals we include additional scaling parameters (kr and kc), which are determined by (2.3). Thus,

�i• = �qi• + kr

s√n(�∗

i• − �qi•), i = 1, . . . , I ,

�•j = �q•j + kc

s√n(�∗•j − �q

•j ), j = 1, . . . , J ,

884 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

-10 -5 5 10

0.005

0.01

0.015

0.02

Fig. 4. Probability of strong misleading evidence for the likelihood of the CA model for the probability table �3 as a function of s, along with themaximum of the bump function.

with

kr =∑

i�qi•∗

is√n(∑

i�qi•∗

i +∑i�∗i•

qi ) −∑i�

∗i•

qi

and kc defined similarly.For the constraint 1+ij > 0 to hold for every cell of the probability table, we need to define appropriately the domain

of s. We have s ∈ S =∩Sij , where Sij is as defined below. If �ij = (∗i −q

i )(∗j − q

j ), �ij =qi (∗

j − qj )+ q

j (∗i −q

i ),

and ij = (1 + qi q

j ), then

• for �ij = 0, Sij = {s ∈ R: s > − ij

�ij

√n},

• for �ij < 0, Sij = {s ∈ R: smin√

n < s < smax√

n},• for �ij > 0, if �2

ij − 4�ij ij < 0, then Sij =R; else, Sij = {s ∈ R: s < smin√

n or s > smax√

n},

where smin and smax are the roots of �ij s2 + �ij s + ij .

For these univariate changing alternatives, the probability of strong misleading evidence converges now to

(−

log k|s| + |s|

2 A∗CA√

VarqY ∗CA

), where

Y ∗CA = kr,∞

I∑i=1

�∗i• − �q

i•�q

i•xi• + kc,∞

J∑j=1

�∗•j − �q•j

�q•j

x•j

+∑i,j

(∗i − q

i )qj + (∗

j − qj )q

i

1 + qi q

j

xij

and A∗CA is same as ACA in (4.15) with c1i , c2j , ci

and cjreplaced by kr,∞(�∗

i• − �qi•), kc,∞(�∗•j − �q

•j ), (∗i − q

i )

and (∗j − q

j ), respectively, while kr,∞ = limn→∞kr = −∑

i�qi•∗

i∑i�

∗i•

qi

and kc,∞ defined similarly.

For the table �3, the probability of strong misleading evidence for the CA model, as a function of s (for k =8), is pre-sented in Fig. 4. The corresponding vectors of parameters are �

qCA = (0.134, 0.291, 0.330, 0.245, 0.271, 0.053, 0.395,

0.258, 0.023, −0.335, −0.385, 0.0434, 0.582, −0.325, −0.147, −0.014, 0.340, 0.595) and �∗CA = (0.18, 0.30, 0.35,

0.17, 0.25, 0.08, 0.34, 0.23, 0.10, −1.179, −0.479, 0.521, 1.021, −0.898, 0.009, 0.109, 0.502, 0.709), which corres-ponds to the same probability table �∗ used for the probability of misleading evidence for the RC model. In this case,the domain of s is S = {s: −2.25618

√n < s < 1.08353

√n}, which for large n poses no practical difficulty.

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 885

5. Bounds for the probabilities of strong misleading evidence

For the likelihood function to be robust, we also need to ensure that in large samples the probability of strong mislead-ing evidence is described by the bump function, and hence its maximum value should be �(−{2 log k}−1/2); see Property(R2) in Section 2.1. For k = 8, the choice in all our examples, the maximum value thus equals �(−{2 log 8}−1/2) =0.0207084.

In general, for a multinomial probability table with a true parameter vector �q , the loglikelihood ratio l(x, �) −l(x, �q)= log

(f (x,�)f (x,�q )

)will converge in distribution to N(− 1

2 c′Hqc, c′Iqc), where Iq =Eq([∇�l(x, �q)]′[∇�l(x, �q)])and Hq = Eq(−∇2

� l(x, �q)). Thus, the probability of strong misleading evidence converges to �(−{log k + 12 c′Hqc}

/√

c′Iqc), which is the bump function for Hq=Iq , with c=√c′Hqc. If we set the alternatives so that c=s(�∗−�q), then the

above probability is maximized at s = { 2 log k

(�∗−�q )′Hq (�∗−�q )}1/2 and this maximum probability equals �(−{2 log k�}1/2),

where � = (�∗−�q )′Hq (�∗−�q )

(�∗−�q )′Iq (�∗−�q ). This will be the maximum of the bump function if � = 1, i.e., Iq = Hq . Otherwise, if

(�∗ − �q)′Iq(�∗ − �q) > (�∗ − �q)′Hq(�∗ − �q), the maximum of the probability of strong misleading evidence willexceed �(−{2 log k}−1/2). If this is the case, the likelihood can be adjusted; see Royall and Tsou (2003) and Stafford(1996). Straightforward application of the methodology of Royall and Tsou (2003) is not appropriate in our case, sinceour parameter of interest is not scalar but a vector. Indeed, raising the likelihood to the power �̂, leads to an adjustmentbased also on the alternative hypothesis, since � is a function of �∗. One way to overcome this problem is to select ascalar parameter as the parameter of interest and work on the corresponding profile likelihood according to Royall andTsou (2003). This procedure is straightforward but has the disadvantage that in our context no single parameter can beisolated as the parameter of interest. One option would be to consider and adjust the profile likelihood for every singlecoordinate of �. Alternatively, one could consider the ‘related probability of strong misleading evidence’ as conditionalon �∗, assume a density g(�∗) and integrate over �∗.

We have already verified that the probability of misleading evidence can exceed the maximum of the bump functionfor the independence model as well as for the correlation model CA.

For the association models, observe that for the AU and ARC provided in (4.13) and (4.14), respectively, it canbe proved that AU = Eq(Y 2

U) and ARC = Eq(Y 2RC). Hence, AU �VarqYU (ARC �VarqYRC) with equality holding if

EqYU = 0 (EqYRC = 0). Consequently, association models are always bounded by the maximum of the bump functionand no adjustment is therefore necessary.

6. On contiguous alternatives

The general contiguous alternatives in (4.2) and (4.3) are neither easy to illustrate nor convenient to work with. Thiswas the reason for considering the alternatives in (4.7) and (4.8), which alternate the rows and columns “simultaneously”towards a pre-specified probability table �∗. The choice of �∗, however, is subjective. We could allow, for example, forthe row and column marginals to change by a different rate, i.e.,

�i = �qi + sr

�∗i − �q

i√n

, i = 1, . . . , I

and

�I+j = �qI+j + sc

�∗I+j − �q

I+j√n

, j = 1, . . . , J ,

and evaluate either MI(sr, sc, k) or MI(sr|sc, k) and MI(sc|sr, k). Asymptotically, we obtain

MI(sr, sc, k) = �

⎡⎢⎢⎣−

log k + 1

2

(s2

r∑I

i=1(�∗

i −�qi )2

�qi

+ s2c∑J

j=1(�∗

I+j −�qI+j )2

�qI+j

)√

Varq(Y )

⎤⎥⎥⎦ ,

886 M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887

-6 -4 -2 2 4 6

0.005

0.01

0.015

0.02

Fig. 5. Probability of strong misleading evidence MU(c�, k) for the likelihood of the U model for the probability table �2.

where

Y = sr

I∑i=1

�∗i − �q

i

�qi

xi• + sc

J∑j=1

�∗I+j − �q

I+j

�qI+j

x•j .

If we consider alternatives only on one marginal (say, the row marginals, i.e., sc = 0), then Y reduces to Y =sr∑I

i=1�∗

i −�qi

�qi

xi• and the probability of misleading evidence converges to �

[− log k/|sr |+ 1

2 |sr |∑Ii=1(�

∗i −�q

i )2/�qi√

Varq (Y )

].

Similarly, for the association and correlation models, we could also consider different scales for the alternativeson the row and column scores. For the U model which has just one parameter more than the independence model,it will be of interest to assume that only the association parameter � changes, i.e., � = �q + c�√

n, and the main

effects are just rescaled as commented in Section 4.2. Due to this rescaling, the ai’s and bj ’s are both multiplied by√n(∑

i,j�qij exp(c�ij /

√n)−1/2 − 1). In this case, YU reduces to Y� =∑i,jij xij . Furthermore, by using the fact

that limn→∞[√n(∑

i,j�qij exp(c�ij /

√n)−1/2 − 1)] = − c�

2

∑i,jij�

qij , we conclude that MU(c�, k) converges to

⎛⎜⎝−

log k/|c�| + 1

2|c�|A�√

Varq Y�

⎞⎟⎠

with

A� =∑i,j

(ij )2�q

ij −⎛⎝∑

i,j

ij�qij

⎞⎠

2

= Varq Y�,

and thus MU(c�, k) is approximated by the bump function.Fig. 5 presents the plot of MU(c�, k) for the example considered for the U model, with �q = 0.8719.

7. Discussion

In the context of contingency tables, we have shown that the association and correlation models exhibit differentbehavior in terms of statistical evidence. Specifically, we have shown that the correlation models are not bounded bythe maximum of the bump function while association models are.

The extension of the results presented in this paper to the more general models RC(K) and CA(K) is straightforward.Rom and Sarkar (1992), Kateri and Papaioannou (1995) and Goodman (1996) have all introduced general classes ofdependence models which express the departure from independence in terms of generalized measures and includeassociation and correlation models as special cases. Thus, one could proceed further to determine the probability of

M. Kateri, N. Balakrishnan / Journal of Statistical Planning and Inference 138 (2008) 873–887 887

strong misleading evidence for such generalized association models which, as one would expect, will be computationallyquite involved. These results could also be generalized to log-linear, association and correlation models for multi-waytables. Work in these directions is currently under progress.

Acknowledgments

The authors thank the Natural Sciences and Engineering Research Council of Canada for funding this research. Theauthors also express their thanks to referees for some critical comments and suggestions on an earlier version of thispaper which led to a considerable improvement in the presentation and discussion in this revised version.

References

Agresti, A., 2002. Categorical Data Analysis. Wiley, New York.Becker, M.P., Clogg, C.C., 1989. Analysis of sets of two-way contingency tables using association models. J. Amer. Statist. Assoc. 84, 142–151.Birch, M.W., 1963. Maximum likelihood in three-way contingency. J. Roy. Statist. Soc. Ser. B 25, 220–233.Birnbaum, A., 1962. On the foundations of statistical inference (with discussion). J. Amer. Statist. Assoc. 57, 259–326.Blume, J.D., 2002. Tutorial in biostatistics: likelihood methods for measuring statistical evidence. Statist. Med. 21, 2563–2599.Blume, J.D., 2005. How to choose a working model for measuring the statistical evidence about a regression parameter. Internet. Statist. Rev. 73,

351–363.Cressie, N., Pardo, L., Pardo, M., 2003. Size and power considerations for testing loglinear models using �-divergence test statistics. Statist. Sinica

13, 555–570.Gilula, Z., Krieger, A.M., Ritov, Y., 1988. Ordinal association in contingency tables: some interpretive aspects. J. Amer. Statist. Assoc. 83, 540–545.Goodman, L.A., 1985. The analysis of cross-classified data having ordered and/or unordered categories: association models, correlation models and

asymmetry models for contingency tables with or without missing entries. Ann. Statist. 13, 10–69.Goodman, L.A., 1986. Some useful extensions of the usual correspondence analysis and the usual log-linear models approach in the analysis of

contingency tables with or without missing entries. Internet. Statist. Rev. 54, 243–309.Goodman, L.A., 1996. A single general method for the analysis of cross-classified data: reconciliation and synthesis of some methods of Pearson,

Yule, and Fisher, and also some methods of correspondence analysis and association analysis. J. Amer. Statist. Assoc. 91, 408–428.Hacking, I., 1965. Logic of Statistical Inference. Cambridge University Press, New York.Kass, R.E., Raftery, A.E., 1995. Bayes Factors. J. Amer. Statist. Assoc. 90, 773–795.Kateri, M., Papaioannou, T., 1995. f -divergence association models. Internet. J. Math. Statist. Sci. 3, 179–203.Rom, D., Sarkar, S.K., 1992. A generalized model for the analysis of association in ordinal contingency tables. J. Statist. Plann. Inference 33,

205–212.Royall, R., 1997. Statistical Evidence: A Likelihood Paradigm. Chapman & Hall, New York.Royall, R., 2000. On the probability of observing misleading statistical evidence (with discussion). J. Amer. Statist. Assoc. 95, 760–780.Royall, R., Tsou, T.S., 2003. Interpreting statistical evidence by using imperfect models: robust adjusted likelihood functions. J. Roy. Statist. Soc.

Ser. B 65, 391–404.Stafford, J.E., 1996. A robust adjustment of the profile likelihood. Ann. Statist. 24, 336–352.