diagnostics for assessing regression models

10
Diagnostics for Assessing Regression Models Author(s): Joan G. Staniswalis and Thomas A. Severini Source: Journal of the American Statistical Association, Vol. 86, No. 415 (Sep., 1991), pp. 684- 692 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2290398 . Accessed: 16/06/2014 07:16 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AM All use subject to JSTOR Terms and Conditions

Upload: joan-g-staniswalis-and-thomas-a-severini

Post on 20-Jan-2017

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Diagnostics for Assessing Regression Models

Diagnostics for Assessing Regression ModelsAuthor(s): Joan G. Staniswalis and Thomas A. SeveriniSource: Journal of the American Statistical Association, Vol. 86, No. 415 (Sep., 1991), pp. 684-692Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2290398 .

Accessed: 16/06/2014 07:16

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 2: Diagnostics for Assessing Regression Models

Diagnostics for Assessing Regression Models JOAN G. STANISWALIS and THOMAS A. SEVERINI*

We concern ourselves with diagnostics for checking the overall and local goodness of fit of a model s(x) used in the regression of Y on x E U = [0, I1d. The model for s(x) is a functional form that depends on a finite number of unknown parameters. Two statistics, A and A,,, are proposed that measure the level of agreement between the model fit to the data and the nonparametric kernel estimator on m preselected points in U. Conditions are given under which A and A,, are asymptotically equivalent. Both of these statistics measure overall lack of fit and are related to the deviance. Their asymptotic distribution under the null model and under local alternatives is derived. This work is motivated by the local mean deviance plot of Landwehr, Pregibon, and Shoemaker for assessing overall lack of fit in logistic regression. Their plot is summarized by our test statistics and is extended to other likelihood based regressions of Y on x.

KEY WORDS: Kernel estimator; Lack of fit; Local likelihood; Nonparametric deviance; Weighted likelihood.

1. INTRODUCTION

Let {(Xi, Yj); i = 1, ..., n}, denote independent random variables, where X E U = [0, 1]d is a vector of explanatory variables and Y is a real-valued response variable. Here Xl, ...,9 X, are either iid random variables or design vari- ables on a lattice. The distribution of Y I X = x is assumed to be a member of a family of distributions depending on the parameter s(x) (x E U). Let f(Y I s(x)) denote the den- sity of Y I X = x, wheref is known. We concern ourselves with developing diagnostic procedures for checking the goodness of fit of a parametric function for s(x) in likeli- hood based models. Two statistics are proposed for de- tecting global lack of fit of the parametric model against a smooth nonparametric alternative. Exploratory plots are de- fined for locating local lack of fit when global lack of fit is detected.

In parametric regression, a functional form for s(x) is postulated that depends upon a vector v of q < n unknown parameters. The data (xi, yi) (i = 1, ..., n) are reduced to a vector iv of q parameter estimates which frequently have physical interpretations. If the postulated parametric model does not provide a good approximation to the regression function within the experimental region, however, then the subsequent inferences are unreliable. Most introductory courses on regression analysis emphasize the need to ex- amine the parametric regression equation for lack of fit. Statistics are derived for detecting lack of fit when Y I X follows a normal distribution and repeat observations of Y at fixed levels of the explanatory variables are available. When repeat observations are not available or when the dis- tribution of Y I X is not normal, then a generalized likeli- hood ratio statistic may be used. The null hypothesis is the postulated parametric regression function, and the alternate hypothesis is an extension of the null model. The extension is such that the null space is strictly contained within the alternate space. The generalized likelihood ratio statistic is the difference between the deviance under the null (re-

* Joan G. Staniswalis is Assistant Professor, Department of Mathe- matical Sciences, University of Texas, El Paso, TX 79968-0514. Thomas A. Severini is Assistant Professor, Department of Statistics, Northwest- em University, Evanston, IL 60208-4070. This research was supported in part by National Science Foundation Grant DMS 8717560. The authors are pleased to acknowledge the valuable comments of the associate ed- itor, in particular those which led to Section 6 of this article. We also thank the referees for a careful reading of the manuscript.

duced) model and the deviance under the alternate (ex- tended) model. As in Landwehr, Pregibon, and Shoemaker (1984), we define the deviance to be negative two times the maximized log-likelihood. Under certain regularity as- sumptions, the absolute difference between these deviances is asymptotically chi-squared distributed. The correspond- ing degrees of freedom are the difference between the di- mension of the alternate space and the dimension of the null space. This approach and others with nonnested hypotheses (Kent 1986) force the statistician to specify a parametric class of alternatives with interesting deviations from the null model. In practice, the parametric alternative is usually an extension of the null model arrived at by saturating the null model with additional parameters. To our knowledge, this dependence of the class of alternatives on the sample size n is not taken into account.

Tibshirani and Hastie (1987) proposed their local likeli- hood theory be used to examine the parametric regression function for lack of fit against a nonparametric alternative. Other approaches for checking the validity of the paramet- ric regression equation against nonparametric alternatives include Cox, Koh, Wahba, and Yandell (1988), Bjerve, Doksum, and Yandell (1985) and Eubank and Spiegleman (1990).

Our approach is most closely related to the approach sug- gested by Azzalini, Bowman, and Hardle (1989) and was motivated by the work of Landwehr et al. (1984). We pro- pose two statistics A and A, for detecting global lack of fit of a parametric regresson equation using a nonparametric alternative from a likelihood perspective. For an illustration of the ideas presented, the following data are introduced. The goal of this study was to determine the optimum levels of two chemicals, phorbol dibutyrate (PDBu) and iono- mycin (I), for enhancing the in vitro proliferation of T lym- phocytes (T cells). This was part of an ongoing investi- gation by Carl McCrady at the Medical College of Virginia for the development of an immunotherapeutic procedure for the treatment of immune disorders such as AIDS. The re- sponse variable was mitogenic counts per half second and is related to the number of T cells present. The explanatory variables were the logl0 levels of PDBu and I. Eight levels of I were used in combination with thirteen levels of PDBu

? 1991 American Statistical Association Joumal of the American Statistical Association

September 1991, Vol. 86, No. 415, Theory and Methods

684

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 3: Diagnostics for Assessing Regression Models

Staniswalis and Severini: Regression Diagnosis 685

for a total of n = 104 data points. The conditional distri- bution of the mitogenic counts was modeled as a Poisson distribution with parameter s(x). The kernel estimate ?(x) of s(x) is given in Figure 1. Initially, we tried to model ln(s(x)) as quadratic in the logl0 levels of PDBu and I (Stan- iswalis and McCrady 1988). Our statistics for detection of lack of fit measure the deviation of the kernel estimator AQ) from E[(-)], the expected value of Q( ) conditional on ln[s()] following a quadratic model with parameters v', on m pre- selected points in U.

Section 2 develops the asymptotic distribution of A under the null model and under local alternatives. Sufficient con- ditions for the asymptotic equivalence of A and A, are also provided there. All the results derived in Section 2 are ob- tained under the assumption that m is fixed and does not depend on n. Section 3 deals with computational aspects. Section 4 defines some diagnostic plots for locating local lack of fit. Section 5 has the result of a small simulation study. Section 6 develops the asymptotic distribution of A and Aw under the assumption that m grows with n. The results of a small simulation study with m = n and ti = xi (i = 1, ..., n) are reported.

2. DETECTION OF GLOBAL LACK OF FIT

Of interest is a test of

HO:sES (1) versus

H1: s E Ck(U).

Ck(U) is the set of real valued functions on U with contin- uous partial derivatives of order k : 2. It is assumed throughout that S is a subset of Ck(U) of the form S = {s(; v); v E fQ C Rq}, where s(; v) is a postulated functional form on U that depends on a vector of v of q unknown parameters. Our test of (1) compares an estimate of the regression function under the null model with a nonpara- metric estimate of the regression function at m preselected points t1, ..., t,m, which form a lattice in U.

Under Ho. s(x) = s(x; v0) for some v0 E Q. The estimator v used in this development must converge to v0 at para- metric rates under Ho; that is, for a E (0, 1), n -/v - vo) p 0, where -p indicates convergence in

210

140

70 -6. 00

-8.41

0-4 57W.10.81 Logq10(PDBu)

Lo;0 (I) -7 .92

Figure 1. The Kernel Estimate of the T Cell Proliferative Response With a Cross-Validated Choice of the Bandwidth; b = .20, k = 2.

probability. An example of such an estimator is the max- imum likelihood estimator.

Let n

L"(v) = E log f(Yi I s(xj; v)) i=l1

be the log-likelihood under the null model. The maximum likelihood estimate (MLE) v' is the maximizer of Ln(V) with respect to v E Q. For any x E U the maximum likelihood estimate of s(x) under Ho is simply s(x; v').

To obtain a nonparametric estimate of s(x) the nonpara- metric MLE of Staniswalis (1989) is used. Specifically, the nonparametric MLE &(x) of s(x) maximizes the weighted log-likelihood

LW[O; XI = E w( bI) logf(Yi I 6)

with respect to the variable 6, where W() is a real valued function on Rd with compact support and b is a positive scalar. W() and b are referred to as the kernel and band- width, respectively. Note that there is a nonparametric MLE for each b and W.

The following are conditions that are used throughout the development of this article.

1. The marginal and conditional density functions de- fined here exist with respect to either the Lebesgue measure or the counting measure on the domain of the random vari- able of interest.

2. The kernel W is a direct product of a kernel (Muller 1980) w: R -* R with compact support on [-1, 1] and of order k (Muller 1984).

3. The bandwidth b E (0, .5) satisfies b -> 0 and nb3d X0 as n -m oo. 4. The density f(y | s(x)) is such that log f(y | s(x)) has

three continuous partial derivatives with respect to s(x) and there exist integrable Hi(y) such that E[IHi(Y)I2] < X and for all x E U and i = 1, 2, lOi log f(y I s(x))/Os(x)iI ' Hi(y).

5. If X is random with density g, then set k = 2 and let g be bounded away from zero with continuous first- and second-order partial derivatives on U. If X is a design vari- able, then set k ? 2 and g(x) = 1 in the development.

Let the operation of expectation conditional on s(x) = s(x; v) be denoted by E,. Section 3.5 addresses the es- timation of Eo(?(x)). The first proposed test statistic is

m

AW = 2 > g(tj){Lw[&(tj); tj - Lw[Eo(?(tj)); tj]}w7, j=1

where wx = i=-W((x-xi)/b) and wj = Wx,x=tj An empiri- cal estimate of Ex{Eylx[logf(Y I X)]} is m 1Ym lg(tj)Lw[S(tj); t,]wj-W. The statistic Aw is a measure of the difference be- tween this nonparametric deviance and its limit in proba- bility under the null model. Under no smoothing, if Y I X = x is normally distributed with known variance o-2, {t,

.tm} is the support of xl, ..., x, and an equal number of replicated observations of Y are available at each (ti; i = 1, .. ., in), then (n/m)Aw reduces to the usual "lack-of- fit" sum of squares scaled by o-2.

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 4: Diagnostics for Assessing Regression Models

686 Journal of the American Statistical Association

The other statistic proposed for testing (1) is m

A = I (tj){s (tj) -EVS j) j=

where ( 2~ I (x) =-E( 2 log f(Y I 0) I X = x)

is an estimate of the information I(x) = -E(&2 log f(Y | 0)/ a I X = X) Io=s(x). The remainder of this section is devoted to specifying the asymptotic distribution of A and A, and providing a sufficient condition for their asymptotic equiv- alence when m is fixed. The proofs of the theorems are in the Appendix.

The limiting distribution of A under the null hypothesis is given in our first theorem.

Theorem 2.1. Under Ho, (nbd/,32)A -->D X2, where x2 represents a random variable having a chi-squared distri- bution with m degrees of freedom, "-*D" denotes conver- gence in distribution and /2 = [f1 w2(u) du]d.

Theorem 2.2 provides the distribution of A under local al- ternatives of the form

(nbd)l/2[s(x)-s(x; v)] - 6(x). (2)

The power of the global goodness-of-fit statistic (nbd/f82)A depends on the lattice tl, ..., tm through the density g(Q), the local deviations 5( ) from the null model, and the in- formation I(-).

Theorem 2.2. Under assumption (2), (nbd/132)A ->D

x2(A), where X2(A) = X x2(A1j) is a noncentral chi-squared distributed random variable with noncentrality parameter A = A1 + *. + Am and m degrees of freedom. Here 2(Al), *

2 xm(Am) represent independent noncentral chi-

squared distributed variables with one degree of freedom and noncentrality parameter A12 = g(t1)82(t1)I(t1)/32 (j

19 l...,9m)-

Note that parametric local alternatives converging to the null at a rate of n-1/2 are not detectable with a test based on A. Neither are alternatives with 5(ti) = 0 for i = 1, .... m. The latter situation may be detected with the local chi- squared goodness-of-fit plot introduced in Section 4 or en- tirely avoided by allowing m to increase with n as in Sec- tion 6. Alternatives with 5(ti) = 0 for i = 1, ..., m are typically not of much interest when m is large.

The last result in this section relates A to Aw. It has the implication that the statistics A and Aw are asymptotically equivalent under the null hypothesis and under certain local alternatives.

Theorem 2.3. If sup.[s(x, vi) - s(x)] p->, 0, then (Aw - A)/A ->p 0.

3. COMPUTATIONAL ASPECTS 3.1 Unknown g

In practice the density function g( ) is unknown, in which case A and Aw are calculated by replacing g(41) with gn(t1) (j = 1, ..., in). The kernel density (Rosenblatt 1971) es-

timate g,(t) = (1 /nbd)>'= I W(t - xi)/b is adequate for these purposes. The asymptotic results in Section 2 still hold be- cause the kernel estimator of g provides a consistent esti- mate on the lattice of points tl, ..., tm under the assump- tions (3) and (5) on the bandwidth and the smoothness of g.

3.2 Choosing b

We recommend choosing b through cross-validation. That is, choose b to maximize Yj= I log f(Yj | 3(j)), where S(J) maximizes

w~~(j (xj -

xi) log f(Y i 0) Xi tt N(xj) b

with respect to 0. Here, N(t) denotes a subset of U centered at t with P(X E N(t)) = 2dIn. If xl, . . ., xn form a lattice in U, then 2dIn is the Lebesgue measure of the largest cube centered at xj containing no other points from the lattice. The optimality properties of this bandwidth selection pro- cedure in this setting are not known, but it does work well on our simulated and real data sets. In our examples, the "leave-one-out" cross-validated bandwidth repeatedly un- dersmoothed the data.

3.3 The Kernel

We suggest selecting a kernel of order k that satisfies the boundary conditions w(i)(-1) = w(j)(1) = 0 forj = 1, k - 1 to ensure that the kth order partial derivatives of 3(t) are consistent estimators of the respective kth order deriv- atives of s(t) when x is a design variable. Muller (1984) provides such kernels with compact support on [-1, 1]. The kernel used for the data analysis and simulations in this presentation used k = 2 and

w(v) = (15/16)(1 - v2)2 v E [-1, 1]

= 0 otherwise.

3.4 The Lattice

We need 3(tl), ..., 3(tm) mutually statistically indepen- dent for a given n and b. The kernel estimators for the points tl ... ., tm in the lattice must have nonoverlapping windows. They should also be chosen to avoid boundary effects. If t is in the boundary, then 12 is not the correct normalization factor for A(t) and Aj(t), yielding a conservative goodness- of-fit test. Moreover, the fact that the alternatives with 5(ti) = 0 (i = 1, ..., m) cannot be detected must be taken into consideration when selecting the lattice. In practice, one wants to use m = (2b)-d or even m = n as Eubank and Spiegelman (1990) did under a normality assumption. The reader is referred to Section 6 of this article for our devel- opment when m is allowed to depend on n.

3.5 Computing E,[s(x)] If assumption (4) is replaced with

4'. The density f(y | s(x)) satisfies: log f(y | s(x)) has 2k continuous partial derivatives with respect to s(x) and there exist integrable Hi( y) such that E[IHi(Y)I2] < oo and

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 5: Diagnostics for Assessing Regression Models

Staniswalis and Severini: Regression Diagnosis 687

il' logf(y I s(x))/3s(x)Yf s Hi(y) for all x E U and i = 1, ... 2k,

then (Staniswalis 1989) n

E4[s(x)] = w xi b')wX 1 + O(bk).

If nbd+2k 0 as n - oo0, then the expected value of sf) under the estimated null model may be approximated by smoothing s(; is) and ignoring the smaller order terms. For the mitogenic count data presented here, the preceding ap- proximation is in fact exact.

4. LOCATING LOCAL LACK OF FIT

In this section some local goodness-of-fit diagnostic plots are defined for locating lack of fit when overall lack of fit is detected with Aw. Note that the same can be done with A as well.

For x E U, set

Aw(x) = 2{Lw[3(x); x] - L [Es(x)) x]} Under the null model, Aw(X)/i32 -*D X2, where x2 repre- sents a chi-squared distributed random variable with one degree of freedom. Under alternate models satisfying (2), Aw(x)/I32 >D X2(A), where A = g(x)82(x)I(x)//32, and x2(A) represents a noncentral chi-squared distributed variable with 1 df and noncentrality parameter A.

This suggests plotting AW(x)/,B2 versus xi for x E U, where x = (xl, ..., Xd). Such a plot is constructed for one or all of i E {1, ..., d} by varying the ith variable xi while fixing the other d - 1 variables. On each plot, imagine a hori- zontal line at x2, the 100(1 - a)th percentile of a chi-squared distribution with 1 df. Local lack of fit is indicated when a cluster of points falls above a horizontal line. This is il- lustrated for the data introduced in Section 1.

The local goodness-of-fit plot of Aw(x)/32 versus the logl0 levels of the chemical PDBu (Figure 2) indicates a lack of fit of the quadratic model. This is largely due to the fact that the value of the response variable drops off rapidly and then levels off near zero for low doses of PDBu and I. Consequently, one can not expect the quadratic model to hold for the entire experimental region, but a quadratic ap- proximation ought to hold in a neighborhood of the optimal dose of PDBu and I. The local goodness-of-fit plot is of no use in selecting this neighborhood.

Figure 1 was used to select a subset R of the experimental region within which the response variable displayed qua- dratic behavior. The subset R that was selected by trial and error was a rectangular region with log1o PDBu E [-9.01, -7.20] and log1o I E [-6.96, -6.00]. A quadratic model was again fit to ln(s(x)) using the experimental data cor- responding to doses in R. Figure 3 has the local goodness- of-fit plot based on Aw. It indicates that the quadratic model agrees with the kernel estimate within the subset R of the experimental region, but outside of R there is an indication of lack of fit. Note, however, that outside of R the kernel estimate (Figure 1) mostly takes on values close to zero. Therefore, by Theorem 2.2 and because 1(x) = s(x)-' for the Poisson model, the statistic Atw(x)//3S2 will lead to the

0

-14 .12 -10 .8 6

Log0 5('P

Figure 2. The Local Chi-Sqtuared Goodness-of-Fit Plot Based on A, for the Quadratic Model Fit to fn(s(x)). Legend: The levels 1, Z,.., 8 correspond to the following levels of logio(f): - 7.92, - 7.43, - 6.96, -6.48, -6.00, -5.52, -5.05, -4.57.

rejection of any null model that deviates even slightly from the kernel estimate outside of R. For the purpose of esti- mating the location of the optimal dose of PDBu and I, the second quadratic model is adequate.

5. SIMULATIONS

A small computer simulation study was performed to ex- plore the pointwise power and the pointwise size of the lo- cal goodness-of-fit test based on the chi-squared approxi- mation to the fimite-sample distribution of AW(x)/x32 and A(x)/

"S I

4 FA X,~~~~~~~~~~~~I

7. ~~~~~~'*32 -

.14 -12 lO .8 -6

Loglo( PDBO

Fiue3 TeLclCh-qae Gons-f-i ltBae1nA forth QudrficMoel itto n(tx) Uin th Sbse o th Dta

1g(:-149,- .3 -.6 -62.18 600 -5.28 -.5 -6.7

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 6: Diagnostics for Assessing Regression Models

688 Journal of the American Statistical Association

132 suggested in Section 4. The simulations were conducted in FORTRAN on a VAX 8650 at the Medical College of Virginia, Virginia Commonwealth University, using the nominal level of a = .05. This example uses d = 2 with X1 and X2 independent uniform random variables on [0, 1]. The set N(t), for t E U, was taken to be the square centered at t with sides of length 2n-1/2. The Y X = x were taken to be Bernoulli random variables with parameter s(x),

log(s(x)/[l - s(x)]) = 4(xl + x2) - 12xlx2 (3)

for x = (xl, x2). The logistic regression models

log(s(x)/[l - s(x)]) = bo + blxl + b2x2 + b,2xlx2 (4) and

log(s(x)/[l - s(x)]) = bo + blxl + b2x2 (5)

were fit to the data generated with the uniform and binomial random number generators in SAS. Five hundred realiza- tions of the parametric MLE (using PROC CATMOD in SAS) and the nonparametric MLE of s(x) were generated.

Table 1 lists the number of realizations out of 500, as a function of x, that incorrectly rejected the null model given by Equation (4) based on the local goodness-of-fit plots for n = 300. The average cross-validated bandwidth was .34. The simulated size of the local goodness-of-fit test is listed in Table 1 for a 10 x 10 grid of points x = (xI, x2) in the unit square. The test based on AW(x)/,82 is more conser- vative than the test based on A(x)/,82, and the size of both tests vary with x. Table 1 is symmetric about xl = x2 as is the surface chosen for the simulation. Our asymptotic re- sults predict that the size of the local goodness-of-fit test based on A,(x)/f82 or A(x)/,82 should be independent of x. The variation of the size of the local goodness-of-fit test as a function of x reflects the influence of the data points on the parametric MLE. The pattern is due to the fact that the

null model was fit to the data using maximum likelihood techniques. The data points that correspond to values of s(x) near 0 or 1 have smaller variance and therefore more influence on the parametric MLE than those data points that correspond to values of s(x) near 1/2. The parametric MLE was most influenced by the data near the corner x = (1, 1), where the value of s(x) is zero. The parametric MLE fit the data in this corner well at the expense of the data corresponding to other values of x.

Table 2 lists the number of realizations out of 500, as a function of x, that correctly rejected the null model given by Equation (5) based on the local goodness-of-fit plots for n = 300. The simulated power of the local goodness-of-fit test is listed in Table 2 for a 10 x 10 grid of points x = (xl, x2) in the unit square. The power of A(x)/f82 and the power of AW(x)/f82 are comparable in magnitude but vary with x. Given the results in Section 2, we expect the power of the test to vary with x. Figure 4 is a plot of A(x) = s(x; iv) - s(x), where vi is the average of the parameter esti- mates over the 500 realizations; A(x) is an estimate of the limit (in probability) of s(x, vi) - s(x). The power of the statistic AW(x)/,82 is governed by both I(x) and A(x), whereas the power of the statistic A(x)/f82 is governed primarily by I(x). Evidence of this is seen in Table 2. There the power of the goodness-of-fit test based on A(x)/f82 is largest for x in a neighborhood of (1, 0), (0, 1), and (1, 1), where 1(x) is the largest. However, the power of the goodness-of- fit test based on AJ(X)/V82 is largest at x = (0, 0), where A(x) takes on its largest value .46 and 1(x) is not so large. Of the four corners in the unit square, the data correspond- ing to the corner (0, 0) has the least influence on the com- puted value of the parametric MLE and, therefore, has the largest value of A(x). At x = (1, 0), (0, 1), and (1, 1), A(x) takes on the values -.25, -.25, and .22, respectively. The effect of A(x) on the power of AW(x)/f82 is large enough to overcome the loss in power, relative to the power of

Table 1. Number of Realizations Out of 500 Incorrectly Rejecting the Model Given by Equation (4) at the .05 Nominal Level, n = 300

X2

x, .09 .18 .27 .36 .46 .55 .64 .73 .82 .91

.09 1 1 3 6 12 14 5 1 1 2 (8) (3) (9) (13) (17) (22) (19) (10) (23) (34)

.18 3 1 3 4 8 6 5 0 1 1 (7) (1) (6) (10) (9) (12) (12) (4) (9) (18)

.27 4 4 5 9 7 6 6 1 0 1 (1 1) (6) (9) (13) (10) (6) (13) (5) (5) (7)

.36 11 7 11 12 10 10 6 3 1 2 (18) (11) (11) (14) (12) (16) (10) (5) (3) (3)

.46 13 10 8 12 11 16 7 4 4 0 (29) (16) (9) (16) (18) (19) (15) (4) (3) (5)

.55 14 8 7 9 7 8 8 6 4 5 (25) (12) (12) (9) (13) (9) (7) (8) (6) (5)

.64 5 4 4 1 3 4 6 8 4 4 (18) (9) (5) (3) (6) (8) (6) (10) (5) (5)

.73 3 3 3 2 2 3 1 4 0 1 (19) (1 1) (8) (3) (2) (5) (2) (4) (3) (6)

.82 0 2 3 4 2 3 1 2 4 2 (19) (1 1) (7) (5) (3) (3) (2) (2) (6) (7)

.91 3 3 4 1 3 3 3 2 1 1 (29) (11) (7) (5) (3) (5) (4) (3) (8) (13)

NOTE: The first number listed in the table corresponds to A,(x)/,82; the numbers in parentheses correspond to A(x)/132.

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 7: Diagnostics for Assessing Regression Models

Staniswalis and Severini: Regression Diagnosis 689

Table 2. Number of Realizations Out of 500 Correctly Rejecting the Model Given by Equation (5) at the .05 Nominal Level, n = 300

X2

x1 .09 .18 .27 .36 .46 .55 .64 .73 .82 .91

.09 466 459 362 141 20 29 126 311 439 447 (419) (408) (291) (77) (20) (47) (200) (391) (469) (476)

.18 456 447 317 103 16 20 113 316 441 454 (405) (373) (244) (62) (10) (39) (177) (367) (461) (475)

.27 357 323 193 63 10 18 83 211 334 383 (263) (242) (125) (39) (1 1) (34) (112) (270) (388) (415)

.36 127 106 60 33 18 27 43 86 139 148 (75) (62) (37) (23) (18) (31) (63) (113) (172) (194)

.46 22 16 15 19 18 20 20 24 18 21 (23) (13) (14) (17) (23) (25) (28) (31) (26) (29)

.55 27 24 16 19 14 8 6 10 9 10 (57) (45) (33) (24) (20) (13) (8) (10) (13) (12)

.64 131 122 82 35 7 5 15 53 101 128 (210) (166) (117) (53) (17) (7) (16) (55) (108) (142)

.73 312 309 211 84 11 7 37 165 300 320 (378) (361) (262) (113) (20) (10) (40) (172) (326) (349)

.82 417 416 321 148 16 7 74 293 433 440 (452) (448) (366) (189) (21) (8) (85) (311) (443) (460)

.91 436 435 355 167 15 8 106 318 438 452 (462) (463) (406) (206) (31) (1 1) (117) (357) (459) (464)

NOTE: The first number listed in the table corresponds to Aw(x)/I.2; the numbers in parentheses correspond to A(x)/.82.

A(x)/132, due to the local averaging of I(x) done implicitly through L,I[s(x); x].

6. LEITING m GROW WITH n

This section is devoted to specifying the asymptotic dis- tribution of A when m is allowed to grow with n. Following are assumptions, in addition to (1)-(5) of Section 2, that are needed for the theorems in this section. Conditions (3') and (4') are needed in addition to conditions (3) and (4). Set

B E(t) ( vvkb ) 02 O=s(x)

and assume that

3'. b s (t)- s(t)), b d/2(f(t) - I(t)), b-d/2(g (t) -(t)) and b-d/2(BA(t) + I(t)) converge in probability to zero uni- formly for t E U as n ->o

4'. E[jHi(Y)14] < o, i = 1, 2 6. I(t)-1 is bounded above for t E U

0.48

0.21

-0.04 10

0.87 X2

-0.28 0.33 1.0

0.87 / 0.33 0.00

xl 0.00

Figure 4. An Estimate of the Limit in Probability of s(x; i) - S(X)J Where s(x; v) is Given by Equation (5).

7. Forj = 2, 3, 4, 6, and 8 and i = 1, ..., n, let zi = d logf(Yi I 0)/1aOo=S(X) and ui = E(zJ I Xi = xi). The uji are assumed uniformly bounded in i, j, n

8. The grid points are chosen so that the probability mass function with mass /rm at each grid point provides a dis- crete approximation to the density g. If g is unknown, then set ti = xi (i= 1, ..., n) with n = m

Theorems 6.1 and 6.2 provide the asymptotic distribu- tions of A under the null model and under local alterna- tives. Their proofs use Theorem 2.1 of de Jong (1987). Only the proof of Theorem 6.1 is included in the Appendix. Theorems 6. 1 and 6.2 hold for Aw in place of A under con- dition (3').

Theorem 6.1. Under the null model,

[(nbd/2)/(m(2V)1/2)][A - Eiv(A)] >D N(O, 1),

where V = Eij I Ex,,x[p2(Xi, X1)]/(n2bd), r~~~~~~~

p(u, v) = f W(r + (u - v)/2)W(r - (u - v)/2) dr

if maxjui - vil < 2b

= 0 otherwise,

and D(u, v) = {r I r = [s - (u - v)/2]/b for some s E U}.

Theorem 6.2. Under local alternatives of the form

[(nbd)(bd/2)]l/2[s(t; A) - s(t)] -_>p 8(t),

[(nb d/2)/(m(2V) /2)][A - EV0(A)] - An -->D N(O, 1),

where An = jm' l I(t1)g(t1)8*2(t )/(2V)1/2 and 8*(x) yn= W((x -xi)/b)S(xi)wx-1.

An approximation to V is 2d322 and an approximation to Ev,(A) is mJ32/nbd, in which case an approximate global

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 8: Diagnostics for Assessing Regression Models

690 Journal of the American Statistical Association

goodness-of-fit test statistic is given by

G nbdA/,82 - m Gt

- =2 (/2M(2b)d/26)

Let GA(t1, ..., tm) denote the statistic that results when A is replaced with Aw in Equation (6). For the simulated data of Section 5, the power and size of a global goodness- of-fit test based on the standard normal approximation to the right tail of the distribution of Gw (G) under the null model was explored. For m = n = 300, ti = xi (i = 1, n), and nominal level a = .05, the size of the test was .008 (.02) and the power was .42 (.43) for Gw (G), respectively. Unfortunately, the histogram of the 500 simulated values of Gw (G) looked more like a x2 distribution than a N(O, 1) distribution. This indicates that with n = 300, d = 2, and our average cross-validated bandwidth of .34, Theorem 6.1 has not begun to take effect. The 500 simulated real- izations of Gw ranged between 0 and 2, G ranged between 0 and 8. The simulated means of nbdAw/,82 and nbdA/182 were 362.0 and 374.2, respectively, not 300 as was expected.

For the T-lymphocyte data presented earlier, the test sta- tistics Gw (G) were -.116 (58.02) for the model fit to the whole data set and 4.98 (60.6) for the model fit to the sub- set of the data. As pointed out in Section 5, the value of A and therefore G is dominated by the poor fit where the Poisson response is near zero, that is, where the variance of the response is near zero. For this reason, we recom- mend using Gw. The absolute values of Gw and G can be expected to be larger for the model fit to the subset of the data as compared with their absolute values for the model fit to all of the data. The computed value of Gw is most informative when taken together with the local goodness- of-fit plots, Figures 2 and 3.

7. CONCLUSIONS

Our examples and simulations addressed d = 2, where the method can be expected to do well. In higher dimen- sions, the need for large data sets to achieve adequate power could be alleviated through the use of generalized additive models (Hastie and Tibshirani 1990) or partially linear models (Speckman 1988) in place of our unrestricted alternative. Further research is needed in this direction. On the other hand, given large data sets, the diagnostic procedures de- scribed here could be extended to test for the validity of generalized additive models or partially linear models.

In summary, the local goodness-of-fit statistic is the most promising diagnostic tool elaborated on here. The global goodness-of-fit statistic with m = n has a finite-sample dis- tribution that is not approximated well by the limiting nor- mal distribution; better finite-sample approximations to its distribution are in progress.

APPENDIX: PROOFS

Proof of Theorem 2.1 The following lemma is instrumental in the proof. Lemma 1. Let s = (S?(t1),... (tm)). Then (nbd/f32)l/2(s -

E(s)) >D N(O, S), where S- = diag(I(t1)g(t1). ,I(tm)g(tm))-

When X is a design variable, this lemma is an extension of a theorem of Staniswalis (1989). When X is a random covariate, it is an extension of results of Schuster (1972). This lemma also follows from Mack and Muller (1987), assuming that the band- width satisfies b 0(n1l=(d+4)=

By the lemma, I(t.) - I(tj) -->p 0 for j = 1, . m. Therefore it suffices to prove the theorem for A* = lj=ll I(tj)g(tj){s^(tj) - E[s^(tj)]}2 in place of A. The parametric rate of convergence of v' together with condition (3) implies that (nbd)l/2{Es(tj)] - Evo[s^(tj)]} -p 0. Therefore, (nbd/,82)[A* - Y5'Lt I(tj)g(tX){^(tj) - E[s^(tj)]}2] -->p 0. The desired result follows from an applica- tion of the lemma.

Proof of Theorem 2.2

As in the proof of Theorem 2. 1, it suffices to prove the theorem for A* in place of A. The theorem follows by an application of Lemma 1 and the assumption on the local alternative on replacing s^(tj)- E[S^(t,)] with {s^(t) - E[S^(t)]} - {E[S^(tj)] - E[s^(tj)]} and taking the limit in probability as n -> oo.

Proof of Theorem 2.3

Let L'(4; t) = aLj(4; t)/la and L'"(4; t) = a2L"(4; t)/a42. Substitute a Taylor series expansion of Lw[Ef(SA(tj)); tj] about s^(tj) into the expression for Aw. Simplify and incorporate the fact that LW[s^(tj); tj] = 0 for j = 1, . m. This yields

m

Aw - g(tj)L'[-j; t]f{s^(tj) s j)- ] -l j=1

where 3j lies between s^(tj) and EA[s^(tj)] (j = 1, ..., m). Now replace L"[-j; tj] in the latter expression with L'[s(t"); t1] + L"[-sj; tj] - L"[s(tj); tj] yielding

m

=w =- g(t)L'[s(tj); tj](^(tj) -E_[^(tj)])2 -

j=1 m

-,g(tJ)(L"[3j; tj] - L[s(tj); tj])(s^(tj) - E[S(j)])2WJ-1. 1=1

(7) Under conditions (1)-(5), L"[s(tj); tj]w 1 -p - I(tj) forj = 1, .m, by the weak law of large numbers. Therefore, the dif-

ference between the first term on the right side of (7) and A con- verges in probability to zero. Observe that s(x; v') - s(x) --p 0 implies that sj converges in probability to s(tj), and, therefore, the second term on the right side of Equation (7) is op(A).

Proof of Theorem 6.1

It suffices to prove the theorem for A* in place of A because of the uniform convergence in probability of b-d/2 I(t) - I(t)I to zero as n -> oo.

Recall that the kernel estimator s^(t) satisfies

- Lw(O; t) | = E () log f(Yi I 0) | 0. (8) -o ) b do

By the regularity condition (4) onf, we have for s^(t) in the neigh- borhood of s(xi), a Taylor series expansion,

a a - log f(Yi I 0) | = log f(Yi I 0)| ao 0=s(t) ao O=s(x,) a2

+ - log f(Yi I 0)

where 6i(t) lies between s^(t) and s(xi) (i = 1,. in , ). Substituting this expansion into Equation (8) yields

0 = An(t) + (Bn(t) + Cn(t))[sA(t) - EV(s(t))] + Dn(t), (9)

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 9: Diagnostics for Assessing Regression Models

Stoniswalis and Severini: Regression Diagnosis 691

where

/t- Xi a -

AA(t) = w Wt b ) log f(Yi 0) wt /b ) 6=s(x,)

n (t- _)Xa2 B = lg fi |0 2 log f(Y i ) Wt | 1

= - Xi)

[a2 f iI )a2lo ao2 6=8j(t) 6eOs(x,)

n t__Xi a2

Dn(t) =E W( ) log f(Yi 16) [ED(S (t))-s(xi)]wj1.

Equation (9) may be written in the more useful form,

s'(t) - E[sf(t)]-[An(t) + D()

&(t - 4S~t)]= BA() + CA(t)

Observe that the statistic A* may be rewritten as

E i(t +G~t1 A*=1 ,

I(t)) [

GA) ) ]g(ti),

where GA(t1) = (B(t) - I(t)) + CA(t). Using the identity (a + b)/ (c + d) = (a + b)/c - (da + db)/(c2 + cd) we have

rAn(t) + DA(t) (An(tj) + Dn(tj))Gn(tj)2

A* = E I(t1)g(t1) I(t1) A(t3) + I(tj)Gn(tj)

As a result of our assumptions, b-d/2 Sup, Gn(t)/I(t)I p 0, m

var [ Dn(ti)F1(ti)g(td] = o[var(Q)], _j=lI

and

varlj An(tj)Dn(t,)I (tj)g(tj)] = o[var(Q)], j=lI

where

Q = I l(j)g(tj)An2(t;- j=1

Lemma 2 will establish that Ev0(Q) = O[m/(nbd)]. These results together imply that

nbd/2 nbd/2

m(2V)2 (A* - E'0(A*)) = m(2V)112 (Q- E,(Q)) + op(l),

in which case the proof of the theorem is completed by showing that the right side of the latter equation converges in distribution to a standard normal as n -> oo.

It is convenient to rewrite Q as Q = Qi + 2Q2 with Qi yn aiz and Q2 = -1-i<k-n aikzizk, where zi = a log

f(Yi I 0)/a0Io=S(x,), and aik = Ejt 1 [I(tj)wj2I-'W((tj - Xi)/b)W((tj - Xk)/b)g(tj). The proof of our theorem follows from an appli- cation of Theorem 2.1 of de Jong (1987) to Q2 and by observing that QI - E,O(Ql) = Op(Q2)- Set Wij = aijzizj. Observe that EvO[W,1] = 0 and EoW2j = E1jWI]=Ex,xj(a jI(Xi)I(X1)) for i # j. The following notation is introduced by de Jong (1987):

G= Ex(a4ji#4j) 1 'i<j'n

G,, = E Ex(aijak4iu%iu2,k + aljaJkuLi4,L4Lk + aikajku21u2ju4k) 1'i<j'n

GI,, Ex(aa + a;kaiJaJku +uaJu k 1 'i<j<k'n

+ 6A2kiP,ikU2jU3jU3k)

Gv= > Ex[(aijaikalalk + aijailakjalk 15i<j<k<l1n

+ a a a a )Un Un n Un]

GvEx( 2a 2 +a2 2 2 2 Un iUnjUnk2, n =Gv EX[(aija kl k + ai alk I 5i<j<k<l1n

We need to show that

nj- 0,2

lim maxi Ii = 0, where Ii = . (10) n rx +i5i<j?n 0i

Toward this end, the following lemma is introduced.

Lemma 2. Conditional on Xi = xi and Xk = Xk, aik =

m(n 2bd){p(Xi, Xk)[I((Xi + xk)/2)Y' + o(1)}, where p(xi, Xk) is defined in the statement of Theorem 6.1.

Proof. m F 21-

n 2bdm aik =(mbd) 2. 1 (tj) ( j=i [I (nb d)2J

x wti -

Xi ti Xk)(t)

= (bd)y f [I(u)-lW(U Xi)wU Xk) du + o(l).

Change variables in the latter integral to v = [u - ((xi + xk)/2)/ b] and use the smoothness assumption of I(-) to achieve the result we set out to prove. This lemma establishes that E,0(Q) = O(m/ (nbd)) and GI, GI,, GI,,, and G,v are all of lower order than Gv.

Now we argue that condition (10) holds. Observe that there are at most O(n2bd) nonzero summands in >i-i<j,,n o-2 . Therefore, [(n2bd)2/(m2n2bd)] 'li<jln -27 is approximately equal to 21-i<j5n

EX[P2(Xi, Xk)]/(n2bd) = 0(1). From this it follows that Ii = o(l) uniformly in i.

Finally, apply de Jong's results to conclude that [(2n2bd)/ (Vm2)]1/2Q2

- N(0, 1) as n -* oo, where V is given in the statement of Theorem 6.1. Therefore [(nbd/2)/(m(2V)l/2)][Q - E,0(Q)] -->d

N(0, 1).

[Received March 1989. Revised January 1991.]

REFERENCES

Azzalini, A., Bowman, A. W., and Hiardle, W. (1989), "On the Use of Nonparametric Regression for Model Checking," Biometrika, 76, 1- 11.

Bjerve, S., Doksum, K., and Yandell, B. S. (1985), "Uniform Confi- dence Bounds for Regression Based on a Simple Moving Average," Scandinavian Journal of Statistics, 12, 159-165.

Cleveland, W. S. (1979), "Robust Locally Weighted Regression and Smoothing Scatterplots," Journal of the American Statistical Associ- ation, 74, 823-836.

Cox, D., Koh, E., Wahba, G., and Yandell, B. S. (1988), "Testing the Null Model Hypothesis," The Annals of Statistics, 16, 113-119.

de Jong, P. (1987), "A Central Limit Theorem for Generalized Quadratic Forms," Probability Theory and Related Fields, 75, 261-277.

Eubank, R. L., and Spiegleman, C. H. (1990), "Testing the Goodness- of-Fit of a Linear Model Via Nonparametric Regression Techniques," Journal of the American Statistical Association, 85, 387-392.

Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models, London: Chapman & Hall.

Kent, J. T. (1986), "The Underlying Structure of Nonnested Hypothesis Test," Biometrika, 73, 333-343.

Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1984), "Graph- ical Methods for Assessing Logistic Regression Models," Journal of

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions

Page 10: Diagnostics for Assessing Regression Models

692 Journal of the American Statistical Association

the American Statistical Association, 79, 61-83. Mack, Y. P., and Muller, H. G. (1987), "Adaptive Nonparametric Es-

timation of a Multivariate Regression Function," Journal of Multi- variate Analysis, 23, 169-182.

Muller, H. G. (1980), Nonparametric Regression Analysis of Longitu- dinal Data (Lecture Notes in Statistics, 46), New York: Springer-Ver- lag.

(1984), "Smooth Optimum Kernel Estimators of Densities, Regression Curves, and Modes," The Annals of Statistics, 12, 766- 774.

Rosenblatt, M. (1971), "Curve Estimates," The Annals of Mathematical Statistics, 42, 1815-1842.

Schuster, E. F. (1972), "Joint Asymptotic Distribution of the Estimated

Regression Function at a Finite Number of Distinct Points," The An- nals of Statistics, 43, 84-88.

Speckman, P. (1988), "Kernel Smoothing in Partial Linear Models," Journal of the Royal Statistical Society, Ser. B, 50, 413-436.

Staniswalis, J. G. (1989), "On the Kernel Estimate of a Regression Func- tion in Likelihood Based Models," Journal of the American Statistical Association, 84, 276-283.

Staniswalis, J. G., and McCrady, C. (1988), "The Use of Kernel Esti- mators in Describing Human T-Lymphocyte Proliferation Induced by Phorbol Esters and Ca"+ lonophore," Journal of the American College of Toxicology, 7, 939-951.

Tibshirani, R., and Hastie, T. (1987), "Local Likelihood Estimation," Journal of the American Statistical Association, 82, 559-568.

This content downloaded from 185.2.32.89 on Mon, 16 Jun 2014 07:16:33 AMAll use subject to JSTOR Terms and Conditions