diagnostics in logistic regression models

Journal of the Korean Statistical Society 37 (2008) 89–94www.elsevier.com/locate/jkss

Diagnostics in logistic regression models

Sugata Sen Roya,∗, Sibnarayan Guriab

a Department of Statistics, University of Calcutta, 35, Ballygunge Circular Road, Calcutta 700019, Indiab Department of Statistics, Bidhannagar College, Block EB, Calcutta 700064, India

Received 1 March 2006; accepted 1 March 2007

Abstract

In this paper we study the diagnostics of a logistic regression model using the deletion of observation technique. The model isfitted using the maximum likelihood method and the changes in the estimates and the deviance are observed when the model isrefitted after deleting an observation. Expressions are derived so that it is not necessary to re-run the regression after each deletion,thereby considerably saving the computational time.c© 2008 The Korean Statistical Society. Published by Elsevier Ltd. All rights reserved.

AMS 2000 Subject Classification: primary 62J12; secondary 62J20

Keywords: Deletion of observation; Deviance; Logistic regression; Maximum likelihood estimator

1. Introduction

The use of a regression model, and particularly predictions based on it, requires that the fitted model is compatiblewith the data. However, the data may very often contain outliers which exercise an inordinate influence on theestimates of the parameters. It is then necessary to detect these outliers and take appropriate measures so as to obtain agood fit. To this effect diagnostics play an important role in regression studies (see Belsley, Kuh, and Welsch (1980)).However, although there has been an extensive study on the methods of detecting outliers for the classical linear model,very few studies have been made for more general types of linear model. One reason for this is that the diagnosticsexploit the nature of the relationship between the response and the explanatory variables thus making it easier todeal with particular forms of relationships rather than a general one. In this respect, Pregibon (1981) studied a binarymodel through logistic regression while Williams (1987) and Thomas and Cook (1989) considered the generalizedlinear model. However, most of the studies have been based on the perturbation technique as suggested by Cook(1986). The reason for this is perhaps that, unlike the least-squares estimation method used in the classical linearmodel, the generalized linear models use the maximum likelihood method of parameter estimation. The latter requiresiterative methods of solving for the estimates and hence the deletion of observation technique may be deemed toocomplicated for application.

∗ Corresponding author.E-mail addresses: [email protected] (S.S. Roy), [email protected] (S. Guria).

1226-3192/$ - see front matter c© 2008 The Korean Statistical Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.jkss.2007.03.001

http://www.elsevier.com/locate/jkss

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.jkss.2007.03.001

90 S.S. Roy, S. Guria / Journal of the Korean Statistical Society 37 (2008) 89–94

In identifying the outliers, Pregibon (1981) uses the Pearsonian residuals and the deviance residuals correspondingto each individual observation of a logistic model. Large values of these indicate that the observation is an outlier.However, he argues that these quantities cannot adequately measure the effect of the outlier on the many componentsof the fitted model and hence resorts to the perturbation technique. Williams (1987) employs a mean shift outlier modelwith the maximum likelihood estimate of the parameters based on the full set of observations as an initial solution.Then taking a single step of weighted least squares, he obtains an approximate relation between the estimates basedon the deleted set and the full set of observations.

In this paper we have considered a logistic regression model. Classically, such models were fitted to data obtainedunder experimental conditions, for example bioassay and related dose-response applications. But logistic regressionmodels have been increasingly used in observational studies where the data is prone to contain extreme values of boththe responses and the design points. Hence diagnostics have become an integral part of the study of these models.Here we have shown that the deletion technique can be much more simply extended to these models. In fact althoughthe estimation is carried out through the maximum likelihood method, the technique we use is closely similar to theone used in the least squares estimation of the classical linear model.

In Section 2 we study the impact of the deletion of an observation on the maximum likelihood estimators of theregression parameters as obtained after a single iteration. In Section 3 we study the diagnostics related to the model.Section 4 contains a numerical example while in Section 5 some concluding observations are made.

2. Estimation with the j th observation deleted

Let yi be a binary response variable and xi the corresponding vector of p covariates for the i th individual,i = 1, . . . , n. For example, yi may be the choice behaviour of the i th individual, having characteristic xi, when facedwith two alternatives, one of which must be chosen. Also let β = (β1, . . . , βp)

′ be the parameter vector associatedwith the covariates xi. Then the logistic regression model is given by

yi =

{1 with probability Pi0 with probability 1 − Pi

(2.1)

where

Pi = P[yi = 1|xi ] =ex′

i β

1 + ex′i β

. (2.2)

The likelihood function for the estimation of the parameters β for this model is

l(β) =

n∏i=1

P yii (1 − Pi )

1−yi ,

so that the log-likelihood function can be written as

L(β) = ln l(β) =

n∑i=1

[yi x′

iβ − ln(1 + ex′i β)

]. (2.3)

The likelihood equations ∂L(β)∂β

= 0 are then obtained as

n∑i=1

[xi yi − xi

ex′i β

1 + ex′i β

]= 0. (2.4)

However, (2.4) being non-linear in β, some iterative technique needs to be applied to find a solution. The usualprocedure is to apply Fisher’s Method of Scoring, but in the case of the logistic model this is also equivalent to theNewton–Raphson method.

Let si = yi − Pi and vi = Pi (1 − Pi ). Then writing S = (s1, . . . , sn), V = diag((vi )) and Z = V 1/2 X ,

∂L(β)/∂β =

n∑i=1

xi (yi − Pi ) = X ′S

S.S. Roy, S. Guria / Journal of the Korean Statistical Society 37 (2008) 89–94 91

and − ∂2L(β)/∂β∂β ′=

n∑i=1

ex′i β[

1 + ex′i β

]2 xi x′

i = X ′V X = Z ′Z .

Hence starting with an initial solution β0 of β and using the Newton–Raphson method, the first approximation toβ̂, the estimator of β, is

β̂1= β0

−

[(∂2L(β)

∂β∂β ′

)−1∂L(β)

∂β

]β=β0

= β0+ (Z ′Z)−1(X ′S)

= β0+ (Z ′Z)−1(Z ′V −1/2S). (2.5)

Here of course Z , S and V are evaluated at β0, but to keep notations simple the suffixes are omitted unless there iscause for ambiguity.

To see the impact of the deletion of the j th observation on the regression we consider the log-likelihood function,

L(β) =

n∑i=1,6= j

[yi x′

iβ − ln(1 + ex′i β)

]. (2.6)

Then

∂L(β)/∂β =

n∑i=1,6= j

xi (yi − Pi ) = X ′S − x j s j

and − ∂2L(β)/∂β∂β ′=

n∑i=1,6= j

ex′i β[

1 + ex′i β

]2 xi x′

i

= X ′V X − v j x j x′

j = Z ′Z − z j z′

j ,

where z j =√

v j x j .Starting with an initial solution β0( j), the first approximation to β̂1( j), the estimate of β based on the set of

observations excluding the j th, is obtained as

β̂1( j)= β0( j)

+ (Z ′Z − z j z′

j )−1(X ′S − x j s j ). (2.7)

Of course, it is impractical and computationally time consuming to run the estimation procedure separately foreach deletion of observation. Hence it is imperative to look for a relationship between (2.5) and (2.7) so that β̂1( j)

can be expressed in terms of β̂ and the residuals and leverages obtained from it. This would considerably save oncomputational time particularly for large data sets. However, before studying this relationship, we first address theproblem of finding the initial solutions β0 and β0( j).

For a logistic model the initial solutions can be obtained in several ways. The simplest method is, of course, to usethe estimator of β as obtained from a linear probability model. This leads to β0

= (X ′ X)−1(X ′y). For the deletedset of (n − 1) observations we can start from the same initial value. However, since for an outlier the impact of itsdeletion on the estimators would be significant, a corrected initial value would lead to faster convergence. Hence weuse as an initial value the estimator of β obtained by omitting the j th observation in the linear probability modeli.e. β0( j)

= (X ′ X − x j x′

j )−1(X ′y − x j y j ).

From standard results,

β0( j)= β0

− (1 − h j j )−1(X ′ X)−1x j e j , (2.8)

where e j = y j − x′

jβ0 and h j j = x′

j (X ′ X)x j .

Result 2.1. Under the above set-up,

β̂1( j)= β̂1 − (1 − h∗

j j )−1(Z ′Z)−1z j e∗

j − (1 − h j j )−1(X ′ X)−1x j e j (2.9)


where e∗

j =

[s j

√v j j

− z′

j (Z ′Z)−1 Z ′V −1/2S]β=β0

and h∗

j j = z′

j (Z ′Z)−1z j .

Proof. Since

(Z ′Z − z j z′

j )−1

= (Z ′Z)−1+

(1 − z′

j (Z ′Z)−1z j

)−1 ((Z ′Z)−1z j z′

j (Z ′Z)−1)

,

the result follows from (2.5), (2.7) and (2.8). �

Remark 1. e∗

j ’s can be looked upon as residuals obtained by regressing v−1/2j j s j j on z j . Strictly the second term on

the right hand side should have been evaluated at β0( j), but since it is not necessary to obtain β0( j) otherwise, thiscorrection term can be estimated at β0 instead.

Remark 2. (2.9) shows that for the j th deleted case we need not carry out the whole estimation afresh, but can obtainthe result from the estimator based on the whole data set and two correction terms estimated at β0.

3. Diagnostics

The last two terms on the right hand side of (2.9) show the amount of change in the regression estimates thatwould occur if the j th observation is deleted. However, these being vectors, the change is difficult to visualize throughβ̂1

− β̂1( j). One way out would be to look at the change in the estimated linear predictor,

DFFIT( j)= η̂1

j − η̂1( j)j

= x′

j β̂1− x′

j β̂1( j)

= (1 − h∗

j j )−1x′

j (Z ′Z)−1z j e∗

j + (1 − h j j )−1x′

j (X ′ X)−1x j e j

= (1 − h∗

j j )−1v

−1/2j j h∗

j j e∗

j + (1 − h j j )−1h j j e j . (3.1)

A large absolute value of DFFIT( j) would mean that the j th observation has a considerable impact on the fit andhence can be considered as an outlier.

Another possible way to look at this change is through the deviance, which for any fitted value (β̂), is defined asD = 2(L(y) − L(β̂)). Since the deviance is used to measure the goodness of fit of a model, a substantial decrease inthe deviance after the deletion of the j th observation is indicative of the fact that the observation is a misfit.

Denoting by pi and p( j)i the estimators of Pi at β̂1 and β̂1( j) respectively, the deviances of the model with and

without the j th observation are respectively,

D = 2n∑

i=1

[yi ln

yi

1 − yi+ ln(1 − yi ) − yi ln

pi

1 − pi− ln(1 − pi )

](3.2)

and

D( j)= 2

n∑i=1,6= j

[yi ln

yi

1 − yi+ ln(1 − yi ) − yi ln

p( j)i

1 − p( j)i

− ln(1 − p( j)i )

]. (3.3)

Hence the difference, on using (2.2), becomes

DDEV( j)= D − D( j)

= 2[

y j lny j

1 − y j+ ln(1 − y j ) − y j x′

j β̂1+ ln(1 + ex′

j β̂1)

]

− 2n∑

i=1,6= j

yi

[x′

i β̂1− x′

i β̂1( j)

]− 2

n∑i=1,6= j

ln1 + ex′

i β̂1( j)

1 + ex′i β̂

1

S.S. Roy, S. Guria / Journal of the Korean Statistical Society 37 (2008) 89–94 93

= 2[

y j lny j

1 − y j+ ln(1 − y j ) − y j x′

j β̂1+ ln(1 + ex′

j β̂1)

]

− 2n∑

i=1,6= j

yi (η̂1i − η̂

1( j)i ) − 2

n∑i=1,6= j

ln1 + ex′

i β̂1e−(η̂1

i −η̂1( j)i )

1 + ex′i β̂

1. (3.4)

Using (3.1), DDEV( j) can be obtained solely in terms of the residuals and leverages of the whole data set. A largevalue of DDEV( j) indicates that the j th observation is an outlier.

4. A numerical example

As an illustration of our technique we consider the data used by Collett (1998) for studying 35 patients undergoingsurgery with general anesthesia. Of course, with so few patients it will not be too time consuming to run the fulliteration for each case and hence we use this example only to illustrate our method. For practical purposes all suchmethods are more useful for large data-sets.

The response here is whether a patient experiences a sore throat on waking (Y = 1) or not (Y = 0). The causevariables are the duration of the surgery (D in min.) and type of device used to secure the airway (T = 0 for laryngealmask airway and =1 for tracheal tube). The model is of the form (2.1)–(2.2) with p = 3 and the linear predictor,η = β1 + β2 ∗ D + β3 ∗ T .

The steps involved in checking the diagnostics are as follows : first an initial estimate of β is obtained by simpleregression of Y on D and T . Using this initial estimate and applying the Newton–Raphson method, the one-stepapproximation of the m.l.estimate of β comes out as

β̂1= (β̂1

1 , β̂12 , β̂1

3 )′ = (−0.21486, 0.02549, −1.00471)′. (4.1)

This provides the residuals and leverages corresponding to each observation. The DFBETA( j)= (β̂1

− β̂1( j))’s arethen obtained using (2.9). Finally DFFIT( j)’s and DDEV( j)’s are obtained using (3.1) and (3.4) respectively. TheDFBETA( j)’s, DFFIT( j)’s and DDEV( j)’s are shown in Table 4.1.

From Table 4.1 it is observed that the 22nd and 33rd observations have large DFFIT and DDEV values andhence can be considered as outliers. In both cases it is observed that the D value is considerably larger for thecorresponding Y -values of 0 and 1 respectively and hence they have been rightly identified as outliers. Obervation 6also has comparatively large values of DFFIT and DDEV and the data reveal that it has a small value of D comparedto its Y -value of 1.

5. Concluding remarks

Result 2.1 has been derived for a one-step iteration. As has been noted by several authors, the estimates of theparameters of a logistic regression model converges very rapidly under the Newton–Raphson method and hence aclose approximation can be obtained using one-step iterations only provided the initial estimate is a good one. If thedeleted observation is a non-outlier, then the initial estimate β0( j) will not be much different from β0. However, foroutliers, the difference can be large and hence the correction in (2.8) is generally necessary.

An alternative method could be to use a fully iterated estimate as an initial solution. This is the technique that hasgenerally been used by Pregibon (1981) and Cook and Weisberg (1982). First the usual estimate β̂ of β based on thefull set of observations is obtained by running the iteration till the estimate converges. This β̂ is then used as an initialestimate for the deleted cases with (2.9) replaced by

β̂1( j)= β̂ + (Z ′Z)−1(Z ′V −1/2S) − (1 − h∗

j j )−1(Z ′Z)−1z j e∗

j ,

= β̂ − (1 − h∗

j j )−1(Z ′Z)−1z j e∗

j , (5.1)

where the third term on the right hand side of (5.1) is evaluated at β̂ while the second term vanishes when evaluatedat β̂. Of course, the third term is, as before, based on the residuals and leverages of the full set and it is not necessaryto run the regression afresh for each of the deleted cases.


Table 4.1Table showing the DFBETA, DFFIT and DDEV values

Sl. No. DFBETA( j)1 DFBETA( j)

2 DFBETA( j)3 DFFIT DDEV

1 −0.25441 −0.0001547 0.26116 −0.26137 0.4357665692 −0.33057 0.0035636 0.17263 −0.27712 0.1458833173 0.098134 −0.00025562 −0.08677 0.087909 0.0198478894 −0.070653 0.0015987 0.058193 0.12023 0.0099485225 −0.054642 0.0012358 0.038122 0.0947 0.0049041326 0.15072 −0.0034108 0.24891 0.31436 0.6064550867 0.13198 −0.00066404 −0.10251 0.10874 0.0320479138 0.0055538 0.00023346 −0.015883 0.020728 0.0023817089 0.034679 −0.0021361 0.060341 −0.16825 0.000120344

10 0.13198 −0.00066404 −0.10251 0.10874 0.03204791311 −0.00045431 −0.00021625 0.010092 −0.016673 0.00093333212 −0.0036847 0.000085041 0.17493 0.17508 0.1459936513 0.031154 −0.0007071 −0.17352 −0.17772 0.27084602414 −0.078368 0.0017739 0.081388 0.13606 0.02082641215 −0.31335 0.0021725 0.21704 −0.24818 0.21698404216 0.21815 −0.0018468 −0.13627 0.17198 0.09816566217 −0.055455 0.0012544 −0.072786 −0.10315 0.02006943818 −0.062163 0.0014082 0.12641 0.14874 0.06253351619 −0.077457 0.0017537 0.096121 0.14142 0.03141665520 0.17181 −0.0011913 −0.11897 0.13607 0.05472860821 0.014811 0.00032635 −0.02926 0.034391 0.00369894722 −0.14701 −0.0036735 0.30962 −0.37109 1.26312710423 0.0055538 0.00023346 −0.015883 0.020728 0.00238170824 −0.050894 0.0011509 −0.055474 −0.089104 0.01343180425 −0.055455 0.0012544 −0.072786 −0.10315 0.02006943826 0.069836 0.000041707 −0.071655 0.071712 0.01285266327 −0.050894 0.0011509 −0.055474 −0.089104 0.01343180428 0.21815 −0.0018468 −0.13627 0.17198 0.09816566229 −0.050894 0.0011509 −0.055474 −0.089104 0.01343180430 0.17181 −0.0011913 −0.11897 0.13607 0.05472860831 0.098134 −0.00025562 −0.08677 0.087909 0.01984788932 −0.050894 0.0011509 −0.055474 −0.089104 0.01343180433 0.27715 −0.0062648 −0.096668 −0.66527 3.08014E−0534 −0.055455 0.0012544 −0.072786 −0.10315 0.02006943835 −0.018912 0.00042704 −0.13999 −0.14182 0.098584575

This method will ensure that the final estimate for the full set is a closer approximation to the maximum likelihoodsolution. The estimate would also serve as a good initial solution for those deleted cases where the deleted observationis a non-outlier.

However, this will in most cases be a bad initial solution when the deleted observation is an outlier. This is primarilybecause the deletion of the outlier will cause a significant change in the regression coefficients and hence starting withthe fully iterated estimate will require several steps for convergence. Here β̂1 is more likely to lie closer to β̂ than tothe actual maximum likelihood solution of the outlier-deleted case. This problem is what makes the method describedin Section 2 a better one compared to the fully iterated solution of (5.1), since in the former the initial solution too iscorrected for the exclusion of the outlier.

References

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics. New York: John Wiley.Collett, D. (1998). In P. Armitage, & T. Colton (Eds.), Encyclopaedia of biostatistics (pp. 350–358). New York: Wiley.Cook, R. D. (1986). Assessment of local influence. Journal of the Royal Statistical Society, Series B, 48, 133–169.Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. London: Chapman and Hall.Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9(4), 705–724.Thomas, W., & Cook, R. D. (1989). Assessing influence on regression coefficients in generalized linear models. Biometrika, 76(4), 741–749.Williams, D. A. (1987). Generalised linear model diagnostics using the deviance and single case deletions. Applied Statistics, 36(2), 181–191.

diagnostics in logistic regression models

Documents