# LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Post on 28-Aug-2016

219 views

Embed Size (px)

TRANSCRIPT

<ul><li><p>Computers & Geoscit, n~ Vol. 15. No. 4. pp. ~-614 . 1989 00911-3004/89 $3.00 + 0.00 Printed in Great Britain. All rights ~ Copynlht ~ 1989 PerlPmOU Prem pk </p><p>LOGDIA--FORTRAN 77 PROGRAM FOR LOGISTIC REGRESSION WITH DIAGNOSTICS* </p><p>F. P. AGTERBERG Geological Survey of Canada, 601 Booth Street. Ottawa. Ontario. Canada KIA 0E8 </p><p>(Received 20 November 1986: received for publication 27 September 1988) </p><p>Ahstraot--The program LOGDIA allows estimation of frequencies resulting from a binomial response. Two chi-square tests are performed to evaluate the logistic model for goodness-of-fit. The logistic hat matrix, modified logistic hat matrix, and other regression diagnostics are provided upon request. </p><p>Key Words: Logistic regression. Hat matrix. Regression diagnostics. Mineral-resource evaluation. </p><p>INTRODUCTION </p><p>The program LOGDIA was developed by the author in Microsoft FORTRAN on an IBM PC XT for recent applications of regression analysis in regional mineral-resource evaluation (Agterberg. 1987a). </p><p>Univariate qualitative response models for the prediction of discrete events have a long history in biometrics whcre they arc used to estimate, for exam- ple. the probability that an insect will survive a specif- ic dose of poison. Cox (1966) has provided a detailed account of the logistic qualitative response model, its multivariate extension employing several explanatory variables, and its relation with discriminant analysis. Pregibon (198 I) discusses use of logistic regression to estimate frequencies, which are independent binomial responses, thus extending the approach to deal with a multiple qualitative response. The use of logistic re- gression to estimate frequencies of mineral deposits in cells from explanatory variables quantified for these cells was suggested originally by Tukey (1972). This led to applications of the logistic model (Agterberg, 1974) for estimating probability of occurrence of mineralization based on the nonlinear weighted least- squares estimator of Walker and Duncan (1967). Cox (1966) proposed the use of the maximum likelihood method with scoring in connection with the logistic qualitative response curve. Amemiya (1976) proved tha the maximum likelihood and nonlinear weighted least-squares estimator provide identical results. Chung (1978) published the FORTRAN IV computer program LOGIST which is based on maximum likeli- hood with scoring. The program LOGDIA is a generalization of LOGIST in that frequencies of more than one discrete event in larger cells can be estimated and logistic regression diagnostics are provided. LOGDIA also uses the scoring method and its esti- mated values become identical to those of LOGIST for a single qualitative response. </p><p>During the past 10yr, regression diagnostics for the general linear model were developed and widely </p><p>* Geological Survey of Canada Contribution No. 38786. </p><p>applied. Extensive use is being made of the diagonal elements of the so-called hat matrix. Chi-square tests for evaluating logistic regression results for goodness- of-fit also have been obtained (see review by Wrigley, 1984). A number of computer programs which can be used for logistic regression analysis are discussed in the book by Wrigley (1985, chap. 7, p. 233-238). For example, the Generalized Linear Interactive Model- ling (GUM) package uses weighted least-squares algorithms for the logistic model (Baker and Neldcr, 1978). Pregibon ( 198 I) has stated that standard out- put from a maximum likelihood fit for the logistic model consists of a subset of the following: </p><p>(a) Estimated parameter vector, ~: (b) Individual coefficient standard deviations, SD </p><p>f/b; (c) Estimated covariance matrix of ~; (d) Chi-square goodness-of-fit statistic; (e) Individual components of chi-square; (f) Deviance D. The program LOGDIA provides these statistics </p><p>(a)-(f) plus individual components of deviance. Pregi- bon (1981) also has pointed out that with a properly designed computing package for fitting the maximum likelihood model, logistic regression diagnostics are essentially "'free for the asking". LOGDIA provides the logistic hat matrix and corresponding modified hat matrix for measuring leverage and influence of individual observations on estimated values. A meas- ure of influence of observations on regression coef- ficients also is provided. </p><p>HAT MATRIX AND MODIFIED HAT MATRIX </p><p>In multiple regression based on the general linear model, the hat matrix is the symmetrical (n x n) matrix H satisfying </p><p>~" = HY; H ,= X (X 'X) -~ X ". (I) </p><p>Here the (n x p) matrix X contains observed values of explanatory variables xl with i - I . . . . . p; Y is a (n x I) column vector for the n observations on the </p><p>599 </p></li><li><p>9J0 </p><p>dependent variable y; and 17. is the (n x 1) column vector of estimated values obtained from Y through the hat matrix. When Y is appended to X to obtain a nev, matrix Z = (X; Y) with in + I) columns, the modified hat matrix H* follows from </p><p>fl" = Z(Z 'Z) -~Z ". ~2) </p><p>The elements h,. of H provide a measure of the amount of "'leverage'" exerted by the observation 1'~ on the estimated value 171. The elements h,*, of the modified hat matrix denote the amount of "influence" exerted by }~ on f~ (of. Gray and Ling. 1984). The sum of the diagonal elements of the hat matrix is equal to p( = number ofexplanatory variables) and that of the modified hat matrix is p + I. </p><p>The hat matrix of the logistic model satisfies: </p><p>II = I '~ :X(X ' I 'X ) ~A"I "1: (3) </p><p>where 1, is an (n x n) diagonal matrix with nonzero elements ~, = Nfi,(I - /~,). The parameters N and p, represent sample size and probability of a binomial distribution, respectively. The estimated probabil ity fi, for observation i becomes available after application of the maximum likelihood method. In the logistic qualitative response model, the observed values of Y are either I or 0 with N = I. For multiple qualitative response. Y is a vector of integer numbers of zeros representing it sample of n independent binomial responses B(N, p,). In the maximum likelihood method, a vector of scores S = Y - I 7, is made to converge by iteration until the relation X'S = 0 is satisfied. At convergence, we have the coelticients [} with </p><p>1~ = ( .~" I . 'X ) IX ' I /A : A =, X~ + V 'S. (4) </p><p>The logits 0, of the probabilities p, satisfy </p><p>P' (5) t), = log~ I -p , " </p><p>The estimated Iogits become 0 = X/}. The estimated frequencies ]:' = N, 6 follow from /~. Here /~ is the vector of estimated probabilities /i,. The variance- covariance matrix of/} is (X" VX) -~ . </p><p>For goodness.of-fit, the following two statistics can be used: Ca) Chi-square values Z~ obtained after squaring </p><p>z, - - (y - , 'v~,) {~'p , ( t - p , )} ' : (6 ) </p><p>(b) Components of deviance with </p><p>d, ~= - 2 {y, log , h, + ( ,v - y , ) i o~ ( l - p , )} . (7) </p><p>Addition of the individual chi-square values yields 7.: which is distributed as theoretical chi-square with (n - p) degrees of freedom when the fit is good. Addition of the d, values yields the deviance D which also is ;~-distributed with (n - p)degrees of freedom. In practical applications, the chi-square value and </p><p>F P. AGT'ERBERG </p><p>deviance ma2r differ considerably and either one can be greater than the other </p><p>Suppose that the vector 4 is appended to X to create a new matrix Z = iU ,4). Then the modified logistic hat matrix follov, s from: </p><p>H* = I ' I :Z iZ IZ ! Z I ~:, (8) </p><p>The SAS MATRIX procedure (SAS. 1985) can be used to obtain H and H in ordinary regression [Eqs. (1) and (2)] as well as logistic regression [Eqs. (3) and (8)1, The maximum likelihood estimate of the coefficients 8 must be computed first by another com- puter program before the logistic hat matrix can be obtained, in LOGDIA. H is calculated directly from ,~" and Y by Equation (3). Next H* is obtained from tt by using Equation (6) because the elements of these two matrices are related according to </p><p>6,~ = h, * Z,,~, ;':. (9) </p><p>A mathematical proof of Equation (9) is given in Appendix I. </p><p>Other properties of hat matrices are discussed in Agterberg (198%). One of their potentially important uses in mineral-resource appraisal studies consists of recognition of similar cells in a region. Cluster analy- sis of the hat matrix can Ix" useful to define groups of sirnilar cells and for ordering the cells in a region according to degree of similarity, in this type of ap- proach both diagonal and oil-diagonal elements of the h;=t matrix (or the modified hat matrix) are used. Examples of this clustering are given in Agterberg and Fr.'mklin (1986) for the linear model, and in Agterberg (1987a) for the linear and logistic models. The basic building blocks ;~,. at,, h,j, and h,~ provided by LOGDIA can be used for the calculation of other regression diagnostics. The influence of individual observations on estimated coefficients is measured in LOGDIA according to the following method. Wrigley and Dunn (1986) in a paper on graphical diagnostics for logistic oil-exploration models provide separate plots of DBETA,flSD (,~,) using the difference vector </p><p>DBETA, = l~ (all observations) - /~ </p><p>(all observations except i) </p><p>= (X 'V .V~ ,t~S,,/(l - h,,) (I0) </p><p>where X, is a (p x !) vector of the values of the explanatory variables for observation i; S, ~, E - ( ~ = p, in Wrigley and Dunn, 1986). The values of (X ' / /X) -~, Y,,. and h. can be used to calculate DBETA. In LOGDIA. DBETA,, /SD (1~) is printed out as a block of values with a separate column for each coefficient/~,. </p><p>From Equation (10) it can be seen that a plot of DBETA,j/SD (~j) vs i for each coefficient shows which observations are causing instability and how great an effect deletion of a particular observation (0 will have on an estimated coefficient (e,), A positive value of DBETA,, indicates that removing observation i will </p></li><li><p>Logistic regression with diagnostics </p><p>decrease the value of ~j, a negative value the reverse (cf. Wrigley and Dunn, 1986, p. 367). This technique for the logistic model was developed originally by Pregibon (1981) who generalized methods proposed by Welsch and Kuh (1977) for the normal-theory linear model. </p><p>EXAMPLE I--POLYMETALLIC MASSIVE SULPHIDE DEPOSITS IN ABITIBI AREA OF THE </p><p>CANADIAN SHIELD </p><p>The example is analyzed in more detail in Agree- berg (1987a). Table I shows complete input for a </p><p>601 </p><p>LOGDIA run. The first part of Table I consists of a small file in free format with input parameters (abit.inp). The first value on line ! represents level of convergence. In absolute value, all scores should be less than this value before a set ofcoefficients is accep- ted as final. The second value on the first line is for maximum number of iterations allowed. The second line gives number of explanatory variables except the dummy variable needed to estimate the intercept, number of observations with one or more events, number of observations without events, and binomial sample size N, respectively. The third line specifies format of the data file which contains values for the </p><p>Table I. Two files (abit.inp) and (abit dat) forming input for LOGDIA run on data for Abitibi area. Canadian Shield Parameters of abit inp are in free format. (Their values should be separated by at least one blank or by a comma.) Third line of'abit.inp contains format used for data in abit dat See text for complete explanation of input parameters. Numbers in abit dat were derived by cross multiplication of values </p><p>originally reported in Agterberg ([987a, table ll) </p><p>Input parameters f t le (ab[ t . inp) : </p><p>0.1 30 7 20 20 I6.0 (T f8 .0 . f2 .0 ) 00 </p><p>Data file (ab i t .dat ) : </p><p>Abittbi area (ExampLe l);vartables xl x2 xI2 xl5 x16 xl8 x45 y 2.312 0.263 0.60806 0.10404 .000000 6.06438 .000000 1 2.354 1.074 2.52820 1.44536 .129470 3.58750 .000000 2 2.010 1.565 3.14565 0.15879 .673350 0.14271 .000000 l 2.482 0.918 2.27848 0.26309 .851326 4.63389 .000000 2 1.19[ 0.000 0.00000 0.03811 .003573 2.31888 .000000 2 3.283 0.127 0.41694 0.95207 ,679581 2.78727 ,000000 2 4.499 0.469 2.11003 2,06504 .026994 1.37669 ,000000 1 1.527 0.258 0.39397 0.26570 .038175 5.21776 .000000 3 1.350 0.283 0.38205 0.27810 .008100 4.76145 .005562 2 3.514 0.902 3.16963 1.97135 .000000 0.59387 .144177 5 0.465 0.274 0.1274| 0.05208 ,016275 0.06045 .058464 1 2.345 0.066 0.15477 .241535 .021105 6.56131 .000000 1 3.268 0.287 0.93792 .081700 .189544 6.55234 .000000 2 1.969 1.109 2.18362 .675367 .013783 1.18731 .393421 7 1.444 0.098 0.14151 ,031768 .229596 1.59418 .033374 1 3.514 1.286 4,51900 .449792 .256522 2.17868 .000000 3 1.486 0.356 0.52902 ,285312 .627092 3.49953 .054144 2 0.071 0.057 0.00405 .000000 .000000 0.00540 .000000 1 2.660 0.906 2.40996 .069160 .311220 4.74012 .000000 1 0.866 0.154 0.13336 .046764 ,009526 0.17060 .112320 1 0.056 0.000 0.000G0 0.00000 .000000 0.00000 .000000 0 1.556 0.508 0.79045 0.76711 .000000 2.22041 .000000 0 1.556 0.257 0.39989 0.18672 .040456 5.20638 .000000 0 1.904 0.270 0.51408 0.00952 .188496 1.28139 .000000 0 0.930 0,443 0.41199 0.3~317 ,027900 3.47541 .000000 0 0.051 0.000 0.00000 0.00000 .000000 0.29748 .000000 0 2.060 0.105 0.21630 0.06180 .611820 1.57796 .000000 0 2.147 1.280 2.74816 0,26193 .139555 1.06921 .000000 0 1.474 0.000 0.00000 0.05159".166562 1.12466 .000980 0 4.072 0.340 1.38448 0.46421 .574152 1.18088 .000000 0 2.589 0.079 0.20453 0.18900 .038835 7.77736 .000000 0 0.635 0.215 0.13652 0.00254 .019050 0.27559 .000000 0 3.021 0.145 0.43804 0.14199 .048336 3.06959 .000000 0 0.000 0.000 0.00000 ,000000 .000000 0.00000 .000000 0 4.074 0.022 0.08963 .032592 .020370 7.14172 .000000 0 3.416 0.348 1.18877 .000000 .204960 5.63298 .000000 0 0.000 0.000 0.00000 .000000 ,000000 0,00000 .000000 0 1.344 0.177 0.23789 .068544 .000000 4.27661 ,000000 0 2.728 0.038 0.10366 .147312 .000000 6.50355 .000000 0 1,909 0.259 0.49443 .045816 .032453 0.68342 .000000 0 </p></li><li><p>Table 2, Partial LOGDIA output for example of Table 1. Probabilities p, are related to frequcncmsJl by p, = f,, N where N ( = 16) represents sample size ( .= number o f 10-km cells with one or more deposits per </p><p>40-km cell) </p><p>Abt t ib i area (Example l);variables xl x2 xl2 x15 x16 x18 x45 y </p><p>ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS WITH Y~, 0 </p><p>CELL NO </p><p>~m~ FREQUENCIES **tm nn PROBABILITIES ~'~: ** COMPONENTS OF ~, </p><p>OBSERVED ESTIMATED OBSERVED ESTIMATED CHISQUARE DEVIANCE </p><p>t 1.OOO .69186 .06250 0t, 362 . I ~e,?~ ,~ '516 </p><p>! 2.000 2.1169l* .12500 .13606 .f 1665 . 15461 </p><p>1.000 .43030 .06250 .02689 ,7 /508 ,50310 </p><p>~ 2.O00 2 .88505 .12500 18032 3 ~ :2~* .' t622 </p><p>3 2 ,000 .42262 .12500 .02641 6.0~,; I~, ,95531 </p><p>2 .000 .891/3 .12500 .05611 1.43386 .82117 </p><p>'~ 1. 000 .95911 .06250 .05998 .001 ~0 ,46169 </p><p>H 3.000 .92231 .18 /50 .05164 4 .96613 ] . 16653 </p><p>~ 2.000 .85481 .12500 .05343 1,6208~ .828~5 </p><p>O 5. OO0 5. 36055 .31250 .33503 . O ~6~* 2 L. 2~448 </p><p>i ~, 1 .000 .36183 .06250 ,02261 1. lb 160 .51653 </p><p>12 1. OOO .85633 .06250 .053~0 .0262 ~ ~6914 </p><p>I ~ 2.000 .16312 ,12500 .04169 2. 10519 ,81,626 </p><p>I~ 1.000 1, 1 [ 313 .t, 3150 ,t*&i*61 .0(i 37 I , ~ 1083 </p><p>i i~ I .000 ,50069 ,062...</p></li></ul>

Recommended

View more >