logdia—fortran 77 program for logistic regression with diagnostics

16
Computers & Geoscit, n¢~ Vol. 15. No. 4. pp. ~ - 6 1 4 . 1989 00911-3004/89 $3.00 + 0.00 Printed in Great Britain. All rights ~ Copynlht ~ 1989 PerlPmOU Prem pk LOGDIA--FORTRAN 77 PROGRAM FOR LOGISTIC REGRESSION WITH DIAGNOSTICS* F. P. AGTERBERG Geological Survey of Canada, 601 Booth Street. Ottawa. Ontario. Canada KIA 0E8 (Received 20 November 1986: received for publication 27 September 1988) Ahstraot--The program LOGDIA allows estimation of frequencies resulting from a binomial response. Two chi-square tests are performed to evaluate the logistic model for goodness-of-fit. The logistic hat matrix, modified logistic hat matrix, and other regression diagnostics are provided upon request. Key Words: Logistic regression. Hat matrix. Regression diagnostics. Mineral-resource evaluation. INTRODUCTION The program LOGDIA was developed by the author in Microsoft FORTRAN on an IBM PC XT for recent applications of regression analysis in regional mineral-resource evaluation (Agterberg. 1987a). Univariate qualitative response models for the prediction of discrete events have a long history in biometrics whcre they arc used to estimate, for exam- ple. the probability that an insect will survive a specif- ic dose of poison. Cox (1966) has provided a detailed account of the logistic qualitative response model, its multivariate extension employing several explanatory variables, and its relation with discriminant analysis. Pregibon (198 I) discusses use of logistic regression to estimate frequencies, which are independent binomial responses, thus extending the approach to deal with a multiple qualitative response. The use of logistic re- gression to estimate frequencies of mineral deposits in cells from explanatory variables quantified for these cells was suggested originally by Tukey (1972). This led to applications of the logistic model (Agterberg, 1974) for estimating probability of occurrence of mineralization based on the nonlinear weighted least- squares estimator of Walker and Duncan (1967). Cox (1966) proposed the use of the maximum likelihood method with scoring in connection with the logistic qualitative response curve. Amemiya (1976) proved tha the maximum likelihood and nonlinear weighted least-squares estimator provide identical results. Chung (1978) published the FORTRAN IV computer program LOGIST which is based on maximum likeli- hood with scoring. The program LOGDIA is a generalization of LOGIST in that frequencies of more than one discrete event in larger cells can be estimated and logistic regression diagnostics are provided. LOGDIA also uses the scoring method and its esti- mated values become identical to those of LOGIST for a single qualitative response. During the past 10yr, regression diagnostics for the general linear model were developed and widely * Geological Survey of Canada Contribution No. 38786. applied. Extensive use is being made of the diagonal elements of the so-called hat matrix. Chi-square tests for evaluating logistic regression results for goodness- of-fit also have been obtained (see review by Wrigley, 1984). A number of computer programs which can be used for logistic regression analysis are discussed in the book by Wrigley (1985, chap. 7, p. 233-238). For example, the Generalized Linear Interactive Model- ling (GUM) package uses weighted least-squares algorithms for the logisticmodel (Baker and Neldcr, 1978). Pregibon (198 I) has stated that standard out- put from a maximum likelihood fit for the logistic model consists of a subset of the following: (a) Estimated parameter vector, ~: (b) Individual coefficientstandard deviations, SD f/b; (c) Estimated covariance matrix of ~; (d) Chi-square goodness-of-fit statistic; (e) Individual components of chi-square; (f) Deviance D. The program LOGDIA provides these statistics (a)-(f) plus individual components of deviance. Pregi- bon (1981) also has pointed out that with a properly designed computing package for fitting the maximum likelihood model, logistic regression diagnostics are essentially "'free for the asking". LOGDIA provides the logistic hat matrix and corresponding modified hat matrix for measuring leverage and influence of individual observations on estimated values. A meas- ure of influence of observations on regression coef- ficients also is provided. HAT MATRIX AND MODIFIED HAT MATRIX In multiple regression based on the general linear model, the hat matrix is the symmetrical (n x n) matrix H satisfying ~" = HY; H ,= X (X'X)-~ X ". (I) Here the (n x p) matrix X contains observed values of explanatory variables xl with i - I ..... p; Y is a (n x I) column vector for the n observations on the 599

Upload: fp-agterberg

Post on 28-Aug-2016

244 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Computers & Geoscit, n¢~ Vol. 15. No. 4. pp. ~ - 6 1 4 . 1989 00911-3004/89 $3.00 + 0.00 Printed in Great Britain. All rights ~ Copynlht ~ 1989 PerlPmOU Prem pk

LOGDIA--FORTRAN 77 PROGRAM FOR LOGISTIC REGRESSION WITH DIAGNOSTICS*

F. P. AGTERBERG Geological Survey of Canada, 601 Booth Street. Ottawa. Ontario. Canada KIA 0E8

(Received 20 November 1986: received for publication 27 September 1988)

Ahstraot--The program LOGDIA allows estimation of frequencies resulting from a binomial response. Two chi-square tests are performed to evaluate the logistic model for goodness-of-fit. The logistic hat matrix, modified logistic hat matrix, and other regression diagnostics are provided upon request.

Key Words: Logistic regression. Hat matrix. Regression diagnostics. Mineral-resource evaluation.

INTRODUCTION

The program LOGDIA was developed by the author in Microsoft FORTRAN on an IBM PC XT for recent applications of regression analysis in regional mineral-resource evaluation (Agterberg. 1987a).

Univariate qualitative response models for the prediction of discrete events have a long history in biometrics whcre they arc used to estimate, for exam- ple. the probability that an insect will survive a specif- ic dose of poison. Cox (1966) has provided a detailed account of the logistic qualitative response model, its multivariate extension employing several explanatory variables, and its relation with discriminant analysis. Pregibon (198 I) discusses use of logistic regression to estimate frequencies, which are independent binomial responses, thus extending the approach to deal with a multiple qualitative response. The use of logistic re- gression to estimate frequencies of mineral deposits in cells from explanatory variables quantified for these cells was suggested originally by Tukey (1972). This led to applications of the logistic model (Agterberg, 1974) for estimating probability of occurrence of mineralization based on the nonlinear weighted least- squares estimator of Walker and Duncan (1967). Cox (1966) proposed the use of the maximum likelihood method with scoring in connection with the logistic qualitative response curve. Amemiya (1976) proved tha the maximum likelihood and nonlinear weighted least-squares estimator provide identical results. Chung (1978) published the FORTRAN IV computer program LOGIST which is based on maximum likeli- hood with scoring. The program LOGDIA is a generalization of LOGIST in that frequencies of more than one discrete event in larger cells can be estimated and logistic regression diagnostics are provided. LOGDIA also uses the scoring method and its esti- mated values become identical to those of LOGIST for a single qualitative response.

During the past 10yr, regression diagnostics for the general linear model were developed and widely

* Geological Survey of Canada Contribution No. 38786.

applied. Extensive use is being made of the diagonal elements of the so-called hat matrix. Chi-square tests for evaluating logistic regression results for goodness- of-fit also have been obtained (see review by Wrigley, 1984). A number of computer programs which can be used for logistic regression analysis are discussed in the book by Wrigley (1985, chap. 7, p. 233-238). For example, the Generalized Linear Interactive Model- ling (GUM) package uses weighted least-squares algorithms for the logistic model (Baker and Neldcr, 1978). Pregibon ( 198 I) has stated that standard out- put from a maximum likelihood fit for the logistic model consists of a subset of the following:

(a) Estimated parameter vector, ~: (b) Individual coefficient standard deviations, SD

f/b; (c) Estimated covariance matrix of ~; (d) Chi-square goodness-of-fit statistic; (e) Individual components of chi-square; (f) Deviance D. The program LOGDIA provides these statistics

(a)-(f) plus individual components of deviance. Pregi- bon (1981) also has pointed out that with a properly designed computing package for fitting the maximum likelihood model, logistic regression diagnostics are essentially "'free for the asking". LOGDIA provides the logistic hat matrix and corresponding modified hat matrix for measuring leverage and influence of individual observations on estimated values. A meas- ure of influence of observations on regression coef- ficients also is provided.

HAT MATRIX AND MODIFIED HAT MATRIX

In multiple regression based on the general linear model, the hat matrix is the symmetrical (n x n) matrix H satisfying

~" = H Y ; H ,= X ( X ' X ) - ~ X ". ( I )

Here the (n x p) matrix X contains observed values o f explanatory variables xl with i - I . . . . . p; Y is a (n x I) column vector for the n observations on the

599

Page 2: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

¢9J0

dependent variable y; and 17. is the (n x 1) column vector of estimated values obtained from Y through the hat matrix. When Y is appended to X to obtain a nev, matrix Z = (X; Y) with in + I) columns, the modified hat matrix H* follows from

fl" = Z ( Z ' Z ) - ~ Z ". ~2)

The elements h,. of H provide a measure of the amount of "'leverage'" exerted by the observation 1'~ on the estimated value 171. The elements h,*, of the modified hat matrix denote the amount of "influence" exerted by }~ on f~ (of. Gray and Ling. 1984). The sum of the diagonal elements of the hat matrix is equal to p( = number ofexplanatory variables) and that of the modified hat matrix is p + I.

The hat matrix of the logistic model satisfies:

II = I ' ~ : X ( X ' I ' X ) ~A"I "1: (3)

where 1, is an (n x n) diagonal matrix with nonzero elements ~, = Nfi,(I - /~,). The parameters N and p, represent sample size and probabil i ty of a binomial distribution, respectively. The estimated probabil i ty fi, for observation i becomes available after application of the maximum likelihood method. In the logistic qualitative response model, the observed values of Y are either I or 0 with N = I. For multiple qualitative response. Y is a vector of integer numbers of zeros representing it sample of n independent binomial responses B(N, p,). In the maximum likelihood method, a vector of scores S = Y - I 7, is made to converge by iteration until the relation X ' S = 0 is satisfied. At convergence, we have the coelticients [} with

1~ = ( . ~ " I . ' X ) I X ' I / A : A =, X~ + V 'S. (4)

The logits 0, of the probabilities p, satisfy

P' (5) t), = log~ I - p , "

The estimated Iogits become 0 = X/}. The estimated frequencies ]:' = N, 6 follow from /~. Here /~ is the vector of estimated probabilities /i,. The var iance- covariance matrix of/} is (X" VX) -~ .

For goodness.of-fit, the following two statistics can be used: Ca) Chi-square values Z~ obtained after squaring

z, - - ( y - , 'v~,) { ~ ' p , ( t - p , ) } ' : ( 6 )

(b) Components of deviance with

d, ~= - 2 {y , l og , h, + ( , v - y , ) i o ~ ( l - p , ) } . (7)

Addition of the individual chi-square values yields 7.: which is distributed as theoretical chi-square with (n - p) degrees of freedom when the fit is good. Addit ion of the d, values yields the deviance D which also is ;~-distributed with (n - p)degrees of freedom. In practical applications, the chi-square value and

F P. AGT'ERBERG

deviance ma2r differ considerably and either one can be greater than the other

Suppose that the vector 4 is appended to X to create a new matrix Z = i U ,4). Then the modified logistic hat matrix follov, s from:

H* = I ' I : Z i Z I Z ! Z I ~:, (8)

The SAS M A T R I X procedure (SAS. 1985) can be used to obtain H and H ° in ordinary regression [Eqs. (1) and (2)] as well as logistic regression [Eqs. (3) and (8)1, The maximum likelihood estimate of the coefficients 8 must be computed first by another com- puter program before the logistic hat matrix can be obtained, in L O G D I A . H is calculated directly from ,~" and Y by Equation (3). Next H* is obtained from t t by using Equation (6) because the elements of these two matrices are related according to

6,~ = h, * Z,,~, ;':. (9)

A mathematical proof of Equation (9) is given in Appendix I.

Other properties of hat matrices are discussed in Agterberg (198%). One of their potentially important uses in mineral-resource appraisal studies consists of recognition of similar cells in a region. Cluster analy- sis of the hat matrix can Ix" useful to define groups of sirnilar cells and for ordering the cells in a region according to degree of similarity, in this type of ap- proach both diagonal and oil-diagonal elements of the h;=t matrix (or the modified hat matrix) are used. Examples of this clustering are given in Agterberg and Fr.'mklin (1986) for the linear model, and in Agterberg (1987a) for the linear and logistic models. The basic building blocks ;~,. at,, h,j, and h,~ provided by L O G D I A can be used for the calculation of other regression diagnostics. The influence of individual observations on estimated coefficients is measured in LOGDIA according to the following method. Wrigley and Dunn (1986) in a paper on graphical diagnostics for logistic oil-exploration models provide separate plots of DBETA,f lSD (,~,) using the difference vector

DBETA, = l~ (all observations) - /~

(all observations except i)

= ( X ' V . V ~ ,t~S,,/(l - h,,) (I0)

where X, is a (p x !) vector of the values of the explanatory variables for observation i; S, ~, E - ( ~ = p, in Wrigley and Dunn, 1986). The values of ( X ' / / X ) -~, Y,,. and h. can be used to calculate DBETA. In L O G D I A . DBETA, , /SD (1~) is printed out as a block of values with a separate column for each coefficient/~,.

F rom Equation (10) it can be seen that a plot of DBETA,j/SD (~j) vs i for each coefficient shows which observations are causing instability and how great an effect deletion of a part icular observation (0 will have on an estimated coefficient (e,), A positive value of DBETA,, indicates that removing observation i will

Page 3: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Logistic regression with diagnostics

decrease the value of ~j, a negative value the reverse (cf. Wrigley and Dunn, 1986, p. 367). This technique for the logistic model was developed originally by Pregibon (1981) who generalized methods proposed by Welsch and Kuh (1977) for the normal-theory linear model.

EXAMPLE I--POLYMETALLIC MASSIVE SULPHIDE DEPOSITS IN ABITIBI AREA OF THE

CANADIAN SHIELD

The example is analyzed in more detail in Agree- berg (1987a). Table I shows complete input for a

601

LOGDIA run. The first part of Table I consists of a small file in free format with input parameters (abit.inp). The first value on line ! represents level of convergence. In absolute value, all scores should be less than this value before a set ofcoefficients is accep- ted as final. The second value on the first line is for maximum number of iterations allowed. The second line gives number of explanatory variables except the dummy variable needed to estimate the intercept, number of observations with one or more events, number of observations without events, and binomial sample size N, respectively. The third line specifies format of the data file which contains values for the

Table I. Two files (abit.inp) and (abit dat) forming input for L O G D I A run on data for Abitibi area. Canadian Shield Parameters of abit inp are in free format. (Their values should be separated by at least one blank or by a comma.) Third line of'abit.inp contains format used for data in abit dat See text for complete explanation of input parameters. Numbers in abit dat were derived by cross multiplication of values

originally reported in Agterberg ([987a, table ll)

Input parameters f t l e ( a b [ t . i n p ) :

0.1 30 7 20 20 I6 .0 ( T f 8 . 0 . f 2 . 0 ) 0 0

Data file ( a b i t . d a t ) :

Abittbi area (ExampLe l);vartables xl x2 xI2 xl5 x16 xl8 x45 y 2.312 0.263 0.60806 0.10404 .000000 6.06438 .000000 1 2.354 1.074 2.52820 1.44536 .129470 3.58750 .000000 2 2.010 1.565 3.14565 0.15879 .673350 0.14271 .000000 l 2.482 0.918 2.27848 0.26309 .851326 4.63389 .000000 2 1.19[ 0.000 0.00000 0.03811 .003573 2.31888 .000000 2 3.283 0.127 0.41694 0.95207 ,679581 2.78727 ,000000 2 4.499 0.469 2.11003 2,06504 .026994 1.37669 ,000000 1 1.527 0.258 0.39397 0.26570 .038175 5.21776 .000000 3 1.350 0.283 0.38205 0.27810 .008100 4.76145 .005562 2 3.514 0.902 3.16963 1.97135 .000000 0.59387 .144177 5 0.465 0.274 0.1274| 0.05208 ,016275 0.06045 .058464 1 2.345 0.066 0.15477 .241535 .021105 6.56131 .000000 1 3.268 0.287 0.93792 .081700 .189544 6.55234 .000000 2 1.969 1.109 2.18362 .675367 .013783 1.18731 .393421 7 1.444 0.098 0.14151 ,031768 .229596 1.59418 .033374 1 3.514 1.286 4,51900 .449792 .256522 2.17868 .000000 3 1.486 0.356 0.52902 ,285312 .627092 3.49953 .054144 2 0.071 0.057 0.00405 .000000 .000000 0.00540 .000000 1 2.660 0.906 2.40996 .069160 .311220 4.74012 .000000 1 0.866 0.154 0.13336 .046764 ,009526 0.17060 .112320 1 0.056 0.000 0.000G0 0.00000 .000000 0.00000 .000000 0 1.556 0.508 0.79045 0.76711 .000000 2.22041 .000000 0 1.556 0.257 0.39989 0.18672 .040456 5.20638 .000000 0 1.904 0.270 0.51408 0.00952 .188496 1.28139 .000000 0 0.930 0,443 0.41199 0.3~317 ,027900 3.47541 .000000 0 0.051 0.000 0.00000 0.00000 .000000 0.29748 .000000 0 2.060 0.105 0.21630 0.06180 .611820 1.57796 .000000 0 2.147 1.280 2.74816 0,26193 .139555 1.06921 .000000 0 1.474 0.000 0.00000 0.05159".166562 1.12466 .000980 0 4.072 0.340 1.38448 0.46421 .574152 1.18088 .000000 0 2.589 0.079 0.20453 0.18900 .038835 7.77736 .000000 0 0.635 0.215 0.13652 0.00254 .019050 0.27559 .000000 0 3.021 0.145 0.43804 0.14199 .048336 3.06959 .000000 0 0.000 0.000 0.00000 ,000000 .000000 0.00000 .000000 0 4.074 0.022 0.08963 .032592 .020370 7.14172 .000000 0 3.416 0.348 1.18877 .000000 .204960 5.63298 .000000 0 0.000 0.000 0.00000 .000000 ,000000 0,00000 .000000 0 1.344 0.177 0.23789 .068544 .000000 4.27661 ,000000 0 2.728 0.038 0.10366 .147312 .000000 6.50355 .000000 0 1,909 0.259 0.49443 .045816 .032453 0.68342 .000000 0

Page 4: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Table 2, Partial L O G D I A o u t p u t for example o f Tab le 1. Probabil i t ies p, are related to f requcncmsJl by p, = f,, N where N ( = 16) represents sample size ( .= number o f 10-km cells with one or more deposits per

40-km cell)

A b t t i b i area (Example l);variables xl x2 xl2 x15 x16 x18 x45 y

ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS WITH Y ~ , 0

CELL NO

~m~ FREQUENCIES ** tm nn PROBABILITIES ~'~: * * COMPONENTS OF ~ ,

OBSERVED ESTIMATED OBSERVED ESTIMATED CHISQUARE DEVIANCE

t 1.OOO . 6 9 1 8 6 . 0 6 2 5 0 0t, 362 . I ~e,?~ ,~ ' 516

! 2.000 2.1169l* . 1 2 5 0 0 . 1 3 6 0 6 . f 1665 . 15461

1 . 0 0 0 . 4 3 0 3 0 . 0 6 2 5 0 . 0 2 6 8 9 , 7 / 5 0 8 , 5 0 3 1 0

~ 2 .O00 2 . 8 8 5 0 5 . 1 2 5 0 0 • 18032 3 ~ :2~* .' t622

3 2 , 0 0 0 . 42262 . 1 2 5 0 0 . 02641 6.0~, ; I~, , 95531

2 . 0 0 0 . 8 9 1 / 3 . 1 2 5 0 0 .05611 1 . 4 3 3 8 6 .82117

'~ 1. 000 .95911 .06250 .05998 .001 ~0 ,46169

H 3 . 0 0 0 .92231 . 1 8 / 5 0 . 0 5 1 6 4 4 . 9 6 6 1 3 ] . 16653

~ 2.000 .85481 .12500 .05343 1,6208~ .828~5

O 5. OO0 5. 36055 . 3 1 2 5 0 . 3 3 5 0 3 . O ~6~* 2 L. 2~448

i ~, 1 . 0 0 0 .36183 .06250 ,02261 1. lb 160 .51653

12 1. OOO . 8 5 6 3 3 . 0 6 2 5 0 . 053~0 .0262 ~ ~6914

I ~ 2 . 0 0 0 . 16312 , 1 2 5 0 0 .04169 2. 10519 ,81,626

I¢~ 1 . 0 0 0 1, 1 [ 313 .t, 3150 ,t*&i*61 .0(i 37 I , ~ 1083

i i~ I . 0 0 0 , 5 0 0 6 9 , 0 6 2 5 0 , 03129 .5 ~,0 ~ 49266

It~ 3 . 0 0 0 2. 33683 . 1 8 / 5 0 . 1 4 6 0 5 , ~!20 V~ 9 1 / 9 9

i / 2 . 0 0 0 2 . 1 2 126 , 1 2 5 0 0 , 1 3 2 5 8 . 0 0 / 9 9 , / 5405

I~ 1.000 .402h2 .06250 ~02515 .91{)? ~ .508 12

19 1 . 0 0 0 1. 19635 . 0 6 2 5 0 ,Olt ,~ll .O H'+~ ~ . ~6988

20 1 . 0 0 0 . 6 1 4 9 3 . 0 6 2 5 0 . 0 3 8 4 3 ,2501 : ~ 4~084

ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS WITH Y([~ ~ 0

CELL NO

ha * FREQUENCIES ,~acr ** PROBABILITIES an, ** COMPONENTS OF ***

OBSERVED ESTIMATED OBSERVED ESTIMATED CHISQUARE DEVIANCE

2 i .000 .47297 .00000 .02956 ,48737 ,06001

22 .000 .475&7 .00000 .02972 .49003 .06033

23 .000 .84478 ,OOOO0 .05280 ,89186 .i0849

2~ .000 .22265 .OOOO0 .01392 . 2 2 5 8 0 , 02803

25 .000 .55971 .0OOOO .03498 .58000 .07122

26 .000 .52201 .0OOOO .03263 .53962 ,06634

2~ .OOO .47710 .OOOO0 .02982 .49176 ,06054

28 .OOO .29719 .0OOOO ,01857 .30281 °03750

29 .OO0 .30834 .OO000 .01927 .31~40 ,03892

30 .OO0 .32739 .0OOOO .O2046 .33422 .0~135

3~ .OO0 1.04391 .OOOOO .O6524 1.11677 13494

32 . 0 0 0 . 2 2 5 5 9 . 0 0 0 0 0 . 0 1 4 1 0 .22882 , 0 2 8 4 0

33 .OOO .24969 .00000 .01561 .2536~ ,03146

3~ .OOO .49432 .OOOO0 .03090 .51008 406277

35 . 0 0 0 . 2 1 9 3 7 . 0 0 0 0 0 =01371 .22242 0 2 7 6 1

36 . 0 0 0 .5789& . 0 0 0 0 0 . 0 3 6 1 8 . 60067 . 07371

37 .O00 .~9432 .OOOOO .O3090 .51008 06277

38 .OO0 .61542 .0OOOO .03846 ,64004 ,07845

39 .000 .55133 .00000 .03446 .57100 ,07013

40 .000 .14613 .00000 .00913 .147~7 ,01835

CHISQUARE = 30.O5OO3 DEVIANCE •

NUMBER OF DEGREES OF FREEDOM = 32

1 6 . 0 4 0 3 2

Page 5: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Logistic ~.'l~-ssion

explanatory variables and frequencies of events for the dependent variable, respectively. Finally, the fourth line contains two (0, I) switches. If the first value = 0, the first set ofcoefficients used to start the iteration process are set equal to zero. If its value is equal to one, initial estimates o f~ (in format 8 FI0.0) should be provided on one or more separate lines of the end of the input parameters file. i f the second value = I, the last estimates of the coefficients will be stored in unit 97 if no convergence was reached for the specified number of iterations. These estimates then can be used as initial estimates of coefficients in another computer experiment.

The first line of the data file provides a brief des- cription of the problem at hand. In the example of Table I. it contains names of area and explanatory variables, respectively. Seven explanatory variables were selected by stepwise regression (see Agterberg. 1987a). They represent: x~ = mafic volcanic rocks; x: = felsic volcanic rocks: .v~: = mafic volcanic rocks x felsic volcanic rocks; . ~ = marie volcanic rocks x ultramafics; -r~s = marie volcanic rocks x felsic intrusive rocks; and .v4, = metamorphic rocks x mafic intrusive rocks. The final column of the data block represents number of small (10 x 10kin) cells containing one or more polymetallic massive sulphide deposits per larger (40 x 40kin) cell. Because Ihcrc are 16 small cclls in a larger ccll, samplc sizc N = 16. This is one of the two parameters of the binomial response B(N. p,), with thc other paramctcr p, rcprc- scnting the probability that a small cell within the larger cell labeled i contains one or more deposits. The observations of cells with deposits precede those of cells without deposits in the input file.

When LOGDIA is run, the user is asked suc- cessively for the names of the file with input par- ameters (abit.inp in this example), the data file (abit.dat) and the output file (abit.out) on which the results of LOGDIA will be written. Only selected output is shown on the monitor during a session and the output file should be printed out afterwards to obtain full information. Calculation of the hat matrix, DBETA and modified hat matrix is performed only when the user responds with 'YES' (or 'yes') to the corresponding prompt on the monitor.

Although level of convergence for this example was set rather low ( - 0 . 1 ) , satisfactory convergence was reached (after six iterations), as can be verified from the output by comparing successive sets of esti- mated coefficients with one another. Table 2 shows estimated frequencies and probabilities with com- ponents of chi-square and deviance. These results, as well as the hat matrix and modified hat matrix have been discussed in Agterberg (1987a) where observa- tions with rows and columns of the hat matrix with one or more elements > 0.150 were clustered to ob- tain a sequence of cells in order of similarity. Table 3 shows this submatrix of the complete hat matrix after reordering the observations by using single linkage cluster analysis. The numbers of the observations in

with diagnostics 603

Table 3 are the same as those in Table 2. Nearly all relatively large elements of the hat matrix have be- come concentrated along the principal diagonal after reordering. Subsets of similar cells tend to fall in square blocks along the diagonal. In Table 3, six such blocks have been outlined (cell No. 6 would belong to two blocks). This example illustrates that clustering of the hat matrix can be useful for defining sets of similar observations.

EXAMPLE 2--POLYMETALLIC MASSIVE SULPHIDE DEPOSITS IN THE BATHURST AREA.

NEW BRUNS'~ICK

Table 4 contains scores for two rock types (.r~ = felsic volcanics and x: = mixed volcanics) in 64 cells measuring 10-kin on a side for an 80 x 80 km area near Bathurst, New Brunswick. The spatial cross correlation of these two rock types in this area was studied by Agterberg (1987b, table X). Division of a rock type score by 4 provides an estimate of the areal percentage of a i0 x lOkm cell which is underlain by the rock type considered. Cells known to contain polymctallic massive sulphide deposits in the Bathurst area are shown by bold numbers in Table 4. Cross multiplication of the scores for .v~ and x2 provides numbers for a third explanatory variable x~ = xax: newly defined for the example in this paper. The qualitative response model wascmployed with Y, = I and 0 (i = I . . . . . 64) for 10-kin cells with and with- out deposits, respectively. The 17 cells with deposits were numbered 1-17 in the input file counting from left to right along successive rows of Table 4. The remaining 47 cells without deposits were numbered from Ig to 64 counting in the same way.

As a starting point, the four coefficients (constant term and those for xa, x2, and x~) were set equal to zero. Final coefficients were obtained after five itera- tions when the absolute value ofall scores had become < 0.1. The final solution in terms of logits 0 is:

0 = - 2.257871 + 0.014398xl (0.525047) (0.004346)

+ 0 .00804Zr , - 0.000101x2x ., (0.006340) (0.00007 l) .

The numbers in parentheses are standard devia- tions of the coefficients. The estimated probabilities and components of chi-square and deviance are given in Table 5. Note that the final values for chi-square (= 61.4) and deviance ( - 53.3) are similar. The logis- tic model provides a good fit [degrees of freedom (dr) ,= 60], because these values are less than the tabulated theoretical X 2 value of 79.1 for level of significance a - 0.05. Individual components of chi- square and deviance are distributed approximately as theoretical 12 with I d f ( - 3.84 for ,, -- 0.05). Three 10-kin cells have components of chi-square and de- viance exceeding this threshold value. The largest anomalies apply to two cells with mineral deposits

CAGgO IS:4-g

Page 6: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

004 F+ P. AGTF~IERG

Table 3. Partial hat matrix for example of Table I. Number,s ~ere multiplied by 100 and rounded to nearer integer. Only olBervations with at least one element > 0.t50 have in hat matrix been retained. Order of observations (column headin~ in table) was changed by single linkage cluster analysis in order to concentrate relatively large positive element~ of hat matrix along diagonal. Six blocks (shown in bold

print) are outlined to show sub~ts of similar cells (cell No. 14 forra$ single-element block)

22 2 25 3 28 ~ t7 6 7 10 30 16 t9 tl I2 t~

22 9 16 9 lI 9 -5 -2 1 ~ -0 I -$ : ! -! ~ -O

2 16 56 17 17 13 19 ! -~ 1~ 16 -.12 -7 ~ Z 2 -5

25 9 17 l0 g 7 -t i -2 7+ -5 - ; -~ ! ? ~ 1

3 l i 17 8 35 23 io -3 ~ ~ -L5 9 6 9 ! ~ -19 5

23 9 13 7 23 18 -~ -[0 -3 ~ -~:, 'i 7 :% -;~ -~

-5 lO -1 [0 -3 )6 34 15 - l l -3 2 1~ I~! 2 -+3 -1

17 ~2 I I -3 -10 34 35 19 -6 2 ~ -12 -2 ~ 3 5

6 J -~ -2 ~ -3 15 19 31 19 9 17 -16 -~ '~ i -5

7 ~ l0 O 3 ~ - l l -6 19 34 29 14 -~ -~ ') ~ -10

10 -C 16 -5 -15 -9 -3 2 9 29 69 1 16 . z . ~ i iI

30 i -12 -~ 9 5 2 t 17 la t 17 ~ i ~ ++2 -1

16 -~ -7 -8 6 7 1~ -12 -16 -~ 16 ~ 74 22 2 +2 -~

19 ~ 6 3 9 8 t2 -2 -5 -~ +7 l Z2 l~ 7+ ~+ 3

31 ~ 2 2 - I ~ -7 2 ~ 0 0 -g -~ .2 5 22 17 o

12 0 2 3 -tO -4 -0 3 I 2 ~ -2 -2 +~ 17 13 -0

I~ -O -5 I 5 ~ -1 5 -5 -10 II -1 -~ } G -o 95

Table 4. Scores for fetsic (top numbers in pairs of numbers) and mixed (bottom numbers) vol- ¢anics in ~ua re array of 64 cells each measuring 10kin on a side in Bathurst area, New Bruns- wick. Each score denotes number of 500m sub- cells (per 10-km cell) with rock type occurring at subeell centers (from Agterberg. 1987b. table X). Scores for 10-kin cells containing one or more polymetallic massive sulphide deposits are

shown in bold print

0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 t~ I I I

o o lO 172 294 51 o o o o ~# 114 50 200 38 0

0 28 2t l 378 296 I 52 0 0 24 112 20 101 272 221 0

0 0 175 350 253 70 229 52 0 0 0 47 67 162 16 10

0 0 116 333 J$2 342 267 6 0 0 0 50 0 19 41 0

0 30 t2 222 301 198 J6 3 0 U 65 119 6 153 30 0

0 0 0 85 17 10 12 0 0 0 2 66 13 0 o 0

0 0 0 4 1 0 24 0 0 0 0 0 0 1~ 61 0

present but wi thout feline or mixed volcanics. Al- t hough the rock types x~ and x: occur in the vicinity o f all polymeta l l ic mass ive su lphide depos i t s in the Bathurs t area , their scores are zero in some cells with depos i t s which occur just ou t s ide the main o u t c r o p areas for x~ and x, . In such cells, these volcanics may be over la in by o t h e r rock types. As m a n y as twenty 10-km cells wi thou t depos i t s have zero scores for x~ and x , , but the c o r r e s p o n d i n g c o m p o n e n t s o f chi-

square and deviance are relatively small for these cells. The two rock types may no t be present at all (no t even at dep th ) in cells wi thou t deposi t s . This example d e m o n s t r a t e s that individual c o m p o n e n t s o f chi- square and deviance for the res iduals can con t r ibu te valuable new in fo rma t ion .

The e lement s o f the hat mat r ix may prov ide fur- ther i n f o r m a t i o n which is c o m p l e m e n t a r y to tha t re- sui t ing f rom the residuals. Tab le 6 s h o w s the d iagona l e lements o f the hat mat r ix (h,j) fo l lowed by four co l umns o f DBETA,,/SD (/~,). T h e ha-values add up to p ( = n u m b e r o f e x p l a n a t o r y variables) . F o r this reason, var ious a u t h o r s have sugges ted tha t obse rva- t ions wi th h , > 2p/n or h , > 3p/n have high leverage. In Table 6, h,, exceeds 2p/n = 0.125 in six cells with n u m b e r s 3, 13, 15, 31, 36, and 37. F o r each o f these six cells, one or m o r e D B E T A values a lso are relatively large (posi t ive o r negative) . C o m p a r i s o n

Page 7: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Table 5. Partial LOGDIA output for example of Table 4

Bathurs t a rea, 64 c e l l s (Example 2 ) ; v = r i a b l u s x l a2 x l - a l * x 2 y

ESTIRATED FREQUENCIES AND FEOBABILI'rlES FoE OBSEEVATI(}N5 WITH Y ( I ) • O

** PROBABILITIF~ *** ** CUHPONENT5 OF *** CELL NO OBSERVED ESTIMATED CHISQIJARE DEVIANCE

1 l .O00 .92969 2 1.O00 .71340 3 1.000 .45636 4 1.000 .09467 5 1.000 .56501 6 1.000 .55294 7 1.000 .68951 8 1.000 .3571Z 9 t.000 .96237 I0 1.000 .89685 11 1.000 .69251 12 1.000 .10234 13 1.000 .31626 14 1.000 .87451 15 1.000 .22544 16 1.000 .20101 17 1.000 .09467

.07563 .14582

.40174 .67542 1.19124 1.56894 9.56271 4 .71466

.76988 1.14183 80851 1.18503 4501I .74355

1 80016 2.05935 03910 .07671 11501 .21772 44444 .73545

8.77136 4.55891 2.16193 2.30236 .14349 .26818

3.43586 2.97944 3.97476 3.20876 9.56271 4 .71466

ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS WITH Y(I) i 0

** PROBABILITIES **~ CELL NO OBSERVED ESTIHATED CHISQUARE

18 .000 .09467 .10457 19 .000 .09467 .10457 20 .000 .09467 .10457 21 .000 .09467 .10457 22 .000 .09467 .10457 23 .000 .09676 .I0713 24 .000 .11256 .12683 25 .000 .20339 .25531 26 .000 .09467 .10457 27 .000 .09467 .10457 28 .000 .17899 .21802 29 .000 .30074 .43009 30 .000 .70956 2 .44308 31 .000 .27995 .38879 32 .000 .12430 .14195 33 .000 .09467 .10457 34 .000 .09467 .10457 35 .000 .15064 .17736 36 .000 .33082 .49436 37 .000 .29073 .40990 38 .000 .09467 .10457 39 .000 .09467 .I0457 40 .000 .81738 4 .47599 41 .000 .25128 .33561 42 .000 .18523 .22734 43 .000 .09467 .10457 44 .000 .09467 .I0457 45 .000 .77867 3.51813 46 .000 .09467 .I0457 47 .000 .14113 .16432 48 .000 .16231 .19376 49 .000 .09844 .I0919 50 .000 .09467 .10457 51 .000 .09467 .I0457 52 .000 .09606 .10627 53 .000 .25548 .34315 54 .000 .12665 .14502 55 .000 .I0775 .12077 56 .000 .II055 .12429 57 .000 .09467 .I0457 58 .000 .09467 .10457 59 .000 .09467 .I0457 60 .000 .09972 .II077 61 .000 .09591 .I0609 62 .000 .10477 .I1703 63 .000 .17228 .20814 64 .000 .09467 .I0457

CHISQUARE ~ 61.39899 DEVIANCE =

NUMBER OF DEGREES OF FREEDOM - 60

~ COMPONENTS OF ~* DEVIANCE

.19892

.19892

.19892

.19892

.19892

.20354

.23882

.45477

.19892 ,19892 .39445 .71547

2.47273 .65686 .26547 .19892 .19892 .32654 .80340 .68704 .19892 .19892

3.40075 .57877 .40970 .19892 .19892

3.01619 .19892 .30428 .35422 .20726 .19892 .19892 .20199 .59004 .27084 .22802 .23431 .19892 19892 19892 21011 20166 22135 37816 19892

53.3O188

Page 8: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

606 F. P. AOTEtmmO

Table 6. Diagonal elements (h,,) of hat matrix for example of Table 4 and corresponding DBETA., 'SD (,8,) values denoted as db o. db,. db:. and db~

hii db 0 db I db 2 db 3

I 0.0$ -.02 0.05 0.02 -.05

2 0.09 -.02 0.[2 -.02 0.00

3 0.53 -.33 0.0,2 1.61 -.96

0, 0.02 0.0,9 -.20, -.23 0.09

5 0. ll 0.07 0.25 0.03 -.23

6 0.09 0.01 0.06 -.06 0.11

7 0.11 0.01 0.20 0.03 -.15

8 0.06 0.19 0.17 -.03 -.19

9 0.03 -.01 0.06 0.02 -.05

lO O. lO -.02 O. ll 0.02 -.07

II 0.08 -.01 0.15 0.00 -.05

12 0.02 0.~7 -.2l -.22 0.07

13 0.21 0.0~ -.32 -.22 0.70

10 0.12 0.02 0.1~ 0.03 - . l l

l~ 0.26 0.05 -.6l -.26 i.07

16 0.02 0.28 -.02 -.03 -.06

17 0.02 0.~9 -.20, -.23 0.09

18 0.02 - .05 0.02 0.02 -.Ol

19 0.02 -.05 0.02 0.02 -.01

20 0.02 -.05 0.02 0.02 -.01

21 0.02 -.05 0.02 0.02 -.Or

22 0.02 -.0~ 0.02 0.02 -.Of

23 0.02 -.05 0.02 0.02 -.0t

2~ 0.02 -.05 0.02 0.01 -.00

25 0.07 -.0~, - .0 [ - . tO 0.07

26 0.02 -.05 0.02 0.02 -.0l

27 0.02 -.05 0.02 0.02 -.01

23 0.0 #, - . 05 0.01 - . 05 0.0~

29 0.10 -.03 0.03 0.03 -.17

30 0.09 .0~ -.30 0.0#, -.01

31 0.15 .02 0.02 -.22 0.0~

32 0.02 ..rj; 0.02 0.00 O. Oi

hii db 0 db I db 2 db 3

33 0.02 -.05 0.02 0.02 -.01

3 ~, 0.02 -.05 0.02 0.02 -.01

35 0.02 -.06 0.02 0.01 0.01

36 0.16 -.02 0. II O.OS -.26

37 0. I$ 0.0#, 0.02 -.26 0.00,

38 0.02 -.05 0.02 0.02 -.01

39 0.02 -.05 0.02 0.02 -.01

~0 0. I0 0.13 -.~8 0.05 0.03

~I 0.03 -.01 0.0~ -.II -.01

~2 0.03 -.07 0.01 0.02 0.01

~3 0.02 - .05 0.02 0.02 - .or

~ 0.02 - .05 0.02 0.02 -.Of

~5 O. lO O. lO -.39 0.06 -.02

~6 0.02 - .05 0.02 0.02 -.01

~7 0.03 -.07 0.02 0.03 0.00

48 0.03 -.05 0.01 -.02 0.02

~9 0.02 -.05 0.02 0.02 -.01

50 0.02 -.05 0.02 0.02 -.01

51 0.02 -.05 0.02 0.02 -.Of

~2 0.02 -.05 0.02 0.02 -.Ol

53 0.02 -.05 0.00 -.02 0.01

.% 0.02 - .06 0.02 0.02 - .00

55 0.02 -.06 0.02 0.03 -.01

56 0.02 =.06 0.02 0.03 -.01

57 0.02 - .06 0.02 0.02 -.Ol

58 0.02 - .05 0.02 0.02 -.Ol

59 0.02 -.05 0.02 0.02 - .0l

60 0.02 -.05 0.02 0.02 -.01

6l 0.02 -.05 0.02 0.02 -.01

62 0.02 -.05 0.02 0.02 -.00

63 0.03 -.05 0.01 -.02 0.02

6~ 0.02 -.05 0.02 0.02 -.01

with Table 5 shows that these six cells do not have Simultaneously, the coefficients of xt and x2 would large anomalously components of chi-square or de- increase in value. Cell No. 3 has the largest h,,-value viance. The three cells (numbers 4, 12, and 17) in the ( - 0.53). Omission of this single cell from the experi- latter category, have largest DBETA,/SD (/~j), for the merit would decrease the coefficient of x~ from constant term. Because the signs of these three values 0.008042 to -0.002193. The probability estimated are positive, omission of any of the corresponding for this cell amounts to 0.456 in Table 5. If all changes three observations would decrease the value of ~0. of Table 6 for cell No. 3 were implemented, its esti-

Page 9: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Logistic regression with diagnostics

mated probability would be reduced to 0.060. Table 5 also shows that omission of cell No. 3 from the data- base would have less effect on the estimated probabili- ties for the other 63 cells. Cell No. 3 is the third cell with one or more deposits in bold print on the third row of Table 4. it differs from the other cells with deposits in that it has a large value ( = 272) for x, along with a small value ( = 8) of x,. This difference in composition provides it with relatively strong in- fluence on the logistic regression results in com- parison with the other 63 cells in the Bathurst area. Its actual omission from the database resulted in the following newly computed regression coefficients:

0 = - 2.057490 + 0.012363x= (0.518482) (0.004275)

- 0 . 0 0 7 7 0 2 x : - O.O00008xlx,, (0.014908) (0.000098).

The new coefficient of x2 ( = - 0.007702) differs not only from the corresponding value ( = 0.008042) previously computed from all 64 cell data, but also from the value ( = - 0.002193) resulting from Table 6. This is because the changes of Equation (10) are one-step approximations which may underestimate fully iterated results (cf. Pregibon, 1981). Inspection of the new array of DBETA for 63 cells (not shown here) gave that the largest db 2 value in Table 6 was reduced from 1.61 for cell No. 3 to 0.24 (for cell No. 40 with dbz - 0.05 in Table 6). The overall largest value in the new DBETA array with db~ - 0.68 (equal to 1.07 for cell No. 5 in Table 6). These results clearly demonstrate that the omission of cell No. 3 from the database stabilizes the logistic regression results. The preceding example also shows that logis- tic regression diagnostics provide valuable new in- formation which could not be obtained easily in the past.

Acknowledgments---Thanks are due to C. F. Chung for helpful comments and suggestions, and to S. M. I.,¢w for implementing an earlier version of LOGDIA in Microsoft FORTRAN on the IBM XT personal computer.

REFERENCES

Agterberg, F. P., 1974, Automatic contouring of geological maps to detect target areas for mineral exploration: Jour. Math. Geology. v. 6. no. 4. p. 373-395.

607

Agtcrbcrg, F. P., 1987a, Application of recent developments of regression analysis in regional mineral resource evaluation: Proc. NATO Adv. Study Inst. on Statistical Treatment for Estimation of Mineral and Energy Re- sources (!1 Ciocco, Italy, 1986), p. 1-28.

Agterbcrg, F. P., 1987"o, Spatial analysis of patterns of land- based and ocean-floor ore deposits: Proc. NATO Adv. Study Inst. on Statistical Treatment for Estimation of Mineral and Energy Resources (I! Ciocco, Italy, 1986), p. 283-299.

Agterberg. F. P., and Franklin, J. M., 1986. Estimation of the probabil i ty of occurrence of polymetallic massive sulfide deposits on the ocean floor. /n Teleki, P.. cd, Marine minerals; resource assessment strategies: Reid¢l. Dordrecht, Holland, p. 467--483.

Amemiya. T., 1976. The maximum likelihood, the minimum chi-square and the nonlinear weighted least-squares esti- mator in the general qualitative response model: Jour. Am. Statist. Assoc.. v. 71. p. 347-351.

Baker, R. J., and Nelder, J. A.. 1978, The GLIM system; Release 3: Numerical Algorithms Group. Oxford, 300 p.

Chung. C. F., 1978, Computer program for the logistic model to estimate the probability of occurrence of dis- crete events: Geol. Survey Canada. Paper 78-1 I, 23 p.

Cox. D. R.. 1966. Some procedures connected with the logis- tic qualitative response curve, in David, F. N.. ed., Re- search papers in statistics: London. p. 55-71.

Gray, J. B., and Ling. R. F., 1984. K-clustering as a detec- tion tool for influential subsets in regression: Tech- nometrics, v. 26, no. 4. p. 305-318.

Pregibon. D., 1981. Logistic regression diagnostics: Annals of Statistics, v. 9. no. 3. p. 705-724.

SAS, 1985, The matrix procedure: language and applica- tions: Tech. Rept. 135, SAS Institute. Cary, North Caro- lina. 150 p.

Tukey. J. W.. 1972, Discussion of paper by F. P. Agterberg and S. C. Robinson: Intern. Slat. Inst. Bull.. v. 44. no. 1. p. 596.

Walker. S. H., and Duncan, D. B., 1967. Estimation of the probability of an event as a function of =everai indepen- dent variables: Biometrika, v. 54. nos. I and 2. p. 167- 179.

Welsh. R. E.. and Kuh, E., 1977. Linear regression diagnos- tics: Nat. Bur. Econ. Res. Inc., Working Paper 173, 55p.

Wrigley. N., 1984, Quantitative methods: diagnostics re- visited: Progress in Human Geography, v. 7, no. 4, p. 525-535.

Wrigley, N., 1985, Categorical data analysis for geographers and environmental scientists: Longman, London, 392 p.

Wrigley, N., and Dunn, R., 1986, Graphical diagnostics for logistic oil exploration models: Jour. Math. Geology v. 18. no. 4. p. 355-374.

A P P E N D I X I

In LOGDIA (cf. Appendix 2), the elements of the modified logistic hat matrix H* with

H* - V I " 2 Z ( Z ' V Z ) - I Z ' V ' :

where Z - (,!"; A) are computed from the elements of the logistic hat matrix H with

H n V I : X ( X ' V X ) - = X ' V I ' .

Use is made of the relationship

H* - H + XX'/Z'Z

where Z represents a (n x

(At)

I) column vector with elements Z~ satisfying Equation (6) given in the text. Equation (AI) is

Page 10: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

608 F, P, AGTERBERG

eqtavalent to Equatmn (9). A mathematical proof that the equauons for H" and H tmpl~ Equauon (AI) is as follows: The matrix Z" I,'Z can be partitioned into four parts:

Z , ' Z = [ X ' V X X 'VA 1

L,4)'x ,~r,4 j Its inverse matrix satisfies:

V ( X ' I , ' X ) ' ( I + X ' V A A ' V X ( X ; ' X ) - ' F ') - ( X ' V X ) ' X ' V A F - ' ]

(z vz ) - ' = L - , 4 ' v x ( x l x ) '~: ~ F - ' . .l

where / is the (p × p) identity matrix and F i s a scalar with

F = A' t .4 - A ' V X ( X ' g X ) ~XVA.

By using X'S = 0 and Equation (4) for ~ and A given in the text. the expression for the reverse can be simplified to

For example, this result implies F = ;('7. which follows from:

F = A ' V ( A - X ( X ' V I ) IX'~',4)

= , 4 t ( , 4 - , U } ) = .4"s

= S ' V ',%" = ~"Z

The modified logistic hat matrix now becomes:

tl* = It + (U':.)i; ; t ' A ) B ( X t . " : ; . 4 < I ~ : ) / i ( ~

- I t - )((I~'X'V I: + ,4"V~z)..Z" z

Finally. H* - tl .). 11'/X'}~ [Eq. (AI) repeatedl.

C c c c C c c c c c c C C c C c c c c c C

c c c c c c c

c C

A P P E N I ) I X 2

Li.s.tmg oI' L O G I ) I A F O R T R A N 77 Source Cmh"

PROGRAM LOGD[A

FOR GENERALIZED LOGISTIC ANALYSIS WITH DIAGNOSTICS

BY F. P. AGTERBERG, OCTOBER, 1986

GEOLOGICAL SURVEY OF CANADA 601 BOOTH STREET OTTAWA, ONTARIO

KIA 0E8

BASED ON PROGRAM LOGIST PUBLISHED BY C. F. CHUNG IN GSC PAPER 78-I 1" "COMPUTER PROGRAM FOR THE LOGISTIC MODEL TO ESTIMATE THE PROBABILITY OF OCCURRENCE OF DISCRETE EVENTS". IN ADDITION TO THE ESTIMATED PROBABILITIES, LOGIST COMPUTES SCORES, COEFFICIENTS AND STANDARD DEVIATIONS OF COEFFICIENTS.

LOGIST WAS REVISED BY S. N. LEW IN SEPTEMBER, 1985, TO COMPLY WITH ANSI FORTRAN 77 STANDARD

LOGDIA AND LOGIST PROVIDE IDENTICAL COEFFICIENTS IF XCELLS - 1.0, OBSERVED VALUES OF THE DEPENDENT VARIABLE - 1.0 OR 0.0 ONLY.

THEN THE OBSERVED AND ESTIMATED FREQUENCIES ARE EQUAL TO THE CORRESPONDING PROBABILITIES. XCELLS REPRESENTS NUMBER OF SMALLER CELLS WITH DISCRETE EVENTS PER LARGER UNIT AREA.

LOODIA CONTAINS THE FOLLOWING EXTENSIONS; I . DEPENDENT VARIABLE CAN hAVE VALUES GREATER THAN ONE; 2. LOODIA GIVES CHISQUARE, DEVIANCE, PLUS INDIVIDUAL COMPONENTS

OF CHISQUARE AND DEVIANCE; 3. HAT MATRIX AND MODIFIED HAT MATRIX ARE PROVIDED UPON REQUEST; 4. BLOCK OF DBETA VALUES FOLLOWS BAT MATRIX.

INPUT CONSISTS OF (l) PILE WITH INPUT PARAMETERS, AND (2) DATA FILE; OUTPUT IS WRITTEN IN OUTPUT PILE.

(l) THE INPVr PARAMETERS FILE CONTAINS THE FOLLOWING INFORMATION

Page 11: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

G C C C C C C C C C C C C C C C C C C C C C C C

l.,o~stic rcsression with diasnostics

IN ~ FORMAT: LINE 1: EVEL = LEVEL OF CONVERGENCE FOR SCORES TO STOP

ITEI~TION; ITER - MAXIMUM N~q4BER OF ITERATIONS. LINE 2: NVAR - NUMBER OF EXPLANATORY VARIABLES (EXCLUDING

Dt~ VARIABLE); NI - NUMBER OF OBSERVATIONS WITH ONE OR MORE EVENTS; N2 = NUMBER OF OBSERVATIONS WITHOUT EVENTS ; XC~T.I-~ - NUMBER OF BERNOULLI TRIALS PER OBSERVATION.

LINE 3: FORMAT OF DATA IN DATA FILE. LINE 4. ICOE7 AND ICO ARE (0.I) SWITCHES; IF ICOEF-O.

ALL COEFFICIENTS ARE SET EQUAL TO ZERO AT BEGINNING OF IERATION PROCESS; IF ICOEF-I. INITIAL COEFFICIENTS SHOULD BE ENTERED ON NEXT LINE(S) IN FORMAT 8FI0.0; IF ICO=I. LAST ESTIMATES OF COEFFICIENTS WILL BE STORED IN UNIT 97 IF NO CONVERGENCE WAS REACHED FOR SPECIFIED VALUE OF ITS.

(2) THE FIRST LINE OF THE DATA FILE IS TITLE IN FORMAT (8A10); SUBSEQUENT LINES CONTAIN VALUES OF EXPLANATORY VARIABLES FOLLOWED BY OBSERVED FREQUENCIES FOR DEPENDENT VARIABLE USING FORMAT ON LINE 3 OF INPUT PARAMETERS FILE.

CO~g4ON X(Z0).B(20),NS(20).D(20.20) DIMENSION T(20).V(20).F(20).XX(20) DIMENSION W(2OO),CH(200),DH(200),RES(200) CHARACTER*IO NAME(8) CHARACTER*Z0 INPUT,TAPE9g,OUTPUT CHAEACTER*60 FORH DOUBLE PRECISION D

* * * * *

READ TITLE AND INPUT, OUTPUT CONTROL CARDS * * * * *

WRITE ( * , * ) 'ENTER NAME OF FILE CONTAINING INPUT-CONTROL CARDS - ' READ ( * , 2 ) INPUT FORMAT ( A2 0 ) OPEN (5.FILE-INPUT,STATUS','OLD' ,ACCESS-'SEQUENTIAL' ) WRITE (*,*) 'ENTER FILENAME FOR BLOCK OF DATA - ' READ (*,2) TAPE99 OPEN (99,FILE-TAPEgq,STATUS-'OLD' ,ACCESS-'SEQUENTIAL' ) WRITE (*.*) 'ENTER FILENAME FOR OUTPUT FILE - ' READ (*,2) OUTPUT OPEN (6,FILE-OUTPUT,STATUS.,'NEW' ,ACCESSm'SEQUENTIAL' ) OPEN (97,FILE-'TAPE97' ,STATUS.,'NEW' ,ACCESS-'SEQUENTIAL' ) OPEN (10,FILE-'TAPEI0' ,STATUS-'NEW' ,FORM-'UNFORMATTED' ) OPEN ( I I , FILE" ' TAPEI I ' , STATUS-' NEW', FORM"' UNFORMATTED' ) OPEN ( 12,FILE"TAPEI2' ,STATUS-'NEW' .FORH-'UNFORMATTED' )

REWIND 99 READ (gg.l) NAME

I FOPJ~T(SAI0) READ (5,*) EVELoITER READ (5,*) NVAR,NI,N2,XCELLS READ (5 ,3 ) FOP~

3 FOe~aT(A60) N-NI+N2 READ (5,*) ICOEF, ICO

t • • • •

PRINT TITLE AND INPUT, OUTPUT CONTROL STATEMENTS , , • • ,

WRITE (6,5) NAME 5 FOI~AT(/ISAI0)

WRITE (6 ,6 ) EVEL,ITEN 6 FORHAT(/' LEVEL OF CONVERGENCE ,, ° , F I S . I O , / / , I X , I 'MAXIMUM PERMISSIBLE NO. OF ITERATIONS -',13) WRITE (6 ,7) NVAR,N1,N2

7 FORMAT(/' NUHSEN OF VARIAELEq - ',lS,/l,IX,

1 'NUMBER OF OBSERVATIONS WITH ¥ ( I ) - 1 - ' , I S , / / , 1 X , 2 ' ~ B E R OF OBSERVATIONS WITH ¥ ( I ) - 0 - ' , 1 5 / / ) WRITE (6 ,8) FOrtH

8 FORMAT(/' INPUT FOR~T - ' , k60) WRITE (6,9) ICOEF,ICO

9 FORMAT(/' ICOEF -',I2,10X,'ICO -',I3) H-NVAR+I

609

Page 12: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

6[0 F, P. AG'rT.RBERG

IF( IC0F.~. EQ. 0 ) ~ 0 i0 READ (5,11) (B(1),Z-l,M)

ii FORMAT(SFI0. O) WRITE (6,12)

12 FORMAT(/' INITIAL COEFFICIENTS TAKEN AS STARTING POINT' 1 WRITE (6,13) (B(1),I=I,M)

13 FORMAT(/SEIS. 7) GOTO 16

10 WRITE (6 ,14) 14 FORMAT(/' AS A STARTING POINT, ALL COEFFICIENTS ARE SET EQUAL TO Z

IERO. ' ) DO 15 l=l,H B( I )=O.O

15 CONTINUE 16 DO 17 l'l,H

T ( I ) -O .0 17 CONTINUE

REWIND I0 X( I ) - I .0

READ INPUT DATA BLOCK

DO 20 I- I ,N READ(99, FORH) (V(J),J=I,M) W(I)-V(M) DO 18 K-I,NVAR L-K+I X(L)-V(K)

18 CONTINUE WRITE(IO) ( X ( J ) , J - I ,M) IF(I .GT.NI )GOTO 20 DO 19 K-I,M T(K)-T(K)+X(K)*V(M)

19 CONTINUE 20 CONTINUE

REWIND 99

START ITERATION

REWIND IO ISTEP-I

21 CONTINUE

CALL SUBROUTINE LOGIT FOR COMPUTATION USING SCORING METHOD

CALL LOGIT(M,N, ISTEP,V,T,XCELLS) REWIND I0

WRLTE (6 .22) 2-' FORMAT(/' SCORES -ITE.RATION WILL CE.ASE WHEN ALL SCORES ARE LESS

ITHAN THE GIVEN LEVF.I, OF CONVERGENCE,') WRITE (6 ,13) (V(1) , i= l .M) WRITE (6,23) ISTF.P

21 FORMAT(/12,' TH ESTIMATE OF COEFFICIENTS:') WRITE (6,13) (B(I),I=I,M) I LEVEL-0 DO 24 I-I,M C-ABS(Vt I )) IF(C.LT, EVEL)GOTO 24 ILEVEL=ILEVEL+I

24 CONTINUE WRITE (6,25) ILEVEL

25 FORMAT(/' NUMBER OF SCORES WHICH ARE GREATER THAN SPECIFIED LEVEl IOF CONVERGENCE - ',15/) I F ( I LEVEL. EQ. 0 )GOTO 30 ISTEP-ISTEP+I IP(ISTEP.GT. ITER)GOTO 26 GOTO 2 I

26 WRITE (6 ,27) 27 FORMAT(/' THE SOLUTION CAN NOT BE OBTAINED FOR THE GIVEN LEVEL O

IF CONVERGENCE AND THE MAXIMUM PERMISSIBLE NUMBER OF ITERATIONS' ,/, 21X, 'HOWEVER, THIS PROGRAM WILL PRODUCE PRINTED OUTPUTS OF THE', 3 ' LAST ESTIMAT~ OF BETA WHICH CAN EE USED FOR THE',/,IX,'INITIA 4L COEFFICIENTS FOR NEXT JOB', ~ WRITE (6,28) (B(1),I-I,M)

28 FORMAT(IX,8FI0.5) OOTO 29

STOP ITERATION, PRINT ESTIMATES(ML ESTIMATES) AND VARIANCES

Page 13: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Logistic regression with diagnostics

c * , a * * 30 WRITE ( 6 , 5 ) NAME

WRITE (*,5) NAME WRITE (6,51) WRITE (*,51) WRITE (6,13) (B(I)oI-I.M) WRITE (*,13) (B(I),IfI,M) IF( ICO.NE. I )GOTO 31 REWIND 97 WRITE(97,32)(B(17, I-I,M)

32 FORMAT(SE20.12) REWIND 97

31 WRITE (6,33) WRITE (*.33)

33 FORMAT(/' STANDARD DEVIATIONS OF COEFFICIENTS' ) DO 34 I-I.M F( I )=DSQRT(D( I, I ) )

34 CO[~T [ NUE WRITE ( 6 , 1 3 ) (F(1),IfI,M) WRITE ( ~ , 1 3 ) (F (1 ) , I f f i I ,M) WRITE ( 6 . 3 5 )

35 FORMAT(/. l X. ' VAR I ANCE.-COVAR I ANCE MATR IX : ' / ) DO 36 I =l ,M WRITE (6,37) (O(l,J), I-I,M)

]6 CONTINUE 37 FOP, MAT( IX, 5DI 5.5)

WRITE (6.49) C * * * * * C CALCULATE PROBABILIT|ES, FREQUENCIES, CRI-VALUE5 AND COMPONENTS OF C DEVIANCE; CHISQUARE AND DEVIANCE WITH DEGREES OF FREEDOM C * * * * *

WRITE (6.5) NAME Ml-I M2-N l WRITE ( 6 , 5 0 ) WRITE (*.50) DEVT-O. 0 CHISQ=O.O

38 IF(XCELLS.GT.[.O) GOTO 39 WRITE (6.58) WRITE ( 6 , 5 6 ) WRITE (*,58) WRITE (*,56) GOTO 40

39 WRITE (6,57) WRITE ( 6 , 5 5 ) WRITE (*.57) WRITE (4,55)

40 DO 45 I=MI,M2 w-w(I) VEiVV/XCELLH READ(10)(X(J),J-I,M) XA-0. DO 41 K=I.M XA,,XA+X(K)*B(K)

41 CONTINUE PROA-VV/XCELLS F.X-F.XP(XA ) THE=XCELLS*EX/(1.0+EX) PROB-TIIE/XCELLS CHI-(W-THE)/SQRT(THE*( 1. O-PROB) ) DO 42 K-I,M XX(K) - X ( K ) * SQRT(THE*(I.O-PROB))

42 CONTINUE WRITE (12) (XX(K),K-I,M) CHI 2=CHI*CHI CH I SQ-CR I SQ+CH 12 DEV-( I. 0-VE)*LDO( I. 0--PROB) DEV',--2.0* ( DEV+VE*LOG( PROE ) ) DEV'r=DEVT+DEV IF(XCELLS.GT.I.0) GOTO 43 WRITE (6,54) I,VV,THE,CHI2.DEV WRITE (*,547 I,VV,THE.CHI2,DEV GOTO 44

43 WRITE (6,53) I,W,THE.PROA,PROB,CHI2.DEV WRITE (*,53) I,W,THE,PROA,PROB,CHI2,DEV

44 RES(I J-W-THE CH( I )'CHI

611

Page 14: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

6 [ 2 F.P. Ao'rEg~=.aO

~5 CONTINUE [F(M2.EQ.N)GOTO 4~ M[-NI+I M2-N WRITE (6,52} WRITE (~.52) C_~DTO 38

40 REWIND 10 WRITE (6,47) CHISQ,DEVI" WRITE (~.47) CHISQ.DEVT

47 FORMAT(/' CHISQUARE - ' .FI2.S.qX. 'DEVIANCE - ' , F 1 2 . 5 ) NDF - N - M WRITE {0,48) NDF WRITE (~ ,48 ) NDF

~8 FORMAT(/' NUMBER OF DEGREES OF FREEDOM - ' , I 4 ) ~9 FORMAT(/' ***a** WARNING a***** ' , / / , 2 X , ' I F AT LEAST ONE OF TH

IE VARIANCES IS PRINTED 0 . , ESTIMATES OF COEFFICIENTS ARE NOT MEANI 2NGFUL',/,2X, 'DUE TO USE OF THE 3GENERALIZED INVERSE MATRIX. HOWEVER, THE ESTIMATES OF FREQUE 4NCIES ARE MEANINGFUL.')

50 FORMAT(/' ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS ! WITH Y( I ) > 0 ' )

51 FORMAT(/' FINAL ESTIMATES OF COEFFICIENTS') 52 FORMAT(/' ESTIMATED FREQUENCIES AND PROBABILITIES FOR OBSERVATIONS

i WITH Y ( 1 ) - 0 ' ) 53 FORMAT(2X,15,4X,FB.3,SFI2.5) 54 FORMAT(2X,15,4X,FB.3,3FI2.5) 55 FORMAT(' CELL NO OBSERVED ESTIMATED

I CHISQUARE DEVIANCE'l) 56 FORMAT(' CELL NO OBSERVED ESTIMATED 57 FORMAT(/' *** FREQUENCIES ~***

I ** COMPONENTS OF ***') 58 FORMAT(/' ~* PROBABILITIES *~*

WRITE(*,*) ' ' WRITE(*,*) 'DO YOU WANT HAT MATRIX7 (Y/N) ' RF~D(*,59) iREPLY

59 FORMAT (A3) IF(IREPLY.EQ.'y' .OR. [REPLY.EQ.'Y') TIIEN

COMPUTE HAT MATRIX, DBETA, AND MODIFIED HAT MATRIX, WRITE THESE ARRAYS ON OUTPUT FILE

CALL HATMAT(M,N,XX,CII,CHiSQ,F,RES,DH) ENDIF

29 STOP END SUBROUTINE IIA'I~AT(M,N,XX,CII,CIIISQ,F,RES,DII) COMMON X(20~.B(20),NS(20),D(20,20) DIMENSION XX(20),CII(2DO),F(20),RES(200),l)II(/OO) DIMENS ION DD( 20 ), AA( 200 ), DB( 20 ) DOUBLE PRECISION I) KC-O REWIND l) REWIND 12 DO 3 K-I,N READ(12) (XX(J).J-I,M) DO 2 I - I ,M DDD - 0.0 DO I J-I,M DDD - DDD + D(I ,J) * XX(J)

I cONTINUE DD(I) - DDD

2 CONTINUE WRITE(If) (DD(I) . I - I ,M)

3 CONTINUE 4 REWIND 11

DO i0 K-I,N R~U~(II) (DD(J),J-I,M) REWIND 12 DO 6 I - I ,N RRAD (12) ( X X ( J ) , J - I , M ) AAA - 0 .0 DO 5 J - I ,M

- ~ + KX(J) * DD(J) 5 CONTINUE

AA(I) - AAA IF(KC.EQ.1) AA(I) - AAA + CH(I)*CH(K)/CHISQ

6 CONTINUE

OBSERVED ESTIMATED

CHISQUARE DEVIANCE'/) ** PROBABILITIES ***

*~ COMPONENTS OF ~*~ ' )

Page 15: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

Logistic regression with diagnostics

IF(KC.EO.0) WRITE(6,8) K II(KC.ZQ.1) WRITE(6,9) K WRITE (6 .7) ( A A ( I ) , I = I . N ) IF(KC. EQ. O)DH(K),.AA(K)

7 tOP, HAT( X01'8. S) 8 l 'OSHAT( / / I ] , '~ ROW Ol' ¢~T m~T~IX'/) 9 FORMAT(//I3,'TH ROW OF MODIFIED HAT MATRIX'/)

10 CONTINIjE KC"KC+ I IF(KC.EQ.2) GOTO Z5 WRITE(6, l I )

II FORMAT(I/'BLOCK OF DBETA VALUES; COLUMNS FOR VARIABLES'//) DO 14 K=I,N READ(I0) (X(J),J-I,M) DO 13 I-X,M DB(1)-0.0 DO 12 J=I,M DB( I )-D( I, J)*X(J )/F( I)+DB(I )

12 CONTINUE DB( I )-DB(I)*RES(K) / ( 1, O-DH(K) )

13 CONTINUE WRITE(5,7)(DB(I), I-1 ,M)

14 CONTINUE GOTO 4

15 CONTINUE RETURN END SUBRO~INZ LOCIT(M,N, IS,V,T,XCELLS) CO~g4ON X(20) .B(20) ,NS(20) ,D(20,20) DIMENSION T(20),V(20),F(20) DOUBLE PRECISION D DO 2 J-I.M DO I I=I,M D(I,J)=O.0

I CONTINUE F(J)-0.0

2 CONTINUE DO 6 K-|,N REAO(tO) (X(X).X-t.M) C=O. 0 DO 3 J-I,M C=C+B(J)*X(J)

3 CONTINUE c-ExP(c) DD=I.O/(I.O+C) DO 4 J=l,M F(J) 'F (J)-C*DD*X (j)*XCELLS

4 CONTINUE DO 5 I-l,M DO 5 J=l,M D( I , J )-D( I , J )+X( I )*X( J )*C*DD*DD*XCELLS

5 CONTINUE 6 CONTINUE

REWIND 10 DO 7 J- l .M F(J)"T(J)+F(J)

7 CONTINUE WRITE (6,8) IS

8 FORMAT( / / 15,2X, 21fl"d, IX, ' ITERATION' ) CALL GENINV(M) DO 11 J=t,M C-O.0 IF(NS(J).EQ.-I)GOTO I0 DO 9 l - l ,M IP(NS( I ). EO.-t )oo¢o 9 c-c +D(I,J)*F(1)

9 CONTINUE t0 V(J)-C 11 CONTINUE

DO 12 J=l,M B(J)=B(J)+V(J)

12 CONTI~ DO 14 J' l ,M IF(NS(J). EQ.-I )D(J, J)-0.0 V(j)=P(J)

14 CONTINUE RETURN END

613

Page 16: LOGDIA—FORTRAN 77 program for logistic regression with diagnostics

614 F. P. A G T E ~ G

SUBROUTINE GI~IINV (M) C O t ~ N C ( 2 0 ) , F ( 2 0 ) , N S ( 2 0 ) , A ( 2 0 , 2 0 ) DIM~SlON B ( 2 0 , 2 0 ) DOUBLE P~ClSldN A,XX DO 2 I - I ,M

NS( I ) ~ I DO 1 J=I,M B(I,J}-A( I,J }

I CONTINUE .' CONTI Nt~E

DO 4 l-l,M C( I )-o. o

4 CONTINUE DO 5 I - I .M XX,'DABS( A{ I, I ) } IF( ( X~X. I.E. I . 01}--7 ). {}r. ( N.S( I }. P:{~. I ) ){,(}'l 0 B K-I GOTO 6

5 CONTINUE GOTO I I

6 XX"A{ K, K ) SS(K)-I C{K)-1.0/~X DO 7 J-I,M A(K,J )-A{ K,.I)/XX

7 CONTINUE DO ( I - I ,M IF( I.EQ.K)GOTO 9 XX,,A(I,K) C( I ),w-)O(*C(K) DO 8 J-I,M A( I , J)-A( I , J)-A(K, J)*XX

8 CONTINUE 9 CONTINUE

DO I0 I - I ,M A( I ,K)-C( I )

10 CONTINUE GOTO 3

I I K-0 DO 12 I - I , M IF(NS( I }. EQ. - I )GOTO 12 K-K+I

12 CONTINUE IF(K.NE. M)GOTO 14 WRITE (6,13) M

13 FORMAT(/' THE INPUT MATRIX HAS FULL RANK',II0) GOTO 16

14 WRITE (6,15) K 15 FORMAT(/' THE INPUT MATRIX HAS RANK',II0) 16 DO 22 K-1,M

IF(NS(K ). EO.-I )GOTO 22 DO 18 I - I ,M CC-O. 0 IF(NS( I ). EQ.-I )GOTO 18 DO 17 J-I,M I F ( N S ( J ) . E Q . - I )GOTO 17 CC-CC+A(K,J)*B{J, I)

I / CONTINUE C( [ )-CC

18 CONTINUE DO 19 L-1,M IF(L. EQ.K)GOTO 19 CC=AB$(C(L) ) IF(CC. LE. 1.0E-3 )GOTO 19 GOTO 20

19 CONTINUE 22 CONTINUE

GOTO 23 20 WRITE (6,21) 21 FORMAT(/' WARNING-- ERRONEOUS INVERSE MATRIX') 23 RETURN

END