logistic regression

of 65 /65
LOGISTIC REGRESSION

Author: louvain

Post on 22-Feb-2016

234 views

Category:

Documents


6 download

Embed Size (px)

DESCRIPTION

Logistic Regression. Outline. Basic Concepts of Logistic Regression Finding Logistic Regression Coefficients using Excel’s Solver Significance Testing of the Logistic Regression Coefficients Testing the Fit of the Logistic Regression Model - PowerPoint PPT Presentation

TRANSCRIPT

Basic Concepts of Logistic Regression

Logistic RegressionOutlineBasic Concepts of Logistic RegressionFinding Logistic Regression Coefficients using Excels SolverSignificance Testing of the Logistic Regression CoefficientsTesting the Fit of the Logistic Regression ModelFinding Logistic Regression Coefficients using Newtons MethodComparing Logistic Regression ModelsHosmer-Lemeshow TestBasic Concepts of Logistic RegressionThe basic approach is to use the following regression model, employing the notation from Definition 3 of Method of Least Squares for Multiple Regression:

where the odds function is as given in the following definition.

Sigmoid curve for pLogistic regression is used instead of ordinary multiple regression because the assumptions required for ordinary regression are not met. In particular1. The assumption of the linear regression model that the values of y are normally distributed cannot be met since y only takes the values 0 and 1.2. The assumption of the linear regression model that the variance of y is constant across values of x (homogeneity of variances) also cannot be met with a binary variable. Since the variance is p(1p) when 50 percent of the sample consists of 1s, the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When p = .10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance approaches 0.3. Using the linear regression model, the predicted values will become greater than one and less than zero if you move far enough on the x-axis. Such values are theoretically inadmissible for probabilities.For the logistics model, the least squares approach to calculating the values of the coefficients bi cannot be used; instead the maximum likelihood techniques, as described below, are employed to find these values.Example 1:A sample of 760 people who received doses of radiation between 0 and 1000 rems wasmade following a recent nuclear accident. Of these 302 died as shown in the table in Figure 2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100, 100-200, etc.).

Figure 2.

LetEi= the event that a person in theith interval survived. The table also shows the probabilityP(Ei) and oddsOdds(Ei) of survival for a person in each interval. Note thatP(Ei) = the percentage of people in intervaliwho survived and

In Figure 3 we plot the values ofP(Ei)vs.iandOdds(Ei)vs.i. We see that the second of these plots is reasonably linear.

Given that there is only one independent variable (namelyx= # of rems), we can use the following model

Here we use coefficientsaandbinstead ofb0andb1just to keep the notation simple.

We show two different methods for finding the values of the coefficients a and b. The first uses Excels Solver tool and the second uses Newtons method. Before proceeding it might be worthwhile to click on Goal Seeking and Solver to review how to use Excels Solver tool and Newtons Method to review how to apply Newtons Method. We will use both methods to maximize the value of the log-likelihood statistic as defined in Definition 5.

Finding Logistic Regression Coefficients using Excels SolverWe now show how to find the coefficients for the logistic regression model using Excels Solver capability (see also Goal Seeking and Solver). We start with Example 1 from Basic Concepts of Logistic Regression.

Example 1 (continued) :From Definition 1 of Basic Concepts of Logistic Regression, the predicted values pi for the probability of survival for each interval i is given by the following formula where xi represents the number of rems for interval i.

The log-likelihood statistic as defined in Definition 5 of Basic Concepts of Logistic Regression is given by

where yiis the observed probability of survival in theith interval.Since we are aggregating the sample elements into intervals, we use the modified version of the formula, namely

yiis the observed probability of survival in thei th ofrintervals where

We capture this information in the worksheet in Figure 1 (based on the data in Figure 2 of Basic Concepts of Logistic Regression).

In figure 1, Column I contains the rem values for each interval (copy of column A and E). Column J contains the observed probability of survival for each interval (copy of column F). Column K contains the values of eachpi. E.g. cell K4 contains the formula =1/(1+EXP(-O5O6*I4)) and initially has value 0.5 based on the initial guess of the coefficients a and b given in cells O5 and O6 (which we arbitrarily set to zero). Cell L14 contains the value ofLLusing the formula =SUM(L4:L13); where L4 contains the formula=(B4+C4)*(J4*LN(K4)+(1-J4)*LN(1-K4)),and similarly for the other cells in column L.

We now use Excels Solver tool by selecting Data > Analysis|Solver and filling in the dialog box that appears as described in Figure 2 (see Goal Seeking and Solver for more details).

Our objective is to maximize the value ofLL(in cell L14) by changing the coefficients (in cells O5 and O6).It is important, however, to make sure that theMake Unconstrained Variables Non-Negativecheckbox is not checked. When we click on theSolvebutton we get a message that Solver has successfully found a solution, i.e. it has found values foraandbwhich maximizeLL.We elect to keep the solution found and Solver automatically updates the worksheet from Figure 1 based on the values it found for a and b. The resulting worksheet is shown in Figure 3.

We see that a = 4.476711 and b = -0.00721. Thus the logistics regression model is given by the formula

For example, the predicted probability of survival when exposed to 380 rems of radiation is given by

Note that

Thus, the odds that a person exposed to 180 rems survives is 15.5% greater than a person exposed to 200 rems.

Real Statistics Data Analysis Tool:

The Real Statistics Resource Pack provides the Logistic Regression supplemental data analysis tool. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure. E.g. for Example 1 this is the data in range A3:C13 of Figure 1. For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.We show how to use this tool to create a spreadsheet similar to the one in Figure 3. First press Ctrl-m to bring up the menu of Real Statistics supplemental data analysis tools and choose the Logistic Regression option. This brings up the dialog box shown in Figure 4.

Now select A3:C13 as the Input Range (see Figure 5) and since this data is in summary form with column headings, select the Summary data option for the Input Format and check Headings included with data. Next select the Solver as the Analysis Type and keep the default Alpha and Classification Cutoff values of .05 and .5 respectively.Finally press the OK button to obtain the output displayed in Figure 5.

This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure (this is considered to be the summary form). E.g. for Example 1 this is the data in range A3:C13 of Figure 1 (repeated in Figure 5 in the same cells). For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.

Note that the coefficients (range Q7:Q8) are set initially to zero and (cell M16) is calculated to be -526.792 (exactly as in Figure 1). The output from the Logistic Regression data analysis tool also contains many fields which will be explained later. As described in Figure 2, we can now use Excels Solver tool to find the logistic regression coefficient. The result is shown in Figure 6. We obtain the same values for the regression coefficients as we obtained previously in Figure 3, but also all the other cells are updated with the correct values as well.

Significance Testing of the Logistic Regression CoefficientsDefinition 1: For any coefficientbtheWaldstatistic is given by the formula

For ordinaryregressionwe can calculate a statistict~T(dfRes) which can be used totestthe hypothesis that a coordinateb= 0. TheWaldstatistic is approximately normal and so it can be used to test whether the coefficientb= 0 in logistic regression.

Since the Wald statistic is approximately normal, by Theorem 1 ofChi-Square Distribution,Wald2is approximately chi-square, and, in fact,Wald2~ 2(df) wheredf = k k0andk= the number of parameters (i.e. the number of coefficients) in the model (the full model) andk0= the number of parameters in a reduced model (esp. thebaselinemodel which doesnt use any of the variables, only the intercept).

Property 1:The covariance matrix S for the coefficient matrix B is given by the matrix formula

whereXis ther (k+1) design matrix (as described in Definition 3 ofLeast Squares Method for Multiple Regression)

andV= [vij] is ther rdiagonalmatrix whose diagonal elements arevii= nipi(1pi), whereni= the number of observations in groupiandpi= the probability of success predicted by the model for elements in groupi.Groups correspond to the rows of matrixXand consist of the various combinations of values of theindependent variables.Note thatS= (XTW)-1whereWisXwith each element in theith row ofXmultiplied byvii.Observation : The standard errors of the logistic regression coefficients consist of the square root of the entries on the diagonal of the covariance matrix in Property 1.

Example 1(Coefficients):We now turn our attention to the coefficient table given in range E18:L20 of Figure 6 of Finding Logistic Regression Coefficients using Solver (repeated in Figure 1 below).

Figure 1 Output from Logistic Regression toolUsing Property 1 we calculate the correlation matrix S (range V6:W7) for the coefficient matrix B via the the formula

Actually, for computational reasons it is better to use the following equivalent array formula:

The formulas used to calculate the values for the Rems coefficient (row 20) are given in Figure 2.

Note that Wald represents theWald2statistic and that lower and upper represent the 100-/2 % confidence interval of exp(b). Since 1 = exp(0) is not in the confidence interval (.991743, .993871), the Rem coefficientbis significantly different from 0 and should therefore be retained in the model.

Observation: The% Correctionstatistic (cell N16 of Figure 1) is another way to gauge the fit of the model to the observed data. The statistic says that 76.8% of the observed cases are predicted accurately by the model. This statistic is calculated as follows:For any observed values of the independent variables, when the predicted value ofpis greater than or equal to .5 (viewed as predicting success) then the % correct is equal to the value of the observed number of successes divided by the total number of observations (for those values of the independent variables). Whenp< .5 (viewed as predicting failure) then the % correct is equal to thevalue of the observed number of successes divided by the total number of observations. These values are weighted by the number of observations of that type and then summed to provide the % correct statistic for all the data.For example, for the case where Rem = 450, p-Pred = .774 (cell J10), which predicts success (i.e. survived). Thus the % Correct for Rem = 450 is 85/108 = 78.7% (cell N10). The weighted sum (found in cell N16)of all these cells is then calculated by the formula=SUMPRODUCT(N6:N15,H6:H15)/H16.Testing the Fit of the Logistic Regression ModelFor larger values of b, the standard error and the wald statistic become inflated, which increases the probability that b is viewed as not making a significant contribution to the model even when it does (i.E. A type II error).To overcome this problem it is better to test on the basis of the log-likelihood statistic since

wheredf = k k0and whereLL1refers to the full log-likelihood model andLL0refers to a model with fewer coefficients (especially the model with only the interceptb0and no other coefficients). This is equivalent to

Observation:For ordinary regression the coefficient of determination

ThusR2measuresthe percentage of variance explained by theregression model. We need a similar statistic for logistic regression. We define the following three pseudo-R2statistics for logistic regression.

Definition 1 :The log-linear ratioR2isdefinedas follows :

whereLL1refers to the full log-likelihood model andLL0refers to a model with fewer coefficients (especially the model with only the interceptb0and no other coefficients).Cox and SnellsR2is defined as

wheren= thesample size.NagelkerkesR2is defined as

Observation I :Sincecannot achieve a value of 1, NagelkerkesR2was developed to have properties more similar to theR2statistic used in ordinary regression.

Observation II :The initial valueL0ofL, i.e. where we only include the intercept valueb0, is given by

wheren0= number of observations with value 0,n1= number of observations with value 1 andn = n0+ n1.

As described above, the likelihood-ratiotest statisticequals:

whereL1is the maximized value of the likelihood function for the full modelL1, whileL0is the maximized value of the likelihood function for the reduced model. The test statistic has chi-squaredistributionwithdf = k1 k0, i.e. the number of parameters in the full model minus the number of parameters in the reduced model.

Example 1 : Determine whether there is a significant difference insurvival ratebetween the different values of rem in Example 1 ofBasic Concepts of Logistic Regression. Also calculate the variouspseudo-R2statistics.

We are essentially comparing the logistic regression model with coefficientbto that of the model without coefficientb. We begin by calculating theL1(the full model withb) and L0(the reduced model withoutb).

HereL1is found in cell M16 or T6 of Figure 6 ofFinding Logistic Coefficients using Solver.

We now use the following test :

wheredf= 1. Since p-value = CHITEST(280.246,1) = 6.7E-63 < .05 =, we conclude that differences in rems yield a significant difference in survival.

The pseudo-R2statistics are as follows:

All these values are reported by the Logistic Regression data analysis tool (see range S5:T16 of Figure 6 ofFinding Logistic Coefficients using Solver).

Finding Logistic Regression Coefficients using Newtons MethodProperty 1: The maximum of the log-likelihood statistic (from Definition 5 ofBasic Concepts of Logistic Regression) occurs when

Observation:Thus, to find the values of the coordinatesbiwe need to solve the equations

We can do this iteratively using Newtons method (see Definition 2 ofNewtons Methodand Property 2 ofNewtons Method) as described in Property 2.

Property 2:LetB= [bj] be the (k+1) 1 column vector of logisticregressioncoefficients, letY= [yi] be then1 column vector of observed outcomes of thedependent variable, let Xbe then(k+1) design matrix (seeDefinition 3 ofLeast Squares Method for Multiple Regression), letP= [pi] be then 1 column vector of predicted values of success andV= [vi] be then nmatrix wherevi=pi(1 pi). Then ifB0is an initial guess ofBand for allm we define the followingiteration

then formsufficiently large BBm,and soBmis areasonableestimate of the coefficient vector.

Observation:If we group thedataas we did in Example 1 ofBasic Concepts of Logistic Regression(i.e. summary data), then Property 3 holds where holds whereY= [yi] is ther1 column vectorof summarized observed outcomes of the dependent variable,Xis the correspondingr (k+1) design matrix,P= [pi]is ther 1 column vector of predicted values of success andV= [vi] is ther rmatrix wherevi= nipi(1 pi).

Example 1(using Newtons Method) :We now return to the problem of finding the coefficientsaandbfor Example 1 ofBasic Concepts of Logistic Regressionusing the Newtons Method.

We apply Newtonsmethod to find the coefficients as described in Figure 1. The method converges in only 4 iterations with the values a= 4.47665 and b= -0.0072.

The regression equation is therefore logit(p) = 4.47665 0.0072x.

Example 2:A study was made as to whether environmental temperature or immersion in water of the hatching egg had an effect on the gender of a particular type of small reptile. The table in Figure 2 shows the temperature (in degrees Celsius) and immersion in water (0 = no and 1 = yes) of the 49 eggs which resulted in a live birth as well as the sex of the reptile that hatched. Determine the odds that a female will be born if the temperature is 23 degrees with the egg immersed in water vs. not immersed in water.

We use theLogistic Regressionsupplemental data analysis tool, selecting theRaw dataandNewton Methodoptions as shown in Figure 3.

After pressing theOKbuttonwe obtain the output displayed in Figure 4.

Here we only show the first 19 elements in the sample, although the full sample is contained in range A4:C52. Note that in the raw data option the Input Range (range A4:C52) consists of one column for each independent variable (Temp and Water for this example) and a final column only containing the values 0 or 1, where 1 indicates success (Male in this case) and 0 indicates failure (Female in this case). Please dont read any gender discrimination into these choices: we would get the same result if we chose Female to be success and Male to be failure.

The model indicates that to predict the probability that a reptile will be male you can use the following formula:

We can now obtain the desired results as shown in Figure 5 by copying any formula for p-Pred from Figure 4 and making a minor modification.

Here we copied the formula from cell K6 into cells G29 and G30.

The formula that now appears in cell G29 will be =1/(1+EXP(-$R$7-MMULT(A29:B29,$R$8:$R$9))).You just need to change the part A29:B29 to E29:F29 (where the values of Temp and Water actually appear). The resulting formula

1/(1+EXP(-$R$7-MMULT(E29:F29,$R$8:$R$9)))

will give the result shown in Figure 5.Comparing Logistic Regression ModelsExample 1:Repeat the study from Example 3 ofFinding Logistic Regression Coefficients using Newtons Methodbased on the summary data shown in Figure 1.

Using the Logistic Regression supplemental data analysis tool, selecting the Newton Method option, we obtain the output displayed in Figure 2.

Example 2:Do the Temp and Water variables make a significant difference in the model of Example 1?

We first create summary tables for the Temp-only and Water-only models and then use the Logistic Regression data analysis tool (with Newton option) to build the two models. Also see below for a simpler approach for creating the Temp-only summary table.

The summary table for the Temp model is shown in range B28:D34of Figure 3 The values of the C and D columns can be calculated from the summary table of the base model (as shown in Figure 2) using SUMIF. For example, the number of samples where Temp = 20 and the reptile was born Male (cell C29) is given by the formula

=SUMIF($A$4:$A$15,$B29,C$4:C$15)By filling right (Ctrl-R) and down (Ctrl-D), you can copy this formula into the other cells in the range C29:D34. You now use the Logistic Regression tool to obtain the output shown in Figure 3.

We observe that the Temp variable makes a significant contribution (cell U35) over the constant-only model. Here we are comparing (Temp model) with (constant-only model).

We can also compare the Temp model with the base model (Temp + Water), by copying the range T28:U35to another location in the worksheet and using the value from the base model and substituting the value from the Temp model for . Also we need to change to 1 since the difference between the of the two models is 2 1 = 1. This is shown in Figure 4.

We see that there is not a significant difference between the models (cell X44). This confirms the conclusion that we reached previously that the Water variable is not making a significant contribution, and in fact it can be dropped.We create the Water-only model in a similar way to obtain the output shown in Figure 5.

This time we see that there is no significant difference between the Water model and the constant model. If we repeat the analysis of Figure 4, we would see that there is a significant difference between the Water model and the base model.

Finally, we can look at further refinements of the model, such as the full interaction model, where we include the interaction between Temp and Water. We show this analysis in Figure 6.

If we compare this model with the base model using the approach described above (as in Figure 4), we get the output shown in Figure 7.

This shows that there is a significant difference between the full interaction model and the base model, with the interaction model providing a better fit.

Observation : As mentioned above, there is a simpler way to create the Temp-only and Water-only summary data tables. To create the Temp-only table, enterCtrl-mand select theLogistic Regressiondata analysis tool and then enter the following information into the dialog box that appears

Here we have entered the Water independent variable into theList of variables to excludefield. This produces the output in Figure 3.Observation :TheList of variables to excludefield can be used whenever theInput Formatis set toSummary dataand theHeadings included with data fieldis checked in order to create a reduced model. The list of variables to exclude are entered into this field separated by commas.

E.g. if we have a summary data table with Nationality, Age, Education, Gender and Occupation as independent variables and want to create a reduced model with only Nationality, Education and Occupation, we would simply enterAge, Genderinto theList of variables to excludefield.Hosmer-Lemeshow TestTheHosmer-Lemeshow testis used to determine the goodness of fit of the logistic regression model. Essentially it is a chi-square goodness of fit test (as described in Goodness of Fit) for grouped data, usually where the data is divided into 10 equal subgroups. The version of the test we present here uses the groupings that we have used elsewhere and not subgroups of size ten.Since this is a chi-square goodness of fit test, we need to calculate the HL statistic

whereg= the number of groups. The test used is chi-square withg 2 degrees of freedom. A significant test indicates that the model is not a good fit and a non-significant test indicates a good fit.

Example 1:Use the Hosmer-Lemeshow test to determine whether the logistic regression model is a good fit for the data in Example 1 inComparing Logistic Regression Models.

In our example the sum is taken over the 12 Male groups and the 12 Female groups. The observed values are given in columns H and I (duplicates of the input data columns C and D), while the expected values are given in columns L and M. E.g. cell L4 contains the formula =K4*J4 and cell M4 contains the formula =J4-L4 or equivalently =(1-K4)*J4.

The HL statistic is calculated in cell N16 via the formula =SUM(N4:N15). E.g. cell N4 contains the formula =(H4-L4)^2/L4+(I4-M4)^2/M4.

The Hosmer-Lemeshow test results are shown in range Q12:Q16. The HL stat is 24.40567 (as calculated in cell N16),df=g 2 = 12 2 = 10 and p-value = CHIDIST(24.40567, 10) = .006593 < .05 = , and so the test is significant, which indicates that the model is not a good fit.Observation :The Hosmer-Lemeshow test needs to be used with caution. It tends to be highly dependent on the groupings chosen, i.e. one selection of groups can give a negative result while another will give a positive result. Also when there are too few groups (5 or less) then usually the test will show a model fit.

As a chi-square goodness of fit test, the expected values used should generally be at least 5. In Example 1 the cells L9, L15, M4 and M10 all have values less than 5, with cells M4 and M10 especially troubling with values less than 1. We now address the problems of cells M4 and M10.

We can eliminate the first of these by combining the first two rows, as shown in Figure 2. Here p-Pred for the first row (cell K23) is calculated as a weighted average of the first two values from Figure 1 using the formula =(J4*K4+J5*K5)/(J4+J5). In a similar manner we combine the 7thand 8throws from Figure 20.23.The revised version shows a non-significant result, indicating that the model is a good fit.

Observation :The Real Statistics Logistic Regression data analysis tool automatically performs the Hosmer-Lemeshow test. For Example 1 ofFinding Logistic Regression Coefficients using Solver, we can see from Figure 5 ofFinding Logistic Regression Coefficients using Solverthat the logistic regression model is a good fit. For Example 1, Figure 2 ofComparing Logistic Regression Modelsshows that the model is not a good fit, at least until we combine rows as we did above.

END