# logistic regression

Post on 22-Feb-2016

219 views

Embed Size (px)

DESCRIPTION

Logistic Regression. Outline. Basic Concepts of Logistic Regression Finding Logistic Regression Coefficients using Excel’s Solver Significance Testing of the Logistic Regression Coefficients Testing the Fit of the Logistic Regression Model - PowerPoint PPT PresentationTRANSCRIPT

Basic Concepts of Logistic Regression

Logistic RegressionOutlineBasic Concepts of Logistic RegressionFinding Logistic Regression Coefficients using Excels SolverSignificance Testing of the Logistic Regression CoefficientsTesting the Fit of the Logistic Regression ModelFinding Logistic Regression Coefficients using Newtons MethodComparing Logistic Regression ModelsHosmer-Lemeshow TestBasic Concepts of Logistic RegressionThe basic approach is to use the following regression model, employing the notation from Definition 3 of Method of Least Squares for Multiple Regression:

where the odds function is as given in the following definition.

Sigmoid curve for pLogistic regression is used instead of ordinary multiple regression because the assumptions required for ordinary regression are not met. In particular1. The assumption of the linear regression model that the values of y are normally distributed cannot be met since y only takes the values 0 and 1.2. The assumption of the linear regression model that the variance of y is constant across values of x (homogeneity of variances) also cannot be met with a binary variable. Since the variance is p(1p) when 50 percent of the sample consists of 1s, the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When p = .10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance approaches 0.3. Using the linear regression model, the predicted values will become greater than one and less than zero if you move far enough on the x-axis. Such values are theoretically inadmissible for probabilities.For the logistics model, the least squares approach to calculating the values of the coefficients bi cannot be used; instead the maximum likelihood techniques, as described below, are employed to find these values.Example 1:A sample of 760 people who received doses of radiation between 0 and 1000 rems wasmade following a recent nuclear accident. Of these 302 died as shown in the table in Figure 2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100, 100-200, etc.).

Figure 2.

LetEi= the event that a person in theith interval survived. The table also shows the probabilityP(Ei) and oddsOdds(Ei) of survival for a person in each interval. Note thatP(Ei) = the percentage of people in intervaliwho survived and

In Figure 3 we plot the values ofP(Ei)vs.iandOdds(Ei)vs.i. We see that the second of these plots is reasonably linear.

Given that there is only one independent variable (namelyx= # of rems), we can use the following model

Here we use coefficientsaandbinstead ofb0andb1just to keep the notation simple.

We show two different methods for finding the values of the coefficients a and b. The first uses Excels Solver tool and the second uses Newtons method. Before proceeding it might be worthwhile to click on Goal Seeking and Solver to review how to use Excels Solver tool and Newtons Method to review how to apply Newtons Method. We will use both methods to maximize the value of the log-likelihood statistic as defined in Definition 5.

Finding Logistic Regression Coefficients using Excels SolverWe now show how to find the coefficients for the logistic regression model using Excels Solver capability (see also Goal Seeking and Solver). We start with Example 1 from Basic Concepts of Logistic Regression.

Example 1 (continued) :From Definition 1 of Basic Concepts of Logistic Regression, the predicted values pi for the probability of survival for each interval i is given by the following formula where xi represents the number of rems for interval i.

The log-likelihood statistic as defined in Definition 5 of Basic Concepts of Logistic Regression is given by

where yiis the observed probability of survival in theith interval.Since we are aggregating the sample elements into intervals, we use the modified version of the formula, namely

yiis the observed probability of survival in thei th ofrintervals where

We capture this information in the worksheet in Figure 1 (based on the data in Figure 2 of Basic Concepts of Logistic Regression).

In figure 1, Column I contains the rem values for each interval (copy of column A and E). Column J contains the observed probability of survival for each interval (copy of column F). Column K contains the values of eachpi. E.g. cell K4 contains the formula =1/(1+EXP(-O5O6*I4)) and initially has value 0.5 based on the initial guess of the coefficients a and b given in cells O5 and O6 (which we arbitrarily set to zero). Cell L14 contains the value ofLLusing the formula =SUM(L4:L13); where L4 contains the formula=(B4+C4)*(J4*LN(K4)+(1-J4)*LN(1-K4)),and similarly for the other cells in column L.

We now use Excels Solver tool by selecting Data > Analysis|Solver and filling in the dialog box that appears as described in Figure 2 (see Goal Seeking and Solver for more details).

Our objective is to maximize the value ofLL(in cell L14) by changing the coefficients (in cells O5 and O6).It is important, however, to make sure that theMake Unconstrained Variables Non-Negativecheckbox is not checked. When we click on theSolvebutton we get a message that Solver has successfully found a solution, i.e. it has found values foraandbwhich maximizeLL.We elect to keep the solution found and Solver automatically updates the worksheet from Figure 1 based on the values it found for a and b. The resulting worksheet is shown in Figure 3.

We see that a = 4.476711 and b = -0.00721. Thus the logistics regression model is given by the formula

For example, the predicted probability of survival when exposed to 380 rems of radiation is given by

Note that

Thus, the odds that a person exposed to 180 rems survives is 15.5% greater than a person exposed to 200 rems.

Real Statistics Data Analysis Tool:

The Real Statistics Resource Pack provides the Logistic Regression supplemental data analysis tool. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure. E.g. for Example 1 this is the data in range A3:C13 of Figure 1. For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.We show how to use this tool to create a spreadsheet similar to the one in Figure 3. First press Ctrl-m to bring up the menu of Real Statistics supplemental data analysis tools and choose the Logistic Regression option. This brings up the dialog box shown in Figure 4.

Now select A3:C13 as the Input Range (see Figure 5) and since this data is in summary form with column headings, select the Summary data option for the Input Format and check Headings included with data. Next select the Solver as the Analysis Type and keep the default Alpha and Classification Cutoff values of .05 and .5 respectively.Finally press the OK button to obtain the output displayed in Figure 5.

This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure (this is considered to be the summary form). E.g. for Example 1 this is the data in range A3:C13 of Figure 1 (repeated in Figure 5 in the same cells). For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.

Note that the coefficients (range Q7:Q8) are set initially to zero and (cell M16) is calculated to be -526.792 (exactly as in Figure 1). The output from the Logistic Regression data analysis tool also contains many fields which will be explained later. As described in Figure 2, we can now use Excels Solver tool to find the logistic regression coefficient. The result is shown in Figure 6. We obtain the same values for the regression coefficients as we obtained previously in Figure 3, but also all the other cells are updated with the correct values as well.

Significance Testing of the Logistic Regression CoefficientsDefinition 1: For any coefficientbtheWaldstatistic is given by the formula

For ordinaryregressionwe can calculate a statistict~T(dfRes) which can be used totestthe hypothesis that a coordinateb= 0. TheWaldstatistic is approximately normal and so it can be used to test whether the coefficientb= 0 in logistic regression.

Since the Wald statistic is approximately normal, by Theorem 1 ofChi-Square Distribution,Wald2is approximately chi-square, and, in fact,Wald2~ 2(df) wheredf = k k0andk= the number of parameters (i.e. the number of coefficients) in the model (the full model) andk0= the number of parameters in a reduced model (esp. thebaselinemodel which doesnt use any of the variables, only the intercept).

Property 1:The covariance matrix S for the coefficient matrix B is given by the matrix formula

whereXis ther (k+1) design matrix (as described in Definition 3 ofLeast Squares Method for Multiple Regression)

andV= [vij] is ther rdiagonalmatrix whose diagonal elements arevii= nipi(1pi), whereni= the number of observations in groupiandpi= the probability of success predicted by the model for elements in groupi.Groups correspond to the rows of matrixXand consist of the various combinations of values of theindependent variables.Note thatS= (XTW)-1whereWisXwith each element in theith row ofXmultiplied byvii.Observation : The standard errors of the logistic regression coefficients consist of the square root of the entries on the diagonal of the covariance matrix in Property 1.

Example 1(Coefficients):We now turn our attention to the coefficient table given in range E18:L20 of Figure 6 o