Click here to load reader
Post on 22-Feb-2016
Embed Size (px)
DESCRIPTIONSTAT E-150 Statistical Methods. Logistic Regression. So far we have considered regression analyses where the response variables are quantitative . What if this is not the case? If a response variable is categorical a different regression model applies, called logistic regression . - PowerPoint PPT Presentation
Logistic RegressionSTAT E-150Statistical Methods2So far we have considered regression analyses where the response variables are quantitative. What if this is not the case?
If a response variable is categorical a different regression model applies, called logistic regression.3A categorical variable which has only two possible values is called a binary variable. We can represent these two outcomes as 1 to represent the presence of some condition (success) and 0 to represent the absence of the condition (failure):
The logistic regression model describes how the probability of success is related to the values of the explanatory variables, which can be categorical or quantitative.
4Logistic regression models work with odds rather than proportions. The odds are just the ratio of the proportions for the two possible outcomes: If is the proportion for one outcome, then 1 is the proportion for the second outcome.
The odds of the first outcome occurring are .
5Here's an example: Suppose that a coin is weighted so that heads are more likely than tails, with P(heads) = .6.
Then P(tails) = 1 - P(heads) = 1 - .6 = .4
The odds of getting heads in a toss of this coin are
The odds of getting tails in a toss of this coin are
The odds ratio is
This tells us that you are 2.25 times more likely to get "heads" than to get "tails".
6You can also convert the odds of an event back to the probability of the event:
For an event A, P(A) =
For example, if the odds of a horse winning are 9 to 1, then the probability of the horse winning are 9/(1+9) = .9
7The Logistic Regression ModelThe relationship between a categorical response variable and a single quantitative predictor variable is an S-shaped curve. Here is a plot of p vs. x for different logistic regression models:
The points on the curve represent P(Y=1) for each value of x.The associated model is the logistic or logit model:
8The general logistic regression model is where
and E(Y) = , the probability of success.
The xi are independent quantitative or qualitative variables.
9Odds and log(odds)
Let = P(Y = 1) be a probability with 0 < < 1
Then the odds that Y = 1 is the ratio
odds = and so the log (odds) =
10This transformation from to log(odds) is called the logistic or logit transformation.
The relationship is one-to-one:
For every value of (except for 0 and 1) there is one and only one
value of .
11The log(odds) can have any value from - to , and so we can use a linear predictor.
That is, we can model the log odds as a linear function of the explanatory variable:
y = 0 + 1x
(To verify this, solve and then take the log of both sides.)
12For any fixed value of the predictor x, there are four probabilities:
If the model is exactly correct, then p = and the two fitted values estimate the same number.
True valueFitted valueActual probabilityp = true P(Yes) for this x= #Yes/(#Yes + #No)Model probability = true P(Yes) from the model = fitted P(Yes) from the model
13To go from log(odds) to odds, use the exponential function ex:
1.odds = elog(odds)2.You can check that if odds = 1/(1 - ), then you can solve for to find that = odds/(1 + odds). 3.Since log(odds) = we have the result
= elog(odds) / (1 + elog(odds))
A study was conducted to analyze behavioral variables and stress in people recently diagnosed with cancer. For our purposes we will look at patients who have been in the study for at least a year, and the dependent variable (Outcome) is coded 1 to indicate that the patient is improved or in complete remission, and 0 if the patient has not improved or has died. The predictor variable is the survival rating assigned by the patient's physician at the time of diagnosis. This is a number between 0 and 100 and represents the estimated probability of survival at five years.
Out of 66 cases there are 48 patients who have improved and 18 who have not.
16The scatterplot shows us that a linear regression analysis is not appropriate for this data. This scatterplot clearly has no linear trend, but it does show that the proportion of people who improve is much higher when the survival rate is high, as would be expected.
However, if we transformation from whether the patient has improved to the odds of improvement, and then consider the log of the odds, we will have a variable that is a linear function of the survival rate, and we will be able to use linear regression.
17Let p = the probability of improvementThen 1 - p is the probability of no improvement
We will look for an equation of the form
Here 1 will be the amount of increase in the log odds for a one-unit increase in SurvRate.
18Here are the results of this analysis:
We can see that the logistic regression equation is log(odds) = .081xSurvRate - 2.684Variables in the EquationBS.E.WalddfSig.Exp(B)95% C.I.for EXP(B)LowerUpperStep 1asurvrate.081.01917.7551.0001.0851.0441.126Constant-2.684.81110.9411.001.068a. Variable(s) entered on step 1: survrate.19Assessing the ModelIn linear regression, we used the p-values associated with the test statistic t to assess the contribution of each predictor. In logistic regression, we can use the Wald statistic in the same way.
Note that in this example, the Wald statistic for the predictor is 17.755, which is significant at the .05 level of significance. This is evidence that this predictor is a significant predictor in this model.Variables in the EquationBS.E.WalddfSig.Exp(B)95% C.I.for EXP(B)LowerUpperStep 1asurvrate.081.01917.7551.0001.0851.0441.126Constant-2.684.81110.9411.001.068a. Variable(s) entered on step 1: survrate.20H0: 1 = 0Ha: 1 0
Since p is close to zero, the null hypothesis is rejected. This indicates that the predicted survival rate is a useful indicator of the patient's outcome.
The resulting regression equation is
log(odds) = .081xSurvRate - 2.684
Variables in the EquationBS.E.WalddfSig.Exp(B)95% C.I.for EXP(B)LowerUpperStep 1asurvrate.081.01917.7551.0001.0851.0441.126Constant-2.684.81110.9411.001.068a. Variable(s) entered on step 1: survrate.21Here are scatterplots of the data and of the values predicted by the model:
22Note how well the results fit the data:
The suggested curve is quite close to the points in the lower left, rises rapidly across the points in the center, where the values of SurvRate that have a roughly equal number of patients who improve and don't improve, and finally comes close to the cluster of points in the upper right. The values all fall between 0 and 1.
23SPSS takes an iterative approach to this solution: it will begin with some starting values for 0 and 1, see how well the estimated log odds fit the data, adjust the coefficients, and then reexamine the fit. This will continue until no further adjustments will produce a better fit
What do all of the SPSS results tell us?
24Starting with Block 0: Beginning BlockThe Case Processing Summary tells us that all 66 cases were included:
Case Processing SummaryUnweighted CasesaNPercentSelected CasesIncluded in Analysis66100.0Missing Cases0.0Total66100.0Unselected Cases0.0Total66100.0a. If weight is in effect, see classification table for the total number of cases.25The Variables in the Equation table shows that in this first iteration only the constant was used. The second table lists the variables that were not included in this model; it indicates that if the second variable were to be included, it would be a significant predictor:
Variables in the EquationBS.E.WalddfSig.Exp(B)Step 0Constant.981.27612.5941.0002.667Variables not in the EquationScoredfSig.Step 0Variablessurvrate34.5381.000Overall Statistics34.5381.00026The Iteration History shows what the results would be with only this single predictor. Since the second variable, SurvRate, is not included, there is little change.
The -2 Log likelihood can be used to assess how well a model would fit the data. It is based on summing the probabilities associated with the expected and observed outcomes. The lower the -2LL value, the better the fit.
Iteration Historya,b,cIteration-2 Log likelihoodCoefficientsConstantStep 0177.414.909277.346.980377.346.981477.346.981a. Constant is included in the model.b. Initial -2 Log Likelihood: 77.346c. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.27You can see from the Classification Table that the values were not classified by a second variable at this point. You can also see that there were 48 patients who improved and 18 who did not.
Classification Tablea,bObservedPredictedoutcomePercentage Correct01Step 0outcome0018.01048100.0Overall Percentage72.7a. Constant is included in the model.b. The cut value is .50028One way to test the overall model is the Hosmer-Lemeshow goodness-of-fit test, which is a Chi-Square test comparing the observed and expected frequencies of subjects falling in the two categories of the response variable. Large values of 2 (and the corresponding small p-values) indicate a lack of fit for the model.This table tells us that our model is a good fit, since the p-value is large:
Hosmer and Lemeshow TestStepChi-squaredfSig.16.8877.44129Now consider the next block, Block 1: Method = EnterThe Iteration History table shows the progress as the model is reassessed; the value of the coefficient of SurvRate converges to .081.T