logistic regression

42
Logistic Regression STAT E-150 Statistical Methods

Upload: grover

Post on 22-Feb-2016

114 views

Category:

Documents


0 download

DESCRIPTION

STAT E-150 Statistical Methods. Logistic Regression. So far we have considered regression analyses where the response variables are quantitative . What if this is not the case? If a response variable is categorical a different regression model applies, called logistic regression . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Logistic Regression

Logistic Regression

STAT E-150Statistical Methods

Page 2: Logistic Regression

2

So far we have considered regression analyses where the response variables are quantitative. What if this is not the case?

If a response variable is categorical a different regression model applies, called logistic regression. 

Page 3: Logistic Regression

3

A categorical variable which has only two possible values is called a binary variable. We can represent these two outcomes as 1 to represent the presence of some condition (“success”) and 0 to represent the absence of the condition (“failure”):

The logistic regression model describes how the probability of “success” is related to the values of the explanatory variables, which can be categorical or quantitative.  

1 if successy

0 if failure

Page 4: Logistic Regression

4

Logistic regression models work with odds rather than proportions. The odds are just the ratio of the proportions for the two possible outcomes:   If π is the proportion for one outcome, then 1 − π is the proportion for the second outcome.

The odds of the first outcome occurring are .

ππ

Page 5: Logistic Regression

5

Here's an example: Suppose that a coin is weighted so that heads are more likely than tails, with P(heads) = .6.

Then P(tails) = 1 - P(heads) = 1 - .6 = .4

 The odds of getting heads in a toss of this coin are  

The odds of getting tails in a toss of this coin are  

The odds ratio is

This tells us that you are 2.25 times more likely to get "heads" than to get "tails".

.6 .6 1.51 .6 .4

.4 .4 .6671 .4 .6

1.5 2.25.667

Page 6: Logistic Regression

6

You can also convert the odds of an event back to the probability of the event:

For an event A, P(A) =

For example, if the odds of a horse winning are 9 to 1, then the probability of the horse winning are 9/(1+9) = .9

odds of event A1 odds of event A

Page 7: Logistic Regression

7

The Logistic Regression Model The relationship between a categorical response variable and a single quantitative predictor variable is an S-shaped curve. Here is a plot of p vs. x for different logistic regression models:

The points on the curve represent P(Y=1) for each value of x. The associated model is the logistic or logit model:

Page 8: Logistic Regression

8

The general logistic regression model is where

and E(Y) = π, the probability of success.

The xi are independent quantitative or qualitative variables.

1 if successY

0 if failure

0 1 1 2 2 k k

0 1 1 2 2 k k

β β x β x + β x

β β x β x + β x

eY1 e

Page 9: Logistic Regression

9

Odds and log(odds)

Let π = P(Y = 1) be a probability with 0 < π < 1

Then the odds that Y = 1 is the ratio

odds = and so the log (odds) = log1

1

Page 10: Logistic Regression

10

This transformation from π to log(odds) is called the logistic or logit transformation.

The relationship is one-to-one:

For every value of π (except for 0 and 1) there is one and only one

value of .log1

Page 11: Logistic Regression

11

The log(odds) can have any value from -∞ to ∞, and so we can use a linear predictor.

That is, we can model the log odds as a linear function of the explanatory variable:

y = β0 + β1x

(To verify this, solve π and then take the log of both sides.)

log1

0 10 1

0 1

β β xβ β x

β β x

e for e1 e

++

+= +

Page 12: Logistic Regression

12

For any fixed value of the predictor x, there are four probabilities:

If the model is exactly correct, then p = π and the two fitted values estimate the same number.

  True value Fitted value

Actual probability p = true P(Yes) for this x = #Yes/(#Yes + #No)

Model probability π = true P(Yes) from the model = fitted P(Yes) from the model

p

π

Page 13: Logistic Regression

13

To go from log(odds) to odds, use the exponential function ex:

1. odds = elog(odds)

 2. You can check that if odds = 1/(1 - π ), then you can solve for π to find that π = odds/(1 + odds).  3. Since log(odds) = we have the result

π = elog(odds) / (1 + elog(odds))

0 1x

Page 14: Logistic Regression

14

The Logistic Regression Model

The Logistic Regression Model for the probability of success of a binary response variable based on a single predictor x is:

 Logit form:

Probability form:

0 1log x1

0 1

0 1

x

x

ee1

Page 15: Logistic Regression

15

Example:

A study was conducted to analyze behavioral variables and stress in people recently diagnosed with cancer. For our purposes we will look at patients who have been in the study for at least a year, and the dependent variable (Outcome) is coded 1 to indicate that the patient is improved or in complete remission, and 0 if the patient has not improved or has died. The predictor variable is the survival rating assigned by the patient's physician at the time of diagnosis. This is a number between 0 and 100 and represents the estimated probability of survival at five years.

Out of 66 cases there are 48 patients who have improved and 18 who have not.

Page 16: Logistic Regression

16

The scatterplot shows us that a linear regression analysis is not appropriate for this data. This scatterplot clearly has no linear trend, but it does show that the proportion of people who improve is much higher when the survival rate is high, as would be expected.

However, if we transformation from whether the patient has improved to the odds of improvement, and then consider the log of the odds, we will have a variable that is a linear function of the survival rate, and we will be able to use linear regression.

Page 17: Logistic Regression

17

Let p = the probability of improvementThen 1 - p is the probability of no improvement

We will look for an equation of the form

Here β1 will be the amount of increase in the log odds for a one-unit increase in SurvRate.

0 1 SurvRatepy log(odds) log β β x

1 p

Page 18: Logistic Regression

18

Here are the results of this analysis:

We can see that the logistic regression equation is

log(odds) = .081xSurvRate - 2.684

Variables in the Equation

 B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068    

a. Variable(s) entered on step 1: survrate.

 

Page 19: Logistic Regression

19

Assessing the Model In linear regression, we used the p-values associated with the test statistic t to assess the contribution of each predictor. In logistic regression, we can use the Wald statistic in the same way.

Note that in this example, the Wald statistic for the predictor is 17.755, which is significant at the .05 level of significance. This is evidence that this predictor is a significant predictor in this model.

Variables in the Equation

 B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068    

a. Variable(s) entered on step 1: survrate.

 

Page 20: Logistic Regression

20

H0: β1 = 0Ha: β1 ≠ 0

Since p is close to zero, the null hypothesis is rejected. This indicates that the predicted survival rate is a useful indicator of the patient's outcome.

The resulting regression equation is

log(odds) = .081xSurvRate - 2.684

Variables in the Equation

 B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068    

a. Variable(s) entered on step 1: survrate.

 

Page 21: Logistic Regression

21

Here are scatterplots of the data and of the values predicted by the model:

Page 22: Logistic Regression

22

Note how well the results fit the data:

The suggested curve is quite close to the points in the lower left, rises rapidly across the points in the center, where the values of SurvRate that have a roughly equal number of patients who improve and don't improve, and finally comes close to the cluster of points in the upper right. The values all fall between 0 and 1.

Page 23: Logistic Regression

23

SPSS takes an iterative approach to this solution: it will begin with some starting values for β0 and β1, see how well the estimated log odds fit the data, adjust the coefficients, and then reexamine the fit. This will continue until no further adjustments will produce a better fit

What do all of the SPSS results tell us?

Page 24: Logistic Regression

24

Starting with Block 0: Beginning Block The Case Processing Summary tells us that all 66 cases were included:

Case Processing Summary

Unweighted Casesa N Percent

Selected Cases Included in Analysis 66 100.0

Missing Cases 0 .0

Total 66 100.0

Unselected Cases 0 .0

Total 66 100.0

a. If weight is in effect, see classification table for the total number of cases.

 

Page 25: Logistic Regression

25

The Variables in the Equation table shows that in this first iteration only the constant was used. The second table lists the variables that were not included in this model; it indicates that if the second variable were to be included, it would be a significant predictor:

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 0 Constant .981 .276 12.594 1 .000 2.667

Variables not in the Equation  Score df Sig.

Step 0 Variables survrate 34.538 1 .000

Overall Statistics 34.538 1 .000

Page 26: Logistic Regression

26

The Iteration History shows what the results would be with only this single predictor. Since the second variable, SurvRate, is not included, there is little change.

The -2 Log likelihood can be used to assess how well a model would fit the data. It is based on summing the probabilities associated with the expected and observed outcomes. The lower the -2LL value, the better the fit.

Iteration Historya,b,c

Iteration

-2 Log

likelihood

Coefficients

Constant

Step 0 1 77.414 .909

2 77.346 .980

3 77.346 .981

4 77.346 .981a. Constant is included in the model.b. Initial -2 Log Likelihood: 77.346c. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

 

Page 27: Logistic Regression

27

You can see from the Classification Table that the values were not classified by a second variable at this point. You can also see that there were 48 patients who improved and 18 who did not.

Classification Tablea,b

 

Observed

Predicted  outcome Percentage

Correct  0 1

Step 0 outcome 0 0 18 .0

1 0 48 100.0

Overall Percentage     72.7a. Constant is included in the model.b. The cut value is .500

 

Page 28: Logistic Regression

28

One way to test the overall model is the Hosmer-Lemeshow goodness-of-fit test, which is a Chi-Square test comparing the observed and expected frequencies of subjects falling in the two categories of the response variable. Large values of 2 (and the corresponding small p-values) indicate a lack of fit for the model. This table tells us that our model is a good fit, since the p-value is large:

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 6.887 7 .441

Page 29: Logistic Regression

29

Now consider the next block, Block 1: Method = Enter The Iteration History table shows the progress as the model is reassessed; the value of the coefficient of SurvRate converges to .081. To assess whether this larger model provides a significantly better fit than the smaller model, consider the difference between the -2LL values. The value for the smaller model was 77.346, which is larger than 37.323, the value for the model with SurvRate included, indicating that the larger model is a significantly better fit.

Iteration Historya,b,c,d

Iteration

-2 Log

likelihood

Coefficients

Constan

t

survrat

e

Step 1 1 45.042 -1.547 .042

2 38.630 -2.184 .063

3 37.410 -2.552 .076

4 37.324 -2.673 .081

5 37.323 -2.684 .081

6 37.323 -2.684 .081a. Method: Enterb. Constant is included in the model.c. Initial -2 Log Likelihood: 77.346d. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001.

 

Page 30: Logistic Regression

30

This table now shows that SurvRate is a significant predictor (p is close to 0), and we can find the coefficients in the resulting regression equation, y = log(odds) = .081xSurvRate - 2.684

Variables in the Equation

 B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068    a. Variable(s) entered on step 1: survrate.

Page 31: Logistic Regression

31

Why is the odds ratio Exp(B)? Suppose that we have the logistic regression equation

y = log(odds) = β1x + β0

Then β1 represents the change in y associated with a unit change in x.

That is, y will increase by β1 when x increases by 1. But y is log(odds). So log(odds) will increase by β1 when x increases by 1.

Page 32: Logistic Regression

32

Exp(B) is an indicator of the change in odds resulting from a unit change in the predictor. Let's see how this happens: Suppose we start with the regression equation y = β1x + β0

Now if x increases by 1, we have y = β1(x +1) + β0

How much has y changed?

New value - old value = [β1(x +1) + β0] - [β1 x + b0] = [β1x + β1 + β0] - [β1 x + β0] = β1

So y has increased by β1. That is, β1 is the change in y associated with a unit change in x. But y = log(odds), so now we know that log(odds) will increase by β1 when x increases by 1.

Page 33: Logistic Regression

33

If log(odds) changes by β1 then odds increases by eβ1

In other words, the change in odds associated with a unit change in x is eβ1, which can be denoted as Exp(β1) -- or by Exp(B) in SPSS. In our example, then, with each unit increase in SurvRate, y = log(odds) will increase by .081. That is, the odds of improving will increase by a factor of 1.085 for each unit increase in SurvRate.

Variables in the Equation

 B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068    

Page 34: Logistic Regression

34

Another example: The sales director for a chain of appliance stores wants to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase.  The response variable is an indicator of whether or not a warranty is purchased. The predictor variables are

- Customer gender - Age of the customer- Whether a gift is offered with the warranty- Price of the appliance- Race of the customer (this is coded with three indicator variables to represent White, African-American, and Hispanic) 

Page 35: Logistic Regression

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gender -3.772 2.568 2.158 1 .142 .023

Gift 2.715 1.567 3.003 1 .083 15.112

Age .091 .056 2.638 1 .104 1.096

Price .001 .000 3.363 1 .067 1.001

White 3.773 13.863 .074 1 .785 43.518

AfricanAmerican 1.163 13.739 .007 1 .933 3.199

Hispanic 6.347 14.070 .203 1 .652 570.898

Constant -12.018 14.921 .649 1 .421 .000

a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.

 

35

Here are the results with all predictors:

The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10

Page 36: Logistic Regression

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gender -3.772 2.568 2.158 1 .142 .023

Gift 2.715 1.567 3.003 1 .083 15.112

Age .091 .056 2.638 1 .104 1.096

Price .001 .000 3.363 1 .067 1.001

White 3.773 13.863 .074 1 .785 43.518

AfricanAmerican 1.163 13.739 .007 1 .933 3.199

Hispanic 6.347 14.070 .203 1 .652 570.898

Constant -12.018 14.921 .649 1 .421 .000

a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.

 

36

Here are the results with all predictors:

The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10

Page 37: Logistic Regression

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price.

37

If the analysis is rerun with only three predictors these are the results:

In this model, all three predictors are significant. These results indicate that the odds that a customer who is offered a gift will purchase a warranty is more than ten times greater than the corresponding odds for a customer having the same other characteristics but who is not offered a gift. Also, the odds ratio for Age is greater than 1. This tells us that older buyers are more likely to purchase a warranty.

Page 38: Logistic Regression

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price.

38

If the analysis is rerun with only three predictors these are the results:

The resulting regression equation is 

Log(odds) = 2.339xGift + .064xAge - 6.096

Page 39: Logistic Regression

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price.

39

If the analysis is rerun with only three predictors these are the results:

Note: in this example, the coefficient of Price is too small to be expressed in three decimal places. This situation can be remedied by dividing the price by 100, and creating a new model.

Page 40: Logistic Regression

40

The resulting equation is now Log(odds) = 2.339xGift + .064xAge + .040xPrice100 - 6.096

Variables in the Equation

  B S.E. Wald df Sig. Exp(B)

Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price100 .040 .016 6.165 1 .013 1.041

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price100.

 

Page 41: Logistic Regression

41

To produce the output for your analysis,

1. Choose >Analyze >Regression >Binary Logistic.2. Choose the response and predictor variables.3. Click on Options and check CI for Exp(B) to create the confidence

intervals. Click on Continue4. Click on Save and check the Probabilities box.

Click on Continue and then on OK.

Page 42: Logistic Regression

42

To produce the graph of the results, create a simple scatterplot using Predicted Probability as the dependent variable, and the predictor as a covariate.