logistic regression

Logistic Regression

STAT E-150Statistical Methods

2

So far we have considered regression analyses where the response variables are quantitative. What if this is not the case?

If a response variable is categorical a different regression model applies, called logistic regression.

3

A categorical variable which has only two possible values is called a binary variable. We can represent these two outcomes as 1 to represent the presence of some condition (“success”) and 0 to represent the absence of the condition (“failure”):

The logistic regression model describes how the probability of “success” is related to the values of the explanatory variables, which can be categorical or quantitative.

1 if successy

0 if failure

4

Logistic regression models work with odds rather than proportions. The odds are just the ratio of the proportions for the two possible outcomes: If π is the proportion for one outcome, then 1 − π is the proportion for the second outcome.

The odds of the first outcome occurring are .

ππ

5

Here's an example: Suppose that a coin is weighted so that heads are more likely than tails, with P(heads) = .6.

Then P(tails) = 1 - P(heads) = 1 - .6 = .4

The odds of getting heads in a toss of this coin are

The odds of getting tails in a toss of this coin are

The odds ratio is

This tells us that you are 2.25 times more likely to get "heads" than to get "tails".

.6 .6 1.51 .6 .4

.4 .4 .6671 .4 .6

1.5 2.25.667

6

You can also convert the odds of an event back to the probability of the event:

For an event A, P(A) =

For example, if the odds of a horse winning are 9 to 1, then the probability of the horse winning are 9/(1+9) = .9

odds of event A1 odds of event A

7

The Logistic Regression Model The relationship between a categorical response variable and a single quantitative predictor variable is an S-shaped curve. Here is a plot of p vs. x for different logistic regression models:

The points on the curve represent P(Y=1) for each value of x. The associated model is the logistic or logit model:

8

The general logistic regression model is where

and E(Y) = π, the probability of success.

The xi are independent quantitative or qualitative variables.

1 if successY

0 if failure

0 1 1 2 2 k k

0 1 1 2 2 k k

β β x β x + β x

β β x β x + β x

eY1 e

9

Odds and log(odds)

Let π = P(Y = 1) be a probability with 0 < π < 1

Then the odds that Y = 1 is the ratio

odds = and so the log (odds) = log1

1

10

This transformation from π to log(odds) is called the logistic or logit transformation.

The relationship is one-to-one:

For every value of π (except for 0 and 1) there is one and only one

value of .log1

11

The log(odds) can have any value from -∞ to ∞, and so we can use a linear predictor.

That is, we can model the log odds as a linear function of the explanatory variable:

y = β0 + β1x

(To verify this, solve π and then take the log of both sides.)

log1

0 10 1

0 1

β β xβ β x

β β x

e for e1 e

++

+= +

12

For any fixed value of the predictor x, there are four probabilities:

If the model is exactly correct, then p = π and the two fitted values estimate the same number.

True value Fitted value

Actual probability p = true P(Yes) for this x = #Yes/(#Yes + #No)

Model probability π = true P(Yes) from the model = fitted P(Yes) from the model

p

π

13

To go from log(odds) to odds, use the exponential function ex:

1. odds = elog(odds)

2. You can check that if odds = 1/(1 - π ), then you can solve for π to find that π = odds/(1 + odds). 3. Since log(odds) = we have the result

π = elog(odds) / (1 + elog(odds))

0 1x

14

The Logistic Regression Model

The Logistic Regression Model for the probability of success of a binary response variable based on a single predictor x is:

Logit form:

Probability form:

0 1log x1

0 1

0 1

x

x

ee1

15

Example:

A study was conducted to analyze behavioral variables and stress in people recently diagnosed with cancer. For our purposes we will look at patients who have been in the study for at least a year, and the dependent variable (Outcome) is coded 1 to indicate that the patient is improved or in complete remission, and 0 if the patient has not improved or has died. The predictor variable is the survival rating assigned by the patient's physician at the time of diagnosis. This is a number between 0 and 100 and represents the estimated probability of survival at five years.

Out of 66 cases there are 48 patients who have improved and 18 who have not.

16

The scatterplot shows us that a linear regression analysis is not appropriate for this data. This scatterplot clearly has no linear trend, but it does show that the proportion of people who improve is much higher when the survival rate is high, as would be expected.

However, if we transformation from whether the patient has improved to the odds of improvement, and then consider the log of the odds, we will have a variable that is a linear function of the survival rate, and we will be able to use linear regression.

17

Let p = the probability of improvementThen 1 - p is the probability of no improvement

We will look for an equation of the form

Here β1 will be the amount of increase in the log odds for a one-unit increase in SurvRate.

0 1 SurvRatepy log(odds) log β β x

1 p

18

Here are the results of this analysis:

We can see that the logistic regression equation is

log(odds) = .081xSurvRate - 2.684

Variables in the Equation

B S.E. Wald df Sig. Exp(B)

95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068

a. Variable(s) entered on step 1: survrate.

19

Assessing the Model In linear regression, we used the p-values associated with the test statistic t to assess the contribution of each predictor. In logistic regression, we can use the Wald statistic in the same way.

Note that in this example, the Wald statistic for the predictor is 17.755, which is significant at the .05 level of significance. This is evidence that this predictor is a significant predictor in this model.



95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068


20

H0: β1 = 0Ha: β1 ≠ 0

Since p is close to zero, the null hypothesis is rejected. This indicates that the predicted survival rate is a useful indicator of the patient's outcome.

The resulting regression equation is

log(odds) = .081xSurvRate - 2.684



95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068


21

Here are scatterplots of the data and of the values predicted by the model:

22

Note how well the results fit the data:

The suggested curve is quite close to the points in the lower left, rises rapidly across the points in the center, where the values of SurvRate that have a roughly equal number of patients who improve and don't improve, and finally comes close to the cluster of points in the upper right. The values all fall between 0 and 1.

23

SPSS takes an iterative approach to this solution: it will begin with some starting values for β0 and β1, see how well the estimated log odds fit the data, adjust the coefficients, and then reexamine the fit. This will continue until no further adjustments will produce a better fit

What do all of the SPSS results tell us?

24

Starting with Block 0: Beginning Block The Case Processing Summary tells us that all 66 cases were included:

Case Processing Summary

Unweighted Casesa N Percent

Selected Cases Included in Analysis 66 100.0

Missing Cases 0 .0

Total 66 100.0

Unselected Cases 0 .0

Total 66 100.0

a. If weight is in effect, see classification table for the total number of cases.

25

The Variables in the Equation table shows that in this first iteration only the constant was used. The second table lists the variables that were not included in this model; it indicates that if the second variable were to be included, it would be a significant predictor:



Step 0 Constant .981 .276 12.594 1 .000 2.667

Variables not in the Equation Score df Sig.

Step 0 Variables survrate 34.538 1 .000

Overall Statistics 34.538 1 .000

26

The Iteration History shows what the results would be with only this single predictor. Since the second variable, SurvRate, is not included, there is little change.

The -2 Log likelihood can be used to assess how well a model would fit the data. It is based on summing the probabilities associated with the expected and observed outcomes. The lower the -2LL value, the better the fit.

Iteration Historya,b,c

Iteration

-2 Log

likelihood

Coefficients

Constant

Step 0 1 77.414 .909

2 77.346 .980

3 77.346 .981

4 77.346 .981a. Constant is included in the model.b. Initial -2 Log Likelihood: 77.346c. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.

27

You can see from the Classification Table that the values were not classified by a second variable at this point. You can also see that there were 48 patients who improved and 18 who did not.

Classification Tablea,b

Observed

Predicted outcome Percentage

Correct 0 1

Step 0 outcome 0 0 18 .0

1 0 48 100.0

Overall Percentage 72.7a. Constant is included in the model.b. The cut value is .500

28

One way to test the overall model is the Hosmer-Lemeshow goodness-of-fit test, which is a Chi-Square test comparing the observed and expected frequencies of subjects falling in the two categories of the response variable. Large values of 2 (and the corresponding small p-values) indicate a lack of fit for the model. This table tells us that our model is a good fit, since the p-value is large:

Hosmer and Lemeshow Test

Step Chi-square df Sig.

1 6.887 7 .441

29

Now consider the next block, Block 1: Method = Enter The Iteration History table shows the progress as the model is reassessed; the value of the coefficient of SurvRate converges to .081. To assess whether this larger model provides a significantly better fit than the smaller model, consider the difference between the -2LL values. The value for the smaller model was 77.346, which is larger than 37.323, the value for the model with SurvRate included, indicating that the larger model is a significantly better fit.

Iteration Historya,b,c,d

Iteration

-2 Log

likelihood

Coefficients

Constan

t

survrat

e

Step 1 1 45.042 -1.547 .042

2 38.630 -2.184 .063

3 37.410 -2.552 .076

4 37.324 -2.673 .081

5 37.323 -2.684 .081

6 37.323 -2.684 .081a. Method: Enterb. Constant is included in the model.c. Initial -2 Log Likelihood: 77.346d. Estimation terminated at iteration number 6 because parameter estimates changed by less than .001.

30

This table now shows that SurvRate is a significant predictor (p is close to 0), and we can find the coefficients in the resulting regression equation, y = log(odds) = .081xSurvRate - 2.684



95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068 a. Variable(s) entered on step 1: survrate.

31

Why is the odds ratio Exp(B)? Suppose that we have the logistic regression equation

y = log(odds) = β1x + β0

Then β1 represents the change in y associated with a unit change in x.

That is, y will increase by β1 when x increases by 1. But y is log(odds). So log(odds) will increase by β1 when x increases by 1.

32

Exp(B) is an indicator of the change in odds resulting from a unit change in the predictor. Let's see how this happens: Suppose we start with the regression equation y = β1x + β0

Now if x increases by 1, we have y = β1(x +1) + β0

How much has y changed?

New value - old value = [β1(x +1) + β0] - [β1 x + b0] = [β1x + β1 + β0] - [β1 x + β0] = β1

So y has increased by β1. That is, β1 is the change in y associated with a unit change in x. But y = log(odds), so now we know that log(odds) will increase by β1 when x increases by 1.

33

If log(odds) changes by β1 then odds increases by eβ1

In other words, the change in odds associated with a unit change in x is eβ1, which can be denoted as Exp(β1) -- or by Exp(B) in SPSS. In our example, then, with each unit increase in SurvRate, y = log(odds) will increase by .081. That is, the odds of improving will increase by a factor of 1.085 for each unit increase in SurvRate.



95% C.I.for EXP(B)

Lower Upper

Step 1a survrate .081 .019 17.755 1 .000 1.085 1.044 1.126

Constant -2.684 .811 10.941 1 .001 .068

34

Another example: The sales director for a chain of appliance stores wants to find out what circumstances encourage customers to purchase extended warranties after a major appliance purchase. The response variable is an indicator of whether or not a warranty is purchased. The predictor variables are

- Customer gender - Age of the customer- Whether a gift is offered with the warranty- Price of the appliance- Race of the customer (this is coded with three indicator variables to represent White, African-American, and Hispanic)



Step 1a Gender -3.772 2.568 2.158 1 .142 .023

Gift 2.715 1.567 3.003 1 .083 15.112

Age .091 .056 2.638 1 .104 1.096

Price .001 .000 3.363 1 .067 1.001

White 3.773 13.863 .074 1 .785 43.518

AfricanAmerican 1.163 13.739 .007 1 .933 3.199

Hispanic 6.347 14.070 .203 1 .652 570.898

Constant -12.018 14.921 .649 1 .421 .000

a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.

35

Here are the results with all predictors:

The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10



Step 1a Gender -3.772 2.568 2.158 1 .142 .023

Gift 2.715 1.567 3.003 1 .083 15.112

Age .091 .056 2.638 1 .104 1.096

Price .001 .000 3.363 1 .067 1.001

White 3.773 13.863 .074 1 .785 43.518

AfricanAmerican 1.163 13.739 .007 1 .933 3.199

Hispanic 6.347 14.070 .203 1 .652 570.898

Constant -12.018 14.921 .649 1 .421 .000

a. Variable(s) entered on step 1: Gender, Gift, Age, Price, White, AfricanAmerican, Hispanic.

36

Here are the results with all predictors:

The odds ratio is shown in the Exp(B) column. This is the change in odds for each unit change in the predictor. For example, the odds of a customer purchasing a warranty are 1.096 times greater for each additional year in the customer's age. Which of these variables might be removed from this model? Use α = .10



Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price.

37

If the analysis is rerun with only three predictors these are the results:

In this model, all three predictors are significant. These results indicate that the odds that a customer who is offered a gift will purchase a warranty is more than ten times greater than the corresponding odds for a customer having the same other characteristics but who is not offered a gift. Also, the odds ratio for Age is greater than 1. This tells us that older buyers are more likely to purchase a warranty.



Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002


38


The resulting regression equation is

Log(odds) = 2.339xGift + .064xAge - 6.096



Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price .000 .000 6.165 1 .013 1.000

Constant -6.096 2.142 8.096 1 .004 .002


39


Note: in this example, the coefficient of Price is too small to be expressed in three decimal places. This situation can be remedied by dividing the price by 100, and creating a new model.

40

The resulting equation is now Log(odds) = 2.339xGift + .064xAge + .040xPrice100 - 6.096



Step 1a Gift 2.339 1.131 4.273 1 .039 10.368

Age .064 .032 4.132 1 .042 1.066

Price100 .040 .016 6.165 1 .013 1.041

Constant -6.096 2.142 8.096 1 .004 .002

a. Variable(s) entered on step 1: Gift, Age, Price100.

41

To produce the output for your analysis,

1. Choose >Analyze >Regression >Binary Logistic.2. Choose the response and predictor variables.3. Click on Options and check CI for Exp(B) to create the confidence

intervals. Click on Continue4. Click on Save and check the Probabilities box.

Click on Continue and then on OK.

42

To produce the graph of the results, create a simple scatterplot using Predicted Probability as the dependent variable, and the predictor as a covariate.

logistic regression

Documents