1 what you've always wanted to know about logistic regression analysis, but were afraid to...
TRANSCRIPT
1
What you've always wanted to know about logistic regression analysis, but were afraid to
ask...
Februari, 1 2010
Gerrit RooksSociology of Innovation
Innovation Sciences & Industrial Engineering Phone: 5509
email: [email protected]
This Lecture
• Why logistic regression analysis?• The logistic regression model• Estimation• Goodness of fit• An example
2
3
What's the difference between 'normal' regression and logistic regression?
Regression analysis: – Relate one
or more independent (predictor) variables to a dependent (outcome) variable
4
What's the difference between 'normal' regression and logistic regression?
• Often you will be confronted with outcome variables that are dichotomic:– success vs failure– employed vs unemployed– promoted or not– sick or healthy – pass or fail an exam
5
ExampleRelationship between hours studied for exam and success
Hours # Failed exam
# Passed exam?
Total # students
Prob. pass exam
28 4 2 6 .33
29 3 2 5 .40
30 2 7 9 .78
31 2 7 9 .78
32 4 16 20 .80
33 1 14 15 .93
6
Linear regression analysisWhy is this wrong?
7
Logistic RegressionThe better alternative
8
9
The logistic regression equationpredicting probabilities
)( 11101
1)( Xbbe
YP
predictedprobability(always between0 and 1)
similar to regressionanalysis
10
The Logistic functionSometimes authors rearrange the model
)(
)(
)( 1110
1110
1110 11
1)(
Xbb
Xbb
Xbb e
e
eYP
nn xcxcxccyp
yp
...)1(1
)1(ln 22110
or also
11
How do we estimate coefficients?Maximum-likelihood estimation
• Parameters are estimated by `fitting' models, based on the available predictors, to the observed data
• The chosen model fits the data best, i.e. is closest to the data
• Fit is determined by the so-called log likelihood statistic
12
Maximum likelihood estimationThe log-likelihood statistic
N
iiiii YPYYPYLL
1
)]}(1ln[)1())(ln({
Large values of LL indicate poor fit of the model
HOWEVER, THIS STATISTIC CANNOT BE USED TO EVALUATE THE FIT OF A SINGLE MODEL
13
Quantity of Study Hours Outcome
3 0
34 1
17 0
6 0
12 0
15 1
26 1
29 1
An example to illustrate maximum likelihood and the log likelihood statistic
Suppose we know hours spentstudying and the outcome of an exam
14
)05.0( 11
1)(P
XeY
Quantity of Study Hours Outcome
Predicted probability (b0=0; b1 = 0.05)
Predicted probability(b0=-6.44; b1 = 0.39)
3 0 .53 .01
34 1 .85 .99
17 0 .71 .53
6 0 .57 .02
12 0 .65 .14
15 1 .68 .34
26 1 .79 .97
29 1 .81 .99
)39.044.6( 11
1)(P X
eY
In ML different valuesfor the parameters are `tried'
Lets look at two possibilities: 1; b0 = 0 & b1= 0.05; 2, b0 = 0 & b1= 0.05
15
Quantity of Study Hours Outcome
Predicted probability (b0=0; b1 = 0.05)
LL (b0=0; b1 = 0.05)
3 0 .53 -.75
34 1 .85 -.16
17 0 .71 -1.24
6 0 .57 -.84
12 0 .65 -1.05
15 1 .68 -.39
26 1 .79 -.24
29 1 .81 -.21
N
iiiii YPYYPYLL
1
)]}(1ln[)1())(ln({
We are now able to calculate the log likelihood statistic
16
Outcome
Pr(b0=0;
b1 = 0.05)
LL (b0=0; b1 =
0.05)
Pr(b0=-6.44; b1 = 0.39)
LL(b0=-6.44; b1 =
0.39)
0 .53 -.75 .01 -.01
1 .85 -.16 .99 -.01
0 .71 -1.24 .53 -.75
0 .57 -.84 .02 -.02
0 .65 -1.05 .14 -.15
1 .68 -.39 .34 -1.08
1 .79 -.24 .97 -.03
1 .81 -.21 .99 -.01
∑ -4.88 -2.07
Two models and their log likelihood statistic
Based on a clever algorithm the model with the best fit (LL closest to 0) is chosen
17
After estimationHow do I determine significance?
• Obviously SPSS does all the work for you
• How to interpret output of SPSS
• Two major issues1. Overall model fit
– Between model comparisons
– Pseudo R-square– Predictive accuracy /
classification test
2. Coefficients– Wald test– Likelihood ratio test– Odds ratios
)*39,044,6(1
1)(P
studyhourseY
18
Model fit: Between model comparison
)]baseline()New([22 LLLL
The log-likelihood ratio test statistic can be used to test the fit of a model
The test statistic has achi-square distribution
Model fit reduced modelModel fit full model
19
Model fit
)( 1101
1)(P Xbbe
Y
)]baseline()New([22 LLLL
The log-likelihood ratio test statistic can be used to test the fit of a model
Model fit reduced modelModel fit full model
)( 01
1)(P be
Y
Between model comparison
• Estimate a null model• Baseline model
• Estimate an improved model• This model contains more
variables• Assess the difference in -
2LL between the models• This difference follows a
chi-square distribution• degrees of freedom = #
estimated parameters in proposed model – # estimated parameters in null model2020
)( 221101
1)(P XbXbbe
Y
)]baseline()New([22 LLLL
Model fit reduced model
Model fit full model
)( 1101
1)(P Xbbe
Y
21
Overall model fitR and R2
2
22
)(
)ˆ(
yy
yyR
i
i
R2 in multiple regression is a measure of the variance explained by the model
SS due to regression
Total SS
22
Overall model fitpseudo R2
Just like in multiple regression, logit R2 ranges 0.0 to 1.0
– Cox and Snell• cannot theoretically
reach 1
– Nagelkerke• adjusted so that it can
reach 1
)(2
)(2LOGIT
2
OriginalLL
ModelLLR
log-likelihood of modelbefore any predictors wereentered
log-likelihood of the modelthat you want to test
NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression
23
What is a small or large R and R2?Strength of correlation
Small 0.10 to 0.29
Medium 0.30 to 0.49
Large 0.50 to 1.00
24
Overall model fitClassification table
Classification Tablea
30 5 85,7
7 33 82,5
84,0
ObservedMissed Penalty
Scored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty Kick
PercentageCorrect
Predicted
The cut value is ,500a.
How well does the model predict outcomes?
This means that we assume that if our model predictsthat a player will score with a probability of .51 (above .5)the prediction will be a score (lower than .50 is a miss).
spss output
25
Testing significance of coefficientsThe Wald statistic: not really good
• In linear regression analysis this statistic is used to test significance
• In logistic regression something similar exists
• however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely)
b
b
SEWald
t-distribution standard error of estimate
estimate
26
Likelihood ratio testan alternative way to test significance of a coefficient
)( 1101
1)(P Xbbe
Y
)]Without()With([22 LLLL
To avoid type II errors for some variables you best use the Likelihood ratio test
model with variable model without variable
)( 01
1)(P be
Y
27
Before we go to the exampleA recap
• Logistic regression– dichotomous outcome– logistic function– log-likelihood / maximum likelihood
• Model fit– likelihood ratio test (compare LL of models)– Pseudo R-square– Classification table– Wald test
28
Illustration with SPSS
• Penalty kicks data, variables:– Scored: outcome variable,
• 0 = penalty missed, and 1 = penalty scored
– Pswq: degree to which a player worries– Previous: percentage of penalties scored by a
particulare player in their career
29
Case Processing Summary
75 100,0
0 ,0
75 100,0
0 ,0
75 100,0
Unweighted Casesa
Included in Analysis
Missing Cases
Total
Selected Cases
Unselected Cases
Total
N Percent
If weight is in effect, see classification table for the totalnumber of cases.
a.
Dependent Variable Encoding
0
1
Original ValueMissed Penalty
Scored Penalty
Internal Value
SPSS OUTPUT Logistic Regression
Tells you somethingabout the number of observations and missings
30
Classification Tablea,b
0 35 ,0
0 40 100,0
53,3
ObservedMissed Penalty
Scored Penalty
Result of PenaltyKick
Overall Percentage
Step 0
MissedPenalty
ScoredPenalty
Result of Penalty Kick
PercentageCorrect
Predicted
Constant is included in the model.a.
The cut value is ,500b.
Variables in the Equation
,134 ,231 ,333 1 ,564 1,143ConstantStep 0B S.E. Wald df Sig. Exp(B)
Variables not in the Equation
34,109 1 ,000
34,193 1 ,000
41,558 2 ,000
previous
pswq
Variables
Overall Statistics
Step0
Score df Sig.
Block 0: Beginning Block this table is based on the empty model, i.e. onlythe constant in the model
)( 01
1)(P be
Y
these variableswill be enteredin the modellater on
31
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
54,977 2 ,000
54,977 2 ,000
54,977 2 ,000
Step
Block
Model
Step 1Chi-square df Sig.
Model Summary
48,662a ,520 ,694Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 6 becauseparameter estimates changed by less than ,001.
a.
)]baseline()New([22 LLLL
Block is useful to check significance of individual coefficients, see Field
New model
this is the teststatistic
after dividing by -2
Note: Nagelkerkeis larger than Cox
32
Variables in the Equation
,065 ,022 8,609 1 ,003 1,067
-,230 ,080 8,309 1 ,004 ,794
1,280 1,670 ,588 1 ,443 3,598
previous
pswq
Constant
Step1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: previous, pswq.a.
Classification Tablea
30 5 85,7
7 33 82,5
84,0
ObservedMissed Penalty
Scored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty Kick
PercentageCorrect
Predicted
The cut value is ,500a.
Block 1: Method = Enter (Continued)
Predictive accuracy has improved (was 53%)
estimatesstandard errorestimates
significance based on Wald statistic
change in odds
33
Variables in the Equation
,065 ,022 8,609 1 ,003 1,067
-,230 ,080 8,309 1 ,004 ,794
1,280 1,670 ,588 1 ,443 3,598
previous
pswq
Constant
Step1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: previous, pswq.a.
Classification Tablea
30 5 85,7
7 33 82,5
84,0
ObservedMissed Penalty
Scored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty Kick
PercentageCorrect
Predicted
The cut value is ,500a.
How is the classification table constructed?
)*230,0*065,028,1(1
1)(P Pred.
pswqpreviouseY
oops wrong prediction
oops wrong prediction
34
How is the classification table constructed?
)*230,0*065,028,1(1
1)(P Pred.
pswqpreviouseY
pswq previous scored Predict. prob.
18 56 1 .68
17 35 1 .41
20 45 0 .40
10 42 0 .85
35
How is the classification table constructed?
pswq previous
scored Predict. prob.
predicted
18 56 1 .68 1
17 35 1 .41 0
20 45 0 .40 0
10 42 0 .85 1
Classification Tablea
30 5 85,7
7 33 82,5
84,0
ObservedMissed Penalty
Scored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty Kick
PercentageCorrect
Predicted
The cut value is ,500a.