logistic regression and odds ratios 818 - lecture 0… · odds ratio used to compare two...
TRANSCRIPT
Logistic Regression andLogistic Regression and
Odds RatiosOdds Ratios
Psych 818 - DeShonPsych 818 - DeShon
Dichotomous ResponseDichotomous Response
Used when the outcome or DV is aUsed when the outcome or DV is adichotomous, random variabledichotomous, random variable
Can only take one of two possible values (1,0)Can only take one of two possible values (1,0)Pass/FailPass/Fail
Disease/No DiseaseDisease/No Disease
Agree/DisagreeAgree/Disagree
True/FalseTrue/False
Present/AbsentPresent/Absent
This data structure causes problems forThis data structure causes problems forOLS regressionOLS regression
Dichotomous ResponseDichotomous Response
Properties of dichotomous responseProperties of dichotomous response
variables (variables (YY))POSITIVE RESPONSE (Success =1) POSITIVE RESPONSE (Success =1) pp
NEGATIVE RESPONSE (Failure = 0) NEGATIVE RESPONSE (Failure = 0) qq = (1- = (1-pp))
observed proportion of successes observed proportion of successes
VarVar((YY) = ) = p*qp*q
OoopsOoops! Variance depends on the mean! Variance depends on the mean
Y = p
Dichotomous ResponseDichotomous Response
Lets generate some (0,1)Lets generate some (0,1)
datadataYY <- <-rbinomrbinom((nn==10001000,,sizesize==11,,probprob==.3.3))
mean(Y)mean(Y) = 0.295= 0.295
μμ = .3 = .3
varvar(Y)(Y) = 0.208 = 0.20822= (.3 = (.3 *.7) = .21*.7) = .21
histhist(Y(Y))
Histogram of Y
Y
0.0 0.2 0.4 0.6 0.8 1.0
01
00
20
03
00
40
05
00
60
07
00
Describing Dichotomous DataDescribing Dichotomous Data
Proportion of successes (p)Proportion of successes (p)
OddsOdds
Odds of an event is the probability it occursOdds of an event is the probability it occurs
divided by the probability it does not occurdivided by the probability it does not occur
p/(1-p)p/(1-p)
if p=.53; odds=.53/.47 = 1.13if p=.53; odds=.53/.47 = 1.13
Modeling Y (Categorical X)Modeling Y (Categorical X)
Odds RatioOdds Ratio
Used to compare two proportions across groupsUsed to compare two proportions across groupsodds for males =.54/(1-.53) = 1.13odds for males =.54/(1-.53) = 1.13
odds for females = .62/(1-.62) = 1.63odds for females = .62/(1-.62) = 1.63
Odds-ratio = 1.62/1.13 = 1.44Odds-ratio = 1.62/1.13 = 1.44
A female is 1.44 times more likely than a male to get a 1A female is 1.44 times more likely than a male to get a 1
OrOr…… 1.13/1.62 = 0.69 1.13/1.62 = 0.69
A male is .69 times as likely as a female to get a 1A male is .69 times as likely as a female to get a 1
OR > 1: increased odds for group 1 relative to 2OR > 1: increased odds for group 1 relative to 2
OR = 1: no difference in odds for group 1 relative to 2OR = 1: no difference in odds for group 1 relative to 2
OR < 1: lower odds for group 1 relative to 2OR < 1: lower odds for group 1 relative to 2
Modeling Y (Categorical X)Modeling Y (Categorical X)
Odds-ratio for a 2 x 2 tableOdds-ratio for a 2 x 2 table
Odds(Hi)Odds(Hi)11/411/4
Odds(Lo)Odds(Lo)2/52/5
O.R. = (11/4)/(2/5)=8.25O.R. = (11/4)/(2/5)=8.25
Odds of HD are 8.25 time larger for highOdds of HD are 8.25 time larger for highcholesterolcholesterol
CholestCholest
inin
DietDiet
Heart DiseaseHeart Disease
232310101313
886622LoLo
1515441111HiHi
NNYY
Odds-RatioOdds-Ratio
Ranges from 0 to infinityRanges from 0 to infinity
00 11
Tends to be skewedTends to be skewed
Often transform to log-odds to getOften transform to log-odds to get
symmetrysymmetryThe log-OR comparing females to males = log(1.44) = 0.36The log-OR comparing females to males = log(1.44) = 0.36
The log-OR comparing males to females = log(0.69) = -0.36The log-OR comparing males to females = log(0.69) = -0.36
Modeling Y (Continuous X)Modeling Y (Continuous X)
We need to form a general prediction modelWe need to form a general prediction model
Standard OLS regression wonStandard OLS regression won’’t workt work
The errors of a dichotomous variable can not beThe errors of a dichotomous variable can not be
normally distributed with constant variancenormally distributed with constant variance
Also, the estimated parameters donAlso, the estimated parameters don’’t make mucht make much
sensesense
LetLet’’s look at a s look at a scatterplot scatterplot of dichotomous dataof dichotomous data……
Dichotomous Dichotomous ScatterplotScatterplot
What smooth function can we use to model somethingWhat smooth function can we use to model something
that looks like this?that looks like this?
Dichotomous Dichotomous ScatterplotScatterplot
OLS regression? Smooth butOLS regression? Smooth but……
Dichotomous Dichotomous ScatterplotScatterplot
Could break X into groups to form a moreCould break X into groups to form a more
continuous scale for Ycontinuous scale for Y
proportion or percentage scaleproportion or percentage scale
Dichotomous Dichotomous ScatterplotScatterplot
Now, plot the categorized dataNow, plot the categorized data
Notice the “S”Shape? = sigmoid
Notice that we just shifted to acontinuous scale?
Dichotomous Dichotomous ScatterplotScatterplot
We can fit a smooth function by modelingWe can fit a smooth function by modeling
the probability of success (the probability of success (““11””) directly) directly
Model the probabilityof a ‘1’ rather than the(0,1) data directly
Another ExampleAnother Example
Another Example (cont)Another Example (cont)
Logistic EquationLogistic Equation
E(y|x)= E(y|x)= (x) = probability that a person with a(x) = probability that a person with agiven x-score will have a score of given x-score will have a score of ‘‘11’’ on Y on Y
Could just expand Could just expand uu to include more predictors to include more predictorsfor a multiple logistic regressionfor a multiple logistic regression
(x) =
eu
1+ eu
u = +
1x
Logistic RegressionLogistic Regression
- shifts the distribution (value of x where =.5)
- reflects the steepness of the transition (slope)
Features of Logistic RegressionFeatures of Logistic Regression
Change in probability is not constantChange in probability is not constant
(linear) with constant changes in X(linear) with constant changes in X
probability of a success (Y = 1) given theprobability of a success (Y = 1) given the
predictor variable (X) is a non-linearpredictor variable (X) is a non-linear
functionfunction
Can rewrite the logistic equation as anCan rewrite the logistic equation as an
OddsOdds
0 1 1( )ˆ( 1| )e
ˆ(1 ( 1| )) (1 )i
b b Xi
i
P Y X
P Y X
+== =
=
Logit Logit TransformTransform
Can Can linearizelinearize the logistic equation by using the logistic equation by using
the the ““logitlogit”” transformation transformation
apply the natural log to both sides of theapply the natural log to both sides of the
equationequation
Yields the Yields the logitlogit or log-odds: or log-odds:
0 1 1
ˆ( 1| )ln ln
ˆ(1 ( 1| )) (1 )
P Y Xb b X
P Y X
== = +
=
Logit Logit TransformationTransformation
The The logitlogit transformation puts the transformation puts the
interpretation of the regression estimatesinterpretation of the regression estimates
back on familiar footingback on familiar footing
= = expected value of the expected value of the logitlogit (log-odds) (log-odds)
when X = 0when X = 0
= = ‘‘logitlogit difference difference’’ = The amount the = The amount the logitlogit
(log-odds) changes, with a one unit change in(log-odds) changes, with a one unit change in
X;X;
LogitLogit
LogitLogit
the natural log of the oddsthe natural log of the odds
often called a log oddsoften called a log odds
logitlogit scale is continuous, linear, and functions scale is continuous, linear, and functionsmuch like a z-score scale.much like a z-score scale.
p = 0.50, then p = 0.50, then logitlogit = 0 = 0
p = 0.70, then p = 0.70, then logitlogit = 0.84 = 0.84
p = 0.30, then p = 0.30, then logitlogit = -0.84 = -0.84
Odds-Ratios and LogisticOdds-Ratios and Logistic
RegressionRegression
The slope may also be interpreted as theThe slope may also be interpreted as the
log odds-ratio associated with a unitlog odds-ratio associated with a unit
increase in xincrease in x
exp(exp( )=odds-ratio)=odds-ratio
Compare the log odds (Compare the log odds (logitlogit) of a person) of a person
with a score of x to a person with a scorewith a score of x to a person with a score
of x+1of x+1logit( ( ))x x= +
logit( ( 1)) ( 1)x x x+ = + + = + +
There and back againThere and back again……
If the data are consistent with a logistic function,If the data are consistent with a logistic function,
then the relationship between the model and thethen the relationship between the model and the
logit logit is linearis linear
The The logit logit scale is somewhat difficult to understandscale is somewhat difficult to understand
Could interpret as odds but people seem to preferCould interpret as odds but people seem to prefer
probability as the natural scale, soprobability as the natural scale, so……
log logit( )1
pp x
p= = +
There and back againThere and back again……
log logit( )1
pp x
p= = +
1
xpe
p
+=
Logit
1
x
x
ep
e
+
+=
+
Odds
Probability
EstimationEstimation
DonDon’’t meet OLS assumptions so somet meet OLS assumptions so some
variant of MLE is usedvariant of MLE is used
LetLet’’s develop the likelihoods develop the likelihood
Assuming observations are independentAssuming observations are independent……
p(yi = 1) = i
p(yi = 0) = 1 i
pdf : fi (yi ) = iyi (1 i )
1 yi ; yi = 0,1; i = 1,2...n
joint pdf : fi (yi )i=1
n
= iyi (1 i )
1 yi
i=1
n
EstimationEstimation
LikelihoodLikelihood
recall..recall..
joint pdf : fi (yi )i=1
n
= iyi (1 i )
1 yi
i=1
n
log transform = [yi log( i1 i
)]i=1
n
+ log(1 i )i=1
n
log i
1 i
= + x
1 i =1
1+ exp( + x)
EstimationEstimation
Upon substitutionUpon substitution……
log l = l( , ) = yi ( + x) log[1+ exp( + x)]i=1
n
i=1
n
ExampleExample
Heart Disease & AgeHeart Disease & Age
100 participants100 participants
DV = presence of heart diseaseDV = presence of heart disease
IV = AgeIV = Age
Heart Disease ExampleHeart Disease Example
0.0
0.2
0.4
0.6
0.8
1.0
Heart Disease ExampleHeart Disease Example
library(MASS)library(MASS)
glmglm(formula = y ~ x, family = binomial,(formula = y ~ x, family = binomial,data=mydatadata=mydata))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***
age 0.11092 0.02406 4.610 4.02e-06 ***
Null deviance: 136.66 on 99 degrees of freedom
Residual deviance: 107.35 on 98 degrees of freedom
AIC: 111.35
Number of Fisher Scoring iterations: 4
Heart Disease ExampleHeart Disease Example
Logistic regressionLogistic regression
Odds-RatioOdds-Ratio
exp(.111)=1.117exp(.111)=1.117
5.31 .111( )
5.31 .111( )( )
1
x
x
ex
e
+
+=
+
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Heart Disease ExampleHeart Disease Example
In terms of In terms of logitslogits……
-3-2
-10