logistic regression for binary outcomes. in linear regression, y is continuous in logistic, y is...
Click here to load reader
Post on 28-Dec-2015
Embed Size (px)
Logistic Regressionfor binary outcomes
In Linear Regression, Y is continuousIn Logistic, Y is binary (0,1). Average Y is P.
Cant use linear regression since:Y cant be linearly related to Xs.Y does NOT have a Gaussian (normal)distribution around mean P. We need a linearizing transformation and a non Gaussian error model
- Since 0
Since P=odds/(1 + odds) & odds = elogit
P = elogit/(1 + elogit) = 1/(1 + e-logit)
If ln(odds)= 0+ 1X1 + 2X2++kXk then odds = (e0) (e1X1) (e2X2)(ekXk) or odds = (base odds) OR1 OR2 ORkModel is multiplicative on the odds scale
(Base odds are odds when all Xs=0) ORi = odds ratio for the ith X
Interpreting coefficientsExample: Dichotomous X X = 0 for males, X=1 for females logit(P) = 0 + 1 X M: X=0, logit(Pm)= 0 F: X=1, logit(Pf) = 0 + 1 logit(Pf) logit(Pm) = 1 log(OR) = 1, e1 = OR
Example: P is proportion with disease logit(P) = 0 + 1 age + 2 sex sex is coded 0 for M, 1 for FOR for F vs M for disease is e2 if both are the same age.
e1 is the increase in the odds of disease for a one year increase in age.
(e1)k = ek1 is the OR for a k year change in age in two groups with the same gender.
Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no)Logit(P) = 0+ 1age + 2 htn + 3 smokeQ: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension.A:0+140+2+3smoke(0+130+3smoke)= 110+2=log OR. OR = e[10 1+2].
InteractionsP is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= 0+ 1S + 2 D + 3 SD Referent category is S=0, D=0 S D odds OR0 0 e0 OR00=1= e0/ e0 1 0 e0+1 OR10= e1 0 1 e0+2 OR01= e2 1 1 e0+1+2+3 OR11= e(1+2+3) When will OR11=OR10 x OR01? IFF 3=0
Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days)
- Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) p value Incr APACHE score 1.15 (1.11-1.18)
Diabetes complications -Descriptive stats
Table of obese by diabetes complicationobese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001
Fasting glucose (fast glu) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0Complication 69 75.0 114.0 155.9 353.0, p=
Steady state glucose (steady glu) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0Complication 69 60.0 257.0 261.5 480.0, p=
- Diabetes complicationParameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706
Statistical sig of the s Linear regr t = b/SE -> p value Logistic regr 2 = (b/SE)2 -> p value
Must first form (95%) CI for on log scale b 1.96 SE, b + 1.96 SE
Then take antilogs of each end e[b 1.96 SE], e[b + 1.96 SE]
Diabetes complications Odds Ratio Estimates
Point 95% WaldEffect Estimate Confidence Limitsobese e0.328=1.388 0.416 4.631Fast glu e0.108=1.114 1.049 1.182Steady glu e0.023=1.023 1.012 1.033
- Model fit-Linear vs Logistic regression k variables, n observations Variation df sum square or deviance Model k G Error n-k D Total n-1 T
Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression.
Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p valueModel (G) 3 117.21 < 0.001Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554
Goodness of fit:H-L chi sqCompare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946
Goodness of fit vs R2
Interpretation when goodness of fit is acceptable and R2 is poor.
Need to include interactions or make transformation on X variables in model?
Need to obtain more X variables?
Sensitivity & SpecificitySensitivity=a/(a+c), false neg=c/(a+c)Specificity=d/(b+d), false pos=b/(b+d)Accuracy = W sensitivity + (1-W) specificity
True posTrue negClassify posabClassify negcdtotala+cb+d
Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc,
Predict positive if P > PcPredict negative if P < Pc
Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu
Pi = 1/(1+ exp(-logit))
Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificityThis is an ROC analysis using the logit (or Pi)
ROC for logistic model
Diabetes model accuracySens=55/69= 79.7%, Spec=65/76=85.5%Accuracy = (81.2% + 85.5%)/2 = 83.4% Logit =0.447, P0=e0.447/(1+e0.447) = 0.61
True compTrue no compPred yes5511Pred no1465total6976
C statistic (report this)n0=num negative, n1=num positiveMake all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model
Logistic model is also a discriminant model (LDA)Histograms of logit scores for each group
Poisson RegressionY is a low positive integer, 0, 1,2, Model: ln(mean Y) = 0+ 1X1 + 2X2++kXk so mean Y = exp(0+ 1X1 + 2X2++kXk)
dY/dXi = i mean Y, i = (dY/dXi)/mean Y 100 i is the percent change per unit change in Xi
Equation for logit = log odds=depr score
logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income
odds depr = elogit, risk = odds/(1+odds)
coding:Female: 0 for M, 1 for FChron ill: 0 for no, 1 for yesIncome in 1000s
Example: Depression (y/n)Model for depression term coeff= SE p value Intercept -1.8259 0.4495 0.0001 female 0.8332 0.3882 0.0319 chron ill 0.3578 0.3300 0.2782 income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s
ORs term coeff= OR = e Intercept -1.8259 --- female 0.8332 2.301 chron ill 0.3578 1.430 income -0.0299 0.971