logistic regression for binary outcomes. in linear regression, y is continuous in logistic, y is...

Click here to load reader

Post on 28-Dec-2015

221 views

Category:

Documents

4 download

Embed Size (px)

TRANSCRIPT

  • Logistic Regressionfor binary outcomes

  • In Linear Regression, Y is continuousIn Logistic, Y is binary (0,1). Average Y is P.

    Cant use linear regression since:Y cant be linearly related to Xs.Y does NOT have a Gaussian (normal)distribution around mean P. We need a linearizing transformation and a non Gaussian error model

  • Since 0
  • Since P=odds/(1 + odds) & odds = elogit

    P = elogit/(1 + elogit) = 1/(1 + e-logit)

  • If ln(odds)= 0+ 1X1 + 2X2++kXk then odds = (e0) (e1X1) (e2X2)(ekXk) or odds = (base odds) OR1 OR2 ORkModel is multiplicative on the odds scale

    (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X

  • Interpreting coefficientsExample: Dichotomous X X = 0 for males, X=1 for females logit(P) = 0 + 1 X M: X=0, logit(Pm)= 0 F: X=1, logit(Pf) = 0 + 1 logit(Pf) logit(Pm) = 1 log(OR) = 1, e1 = OR

  • Example: P is proportion with disease logit(P) = 0 + 1 age + 2 sex sex is coded 0 for M, 1 for FOR for F vs M for disease is e2 if both are the same age.

    e1 is the increase in the odds of disease for a one year increase in age.

    (e1)k = ek1 is the OR for a k year change in age in two groups with the same gender.

  • Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no)Logit(P) = 0+ 1age + 2 htn + 3 smokeQ: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension.A:0+140+2+3smoke(0+130+3smoke)= 110+2=log OR. OR = e[10 1+2].

  • InteractionsP is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= 0+ 1S + 2 D + 3 SD Referent category is S=0, D=0 S D odds OR0 0 e0 OR00=1= e0/ e0 1 0 e0+1 OR10= e1 0 1 e0+2 OR01= e2 1 1 e0+1+2+3 OR11= e(1+2+3) When will OR11=OR10 x OR01? IFF 3=0

  • Interpretation example

    Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days)

  • Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) p value Incr APACHE score 1.15 (1.11-1.18)
  • Diabetes complications -Descriptive stats

    Table of obese by diabetes complicationobese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001

    Fasting glucose (fast glu) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0Complication 69 75.0 114.0 155.9 353.0, p=

    Steady state glucose (steady glu) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0Complication 69 60.0 257.0 261.5 480.0, p=

  • Diabetes complicationParameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706
  • Statistical sig of the s Linear regr t = b/SE -> p value Logistic regr 2 = (b/SE)2 -> p value

    Must first form (95%) CI for on log scale b 1.96 SE, b + 1.96 SE

    Then take antilogs of each end e[b 1.96 SE], e[b + 1.96 SE]

  • Diabetes complications Odds Ratio Estimates

    Point 95% WaldEffect Estimate Confidence Limitsobese e0.328=1.388 0.416 4.631Fast glu e0.108=1.114 1.049 1.182Steady glu e0.023=1.023 1.012 1.033

  • Model fit-Linear vs Logistic regression k variables, n observations Variation df sum square or deviance Model k G Error n-k D Total n-1 T
  • Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression.

  • Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p valueModel (G) 3 117.21 < 0.001Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554

  • Goodness of fit:H-L chi sqCompare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946

  • Goodness of fit vs R2

    Interpretation when goodness of fit is acceptable and R2 is poor.

    Need to include interactions or make transformation on X variables in model?

    Need to obtain more X variables?

  • Sensitivity & SpecificitySensitivity=a/(a+c), false neg=c/(a+c)Specificity=d/(b+d), false pos=b/(b+d)Accuracy = W sensitivity + (1-W) specificity

    True posTrue negClassify posabClassify negcdtotala+cb+d

  • Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc,

    Predict positive if P > PcPredict negative if P < Pc

  • Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu

    Pi = 1/(1+ exp(-logit))

    Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificityThis is an ROC analysis using the logit (or Pi)

  • ROC for logistic model

  • Diabetes model accuracySens=55/69= 79.7%, Spec=65/76=85.5%Accuracy = (81.2% + 85.5%)/2 = 83.4% Logit =0.447, P0=e0.447/(1+e0.447) = 0.61

    True compTrue no compPred yes5511Pred no1465total6976

  • C statistic (report this)n0=num negative, n1=num positiveMake all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model

  • Logistic model is also a discriminant model (LDA)Histograms of logit scores for each group

  • Poisson RegressionY is a low positive integer, 0, 1,2, Model: ln(mean Y) = 0+ 1X1 + 2X2++kXk so mean Y = exp(0+ 1X1 + 2X2++kXk)

    dY/dXi = i mean Y, i = (dY/dXi)/mean Y 100 i is the percent change per unit change in Xi

  • End

  • Equation for logit = log odds=depr score

    logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income

    odds depr = elogit, risk = odds/(1+odds)

    coding:Female: 0 for M, 1 for FChron ill: 0 for no, 1 for yesIncome in 1000s

  • Example: Depression (y/n)Model for depression term coeff= SE p value Intercept -1.8259 0.4495 0.0001 female 0.8332 0.3882 0.0319 chron ill 0.3578 0.3300 0.2782 income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s

  • ORs term coeff= OR = e Intercept -1.8259 --- female 0.8332 2.301 chron ill 0.3578 1.430 income -0.0299 0.971