binary and multinomial logistic regression - · pdf filebinary and multinomial logistic...

Click here to load reader

Post on 10-Mar-2018

226 views

Category:

Documents

5 download

Embed Size (px)

TRANSCRIPT

  • Binary and Multinomial Logistic Regression

    stat 557Heike Hofmann

  • Outline

    Logistic Regression: model checking by grouping Model selection scores

    Intro to Multinomial Regression

  • Example: Happiness Data

    > summary(happy) happy year age sex not too happy: 5629 Min. :1972 Min. : 18.00 female:28581 pretty happy :25874 1st Qu.:1982 1st Qu.: 31.00 male :22439 very happy :14800 Median :1990 Median : 43.00 NA's : 4717 Mean :1990 Mean : 45.43 3rd Qu.:2000 3rd Qu.: 58.00 Max. :2006 Max. : 89.00 NA's :184.00 marital degree finrela health divorced : 6131 bachelor : 6918 above average : 8536 excellent:11951 married :27998 graduate : 3253 average :23363 fair : 7149 never married:10064 high school :26307 below average :10909 good :17227 separated : 1781 junior college: 2601 far above average: 898 poor : 2164 widowed : 5032 lt high school:11777 far below average: 2438 NA's :12529 NA's : 14 NA's : 164 NA's : 4876

    only consider extremes: very happy and not very happy individuals

  • female male

    prodplot(data=happy, ~ happy+sex, c("vspine", "hspine"), na.rm=T, subset=level==2)# almost perfect independence# try a model

    happy.sex |z|) (Intercept) 0.96613 0.02075 46.551

  • Deviance difference is asymptotically 2 distributed

    Null hypothesis of independence cannot be rejected

    > anova(happy.sex)Analysis of Deviance Table

    Model: binomial, link: logit

    Response: happy

    Terms added sequentially (first to last)

    Df Deviance Resid. Df Resid. DevNULL 20428 24053sex 1 0.0016906 20427 24053

    > confint(happy.sex)Waiting for profiling to be done... 2.5 % 97.5 %(Intercept) 0.92557962 1.00693875sexmale -0.06064378 0.06332427

  • Age and Happiness

    age

    count

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    20 30 40 50 60 70 80

    happy

    not too happy

    very happy

    age

    count

    0

    100

    200

    300

    400

    20 30 40 50 60 70 80

    happy

    not too happy

    very happy

    qplot(age, geom="histogram", fill=happy, binwidth=1, data=happy)

    qplot(age, geom="histogram", fill=happy, binwidth=1, position="fill", data=happy)

    # research paper claims that happiness is u-shapedhappy.age

  • > summary(happy.age)

    Call:glm(formula = happy ~ poly(age, 2), family = binomial(), data = na.omit(happy[, c("age", "happy")]))

    Deviance Residuals: Min 1Q Median 3Q Max -1.6400 -1.5480 0.7841 0.8061 0.8707

    Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96850 0.01571 61.660 < 2e-16 ***poly(age, 2)1 6.41183 2.22171 2.886 0.00390 ** poly(age, 2)2 -7.81568 2.21981 -3.521 0.00043 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 23957 on 20351 degrees of freedomResidual deviance: 23936 on 20349 degrees of freedomAIC: 23942

    Number of Fisher Scoring iterations: 4

    age

    count

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    20 30 40 50 60 70 80

    happy

    not too happy

    very happy

  • # effect of ageX

  • age

    pred3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    20 30 40 50 60 70 80

    sex

    female

    male

    # effect of ageX

  • Problems with Deviance

    if X is continuous, deviance has no longer 2 distribution. Two-fold violations:

    regard X to be categorical (with lots of categories): we might end up with a contingency table that has lots of small cells - which means, that the 2 approximation does not hold.

    Increases in sample size, most likely increase the number of different values of X.Corresponding contingency table changes size (asymptotic distribution for the smaller contingency table doesnt exist).

  • ... but

    Differences in deviances between models that are only a few degrees of freedom apart, still have asymptotically 2

  • age

    pred3

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    20 30 40 50 60 70 80

    sex

    female

    male

    # effect of ageX

  • Model Checking by Grouping

    Group data along estimates, e.g. such that groups are approximately equal in size.

    Partition smallest n1 estimates into group 1, second smallest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hosmer-Lemeshow test statistic:

    Problem with deviance: if X continuous, deviance has no longer 2 distribution. The approximation as-sumptions are violated two-fold: even if we regard X to be categorical (with lots of categories) these means,that we end up with a contingency table that has lots of small cells - which means, that the 2 approxima-tion does not hold. Secondly, if we increase the sample size, most likely the number of different values of Xincreases, too, which makes the corresponding contingency table change size (so we cannot even talk aboutan asymptotic distribution for the smaller contingency table, as it doesnt exist anymore once the samplesize is larger).

    Model Checking by Grouping To get around the problems with the distribution assumption of G2, wecan group the data along estimates, e.g. by partitioning on estimates such that groups are approximatelyequal in size.Partitioning the estimates is done by size, we group the smallest n1 estimates into group 1, the secondsmallest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hosmer-Lemeshow teststatistic

    g

    i=1

    nij=1 yij

    nij=1 ij

    2

    nij=1 ij

    1

    j ij/ni

    2g2.

    4.4 Effects of Coding

    Let X be a nominal variable with I categories. An appropriate model would then be:

    log(x)

    1 (x) = + i,

    where i is the effect of the ith category in X on the log odds, i.e. for each category one effect is estimated.This means that the above model is overparameterized (the last category can be explained in terms ofthe others). To make the solution unique again, we have to use an additional constraint. In R, 1 = 0,by default. Whenever one of the effects is fixed to be zero, this is called a contrast coding - as it allows acomparison of all the other effects to the baseline effect. For effect coding the constraint is on the sum of alleffects of a variable:

    i i = 0. In a binary variable the effects are then the negatives of each other.

    Predictions and inference are independent from the specific coding used and are not affected by changesmade in the coding.

    Example: Alcohol and MalformationAlcohol during pregnancy is believed to be associated with congenital malformation. The following numbersare from an observational study - after three months of pregnancy questions on the average number of dailyalcoholic beverages were asked; at birth the infant was checked for malformations:

    Alcohol malformed absent P(malformed)1 0 48 17066 0.00282 < 1 38 14464 0.00263 1-2 5 788 0.00634 3-5 1 126 0.00795 6 1 37 0.0263

    Models m1 and m2 are the same in terms of statistical behavior: deviance, predictions and inference willyield the same numbers. The variable Alcohol is recoded for the second model, giving different estimatesfor the levels.

    Alcohol

  • Problems with Grouping

    Different groupings might (and will) lead to different decisions w.r.t model fit

    Hosmer et al (1997): A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL (on Blackboard)

  • Model Selection

    ?

    Theory for relationship between response and outcome is well developed, model is fitted because we want to fine-tune dependency structure

    Ideal Situation:

  • Model Selection

    ?

    After initial data check, visually inspect relationship between response and potential co-variatesinclude strongest co-variates first, build up from there, check whether additions are significant improvements

    Exploratory Modelling

  • Model Selection

    Include/Exclude variables based on goodness-of-fit criteria such as AIC, adjusted R2, ...

    Stepwise Modelling (not recommended by itself)

    In Practice: combination of all three methods

  • (Forward) Selection

    Results are often not easy to interpret - questionable value?

    Step: AIC=18176cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree

    Df Deviance AIC 16714 18176+ sex:marital 4 16707 18177+ marital:degree 16 16688 18182+ poly(age, 4):marital 16 16688 18182+ sex:poly(age, 4) 4 16714 18184+ marital:finrela 16 16693 18187

  • (Forward) Selection

    Step: AIC=18176cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree

    Df Deviance AIC 16714 18176- sex:degree