bus41100 applied regression analysis week 8: generalized...

Click here to load reader

Post on 10-Apr-2018

222 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • BUS41100 Applied Regression Analysis

    Week 8: Generalized Linear Models

    Logistic regression, classification,generalized linear models

    Max H. FarrellThe University of Chicago Booth School of Business

  • Discrete Responses

    Today we will look at discrete responses

    I Binary: Y = 0 or 1I Buy or dont buy

    I More categories: Y = 0, 1, 2, 3, 4I Unordered: buy product A, B, C, D, or nothingI Ordered: rate 15 stars

    I Count: Y = 0, 1, 2, 3, 4, . . .I How many products bought in a month?

    The goal varies depending on the type of response.

    Our tool will be the Generalized Linear Model (GLM).

    1

  • Binary response data

    Well spend the most time on binary data: Y = 0 or 1.

    I By far the most common application

    I Illustrate all the ideas

    The goal is generally to predict the probability that Y = 1.

    You can then do classification based on this estimate.

    I Buy or not buyI Win or loseI Sick or healthyI Pay or defaultI Thumbs up or down

    Relationship type questions are interesting too

    I How does an ad affect sales?I What type of customer is more likely to buy? 2

  • Y is an indicator: Y = 0 or 1.

    The conditional mean is thus

    E[Y |X] = p(Y = 1|X) 1 + p(Y = 0|X) 0= p(Y = 1|X).

    So the mean function is a probability:

    I We need a model that gives mean/probability values

    between 0 and 1.

    Well use a transform function that takes the right-hand side of

    the model (x) and gives back a value between zero and one.

    3

  • The binary choice model is

    p(Y = 1|X1, . . . , Xd) = S(0 + 1X1 + + dXd)

    where S is a function that increases in value from zero to one.

    6 4 2 0 2 4 6

    0.0

    0.4

    0.8

    1.2

    S(x0)

    x0

    1

    S1(p(Y = 1|X1, . . . , Xd)

    )= 0 + 1X1 + + dXd

    Linear! 4

  • There are two main functions that are used for this:

    I Logistic Regression: S(z) =ez

    1 + ez.

    I Probit Regression: S(z) = pnorm(z) = (z).

    Both are S-shaped and take values in (0, 1).

    Logit is usually preferred, but

    they result in practically the same fit.

    5

  • Logistic regression

    Well use logistic regression, such that

    p(Y = 1|X1 . . . Xd) = S (X) =exp[0 + 1X1 . . .+ dXd]

    1 + exp[0 + 1X1 . . .+ dXd].

    These models are easy to fit in R:

    glm(Y ~ X1 + X2, family=binomial)

    I g is for generalized; binomial indicates Y = 0 or 1.

    I Otherwise, glm uses the same syntax as lm.

    I The logit link is more common, and is the default in R.

    6

  • Interpretation

    Model the probability:

    p(Y = 1|X1 . . . Xd) = S (X) =exp[0 + 1X1 . . .+ dXd]

    1 + exp[0 + 1X1 . . .+ dXd].

    Invert to get linear log odds ratio:

    log

    (p(Y = 1|X1 . . . Xd)p(Y = 0|X1 . . . Xd)

    )= 0 + 1X1 . . .+ dXd.

    Therefore:

    ej =p(Y = 1|Xj = (x+ 1))p(Y = 0|Xj = (x+ 1))

    /p(Y = 1|Xj = x)p(Y = 0|Xj = x)

    7

  • Example:

    I Suppose

    {p(Y = 1|X1 . . . Xd) = 0.8p(Y = 0|X1 . . . Xd) = 0.2

    Ip(Y = 1|X1 . . . Xd)p(Y = 0|X1 . . . Xd)

    =0.8

    0.2= 4

    Win is 4 times more likely!

    Therefore:

    I ej = change in the odds for a one unit increase in Xj.

    I . . . holding everything else constant, as always!

    I Always ej > 0, e0 = 1. Why?

    8

  • Example: we have Las Vegas betting point spreads for 553

    NBA games and the resulting scores.

    We can use logistic regression of scores onto spread to predict

    the probability of the favored team winning.

    I Response: favwin=1 if favored team wins.

    I Covariate: spread is the Vegas point spread.

    spread

    Fre

    quen

    cy

    0 10 20 30 40

    040

    8012

    0

    favwin=1favwin=0

    01

    0 10 20 30 40

    spread

    favw

    in

    9

  • This is a weird situation where we assume no intercept.

    I Most likely the Vegas betting odds are efficient.

    I A spread of zero implies p(win) = 0.5 for each team.

    We get this out of our model when 0 = 0

    p(win) = exp[0]/(1 + exp[0]) = 1/2.

    The model we want to fit is thus

    p(favwin|spread) = exp[1 spread]1 + exp[1 spread]

    .

    10

  • R output from glm:

    > nbareg summary(nbareg) ## abbreviated output

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    spread 0.15600 0.01377 11.33

  • Interpretation

    The fitted model is

    p(favwin|spread) = exp[0.156 spread]1 + exp[0.156 spread]

    .

    0 5 10 15 20 25 30

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    spread

    P(f

    avw

    in)

    12

  • Convert to odds-ratio> exp(coef(nbareg))

    spread

    1.168821

    I A 1 point increase in the spread means the favorite is

    1.17 times more likely to win

    I What about a 10-point increase:

    exp(10*coef(nbareg)) 4.75 times more likely

    Uncertainty:> exp(confint(nbareg))

    Waiting for profiling to be done...

    2.5 % 97.5 %

    1.139107 1.202371

    Code: exp(cbind(coef(logit.reg), confint(logit.reg)))

    13

  • Many things are basically the same as before:

    I extractAIC(reg, k=log(n)) will give your BICs.

    I The predict function works as before, but add

    type = "response" to get p = exp[xb]/(1 + exp[xb])

    (otherwise it just returns the linear function xb).

    But some things are different:

    I Inference based on z not t

    I Without sums of squares there are no R2, anova, or

    F -tests!

    I Techniques for residual diagnostics and model checking

    are different (but well not worry about that today).

    14

  • We could consider other models ... and compare with BIC!

    Our efficient Vegas model:

    > extractAIC(nbareg, k=log(553))

    [1] 1.000 534.287

    A model that includes non-zero intercept:

    > extractAIC(glm(favwin ~ spread, family=binomial), k=log(553))

    [1] 2.0000 540.4333

    What if we throw in home-court advantage?

    > extractAIC(glm(favwin ~ spread + favhome, family=binomial), k=log(553))

    [1] 3.0000 545.637

    The simplest model is best(The model probabilities are 0.95, 0.05, and zero.)

    15

  • New predictions

    Chicago vs Sacramento spread is SK by 1

    p(CHI win) =1

    1 + exp[0.156 1]= 0.47

    I Atlanta at NY (-1.5): P (favwin) = 0.56

    I Orlando (-7.5) at Washington: P (favwin) = 0.76

    I Memphis at Cleveland (-1): P (favwin) = 0.53

    I New Jersey at Philadelphia (-2): P (favwin) = 0.58

    I Golden State at Minnesota (-2.5): P (favwin) = 0.60

    I Miami at Dallas (-2.5): P (favwin) = 0.60

    I Charlotte at Milwaukee (-3.5): P (favwin) = 0.63

    16

  • Example: a common business application of logistic regression

    is in evaluating the credit quality of (potential) debtors.

    I Take a list of borrower characteristics.

    I Build a prediction rule for their credit.

    I Use this rule to automatically evaluate applicants

    (and track your risk profile).

    You can do all this with logistic regression, and then use the

    predicted probabilities to build a classification rule.

    17

  • We have data on 1000 loan applicants at German community

    banks, and judgment of the loan outcomes (good or bad).

    The data has 20 borrower characteristics, including

    I credit history (5 categories),

    I housing (rent, own, or free),

    I the loan purpose and duration,

    I and installment rate as a percent of income.

    Unfortunately, many of the columns in the data file are coded

    categorically in a very opaque way. (Most are factors in R.)

    18

  • A word of caution

    Watch out for perfect prediction!

    > too.good

  • We can use forward step wise regression to build a model.

    > credit train null full fwdBIC

  • Or we can use LASSO to find a model

    > cvfit betas model.1se colnames(X[,model.1se])

    [1] "checkingstatus1A13" "checkingstatus1A14" "duration2"

    [4] "history3A31" "history3A34" "purpose4A41"

    [7] "purpose4A43" "purpose4A46" "amount5"

    [10] "savings6A64" "savings6A65" "employ7A74"

    [13] "installment8" "status9A93" "others10A103"

    [16] "property12A124" "age13" "otherplans14A143"

    [19] "housing15A152" "foreign20A202"

    Trees are very popular in classification. . .

    > tree.model

  • Classification

    A common goal with logistic regression is to classify the inputs

    depending on their predicted response probabilities.

    In our credit scoring example,

    I we might want to classify the German borrowers as having

    good or bad credit (i.e., do we loan to them?).

    A simple classification rule is to say that anyone with

    I p(good|x) > 0.5 can get a loan,I and the rest cannot.

    (As usual: classification is a huge field, and were only scratching

    the surface here.) 22

  • Comparing with the validation set:

    > predBIC errorBIC = .5)

    (LASSO takes a few extra lines)

    Misclassification rates (not MSE!)

    > c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

    + lasso=mean(abs(errorLasso.1se)),

    + tree=mean(abs(errorTree)))

    full BIC lasso tree

    0.268 0.240 0.248 0.252

    I Our BIC model classifies 76% borrowers correctly

    I LASSO and tree slightly worse . . . both better than full

    23

  • Classification risk & ROC curves

    You can also do classification with cut-offs other than 1/2.

    I Suppose the risk associated with one action is higher than

    for the other.

    I Youll want to have p > 0.5 of a positive outcome before

    taking the risky action.

    We want to know:

    I What happens as the cut-off changes?

    I Is there a best cut-off?

    One way is to look at ROC curves.

    24

  • ROC: Receiver Operating Characteristic

    > library("pROC")

    > roc.full coords(roc.full, x=0.5)

    threshold specificity sensitivity

    0.5000000 0.8662791 0.4358974

    Specificity

    Sen

    sitiv

    ity0.

    00.

    20.

    40.

    60.

    81.

    0

    1.0 0.8 0.6 0.4 0.2 0.0

    x

    X cutoff = 0.5

    Sensitivity

    true positive rate

    Specificity

    true negative rate

    Many related names: recall, precision

    positive predictive value, . . .

    25

  • What else can we do?

    > rbind( coords(roc.full, "best"),

    + coords(roc.BIC, "best"), coords(roc.lasso, "best") )

    threshold specificity sensitivity

    [1,] 0.3210536 0.7558140 0.6538462

    [2,] 0.4065372 0.8313953 0.6153846

    [3,] 0.2030214 0.6279070 0.7820513

    Is misclassification improved?

    > errorBIC = xy.BIC.best[1])

    > errorfull = xy.full.best[1])

    > errorLasso.1se = xy.lasso.best[1])

    > c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

    + lasso=mean(abs(errorLasso.1se)))

    full BIC lasso

    0.276 0.236 0.324

    Different goals, different results.

    26

  • > A bunch of code...

    > see week8-Rcode.R

    Specificity

    Sen

    sitiv

    ity

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    1.0 0.8 0.6 0.4 0.2 0.0

    XX

    X

    YY

    Y

    XY

    cutoff = 0.5cutoff = bestFullBICLasso

    27

  • Testing ROC curves

    I Not really necessary.

    > roc.test(roc.full,roc.BIC)

    DeLongs test for two correlated ROC curves

    data: roc.full and roc.BIC

    Z = -0.72882, p-value = 0.4661

    alternative hypothesis: true difference in AUC is not equal to 0

    sample estimates:

    AUC of roc1 AUC of roc2

    0.7492546 0.7700507

    > roc.test(roc.lasso,roc.BIC)

    Z = -0.49662, p-value = 0.6195

    AUC of roc1 AUC of roc2

    0.7573047 0.7700507

    Not surprising, given the picture. 28

  • Extensions to Logistic Regression

    The obvious step beyond 0/1 outcomes is more categories

    Y = 0, 1, 2, . . . ,M.

    These categories can be:

    I Unordered: Buy product A, B, C, or nothing

    I Ordinal: Low, medium, high risk

    I Cardinal: Rate 1-5 stars

    Predict choices/actions based on characteristics

    . . . then do classification into multiple categories.

    Our main tool will be multinomial logistic regression, but there

    are others. For count data, Poisson regression may be better.

    29

  • Discrete Choice

    Units (firms, customers, etc) face a set of choices, labeled

    {0, 1, 2, . . . ,M}. Or we want to classify units into one ofthese categories.

    Choose based on utility (happiness/profit/etc)

    I Utility unit i gets from option m: Ui,m = Xi,mm + i,m.

    I Chooses m if Ui,m is the biggest.

    I So we want to model probabilities like P[Ui,m > Ui,m]

    Gives rise to a set of log-odds, one for each category:

    log

    (P[Yi = m | Xi]P[Yi = 0 | Xi]

    )= 0,m + 1,mXi.

    just like regular (binary) logistic regression.30

  • Words of warning!

    As usual, were skipping a lot of statistical details. But now,

    were skipping economic/substantive stuff too:

    I What are we assuming about the agents behavior?

    I What are the random shocks i,j? Are the X variables

    specific to the product (e.g. price) or the person (e.g.

    income)? Are the choices structured (e.g. ordered)?

    Depending on the situation, other models may be appropriate.

    Youll hear about ordinal, conditional, nested . . . . Even harder

    to interpret!

    I Multinomial logit is a good all-purpose choice, often just

    as good as more specialized models.

    31

  • Remember: The more complex the model, the more the

    assumptions matter!

    Its worth talking about the big one here:

    The Independence of Irrelevant Alternatives

    I If m is chosen from {0, 1, 2, . . . ,M}, then m is chosenfrom any subset of {0, 1, 2, . . . ,M}.

    I The log-odds of m vs. 0 doesnt depend on other options.

    I So if we change/remove options from the choice set, the

    odds of choosing m vs. 0 do not change.

    Does that seem realistic?

    I Does it matter? Maybe not for classification.

    What to do?

    I For now, well ignore the problem, but you can test if IIA

    holds, then run a different model if need be. 32

  • Example: Choice of cracker brand

    > names(cracker table(cracker$choice)

    kleebler nabisco private sunshine

    226 1792 1035 239

    > cracker$choice

  • Descriptives and Visualizations

    Not as simple as linear regression (no lines to graph) but still

    an important step.

    I Look at the price of one option when others are chosen> colMeans(cracker[cracker$choice!="sunshine",2:5])

    pr.private pr.keebler pr.nabisco pr.sunshine

    68.01 112.53 107.74 96.44

    > colMeans(cracker[cracker$choice=="sunshine",2:5])

    pr.private pr.keebler pr.nabisco pr.sunshine

    68.90 113.40 110.22 86.26

    private keebler nabisco sunshine

    6080

    100

    120

    Sunshine Price

    Choice

    private keebler nabisco sunshine

    4060

    8010

    0

    Private Price

    Choice 34

  • I The price of all the options when Sunshine is chosen:

    Private price

    Fre

    quen

    cy

    40 60 80 100 120 140

    020

    4060

    80

    Keebler price

    Fre

    quen

    cy

    40 60 80 100 120 140

    010

    2030

    4050

    6070

    Nabisco price

    Fre

    quen

    cy

    40 60 80 100 120 140

    010

    2030

    4050

    60

    Sunshine price

    Fre

    quen

    cy

    40 60 80 100 120 140

    010

    2030

    4050

    60

    35

  • On to regression analysis

    Deceptively easy in R:

    > mlogit.fit exp(coef(mlogit.fit))

    (Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

    keebler 20.82 1.02 0.942 1.005 1.002

    nabisco 13.22 1.02 1.007 0.969 0.994

    sunshine 1.12 1.01 1.023 1.007 0.940

    Interpretation is just as easy (hard?) as regular logistic:

    I Why is b > 1 for pr.private always?

    I Own price coefficient < 1?

    How do we feel about these results and the IIA assumption?

    36

  • > exp(coef(mlogit.fit))

    (Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

    keebler 20.82 1.02 0.942 1.005 1.002

    nabisco 13.22 1.02 1.007 0.969 0.994

    sunshine 1.12 1.01 1.023 1.007 0.940

    What if Keebler raises their price by $0.05?

    I Lowers the chances of buying Keebler (vs store brand):Multiply odds-ratio by 0.9425 = 0.74. (why to the fifth?)

    P[ keebler | $1.00]P[ private | $1.00]

    0.74 = P[ keebler | $1.05]P[ private | $1.05]

    I Or the other way: 1/0.74 = 1.35, so people are 35% more

    likely to buy keebler after a 5 cent price drop.

    I But what do the 1.007 and 0.994 represent?

    37

  • Do advertisements increase sales?

    > mlogit.fit2 (exp(coef(mlogit.fit2))

    (Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

    keebler 12.719 1.02 0.949 1.000 1.002

    nabisco 11.416 1.02 1.010 0.967 0.992

    sunshine 0.981 1.01 1.016 1.012 0.940

    ad.privateTRUE ad.keeblerTRUE ad.nabiscoTRUE ad.sunshineTRUE

    keebler 1.06 2.34 0.925 0.932

    nabisco 1.07 2.03 1.285 0.885

    sunshine 1.06 1.27 2.086 1.183

    How much less likely are people to buy keebler (vs. store

    brand) if they run an ad?

    I More than twice as likely? Odds ratio multiplied by 2.34.

    What about the other odds-ratios? What is the 2.03?

    I Keebler running an ad increases Nabisco sales?

    I Ads raise awareness of crackers? Try a model with any ad. 38

  • We can do all the usual things

    I Predictions

    > mlogit.probs dim(mlogit.probs)

    [1] 3292 4

    I Classification

    I Cut-off rule problem:

    what if two categories have P [Y =m|X] > c?I One simple idea: highest predicted probability:

    > colnames(mlogit.probs)[apply(mlogit.probs,1,which.max)]

    I Model selection

    I AIC, BIC, LASSO, etc

    These may or may not make sense in discrete choice models.39

  • Count Data

    Poisson regression is a GLM designed for count data. How?

    Remember the Poisson distribution:

    0 2 4 6 8 10

    0.0

    0.1

    0.2

    0.3

    Pro

    babi

    lity

    lambda = 1lambda = 2lambda = 3lambda = 7

    I Integers only

    I One parameter: > 0

    I = Mean and Variance

    Extensions: Negative Binomial Model, Zero-Inflated Models 40

  • Example: Number of take-over bids (NUMBIDS) received by

    126 US firms during 197885.

    takeover$NUMBIDS

    Fre

    quen

    cy

    0 2 4 6 8 10

    020

    4060

    Looks like Poisson!

    Covariates include

    I Continuous variables:

    BIDPREM, INSTHOLD,

    SIZE

    I Indicators: LEGLREST,

    FINREST, REGULATN,

    WHTKNGHT

    41

  • Browsing the data

    I More bids on average for the indicators being Yes:

    LEGLREST REALREST FINREST WHTKNGHT REGULATN

    No 1.46 1.69 1.70 1.18 1.67

    Yes 2.11 1.96 2.08 2.12 1.91

    I Pattern in the continuous variables?

    0 2 4 6 10

    1.0

    1.2

    1.4

    1.6

    1.8

    2.0

    Number of Bids

    Bid

    pric

    e / s

    tock

    pric

    e be

    fore

    bid

    0 2 4 6 10

    0.0

    0.2

    0.4

    0.6

    0.8

    Number of Bids

    % h

    eld

    by in

    stitu

    tions

    0 2 4 6 10

    05

    1015

    20

    Number of BidsTo

    tal a

    sset

    s

    42

  • Poisson regression results:

    > poisson.fit summary(poisson.fit)

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) 1.10489 0.53443 2.067 0.03870 *

    BIDPREM -0.76982 0.37676 -2.043 0.04103 *

    INSTHOLD -0.13180 0.40027 -0.329 0.74195

    SIZE 0.03428 0.01865 1.839 0.06598 .

    LEGLREST 0.25191 0.15005 1.679 0.09318 .

    REALREST -0.02024 0.17726 -0.114 0.90909

    FINREST 0.09190 0.21688 0.424 0.67176

    WHTKNGHT 0.49821 0.15822 3.149 0.00164 **

    REGULATN 0.01178 0.16297 0.072 0.94237

    The expected log(NUMBIDS) increases by 0.498 1/2 if aWhite Knight is involved. What? 43

  • Just like logistic regression, exponentiating aids interpretation.

    > exp(coef(poisson.fit))

    (Intercept) BIDPREM INSTHOLD SIZE LEGLREST

    3.0188810 0.4630954 0.8765198 1.0348739 1.2864796

    REALREST FINREST WHTKNGHT REGULATN

    0.9799633 1.0962533 1.6457787 1.0118518

    The incidence rate is 1.65 times higher if a White Knight isinvolved.

    The incidence rate ratio (IRR, or risk ratio) is almost like the

    odds ratio from logistic regression: j > 0 IRR > 1.I The number of bids (the count were modeling) is

    expected to be 65% higher if a White Knight is involved.

    I For a one-unit increase in the % held by institutions,

    number of bids falls by 1 - 0.88 = 12%.44

  • Final thoughts

    What did we accomplish today?

    I Showed how to use glm to handle different types of

    outcomes.

    I Explored logistic regression in some detail

    I Introduced other important models

    What didnt we do?

    I Explore the models in detail, or understand how they work

    I Discuss assumptions, diagnostics, and fixes

    All of which exist for glm.

    Be careful out there!

    I As usual, refer to another class/reference45