Download - BUS41100 Applied Regression Analysis Week 8: Generalized ...faculty.chicagobooth.edu/max.farrell/bus41100/week8-slides.pdf · BUS41100 Applied Regression Analysis Week 8: Generalized

BUS41100 Applied Regression Analysis

Week 8: Generalized Linear Models

Logistic regression, classification,generalized linear models

Max H. FarrellThe University of Chicago Booth School of Business

Discrete Responses

Today we will look at discrete responses

I Binary: Y = 0 or 1I Buy or don’t buy

I More categories: Y = 0, 1, 2, 3, 4I Unordered: buy product A, B, C, D, or nothingI Ordered: rate 1–5 stars

I Count: Y = 0, 1, 2, 3, 4, . . .I How many products bought in a month?

The goal varies depending on the type of response.

Our tool will be the Generalized Linear Model (GLM).

1

Binary response data

We’ll spend the most time on binary data: Y = 0 or 1.

I By far the most common application

I Illustrate all the ideas

The goal is generally to predict the probability that Y = 1.

You can then do classification based on this estimate.

I Buy or not buyI Win or loseI Sick or healthyI Pay or defaultI Thumbs up or down

Relationship type questions are interesting too

I How does an ad affect sales?I What type of customer is more likely to buy? 2

Y is an indicator: Y = 0 or 1.

The conditional mean is thus

E[Y |X] = p(Y = 1|X)× 1 + p(Y = 0|X)× 0

= p(Y = 1|X).

So the mean function is a probability:

I We need a model that gives mean/probability values

between 0 and 1.

We’ll use a transform function that takes the right-hand side of

the model (x′β) and gives back a value between zero and one.

3

The binary choice model is

p(Y = 1|X1, . . . , Xd) = S(β0 + β1X1 + · · ·+ βdXd)

where S is a function that increases in value from zero to one.

−6 −4 −2 0 2 4 6

0.0

0.4

0.8

1.2

S(x0�)

x0�

1

S−1(p(Y = 1|X1, . . . , Xd)

)= β0 + β1X1 + · · ·+ βdXd︸︷︷︸

Linear! 4

There are two main functions that are used for this:

I Logistic Regression: S(z) =ez

1 + ez.

I Probit Regression: S(z) = pnorm(z) = Φ(z).

Both are S-shaped and take values in (0, 1).

Logit is usually preferred, but

they result in practically the same fit.

5

Logistic regression

We’ll use logistic regression, such that

p(Y = 1|X1 . . . Xd) = S (X′β) =exp[β0 + β1X1 . . .+ βdXd]

1 + exp[β0 + β1X1 . . .+ βdXd].

These models are easy to fit in R:

glm(Y ~ X1 + X2, family=binomial)

I “g” is for generalized; binomial indicates Y = 0 or 1.

I Otherwise, glm uses the same syntax as lm.

I The “logit” link is more common, and is the default in R.

6

Interpretation

Model the probability:

p(Y = 1|X1 . . . Xd) = S (X′β) =exp[β0 + β1X1 . . .+ βdXd]

1 + exp[β0 + β1X1 . . .+ βdXd].

Invert to get linear log odds ratio:

log

(p(Y = 1|X1 . . . Xd)

p(Y = 0|X1 . . . Xd)

)= β0 + β1X1 . . .+ βdXd.

Therefore:

eβj =p(Y = 1|Xj = (x+ 1))

p(Y = 0|Xj = (x+ 1))

/p(Y = 1|Xj = x)

p(Y = 0|Xj = x)

7

Example:

I Suppose

{p(Y = 1|X1 . . . Xd) = 0.8

p(Y = 0|X1 . . . Xd) = 0.2

Ip(Y = 1|X1 . . . Xd)

p(Y = 0|X1 . . . Xd)=

0.8

0.2= 4

⇒ “Win” is 4 times more likely!

Therefore:

I eβj = change in the odds for a one unit increase in Xj.

I . . . holding everything else constant, as always!

I Always eβj > 0, e0 = 1. Why?

8

Example: we have Las Vegas betting point spreads for 553

NBA games and the resulting scores.

We can use logistic regression of scores onto spread to predict

the probability of the favored team winning.

I Response: favwin=1 if favored team wins.

I Covariate: spread is the Vegas point spread.

spread

Fre

quen

cy

0 10 20 30 40

040

8012

0

favwin=1favwin=0

●●

● ●●

01

0 10 20 30 40

spread

favw

in

9

This is a weird situation where we assume no intercept.

I Most likely the Vegas betting odds are efficient.

I A spread of zero implies p(win) = 0.5 for each team.

We get this out of our model when β0 = 0

p(win) = exp[β0]/(1 + exp[β0]) = 1/2.

The model we want to fit is thus

p(favwin|spread) =exp[β1 × spread]

1 + exp[β1 × spread].

10

R output from glm:

> nbareg <- glm(favwin~spread-1, family=binomial)

> summary(nbareg) ## abbreviated output

Coefficients:

Estimate Std. Error z value Pr(>|z|)

spread 0.15600 0.01377 11.33 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Null deviance: 766.62 on 553 degrees of freedom

Residual deviance: 527.97 on 552 degrees of freedom

AIC: 529.97

11

Interpretation

The fitted model is

p(favwin|spread) =exp[0.156× spread]

1 + exp[0.156× spread].

0 5 10 15 20 25 30

0.5

0.6

0.7

0.8

0.9

1.0

spread

P(f

avw

in)

12

Convert to odds-ratio> exp(coef(nbareg))

spread

1.168821

I A 1 point increase in the spread means the favorite is

1.17 times more likely to win

I What about a 10-point increase:

exp(10*coef(nbareg)) ≈ 4.75 times more likely

Uncertainty:> exp(confint(nbareg))

Waiting for profiling to be done...

2.5 % 97.5 %

1.139107 1.202371

Code: exp(cbind(coef(logit.reg), confint(logit.reg)))

13

Many things are basically the same as before:

I extractAIC(reg, k=log(n)) will give your BICs.

I The predict function works as before, but add

type = "response" to get p̂ = exp[x′b]/(1 + exp[x′b])

(otherwise it just returns the linear function x′b).

But some things are different:

I Inference based on z not t

I Without sums of squares there are no R2, anova, or

F -tests!

I Techniques for residual diagnostics and model checking

are different (but we’ll not worry about that today).

14

We could consider other models ... and compare with BIC!

Our “efficient Vegas” model:

> extractAIC(nbareg, k=log(553))

[1] 1.000 534.287

A model that includes non-zero intercept:

> extractAIC(glm(favwin ~ spread, family=binomial), k=log(553))

[1] 2.0000 540.4333

What if we throw in home-court advantage?

> extractAIC(glm(favwin ~ spread + favhome, family=binomial), k=log(553))

[1] 3.0000 545.637

The simplest model is best(The model probabilities are 0.95, 0.05, and zero.)

15

New predictions

Chicago vs Sacramento spread is SK by 1

p(CHI win) =1

1 + exp[0.156× 1]= 0.47

I Atlanta at NY (-1.5): P (favwin) = 0.56

I Orlando (-7.5) at Washington: P (favwin) = 0.76

I Memphis at Cleveland (-1): P (favwin) = 0.53

I New Jersey at Philadelphia (-2): P (favwin) = 0.58

I Golden State at Minnesota (-2.5): P (favwin) = 0.60

I Miami at Dallas (-2.5): P (favwin) = 0.60

I Charlotte at Milwaukee (-3.5): P (favwin) = 0.63

16

Example: a common business application of logistic regression

is in evaluating the credit quality of (potential) debtors.

I Take a list of borrower characteristics.

I Build a prediction rule for their credit.

I Use this rule to automatically evaluate applicants

(and track your risk profile).

You can do all this with logistic regression, and then use the

predicted probabilities to build a classification rule.

17

We have data on 1000 loan applicants at German community

banks, and judgment of the loan outcomes (good or bad).

The data has 20 borrower characteristics, including

I credit history (5 categories),

I housing (rent, own, or free),

I the loan purpose and duration,

I and installment rate as a percent of income.

Unfortunately, many of the columns in the data file are coded

categorically in a very opaque way. (Most are factors in R.)

18

A word of caution

Watch out for perfect prediction!

> too.good <- glm(GoodCredit~. + .^2, family=binomial,

+ data=credit[train,])

Warning messages:

1: glm.fit: algorithm did not converge

2: glm.fit: fitted probabilities numerically 0 or 1 occurred

This warning means you have the logistic version of our

“connect the dots” model.

I Just as useless as before!

19

We can use forward step wise regression to build a model.

> credit <- read.csv("germancredit.csv")

> train <- sample.int(nrow(credit), 0.75*nrow(credit))

> null <- glm(GoodCredit~history3, family=binomial,


> full <- glm(GoodCredit~., family=binomial,


> fwdBIC <- step(null, scope=formula(full),

+ direction="forward", k=log(length(train)))

...

Step: AIC=882.94

GoodCredit ~ history3 + checkingstatus1 + duration2

+ installment8

The null model has credit history as a variable, since I’d include

this regardless, and we’ve left-out 200 points for validation.

20

Or we can use LASSO to find a model

> cvfit <- cv.glmnet(x=X[train,], y=credit$GoodCredit[train],

+ family="binomial", alpha=1, standardize=TRUE)

> betas <- coef(cvfit, s = "lambda.1se")

> model.1se <- which(betas[2:length(betas)]!=0)

Which variables were selected?> colnames(X[,model.1se])

[1] "checkingstatus1A13" "checkingstatus1A14" "duration2"

[4] "history3A31" "history3A34" "purpose4A41"

[7] "purpose4A43" "purpose4A46" "amount5"

[10] "savings6A64" "savings6A65" "employ7A74"

[13] "installment8" "status9A93" "others10A103"

[16] "property12A124" "age13" "otherplans14A143"

[19] "housing15A152" "foreign20A202"

Trees are very popular in classification. . .

> tree.model <- rpart(GoodCredit ~ .,

+ data=credit[train,], method="class")

21

Classification

A common goal with logistic regression is to classify the inputs

depending on their predicted response probabilities.

In our credit scoring example,

I we might want to classify the German borrowers as having

“good” or “bad” credit (i.e., do we loan to them?).

A simple classification rule is to say that anyone with

I p(good|x) > 0.5 can get a loan,

I and the rest cannot.

——————

(As usual: classification is a huge field, and we’re only scratching

the surface here.) 22

Comparing with the validation set:

> predBIC <- predict(fwdBIC, newdata=credit[-train,],

+ type="response")

> errorBIC <- credit[-train,1] - (predBIC >= .5)

(LASSO takes a few extra lines)

Misclassification rates (not MSE!)

> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

+ lasso=mean(abs(errorLasso.1se)),

+ tree=mean(abs(errorTree)))

full BIC lasso tree

0.268 0.240 0.248 0.252

I Our BIC model classifies 76% borrowers correctly

I LASSO and tree slightly worse . . . both better than full

23

Classification risk & ROC curves

You can also do classification with cut-offs other than 1/2.

I Suppose the risk associated with one action is higher than

for the other.

I You’ll want to have p > 0.5 of a positive outcome before

taking the risky action.

We want to know:

I What happens as the cut-off changes?

I Is there a “best” cut-off?

One way is to look at ROC curves.

24

ROC: Receiver Operating Characteristic

> library("pROC")

> roc.full <- roc(credit[-train,1] ~ predfull)

> coords(roc.full, x=0.5)

threshold specificity sensitivity

0.5000000 0.8662791 0.4358974

Specificity

Sen

sitiv

ity0.

00.

20.

40.

60.

81.

0

1.0 0.8 0.6 0.4 0.2 0.0

x

X cut−off = 0.5

Sensitivity

true positive rate

Specificity

true negative rate

——————

Many related names: recall, precision

positive predictive value, . . .

25

What else can we do?

> rbind( coords(roc.full, "best"),

+ coords(roc.BIC, "best"), coords(roc.lasso, "best") )

threshold specificity sensitivity

[1,] 0.3210536 0.7558140 0.6538462

[2,] 0.4065372 0.8313953 0.6153846

[3,] 0.2030214 0.6279070 0.7820513

Is misclassification improved?

> errorBIC <- credit[-train,1] - (predBIC >= xy.BIC.best[1])

> errorfull <- credit[-train,1] - (predfull >= xy.full.best[1])

> errorLasso.1se <- credit[-train,1] - (predLasso.1se >= xy.lasso.best[1])

> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

+ lasso=mean(abs(errorLasso.1se)))

full BIC lasso

0.276 0.236 0.324

Different goals, different results.

26

> A bunch of code...

> see week8-Rcode.R

Specificity

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

1.0 0.8 0.6 0.4 0.2 0.0

XX

X

YY

Y

XY

cut−off = 0.5cut−off = bestFullBICLasso

27

Testing ROC curves

I Not really necessary.

> roc.test(roc.full,roc.BIC)

DeLong’s test for two correlated ROC curves

data: roc.full and roc.BIC

Z = -0.72882, p-value = 0.4661

alternative hypothesis: true difference in AUC is not equal to 0

sample estimates:

AUC of roc1 AUC of roc2

0.7492546 0.7700507

> roc.test(roc.lasso,roc.BIC)

Z = -0.49662, p-value = 0.6195

AUC of roc1 AUC of roc2

0.7573047 0.7700507

Not surprising, given the picture. 28

Extensions to Logistic Regression

The obvious step beyond 0/1 outcomes is more categories

Y = 0, 1, 2, . . . ,M.

These categories can be:

I Unordered: Buy product A, B, C, or nothing

I Ordinal: Low, medium, high risk

I Cardinal: Rate 1-5 stars

Predict choices/actions based on characteristics

. . . then do classification into multiple categories.

Our main tool will be multinomial logistic regression, but there

are others. For count data, Poisson regression may be better.

29

Discrete Choice

Units (firms, customers, etc) face a set of choices, labeled

{0, 1, 2, . . . ,M}. Or we want to classify units into one of

these categories.

Choose based on utility (happiness/profit/etc)

I Utility unit i gets from option m: Ui,m = Xi,mβm + εi,m.

I Chooses m∗ if Ui,m∗ is the biggest.

I So we want to model probabilities like P[Ui,m∗ > Ui,m]

Gives rise to a set of log-odds, one for each category:

log

(P[Yi = m | Xi]

P[Yi = 0 | Xi]

)= β0,m + β1,mXi.

just like regular (binary) logistic regression.30

Words of warning!

As usual, we’re skipping a lot of statistical details. But now,

we’re skipping economic/substantive stuff too:

I What are we assuming about the agents behavior?

I What are the random shocks εi,j? Are the X variables

specific to the product (e.g. price) or the person (e.g.

income)? Are the choices structured (e.g. ordered)?

Depending on the situation, other models may be appropriate.

You’ll hear about ordinal, conditional, nested . . . . Even harder

to interpret!

I Multinomial logit is a good all-purpose choice, often just

as good as more specialized models.

31

Remember: The more complex the model, the more the

assumptions matter!

It’s worth talking about the big one here:

The Independence of Irrelevant Alternatives

I If m∗ is chosen from {0, 1, 2, . . . ,M}, then m∗ is chosen

from any subset of {0, 1, 2, . . . ,M}.I The log-odds of m vs. 0 doesn’t depend on other options.

I So if we change/remove options from the choice set, the

odds of choosing m vs. 0 do not change.

Does that seem realistic?

I Does it matter? Maybe not for classification.

What to do?

I For now, we’ll ignore the problem, but you can test if IIA

holds, then run a different model if need be. 32

Example: Choice of cracker brand

> names(cracker <- read.csv("cracker_choice.csv"))

> table(cracker$choice)

kleebler nabisco private sunshine

226 1792 1035 239

> cracker$choice <- relevel(cracker$choice, "private")

Perfect data for relationship or inference questions:

I How does my price relate to choice?

I How do others’ prices relate to choice?

I Does an ad help? Anyone’s ad?

But the X’s are about the product, not the person:

I Can’t figure out what customer type buys a which brand.

I What would prediction/classification mean here?

33

Descriptives and Visualizations

Not as simple as linear regression (no lines to graph) but still

an important step.

I Look at the price of one option when others are chosen> colMeans(cracker[cracker$choice!="sunshine",2:5])

pr.private pr.keebler pr.nabisco pr.sunshine

68.01 112.53 107.74 96.44

> colMeans(cracker[cracker$choice=="sunshine",2:5])

pr.private pr.keebler pr.nabisco pr.sunshine

68.90 113.40 110.22 86.26

●

●

●●●

●

●

●

●●

●

●

●

●

●

●●

●

●● ●●● ●●

●

●●

●●●

●●

●

●●

●●

●

●

●●

●●●

●●●

●

●●●●●●●●●●●●●●●●●

private keebler nabisco sunshine

6080

100

120

Sunshine Price

Choice

●●

private keebler nabisco sunshine

4060

8010

0

Private Price

Choice 34

I The price of all the options when Sunshine is chosen:

Private price

Fre

quen

cy

40 60 80 100 120 140

020

4060

80

Keebler price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

6070

Nabisco price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

60

Sunshine price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

60

35

On to regression analysis

Deceptively easy in R:

> mlogit.fit <- multinom(choice ~ pr.private + pr.keebler +

+ pr.nabisco + pr.sunshine, data=cracker)

> exp(coef(mlogit.fit))

(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

keebler 20.82 1.02 0.942 1.005 1.002

nabisco 13.22 1.02 1.007 0.969 0.994

sunshine 1.12 1.01 1.023 1.007 0.940

Interpretation is just as easy (hard?) as regular logistic:

I Why is b > 1 for pr.private always?

I Own price coefficient < 1?

How do we feel about these results and the IIA assumption?

36

> exp(coef(mlogit.fit))


keebler 20.82 1.02 0.942 1.005 1.002

nabisco 13.22 1.02 1.007 0.969 0.994

sunshine 1.12 1.01 1.023 1.007 0.940

What if Keebler raises their price by $0.05?

I Lowers the chances of buying Keebler (vs store brand):Multiply odds-ratio by 0.9425 = 0.74. (why to the fifth?)

P[ keebler | $1.00]P[ private | $1.00]

× 0.74 =P[ keebler | $1.05]P[ private | $1.05]

I Or the other way: 1/0.74 = 1.35, so people are 35% more

likely to buy keebler after a 5 cent price drop.

I But what do the 1.007 and 0.994 represent?

37

Do advertisements increase sales?

> mlogit.fit2 <- multinom(choice ~ ., data=cracker)

> (exp(coef(mlogit.fit2))


keebler 12.719 1.02 0.949 1.000 1.002

nabisco 11.416 1.02 1.010 0.967 0.992

sunshine 0.981 1.01 1.016 1.012 0.940

ad.privateTRUE ad.keeblerTRUE ad.nabiscoTRUE ad.sunshineTRUE

keebler 1.06 2.34 0.925 0.932

nabisco 1.07 2.03 1.285 0.885

sunshine 1.06 1.27 2.086 1.183

How much less likely are people to buy keebler (vs. store

brand) if they run an ad?

I More than twice as likely? Odds ratio multiplied by 2.34.

What about the other odds-ratios? What is the 2.03?

I Keebler running an ad increases Nabisco sales?

I Ads raise awareness of crackers? Try a model with any ad. 38

We can do all the usual things

I Predictions

> mlogit.probs <- predict(mlogit.fit, type="probs")

> dim(mlogit.probs)

[1] 3292 4

I Classification

I Cut-off rule problem:

what if two categories have P [Y =m|X] > c?I One simple idea: highest predicted probability:

> colnames(mlogit.probs)[apply(mlogit.probs,1,which.max)]

I Model selection

I AIC, BIC, LASSO, etc

These may or may not make sense in discrete choice models.39

Count Data

Poisson regression is a GLM designed for count data. How?

Remember the Poisson distribution:

● ●

●

●

●● ● ● ● ● ●

0 2 4 6 8 10

0.0

0.1

0.2

0.3

Pro

babi

lity

●

● ●

●

●

●

●● ● ● ●

●

●

● ●

●

●

●

●● ● ●● ●

●

●

●

●

● ●●

●

●

●●●●

lambda = 1lambda = 2lambda = 3lambda = 7

I Integers only

I One parameter: λ > 0

I λ = Mean and Variance

——————

Extensions: Negative Binomial Model, Zero-Inflated Models 40

Example: Number of take-over bids (NUMBIDS) received by

126 US firms during 1978–85.

takeover$NUMBIDS

Fre

quen

cy

0 2 4 6 8 10

020

4060

Looks like Poisson!

Covariates include

I Continuous variables:

BIDPREM, INSTHOLD,

SIZE

I Indicators: LEGLREST,

FINREST, REGULATN,

WHTKNGHT

41

Browsing the data

I More bids on average for the indicators being “Yes”:

LEGLREST REALREST FINREST WHTKNGHT REGULATN

No 1.46 1.69 1.70 1.18 1.67

Yes 2.11 1.96 2.08 2.12 1.91

I Pattern in the continuous variables?

●

●

●

●

●

0 2 4 6 10

1.0

1.2

1.4

1.6

1.8

2.0

Number of Bids

Bid

pric

e / s

tock

pric

e be

fore

bid

●

●

●

●

0 2 4 6 10

0.0

0.2

0.4

0.6

0.8

Number of Bids

% h

eld

by in

stitu

tions

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

0 2 4 6 10

05

1015

20

Number of BidsTo

tal a

sset

s

42

Poisson regression results:

> poisson.fit <- glm(NUMBIDS ~ BIDPREM+INSTHOLD+SIZE+LEGLREST

+ +REALREST+FINREST+ WHTKNGHT+REGULATN, family="poisson")

> summary(poisson.fit)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.10489 0.53443 2.067 0.03870 *

BIDPREM -0.76982 0.37676 -2.043 0.04103 *

INSTHOLD -0.13180 0.40027 -0.329 0.74195

SIZE 0.03428 0.01865 1.839 0.06598 .

LEGLREST 0.25191 0.15005 1.679 0.09318 .

REALREST -0.02024 0.17726 -0.114 0.90909

FINREST 0.09190 0.21688 0.424 0.67176

WHTKNGHT 0.49821 0.15822 3.149 0.00164 **

REGULATN 0.01178 0.16297 0.072 0.94237

⇒ The expected log(NUMBIDS) increases by 0.498 ≈ 1/2 if a

White Knight is involved. What? 43

Just like logistic regression, exponentiating aids interpretation.

> exp(coef(poisson.fit))

(Intercept) BIDPREM INSTHOLD SIZE LEGLREST

3.0188810 0.4630954 0.8765198 1.0348739 1.2864796

REALREST FINREST WHTKNGHT REGULATN

0.9799633 1.0962533 1.6457787 1.0118518

⇒ The incidence rate is 1.65 times higher if a White Knight is

involved.

The incidence rate ratio (IRR, or risk ratio) is almost like the

odds ratio from logistic regression: βj > 0⇔ IRR > 1.

I The number of bids (the count we’re modeling) is

expected to be 65% higher if a White Knight is involved.

I For a one-unit increase in the % held by institutions,

number of bids falls by 1 - 0.88 = 12%.44

Final thoughts

What did we accomplish today?

I Showed how to use glm to handle different types of

outcomes.

I Explored logistic regression in some detail

I Introduced other important models

What didn’t we do?

I Explore the models in detail, or understand how they work

I Discuss assumptions, diagnostics, and fixes

All of which exist for glm.

Be careful out there!

I As usual, refer to another class/reference45

Download - BUS41100 Applied Regression Analysis Week 8: Generalized ...faculty.chicagobooth.edu/max.farrell/bus41100/week8-slides.pdf · BUS41100 Applied Regression Analysis Week 8: Generalized

Top Related