# bus41100 applied regression analysis week 8: generalized...

Post on 10-Apr-2018

222 views

Embed Size (px)

TRANSCRIPT

BUS41100 Applied Regression Analysis

Week 8: Generalized Linear Models

Logistic regression, classification,generalized linear models

Max H. FarrellThe University of Chicago Booth School of Business

Discrete Responses

Today we will look at discrete responses

I Binary: Y = 0 or 1I Buy or dont buy

I More categories: Y = 0, 1, 2, 3, 4I Unordered: buy product A, B, C, D, or nothingI Ordered: rate 15 stars

I Count: Y = 0, 1, 2, 3, 4, . . .I How many products bought in a month?

The goal varies depending on the type of response.

Our tool will be the Generalized Linear Model (GLM).

1

Binary response data

Well spend the most time on binary data: Y = 0 or 1.

I By far the most common application

I Illustrate all the ideas

The goal is generally to predict the probability that Y = 1.

You can then do classification based on this estimate.

I Buy or not buyI Win or loseI Sick or healthyI Pay or defaultI Thumbs up or down

Relationship type questions are interesting too

I How does an ad affect sales?I What type of customer is more likely to buy? 2

Y is an indicator: Y = 0 or 1.

The conditional mean is thus

E[Y |X] = p(Y = 1|X) 1 + p(Y = 0|X) 0= p(Y = 1|X).

So the mean function is a probability:

I We need a model that gives mean/probability values

between 0 and 1.

Well use a transform function that takes the right-hand side of

the model (x) and gives back a value between zero and one.

3

The binary choice model is

p(Y = 1|X1, . . . , Xd) = S(0 + 1X1 + + dXd)

where S is a function that increases in value from zero to one.

6 4 2 0 2 4 6

0.0

0.4

0.8

1.2

S(x0)

x0

1

S1(p(Y = 1|X1, . . . , Xd)

)= 0 + 1X1 + + dXd

Linear! 4

There are two main functions that are used for this:

I Logistic Regression: S(z) =ez

1 + ez.

I Probit Regression: S(z) = pnorm(z) = (z).

Both are S-shaped and take values in (0, 1).

Logit is usually preferred, but

they result in practically the same fit.

5

Logistic regression

Well use logistic regression, such that

p(Y = 1|X1 . . . Xd) = S (X) =exp[0 + 1X1 . . .+ dXd]

1 + exp[0 + 1X1 . . .+ dXd].

These models are easy to fit in R:

glm(Y ~ X1 + X2, family=binomial)

I g is for generalized; binomial indicates Y = 0 or 1.

I Otherwise, glm uses the same syntax as lm.

I The logit link is more common, and is the default in R.

6

Interpretation

Model the probability:

p(Y = 1|X1 . . . Xd) = S (X) =exp[0 + 1X1 . . .+ dXd]

1 + exp[0 + 1X1 . . .+ dXd].

Invert to get linear log odds ratio:

log

(p(Y = 1|X1 . . . Xd)p(Y = 0|X1 . . . Xd)

)= 0 + 1X1 . . .+ dXd.

Therefore:

ej =p(Y = 1|Xj = (x+ 1))p(Y = 0|Xj = (x+ 1))

/p(Y = 1|Xj = x)p(Y = 0|Xj = x)

7

Example:

I Suppose

{p(Y = 1|X1 . . . Xd) = 0.8p(Y = 0|X1 . . . Xd) = 0.2

Ip(Y = 1|X1 . . . Xd)p(Y = 0|X1 . . . Xd)

=0.8

0.2= 4

Win is 4 times more likely!

Therefore:

I ej = change in the odds for a one unit increase in Xj.

I . . . holding everything else constant, as always!

I Always ej > 0, e0 = 1. Why?

8

Example: we have Las Vegas betting point spreads for 553

NBA games and the resulting scores.

We can use logistic regression of scores onto spread to predict

the probability of the favored team winning.

I Response: favwin=1 if favored team wins.

I Covariate: spread is the Vegas point spread.

spread

Fre

quen

cy

0 10 20 30 40

040

8012

0

favwin=1favwin=0

01

0 10 20 30 40

spread

favw

in

9

This is a weird situation where we assume no intercept.

I Most likely the Vegas betting odds are efficient.

I A spread of zero implies p(win) = 0.5 for each team.

We get this out of our model when 0 = 0

p(win) = exp[0]/(1 + exp[0]) = 1/2.

The model we want to fit is thus

p(favwin|spread) = exp[1 spread]1 + exp[1 spread]

.

10

R output from glm:

> nbareg summary(nbareg) ## abbreviated output

Coefficients:

Estimate Std. Error z value Pr(>|z|)

spread 0.15600 0.01377 11.33

Interpretation

The fitted model is

p(favwin|spread) = exp[0.156 spread]1 + exp[0.156 spread]

.

0 5 10 15 20 25 30

0.5

0.6

0.7

0.8

0.9

1.0

spread

P(f

avw

in)

12

Convert to odds-ratio> exp(coef(nbareg))

spread

1.168821

I A 1 point increase in the spread means the favorite is

1.17 times more likely to win

I What about a 10-point increase:

exp(10*coef(nbareg)) 4.75 times more likely

Uncertainty:> exp(confint(nbareg))

Waiting for profiling to be done...

2.5 % 97.5 %

1.139107 1.202371

Code: exp(cbind(coef(logit.reg), confint(logit.reg)))

13

Many things are basically the same as before:

I extractAIC(reg, k=log(n)) will give your BICs.

I The predict function works as before, but add

type = "response" to get p = exp[xb]/(1 + exp[xb])

(otherwise it just returns the linear function xb).

But some things are different:

I Inference based on z not t

I Without sums of squares there are no R2, anova, or

F -tests!

I Techniques for residual diagnostics and model checking

are different (but well not worry about that today).

14

We could consider other models ... and compare with BIC!

Our efficient Vegas model:

> extractAIC(nbareg, k=log(553))

[1] 1.000 534.287

A model that includes non-zero intercept:

> extractAIC(glm(favwin ~ spread, family=binomial), k=log(553))

[1] 2.0000 540.4333

What if we throw in home-court advantage?

> extractAIC(glm(favwin ~ spread + favhome, family=binomial), k=log(553))

[1] 3.0000 545.637

The simplest model is best(The model probabilities are 0.95, 0.05, and zero.)

15

New predictions

Chicago vs Sacramento spread is SK by 1

p(CHI win) =1

1 + exp[0.156 1]= 0.47

I Atlanta at NY (-1.5): P (favwin) = 0.56

I Orlando (-7.5) at Washington: P (favwin) = 0.76

I Memphis at Cleveland (-1): P (favwin) = 0.53

I New Jersey at Philadelphia (-2): P (favwin) = 0.58

I Golden State at Minnesota (-2.5): P (favwin) = 0.60

I Miami at Dallas (-2.5): P (favwin) = 0.60

I Charlotte at Milwaukee (-3.5): P (favwin) = 0.63

16

Example: a common business application of logistic regression

is in evaluating the credit quality of (potential) debtors.

I Take a list of borrower characteristics.

I Build a prediction rule for their credit.

I Use this rule to automatically evaluate applicants

(and track your risk profile).

You can do all this with logistic regression, and then use the

predicted probabilities to build a classification rule.

17

We have data on 1000 loan applicants at German community

banks, and judgment of the loan outcomes (good or bad).

The data has 20 borrower characteristics, including

I credit history (5 categories),

I housing (rent, own, or free),

I the loan purpose and duration,

I and installment rate as a percent of income.

Unfortunately, many of the columns in the data file are coded

categorically in a very opaque way. (Most are factors in R.)

18

A word of caution

Watch out for perfect prediction!

> too.good

We can use forward step wise regression to build a model.

> credit train null full fwdBIC

Or we can use LASSO to find a model

> cvfit betas model.1se colnames(X[,model.1se])

[1] "checkingstatus1A13" "checkingstatus1A14" "duration2"

[4] "history3A31" "history3A34" "purpose4A41"

[7] "purpose4A43" "purpose4A46" "amount5"

[10] "savings6A64" "savings6A65" "employ7A74"

[13] "installment8" "status9A93" "others10A103"

[16] "property12A124" "age13" "otherplans14A143"

[19] "housing15A152" "foreign20A202"

Trees are very popular in classification. . .

> tree.model

Classification

A common goal with logistic regression is to classify the inputs

depending on their predicted response probabilities.

In our credit scoring example,

I we might want to classify the German borrowers as having

good or bad credit (i.e., do we loan to them?).

A simple classification rule is to say that anyone with

I p(good|x) > 0.5 can get a loan,I and the rest cannot.

(As usual: classification is a huge field, and were only scratching

the surface here.) 22

Comparing with the validation set:

> predBIC errorBIC = .5)

(LASSO takes a few extra lines)

Misclassification rates (not MSE!)

> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

+ lasso=mean(abs(errorLasso.1se)),

+ tree=mean(abs(errorTree)))

full BIC lasso tree

0.268 0.240 0.248 0.252

I Our BIC model classifies 76% borrowers correctly

I LASSO and tree slightly worse . . . both better than full

23

Classification risk & ROC curves

You can also do classification with cut-offs other than 1/2.

I Suppose the risk associated with one action is higher than

for the other.

I Youll want to have p > 0.5 of a positive outcome before

taking the risky action.

We want to know:

I What happens as the cut-off changes?

I Is there a best cut-off?

One way is to look at ROC curves.

24

ROC: Receiver Operating Characteristic

> library("pROC")

> roc.full coords(roc.full, x=0.5)

threshold specificity sensitivity

0.5000000 0.8662791 0.4358974

Specificity

Sen

sitiv

ity0.

00.

20.

40.

60.

81.

0

1.0 0.8 0.6 0.4 0.2 0.0

x

X cutoff = 0.5

Sensitivity

true positive rate

Specificity

true negative rate

Many related names: recall, precision

positive predictive value, . . .

25

What else can we do?

> rbind( coords(roc.full, "best"),

+ coords(roc.BIC, "best"), coords(roc.lasso, "best") )

threshold specificity sensitivity

[1,] 0.3210536 0.7558140 0.6538462

[2,] 0.4065372 0.8313953 0.6153846

[3,] 0.2030214 0.6279070 0.7820513

Is misclassification improved?

> errorBIC = xy.BIC.best[1])

> errorfull = xy.full.best[1])

> errorLasso.1se = xy.lasso.best[1])

> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),

+ lasso=mean(abs(errorLasso.1se)))

full BIC lasso

0.276 0.236 0.324

Different goals, different results.

26

> A bunch of code...

> see week8-Rcode.R

Specificity

Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

1.0 0.8 0.6 0.4 0.2 0.0

XX

X

YY

Y

XY

cutoff = 0.5cutoff = bestFullBICLasso

27

Testing ROC curves

I Not really necessary.

> roc.test(roc.full,roc.BIC)

DeLongs test for two correlated ROC curves

data: roc.full and roc.BIC

Z = -0.72882, p-value = 0.4661

alternative hypothesis: true difference in AUC is not equal to 0

sample estimates:

AUC of roc1 AUC of roc2

0.7492546 0.7700507

> roc.test(roc.lasso,roc.BIC)

Z = -0.49662, p-value = 0.6195

AUC of roc1 AUC of roc2

0.7573047 0.7700507

Not surprising, given the picture. 28

Extensions to Logistic Regression

The obvious step beyond 0/1 outcomes is more categories

Y = 0, 1, 2, . . . ,M.

These categories can be:

I Unordered: Buy product A, B, C, or nothing

I Ordinal: Low, medium, high risk

I Cardinal: Rate 1-5 stars

Predict choices/actions based on characteristics

. . . then do classification into multiple categories.

Our main tool will be multinomial logistic regression, but there

are others. For count data, Poisson regression may be better.

29

Discrete Choice

Units (firms, customers, etc) face a set of choices, labeled

{0, 1, 2, . . . ,M}. Or we want to classify units into one ofthese categories.

Choose based on utility (happiness/profit/etc)

I Utility unit i gets from option m: Ui,m = Xi,mm + i,m.

I Chooses m if Ui,m is the biggest.

I So we want to model probabilities like P[Ui,m > Ui,m]

Gives rise to a set of log-odds, one for each category:

log

(P[Yi = m | Xi]P[Yi = 0 | Xi]

)= 0,m + 1,mXi.

just like regular (binary) logistic regression.30

Words of warning!

As usual, were skipping a lot of statistical details. But now,

were skipping economic/substantive stuff too:

I What are we assuming about the agents behavior?

I What are the random shocks i,j? Are the X variables

specific to the product (e.g. price) or the person (e.g.

income)? Are the choices structured (e.g. ordered)?

Depending on the situation, other models may be appropriate.

Youll hear about ordinal, conditional, nested . . . . Even harder

to interpret!

I Multinomial logit is a good all-purpose choice, often just

as good as more specialized models.

31

Remember: The more complex the model, the more the

assumptions matter!

Its worth talking about the big one here:

The Independence of Irrelevant Alternatives

I If m is chosen from {0, 1, 2, . . . ,M}, then m is chosenfrom any subset of {0, 1, 2, . . . ,M}.

I The log-odds of m vs. 0 doesnt depend on other options.

I So if we change/remove options from the choice set, the

odds of choosing m vs. 0 do not change.

Does that seem realistic?

I Does it matter? Maybe not for classification.

What to do?

I For now, well ignore the problem, but you can test if IIA

holds, then run a different model if need be. 32

Example: Choice of cracker brand

> names(cracker table(cracker$choice)

kleebler nabisco private sunshine

226 1792 1035 239

> cracker$choice

Descriptives and Visualizations

Not as simple as linear regression (no lines to graph) but still

an important step.

I Look at the price of one option when others are chosen> colMeans(cracker[cracker$choice!="sunshine",2:5])

pr.private pr.keebler pr.nabisco pr.sunshine

68.01 112.53 107.74 96.44

> colMeans(cracker[cracker$choice=="sunshine",2:5])

pr.private pr.keebler pr.nabisco pr.sunshine

68.90 113.40 110.22 86.26

private keebler nabisco sunshine

6080

100

120

Sunshine Price

Choice

private keebler nabisco sunshine

4060

8010

0

Private Price

Choice 34

I The price of all the options when Sunshine is chosen:

Private price

Fre

quen

cy

40 60 80 100 120 140

020

4060

80

Keebler price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

6070

Nabisco price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

60

Sunshine price

Fre

quen

cy

40 60 80 100 120 140

010

2030

4050

60

35

On to regression analysis

Deceptively easy in R:

> mlogit.fit exp(coef(mlogit.fit))

(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

keebler 20.82 1.02 0.942 1.005 1.002

nabisco 13.22 1.02 1.007 0.969 0.994

sunshine 1.12 1.01 1.023 1.007 0.940

Interpretation is just as easy (hard?) as regular logistic:

I Why is b > 1 for pr.private always?

I Own price coefficient < 1?

How do we feel about these results and the IIA assumption?

36

> exp(coef(mlogit.fit))

(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

keebler 20.82 1.02 0.942 1.005 1.002

nabisco 13.22 1.02 1.007 0.969 0.994

sunshine 1.12 1.01 1.023 1.007 0.940

What if Keebler raises their price by $0.05?

I Lowers the chances of buying Keebler (vs store brand):Multiply odds-ratio by 0.9425 = 0.74. (why to the fifth?)

P[ keebler | $1.00]P[ private | $1.00]

0.74 = P[ keebler | $1.05]P[ private | $1.05]

I Or the other way: 1/0.74 = 1.35, so people are 35% more

likely to buy keebler after a 5 cent price drop.

I But what do the 1.007 and 0.994 represent?

37

Do advertisements increase sales?

> mlogit.fit2 (exp(coef(mlogit.fit2))

(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine

keebler 12.719 1.02 0.949 1.000 1.002

nabisco 11.416 1.02 1.010 0.967 0.992

sunshine 0.981 1.01 1.016 1.012 0.940

ad.privateTRUE ad.keeblerTRUE ad.nabiscoTRUE ad.sunshineTRUE

keebler 1.06 2.34 0.925 0.932

nabisco 1.07 2.03 1.285 0.885

sunshine 1.06 1.27 2.086 1.183

How much less likely are people to buy keebler (vs. store

brand) if they run an ad?

I More than twice as likely? Odds ratio multiplied by 2.34.

What about the other odds-ratios? What is the 2.03?

I Keebler running an ad increases Nabisco sales?

I Ads raise awareness of crackers? Try a model with any ad. 38

We can do all the usual things

I Predictions

> mlogit.probs dim(mlogit.probs)

[1] 3292 4

I Classification

I Cut-off rule problem:

what if two categories have P [Y =m|X] > c?I One simple idea: highest predicted probability:

> colnames(mlogit.probs)[apply(mlogit.probs,1,which.max)]

I Model selection

I AIC, BIC, LASSO, etc

These may or may not make sense in discrete choice models.39

Count Data

Poisson regression is a GLM designed for count data. How?

Remember the Poisson distribution:

0 2 4 6 8 10

0.0

0.1

0.2

0.3

Pro

babi

lity

lambda = 1lambda = 2lambda = 3lambda = 7

I Integers only

I One parameter: > 0

I = Mean and Variance

Extensions: Negative Binomial Model, Zero-Inflated Models 40

Example: Number of take-over bids (NUMBIDS) received by

126 US firms during 197885.

takeover$NUMBIDS

Fre

quen

cy

0 2 4 6 8 10

020

4060

Looks like Poisson!

Covariates include

I Continuous variables:

BIDPREM, INSTHOLD,

SIZE

I Indicators: LEGLREST,

FINREST, REGULATN,

WHTKNGHT

41

Browsing the data

I More bids on average for the indicators being Yes:

LEGLREST REALREST FINREST WHTKNGHT REGULATN

No 1.46 1.69 1.70 1.18 1.67

Yes 2.11 1.96 2.08 2.12 1.91

I Pattern in the continuous variables?

0 2 4 6 10

1.0

1.2

1.4

1.6

1.8

2.0

Number of Bids

Bid

pric

e / s

tock

pric

e be

fore

bid

0 2 4 6 10

0.0

0.2

0.4

0.6

0.8

Number of Bids

% h

eld

by in

stitu

tions

0 2 4 6 10

05

1015

20

Number of BidsTo

tal a

sset

s

42

Poisson regression results:

> poisson.fit summary(poisson.fit)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.10489 0.53443 2.067 0.03870 *

BIDPREM -0.76982 0.37676 -2.043 0.04103 *

INSTHOLD -0.13180 0.40027 -0.329 0.74195

SIZE 0.03428 0.01865 1.839 0.06598 .

LEGLREST 0.25191 0.15005 1.679 0.09318 .

REALREST -0.02024 0.17726 -0.114 0.90909

FINREST 0.09190 0.21688 0.424 0.67176

WHTKNGHT 0.49821 0.15822 3.149 0.00164 **

REGULATN 0.01178 0.16297 0.072 0.94237

The expected log(NUMBIDS) increases by 0.498 1/2 if aWhite Knight is involved. What? 43

Just like logistic regression, exponentiating aids interpretation.

> exp(coef(poisson.fit))

(Intercept) BIDPREM INSTHOLD SIZE LEGLREST

3.0188810 0.4630954 0.8765198 1.0348739 1.2864796

REALREST FINREST WHTKNGHT REGULATN

0.9799633 1.0962533 1.6457787 1.0118518

The incidence rate is 1.65 times higher if a White Knight isinvolved.

The incidence rate ratio (IRR, or risk ratio) is almost like the

odds ratio from logistic regression: j > 0 IRR > 1.I The number of bids (the count were modeling) is

expected to be 65% higher if a White Knight is involved.

I For a one-unit increase in the % held by institutions,

number of bids falls by 1 - 0.88 = 12%.44

Final thoughts

What did we accomplish today?

I Showed how to use glm to handle different types of

outcomes.

I Explored logistic regression in some detail

I Introduced other important models

What didnt we do?

I Explore the models in detail, or understand how they work

I Discuss assumptions, diagnostics, and fixes

All of which exist for glm.

Be careful out there!

I As usual, refer to another class/reference45