BUS41100 Applied Regression Analysis
Week 8: Generalized Linear Models
Logistic regression, classification,generalized linear models
Max H. FarrellThe University of Chicago Booth School of Business
Discrete Responses
Today we will look at discrete responses
I Binary: Y = 0 or 1I Buy or don’t buy
I More categories: Y = 0, 1, 2, 3, 4I Unordered: buy product A, B, C, D, or nothingI Ordered: rate 1–5 stars
I Count: Y = 0, 1, 2, 3, 4, . . .I How many products bought in a month?
The goal varies depending on the type of response.
Our tool will be the Generalized Linear Model (GLM).
1
Binary response data
We’ll spend the most time on binary data: Y = 0 or 1.
I By far the most common application
I Illustrate all the ideas
The goal is generally to predict the probability that Y = 1.
You can then do classification based on this estimate.
I Buy or not buyI Win or loseI Sick or healthyI Pay or defaultI Thumbs up or down
Relationship type questions are interesting too
I How does an ad affect sales?I What type of customer is more likely to buy? 2
Y is an indicator: Y = 0 or 1.
The conditional mean is thus
E[Y |X] = p(Y = 1|X)× 1 + p(Y = 0|X)× 0
= p(Y = 1|X).
So the mean function is a probability:
I We need a model that gives mean/probability values
between 0 and 1.
We’ll use a transform function that takes the right-hand side of
the model (x′β) and gives back a value between zero and one.
3
The binary choice model is
p(Y = 1|X1, . . . , Xd) = S(β0 + β1X1 + · · ·+ βdXd)
where S is a function that increases in value from zero to one.
−6 −4 −2 0 2 4 6
0.0
0.4
0.8
1.2
S(x0�)
x0�
1
S−1(p(Y = 1|X1, . . . , Xd)
)= β0 + β1X1 + · · ·+ βdXd︸ ︷︷ ︸
Linear! 4
There are two main functions that are used for this:
I Logistic Regression: S(z) =ez
1 + ez.
I Probit Regression: S(z) = pnorm(z) = Φ(z).
Both are S-shaped and take values in (0, 1).
Logit is usually preferred, but
they result in practically the same fit.
5
Logistic regression
We’ll use logistic regression, such that
p(Y = 1|X1 . . . Xd) = S (X′β) =exp[β0 + β1X1 . . .+ βdXd]
1 + exp[β0 + β1X1 . . .+ βdXd].
These models are easy to fit in R:
glm(Y ~ X1 + X2, family=binomial)
I “g” is for generalized; binomial indicates Y = 0 or 1.
I Otherwise, glm uses the same syntax as lm.
I The “logit” link is more common, and is the default in R.
6
Interpretation
Model the probability:
p(Y = 1|X1 . . . Xd) = S (X′β) =exp[β0 + β1X1 . . .+ βdXd]
1 + exp[β0 + β1X1 . . .+ βdXd].
Invert to get linear log odds ratio:
log
(p(Y = 1|X1 . . . Xd)
p(Y = 0|X1 . . . Xd)
)= β0 + β1X1 . . .+ βdXd.
Therefore:
eβj =p(Y = 1|Xj = (x+ 1))
p(Y = 0|Xj = (x+ 1))
/p(Y = 1|Xj = x)
p(Y = 0|Xj = x)
7
Example:
I Suppose
{p(Y = 1|X1 . . . Xd) = 0.8
p(Y = 0|X1 . . . Xd) = 0.2
Ip(Y = 1|X1 . . . Xd)
p(Y = 0|X1 . . . Xd)=
0.8
0.2= 4
⇒ “Win” is 4 times more likely!
Therefore:
I eβj = change in the odds for a one unit increase in Xj.
I . . . holding everything else constant, as always!
I Always eβj > 0, e0 = 1. Why?
8
Example: we have Las Vegas betting point spreads for 553
NBA games and the resulting scores.
We can use logistic regression of scores onto spread to predict
the probability of the favored team winning.
I Response: favwin=1 if favored team wins.
I Covariate: spread is the Vegas point spread.
spread
Fre
quen
cy
0 10 20 30 40
040
8012
0
favwin=1favwin=0
●●
● ●●
01
0 10 20 30 40
spread
favw
in
9
This is a weird situation where we assume no intercept.
I Most likely the Vegas betting odds are efficient.
I A spread of zero implies p(win) = 0.5 for each team.
We get this out of our model when β0 = 0
p(win) = exp[β0]/(1 + exp[β0]) = 1/2.
The model we want to fit is thus
p(favwin|spread) =exp[β1 × spread]
1 + exp[β1 × spread].
10
R output from glm:
> nbareg <- glm(favwin~spread-1, family=binomial)
> summary(nbareg) ## abbreviated output
Coefficients:
Estimate Std. Error z value Pr(>|z|)
spread 0.15600 0.01377 11.33 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Null deviance: 766.62 on 553 degrees of freedom
Residual deviance: 527.97 on 552 degrees of freedom
AIC: 529.97
11
Interpretation
The fitted model is
p(favwin|spread) =exp[0.156× spread]
1 + exp[0.156× spread].
0 5 10 15 20 25 30
0.5
0.6
0.7
0.8
0.9
1.0
spread
P(f
avw
in)
12
Convert to odds-ratio> exp(coef(nbareg))
spread
1.168821
I A 1 point increase in the spread means the favorite is
1.17 times more likely to win
I What about a 10-point increase:
exp(10*coef(nbareg)) ≈ 4.75 times more likely
Uncertainty:> exp(confint(nbareg))
Waiting for profiling to be done...
2.5 % 97.5 %
1.139107 1.202371
Code: exp(cbind(coef(logit.reg), confint(logit.reg)))
13
Many things are basically the same as before:
I extractAIC(reg, k=log(n)) will give your BICs.
I The predict function works as before, but add
type = "response" to get p̂ = exp[x′b]/(1 + exp[x′b])
(otherwise it just returns the linear function x′b).
But some things are different:
I Inference based on z not t
I Without sums of squares there are no R2, anova, or
F -tests!
I Techniques for residual diagnostics and model checking
are different (but we’ll not worry about that today).
14
We could consider other models ... and compare with BIC!
Our “efficient Vegas” model:
> extractAIC(nbareg, k=log(553))
[1] 1.000 534.287
A model that includes non-zero intercept:
> extractAIC(glm(favwin ~ spread, family=binomial), k=log(553))
[1] 2.0000 540.4333
What if we throw in home-court advantage?
> extractAIC(glm(favwin ~ spread + favhome, family=binomial), k=log(553))
[1] 3.0000 545.637
The simplest model is best(The model probabilities are 0.95, 0.05, and zero.)
15
New predictions
Chicago vs Sacramento spread is SK by 1
p(CHI win) =1
1 + exp[0.156× 1]= 0.47
I Atlanta at NY (-1.5): P (favwin) = 0.56
I Orlando (-7.5) at Washington: P (favwin) = 0.76
I Memphis at Cleveland (-1): P (favwin) = 0.53
I New Jersey at Philadelphia (-2): P (favwin) = 0.58
I Golden State at Minnesota (-2.5): P (favwin) = 0.60
I Miami at Dallas (-2.5): P (favwin) = 0.60
I Charlotte at Milwaukee (-3.5): P (favwin) = 0.63
16
Example: a common business application of logistic regression
is in evaluating the credit quality of (potential) debtors.
I Take a list of borrower characteristics.
I Build a prediction rule for their credit.
I Use this rule to automatically evaluate applicants
(and track your risk profile).
You can do all this with logistic regression, and then use the
predicted probabilities to build a classification rule.
17
We have data on 1000 loan applicants at German community
banks, and judgment of the loan outcomes (good or bad).
The data has 20 borrower characteristics, including
I credit history (5 categories),
I housing (rent, own, or free),
I the loan purpose and duration,
I and installment rate as a percent of income.
Unfortunately, many of the columns in the data file are coded
categorically in a very opaque way. (Most are factors in R.)
18
A word of caution
Watch out for perfect prediction!
> too.good <- glm(GoodCredit~. + .^2, family=binomial,
+ data=credit[train,])
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
This warning means you have the logistic version of our
“connect the dots” model.
I Just as useless as before!
19
We can use forward step wise regression to build a model.
> credit <- read.csv("germancredit.csv")
> train <- sample.int(nrow(credit), 0.75*nrow(credit))
> null <- glm(GoodCredit~history3, family=binomial,
+ data=credit[train,])
> full <- glm(GoodCredit~., family=binomial,
+ data=credit[train,])
> fwdBIC <- step(null, scope=formula(full),
+ direction="forward", k=log(length(train)))
...
Step: AIC=882.94
GoodCredit ~ history3 + checkingstatus1 + duration2
+ installment8
The null model has credit history as a variable, since I’d include
this regardless, and we’ve left-out 200 points for validation.
20
Or we can use LASSO to find a model
> cvfit <- cv.glmnet(x=X[train,], y=credit$GoodCredit[train],
+ family="binomial", alpha=1, standardize=TRUE)
> betas <- coef(cvfit, s = "lambda.1se")
> model.1se <- which(betas[2:length(betas)]!=0)
Which variables were selected?> colnames(X[,model.1se])
[1] "checkingstatus1A13" "checkingstatus1A14" "duration2"
[4] "history3A31" "history3A34" "purpose4A41"
[7] "purpose4A43" "purpose4A46" "amount5"
[10] "savings6A64" "savings6A65" "employ7A74"
[13] "installment8" "status9A93" "others10A103"
[16] "property12A124" "age13" "otherplans14A143"
[19] "housing15A152" "foreign20A202"
Trees are very popular in classification. . .
> tree.model <- rpart(GoodCredit ~ .,
+ data=credit[train,], method="class")
21
Classification
A common goal with logistic regression is to classify the inputs
depending on their predicted response probabilities.
In our credit scoring example,
I we might want to classify the German borrowers as having
“good” or “bad” credit (i.e., do we loan to them?).
A simple classification rule is to say that anyone with
I p(good|x) > 0.5 can get a loan,
I and the rest cannot.
——————
(As usual: classification is a huge field, and we’re only scratching
the surface here.) 22
Comparing with the validation set:
> predBIC <- predict(fwdBIC, newdata=credit[-train,],
+ type="response")
> errorBIC <- credit[-train,1] - (predBIC >= .5)
(LASSO takes a few extra lines)
Misclassification rates (not MSE!)
> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),
+ lasso=mean(abs(errorLasso.1se)),
+ tree=mean(abs(errorTree)))
full BIC lasso tree
0.268 0.240 0.248 0.252
I Our BIC model classifies 76% borrowers correctly
I LASSO and tree slightly worse . . . both better than full
23
Classification risk & ROC curves
You can also do classification with cut-offs other than 1/2.
I Suppose the risk associated with one action is higher than
for the other.
I You’ll want to have p > 0.5 of a positive outcome before
taking the risky action.
We want to know:
I What happens as the cut-off changes?
I Is there a “best” cut-off?
One way is to look at ROC curves.
24
ROC: Receiver Operating Characteristic
> library("pROC")
> roc.full <- roc(credit[-train,1] ~ predfull)
> coords(roc.full, x=0.5)
threshold specificity sensitivity
0.5000000 0.8662791 0.4358974
Specificity
Sen
sitiv
ity0.
00.
20.
40.
60.
81.
0
1.0 0.8 0.6 0.4 0.2 0.0
x
X cut−off = 0.5
Sensitivity
true positive rate
Specificity
true negative rate
——————
Many related names: recall, precision
positive predictive value, . . .
25
What else can we do?
> rbind( coords(roc.full, "best"),
+ coords(roc.BIC, "best"), coords(roc.lasso, "best") )
threshold specificity sensitivity
[1,] 0.3210536 0.7558140 0.6538462
[2,] 0.4065372 0.8313953 0.6153846
[3,] 0.2030214 0.6279070 0.7820513
Is misclassification improved?
> errorBIC <- credit[-train,1] - (predBIC >= xy.BIC.best[1])
> errorfull <- credit[-train,1] - (predfull >= xy.full.best[1])
> errorLasso.1se <- credit[-train,1] - (predLasso.1se >= xy.lasso.best[1])
> c(full=mean(abs(errorfull)), BIC=mean(abs(errorBIC)),
+ lasso=mean(abs(errorLasso.1se)))
full BIC lasso
0.276 0.236 0.324
Different goals, different results.
26
> A bunch of code...
> see week8-Rcode.R
Specificity
Sen
sitiv
ity
0.0
0.2
0.4
0.6
0.8
1.0
1.0 0.8 0.6 0.4 0.2 0.0
XX
X
YY
Y
XY
cut−off = 0.5cut−off = bestFullBICLasso
27
Testing ROC curves
I Not really necessary.
> roc.test(roc.full,roc.BIC)
DeLong’s test for two correlated ROC curves
data: roc.full and roc.BIC
Z = -0.72882, p-value = 0.4661
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7492546 0.7700507
> roc.test(roc.lasso,roc.BIC)
Z = -0.49662, p-value = 0.6195
AUC of roc1 AUC of roc2
0.7573047 0.7700507
Not surprising, given the picture. 28
Extensions to Logistic Regression
The obvious step beyond 0/1 outcomes is more categories
Y = 0, 1, 2, . . . ,M.
These categories can be:
I Unordered: Buy product A, B, C, or nothing
I Ordinal: Low, medium, high risk
I Cardinal: Rate 1-5 stars
Predict choices/actions based on characteristics
. . . then do classification into multiple categories.
Our main tool will be multinomial logistic regression, but there
are others. For count data, Poisson regression may be better.
29
Discrete Choice
Units (firms, customers, etc) face a set of choices, labeled
{0, 1, 2, . . . ,M}. Or we want to classify units into one of
these categories.
Choose based on utility (happiness/profit/etc)
I Utility unit i gets from option m: Ui,m = Xi,mβm + εi,m.
I Chooses m∗ if Ui,m∗ is the biggest.
I So we want to model probabilities like P[Ui,m∗ > Ui,m]
Gives rise to a set of log-odds, one for each category:
log
(P[Yi = m | Xi]
P[Yi = 0 | Xi]
)= β0,m + β1,mXi.
just like regular (binary) logistic regression.30
Words of warning!
As usual, we’re skipping a lot of statistical details. But now,
we’re skipping economic/substantive stuff too:
I What are we assuming about the agents behavior?
I What are the random shocks εi,j? Are the X variables
specific to the product (e.g. price) or the person (e.g.
income)? Are the choices structured (e.g. ordered)?
Depending on the situation, other models may be appropriate.
You’ll hear about ordinal, conditional, nested . . . . Even harder
to interpret!
I Multinomial logit is a good all-purpose choice, often just
as good as more specialized models.
31
Remember: The more complex the model, the more the
assumptions matter!
It’s worth talking about the big one here:
The Independence of Irrelevant Alternatives
I If m∗ is chosen from {0, 1, 2, . . . ,M}, then m∗ is chosen
from any subset of {0, 1, 2, . . . ,M}.I The log-odds of m vs. 0 doesn’t depend on other options.
I So if we change/remove options from the choice set, the
odds of choosing m vs. 0 do not change.
Does that seem realistic?
I Does it matter? Maybe not for classification.
What to do?
I For now, we’ll ignore the problem, but you can test if IIA
holds, then run a different model if need be. 32
Example: Choice of cracker brand
> names(cracker <- read.csv("cracker_choice.csv"))
> table(cracker$choice)
kleebler nabisco private sunshine
226 1792 1035 239
> cracker$choice <- relevel(cracker$choice, "private")
Perfect data for relationship or inference questions:
I How does my price relate to choice?
I How do others’ prices relate to choice?
I Does an ad help? Anyone’s ad?
But the X’s are about the product, not the person:
I Can’t figure out what customer type buys a which brand.
I What would prediction/classification mean here?
33
Descriptives and Visualizations
Not as simple as linear regression (no lines to graph) but still
an important step.
I Look at the price of one option when others are chosen> colMeans(cracker[cracker$choice!="sunshine",2:5])
pr.private pr.keebler pr.nabisco pr.sunshine
68.01 112.53 107.74 96.44
> colMeans(cracker[cracker$choice=="sunshine",2:5])
pr.private pr.keebler pr.nabisco pr.sunshine
68.90 113.40 110.22 86.26
●
●
●●●
●
●
●
●●
●
●
●
●
●
●●
●
●● ●●● ●●
●
●●
●●●
●●
●
●●
●●
●
●
●●
●●●
●●●
●
●●●●●●●●●●●●●●●●●
private keebler nabisco sunshine
6080
100
120
Sunshine Price
Choice
●●
private keebler nabisco sunshine
4060
8010
0
Private Price
Choice 34
I The price of all the options when Sunshine is chosen:
Private price
Fre
quen
cy
40 60 80 100 120 140
020
4060
80
Keebler price
Fre
quen
cy
40 60 80 100 120 140
010
2030
4050
6070
Nabisco price
Fre
quen
cy
40 60 80 100 120 140
010
2030
4050
60
Sunshine price
Fre
quen
cy
40 60 80 100 120 140
010
2030
4050
60
35
On to regression analysis
Deceptively easy in R:
> mlogit.fit <- multinom(choice ~ pr.private + pr.keebler +
+ pr.nabisco + pr.sunshine, data=cracker)
> exp(coef(mlogit.fit))
(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine
keebler 20.82 1.02 0.942 1.005 1.002
nabisco 13.22 1.02 1.007 0.969 0.994
sunshine 1.12 1.01 1.023 1.007 0.940
Interpretation is just as easy (hard?) as regular logistic:
I Why is b > 1 for pr.private always?
I Own price coefficient < 1?
How do we feel about these results and the IIA assumption?
36
> exp(coef(mlogit.fit))
(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine
keebler 20.82 1.02 0.942 1.005 1.002
nabisco 13.22 1.02 1.007 0.969 0.994
sunshine 1.12 1.01 1.023 1.007 0.940
What if Keebler raises their price by $0.05?
I Lowers the chances of buying Keebler (vs store brand):Multiply odds-ratio by 0.9425 = 0.74. (why to the fifth?)
P[ keebler | $1.00]P[ private | $1.00]
× 0.74 =P[ keebler | $1.05]P[ private | $1.05]
I Or the other way: 1/0.74 = 1.35, so people are 35% more
likely to buy keebler after a 5 cent price drop.
I But what do the 1.007 and 0.994 represent?
37
Do advertisements increase sales?
> mlogit.fit2 <- multinom(choice ~ ., data=cracker)
> (exp(coef(mlogit.fit2))
(Intercept) pr.private pr.keebler pr.nabisco pr.sunshine
keebler 12.719 1.02 0.949 1.000 1.002
nabisco 11.416 1.02 1.010 0.967 0.992
sunshine 0.981 1.01 1.016 1.012 0.940
ad.privateTRUE ad.keeblerTRUE ad.nabiscoTRUE ad.sunshineTRUE
keebler 1.06 2.34 0.925 0.932
nabisco 1.07 2.03 1.285 0.885
sunshine 1.06 1.27 2.086 1.183
How much less likely are people to buy keebler (vs. store
brand) if they run an ad?
I More than twice as likely? Odds ratio multiplied by 2.34.
What about the other odds-ratios? What is the 2.03?
I Keebler running an ad increases Nabisco sales?
I Ads raise awareness of crackers? Try a model with any ad. 38
We can do all the usual things
I Predictions
> mlogit.probs <- predict(mlogit.fit, type="probs")
> dim(mlogit.probs)
[1] 3292 4
I Classification
I Cut-off rule problem:
what if two categories have P [Y =m|X] > c?I One simple idea: highest predicted probability:
> colnames(mlogit.probs)[apply(mlogit.probs,1,which.max)]
I Model selection
I AIC, BIC, LASSO, etc
These may or may not make sense in discrete choice models.39
Count Data
Poisson regression is a GLM designed for count data. How?
Remember the Poisson distribution:
● ●
●
●
●● ● ● ● ● ●
0 2 4 6 8 10
0.0
0.1
0.2
0.3
Pro
babi
lity
●
● ●
●
●
●
●● ● ● ●
●
●
● ●
●
●
●
●● ● ●● ●
●
●
●
●
● ●●
●
●
●●●●
lambda = 1lambda = 2lambda = 3lambda = 7
I Integers only
I One parameter: λ > 0
I λ = Mean and Variance
——————
Extensions: Negative Binomial Model, Zero-Inflated Models 40
Example: Number of take-over bids (NUMBIDS) received by
126 US firms during 1978–85.
takeover$NUMBIDS
Fre
quen
cy
0 2 4 6 8 10
020
4060
Looks like Poisson!
Covariates include
I Continuous variables:
BIDPREM, INSTHOLD,
SIZE
I Indicators: LEGLREST,
FINREST, REGULATN,
WHTKNGHT
41
Browsing the data
I More bids on average for the indicators being “Yes”:
LEGLREST REALREST FINREST WHTKNGHT REGULATN
No 1.46 1.69 1.70 1.18 1.67
Yes 2.11 1.96 2.08 2.12 1.91
I Pattern in the continuous variables?
●
●
●
●
●
0 2 4 6 10
1.0
1.2
1.4
1.6
1.8
2.0
Number of Bids
Bid
pric
e / s
tock
pric
e be
fore
bid
●
●
●
●
0 2 4 6 10
0.0
0.2
0.4
0.6
0.8
Number of Bids
% h
eld
by in
stitu
tions
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 10
05
1015
20
Number of BidsTo
tal a
sset
s
42
Poisson regression results:
> poisson.fit <- glm(NUMBIDS ~ BIDPREM+INSTHOLD+SIZE+LEGLREST
+ +REALREST+FINREST+ WHTKNGHT+REGULATN, family="poisson")
> summary(poisson.fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.10489 0.53443 2.067 0.03870 *
BIDPREM -0.76982 0.37676 -2.043 0.04103 *
INSTHOLD -0.13180 0.40027 -0.329 0.74195
SIZE 0.03428 0.01865 1.839 0.06598 .
LEGLREST 0.25191 0.15005 1.679 0.09318 .
REALREST -0.02024 0.17726 -0.114 0.90909
FINREST 0.09190 0.21688 0.424 0.67176
WHTKNGHT 0.49821 0.15822 3.149 0.00164 **
REGULATN 0.01178 0.16297 0.072 0.94237
⇒ The expected log(NUMBIDS) increases by 0.498 ≈ 1/2 if a
White Knight is involved. What? 43
Just like logistic regression, exponentiating aids interpretation.
> exp(coef(poisson.fit))
(Intercept) BIDPREM INSTHOLD SIZE LEGLREST
3.0188810 0.4630954 0.8765198 1.0348739 1.2864796
REALREST FINREST WHTKNGHT REGULATN
0.9799633 1.0962533 1.6457787 1.0118518
⇒ The incidence rate is 1.65 times higher if a White Knight is
involved.
The incidence rate ratio (IRR, or risk ratio) is almost like the
odds ratio from logistic regression: βj > 0⇔ IRR > 1.
I The number of bids (the count we’re modeling) is
expected to be 65% higher if a White Knight is involved.
I For a one-unit increase in the % held by institutions,
number of bids falls by 1 - 0.88 = 12%.44
Final thoughts
What did we accomplish today?
I Showed how to use glm to handle different types of
outcomes.
I Explored logistic regression in some detail
I Introduced other important models
What didn’t we do?
I Explore the models in detail, or understand how they work
I Discuss assumptions, diagnostics, and fixes
All of which exist for glm.
Be careful out there!
I As usual, refer to another class/reference45