classification for the car crashes analysis · car crashes data what are the challenges with the...

Classification for the Car Crashes Analysis

Car Crashes Data

What is our goal with the Car Crashes data?

1. Infer the relationship between variables and the probability of a serious injury occurring.

Why is this goal important?• Enact policy to save lives.

Car Crashes Data

How do we even explore the data? Scatterplots have issues.

Car Crashes Data

How do we even explore the data?

1. Side-by-side Boxplots

Car Crashes Data


2. Jittered Scatterplots

Car Crashes Data


3. Smooth fit to scatterplot

Car Crashes Data


4. Cross Tabulations

Not Severe SevereYes Alcohol 351 724No Alcohol 4198 3330

Car Crashes Data

What are the challenges with the Car Crashes data?

1. 0/1 Categorical Response.2. Lots of variables (fine line b/n quantitative and

categorical).This unit is all about dealing with binary categorical responses. This is referred to as classification.

Possible Approaches:1. Logistic Regression2. Probit Regression3. Various Machine Learning Algorithms (next

analysis)

Why Not Linear Regression?

Why can’t we just define:

then use:

� = (X0X)�1X0y?

Because:1. Predictions .2. Its certainly not-linear.

x00� 62 {0, 1}

Yi =

(1 if Severe

0 otherwise

Logistic Regression

Thinking about linear regression, we have:

Y ⇠ N (X�,�2I)

so, when data are continuous we use a Gaussian distribution and are interested in regressing the mean.

2. When we use a Bernoulli distribution, what are we interested in?

p = Prob(Yi = 1)

1. What’s an appropriate distribution when Yi 2 {0, 1}?Bernoulli Distribution: f(yi) = pyi(1� p)1�yi

Logistic Regression

So, a regression model for categorical data would be:

p = Prob(Y = 1) = x0i�

Problem: has to be between 0 and 1. How do we constrain our regression to be between 0 and 1?

p

Logistic Regression Model: (Generalized Linear Model)

Yiind⇠ Bern(pi)

log

✓pi

1� pi

◆= x0

i�

) pi =exp{x0

i�}1 + exp{x0

i�}LogitTransform

Logistic Function

Odds Ratio

Logistic Regression

Logistic Regression Model:

1. Independence2. Monotone in x’s.

What are the assumptions?

log

✓pi

1� pi

◆= x0

i�Yiind⇠ Bern(pi)

Logistic Regression

Logistic Regression Model:

1. For every unit increase in , the log-odds ratio increases by

xj

�j .

2. Just interpret the sign: If , then increases as increases.

�j > 0 pixj

3. percent increase in the odds ratio.100⇥ (exp{�j}� 1) :

How do we interpret �j?

log

✓pi

1� pi

◆= x0

i�Yiind⇠ Bern(pi)

4. For every unit increase in , the outcome ismore likely to occur.

xj exp{�j}

Logistic Regression

How do we estimate �?• Maximum the likelihood:

L(�) =nY

i=1

f(yi | �) =nY

i=1

pyii (1� pi)

yi

=nY

i=1

✓exp{x0

i�}1 + exp{x0

i�}

◆yi✓1� exp{x0

i�}1 + exp{x0

i�}

◆1�yi

=nY

i=1

✓exp{x0

i�}1 + exp{x0

i�}

◆yi✓

1

1 + exp{x0i�}

◆1�yi

• In R Code: glm(y~X,family=binomial)

Logistic Regression

How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?

• Asymptotics:

�d! N (�, I�1(�))

glm will calculate these for you. Confidence intervals can be computed by:

the.glm <- summary(glm(y~X,family=binomial))upper <- the.glm$coef[,1] + qnorm(0.975)*the.glm$coef[,2]lower <- the.glm$coef[,1] - qnorm(0.975)*the.glm$coef[,2]

• Bootstrap it!• Be a Bayesian!

�

Logistic Regression

How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?

• Testing multiple parameters is based on the Likelihood Ratio Tests

• glm will calculate these for you:full.glm <- glm(y~Xfull,family=binomial)red.glm <- glm(y~Xreduced,family=binomial)anova.glm(full.glm,red.glm)

�

�2 log

✓like(MLE of reduced model)

like(MLE of full model)

◆d! �2

Pfull�Preduced

H0 : Smaller Model is True

HA : Larger Model is True

Logistic Regression

How do we get predictions?

• We predict the probabilities:

How do we classify from predicted probabilities?

y =

(1 if p > c

0 if p c

c = Cuto↵ Probability

p0 =exp{x0

0�}1 + exp{x0

0�}

Logistic Regression

How do we assess accuracy of our classifications?

• Cross-Validation

Confusion Matrix

ErrorRate

1

n

nX

i=1

I(yi 6= yi) = Percent Misclassified

Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12

Logistic Regression


• Sensitivity (Recall): Percent of True Positives (10/12)

• Specificity: Percent of True Negatives (12/16)• Positive Predictive Value (Precision): % Correctly

Predicted Yes’s (10/14)


• Negative Predictive Value: % Correctly Predicted No’s (12/14)

Logistic Regression


• F-Score (Harmonic Mean):

2 (precision*recall)/(precision+recall)


Logistic Regression

So, what do we choose as our cutoff value?

1. Classify based on highest probability (c = 0.5 is called the Bayes Classifier).

2. Probably want to consider cost/benefits to set a cut-off probability.

Logistic Regression

False Negative Rate

False Positive Rate

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

Threshold

Err

or

Rate

Overall Error

Logistic Regression


ROC (Receiver Operating Characteristic) Curves: Compares sensitivity to false positive rates for many thresholds.ROC Curve

False positive rate

True p

osi

tive r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

“Coin Flip” Rate

Summarize an ROC curve by the area under the curve (AUC).

Logistic Regression

How do we know if our model fits well?

Deviance = �2 (log(likelihood at MLE)� log(likelihood of “saturated” model))

If the likelihood is Gaussian then:

Deviance =nX

i=1

(yi � x0i�)

2

Pseudo-R2

Interpretation: Percent of best likelihood captured by your model. Warning: practical upper bound is not 1 so low R2 values are common.

R2pseudo = 1� Whats Left Over After Model

Null Model Likelihood= 1� Residual Deviance

Null Deviance

Logistic Regression

How do we know if our model fits well?

1. Report the AUC of the in-sample predictions2. Report the sensitivity, specificity, positive predicted

value and/or negative predicted value for a given threshold

3. R2 if you really want to (but you have been warned that good classification models may still have low R2 values)

Logistic Regression

How do we know if our predicts well?

1. Report the AUC of the out-of-sample predictions2. Report the sensitivity, specificity, positive predicted

value and/or negative predicted value for a given threshold in out-of-sample classifications

Logistic Regression

How do we do variable selection in logistic regression?

The same way we do in linear regression!

AIC = �2 log(Like) + 2P

BIC = �2 log(Like) + P log(n)

Training Set Misclassification

Cross Validation Misclassification

Logistic Regression

Do we have to worry about collinearity in logistic regression?Yes!

What can we do if we have multicollinearity?1. Orthogonalize the variables.2. Principal components (this is portable to GLMs).3. LASSO or Ridge - glmnet(…,family=binomial).4. Strong prior correlations.

Logistic Regression

What do we do if there is non-linearity (non-monotonicity)?

1. Splines2. Smoothing splines - gam(…,family=binomial)

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

x.vals

y

Probit Regression

Probit Regression Model:

In R: glm(y~X,family=binomial(link=“probit”))

Yiind⇠ Bern(pi)

pi = �(x0i�)

� = Normal CDF

Probit Regression

Logistic vs. Probit Regression : Is there a difference?

Coefficients will be different, but probability curve will be about the same.

1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Average TK1

Pro

babi

lity

of C

ance

r

LogisticProbit

�1

Logistic 1.556Probit 0.976

Probit Regression

Logistic vs. Probit Regression :

Advantages Disadvantages

Logistic Interpretability No conjugacy

Probit Conjugacy in Bayesian Model

Can only interpret sign of coefficients.

Expectation for Car Crashes Analysis

1. Clean the data – you’ll have to make decisions on• Should some categories be combined?

2. Determine a method for the car crashes dataset.• Justify which variables are in the model.• Justify linear vs. non-linear effects.

3. Fit your chosen model and report:• Effects and confidence intervals – be sure to interpret these

appropriately.• Compare probabilities of different groups.

classification for the car crashes analysis · car crashes data what are the challenges with the...

Documents