classification for the car crashes analysis · car crashes data what are the challenges with the...

33
Classification for the Car Crashes Analysis

Upload: others

Post on 03-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Classification for the Car Crashes Analysis

Page 2: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

What is our goal with the Car Crashes data?

1. Infer the relationship between variables and the probability of a serious injury occurring.

Why is this goal important?• Enact policy to save lives.

Page 3: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

How do we even explore the data? Scatterplots have issues.

Page 4: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

How do we even explore the data?

1. Side-by-side Boxplots

Page 5: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

How do we even explore the data?

2. Jittered Scatterplots

Page 6: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

How do we even explore the data?

3. Smooth fit to scatterplot

Page 7: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

How do we even explore the data?

4. Cross Tabulations

Not Severe SevereYes Alcohol 351 724No Alcohol 4198 3330

Page 8: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Car Crashes Data

What are the challenges with the Car Crashes data?

1. 0/1 Categorical Response.2. Lots of variables (fine line b/n quantitative and

categorical).This unit is all about dealing with binary categorical responses. This is referred to as classification.

Possible Approaches:1. Logistic Regression2. Probit Regression3. Various Machine Learning Algorithms (next

analysis)

Page 9: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Why Not Linear Regression?

Why can’t we just define:

then use:

� = (X0X)�1X0y?

Because:1. Predictions .2. Its certainly not-linear.

x00� 62 {0, 1}

Yi =

(1 if Severe

0 otherwise

Page 10: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

Thinking about linear regression, we have:

Y ⇠ N (X�,�2I)

so, when data are continuous we use a Gaussian distribution and are interested in regressing the mean.

2. When we use a Bernoulli distribution, what are we interested in?

p = Prob(Yi = 1)

1. What’s an appropriate distribution when Yi 2 {0, 1}?Bernoulli Distribution: f(yi) = pyi(1� p)1�yi

Page 11: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

So, a regression model for categorical data would be:

p = Prob(Y = 1) = x0i�

Problem: has to be between 0 and 1. How do we constrain our regression to be between 0 and 1?

p

Logistic Regression Model: (Generalized Linear Model)

Yiind⇠ Bern(pi)

log

✓pi

1� pi

◆= x0

i�

) pi =exp{x0

i�}1 + exp{x0

i�}LogitTransform

Logistic Function

Odds Ratio

Page 12: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

Logistic Regression Model:

1. Independence2. Monotone in x’s.

What are the assumptions?

log

✓pi

1� pi

◆= x0

i�Yiind⇠ Bern(pi)

Page 13: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

Logistic Regression Model:

1. For every unit increase in , the log-odds ratio increases by

xj

�j .

2. Just interpret the sign: If , then increases as increases.

�j > 0 pixj

3. percent increase in the odds ratio.100⇥ (exp{�j}� 1) :

How do we interpret �j?

log

✓pi

1� pi

◆= x0

i�Yiind⇠ Bern(pi)

4. For every unit increase in , the outcome ismore likely to occur.

xj exp{�j}

Page 14: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we estimate �?• Maximum the likelihood:

L(�) =nY

i=1

f(yi | �) =nY

i=1

pyii (1� pi)

yi

=nY

i=1

✓exp{x0

i�}1 + exp{x0

i�}

◆yi✓1� exp{x0

i�}1 + exp{x0

i�}

◆1�yi

=nY

i=1

✓exp{x0

i�}1 + exp{x0

i�}

◆yi✓

1

1 + exp{x0i�}

◆1�yi

• In R Code: glm(y~X,family=binomial)

Page 15: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?

• Asymptotics:

�d! N (�, I�1(�))

glm will calculate these for you. Confidence intervals can be computed by:

the.glm <- summary(glm(y~X,family=binomial))upper <- the.glm$coef[,1] + qnorm(0.975)*the.glm$coef[,2]lower <- the.glm$coef[,1] - qnorm(0.975)*the.glm$coef[,2]

• Bootstrap it!• Be a Bayesian!

Page 16: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?

• Testing multiple parameters is based on the Likelihood Ratio Tests

• glm will calculate these for you:full.glm <- glm(y~Xfull,family=binomial)red.glm <- glm(y~Xreduced,family=binomial)anova.glm(full.glm,red.glm)

�2 log

✓like(MLE of reduced model)

like(MLE of full model)

◆d! �2

Pfull�Preduced

H0 : Smaller Model is True

HA : Larger Model is True

Page 17: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we get predictions?

• We predict the probabilities:

How do we classify from predicted probabilities?

y =

(1 if p > c

0 if p c

c = Cuto↵ Probability

p0 =exp{x0

0�}1 + exp{x0

0�}

Page 18: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we assess accuracy of our classifications?

• Cross-Validation

Confusion Matrix

ErrorRate

1

n

nX

i=1

I(yi 6= yi) = Percent Misclassified

Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12

Page 19: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we assess accuracy of our classifications?

• Sensitivity (Recall): Percent of True Positives (10/12)

• Specificity: Percent of True Negatives (12/16)• Positive Predictive Value (Precision): % Correctly

Predicted Yes’s (10/14)

Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12

• Negative Predictive Value: % Correctly Predicted No’s (12/14)

Page 20: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we assess accuracy of our classifications?

• F-Score (Harmonic Mean):

2 (precision*recall)/(precision+recall)

Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12

Page 21: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

So, what do we choose as our cutoff value?

1. Classify based on highest probability (c = 0.5 is called the Bayes Classifier).

2. Probably want to consider cost/benefits to set a cut-off probability.

Page 22: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

False Negative Rate

False Positive Rate

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

0.6

Threshold

Err

or

Rate

Overall Error

Page 23: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we assess accuracy of our classifications?

ROC (Receiver Operating Characteristic) Curves: Compares sensitivity to false positive rates for many thresholds.ROC Curve

False positive rate

True p

osi

tive r

ate

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

“Coin Flip” Rate

Summarize an ROC curve by the area under the curve (AUC).

Page 24: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we know if our model fits well?

Deviance = �2 (log(likelihood at MLE)� log(likelihood of “saturated” model))

If the likelihood is Gaussian then:

Deviance =nX

i=1

(yi � x0i�)

2

Pseudo-R2

Interpretation: Percent of best likelihood captured by your model. Warning: practical upper bound is not 1 so low R2 values are common.

R2pseudo = 1� Whats Left Over After Model

Null Model Likelihood= 1� Residual Deviance

Null Deviance

Page 25: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we know if our model fits well?

1. Report the AUC of the in-sample predictions2. Report the sensitivity, specificity, positive predicted

value and/or negative predicted value for a given threshold

3. R2 if you really want to (but you have been warned that good classification models may still have low R2 values)

Page 26: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we know if our predicts well?

1. Report the AUC of the out-of-sample predictions2. Report the sensitivity, specificity, positive predicted

value and/or negative predicted value for a given threshold in out-of-sample classifications

Page 27: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

How do we do variable selection in logistic regression?

The same way we do in linear regression!

AIC = �2 log(Like) + 2P

BIC = �2 log(Like) + P log(n)

Training Set Misclassification

Cross Validation Misclassification

Page 28: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

Do we have to worry about collinearity in logistic regression?Yes!

What can we do if we have multicollinearity?1. Orthogonalize the variables.2. Principal components (this is portable to GLMs).3. LASSO or Ridge - glmnet(…,family=binomial).4. Strong prior correlations.

Page 29: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Logistic Regression

What do we do if there is non-linearity (non-monotonicity)?

1. Splines2. Smoothing splines - gam(…,family=binomial)

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

1.0

x.vals

y

Page 30: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Probit Regression

Probit Regression Model:

In R: glm(y~X,family=binomial(link=“probit”))

Yiind⇠ Bern(pi)

pi = �(x0i�)

� = Normal CDF

Page 31: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Probit Regression

Logistic vs. Probit Regression : Is there a difference?

Coefficients will be different, but probability curve will be about the same.

1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

Average TK1

Pro

babi

lity

of C

ance

r

LogisticProbit

�1

Logistic 1.556Probit 0.976

Page 32: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Probit Regression

Logistic vs. Probit Regression :

Advantages Disadvantages

Logistic Interpretability No conjugacy

Probit Conjugacy in Bayesian Model

Can only interpret sign of coefficients.

Page 33: Classification for the Car Crashes Analysis · Car Crashes Data What are the challenges with the Car Crashes data? 1. 0/1 Categorical Response. 2. Lots of variables (fine line b/n

Expectation for Car Crashes Analysis

1. Clean the data – you’ll have to make decisions on• Should some categories be combined?

2. Determine a method for the car crashes dataset.• Justify which variables are in the model.• Justify linear vs. non-linear effects.

3. Fit your chosen model and report:• Effects and confidence intervals – be sure to interpret these

appropriately.• Compare probabilities of different groups.