classification for the car crashes analysis · car crashes data what are the challenges with the...
TRANSCRIPT
Classification for the Car Crashes Analysis
Car Crashes Data
What is our goal with the Car Crashes data?
1. Infer the relationship between variables and the probability of a serious injury occurring.
Why is this goal important?• Enact policy to save lives.
Car Crashes Data
How do we even explore the data? Scatterplots have issues.
Car Crashes Data
How do we even explore the data?
1. Side-by-side Boxplots
Car Crashes Data
How do we even explore the data?
2. Jittered Scatterplots
Car Crashes Data
How do we even explore the data?
3. Smooth fit to scatterplot
Car Crashes Data
How do we even explore the data?
4. Cross Tabulations
Not Severe SevereYes Alcohol 351 724No Alcohol 4198 3330
Car Crashes Data
What are the challenges with the Car Crashes data?
1. 0/1 Categorical Response.2. Lots of variables (fine line b/n quantitative and
categorical).This unit is all about dealing with binary categorical responses. This is referred to as classification.
Possible Approaches:1. Logistic Regression2. Probit Regression3. Various Machine Learning Algorithms (next
analysis)
Why Not Linear Regression?
Why can’t we just define:
then use:
� = (X0X)�1X0y?
Because:1. Predictions .2. Its certainly not-linear.
x00� 62 {0, 1}
Yi =
(1 if Severe
0 otherwise
Logistic Regression
Thinking about linear regression, we have:
Y ⇠ N (X�,�2I)
so, when data are continuous we use a Gaussian distribution and are interested in regressing the mean.
2. When we use a Bernoulli distribution, what are we interested in?
p = Prob(Yi = 1)
1. What’s an appropriate distribution when Yi 2 {0, 1}?Bernoulli Distribution: f(yi) = pyi(1� p)1�yi
Logistic Regression
So, a regression model for categorical data would be:
p = Prob(Y = 1) = x0i�
Problem: has to be between 0 and 1. How do we constrain our regression to be between 0 and 1?
p
Logistic Regression Model: (Generalized Linear Model)
Yiind⇠ Bern(pi)
log
✓pi
1� pi
◆= x0
i�
) pi =exp{x0
i�}1 + exp{x0
i�}LogitTransform
Logistic Function
Odds Ratio
Logistic Regression
Logistic Regression Model:
1. Independence2. Monotone in x’s.
What are the assumptions?
log
✓pi
1� pi
◆= x0
i�Yiind⇠ Bern(pi)
Logistic Regression
Logistic Regression Model:
1. For every unit increase in , the log-odds ratio increases by
xj
�j .
2. Just interpret the sign: If , then increases as increases.
�j > 0 pixj
3. percent increase in the odds ratio.100⇥ (exp{�j}� 1) :
How do we interpret �j?
log
✓pi
1� pi
◆= x0
i�Yiind⇠ Bern(pi)
4. For every unit increase in , the outcome ismore likely to occur.
xj exp{�j}
Logistic Regression
How do we estimate �?• Maximum the likelihood:
L(�) =nY
i=1
f(yi | �) =nY
i=1
pyii (1� pi)
yi
=nY
i=1
✓exp{x0
i�}1 + exp{x0
i�}
◆yi✓1� exp{x0
i�}1 + exp{x0
i�}
◆1�yi
=nY
i=1
✓exp{x0
i�}1 + exp{x0
i�}
◆yi✓
1
1 + exp{x0i�}
◆1�yi
• In R Code: glm(y~X,family=binomial)
Logistic Regression
How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?
• Asymptotics:
�d! N (�, I�1(�))
glm will calculate these for you. Confidence intervals can be computed by:
the.glm <- summary(glm(y~X,family=binomial))upper <- the.glm$coef[,1] + qnorm(0.975)*the.glm$coef[,2]lower <- the.glm$coef[,1] - qnorm(0.975)*the.glm$coef[,2]
• Bootstrap it!• Be a Bayesian!
�
Logistic Regression
How do we get standard errors (uncertainties) associated with the MLE so we can do CIs and tests?
• Testing multiple parameters is based on the Likelihood Ratio Tests
• glm will calculate these for you:full.glm <- glm(y~Xfull,family=binomial)red.glm <- glm(y~Xreduced,family=binomial)anova.glm(full.glm,red.glm)
�
�2 log
✓like(MLE of reduced model)
like(MLE of full model)
◆d! �2
Pfull�Preduced
H0 : Smaller Model is True
HA : Larger Model is True
Logistic Regression
How do we get predictions?
• We predict the probabilities:
How do we classify from predicted probabilities?
y =
(1 if p > c
0 if p c
c = Cuto↵ Probability
p0 =exp{x0
0�}1 + exp{x0
0�}
Logistic Regression
How do we assess accuracy of our classifications?
• Cross-Validation
Confusion Matrix
ErrorRate
1
n
nX
i=1
I(yi 6= yi) = Percent Misclassified
Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12
Logistic Regression
How do we assess accuracy of our classifications?
• Sensitivity (Recall): Percent of True Positives (10/12)
• Specificity: Percent of True Negatives (12/16)• Positive Predictive Value (Precision): % Correctly
Predicted Yes’s (10/14)
Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12
• Negative Predictive Value: % Correctly Predicted No’s (12/14)
Logistic Regression
How do we assess accuracy of our classifications?
• F-Score (Harmonic Mean):
2 (precision*recall)/(precision+recall)
Predicted Yes’s Predicted No’sTrue Yes’s 10 2True No’s 4 12
Logistic Regression
So, what do we choose as our cutoff value?
1. Classify based on highest probability (c = 0.5 is called the Bayes Classifier).
2. Probably want to consider cost/benefits to set a cut-off probability.
Logistic Regression
False Negative Rate
False Positive Rate
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
Threshold
Err
or
Rate
Overall Error
Logistic Regression
How do we assess accuracy of our classifications?
ROC (Receiver Operating Characteristic) Curves: Compares sensitivity to false positive rates for many thresholds.ROC Curve
False positive rate
True p
osi
tive r
ate
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
“Coin Flip” Rate
Summarize an ROC curve by the area under the curve (AUC).
Logistic Regression
How do we know if our model fits well?
Deviance = �2 (log(likelihood at MLE)� log(likelihood of “saturated” model))
If the likelihood is Gaussian then:
Deviance =nX
i=1
(yi � x0i�)
2
Pseudo-R2
Interpretation: Percent of best likelihood captured by your model. Warning: practical upper bound is not 1 so low R2 values are common.
R2pseudo = 1� Whats Left Over After Model
Null Model Likelihood= 1� Residual Deviance
Null Deviance
Logistic Regression
How do we know if our model fits well?
1. Report the AUC of the in-sample predictions2. Report the sensitivity, specificity, positive predicted
value and/or negative predicted value for a given threshold
3. R2 if you really want to (but you have been warned that good classification models may still have low R2 values)
Logistic Regression
How do we know if our predicts well?
1. Report the AUC of the out-of-sample predictions2. Report the sensitivity, specificity, positive predicted
value and/or negative predicted value for a given threshold in out-of-sample classifications
Logistic Regression
How do we do variable selection in logistic regression?
The same way we do in linear regression!
AIC = �2 log(Like) + 2P
BIC = �2 log(Like) + P log(n)
Training Set Misclassification
Cross Validation Misclassification
Logistic Regression
Do we have to worry about collinearity in logistic regression?Yes!
What can we do if we have multicollinearity?1. Orthogonalize the variables.2. Principal components (this is portable to GLMs).3. LASSO or Ridge - glmnet(…,family=binomial).4. Strong prior correlations.
Logistic Regression
What do we do if there is non-linearity (non-monotonicity)?
1. Splines2. Smoothing splines - gam(…,family=binomial)
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
x.vals
y
Probit Regression
Probit Regression Model:
In R: glm(y~X,family=binomial(link=“probit”))
Yiind⇠ Bern(pi)
pi = �(x0i�)
� = Normal CDF
Probit Regression
Logistic vs. Probit Regression : Is there a difference?
Coefficients will be different, but probability curve will be about the same.
1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
Average TK1
Pro
babi
lity
of C
ance
r
LogisticProbit
�1
Logistic 1.556Probit 0.976
Probit Regression
Logistic vs. Probit Regression :
Advantages Disadvantages
Logistic Interpretability No conjugacy
Probit Conjugacy in Bayesian Model
Can only interpret sign of coefficients.
Expectation for Car Crashes Analysis
1. Clean the data – you’ll have to make decisions on• Should some categories be combined?
2. Determine a method for the car crashes dataset.• Justify which variables are in the model.• Justify linear vs. non-linear effects.
3. Fit your chosen model and report:• Effects and confidence intervals – be sure to interpret these
appropriately.• Compare probabilities of different groups.