predicting good probabilities with supervised learning alexandru niculescu-mizil rich caruana...

19
Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Upload: buddy-lester

Post on 13-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Predicting Good Probabilities With Supervised Learning

Alexandru Niculescu-Mizil

Rich Caruana

Cornell University

Page 2: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

What are good probabilities?

Ideally, if the model predicts 0.75 for an example then the conditional probability, given the available attributes, of that example to be positive is 0.75.

In practice: Good calibration: out of all the cases the model

predicts 0.75 for, 75% are positive. Low Brier score (squared error). Low cross-entropy (log-loss).

Page 3: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Why good probabilities?

Intelligibility

If the classifier is part of a larger system Speech recognition Handwritten recognition

If the classifier is used for decision making Cost sensitive decisions Medical applications Meteorology Risk analysis

Page 4: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

What did we do?

We analyzed the predictions made by ten supervised learning algorithms.

For the analysis we used eight binary classification problems.

Limitations: Only binary problems. No multiclass. No high dimensional problems (only under 200). Only moderately sized training sets.

Page 5: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Questions addressed in this talk

Which models are well calibrated and which are not?

Can we fix the models that are not well calibrated?

Which learning algorithm makes the best probabilistic predictions?

Page 6: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Reliability diagrams

Put the cases with predicted values between 0 and 0.1 in the first bin, between 0.1 and 0.2 in the second, etc.

For each bin, plot the mean predicted value against the true fraction of positives.

Page 7: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Which models are well calibrated?ANN BAG-DTLOGREG

SVM BST-DT BST-STMP

RFDT KNN NB

Page 8: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Questions addressed in this talk

Which models are well calibrated and which are not?

Can we fix the models that are not well calibrated?

Which learning algorithm makes the best probabilistic predictions?

Page 9: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Can we fix the models that are not well calibrated?

Platt Scaling Method used by Platt to obtain calibrated

probabilities from SVMs. [Platt `99] Converts the outputs by passing them through a

sigmoid. The sigmoid is fitted using an independent

calibration set.

Page 10: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Can we fix the models that are not well calibrated?

Isotonic Regression [Robertson et al. `88] More general calibration method used by

Zadrozny and Elkan. [Zadrozny & Elkan `01, `02] Converts the outputs by passing them through a

general isotonic (monotonically increasing) function.

The isotonic function is fitted using an independent calibration set.

Page 11: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Max-margin methods

BS

T-D

TB

ST

-ST

MP

SV

M

Predictions are pushed away from 0 and 1.

HIST

Reliability plots have sigmoidal shape.

PLATT ISO

Calibration undoes the shift in predictions: more cases have predicted values closer to 0 and 1.

PLATTHIST

Page 12: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Boosted decision trees

P1COVT

P2ADULT

P3LET1

P4LET2

P5MEDIS

P6SLAC

P7HS

P8MG

HIS

TP

LA

TT

ISO

Page 13: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Naive BayesPLATTHISTPLATT

This generates reliability plots that have an inverted sigmoid shape.

Even if Platt Calibration is helping to improve the calibration, it is clear that a sigmoid is not the the right function for Naive Bayes models.

NB

HIST

Naive Bayes pushes predictions toward 0 and 1 because of the unrealistic independence assumptions.

ISO

Isotonic Regression provides a better fit.

Page 14: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Platt Scaling vs.Isotonic Regression

ANN

ISO

PLATT

UNCAL

.38

.36

.34

.32

.30

.28

BR

IER

SC

OR

E

10 100 1000 10000

ISO

PLATT

UNCAL

.38

.36

.34

.32

.30

.28

BR

IER

SC

OR

E

BST-DT

10 100 1000 10000

ISO

PLATT

UNCAL

.38

.36

.34

.32

.30

.28

BR

IER

SC

OR

E

RF

10 100 1000 10000

ISO

PLATT

UNCAL

.38

.36

.34

.32

.30

.28

BR

IER

SC

OR

E

NB

10 100 1000 10000

Page 15: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Questions addressed in this talk

Which models are well calibrated and which are not?

Can we fix the models that are not well calibrated?

Which learning algorithm makes the best probabilistic predictions?

Page 16: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Empirical ComparisonB

RIE

R S

CO

RE

BST-DT SVM RF ANN BAG KNN STMP DT LR NB

Page 17: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Summary and Conclusions We examined the quality of the probabilities

predicted by ten supervised learning algorithms.

Neural nets, bagged trees and logistic regression have well calibrated predictions.

Max-margin methods such as boosting and SVMs push the predicted values away from 0 and 1. This yields a sidmoid-shaped reliability diagram.

Learning algorithms such as Naive Bayes distort the probabilities in the opposite way, pushing them closer to 0 and 1.

Page 18: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Summary and Conclusions We examined two methods to calibrate the predictions.

Max-margin methods and Naive Baies benefit a lot from calibration, while well-calibrated methods do not.

Platt Scaling is more effective when the calibration set is small, but Isotonic Regression is more powerful when there is enough data to prevent overfitting.

The methods that predict the best probabilities are calibrated boosted trees, calibrated random forests, calibrated SVMs, uncalibrated bagged trees and uncalibrated neural nets.

Page 19: Predicting Good Probabilities With Supervised Learning Alexandru Niculescu-Mizil Rich Caruana Cornell University

Thank you!

Questions?