machine learning workshop

28
Hands on Classification: Decision Trees and Random Forests Daniel Gerlanc, Managing Director Enplus Advisors, Inc. www.enplusadvisors.com [email protected] Predictive Analytics Meetup Group Machine Learning Workshop December 2, 2012

Upload: enplus-advisors-inc

Post on 26-Jan-2015

117 views

Category:

Education


0 download

DESCRIPTION

Presentation on Decision Trees and Random Forests given for the Boston Predictive Analytics Machine Learning Workshop on December 2, 2012. Code to accompany the slides is available at www.github.com/dgerlanc/mlclass or http://www.enplusadvisors.com/wp-content/uploads/2012/12/mlclass_1.0.tar.gz

TRANSCRIPT

Page 1: Machine Learning Workshop

Hands on Classification:Decision Trees and Random Forests

Daniel Gerlanc, Managing DirectorEnplus Advisors, [email protected]

Predictive Analytics Meetup GroupMachine Learning WorkshopDecember 2, 2012

Page 2: Machine Learning Workshop

© Daniel Gerlanc, 2012. All rights reserved.

If you’d like to use this material for any purpose, please contact [email protected]

Page 3: Machine Learning Workshop

What You’ll Learn

•Intuition behind decision trees and random forests

•Implementation in R

•Assessing the results

Page 4: Machine Learning Workshop

Dataset

•Chemical Analysis of Italian Wines

•http://www.parvus.unige.it/

•178 records, 14 attributes

Page 5: Machine Learning Workshop

Follow along

> library(mlclass)> data(wine)> str(wine)'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

Page 6: Machine Learning Workshop

What are Decision Trees?

•Model for partitioning an input space

Page 7: Machine Learning Workshop

What’s partitioning?

See rf-1.R

Page 8: Machine Learning Workshop

Create the 1st split.

G

Not G

See rf-1.R

Page 9: Machine Learning Workshop

G

Not G

G

Create the 2nd Split

See rf-1.R

Page 10: Machine Learning Workshop

G

Not G

G

Create more splits…

Not G

I drew this one in.

Page 11: Machine Learning Workshop

Another view of partitioning

See rf-2.R

Page 12: Machine Learning Workshop

Use R to do the partitioning.

tree.1 <- rpart(Type ~ ., data=wine)prp(tree.1, type=4, extra=2)

• See the ‘rpart’ and ‘rpart.plot’ R packages.• Many parameters available to control the fit.

See rf-2.R

Page 13: Machine Learning Workshop

Make predictions on a test dataset

predict(tree.1, data=wine, type=“vector”)

Page 14: Machine Learning Workshop

How’d it do?

Guessing: 60.11%

CART: 94.38% Accuracy • Precision: 92.95% (66 / 71)• Sensitivity/Recall: 92.95% (66 / 71)

Actual

Predicted Grig no

Grig (1)66 (3) 5

No (2) 5 (4) 102

Page 15: Machine Learning Workshop

Decision Tree Problems

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Page 16: Machine Learning Workshop

Random Forests

One Decision Tree

Many Decision Trees (Ensemble)

Page 17: Machine Learning Workshop

Random Forest Fixes

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Page 18: Machine Learning Workshop

Building RF

For each tree:

Sample from the data

At each split, sample from the available variables

Page 19: Machine Learning Workshop

Bootstrap Sampling

Page 20: Machine Learning Workshop

Sample Attributes at each split

Page 21: Machine Learning Workshop

Motivations for RF

•Create uncorrelated trees

•Variance reduction

•Subspace exploration

Page 22: Machine Learning Workshop

Random Forestsrffit.1 <- randomForest(Type ~ ., data=wine)

See rf-3.R

Page 23: Machine Learning Workshop

RF Parameters in RMost important parameters are:

Variable

Description Default

ntree Number of Trees 500

mtry Number of variables to randomly select at each node

• square root of # predictors for classification

• # predictors / 3 for regression

nodesize

Minimum number of records in a terminal node

• 1 for classification• 5 for regression

sampsize

Number of records to select in each bootstrap sample

• 63.2%

Page 24: Machine Learning Workshop

How’d it do?

Guessing Accuracy: 60.11%

Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71)• Sensitivity/Recall: 100% (68 / 68)

Actual

Predicted Grig No

Grig (1)68 (3) 3

No (2) 0 (4) 107

Page 25: Machine Learning Workshop

Tuning RF: Grid Search

See rf-4.R

Th

is is

the d

efa

ult

.

Page 26: Machine Learning Workshop

Tuning is Expensive

•Polynomial in the number of tuning parameters:

•Plus repeated model fitting in cross-validation

Page 27: Machine Learning Workshop

Benefits of RF

•Good performance with default settings

•Relatively easy to make parallel

•Many implementations

•R, Weka, RapidMiner, Mahout

Page 28: Machine Learning Workshop

References

• A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.

• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm