machine learning workshop

Hands on Classification:Decision Trees and Random Forests

Daniel Gerlanc, Managing DirectorEnplus Advisors, [email protected]

Predictive Analytics Meetup GroupMachine Learning WorkshopDecember 2, 2012

http://www.enplusadvisors.com/

mailto:[email protected]

© Daniel Gerlanc, 2012. All rights reserved.

If you’d like to use this material for any purpose, please contact [email protected]

What You’ll Learn

•Intuition behind decision trees and random forests

•Implementation in R

•Assessing the results

Dataset

•Chemical Analysis of Italian Wines

•http://www.parvus.unige.it/

•178 records, 14 attributes

http://www.parvus.unige.it/

http://www.parvus.unige.it/

Follow along

> library(mlclass)> data(wine)> str(wine)'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

What are Decision Trees?

•Model for partitioning an input space

What’s partitioning?

See rf-1.R

Create the 1st split.

G

Not G

See rf-1.R

G

Not G

G

Create the 2nd Split

See rf-1.R

G

Not G

G

Create more splits…

Not G

I drew this one in.

Another view of partitioning

See rf-2.R

Use R to do the partitioning.

tree.1 <- rpart(Type ~ ., data=wine)prp(tree.1, type=4, extra=2)

• See the ‘rpart’ and ‘rpart.plot’ R packages.• Many parameters available to control the fit.

See rf-2.R

Make predictions on a test dataset

predict(tree.1, data=wine, type=“vector”)

How’d it do?

Guessing: 60.11%

CART: 94.38% Accuracy • Precision: 92.95% (66 / 71)• Sensitivity/Recall: 92.95% (66 / 71)

Actual

Predicted Grig no

Grig (1)66 (3) 5

No (2) 5 (4) 102

Decision Tree Problems

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Random Forests

One Decision Tree

Many Decision Trees (Ensemble)

Random Forest Fixes

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Building RF

For each tree:

Sample from the data

At each split, sample from the available variables

Bootstrap Sampling

Sample Attributes at each split

Motivations for RF

•Create uncorrelated trees

•Variance reduction

•Subspace exploration

Random Forestsrffit.1 <- randomForest(Type ~ ., data=wine)

See rf-3.R

RF Parameters in RMost important parameters are:

Variable

Description Default

ntree Number of Trees 500

mtry Number of variables to randomly select at each node

• square root of # predictors for classification

• # predictors / 3 for regression

nodesize

Minimum number of records in a terminal node

• 1 for classification• 5 for regression

sampsize

Number of records to select in each bootstrap sample

• 63.2%

How’d it do?

Guessing Accuracy: 60.11%

Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71)• Sensitivity/Recall: 100% (68 / 68)

Actual

Predicted Grig No

Grig (1)68 (3) 3

No (2) 0 (4) 107

Tuning RF: Grid Search

See rf-4.R

Th

is is

the d

efa

ult

.

Tuning is Expensive

•Polynomial in the number of tuning parameters:

•Plus repeated model fitting in cross-validation

Benefits of RF

•Good performance with default settings

•Relatively easy to make parallel

•Many implementations

•R, Weka, RapidMiner, Mahout

References

• A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.

• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm

machine learning workshop

Education

daniel gerlanc

view of partitioningsee

whats partitioning

nd splitnot gggsee rf

factor w

directorenplus advisors

splitsnot g gnot ggi

input space