machine learning workshop

Post on 26-Jan-2015

117 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation on Decision Trees and Random Forests given for the Boston Predictive Analytics Machine Learning Workshop on December 2, 2012. Code to accompany the slides is available at www.github.com/dgerlanc/mlclass or http://www.enplusadvisors.com/wp-content/uploads/2012/12/mlclass_1.0.tar.gz

TRANSCRIPT

Hands on Classification:Decision Trees and Random Forests

Daniel Gerlanc, Managing DirectorEnplus Advisors, Inc.www.enplusadvisors.comdgerlanc@enplusadvisors.com

Predictive Analytics Meetup GroupMachine Learning WorkshopDecember 2, 2012

© Daniel Gerlanc, 2012. All rights reserved.

If you’d like to use this material for any purpose, please contact dgerlanc@enplusadvisors.com

What You’ll Learn

•Intuition behind decision trees and random forests

•Implementation in R

•Assessing the results

Dataset

•Chemical Analysis of Italian Wines

•http://www.parvus.unige.it/

•178 records, 14 attributes

Follow along

> library(mlclass)> data(wine)> str(wine)'data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...

What are Decision Trees?

•Model for partitioning an input space

What’s partitioning?

See rf-1.R

Create the 1st split.

G

Not G

See rf-1.R

G

Not G

G

Create the 2nd Split

See rf-1.R

G

Not G

G

Create more splits…

Not G

I drew this one in.

Another view of partitioning

See rf-2.R

Use R to do the partitioning.

tree.1 <- rpart(Type ~ ., data=wine)prp(tree.1, type=4, extra=2)

• See the ‘rpart’ and ‘rpart.plot’ R packages.• Many parameters available to control the fit.

See rf-2.R

Make predictions on a test dataset

predict(tree.1, data=wine, type=“vector”)

How’d it do?

Guessing: 60.11%

CART: 94.38% Accuracy • Precision: 92.95% (66 / 71)• Sensitivity/Recall: 92.95% (66 / 71)

Actual

Predicted Grig no

Grig (1)66 (3) 5

No (2) 5 (4) 102

Decision Tree Problems

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Random Forests

One Decision Tree

Many Decision Trees (Ensemble)

Random Forest Fixes

•Overfitting the data

•May not use all relevant features

•Perpendicular decision boundaries

Building RF

For each tree:

Sample from the data

At each split, sample from the available variables

Bootstrap Sampling

Sample Attributes at each split

Motivations for RF

•Create uncorrelated trees

•Variance reduction

•Subspace exploration

Random Forestsrffit.1 <- randomForest(Type ~ ., data=wine)

See rf-3.R

RF Parameters in RMost important parameters are:

Variable

Description Default

ntree Number of Trees 500

mtry Number of variables to randomly select at each node

• square root of # predictors for classification

• # predictors / 3 for regression

nodesize

Minimum number of records in a terminal node

• 1 for classification• 5 for regression

sampsize

Number of records to select in each bootstrap sample

• 63.2%

How’d it do?

Guessing Accuracy: 60.11%

Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71)• Sensitivity/Recall: 100% (68 / 68)

Actual

Predicted Grig No

Grig (1)68 (3) 3

No (2) 0 (4) 107

Tuning RF: Grid Search

See rf-4.R

Th

is is

the d

efa

ult

.

Tuning is Expensive

•Polynomial in the number of tuning parameters:

•Plus repeated model fitting in cross-validation

Benefits of RF

•Good performance with default settings

•Relatively easy to make parallel

•Many implementations

•R, Weka, RapidMiner, Mahout

References

• A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.

• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.

• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.htm

top related