random forests: the vanilla of machine learning - anna quach

Welcome to my talk! I’m currently a PhD student at Utah State University working under Dr. Adele Cutler.

Upload: withthebest

Post on 16-Apr-2017




3 download


Page 1: Random Forests: The Vanilla of Machine Learning - Anna Quach

Welcome to my talk!

I’m currently a PhD student at Utah State University working underDr. Adele Cutler.

Page 2: Random Forests: The Vanilla of Machine Learning - Anna Quach

Do we need hundreds of classifiers to solve real worldclassification problems?

121 data sets from the University of California, Irvine (UCI) database (excluding large-scale problems) and their own data to evaluate179 classifiers. Overall Random Forests performed the best in termsof accuracy!

See the paper here:http://www.jmlr.org/papers/v15/delgado14a.html

Page 4: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests: A seminal paper!


Page 5: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests

The (very theoretical) paper can be found here:


Page 6: Random Forests: The Vanilla of Machine Learning - Anna Quach

The inventers of Classification and Regression Trees(CART)

(a) LeoBreiman

(b) JeromeFriedman

(c) Charles J.Stone

(d) Richard A.Olshen

Page 7: Random Forests: The Vanilla of Machine Learning - Anna Quach

CART is actually published as a book.


Page 8: Random Forests: The Vanilla of Machine Learning - Anna Quach

Building a Classification Tree - Predicting Fake Likers

0.0 0.2 0.4 0.6 0.8 1.0



25Predict Fake Facebook Likes

Average Verified Page Likes



y E



Facebook Like


Download the data here:http://digital.cs.usu.edu/~kyumin/data/likers.html

Page 9: Random Forests: The Vanilla of Machine Learning - Anna Quach

First split

Category Entropy: −∑k

i=1niN log ni

N where ni is the number of likedpages under category i, and N is the total number of pages like byuser u.Average Verified Page Likes: average proportion of verified pagesliked out of total number of pages liked by a user

Page 10: Random Forests: The Vanilla of Machine Learning - Anna Quach

Second split

Page 11: Random Forests: The Vanilla of Machine Learning - Anna Quach

Third split

Page 12: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to build a classification tree


data = read.csv("FakeLiker-dataset.csv")colnames(data)[9] = "Entropy"colnames(data)[10] = "Average"

levels(data$Class) = c("Real", "Fake")cols = c("lightblue", "orange")[data$Class]

ctree = rpart(Class ~ Entropy + Average, data)prp(ctree,

extra = 1,box.palette = c("lightblue", "orange"))

Page 13: Random Forests: The Vanilla of Machine Learning - Anna Quach

References on recursive partitioning (rpart) and rpart.plot

Some good reference to understanding how a CART works andexample code on the rpart R package can be found here:


and a guide with plenty of examples on plotting nice tree can befound here:


Page 14: Random Forests: The Vanilla of Machine Learning - Anna Quach


Learn how a classfication tree is built interactively here: http://www.popsci.com/how-machine-learning-works-interactive

Page 15: Random Forests: The Vanilla of Machine Learning - Anna Quach

Bagging (Bootstrap Aggregating)

Fit each tree to bootstrap samples (random sample withreplacement) from the data and combine by voting (classification)or averaging (regression).


Page 16: Random Forests: The Vanilla of Machine Learning - Anna Quach

The powers of Random Forests!

Random Forests is applicable to a wide variety of problems. Hereare some of the features of Random Forests:

I Classification and RegressionI Rank Important Feature (Most widely used)I Impute Missing ValuesI Local Variable Importance (Underused)I Unbalance classesI Naturally fits interactionsI Does not overfit as you add more treeI Detect patterns using proximities (Underused)I Requires little tuning! Has two possible parameters to tune –

mtry and depth for regression

Page 17: Random Forests: The Vanilla of Machine Learning - Anna Quach

Original Implementation of Random Forests is in Fortran

(a) Leo Breiman (b) Adele Cutler

A good documentation of the capabilities of Random Forests andthe fortran code can be found here:https://www.stat.berkeley.edu/~breiman/RandomForests/

Page 18: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests is a trademark

The commercial version of Random Forests, as well as videos aboutRandom Forests can be found here: https://www.salford-systems.com/products/randomforests

Salford Systems provides a user guide on how to use RandomForests in their Software, Salford Predictive Modeler (SPM). Findthe user guide here: http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf

Page 19: Random Forests: The Vanilla of Machine Learning - Anna Quach

randomForest - the first Random Forests package in R


Page 20: Random Forests: The Vanilla of Machine Learning - Anna Quach

Variable Importance


0.00 0.04 0.08 0.12MeanDecreaseAccuracy


0 50 100 150MeanDecreaseGini

Rank of Important Features

Page 21: Random Forests: The Vanilla of Machine Learning - Anna Quach

Variable Importance Definition

Random Forests computes two measures of variable importance:

1. Permutation Importance (Mean Decrease in Accuracy) ispermutation based.

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. Thepermutation importance for each variable is the average of(error rate of permuted variable) - (error rate with nopermutation) over all the trees.

2. Gini Importance (Mean Decrease in Gini) is gini based forclassification.

Average of (Gini impurity of parent node) - (the gini impurityof child nodes) over all trees in the forest for each variable.

Page 22: Random Forests: The Vanilla of Machine Learning - Anna Quach

randomForest code

The Random Forests can be built and display the importantvariables using the following code in R:


rf = randomForest(Class ~ ., data,importance = TRUE,ntree = 500)

varImpPlot(rf,scale = FALSE,main = "Rank of Important Features")

Page 23: Random Forests: The Vanilla of Machine Learning - Anna Quach

Determining how many trees to use

plot(rf,main = "",ylim = c(0.05, 0.25),col = c("black", "lightblue", "orange"))

Page 24: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. The localvariable importance for each case i and variable j is the average of(error rate of permuted variable) - (error rate with no permutation)over all the trees.

Page 25: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance example on detecting fakeFacebook likes

Entropy Years_Active About_Count Average Post_Frequency_per_day











Page 26: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to extract local variable importance


rf = randomForest(Class ~ ., data,importance = TRUE,localImp = TRUE,proximity = TRUE,ntree = 500,scale = FALSE)

impv = names(sort(rf$importance[, 3],decreasing = TRUE))[1:5]

parcoord(t(rf$localImportance)[, impv],col = cols,var.label = TRUE)

Page 27: Random Forests: The Vanilla of Machine Learning - Anna Quach


Proximities in Random Forests is defined as the proportion of thetime two observations (both in the out-of-bag sample) end up in thesame terminal node. The proximity measures can be visualized usingMultidimensional Scaling (MDS) plots. Using the MDS plot we canlearn more about our data:

I identify characteristics of unusual pointsI find clusters within classesI see which classes are overlappingI see which classes differI see which variables are locally important

Page 28: Random Forests: The Vanilla of Machine Learning - Anna Quach

Visualizing the Proximites



S 2



S 3



S 3

Page 29: Random Forests: The Vanilla of Machine Learning - Anna Quach

Code to extract the proximities

loc = row(rf$prox)aprox = rbind(loc,rf$prox)prox = matrix(aprox, nrow = nrow(rf$prox))scalerf = cmdscale(1 - rf$prox, eig = T, k = 3)$points

plot(scalerf[, 1], scalerf[, 2], col = cols,xlab = "MDS 1", ylab = "MDS 2",xlim = c(-0.5, 0.5),ylim = c(-0.5, 0.5),xaxt = "n",yaxt = "n")

Page 30: Random Forests: The Vanilla of Machine Learning - Anna Quach

Local Variable Importance in interactive plots

We can find interesting patterns using an interactive plot.

Read more about irfplot (interactive random forests plots) here:http://digitalcommons.usu.edu/gradreports/134/

Page 31: Random Forests: The Vanilla of Machine Learning - Anna Quach

Brushing in interactive plots

Page 32: Random Forests: The Vanilla of Machine Learning - Anna Quach


A short paper on the randomForest package can be found here:http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

Page 33: Random Forests: The Vanilla of Machine Learning - Anna Quach

Random Forests presentations by Dr. Adele Cutler

A more comprehensive set of notes on Random Forests by Dr. AdeleCutler can be found here:http://www.math.usu.edu/adele/RandomForests/UofU2013.pdfhttp://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf

Page 35: Random Forests: The Vanilla of Machine Learning - Anna Quach

Current Research - Improving the interpretation of RandomForests


Page 36: Random Forests: The Vanilla of Machine Learning - Anna Quach

Remembering Leo Breiman

1928 – 2005

Read more about Leo Breiman’s life’s work from the article writtenby Dr. Adele Cutler: https://arxiv.org/pdf/1101.0917.pdf

Page 37: Random Forests: The Vanilla of Machine Learning - Anna Quach

Contact Information

Additional questions regarding Random Forests can be emailed tome at [email protected].