random forests: the vanilla of machine learning - anna quach

Welcome to my talk!

I’m currently a PhD student at Utah State University working underDr. Adele Cutler.

Do we need hundreds of classifiers to solve real worldclassification problems?

121 data sets from the University of California, Irvine (UCI) database (excluding large-scale problems) and their own data to evaluate179 classifiers. Overall Random Forests performed the best in termsof accuracy!

See the paper here:http://www.jmlr.org/papers/v15/delgado14a.html

http://www.jmlr.org/papers/v15/delgado14a.html

Random Forests wins Kaggle competitions

http://blog.kaggle.com/2012/05/01/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon/




Random Forests: A seminal paper!

https://scholar.google.com/citations?user=mXSv_1UAAAAJ&hl=en&oi=ao



Random Forests

The (very theoretical) paper can be found here:

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf



The inventers of Classification and Regression Trees(CART)

(a) LeoBreiman

(b) JeromeFriedman

(c) Charles J.Stone

(d) Richard A.Olshen

CART is actually published as a book.

https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418



Building a Classification Tree - Predicting Fake Likers

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25Predict Fake Facebook Likes

Average Verified Page Likes

Cat

egor

y E

ntro

py

Facebook Like

RealFake

Download the data here:http://digital.cs.usu.edu/~kyumin/data/likers.html

http://digital.cs.usu.edu/~kyumin/data/likers.html

First split

Category Entropy: −∑k

i=1niN log ni

N where ni is the number of likedpages under category i, and N is the total number of pages like byuser u.Average Verified Page Likes: average proportion of verified pagesliked out of total number of pages liked by a user

Second split

Third split

Code to build a classification tree

library(rpart)library(rpart.plot)

data = read.csv("FakeLiker-dataset.csv")colnames(data)[9] = "Entropy"colnames(data)[10] = "Average"

levels(data$Class) = c("Real", "Fake")cols = c("lightblue", "orange")[data$Class]

ctree = rpart(Class ~ Entropy + Average, data)prp(ctree,

extra = 1,box.palette = c("lightblue", "orange"))

References on recursive partitioning (rpart) and rpart.plot

Some good reference to understanding how a CART works andexample code on the rpart R package can be found here:

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

and a guide with plenty of examples on plotting nice tree can befound here:

http://www.milbo.org/rpart-plot/prp.pdf



http://www.milbo.org/rpart-plot/prp.pdf

A visual introduction to a decision treeONE OF THE 10 AWARD-WINNING SCIENCE VISUALIZATIONSFROM THE 2016 VIZZIES

Learn how a classfication tree is built interactively here: http://www.popsci.com/how-machine-learning-works-interactive

http://www.popsci.com/how-machine-learning-works-interactive

http://www.popsci.com/how-machine-learning-works-interactive

Bagging (Bootstrap Aggregating)

Fit each tree to bootstrap samples (random sample withreplacement) from the data and combine by voting (classification)or averaging (regression).

http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf



The powers of Random Forests!

Random Forests is applicable to a wide variety of problems. Hereare some of the features of Random Forests:

I Classification and RegressionI Rank Important Feature (Most widely used)I Impute Missing ValuesI Local Variable Importance (Underused)I Unbalance classesI Naturally fits interactionsI Does not overfit as you add more treeI Detect patterns using proximities (Underused)I Requires little tuning! Has two possible parameters to tune –

mtry and depth for regression

Original Implementation of Random Forests is in Fortran

(a) Leo Breiman (b) Adele Cutler

A good documentation of the capabilities of Random Forests andthe fortran code can be found here:https://www.stat.berkeley.edu/~breiman/RandomForests/

https://www.stat.berkeley.edu/~breiman/RandomForests/

Random Forests is a trademark

The commercial version of Random Forests, as well as videos aboutRandom Forests can be found here: https://www.salford-systems.com/products/randomforests

Salford Systems provides a user guide on how to use RandomForests in their Software, Salford Predictive Modeler (SPM). Findthe user guide here: http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf

https://www.salford-systems.com/products/randomforests

https://www.salford-systems.com/products/randomforests

http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf

http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf

randomForest - the first Random Forests package in R

https://cran.r-project.org/web/packages/randomForest/

https://cran.r-project.org/web/packages/randomForest/

Variable Importance

Mean_postsMean_pagesMean_photosSkewness_of_postsMaximum_posts_in_a_daySTD_Cat_TemporalShares_per_postsSelf_post_updatesFriends_CountLinks_per_postsComments_per_postsLikes_per_postsSTD_TemporalPost_Frequency_per_dayAverageAbout_CountYears_ActiveEntropy

0.00 0.04 0.08 0.12MeanDecreaseAccuracy

Mean_pagesMean_postsMean_photosShares_per_postsMaximum_posts_in_a_daySelf_post_updatesSkewness_of_postsFriends_CountLinks_per_postsSTD_Cat_TemporalComments_per_postsLikes_per_postsPost_Frequency_per_daySTD_TemporalAbout_CountAverageYears_ActiveEntropy

0 50 100 150MeanDecreaseGini

Rank of Important Features

Variable Importance Definition

Random Forests computes two measures of variable importance:

1. Permutation Importance (Mean Decrease in Accuracy) ispermutation based.

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. Thepermutation importance for each variable is the average of(error rate of permuted variable) - (error rate with nopermutation) over all the trees.

2. Gini Importance (Mean Decrease in Gini) is gini based forclassification.

Average of (Gini impurity of parent node) - (the gini impurityof child nodes) over all trees in the forest for each variable.

randomForest code

The Random Forests can be built and display the importantvariables using the following code in R:

library(randomForest)

rf = randomForest(Class ~ ., data,importance = TRUE,ntree = 500)

varImpPlot(rf,scale = FALSE,main = "Rank of Important Features")

Determining how many trees to use

plot(rf,main = "",ylim = c(0.05, 0.25),col = c("black", "lightblue", "orange"))

Local Variable Importance

For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. The localvariable importance for each case i and variable j is the average of(error rate of permuted variable) - (error rate with no permutation)over all the trees.

Local Variable Importance example on detecting fakeFacebook likes

Entropy Years_Active About_Count Average Post_Frequency_per_day

−0.412

0.448

−0.292

0.326

−0.446

0.468

−0.299

0.236

−0.174

0.173

Code to extract local variable importance

library(MASS)

rf = randomForest(Class ~ ., data,importance = TRUE,localImp = TRUE,proximity = TRUE,ntree = 500,scale = FALSE)

impv = names(sort(rf$importance[, 3],decreasing = TRUE))[1:5]

parcoord(t(rf$localImportance)[, impv],col = cols,var.label = TRUE)

Proximities

Proximities in Random Forests is defined as the proportion of thetime two observations (both in the out-of-bag sample) end up in thesame terminal node. The proximity measures can be visualized usingMultidimensional Scaling (MDS) plots. Using the MDS plot we canlearn more about our data:

I identify characteristics of unusual pointsI find clusters within classesI see which classes are overlappingI see which classes differI see which variables are locally important

Visualizing the Proximites

MDS 1

MD

S 2

MDS 1

MD

S 3

MDS 2

MD

S 3

Code to extract the proximities

loc = row(rf$prox)aprox = rbind(loc,rf$prox)prox = matrix(aprox, nrow = nrow(rf$prox))scalerf = cmdscale(1 - rf$prox, eig = T, k = 3)$points

plot(scalerf[, 1], scalerf[, 2], col = cols,xlab = "MDS 1", ylab = "MDS 2",xlim = c(-0.5, 0.5),ylim = c(-0.5, 0.5),xaxt = "n",yaxt = "n")

Local Variable Importance in interactive plots

We can find interesting patterns using an interactive plot.

Read more about irfplot (interactive random forests plots) here:http://digitalcommons.usu.edu/gradreports/134/

http://digitalcommons.usu.edu/gradreports/134/

Brushing in interactive plots

randomForest

A short paper on the randomForest package can be found here:http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

Random Forests presentations by Dr. Adele Cutler

A more comprehensive set of notes on Random Forests by Dr. AdeleCutler can be found here:http://www.math.usu.edu/adele/RandomForests/UofU2013.pdfhttp://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf

http://www.math.usu.edu/adele/RandomForests/UofU2013.pdf

http://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf

Current Research

http://www.amstat.org/meetings/wsds/2016/onlineprogram/AbstractDetails.cfm?AbstractID=303499



Current Research - Improving the interpretation of RandomForests

https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314849



Remembering Leo Breiman

1928 – 2005

Read more about Leo Breiman’s life’s work from the article writtenby Dr. Adele Cutler: https://arxiv.org/pdf/1101.0917.pdf

https://arxiv.org/pdf/1101.0917.pdf

Contact Information

Additional questions regarding Random Forests can be emailed tome at [email protected].

random forests: the vanilla of machine learning - anna quach

Technology