random forests: the vanilla of machine learning - anna quach
TRANSCRIPT
Welcome to my talk!
I’m currently a PhD student at Utah State University working underDr. Adele Cutler.
Do we need hundreds of classifiers to solve real worldclassification problems?
121 data sets from the University of California, Irvine (UCI) database (excluding large-scale problems) and their own data to evaluate179 classifiers. Overall Random Forests performed the best in termsof accuracy!
See the paper here:http://www.jmlr.org/papers/v15/delgado14a.html
Random Forests wins Kaggle competitions
http://blog.kaggle.com/2012/05/01/chucking-everything-into-a-random-forest-ben-hamner-on-winning-the-air-quality-prediction-hackathon/
Random Forests: A seminal paper!
https://scholar.google.com/citations?user=mXSv_1UAAAAJ&hl=en&oi=ao
Random Forests
The (very theoretical) paper can be found here:
https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
The inventers of Classification and Regression Trees(CART)
(a) LeoBreiman
(b) JeromeFriedman
(c) Charles J.Stone
(d) Richard A.Olshen
CART is actually published as a book.
https://www.amazon.com/Classification-Regression-Wadsworth-Statistics-Probability/dp/0412048418
Building a Classification Tree - Predicting Fake Likers
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25Predict Fake Facebook Likes
Average Verified Page Likes
Cat
egor
y E
ntro
py
Facebook Like
RealFake
Download the data here:http://digital.cs.usu.edu/~kyumin/data/likers.html
First split
Category Entropy: −∑k
i=1niN log ni
N where ni is the number of likedpages under category i, and N is the total number of pages like byuser u.Average Verified Page Likes: average proportion of verified pagesliked out of total number of pages liked by a user
Second split
Third split
Code to build a classification tree
library(rpart)library(rpart.plot)
data = read.csv("FakeLiker-dataset.csv")colnames(data)[9] = "Entropy"colnames(data)[10] = "Average"
levels(data$Class) = c("Real", "Fake")cols = c("lightblue", "orange")[data$Class]
ctree = rpart(Class ~ Entropy + Average, data)prp(ctree,
extra = 1,box.palette = c("lightblue", "orange"))
References on recursive partitioning (rpart) and rpart.plot
Some good reference to understanding how a CART works andexample code on the rpart R package can be found here:
https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
and a guide with plenty of examples on plotting nice tree can befound here:
http://www.milbo.org/rpart-plot/prp.pdf
A visual introduction to a decision treeONE OF THE 10 AWARD-WINNING SCIENCE VISUALIZATIONSFROM THE 2016 VIZZIES
Learn how a classfication tree is built interactively here: http://www.popsci.com/how-machine-learning-works-interactive
Bagging (Bootstrap Aggregating)
Fit each tree to bootstrap samples (random sample withreplacement) from the data and combine by voting (classification)or averaging (regression).
http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
The powers of Random Forests!
Random Forests is applicable to a wide variety of problems. Hereare some of the features of Random Forests:
I Classification and RegressionI Rank Important Feature (Most widely used)I Impute Missing ValuesI Local Variable Importance (Underused)I Unbalance classesI Naturally fits interactionsI Does not overfit as you add more treeI Detect patterns using proximities (Underused)I Requires little tuning! Has two possible parameters to tune –
mtry and depth for regression
Original Implementation of Random Forests is in Fortran
(a) Leo Breiman (b) Adele Cutler
A good documentation of the capabilities of Random Forests andthe fortran code can be found here:https://www.stat.berkeley.edu/~breiman/RandomForests/
Random Forests is a trademark
The commercial version of Random Forests, as well as videos aboutRandom Forests can be found here: https://www.salford-systems.com/products/randomforests
Salford Systems provides a user guide on how to use RandomForests in their Software, Salford Predictive Modeler (SPM). Findthe user guide here: http://media.salford-systems.com/pdf/spm7/RandomForestsModelBasics.pdf
randomForest - the first Random Forests package in R
https://cran.r-project.org/web/packages/randomForest/
Variable Importance
Mean_postsMean_pagesMean_photosSkewness_of_postsMaximum_posts_in_a_daySTD_Cat_TemporalShares_per_postsSelf_post_updatesFriends_CountLinks_per_postsComments_per_postsLikes_per_postsSTD_TemporalPost_Frequency_per_dayAverageAbout_CountYears_ActiveEntropy
0.00 0.04 0.08 0.12MeanDecreaseAccuracy
Mean_pagesMean_postsMean_photosShares_per_postsMaximum_posts_in_a_daySelf_post_updatesSkewness_of_postsFriends_CountLinks_per_postsSTD_Cat_TemporalComments_per_postsLikes_per_postsPost_Frequency_per_daySTD_TemporalAbout_CountAverageYears_ActiveEntropy
0 50 100 150MeanDecreaseGini
Rank of Important Features
Variable Importance Definition
Random Forests computes two measures of variable importance:
1. Permutation Importance (Mean Decrease in Accuracy) ispermutation based.
For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. Thepermutation importance for each variable is the average of(error rate of permuted variable) - (error rate with nopermutation) over all the trees.
2. Gini Importance (Mean Decrease in Gini) is gini based forclassification.
Average of (Gini impurity of parent node) - (the gini impurityof child nodes) over all trees in the forest for each variable.
randomForest code
The Random Forests can be built and display the importantvariables using the following code in R:
library(randomForest)
rf = randomForest(Class ~ ., data,importance = TRUE,ntree = 500)
varImpPlot(rf,scale = FALSE,main = "Rank of Important Features")
Determining how many trees to use
plot(rf,main = "",ylim = c(0.05, 0.25),col = c("black", "lightblue", "orange"))
Local Variable Importance
For each tree, randomly permute values of a variable that areout-of-bag. Pass the permuted data down the tree. The localvariable importance for each case i and variable j is the average of(error rate of permuted variable) - (error rate with no permutation)over all the trees.
Local Variable Importance example on detecting fakeFacebook likes
Entropy Years_Active About_Count Average Post_Frequency_per_day
−0.412
0.448
−0.292
0.326
−0.446
0.468
−0.299
0.236
−0.174
0.173
Code to extract local variable importance
library(MASS)
rf = randomForest(Class ~ ., data,importance = TRUE,localImp = TRUE,proximity = TRUE,ntree = 500,scale = FALSE)
impv = names(sort(rf$importance[, 3],decreasing = TRUE))[1:5]
parcoord(t(rf$localImportance)[, impv],col = cols,var.label = TRUE)
Proximities
Proximities in Random Forests is defined as the proportion of thetime two observations (both in the out-of-bag sample) end up in thesame terminal node. The proximity measures can be visualized usingMultidimensional Scaling (MDS) plots. Using the MDS plot we canlearn more about our data:
I identify characteristics of unusual pointsI find clusters within classesI see which classes are overlappingI see which classes differI see which variables are locally important
Visualizing the Proximites
MDS 1
MD
S 2
MDS 1
MD
S 3
MDS 2
MD
S 3
Code to extract the proximities
loc = row(rf$prox)aprox = rbind(loc,rf$prox)prox = matrix(aprox, nrow = nrow(rf$prox))scalerf = cmdscale(1 - rf$prox, eig = T, k = 3)$points
plot(scalerf[, 1], scalerf[, 2], col = cols,xlab = "MDS 1", ylab = "MDS 2",xlim = c(-0.5, 0.5),ylim = c(-0.5, 0.5),xaxt = "n",yaxt = "n")
Local Variable Importance in interactive plots
We can find interesting patterns using an interactive plot.
Read more about irfplot (interactive random forests plots) here:http://digitalcommons.usu.edu/gradreports/134/
Brushing in interactive plots
randomForest
A short paper on the randomForest package can be found here:http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf
Random Forests presentations by Dr. Adele Cutler
A more comprehensive set of notes on Random Forests by Dr. AdeleCutler can be found here:http://www.math.usu.edu/adele/RandomForests/UofU2013.pdfhttp://www.math.usu.edu/adele/RandomForests/Ovronnaz.pdf
Current Research
http://www.amstat.org/meetings/wsds/2016/onlineprogram/AbstractDetails.cfm?AbstractID=303499
Current Research - Improving the interpretation of RandomForests
https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314849
Remembering Leo Breiman
1928 – 2005
Read more about Leo Breiman’s life’s work from the article writtenby Dr. Adele Cutler: https://arxiv.org/pdf/1101.0917.pdf
Contact Information
Additional questions regarding Random Forests can be emailed tome at [email protected].