h2o world - gbm and random forest in h2o- mark landry

GBM & Random Forest in H2O

Mark Landry

Presenta6on Outl ine

•  Algorithm Background o Decision Trees o Random Forest o Gradient Boosted Machines (GBM)

•  H2O ImplementaCons o Code examples o DescripCon of parameters and general usage

Decision Trees: Concept

•  Separate the data according to a series of quesCons o  Age > 9.5?

•  The quesCons are found automaCcally to opCmize separaCon of the data point by the “target”

Source: wikimedia CART tree Titanic survivors

Example decision tree: Predicting survival of Titanic passengers

Decision Trees: Prac6cal Use

•  Non linear •  Robust to correlated features

•  Robust to feature distribuCons

•  Robust to missing values

•  Simple to comprehend •  Fast to train •  Fast to score

•  Poor accuracy •  Cannot project •  Inefficiently fits linear relaConships

Weaknesses Strengths

Improved Decis ion Trees: Ensembles

•  Bootstrap aggregaCon (bagging)

•  Fit many trees against different samples of the data and average together

•  BoosCng •  Fits consecuCve trees where each solves for the net error of the prior trees

GBM Random Forest

Random Forest

•  Combine mulCple decision trees, each fit to a random sample of the original data

•  Randomly samples o  Rows o  Columns

•  Reduce variance, with minimal increase in bias

•  Strengths o  Easy to use

•  Few parameters •  Well-‐established default values for parameters

o  Robust o  CompeCCve accuracy on most data sets

•  Weaknesses o  Slow to score o  Lack of transparency

Practical Conceptual

Gradient Boosted Machines (GBM)

•  BoosCng: ensemble of weak learners*

•  Fits consecuCve trees where each solves for the net loss of the prior trees

•  Results of new trees are applied parCally to the enCre soluCon

•  Strengths o  O`en best possible model o  Robust o  Directly opCmizes cost funcCon

•  Weaknesses o  Overfits

•  Need to find proper stopping point

o  SensiCve to noise and extreme values

o  Several hyper-‐parameters o  Lack of transparency

Practical Conceptual

* the notion of “weak” is being challenged in practice

Trees in H2O

•  Individual tree fiang is performed in parallel •  Shared histograms calculate cut-‐points •  Greedy search of histogram bins, opCmizing squared error

Explore Further through Examples

I have H2O Installed

I have R installed

I have the H2O World data sets

h2o world - gbm and random forest in h2o- mark landry

Software