h2o world - gbm and random forest in h2o- mark landry
TRANSCRIPT
Presenta6on Outl ine
• Algorithm Background o Decision Trees o Random Forest o Gradient Boosted Machines (GBM)
• H2O ImplementaCons o Code examples o DescripCon of parameters and general usage
Decision Trees: Concept
• Separate the data according to a series of quesCons o Age > 9.5?
• The quesCons are found automaCcally to opCmize separaCon of the data point by the “target”
Source: wikimedia CART tree Titanic survivors
Example decision tree: Predicting survival of Titanic passengers
Decision Trees: Prac6cal Use
• Non linear • Robust to correlated features
• Robust to feature distribuCons
• Robust to missing values
• Simple to comprehend • Fast to train • Fast to score
• Poor accuracy • Cannot project • Inefficiently fits linear relaConships
Weaknesses Strengths
Improved Decis ion Trees: Ensembles
• Bootstrap aggregaCon (bagging)
• Fit many trees against different samples of the data and average together
• BoosCng • Fits consecuCve trees where each solves for the net error of the prior trees
GBM Random Forest
Random Forest
• Combine mulCple decision trees, each fit to a random sample of the original data
• Randomly samples o Rows o Columns
• Reduce variance, with minimal increase in bias
• Strengths o Easy to use
• Few parameters • Well-‐established default values for parameters
o Robust o CompeCCve accuracy on most data sets
• Weaknesses o Slow to score o Lack of transparency
Practical Conceptual
Gradient Boosted Machines (GBM)
• BoosCng: ensemble of weak learners*
• Fits consecuCve trees where each solves for the net loss of the prior trees
• Results of new trees are applied parCally to the enCre soluCon
• Strengths o O`en best possible model o Robust o Directly opCmizes cost funcCon
• Weaknesses o Overfits
• Need to find proper stopping point
o SensiCve to noise and extreme values
o Several hyper-‐parameters o Lack of transparency
Practical Conceptual
* the notion of “weak” is being challenged in practice
Trees in H2O
• Individual tree fiang is performed in parallel • Shared histograms calculate cut-‐points • Greedy search of histogram bins, opCmizing squared error