h2o world - gbm and random forest in h2o- mark landry

9
GBM & Random Forest in H2O Mark Landry

Upload: srisatish-ambati

Post on 06-Jan-2017

1.315 views

Category:

Software


1 download

TRANSCRIPT

GBM  &  Random  Forest   in  H2O  

Mark  Landry  

Presenta6on  Outl ine  

•  Algorithm  Background  o Decision  Trees  o Random  Forest  o Gradient  Boosted  Machines  (GBM)  

•  H2O  ImplementaCons  o Code  examples  o DescripCon  of  parameters  and  general  usage  

Decision  Trees:  Concept  

•  Separate  the  data  according  to  a  series  of  quesCons  o  Age  >  9.5?  

•  The  quesCons  are  found  automaCcally  to  opCmize  separaCon  of  the  data  point  by  the  “target”  

Source: wikimedia CART tree Titanic survivors

Example decision tree: Predicting survival of Titanic passengers

Decision  Trees:  Prac6cal  Use  

•  Non  linear  •  Robust  to  correlated  features  

•  Robust  to  feature  distribuCons  

•  Robust  to  missing  values  

•  Simple  to  comprehend  •  Fast  to  train  •  Fast  to  score  

•  Poor  accuracy  •  Cannot  project  •  Inefficiently  fits  linear  relaConships  

Weaknesses Strengths

Improved  Decis ion  Trees:  Ensembles  

•  Bootstrap  aggregaCon  (bagging)  

•  Fit  many  trees  against  different  samples  of  the  data  and  average  together  

•  BoosCng  •  Fits  consecuCve  trees  where  each  solves  for  the  net  error  of  the  prior  trees    

GBM Random Forest

Random  Forest  

•  Combine  mulCple  decision  trees,  each  fit  to  a  random  sample  of  the  original  data  

•  Randomly  samples    o  Rows  o  Columns  

•  Reduce  variance,  with  minimal  increase  in  bias  

•  Strengths  o  Easy  to  use  

•  Few  parameters  •  Well-­‐established  default  values  for  parameters    

o  Robust  o  CompeCCve  accuracy  on  most  data  sets  

•  Weaknesses  o  Slow  to  score  o  Lack  of  transparency  

Practical Conceptual

Gradient  Boosted  Machines  (GBM)  

•  BoosCng:  ensemble  of  weak  learners*  

•  Fits  consecuCve  trees  where  each  solves  for  the  net  loss  of  the  prior  trees  

•  Results  of  new  trees  are  applied  parCally  to  the  enCre  soluCon  

•  Strengths  o  O`en  best  possible  model  o  Robust  o  Directly  opCmizes  cost  funcCon  

•  Weaknesses  o  Overfits  

•  Need  to  find  proper  stopping  point  

o  SensiCve  to  noise  and  extreme  values  

o  Several  hyper-­‐parameters  o  Lack  of  transparency  

Practical Conceptual

* the notion of “weak” is being challenged in practice

Trees   in  H2O  

•  Individual  tree  fiang  is  performed  in  parallel  •  Shared  histograms  calculate  cut-­‐points    •  Greedy  search  of  histogram  bins,  opCmizing  squared  error  

Explore  Further  through  Examples  

I have H2O Installed

I have R installed

I have the H2O World data sets