generalized linear models with h2o
TRANSCRIPT
H2O.ai Machine Intelligence
Outline
• Introduction to H2O • GLM Overview • Quick demo on Airlines Data • Overview of H2O GLM features • Common usage patterns
•finding optimal regularization •handling wide datasets
• Kaggle Example •Avito Dataset overview •basic model •feature engineering •final model building 2
H2O.ai Machine Intelligence
In-Memory ML
Distributed
Open Source
APIs
3
Memory-Efficient Data Structures Cutting-Edge Algorithms
Use all your Data (No Sampling) Accuracy with Speed and Scale
Ownership of Methods - Apache V2 Easy to Deploy: Bare, Hadoop, Spark, etc.
Java, Scala, R, Python, JavaScript, JSON NanoFast Scoring Engine (POJO)
H2O - Product Overview
H2O.ai Machine Intelligence
25,000 commits / 3yrs
H2O World Conference 2014
Team Work @ H2O.ai
4Join H2O World Nov 9-11 2015!
H2O.ai Machine Intelligence
5
cientific Advisory Council
Stephen Boyd Professor of EE Engineering Stanford University
Rob Tibshirani Professor of Health Research and Policy, and Statistics Stanford University
Trevor Hastie Professor of Statistics Stanford University
H2O.ai Machine Intelligence
103 634 2789
463 2,887 13,237
Companies
Users
Mar 2014 July 2014 Mar 2015
Active Users
150+
6
Strong Community & Growth5/25/15 @kdnuggets t.co/4xSgleSIdY
H2O.ai Machine Intelligence
7
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models, 15x faster with H2O than before)
Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions)
…and many large insurance and financial services companies!
Real-time marketing (H2O is 10x faster than anything else)
Actual Customer Use Cases
H2O.ai Machine Intelligence
8
HDFS
S3
SQL
NoSQL
Classification Regression
Feature Engineering
Distributed In-Memory
Map Reduce/Fork Join
Columnar Compression
GLM, Deep Learning
K-Means, PCA, NB, Cox
Random Forest / GBM Ensembles
Fast Modeling Engine
Streaming Nano Fast Java Scoring Engines (POJO code generation)
Matrix Factorization Clustering
Munging
Unsupervised
Supervised
Accuracy with Speed and Scale
Most code is written in-house from scratch
H2O.ai Machine Intelligence
- Well known statistical/machine learning method - Fits a linear model
- link(y) = c1*x1 + c2*x + … + cn*xn + intercept - Produces (relatively) simple model
- easy to fit - easy to understand and interpret - well known statistical properties
- Regression problems - gaussian, poisson, gamma, tweedie
- Classification - binomial, multinomial
- Requires good features - not as powerful on raw data as some other models - (gbm/deep learning)
Generalized Linear Models
9
H2O.ai Machine Intelligence
- Linear Model - defined by vector of coefficients - 1 number per predictor
- Parametrized by Family and Link - Family
- Our assumption about distribution of the response - e.g. poisson for regression on counts, binomial for
two class classification - Link
- non-linear transform of the response - e.g. logit to generate s-curve for logistic regression
- Fitted by maximum likelihood - pick the model with max probability of seeing the data - need an iterative solver (e.g. newton method, L-BFGS)
Generalized Linear Models 2
10
H2O.ai Machine Intelligence
Generalized Linear Models 3
11
Simple 2-class classification example
Linear Regression fit (family=gaussian,link =identity)
Logistic Regression fit (family=binomial,link = logit)
H2O.ai Machine Intelligence
- Problems - can overfit - works great on training, fails on test - solution is not unique with correlated variables
- Solution - Add Regularization - add penalty to reduce size of the vector - L1 or L2 norm of the coefficient vector
- L1 versus L2 - L2 dense solution
- correlated variables coefficients are pushed to the same value
- L1 sparse solution - picks one correlated variable, others discarded
- Elastic Net - combination of L1 and L2 - sparse solution, correlated variables grouped, enter/ leave the
model together
Penalized Generalized Linear Models
12
H2O.ai Machine Intelligence
- Fully Distributed and Parallel - handles datasets with up to 100s of thousand of predictors - scales linearly with number of rows - processes datasets with 100s of millions of rows in seconds
- All standard GLM features - standard families - support for observation weights and offset
- Elastic Net Regularization - lambda-search - efficient computation of optimal regularization
strength - applies strong rules to filter out in-active coefficients
- Several solvers for different problems - Iterative re-weighted least squares with ADMM solver - L-BFGS for wide problems - Coordinate Descent (Experimental)
GLM on H2O
13
H2O.ai Machine Intelligence
- Automatic handling of categorical variables - automatically expands categoricals into 1-hot encoded binary
vectors - Efficient handling (sparse acces, sparse covariance matrix) - (Unlike R) uses all levels by default if running with
regularization
- Missing value handling - missing values are not handled and rows with any missing value
will be omitted from the training dataset - need to impute missing values up front if there are many
GLM on H2O 2
14
H2O.ai Machine Intelligence
15
EC2 Demo Cluster: 8 nodes, 64 cores
H2O Deep Learning: Expect good cluster utilization :)
H2O.ai Machine Intelligence
16
Airline Data: Predict Delayed Departure
Predict dep. delay Y/N
116M rows 31 colums 12 GB CSV 4 GB compressed
20 years of domestic airline flight data
H2O.ai Machine Intelligence
17
Results in Seconds on Big Data
Logistic Regression: ~20s elastic net, alpha=0.5, lambda=1.379e-4 (auto)
Deep Learning: ~70s 4 hidden ReLU layers of 20 neurons, 2 epochs
8-node EC2 cluster: 64 virtual cores, 1GbE
Year, Month, Sched. Dep. Time have non-linear impact
Chicago, Atlanta, Dallas: often delayed
All cores maxed out
+9% AUC
+--+++
H2O.ai Machine Intelligence
- Standard Metrics as other H2O algos + - residual deviance - null deviance - degrees of freedom
- Coefficients / standardized coefficients - The actual model - One number per predictor - Model is fitted on standardized data by default (parameter)
- standardized coefficients are the actual coefficients fitted on standardized data
- (non-standardized) coefficients are de-scaled version of standardized coefficients (so that they can be applied to original dataset)
Output Fields
18
H2O.ai Machine Intelligence
19
Avito Dataset: Predict User Clicks
20M rows 21 colums ~ 3 GB CSV ~ 1 GB compressed
Kaggle competion https://www.kaggle.com/c/avito-context-ad-clicks/data
Some munging needed, see
https://github.com/h2oai/0xdata.com/blob/master/src/blog/2015_10_DataMunging._md http://www.slideshare.net/0xdata/400-million-search-results-predict-contextual-ad-clicks
H2O.ai Machine Intelligence
- Running GLM straight on the data runs fast, but not great accuracy - We will try to improve it by:
- turning variables to categoricals - imputing NAs with means
Avito Dataset: Predict User Clicks
20
H2O.ai Machine Intelligence
- Further improvements - cut numerical columns into intervals to make new categoricals
- use h2o.hist + h2o.cut
- add interactions - use h2o.interaction
Avito Dataset: Predict User Clicks
21
H2O.ai Machine Intelligence
- Solver selection - IRLSM - default choice with L1 penalty
- works great with small number of predictors - efficient L1 solver - can handle wide datasets with lambda search and L1 penalty
- L-BFGS - handles wide data well, but - can iterate a lot (take long time), especially with L1 penalty - tune the objective epsilon - often many iterations are spent
on minor improvements - Regularization Selection
- Compare sparse versus dense - compare runs with alpha >= .5, alpha == 0 - generally L1 does slightly better
- Run lambda search to pick optimal regularization strength
General Usage Guidelines
22
H2O.ai Machine Intelligence
- Do not pre-expand categorical variables - H2O expands categorical variables automatically, - way more efficient
- Adding features - Splitting numerical variables into intervals helps - Adding categorical interactions helps
- Using Lambda-Search - always use validation data set
- otherwise picks lambda.min - validation dataset is used to pick the best lambda value ->
need separate test set! - check that lambda.best > lambda.min
- otherwise did not start overfitting, smaller lambda may be better
- re-run with smaller lambda.min
General Usage Guidelines 2
23
H2O.ai Machine Intelligence
More Info in H2O Booklets
http://h2o.ai/resources
24