generalized linear models with h2o

H2O.ai Machine Intelligence

Generalized Linear Models with H2O

1

Tomas Nykodym [email protected]


Outline

• Introduction to H2O • GLM Overview • Quick demo on Airlines Data • Overview of H2O GLM features • Common usage patterns

•finding optimal regularization •handling wide datasets

• Kaggle Example •Avito Dataset overview •basic model •feature engineering •final model building 2


In-Memory ML

Distributed

Open Source

APIs

3

Memory-Efficient Data Structures Cutting-Edge Algorithms

Use all your Data (No Sampling) Accuracy with Speed and Scale

Ownership of Methods - Apache V2 Easy to Deploy: Bare, Hadoop, Spark, etc.

Java, Scala, R, Python, JavaScript, JSON NanoFast Scoring Engine (POJO)

H2O - Product Overview


25,000 commits / 3yrs

H2O World Conference 2014

Team Work @ H2O.ai

4Join H2O World Nov 9-11 2015!


5

cientific Advisory Council

Stephen Boyd Professor of EE Engineering Stanford University

Rob Tibshirani Professor of Health Research and Policy, and Statistics Stanford University

Trevor Hastie Professor of Statistics Stanford University


103 634 2789

463 2,887 13,237

Companies

Users

Mar 2014 July 2014 Mar 2015

Active Users

150+

6

Strong Community & Growth5/25/15 @kdnuggets t.co/4xSgleSIdY

http://t.co/4xSgleSIdY


7

Ad Optimization (200% CPA Lift with H2O)

P2B Model Factory (60k models, 15x faster with H2O than before)

Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions)

…and many large insurance and financial services companies!

Real-time marketing (H2O is 10x faster than anything else)

Actual Customer Use Cases


8

HDFS

S3

SQL

NoSQL

Classification Regression

Feature Engineering

Distributed In-Memory

Map Reduce/Fork Join

Columnar Compression

GLM, Deep Learning

K-Means, PCA, NB, Cox

Random Forest / GBM Ensembles

Fast Modeling Engine

Streaming Nano Fast Java Scoring Engines (POJO code generation)

Matrix Factorization Clustering

Munging

Unsupervised

Supervised

Accuracy with Speed and Scale

Most code is written in-house from scratch


- Well known statistical/machine learning method - Fits a linear model

- link(y) = c1*x1 + c2*x + … + cn*xn + intercept - Produces (relatively) simple model

- easy to fit - easy to understand and interpret - well known statistical properties

- Regression problems - gaussian, poisson, gamma, tweedie

- Classification - binomial, multinomial

- Requires good features - not as powerful on raw data as some other models - (gbm/deep learning)

Generalized Linear Models

9


- Linear Model - defined by vector of coefficients - 1 number per predictor

- Parametrized by Family and Link - Family

- Our assumption about distribution of the response - e.g. poisson for regression on counts, binomial for

two class classification - Link

- non-linear transform of the response - e.g. logit to generate s-curve for logistic regression

- Fitted by maximum likelihood - pick the model with max probability of seeing the data - need an iterative solver (e.g. newton method, L-BFGS)

Generalized Linear Models 2

10


Generalized Linear Models 3

11

Simple 2-class classification example

Linear Regression fit (family=gaussian,link =identity)

Logistic Regression fit (family=binomial,link = logit)


- Problems - can overfit - works great on training, fails on test - solution is not unique with correlated variables

- Solution - Add Regularization - add penalty to reduce size of the vector - L1 or L2 norm of the coefficient vector

- L1 versus L2 - L2 dense solution

- correlated variables coefficients are pushed to the same value

- L1 sparse solution - picks one correlated variable, others discarded

- Elastic Net - combination of L1 and L2 - sparse solution, correlated variables grouped, enter/ leave the

model together

Penalized Generalized Linear Models

12


- Fully Distributed and Parallel - handles datasets with up to 100s of thousand of predictors - scales linearly with number of rows - processes datasets with 100s of millions of rows in seconds

- All standard GLM features - standard families - support for observation weights and offset

- Elastic Net Regularization - lambda-search - efficient computation of optimal regularization

strength - applies strong rules to filter out in-active coefficients

- Several solvers for different problems - Iterative re-weighted least squares with ADMM solver - L-BFGS for wide problems - Coordinate Descent (Experimental)

GLM on H2O

13


- Automatic handling of categorical variables - automatically expands categoricals into 1-hot encoded binary

vectors - Efficient handling (sparse acces, sparse covariance matrix) - (Unlike R) uses all levels by default if running with

regularization

- Missing value handling - missing values are not handled and rows with any missing value

will be omitted from the training dataset - need to impute missing values up front if there are many

GLM on H2O 2

14


15

EC2 Demo Cluster: 8 nodes, 64 cores

H2O Deep Learning: Expect good cluster utilization :)


16

Airline Data: Predict Delayed Departure

Predict dep. delay Y/N

116M rows 31 colums 12 GB CSV 4 GB compressed

20 years of domestic airline flight data


17

Results in Seconds on Big Data

Logistic Regression: ~20s elastic net, alpha=0.5, lambda=1.379e-4 (auto)

Deep Learning: ~70s 4 hidden ReLU layers of 20 neurons, 2 epochs

8-node EC2 cluster: 64 virtual cores, 1GbE

Year, Month, Sched. Dep. Time have non-linear impact

Chicago, Atlanta, Dallas: often delayed

All cores maxed out

+9% AUC

+--+++


- Standard Metrics as other H2O algos + - residual deviance - null deviance - degrees of freedom

- Coefficients / standardized coefficients - The actual model - One number per predictor - Model is fitted on standardized data by default (parameter)

- standardized coefficients are the actual coefficients fitted on standardized data

- (non-standardized) coefficients are de-scaled version of standardized coefficients (so that they can be applied to original dataset)

Output Fields

18


19

Avito Dataset: Predict User Clicks

20M rows 21 colums ~ 3 GB CSV ~ 1 GB compressed

Kaggle competion https://www.kaggle.com/c/avito-context-ad-clicks/data

Some munging needed, see

https://github.com/h2oai/0xdata.com/blob/master/src/blog/2015_10_DataMunging._md http://www.slideshare.net/0xdata/400-million-search-results-predict-contextual-ad-clicks

https://www.kaggle.com/c/avito-context-ad-clicks/data

https://github.com/h2oai/0xdata.com/blob/master/src/blog/2015_10_DataMunging._md

http://www.slideshare.net/0xdata/400-million-search-results-predict-contextual-ad-clicks


- Running GLM straight on the data runs fast, but not great accuracy - We will try to improve it by:

- turning variables to categoricals - imputing NAs with means


20


- Further improvements - cut numerical columns into intervals to make new categoricals

- use h2o.hist + h2o.cut

- add interactions - use h2o.interaction


21


- Solver selection - IRLSM - default choice with L1 penalty

- works great with small number of predictors - efficient L1 solver - can handle wide datasets with lambda search and L1 penalty

- L-BFGS - handles wide data well, but - can iterate a lot (take long time), especially with L1 penalty - tune the objective epsilon - often many iterations are spent

on minor improvements - Regularization Selection

- Compare sparse versus dense - compare runs with alpha >= .5, alpha == 0 - generally L1 does slightly better

- Run lambda search to pick optimal regularization strength

General Usage Guidelines

22


- Do not pre-expand categorical variables - H2O expands categorical variables automatically, - way more efficient

- Adding features - Splitting numerical variables into intervals helps - Adding categorical interactions helps

- Using Lambda-Search - always use validation data set

- otherwise picks lambda.min - validation dataset is used to pick the best lambda value ->

need separate test set! - check that lambda.best > lambda.min

- otherwise did not start overfitting, smaller lambda may be better

- re-run with smaller lambda.min

General Usage Guidelines 2

23


More Info in H2O Booklets

http://h2o.ai/resources

24

http://h2o.ai/resources

generalized linear models with h2o

Software