model selection and tuning at scale

Model Selection and Tuning at Scale

March 2016

About us

Owen Zhang

Chief Product Officer @ DataRobot

Former #1 ranked Data Scientist on Kaggle

Former VP, Science @ AIG

Peter Prettenhofer

Software Engineer @ DataRobot

Scikit-learn core developer

https://www.linkedin.com/vsearch/p?title=Vice+President%2C+Science&trk=prof-exp-title

Agenda

● Introduction

● Case-study Criteo 1TB

● Conclusion / Discussion

Model Selection

● Estimating the performance of different models in order to choose the best one.

● K-Fold Cross-validation

● The devil is in the detail:○ Partitioning○ Leakage○ Sample size○ Stacked-models require nested layers

Train Validation Holdout

1 2 3 4 5

Model Complexity & Overfitting

More data to the rescue?

Underfitting or Overfitting?

http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

Model Tuning

● Optimizing the performance of a model

● Example: Gradient Boosted Trees

○ Nr of trees○ Learning rate○ Tree depth / Nr of leaf nodes○ Min leaf size○ Example subsampling rate○ Feature subsampling rate

Search Space

Hyperparameter GBRT (naive) GBRT RandomForest

Nr of trees 5 1 1

Learning rate 5 5 -

Tree depth 5 5 1

Min leaf size 3 3 3

Example subsample rate 3 1 1

Feature subsample rate 2 2 5

Total 2250 150 15

Hyperparameter Optimization

● Grid Search

● Random Search

● Bayesian optimization

Challenges at Scale

● Why learning with more data is harder?○ Paradox: we could use more complex models due to more data but we cannot because

of computational constraints*○ => we need more efficient ways for creating complex models!

● Need to account for the combined cost: model fitting + model selection / tuning○ Smart hyperparameter tuning tries to decrease the # of model fits○ … we can accomplish this with fewer hyperparameters too**

* Pedro Domingos, A few useful things to know about machine learning, 2012.** Practitioners often favor algorithms with few hyperparameters such as RandomForest or AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)

A case study -- binary classification on 1TB of data

● Criteo click through data● Down sampled ads impression data on 24 days ● Fully anonymized dataset:

○ 1 target○ 13 integer features○ 26 hashed categorical features

● Experiment setup:○ Using day 0 - day 22 data for training, day 23 data for testing

Big Data?

Data size:● ~46GB/day● ~180,000,000/day

However it is very imbalanced (even after downsampling non-events)● ~3.5% events rate

Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB● Will fit into a single node under “optimal” conditions● Loss of model accuracy is negligible in most situations

Assuming 0.1% raw event (click through) rate:

Raw Data:[email protected]%

Data:[email protected]%

Data:70GB@50%

Where to start?

● 70GB (~260,000,000 data points) is still a lot of data● Let’s take a tiny slice of that to experiment

○ Take 0.25%, then .5%, then 1%, and do grid search on them

Time (Seconds)

RF

ASVM

Regularized Regression

GBM (with Count)

GBM (without Count)Better

GBM is the way to go, let’s go up to 10% data

# of Trees

Sample Size/Depth of Tree/Time to Finish

A “Fairer” Way of Comparing Models

A better model when time is the constraint

Can We Extrapolate?

?

Where We (can) do better than generic Bayesian Optimization

Tree Depth vs Data Size

● A natural heuristic -- increment tree depth by 1 every time data size doubles

1%

2%

4%

10%

Optimal Depth = a + b * log(DataSize)

What about VW?

● Highly efficient online learning algorithm● Support adaptive learning rate● Inherently linear, user needs to specify non-linear feature or interactions explicitly● 2-way and 3-way interactions can be generated on the fly

● Supports “every k” validation

● The only “tuning” REQUIRED is specification of interactions ○ Due to availability of progressive validation, bad interactions can be detected immediately

thus don’t waste time:

Data pipeline for VW

Training

Test

T1

T2

Tm

Test

T1s

Random Split

T2s

Tms

Random Shuffle

Concat + Interleave

It takes longer to prep the data than to run the model!

VW Results

Without

With Count + Count*Numeric Interaction

1% Data

10% Data

100% Data

Putting it All Together 1 Hour 1 Day

Do We Really “Tune/Select Model @ Scale”?● What we claim we do:

○ Model tuning and selection on big data● We we actually do:

○ Model tuning and selection on small data○ Re-run the model and expect/hope performance/hyper

parameters extrapolate as expected

● If you start the model tuning/selection process with GBs (even 100s of MBs) of data, you are doing it wrong!

Some Interesting Observations

● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise) non-linear model, even with much larger data

● There is meaningful structure in the hyper parameter space

● When we have limited time (relative to data size), running “deeper” models on smaller data sample may actually yield better results

● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We

need models that has # of parameters that can scale with # of data points○ GBM can have any many parameters as we want○ So does factorization machines

● For any data any model we will run into a “diminishing return” issue, as data get bigger and bigger

model selection and tuning at scale

Data & Analytics