model selection and tuning at scale
TRANSCRIPT
Model Selection and Tuning at Scale
March 2016
About us
Owen Zhang
Chief Product Officer @ DataRobot
Former #1 ranked Data Scientist on Kaggle
Former VP, Science @ AIG
Peter Prettenhofer
Software Engineer @ DataRobot
Scikit-learn core developer
Agenda
● Introduction
● Case-study Criteo 1TB
● Conclusion / Discussion
Model Selection
● Estimating the performance of different models in order to choose the best one.
● K-Fold Cross-validation
● The devil is in the detail:○ Partitioning○ Leakage○ Sample size○ Stacked-models require nested layers
Train Validation Holdout
1 2 3 4 5
Model Complexity & Overfitting
More data to the rescue?
Underfitting or Overfitting?
http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
Model Tuning
● Optimizing the performance of a model
● Example: Gradient Boosted Trees
○ Nr of trees○ Learning rate○ Tree depth / Nr of leaf nodes○ Min leaf size○ Example subsampling rate○ Feature subsampling rate
Search Space
Hyperparameter GBRT (naive) GBRT RandomForest
Nr of trees 5 1 1
Learning rate 5 5 -
Tree depth 5 5 1
Min leaf size 3 3 3
Example subsample rate 3 1 1
Feature subsample rate 2 2 5
Total 2250 150 15
Hyperparameter Optimization
● Grid Search
● Random Search
● Bayesian optimization
Challenges at Scale
● Why learning with more data is harder?○ Paradox: we could use more complex models due to more data but we cannot because
of computational constraints*○ => we need more efficient ways for creating complex models!
● Need to account for the combined cost: model fitting + model selection / tuning○ Smart hyperparameter tuning tries to decrease the # of model fits○ … we can accomplish this with fewer hyperparameters too**
* Pedro Domingos, A few useful things to know about machine learning, 2012.** Practitioners often favor algorithms with few hyperparameters such as RandomForest or AveragedPerceptron (see http://nlpers.blogspot.co.at/2014/10/hyperparameter-search-bayesian.html)
A case study -- binary classification on 1TB of data
● Criteo click through data● Down sampled ads impression data on 24 days ● Fully anonymized dataset:
○ 1 target○ 13 integer features○ 26 hashed categorical features
● Experiment setup:○ Using day 0 - day 22 data for training, day 23 data for testing
Big Data?
Data size:● ~46GB/day● ~180,000,000/day
However it is very imbalanced (even after downsampling non-events)● ~3.5% events rate
Further downsampling of non-events to a balanced dataset will reduce the size of data to ~70GB● Will fit into a single node under “optimal” conditions● Loss of model accuracy is negligible in most situations
Assuming 0.1% raw event (click through) rate:
Raw Data:[email protected]%
Data:[email protected]%
Data:70GB@50%
Where to start?
● 70GB (~260,000,000 data points) is still a lot of data● Let’s take a tiny slice of that to experiment
○ Take 0.25%, then .5%, then 1%, and do grid search on them
Time (Seconds)
RF
ASVM
Regularized Regression
GBM (with Count)
GBM (without Count)Better
GBM is the way to go, let’s go up to 10% data
# of Trees
Sample Size/Depth of Tree/Time to Finish
A “Fairer” Way of Comparing Models
A better model when time is the constraint
Can We Extrapolate?
?
Where We (can) do better than generic Bayesian Optimization
Tree Depth vs Data Size
● A natural heuristic -- increment tree depth by 1 every time data size doubles
1%
2%
4%
10%
Optimal Depth = a + b * log(DataSize)
What about VW?
● Highly efficient online learning algorithm● Support adaptive learning rate● Inherently linear, user needs to specify non-linear feature or interactions explicitly● 2-way and 3-way interactions can be generated on the fly
● Supports “every k” validation
● The only “tuning” REQUIRED is specification of interactions ○ Due to availability of progressive validation, bad interactions can be detected immediately
thus don’t waste time:
Data pipeline for VW
Training
Test
T1
T2
Tm
Test
T1s
Random Split
T2s
Tms
Random Shuffle
Concat + Interleave
It takes longer to prep the data than to run the model!
VW Results
Without
With Count + Count*Numeric Interaction
1% Data
10% Data
100% Data
Putting it All Together 1 Hour 1 Day
Do We Really “Tune/Select Model @ Scale”?● What we claim we do:
○ Model tuning and selection on big data● We we actually do:
○ Model tuning and selection on small data○ Re-run the model and expect/hope performance/hyper
parameters extrapolate as expected
● If you start the model tuning/selection process with GBs (even 100s of MBs) of data, you are doing it wrong!
Some Interesting Observations
● At least for some datasets, it is very hard for “pure linear” model to outperform (accuracy-wise) non-linear model, even with much larger data
● There is meaningful structure in the hyper parameter space
● When we have limited time (relative to data size), running “deeper” models on smaller data sample may actually yield better results
● To fully exploit data, model estimation time is usually at least proportional to n*log(n) and We
need models that has # of parameters that can scale with # of data points○ GBM can have any many parameters as we want○ So does factorization machines
● For any data any model we will run into a “diminishing return” issue, as data get bigger and bigger
DataRobot Essentials
April 7-8 LondonApril 28-29 San Francisco
May 17-18 AtlantaJune 23-24 Boston
datarobot.com/training© DataRobot, Inc. All rights reserved.
Thanks / Questions?