using bayesian optimization to tune machine learning models

49
USING BAYESIAN OPTIMIZATION TO TUNE MACHINE LEARNING MODELS Scott Clark Co-founder and CEO of SigOpt [email protected] @DrScottClark

Upload: scott-clark

Post on 16-Apr-2017

361 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Using Bayesian Optimization to Tune Machine Learning Models

USING BAYESIAN OPTIMIZATIONTO TUNE MACHINE LEARNING MODELS

Scott ClarkCo-founder and CEO of SigOpt

[email protected] @DrScottClark

Page 2: Using Bayesian Optimization to Tune Machine Learning Models

TRIAL AND ERROR WASTES EXPERT TIME

Machine Learning is extremely powerful

Tuning Machine Learning systemsis extremely non-intuitive

Page 3: Using Bayesian Optimization to Tune Machine Learning Models

UNRESOLVED PROBLEM IN ML

https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3

What is the most important unresolved problem in machine learning?

“...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.”

Xavier Amatriain, VP Engineering at Quora(former Director of Research at Netflix)

Page 4: Using Bayesian Optimization to Tune Machine Learning Models

LOTS OF TUNABLE PARAMETERS

Page 5: Using Bayesian Optimization to Tune Machine Learning Models

COMMON APPROACH

Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012

1. Random search or grid search2. Expert defined grid search near “good” points3. Refine domain and repeat steps - “grad student descent”

Page 6: Using Bayesian Optimization to Tune Machine Learning Models

COMMON APPROACH

● Expert intensive● Computationally intensive● Finds potentially local optima● Does not fully exploit useful information

Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012

1. Random search or grid search2. Expert defined grid search near “good” points3. Refine domain and repeat steps - “grad student descent”

Page 7: Using Bayesian Optimization to Tune Machine Learning Models

… the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive.

Prof. Warren Powell - Princeton

What is the most efficient way to collect information?Prof. Peter Frazier - Cornell

How do we make the most money, as fast as possible?Me - @DrScottClark

OPTIMAL LEARNING

Page 8: Using Bayesian Optimization to Tune Machine Learning Models

● Optimize some Overall Evaluation Criterion (OEC)○ Loss, Accuracy, Likelihood, Revenue

● Given tunable parameters○ Hyperparameters, feature parameters

● In an efficient way○ Sample function as few times as possible○ Training on big data is expensive

BAYESIAN GLOBAL OPTIMIZATION

Details at https://sigopt.com/research

Page 9: Using Bayesian Optimization to Tune Machine Learning Models
Page 10: Using Bayesian Optimization to Tune Machine Learning Models

Grid Search Random Search

This slide’s GIF loops automatically

Page 11: Using Bayesian Optimization to Tune Machine Learning Models

...

...

...... ... ...

GRID SEARCH SCALES EXPONENTIALLY

4D

Page 12: Using Bayesian Optimization to Tune Machine Learning Models

...

...

...... ... ...

...

...

...... ... ...

...

...

...... ... ...

...

...

...... ... ...BAYESIAN OPT SCALES LINEARLY

6D

Page 13: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT FIT IN THE STACK?

Big Data

Machine LearningModels

with tunable parameters

Page 14: Using Bayesian Optimization to Tune Machine Learning Models

Optimally suggests new parameters

HOW DOES IT FIT IN THE STACK?

Objective Metric

New parameters

Big Data

Machine LearningModels

with tunable parameters

Page 15: Using Bayesian Optimization to Tune Machine Learning Models

Optimally suggests new parameters

HOW DOES IT FIT IN THE STACK?

Objective Metric

New parameters

Better Models

Big Data

Machine LearningModels

with tunable parameters

Page 16: Using Bayesian Optimization to Tune Machine Learning Models

QUICK EXAMPLES

Page 17: Using Bayesian Optimization to Tune Machine Learning Models

Optimally suggests new parameters

Ex: LOAN CLASSIFICATION (xgboost)

Prediction Accuracy

New parameters

Better AccuracyLoan

Applications

Default Prediction

with tunableML parameters

● Income● Credit Score● Loan Amount

Page 18: Using Bayesian Optimization to Tune Machine Learning Models

COMPARATIVE PERFORMANCE

● 8.2% Better Accuracy than baseline

● 100x faster than standard tuning methods

Accuracy

Cost

Grid Search

Random Search

IterationsA

UC

.698

.690

.683

.6751,00010,000100,000

Page 19: Using Bayesian Optimization to Tune Machine Learning Models

EXAMPLE: ALGORITHMIC TRADING

Expected Revenue

New parameters

Higher Returns

Market Data

Trading Strategy

with tunableweights and thresholds

● Closing Prices● Day of Week● Market Volatility

Optimally suggests new parameters

Page 20: Using Bayesian Optimization to Tune Machine Learning Models

COMPARATIVE PERFORMANCE

Standard Method

Expert

● 200% Higher model returns than expert

● 10x faster than standard methods

Page 21: Using Bayesian Optimization to Tune Machine Learning Models

HOW BAYESIAN OPTIMIZATION WORKS

Page 22: Using Bayesian Optimization to Tune Machine Learning Models

1. Build Gaussian Process (GP) with points sampled so far

2. Optimize the fit of the GP (covariance hyperparameters)

3. Find the point(s) of highest Expected Improvement within parameter domain

4. Return optimal next best point(s) to sample

HOW DOES IT WORK?

Page 23: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 24: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 25: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 26: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 27: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 28: Using Bayesian Optimization to Tune Machine Learning Models

HOW DOES IT WORK?

1. User reports data

2. SigOpt builds statistical model (Gaussian Process)

3. SigOpt finds the points of highest Expected Improvement

4. SigOpt suggests best parameters to test next

5. User tests those parameters and reports results to SigOpt

6. Repeat

Page 29: Using Bayesian Optimization to Tune Machine Learning Models

EXTENDED EXAMPLE:EFFICIENTLY BUILDING CONVNETS

Page 30: Using Bayesian Optimization to Tune Machine Learning Models

● Classify house numbers with more training data and more sophisticated model

PROBLEM

Page 31: Using Bayesian Optimization to Tune Machine Learning Models

● TensorFlow makes it easier to design DNN architectures, but what structure works best on a given dataset?

CONVNET STRUCTURE

Page 32: Using Bayesian Optimization to Tune Machine Learning Models

● Per parameter adaptive SGD variants like RMSProp and Adagrad seem to work best

● Still require careful selection of learning rate (α), momentum (β), decay (γ) terms

STOCHASTIC GRADIENT DESCENT

Page 33: Using Bayesian Optimization to Tune Machine Learning Models

● Comparison of several RMSProp SGD parametrizations

● Not obvious which configurations will work best on a given dataset without experimentation

STOCHASTIC GRADIENT DESCENT

Page 34: Using Bayesian Optimization to Tune Machine Learning Models

RESULTS

Page 35: Using Bayesian Optimization to Tune Machine Learning Models

● Avg Hold out accuracy after 5 optimization runs consisting of 80 objective evaluations

● Optimized single 80/20 CV fold on training set, ACC reported on test set as hold out

PERFORMANCE

SigOpt(TensorFlow CNN)

Rnd Search(TensorFlow CNN)

No Tuning (sklearn RF)

No Tuning(TensorFlow CNN)

Hold Out ACC 0.8130 (+315.2%) 0.5690 0.5278 0.1958

Page 36: Using Bayesian Optimization to Tune Machine Learning Models

COST ANALYSIS

Model Performance (CV Acc. threshold)

Random Search Cost

SigOpt Cost

SigOpt Cost Savings

Potential Savings In Production (50 GPUs)

87 % $275 $42 84% $12,530

85 % $195 $23 88% $8,750

80 % $46 $21 55% $1,340

70 % $29 $21 27% $400

Page 37: Using Bayesian Optimization to Tune Machine Learning Models

EXAMPLE: TUNING DNN CLASSIFIERS

CIFAR10 Dataset● Photos of objects

● 10 classes

● Metric: Accuracy○ [0.1, 1.0]

Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.

Page 38: Using Bayesian Optimization to Tune Machine Learning Models

● All convolutional neural network● Multiple convolutional and dropout layers● Hyperparameter optimization mixture of

domain expertise and grid search (brute force)

USE CASE: ALL CONVOLUTIONAL

http://arxiv.org/pdf/1412.6806.pdf

Page 39: Using Bayesian Optimization to Tune Machine Learning Models

MANY TUNABALE PARAMETERS...

● epochs: “number of epochs to run fit” - int [1,∞]● learning rate: influence on current value of weights at each step - double (0, 1]● momentum coefficient: “the coefficient of momentum” - double (0, 1]● weight decay: parameter affecting how quickly weight decays - double (0, 1]● depth: parameter affecting number of layers in net - int [1, 20(?)]● gaussian scale: standard deviation of initialization normal dist. - double (0,∞] ● momentum step change: mul. amount to decrease momentum - double (0, 1]● momentum step schedule start: epoch to start decreasing momentum - int [1,∞]● momentum schedule width: epoch stride for decreasing momentum - int [1,∞]

...optimal values non-intuitive

Page 40: Using Bayesian Optimization to Tune Machine Learning Models

COMPARATIVE PERFORMANCE

● Expert baseline: 0.8995○ (using neon)

● SigOpt best: 0.9011○ 1.6% reduction in

error rate○ No expert time

wasted in tuning

Page 41: Using Bayesian Optimization to Tune Machine Learning Models

USE CASE: DEEP RESIDUAL

http://arxiv.org/pdf/1512.03385v1.pdf

● Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions

● Variable depth● Hyperparameter optimization mixture of domain expertise and grid

search (brute force)

Page 42: Using Bayesian Optimization to Tune Machine Learning Models

COMPARATIVE PERFORMANCE

Standard Method

● Expert baseline: 0.9339○ (from paper)

● SigOpt best: 0.9436○ 15% relative error

rate reduction○ No expert time

wasted in tuning

Page 44: Using Bayesian Optimization to Tune Machine Learning Models

TRY OUT SIGOPT FOR FREE

https://sigopt.com/getstarted

● Quick example and intro to SigOpt● No signup required● Visual and code examples

Page 45: Using Bayesian Optimization to Tune Machine Learning Models

MORE EXAMPLES

https://github.com/sigopt/sigopt-examples Examples of using SigOpt in a variety of languages and contexts.

Tuning Machine Learning Models (with code)A comparison of different hyperparameter optimization methods.

Using Model Tuning to Beat Vegas (with code)Using SigOpt to tune a model for predicting basketball scores.

Learn more about the technology behind SigOpt athttps://sigopt.com/research

Page 46: Using Bayesian Optimization to Tune Machine Learning Models

GPs: FUNCTIONAL VIEW

Page 47: Using Bayesian Optimization to Tune Machine Learning Models

overfit good fit underfit

GPs: FITTING THE GP

Page 48: Using Bayesian Optimization to Tune Machine Learning Models

USE CASE: CLASSIFICATION MODELSMachine Learning models have many non-intuitive tunable hyperparameters

Problem:

BeforeStandard methods use high

resources for low performance

AfterSigOpt finds better parameters

with 10x fewer evaluations than standard methods

Page 49: Using Bayesian Optimization to Tune Machine Learning Models

USE CASE: SIMULATIONS

BETTER RESULTS+450% FASTER

Expensive simulations require high resources for every run

Problem:

BeforeBrute force tuning approach

prohibitively expensive

AfterSigOpt finds better results with

fewer required simulations