scalable bayesian optimization using deep neural networks · motivation bayesian optimization: •...

38
Scalable Bayesian Optimization using Deep Neural Networks Jasper Snoek with Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Prabhat, Ryan P . Adams

Upload: others

Post on 02-Jun-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Scalable Bayesian Optimization using Deep Neural Networks

Jasper Snoek

with

Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary, Prabhat, Ryan P. Adams

Page 2: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Motivation

Bayesian optimization:

• Global optimization of expensive, multi-modal and noisy functions

• E.g. the hyperparameters of machine learning algorithms

• Robots, chemistry, cooking recipes, etc

Page 3: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Bayesian Optimization for Hyperparameters

Instead of relying on intuition or brute-force strategies:

Perform a regression from the high-level model parameters to the error metric (e.g. classification error)

• Build a statistical model of the function, with a suitable prior – e.g. a Gaussian process

• Use stats to tell us:

• Where is the expected minimum of the function?

• Expected improvement of trying other parameters

Page 4: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2 True Function with Three Observations

Page 5: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Bayesian nonlinear regression predictive distributions

Page 6: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

How do the predictions compare to the current best?

Page 7: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How do the predictions compare to the current best?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Expected Improvement

Page 8: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

GPs as Distributions over FunctionsPrior Posterior

But the computational cost grows cubically in N!

Page 9: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Having a Statistical Framework Helps

• Reason about constraints:• Gramacy et al., 2010,. Gardner et al., 2014., Gelbart, Snoek & Adams, 2014. …

• Think about multi-task & transfer across related problems• Krause & Ong, 2011. Hutter et al., 2011. Bardenet et al., 2013. Swersky, Snoek & Adams,

2013. …

• Run experiments in parallel• Ginsbourger & Riche, 2010. Hutter et al., 2011. Snoek, Larochelle & Adams, 2012.

Frazier et al., 2014 …

• Determine when to stop experiments early• Swersky, Snoek & Adams, 2014. Domhan et al., 2014

Page 10: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

GP-Based Bayesian Optimization • Gaussian Processes scale poorly - N3

• Due to having to invert data covariance matrix• This prevents us from…

• Running hundreds/thousands of experiments in parallel• Sharing information across many optimizations• Modeling every epoch of learning (early stopping)• Having very complex constraint spaces• Tackling high dimensional problems

• In order to address more interesting problems, we have to scale it up

Page 11: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Need a Different Model• Random Forests

• Empirical estimate of uncertainty• Outperformed by neural nets in general

• Sparse GPs• Scale better but aren’t actually used in practice• Hard to get to work well. Uncertainty is not great

• Bayesian Neural Nets• Very flexible, powerful models• Marginalizing all the parameters is prohibitively expensive

Page 12: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Deep Nets for Global Optimization• A pragmatic Bayesian deep neural net

Bayesian Linear Regression

Page 13: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

Expected Improvement depends on the predictive mean and variance of the model

Page 14: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Expected Improvement

Expected Improvement depends on the predictive mean and variance of the model

Page 15: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

m = �K�1�Ty 2 RD

K = ��T�+ I↵2 2 RD⇥D

last hidden layer of the neural net for test data

last hidden layer of the neural net for training data

D << N !

Page 16: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

m = �K�1�Ty 2 RD

K = ��T�+ I↵2 2 RD⇥D

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Expected Improvement

last hidden layer of the neural net for test data

last hidden layer of the neural net for training data

D << N !

Page 17: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

⌘(x) = �+ (x� c)T⇤(x� c)

We set a quadratic prior - a bowl centered in the middle of the search region

Page 18: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does this work?

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

← 80%

← 90%

← 95%

Expected Improvement

⌘(x) = �+ (x� c)T⇤(x� c)

We set a quadratic prior - a bowl centered in the middle of the search region

Page 19: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

ConstraintsAlmost every real problem has complex constraints

• Often unknown a-priori• E.g. training of a model diverging and producing NaN

• We developed a principled approach to dealing with constraints• Gelbart, Snoek & Adams. Bayesian Optimization with Unknown Constraints.

UAI 2014.

• Need to scale that up as well

Page 20: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Constraints

Use a classification neural net and integrate out the last layer (Laplace Approximation)

Page 21: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

With 3 complete and 2 pending, what to do next?

Page 22: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

With 3 complete and 2 pending, what to do next?

Use posterior predictive to “fantasize” outcomes.

Page 23: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

With 3 complete and 2 pending, what to do next?

Use posterior predictive to “fantasize” outcomes.

Compute acquisition function (EI) for each predictive

fantasy.

Page 24: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

With 3 complete and 2 pending, what to do next?

Use posterior predictive to “fantasize” outcomes.

Compute acquisition function (EI) for each predictive

fantasy.

Monte Carlo estimate of overall acquisition function.

Page 25: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Parallelism

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5 -4 -3 -2 -1 0 1 2 3 4 5

Exp

ect

ed

Im

pro

vem

en

t

Sample outputs for both objective and constraint

Monte Carlo Constrained EI

Page 26: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

What about all the hyperparameters of this model?

Integrate out hyperparameters of Bayesian layers

Page 27: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

What about all the hyperparameters of this model?

Integrate out hyperparameters of Bayesian layers

Use GP Bayesian optimization for the neural net hyperparameters

Page 28: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Putting it all together

Backprop down to the inputs to optimize for the most promising next experiment

Page 29: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

How does it scale?

Page 30: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

A collection of Bayesian optimization benchmarks(Eggensperger et. al)

How well does it optimize?

Page 31: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Convolutional Networks• Notoriously hard to tune

• 14 hyperparameters with broad support• e.g. learning rate, momentum, input dropout, dropout,

weight-decay, weight initialization, parameters on input transformations, etc.

• Very generic architecture

• Evaluate 40 in parallel on Intel® Xeon Phi™ coprocessors

Page 32: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Convolutional Networks

Achieved “state-of-the-art” within a few sequential steps

Page 33: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Image Caption Generation

Tune the hyperparameters of this model

• MS COCO Benchmark Dataset• Each experiment takes ~26 hours• 11 hyperparameters (including categorical)

• Approx half of the space is invalid• 500-800 in parallel

Zaremba, Sutskever & Vinyals, 2015

Page 34: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Image Caption Generation

Tune the hyperparameters of this model

Zaremba, Sutskever & Vinyals, 2015

Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25

Page 35: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Image Caption Generation

Tune the hyperparameters of this model

Zaremba, Sutskever & Vinyals, 2015

Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25

“A person riding a wave in the ocean” “A bird sitting on top of a field”

Page 36: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Image Caption Generation

Tune the hyperparameters of this model

Zaremba, Sutskever & Vinyals, 2015

Iteration #500 1000 1500 2000 2500

Valid

atio

n BL

EU-4

Sco

re

0

5

10

15

20

25

“A person riding a wave in the ocean” “A bird sitting on top of a field”

“A horse riding a horse”

Page 37: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

Other Interesting Decisions - Neural Net Basis Functions

tanh ReLU

tanh + ReLU

Page 38: Scalable Bayesian Optimization using Deep Neural Networks · Motivation Bayesian optimization: • Global optimization of expensive, multi-modal and noisy functions • E.g. the hyperparameters

ThanksOren Rippel (MIT, Harvard)

Kevin Swersky (Toronto)

Ryan P. Adams (Harvard)

Ryan Kiros (Toronto)

Nadathur Satish, Narayanan Sundaram, Mostofa Ali Patwary (Intel Parallel Labs)

Prabhat (Lawrence Berkeley National Laboratory)