scalable bayesian optimization using deep neural networkslcarin/ikenna6.27.2016.pdf · scalable...

BackgroundAdaptive Basis Regression with Deep Neural Networks

Experiments

Scalable Bayesian Optimization Using DeepNeural Networks

Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros,Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary,

Prabhat and Ryan P. Adams

June 27, 2016

Discussion by Ikenna OdinakaDuke University

Snoek et al.


Experiments

Outline

1 Background

2 Adaptive Basis Regression with Deep Neural NetworksModel DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization

3 Experiments

Snoek et al.


Experiments

Bayesian Optimization in a nutshellGlobal optimization aims to solve the minimization problem

x∗ = arg minx∈χ

f (x) (1)

where χ is a compact subset of RK .When f (x) is noisy and expensive (and x is intrinsicallylow-dimensional), Bayesian optimization is a natural fitGoal: Find the global optimum in as few steps as possible, sincef (x) is expensivePrincipled modeling of uncertainty -> balancing of explorationand exploitation during searchBayesian optimization uses a surrogate probabilistic modelAcquisition functions are used to search for the next xAcquisition functions balance exploration and exploitation. Theiroptima are located

Where the model prediction is low (exploitation)Where the uncertainty of the surrogate model is large (exploration)

Snoek et al.


Experiments

Authors’ Contribution in a nutshell

State-of-the-art approach (Spearmint) uses a Gaussian processas the surrogate modelAuthors’ contribution is to improve scalability while maintainingprincipled modeling of uncertainty

Replace the surrogate model with an adaptive basis regressionAdaptive basis provided by deep neural networkBayesian linear regression performed on the last hidden layer

Snoek et al.


Experiments

Example of Global Optimization: HyperparameterTuning

Era of BigData, more computational power, and ambitiousapplicationsThis means more sophisticated machine learning modelsComplex models means more hyperparameters to tuneFor example,

Design decisions e.g. shape of neural network architecture — # ofhidden layers, # of neurons in each hidden layer, choice ofactivation functionsRegularization parameters e.g. dropout rate, weight decay (`2

regularizer) coefficientOptimization parameters e.g. learning rate, size of mini-batch,momentum coefficient

Hyperparameters need to be set properly for good performanceComplex models also mean more function evaluations needed toget a good enough solutionNeed scalable surrogate model

Snoek et al.


Experiments

Approaches to Global Optimization

Non model-based approaches (aka cross validation): grid orrandom searchModel-based approaches

Random forests (SMAC)Tree Parzen estimator (TPE)Gaussian process (GP)Authors’ method: Deep Networks for Global Optimization (DNGO)

GPs are widely used because they are simple and flexible w.r.t.conditioning and inference, and have well-calibrated uncertaintyDNGO maintains this simplicity, flexibility, and uncertaintyproperties

Snoek et al.


Experiments

Details of Bayesian Optimization

Choose prior over functional form of f (x)

Given a set of N observations of input-target pairsD := {(xn, yn = f (xn))}N

n=1 ⊂ RK × RConstruct probabilistic regression model (a distribution overobjective functions)Query surrogate model (cheaper than f (x)) to determine whereto find the optimumOptimize over acquisition function to determine the next x toevaluateAugment D and repeat

Snoek et al.


Experiments

Acquisition Functions: Expected Improvement (EI)

Let µ(x;D,Θ) and σ2(x;D,Θ) be the predictive mean and variance ofthe surrogate modelDefine

γ(x) =f (xbest)− µ(x;D,Θ)

σ(x;D,Θ)(2)

where f (xbest) = min yn is the lowest observed valueThe expected improvement is given as

aEI(x;D,Θ) = σ(x;D,Θ)[γ(x)Φ(γ(x)) +N (γ(x); 0,1)] (3)

where Φ(·) and N (·; 0,1) are the CDF and PDF of a standard normal,respectively

Snoek et al.


Experiments

Bayesian Neural Networks (BNNs)

BNNs try to uncover the full posterior over the network weightsso as to

Capture uncertaintyAct as a regularizerProvide a framework for comparing different models

Full posterior is intractable for most neural networks -> expensiveapproximate inference or MCMCRecent trend

Variational approaches e.g. use another neural network toapproximate the posterior over the network weightsPerform full or approximate inference on a small part of the networke.g. the last layer of the network

Authors pursue the latter approach

Snoek et al.


Experiments

Model DetailsIncorporating Input Space ConstraintsParallel Bayesian Optimization

Adaptive Basis Regression

GP-based Bayesian optimization is cubic in NLimited to applications where f (x) requires a small number ofobservations to optimizeNeed to replace GP with a regressor that keeps GP’s desirableproperties

Flexible w.r.t. conditioning and inferenceWell-calibrated uncertainty

Theoretical relationship between GPs and infinite Bayesianneural networks (BNNs) makes BNNs a natural choiceBNNs are computational expensivePractical approach -> Adaptive basis regression:

Train deep neural network using a linear output layer for regressionAll weights estimated via MAPAfter training, replace the output layer with a Bayesian linearregressorMarginalize the output weights

Adaptive basis regression has cubic in D and linear in N;D << N is # of nodes in last hidden layer

Snoek et al.


Experiments


Basis Functions

Scale input space to unit hypercubeDeep neural network trained on DThe vector of outputs from the last hidden layer is denoted byφ(·) = [φ1(·), . . . , φD(·)]T

The output vectors from each training sample form the set ofbasis functionsThe resulting design matrix is denoted by Φ = [Φnd = φd (xn)],n = 1, . . . ,N, d = 1, . . . ,D

Snoek et al.


Experiments


Bayesian Linear Regression

y is the stacked target vector, X is the concatenated input vectorsPredictive mean µ(x;D,Θ) and variance σ2(x;D,Θ) of Bayesianlinear regression are given by

µ(x;D,Θ) = mTφ(x) + η(x), (4)

σ2(x;D,Θ) = φ(x)T K−1φ(x) +1β, (5)

where

m = βK−1ΦT y ∈ RD, (6)K = βΦTΦ + Iα ∈ RD×D, (7)y = y− η(x),

η(x) is a prior mean function, α, β ∈ Θ are regression modelhyperparameters

Snoek et al.


Experiments


Bayesian Linear Regression Contd

Marginal log-likelihood is given by

log p(y|X, α, β) =D2

logα +N2

logβ − N2

log(2π)

− β

2‖y−Φm‖2 − α

2mT m− 1

2log |K|

(8)

α, β, and parameters of η(x) are integrated out using slice sampling

Snoek et al.


Experiments


Scalability Comparison Between GP and ABR

GP scales cubically with NAdaptive Basis Regression (DNGO) scales linearly with N,cubically with D; D is fixed and small.

Snoek et al.


Experiments


Network Architecture

Need an architecture that generalizes across optimizationproblems

Important to choose the right activation functionInterestingly, ReLU is a poor choice for last hidden layerUnbounded activation functions lead to poor uncertainty estimatesUnnecessary exploration (more expensive function evaluations)

Snoek et al.


Experiments



Need an architecture that generalizes across optimizationproblemsImportant to choose the right activation function

Interestingly, ReLU is a poor choice for last hidden layerUnbounded activation functions lead to poor uncertainty estimatesUnnecessary exploration (more expensive function evaluations)

Snoek et al.


Experiments



Need an architecture that generalizes across optimizationproblemsImportant to choose the right activation functionMinimize average relative loss on HPOLib benchmark problemsChoice between 1 to 4 hidden layersGP-based Bayesian optimization (Spearmint) was used to tuneother hyperparameters

Learning rate, momentumWidth of each hidden layerdropout rates, `2 normalization coefficient

Optimal configuration had no dropout and small `2 normalizationcoefficientSpearmint restricted capacity via a small number of hidden units(50 hidden units per layer)

Snoek et al.


Experiments



3 hidden layers chosenSame architecture used in all experiments

Snoek et al.


Experiments


Marginal Likelihood vs MAP Estimate

Standard approach is to maximize Equation 8 with respect tobasis parameters (weights of network)Computing gradient of log p(y|X, α, β) requires inverting K eachiteration; expensiveAuthors approach:

Optimize basis using MAP point estimateApply Bayesian linear regression layer, after the fact

Snoek et al.


Experiments


Quadratic PriorThe prior mean function was chosen as

η(x) = λ+ (x− c)TΛ(x− c) (9)

where λ is the offset, c is the center of the quadratic, and Λ is adiagonal scaling matrix

c ∼ N (0.51, I)Λkk ∼ Horseshoe sparsifying prior,∀k ∈ {1, . . . ,K}

Reasons for horseshoe sparsifying prior:Positive support -> convex functionsLarge spike at 0 with a heavy tail -> strong shrinkage for smallvalues, preserving large onesShrinkage allows quadratic part of equation 9 to disappear if themodel is misspecified

Snoek et al.


Experiments


Handling Input Space Constraints in DNGOCreate a constraint classifierLet cn ∈ {0,1} be an indicator of the validity of xnLet V = {(xn, yn)|cn = 1} and I = {(xn, yn)|cn = 0} be sets of validand invalid inputs, respectively; D := V ∪ I.Let Ψ be the set of hyperparameters for the constraint classifierThe expected improvement function in Equation 3 is modified to give

aCEI(x;D,Θ,Ψ) = aEI(x;V,Θ)P[c = 1 | x,D,Ψ]

where

P[c = 1 | x,D,Ψ] =

∫wP[c = 1 | x,D,w,Ψ]P(w;Ψ)dw (10)

is obtained by integrating out the output layer weights of the adaptivebasis modelFor noisy constraints, a logistic likelihood function is used forP[c = 1 | x,D,w,Ψ]For noiseless constraints, a step function is used instead

Snoek et al.


Experiments


Parallel DNGO

Intractable to create joint acquisition function across multipleinputsAcquisitions are in general sequentialHowever, one can utilize fantasies from experiments that arerunning in parallel to aid the next choice of xIdea:

Use posterior predictive distribution in Equations 4 and 5 togenerate a set of fantasy outcomes y for each running experimentAverage fantasy outcomes to get a fantasy outcome for eachrunning experimentAugment dataset DMarginalize out fantasies

Snoek et al.


Experiments


Parallel DNGO ContdGiven J currently running jobs with inputs {xj}J

j=1, the marginalizedacquisition function is

aMCEI(x;D, {xj}Jj=1,Θ,Ψ) =∫

aCEI(x;D ∪ {(xj , yj )}Jj=1,Θ,Ψ)

× P[{cj , yj}J

j=1 | D, {xj}Jj=1]

dy1 . . . dyJdc1 . . . dcJ

Next input x∗ is chosen as

x∗ = arg maxx

aMCEI(x;D, {xj}Jj=1), (11)

where

aMCEI(x;D, {xj}Jj=1) =

∫aMCEI(x;D, {xj}J

j=1,Θ,Ψ)dΘdΨ (12)

is the integrated acquisition functionSnoek et al.


Experiments

HPOLib Benchmarks

DNGO was compared to other methods for global optimizationon a benchmark set of problemsTPE and SMAC are scalable, but have ad-hoc estimates ofuncertaintySpearmint is based on standard GP, so its not scalable

Snoek et al.


Experiments

Image Caption Generation: Description

Using BLEU-4 metric on the Microsoft COCO 2014 test setDNGO based on log bilinear model (LBL), which is a simplermodel relative to LSTMEach evaluation of LBL model took 26.6 hoursTuned learning rate, momentum, batch size, dropout rate andweight decay for word and image representations, context size,size of word embeddings, etc.Between 300 and 800 experiments run in parallelTotal of 2500 experiments (2700 CPU days) ran in less than 1weekDistinct local optima in hyperparameter space may explaindramatic improvement in combining top 2 and 3 models

Snoek et al.


Experiments

Image Caption Generation: Results

Snoek et al.


Experiments

Deep Convolutional Neural Networks: Architecture

DNGO to tune Deep CNN for visual object recognition onCIFAR-10 and CIFAR-100 datasetsSame architecture (from Springenberg et al., 2014) for bothdatasets

Snoek et al.


Experiments

Deep Convolutional Neural Networks: Results

40 experiments in parallelTuned momentum, learning rate, `2 weight decay coefficients,dropout rates, standard deviations of random i.i.d. Gaussianweight initializations, etc

Snoek et al.

scalable bayesian optimization using deep neural networkslcarin/ikenna6.27.2016.pdf · scalable...

Documents