optimization of machine learning...

103
Optimization of Machine Learning Hyperparameters Dr. Frank Hutter Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design Computer Science Institute University of Freiburg, Germany July 2014

Upload: voanh

Post on 07-Jul-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Optimization of Machine Learning Hyperparameters

Dr. Frank Hutter

Head of Emmy Noether Research Group on Learning, Optimization, and Automated Algorithm Design

Computer Science Institute University of Freiburg, Germany

July 2014

Page 2: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Motivation

• The machine learning algorithms you have learned about had several degrees of freedom

– E.g., in neural networks: regularization, momentum, learning rate, number of layers, number of units, …

• So far, how have you been setting these in practice?

– Changing one parameter at a time

– Grid search

• Was this tedious? Time-consuming?

– Imagine you have millions of data points and each evaluation takes hours or days…

2

Page 3: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

High-level Learning Goals

• After this module, you can …

– Effectively use modern hyperparameter optimization methods

– Explain the concept of over-fitting

– Describe what measures can be taken to avoid over-fitting

– Describe the core mechanisms of several types of hyperparameter optimization methods

– Reason about the pros and cons of using a particular hyperparameter optimization method for a particular problem

– Derive the mechanisms behind Bayesian optimization

3

Page 4: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Outline of Today’s Class

• Generalization to previously unseen data

• Overview of hyperparameter optimization methods

• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes

4

Page 5: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Learning and Generalization

• Much of supervised machine learning is about selecting a model from a given hypothesis space that

– Explains the seen data well

– Is likely to also work well for new data

• Example: Which model will describe new data better? The polynomial or the line?

5

Image source: Wikipedia

Page 6: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Occam’s razor (or Ockham’s razor)

“Numquam ponenda est pluralitas sine necessitate”

[Plurality must never be posited without necessity.]

• General problem solving principle

– In the absence of evidence to the contrary, prefer the simplest explanation.

– Adapted to machine learning: all things being equal prefer the simplest model.

6

William of Ockham, 1287-1347, philosopher

and theologian.

Image source: Wikipedia

Page 7: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Occam’s razor in practice

• We need to trade off model complexity and model fit

• Model fit

– E.g., likelihood of the data under the model: P(data|model)

– In general: some loss of the predictor on the training data

• Model complexity

– E.g., number of free parameters

– E.g., number of effective dimensions

– E.g., VC dimension [Vapnik–Chervonenkis, 1971]

• Use regularization to penalize complex models: minimize training loss + C * regularization cost

7

Page 8: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Parameters vs. Hyperparameters

• Most machine learning algorithms optimize parameters under the hood

– E.g., weights in linear regression and neural networks

– E.g., deep learning: millions of parameters

• Standard approach: minimize training loss + C * regularization cost

– Using standard gradient-based optimizers

• Hyperparameters: decisions left to algorithm designer

– How complex a model to use?

– How to set C?

– How many layers/which structure of deep networks to use?

8

Page 9: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

How to set the hyperparameters?

• We wish to achieve good generalization performance

• In practice, we need to try several values and empirically evaluate how well they generalize

– Train the model for a given hyperparameter setting

– Evaluate the model’s generalization performance

• Which data set should we use to evaluate the model’s generalization performance?

1. The same data set that we use all the time: all the data we have

2. We split the data we have available: use one part for training the model, another disjoint part for evaluating generalization

9

Page 10: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Interactive question

• Which data set should we use to evaluate a model’s generalization performance empirically?

– We split the data we have available: use one part for training the model, another disjoint part for evaluating performance

• Why?

– The assumption we make is that future data will come from the same ``true’’ distribution as our current data.

– Then, using an unseen sample of that distribution gives us an unbiased estimate of generalization to future data

– If our assumption is false, then we must control for concept drift … a topic for another lecture ;-)

10

Page 11: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Overfitting & early stopping heuristic

• Too little data / too little regularization:

– The error on the training data keeps on decreasing

– After too much training, the error on separate validation data starts to increase

• Early stopping heuristic: stop training at that point

11 Training time

Image source: Wikipedia

Page 12: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Generalization of performance

• The dark ages

– Student tweaks hyperparameters until it works

– Supervisor may not even know about the tuning

– Results get published without acknowledging the tuning

– Of course, the approach does not generalize

• A step further

– Optimize parameters on a training set

– Evaluate generalization on a test set

• Another step further: avoid “peeking” at the test set

– Put test set into a vault (i.e., never look at it)

– Split training set again into training and validation set

– Only use test set in the end to generate results for publication 12

Page 13: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Training Validation Training Validation Training

Cross-validation for model selection

• Problem: single split of training data into training/validation might not be representative

• Standard solution: average performance across k cross-validation folds (here: k=3)

13

Training Validation

Training Validation

Page 14: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cross-validation for model selection

• Standard model selection using cross-validation (CV):

• is a learning algorithm

• We apply to dataset and evaluate the resulting model on dataset

• We call the resulting loss

• We average these losses over the k cross-validation folds and pick the best-performing learning algorithm

14

Page 15: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cross-validation for further tasks

• Standard model selection using cross-validation (CV):

• Standard hyperparameter optimization using CV:

• Combination of the two:

15

Page 16: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cross-validation Details

• How to choose the number of folds k?

– Too low: noisy approximations of generalization

Poor generalization to test instances

– Too high: evaluating a configuration is expensive

Optimization process is slow Also, performance in folds is not independent, so increasing k does not always improve generalization

• Theory is lacking

• In practice, typically choose k=5 or k=10 [Kohavi, 1995]

• Practical speedup trick [Hutter, Hoos & Leyton-Brown, 2011]

– We do not need to evaluate all folds for each configuration

– Example: best configuration so far has average C/V error 0.1 based on 5 folds; new configuration has error 0.6 in first fold

16

Page 17: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Outline of Today’s Class

• Generalization to previously unseen data

• Overview of hyperparameter optimization methods

• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes

17

Page 18: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Manual Search

Start with some configuration

repeat

Modify a single parameter

if performance on a benchmark set degrades then

undo modification

until no more improvement possible (or “good enough")

(manually-executed hill climbing)

18

Aka “Optimization by Graduate Student”

Page 19: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Pros and cons of manual search

• Pros

– Student gains some intuition helps understanding

– Student can notice irregularities, e.g.

• A configuration is worse than expected find bugs

• E.g., aliasing in filters learned by a convolutional network [Zeiler & Fergus, 2013]

• A run dies because of temporary file system errors repeat the run

• Cons

– “Blind” search: inefficient use of student’s time

– Sometimes “false intuition”: e.g., based on a different dataset and a different architecture a year ago

19

Page 20: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Simple Search Strategy: Grid Search

20

Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012

• Select D values for each of N hyperparameters, try all DN combinations

• Direct feedback:

– Which values work/don’t work for each setting

– Which parameters are important? Are there interactions?

Page 21: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Simple Search Strategy: Random Search

• Select configurations uniformly at random

– Completely uninformed

– Global search, won’t get stuck in a local region

– Better than grid search for low effective dimensionality:

21

Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012

Page 22: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Further Benefits of Random Search

• Perfect parallelizability

– Simply start K runs in parallel on a compute cluster

• Fault tolerance

– In practice, some runs often die because of some problem • File system error

• Parameter combination not legal

• Code crashes

– In grid search, you need the entire grid

– In random search, a design with M < K runs is also valid

22

Page 23: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Disadvantages of Random Search

• Entirely uninformed – Cannot follow an obvious gradient (e.g. bigger is better)

• Curse of dimensionality – Example: only ½ of the values of each dimensions is good

– Probability of randomly drawing a good configuration in N dimensions: 0.5N

• In 1 dimension: 0.5

• In 2 dimensions: 0.25

• In 10 dimensions: < 0.001

• In 20 dimensions: < 0.0000001

• Grid search has the same problems – Random search is the better search method

– Grid search only gives better intuitions

23

Page 24: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Stochastic Local Search

• Balance intensification and diversification

– Intensification: gradient descent

– Diversification: restarts, random steps, perturbations, …

• Prominent general methods

– Taboo search [Glover, 1986]

– Simulated annealing [Kirkpatrick, Gelatt, C. D.; Vecchi, 1983]

– Iterated local search [Lourenço, Martin & Stützle, 2003]

24

[e.g., Hoos and Stützle, 2005]

Page 25: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Population-based Methods

• Population of configurations

– Global + local search via population

– Maintain population fitness & diversity

• Examples

– Genetic algorithms [e.g., Barricelli, ’57, Goldberg, ’89]

– Evolutionary strategies [e.g., Beyer & Schwefel, ’02]

– Ant colony optimization [e.g., Dorigo & Stützle, ’04]

– Particle swarm optimization [e.g., Kennedy & Eberhart, ’95]

25

Page 26: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Optimization

• Fit a (probabilistic) model of the function

• Use that model to trade off exploitation vs exploration

• Also known as sequential model-based optimization (SMBO)

26

Page 27: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Optimization

• Popular approach in statistics to minimize expensive blackbox functions [Mockus, '78]

– Efficient in the number of function evaluations

– Works when objective is nonconvex, noisy, has unknown derivatives, etc

• Recent progress in the machine learning literature: global convergence rates for continuous optimization [Srinivas et al, ICML 2010] [Bull, JMLR 2011] [Bubeck et al., JMLR 2011] [de Freitas, Smola, Zoghi, ICML 2012]

27

Page 28: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Estimation of Distribution (EDA)

• Also uses a probabilistic model

• Also uses that model to inform where to evaluate next

• But models promising configurations: P(x is “good”)

– In contrast to modeling the function: P(f|x)

28

Image source: Wikipedia

[e.g., Pelikan, Goldberg and Lobo, 2002]

Page 29: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Outline of Today’s Class

• Generalization to previously unseen data

• Overview of hyperparameter optimization methods

• Foundations of Bayesian optimization: Bayesian linear regression & Gaussian processes

29

Page 30: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Reminder: Bayesian Optimization

30

Page 31: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Aside: why is it called “Bayesian” ?

• Often you have causal knowledge

– For example • P(symptom | disease)

• P(observed noisy function values | true function)

– This is the likelihood: P(evidence e | hypothesis h)

• ... and you want to do evidential reasoning

– For example • P(disease | symptom)

• P(true function | observed noisy function values)

– This is the posterior: P(hypothesis h | evidence e)

• To compute this posterior, you also need – the prior P(hypothesis h) and Bayes rule

31

Page 32: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayes rule (or Bayes’ rule)

32

Thomas Bayes, 1701-1761, English statistician and philosopher. Image source: Wikipedia

Page 33: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayes rule in Bayesian optimization

• Denote the observed data as

• Denote our prior over functions as

• Then the posterior over functions is:

33

posterior likelihood prior

Page 34: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Two components of Bayesian optimization

• The probabilistic model

– Typically used: Gaussian process

– Today: Bayesian linear regression & Gaussian processes

– Next time: random forests

• The acquisition function

– Trades off exploration vs. exploitation

34

Page 35: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian linear regression & Gaussian processes

• Acknowledgement: The following slides are taken from Phillip Hennig’s tutorial on Gaussian processes in the machine learning summer school 2013

• All of Phillip’s slides are online: http://mlss.tuebingen.mpg.de/hennig_slides1.pdf

• Phillip’s website also has video lectures and more slides: http://www.is.tuebingen.mpg.de/nc/employee/details/phennig.html

35

Page 36: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Carl Friedrich Gauss (1777–1855)Paying Tolls with A Bell

f(x) = 1

σ√

2πe− (x−µ)22σ2

2 ,

Page 37: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The Gaussian distributionMultivariate Form

N (x;µ,Σ) = 1(2π)N/2∣Σ∣1/2 exp [−1

2(x − µ)⊺Σ−1(x − µ)]

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ x,µ ∈ RN , Σ ∈ RN×N▸ Σ is positive semidefinite, i.e.

▸ v⊺Σv ≥ 0 for all v ∈ RN

▸ Hermitian, all eigenvalues ≥ 0

3 ,

Page 38: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Why Gaussian?an experiment

−0.1 −5 ⋅ 10−2 0 5 ⋅ 10−2 0.10

20

40

▸ nothing in the real world is Gaussian (except sums of i.i.d. variables)▸ But nothing in the real world is linear either!

Gaussians are for inference what linear maps are for algebra.

4 ,

Page 39: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure Under Multiplicationmultiple Gaussian factors form a Gaussian

N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

5 ,

Page 40: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure Under Multiplicationmultiple Gaussian factors form a Gaussian

N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

5 ,

Page 41: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure Under Multiplicationmultiple Gaussian factors form a Gaussian

N (x;a,A)N (x; b,B) = N (x; c,C)N (a; b,A +B)C ∶= (A−1 +B−1)−1 c ∶= C(A−1a +B−1b)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

5 ,

Page 42: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure under Linear MapsLinear Maps of Gaussians are Gaussians

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

p(z) = N (z;µ,Σ)⇒ p(Az) = N (Az,Aµ,AΣA⊺)Here: A = [1,−0.5]

6 ,

Page 43: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure under Marginalizationprojections of Gaussians are Gaussian

▸ projection with A = (1 0)∫ N [(x

y) ;(µx

µy) ,(Σxx Σxy

Σyx Σyy)] dy = N (x;µx,Σxx)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ this is the sum rule

∫ p(x, y) dy = ∫ p(y ∣x)p(x) dy = p(x)▸ so every finite-dim Gaussian is a

marginal of infinitely many more

7 ,

Page 44: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Closure under Conditioningcuts through Gaussians are Gaussians

p(x ∣ y) = p(x, y)p(y) = N (x;µx +ΣxyΣ−1

yy(y − µy),Σxx −ΣxyΣ−1yyΣyx)

−4 −2 0 µ1 4 6 8−4−20

µ2

4

6

8

▸ this is the product rule▸ so Gaussians are closed under

the rules of probability

8 ,

Page 45: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Inferenceexplaining away

0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]

p(y ∣x, σ)= N (y;A⊺x;σ2)= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)

= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1

x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]

9 ,

Page 46: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Inferenceexplaining away

0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)

= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)

= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1

x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]

9 ,

Page 47: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Inferenceexplaining away

0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)

= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)

= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1

x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]

9 ,

Page 48: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bayesian Inferenceexplaining away

0 5

0

5 p(x)= N (x;µ,Σ)= N [(x1

x2) ;( 1

0.5) ,(32 0

0 32)]p(y ∣x, σ)= N (y;A⊺x;σ2)

= N [6; (1 0.6)(x1

x2) , σ2]

p(x ∣σ2, y) = p(x)p(y ∣x)p(x)

= N (x;µ +ΣA(A⊺ΣA + σ2)−1(y −A⊺µ),Σ −ΣA(A⊺ΣA + σ2)−1A⊺Σ)= N [(x1

x2) ;(3.9

2.3) ,( 3.4 −3.4−3.4 7.0

)]9 ,

Page 49: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

What can we do with this?linear regression

given y ∈ RN , p(y ∣ f), what’s f?

−8 −6 −4 −2 0 2 4 6 8

−100

10

20

x

y

10 ,

Page 50: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

A priorover linear functions

f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1

x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)

11 ,

Page 51: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

A priorover linear functions

f(x) = w1 +w2x = φ⊺xw p(w) = N (w;µ,Σ)φx = (1

x) p(f) = N (f ;φ⊺xµ,φ⊺xΣφx)

12 ,

Page 52: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The posteriorover linear functions

p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx

13 ,

Page 53: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The posteriorover linear functions

p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx

13 ,

Page 54: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting

load(’data.mat’); N = length(Y); % gives Y,X,sigma

% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use

mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples

stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting

14 ,

Page 55: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

A More Realistic DatasetGeneral Linear Regression

f(x) = φ⊺xw ?

−8 −6 −4 −2 0 2 4 6 8

−10

0

10

20

x

y

15 ,

Page 56: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

f(x) = w1 +w2x = φ⊺xw

φx ∶= (1x)

16 ,

Page 57: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting

load(’data.mat’); N = length(Y); % gives Y,X,sigma

% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use

mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples

stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting

17 ,

Page 58: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺

18 ,

Page 59: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺

18 ,

Page 60: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺

19 ,

Page 61: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺

19 ,

Page 62: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);

φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺

20 ,

Page 63: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);

φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺

20 ,

Page 64: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));

φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺

21 ,

Page 65: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));

φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺

21 ,

Page 66: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));

φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺

23 ,

Page 67: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));

φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺

23 ,

Page 68: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));

φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺

25 ,

Page 69: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));

φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺

25 ,

Page 70: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));

φ(x) = (e− 12 (x−8)2 e− 1

2 (x−7)2 e− 12 (x−6)2 . . .)⊺

26 ,

Page 71: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));

φ(x) = (e− 12 (x−8)2 e− 1

2 (x−7)2 e− 12 (x−6)2 . . .)⊺

26 ,

Page 72: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Multiple Inputsall this works for in multiple dimensions, too

φ ∶ RN _R f ∶ RN _R

27 ,

Page 73: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Multiple Inputsall this works for in multiple dimensions, too

28 ,

Page 74: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

How many features should we use?let’s look at that algebra again

p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx)

▸ there’s no lonely φ in there▸ all objects involving φ are of the form

▸ φ⊺µ — the mean function▸ φ⊺Σφ — the kernel

▸ once these are known, cost is independent of the number of features▸ remember the code:

M = phiX * mu;m = phix * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxX

32 ,

Page 75: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

% prior on wF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F-1)); % φ(a) = [1;a]mu = zeros(F,1);Sigma = eye(F); % p(w) =N (µ,Σ)% prior on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsphix = phi(x); % features of xm = phix * mu;kxx = phix * Sigma * phix’; % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting

load(’data.mat’); N = length(Y); % gives Y,X,sigma

% prior on Y = fX + εphiX = phi(X); % features of dataM = phiX * mu;kXX = phiX * Sigma * phiX’; % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = phix * Sigma * phiX’; % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use

mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples

stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting

33 ,

Page 76: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

% priorF = 2; % number of featuresphi = @(a)(bsxfun(@power,a,0:F)); % φ(a) = [1;a]k = @(a,b)(phi(a)’ * phi(b)); % kernelmu = @(a)(zeros(size(a,1))); % mean function

% belief on f(x)n = 100; x = linspace(-6,6,n)’; % ‘test’ pointsm = mu(x);kxx = k(x,x); % p(fx) =N (m,kxx)s = bsxfun(@plus,m,chol(kxx + 1.0e-8 * eye(n))’ * randn(n,3)); % samples from priorstdpi = sqrt(diag(kxx)); % marginal stddev, for plotting

load(’data.mat’); N = length(Y); % gives Y,X,sigma

% prior on Y = fX + εM = mu(X);kXX = k(X,X); % p(fX) =N (M,kXX)G = kXX + sigma^2 * eye(N); % p(Y ) =N (M,kXX + σ2I)R = chol(G); % most expensive step: O(N3)kxX = k(x,X); % cov(fx, fX) = kxXA = kxX / R; % pre-compute for re-use

mpost = m + A * (R’ \ (Y-M)); % p(fx ∣Y ) =N (m + kxX(kXX + σ2I)−1(Y −M),vpost = kxx - A * A’; % kxx − kxX(kXX + σ2I)−1kXx)spost = bsxfun(@plus,mpost,chol(vpost + 1.0e-8 * eye(n))’ * randn(n,3)); % samples

stdpo = sqrt(diag(vpost)); % marginal stddev, for plotting

34 ,

Page 77: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,10)).^2 ./ell.^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Page 78: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Exponentiated Squaresphi = @(a)(exp(-0.5 * bsxfun(@minus,a,linspace(-8,8,30)).^2 ./ell.^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Page 79: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Page 80: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Page 81: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

What just happened?kernelization to infinitely many features

DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.

LemmaAny kernel that can be written as

k(x,x′) = ⨋ φ`(x)φ`(x′)d`is a Mercer kernel. (assuming integral over positive set)Proof: ∀X ∈ XN , v ∈ RNv⊺kXXv = ⨋ N∑

i

viφ`(xi)N∑j

vjφ`(xj)d` = ⨋ [∑i

viφ`(xi)]2

d` ≥ 0 ◻38 ,

Page 82: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

What just happened?Gaussian process priors

DefinitionA function k ∶ X ×X_R is a Mercer kernel if, for any finite collectionX = [x1, . . . , xN ], the matrix kXX ∈ RN×N with elementskXX,(i,j) = k(xi, xj) is positive semidefinite.

DefinitionLet µ ∶ X_R be any function, k ∶ X ×X_R be a Mercer kernel.A Gaussian process p(f) = GP(f ;µ, k) is a probability distribution overthe function f ∶ X_R, such that every finite restriction to function valuesfX ∶= [fx1 , . . . , fxN ] is a Gaussian distribution p(fX) = N (fX ;µX , kXX).

39 ,

Page 83: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The posteriorover linear functions

p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(w ∣ y, φX) = N (w;µ +ΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

Σ −ΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣ)φx

13 ,

Page 84: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The posteriorover linear functions

p(y ∣w,φX) = N (y;φ⊺Xw,σ2I)p(fx ∣ y, φX) = N (fx;φ⊺xµ + φ⊺xΣφX(φ⊺XΣφX + σ2I)−1(y − φ⊺Xµ),

φ⊺xΣφx − φ⊺xΣφX(φ⊺XΣφX + σ2I)−1φ⊺XΣφx

13 ,

Page 85: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Cubic Regressionphi = @(a)(bsxfun(@power,a,[0:3]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 x.3)⊺

18 ,

Page 86: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Septic Regression ?phi = @(a)(bsxfun(@power,a,[0:7]));

f(x) = φ(x)⊺w φ(x) = (1 x x.2 ⋯ x.7)⊺

19 ,

Page 87: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Fourier Regressionphi = @(a)(2 * [cos(bsxfun(@times,a/8,[0:8])), sin(bsxfun(@times,a/8,[1:8]))]);

φ(x) = (cos(x) cos(2x) cos(3x) . . . sin(x) sin(2x) . . .)⊺

20 ,

Page 88: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Step Regressionphi = @(a)(-1 + 2 * bsxfun(@lt,a,linspace(-8,8,16)));

φ(x) = −1 + 2 (θ(x − 8) θ(8 − x) θ(x − 7) θ(7 − x) . . .)⊺

21 ,

Page 89: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

V Regressionphi = @(a)(bsxfun(@minus,abs(bsxfun(@minus,a,linspace(-8,8,16))),linspace(-8,8,16)));

φ(x) = (∣x − 8∣ + 8 ∣x − 7∣ + 7 ∣x − 6∣ + 6 . . .)⊺

23 ,

Page 90: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Eiffel Tower Regressionphi = @(a)(exp(-abs(bsxfun(@minus,a,[-8:1:8]))));

φ(x) = (e−∣x−8∣ e−∣x−7∣ e−∣x−6∣ . . .)⊺

25 ,

Page 91: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Bell Curve Regressionphi = @(a)(exp(-0.5 * bsxfun(@minus,a,[-8:1:8]).^2));

φ(x) = (e− 12 (x−8)2 e− 1

2 (x−7)2 e− 12 (x−6)2 . . .)⊺

26 ,

Page 92: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Exponentiated Squaresk = @(a,b)(5*exp(-0.25*bsxfun(@minus,a,b’).^2));

▸ aka. radial basis function, square(d)-exponential kernel

37 ,

Page 93: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The predictive posterior distribution

The posterior Gaussian process has predictive distribution , where

36

Page 94: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

The predictive posterior under noise

The posterior Gaussian process has predictive distribution , where

37

Page 95: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Computational complexity of GPs

• Let t denote the number of data points in the GP

• Inverting the kernel matrix: O(t3)

• Predictions of the variance: O(t2)

• Predictions of the mean: O(t)

38

The posterior Gaussian process has predictive distribution , where

38

Page 96: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Two components of Bayesian optimization

• The probabilistic model

– Typically used: Gaussian process

– Later: other models are possible, e.g., random forests

• The acquisition function

– Trades off exploration vs. exploitation

– We’ll discuss this in detail

39

Page 97: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Probability of Improvement

40

Page 98: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Expected Improvement

41

(the derivation of this integral’s closed-form solution will be an exercise)

Page 99: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Upper Confidence Bound (UBC)

42

Page 100: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Entropy Search

• Compute a probability distribution over which configuration is optimal

• Acquisition function: try to push this probability distribution as close to a delta distribution as possible

• One of the most powerful acquisition functions

– Can choose to actively evaluate in one region of the space to learning something about a different region of the space

43

Page 101: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Putting it all Together

• How to optimize the acquisition function?

– Subsidiary optimization method

– Important: in that subsidiary optimization, function evaluations are cheap (just evaluations of the GP).

44

Page 102: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Summary of Bayesian Optimization

• Bayesian optimization integrates

– prior information and

– the likelihood of the observed data

• It uses quite involved computation to select which function value to evaluate next

– Thus, it’s most useful for expensive blackbox functions

45

Page 103: Optimization of Machine Learning Hyperparametersml.informatik.uni-freiburg.de/_media/teaching/ss14/...optimization.pdf · Optimization of Machine Learning Hyperparameters ... Random

Overall summary

• Generalization: we need to safeguard against over-fitting

• Overview over Hyperparameter optimization methods

• Bayesian optimization

– Based on linear regression & Gaussian processes

• Next week:

– Bayesian optimization with random forests

– Extensions and applications

46