gaussian process optimization with mutual...
TRANSCRIPT
Gaussian Process Optimization with MutualInformation
Emile Contal1 Vianney Perchet2 Nicolas Vayatis1
1CMLA – Ecole Normale Suprieure de Cachan & CNRS, France2LPMA – Universite Paris Diderot & CNRS, France
June 22, 2014
Background Algorithm Theoretical Analysis Experiments
Problem Statement
Sequential Optimization
Let f : X → R where X ⊂ Rd is compact and convex.We consider the problem of finding the maximum of f denoted by:
f (x?) = maxx∈X
f (x) ,
via sequential queries f (x1), f (x2), . . .
Noisy Observations
At iteration T we choose xT+1 using the previous noisyobservations YT = y1, . . . , yT, where ∀t ≤ T :
yt = f (xt) + εt and εtiid∼ N (0, η2) .
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 2
Background Algorithm Theoretical Analysis Experiments
Gaussian Processes
Definition
f ∼ GP(m, k) with mean function m : X → R and kernel functionk : X × X → R+, when for all x1, . . . , xn ∈ X we have:(
f (x1), . . . , f (xn))∼ N
([m(xi )]i , [k(xi , xj)]i ,j
).
Bayesian Inference
Given YT , the posterior distribution Pr[f | YT ] is a GP with meanµT+1 (prediction) and variance σ2
T+1 (uncertainty) computed byBayesian inference.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 3
Background Algorithm Theoretical Analysis Experiments
Gaussian Processes
−1 0 1
0
−1
−2
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 4
Background Algorithm Theoretical Analysis Experiments
Objective
Cumulative Regret
The efficiency of a policy is measured via the cumulative regret:
RT =∑t<T
f (x?)− f (xt) .
Goal
The cumulative regret is unknown in practice. Our aim is to obtainupper bounds on RT with high probability.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 5
Background Algorithm Theoretical Analysis Experiments
Related Work
Dani al. 2008
In the general linear optimization problem, we have a lower boundon the cumulative regret of Ω(d
√T ).
Srinivas et al. 2012
For the GP-UCB algorithm with linear GP, we have an upper boundon the cumulative regret of O(d
√T ) with high probability.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 6
Background Algorithm Theoretical Analysis Experiments
Mutual Information – An Important Ingredient
Information Gain
The information gain on f at XT is the mutual information betweenf and YT . For a GP distribution with KT the kernel matrix of XT :
IT (XT ) =1
2log det(I + η−2KT ) .
We define γT = max|X |=T IT (X ) the maximum information gain bya sequence of T queries points.
Empirical Lower Bound
For GPs with bounded variance, we have: [Srinivas et al. 2012]
γT =T∑t=1
σ2t (xt) ≤ CγT where C =
2
log(1 + η−2)
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 7
Background Algorithm Theoretical Analysis Experiments
GP-MI – A Novel Algorithm for Sequential Optimization
γ0 ← 0
for t = 1, 2, . . . do
Compute µt and σ2t using Bayesian inference
φt(x)← √α(√
σ2t (x) + γt−1 −
√γt−1
)xt ← argmaxx∈X µt(x) + φt(x)
γt ← γt−1 + σ2t (xt)
Query at xt and observe ytend
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 8
Background Algorithm Theoretical Analysis Experiments
Main Result – Regret bounds for GP-MI
For all δ > 0 and T > 1, set α = log 2δ . The cumulative regret RT
incurred by the GP-MI algorithm on f distributed as a GP perturbedby independent Gaussian noise with variance η2 satisfies thefollowing bounds:
Pr[RT ≤ 5
√αCγT + 4
√α]≥ 1− δ ,
where C = 2log(1+η−2)
.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 9
Background Algorithm Theoretical Analysis Experiments
Application to Specific Kernels
I For linear kernel: RT = O(√d logT )
I For RBF kernel: RT = O(√
(logT )d+1)
I For Matern kernel: RT = O(√
T a logT),
where a = d(d+1)2ν+d(d+1) < 1 and ν is the Matern parameter.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 10
Background Algorithm Theoretical Analysis Experiments
High Probabilistic Bounds for RT
Gaussian Martingale
The sequence MT = RT−∑T
t=1
(µt(x
?)− µt(xt))
is a Gaussianmartingale with respect to YT−1.
Concentration Inequalities [Becu et al. 2008]
For all δ > 0 and T > 1, with y = 8(CγT + 1) we have:
Pr
[MT ≤
√2αy+
√2α
y
T∑t=1
σ2t (x?)
]≥ 1− δ
Pr
[RT ≤
√2αy+
√2α
y
T∑t=1
σ2t (x?)+
T∑t=1
(µt(x
?)− µt(xt))]≥ 1−δ .
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 11
Background Algorithm Theoretical Analysis Experiments
Regret Bounds for the GP-MI Algorithm
Inequality for the GP-MI Algorithm
Using the function φt as defined in GP-MI, we have:T∑t=1
(µt(x
?)− µt(xt))≤
T∑t=1
(φt(xt)− φt(x?)
)≤√αCγT−
√2α
y
T∑t=1
σ2t (x?) .
Regret bounds
Plugging this inequality in the previous concentration bound for RT :
Pr[RT ≤ 5
√αCγT + 4
√α]≥ 1− δ .
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 12
Background Algorithm Theoretical Analysis Experiments
Empirical Average Regret
50 100 150 200 2500
0.5
1
1.5
2
2.5
EI UCB
MI
RT/T
(a) Generated GP (d = 2)
100 200 300 400 5000
1
2
3
EI
UCB
MI
(b) Generated GP (d = 4)
0 200 400 600 800 1,0000
0.5
1
1.5
EI
UCB
MIRT/T
(c) Gaussian mixture
0 50 100 150 200 2500
0.5
1
1.5
2
2.5
EI
UCB
MI
(d) Himmelblau
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 13
Background Algorithm Theoretical Analysis Experiments
Empirical Average Regret
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
EIUCB
MI
RT/T
(e) Branin
0 50 100 150 200 2500
0.2
0.4
0.6
EI
UCB
MI
(f) Goldstein
0 100 200 300 400 5000
0.1
0.2
0.3
0.4
EI
UCB
MIRT/T
(g) Tsunamis
0 200 400 600 800 1,0000
0.1
0.2
0.3
0.4
EI
UCB
MI
(h) Mackey-Glass
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 14
Background Algorithm Theoretical Analysis Experiments
Implementation
Exact Inference for Gaussian likelihood
Numerical complexity in O(T 2) using the Cholesky sequentialupdates of the covariance matrix.
Algorithms for non-Gaussian likelihood
For other likelihood functions (e.g. Laplacian or Student’s t), onecan use the EP algorithm or Monte Carlo sampling.
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 15
Background Algorithm Theoretical Analysis Experiments
Open Questions and Discussion
I Theoretical guarantees for simple regret
I Empirical performance for simple regret
I Kernel learning procedure
I Calibration of δ
Contal, Perchet, Vayatis Gaussian Process Optimization with Mutual Information 16