a. proof of claim5proceedings.mlr.press/v97/nayebi19a/nayebi19a-supp.pdf · 2021. 1. 14. ·...
TRANSCRIPT
A Framework for Bayesian Optimization in Embedded Subspaces
A. Proof of Claim 5Proof of Claim 5. First note that z ∈ [0, 1] directly impliesz1+δ ≤ z ≤ z1−δ which proves both lower bounds. Itremains to prove the upper bounds of δ. We study themapping f : [0, 1] → R, z 7→ z − z1+δ. Clearly f(0) =f(1) = 0 ≤ δ. Setting the derivative to zero and solving forz we have
f ′(z) = 1− (1 + δ)zδ = 0⇔ z =
(1
1 + δ
) 1δ
.
Note that(
11+δ
) 1δ ≤ 1, thus
f(z) ≤(
1
1 + δ
) 1δ
−(
1
1 + δ
) 1+δδ
=
(1
1 + δ
) 1δ(
1− 1
1 + δ
)≤ 1 + δ − 1
1 + δ=
δ
1 + δ≤ δ.
Very similar arguments prove the claim for the mappingf : [0, 1]→ R, z 7→ z1−δ − z. Again, f(0) = f(1) = 0 ≤δ. Setting the derivative to zero and solving for z yields
f ′(z) = (1− δ)z−δ − 1 = 0⇔ z =
(1
1− δ
)− 1δ
.
Note that(
11−δ
)− 1δ
= (1− δ) 1δ ≤ e−1, thus
f(z) ≤(
1
1− δ
)− 1−δδ
−(
1
1− δ
)− 1δ
=
(1
1− δ
)− 1δ(
1
1− δ − 1
)≤ 1− 1 + δ
1− δ e−1 ≤ 2δe−1 ≤ δ.
B. The Dilation Issue in REMBOIn their seminal work proposing the REMBO variants, Wanget al. (2016b) noted that their random Gaussian embeddingA has a dilation that necessitates to scale REMBO’s searchspace to [−
√d,+√d]d to ensure the optimal remains in the
search domain. However, REMBO might select a point ysuch that the random embedding Ay is outside the actualbox X . In such a case it projects Ay onto the boundary ofX before evaluating f . This may lead to over-explorationof the boundary regions, which leads to particularly badperformance when optimal point lies in the interior. (Bi-nois et al., 2015) proposed a new warping kernel kψ toaddress this issue by an improved preservation of distances
among such points. This method was further improved in(Binois et al., 2019). The analysis in Sect. C below showsthat REMBO applies these complex corrections frequently.In contrast, the count-sketch avoids such complex correc-tions. By construction of the inverse count-sketch S thecolumns of the matrix are orthogonal, since each row hasonly a single non-zero entry. Moreover, for any fixed co-ordinate (dimension) di of the low-dimensional space wehave that the number of coordinates in the high dimen-sional space that map to di is concentrated at D/d, i.e., thisnumber has mean D/d and a variance that drops (quickly)as D increases. The reason is that h picks the target ofeach dimension uniformly at random. The dilation of S isthus roughly
√D/d, and the search domain Y = [−1, 1]d
is mapped to SY = {Sy | y ∈ Y} ⊂ X = [−1, 1]D
with low distortion (1± ε) as given in Definition 1. More-over, note that the back-projection is realized by solelycopying the coordinates and multiplying them by {−1, 1},which assert that no point is projected outside the domain X .Thus, HeSBO avoids complex corrections in contrast to theREMBO variants described above.
C. Analysis of Correction Efforts in REMBOWe examine how often the REMBO variants choose points tosample whose projection to the high-dimensional space isoutside of the search domain X . Note that these points arecorrected by projecting them back onto the boundary ∂X .Table 1 shows the results on Branin with inputD = 100 andthree different choices for the target dimension d. We seethat for larger values of d almost all points chosen by REMBOare projected outside ofX . Note that all three algorithms use
Table 1. The fraction of steps of the three REMBO variants thatproject the sample point outside X . The data was collected byrunning the algorithms on Branin with input dimension D = 100over 30 replications.
d = 2 d = 3 d = 4
ky 0.95± 0.006 0.993± 0.0008 0.9999± 0.0001kx 0.93± 0.009 0.992± 0.001 0.9996± 0.0002kψ 0.88± 0.02 0.992± 0.003 0.9997± 0.0002
the same random projection in each iteration, their decisionswhat points to sample are affected by the respective kernel.In particular, we see that kX performs better than ky asintended by Wang et al. (2016b), whereas kψ of Binoiset al. (2015) performs best among these. Note that theapproach proposed in this paper completely avoids the needfor corrections.
A Framework for Bayesian Optimization in Embedded Subspaces
D. Details on ExperimentsWe provide details needed to replicate the experiments inTable 2. All REMBO variants and HeSBO-EI use identicalinitial datasets in every replication to initialize the surrogatemodels, obtained via random Latin hypercube designs. Forthe other algorithms we used their preferred strategies to ob-tain the same number of initial data points. We implementedHeSBO-EI and all REMBO variants in Python 2.7 us-ing GPy. They use a GP model with the 5
2 -Matern kerneland constant mean function. The acquisition function wasoptimized by L-BFGS-B leveraging its gradient (Frean &Boyle, 2008).
For the other algorithms we used the reference implemen-tations provided by the authors, see the hyperlinks in thereferences. ADD was provided by Gardner in personal com-munication. For the neural network parameter search bench-mark discussed in Sect. 3, we had to omit ADD because ofsoftware incompatibilities with this benchmark and SKLdue to its high running time.
Table 2. The fraction of steps of the three REMBO variants thatproject the sample point outside X . The data was collected byrunning the algorithms on Branin with input dimension D = 100over 30 replications.
Feature Choice
number of initial data points 10sampling approach Latin hypercube design
provided by pyDOEkernel Matern 5/2- ARDSearch space [−1, 1]dMethod of GP fitting GPy package (1.9.2)Number of replications 100Reported statistic MedianError bars Standard error of medianProcessor Xeon Haswell E5-2695 Dual
2.3 GHz 14-coreNumber of CPUs 1
E. Additional Plots for all ConductedExperiments
This section provides the plots for all experiments that weconducted as part of the evaluation. This includes plotsalready shown in Sect. 3 and additional plots that wereomitted due to space constraints. The plots are organizedaccording to the same subsections.
E.1. Performance Evaluation
The performances on Branin as a function of the evaluationsare given in Fig. 5.
The performances on Hartmann6 as a function of the evalu-
ations are given in Fig. 6.
The performances on Rosenbrock as a function of the evalu-ations are given in Fig. 7.
The performances on Styblinski-Tang as a function of theevaluations are given in Fig. 8.
E.2. Robustness
The robustness plots on Branin and Hartmann6 are given inFig. 9
E.3. Scalability
The performances on Branin, Rosenbrock and Styblinski-Tang as a function of the running time are given in Fig. 10.
E.4. High Dimensional Network Architecture Search
The performances on neural network optimization as a func-tion of the evaluations are given in Fig. 11.
A Framework for Bayesian Optimization in Embedded Subspaces
BRANIND = 25 D = 100 D = 1000
0 20 40 60 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
25 dim Branin
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 10 20 30 40 50 60 70 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
100 dim Branin
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
1000 dim Branin
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
0
2
4
6
8
10
12
14
Func
tion
valu
es
25 dim Branin
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 10 20 30 40 50 60 70 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
100 dim Branin
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
1000 dim Branin
Rembo-Kx
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
25 dim Branin
Rembo-Kx
BOCKEBOADDHeSBO-EIHeSBO-KGHeSBO-BM
0 10 20 30 40 50 60 70 80Number of evaluations
0
1
2
3
4
5
Func
tion
valu
es
100 dim Branin
Rembo-Kx
BOCKEBOADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Func
tion
valu
es
1000 dim Branin
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
Figure 5. Performance plots for Branin as a function of the evaluations for the projection based algorithms (top), the state-of-the-artalgorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100 (middlecolumn), and D = 1000 (right).
A Framework for Bayesian Optimization in Embedded Subspaces
HARTMANN6
0 20 40 60 80Number of evaluations
−2.00
−1.75
−1.50
−1.25
−1.00
−0.75
−0.50
−0.25
0.00
Func
tion
valu
es
25 dim Hartmann6
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
−3.5
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es
25 dim Hartmann6
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
−2.00
−1.75
−1.50
−1.25
−1.00
−0.75
−0.50
−0.25
0.00
Func
tion
valu
es
100 dim Hartmann6
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
−3.5
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es100 dim Hartmann6
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es
1000 dim Hartmann6
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
−3.0
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es
1000 dim Hartmann6
Rembo-Kx
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
Figure 6. Performance plots for Hartmann-6 as a function of the evaluations for the projection based algorithms (left) and the state-of-the-artalgorithms (right) for different input dimensions D = 25 (top), D = 100 (middle), and D = 1000 (bottom).
A Framework for Bayesian Optimization in Embedded Subspaces
ROSENBROCKD = 25 D = 100 D = 1000
0 20 40 60 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
25 dim Rosenbrock
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0Fu
nctio
nva
lues
100 dim Rosenbrock
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
1000 dim Rosenbrock
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
25 dim Rosenbrock
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 10 20 30 40 50 60 70Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
100 dim Rosenbrock
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0.0
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
Func
tion
valu
es
1000 dim Rosenbrock
Rembo-Kx
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Func
tion
valu
es
25 dim Rosenbrock
BOCKADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Func
tion
valu
es
100 dim Rosenbrock
BOCKADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Func
tion
valu
es
1000 dim Rosenbrock
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
Figure 7. Performance plots for Rosenbrock as a function of the evaluations for the projection based algorithms (top), the state-of-the-artalgorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100 (middlecolumn), and D = 1000 (right).
A Framework for Bayesian Optimization in Embedded Subspaces
STYBLINSKI-TANGD = 25 D = 100 D = 1000
0 20 40 60 80Number of evaluations
−500
0
500
1000
1500
2000
2500
Func
tion
valu
es
25 dim Styblinski-Tang
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
2000
4000
6000
8000
10000
12000
14000Fu
nctio
nva
lues
100 dim Styblinski-Tang
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 10 20 30 40 50 60 70 80Number of evaluations
20000
40000
60000
80000
100000
120000
140000
Func
tion
valu
es
1000 dim Styblinski-Tang
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80Number of evaluations
−500
0
500
1000
1500
2000
2500
Func
tion
valu
es
25 dim Styblinski-Tang
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
2000
4000
6000
8000
10000
12000
14000
Func
tion
valu
es
100 dim Styblinski-Tang
Rembo-Kx
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 10 20 30 40 50 60 70 80Number of evaluations
20000
40000
60000
80000
100000
120000
140000
Func
tion
valu
es
1000 dim Styblinski-Tang
Rembo-Kx
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
−700
−600
−500
−400
−300
−200
−100
0
Func
tion
valu
es
25 dim Styblinski-Tang
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 20 40 60 80Number of evaluations
1000
1500
2000
2500
3000
3500
4000
Func
tion
valu
es
100 dim Styblinski-Tang
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 10 20 30 40 50 60 70 80Number of evaluations
15000
20000
25000
30000
35000
40000
Func
tion
valu
es
1000 dim Styblinski-Tang
BOCKHeSBO-EIHeSBO-KGHeSBO-BM
Figure 8. Performance plots for Styblinski-Tang as a function of the evaluations for the projection based algorithms (top), the state-of-the-art algorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100(middle column), and D = 1000 (right).
A Framework for Bayesian Optimization in Embedded Subspaces
0 20 40 60 80 100Number of evaluations
0
1
2
3
4
5
6
Func
tion
valu
es
25 dim Branin
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
0 20 40 60 80 100Number of evaluations
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es
25 dim Hartmann6
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
0 20 40 60 80 100Number of evaluations
0
1
2
3
4
5
6
Func
tion
valu
es
100 dim Branin
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
0 20 40 60 80 100Number of evaluations
−2.5
−2.0
−1.5
−1.0
−0.5
0.0Fu
nctio
nva
lues
100 dim Hartmann6
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
0 20 40 60 80 100Number of evaluations
0
1
2
3
4
5
6
Func
tion
valu
es
1000 dim Branin
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
0 20 40 60 80 100Number of evaluations
−2.5
−2.0
−1.5
−1.0
−0.5
0.0
Func
tion
valu
es
1000 dim Hartmann6
Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)
HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)
Figure 9. The robustness analysis for Branin (left) and Hartmann6 (right) for D = 25 (top), D = 100 (middle), D = 1000 (bottom) anddifferent values of d.
A Framework for Bayesian Optimization in Embedded Subspaces
ALL BEST
0 5 10 15 20 25 30 35Time (seconds)
0
5
10
15
20
25
Func
tion
valu
es
100 dim Branin
Rembo-Ky
Rembo-Kx
Rembo-KψBOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 5 10 15 20 25 30 35Time (seconds)
0
1
2
3
4
5
6
Func
tion
valu
es
100 dim Branin
Rembo-Ky
Rembo-Kx
Rembo-KψEBOADDHeSBO-EIHeSBO-KGHeSBO-BM
0 5 10 15 20 25 30 35Time (seconds)
0
20
40
60
80
100
120
Func
tion
valu
es
100 dim Rosenbrock
Rembo-Ky
Rembo-Kx
Rembo-KψBOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 5 10 15 20 25 30 35Time (seconds)
0
2
4
6
8
10Fu
nctio
nva
lues
100 dim Rosenbrock
BOCKEBOADDHeSBO-EIHeSBO-KGHeSBO-BM
0 5 10 15 20 25 30 35Time (seconds)
−2000
0
2000
4000
6000
8000
Func
tion
valu
es
100 dim StybTang
Rembo-Ky
Rembo-Kx
Rembo-KψBOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
0 5 10 15 20 25 30 35Time (seconds)
−2500
−2000
−1500
−1000
−500
0
Func
tion
valu
es
100 dim StybTang
BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM
Figure 10. The scalability analysis for all algorithms (left) and a close-up on the best (right) with input dimension D = 100 each anddifferent target dimensions for d: Branin d = 4 (top), Rosenbrock d = 4 (mid) and Styblinski-Tang d = 12 (bottom). Note that in theplot on the bottom right, ADD’s performance equals EBO on the Styblinski-Tang d = 12, that is why it is not visible.
A Framework for Bayesian Optimization in Embedded Subspaces
0 20 40 60 80 100Number of evaluations
0.32
0.34
0.36
0.38
0.40
0.42
0.44
Func
tion
valu
es100 dim MNIST
Rembo-Ky
Rembo-Kx
Rembo-KψHeSBO-EI
0 20 40 60 80 100Number of evaluations
0.32
0.34
0.36
0.38
0.40
0.42
0.44
Func
tion
valu
es
100 dim MNIST
Rembo-Kx
BOCKEBOHeSBO-EIHeSBO-KGHeSBO-BM
Figure 11. The performance comparison of projection based algorithms (top) and state-of-the-art algorithms (bottom) on the MNISTneural network benchmark with input dimension D = 100 and target dimension d = 12.