a. proof of claim5proceedings.mlr.press/v97/nayebi19a/nayebi19a-supp.pdf · 2021. 1. 14. ·...

A Framework for Bayesian Optimization in Embedded Subspaces

A. Proof of Claim 5Proof of Claim 5. First note that z ∈ [0, 1] directly impliesz1+δ ≤ z ≤ z1−δ which proves both lower bounds. Itremains to prove the upper bounds of δ. We study themapping f : [0, 1] → R, z 7→ z − z1+δ. Clearly f(0) =f(1) = 0 ≤ δ. Setting the derivative to zero and solving forz we have

f ′(z) = 1− (1 + δ)zδ = 0⇔ z =

(1

1 + δ

) 1δ

.

Note that(

11+δ

) 1δ ≤ 1, thus

f(z) ≤(

1

1 + δ

) 1δ

−(

1

1 + δ

) 1+δδ

=

(1

1 + δ

) 1δ(

1− 1

1 + δ

)≤ 1 + δ − 1

1 + δ=

δ

1 + δ≤ δ.

Very similar arguments prove the claim for the mappingf : [0, 1]→ R, z 7→ z1−δ − z. Again, f(0) = f(1) = 0 ≤δ. Setting the derivative to zero and solving for z yields

f ′(z) = (1− δ)z−δ − 1 = 0⇔ z =

(1

1− δ

)− 1δ

.

Note that(

11−δ

)− 1δ

= (1− δ) 1δ ≤ e−1, thus

f(z) ≤(

1

1− δ

)− 1−δδ

−(

1

1− δ

)− 1δ

=

(1

1− δ

)− 1δ(

1

1− δ − 1

)≤ 1− 1 + δ

1− δ e−1 ≤ 2δe−1 ≤ δ.

B. The Dilation Issue in REMBOIn their seminal work proposing the REMBO variants, Wanget al. (2016b) noted that their random Gaussian embeddingA has a dilation that necessitates to scale REMBO’s searchspace to [−

√d,+√d]d to ensure the optimal remains in the

search domain. However, REMBO might select a point ysuch that the random embedding Ay is outside the actualbox X . In such a case it projects Ay onto the boundary ofX before evaluating f . This may lead to over-explorationof the boundary regions, which leads to particularly badperformance when optimal point lies in the interior. (Bi-nois et al., 2015) proposed a new warping kernel kψ toaddress this issue by an improved preservation of distances

among such points. This method was further improved in(Binois et al., 2019). The analysis in Sect. C below showsthat REMBO applies these complex corrections frequently.In contrast, the count-sketch avoids such complex correc-tions. By construction of the inverse count-sketch S thecolumns of the matrix are orthogonal, since each row hasonly a single non-zero entry. Moreover, for any fixed co-ordinate (dimension) di of the low-dimensional space wehave that the number of coordinates in the high dimen-sional space that map to di is concentrated at D/d, i.e., thisnumber has mean D/d and a variance that drops (quickly)as D increases. The reason is that h picks the target ofeach dimension uniformly at random. The dilation of S isthus roughly

√D/d, and the search domain Y = [−1, 1]d

is mapped to SY = {Sy | y ∈ Y} ⊂ X = [−1, 1]D

with low distortion (1± ε) as given in Definition 1. More-over, note that the back-projection is realized by solelycopying the coordinates and multiplying them by {−1, 1},which assert that no point is projected outside the domain X .Thus, HeSBO avoids complex corrections in contrast to theREMBO variants described above.

C. Analysis of Correction Efforts in REMBOWe examine how often the REMBO variants choose points tosample whose projection to the high-dimensional space isoutside of the search domain X . Note that these points arecorrected by projecting them back onto the boundary ∂X .Table 1 shows the results on Branin with inputD = 100 andthree different choices for the target dimension d. We seethat for larger values of d almost all points chosen by REMBOare projected outside ofX . Note that all three algorithms use

Table 1. The fraction of steps of the three REMBO variants thatproject the sample point outside X . The data was collected byrunning the algorithms on Branin with input dimension D = 100over 30 replications.

d = 2 d = 3 d = 4

ky 0.95± 0.006 0.993± 0.0008 0.9999± 0.0001kx 0.93± 0.009 0.992± 0.001 0.9996± 0.0002kψ 0.88± 0.02 0.992± 0.003 0.9997± 0.0002

the same random projection in each iteration, their decisionswhat points to sample are affected by the respective kernel.In particular, we see that kX performs better than ky asintended by Wang et al. (2016b), whereas kψ of Binoiset al. (2015) performs best among these. Note that theapproach proposed in this paper completely avoids the needfor corrections.


D. Details on ExperimentsWe provide details needed to replicate the experiments inTable 2. All REMBO variants and HeSBO-EI use identicalinitial datasets in every replication to initialize the surrogatemodels, obtained via random Latin hypercube designs. Forthe other algorithms we used their preferred strategies to ob-tain the same number of initial data points. We implementedHeSBO-EI and all REMBO variants in Python 2.7 us-ing GPy. They use a GP model with the 5

2 -Matern kerneland constant mean function. The acquisition function wasoptimized by L-BFGS-B leveraging its gradient (Frean &Boyle, 2008).

For the other algorithms we used the reference implemen-tations provided by the authors, see the hyperlinks in thereferences. ADD was provided by Gardner in personal com-munication. For the neural network parameter search bench-mark discussed in Sect. 3, we had to omit ADD because ofsoftware incompatibilities with this benchmark and SKLdue to its high running time.

Table 2. The fraction of steps of the three REMBO variants thatproject the sample point outside X . The data was collected byrunning the algorithms on Branin with input dimension D = 100over 30 replications.

Feature Choice

number of initial data points 10sampling approach Latin hypercube design

provided by pyDOEkernel Matern 5/2- ARDSearch space [−1, 1]dMethod of GP fitting GPy package (1.9.2)Number of replications 100Reported statistic MedianError bars Standard error of medianProcessor Xeon Haswell E5-2695 Dual

2.3 GHz 14-coreNumber of CPUs 1

E. Additional Plots for all ConductedExperiments

This section provides the plots for all experiments that weconducted as part of the evaluation. This includes plotsalready shown in Sect. 3 and additional plots that wereomitted due to space constraints. The plots are organizedaccording to the same subsections.

E.1. Performance Evaluation

The performances on Branin as a function of the evaluationsare given in Fig. 5.

The performances on Hartmann6 as a function of the evalu-

ations are given in Fig. 6.

The performances on Rosenbrock as a function of the evalu-ations are given in Fig. 7.

The performances on Styblinski-Tang as a function of theevaluations are given in Fig. 8.

E.2. Robustness

The robustness plots on Branin and Hartmann6 are given inFig. 9

E.3. Scalability

The performances on Branin, Rosenbrock and Styblinski-Tang as a function of the running time are given in Fig. 10.

E.4. High Dimensional Network Architecture Search

The performances on neural network optimization as a func-tion of the evaluations are given in Fig. 11.


BRANIND = 25 D = 100 D = 1000

0 20 40 60 80Number of evaluations

0

1

2

3

4

5

Func

tion

valu

es

25 dim Branin

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI

0 10 20 30 40 50 60 70 80Number of evaluations

0

1

2

3

4

5

Func

tion

valu

es

100 dim Branin

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0

1

2

3

4

5

Func

tion

valu

es

1000 dim Branin

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0

2

4

6

8

10

12

14

Func

tion

valu

es

25 dim Branin

Rembo-Kx

BOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM


0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

100 dim Branin

Rembo-Kx



0

1

2

3

4

5

Func

tion

valu

es

1000 dim Branin

Rembo-Kx

BOCKHeSBO-EIHeSBO-KGHeSBO-BM


0

1

2

3

4

5

Func

tion

valu

es

25 dim Branin

Rembo-Kx

BOCKEBOADDHeSBO-EIHeSBO-KGHeSBO-BM


0

1

2

3

4

5

Func

tion

valu

es

100 dim Branin

Rembo-Kx



0.0

0.5

1.0

1.5

2.0

2.5

3.0

Func

tion

valu

es

1000 dim Branin


Figure 5. Performance plots for Branin as a function of the evaluations for the projection based algorithms (top), the state-of-the-artalgorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100 (middlecolumn), and D = 1000 (right).


HARTMANN6


−2.00

−1.75

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

0.00

Func

tion

valu

es

25 dim Hartmann6

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es

25 dim Hartmann6

Rembo-Kx



−2.00

−1.75

−1.50

−1.25

−1.00

−0.75

−0.50

−0.25

0.00

Func

tion

valu

es

100 dim Hartmann6

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es100 dim Hartmann6

Rembo-Kx



−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es

1000 dim Hartmann6

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


−3.0

−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es

1000 dim Hartmann6

Rembo-Kx


Figure 6. Performance plots for Hartmann-6 as a function of the evaluations for the projection based algorithms (left) and the state-of-the-artalgorithms (right) for different input dimensions D = 25 (top), D = 100 (middle), and D = 1000 (bottom).


ROSENBROCKD = 25 D = 100 D = 1000


0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

25 dim Rosenbrock

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0Fu

nctio

nva

lues

100 dim Rosenbrock

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

1000 dim Rosenbrock

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

25 dim Rosenbrock

Rembo-Kx


0 10 20 30 40 50 60 70Number of evaluations

0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

100 dim Rosenbrock

Rembo-Kx



0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

Func

tion

valu

es

1000 dim Rosenbrock

Rembo-Kx



0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Func

tion

valu

es

25 dim Rosenbrock

BOCKADDHeSBO-EIHeSBO-KGHeSBO-BM


0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Func

tion

valu

es

100 dim Rosenbrock

BOCKADDHeSBO-EIHeSBO-KGHeSBO-BM


0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Func

tion

valu

es

1000 dim Rosenbrock


Figure 7. Performance plots for Rosenbrock as a function of the evaluations for the projection based algorithms (top), the state-of-the-artalgorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100 (middlecolumn), and D = 1000 (right).


STYBLINSKI-TANGD = 25 D = 100 D = 1000


−500

0

500

1000

1500

2000

2500

Func

tion

valu

es

25 dim Styblinski-Tang

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


2000

4000

6000

8000

10000

12000

14000Fu

nctio

nva

lues


Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


20000

40000

60000

80000

100000

120000

140000

Func

tion

valu

es


Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


−500

0

500

1000

1500

2000

2500

Func

tion

valu

es


Rembo-Kx



2000

4000

6000

8000

10000

12000

14000

Func

tion

valu

es


Rembo-Kx



20000

40000

60000

80000

100000

120000

140000

Func

tion

valu

es


Rembo-Kx



−700

−600

−500

−400

−300

−200

−100

0

Func

tion

valu

es




1000

1500

2000

2500

3000

3500

4000

Func

tion

valu

es




15000

20000

25000

30000

35000

40000

Func

tion

valu

es



Figure 8. Performance plots for Styblinski-Tang as a function of the evaluations for the projection based algorithms (top), the state-of-the-art algorithms (middle row) and a close-up to the best algorithms (bottom) for different input dimensions D = 25 (left), D = 100(middle column), and D = 1000 (right).


0 20 40 60 80 100Number of evaluations

0

1

2

3

4

5

6

Func

tion

valu

es

25 dim Branin

Rembo-Kx (d=2)Rembo-Kx (d=4)Rembo-Kx (d=8)

HeSBO-EI (d=2)HeSBO-EI (d=4)HeSBO-EI (d=8)


−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es

25 dim Hartmann6




0

1

2

3

4

5

6

Func

tion

valu

es

100 dim Branin




−2.5

−2.0

−1.5

−1.0

−0.5

0.0Fu

nctio

nva

lues

100 dim Hartmann6




0

1

2

3

4

5

6

Func

tion

valu

es

1000 dim Branin




−2.5

−2.0

−1.5

−1.0

−0.5

0.0

Func

tion

valu

es

1000 dim Hartmann6



Figure 9. The robustness analysis for Branin (left) and Hartmann6 (right) for D = 25 (top), D = 100 (middle), D = 1000 (bottom) anddifferent values of d.


ALL BEST

0 5 10 15 20 25 30 35Time (seconds)

0

5

10

15

20

25

Func

tion

valu

es

100 dim Branin

Rembo-Ky

Rembo-Kx

Rembo-KψBOCKEBOSKLADDHeSBO-EIHeSBO-KGHeSBO-BM

0 5 10 15 20 25 30 35Time (seconds)

0

1

2

3

4

5

6

Func

tion

valu

es

100 dim Branin

Rembo-Ky

Rembo-Kx

Rembo-KψEBOADDHeSBO-EIHeSBO-KGHeSBO-BM

0 5 10 15 20 25 30 35Time (seconds)

0

20

40

60

80

100

120

Func

tion

valu

es

100 dim Rosenbrock

Rembo-Ky

Rembo-Kx


0 5 10 15 20 25 30 35Time (seconds)

0

2

4

6

8

10Fu

nctio

nva

lues

100 dim Rosenbrock


0 5 10 15 20 25 30 35Time (seconds)

−2000

0

2000

4000

6000

8000

Func

tion

valu

es

100 dim StybTang

Rembo-Ky

Rembo-Kx


0 5 10 15 20 25 30 35Time (seconds)

−2500

−2000

−1500

−1000

−500

0

Func

tion

valu

es

100 dim StybTang


Figure 10. The scalability analysis for all algorithms (left) and a close-up on the best (right) with input dimension D = 100 each anddifferent target dimensions for d: Branin d = 4 (top), Rosenbrock d = 4 (mid) and Styblinski-Tang d = 12 (bottom). Note that in theplot on the bottom right, ADD’s performance equals EBO on the Styblinski-Tang d = 12, that is why it is not visible.



0.32

0.34

0.36

0.38

0.40

0.42

0.44

Func

tion

valu

es100 dim MNIST

Rembo-Ky

Rembo-Kx

Rembo-KψHeSBO-EI


0.32

0.34

0.36

0.38

0.40

0.42

0.44

Func

tion

valu

es

100 dim MNIST

Rembo-Kx

BOCKEBOHeSBO-EIHeSBO-KGHeSBO-BM

Figure 11. The performance comparison of projection based algorithms (top) and state-of-the-art algorithms (bottom) on the MNISTneural network benchmark with input dimension D = 100 and target dimension d = 12.

a. proof of claim5proceedings.mlr.press/v97/nayebi19a/nayebi19a-supp.pdf · 2021. 1. 14. ·...

Documents