comparing svm, gaussian process and random forest ...ceur-ws.org/vol-1422/186.pdf · comparing svm,...

8
Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES Zbynˇ ek Pitra 1,2 , Lukáš Bajer 3,4 , and Martin Hole ˇ na 3 1 National Institute of Mental Health Topolová 748, 250 67 Klecany, Czech Republic [email protected] 2 Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague rehová 7, 115 19 Prague 1, Czech Republic 3 Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vodárenskou vˇ eží 2, 182 07 Prague 8, Czech Republic {bajer,holena}@cs.cas.cz 4 Faculty of Mathematics and Physics, Charles University in Prague Malostranské nám. 25, 118 00 Prague 1, Czech Republic Abstract: In practical optimization tasks, it is more and more frequent that the objective function is black-box which means that it cannot be described mathematically. Such functions can be evaluated only empirically, usually through some costly or time-consuming measurement, nu- merical simulation or experimental testing. Therefore, an important direction of research is the approximation of these objective functions with a suitable regression model, also called surrogate model of the objective functions. This paper evaluates two different approaches to the con- tinuous black-box optimization which both integrates sur- rogate models with the state-of-the-art optimizer CMA- ES. The first Ranking SVM surrogate model estimates the ordering of the sampled points as the CMA-ES utilizes only the ranking of the fitness values. However, we show that continuous Gaussian processes model provides in the early states of the optimization comparable results. 1 Introduction Optimization of an expensive objective or fitness function plays an important role in many engineering and research tasks. For such functions, it is sometimes difficult to find an exact analytical formula, or to obtain any derivatives or information about smoothness. Instead, values for a given input are possible to be obtained only through expen- sive and time-consuming measurements and experiments. Those functions are called black-box, and because of the evaluation costs, the primary criterion for assessment of the black-box optimizers is the number of fitness function evaluations necessary to achieve the optimal value. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [5] is considered to be the state-of-the-art of the black-box continuous optimization. The important property of the CMA-ES is that it advances through the search space only according to the ordering of the function values in current population. Hence, the search of the algo- rithm is rather local which predisposes it to premature con- vergence in local optima if not used with sufficiently large population size. This issue resulted in development of sev- eral restart strategies [12], such as IPOP-CMA-ES [1] and BIPOP-CMA-ES [6] performing restarts with population size successively increased, or aCMA-ES [9] using also unsuccessful individuals for covariance matrix adaptation. Furthermore, the CMA-ES often requires more fitness function evaluations to find the optimum than many real- world experiments can offer. In order to decrease the num- ber of evaluations in evolutionary algorithms, it is conve- nient to periodically train a surrogate model of the fitness function and use it for evaluation of new points instead of the original function. The second option is to use the model for selection of the most promising points to be evaluated by the original fitness. Loshchilov’s surrogate-model-based algorithm s* ACM-ES [13] utilizes the former approach: it esti- mates the ordering of the fitness values required by the CMA-ES using Ranking Support Vector Machines (SVM) as an ordinal regression model. Moreover, it has been shown [13] that model parameters (hyperparametres) used to construct Ranking SVM model can be optimized during the search by the pure CMA-ES algorithm. Later proposed s* ACM-ES extensions, referred to as s* ACM-ES-k [15] and BIPOP- s* ACM-ES-k [14], use a more intensive exploitation of the surrogate model by increasing population size in generations evaluated by the model. More recently, a similar algorithm based on regres- sion surrogate model called S-CMA-ES [3] has been pre- sented. As opposed to the former algorithm, S-CMA- ES is performing continuous regression by Gaussian pro- cesses (GP) [17] and random forests (RF) [4]. This paper compares the two mentioned surrogate CMA-ES algorithms, s* ACM-ES-k and S-CMA-ES, and the original CMA-ES itself. We benchmark these al- gorithms on the BBOB/COCO testing set [7, 8] not only in their one population IPOP-CMA-ES version, but also in combination with the two-population-size BIPOP-CMA-ES. J. Yaghob (Ed.): ITAT 2015 pp. 186–193 Charles University in Prague, Prague, 2015

Upload: lynhu

Post on 03-May-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

Comparing SVM, Gaussian Process and Random ForestSurrogate Models for the CMA-ES

Zbynek Pitra1,2, Lukáš Bajer3,4, and Martin Holena3

1 National Institute of Mental HealthTopolová 748, 250 67 Klecany, Czech Republic

[email protected] Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague

Brehová 7, 115 19 Prague 1, Czech Republic3 Institute of Computer Science, Academy of Sciences of the Czech Republic

Pod Vodárenskou veží 2, 182 07 Prague 8, Czech Republic{bajer,holena}@cs.cas.cz

4 Faculty of Mathematics and Physics, Charles University in PragueMalostranské nám. 25, 118 00 Prague 1, Czech Republic

Abstract: In practical optimization tasks, it is more andmore frequent that the objective function is black-boxwhich means that it cannot be described mathematically.Such functions can be evaluated only empirically, usuallythrough some costly or time-consuming measurement, nu-merical simulation or experimental testing. Therefore, animportant direction of research is the approximation ofthese objective functions with a suitable regression model,also called surrogate model of the objective functions.This paper evaluates two different approaches to the con-tinuous black-box optimization which both integrates sur-rogate models with the state-of-the-art optimizer CMA-ES. The first Ranking SVM surrogate model estimates theordering of the sampled points as the CMA-ES utilizesonly the ranking of the fitness values. However, we showthat continuous Gaussian processes model provides in theearly states of the optimization comparable results.

1 Introduction

Optimization of an expensive objective or fitness functionplays an important role in many engineering and researchtasks. For such functions, it is sometimes difficult to findan exact analytical formula, or to obtain any derivatives orinformation about smoothness. Instead, values for a giveninput are possible to be obtained only through expen-sive and time-consuming measurements and experiments.Those functions are called black-box, and because of theevaluation costs, the primary criterion for assessment ofthe black-box optimizers is the number of fitness functionevaluations necessary to achieve the optimal value.

The Covariance Matrix Adaptation Evolution Strategy(CMA-ES) [5] is considered to be the state-of-the-art ofthe black-box continuous optimization. The importantproperty of the CMA-ES is that it advances through thesearch space only according to the ordering of the functionvalues in current population. Hence, the search of the algo-rithm is rather local which predisposes it to premature con-vergence in local optima if not used with sufficiently large

population size. This issue resulted in development of sev-eral restart strategies [12], such as IPOP-CMA-ES [1] andBIPOP-CMA-ES [6] performing restarts with populationsize successively increased, or aCMA-ES [9] using alsounsuccessful individuals for covariance matrix adaptation.

Furthermore, the CMA-ES often requires more fitnessfunction evaluations to find the optimum than many real-world experiments can offer. In order to decrease the num-ber of evaluations in evolutionary algorithms, it is conve-nient to periodically train a surrogate model of the fitnessfunction and use it for evaluation of new points insteadof the original function. The second option is to use themodel for selection of the most promising points to beevaluated by the original fitness.

Loshchilov’s surrogate-model-based algorithms∗ACM-ES [13] utilizes the former approach: it esti-mates the ordering of the fitness values required by theCMA-ES using Ranking Support Vector Machines (SVM)as an ordinal regression model. Moreover, it has beenshown [13] that model parameters (hyperparametres)used to construct Ranking SVM model can be optimizedduring the search by the pure CMA-ES algorithm.Later proposed s∗ACM-ES extensions, referred to ass∗ACM-ES-k [15] and BIPOP-s∗ACM-ES-k [14], usea more intensive exploitation of the surrogate model byincreasing population size in generations evaluated by themodel.

More recently, a similar algorithm based on regres-sion surrogate model called S-CMA-ES [3] has been pre-sented. As opposed to the former algorithm, S-CMA-ES is performing continuous regression by Gaussian pro-cesses (GP) [17] and random forests (RF) [4].

This paper compares the two mentioned surrogateCMA-ES algorithms, s∗ACM-ES-k and S-CMA-ES, andthe original CMA-ES itself. We benchmark these al-gorithms on the BBOB/COCO testing set [7, 8] notonly in their one population IPOP-CMA-ES version,but also in combination with the two-population-sizeBIPOP-CMA-ES.

J. Yaghob (Ed.): ITAT 2015 pp. 186–193Charles University in Prague, Prague, 2015

Page 2: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

The remainder of the paper is organized as follows.The next chapter briefly describes tested algorithms: theCMA-ES, the BIPOP-CMA-ES, the s∗ACM-ES-k, and theS-CMA-ES. Section 3 contains experimental setup andresults, and Section 4 concludes the paper and suggestsfurther research directions.

2 Algorithms

2.1 The CMA-ES

In each generation g, the CMA-ES [5] generates λ newcandidate solutions xk ∈ RD, where k = 1, . . . ,λ , froma multivariate normal distribution N(m(g),σ2(g)C(g)),where m(g) is the mean interpretable as the current bestestimate of the optimum, σ2(g) the step size, representingthe overall standard deviation, and C(g) the D×D covari-ance matrix. The algorithm selects the µ points with thelowest function value from λ generated candidates to ad-just distribution parameters for the next generation.

The CMA-ES uses restart strategies to deal with mul-timodal fitness landscapes and to avoid being trappedin local optima. A multi-start strategy where the pop-ulation size is doubled in each restart is referred to asIPOP-CMA-ES [1].

2.2 BIPOP-CMA-ES

The BIPOP-CMA-ES [6], unlike IPOP-CMA-ES, consid-ers two different restart strategies. In the first one, cor-responding to the IPOP-CMA-ES, the population size isdoubled in each restart irestart using a constant initial step-size σ0

large = σ0default:

λlarge = 2irestartλdefault . (1)

In the second one, the smaller population size λsmall iscomputed as

λsmall =⎢⎢⎢⎢⎢⎢⎣λdefault(1

2λlarge

λdefault)U[0,1]2⎥⎥⎥⎥⎥⎥⎦

, (2)

where U[0,1] denotes the uniform distribution in [0,1].The initial step-size is also randomly drawn as

σ0small = σ0

default×10−2U[0,1] . (3)

The BIPOP-CMA-ES performs the first run usingthe default population size λdefault and the initial step-size σ0

default. In the following restarts, the strategy withless function evaluations summed over all algorithm runsis selected.

2.3 s∗ACM-ES-k

Loshchilov’s version of the CMA-ES using the ordinal re-gression by Ranking SVM as surrogate model in specificgenerations instead of the original function is referred to ass∗ACM-ES [13], and its extension using a more intensiveexploitation is called s∗ACM-ES-k [15].

Before the main loop starts, the s∗ACM-ES-k evalu-ates gstart generations by the original function, then itrepeats the following steps: First, the surrogate modelis constructed using hyperparameters θ , and the originalfunction-evaluated points from previous generations. Sec-ond, the surrogate model is optimized by the CMA-ESfor gm generations with population size λ = kλ λdefault andthe number of best points µ = kµ µdefault, where kλ ,kµ ≥ 1.Third, the following generation is evaluated by the origi-nal function using λ = λdefault and µ = µdefault. To avoid apotential divergence when gm fluctuate between 0 and 1,kλ > 1 is used only in the case of gm ≥ gmλ , where gmλ de-notes the number of generations suitable for effective ex-ploitation using the model. Then the model error is calcu-lated according to the comparison of ranking between theoriginal and model evaluation of the last generation. Afterthat, the gm is adjusted in accordance with the model error.As the last step, the s∗ACM-ES-k searches a hyperparam-eter space by one generation of the CMA-ES minimizingthe model error to find the most suitable hyperparametersettings θnew for the next model-evaluated generations.

The s∗ACM-ES-k version using BIPOP-CMA-ES pro-posed in [14] is called BIPOP-s∗ACM-ES-k.

2.4 S-CMA-ES

As opposed to the former algorithms, a different ap-proach to surrogate model usage is incorporated in theS-CMA-ES [3]. The algorithm is a modification ofCMA-ES where the original evaluating and samplingphases are substituted by the Algorithm 1 at the beginningof each CMA-ES generation.

In order to avoid the false convergence of the algorithmin the BBOB benchmarking toolbox, the model-predictedvalues are adapted to never be lower then the so far mini-mum of the original function (see the step 17 in the pseu-docode).

The main difference between the S-CMA-ES and thes∗ACM-ES-k is in the manner how the CMA-ES is uti-lized. Considering S-CMA-ES, the model predictionor training is performed within each generation of theCMA-ES. On the contrary in the s∗ACM-ES-k, individualgenerations of the CMA-ES are started to optimize eitheroriginal fitness, surrogate fitness, or model itself.

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES 187

Page 3: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

3 Experimental Evaluation

The core of this paper lies in a systematic comparison ofthe two mentioned approaches to using surrogate mod-els with the CMA-ES and the original CMA-ES algo-rithm itself. The first group of surrogate-based algo-rithms is formed by the S-CMA-ES algorithms usingGaussian processes and random forests models, and theother group is formed by the s∗ACM-ES algorithm. Thesefour algorithms (CMA-ES, GP-CMA-ES, RF-CMA-ES,s∗ACM-ES) are tested in their IPOP version (based onIPOP-CMA-ES) [1] and in the bi-population restart strat-egy version (based on BIPOP-CMA-ES and its deriva-tives) [6].

3.1 Experimental Setup

The experimental evaluation is performed through thenoiseless part of the COCO/BBOB framework (COm-paring Continuous Optimizers / Black-Box OptimizationBenchmarking) [7, 8]. It is a collection of 24 benchmarkfunctions with different degree of smoothness, uni-/multi-modality, separability, conditionality etc. Each function is

Algorithm 1 Surrogate CMA-ES Algorithm [3]Input: g (generation), gm (number of model generations),

σ , λ , m, C (CMA-ES internal variables),r (maximal distance between training points and m),nREQ (minimal number of points for model training),nMAX (maximal number of points for model training),A (archive), fM (model), f (original fitness function)

1: xk ∼N(m,σ2C) k = 1, . . . ,λ {CMA-ES sampling}2: if g is original-evaluated then3: yk ← f (xk) k = 1, . . . ,λ {fitness evaluation}4: A =A∪{(xk,yk)}λ

k=15: (Xtr,ytr)← {(x,y)∈A ∣(m−x)⊺σC−1/2(m−x) ≤ r}6: if ∣Xtr∣ ≥ nREQ then7: (Xtr,ytr)← choose nMAX points if ∣Xtr∣ > nMAX8: {transformation to the eigenvector basis:}

Xtr← {(σC−1/2)⊺xtr for each xtr ∈Xtr}9: fM← trainModel(Xtr,ytr)

10: mark (g+1) as model-evaluated11: else12: mark (g+1) as original-evaluated13: end if14: else15: xk ← (σC−1/2)⊺xk k = 1, . . . ,λ16: yk ← fM(xk) k = 1, . . . ,λ {model evaluation}17: {shift yk values if (minyk) < best y from A}

yk = yk +max{0, minA y−minyk} k = 1, . . . ,λ18: if gm model generations passed then19: mark (g+1) as original-evaluated20: end if21: end ifOutput: fM, A, (yk)λ

k=1

defined for any dimension D ≥ 2; the dimensions used forour tests are 2, 5, 10, and 20. The set of functions com-prises, among others, well-known continuous optimiza-tion benchmarks like ellipsoid, Rosenbrock’s, Rastrigin’s,Schweffel’s or Weierstrass’ function.

The framework calls the optimizers on 15 different in-stances for each function and dimension, meaning that1440 optimization runs were called for each of the eightconsidered algorithms. The graphs at the end of the pa-per show detailed results in a per-function and per-group-of-function manner. The following paragraphs summarizethe parameters of the algorithms.

The CMA-ES. The original CMA-ES was used in itsIPOP-CMA-ES version (Matlab code v. 3.61) with num-ber of restarts = 4, IncPopSize = 2, σstart = 8

3 , λ = 4+⌊3logD⌋. The remainder settings were left default.

s∗ACM-ES. We have used Loshchilov’s GECCO 2013Matlab code xacmes.m [14] in its s∗ACM-ES version, set-ting the parameters CMAactive = 1, newRestartRules = 0and withSurr = 1, modelType = 1, withModelEnsembles =0, withModelOptimization = 1, hyper_lambda = 20, λMult= 1, µMult = 1 and ΛminIter = 4.

S-CMA-ES: GP5-CMA-ES and RF5-CMA-ES. The num-ber after the GP/RF in the names of the algorithms denotesthe number of model-evaluated generations gm, which areevaluated by the model in row. All considered S-CMA-ESversions use the distance r = 8 (see algorithm 1). For theGP model, Kν=5/2

Matern covariance function with starting val-ues (σ2

n , l,σ2f ) = log(0.01,2,0.5) has been used (see [3]

for the details). We have tested RF comprising 100 regres-sion trees, each containing at least two training points ineach leaf. The CMA-ES parameters (IPOP version, σstart ,λ , IncPopSize etc.) were used the same as in the pureCMA-ES experiments. All S-CMA-ES parameter valueswere chosen according to preliminary testing on severalfunctions from the COCO/BBOB framework.

BIPOP version of the algorithms. The bi-population ver-sions BIPOP-CMA-ES and BIPOP-s∗ACM-ES use thesame Loshchilov’s Matlab code xacmes.m with the pa-rameter BIPOP = 1. The BIPOP-GP5-CMA-ES andBIPOP-RF5-CMA-ES algorithms are constructed in thesame manner as the S-CMA-ES was transformed from theCMA-ES – by integration of the Algorithm 1 into everygeneration of the BIPOP-s∗ACM-ES.

3.2 Results

The performance of the algorithms is compared in thegraphs placed in Figures 1–3. The graphs in Figure 1 de-pict the expected running time (ERT), which depends ona given target function value ft = fopt +∆ f – the true opti-mum fopt of the respective benchmark function raised bya small value ∆ f . The ERT is computed over all relevant

188 Z. Pitra, L. Bajer, M. Holena

Page 4: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

1 Sphere

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

2 Ellipsoid separable

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

3 Rastrigin separable

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

4 Skew Rastrigin-Bueche separ

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

5 Linear slope

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

6 Attractive sector

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

7 Step-ellipsoid

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

8 Rosenbrock original

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

9 Rosenbrock rotated

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

10 Ellipsoid

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

11 Discus

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

12 Bent cigar

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

13 Sharp ridge

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

14 Sum of different powers

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

15 Rastrigin

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

16 Weierstrass

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

17 Schaffer F7, condition 10

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

18 Schaffer F7, condition 1000

2 3 5 10 20 400

1

2

3

4

5

target RL/dim: 10

19 Griewank-Rosenbrock F8F2

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

20 Schwefel x*sin(x)

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

21 Gallagher 101 peaks

2 3 5 10 20 40

0

1

2

3

target RL/dim: 10

22 Gallagher 21 peaks

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

23 Katsuuras

2 3 5 10 20 400

1

2

3

4

target RL/dim: 10

24 Lunacek bi-Rastrigin

Figure 1: Expected running time (ERT in number of f -evaluations as log10 value) divided by dimension versus dimension.The target function value is chosen such that the bestGECCO2009 artificial algorithm just failed to achieve an ERT of10×DIM. Different symbols correspond to different algorithms given in the legend of f1 and f24. Light symbols give themaximum number of function evaluations from the longest trial divided by dimension. Black stars indicate a statisticallybetter result compared to all other algorithms with p<0.01 and Bonferroni correction number of dimensions (six). Legend:○:BIPOP-CMAES, ▽:BIPOP-GP5, ⋆:BIPOP-RF5, ◻:BIPOP-saACMES, △:CMA-ES, ♢:GP5-CMAES, 9:RF5-CMAES,D:saACMES

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES 189

Page 5: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

separable fcts moderate fcts

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

BIPOP-GP5

CMA-ES

BIPOP-CMAES

GP5-CMAES

saACMES

BIPOP-saACMES

best 2009f1-5,5-D best 2009

BIPOP-s

saACMES

GP5-CMA

BIPOP-C

CMA-ES

BIPOP-G

RF5-CMA

BIPOP-R0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

BIPOP-GP5

GP5-CMAES

CMA-ES

BIPOP-CMAES

saACMES

BIPOP-saACMES

best 2009f6-9,5-D best 2009

BIPOP-s

saACMES

BIPOP-C

CMA-ES

GP5-CMA

BIPOP-G

RF5-CMA

BIPOP-R

ill-conditioned fcts multi-modal fcts

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

GP5-CMAES

CMA-ES

BIPOP-GP5

BIPOP-CMAES

BIPOP-saACMES

saACMES

best 2009f10-14,5-D best 2009

saACMES

BIPOP-s

BIPOP-C

BIPOP-G

CMA-ES

GP5-CMA

RF5-CMA

BIPOP-R0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0Pr

opor

tion

of fu

nctio

n+ta

rget

pai

rs

BIPOP-RF5

RF5-CMAES

GP5-CMAES

BIPOP-CMAES

BIPOP-GP5

CMA-ES

BIPOP-saACMES

saACMES

best 2009f15-19,5-D best 2009

saACMES

BIPOP-s

CMA-ES

BIPOP-G

BIPOP-C

GP5-CMA

RF5-CMA

BIPOP-R

weakly structured multi-modal fcts all functions

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

BIPOP-GP5

CMA-ES

GP5-CMAES

saACMES

BIPOP-saACMES

BIPOP-CMAES

best 2009f20-24,5-D best 2009

BIPOP-C

BIPOP-s

saACMES

GP5-CMA

CMA-ES

BIPOP-G

RF5-CMA

BIPOP-R0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

BIPOP-GP5

GP5-CMAES

CMA-ES

BIPOP-CMAES

BIPOP-saACMES

saACMES

best 2009f1-24,5-D best 2009

saACMES

BIPOP-s

BIPOP-C

CMA-ES

GP5-CMA

BIPOP-G

RF5-CMA

BIPOP-R

Figure 2: Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided bydimension (FEvals/DIM) for all functions and subgroups in 5-D. The targets are chosen from 10[−8..2] such that thebestGECCO2009 artificial algorithm just not reached them within a given budget of k × DIM, with k ∈ {0.5,1.2,3,10,50}.The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each selected target.

190 Z. Pitra, L. Bajer, M. Holena

Page 6: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

separable fcts moderate fcts

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

GP5-CMAES

BIPOP-GP5

CMA-ES

BIPOP-CMAES

saACMES

BIPOP-saACMES

best 2009f1-5,20-D best 2009

BIPOP-s

saACMES

BIPOP-C

CMA-ES

BIPOP-G

GP5-CMA

RF5-CMA

BIPOP-R0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

BIPOP-RF5

RF5-CMAES

BIPOP-GP5

GP5-CMAES

CMA-ES

BIPOP-CMAES

best 2009

BIPOP-saACMES

saACMESf6-9,20-D saACMES

BIPOP-s

best 2009

BIPOP-C

CMA-ES

GP5-CMA

BIPOP-G

RF5-CMA

BIPOP-R

ill-conditioned fcts multi-modal fcts

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

RF5-CMAES

BIPOP-RF5

BIPOP-GP5

GP5-CMAES

CMA-ES

BIPOP-CMAES

BIPOP-saACMES

best 2009

saACMESf10-14,20-D saACMES

best 2009

BIPOP-s

BIPOP-C

CMA-ES

GP5-CMA

BIPOP-G

BIPOP-R

RF5-CMA0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0Pr

opor

tion

of fu

nctio

n+ta

rget

pai

rs

BIPOP-RF5

CMA-ES

RF5-CMAES

BIPOP-GP5

GP5-CMAES

BIPOP-CMAES

BIPOP-saACMES

saACMES

best 2009f15-19,20-D best 2009

saACMES

BIPOP-s

BIPOP-C

GP5-CMA

BIPOP-G

RF5-CMA

CMA-ES

BIPOP-R

weakly structured multi-modal fcts all functions

0 1 2 3log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

RF5-CMAES

BIPOP-GP5

BIPOP-saACMES

saACMES

BIPOP-RF5

CMA-ES

BIPOP-CMAES

GP5-CMAES

best 2009f20-24,20-D best 2009

GP5-CMA

BIPOP-C

CMA-ES

BIPOP-R

saACMES

BIPOP-s

BIPOP-G

RF5-CMA0 1 2 3

log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n of

func

tion+

targ

et p

airs

RF5-CMAES

BIPOP-RF5

BIPOP-GP5

GP5-CMAES

CMA-ES

BIPOP-CMAES

saACMES

BIPOP-saACMES

best 2009f1-24,20-D best 2009

BIPOP-s

saACMES

BIPOP-C

CMA-ES

GP5-CMA

BIPOP-G

BIPOP-R

RF5-CMA

Figure 3: Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided bydimension (FEvals/DIM) for all functions and subgroups in 20-D. The targets are chosen from 10[−8..2] such that thebestGECCO2009 artificial algorithm just not reached them within a given budget of k × DIM, with k ∈ {0.5,1.2,3,10,50}.The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each selected target.

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES 191

Page 7: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

trials as the number of the original function evaluations(FEs) executed during each trial until the best functionvalue reached ft, summed over all trials and divided bythe number of trials that actually reached ft [7].

As we can see in Figure 1, the 24 functions can beroughly divided into two groups according to the algo-rithm which performed the best (at least in 10D and 20D).The first group of functions where the CMA-ES performedbest consists of functions 1, 3, 4, 6, and 20 while on func-tions 2, 5, 7, 10, 11, 13–16, 18, 21, 23, and 24, GP5-CMA-ES is usually better. The usage of the BIPOP versionsgenerally leads to no improvement or even to performancedecrease.

The graphs in Figures 2 and 3 summarize the perfor-mance over subgroups of the benchmark functions andshow the proportion of algorithm runs that reached thetarget value ft ∈ 10[−8..2] indeed ( ft was actually differ-ent for each respective function, see the figures captions).Roughly speaking, the higher the colored line, the betterthe performance of the algorithm is for the number of theoriginal evaluations given on the horizontal axis.

Thus we can see that our GP5-CMA-ES usually out-performs the other algorithms when we consider the eval-uations budget FEs ≤ 101.5D, i.e. FEs ≤ 150 for 5D andFEs ≤ 600 for 20D. However, as the number of the consid-ered original evaluations rises, the original CMA-ES or thes∗ACM-ES usually performs better. This fact can be sum-marized that our GP5-CMA-ES is convenient especiallyfor the applications where a very low number of functionevaluations is available, such as in [2].

4 Conclusions & Future Work

In this paper, we have compared the surrogate-assisted S-CMA-ES, which uses GP and RF continuous regressionmodels, with s∗ACM-ES-k algorithm based on ordinal re-gression by Ranking SVM, and the original CMA-ES, allin their IPOP and BIPOP versions. The comparison showsthat Gaussian process S-CMA-ES usually outperforms theordinal-based s∗ACM-ES-k in early stages of the algo-rithm search, especially on multimodal functions (BBOBfunctions 15–24). However, the algorithms and surrogatemodels should be further analyzed and compared since,for example, the NEWUOA [16] or SMAC [10, 11] al-gorithms spend a considerably lower number of functionevaluations than the CMA-ES in these early optimizationphases. The BIPOP versions of the algorithms did not in-creased performances of appropriate IPOP versions exceptBBOB function 5.

A natural perspective of improving S-CMA-ES is tomake the number of model-evaluated generations self-adaptive. We will additionally investigate different prop-erties of continuous and ordinal regression in view of theirapplicability as regression models. Different cases andbenchmarks where the ordinal regression is clearly supe-rior to continuous regression will be further identified. For

example, hybrid surrogate models combining both kindsof regression will be attempted.

Acknowledgements

This work was supported by the Czech Science Founda-tion (GACR) grant P103/13-17187S, by the Grant Agencyof the Czech Technical University in Prague with its grantNo. SGS14/205/OHK4/3T/14, and by the project “Na-tional Institute of Mental Health (NIMH-CZ)”, grant num-ber CZ.1.05/2.1.00/03.0078 (and the European RegionalDevelopment Fund.). Further, access to computing andstorage facilities owned by parties and projects contribut-ing to the National Grid Infrastructure MetaCentrum, pro-vided under the programme “Projects of Large Infras-tructure for Research, Development, and Innovations”(LM2010005), is greatly appreciated.

References

[1] Auger, A., Hansen, N.: A restart CMA evolution strat-egy with increasing population size. In: The 2005 IEEECongress on Evolutionary Computation 2, 1769–1776,IEEE, Sept. 2005

[2] Baerns, M., Holena, M.:. Combinatorial development ofsolid catalytic materials. Design of high-throughput ex-periments, data analysis, data mining. Imperial CollegePress / World Scientific, London, 2009

[3] Bajer, L., Pitra, Z., Holena, M.: Benchmarking gaus-sian processes and random forests surrogate models onthe BBOB noiseless testbed. In: Proceedings of the17th GECCO Conference Companion, Madrid, July 2015,ACM, New York

[4] Breiman, L.: Classification and regression trees. Chapman& Hall/CRC, 1984

[5] Hansen, N.: The CMA evolution strategy: A comparingreview. In: J. A. Lozano, P. Larranaga, I. Inza, E. Ben-goetxea, (eds), Towards a New Evolutionary Computation,192 in Studies in Fuzziness and Soft Computing, 75–102,Springer Berlin Heidelberg, Jan. 2006

[6] Hansen, N.: Benchmarking a BI-population CMA-ES onthe BBOB-2009 function testbed. In: Proceedings of the11th Annual GECCO Conference Companion: Late Break-ing Papers, GECCO’09, 2389–2396, New York, NY, USA,2009, ACM

[7] Hansen, N., Auger, A., Finck, S., Ros, R.: Real-parameterblack-box optimization benchmarking 2012: Experimentalsetup. Technical Report, INRIA, 2012

[8] Hansen, N., Finck, S., Ros, R., Auger, A.: Real-parameterblack-box optimization benchmarking 2009: Noiselessfunctions definitions. Technical Report RR-6829, INRIA,2009, updated February 2010

[9] Hansen, N., Ros, R.: Benchmarking a weighted nega-tive covariance matrix update on the BBOB-2010 noise-less testbed. In: Proceedings of the 12th Annual Confer-ence Companion on Genetic and Evolutionary Computa-tion, GECCO’10, 1673–1680, New York, NY, USA, 2010,ACM

192 Z. Pitra, L. Bajer, M. Holena

Page 8: Comparing SVM, Gaussian Process and Random Forest ...ceur-ws.org/Vol-1422/186.pdf · Comparing SVM, Gaussian Process and Random Forest ... tinuous black-box optimization which both

[10] Hutter, F., Hoos, H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In:C. Coello, (ed.), Learning and Intelligent Optimization2011, volume 6683 of Lecture Notes in Computer Science,507–523, Springer Berlin Heidelberg, 2011

[11] Hutter, F., Hoos, H., Leyton-Brown, K.: An evaluation ofsequential model-based optimization for expensive black-box functions. In: Proceedings of the 15th Annual Confer-ence Companion on Genetic and Evolutionary Computa-tion, GECCO’13 Companion, 1209–1216, New York, NY,USA, 2013, ACM

[12] Loshchilov, I., Schoenauer, M., Sebag, M.: Alterna-tive restart strategies for CMA-ES. In: C. A. C. Coello,V. Cutello, K. Deb, S. Forrest, G. Nicosia, and M. Pavone,(eds), PPSN (1), volume 7491 of Lecture Notes in Com-puter Science, 296–305, Springer, 2012

[13] Loshchilov, I., Schoenauer, M., Sebag, M.: Self-adaptivesurrogate-assisted covariance matrix adaptation evolutionstrategy. In: Proceedings of the 14th GECCO, GECCO ’12,321–328, New York, NY, USA, 2012, ACM

[14] Loshchilov, I., Schoenauer, M., Sebag, M.: BI-populationCMA-ES algorithms with surrogate models and linesearches. In: Genetic and Evolutionary Computation Con-ference (GECCO Companion), 1177–1184, ACM Press,July 2013

[15] Loshchilov, I., Schoenauer, M., Sebag, M.: Intensive surro-gate model exploitation in self-adaptive surrogate-assistedCMA-ES (saACM-ES). In: Genetic and EvolutionaryComputation Conference (GECCO), 439–446, ACM Press,July 2013

[16] Powell, M. J. D.: The NEWUOA software for uncon-strained optimization without derivatives. In: G. D. Pillo,M. Roma, (eds), Large-Scale Nonlinear Optimization,number 83 in Nonconvex Optimization and Its Applica-tions, 255–297, Springer US, 2006

[17] Rasmussen, C. E., Williams, C. K. I.: Gaussian processesfor machine learning. Adaptative Computation and Ma-chine Learning Series, MIT Press, 2006

Comparing SVM, Gaussian Process and Random Forest Surrogate Models for the CMA-ES 193