evaluating the population size adaptation mechanism for ... · introduction algorithmdiscription...

Introduction Algorithm Discription Noiseless Testbed Noisy Testbed Conclusion

Evaluating the Population Size AdaptationMechanism for CMA-ES on the BBOB Noiseless

Testbed

Kouhei Nishida1 Youhei Akimoto1

1Shinshu University, Japan

1 / 23


Introduction: CMA-ESI The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a

stochastic search algorithm using the multivariate normal distribution.1. Generate candidate solutions (x (t )

i )i=1,2, ...,λ from N (m(t ),C (t )).

2. Evaluate f (x (t )i ) and sort them, f (x1:λ ) < · · · < f (xλ:λ ).

3. Update the distribution parameters θ (t ) = (m(t ),C (t )) using theranking of candidate solutions.

I The CMA-ES has the default value for all strategy parameters (such as thepopulation size λ, the learning rate ηc ).

I A larger population size than the default value improves its performanceon following scenarios.1. Well-structured multimodal function2. Noisy function

I It can be easily very expensive to tune the population size in advance.2 / 23


Introduction: Population Size AdaptationI As a measure for the adaptation, we consider the randomness of the

parameter update.

I To quantify the randomness of the parameter update, we introduce theevolution path in the parameter space.

I To keep the randomness of the parameter update in a certain level, thepopulation size is adapted online.

Advantage of adapting the population size online:I It doesn’t require tuning of the population size in advance.

I On rugged function, it may accelerate the search by reducing thepopulation size after converging in a basin of a local minimum.

3 / 23


Rank-µ update CMA-ESI The rank-µ update CMA-ES, which is a component of the CMA-ES,

repeats the following procedure.1. Generate candidate solutions (x (t )

i )i=1,2, ...,λ from N (m(t ),C (t )).

2. Evaluate f (x (t )i ) and sort them, f (x1:λ ) < · · · < f (xλ:λ ).

3. Update the distribution parameters θ (t ) = (m(t ),C (t )) using theranking of candidate solutions.

θ (t+1) = θ (t ) + ∆θ (t )

∆m(t ) = ηm

λ∑i

wi (x (t )i:λ − m(t )),

∆C (t ) = ηc

λ∑i

wi ((x (t )i:λ − m(t ))(x (t )

i:λ − m(t ))T − C (t ))

4 / 23


Population Size Adaptation: MeasurementTo quantify the randomness of the parameter update, we introduce the evolutionpath in the space Θ of the distribution parameter θ = (m,C).

p(t+1) = (1 − β)p(t ) +√β(2 − β)∆θ (t )

The evolution path accumulates the successive steps in the parameter space Θ.

(a) less tendency (b) strong tendency

Figure: An image of the evolution path

5 / 23


Population Size Adaptation: MeasurementI We measure the length of the evolution path based on the KL-divergence.

‖p‖2θ = pTI(θ)p ≈ K L(θ‖θ + p)

The KL-divergence measures the defference between two probability distributions.

I We measure the randomness of the parameter update by the ratio between‖p(t+1) ‖2θ and its expected value γ

(t+1) ≈ E[‖p(t+1) ‖2θ ] under a randomfunction.

γ (t+1) = (1 − β)2γ (t ) + β(2 − β)λ∑i

w2i (dη2m +

d(d + 1)2

η2c )

I Two important cases

I a random function: ‖p ‖2θ

γ ≈ 1

I too large λ: ‖p ‖2θ

γ → ∞

6 / 23


Population Size Adaptation: Adaptation

I If‖p (t+1) ‖2

θ (t )

γ (t+1) < α, regarding the update as inaccurate, the population sizeis increased with

λ (t+1) =

⌊λ (t ) exp *

,β *,α −‖p(t+1) ‖2

θ (t )

γ (t+1)+-+-

⌋∨ λ (t ) + 1.

I If‖p (t+1) ‖2

θ (t )

γ (t+1) > α, regarding the update as sufficiently accurate, thepopulation size is decreased with

λ (t+1) =

⌊λ (t ) exp *

,β *,α −‖p(t+1) ‖2

θ (t )

γ (t+1)+-+-

⌋∨ λmin.

7 / 23


Algorithm VariantWe use the default setting for most of parameters. The modifiedparameters are the learning rate for the mean vector, cm , and thethreshold α to decide whether the parameter update is consideredaccurate or not.PSAaLmC α =

√2, cm = 0.1

PSAaLmD α =√2, cm = 1/D

PSAaSmC α = 1.1, cm = 0.1PSAaSmD α = 1.1, cm = 1/D

I The greater α is, the greater the population size tends to be keptI From our preliminaly study, we set cc =

√2/(D + 1)cm .

8 / 23


Restart StrategyFor each (re-)start of the algorithm, we initialize the mean vectorm ∼ U[−4, 4]D and the covariance matrix C = 22I. The maximum#f-call is set to 105D.

Termination conditionstolf: median(fiqr_hist) < 10 - 12abs(median(fmin_hist))

I the objective function value differences are too small tosort them without being affected by numerical errors.

tolx: median(xiqr_hist) < 10 - 12min(abs(xmed_hist))I the coordinate value differences are too small to update

parameters without being affected by numerical errors.

maxcond: cond(C) > 1014I the matrix operations using C are not reliable due to

numerical errors.

maxeval: #f-call ≥ 5 × 104D (for noiseless) or 105D (for noisy)

9 / 23


BIPOP-CMA-ESBIPOP restart strategy: A restart strategy with two budgets of functionevaluations.I one is for incremental population size.

I to tackle well-structured multimodal functions or noisy functions

I the other is for relatively small population size and a relatively smallstep-size.I to tackle weakly-structured multimodal functions

10 / 23


Noiseless: Unimodal Function

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

1 Sphere

PSAaLmC

PSAaLmD

PSAaSmC

PSAaSmD

BIPOP-CMA-ES

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

5 Linear slope

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

7 Step-ellipsoid

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

8 Rosenbrock original

The aRT is higher for most of the unimodal functions than the best 2009 portfolio due tolack of the step-size adaptation.On Step-ellipsoid function, where the step-size adaptaiton is less important, ouralgorithm performs well.

11 / 23


Noiseless: Well-structured Multimodal Function

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

15 Rastrigin

2 3 5 10 20 400

1

2

3

4

5

target Df: 1e-8

17 Schaffer F7, condition 10

2 3 5 10 20 400

1

2

3

4

5

target Df: 1e-8

18 Schaffer F7, condition 1000

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

19 Griewank-Rosenbrock F8F2

The performance of the tested algorithms is similar to the performance of theBIPOP-CMA-ES without the step-size adaptation.Especially on Griewank-Rosenbrock, the tested algorithm is partly better than the best2009 portfolio.

12 / 23


Noiseless: Weakly-structured Multimodal Function

0 1 2 3 4 5 6 7 8log10 of (# f-evals / dimension)

0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f fu

nct

ion+

targ

et

pair

s

PSAaSmD

PSAaSmC

PSAaLmC

PSAaLmD

BIPOP-CMA

best 2009f20-f24,5-D


0.0

0.2

0.4

0.6

0.8

1.0

Pro

port

ion o

f fu

nct

ion+

targ

et

pair

s

PSAaLmD

PSAaLmC

PSAaSmC

PSAaSmD

BIPOP-CMA

best 2009f20-f24,20-D

The BIPOP-CMA-ES performs better than the tested algorithm because thetested algorithms doesn’t have the mechanism to tackle weakly-structure.

13 / 23


Noiseless: Comparing the variants


0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f fu

nct

ion+

targ

et

pair

s

PSAaLmC

PSAaLmD

PSAaSmD

BIPOP-CMA

best 2009

PSAaSmCf15-f19,10-D

Variants with α = 1.1 are better than ones with α =√2.

14 / 23


Noiseless Summary

I On Well-structured multimodal function, the tested algorithm performswell without the step-size adaptaiton.

I For lack of the step-size adaptation, the aRT is higher for most of theunimodal functions and the than the best 2009 portfolio.

I When the step-size is less important, the tested algorithm performs well.

I Variants with α = 1.1 tends to be better than ones with α =√2

15 / 23


Noisy: Unimodal Function

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

101 Sphere moderate Gauss

PSAaLmC

PSAaLmD

PSAaSmC

PSAaSmD

BIPOP-CMA-ES

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

102 Sphere moderate unif

2 3 5 10 20 40

0

1

2

3

4

target Df: 1e-8

103 Sphere moderate Cauchy

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

104 Rosenbrock moderate Gauss

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

105 Rosenbrock moderate unif

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

106 Rosenbrock moderate Cauchy

On sphere function, the algorithm is slower than the BIPOP-CMA-ES for lackof the step-size adaptation.The failure on the Rosenbrock functions is mainly due to the same reason.

16 / 23


Noisy: Unimodal Function

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

113 Step-ellipsoid Gauss

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

114 Step-ellipsoid unif

2 3 5 10 20 400

1

2

3

4

5

target Df: 1e-8

115 Step-ellipsoid Cauchy

On step-ellipsoid function, where the step-size adaptation is less important, thealgorithm performs well.

17 / 23


Noisy: Well-structured Multimodal Function

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

122 Schaffer F7 Gauss

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

123 Schaffer F7 unif

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

124 Schaffer F7 Cauchy

On schaffer function, the performance of the tested algorithm is similarly to thebest 2009 portfolio, and partly better than it.

18 / 23


Noisy: Compairing the variants

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

113 Step-ellipsoid Gauss

2 3 5 10 20 400

1

2

3

4

5

6

target Df: 1e-8

116 Ellipsoid Gauss

2 3 5 10 20 400

1

2

3

4

5

6

7

target Df: 1e-8

119 Sum of diff powers Gauss

The algorithms using cm = 1/D sometimes get worse in low dimension becausethe learning rate is too large.

19 / 23


Noisy: Compairing the variants


0.0

0.2

0.4

0.6

0.8

1.0Pro

port

ion o

f fu

nct

ion+

targ

et

pair

s

PSAaLmD

PSAaLmC

PSAaSmD

PSAaSmC

BIPOP-CMA

best 2009f101-f130,10-D

Variants with α = 1.1 are better than ones with α =√2.

20 / 23


Noisy Summary

I On Well-structured multimodal function, the tested algorithmperformance is similarly to the best 2009.

I For lack of the step-size adaptation, the convargence speed scales worseon Sphere function and the aRT is higher for most of the unimodalfunctions than the best 2009 portfolio.

I variants with α = 1.1 tends to be better than ones with α =√2

I cm = 1/D is too large at low dimension.

21 / 23


ConclusionSummaryI On well-structured multimodal function, the tested algorithm performance

is similarly to the best 2009.

I For lack of the step-size adaptation, the aRT is higher for most of theunimodal function and the weakly-structured function than the best 2009portfolio.

I On noisy function, cm = 1/D is too large at low dimension.

Future WorkI We incorporate the rank-one adaptation and the step-size adaptation.

22 / 23


Using the small learning rate works as averaging the mean vector insuccessive iteration.

(a) with a larger learning rate(b) with a smaller learning rate

23 / 23

evaluating the population size adaptation mechanism for ... · introduction algorithmdiscription...

Documents