optimization in electrical engineering - diegm.uniud.it corsi... · optimization in electrical...

Optimization in Electrical Engineering

Christian Magele & Thomas Ebner

Institute for Fundamentals and Theory in Electrical Engineering

Technical University of Graz

Contents

1 Introduction 11.1 Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Design of an Active Filter . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Optimization of a Die Mold Press . . . . . . . . . . . . . . . . . . 21.1.3 Optimzation of a SMES arrangement . . . . . . . . . . . . . . . . 2

1.2 Optimization environment . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Mathematical background 52.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Characterization of a minimum . . . . . . . . . . . . . . . . . . . 52.1.2 Local minima - global minimum . . . . . . . . . . . . . . . . . . . 52.1.3 Pareto-optimality in multi-objective optimization . . . . . . . . . 52.1.4 Gradient and Hessian matrix . . . . . . . . . . . . . . . . . . . . 62.1.5 Slope and curvature . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.6 Properties of linear functions . . . . . . . . . . . . . . . . . . . . 82.1.7 Properties of quadratic functions . . . . . . . . . . . . . . . . . . 9

2.2 Optimality conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.1 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . . 112.2.2 Linearly constrained Optimization . . . . . . . . . . . . . . . . . . 12

2.2.2.1 Linear equality constraints . . . . . . . . . . . . . . . . . 122.2.2.2 Linear inequality constraints . . . . . . . . . . . . . . . . 15

2.2.3 Nonlinearly constrained optimization . . . . . . . . . . . . . . . . 162.3 Line Search subproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Line search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Bracketing and sectioning phase . . . . . . . . . . . . . . . . . . . 20

2.4 Derivatives of the objective function f(x) . . . . . . . . . . . . . . . . . . 242.4.1 Numerically derived gradient g(x) . . . . . . . . . . . . . . . . . . 252.4.2 Analytically derived gradient g(x) : Adjoint Variable approach . . 25

3 Deterministic optimization methods 323.1 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Non-derivative methods, AD-HOC methods, direct search methods 32

II

Contents

3.1.1.1 Polytope algorithm (Simplex method) . . . . . . . . . . 323.1.1.2 Pattern search, Method of Hooke and Jeeves . . . . . . . 33

3.1.2 Second derivative methods . . . . . . . . . . . . . . . . . . . . . . 353.1.2.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . 35

3.1.3 First derivative methods . . . . . . . . . . . . . . . . . . . . . . . 383.1.3.1 Quasi Newton method . . . . . . . . . . . . . . . . . . . 38

3.1.4 Sums of squares and nonlinear equations . . . . . . . . . . . . . . 413.1.4.1 Gauss-Newton method . . . . . . . . . . . . . . . . . . . 423.1.4.2 Levenberg-Marquardt method . . . . . . . . . . . . . . . 42

3.2 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Penalty or Barrier functions . . . . . . . . . . . . . . . . . . . . . 443.2.2 Augmented Lagrangian function . . . . . . . . . . . . . . . . . . . 463.2.3 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.3.1 Quadratic Programming with equality constraints . . . . 49

4 Stochastic Optimization Methods 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1.1 Main Features of Stochastic Algorithms . . . . . . . . . . . . . . . 514.2 Evolutionary Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.1 Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1.1 (1+1) Evolution Strategy . . . . . . . . . . . . . . . . . 534.2.1.2 (µ/ρ, λ)Evolution Strategy . . . . . . . . . . . . . . . . . 554.2.1.3 Adaptation of Stepsize σ . . . . . . . . . . . . . . . . . . 57

4.2.2 Niching Evolution Strategy . . . . . . . . . . . . . . . . . . . . . 574.2.2.1 Hierarchically Clustering the Population . . . . . . . . . 584.2.2.2 Cluster Sensitive Recombination . . . . . . . . . . . . . 59

4.3 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.3.1 Binary Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . 63

4.3.1.1 Genetic Operators . . . . . . . . . . . . . . . . . . . . . 634.3.2 Floating Point Genetic Algorithm . . . . . . . . . . . . . . . . . . 654.3.3 Improved Floating Point Genetic Algorithm . . . . . . . . . . . . 65

4.3.3.1 Immigration . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.3.2 Gradient-like Mutation . . . . . . . . . . . . . . . . . . . 66

4.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4.1 Simulated Annealing Algorithm . . . . . . . . . . . . . . . . . . . 68

4.5 Summary of Strategy Parameters . . . . . . . . . . . . . . . . . . . . . . 714.5.1 Meta Optimization of Strategy Parameters . . . . . . . . . . . . . 71

4.6 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.7 Similarities and Differences: Comparison of Stochastic Algorithms . . . . 74

4.7.1 Generation of New Configurations . . . . . . . . . . . . . . . . . . 744.7.2 Selection Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 744.7.3 Deterioration of the Objective Function . . . . . . . . . . . . . . . 75

III

Contents

4.7.4 Control of Mutation Stepsize . . . . . . . . . . . . . . . . . . . . . 754.7.5 Globally Best Configuration . . . . . . . . . . . . . . . . . . . . . 75

5 Definition of the Objective Function 765.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 From Vector Optimization to Scalar Optimization . . . . . . . . . . . . . 775.3 Constraint Optimization Techniques . . . . . . . . . . . . . . . . . . . . . 785.4 Weighting of Objectives Techniques . . . . . . . . . . . . . . . . . . . . . 785.5 Fuzzy Based Decision Making Scheme . . . . . . . . . . . . . . . . . . . . 79

5.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.5.2 Bell Shaped Static Fuzzy Membership Functions . . . . . . . . . . 805.5.3 Bell Shaped Self Adaptive Membership Functions . . . . . . . . . 81

6 Optimization Examples 836.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.1 Optimization of an Active Filter . . . . . . . . . . . . . . . . . . . 846.1.2 Optimzation of a SMES arrangement . . . . . . . . . . . . . . . . 88

Bibliography 94

IV

1 Introduction

The lecture notes Optimization in Electrical Engineering intend to give the student anoverview of standard optimization methods, which are applied to practical problem takenfrom several areas of electrical engineering.

First of all, some mathematical basics, which will be used throughout this notes,are summarized. Terms like gradient of a function or different optimality conditions areexplained.

In the following two chapters, several deterministic and stochastic optimization meth-ods are presented, explained and applied to simple analytical examples.

Then different ways for the description of the objective function are introduced.Finally, some solutions of the introductory examples is presented.

1.1 Introductory Examples

1.1.1 Design of an Active Filter

Assume the amplitude of a voltage U input as given in Fig. 1.1 (a) using the normalizedfrequency Ω. The required output voltage U output should have a constant amplitude of1V from very low frequencies to Ω = 1. This can be done using an active filter as shownin Fig. 1.1 (b).

(a) Amplitude of the input voltage U input

(b) Active filter

Figure 1.1: Active filter example

A general active filter can be described mathematically by a complex function A(P )

1

1 Introduction

of second order

A(P ) =d0 + d1P + d2P

2

c0 + c1P + c2P 2. (1.1)

P can be set to be jΩ = j ffc

with fc being the cut off frequency.

A(jΩ) =d0 + d1jΩ − d2Ω

2

c0 + c1jΩ − c2Ω2. (1.2)

The optimization/identification problem is to determine the coefficients d0, d1, d2, c0, c1, c2

of a filter, which fulfills the required operation.

1.1.2 Optimization of a Die Mold Press

Goal of this problem is to optimize the shape R1, L2, L3, L4 of a nonlinear die mold pressused for producing anisotropic permanent magnets.

(a) full view

!

!

(b) enlarged view

Figure 1.2: Model of a Die Mold Press with electromagnet

The shape of the die molds is parameterized by a circle for the inner and an ellipsefor the outer die (Fig. 1.2). The shape is to be set up in such a way that the magneticflux density B equals Bx = 0.35 cos(θ)T and By = 0.35 sin(θ) T along the circle linee − f in 10 measurement points, where 0 < θ < 45 and r0 = 0.01175 m. The forwardproblem is solved using the Finite Element Method.

1.1.3 Optimzation of a SMES arrangement

SMES (Superconducting Magnetic Energy Storing) systems consisting of a single super-conducting solenoidal coil offer the opportunity to store a significant amount of energy

2

1 Introduction

in magnetic fields in a fairly simple and economical way and can be rather easily scaledup in size. However, such arrangements usually suffer from their remarkable stray field.A reduction of the stray field can be achieved if a second solenoid is placed outsidethe inner one, with a current flowing in the opposite direction (Fig. 1.3 (a)). A correctdesign of the system should then couple the correct value of energy to be stored (=180MJ in our problem) with a minimal stray field. The resulting problem is called a multiobjective optimization problem.

"

"

! # !

! # !

$

%

& &

(a) Configuration of the SMES device (b) Critical curve of the superconductor

Figure 1.3: SMES optimization problem

The optimal choice of the parameters R1, R2, d1, d2, h1, h2, J1, J2 is not an easy task,because besides usual geometrical constraints, there is a material related constraint: thegiven current density and the maximum magnetic flux density value on the coil mustnot violate the superconducting quench condition which can be well represented by alinear relationship shown in Fig. 1.3 (b).

The resulting field problem can be solved in a hybrid analytical/numerical way usingBiot-Savarts law.

1.2 Optimization environment

To solve one or all of the above mentioned problems, an optimization environmentfulfilling the following requirements is needed.

• Analysis software to solve the forward problem (Test functions, Transfer function,Finite Element code or Biot-Savart formula for electromagnetic field problems, ...)

3

1 Introduction

• Optimization strategy that supplies the analysis software with new values for theoptimization parameters and converge stable and fast towards a solution

• Approximation algorithms to avoid the solution of the forward problem

• Niching methods to detect ”good” solutions besides the optimal one

• Preprocessor checking the new values for the optimization parameters, transform-ing them in a suitable way, ...

• Possibility to assess the quality of the successively calculated forward problems viaan objective function (Single objective problems, Multi objective problems)

• Post-processor to visualize the obtained results

4

2 Mathematical background

2.1 Basics

2.1.1 Characterization of a minimum

Optimization problems usually involve the minimization of an objective functionf(x),where x is a point in the optimization parameter space. The general problem class tobe considered is known as a nonlinearly constraint optimization problem (NCP) and canbe expressed in mathematical terms as :

minx∈n

f(x) (2.1)

subject to ci(x) = 0 i = 1, 2, . . . , m′ − 1 (2.2)

ci(x) ≥ 0 i = m′, . . . , m (2.3)

The objective function f(x) and (equality or inequality) constraint functions ci(x)are real value scalar functions. Any point x satisfying all constraints of the optimizationproblem is called feasible. All feasible points are called the feasible region.

If N(x∗, δ) represents a set of feasible points in the neighborhood of a strong localminimizer x∗, then (2.4) must hold:

f(x∗) < f(x) for all x ∈ N(x∗, δ) and x = x∗. (2.4)

2.1.2 Local minima - global minimum

In real world application the feasible region very often contains one or more local minimabesides the global minimum (see Fig. 2.1 a), all of them fulfilling (2.4). In many problemsit will be sufficient to discover a local minimum, nevertheless there exist applicationswhere the global minimum is the only acceptable solution (identification problem).

2.1.3 Pareto-optimality in multi-objective optimization

Most of the real-world optimization problems involve multiple conflicting objectives thatmust be mutually reconciled. This problems are called multi objective optimization

5


problems (MOO) or vector optimization problems in contrast to singles objective opti-mization (SOO) or scalar optimization problems. A major characteristic of them is thatin general the individual solutions for each single objective differ from each other andtherefore no solution exists, where all solutions reach their individual minimum. Theobjective function f(x) of a vector optimization problem can be written as follows:

minx∈n

f(x) = minx∈n

(f1(x), f2(x), · · ·fk(x)) (2.5)

!

(a) Single objective function f(x)

!

#

#

' (

(b) Multi-objective function f1(x), f2(x)

Figure 2.1: Optimal solutions

Fig. 2.1 shows both a one dimensional SOO and MOO problem. In general, there existsa set of pareto-optimal solutions rather than one unique solution for MOO problems.

2.1.4 Gradient and Hessian matrix

The n-vector of the partial derivatives of the multivariate function f(x) is called thegradient vector of f(x) (“first derivative”), denoted by ∇f(x) or g(x) and defined by(2.6):

∇f(x) ≡ g(x) ≡

∂f∂x1...

∂f∂xn

. (2.6)

The “second derivative” of an n-variable function is defined by the n2 partial deriva-tives of the n first derivatives with respect to the n variables. These n2 “ second deriva-tive” are usually represented by a square matrix, termed the Hessian matrix of f(x)and denoted by ∇(∇T (f(x)) or ∇2f(x) or G(x) :

6


∇2f(x) ≡ G(x) ≡ ∇(∇f)T ≡

∂2f∂x2

1

... ∂2f∂x1∂xn

.... . .

...∂2f

∂xn∂x1

... ∂2f∂2xn

. (2.7)

Consider Rosenbrocks function.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

x1

x2

minimum minimum

Figure 2.2: Rosenbrock’s function

f(x) = 100(x2 − x21)

2 + (1 − x1)2. (2.8)

The gradient and the Hessian matrix can be written as

g(x) =

(−400x1(x2 − x2

1) − 2(1 − x1)200(x2 − x2

1)

)(2.9)

and

G(x) =

[−1200x2

1 − 400x2 + 2 −400x1

−400x21 200

](2.10)

which leads to the following values at x = (0, 0)T , where f(x) = 1 :

g(x) =

(−20

)(2.11)

and

7


G(x) =

[2 00 200

](2.12)

2.1.5 Slope and curvature

Many deterministic optimization strategies require to solve a subproblem, where thelocal minimum of the objective function f(x) along an arbitrary direction s startingfrom an initial value x0 has to be determined. Any point along this direction can beexpressed as

x = x0 + αs, (2.13)

and the minimizer x∗ corresponds to α∗. To determine α∗ sometimes it is necessaryto compute the first and second derivative of f(x) along an arbitrary direction s withrespect to α:

df(x)

dα= sTg, (2.14)

termed slope or directional derivative and

d2f(x)

dα2= sT Gs, (2.15)

called the curvature.

2.1.6 Properties of linear functions

A special function of many variables is the general linear function l(x), which can bewritten as

l(x) = a1x1 + a2x2 + · · ·+ anxn + b =n∑

i=1

aixi + b = aT x + b. (2.16)

The gradient ∇l(x) of a linear function can be derived as

∇l(x) = ∇(aTx + b) = ∇(aTx) (2.17)

since ∇b = 0. Using the vector identity ∇(uT v) = (∇uT )v + (∇vT )u and the relation

∇xT = ∇(x1, x2, · · · , xn) = [∇x1,∇x2, · · · ,∇xn] =

∂x1

∂x1

∂x1

∂x2· · · ∂x1

∂xn

∂x1

∂x2

. . ....

.... . .

...∂x1

∂xn· · · · · · ∂x1

∂xn

= I (2.18)

8


where I is the unit matrix, (2.17) can be rewritten:

∇l(x) = ∇(xTa) = (∇xT )︸︷︷︸I

a + (∇aT )︸︷︷︸0

x = a. (2.19)

The gradient of a linear function is a constant vector and equals to the normal vector nof the (hyper)plane l(x) and the Hessian matrix ∇2l is the zero matrix.

2.1.7 Properties of quadratic functions

In a sufficiently small neighborhood of a given point, f(x) can be closely approximatedby a quadratic function. Consider now a quadratic function φ(x) given by

φ(x) = b + cTx +1

2xT Gx, (2.20)

for some constant b, a constant vector c and a constant symmetric matrix G.

Consider a quadratic function φ(x), where G =

[2 00 4

]and cT = (1, 1) and

b = 5. Using (2.20), the quadratic function becomes

φ(x) = 5 +(

1 1)( x1

x2

)+

12

(x1 x2

) [ 2 00 4

](x1

x2

)=

= x21 + 2x2

2 + x1 + x2 + 5 (2.21)

Deriving the gradient of φ(x)

∇φ(x) = ∇b + ∇(cTx) + ∇(1

2xT Gx). (2.22)

the first term in (2.22) becomes zero and the second term becomes c. The last term canbe evaluated using the vector identity ∇(uTv) = (∇uT )v + (∇vT )u defining u = x andv = Gx.

∇(1

2xT Gx) =

1

2

((∇xT )Gx + ∇(Gx)Tx

)=

1

2

(IGx + ∇(xT GT )x

)=

1

2

(G + GT

)x (2.23)

If the matrix G is symmetric (GT = G, where (GT is the transpose of G), the gradientof φ(x) becomes

∇φ(x) = g(x) = Gx + c, (2.24)

9


which is a linear function and the second derivative of φ(x) becomes the constant Hessianmatrix G. A consequence of (2.24) is that if x′ and x′′ are two given points and ifg′ = ∇q(x′) and g′′ = ∇q(x′′) then

g′ − g′′ = G(x′ − x′) (2.25)

that is the Hessian matrix maps differences in position into differences in gradient.Using Taylor-series expansion,

y(x + h) = y(x) + hdy(x)

dx+

1

2!h2d2y(x)

dx2+ · · · (2.26)

the following relationship between φ(x)and φ(x+αp) for any vector x, p and α is found

(dy(x)dx

becomes ∇φ = g and d2y(x)dx2 becomes ∇(∇φ)T = G):

φ(x + αp) = φ(x) + αpTg(x) +1

2(αpT )G(αp) (2.27)

Using (2.24), the relation (2.27) can be written as

φ(x + αp) = φ(x) + αpT (Gx + c) +1

2α2pT Gp. (2.28)

The function φ(x) has a stationary point only if there exists a point x∗ where the linearterm in (2.28) vanishes, i.e. it must hold that ∇φ(x∗) = Gx∗ + c = 0. Therefore, astationary point x∗ must satisfy the following system of linear equations:

Gx∗ = −c. (2.29)

For the example (2.21), the necessary condition (2.29) becomes a system of 2equations

[2 00 4

](x1

x2

)= −

(11

)=⇒ 2x1 = −1

4x2 = −1(2.30)

which has its solution at (−12 ,−1

4 )T , where f(x) becomes 4.625.

If x∗ is a stationary point, from (2.28) and (2.29) follows that

φ(x∗ + αp) = φ(x∗) +1

2α2pT Gp. (2.31)

Hence the local behaviour of φ(x) in the neighborhood of a stationary point x∗ is deter-mined by the Hessian matrix G which is called the curvature of φ(x).

10


2.2 Optimality conditions

2.2.1 Unconstrained optimization

An unconstrained problem in n dimensions can be defined in the following way.

minx∈n

f(x) (2.32)

In this case a sufficient condition for a local minimizer x∗ is that the gradient g(x∗) (orits norm) vanishes and the Hessian matrix G becomes positive definite (s denotes anysearch direction).

‖g(x∗)‖ = 0 (2.33)

G(x∗) is positive definite (2.34)

Equation 2.35 can also be written as

sT G(x∗)s ≥ 0. (2.35)

The minimizer of Rosenbrocks functionf(x∗) = 100(x2 − x21)

2 + (1 − x1)2 equalsx∗ = (1, 1)T , where the objective function becomes 0.

f(x∗) = 100(x∗

2 − (x∗1)

2)2

+ (1 − x∗1)

2 = 0. (2.36)

The gradient and the Hessian matrix can be evaluated at (1, 1)T as

g(x∗) =

(00

)(2.37)

and

G(x∗) =

[802 −400−400 200

](2.38)

The gradient (2.37) fulfills the requirements of (2.33). Applying sT G(x∗)s to (2.38)leads to sT G(x∗)s = 2s2

1 + 200s22 > 0, proving that x∗ = (1, 1)T is a minimizer.

11


2.2.2 Linearly constrained Optimization

2.2.2.1 Linear equality constraints

A problem that contains m linear equality constraints (LEP) can be defined as

minx∈

f(x) (2.39)

subject to Ax = b. (2.40)

The i -th row of the m× n matrix A will be denoted by aTi and contains the coefficients

of the i -th linear constraint:

aTi x = ai1x1 + · · ·+ ainxn = bi. (2.41)

Writing (2.41) in a more general way leads the following expression for the i-th constraint.

ci(x) = aTi x − bi = 0. (2.42)

It can be seen easily from (2.41) and (2.42) that aTi can be derived as the gradient of

the i-th constraint functionai(x) = ∇ci(x), (2.43)

as it will be done in case of nonlinear constrains.If a feasible direction between two feasible points x and x is given by p, then (2.44)must hold.

Ap = 0, (2.44)

since Ax = b and Ax = b following (2.40). If the columns of the matrix Z form thebasis of the subspace of vectors satisfying (2.44), then every feasible direction can bewritten as a linear combination of the columns of Z and an arbitrary vector pz.

p = Zpz (2.45)

Using (2.44) and (2.45) leads to AZ = 0, since pz is arbitrary. Using (2.28) and takinginto account, that f(x) is (in general) not a quadratic function, f(x) can be approxi-mated around the minimizer x∗.

f(x∗ + αZpz) = f(x∗) + αpTz ZTg(x∗) +

1

2α2pT

z ZT G(x∗)Zpz (2.46)

It can be shown ([15]) that x∗ is a local minimizer only if αpTz ZTg(x∗) vanishes. There-

fore, a necessary condition for a local minimum x∗ of an LEP can stated as in (2.47).

ZTg(x∗) = 0 (2.47)

12


since pz is arbitrary. Equation (2.47) and AZ = 0 imply that g(x∗) must be a linearcombination of the rows of A, i.e.

g(x∗) =m∑

i=1

aiλ∗i = A

Tλ∗, (2.48)

for some vector λ ∗, which is termed the vector of Lagrangian multipliers λ∗i .

ZTg∗ = ZT ATλ∗ = (AZ)T︸︷︷︸

0

λ∗ = 0 (2.49)

In Fig. 2.3 the minimum of an optimization problem with linear equality constrains isshown.

)

Figure 2.3: Contour lines of f(x), linear equality constraint

The sufficient conditions for a minimum of a LEP are summarized in (2.50) [15].

Ax = b

ZTg(x∗) = 0 or, equivalently, g(x∗) = ATλ∗ (2.50)

ZT G(x∗)Z is positive definite.

where ZTg(x∗) is called the projected gradient and ZT G(x∗)Z is termed the projectedHessian matrix.

Assume a constraint x3 = 0x1 + 0x2 + 1x3 = 0. This leads to the matrix A =[0 0 1

], with the vector a1 =

0

01

. Any feasible perturbation p starting

from an feasible point x can be done in the x1, x2 plane only (Fig. 2.4).

13


*

+

Figure 2.4: Linear constraints

The feasible subspace can be spanned by the two basis-vectors

1

00

and

0

10

,

from which any feasible perturbation can be derived as a linear combination. If

Z =

1 0

0 10 0

,then any feasible direction p can be written as p =

1 0

0 10 0

pz,

where pz is an unconstrained, arbitrary vector. A can be derived from Z byapplying an appropriate decomposition algorithm (e.g. RQ or LQ decomposition[15] Taking into account (2.44) and (2.45), leads to AZpz = 0. Since pz can bechosen arbitrarily, AZ = 0. From the Taylor series expansion (2.46), it can beseen, that αpT

z ZTg∗ = 0. Since α and pz can be chosen arbitrarily, ZTg∗ = 0.The two main relations are then:

AZ = 0

g∗T Z = 0

The next step is to express g∗. Assume another matrix A =

[a1

1 a12 a1

3

a21 a2

2 a23

]and

a matrix Z =

z1

z2

z3

. Then AZ =

(a1

1z1 + a12z2 + a1

3z3

a21z1 + a2

2z2 + a23z3

)=

(00

). Let us

assume, that g∗ = Aλ∗, where λ∗ =

(λ∗

1

λ∗2

). g∗ = Aλ ∗ =

a1

1 a21

a12 a2

2

a13 a2

3

(

λ∗1

λ∗2

)=

a1

1λ∗1 + a2

1λ∗2

a12λ

∗1 + a2

2λ∗2

a13λ

∗1 + a2

3λ∗2

= λ∗

1

a1

1

a12

a13

+ λ∗

2

a2

1

a22

a23

=

∑aT

i λ∗i . Then g∗T

Z becomes

[a1

1λ∗1 + a2

1λ∗2 a1

2λ∗1 + a2

2λ∗2 a1

3λ∗1 + a2

3λ∗2

] z1

z2

z3

= (a1

1z1 + a12z2 + a1

3z3)︸︷︷︸0

λ∗1 +

14


(a21z1 + a2

2z2 + a23z3)︸︷︷︸

0

λ∗2 leading to g∗T

Z = 0 as required.

2.2.2.2 Linear inequality constraints

Consider the problem in which the constraints are a set of linear inequalities:

minx∈

f(x) (2.51)

subject to Ax ≥ b (2.52)

Deriving the optimality criteria for such a problem it is important to distinguish betweenconstraints that hold exactly and those that do not. At a feasible point x the constraintaT

i x ≥ bi is said to be active or binding if aTi x = bi and inactive if aT

i x > bi. Theconstraint is said to be satisfied if it is active or inactive. If aT

i x < bi, the constraintis said to be violated at x. If the j -th constraint is inactive at the feasible point x, itis possible to move a non-zero distance from x in any direction without violating thatconstraint. On the other hand, an active constraint restricts feasible perturbations p inevery neighborhood of a feasible point. Firstly, if p satisfies

aTi p = 0 (2.53)

the direction p is termed a binding perturbation with respect to the j -th constraint (“on”the constrains). Secondly, if p satisfies

aTi p > 0 (2.54)

the direction p is termed a non-binding perturbation with respect to the j -th constraint(“off” the constraints). In order to determine, whether the feasible point x∗ is optimalfor LIP, it is necessary to identify the active constraints. Let the t rows of the matrixA (= At) contain the coefficients of the constraints active at x∗. Then a necessarycondition for optimality of x∗ is that ZT

t g(x∗) = 0, or

g(x∗) = Atλ∗. (2.55)

Additionally one has to guarantee that no non-binding perturbation p at x∗ is a descentdirection decreasing the objective function f(x).As can be seen from Fig. 2.5, any direction s(k) decreasing the objective f(x) must satisfy(2.62) :

s(k)T

g(k) < 0. (2.56)

To avoid a decrease of f(x) we seek a condition to ensure that for all p satisfyingAp ≥ 0, it holds that g(x∗)Tp ≥ 0 (inverse descent condition). Since g(x∗) is a linearcombination of the rows of At, the desired condition is that

15


,

Figure 2.5: Descent condition

g(x∗)Tp = λ∗1a

T1 p + · · ·+ λ∗

t aTt p ≥ 0 (2.57)

where aTi p ≥ 0, i = 1, · · · , t. Condition (2.57) will hold only if λ∗

i ≥ 0, i = 1, · · · , t,i.e. x∗ will not be optimal if there are any negative Lagrange multipliers. To see why,assume that x∗ is a local minimum (so that 2.55) must hold, but that λ∗

j < 0 for some j.

Since the rows of At are linearly independent, there must exist a feasible perturbationp such that

• aTj p = 1 (non binding perturbation with respect to j-th constraint)

• aTi p = 0, i = j (binding perturbation with respect to all other constraint).

For such a p,

g(x∗)Tp = λ∗j a

Tj p = λ∗

j < 0, (2.58)

hence, p is a feasible descent direction, which contradicts the optimality of x∗ [15].In Fig. 2.6 the minimum of a optimization problems is shown, where hatching on

one side of the inequality constraint indicates the non-feasible side.The sufficient conditions for a minimum of a LEP are summarized in (2.59).

Ax ≥ b

ZTt g(x∗) = 0 or, equivalently, g(x∗) = A

T

t λ∗ (2.59)

λ∗i ≥ 0, i = 1, · · · , t

ZTt G(x∗)Zt is positive definite.

2.2.3 Nonlinearly constrained optimization

The (equality or inequality) constraints imposed upon the variables do in general notalways involve linear functions only. A problem that contains only nonlinear equalityconstraints is given in (2.60):

16


- " " .

- " " . /

. /

Figure 2.6: Contour lines of f(x), linear inequality constraints

minx∈

f(x)

subject to ci(x) = 0, i = 1, . . . , m (2.60)

while a problem in which all the constraints are nonlinear inequalities can be written asin (2.61):

minx∈

f(x)

subject to ci(x) ≥ 0, i = 1, . . . , m (2.61)

For such types of problems similar optimality conditions like 2.50 and 2.59 can bederived [6], [15]. The optimality conditions for constrained optimization mentioned inthis chapter are often referred to as Kuhn-Tucker conditions, optimal points are calledKuhn-Tucker points. [14].

2.3 Line Search subproblem

Even in early optimization methods the idea occurs of searching along coordinate direc-tions or in more general directions. This is an example of a line search method, which isa widely used method for solving unconstrained optimization problems. After supplyingan initial estimate x(1) the basic structure of the k-th iteration is

(a) determine a direction of search s(k)

17


(b) find α(k) to minimize f(x(k) + α(k)s(k)) with respect to α(k)

(c) set x(k+1) = x(k) + α(k)s(k)

Different methods, which will be described in more detail in later, correspond to differentways of choosing s(k) in step (a), based on the information available in the method.

,

Figure 2.7: Line search following the descent condition

From Fig. 2.7 it can be seen that any direction s decreasing the objective f(x) mustsatisfy (2.62) :

s(k)T

g(k) < 0. (2.62)

This condition, which in general must be fulfilled by deterministic methods, is termedthe descent condition, methods following this condition are termed greedy algorithms.

2.3.1 Line search algorithm

After having defined a search direction, a point along this direction has to specified,where f(x) becomes a minimum. This can be done using two distinct phases :

• bracketing phase : find an interval [a, b] where the minimum is expected

• sectioning phase : divide [a, b] in sub-intervals [a(j), b(j)], which become smaller andsmaller, until a point is located as the minimum or more general as an acceptablepoint.

If a value α(k) (see Fig. 2.8) exists with the same objective function value as for α = 0

f(x(k) + α(k)s(k)) = f(x(k)) (2.63)

then an acceptable point in [0, α(k)] has to obey the following requirements

• α(k) must decrease the objective function significantly

• α(k) must not be to close to α = 0 and α = α(k),respectively.

18


!

$

,

Figure 2.8: Line search : Interval with acceptable points (Goldstein interval)

If a constant ρ is defined, an interval called the “ Goldstein interval” containing accept-able points can be defined using a left limit

f(α) ≤ f(0) + αρf ′(0) (2.64)

and a right limit

f(α) ≥ f(0) + α(1 − ρ)f ′(0) (2.65)

where f(x(k) + α(k)s(k)) is denoted by f(α) [6]. Any acceptable point reduces theobjective function at least for ∆f

∆f = f (k) − f (k+1) = −ρg(k)T

δ (k), (2.66)

where δ (k) is defined as x(k+1)−x(k). It can be shown, that the minimum of a quadraticfunction is always within the region of acceptable points, if ρ is chosen smaller than 1

2.

Still, as it can be seen from Fig. 2.8, that the minimum of an arbitrary function notnecessarily lies within the specified region. Therefore, the left limit (2.65) is substitutedby [6]

f ′(α) ≥ σf ′(0) with σ > ρ. (2.67)

An alternative way to define the right limit can be found using (2.68):

f ′(α) ≤ −σf ′(0) with σ > ρ. (2.68)

The width of this interval, called the Wolfe-Powell interval (see Fig. 2.9) of acceptablepoints can hence be tuned using the parameter σ.

19


!

!

!

$ ,

Figure 2.9: Line search : Interval with acceptable points (Wolfe-Powell interval)

2.3.2 Bracketing and sectioning phase

In this phase the iterates αi increase (in positive α direction) until

• f(α) ≤ f (see Fig. 2.10)

• an interval [ai, bi] with acceptable points (2.67 - 2.68) has been detected.

Additionally to σ and ρ a value f has to be specified, that will be accepted as a minimumanyway.

!

Figure 2.10: Line search : Definition of an upper limit

From Fig. 2.10 the value µ can be calculated using (2.69):

µ =f − f(0)

ρf ′(0). (2.69)

20


Flowchart Fig. 2.11 (a) shows the basic algorithm to find a proper interval. The param-eters τ1, τ2 and τ3 have to initialized. In the beginning α0 is set to 0 and 0 < α1 < µ.

!

!

#

#

!!

$ 0

(a) Bracketing phase

!!

!

!

#

(b) Sectioning phase

Figure 2.11: Line search algorithm

To speed up the iteration process, in general a quadratic or a cubic spline interpo-lation is utilized to find a new iterate αi inside the interval [ai, αi], if f(ai), f

′(ai) andf(αi) are known. (Fig. 2.12).First of all the interval [ai, αi] together with the function and slope values are transformedto [0, 1] (f(ai) → f(0), f ′(ai) → f ′(0) and f(αi) → f(1)). This leads to a quadraticpolynomial q(z) = a + bz + cz2.

q(z) = f(0) + f ′(0)z + (f(1) − f(0) − f ′(0))z2 (2.70)

21


#

# ! !

Figure 2.12: Line search : Quadratic interpolation

If f ′(αi) is known as well and denoted by f ′(1) a cubic polynomial c(z) = a+bz+cz2+dz3

can be derived.

c(z) = f(0) + f ′(0)z + [3(f(1) − f(0) − 2f ′(0) − f ′(1)]z2 +

[(f(1) − f(0)) + f ′(0) + f ′(1)]z3 (2.71)

There exists a linear transformation between α in the interval [a, b] and z in [0, 1]

α = a + z(b − a) (2.72)

anddf

dz= (b − a)

df

dα. (2.73)

The coefficients a, b, c, (d) for the transformation from [a, b] into [0, 1] can be evaluatedusing the following transformation matrices :

abc

=

1 0 0

0 1 0−1 −1 1

f(0)f ′(0)(b − a)

f(1)

(2.74)

and

abcd

=

1 0 0 00 1 0 0−3 −2 3 12 1 −2 1

f(0)f ′(0)(b − a)

f(1)f ′(1)(αi − ai)

(2.75)

A line search is started at x0 along the direction s = (1, 0)T . The objectivefunction f(x) and the slope f ′(x) can be written in terms of α.

22


−0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Rosenbrocks function, zoomed view

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350

0.2

0.4

0.6

0.8

1

1.2

1.4

actual function

acceptance intervall

α1

interpolation function

α2

bracketing phase

0.05 0.1 0.15 0.2 0.250.65

0.7

0.75

0.8

0.85

0.9

0.95

1

solid line : actual function

dashed line: interpolating function

ab

sectioning interval [a,b]

new iterate

interval

: minimizer

sectioning phase

Figure 2.13: Line search using Rosenbrock’s function

f(α) = 100α4 + (α − 1)2 (2.76)

f(α) = 400α3 + 2(α − 1) (2.77)

The parameters are set as follows : σ = 0.1, ρ = 0.01, τ1 = 9, τ2 = 0.1 and τ3 = 0.5.The first iterate is fixed with α1 = 0.1. This leads to the following values:

α 0 0.1f(α) 1 0.82f ′(α) -2 -1.4

23


Next the value f is set to 0.1 and µ is evaluated using (2.69):

µ =f − f(0)ρf ′(0)

=0.1 − 1−0.02

= 45 (2.78)

The initial guess does not give a bracket, since f(0.1) < f(0), f(0.1) > f andf ′(0.1) < 0. The bracketing algorithm (Fig. 2.11 (a)) requires that the new iterateis chosen in

αi+1 ∈ [2αi − αi − 1,min(µ, αi + τ1(αi − αi − 1))] = [0.2, 1]. (2.79)

Mapping the initial interval [0, 0.1] on to [0, 1] in z-space, the resulting cubic fitis c(z) = 1 − 0.2z + 0.02z3 and the minimum value of c(z) in [2, 10] is at z = 2.Thus α2 = 0.2 is the next iterate, giving a new bracket [0.2, 0.1]. The new iterateis not among the acceptable points (the slope |f ′(α2)| > |ρf ′(α1)|), but satisfiesthe requirements for the final interval (f ′(α2) > 0).

Now the sectioning algorithm (Fig. 2.11(b)) comes into play and looks for a newiterate in [0.19, 0.15] using

αj ∈ [aj + τ2(bj − aj), bj − τ3 − (bj − aj)] = [0.19, 0.15]. (2.80)

Mapping the initial interval [0.2, 0.1] on to [0, 1] in z-space, the resulting cubic fitis c(z) = 0.8 − 0.16z + 0.24z2 − 0.06z3 and the minimum value of c(z) in [0.1, 0.5]can be found at z = 0.390524. Mapping this back into the α-space, leads toα3 = 0.160984.

α 0 0.1 0.2 0.160948f(α) 1 0.82 0.8 0.771111f ′(α) -2 -1.4 1.6 -0.010423

The new iterate is an acceptable point ((the slope |f ′(α2)|(= 0.010423) < |ρf ′(α1)|(=0.14).

2.4 Derivatives of the objective function f(x)

In the methods presented in the previous sections both for unconstrained and constrainedoptimization, many considerations are only relevant, when the method is supplied withthe gradient vector g(k)(x) at a given point x(k). The evaluation of the gradient can bedone either numerically or using some analytic approach.

24


2.4.1 Numerically derived gradient g(x)

The simplest way to calculate the gradient of the objective function is some approxima-tion of it using finite differences, where the i-th component of g(x) is either evaluatedas a forward difference

gi(x) ≈ (f(x + hei) − f(x))/h (2.81)

or as a center difference

gi(x) ≈ 1

2(f(x + hei) − f(x − hei))/h (2.82)

where ei is the unit vector of the i-th coordinate direction. The major drawback usingthe differences (2.81) or (2.82) can be found in the high number of function calls toobtain the gradient in one point. Commercial software like the routines of the NAGlibrary or the IMSL library usually do (2n + 1) function evaluations to arrive at areliable approximation of g(x) if there are n optimization parameters. On the otherhand no modification of the analysis software is required at all to couple an existinganalysis software like a Finite Element packages to an optimization routine using finitedifferences.

2.4.2 Analytically derived gradient g(x) : Adjoint Variableapproach

In this section a method will be discussed briefly, which allows a highly accurate calcu-lation of the gradient of the objective function f(x), if the underlying forward problemcan be formulated as a linear equation system in the following way:

KA = r, (2.83)

where K denotes the stiffness matrix, A the solution in the nodes of the discretizedregion of interest and r the vector containing the given sources. In the general case, K,A and r are functions of the optimization parameters x.

The objective function f(x) can depend explicitly or implicitly on x:

f = f(x,A(x)). (2.84)

The i-th component of the gradient of f(x) with respect to the optimization parameterxi can be derived as given in (2.85):

df

dxi=

∂f

∂xi+

∂f

∂A

T ∂A

∂xi

. (2.85)

The expression

∂f∂A

Tdenotes the partial derivatives of f(x) with respect to the elements

in the solution vector A, which have to be calculated only for those few components of

25


the vector A contributing to the objective function f(x). The term

∂A∂xi

denotes the

partial derivatives of all elements of A with respect to xi. For the evaluation of thissecond term, first of all the partial derivative of (2.83) is calculated using the chain rule:

∂

∂xi

K(x) (A(x)) + K(x)∂

∂xi

A(x) =∂

∂xi

r(x). (2.86)

Using (2.86),

∂A∂xi

can be obtained as the solution of the linear equation system (2.87):

K(x)∂

∂xi

A(x) =∂

∂xi

r(x) − ∂

∂xi

K(x)A(x). (2.87)

The solution of (2.87) has to be found for each component xi of the n-dimensionaloptimization parameter vector x, which means together with the solution of (2.83) (n+1)solutions of a linear equation system. This number can be reduced to two solutions byintroducing an adjoint variable Υ as the solution of the linear equation system (2.88):

KTΥ =

∂f

∂A

(2.88)

Taking the transpose of the left hand side of (2.88) and taking into account a symmetricmatrix K (which is the case for Finite-Element-Method, but not for the Boundary-Element-Method), the inner product of (2.85) can be written as

∂f

∂A

T ∂A

∂xi

=(ΥT K

)∂A

∂xi

= ΥT

(K

∂A

∂xi

). (2.89)

Using the right hand side of (2.87) in (2.89), the i-th component of the gradient can bewritten as

df

dxi

=∂f

∂xi

+ ΥT

[∂

∂xi

r(x) − ∂

∂xi

K(x)A(x)

](2.90)

Summing up, the advantage of the adjoint variable method is that two linear equationsystem (2.83) and (2.88) have to be solved only, independently from the number ofoptimization parameters. The main drawback of this approach is that the derivativesof the coefficient matrix K with respect to the optimization parameters xi have to beevaluated. This means that the dependency of the coefficients on the optimizationparameters has to be expressed explicitly, which in general means a major modificationto an existing software package. Still, if one has to treat a similar problem again andagain or if the optimization or identification process has to be done quickly (real timeapplications), the adjoint variable method is very powerful.

26


Assume the following linear dc network:

1

2

3

4 !

5

6

7

7 7

7

Figure 2.14: Linear dc network

Given are the values: R1 = 3Ω, R2 = 3Ω, R3 = 5Ω, R4 = 5Ω, R5 = 6Ω, R6 = 3Ω,R8 = 2Ω, R9 = 2Ω, I = 3A. The resistors R7, R10 and the voltage U , summarizedin the vector x = (R7, R10, U)T should be optimized in such a way, that the node-todatum voltages reach the following values (N4 is the datum).

• Un0,1 = 10V

• Un0,2 = 6V

• Un0,3 = 4V

The forward problem can be solved using the method of nodal analysis.

YUn = r, (2.91)

where Y is the node admittance matrix, Un = (Un,1, Un,2, Un,3)T the vector of thenode-to-datum voltages and r is the right hand side containing the sources of theproblem. Using Fig. 2.14, the matrix Y and the right hand side r can be derived.

Y =

1R1

+ 1R2

+ 1R3R4

R3+R4+R5

− 1R2

− 1R3R4

R3+R4+R5

− 1R2

1R2

+ 1R10

+ 1R6R7

R6+R7

− 1R6R7

R6+R7

− 1R3R4

R3+R4+R5

− 1R6R7

R6+R7

1R6R7

R6+R7

+ 1R8+R9

+ 1R3R4

R3+R4+R5

27


r =

UR1

I

0

(2.92)

The objective can be set up in a least square sense.

f(Un(x) =[(Un1 − Un0,1)2 + (Un2 − Un0,2)2 + (Un3 − Un0,3)2

](2.93)

If we need to know the component of the gradient of f(x) with respect to R7, wehave to perform the following manipulation referring to (2.85).

df

dR7=

∂f

∂R7+

∂f

∂Un

T ∂Un

∂R7

. (2.94)

The partial derivative ∂f∂R7

becomes zero, since f(x) does not explicitly depend onR7. The partial derivatives of f(Un(x)) can be derived as shown in (2.95).

∂f

∂Un1= 2 [Un1(x) − Un0,1]

...∂f

∂Un

= 2

Un1(x) − Un0,1

Un2(x) − Un0,2

Un3(x) − Un0,3

(2.95)

Next the adjoint variable Υ is introduced and obtained by solving the followingsystem of linear equations.

YΥ =

∂f

∂Un

(2.96)

Then (2.94) can be rewritten using (2.90).

df

dR7= ΥT

[− ∂

∂R7Y(x)Un(x)

](2.97)

The terms ∂f∂xi

and ∂∂xi

r(x) of (2.90) are zero, since both f and r do not dependon R7 explicitly.

28


From (2.97) it can be seen that the coefficients of the matrix Y has to be derivedwith respect to R7.

∂Y∂R7

=

0 0 0

0 − 1R2

7

1R2

7

0 1R2

7− 1

R27

(2.98)

Finally the component of ∇R7f(R7, R10, U) can be calculated using simple matrixmanipulations only.

If we assume to have assigned the following values to the fixed resistors and thecurrent source

R1 = 3Ω R2 = 3Ω R3 = 5ΩR4 = 5Ω R5 = 6Ω R6 = 3ΩR8 = 2Ω R9 = 2Ω I = 3A

(2.99)

and if we assume to have the following parameter values

R7 = 3Ω R10 = 5Ω U = 30V (2.100)

the matrix system to solve for Un becomes

0.784 −0.333 −0.118

−0.3333 0.917 −0.383−0.118 −0.383 0.751

Un =

1030

=⇒ Un =

21.1115.6811.31

(2.101)

which is far away from the required values. The right hand side and the solutionfor the adjoint system become

∂f

∂Un

=

22.2219.3614.62

=⇒ Υ =

71.0175.9469.36

. (2.102)

The matrix ∂∂R7

Y becomes

29


∂

∂R7Y =

0 0 0

0 −0.1111 0.11110 0.1111 −0.1111

. (2.103)

The component ∇R7f(R7, R10, U) can be calculated using (2.97) and becomes

df

dR7=

71.0175.9469.36

T 0 0 0

0 −0.1111 0.11110 0.1111 −0.1111

21.1115.6811.31

= 3.1968 (2.104)

How does this solution correspond to a numerical gradient using simple forwarddifferences (2.81). The component ∇R7f(R2, R7, U) can be calculated using aappropriate ∆R7.

df

dR7=

(f(· · ·R7 + ∆R7 · · ·) − f(· · ·R7 · · ·))∆R7

(2.105)

Table 2.4.2 summarizes the behavior of the numerical gradient for different ∆R7.

∆R7df

dR7

1Ω 3.050.1Ω 3.1843

0.01Ω 3.19550.001Ω 3.1966

Table 2.1: Values of the numerical gradient versus ∆R7

It can be seen, that the numerical gradient converges to the analytical solution.Finally let us check the situation in the optimal point, which has been found tobe R7 = 10.3 Ω, R10 = 1.48 Ω and U = 16.12 V. The node-to-datum voltagesbecome Un = (9.99V, 5.98V, 3.99V)T . The component ∇R7f(R7, R10, U) in thisoptimal point becomes 4.317×10−6, which corresponds to the first order conditionof a minimizer, that ||∇f(x∗)|| = 0. Table 2.4.2 summarizes the behavior of thenumerical gradient for different values of ∆R7.

It can be seen from Table 2.4.2, that the gradient can be rather sensitive to thecorrect choice of ∆R7.

30


∆R7df

dR7

1Ω 1070 × 10−6

0.1Ω 125 × 10−6

0.01Ω 16.6 × 10−6

0.001Ω 5.54 × 10−6

0.0001Ω 4.44 × 10−6

0.00001Ω 4.329 × 10−6

Table 2.2: Values of the numerical gradient versus ∆R7, Optimal point

31

3 Deterministic optimization methods

3.1 Unconstrained optimization

3.1.1 Non-derivative methods, AD-HOC methods, direct searchmethods

Using these zero-th methods no model of the objective function f(x) is established. Themain advantage of them is that neither first nor second derivatives of f(x) are needed.On the other hand, the heuristic nature of these methods is reflected by a more or lesslarge number of strategy parameters to be selected; the success of the different methodsoften crucially depends on the choice of these parameters.

3.1.1.1 Polytope algorithm (Simplex method)

/

(a) Polytopes for n=2 and 3

-

(b) Reflection of a vertex

Figure 3.1: Different types of polytopes and reflection of vertices

Instead of one initial point, an initial polytope with n+1 vertices (if n is the dimensionof the parameter space) is set up. Fig. 3.1 (a) shows two types of polytopes for n = 2(triangle) and n = 3 (tetraeder). At each iteration a new polytope will be generated by

32


producing a new point to replace the “worst” point of the old polytope (Fig. 3.1 (a)).Fig. 3.2 shows a simple example applying the polytope algorithm. The iteration process

/

2

1

4

3

!

Figure 3.2: Polytope method

can be described as follows:

• Replace vertex with the “worst” objective function value by reflection (points1 → 4, 2 → 5, 3 → 6, 5 → 7, 7 → 9, 12 → 13)

• If the objective function value of the new vertex is worse than the reflected one, themethod would start to oscillate. Therefore, the second “worst” value is reflected.(points 4 → 8, 8 → 10)

• If a vertex remains longer than a specified number of iteration M in the polytope(M is set to 3.5 in Fig. 3.2), the polytope is contracted (point 6 has been “living”too long: points 9 → 11, 10 → 12)

3.1.1.2 Pattern search, Method of Hooke and Jeeves

This method can be characterized by two major moves in the parameter space:

• Exploratory move

• Pattern move

The exploratory move investigates the parameter space starting from some initial pointin all the coordinate” directions, respectively. Then there comes an extrapolation stepinto a direction constructed from the initial point and the final point of the exploratorymove called pattern move. A quite interesting feature of this algorithm is, that thepattern move does not necessarily decrease the objective function. Fig. 3.3 shows asimple example, Fig. 3.4 displays a flow-chart of this method. Table 3.1.1.2 shows, howthe methods proceeds.

33


4

2

1

!

1

2

1

4

2

1

! # 3

# !

# #

4

3 # 2 # 4

1

!

,

,

, $ 8 , 1

4

0 / ! # / !

0 / ! # / 0 / # /

0 / # / !

Figure 3.3: Example Pattern Search

6 %

' !

9 % !: , !

"

; (

-<

- ( ;

; (

-<

-=

!

> .

! / - ; - ( !

- / !

? $ 0 .

! # ! # ! / ! !

Figure 3.4: Flow-chart of the Pattern Search Method

34


number of point iteration direction point to compare remark

0 0 0 - initial point1 0 1 0 success2 0 2 1 failure3 0 2 1 success4 1 0 - extrapolation5 1 1 4,3 success, success6 1 2 5 failure7 1 2 5 failure8 2 0 -,5 extrapolation, success9 2 1 8 failure10 2 1 8 failure11 2 2 8 failure12 2 2 8 failure13 3 0 - extrapolation14 3 1 13 failure15 3 1 13,8 success, failure16 3 2 15 failure17 3 2 15 failure8 4 0 - reset18 4 1 8 failure19 4 1 8 failure20 4 2 8 failure21 4 2 8 failure8 5 0 - decrease step-length22 5 1 8 failure23 5 1 8 success24 5 2 23,8 success25 6 0 - extrapolation

Table 3.1: Pattern search

3.1.2 Second derivative methods

3.1.2.1 Newton’s method

Newton’s method [6], [15] is based on a quadratic model of the objective function f(x) asdiscussed in section 2.1.7. There are two major reasons for choosing a quadratic model:its simplicity and, more importantly the success and efficiency in practice of methodsbased on it.

35


If first and second derivatives of f(x) (e.g. the gradient g(x) and the Hessian matrixG(x)) are available, a quadratic model of the objective function f(x) can be obtainedusing (2.28) :

f(x(k) + p) ≈ f(x(k)) + g(k)T

p +1

2pT G(k)p. (3.1)

The quadratic function in (3.1) is formulated in terms of p (the step to the minimum)rather than in terms of the predicted minimum itself. The minimum of the right handside of (3.1) can be achieved if p is a minimum of the quadratic function

φ(p) = g(k)T

p +1

2pT G(k)p. (3.2)

From section 2.1.7 it is known, that p(k) becomes a stationary point if the linear equationsystem (3.3) is satisfied :

G(k)p(k) = −g(k). (3.3)

The solution p(k) of (3.3) is called the Newton direction. If G(k) is a positive definitematrix, only one iteration step is required to find the minimum of the quadratic modelfrom any starting point. Therefore, good convergence can be expected from Newton’smethod if the model (3.1) is accurate.

Nevertheless, the basic Newton method as it stands is not really suitable for a generalpurpose algorithm, since G(k) may not be positive definite when x(k) is remote from thesolution. Furthermore even if G(k) is positive definite then convergence may not occur,in fact f(x) may not even decrease. The latter possibility can be overcome by Newton’smethod with line search in which the Newton direction is used to generate a direction ofsearch s(k) (see section 2.3)

s(k) = −G(k)−1

g(k). (3.4)

which fulfills the descent property, if G and hence G−1 are positive definit.

g(k)T

s(k) = −g(k)T

G(k)−1

g(k) < 0. (3.5)

The main difficulty therefore in modifying Newton’s method arises when G(k) is notpositive definite. One possible approach to do this is by giving it a bias towards thesteepest decent vector −g(k). This is most conveniently achieved by adding a multiplyof the unit matrix I to G(k) and solving the system

(G(k) + νI)s(k) = −g(k). (3.6)

If G(k) is close to being positive definite then it may only be necessary to add a smallmultiple of I so as to give a good direction of search. The idea of modifying matrices ina way shown in (3.6) is called Levenberg-Marquardt method [16], [18].

36


Consider the following function

f(x) = x41 + x1x2 + (1 + x2)2 (3.7)

and a starting point x(1)T= (0.75,−1.25). The function value f(x(1)) = 0.5589375.

To calculate the correction p(k), g(x(k)) and G(x(k)) are needed.

g(x) =

(4x3

1 + x2

x1 + 2(1 + x2)

)(3.8)

G(x) =

[12x2

1 11 2

](3.9)

Starting at x(1) leads to

g(x(1)) =

(0.43750.25

)(3.10)

G(x(1)) =

[6.75 11 2

](3.11)

Next the correction p(1) can be calculated using (3.3).

p(1) = −G−1g(x(1)) =

−0.05−0.1

. (3.12)

The new iterate x(2) = x(1)+p(1) becomes (0.7,−1.35) with an decreased objectivefunction value of f(x(2)) = −0.58246. Table 3.1.2.1 shows the whole iterationprocess.

k 1 2 3

x(k)1

x(k)2

0.75−1.25

0.7−1.35

0.69591−1.34795

f(x(k)) -0.55859 -0.5824 -0.582455

g(k)1 (x)

g(k)2 (x)

0.43750.25

0.0220

0.00014020

Table 3.2: Newton’s method

37


3.1.3 First derivative methods

3.1.3.1 Quasi Newton method

The key to the success of Newton-type methods as presented in section 3.1.2 is the cur-vature information provided by the Hessian matrix, which allows a local quadratic modelof f(x) to be developed. Quasi Newton methods [6], [15] are based on the idea of build-ing up curvature information as the iterations proceed, using the observed behaviour off(x) and g(x) without explicitly forming the Hessian matrix.

The inverse Hessian matrix G(k)−1

is approximated by a symmetric positive definitematrix H(k), which is corrected or updated from iteration to iteration. Thus the k-thiteration has the following basic structure

(a) set s(k) = −H(k)g(k)

(b) line search along s(k) giving x(k+1) = x(k) + α(k)s(k)

(c) update H(k) to become H(k+1).

The initial matrix H(1) is usually taken as the identity matrix I if no additionalinformation is available. Potential advantages of this method are

• only first derivatives of f(x) are required

• positive definite H(k) guarantees the descent property.

Much of the interest lies in the updating formula which enables H(k+1) to be calculatedfrom H(k). If the differences

δ(k) = x(k+1) − x(k) (3.13)

γ(k) = g(k+1) − g(k) (3.14)

are defined, δ(k) can approximately be mapped into γ(k) using a Taylor series expansion[6]:

γ(k) ≈ G(k)δ(k) (3.15)

or

δ(k) ≈ G(k)−1

γ(k) = H(k)γ(k). (3.16)

Since γ(k) and δ(k) can only be calculated after the line search is completed, thematrix H(k) does not relate them correctly in the sense of (3.16). Thus the update H(k+1)

is chosen so that it does correctly relate these differences, leading to

38


H(k+1)γ(k) = δ(k), (3.17)

which is called the Quasi Newton condition.Using (3.17), different update formulas can be derived. Perhaps the simplest possi-

bility is to have

H(k+1) = H(k) + E(k) = H(k) + auuT (3.18)

where a symmetric rank one matrix E(k) = auuT is added to H(k). Using (3.17), itfollows

H(k)γ(k) + auuT γ(k) = δ(k), (3.19)

and hence that u is proportional to δ(k) − H(k)γ(k). Since any change of length can betaken into a, u = δ(k)−H(k)γ(k) is set, which requires auT γ(k) = 1. Therefore a becomes

a =1

uT γ(k)(3.20)

and the rank one update formula is obtained [6]:

H(k+1) = H(k) +(δ − Hγ )(δ − Hγ )T

(δ − Hγ )T γ, (3.21)

Describing E as a rank two matrix leads finally to

H(k+1) = H(k) +δ δ T

δT γ+

(HγγT H)

(γ)T Hγ, (3.22)

which is generally known as the Davidson, Fletcher and Powell (DFP) formula [?].Finally, (3.23) is called the BFGS update formula, introduced by Broyden, Fletcher,Goldfarb and Shanno [6]

H(k+1) = H(k) +

(1 +

γT Hγ

δT γ

)δδT

δT γ−(

δγT H + HγδT

δT γ

). (3.23)

The DFP and the BFGS update formulas are implemented in most of the mathematicallibraries available.

Consider the following function

f(x) = 10x21 + x2

2 (3.24)

and a starting point x(1) = (0.1, 1)T . The function value f(x(1)) = 1.1. Theminimum of (3.24) has to be found using a rank one Quasi Newton method.

To calculate a search direction correction s(k), g(x(k)) and G(x(k)) are needed.

39


g(x) =

(20x1

2x2

)(3.25)

G(x) =

[20 00 2

](3.26)

The initial H(1) matrix, which is the approximation of the inverse of the Hessian

matrix G, is chosen as the unit matrix I =[ 1 0

0 1

]. Starting at x(1) leads to

g(x(1)) =( 2

2

).

Next the search direction s(1) can be calculated:

s(1) = −H(1)g(1) =

(−2−2

). (3.27)

The objective function f(x) can be written in terms of α, if x and s are given.

f(α) = 10[x1 + αs1]2 + [x2 + αs2]2 (3.28)

Since the analytic expression for f(α) is known, an exact line search can be per-formed by setting the first derivative of f(α) to zero.

αmin = −10x1s1 + x2s2

10s21 + s2

2

(3.29)

Using (3.29), α(1) becomes 111 . The new iterate x(2) becomes

( − 9110

911

), the new

gradient g(2) becomes( −18

111811

).

The matrix H is updated using (3.21). The vector δ (1) (see (3.13)) is( − 2

11− 2

11

),

the vector γ (1) (see (3.14)) is( −40

11− 4

11

).

The update formula (3.21) is given here again:

H(k+1) = H(k) +(δ − Hγ )(δ − Hγ )T

(δ − Hγ )T γ

40


H(1)γ(1) becomes( −40

11− 4

11

); δ(1)−H(1)γ(1) becomes

( 3811211

); (δ(1)−H(1)γ(1)) (δ(1)−

H(1)γ(1))T becomes[ 1444

12176121

76121

4121

]. The numerator becomes [δ(1) −H(1)γ(1)]δ(1) =

− 11211528. The update of the H-matrix therefore is

H(2) =

[1 00 1

]− 1

1528

[1444 7676 4

]=

21382 − 19

382

− 19382

381382

. (3.30)

The next search direction s(2) becomes( 360

2101−3600

2101

). Table 3.1.3.1 summarizes the

whole iteration process.

k x(k) g(k) δ(k−1) γ(k−1) H(k) s(k) α(k)

1( 1

10

1

) ( 22

) ( −−) ( −

−) [ 1 0

0 1

] ( −2−2

)111

2( − 9

110911

) ( −1811

1811

) ( − 211− 211

) ( −4011− 411

) [ 21382

− 19382

− 19382

381382

] ( 3602101−36002101

)191400

3( 0

0

) ( 00

) ( 9110− 9

11

) ( 18111811

) [ 120

0

0 12

]

Table 3.3: Qussi Newton method

It can be seen, that the H-matrix converges towards the inverse Hessian matrix

G−1 = 120

[ 1 00 10

].

3.1.4 Sums of squares and nonlinear equations

This section deals with problems in which the objective function is a sum of m squaredterms

f(x) =m∑

i=1

[ri(x)]2 = rT r (3.31)

where r = r(x). Special methods exist, which use the structure of such nonlinear leastsquares problems advantageously. An alternative way of viewing such problems is thatthey arise from an attempt to solve the system of m equations

ri(x) = 0 i = 1, 2, . . . , m, (3.32)

41


where the ri(x) can be termed the residuals of the equations. One important area inwhich such problems arise is in data fitting, in which m is usually much larger than n,the number of optimization parameters. Such systems are termed over-determined. Ifthe columns of the n × m Jacobianmatrix J(x)

J(x) = [∇r1,∇r2, . . . ,∇rm] (3.33)

are defined as the first derivative vectors ∇ri of the components of r(Jij = ∂rj/∂xi), thegradient g(x) (3.34) and the Hessian matrix G(x) (3.35) of f(x) can be written as [6]:

g(x) = 2Jr (3.34)

and

G(x) = 2JJT + 2m∑

i=1

ri∇2ri. (3.35)

3.1.4.1 Gauss-Newton method

Since r(x) of (3.32) is minimized in a least square sense, in general the components ri

become very small. This suggests that a good approximation of g(x) can be found byneglecting the final term in (3.35) to give

G(x) ≈ 2JJT . (3.36)

A method making use of (3.36) is called a Gauss-Newton method. The important featureis that using only (J and r), it is possible calculate g(x) exactly and to approximatethe second derivative matrix G(x) with a rather high accuracy. Therefore the obtainedconvergence is much more rapid than using a Quasi-Newton method. Because the Gauss-Newton method can suffer from the same problems like any second derivative method(see section 3.1.2), a line search approach is desirable. The correction term δ (k) can befound solving the following linear system (3.37):

J(k)J(k)T

δ (k) = −J(k)r(k). (3.37)

The new iterate x(k+1) becomes

x(k+1) = x(k) + δ (k) (3.38)

3.1.4.2 Levenberg-Marquardt method

Nonetheless because the Gauss-Newton method can fail or converge slowly - the Hessianmatrix can become indefinite - it is desirable to apply the Levenberg-Marquardt method[16], [18] to obtain a new search direction (3.39), which fulfills the descent condition.

[J(k)J(k)T

+ νI]s(k) = −J(k)r(k) ν ≥ 0 (3.39)

42


Consider the following over-determined system of equations (m = 2 equations,n = 1 variables)

r1(x) = x + 1r2(x) = λx2 + x − 1 (3.40)

and a starting point x(1) = (1). Setting λ = 0.1, which means a small degree ofnon-linearity, the procedure (3.37) and (3.38 can be successfully applied. The firstthing to do is to calculate the Jacobian matrix.

J =(

∂r1

∂x,∂r1

∂x

)= (1, 0.2x + 1) . (3.41)

Now the gradient g(x) can be derived using (3.34).

g(x) = 2Jr(x) = 2(1, 0.2x+1)

(x + 1

0.1x2 + x − 1

)= 0.04x3 +0.6x2 +3.6x, (3.42)

while JJT becomes 2 + 0.4x + 0.04x2. Starting at x = 1, one gets J(1)J(1)T= 2.44

and -J(1)r(1) = −2.12, leading to a simple equation system 2.44 δ(1) = −2.12 andδ(1) = −0.8689. The new iterate hence becomes x(2) = x(1) + δ(1) = 0.1311. Table3.1.4.2 shows the iteration process.

k 1 2 3 4

x(k) 1 0.1311 0.01364 0.001369

Table 3.4: Gauss-Newton method

43


3.2 Constrained optimization

Practical optimization problems in electrical engineering usually involve the determi-nation of optimal parameters subject to one or more linear or nonlinear equality orinequality constraints, as presented in sections 2.2.2 and 2.2.3. There exist many ap-proaches to solve the different problems [6], [15] and a lot of well tested commercialsoftware libraries like the NAG numerical library or the IMSL numerical library areavailable. In the following sections two methods to solve an optimization problem sub-ject to nonlinear constraints are briefly discussed using an analytical two dimensionalexample (3.43):

minx∈2

f(x) = −x1 − x2

subject to 1 − x21 − x2

2 = 0. (3.43)

In Fig. 3.5 the contour lines of the objective function f(x) (solid lines) and the constraintfunction c1(x) (dashed line) are plotted. The feasible region of the problem lies on thecircle (dashed line). The minimum can easily be obtained as x∗

1 = 1/√

2 and x∗2 = 1/

√2

yielding a minimum objective function value of f(x∗) =√

2. The point x∗ fulfills thenecessary conditions to be an optimal point [15], since g(x∗) = A(x∗)T λ∗, where theLagrange multiplier λ∗ equals 1/

√2.

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.6

0.8

1

1.2

x1

x2

−1.98−1.41

−1.01

−0.722

−0.515

x* x*

Figure 3.5: Contour lines of f(x) (solid line) and c1(x) ( dashed line) of example 3.43

3.2.1 Penalty or Barrier functions

When solving a general nonlinear optimization problem in which the constraints cannotbe easily eliminated, it is necessary to balance the aims of reducing the objective function

44


f(x) and staying inside or at least very close to the feasible region. This leads to the ideaof a penalty function which is some combination of f(x) and ci(x). A simple penaltyfunction, which can be used solving an equality problem is given in (3.44).

Φ(x, σ) = f(x) +1

2σ∑

i

(ci(x))2 = f(x) +1

2σc(x)T c(x) (3.44)

Deriving a sequence of (3.44) for increasing σ one can expect that the approximatedsolution x → x∗ as σ → ∞. This is traditionally implemented as follows.

(a) Choose a fixed sequence σ(k) → ∞ (1, 10, 102 . . .)(b) For each σ(k) find an unconstrained local minimizer x for Φ(x, σ(k)) using a suitable

method

(c) Terminate when∑i(ci(x))2 is sufficiently small

Although this methods seems very appealing, it suffers from severe numerical diffi-culties in practice. These are caused by the fact that as σ → ∞, it becomes increasinglydifficult to perform the minimization steps (b) in the above algorithm. In Fig. 3.6 thecontour lines of the penalty function corresponding to (3.43)

Φ(x, σ) = −x1 − x2 − σ(1 − x21 − x2

2)2 (3.45)

are plotted for different values of σ. As σ takes higher values, the approximated minimax(1),x(10) and x(100) solve the initial problem increasingly better. It can also be seenthat the solution x(100) is still well determined in radial direction, but not so tangentialto the constraint boundary. Therefore, it is very difficult to determine numerically thelocation of x(100).

45


0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x*

x(1) x(1)

σ = 1

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x* x(10) x(10)

σ = 10

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x* x(100) x(100)

σ = 100

Figure 3.6: Increasing ill-conditioning using a penalty function (contours of Φ − Φmin)for different penalty terms σ

Choosing a very large σ(1) or a very rapidly increasing σ gives minimization subprob-lems which are difficult to solve accurately. The alternative of choosing σ(1) small orincreasing σ slowly makes it easier to achieve an accurate solution, but is very inefficient.The choice of the sequence σ(k) usually is a trade-off between these two effects.

3.2.2 Augmented Lagrangian function

The way in which the penalty function (3.44) is used to solve a constrained optimizationproblem can be seen as an attempt to create a local minimizer at x∗ in the limit σ(k) →∞. However x∗ can be made to minimize Φ(x, σ) for finite σ by changing the origin ofthe penalty term. This suggests using the function

46


Φ(x, θ, σ) = f(x) +1

2

∑i

σi(ci(x) − θi)2 = f(x) +

1

2(c(x) − θ)T S(c(x) − θ) (3.46)

where S = diagσi [?]. The behavior of the shifted penalty function is shown in Fig.3.7, where the σ dependent shift θ(σ) is set to 1/

√2 for σ = 1.

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x* x(σ=1, θ = 0.707) x(σ=1, θ = 0.707)

Figure 3.7: Contour lines of Φ − Φmin with shifted origin (σ = 1, θ = 1/√

2)

In fact it is more convenient to introduce a slightly different penalty function [?]

Φ(x, λ, S) = f(x) − λT c(x) +1

2c(x)T Sc(x). (3.47)

There exists a corresponding optimum value for λ, for which x∗ minimizes Φ(x, λ, σ),which in fact is the vector of Lagrangian multipliers λ∗ at the solution of (3.43). Thisresult is independent of the σi, so it is usually convenient to ignore the dependence onS and write Φ(x, λ). Since the Lagrangian function L(x, λ) = f(x) − λT c(x) [6], [15]is augmented by the penalty term 1

2c(x)T Sc(x), (3.47) is called augmented Lagrangian

function. The Lagrangian multipliers λ∗i are used as the control parameters in a sequen-

tial minimization algorithm as follows:

(a) Determine a sequence λ(k) → λ∗

(b) For each λ(k) find an unconstrained local minimizer x of Φ(x, λ(k))

(c) Terminate when all ci(x) are sufficiently small

47


The main difference between this algorithm and the penalty function method is that λ∗

is not known in advance and hence no sequence for λ(k) → λ∗ can be predetermined.However, such a sequence can be constructed by approximating the Lagrange multipliersusing (3.48) [6], [15]:

λ(k+1) = λ(k) − Sc(k)(x). (3.48)

Fig. 3.8 shows three iteration steps for the solution of the two dimensional example(3.43). The subproblems are solved using a Quasi-Newton method. The Lagrangianmultiplier is initially set to λ(0) = 0 and takes the values λ(1) = 0.565 and λ(2) = 0.707in the due course of the optimization process.

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x

2

x* x*

x(σ = 1, λ = .0) x(σ = 1, λ = .0)

σ = 1, λ = 0

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x* x(σ = 1, λ = 0.565) x(σ = 1, λ = 0.565)

σ = 1, λ = 0.565

0 0.2 0.4 0.6 0.8 1 1.20

0.2

0.4

0.8

1.0

1.2

x1

x2

x* x* x(σ = 1, λ = .707) x(σ = 1, λ = .707)

σ = 1, λ = λ∗ = 1/√

2

Figure 3.8: Contours of the augmented Lagrangian function (Φ − Φmin) for σ = 1 atdifferent stages of the optimization process.

48


3.2.3 Quadratic Programming

The general problem is to find a solution x∗, when the objective function f(x) isquadratic and the constraints are linear.

minx∈

φ(x) =1

2xT Gx + cTx (3.49)

subject to Ax ≥ b.

3.2.3.1 Quadratic Programming with equality constraints

This section investigates how to find a solution x∗ to the following problem.

minx∈

φ(x) =1

2xT Gx + cTx (3.50)

subject to Ax = b.

A common way to treat this problem is to set up the auxiliary matrices Y and Zsuch that [Y : Z] is non-singular and, in addition AY = I and AZ = 0. The matrix Ycan be regarded as an inverse of A, so that any solution of Ax = b can be expressed asx = Yb. In general, this solution will not be unique. Other feasible points are given byx = Yb + p, where p is in the null column space of A. In this n−m dimensional spaceall points are feasible. The columns of the matrix Z are made up by the basis vector ofthe null space of A. At any point feasible x a feasible correction can be written as

p = Zpz, (3.51)

while any feasible point can be expressed as

x = Yb + Zpz. (3.52)

Substituting into (3.50) gives the reduced quadratic function

Ψ(x) =1

2yT ZT GZy + (c + GYb)T Zy +

1

2(c + GYb)T Yb. (3.53)

If ZT GZ is positive definite, there exits a unique minimizer y∗ solving the linear system

(ZT GZ)y = −ZT (c + GYb). (3.54)

ZT GZ is referred to as the reduced Hessian matrix, while c + GYb is called the reducedgradient vector, because c+GYb is the gradient vector of φ(x) at x = Yb. The relation

λ∗ = YT (Gx∗ + c) (3.55)

49


can be used to calculate the the Lagrange multipliers at the solution x∗.There are different methods to calculate the matrices Y and Z, for instance the QRfactorization.

A = Q

[R0

]= [Q1Q2]

[R0

]= Q1R (3.56)

Using (3.56) one getsY = Q1R

−T (3.57)

andZ = Q1. (3.58)

As an example consider

minx∈

φ(x) = x21 + x2

2 + x23 (3.59)

subject to x1 + 2x2 − x3 = 4x1 − x2 + x3 = −2

where G, c, A and b can be found to be

G =

2 0 0

0 2 00 0 2

, c =

(00

), A =

[1 2 −11 −1 1

], b =

(4−2

). (3.60)

Using the QR factorization, Y and Z can be calculated.

Y =114

5 8

4 −2−1 4

,Z =

1

−2−3

. (3.61)

The vector Yb becomes 17(2, 10,−6)T , which eventually turns out to be the op-

timal solution, since −ZT (c + GYb) = 0. The gradient g∗ = c + GYb becomes27(2, 10,−6)T . The Lagrange multipliers turn out to be λ∗ = 4

7 (2,−1)T , whichagrees with the first order optimality condition for optimization with linear equal-ity constraints.

g∗ =27

2

10−6

= A

Tλ∗ =

1 1

2 −1−1 1

(

87−4

7

). (3.62)

50

4 Stochastic Optimization Methods

4.1 Introduction

Solving practical problems where no hypothesis like convexity or differentiability con-cerning the objective function f(x) can be made a priori, deterministic methods aspresented in Chapter 3 often end up in one of the function’s local minima. Therefore, ifthere is no or very little knowledge about the behavior of the objective function, aboutthe presence of local minima or the distribution of feasible and nonfeasible regions in themulti-dimensional parameter space it seems advisable to start the optimization processwith a stochastic strategy. Another field where the robust nature of stochastic strategiescan be exploited successfully is the treatment of data fitting problems suffering fromnoisy measurement data. Stochastic methods, in fact, choose their path through the pa-rameter space from one or more initial configurations to the final one using the “randomfactor”. This implies that unlike using a deterministic method different computationruns solving the same problem lead to more or less different results.

An important feature, which again vitally discriminates them from deterministicstrategies is, that stochastic methods accept deterioration in the objective function dur-ing the iteration process. This fact enables them to escape local minima and find theregion of the global optimum with a high probability, no matter where the strategy isstarted from. Although stochastic strategies are rather simple to implement, stable inconvergence and supply the user with reliable results, they usually suffer from a highnumber of function evaluations. It was therefore the recent incredible increase in com-puter power that helped them to their current popularity. The availability of parallelcomputer environments will even promote their applicability to real world problems inelectrical engineering.

4.1.1 Main Features of Stochastic Algorithms

The application of stochastic optimization procedures for design problems in electricalengineering has become very popular in the last few years. Evolution strategies [22],[23] and Genetic Algorithms [11], [12], imitating the evolutionary behaviour of Natureor Simulated Annealing [19], taking advantage of the analogy of the cooling of fluids andthe optimum seeking process are widely used to localize a region in the parameter space,

51


where the global minimum can be expected. Since these stochastic strategies are zeroth

order methods and hence can get along with the function value of the objective functionf(x) only, one does not have to bother about the continuity of the objective functionor the gradient of it during the optimization process. Constraints like box constraintson the optimization parameters or equality and inequality constraints on parts of thesolution of the electromagnetic field problem can be treated easily by replacing non-feasible solutions by newly computed feasible ones without introducing user definedpenalty functions. The general structure of stochastic algorithms presented here consistsof two main levels:

• Basic Level : This level is responsible for the generation of new configurations x,the evaluation of the corresponding objective functions f(x), the adaptation of thestepsize and the selection of configurations for the next iteration step.

• Control Level : This level guides the basic level towards a global optimal solutionand decides when the stochastic strategy is terminated.

In other words, the basic level performs some kind of local optimization that iscontrolled by the control level to obtain an overall global behaviour. Each of the threestrategies has a number of parameters that influence the convergence behavior, calledstrategy parameters. The choice of optimal values for these parameters depends on thespecific problem. However some of these parameters can be set for greater class ofproblems. The value of them can be found either by “trial and error” or by applying ahigher level optimization (meta-optimization) [22].

All of the mentioned strategies have some termination criterion which can either bea maximum number of control level iterations, a maximum number of objective functioncalls or any epsilon criterion.

4.2 Evolutionary Computation

All learning and hence all scientific processes are adaptive. The most important aspect ofsuch learning processes is the development of implicit or explicit techniques to accuratelyestimate the probabilities of future events. This feature becomes even more vital if theenvironment is changing.

In the Neo-Darvinian sense in natural evolution there are four main statistical pro-cesses acting on and within populations and species leading to optimal solutions. Theseprocesses are reproduction, mutation, competition and selection [10]. Reproduction isan obvious property of all life. But similarly as obvious, mutation is guaranteed in anysystem that continuously reproduces itself in a positively entropic universe. Competitionand selection become the inescapable consequences of any expanding population con-strained to a finite arena. Evolution is then the result of these fundamental interacting

52


stochastic processes as they act on populations, generation after generation [7]. Follow-ing that, these interacting stochastic processes are used to set up optimization strategiesas described in 4.2.1 (Evolution Strategies) and 4.3 (Genetic Algorithms).

4.2.1 Evolution Strategies

One approach to simulating natural evolution to solve optimization problems resulted inthe development of Evolution Strategies (ES) in the mid-Sixties [21], [25]. In this model,the components of a trail solution, which in general are integer or floating point numbersare viewed as behavioral traits of an individual, not as genes along a chromosom as donewhen working with Genetic Algorithms (see 4.3). The most simple version of EvolutionStrategies is the (1+1) ES.

4.2.1.1 (1+1) Evolution Strategy

In this two-member model [22] the stochastic processes mutation and selection are uti-lized as shown in Fig. 4.1.

@ ,

$ -

6

A , B

6

A ,

" B

,

" B

@ ,

- " $ C , ,

- " - "

D 7

D

7

D

7

C , $

$ ,

, $ .

, E , A ,

>

6

, - "

- "

A ,

:

Figure 4.1: (1+1) Evolution Strategy

An initial parent configuration vector xparent,ini is selected at random from the fea-sible region of the multidimensional parameter space. Each component of this vectorcorresponds to one of the optimization parameters. An objective function value f(x)

53


is associated with this configuration. Mutating this initial configuration means, that avector v, whose elements have a Gaussian distribution with a zero mean value and astandard deviation σ called stepsize (Fig. 4.2), is added to the parent configuration:

xdescendant = xparent + v(0, σparent) (4.1)

It is well known, that about 68% of all randomly chosen values will be found within the

−3 −2 −1 0 1 2 30

0.1

0.3

0.4

x

f(x)

−σ −σ +σ +σ

68% 68%

16% 16% 16% 16%

Figure 4.2: Gaussian distribution with 0 mean value and standard deviation σ

[−σ, +σ] interval and 95.5% inside the [−2σ, +2σ] interval. The Gaussian distributionof the components of the vector v corresponds to the observation that smaller changeshappen more frequently in natural evolution than bigger ones. The quality of the de-scendant configuration xdescendant is compared with the parental configuration xparent andthe configuration yielding the better value of the objective function f(x) is determinedto be the parent for the next generation. This process is repeated until some stoppingcriterion is met (see 4.6). However, after a certain number of generations the standarddeviation σ of the mutation vector v is adapted to the progress of the optimization. Ifthe ratio of positive mutations p(pos)

p(pos) =number of positive mutations

number of all mutations(4.2)

is lower than a prescribed probability p(pos), the stepsize σ is decreased by multiplyingit with a factor 0.85. Otherwise the stepsize is increased by dividing σ by the samefactor 0.85. The value 0.85 and the value p(pos) = 1

5are taken from the classical

implementation of the (1+1) Evolution Strategy [22]. This dynamic adaptation of thestepsize σ is one of the crucial characteristics of many stochastic strategies.

One major drawback of the (1+1)ES is that no deterioration of the objective functionf(x) is allowed in the course of the optimization. This feature is implemented in thehigher order Evolution Strategies only, which will be presented in the following section.

54


4.2.1.2 (µ/ρ, λ)Evolution Strategy

The (1+1)ES does not take population into account. This feature of biological evolutioncan be realized by introducing λ descendants, leading to a (1+λ) Evolution Strategy.In this ES both the parent configuration and all the λ descendant configurations areused in the selection process, which could lead to an unlimited life time. But since avital characteristic of biological evolution can be found in the limited lifetime, it makessense to exclude the parent configuration from the selection process. Such a strategyis called a “,”-strategy to distinguish it from the “+”-strategy described in the lastsection. Therefore, the (1+λ) ES is transformed to a (1,λ) ES. Additionally, the num-ber of parents can be increased from 1 to µ. The consideration of more than 1 parentallows the imitation of sexual reproduction of ρ parents by introducing recombinationprior to the mutation step. Such an Evolution Strategy is termed a classical (µ/ρ, λ) ES(Fig. 4.3). The initialization process creates at random a population of at least µ config-

C , , A ,

$ $ .

, ,

6 ,

- " - "

$ - " - "

$ -

: : : :

- " , $ $ F

5 #

- " " " ,

Figure 4.3: (µ/ρ, λ) Evolution Strategy

urations, which meet the constraints of the optimization parameters. In the next step

55


ρ out of the µ configurations are chosen to produce a certain number of descendants byrecombination. Arithmetic crossover (4.5) is a common way to combine two parent con-figurations xparent,1 and xparent,2 to produce two descendant configurations xdescendant,1

and xdescendant,2.

xdescendant,1 = axparent,1 + (1 − a)xparent,2

xdescendant,2 = axparent,2 + (1 − a)xparent,1 (4.3)

The factor a in (4.5) is chosen in a suitable way. Investigations have shown, that a valuerandomly taken from a Gaussian distribution with a mean value of 0.8 and a standarddeviation of 0.5 works quite well. The recombination procedure is repeated, until asufficient number of descendants has been produced.

The process for selecting parent configurations for the recombination step can eitherbe done in a completely random way or can take the values of the objective function f(x)of the respective parent configurations into account (= competition). Configurationswith a better quality are chosen for recombination with a higher probability than otherones (implicit selection or mating selection). One common method to use the valueof the objective function is the roulette wheel selection [8] (Fig. 4.4). To each parentconfiguration a range between 0 and 1 is assigned in such a way that

• the sum of all the ranges equals 1

• a configuration with a better objective function f(x) comprises a larger region onthe roulette wheel.

Then an equally distributed number between 0 and 1 is randomly computed (turningthe wheel) and the parent configuration, which includes that number in its area on theroulette wheel is chosen for the recombination process.

,

,

Figure 4.4: Roulette wheel selection

56


4.2.1.3 Adaptation of Stepsize σ

As already stated in section 4.2.1.1, the dynamic adaptation of the stepsizes σ playsa key role for the convergence stability and the convergence speed of most stochasticoptimization strategies. In the beginning of the (µ/ρ, λ) ES an initial stepsize σ isassigned to each optimization parameter. Each descendant configuration, produced byrecombination of selected parents inherits a stepsize for each parameter from its parents(σ

(i)descendant,inherited). The adaptive step size behaviour of the (µ/ρ, λ) ES is realized by

introducing a factor α. Using this factor α, which has been found best to take valuesfrom 1.05 to 1.2, the inherited stepsizes of each optimization parameter are increased ordecreased in a random way (4.4).

σ(i)descendant = (σ

(i)descendant,inherited) ∗ α or σ

(i)descendant = (σ

(i)descendant,inherited)/α (4.4)

This can be done in a consistent manner (all stepsizes σ(i) of the descendant configu-ration are decreased or all stepsizes σ(i) are increased) or each single stepsize is treatedindividually. In the latter case one has to take care, that the different stepsizes σi donot diverge from each other too strongly. Once all stepsizes σi have been adapted,the descendant configurations are mutated in the same manner as described in 4.2.1.1by adding a vector v with normally distributed numbers to the respective descendantconfigurations. Once the objective function values of all descendants are available, theµ best descendants together with their current stepsizes are selected to be the parentsfor the next generation. This procedure is repeated until some stopping criterion (seesection 4.6) is met.

4.2.2 Niching Evolution Strategy

In most real world optimization problems one tries to determine the global among someor even numerous local solutions within the feasible region of parameters. One the otherhand, it could be worth to investigate some of the local solutions as well. Therefore,a most desirable behaviour would be, if the optimization strategy behaves globally andyields additional information about local minima detected on the way to the globalsolution. But having a closer look at the behaviour of higher order ES on their way tothe final solution, it can be recognized, that they pass by regions with local solutionswithout giving notice. Therefore, these solutions remain unknown to the user. Thisdisadvantage can be overcome by introducing more than one population into the strategy.A control process guides some of these populations temporarily into some of the localminima and informs the user about these solutions. Nevertheless, the overall behaviourof the ES must remain globally. This can be achieved by introducing a cluster algorithmbased on a complete linkage [9] into the recombination process to detect κ individual sub-populations temporarily gathering around many local solutions, setting up a hierarchicalEvolution strategy as shown in Fig. 4.5.

57


C , , A ,

6 ,

$

: : : :

, " ,

, ,

, $

Figure 4.5: Hierarchical Evolution Strategy

A [κ(µ(τ)/ρ, λ)] ES sets up κ sub-populations, where µ is the average number ofparents per subpopulation. Each parent survives at most for τ generations. The totalnumber of individuals therefore is κ ∗ µ ∗ τ . In each generation, ρ out of all matingparents take part in the recombination process to produce κ ∗ λ children. All childrenhave then to undergo the mutation operation [1]. The κ ∗ µ best children then replacethe oldest parents and the iteration process continues.

4.2.2.1 Hierarchically Clustering the Population

To apply clustering we treat every parameter configuration as a single object. Groupingthese objects together is done by complete linkage (C-link), an agglomerative hierarchicalclustering method [2], [9]. This class of methods starts with n single objects (parameterconfigurations), all seen as their own clusters. Step by step two specific clusters are thenmerged into a new cluster until, after n − 1 steps, all objects form a single final cluster,the root of the resulting hierarchical cluster tree. For a simple example see Figure 4.6.

The two clusters to be merged in the next step are chosen in a way that the dissimi-larity of the resulting cluster is minimized among all possible cluster pairs. C-link takesthe cluster diameter (maximum distance of objects in the cluster) as the measure of dis-

58


similarity. Different measures give rise to other popular hierarchical clustering methodslike single linkage or average linkage which, however, turned out to be less suitable forour application.

The cluster tree can be viewed as a family of clusters where any two clusters aredisjoint if they lie on different paths (as seen from the root), or one cluster includes theother if they are on the same path. This gives the advantage of a trade-off betweenthe number of desired clusters and their maximum diameter. For applications like ourswhere it is a priori not known how many (i.e. few) clusters are required, the best trade-offcan easily be found once the cluster tree has been computed.

A

B

D

E

F

G

HI

C

A

B

CD

EF

G

HI

Figure 4.6: Hierarchical clustering and corresponding cluster tree

4.2.2.2 Cluster Sensitive Recombination

Since the optimization parameters are floating point numbers, recombination is per-formed by arithmetic crossover. Two new descendant configurations are set up fromtwo parental ones (4.5).

xdescendant,1 = axparent,1 + (1 − a)xparent,2

xdescendant,2 = axparent,2 + (1 − a)xparent,1 (4.5)

The factor a is chosen randomly from [0.8 · · ·1.2]. The candidates for recombination arechosen in the following way.

• Choose one of the κ clusters randomly

• Choose one candidate, p1, from this cluster, taking its fitness into account (roulettewheel) [1]

• Choose the second candidate, p2, such that members from the current cluster aremore likely than members from remote clusters

The last rule is implemented using the C-link cluster tree. Consider the path fromcandidate p1 to the root of the tree. We choose candidate p2 such that, for each cluster

59


C on this path, the probability of C to be the lowest common ancestor of p1 and p2 isproportional to

size(C2) ∗ b−diam(C) (4.6)

for b some suitable constant. Thereby, C2 is the cluster not containing p1 and beingmerged into C with another cluster (which then has to contain p1), and size and diamdenote the number of objects and the diameter of a cluster, respectively. Note that,once the lowest common ancestor C has been selected according to the probabilityabove, every object in the subcluster C2 could be chosen for p2. Again a roulette wheelselection taking the fitness into account has been used to finally determine p2.

To illustrate how this approach works, the unimodal Rosenbrock function

frosen(x) = 100(x2 − x1 ∗ x1)2 + (1 − x1)

2 (4.7)

was modified to show three distinct minima (4.8).

frosen,mod(x) = frosen(x) − 50((x1 + 1)2 + (x2 − 1)2)

−10((x1 − 1.5)2 + (x2 − 2.5)2) + e2x2−5 (4.8)

Fig. 4.8(a) shows the initial population clustered into five sub-populations. Form nowon the cluster sensitive recombination is applied, which leads to an intermediate state aspresented in Fig. 4.8(b). The three minima can be seen very well. If this situation canbe detected, all possible solutions of the problem have been found. One way to do so isto assess diameter curves in the due course of the optimization. At each iteration stepthe diameters of the largest cluster of all possible clusters (from 1 to the κ) is evaluated.

Fig. 4.7(a) shows a selection of such curves at 4 interesting time instances. Lookingat the sequence with respect to the diameters of the largest cluster it can be seen thatsomething significantly happens after generation 63 (=iteration 63) and generation 78.In fact this is where the strategy reduces the three clusters to two and the two clustersto one, respectively.

1 2 3 4 50

10

20

30

40

50

60

70

Number of custers

Dia

me

ter

of

the

la

rge

st

clu

ste

r

iteration 63

iteration 64

iteration 78

iteration 79

(a) Cluster diameters

0.33

200.69

120.71

98.29

80.35

35.89

21.98

1.31

0.62

(b) Cluster tree

Figure 4.7:

60


−5 −4 −3 −2 −1 0 1 2 3 4 5

−4

−2

0

2

4

6

3 Local MIN Function − Iteration0, Global MIN at [2.38,5.32]

Population spread over three minima

−5 −4 −3 −2 −1 0 1 2 3 4 5

−4

−2

0

2

4

6


Population spread over two minima

−5 −4 −3 −2 −1 0 1 2 3 4 5

−4

−2

0

2

4

6


Final population

Figure 4.8: Modified Rosenbrock function: Development of the population during theoptimization process

If the Evolution Strategy is continued, a final population as plotted in Fig. 4.8(c) isreached, where the stochastic optimization process terminates. It can be seen that theEvolution Strategy guides its population(s) into the global solution.

61


4.3 Genetic Algorithm

In the classical, binary Genetic Algorithm (GA) [11] all the optimization parameters xi

have to be coded into a binary system. The number of bits used is usually problemdependent. All these bit strings are then combined to form a chromosome leading toa genotypic representation of the optimization parameters (Fig. 4.9). In the beginning

6 ,

$ ! !

! !

! !

! !

:

! !

! !

! !

! !

:

! !

! !

! !

! !

:

! !

! !

! !

! !

:

@

$ $

, -

, "

! !

! !

! !

! !

! !

! !

! !

! !

! ! ! ! ! !

! ! ! ! ! ! !

! !

! !

! !

! !

:

! !

! !

! !

! !

:

$ (

! !

! !

! !

! !

! !

! !

! !

! !

:

9

(

,

9

! !

! !

! !

! !

: ! ! ! ! ! !

C , ! !

! !

! !

! !

Figure 4.9: Classical Genetic Algorithm

a certain number of configurations is set up randomly to form the initial, binary codedpopulation. In each of the following iteration steps genetic operators are applied toselected individuals of the current generation to produce a new generation. This processis continued until some stopping criterion is met (see section 4.6).

62


4.3.1 Binary Genetic Algorithm

The features of the traditional binary coding in Genetic Algorithms are evident:

• similarity to the DNS coding in biological systems

• very elegant genetic operators.

The common binary representation, though, is not very well suited for transforming theproblem space into the representation space, because a small distance in the problemspace may correspond to a large distance in the representation space. If, for instance,some value changes from 15 to 16, five bits have to be changed in the coded string (01111to 10000) while for a change from 16 to 17 (same distance in the problem space) onlyone bit must be changed (10000 to 10001). Alternatively, an encoding of numbers canbe used, so that adjacent numbers differ by a single digit only. The term Gray code isoften used to refer to such a reflected code as presented in Table 4.1.

Table 4.1: Coding of integer numbers: Binary coding versus Gray coding

Binary Integer Gray Binary Integer Graycoding number coding coding number coding00000 0 00000 01001 9 0110100001 1 00010 01010 10 0111100010 2 00011 01011 11 0111000011 3 00010 01100 12 0101000100 4 00110 01101 13 0101100101 5 00111 01110 14 0100100110 6 00101 01111 15 0100000111 7 00100 10000 16 1100001000 8 01100 10001 17 11001

......

...

4.3.1.1 Genetic Operators

The standard genetic operators, which are used to produce a new generation from thecurrent one, are :

• recombination (crossover)

• reproduction

• mutation.

63


! ! ! ! ! !

! ! ! ! ! ! !

! ! ! ! !

! ! !! !

!

!

,

,

!

(a) One point cross over

! ! ! ! ! !

! ! ! ! ! ! !

,

,

! ! !

!

! !

! ! !

!

! ! !

(b) Two point cross over

Figure 4.10: One point and two point cross over operators

In the classical binary implementation, recombination and reproduction are per-formed in parallel with a certain probability (p(C) and p(R), respectively), while mu-tation is subsequently applied to every newly produced configuration with a very lowprobability p(M).

Recombination is usually performed by some kind of crossover, which can be re-alized very efficiently with the binary coded strings. Using the one point cross over(Fig. 4.10 (a)), a single point within the chromosomes of two parent configurations is se-lected randomly and all the bits following this position until the end of the bit-strings aremutually exchanged to produce two new configurations. Using the two point cross over,the bits of a substring within the chromosomes are mutually exchanged, again resultingin two new configurations (Fig. 4.10 (b)). The most commonly used recombinationoperator, still, is uniform cross over (Fig. 4.11 (a)). Each gene of the first descendantconfigurations is set up randomly from either the first or the second parent configuration,while the second descendant configuration gets its genes from the left over parent.

Reproduction means, that one parent configuration is passed on as a descendant tothe next generation without any change.

As it can be seen in natural evolution, each member of the population has thechance to contribute to the improvement of the current generation, although memberswith higher fitness (corresponding to lower values of the objective function f(x)) aremore likely to do so. The roulette wheel method as presented in section 4.2.1.2 is aappropriate way to implement this fact when choosing candidates for recombination orreproduction (mating selection).

But before both types of newly produced descendants are allowed to become mem-bers of the next generation, their chromosomes have to undergo a bitwise mutation(Fig. 4.11 (b)). This mutation is done with a very low probability P (M), which means

64


! ! ! ! ! !

! ! ! ! ! ! !

,

,

! ! ! !

!

!

!

! !

! !! !

(a) Uniform cross over

! ! ! ! ! !

,

,

! ! ! !! ! ! ! !

(b) Mutation

Figure 4.11: Uniform cross over and mutation operator

that only a few bits (if at all) within the chromosom are inverted. Therefore, in theclassical binary Genetic Algorithms, mutation does not play an important role. Stillthere exists mutation because it can help to overcome local minima during the iterationprocess.

4.3.2 Floating Point Genetic Algorithm

Besides the classical binary coding of the optimization parameters a floating point rep-resentation of the parameters has become popular recently (Fig. 4.12 (a)): The mainadvantages of this representation are

• more natural representation of optimization parameters of technical problems

• high numerical precision

• possibility to treat great ranges of the values of optimization parameters.

The genetic operators must be adapted, when a floating point number representation isused. Recombination can be performed by arithmetic crossover, mutation by adding avector with random numbers (see section 4.2.1). The Floating Point Genetic Algorithmtherefore becomes more like Evolution Strategies, still putting much more emphasis onrecombination and reproduction than on mutation.

4.3.3 Improved Floating Point Genetic Algorithm

4.3.3.1 Immigration

Two additional genetic operators, immigration and gradient-like mutation, can be de-fined to improve both the convergence stability and the convergence speed of the floating

65


6 ,

@

$ $

, -

5 #

- " " " ,

::: :

: : : :

$ (

9

C ,

9

(

,

9

(a) Floating Point Genetic Algorithm

6 ,

@

$ $ , -

5 #

- " " " ,

::: :

: : : :

$ (

9

C ,

9

(

,

9

:

6 (

:

(b) Improved Floating Point Genetic Algorithm

Figure 4.12: Flowcharts of the Floating Point Genetic Algorithm

point Genetic Algorithm. Although GA’s reach the global solution with a very high prob-ability, it can take a long time to overcome local optimal solutions which are generallypresent if dealing with practical applications. One possible way to overcome this problemis to introduce completely new genetic material into the new population and is termedimmigration. This can be done by generating some new configurations randomly with avery low probability p(I) within the feasible parameter space (Fig. 4.12 (b)). It is quiteinteresting to note, that this feature contributes to the convergence speed, although thefitness of the immigrant and hence its mating performance is usually very low.

4.3.3.2 Gradient-like Mutation

The classical form of floating point mutation means that changes in any direction areperformed with the same probability [22], [23]. Investigations have shown that mutationsin more promising directions increase both the convergence stability and the convergencespeed. Prior to the mutation step, a promising direction d(−1) for the current generation

66


$ (

(

(

! $

(

!

! ! / (

!

Figure 4.13: Gradient-like mutation

denoted by (0) is computed. To do so, the gradient-like vectors g(−2) and g(−1) of theprevious two generations have to be evaluated. This is done by setting up a directionbetween the best (Pbest) and the second best (Pbest−1) configuration of the respectivegenerations (Fig. 4.13).

d(−1) =1

2

(g(−2) + g(−1)

)(4.9)

To preserve the stochastic characteristic of the mutation operator, the new mutationdirection vector v(0) is calculated from d(−1) using a factor a. This factor is evaluatedfrom Gaussian distribution, where a mean value of 0.8 and a standard deviation of0.3 have turned out to be appropriate values. Starting from an arbitrary descendantconfiguration p(0) of the current generation (which has either been produced by recom-bination, reproduction or immigration), an intermediate configuration p

′(0) is computedusing (4.10):

p′(0) = p(0) + v(0) = p(0) + ad(0). (4.10)

Around this intermediate point, the new descendant d(0) is finally found using undirectedmutation by adding a vector with normally distributed numbers with a (0,σ) distribu-tion. This is indicated by the dashed circle in Fig. 4.13. A factor of σ is 10% of thelength of d(0) has turned out useful.

67


4.4 Simulated Annealing

Simulated Annealing (SA) Algorithms , firstly introduced in the fifties [19] are optimiza-tion methods based on an analogy with physical systems. SA were proposed to approachand solve complicated combinatorial problems, for instance the traveling salesman prob-lem, which could not be addressed with classical optimization algorithms [3], [13], [20],[28].

SA treat the optimization/minimization problem as the annealing of a molten solid:when a material is cooled from the liquid phase the process must proceed slowly if thematerial has to reach its minimal energy state (ground state). If the cooling is too fast(quenching), the material will not reach thermal equilibrium during the process andlocal irregularities in the crystalline lattice remain frozen in the structure, thus yieldinga higher value of the associated energy, called meta-stable state. The key issue in thisprocess is the Boltzmann probability distribution. The probability that a particle is atany energy level E can be calculated by use of the Boltzmann distribution, also calledthe Boltzmann probability law (4.11), where kB is the Boltzmann constant : .

pB = e− E

kBT . (4.11)

If the cooling is sufficiently slow the system has time to explore many configurations andsettles then in the minimum energy one [?].

The same basic principle can be used in an optimization algorithm. The objectivefunction f(x) to be minimized can be considered the energy of the system while the dif-ferent combinations of the degrees of freedom of the optimization are its configurations.The probability that a particular configuration, even a worse one, is accepted is ruledby a Boltzmann like equation. This particular acceptance criterion, the Metropolis cri-terion, which is the heart of the method, allows some probability of accepting worseningconfigurations, or, as they are usually called, uphill movements. Accepting only config-urations decreasing the cost function (greedy algorithms following strictly the descentproperty [15]) is much like rapidly quenching a physical system to zero temperature.In this case the system can be trapped in local minima which are equivalent to meta-stable states of the physical system. The probability of accepting uphill movements isunder the control of a strategy parameter called temperature which is lowered duringthe optimization procedure.

4.4.1 Simulated Annealing Algorithm

In the initial phase of this optimization algorithm a number of random configurationsis set up. The following procedure can then be subdivided into four main nested loops(Fig. 4.14) governed by a few user defined parameters, which will be described later. Theinnermost loop is responsible for the generation of a new configuration from an existingone. This feature shows some similarity to the (1+1) Evolution Strategy (see section

68


5 #

- " " " ,

#

" $ " $

9 , 7 A ,

$ ,

,

" $ :

G " 7 H # , = 7 H . , H ,

H , /

9 - " $ $

@ - " 7 9 # , = 7 9

9 % /

G " 7 I # , = 7 I .

I ,

I $

$ A ,

$ , , $

- , E

, A , :

6

A , B7

, J , " J $ $

D

D

7

, $

= B

, %

Figure 4.14: Simulated Annealing Algorithm

4.2.1.1). The new configuration is obtained perturbing in turn the i-th component ofthe parameter vector x by adding an equally distributed number, called stepsize σi. Theamplitude of the perturbation of each parameter is set by the user at the beginning of theoptimization and is then tuned by the algorithm itself following the objective functionbehavior. Once a new configuration has been found, it must be decided, whether toaccept or to reject it. If it yields a better quality f(x)new than f(x)old, the one it wasderived from, it is accepted anyway. But in contrast to the (1+1) Evolution Strategy,Simulated Annealing provides a possibility of accepting deterioration in the objectivefunction. This has been found to play a key role in overcoming local minima. Theacceptance criterion is based on the Boltzmann probability law. A value p is calculatedusing the two objective function values f(x)new and f(x)old, the current temperature Tand a normalization factor cnorm, which can automatically be derived from the initialconfigurations and corresponds to the Boltzmann constant in (4.11).

p = e−fnew−fold

cnormT (4.12)

69


The value p is compared to a number pequal, which is randomly chosen from the interval[0, 1]. If p > pequal, the inferior configuration is accepted, otherwise it is rejected. As longas the temperature T is very high, Simulated Annealing accepts every new solution, thusyielding a random walk through the search space (p is 1 all the time). On the other hand,with a temperature T close to zero, only improvements are accepted (p is getting closeto zero!!). The probability of accepting a worse configuration decreases, as T decreases.This acceptance rule is called Metropolis criterion.

After NV cycles, all stepsizes σi are adaptively updated in a way, which leads to a 1:1ratio of accepted and rejected configurations. A great number of successes indicates, thatthe newly generated configurations are too close to the old ones, so that no significantexploration is performed. On the other hand, if the number of successes is too small itis highly probable that the new points are too far from the current point leading thento a random sampling of the parameter space. In both cases the trials are not usefuland cannot give significant information. The new value of the stepsize σi of the i-thparameter is obtained by multiplying the current amplitude of the perturbation by thefunction F shown in Fig. 4.15. The success ratio is computed on the basis of acceptedand rejected configurations obtained in the previous iterations. The adaptation of the

!

! #

#

#

! ! # ! # ! # 1 ! # 4

9 ,

9 % K

Figure 4.15: Stepsize multiplicator depending on the success rate

stepsizes at a constant temperature T is done NS times.After the three inner loops have been finished (see 4.14), the temperature T is lowered

according to a user-defined function (outer loop). In the presented implementation thenew temperature Tnew is derived from the current one (Told) using (4.13):

Tnew = Told ∗ fT , (4.13)

where fT is chosen from the interval [0.85, 0.9].

70


4.5 Summary of Strategy Parameters

All stochastic algorithms presented rely on a number of strategy parameters presentedin Table 4.2. These parameters must be supplied by the user and influence both theconvergence speed and the convergence stability. Reasonable values for these parameters,as stated in Table 4.2 can either be found be trial and error and the experience of theuser (as it is done usually) or by using some higher order optimization, called metaoptimization [23], which will be presented in the next section.

Table 4.2: Strategy Parameters of Stochastic Algorithms

(µ/ρ, λ) Improved Floating PointEvolution Strategy Genetic Algorithm Simulated Annealing

popsize=20 popsize=21 Tmax = 1.0population size population size maximum temperature

µ=8 p(C)=12% NT =30÷40number of parents probability of cross over temperature loops

ρ=2 p(M)=43% NS=10recombination number probability of mutation stepsize loops

λ=20 p(R)=42% NV =10number of descendants probability of reproduction variable cycles

α=1.1÷1.3 p(I)=3% fT =0.85÷0.9stepsize factor probability of immigration temperature reduction factor

F (see Fig. 4.15)stepsize function

4.5.1 Meta Optimization of Strategy Parameters

In this section, a method is presented, which helps to tune the strategy parameters ofstochastic strategies for a wider class of optimization problems. In the case of GeneticAlgorithms, these parameters are the population size and the probabilities p of the ge-netic operators shown in Table 4.3. To achieve this goal, a higher level optimization

Table 4.3: Strategy Parameters of different Genetic Algorithms

population p(R) p(C) p(M) p(I)size

binary GA x x x (x)floating point GA x x x ximprovedfloating point GA x x x x x

71


called meta-optimization, as shown in Fig. 4.16 can be applied [23]. In principle, metaoptimization means, that some strategy parameters themselves are treated as optimiza-tion variables. Any optimization method with a fixed set of strategy parameters canbe chosen as outer loop. The trial variables of this outer loop problem are the strategyparameters of the method to be optimized.

Inside the outer loop a genetic algorithm, which should be tuned (Binary GeneticAlgorithm, Floating Point Genetic Algorithm or Improved Floating Point Genetic Al-gorithm) with variable strategy parameters is implemented. This inner optimizationstrategy, successively supplied with new strategy parameters, is applied repeatedly toa test function (Rosenbrocks function [15]) or to a simple electromagnetic optimizationproblem for a specified number of times (e.g. n=50 times) and the mean value of thenecessary function calls to arrive at an acceptable solution (finner: objective functionof the inner loop) is treated as the objective function fouter of the outer loop. After

? , % # L # C # # 6

. , ,

) ?

,

. , $ ,

,

Figure 4.16: Meta optimization of Genetic Algorithms

an enormous amount of function evaluations, the optimized strategy parameters are ob-tained. Table 4.4 summarizes these values for three Genetic Algorithm presented in theprevious sections. It should be noticed, that once the operator recombination (p(C))is selected, two descendants are produced, while reproduction, mutation and immigra-tion results in one new configuration only. It can be seen very well in Table 4.4, thatthe probability of the mutation operator is increased remarkably after introducing thegradient-like mutation.

72


Table 4.4: Meta Optimized Strategy Parameters of different Genetic Algorithms

population p(R) p(C) p(M) p(I)size [%] [%] [%] [%]

binary GA 52 70 30 (0.95) –floating point GA 50 61 29 10 –improvedfloating point GA 21 43 12 42 3

4.6 Stopping Criteria

This basic scheme of most stochastic algorithms is iterated until some stopping criterionis met. Possible criteria are :

• Objective function has become smaller than εobjcective

• Change in the objective function has become smaller than ∆εobjcective

• Maximum number of function calls has been exceeded (maximum number of gen-erations, maximum number of temperature loops)

• Maximum number of function calls without improvement of the objective functionf(x) has been exceeded

• Norm of the population qn (4.15) has become smaller than a prescribed εnorm

The last item in the above table requires the calculation of some norm of the population.For a population of λ configurations with n optimization parameters first of all a meanvalue x is calculated (4.14).

x =1

λ

λ∑i=1

xi (4.14)

Then the quadratic norm qn of the population is evaluated (4.15).

qn =1

λn

n∑j=1

λ∑i=1

(xj,i − xj

xj

)2

(4.15)

The scalar value qn indicates if the strategy is still searching the parameter space forthe global optimum or if it has already decided to concentrate on a specific domain inthe parameter space.

73


4.7 Similarities and Differences: Comparison ofStochastic Algorithms

In this section three of the algorithms presented above

• ES : (µ/ρ, λ) Evolution Strategy

• GA : Improved Floating Point Genetic Algorithm

• SA : Simulated Annealing Algorithm

will be compared with respect to some important features of stochastic algorithms,namely

• Generation of new configurations

• Selection mechanism

• Deterioration of the objective function

• Control of mutation stepsize

• Globally best configuration

• Parallelization.

4.7.1 Generation of New Configurations

* ES: The µ best configurations of the current population are used to generate thenext population using genetic operators like recombination and mutation.

* GA: All configurations of the current population are used to generate the con-figurations of the next population. using genetic operators like recombination,reproduction, immigration and mutation. Their fitness is taken into account.

* SA: A new configuration is set up from the latest accepted configuration.

4.7.2 Selection Mechanism

* ES: The µ best configurations are selected amongst the λ configurations of thepopulation (explicit selection, environmental selection).

* GA: All configurations of a population can participate in the genetic operations.The probability of participating is guided by the value of the objective function(implicit selection, mating selection).

74


* SA: The objective function of the newly produced configuration is compared tothe one it was made from considering the Metropolis criterion (1:1 selection, tour-nament selection).

4.7.3 Deterioration of the Objective Function

* ES: The µ best configurations are accepted even if configurations with better ob-jective function values existed in the previous population.

* GA: Since the participation in the genetic operations happens with a certain prob-ability, it is not ensured that the best configuration of the previous generation istaken into account.

* SA: A new configuration, yielding a worse objective function can be acceptedaccording to the Metropolis criterion. The acceptance probability of uphill move-ments depends on the stage of the iteration process.

4.7.4 Control of Mutation Stepsize

* ES: Prior to the mutation operation the stepsize of each parameter is either slightlyincreased or decreased (factor α).

* GA: No stepsize control exists.

* SA: After a certain number of configurations the stepsize is adjusted in such a waythat the success rate is about 50%.

4.7.5 Globally Best Configuration

* ES: The globally best configuration is saved but involved only once in the processof generating new configurations.

* GA: The globally best configuration is transferred unchanged to the next genera-tion via the reproduction mechanism with a certain probability.

* SA: After the temperature is lowered, the procedure is restarted from the globallybest configuration.

75

5 Definition of the Objective Function

5.1 Introduction

A very important tasks besides the choice of the appropriate optimization method isthe definition of the objective function f(x), which has to be minimized. In general,two different types of objectives can be distinguished, best fit objectives and minimumobjectives, respectively.

• Best fit objective: A term V aluebest(x) should reach a reference quantity V alueref

as close as possible in the course of the optimization process.

• Minimum objective: A term V aluemin(x) should become as small as possible inthe course of the optimization process.

An important criterion to classify optimization problems is the number of objectives,which have to be met. If one has to solve a single objective problem like “Determinethe shape of a die mold press in such a way, that the magnetic field has only a radialcomponent of 0.35 T along a specified line”, the definition of f(x) is quite simple andstraight forward. Such problems are termed scalar optimization problems and mostoptimization strategies can handle them properly. If, for instance, the magnetic fluxdensity Bcalc can be evaluated in n points of interest, a best fit objective can be set upin a least square sense as given in (5.1):

f(x) =n∑

i=1

[(Bxi,calc − Bxi,spec)

2 + (Byi,calc − Byi,spec)2], (5.1)

where Bspec are the desired values.But solving real world problems, very often two or more objectives have to be treated

simultaneously (Multi Objective Optimization). A typical example of such a kind ofoptimization problem is: “Design a Superconducting Magnetic Energy Storage (SMES)configuration, which can store 180 MJ while producing a stray field Bstray as small aspossible” (see Example 6.1.2). The definition of the objective function f(x) of suchvector optimization problems is certainly more difficult. It can have a major influenceon the convergence of the optimization method and, in the worst case, even mislead thealgorithm to incorrect results.

76


5.2 From Vector Optimization to Scalar Optimization

The Multi Objective Optimisation (MOO) problem is a very difficult task. Often, thelargest share of the design time is spent in finding the best compromise among differentrequirements f1(x), f2(x), . . . fn(x), which, in general, are conflicting each other. Thismeans, that the improvements of some objectives are worsening at least some of theothers.

Since most of the common optimization algorithms can easily get along with scalaroptimization problems, the main idea has always been to merge all objectives fi(x)into one single objective function f(x) and then to optimize f(x) by means of a scalaroptimisation algorithm. To this aim several possible solutions have been addressed [24]and they can roughly be divided in three categories:

• Constrained optimisation techniques

• Weighting of objectives techniques

• Decision making schemes

The constrained optimisation techniques assume one objective as the primary oneand use the others as constraints. The primary objective has to be optimized subject tothe constraints. This technique has by far the deepest mathematical background, butis in general used in conjunction with the gradient evaluation of the global objectivefunction and the constraints.

The weighting of objectives techniques are the easiest of all the techniques: they tryto optimize a weighted sum of all objectives:

f(x) = w1f1(x) + w2f2(x) + . . . + wnfn(x). (5.2)

However, its main drawback lies in the choice of the weights w1, w2, . . . wn, that havethe twofold purpose of normalizing the objectives in order to obtain a balanced sum andof implementing a hierarchy of the objectives giving priority to one instead of another.It has been shown, that even slightly modifying the weights may lead the optimizationalgorithm into a non-converging state (see section 6.1.2). No algorithm for choosing theweights has been set up yet and picking the right set of weights can require more timethan the optimization itself.

The decision making schemes could be the most powerful tool to tackle the multiobjective optimization problems since they are able to translate in rigorous mathematicalterms the way of thinking of the designer. Among the several decision making schemesproposed in the literature [24], the ones based on fuzzy logic seem to be most promising[4],[5],[17], [29].

77


5.3 Constraint Optimization Techniques

Assume a general multi objective optimization problem with two objectives, a best fitobjective f1(x) and a minimum objective f2(x). Instead of solving

minx∈

f (f1(x), f2(x))

subject to ci(x) ≥ 0, i = 1, . . .m, (5.3)

where ci(x) are some general constraints, a new optimization problem is set up. All butone objective, the primary objective, are transformed into additional constraints:

minx∈

f2(x)

subject to f1(x) = 0 (5.4)

ci(x) ≥ 0, i = 1, . . .m.

Following this approach any method for solving constrained optimization problems, likefor instance the Sequential Quadratic Programming method, can be applied to obtainthe optimal solution.

5.4 Weighting of Objectives Techniques

A common way of transforming a best fit objective f1(x) and a minimum objective f2(x)into one scalar function f(x) is using (5.2):

f(x) = w1|(V aluebest − V alueref)| + w2|(V aluemin|. (5.5)

One problem that arises is that the two terms in (5.5) should be compatible with eachother, which means that they should have more or less the same order of magnitudethroughout the whole iteration process. This can be achieved by normalizing the re-spective terms in (5.5):

f(x) =w1

V aluenorm,best‖(V aluebest − V alueref)‖ +

w2

V aluenorm,min‖(V aluemin‖. (5.6)

While it is rather easy to find a suitable normalization value for best fit objectives(V aluenorm,best = V alueref), the proper choice of a normalization value for minimumobjectives can be crucial to the success of the whole optimization process. ChoosingV aluemin,best automatically from an appropriate set of initial configurations does notguarantee a satisfying behaviour in the final phase of the optimization process, while abad choice of weights in the initial phase of the optimization process may mislead thestrategy to completely wrong results. This problem becomes more and more significant,if the number of conflicting objectives increases.

78


5.5 Fuzzy Based Decision Making Scheme

5.5.1 Introduction

Fuzzy logic, as introduced in the late sixties by Zadeh, is based on the idea that thetruthfulness of a statement is not a discrete function. In fact, beyond the two classicalvalues 1 for truth and 0 for falseness, it can perform as a continuous function whereintermediate values express an intermediate degree of truthfulness of the statement [29].

The membership function µ(x) applied to a statement x gives the degree of truth ofx. For instance, if the value for one objective is bound between two values xacc and xsat,the degree of satisfaction of this criterion can be plotted as in Fig. 5.1. As can be seen,

!

!

$

Figure 5.1: Membership functions with a satisfaction criterion

the choice for this function is not univocal. The membership functions can be definedby analytical functions, for instance by means of the hyperbolic tangent in Fig. 5.1 (a)or by means of piecewise linear functions as in Fig. 5.1 (b).

Using a membership function scheme, it is possible to quantify how much a partic-ular configuration leading to set of design objectives f1(x), f2(x), . . . fn(x) satisfies therequirements on each design objective.

After fuzzification has been performed, logical values can be merged into one bymeans of logical operators, called inference rules. In classical logic, this operation couldbe done by means of the AND operator. In fuzzy logic the AND operator can be replacedby several rules, as proposed in [27], however the main ones are: the min operator andthe product operator.

The min operator gives as output the minimum value of all the µi(x) on which itoperates:

µ(x) = mini

µi(x). (5.7)

79


The product operator gives as output the product of all the µi(x) on which it operates:

µ(x) =n∏

i=1

µi(x). (5.8)

By applying one of the operators described above, a scalar indicator µ(x) assessing theglobal quality of the design configuration x can be obtained. Now any scalar optimizationalgorithm can be used to find out the minimum value of the global indicator. It is worthnoting that, while the proposed approach is well suited for zero-th order searches, it canlead to trouble if a gradient of the global indicator is required. In fact, when a minstrategy is followed, there is no guarantee that the resulting scalar function would bedifferentiable due to the non-smooth nature of this operator.

In addition, the use of the membership function of Fig.5.1 (b) could be not efficient ifone of the objectives is out of the limiting values. In order to correct this drawback, themembership function has to be modified as shown in Fig. 5.2. This assures nonzero valuesfor all µi(x) throughout the optimization process. The amplitude or the shape of this

!

Figure 5.2: Modified piece wise linear membership function

additional function has shown to be not critical for the convergence of the optimizationprocess [4].

5.5.2 Bell Shaped Static Fuzzy Membership Functions

Several types of fuzzy sets have been investigated recently. A possible way to describeboth types of objectives, best fit objectives and minimum objectives, is using simple bellshaped fuzzy sets (Fig. 5.3), where each side of the membership function µbell is describedby one function (5.9):

µbell(x) =

e−l(x−m)2 x ≤ m

e−r(x−m)2 x > m, (5.9)

where m is the center value. The designer has to define the 90% acceptance parameters

80


−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

µ(x

)

90% 90%

region region

xleft90% xleft90% xright90% xright90%

Fuzzy set: best fit objective

−5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

µ(x

)

90% region 90% region

xright90% xright90%

Fuzzy set: minimum objective

Figure 5.3: Bell shaped membership functions

xleft90% and xright90% leading to the 90% region, from which the constants l and r canbe calculated (5.10).

l = ln(0.9)−(xleft90%−m)2

r = ln(0.9)−(xleft90%−m)2

(5.10)

The 90% acceptance parameters remain unchanged (static) throughout the whole opti-mization process [17]. It is important to choose l and r in such a way, that the member-ship functions are wide enough to assign values to the objectives, which are not 0 or verynear to 0, which will lead to substantial problems using them together with higher orderdeterministic optimization strategies. On the other hand, the membership functionsshould be small enough to distinguish different good solutions in the final phase of theiteration process. Therefore, some dynamic behaviour of the 90% acceptance parametersis desirable.

5.5.3 Bell Shaped Self Adaptive Membership Functions

A step towards something like a plug and play optimization algorithm can be performedby introducing self adaptive membership functions µi(x), which adapt themselves tothe local properties of the optimization path in the multidimensional parameter space.From the first few parameter configurations (the first generation of the Genetic Algo-rithm or Evolution Strategy, for instance) initial acceptance parameters xi,90% for eachobjective fi(x) are evaluated automatically. Then an adaptation algorithm, consisting ofa monitoring phase and an update step can be used to either scale up or scale down themembership functions during the optimization process (Fig. 5.4). The improvements of

81


,

$ 8 E

,

,

, " 0

" $ 8 .

- " "

3 ! M .

" ,

, , #

# - "

3 ! M B

= B

0

Figure 5.4: Flow chart of the adaptation of the membership functions

each single objective fi(x) with respect to the current acceptance parameters xi,90% (i.e.the number of configurations within the 90% region) are monitored throughout a givennumber of iterations (e.g. one generation of the Genetic Algorithm or the EvolutionStrategy) and stored in a success rate value si. Besides that a mean value fi,mean of eachsingle objectives sampled over the same number of iterations is evaluated. Then theadaptation step follows. If all mean values fi,mean are inside the 90% region, all accep-tance parameters xi,90% are decreased, scaling the membership function down. On theother hand, if all success rates si are lower than a prescribed value slevel, all acceptanceparameters xi,90% are increased, scaling the membership function up. In all other casesthe acceptance parameters xi,90% remain unchanged.

82

6 Optimization Examples

6.1 Introduction

This chapter presents the results of the three optimization examples presented in chapter1. The stochastic strategies, which will be compared are

• Floating point Genetic Algorithm

– Population size : 21

– Probability of Mutation: p(M) = 53%

– Probability of Crossover: p(C) = 7%

– Probability of Reproduction: p(R) = 40%

– Stopping criterion: no improvement for more than 500 function calls

• Higher order Evolution Strategy

– Number of parents : µ = 8

– Number of recombining parents : ρ = 2

– Number of children : λ = 20

– Stepwidth factor : α = 1.1 ÷ 1.3

– Stopping criterion : qn < 0.001

• Simulated Annealing Algorithm

– Maximum value of temperature : Tmax = 1.0

– Number of temperature reductions: NT = 30 ÷ 40

– Number of step size adjustments: NS = 10

– Number variable cycles: NV = 10

– Temperature reduction factor: fT = 0.85 ÷ 0.90

– Boltzmann like constant: k = 1.0

– Perturbation amplitude function F : see Fig. 4.15

83


– Stopping criterion: procedure is stopped after the maximum number of tem-perature reductions is reached or if the objective function no longer variessignificantly.

6.1.1 Optimization of an Active Filter

Assume the amplitude of a voltage U input as given in Fig. 1.1 (a) using the normalizedfrequency Ω. The required output voltage U output should have a constant amplitude of1V from very low frequencies to Ω = 10. This can be done using an active filter as shownin Fig. 6.1 (b).

(a) Amplitude of the input voltage U input

(b) Active filter

Figure 6.1: Active filter example

A general active filter can be described mathematically by a complex function A(P )of second order

A(P ) =d0 + d1P + d2P

2

c0 + c1P + c2P 2. (6.1)

The transfer function of an general filter of second order as shown in Fig. 6.2 can befound to be

A(P ) =k0 − k1ω0τP + k2ω

20τ

2P 2

l0 + l1ω0τP + l2ω20τ

2P 2, (6.2)

where τ = RC and ω0 = 2πfc is the normalizing frequency with fc being the cut offfrequency. P can be set to be jΩ = j f

fc.

A(jΩ) =d0 + d1jΩ − d2Ω

2

c0 + c1jΩ − c2Ω2=

(d0 − d2Ω2) + jΩd1

c0 − c2Ω2) + jΩc1

(6.3)

Substituting a = d0 − d2Ω2, b = jΩd1, c = c0 − c2Ω

2 and d = jΩc1 leads to

A(a, b, c, d) =ac + bd

c2 + d2+ j

bc − ad

c2 + d2= R + jI (6.4)

The amplitude of the transfer-function becomes |A(jΩ)| =√

R2 + I2, while the angleφ can be evaluated using tanφ(jΩ) = I

R. The objective function is defined in a least

84


Figure 6.2: General filter of second order

square sense

f(x) =m∑

i=1

(A(jΩi) − A(jΩi)0)2 (6.5)

where the A(jΩi)0 are the desired values, presented in Table 6.1.

Table 6.1: Desired values of the amplitude of the active filterΩ amplitude

0.01 0.99795350.02 0.99185590.03 0.98183050.04 0.96807500.05 0.95085100.06 0.93047090.07 0.90728360.08 0.88165990.09 0.8539798

Ω amplitude0.1 0.82462110.2 0.50990200.3 0.33741100.4 0.44721360.5 0.66953410.6 0.90553850.7 1.13835370.8 1.36597300.9 1.5889323

Ω amplitude1 1.80810142 3.90100273 5.93363004 7.95012515 9.96006406 11.96670377 13.97145198 15.97501569 17.9777888

The optimization/identification problem is to determine the coefficients d0, d1, d2,c0, c1, c2 of a filter, which fulfills the required operation. The coefficients c0, c1 and c2

must be positive (constraints!). Using an evolution strategy, the six coefficients couldbe determined as given in Tab. 6.2.The number of evaluations of the objective function was 20000, the value of the objectivefunction f = 3.654 ∗ 10−7.

85


Table 6.2: Coefficients of the optimal active filterd0 d1 d2 c0 c1 c2

1 −2 10 1 5 0

!

d1 and c1 (normalized)

"#

" #

" #

σ(d1) and σ(c1) (normalized)

$%&!'(!

value of f(x) (logarithmic scaling)

Figure 6.3: Parameters d1 and c1, stepwidth σ(d1) and σ(c1) and objective function f(x)versus generations

Fig. 6.3 shows the behavior of d1 and c1 (normalized to the initial values), thecorresponding stepwidth σ(d1) and σ(c1) (normalized to the initial values) and the ob-jective function f in the due course of the optimization process. It can be seen, that thestepwidth are adjusted individually by the optimization strategy.

Setting R =1MΩ and C =1µF, which gives a τ of 1s and setting ω0 to 1, leads tothe following coefficients for k0, k1, k2, l0, l1 and l2 (see Table 6.3), giving the realization

86


of the active filter as shown in Fig. 6.4.

Table 6.3: Coefficients of the realization of the optimal filterk0 k1 k2 l0 l1 l2

1 2 10 1 5 0

Figure 6.4: Realization of the active filter

Fig. 6.5 shows the input signal, the transfer function of the optimized filter and theresponse signal, which fulfills the requirements.

( ) ! *

+ ,

- ( ( !

$ ,

Figure 6.5: Input signal, transfer function of the filter and output signal

87


6.1.2 Optimzation of a SMES arrangement

SMES (Superconducting Magnetic Energy Storing) systems consisting of a single super-conducting solenoidal coil offer the opportunity to store a significant amount of energyin magnetic fields in a fairly simple and economical way and can be rather easily scaledup in size. However, such arrangements usually suffer from their remarkable stray field.A reduction of the stray field can be achieved if a second solenoid is placed outside theinner one, with a current flowing in the opposite direction (Fig. 6.6).

"

"

! # !

! # !

$

%

& &

Figure 6.6: Configuration of the SMES device

The optimal design (R1, R2, d1, d2, h1, h2, J1, J2) is not an easy task because, besidesusual geometrical constraints, there is a material related constraint: the given currentdensity and the maximum magnetic flux density value on the coil, must not violate thesuperconducting quench condition which can be well represented by a linear relationship|J | = (−6.4|B| + 54.0)A/mm2 shown in Fig. 6.7.

Figure 6.7: Critical Curve of the Superconductor.

A correct design of the system should then couple the right value of energy to bestored (=180 MJ) with a minimal stray field along the measure points (line a and lineb), which clearly is a type of multi objective optimization. The objective related to the

88


energy can be termed “best fit objective”, while the stray field objective is a “minimumobjective”. A common way to treat both objectives at one time is to set up a scalarfunction using a weighted sum as given in (6.6).

OF =B2

stray

B2norm

+|Energy − Eref |

Eref

, (6.6)

where Eref = 180MJ , Bnorm = 200µT and B2stray is defined as :

B2stray =

22∑i=1

|Bstray, i|2

22. (6.7)

Bstray, i is evaluated along 22 equidistant points along line a and line b in Fig. 6.6.B2

norm has to be introduced to achieve a similar order of magnitude of the two terms in(6.6). The choice of this normalization factor is extremely crucial not only for the speedof convergence of the optimization process, but also for the stability of convergence. IfB2

norm is set to 1 mT, which means an increase by a factor 5 only, the optimizationprocess no longer converges to the minimizer. The problem is transformed from a multiobjective optimization problem into a single objective optimization problem, since thestray-field requirement is over-valued against the energy requirement. Figures 6.8 and6.9 show the behavior of both the stray-field objective (6.7) and the energy objectiveusing different weighting factors (200 µT and 1 mT, respectively).

H ! !

! !

! !

! !

! !

!

!! ! ! ! ! ! ! 1 ! ! ! 4 ! ! !

Figure 6.8: Stray field objective versus generations, weighted sum

89


! !

! !

!

C &

! ! ! ! ! ! 1 ! ! ! 4 ! ! ! ! ! ! !!

$ - "

- "

Figure 6.9: Energy objective versus generations, weighted sum

A much more intuitive and promising description of the objective function is usingfuzzy sets. As already discussed in chapter 5, this functions implicitly normalize thedifferent terms of the objective function. If the 90% parameters are chosen by the userto be 1 mT for the stray field and 20 MJ for the energy (see Fig. 6.10(a),(b)) and ifthe two objectives are combined using the product rule, then the optimization processconverges stable to the minimizer of the problem.

−4 −2 0 2 4 6 8 10

x 10−3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

µ(x

)

90% region 90% region

xright90% xright90%

(a) Fuzzy set for the stray field

100 120 140 160 180 200 220 240 2600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

µ(x

)

90% 90%

region region

xleft90% xleft90% xright90% xright90%

(b) Fuzzy set for the energy

Figure 6.10: Static fuzzy sets for the SMES problem

Still it can happen, especially when a deterministic strategy is applied, that the start-ing parameters are so far away from the solution, that the fuzzy functions only yield 0.Therefore, the initial 90% parameters should be evaluated using several initial configu-rations (e.g. the initial population of an evolution strategy, the starting configuration ofa deterministic strategy) and adapted in the due course of the optimization process, asdescribed in section 5.5.3.

90


! !

! !

!

C &

! ! ! ! ! ! 1 ! ! ! 4 ! ! ! ! ! ! !!

, % %

, % %

! !

Figure 6.11: Energy objective versus generations, fuzzy sets

!

! ! ! ! ! ! 1 ! ! ! 4 ! ! ! ! ! ! !!

3 ! M

! ! !

! ! !

1 ! ! !C &

Figure 6.12: 90% energy value versus generations, fuzzy sets

Fig. 6.11 compares the behavior of the energy using static and self adaptive fuzzysets. The main advantage of them is, that nothing has to be specified in advance (plugand play) and that the bell shaped function can become very small in the final stage ofthe optimization. Fig. 6.12 shows, that even a very big increase in the 90% value doesnot stop the convergence process.

The results obtained by the three algorithms are presented in Table 6.4. All of themwere able to find qualitatively comparable solutions from the point of view of objectivefunction values. On the other hand, these solutions are quite far from each other in theparameters space. This fact proofs that this optimization problem is not easy to solveand that it shows an ill-conditioned behavior. All the strategies took approximately thesame amount of function calls to get to the final solution.

For the sake of comparison of the different stochastic strategies, the minimum hadto be identified with a very high accuracy, which explains the extremely high number

91


Table 6.4: Solutions of the SMES Problem.SA ES GA

R1[m] 1.931 1.918 1.301R2[m] 2.559 2.576 1.8h1

2[m] 0.5689 0.976 1.132

h2

2[m] 1.5546 1.796 1.542

d1[m] 0.5233 0.2633 0.5793d2[m] 0.1797 0.1705 0.1959J1[A/mm2] 19.704 24.82 16.42J2[A/mm2] -12.04 -11.59 -18.93Energy [MJ ] 179.98 179.91 179.988BStray [µ T ] 27.048 35.64 26.7258OF-calls 19956 17990 18432

of function calls. This number can be reduced drastically, if a deterministic strategy isstarted after a stochastic strategy has found the parameter region where the minimumis located. Table 6.5 shows a result using an Evolution Strategy in the initial phaseof the optimization and subsequently a 1st order deterministic strategy (Quasi NewtonMethod) to identify the minimum.

Table 6.5: Coupled strategy : Evolution Strategy and Quasi Newton Method.stochastic strategy: Evolution Strategy

R1 R2h1

2h2

2d1 d2 J1 J2

[m] [m] [m] [m] [m] [m] [A/mm2] [A/mm2]1.107 2.031 1.181 0.749 0.726 0.206 12.01 -20.9

OF-calls Bstray Energy OF2290 362mT 179.69MJ 3.28

subsequent deterministic strategy: Quasi Newton methodR1 R2

h1

2h2

2d1 d2 J1 J2

[m] [m] [m] [m] [m] [m] [A/mm2] [A/mm2]1.432 2.023 0.784 1.411 0.787 0.178 14.29 -18.01

OF-calls Bstray Energy OF1790 12.36mT 179.999 MJ 0.00306

Fig. 6.13(a) shows a fields plot of one of the initial configurations, while Fig. 6.13(b)shows a field plot of the final solution.

92


(a) Initial configuration (b) Final configuration

Figure 6.13: Field plots of the SMES problem

93

Bibliography

[1] Alotto P.G., Brandstatter B., Cela E., Furntratt G., Magele Ch., Molinari G.,Nervi M., Preis K., Repetto M. and Richter K.R.: “Stochastic Algorithms inElectromagnetic Optimization” IEEE Trans. Magn., vol. 34, No. 5, pp 3674-3684, 1998

[2] H.H. Bock, Automatische Klassifikation, Studia Mathematica, Vandenhoeck &Ruprecht, Gottingen, 1974

[3] Cerny V., “Thermodynamical approach to the traveling salesman problem: Anefficient simulation algorithm”,Journal of Optimization Theory and Applica-tions, 45(1):41-51, 1985.

[4] Chiampi, C. Ragusa, M. Repetto “Fuzzy approach for multiobjective optimisa-tion in magnetics”, IEEE Trans. on Mag., Vol 32, pp 1234-1237, 1996.

[5] Chiampi, G. Furntratt, Ch. Magele, C. Ragusa, M. Repetto “Multi ObjectiveOptimisation with Stochastic Algorithms and Fuzzy Definition of ObjectiveFunction”, accepted for publication in Int. J. Applied Electromagnetics andMechanics August, 1998.

[6] Fletcher, R.: “Practical Methods of Optimization” Wiley, 1987

[7] Fogel, D.B.: “Evolutionary computation: Towards a new philosophy of machineintelligence” IEEE Press 1995

[8] Goldberg, D.E., “Genetic Algorithms in Search, Optimization and MachineLearning”, Addison Wessely, Reading MA, 1989

[9] J. A. Hartigan, Clustering Algorithms, Wiley Series in Probability and Mathe-matical Statistics, John Wiley & Suns, NY, 1975

[10] Hoffman, A. “Arguments on Evolution : A Paleontogolists Perspective” OxfordUniversity Press, New York, 1989

[11] Holland, J.H.: “Adaption in Natural and Artificial Systems” Ann Arbor: Uni-versity of Michigan Press, 1975

94

Bibliography

[12] Holland, J.H.: “Genetic algorithms” Scientific American 1992

[13] Kirkpatrick S., Gelatt Jr. C.D., Vecchi M.P.,“Optimization by Simulated An-nealing.”, Science, 20(4598):671–679, 1983.

[14] Kuhn H.W. and Tucker A.W.,“Nonlinear Programming”, in Proceedings of theSecond Berkeley Symposium on Mathematical Statistics and Probability (Ed. J.Neyman), pp. 481-492, Berkeley, University of California Press, 1951. Science,20(4598):671–679, 1983.

[15] Gill,P.E., W. Murray and M. H. Wright: “Practical Optimization” AcademicPress, 1981

[16] Levenberg K.,“A method for the solution of certain problems in nonlinear leastsquares”, Quart.Appl.Math., 2, 164-168, 1944.

[17] Magele Ch., Fuerntratt G., Brandstaetter B., Richter K.R., “Self AdaptiveFuzzy Sets in Multi Objective Optimization using Genetic Algorithms”, Ap-plied Computational Electromagnetics Society Journal, Vol. 12, No. 2, pp. 26-31,1997.

[18] Marquardt D.W.,“An algorithm for least squares estimations of nonlinear pa-rameters”, SIAM J., 11, 431–441, 1963.

[19] Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E. .“Equation of state calculations by fast computing machines”, The Journal ofChemical Physics, 21(6):1087-1092, 1955.

[20] Otten R. H. J. M., van Ginneken L. P. P. P , “The Annealing Algorithm”,Kluwer Academic, Boston, MA, 1989.

[21] Rechenberg, I.: “Cybernetic Solution Path of an Experimental Problem” RoyalAircraft Establishment, Library Translation No. 1122, August 1965

[22] Rechenberg, I.: “Evolutionsstrategie: Optimierung technischer Systeme nachPrinzipien der biologischen Evolution” frommann-holzboog, Stuttgart, 1973

[23] Rechenberg, I.: “Evolutionsstrategie ‘94” frommann-holzboog, Stuttgart, 1994

[24] S. Russenschuck “Synthesis, inverse problems and optimisation in computa-tional electromagnetics”, Int. Journal of Numerical Modelling: electronic net-works, devices and fields, Vol. 9, pp. 45-57, 1996

[25] Schwefel,H.P.: “Kybernetische Evolution als Strategie der ExperimentellenForschung in der Stmungstechnik” Diploma Thesis, Technical University ofBerlin, 1965

95

Bibliography

[26] Schwefel,H.P.: “Numerische Optimierung von Computermodellen mittels derEvolutionsstrategie” Birkhuser Verlag, 1977

[27] T. Teano, K. Asai, M. Sugeno ed’s, Fuzzy Systems theory and its applications,Academic Press Inc. S. Diego California, 1991.

[28] Vanderbilt D. , Louie S. G , “A Monte Carlo Simulated Annealing approach tooptimization over continuous variables”, J. Comput. Phys. 56, 259- 271, 1984.

[29] R.E. Bellman, H.L. Zadeh, “Decision making in a fuzzy environment”, Man-agement Science, Vol. 17, Pg. 141-164, 1970.

96

optimization in electrical engineering - diegm.uniud.it corsi... · optimization in electrical...

Documents