hanso and hifoo two new matlab codes michael l. overton courant institute of mathematical sciences...

HANSO and HIFOOTwo New MATLAB Codes

Michael L. OvertonCourant Institute of Mathematical

SciencesNew York University

Singapore, Jan 2006

Two New Codes• HANSO:

Hybrid Algorithm for Nonsmooth Optimization– Aimed at finding local minimizers of general

nonsmooth, nonconvex optimization problems– Joint with Jim Burke and Adrian Lewis

• HIFOO: H-Infinity Fixed-Order Optimization– Aimed at solving specific nonsmooth, nonconvex

optimization problems arising in control– Built on HANSO– Joint with Didier Henrion

Many optimization objectives are

• Nonconvex• Nonsmooth • Continuous• Differentiable almost everywhere• With gradients often available at little

additional cost beyond that of computing function

• Subdifferentially regular• Sometimes, non-Lipschitz

Numerical Optimization of Nonsmooth, Nonconvex Functions

• Steepest descent: jams• Bundle methods: better, but these are mainly

intended for nonsmooth convex functions• We developed a simple method for nonsmooth,

nonconvex minimization based on Gradient Sampling

• Intended for continuous functions that are differentiable almost everywhere, and for which the gradient can be easily computed when it is defined

• User need only write routine to return function value and gradient – and need not worry about nondifferentiable cases, e.g., ties for a max when coding the gradient

• Very recently, found BFGS is often very effective when implemented correctly - and much less expensive

BFGS• Standard quasi-Newton method for optimizing

smooth functions, using gradient differences to update an inverse Hessian matrix H, defining a local quadratic model of the objective function

• Conventional wisdom: jams on nonsmooth functions

• Amazingly, works very well as long as implemented right (weak Wolfe line search is essential)

• It builds an excellent, extremely ill-conditioned quadratic model

• Often, runs until cond(H) is 1016 before breaking down, taking steplengths of around 1

• Often converges linearly (not superlinearly)• By contrast, steepest descent and Newton’s

method usually jam• Not very reliable, and no convergence theory

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Steepest Descent:f(x)=10*|x2 - x

12| + (1-x

1)2 iter 25: f = 3.6e+000

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

BFGS:f(x)=10*|x2 - x

12| + (1-x

1)2 iter 250: f = 4.5e-009

0 50 100 150 200 25010

-12

10-10

10-8

10-6

10-4

10-2

100

102

BFGS:f(x)=10*|x2 - x

12| + (1-x

1)2 Summary

f

||g||cos(theta)

inverse(cond(H))

Minimizing a Product of Eigenvalues

• An application due to Anstreicher and Lee• Min log of product of K largest eigenvalues of A o X

subject to X positive semidefinite and diag(X)=1(A, X symmetric) (we usually set K=N/2)

• Would be equivalent to SDP if replace product by sum

• Number of variables is n = N(N-1)/2 where N is dimension of X

• Following Burer, substitute Y Y’ for X where Y is N by r so n is reduced to rN and normalize Y so its rows have norm 1: thus both constraints eliminated

• Typically objective is not smooth at minimizer, because the K-th largest eigenvalue is multiple

• Results for N=10 and rank r = 2 through 10…

2 3 4 5 6 7 8 9 10-1.6

-1.5

-1.4

-1.3

-1.2

-1.1

-1

-0.9

rank r

optim

al v

alue

of

log

eige

nval

ue p

rodu

ct f

optimal multiplicities are 2 3 3 5 6 3 3 3 3

N=10, r=6: eigenvalues of optimal A o X

3.163533558166873e-001 5.406646569314501e-001 5.406646569314647e-001 5.406646569314678e-001 5.406646569314684e-001 5.406646569314689e-001 5.406646569314706e-001 7.149382062194134e-001 7.746621702661097e-001 1.350303124150612e+000

The U and V Spaces• The condition number of H, the BFGS approximation

to inverse Hessian, typically reaches 1016 !! • Let’s look at eigenvalues of H:

H = QDQ* (* denotes transpose)Search direction is d = Hg = QDQ*g (where g = gradient)

• Next plot shows diag(D), sorted into ascending orderComponents of |Q*g|, using same orderingComponents of |DQ*g|, using same ordering

(search direction expanded in eigenvector basis)

• Tiny eigenvalues of H correspond to “nonsmooth” V-space

• Other eigenvalues of H correspond to “smooth” U-space

• From matrix theory, dim(V-space) = m(m+1)/2 – 1, where m is multiplicity of Kth eigenvalue of A o X

0 10 20 30 40 50 6010

-20

10-15

10-10

10-5

100

105Data=63, N=10, r=6, K=5, mult=6, V space dim=20, U space dim=40, iter=415

Diag(D) where H=QDQ*

|Q*g||DQ*g|

0 100 200 300 400 500 600 70010

-10

10-5

100

105

Data=63, N=63, r=10, K=31, mult=10, V space dim=54, U space dim=576, iter=10000

Diag(D) where H=QDQ*

|Q*g||DQ*g|

More on the U and V Spaces• H seems to be an excellent approximation

to the “limiting inverse Hessian” and evidently gives us bases for the optimal U and V spaces

• Let’s look at plots of f along random directions in the U and V spaces, as defined by eigenvectors of H, passing through minimizer (for the N=10, r=6 example)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.6

-1.5

-1.4

-1.3

-1.2

-1.1

-1

-0.9

-0.8

-0.7

-0.6plots in random directions in V (red) and U (green) spaces, scale=1.0e+001

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.6

-1.5

-1.4

-1.3

-1.2

-1.1

-1

-0.9plots in random directions in V (red) and U (green) spaces, scale=1.0e+000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-1.53

-1.52

-1.51

-1.5

-1.49

-1.48

-1.47

-1.46

-1.45

-1.44plots in random directions in V (red) and U (green) spaces, scale=1.0e-001

Line Search for BFGS• Sufficient decrease:

for 0 < c1 < 1, f (x + td) ≤ f (x) + c1 t(grad f)(x)*d

• Weak Wolfe condition on derivative for c1 < c2 < 1, (grad f)(x + td)*d ≥ c2 (grad f)(x)*d

• Not strong Wolfe condition on derivative | (grad f)(x + t d)*d | ≤ c2 (grad f)(x)*d

• Essential when f is nonsmooth• No reason to use strong Wolfe even if f is smooth• Much simpler to implement• Why ever use strong Wolfe?

But how to check if final x is locally optimal?

• Extreme ill-conditioning of H is a strong clue, but is expensive to compute and proves nothing

• If x is locally optimal, running a local bundle method from x establishes nonsmooth stationarity: this involves repeated null-step line searches and building a bundle of gradients obtained via the line searches

• When line search returns null step, it must also return gradient at a point lying across the discontinuity

• Then new d = – arg min { ||d||: d conv G }, where G is the set of gradients in the bundle

• Terminate if ||d|| is small• If ||d|| is 0 and line search steps were 0, x is Clarke

stationary; more realistically it is “close” to Clarke stationary

• I’ll call this Clarke Quay Stationary since it is a Quay Idea

First-order local optimality• Is this implied by Clarke stationarity of f at x ?• No, for example f(x) = x is Clarke stationary

at 0• Yes, in sense that f’(x; d) ≥ 0 for all directions

d, when f is regular at x (subdifferentially regular, Clarke regular)

• Most of the functions that we have studied are regular everywhere, although this is sometimes hard to prove

• Regularity generalizes smoothness and convexity

Harder Problems

• Eigenvalue product problems are interesting, but the Lipschitz constant L for f is not large

• Harder problems:– Chebyshev Exponential Approximation– Optimization of Distance to Instability– Pseudospectral Abscissa Optimization

(arbitarily large L)– Spectral Abscissa Optimization (not even

Lipschitz)• In all cases, BFGS turns out to be

substantially less reliable than gradient sampling

Gradient Sampling Algorithm Initialize and x. Repeat • Get G, a set of gradients of function

f evaluated at x and at points near x (sampling controlled by )

• Let d = – arg min { ||d||: d conv G }• Line search: replace x by x + t d,

with f(x + t d) < f(x) (if d is not 0)

until d = 0 (or ||d|| tol)

Then reduce and repeat.

Convergence Theory for Gradient Sampling Method (SIOPT, 2005)• Suppose

– f is locally Lipschitz and coercive– f is continuously differentiable on an open dense subset of

its domain– number of gradients sampled near each iterate is greater

than problem dimension– line search uses an appropriate sufficient decrease

condition• Then, with probability one and for fixed sampling

parameter , algorithm generates a sequence of points with a cluster point x that is -Clarke stationary

• If f has a unique Clarke stationary point x, then the set of all cluster points generated by the algorithm converges to x as is reduced to zero

• Kiwiel has already improved on this! (For a slightly different, as yet untested, version of the algorithm.)

1 2 3 4 5 6 7 8 9 100.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

different starting points

optim

al v

alue

fou

ndpseudospectral abscissa, epsln = .001

BFGS

Grad SamplingGrad Sampling started at BFGS solution

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35


optim

al v

alue

fou

ndpseudospectral abscissa, epsln = .000001

BFGS


1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8


optim

al v

alue

fou

nd

spectral abscissa

BFGS


HANSO• User provides routine to compute f and its

gradient at any given x (do not worry about nonsmooth cases)

• BFGS is then run from many randomly generated or user-provided starting points. Quit if gradient at best point found is small.

• Otherwise, a local bundle method is run from best point found, attempting to verify local stationarity.

• If this is not successful, gradient sampling is initiated.

• Regardless, return a final x along with a set of nearby points, the corresponding gradients, and the d which is the smallest vector in the convex hull of these gradients

• If the nearby points are near enough and d is small enough, f is Clarke Quay Stationary at x

• This is an approximate first-order local optimality condition if f is regular at x

Optimization in Control• Often, not many variables, as in low-order controller

design• Reasons:

– engineers like simple controllers– nonsmooth, nonconvex optimization problems in even a

few variables can be very difficult

• Sometimes, more variables are added to the problem in an attempt to make it more tractable

• Example: adding a Lyapunov matrix variable P and imposing stability on A by AP + PA*

• When A depends linearly on the original parameters, this results in a bilinear matrix inequality (BMI), which is typically difficult to solve

• Our approach: tackle the original problem• Looking for locally optimal solutions

H Norm of a Transfer Function• Transfer function G(s)=C (sI – A)-1 B + D• SS representation

• H norm is if A is not stable (stable means all eigenvalues have negative real part) and otherwise, sup ||G(s)||2 over all imaginary s

• Standard way to compute it is norm(SS,inf) in Matlab control toolbox

• LINORM from SLICOT is much faster

A B

C D

H Fixed-Order Controller Design

• Choose matrices a, b, c, d to minimize the H norm of the transfer function

• The dimension of a is k by k (order, between 0 and n=dim(A))

• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)

A + B2 d C2 B2 c

b C2 a

B1 + B2 d D21

b D21

C1 + D12 d C2 D12 c

D11 + D12 d D21



• The dimension of a is k by k (order, between 0 and n=dim(A))

• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)• The case k=0 is static output feedback

A + B2 d C2 B2 c

b C2 a

B1 + B2 d D21

b D21

C1 + D12 d C2 D12 c

D11 + D12 d D21



• The dimension of a is k by k (order, between 0 and n=dim(A))• The dimension of b is k by p (number of system inputs)• The dimension of c is m (number of system outputs) by k• The dimension of d is m by p• Total number of variables is (k+m)(k+p)• The case k=0 is static output feedback• When B1,C1 are empty, all I/O channels are in performance

measure

A + B2 d C2 B2 c

b C2 a

B1 + B2 d D21

b D21

C1 + D12 d C2 D12 c

D11 + D12 d D21

HIFOO: H Fixed-Order Optimization • Aims to find a, b, c, d for which H norm is locally

optimal• Begins by minimizing the spectral abscissa

max(real(eig(A-block))) until finds a, b, c, d for which A-block is stable (and therefore H norm is finite)

• Then locally minimizes H norm• Calls HANSO to carry out both optimizations• Alternative objectives:

– Optimize the spectral abscissa instead of quitting when stable

– Optimize the pseudospectral abscissa– Optimize the distance to instability (complex stability radius)

• The output a,b,c,d can then be input to optimize for larger order k, which cannot give worse result

• Accepts various input formats

HIFOO Provides the Gradients• For H norm, it combines

– left and right singular vector info at point on imaginary axis where sup achieved

– chain rule• For spectral abscissa max(real(eig)), it combines

– left and right eigenvector info for eigenvalue achieving max real part

– chain rule• Do not have to worry about ties• Function is continuous, but gradient is not• However, function is differentiable almost

everyhere (and virtually everywhere that it is evaluated)

• Typically, not differentiable at an exact minimizer

Benchmark Examples: AC Suite• Made extensive runs for the AC suite of 18

aircraft control problems in Leibfritz’ COMPLeIB• In many cases we found low-order controllers

that had smaller H norm than was supposedly possible according to the standard full-order controller MATLAB routine HINFSYN!

• Sometimes this was even the case when the order was set to 0 (static output feedback)

• After I announced this at the SIAM Control meeting in July, HINFSYN was extensively debugged and a new version has been released

Benchmark Examples: Mass-Spring

• A well known simple example with n=4

• Optimizing spectral abscissa:– Order 1, we obtain 0, known to be

optimal value– Order 2, we obtain about 0.73,

previously conjectured to be about 0.5

Benchmarks: Belgian Chocolate Challenge

• Simply stated stabilization problem due to Blondel

• From 1994 to 2000, not known if solvable by any order controller

• In 2000, an order 11 controller was discovered

• We found an order 3 controller, and using our analytical results, proved that it is locally optimal

Availability of HANSO and HIFOO

• Version 0.9, documented in a paper submitted to ROCOND 2006, freely available athttp://www.cs.nyu.edu/overton/software/hifoo/

• Version 0.91, available online but not fully tested so no public link yet

• Version 0.92 will have further enhancements and we hope to announce it widely in a month or two

Some Relevant Publications• A Robust Gradient Sampling Algorithm for

Nonsmooth, Nonconvex Optimization– SIOPT, 2005 (with Burke, Lewis)

• Approximating Subdifferentials by Random Sampling of Gradients– MOR, 2002 (with Burke, Lewis)

• Stabilization via Nonsmooth, Nonconvex Optimization– submitted to IEEE Trans-AC (with Burke, Henrion, Lewis)

• HANSO: A Hybrid Algorithm for Non-Smooth Optimization Based on BFGS, Bundle and Gradient Sampling– in planning stage

http://www.cs.nyu.edu/faculty/overton/

恭喜发财 Gong xi fa cai!

Gung hei fat choy!

恭喜發財

hanso and hifoo two new matlab codes michael l. overton courant institute of mathematical sciences...

Documents