zero-order, black-box, derivative-free, and simulation

Zero-Order, Black-Box, Derivative-Free, andSimulation-Based Optimization

Stefan Wild

Argonne National LaboratoryMathematics and Computer Science Division

August 5, 2016

The Plan

Motivation

Black-Box OptimizationDirect Search MethodsModel-Based MethodsSome Global Optimization

Simulation-Based Optimization and StructureNLS=Nonlinear Least SquaresCNO=Composite Nonsmooth OptimizationSKP=Some Known PartialsSCO=Simulation-Constrained Optimization

Wild, IMA WMO16 1

Simulation-Based Optimization

minx∈Rn

f(x) = F [x,S(x)] : cI [x, S(x)] ≤ 0, cE [x, S(x)] = 0

⋄ S (numerical) simulation output, often “noisy” (even when deterministic)

⋄ Derivatives ∇xS often unavailable or prohibitively expensive toobtain/approximate directly

⋄ S can contribute to objective and/or constraints

⋄ Single evaluation of S could take seconds/minutes/hours/daysEvaluation is a bottleneck for optimization

Functions of complex numerical simulations arise everywhere

Wild, IMA WMO16 2

Computing is Responsible for Pervasiveness of Simulations in Sci&Eng

Argonne’s AVIDAC

(1953: vacuum tubes)

Argonne’s Blue Gene/Q (2012: 786,432cores)

Currently 6th fastest in the world Sunway TaihuLight (2016: 11Mcores)

Currently fastest in the world

⋄ Parallel/multi-core environments increasingly common Small clusters/multi-core desktops/multi-core laptops pervasive Leadership class machines increasingly parallel

⋄ Simulations (the “forward problem”) become faster/more realistic/more complex

Wild, IMA WMO16 3

Improvements from Algorithms Can Trump Those From Hardware

Martin Grotschel’s production planning benchmark problem (a MIP):

1988 solve time using current computers and LP algorithms: 82 years2003 solve time using current computers and LP algorithms: 1 minute

⋄ Speed up of 43,000,000X103X from processor improvements104X additional from algorithmic improvements

Wild, IMA WMO16 4

Improvements from Algorithms Can Trump Those From Hardware

1991 (v1.2) to 2007 (v11.0): Moore’s Law transistor speedup: ≈ 256X[Slide from Bixby (CPLEX/GUROBI)]: Solves 1,852 MIPs

Wild, IMA WMO16 4

Derivative-Free/Zero-Order Optimization

“Some derivatives are unavailable for optimization purposes”

Wild, IMA WMO16 5

Derivative-Free/Zero-Order Optimization

“Some derivatives are unavailable for optimization purposes”

The Challenge: Optimization is tightly coupled with derivatives

Typical optimality (no noise, smooth functions)

∇xf(x∗) + λT∇xcE(x∗) = 0, cE(x∗) = 0

(sub)gradients ∇xf, ∇xc enable:

⋄ Faster feasibility⋄ Faster convergence

Guaranteed descent Approximation of nonlinearities

⋄ Better termination Measure of criticality

‖∇xf‖ or ‖PΩ(∇xf)‖

⋄ Sensitivity analysis Correlations, standard errors, UQ, . . .

Wild, IMA WMO16 5

Ways to Get Derivatives (assuming they exist)

Handcoding (HC)

“Army of students/programmers”

? Prone to errors/conditioning

? Intractable as number of ops increases

Algorithmic/Automatic Differentiation (AD)

“Exact∗ derivatives!”

? No black boxes allowed

? Not always automatic/cheap/well-conditioned

Finite Differences (FD)

“Nonintrusive”

? Expense grows with n

? Sensitive to stepsize choice/noise

→[More & W.; SISC 2011], [More & W.; TOMS 2012]

. . . then apply derivative-based method (that handles inexact derivatives)

Wild, IMA WMO16 6

Algorithmic Differentiation

→ [Coleman & Xu; SIAM 2016], [Griewank & Walther; SIAM 2008]

Computational Graph

⋄ y = sin(a ∗ b) ∗ c

⋄ Forward and reverse modes

⋄ AD tool provides code for yourderivatives

Write codes and formulateproblems with AD in mind!

Many tools (see www.autodiff.org):

F OpenAD

F/C Tapenade, Rapsodia

C/C++ ADOL-C, ADIC

Matlab ADiMat, INTLAB

Python/R ADOL-C

Also done in AMPL, GAMS, JULIA!

Wild, IMA WMO16 7

www.autodiff.org

The Price of Algorithm Choice: Solvers in PETSc/TAO

chwirut1 (n = 6)

100

101

102

103.7

103.8

103.9

Number of Evaluations

Bes

t f V

alue

Fou

nd

lmvmpoundersnm

Toolkit for Advanced Optimization

[Munson et al.; mcs.anl.gov/tao]

Increasing level of user input:

nm Assumes ∇xfunavailable, black box

pounders Assumes ∇xfunavailable, exploitsproblem structure

lmvm Uses available ∇xf

Wild, IMA WMO16 8

The Price of Algorithm Choice: Solvers in PETSc/TAO

chwirut1 (n = 6)

100

101

102

103.7

103.8

103.9


Bes

t f V

alue

Fou

nd

lmvmpoundersnm

Toolkit for Advanced Optimization

[Munson et al.; mcs.anl.gov/tao]

Increasing level of user input:

nm Assumes ∇xfunavailable, black box

pounders Assumes ∇xfunavailable, exploitsproblem structure

THIS TALK

lmvm Uses available ∇xf

DFO methods should be designed tobeat finite-difference-based methods

Observe: Constrained by budget on #evals, method limits solution accuracy/problem size

Wild, IMA WMO16 8

Global Optimization, minx∈Ω f(x)

Careful:

⋄ Global convergence: Convergence (to a local solution/stationary point) fromanywhere in Ω

⋄ Convergence to a global minimizer: Obtain x∗ with f(x∗) ≤ f(x)∀x ∈ Ω

Wild, IMA WMO16 9


Careful:



Anyone selling you global solutions when derivatives are unavailable:

either assumes more about your problem (e.g., convex f)

or expects you to wait forever

Torn and Zilinskas: An algorithm converges to the global minimum for anycontinuous f if and only if the sequence of points visited by thealgorithm is dense in Ω.

or cannot be trusted

Wild, IMA WMO16 9


Careful:



Anyone selling you global solutions when derivatives are unavailable:

either assumes more about your problem (e.g., convex f)

or expects you to wait forever

Torn and Zilinskas: An algorithm converges to the global minimum for anycontinuous f if and only if the sequence of points visited by thealgorithm is dense in Ω.

or cannot be trusted

Instead:

⋄ Rapidly find good local solutions and/or be robust to poor solutions

⋄ Consider multistart approaches and/or structure of multimodality

Wild, IMA WMO16 9

(One Reason) Why We Won’t Be Talking About Heuristics

100 200 300 400 500 600 700 800 900 1000 1100 12000.0580.075

0.1

0.14

0.2

0.4

0.8

Number of Iterations

Fit

Val

ue (

log

scal

e)

Serial** PSOSerial SimplexSerial POUNDERS1024−Core PSO

⋄ Heuristics often “embarrassingly/naturally parallel”;PSO= particle swarm method

Typically through stochastic sampling/evolution 1024 function evaluations per iteration

⋄ Simplex is Nelder-Mead; POUNDERS is model-basedtrust-region algorithm

one function evaluation per iteration

→ Is this an effective use of resources?→ How many cores would have sufficed?

Wild, IMA WMO16 10

I. Black-box Optimization

Black-box Optimization Problems

Only knowledge about f is obtained by sampling

⋄ f = S a black box (running some executable-only codeor performing an experiment in the lab)

⋄ Only give a single output (no derivatives ∇xS(x))

Good solutions guaranteed in the limit, but:

⋄ Usually have computational budget (due to scheduling,finances, deadlines)

⋄ Limited number of evaluations

Wild, IMA WMO16 12

A Black Box: Automating Empirical Performance Tuning

Given semantically equivalent codesx1, x2, . . ., minimize run time subject toenergy consumption

min f(x) : (xC , xI , xB) ∈ ΩC × ΩI × ΩB

x multidimensional parameterization (compiler type, compiler flags, unroll/tilingfactors, internal tolerances, . . . )

Ω search domain (feasible transformation, no errors)

f quantifiable performance objective (requires a run)

→ [Audet & Orban; SIOPT 2006], [Balaprakash, W., Hovland; ICCS 2011], [Porcelli & Toint; 2016]

Numerical Linear Algebra → [N. Higham; SIMAX 1993], . . .

Wild, IMA WMO16 13

Optimization for Automatic Tuning of HPC Codes

Evaluation of f requires:transforming source, compilation, (repeated?) execution, checking for correctness

13

57

911

1315

12

34

56

78

910

0

1

2

3

4

5

x 107

Unroll Factor jUnroll Factor i

Tim

e [C

PU

ms]

Challenges:

- Evaluating f(Ω) prohibitively expensive(e.g., 1019 discrete decisions)

- f noisy

- Discrete x unrelaxable

- ∇xf unavailable/nonexistent

- Many distinct/local solutions

Wild, IMA WMO16 14

Black-box Algorithms

Solve general problems minf(x) : x ∈ Rn:

⋄ Only require function values (no ∇f(x))

⋄ Don’t rely on finite-difference approximations to ∇f(x)

⋄ Seek greedy and rapid decrease of function value

⋄ Have asymptotic convergence guarantees

⋄ Assume parallel resources are used within function evaluation

Main styles of DFO algorithms

⋄ Randomized methods (later?)

⋄ Direct search methods (pattern search, Nelder-Mead, . . . )

⋄ Model-based methods (quadratics, radial basis functions, . . . )

Wild, IMA WMO16 15

Black-Box Algorithms: Stochastic Methods

Random search

Repeat:

1. Randomly generate direction dk ∈ Rn

2. Evaluate “gradient-free oracle” g(xk; hk) =f(xk+hk dk)−f(xk)

hkdk

(≈ directional derivative)

3. Compute xk+1 = xk − δkg(xk ;hk), evaluate f(xk+1)

Convergence (for different types of f) tends to be probabilistic[Kiefer & Wolfowitz; AnnMS 1952], [Polyak; 1987], [Ghadimi & Lan; SIOPT 2013], [Nesterov & Spokoiny; FoCM

2015], . . .

Wild, IMA WMO16 16

Black-Box Algorithms: Stochastic Methods

Random search

Repeat:

1. Randomly generate direction dk ∈ Rn

2. Evaluate “gradient-free oracle” g(xk; hk) =f(xk+hk dk)−f(xk)

hkdk

(≈ directional derivative)

3. Compute xk+1 = xk − δkg(xk ;hk), evaluate f(xk+1)

Convergence (for different types of f) tends to be probabilistic[Kiefer & Wolfowitz; AnnMS 1952], [Polyak; 1987], [Ghadimi & Lan; SIOPT 2013], [Nesterov & Spokoiny; FoCM

2015], . . .

Stochastic heuristics (nature-inspired methods, etc.)

⋄ Popular in practice, especially in engineering

⋄ Typically global in nature

⋄ Require many f evaluations

Wild, IMA WMO16 16

Pattern Search And Its Variants

Choose a set of directions (pattern or mesh) Dk

Ex.- ± coordinate directions (2n directions)

Ex.- any minimal positive spanning set ([e1, · · · , en,−∑

ei])

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

1.5

2

2.5

3

3.5

4

4.5

5

Basic iteration (k ≥ 0):

⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|

⋄ If[

f(xk +∆kdj) < f(xk)

]

,

move to xk+1 = xk +∆kdj

Otherwise shrink ∆k

⋄ Update Dk

This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering

Survey → [Kolda, Lewis, Torczon; SIREV 2003]

Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .

Wild, IMA WMO16 17

Pattern Search And Its Variants

Choose a set of directions (pattern or mesh) Dk

Ex.- ± coordinate directions (2n directions)

Ex.- any minimal positive spanning set ([e1, · · · , en,−∑

ei])

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

1.5

2

2.5

3

3.5

4

4.5

5


⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|

⋄ If[

f(xk +∆kdj) < f(xk)

]

,

move to xk+1 = xk +∆kdj

Otherwise shrink ∆k

⋄ Update Dk

This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering

→ Lends itself well to doing concurrent function evaluations→ See also mesh-adaptive direct search methods→ Can establish convergence for nonsmooth f

Survey → [Kolda, Lewis, Torczon; SIREV 2003]

Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .

Wild, IMA WMO16 17

The Nelder-Mead Method [1965]


⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS

(k)

⋄ Reflect worst vertex about thebest face

⋄ Shrink, contract, or expand∆kS

(k)

θ1

θ 2

Wild, IMA WMO16 18

The Nelder-Mead Method [1965]


⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS

(k)

⋄ Reflect worst vertex about thebest face

⋄ Shrink, contract, or expand∆kS

(k)

θ1

θ 2

Only the order of the function values matter:f(x) = 1, f(x) = 1.0001 is the same as f(x) = 1, f(x) = 10000.

→ A very popular (in Numerical Recipes), robust first choice. . . with nontrivial convergence Newer NM → [Lagarias, Poonen, Wright; SIOPT 2012]

Wild, IMA WMO16 18

What Are We Missing?

These methods will (eventually) find a local solutionOverview: → [Kolda, Lewis, Torczon, SIREV 2003]

Each evaluation of f is expensive (valuable)

N-M:

1. Only remembers the last n+ 1evaluations

2. Neglects the magnitudes of the functionvalues (order only)

3. Doesn’t take into account the special(LS) problem structure

θ1

θ 2

→ This is the reason many direct search methods use a search phase on top of theusual poll phase

Wild, IMA WMO16 19

Making the Most of Little Information About Smooth f

⋄ f is expensive ⇒ can afford to make better use of points

⋄ Overhead of the optimization routine is minimal (negligible?) relative to cost ofevaluating simulation

11.5

22.5

33.5

4 0

1

2

30

5

10

15

20

25

30

yx

Bank of data,

xi, f(xi)k

i=1:

= Points (& function values) evaluated sofar

= Everything known about f

Idea:

⋄ Make use of growing Bank asoptimization progresses

⋄ Limit unnecessary evaluations(geometry/approximation)

Wild, IMA WMO16 20

Trust-Region Methods Use Models Instead of f

To reduce the number of expensive f evaluations→ Replace difficult optimization problem min f(x)

with a much simpler one min m(x) : x ∈ B

Classic NLP Technique:

f Original function: computationallyexpensive, no derivatives

m Surrogate model: computationallyattractive, analytic derivatives

Wild, IMA WMO16 21

Basic Trust-Region Idea

Use a model m(x) in place of the unwieldy f(x)

2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

1.5

2

2.5

3

3.5

4

4.5

5 Optimize over m to avoidexpense of f :

⋄ Trust m to approximate f withinBk = x ∈ R

n : ‖x− xk‖ ≤ ∆k,

⋄ Obtain next point frommin

m(xk + s) : xk + s ∈ Bk

,

⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:

ρk =f(xk)− f(xk + sk)

m(xk)−m(xk + sk)

[Conn, Gould, Toint; SIAM, 2000]

Wild, IMA WMO16 22

Where Does the Model Come From?

When derivatives are available:

Taylor-based model m(xk + s) = c+ (gk)T s+ 12sTHks

⋄ gk = ∇xf(xk)⋄ Hk ≈ ∇2

x,xf(xk)

Wild, IMA WMO16 23

Where Does the Model Come From?

When derivatives are available:

Taylor-based model m(xk + s) = c+ (gk)T s+ 12sTHks

⋄ gk = ∇xf(xk)⋄ Hk ≈ ∇2

x,xf(xk)

Without derivatives

⋄ Interpolation-based models

⋄ Regression-based models

⋄ Stochastic/randomized models

Wild, IMA WMO16 23

Interpolation-Based Quadratic Models

An interpolating quadratic in R2

m(xk + s) = c+ gT s+ 12s

THs:

Get the model parameters c, g,H = HT

by demanding interpolation:

m(xk + yi) = f(xk + yi)

for all yi ∈ Y =interpolation set

Main difficulty is Y:

⋄ Use prior function evaluations,

⋄ m well-defined and approximates flocally.

Wild, IMA WMO16 24

Interpolation-Based Trust-Region Methods

Iteration k:

⋄ Build a model mk interpolating fon Y

⋄ Trust mk within region Bk⋄ Minimize mk within Bk to obtain

next point for evaluation

⋄ Do expensive evaluation

⋄ Update mk and Bk based on howgood model prediction was

Wild, IMA WMO16 25

Quick Diversion: Polynomial Bases

⋄ Let φ denote a basis for some space of polynomials of n variables Linear:

φ(x) = [1, x1, · · · , xn]

Wild, IMA WMO16 26



φ(x) = [1, x1, · · · , xn]

Full quadratics:

φ(x) =[

1, x1, · · · , xn x21, · · · , x

2n x1x2, · · · , xn−1xn

]

Wild, IMA WMO16 26



φ(x) = [1, x1, · · · , xn]

Full quadratics:

φ(x) =[

1, x1, · · · , xn x21, · · · , x

2n x1x2, · · · , xn−1xn

]

⋄ Given a collection of p = |Y| points Y = y1, · · · , yp:

Φ(Y) =

1 y11 · · · y1n (y11)2 · · · (y1n)

2 y11y12 · · · y1n−1y

1n

......

1 yp1 · · · ypn (yp1 )2 · · · (ypn)

2 yp1yp2 · · · ypn−1y

pn

This is a matrix of size p × (n+1)(n+2)2

Wild, IMA WMO16 26

Building Models Without Derivatives

Given data(

Yk, f(

Yk))

and basis Φ, “solve”

Φ(Yk)z =[

Φc Φg ΦH

]

zczgzH

= f = f(

Yk)

n = 2, |Yk| = 4

Full quadratics, |Yk| = (n+1)(n+2)2

⋄ Geometric conditions on points in Yk

Undetermined interpolation,

|Yk| < (n+1)(n+2)2

⋄ Use (Powell) Hessian updatesmingk,Hk ‖Hk −Hk−1‖2F

s.t. qk = f on Yk

Regression, |Yk| > (n+1)(n+2)2

⋄ Solve minz ‖Φz − f‖

Wild, IMA WMO16 27

Multivariate (Scattered Data) Interpolation is a Different Kind of Animal

m(xk + yi) = f(xk + yi) ∀yi ∈ Y

n = 1 Given p distinct points, can find a unique degree p− 1 polynomial m

n > 1 Not true! (see Mairhuber-Curtis Theorem)

For quadratic models in Rn:

⋄ (n+1)(n+2)2

coefficients

⋄ Unique interpolant may not exist, even

when |Y| = (n+1)(n+2)2

⋄ Locations of the points in Y must satisfyadditional geometric conditions(has nothing to do with f values)

χ1

χ2

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

→ [ Wendland; Cambridge University Press, 2010]

Wild, IMA WMO16 28

Notions of Nonlinear Model Quality

“Taylor-like” Error Bounds

1. Assuming underlying f is sufficiently smooth= derivatives of f exist but are unavailable

2. A model mk is locally fully linear if:

For all x ∈ Bk = x ∈ Ω : ‖x− xk‖ ≤ ∆k |mk(x) − f(x)| ≤ κ1∆

2k

‖∇mk(x) − ∇f(x)‖ ≤ κ2∆k

for constants κi independent of x and ∆k.

→[ Conn, Scheinberg, Vicente; SIAM 2009]

Wild, IMA WMO16 29

Notions of Nonlinear Model Quality

“Taylor-like” Error Bounds

1. Assuming underlying f is sufficiently smooth

2. A model mk is locally fully quadratic if:

For all x ∈ Bk = x ∈ Ω : ‖x− xk‖ ≤ ∆k |mk(x) − f(x)| ≤ κ1∆

3k

‖∇mk(x) − ∇f(x)‖ ≤ κ2∆2k

‖∇2mk(x)− ∇2f(x)‖ ≤ κ3∆k

for constants κi independent of x and ∆k.

→[ Conn, Scheinberg, Vicente; SIAM 2009]

Wild, IMA WMO16 29

Ingredients for Convergence to Stationary Points

limk→∞ ∇f(xk) = 0 provided:

0. f is sufficiently smooth and regular (e.g., bounded level sets)

1. Control Bk based on model quality

2. (Occasional) approximation within BkOur quadratics satisfy

|qk(x) − f(x)| ≤ κ1(γf + ‖Hk‖)∆2k, x ∈ Bk

‖gk + Hk(x− xk) − ∇f(x)‖ ≤ κ2(γf + ‖Hk‖)∆k, x ∈ Bk

3. Sufficient decrease

Michael J.D. Powell, 1936-2015

Survey →[Conn, Scheinberg, Vicente; SIAM 2009]

Methods →[Powell: COBYLA, UOBYQA, NEWUOA, BOBYQA, LINCOA],

. . .

Line search methods also work →[Kelley et al; IFFCO]

RBF models also work →[W. & Shoemaker; SIREV 2013]

Probabilistic models→[Bandeira, Scheinberg, Vicente; SIOPT 2014]

Wild, IMA WMO16 30

Greed. Alone. Can. Hurt.

Model-improvement may be needed when:

⋄ Nearby points line up

⋄ May not have enough points to ensure model quality in all directions

→ May need n additional evaluations

Wild, IMA WMO16 31

Constraints and Model Quality

Constraints complicate matters. . . if one does not allow evaluation of infeasible points

→ May need directions normal to nearby constraints

Wild, IMA WMO16 32

Performance Comparisons on Test Functions

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Number of Simplex Gradients , κ

PS Data Profile, ds(κ) (τ =10−3 )

nmsmaxnewuoa(2n+1)appsrand

Smooth problems

⋄ When evaluations aresequential, model-basedmethods (NEWUOA)regularly outperformdirect search methodswithout a search phase(nmsmax, appsrand)

→ [More & W., SIOPT 2009]

Wild, IMA WMO16 33


0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Number of Simplex Gradients , κ

PN Data Profile, ds(κ) (τ =10−3 )

nmsmaxnewuoa(2n+1)appsrand

Noisy problems

⋄ When evaluations aresequential, model-basedmethods (NEWUOA)regularly outperformdirect search methodswithout a search phase(nmsmax, appsrand)

→ [More & W., SIOPT 2009]

Wild, IMA WMO16 33

Many Practical Details In Implementations

⋄ Choice of interpolation points Yk

⋄ Updating of trust region Bk⋄ Improvement of models

Wild, IMA WMO16 34

Many Practical Details In Implementations

⋄ Choice of interpolation points Yk

⋄ Updating of trust region Bk⋄ Improvement of models

BOBYQA [Powell], DFO [Scheinberg], POUNDer [W.]

Initialization p = |Yk| structured evaluationsBased on input, ≈ 2n+ 1Based on input, no more than n+ 1

Interpolation Set p = |Yk|,∀k

Bootstrap to |Yk| = (n+1)(n+2)2

, then fixed

Varies in n+ 1, · · · , (n+1)(n+2)2

based on available points

Linear Algebra If p = O(n), model formation costs only O(n2)ExpensiveExpensive

Wild, IMA WMO16 34

Growing, Recent Body of Tools and Resources for Local DFO

? What to use on problems with characteristics X, Y, and Z ?

Conn, Scheinberg,Vicente; SIAM 2009

2

2.5

3

3.5

x 10

0

5000

10000

6

7

8

9

10

11

x 105

Permanent Rights, NR Options, N

O

Implicit Filtering

Kelley; SIAM 2011

Many solversSample considered by Rios & Sahinidis, 2010:

ASA, BOBYQA,CMA-ES, DFO,DAKOTA/*, TOMLAB/*,FMINSEARCH, GLOBAL,HOPSPACK, IMFIL,MCS, NEWUOA,NOMAD, PSWARM,SID-PSM, SNOBFIT

Wild, IMA WMO16 35

Toward Global Optimization

A quick sketch of a multistart methods and some practical details

⋄ useful in derivative-based and derivative-free cases

⋄ obtain a list of distinct minimizers (for post-processing, etc.)

⋄ simple to get started

! simple to abuse/misuse (“I found all minimizers”)

Wild, IMA WMO16 36

Why Multistart?

Multiple local minima are often of interest in practice:

Design: Multiple objectives (or even constraints) might later be of interest

Simulation Errors: Could have spurious local minima from anomalies in the simulator

Uncertainty: Some minima are moresensitive to perturbationsthan others (gentle valleysversus steep cliffs)

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

Min AMinB

Wild, IMA WMO16 37

Global Optimization Multistart Methods

Two phase iterative method

Global Exploration: Sample N points in D. ← Guarantees convergence

Local Refinement: Start a local minimization algorithm A from some promising subsetof the sample points.

Want to find many (good) local minima while avoiding repeatedly finding the samelocal minima.

Wild, IMA WMO16 38

Multi Level Single Linkage (MLSL) Clustering Procedure

Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]

Sampled Points

Sampled Candidate Points

Descent Paths

Start Points

Ex.: It. 1 Exploration

Start A at each sample point xi

provided:

⋄ A has not been started from xi, and

⋄ no other sample point xj withf(xj) < f(xi) is within a distance

rk =1

√π

n

√

√

√

√

vol(D)5Γ

(

1 + n2

)

log(kN)

kN,

Wild, IMA WMO16 39



Sampled Points


Descent Paths

Start Points



provided:



rk =1

√π

n

√

√

√

√

vol(D)5Γ

(

1 + n2

)

log(kN)

kN,

Thm [RK-T]- Will start finitely many local runs with probability 1.

Wild, IMA WMO16 39



Sampled Points

Optim.Paths

Approx.LocalMin.

Ex.: It. 1 Refinement


provided:



rk =1

√π

n

√

√

√

√

vol(D)5Γ

(

1 + n2

)

log(kN)

kN,


Wild, IMA WMO16 39



Sampled Points


Descent Paths

Start Points



provided:



rk =1

√π

n

√

√

√

√

vol(D)5Γ

(

1 + n2

)

log(kN)

kN,


Wild, IMA WMO16 39


50 100 150 200 250

10

15

20

25

30

35

40

45

number of function evaluations

mea

nofth

ebes

tva

lue

in30

runs

Schoen14.100 Test Function (n=14)

GORBITGlobal-SRBFLocal-SRBFGutmann-RBFGreedy-RBFSA-SLHD

⋄ GORBIT is multistart withRBF model-basedmethod

⋄ SA-SLHD is a heuristic(simulated annealingwith a symmetric Latinhypercube design asinitialization)

→ [W., Cornell University, 2009]

Wild, IMA WMO16 40

II. Exploiting Structure in Simulation-BasedOptimization

Beyond the Black Box

min f(x) = F [S(x)]

So far, f = S

Wild, IMA WMO16 42


min f(x) = F [S(x)]

So far, f = S

Your problems are not black-box problems

Wild, IMA WMO16 42


min f(x) = F [S(x)]

So far, f = S

Your problems are not black-box problems

You formulated the problem⇒ You know more than nothing

Wild, IMA WMO16 42

Structure in Simulation-Based Optimization, min f(x) = F [x, S(x)]

f is often not a black box S

NLS Nonlinear least squares

f(x) =∑

i

(Si(x)− di)2

CNO Composite (nonsmooth) optimization

f(x) = h(S(x))

SKP Not all variables enter simulation

f(x) = g(xI , xJ) + h(S(xJ ))

SCO Only some constraints depend on simulation

minf(x) : c1(x) = 0, cS(x) = 0

+ Slack variables

ΩS = (xI , xJ) : S(xJ) + xI = 0, xI ≥ 0

. . .

Model-based methods offer one way to exploit such structure

Wild, IMA WMO16 43

General Setting – Modeling Smooth S1(x), S2(x), . . . , Sp(x)

Assume:

⋄ each Si is continuously differentiable, available

⋄ each ∇Si is Lipschitz continuous, unavailable

Wild, IMA WMO16 44

General Setting – Modeling Smooth S1(x), S2(x), . . . , Sp(x)

Assume:

⋄ each Si is continuously differentiable, available

⋄ each ∇Si is Lipschitz continuous, unavailable

mSi : Rn → R approximates Si on B(x,∆) i = 1, . . . , p

Fully Linear Models

mSi fully linear on B(x,∆) if there exist constants κi,ef and κi,eg independent of xand ∆ so that

|Si(x+ s)−mSi(x+ s)| ≤ κi,ef∆2 ∀s ∈ B(0,∆)

‖∇Si(x+ s)−∇mSi(x+ s)‖ ≤ κi,eg∆ ∀s ∈ B(0,∆)

Wild, IMA WMO16 44

NLS– Nonlinear Least Squares f(x) = 12

∑

iRi(x)2

Obtain a vector of output R1(x), . . . , Rp(x)

⋄ Model each Ri

Ri(x) ≈ mRik

(x) = Ri(xk) + (x− xk)⊤g

(i)k

+1

2(x− xk)⊤H

(i)k

(x− xk)

⋄ Approximate:

∇f(x) =∑

i

∇Ri(x)Ri(x) −→∑

i

∇mRik (x)Ri(x)

∇2f(x) =∑

i

∇Ri(x)∇Ri(x)⊤ +

∑

i

Ri(x)∇2Ri(x)

−→∑

i

∇mRik (x)∇mRi

k (x)⊤ +∑

i

Ri(x)∇2m

Rik (x)

⋄ Model f via Gauss-Newton or similar

regularized Hessians →DFLS [Zhang, Conn, Scheinberg]

full Newton →POUNDERS [W., More]

Wild, IMA WMO16 45

NLS– Consequences for f(x) = 12

∑

i Ri(x)2

Pay a (negligible for expensive S) price in terms of p models

⋄ Save linear algebra using interpolation set Yk common to all models Single system solve, multiple right hand sides

Φ(Yk)[

z(1) · · · z(p)]

=[

R1 · · · Rp

]

mR1 quality ⇒ quality of all mRi

+ (nearly) exact gradients for Ri (nearly) linear

- No longer interpolate function at data points

m(xk + δ) = f(xk)

+δ⊤∑

i g(i)k Ri(x

k)

+ 12δ⊤

∑

i

(

g(i)k (g

(i)k )⊤ + Ri(x

k)H(i)k

)

δ

+ missing h.o. terms

Wild, IMA WMO16 46

NLS– POUNDERS in Practice: DFT Calibration/MLE

minxp∑

i=1wi (Si(x)− di)

2

Si(x) Simulated (DFT) nucleus property

di Experimental data i

wi Weight for data type i

p Parallel simulations (12 wallclock mins)

50 150 2500

5

10

15

20

Day 1 Day 2 Day 3

Number of 12min. Evaluations

Leas

t f V

alue

nelder−meadpounders

→[Kortelainen et al., PhysRevC 2010]

Wild, IMA WMO16 47

CNO– Composite Nonsmooth Optimization Examples

Ex.- Groundwater remediation

Determine rates x forextraction/injection wells

⋄ Regulator’s simulator returnsflow Si(x) in/out of cell i

⋄ Minimize plume fluxes(e.g., regulatory $ penalties)f(x) =

∑

i

|Si(x)|

Lockwood Solvent Ground Water Plume Site (LSGPS)

Ex.- Particle accelerator design

Minimize particle losses: f(x) = maxti∈T1

S(x; ti)− minti∈T2(x)

S(x; ti)

Wild, IMA WMO16 48

CNO– Some Generic Ideas For f(x) =∑p

i=1 |Fi(x)|

Model-based Approaches:

pounder Ignore structure, model f as usual

pounders-sqrt f =p∑

i=1

√

|Fi|2,

model√

|Fi| by Qi

subproblem min∑p

i=1 Qi(x)2

poundera-abs f =p∑

i=1|Fi|,

model |Fi| by Qi

subproblem min∑p

i=1 Qi(x)

poundera-nsm f =p∑

i=1|Fi|,

model Fi by Qi

subproblem min∑p

i=1 |Qi(x)|

Wild, IMA WMO16 49

CNO– Results for Generic Ideas, min∑p

i=1 |Fi(x)|

pounder black-box

pounders-sqrtp∑

i=1Qi(x)

2

poundera-absp∑

i=1Qi(x)

poundera-nsmp∑

i=1|Qi(x)|

20 40 60 80 100 120 140 160 180 20010

−4

10−2

100

102

104

1.3644e−05

Number of f evaluations

Leas

t f v

alue

foun

d

Mancino problem: n=5, p=5

pounderpounders−sqrtpoundera−abspoundera−nsm

Wild, IMA WMO16 50

CNO– Results for Generic Ideas, min∑p

i=1 |Fi(x)|

pounder black-box

pounders-sqrtp∑

i=1Qi(x)

2

poundera-absp∑

i=1Qi(x)

poundera-nsmp∑

i=1|Qi(x)|

20 40 60 80 100 120 140 160 180 200

101.25

101.27

101.29

101.31

101.33

Number of f evaluations

Leas

t f v

alue

foun

d

Bdqrtic problem: n=12, p=16

pounderpounders−sqrtpoundera−abspoundera−nsm

Wild, IMA WMO16 50

CNO– Composite Nonsmooth Optimization f(x) = h(S(x);x)

nonsmooth (algebraically available) function h : Rp × Rn → R

of a smooth (blackbox) mapping S : Rn → Rp

Wild, IMA WMO16 51

CNO– Composite Nonsmooth Optimization f(x) = h(S(x);x)

nonsmooth (algebraically available) function h : Rp × Rn → R

of a smooth (blackbox) mapping S : Rn → Rp

Basic Idea: Knowledge of vector S(xk) & potential nondifferentiability at S(xk)should enhance (theoretical and practical) progress to a stationary point

Ex.- f1(x) = ‖S(x)‖1 =∑p

i=1 |Si(x)|

∂f1(x) =∑

i:Si(x) 6=0

sgn(Si(x))∇Si(x) +∑

i:Si(x)=0

co −∇Si(x),∇Si(x)

⋄ Dc = x : ∃i with Si(x) = 0,∇Si(x) 6= 0

+ Compact ∂f(x)

- Dc depends on ∇Si(x)

Wild, IMA WMO16 51

CNO– The Nuisance Set, N

Relaxation N ⊆ Dc using only zero-orderinformation

f1:N = x : ∃i with Si(x) = 0

f∞:

N =

x : f∞(x) = 0 or

∣

∣

∣

∣

argmaxi|Si(x)|

∣

∣

∣

∣

> 1

|x 3 |

x3

-x 3

f1(x) = |x3|

||S(x)||∞

S1(x)

S2(x)

f∞(x) = max|S1(x)|, |S2(x)|

Wild, IMA WMO16 52

CNO– The Nuisance Set, N

Relaxation N ⊆ Dc using only zero-orderinformation

f1:N = x : ∃i with Si(x) = 0

f∞:

N =

x : f∞(x) = 0 or

∣

∣

∣

∣

argmaxi|Si(x)|

∣

∣

∣

∣

> 1

Observe

When xk /∈ N ,

∂f(xk) = ∇f(xk)= ∇xS(xk)⊤∇Sh(S(x

k))≈ ∇xM(xk)⊤∇Sh(S(x

k))

and smooth approximation is justified

|x 3 |

x3

-x 3

f1(x) = |x3|

||S(x)||∞

S1(x)

S2(x)

f∞(x) = max|S1(x)|, |S2(x)|

Wild, IMA WMO16 52

CNO– Subdifferential Approximation

⋄ xk ∈ N , we build a set of generators G(xk) based on ∂Sh(S(xk)).

co

G(xk)

approximates ∂f(xk)

Ex.- f1(x) = ‖S(x)‖1

G(xk) = ∇M(xk)⊤

sgn(S(xk)) + ∪i:Si(x

k)=0−ei, 0, ei

Wild, IMA WMO16 53

CNO– Subdifferential Approximation

⋄ xk ∈ N , we build a set of generators G(xk) based on ∂Sh(S(xk)).

co

G(xk)

approximates ∂f(xk)

Ex.- f1(x) = ‖S(x)‖1

G(xk) = ∇M(xk)⊤

sgn(S(xk)) + ∪i:Si(x

k)=0−ei, 0, ei

Nearby data Y ⊂ B(xk,∆k) informs models M = mS and generator set

⋄ Manifold sampling method uses manifold(s) of Y

∇M(xk)⊤ ∪yi∈Y

mani(

S(yi))

⋄ Traditional gradient sampling → [Burke, Lewis, Overton; SIOPT 2005]

∪yi∈Y

∇M(yi)⊤mani(

S(yi))

Wild, IMA WMO16 53

CNO– Smooth Trust-Region Subproblem

Smooth master model from minimum-norm element

mf (xk + s) = f(xk) +⟨

s,proj(

0, co

G(xk))⟩

+ · · ·

Wild, IMA WMO16 54

CNO– Smooth Trust-Region Subproblem

Smooth master model from minimum-norm element

mf (xk + s) = f(xk) +⟨

s,proj(

0, co

G(xk))⟩

+ · · ·

⇒ smooth subproblems

min

mf (xk + s) : s ∈ B(0,∆k) vs.

Nonsmooth subproblems

min

h(

M(

xk + s))

: s ∈ B(0,∆k)

⋄ Convex h (e.g., ‖S(x)‖1) and∇Si is Lipschitz⇒ every cluster point ofxkk is Clarke stationary→ [Larson, Menickelly, W.; Preprint

2016]

⋄ OK to sample at xk ∈ DC

⋄ More general (piecewisedifferentiable f) results:→ [Larson, Khan, W.; in prog. 2016]

⋄ Requiresconvex h

→ [Fletcher;

MathProgStudy 1982]

→ [Grapiglia, Yuan,

Yuan; C&A Math.

2016]

Complexity results

→ [Garmanjani,

Judice, Vicente; SIOPT

2016]

Roger Fletcher

Wild, IMA WMO16 54

CNO– Example Performance on L1 Test Problems

Function Value

1000 2000 3000 4000 5000

0.1

0.15

0.2

0.25

0.3

0.35

Best

fvaluefound

Number of function evaluations

CMSGDMSDMSSMSGYYQ−DFOTRL−DFOTR

Stationary Measure

1000 2000 3000 4000 500010

−8

10−6

10−4

10−2

100

Best

Ψvaluefound

Number of function evaluations

Smooth black-box methods can fail in practice, even when DC has measure zero

Numerical tests: → [Larson, Menickelly, W.; Preprint 2016]

Wild, IMA WMO16 55

SKP– Some Known Partials Example

Ex.- Bi-level model calibration structure

minx

f(x) =

p∑

i=1

(Si(x)− di)2

Si(x) solution to lower-level problem depending only on xJ

Si(x) = gi(x) + minyhi(xJ ; y) : y ∈ Di

= gi(x) + hi(xJ ; yi,∗[xJ ])

For x = (xI , xJ )

⋄ ∇xISi(xI , xJ) available

⋄ ∇xJSi(x) ≈ ∇xJ

gi(x) +∇xJmSi(xJ )

⋄ Si(x) continuous and smooth in xI

⋄ gi(x) cheap to compute!

⋄ No noise/errors introduced in gi(x)

General bi-level →[Conn & Vicente, OMS 2012]

Wild, IMA WMO16 56

SKP– Some Known Partials

x = (xI , xJ ); have∂f∂xI

but not ∂f∂xJ

“Solve”Φz = f

with known zg,I , zH,I

[

Φc Φg,J ΦH,J

]

zczg,JzH,J

= f −Φg,Izg,I − ΦH,IzH,I

⋄ Still have interpolation where required⋄ Effectively lowers dimension to |J | = n− |I| for

approximation model-improving evaluations linear algebra

⋄ limk→∞∇f(xk) = 0 as before:

Guaranteed descent in some directions

Wild, IMA WMO16 57

SKP– Numerical Results With Some Partials

0 100 200 300 400

104

106

108

1010


pounderpounderspoundersm

Three approaches:

- black box

s exploit least squares

m use ∇xIderivatives

⋄ n = 16, |I| = 3

⋄ 5-10 secs/evaluation

Same algorithmic framework, performance advantages from exploiting structure

→[Bertolli, Papenbrock, W., PRC 2012]

Wild, IMA WMO16 58

SCO- General Constraints

minf(x) : c1(x) = 0, cS(x) = 0

⋄ Lagrangian (key to optimality conditions):

∇L = ∇f + λ⊤1 ∇c1 + λ⊤

2 ∇cS

→ ∇f + λ⊤1 ∇c1 + λ⊤

2 ∇m

⋄ Use favorite method: filters, augmented Lagrangian, . . .⋄ Slack variables

Do not increase effective dimension Subproblems can treat separately Know derivatives

→[Lewis & Torczon; 2010]

Modified AL methods →[Diniz-Ehrhardt, Martınez, Pedroso; C&A Math. 2011]

SBO constraints have unique properties →[Le Digabel & W.; ANL/MCS-P5350-0515 2016]

Wild, IMA WMO16 59

SCO– What Constraint Derivatives Buy You

Ex.- Augmented Lagrangian methods, LA(x, λ;µ) = f(x)− λ⊤c(x) + 1µ‖c(x)‖2

minx f(x) : c(x) = 0

Four approaches:

1. Penalize constraints

2. Treat c and f both as(separate) black boxes

3. Work with f and ∇xc

4. Have both ∇xf and ∇xc

0 50 100 150 200 250 300 350 400 450

10−10

10−8

10−6

10−4

10−2

100


Bes

t Mer

it F

unct

ion

Val

ue

All Ders.Constraint Ders.No Ders.No Structure

n = 15, 11 constraints

with no explicit internal vars, 15 var, 11 cons

Wild, IMA WMO16 60

So You Want To Solve A Hard Optimization Problem?

Mathematically unwrap problems to expose the deepest black boxes!

⋄ It is easy to get started with derivative-free methods

⋄ You should strive to obtain derivatives & apply methods from every other lecture

⋄ Model-based methods can make use of expensive function values

⋄ Structure is everywhere, even in “black-box”/ legacy code-driven optimization problems⋄ By exploiting structure, optimization can solve grand-challenge problems in 〈insert

your field here〉: Model residuals ri(x)i, not ‖r(x)‖ Model constraints ci(x)i, not a penalty P (c(x)) Explicitly handle nonsmoothness (and noise)

Send me your structured SBO problems!→ www.mcs.anl.gov/~wild

Wild, IMA WMO16 61

www.mcs.anl.gov/~wild

zero-order, black-box, derivative-free, and simulation

Documents