Zero-Order, Black-Box, Derivative-Free, andSimulation-Based Optimization
Stefan Wild
Argonne National LaboratoryMathematics and Computer Science Division
August 5, 2016
The Plan
Motivation
Black-Box OptimizationDirect Search MethodsModel-Based MethodsSome Global Optimization
Simulation-Based Optimization and StructureNLS=Nonlinear Least SquaresCNO=Composite Nonsmooth OptimizationSKP=Some Known PartialsSCO=Simulation-Constrained Optimization
Wild, IMA WMO16 1
Simulation-Based Optimization
minx∈Rn
f(x) = F [x,S(x)] : cI [x, S(x)] ≤ 0, cE [x, S(x)] = 0
⋄ S (numerical) simulation output, often “noisy” (even when deterministic)
⋄ Derivatives ∇xS often unavailable or prohibitively expensive toobtain/approximate directly
⋄ S can contribute to objective and/or constraints
⋄ Single evaluation of S could take seconds/minutes/hours/daysEvaluation is a bottleneck for optimization
Functions of complex numerical simulations arise everywhere
Wild, IMA WMO16 2
Computing is Responsible for Pervasiveness of Simulations in Sci&Eng
Argonne’s AVIDAC
(1953: vacuum tubes)
Argonne’s Blue Gene/Q (2012: 786,432cores)
Currently 6th fastest in the world Sunway TaihuLight (2016: 11Mcores)
Currently fastest in the world
⋄ Parallel/multi-core environments increasingly common Small clusters/multi-core desktops/multi-core laptops pervasive Leadership class machines increasingly parallel
⋄ Simulations (the “forward problem”) become faster/more realistic/more complex
Wild, IMA WMO16 3
Improvements from Algorithms Can Trump Those From Hardware
Martin Grotschel’s production planning benchmark problem (a MIP):
1988 solve time using current computers and LP algorithms: 82 years2003 solve time using current computers and LP algorithms: 1 minute
⋄ Speed up of 43,000,000X103X from processor improvements104X additional from algorithmic improvements
Wild, IMA WMO16 4
Improvements from Algorithms Can Trump Those From Hardware
1991 (v1.2) to 2007 (v11.0): Moore’s Law transistor speedup: ≈ 256X[Slide from Bixby (CPLEX/GUROBI)]: Solves 1,852 MIPs
Wild, IMA WMO16 4
Derivative-Free/Zero-Order Optimization
“Some derivatives are unavailable for optimization purposes”
Wild, IMA WMO16 5
Derivative-Free/Zero-Order Optimization
“Some derivatives are unavailable for optimization purposes”
The Challenge: Optimization is tightly coupled with derivatives
Typical optimality (no noise, smooth functions)
∇xf(x∗) + λT∇xcE(x∗) = 0, cE(x∗) = 0
(sub)gradients ∇xf, ∇xc enable:
⋄ Faster feasibility⋄ Faster convergence
Guaranteed descent Approximation of nonlinearities
⋄ Better termination Measure of criticality
‖∇xf‖ or ‖PΩ(∇xf)‖
⋄ Sensitivity analysis Correlations, standard errors, UQ, . . .
Wild, IMA WMO16 5
Ways to Get Derivatives (assuming they exist)
Handcoding (HC)
“Army of students/programmers”
? Prone to errors/conditioning
? Intractable as number of ops increases
Algorithmic/Automatic Differentiation (AD)
“Exact∗ derivatives!”
? No black boxes allowed
? Not always automatic/cheap/well-conditioned
Finite Differences (FD)
“Nonintrusive”
? Expense grows with n
? Sensitive to stepsize choice/noise
→[More & W.; SISC 2011], [More & W.; TOMS 2012]
. . . then apply derivative-based method (that handles inexact derivatives)
Wild, IMA WMO16 6
Algorithmic Differentiation
→ [Coleman & Xu; SIAM 2016], [Griewank & Walther; SIAM 2008]
Computational Graph
⋄ y = sin(a ∗ b) ∗ c
⋄ Forward and reverse modes
⋄ AD tool provides code for yourderivatives
Write codes and formulateproblems with AD in mind!
Many tools (see www.autodiff.org):
F OpenAD
F/C Tapenade, Rapsodia
C/C++ ADOL-C, ADIC
Matlab ADiMat, INTLAB
Python/R ADOL-C
Also done in AMPL, GAMS, JULIA!
Wild, IMA WMO16 7
The Price of Algorithm Choice: Solvers in PETSc/TAO
chwirut1 (n = 6)
100
101
102
103.7
103.8
103.9
Number of Evaluations
Bes
t f V
alue
Fou
nd
lmvmpoundersnm
Toolkit for Advanced Optimization
[Munson et al.; mcs.anl.gov/tao]
Increasing level of user input:
nm Assumes ∇xfunavailable, black box
pounders Assumes ∇xfunavailable, exploitsproblem structure
lmvm Uses available ∇xf
Wild, IMA WMO16 8
The Price of Algorithm Choice: Solvers in PETSc/TAO
chwirut1 (n = 6)
100
101
102
103.7
103.8
103.9
Number of Evaluations
Bes
t f V
alue
Fou
nd
lmvmpoundersnm
Toolkit for Advanced Optimization
[Munson et al.; mcs.anl.gov/tao]
Increasing level of user input:
nm Assumes ∇xfunavailable, black box
pounders Assumes ∇xfunavailable, exploitsproblem structure
THIS TALK
lmvm Uses available ∇xf
DFO methods should be designed tobeat finite-difference-based methods
Observe: Constrained by budget on #evals, method limits solution accuracy/problem size
Wild, IMA WMO16 8
Global Optimization, minx∈Ω f(x)
Careful:
⋄ Global convergence: Convergence (to a local solution/stationary point) fromanywhere in Ω
⋄ Convergence to a global minimizer: Obtain x∗ with f(x∗) ≤ f(x)∀x ∈ Ω
Wild, IMA WMO16 9
Global Optimization, minx∈Ω f(x)
Careful:
⋄ Global convergence: Convergence (to a local solution/stationary point) fromanywhere in Ω
⋄ Convergence to a global minimizer: Obtain x∗ with f(x∗) ≤ f(x)∀x ∈ Ω
Anyone selling you global solutions when derivatives are unavailable:
either assumes more about your problem (e.g., convex f)
or expects you to wait forever
Torn and Zilinskas: An algorithm converges to the global minimum for anycontinuous f if and only if the sequence of points visited by thealgorithm is dense in Ω.
or cannot be trusted
Wild, IMA WMO16 9
Global Optimization, minx∈Ω f(x)
Careful:
⋄ Global convergence: Convergence (to a local solution/stationary point) fromanywhere in Ω
⋄ Convergence to a global minimizer: Obtain x∗ with f(x∗) ≤ f(x)∀x ∈ Ω
Anyone selling you global solutions when derivatives are unavailable:
either assumes more about your problem (e.g., convex f)
or expects you to wait forever
Torn and Zilinskas: An algorithm converges to the global minimum for anycontinuous f if and only if the sequence of points visited by thealgorithm is dense in Ω.
or cannot be trusted
Instead:
⋄ Rapidly find good local solutions and/or be robust to poor solutions
⋄ Consider multistart approaches and/or structure of multimodality
Wild, IMA WMO16 9
(One Reason) Why We Won’t Be Talking About Heuristics
100 200 300 400 500 600 700 800 900 1000 1100 12000.0580.075
0.1
0.14
0.2
0.4
0.8
Number of Iterations
Fit
Val
ue (
log
scal
e)
Serial** PSOSerial SimplexSerial POUNDERS1024−Core PSO
⋄ Heuristics often “embarrassingly/naturally parallel”;PSO= particle swarm method
Typically through stochastic sampling/evolution 1024 function evaluations per iteration
⋄ Simplex is Nelder-Mead; POUNDERS is model-basedtrust-region algorithm
one function evaluation per iteration
→ Is this an effective use of resources?→ How many cores would have sufficed?
Wild, IMA WMO16 10
I. Black-box Optimization
Black-box Optimization Problems
Only knowledge about f is obtained by sampling
⋄ f = S a black box (running some executable-only codeor performing an experiment in the lab)
⋄ Only give a single output (no derivatives ∇xS(x))
Good solutions guaranteed in the limit, but:
⋄ Usually have computational budget (due to scheduling,finances, deadlines)
⋄ Limited number of evaluations
Wild, IMA WMO16 12
Black-box Optimization Problems
Only knowledge about f is obtained by sampling
⋄ f = S a black box (running some executable-only codeor performing an experiment in the lab)
⋄ Only give a single output (no derivatives ∇xS(x))
Good solutions guaranteed in the limit, but:
⋄ Usually have computational budget (due to scheduling,finances, deadlines)
⋄ Limited number of evaluations
Wild, IMA WMO16 12
A Black Box: Automating Empirical Performance Tuning
Given semantically equivalent codesx1, x2, . . ., minimize run time subject toenergy consumption
min f(x) : (xC , xI , xB) ∈ ΩC × ΩI × ΩB
x multidimensional parameterization (compiler type, compiler flags, unroll/tilingfactors, internal tolerances, . . . )
Ω search domain (feasible transformation, no errors)
f quantifiable performance objective (requires a run)
→ [Audet & Orban; SIOPT 2006], [Balaprakash, W., Hovland; ICCS 2011], [Porcelli & Toint; 2016]
Numerical Linear Algebra → [N. Higham; SIMAX 1993], . . .
Wild, IMA WMO16 13
Optimization for Automatic Tuning of HPC Codes
Evaluation of f requires:transforming source, compilation, (repeated?) execution, checking for correctness
13
57
911
1315
12
34
56
78
910
0
1
2
3
4
5
x 107
Unroll Factor jUnroll Factor i
Tim
e [C
PU
ms]
Challenges:
- Evaluating f(Ω) prohibitively expensive(e.g., 1019 discrete decisions)
- f noisy
- Discrete x unrelaxable
- ∇xf unavailable/nonexistent
- Many distinct/local solutions
Wild, IMA WMO16 14
Black-box Algorithms
Solve general problems minf(x) : x ∈ Rn:
⋄ Only require function values (no ∇f(x))
⋄ Don’t rely on finite-difference approximations to ∇f(x)
⋄ Seek greedy and rapid decrease of function value
⋄ Have asymptotic convergence guarantees
⋄ Assume parallel resources are used within function evaluation
Main styles of DFO algorithms
⋄ Randomized methods (later?)
⋄ Direct search methods (pattern search, Nelder-Mead, . . . )
⋄ Model-based methods (quadratics, radial basis functions, . . . )
Wild, IMA WMO16 15
Black-Box Algorithms: Stochastic Methods
Random search
Repeat:
1. Randomly generate direction dk ∈ Rn
2. Evaluate “gradient-free oracle” g(xk; hk) =f(xk+hk dk)−f(xk)
hkdk
(≈ directional derivative)
3. Compute xk+1 = xk − δkg(xk ;hk), evaluate f(xk+1)
Convergence (for different types of f) tends to be probabilistic[Kiefer & Wolfowitz; AnnMS 1952], [Polyak; 1987], [Ghadimi & Lan; SIOPT 2013], [Nesterov & Spokoiny; FoCM
2015], . . .
Wild, IMA WMO16 16
Black-Box Algorithms: Stochastic Methods
Random search
Repeat:
1. Randomly generate direction dk ∈ Rn
2. Evaluate “gradient-free oracle” g(xk; hk) =f(xk+hk dk)−f(xk)
hkdk
(≈ directional derivative)
3. Compute xk+1 = xk − δkg(xk ;hk), evaluate f(xk+1)
Convergence (for different types of f) tends to be probabilistic[Kiefer & Wolfowitz; AnnMS 1952], [Polyak; 1987], [Ghadimi & Lan; SIOPT 2013], [Nesterov & Spokoiny; FoCM
2015], . . .
Stochastic heuristics (nature-inspired methods, etc.)
⋄ Popular in practice, especially in engineering
⋄ Typically global in nature
⋄ Require many f evaluations
Wild, IMA WMO16 16
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
Pattern Search And Its Variants
Choose a set of directions (pattern or mesh) Dk
Ex.- ± coordinate directions (2n directions)
Ex.- any minimal positive spanning set ([e1, · · · , en,−∑
ei])
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5
Basic iteration (k ≥ 0):
⋄ Evaluate f(xk +∆kdj), j = 1, . . . , |Dk|
⋄ If[
f(xk +∆kdj) < f(xk)
]
,
move to xk+1 = xk +∆kdj
Otherwise shrink ∆k
⋄ Update Dk
This is an indicator function, does not say anythingabout the magnitude of f values, just the ordering
→ Lends itself well to doing concurrent function evaluations→ See also mesh-adaptive direct search methods→ Can establish convergence for nonsmooth f
Survey → [Kolda, Lewis, Torczon; SIREV 2003]
Tools → DFL [Liuzzi et al.], NOMAD [Audet et al.], . . .
Wild, IMA WMO16 17
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Wild, IMA WMO16 18
The Nelder-Mead Method [1965]
Basic iteration (k ≥ 0):
⋄ Evaluate f on the n+ 1vertices of the simplexxk +∆kS
(k)
⋄ Reflect worst vertex about thebest face
⋄ Shrink, contract, or expand∆kS
(k)
θ1
θ 2
Only the order of the function values matter:f(x) = 1, f(x) = 1.0001 is the same as f(x) = 1, f(x) = 10000.
→ A very popular (in Numerical Recipes), robust first choice. . . with nontrivial convergence Newer NM → [Lagarias, Poonen, Wright; SIOPT 2012]
Wild, IMA WMO16 18
What Are We Missing?
These methods will (eventually) find a local solutionOverview: → [Kolda, Lewis, Torczon, SIREV 2003]
Each evaluation of f is expensive (valuable)
N-M:
1. Only remembers the last n+ 1evaluations
2. Neglects the magnitudes of the functionvalues (order only)
3. Doesn’t take into account the special(LS) problem structure
θ1
θ 2
→ This is the reason many direct search methods use a search phase on top of theusual poll phase
Wild, IMA WMO16 19
Making the Most of Little Information About Smooth f
⋄ f is expensive ⇒ can afford to make better use of points
⋄ Overhead of the optimization routine is minimal (negligible?) relative to cost ofevaluating simulation
11.5
22.5
33.5
4 0
1
2
30
5
10
15
20
25
30
yx
Bank of data,
xi, f(xi)k
i=1:
= Points (& function values) evaluated sofar
= Everything known about f
Idea:
⋄ Make use of growing Bank asoptimization progresses
⋄ Limit unnecessary evaluations(geometry/approximation)
Wild, IMA WMO16 20
Making the Most of Little Information About Smooth f
⋄ f is expensive ⇒ can afford to make better use of points
⋄ Overhead of the optimization routine is minimal (negligible?) relative to cost ofevaluating simulation
11.5
22.5
33.5
4 0
1
2
30
5
10
15
20
25
30
yx
Bank of data,
xi, f(xi)k
i=1:
= Points (& function values) evaluated sofar
= Everything known about f
Idea:
⋄ Make use of growing Bank asoptimization progresses
⋄ Limit unnecessary evaluations(geometry/approximation)
Wild, IMA WMO16 20
Trust-Region Methods Use Models Instead of f
To reduce the number of expensive f evaluations→ Replace difficult optimization problem min f(x)
with a much simpler one min m(x) : x ∈ B
Classic NLP Technique:
f Original function: computationallyexpensive, no derivatives
m Surrogate model: computationallyattractive, analytic derivatives
Wild, IMA WMO16 21
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Basic Trust-Region Idea
Use a model m(x) in place of the unwieldy f(x)
2.2 2.4 2.6 2.8 3 3.2 3.4 3.6
1.5
2
2.5
3
3.5
4
4.5
5 Optimize over m to avoidexpense of f :
⋄ Trust m to approximate f withinBk = x ∈ R
n : ‖x− xk‖ ≤ ∆k,
⋄ Obtain next point frommin
m(xk + s) : xk + s ∈ Bk
,
⋄ Evaluate function and update (xk,∆k)based on how good the model’sprediction was:
ρk =f(xk)− f(xk + sk)
m(xk)−m(xk + sk)
[Conn, Gould, Toint; SIAM, 2000]
Wild, IMA WMO16 22
Where Does the Model Come From?
When derivatives are available:
Taylor-based model m(xk + s) = c+ (gk)T s+ 12sTHks
⋄ gk = ∇xf(xk)⋄ Hk ≈ ∇2
x,xf(xk)
Wild, IMA WMO16 23
Where Does the Model Come From?
When derivatives are available:
Taylor-based model m(xk + s) = c+ (gk)T s+ 12sTHks
⋄ gk = ∇xf(xk)⋄ Hk ≈ ∇2
x,xf(xk)
Without derivatives
⋄ Interpolation-based models
⋄ Regression-based models
⋄ Stochastic/randomized models
Wild, IMA WMO16 23
Interpolation-Based Quadratic Models
An interpolating quadratic in R2
m(xk + s) = c+ gT s+ 12s
THs:
Get the model parameters c, g,H = HT
by demanding interpolation:
m(xk + yi) = f(xk + yi)
for all yi ∈ Y =interpolation set
Main difficulty is Y:
⋄ Use prior function evaluations,
⋄ m well-defined and approximates flocally.
Wild, IMA WMO16 24
Interpolation-Based Trust-Region Methods
Iteration k:
⋄ Build a model mk interpolating fon Y
⋄ Trust mk within region Bk⋄ Minimize mk within Bk to obtain
next point for evaluation
⋄ Do expensive evaluation
⋄ Update mk and Bk based on howgood model prediction was
Wild, IMA WMO16 25
Interpolation-Based Trust-Region Methods
Iteration k:
⋄ Build a model mk interpolating fon Y
⋄ Trust mk within region Bk⋄ Minimize mk within Bk to obtain
next point for evaluation
⋄ Do expensive evaluation
⋄ Update mk and Bk based on howgood model prediction was
Wild, IMA WMO16 25
Interpolation-Based Trust-Region Methods
Iteration k:
⋄ Build a model mk interpolating fon Y
⋄ Trust mk within region Bk⋄ Minimize mk within Bk to obtain
next point for evaluation
⋄ Do expensive evaluation
⋄ Update mk and Bk based on howgood model prediction was
Wild, IMA WMO16 25
Interpolation-Based Trust-Region Methods
Iteration k:
⋄ Build a model mk interpolating fon Y
⋄ Trust mk within region Bk⋄ Minimize mk within Bk to obtain
next point for evaluation
⋄ Do expensive evaluation
⋄ Update mk and Bk based on howgood model prediction was
Wild, IMA WMO16 25
Quick Diversion: Polynomial Bases
⋄ Let φ denote a basis for some space of polynomials of n variables Linear:
φ(x) = [1, x1, · · · , xn]
Wild, IMA WMO16 26
Quick Diversion: Polynomial Bases
⋄ Let φ denote a basis for some space of polynomials of n variables Linear:
φ(x) = [1, x1, · · · , xn]
Full quadratics:
φ(x) =[
1, x1, · · · , xn x21, · · · , x
2n x1x2, · · · , xn−1xn
]
Wild, IMA WMO16 26
Quick Diversion: Polynomial Bases
⋄ Let φ denote a basis for some space of polynomials of n variables Linear:
φ(x) = [1, x1, · · · , xn]
Full quadratics:
φ(x) =[
1, x1, · · · , xn x21, · · · , x
2n x1x2, · · · , xn−1xn
]
⋄ Given a collection of p = |Y| points Y = y1, · · · , yp:
Φ(Y) =
1 y11 · · · y1n (y11)2 · · · (y1n)
2 y11y12 · · · y1n−1y
1n
......
1 yp1 · · · ypn (yp1 )2 · · · (ypn)
2 yp1yp2 · · · ypn−1y
pn
This is a matrix of size p × (n+1)(n+2)2
Wild, IMA WMO16 26
Building Models Without Derivatives
Given data(
Yk, f(
Yk))
and basis Φ, “solve”
Φ(Yk)z =[
Φc Φg ΦH
]
zczgzH
= f = f(
Yk)
n = 2, |Yk| = 4
Full quadratics, |Yk| = (n+1)(n+2)2
⋄ Geometric conditions on points in Yk
Undetermined interpolation,
|Yk| < (n+1)(n+2)2
⋄ Use (Powell) Hessian updatesmingk,Hk ‖Hk −Hk−1‖2F
s.t. qk = f on Yk
Regression, |Yk| > (n+1)(n+2)2
⋄ Solve minz ‖Φz − f‖
Wild, IMA WMO16 27
Multivariate (Scattered Data) Interpolation is a Different Kind of Animal
m(xk + yi) = f(xk + yi) ∀yi ∈ Y
n = 1 Given p distinct points, can find a unique degree p− 1 polynomial m
n > 1 Not true! (see Mairhuber-Curtis Theorem)
For quadratic models in Rn:
⋄ (n+1)(n+2)2
coefficients
⋄ Unique interpolant may not exist, even
when |Y| = (n+1)(n+2)2
⋄ Locations of the points in Y must satisfyadditional geometric conditions(has nothing to do with f values)
χ1
χ2
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
→ [ Wendland; Cambridge University Press, 2010]
Wild, IMA WMO16 28
Notions of Nonlinear Model Quality
“Taylor-like” Error Bounds
1. Assuming underlying f is sufficiently smooth= derivatives of f exist but are unavailable
2. A model mk is locally fully linear if:
For all x ∈ Bk = x ∈ Ω : ‖x− xk‖ ≤ ∆k |mk(x) − f(x)| ≤ κ1∆
2k
‖∇mk(x) − ∇f(x)‖ ≤ κ2∆k
for constants κi independent of x and ∆k.
→[ Conn, Scheinberg, Vicente; SIAM 2009]
Wild, IMA WMO16 29
Notions of Nonlinear Model Quality
“Taylor-like” Error Bounds
1. Assuming underlying f is sufficiently smooth
2. A model mk is locally fully quadratic if:
For all x ∈ Bk = x ∈ Ω : ‖x− xk‖ ≤ ∆k |mk(x) − f(x)| ≤ κ1∆
3k
‖∇mk(x) − ∇f(x)‖ ≤ κ2∆2k
‖∇2mk(x)− ∇2f(x)‖ ≤ κ3∆k
for constants κi independent of x and ∆k.
→[ Conn, Scheinberg, Vicente; SIAM 2009]
Wild, IMA WMO16 29
Ingredients for Convergence to Stationary Points
limk→∞ ∇f(xk) = 0 provided:
0. f is sufficiently smooth and regular (e.g., bounded level sets)
1. Control Bk based on model quality
2. (Occasional) approximation within BkOur quadratics satisfy
|qk(x) − f(x)| ≤ κ1(γf + ‖Hk‖)∆2k, x ∈ Bk
‖gk + Hk(x− xk) − ∇f(x)‖ ≤ κ2(γf + ‖Hk‖)∆k, x ∈ Bk
3. Sufficient decrease
Michael J.D. Powell, 1936-2015
Survey →[Conn, Scheinberg, Vicente; SIAM 2009]
Methods →[Powell: COBYLA, UOBYQA, NEWUOA, BOBYQA, LINCOA],
. . .
Line search methods also work →[Kelley et al; IFFCO]
RBF models also work →[W. & Shoemaker; SIREV 2013]
Probabilistic models→[Bandeira, Scheinberg, Vicente; SIOPT 2014]
Wild, IMA WMO16 30
Greed. Alone. Can. Hurt.
Model-improvement may be needed when:
⋄ Nearby points line up
⋄ May not have enough points to ensure model quality in all directions
→ May need n additional evaluations
Wild, IMA WMO16 31
Constraints and Model Quality
Constraints complicate matters. . . if one does not allow evaluation of infeasible points
→ May need directions normal to nearby constraints
Wild, IMA WMO16 32
Performance Comparisons on Test Functions
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Number of Simplex Gradients , κ
PS Data Profile, ds(κ) (τ =10−3 )
nmsmaxnewuoa(2n+1)appsrand
Smooth problems
⋄ When evaluations aresequential, model-basedmethods (NEWUOA)regularly outperformdirect search methodswithout a search phase(nmsmax, appsrand)
→ [More & W., SIOPT 2009]
Wild, IMA WMO16 33
Performance Comparisons on Test Functions
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Number of Simplex Gradients , κ
PN Data Profile, ds(κ) (τ =10−3 )
nmsmaxnewuoa(2n+1)appsrand
Noisy problems
⋄ When evaluations aresequential, model-basedmethods (NEWUOA)regularly outperformdirect search methodswithout a search phase(nmsmax, appsrand)
→ [More & W., SIOPT 2009]
Wild, IMA WMO16 33
Many Practical Details In Implementations
⋄ Choice of interpolation points Yk
⋄ Updating of trust region Bk⋄ Improvement of models
Wild, IMA WMO16 34
Many Practical Details In Implementations
⋄ Choice of interpolation points Yk
⋄ Updating of trust region Bk⋄ Improvement of models
BOBYQA [Powell], DFO [Scheinberg], POUNDer [W.]
Initialization p = |Yk| structured evaluationsBased on input, ≈ 2n+ 1Based on input, no more than n+ 1
Interpolation Set p = |Yk|,∀k
Bootstrap to |Yk| = (n+1)(n+2)2
, then fixed
Varies in n+ 1, · · · , (n+1)(n+2)2
based on available points
Linear Algebra If p = O(n), model formation costs only O(n2)ExpensiveExpensive
Wild, IMA WMO16 34
Growing, Recent Body of Tools and Resources for Local DFO
? What to use on problems with characteristics X, Y, and Z ?
Conn, Scheinberg,Vicente; SIAM 2009
2
2.5
3
3.5
x 10
0
5000
10000
6
7
8
9
10
11
x 105
Permanent Rights, NR Options, N
O
Implicit Filtering
Kelley; SIAM 2011
Many solversSample considered by Rios & Sahinidis, 2010:
ASA, BOBYQA,CMA-ES, DFO,DAKOTA/*, TOMLAB/*,FMINSEARCH, GLOBAL,HOPSPACK, IMFIL,MCS, NEWUOA,NOMAD, PSWARM,SID-PSM, SNOBFIT
Wild, IMA WMO16 35
Toward Global Optimization
A quick sketch of a multistart methods and some practical details
⋄ useful in derivative-based and derivative-free cases
⋄ obtain a list of distinct minimizers (for post-processing, etc.)
⋄ simple to get started
! simple to abuse/misuse (“I found all minimizers”)
Wild, IMA WMO16 36
Why Multistart?
Multiple local minima are often of interest in practice:
Design: Multiple objectives (or even constraints) might later be of interest
Simulation Errors: Could have spurious local minima from anomalies in the simulator
Uncertainty: Some minima are moresensitive to perturbationsthan others (gentle valleysversus steep cliffs)
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6−3.5
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
Min AMinB
Wild, IMA WMO16 37
Global Optimization Multistart Methods
Two phase iterative method
Global Exploration: Sample N points in D. ← Guarantees convergence
Local Refinement: Start a local minimization algorithm A from some promising subsetof the sample points.
Want to find many (good) local minima while avoiding repeatedly finding the samelocal minima.
Wild, IMA WMO16 38
Multi Level Single Linkage (MLSL) Clustering Procedure
Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]
Sampled Points
Sampled Candidate Points
Descent Paths
Start Points
Ex.: It. 1 Exploration
Start A at each sample point xi
provided:
⋄ A has not been started from xi, and
⋄ no other sample point xj withf(xj) < f(xi) is within a distance
rk =1
√π
n
√
√
√
√
vol(D)5Γ
(
1 + n2
)
log(kN)
kN,
Wild, IMA WMO16 39
Multi Level Single Linkage (MLSL) Clustering Procedure
Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]
Sampled Points
Sampled Candidate Points
Descent Paths
Start Points
Ex.: It. 1 Exploration
Start A at each sample point xi
provided:
⋄ A has not been started from xi, and
⋄ no other sample point xj withf(xj) < f(xi) is within a distance
rk =1
√π
n
√
√
√
√
vol(D)5Γ
(
1 + n2
)
log(kN)
kN,
Wild, IMA WMO16 39
Multi Level Single Linkage (MLSL) Clustering Procedure
Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]
Sampled Points
Sampled Candidate Points
Descent Paths
Start Points
Ex.: It. 1 Exploration
Start A at each sample point xi
provided:
⋄ A has not been started from xi, and
⋄ no other sample point xj withf(xj) < f(xi) is within a distance
rk =1
√π
n
√
√
√
√
vol(D)5Γ
(
1 + n2
)
log(kN)
kN,
Thm [RK-T]- Will start finitely many local runs with probability 1.
Wild, IMA WMO16 39
Multi Level Single Linkage (MLSL) Clustering Procedure
Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]
Sampled Points
Optim.Paths
Approx.LocalMin.
Ex.: It. 1 Refinement
Start A at each sample point xi
provided:
⋄ A has not been started from xi, and
⋄ no other sample point xj withf(xj) < f(xi) is within a distance
rk =1
√π
n
√
√
√
√
vol(D)5Γ
(
1 + n2
)
log(kN)
kN,
Thm [RK-T]- Will start finitely many local runs with probability 1.
Wild, IMA WMO16 39
Multi Level Single Linkage (MLSL) Clustering Procedure
Where to start A in kth iteration [Rinnooy Kan & Timmer, Math. Programming 1987]
Sampled Points
Sampled Candidate Points
Descent Paths
Start Points
Ex.: It. 2 Exploration
Start A at each sample point xi
provided:
⋄ A has not been started from xi, and
⋄ no other sample point xj withf(xj) < f(xi) is within a distance
rk =1
√π
n
√
√
√
√
vol(D)5Γ
(
1 + n2
)
log(kN)
kN,
Thm [RK-T]- Will start finitely many local runs with probability 1.
Wild, IMA WMO16 39
Performance Comparisons on Test Functions
50 100 150 200 250
10
15
20
25
30
35
40
45
number of function evaluations
mea
nofth
ebes
tva
lue
in30
runs
Schoen14.100 Test Function (n=14)
GORBITGlobal-SRBFLocal-SRBFGutmann-RBFGreedy-RBFSA-SLHD
⋄ GORBIT is multistart withRBF model-basedmethod
⋄ SA-SLHD is a heuristic(simulated annealingwith a symmetric Latinhypercube design asinitialization)
→ [W., Cornell University, 2009]
Wild, IMA WMO16 40
II. Exploiting Structure in Simulation-BasedOptimization
Beyond the Black Box
min f(x) = F [S(x)]
So far, f = S
Wild, IMA WMO16 42
Beyond the Black Box
min f(x) = F [S(x)]
So far, f = S
Your problems are not black-box problems
Wild, IMA WMO16 42
Beyond the Black Box
min f(x) = F [S(x)]
So far, f = S
Your problems are not black-box problems
You formulated the problem⇒ You know more than nothing
Wild, IMA WMO16 42
Structure in Simulation-Based Optimization, min f(x) = F [x, S(x)]
f is often not a black box S
NLS Nonlinear least squares
f(x) =∑
i
(Si(x)− di)2
CNO Composite (nonsmooth) optimization
f(x) = h(S(x))
SKP Not all variables enter simulation
f(x) = g(xI , xJ) + h(S(xJ ))
SCO Only some constraints depend on simulation
minf(x) : c1(x) = 0, cS(x) = 0
+ Slack variables
ΩS = (xI , xJ) : S(xJ) + xI = 0, xI ≥ 0
. . .
Model-based methods offer one way to exploit such structure
Wild, IMA WMO16 43
General Setting – Modeling Smooth S1(x), S2(x), . . . , Sp(x)
Assume:
⋄ each Si is continuously differentiable, available
⋄ each ∇Si is Lipschitz continuous, unavailable
Wild, IMA WMO16 44
General Setting – Modeling Smooth S1(x), S2(x), . . . , Sp(x)
Assume:
⋄ each Si is continuously differentiable, available
⋄ each ∇Si is Lipschitz continuous, unavailable
mSi : Rn → R approximates Si on B(x,∆) i = 1, . . . , p
Fully Linear Models
mSi fully linear on B(x,∆) if there exist constants κi,ef and κi,eg independent of xand ∆ so that
|Si(x+ s)−mSi(x+ s)| ≤ κi,ef∆2 ∀s ∈ B(0,∆)
‖∇Si(x+ s)−∇mSi(x+ s)‖ ≤ κi,eg∆ ∀s ∈ B(0,∆)
Wild, IMA WMO16 44
NLS– Nonlinear Least Squares f(x) = 12
∑
iRi(x)2
Obtain a vector of output R1(x), . . . , Rp(x)
⋄ Model each Ri
Ri(x) ≈ mRik
(x) = Ri(xk) + (x− xk)⊤g
(i)k
+1
2(x− xk)⊤H
(i)k
(x− xk)
⋄ Approximate:
∇f(x) =∑
i
∇Ri(x)Ri(x) −→∑
i
∇mRik (x)Ri(x)
∇2f(x) =∑
i
∇Ri(x)∇Ri(x)⊤ +
∑
i
Ri(x)∇2Ri(x)
−→∑
i
∇mRik (x)∇mRi
k (x)⊤ +∑
i
Ri(x)∇2m
Rik (x)
⋄ Model f via Gauss-Newton or similar
regularized Hessians →DFLS [Zhang, Conn, Scheinberg]
full Newton →POUNDERS [W., More]
Wild, IMA WMO16 45
NLS– Consequences for f(x) = 12
∑
i Ri(x)2
Pay a (negligible for expensive S) price in terms of p models
⋄ Save linear algebra using interpolation set Yk common to all models Single system solve, multiple right hand sides
Φ(Yk)[
z(1) · · · z(p)]
=[
R1 · · · Rp
]
mR1 quality ⇒ quality of all mRi
+ (nearly) exact gradients for Ri (nearly) linear
- No longer interpolate function at data points
m(xk + δ) = f(xk)
+δ⊤∑
i g(i)k Ri(x
k)
+ 12δ⊤
∑
i
(
g(i)k (g
(i)k )⊤ + Ri(x
k)H(i)k
)
δ
+ missing h.o. terms
Wild, IMA WMO16 46
NLS– POUNDERS in Practice: DFT Calibration/MLE
minxp∑
i=1wi (Si(x)− di)
2
Si(x) Simulated (DFT) nucleus property
di Experimental data i
wi Weight for data type i
p Parallel simulations (12 wallclock mins)
50 150 2500
5
10
15
20
Day 1 Day 2 Day 3
Number of 12min. Evaluations
Leas
t f V
alue
nelder−meadpounders
→[Kortelainen et al., PhysRevC 2010]
Wild, IMA WMO16 47
CNO– Composite Nonsmooth Optimization Examples
Ex.- Groundwater remediation
Determine rates x forextraction/injection wells
⋄ Regulator’s simulator returnsflow Si(x) in/out of cell i
⋄ Minimize plume fluxes(e.g., regulatory $ penalties)f(x) =
∑
i
|Si(x)|
Lockwood Solvent Ground Water Plume Site (LSGPS)
Ex.- Particle accelerator design
Minimize particle losses: f(x) = maxti∈T1
S(x; ti)− minti∈T2(x)
S(x; ti)
Wild, IMA WMO16 48
CNO– Some Generic Ideas For f(x) =∑p
i=1 |Fi(x)|
Model-based Approaches:
pounder Ignore structure, model f as usual
pounders-sqrt f =p∑
i=1
√
|Fi|2,
model√
|Fi| by Qi
subproblem min∑p
i=1 Qi(x)2
poundera-abs f =p∑
i=1|Fi|,
model |Fi| by Qi
subproblem min∑p
i=1 Qi(x)
poundera-nsm f =p∑
i=1|Fi|,
model Fi by Qi
subproblem min∑p
i=1 |Qi(x)|
Wild, IMA WMO16 49
CNO– Results for Generic Ideas, min∑p
i=1 |Fi(x)|
pounder black-box
pounders-sqrtp∑
i=1Qi(x)
2
poundera-absp∑
i=1Qi(x)
poundera-nsmp∑
i=1|Qi(x)|
20 40 60 80 100 120 140 160 180 20010
−4
10−2
100
102
104
1.3644e−05
Number of f evaluations
Leas
t f v
alue
foun
d
Mancino problem: n=5, p=5
pounderpounders−sqrtpoundera−abspoundera−nsm
Wild, IMA WMO16 50
CNO– Results for Generic Ideas, min∑p
i=1 |Fi(x)|
pounder black-box
pounders-sqrtp∑
i=1Qi(x)
2
poundera-absp∑
i=1Qi(x)
poundera-nsmp∑
i=1|Qi(x)|
20 40 60 80 100 120 140 160 180 200
101.25
101.27
101.29
101.31
101.33
Number of f evaluations
Leas
t f v
alue
foun
d
Bdqrtic problem: n=12, p=16
pounderpounders−sqrtpoundera−abspoundera−nsm
Wild, IMA WMO16 50
CNO– Composite Nonsmooth Optimization f(x) = h(S(x);x)
nonsmooth (algebraically available) function h : Rp × Rn → R
of a smooth (blackbox) mapping S : Rn → Rp
Wild, IMA WMO16 51
CNO– Composite Nonsmooth Optimization f(x) = h(S(x);x)
nonsmooth (algebraically available) function h : Rp × Rn → R
of a smooth (blackbox) mapping S : Rn → Rp
Basic Idea: Knowledge of vector S(xk) & potential nondifferentiability at S(xk)should enhance (theoretical and practical) progress to a stationary point
Ex.- f1(x) = ‖S(x)‖1 =∑p
i=1 |Si(x)|
∂f1(x) =∑
i:Si(x) 6=0
sgn(Si(x))∇Si(x) +∑
i:Si(x)=0
co −∇Si(x),∇Si(x)
⋄ Dc = x : ∃i with Si(x) = 0,∇Si(x) 6= 0
+ Compact ∂f(x)
- Dc depends on ∇Si(x)
Wild, IMA WMO16 51
CNO– The Nuisance Set, N
Relaxation N ⊆ Dc using only zero-orderinformation
f1:N = x : ∃i with Si(x) = 0
f∞:
N =
x : f∞(x) = 0 or
∣
∣
∣
∣
argmaxi|Si(x)|
∣
∣
∣
∣
> 1
|x 3 |
x3
-x 3
f1(x) = |x3|
||S(x)||∞
S1(x)
S2(x)
f∞(x) = max|S1(x)|, |S2(x)|
Wild, IMA WMO16 52
CNO– The Nuisance Set, N
Relaxation N ⊆ Dc using only zero-orderinformation
f1:N = x : ∃i with Si(x) = 0
f∞:
N =
x : f∞(x) = 0 or
∣
∣
∣
∣
argmaxi|Si(x)|
∣
∣
∣
∣
> 1
Observe
When xk /∈ N ,
∂f(xk) = ∇f(xk)= ∇xS(xk)⊤∇Sh(S(x
k))≈ ∇xM(xk)⊤∇Sh(S(x
k))
and smooth approximation is justified
|x 3 |
x3
-x 3
f1(x) = |x3|
||S(x)||∞
S1(x)
S2(x)
f∞(x) = max|S1(x)|, |S2(x)|
Wild, IMA WMO16 52
CNO– Subdifferential Approximation
⋄ xk ∈ N , we build a set of generators G(xk) based on ∂Sh(S(xk)).
co
G(xk)
approximates ∂f(xk)
Ex.- f1(x) = ‖S(x)‖1
G(xk) = ∇M(xk)⊤
sgn(S(xk)) + ∪i:Si(x
k)=0−ei, 0, ei
Wild, IMA WMO16 53
CNO– Subdifferential Approximation
⋄ xk ∈ N , we build a set of generators G(xk) based on ∂Sh(S(xk)).
co
G(xk)
approximates ∂f(xk)
Ex.- f1(x) = ‖S(x)‖1
G(xk) = ∇M(xk)⊤
sgn(S(xk)) + ∪i:Si(x
k)=0−ei, 0, ei
Nearby data Y ⊂ B(xk,∆k) informs models M = mS and generator set
⋄ Manifold sampling method uses manifold(s) of Y
∇M(xk)⊤ ∪yi∈Y
mani(
S(yi))
⋄ Traditional gradient sampling → [Burke, Lewis, Overton; SIOPT 2005]
∪yi∈Y
∇M(yi)⊤mani(
S(yi))
Wild, IMA WMO16 53
CNO– Smooth Trust-Region Subproblem
Smooth master model from minimum-norm element
mf (xk + s) = f(xk) +⟨
s,proj(
0, co
G(xk))⟩
+ · · ·
Wild, IMA WMO16 54
CNO– Smooth Trust-Region Subproblem
Smooth master model from minimum-norm element
mf (xk + s) = f(xk) +⟨
s,proj(
0, co
G(xk))⟩
+ · · ·
⇒ smooth subproblems
min
mf (xk + s) : s ∈ B(0,∆k) vs.
Nonsmooth subproblems
min
h(
M(
xk + s))
: s ∈ B(0,∆k)
⋄ Convex h (e.g., ‖S(x)‖1) and∇Si is Lipschitz⇒ every cluster point ofxkk is Clarke stationary→ [Larson, Menickelly, W.; Preprint
2016]
⋄ OK to sample at xk ∈ DC
⋄ More general (piecewisedifferentiable f) results:→ [Larson, Khan, W.; in prog. 2016]
⋄ Requiresconvex h
→ [Fletcher;
MathProgStudy 1982]
→ [Grapiglia, Yuan,
Yuan; C&A Math.
2016]
Complexity results
→ [Garmanjani,
Judice, Vicente; SIOPT
2016]
Roger Fletcher
Wild, IMA WMO16 54
CNO– Example Performance on L1 Test Problems
Function Value
1000 2000 3000 4000 5000
0.1
0.15
0.2
0.25
0.3
0.35
Best
fvaluefound
Number of function evaluations
CMSGDMSDMSSMSGYYQ−DFOTRL−DFOTR
Stationary Measure
1000 2000 3000 4000 500010
−8
10−6
10−4
10−2
100
Best
Ψvaluefound
Number of function evaluations
Smooth black-box methods can fail in practice, even when DC has measure zero
Numerical tests: → [Larson, Menickelly, W.; Preprint 2016]
Wild, IMA WMO16 55
SKP– Some Known Partials Example
Ex.- Bi-level model calibration structure
minx
f(x) =
p∑
i=1
(Si(x)− di)2
Si(x) solution to lower-level problem depending only on xJ
Si(x) = gi(x) + minyhi(xJ ; y) : y ∈ Di
= gi(x) + hi(xJ ; yi,∗[xJ ])
For x = (xI , xJ )
⋄ ∇xISi(xI , xJ) available
⋄ ∇xJSi(x) ≈ ∇xJ
gi(x) +∇xJmSi(xJ )
⋄ Si(x) continuous and smooth in xI
⋄ gi(x) cheap to compute!
⋄ No noise/errors introduced in gi(x)
General bi-level →[Conn & Vicente, OMS 2012]
Wild, IMA WMO16 56
SKP– Some Known Partials
x = (xI , xJ ); have∂f∂xI
but not ∂f∂xJ
“Solve”Φz = f
with known zg,I , zH,I
[
Φc Φg,J ΦH,J
]
zczg,JzH,J
= f −Φg,Izg,I − ΦH,IzH,I
⋄ Still have interpolation where required⋄ Effectively lowers dimension to |J | = n− |I| for
approximation model-improving evaluations linear algebra
⋄ limk→∞∇f(xk) = 0 as before:
Guaranteed descent in some directions
Wild, IMA WMO16 57
SKP– Numerical Results With Some Partials
0 100 200 300 400
104
106
108
1010
Number of Evaluations
pounderpounderspoundersm
Three approaches:
- black box
s exploit least squares
m use ∇xIderivatives
⋄ n = 16, |I| = 3
⋄ 5-10 secs/evaluation
Same algorithmic framework, performance advantages from exploiting structure
→[Bertolli, Papenbrock, W., PRC 2012]
Wild, IMA WMO16 58
SCO- General Constraints
minf(x) : c1(x) = 0, cS(x) = 0
⋄ Lagrangian (key to optimality conditions):
∇L = ∇f + λ⊤1 ∇c1 + λ⊤
2 ∇cS
→ ∇f + λ⊤1 ∇c1 + λ⊤
2 ∇m
⋄ Use favorite method: filters, augmented Lagrangian, . . .⋄ Slack variables
Do not increase effective dimension Subproblems can treat separately Know derivatives
→[Lewis & Torczon; 2010]
Modified AL methods →[Diniz-Ehrhardt, Martınez, Pedroso; C&A Math. 2011]
SBO constraints have unique properties →[Le Digabel & W.; ANL/MCS-P5350-0515 2016]
Wild, IMA WMO16 59
SCO– What Constraint Derivatives Buy You
Ex.- Augmented Lagrangian methods, LA(x, λ;µ) = f(x)− λ⊤c(x) + 1µ‖c(x)‖2
minx f(x) : c(x) = 0
Four approaches:
1. Penalize constraints
2. Treat c and f both as(separate) black boxes
3. Work with f and ∇xc
4. Have both ∇xf and ∇xc
0 50 100 150 200 250 300 350 400 450
10−10
10−8
10−6
10−4
10−2
100
Number of Evaluations
Bes
t Mer
it F
unct
ion
Val
ue
All Ders.Constraint Ders.No Ders.No Structure
n = 15, 11 constraints
with no explicit internal vars, 15 var, 11 cons
Wild, IMA WMO16 60
So You Want To Solve A Hard Optimization Problem?
Mathematically unwrap problems to expose the deepest black boxes!
⋄ It is easy to get started with derivative-free methods
⋄ You should strive to obtain derivatives & apply methods from every other lecture
⋄ Model-based methods can make use of expensive function values
⋄ Structure is everywhere, even in “black-box”/ legacy code-driven optimization problems⋄ By exploiting structure, optimization can solve grand-challenge problems in 〈insert
your field here〉: Model residuals ri(x)i, not ‖r(x)‖ Model constraints ci(x)i, not a penalty P (c(x)) Explicitly handle nonsmoothness (and noise)
Send me your structured SBO problems!→ www.mcs.anl.gov/~wild
Wild, IMA WMO16 61