derivative-free optimization of expensive …of the well-known dfo algorithm which is freely...

27
DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE FUNCTIONS WITH COMPUTATIONAL ERROR USING WEIGHTED REGRESSION STEPHEN C. BILLUPS * , JEFFREY LARSON , AND PETER GRAF Abstract. We propose a derivative-free algorithm for optimizing computationally expensive functions with computational error. The algorithm is based on the trust region regression method by Conn, Scheinberg, and Vicente [4], but uses weighted regression to obtain more accurate model functions at each trust region iteration. A heuristic weighting scheme is proposed which simulta- neously handles i) differing levels of uncertainty in function evaluations, and ii) errors induced by poor model fidelity. We also extend the theory of Λ-poisedness and strong Λ-poisedness to weighted regression. We report computational results comparing interpolation, regression, and weighted re- gression methods on a collection of benchmark problems. Weighted regression appears to outperfrom interpolation and regression models on nondifferentiable functions and functions with deterministic noise. Key words. Derivative-Free Optimization, Weighted Regression Models, Noisy Function Eval- uations. AMS subject classifications. 90C56, 49J52, 65K05, 49M30 1. Introduction. The algorithm described in this paper is designed to opti- mize functions evaluated by large computational codes, taking minutes, hours or even days for a single function call, for which derivative information is unavailable, and for which function evaluations are subject to computational error. Such error may be deterministic (arising, for example, from discretization error), or stochastic (for example, from Monte-Carlo simulation). Because function evaluations are extremely expensive, it is sensible to perform substantial work at each iteration to reduce the number of function evaluations required to obtain an optimum. In some cases, the magnitude of the uncertainty for each function evaluation can be controlled by the user. For example, accuracy can be improved by using a finer discretization or by increasing the sample size of a Monte-Carlo simulation. But this greater accuracy comes at the cost of increased computational time; so it makes sense to vary the accuracy requirements over the course of the optimization. Such an approach was proposed, for example, in [1]. Our algorithm is designed to take advantage of varying levels of uncertainty. The algorithm fits into a general framework for derivative-free trust region meth- ods using quadratic models, which was described by Conn, Scheinberg, and Vicente in [6, 5]. We shall refer to this framework as the CSV2-framework. This framework constructs a quadratic model function m k (·), which approximates the objective func- tion f on a set of sample points Y k R n at each iteration k. The next iterate is then determined by minimizing m k over a trust region. In order to guarantee global convergence, the CSV2-framework monitors the distribution of points in the sample set, and occasionally invokes a model-improvement algorithm that modifies the sam- ple set to ensure m k accurately approximates f . The CSV2-framework is the basis * University of Colorado Denver, Department of Mathematical and Statistical Sciences ([email protected]). University of Colorado Denver, Department of Mathematical and Statistical Sciences ([email protected]), research partially supported by National Science Foundation Grant GK-12-0742434. National Renewable Energy Laboratory, Computational Science Center ([email protected]). 1

Upload: others

Post on 14-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE FUNCTIONSWITH COMPUTATIONAL ERROR USING WEIGHTED

REGRESSION

STEPHEN C. BILLUPS∗, JEFFREY LARSON† , AND PETER GRAF‡

Abstract. We propose a derivative-free algorithm for optimizing computationally expensivefunctions with computational error. The algorithm is based on the trust region regression methodby Conn, Scheinberg, and Vicente [4], but uses weighted regression to obtain more accurate modelfunctions at each trust region iteration. A heuristic weighting scheme is proposed which simulta-neously handles i) differing levels of uncertainty in function evaluations, and ii) errors induced bypoor model fidelity. We also extend the theory of Λ-poisedness and strong Λ-poisedness to weightedregression. We report computational results comparing interpolation, regression, and weighted re-gression methods on a collection of benchmark problems. Weighted regression appears to outperfrominterpolation and regression models on nondifferentiable functions and functions with deterministicnoise.

Key words. Derivative-Free Optimization, Weighted Regression Models, Noisy Function Eval-uations.

AMS subject classifications. 90C56, 49J52, 65K05, 49M30

1. Introduction. The algorithm described in this paper is designed to opti-mize functions evaluated by large computational codes, taking minutes, hours or evendays for a single function call, for which derivative information is unavailable, andfor which function evaluations are subject to computational error. Such error maybe deterministic (arising, for example, from discretization error), or stochastic (forexample, from Monte-Carlo simulation). Because function evaluations are extremelyexpensive, it is sensible to perform substantial work at each iteration to reduce thenumber of function evaluations required to obtain an optimum.

In some cases, the magnitude of the uncertainty for each function evaluation canbe controlled by the user. For example, accuracy can be improved by using a finerdiscretization or by increasing the sample size of a Monte-Carlo simulation. Butthis greater accuracy comes at the cost of increased computational time; so it makessense to vary the accuracy requirements over the course of the optimization. Suchan approach was proposed, for example, in [1]. Our algorithm is designed to takeadvantage of varying levels of uncertainty.

The algorithm fits into a general framework for derivative-free trust region meth-ods using quadratic models, which was described by Conn, Scheinberg, and Vicentein [6, 5]. We shall refer to this framework as the CSV2-framework. This frameworkconstructs a quadratic model function mk(·), which approximates the objective func-tion f on a set of sample points Y k ⊂ R

n at each iteration k. The next iterate isthen determined by minimizing mk over a trust region. In order to guarantee globalconvergence, the CSV2-framework monitors the distribution of points in the sampleset, and occasionally invokes a model-improvement algorithm that modifies the sam-ple set to ensure mk accurately approximates f . The CSV2-framework is the basis

∗University of Colorado Denver, Department of Mathematical and Statistical Sciences([email protected]).

†University of Colorado Denver, Department of Mathematical and Statistical Sciences([email protected]), research partially supported by National Science Foundation GrantGK-12-0742434.

‡National Renewable Energy Laboratory, Computational Science Center ([email protected]).

1

Page 2: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

2 Derivative-Free Optimization: Weighted Regression

of the well-known DFO algorithm which is freely available on the COIN-OR website[17].

There have been previous attempts at optimizing noisy functions without deriva-tives. For functions with stochastic noise, replications of function evaluations canbe a simple way to modify existing algorithms. For example, [8] modifies Powell’sUOBYQA [21], [24] modifies Nelder-Mead [20], and [9] modifies DIRECT [15]. Fordeterministic noise, Kelley [16] proposes a technique to detect and restart Nelder-Meadmethods. Neumaier’s SNOBFIT [14] algorithm accounts for noise by not requiring thesurrogate functions to interpolate function values, but rather fit a stochastic model.A least-squares regression method for handling noise was considered in [4].

In our paper, we propose using weighted regression, which can handle differinglevels of uncertainty for each function evaluation. We propose a weighting scheme thatcan use knowledge of the relative levels of uncertainty in each function evaluation. Theweighting scheme also handles errors resulting from poor model fidelity (i.e. fitting anon-quadratic function by a quadratic model) by taking into account the distances ofeach sample point to the center of the trust region.

To fit our algorithm into the CSV2-framework we extend the theory of poisedness,as described in [6], to weighted regression. We show (Proposition 4.11) that a sampleset that is strongly Λ-poised in the regression sense is also strongly cΛ-poised in theweighted regression sense for some constant c, provided that no weight is too smallrelative to the other weights. Thus, any model improvement scheme that ensuresstrong Λ-poisedness in the regression sense can be used in the weighted regressionframework.

The paper is organized as follows. We begin by describing the CSV2 frameworkin §2. To put our algorithm into this framework, we describe 1) how model functionsare constructed (§3), and 2) a model improvement algorithm (§5). Before describingthe model improvement algorithm, we first extend the theory of Λ-poisedness to theweighted regression framework (§4). Computational results are presented in §6 usinga heuristic weighting scheme, which is described in that section. §7 concludes thepaper.

Notation. The following notation will be used: Rn denotes the real Euclidean

space of vectors of length n. ‖ · ‖p denotes the standard ℓp vector norm, and ‖ · ‖(without the subscript) denotes the Euclidean norm. ‖·‖F denotes the Frobenius normof a matrix. Ck denotes the set of functions on R

n with k continuous derivatives. Djfdenotes the jth derivative of a function f ∈ Ck, j ≤ k. Given an open set Ω ∈ R

n,LCk(Ω) denotes the set of Ck functions with Lipschitz continuous kth derivatives.That is, for f ∈ LCk(Ω), there exists a Lipschitz constant L such that

∥Dkf(y)−Dkf(x)∥

∥ ≤ L ‖y − x‖ for all x, y ∈ Ω.

Pdn denotes the space of polynomials of degree less than or equal to d in R

n; q1 denotesthe dimension of P2

n (specifically q1 = (n+ 1)(n + 2)/2). We use standard “big-Oh”notation (written O(·)) to state, for example, that for two functions on the samedomain, f(x) = O(g(x)) if there exists a constant M such that |f(x)| ≤ M |g(x)|for all x with sufficiently small norm. Given a set Y , |Y | denotes the cardinality andconv(Y ) denotes the convex hull. For a real number α, ⌊α⌋ denotes the greatest integerless than or equal to α. For a matrix A, A+ denotes the Moore-Penrose generalizedinverse [11]. ej denotes the jth column of the identity matrix. The ball of radius∆ centered at x ∈ R

n is denoted B(x; ∆). Given a vector w, diag(w) denotes thediagonal matrix W with diagonal entries Wii = wi. For a square matrix A, cond(A)denotes the condition number, and λmin(A) denotes the smallest eigenvalue.

Page 3: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 3

2. Background. The algorithm proposed in this paper fits into a general frame-work for derivative-free trust region methods using quadratic models described byConn, Scheinberg, and Vicente [6, Algorithm 10.3]. We shall refer to this framework asthe CSV2-framework. Algorithms in the framework construct a model function mk(·)at iteration k, which approximates the objective function f on a set of sample pointsY k =

y0, . . . , ypk

⊂ Rn. The next iterate is then determined by minimizing mk.

Specifically, given the iterate xk, a putative next iterate is given by xk+1 = xk + sk,where the step sk solves the trust region subproblem

mins

mk(xk + s)

subject to ‖s‖ ≤ ∆k

,

where the scalar ∆k > 0 denotes the trust region radius, which may vary from iterationto iteration. If xk+1 produces sufficient descent in the model function, then f(xk+1)is evaluated, and the iterate is accepted if f(xk+1) < f(xk); otherwise, the step isnot accepted. In either case, the trust region radius may be adjusted, and a model-improvement algorithm may be called to obtain a more accurate model.

To establish convergence properties, the following smoothness assumption is madeon f :

Assumption 2.1. Suppose that a set of points S ⊂ Rn and a radius ∆max are

given. Let Ω be an open domain containing the ∆max neighborhood⋃

x∈S B(x; ∆max)of the set S. Assume f ∈ LC2(Ω) with Lipschitz constant L.

The CSV2-framework does not specify how the quadratic model functions areconstructed. However, it does require that the model functions be selected from afully quadratic class of model functions M:

Definition 2.2. Let f satisfy Assumption 2.1. Let κ = (κef , κeg, κeh, νm2 ) be a

given vector of constants, and let ∆ > 0. A model function m ∈ C2 κ-fully quadraticon B(x; ∆) if m has a Lipschitz continuous Hessian with corresponding Lipschitzconstant bounded by νm2 and

• the error between the Hessian of the model and the Hessian of the functionsatisfies

∥∇2f(y)−∇2m(y)∥

∥ ≤ κeh∆ for all y ∈ B(x; ∆),

• the error between the gradient of the model and the gradient of the functionsatisfies

‖∇f(y)−∇m(y)‖ ≤ κeg∆2 for all y ∈ B(x; ∆),

• the error between the model and the function satisfies

|f(y)−m(y)| ≤ κef∆3 for all y ∈ B(x; ∆).

Definition 2.3. Let f satisfy Assumption 2.1. A set of model functions M =m : Rn → R,m ∈ C2 is called a fully quadratic class of models if there exist positiveconstants κ = (κef , κeg, κeh, ν

m2 ), such that the following hold:

1. for any x ∈ S and ∆ ∈ (0,∆max], there exists a model function m in Mwhich is κ-fully quadratic on B(x; ∆).

2. For this class M, there exists an algorithm, called a “model-improvement” al-gorithm, that in a finite, uniformly bounded (with respect to x and ∆) numberof steps can

Page 4: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

4 Derivative-Free Optimization: Weighted Regression

• either certify that a given model m ∈ M is κ-fully quadratic on B(x; ∆),• or, find a model m ∈ M that is κ-fully quadratic on B(x; ∆).

Note that this definition of a fully quadratic class of models is equivalent to [6,Definition 6.2]; but we have given a separate definition of a κ-fully quadratic model(Definition 2.2) that includes the use of κ to stress the fixed nature of the boundingconstants. This change simplifies some analysis by allowing us to discuss κ-fullyquadratic models independent of the class of models they belong to. It is importantto note that κ does not need to be known explicitly. Instead, it can be definedimplicitly by the model improvement algorithm. All that is required is for κ to befixed (that is, independent of x and ∆). We also note that the set M may includenon-quadratic functions, but when the model functions are quadratic, the Hessian isfixed, so νm2 can be chosen to be zero.

2.1. CSV2-framework. The CSV2-framework can now be specified. A criticaldistinction between this framework and classical trust region methods lies in theoptimality criteria. In classical trust region methods, mk is the 2nd order Taylorapproximation of f at xk; so if xk is optimal for mk, it satisfies the first and secondorder necessary conditions for an optimum of f . In the CSV2-framework, xk must beoptimal for mk, but mk must also be an accurate approximation of f near xk. Thisrequires that the trust region radius is small and that mk is κ-fully quadratic on thetrust region for some fixed κ.

In the algorithm below, gicbk and Hicbk denote the gradient and Hessian of the

incumbent model micbk . (We use the superscript icb to stress that incumbent pa-

rameters from the previous iterates may be changed before they are used in thetrust region step.) The optimality of xk with respect to mk is tested by calculatingςicbk = max

‖gicbk ‖,−λmin(Hicbk )

. This quantity is zero if and only if the first andsecond order optimality conditions for mk are satisfied. The algorithm enters thecriticality step when ςicbk is close to zero. This routine builds a (possibly) new κ-fullyquadratic model for the current ∆icb

k , and tests if ςicbk for this model is sufficientlylarge. If so, a descent direction has been determined, and the algorithm can proceed.If not, the criticality step reduces ∆icb

k and updates the sample set to improve theaccuracy of the model function near xk. The criticality step ends when ςicbk is largeenough (and the algorithm proceeds) or when both ςicbk and ∆icb

k are smaller thangiven threshold values ǫc and ∆min (in which case the algorithm has identified a sec-ond order stationary point). We refer the reader to [6] for a more detailed discussionof the algorithm, including explanations of the parameters η0, η1, γ, γinc, β, µ and ω.

Algorithm CSV2. ([6, Algorithm 10.3])Step 0 (initialization): Choose a fully quadratic class of models M and a corre-

sponding model-improvement algorithm, with associated κ defined by Definition 2.3.Choose an initial point x0 and maximum trust region radius ∆max > 0. We assumethat the following are given: an initial model micb

0 (x0+ s) (with gradient and Hessianat s = 0 given by gicb0 and Hicb

0 , respectively), ςicb0 = max

‖gicb0 ‖,−λmin(Hicb0 )

, anda trust region radius ∆icb

0 ∈ (0,∆max].The constants η0, η1, γ, γinc, ǫc, β, µ, ω are given and satisfy the conditions 0 ≤

η0 ≤ η1 < 1 (with η1 6= 0), 0 < γ < 1 < γinc, ǫc > 0, µ > β > 0, ω ∈ (0, 1). Set k = 0.Step 1 (criticality step): If ςicbk > ǫc, then mk = micb

k and ∆k = ∆icbk .

If ςicbk ≤ ǫc, then proceed as follows. Call the model-improvement algorithm toattempt to certify if the model micb

k is κ-fully quadratic on B(xk; ∆icbk ). If at least

one of the following conditions hold,• the model micb

k is not certifiably κ-fully quadratic on B(xk; ∆icbk ), or

Page 5: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 5

• ∆icbk > µςicbk ,

then apply Algorithm CriticalityStep (described below) to construct a model mk(xk+

s) (with gradient and Hessian at s = 0 given by gk, and Hk, respectively), with

ςmk = max

‖gk‖,−λmin(Hk)

, which is κ-fully quadratic on the ball B(xk; ∆k) for

some ∆k ∈ (0, µςmk ] given by [6, Algorithm 10.4]. In such a case set

mk = mk and ∆k = min

max

∆k, βςmk

,∆icbk

.

Otherwise, set mk = micbk and ∆k = ∆icb

k .Step 2 (step calculation): Compute a step sk that sufficiently reduces the

model mk (in the sense of [6, (10.13)]) such that xk + sk ∈ B(xk; ∆k).Step 3 (acceptance of the trial point): Compute f(xk + sk) and define

ρk =f(xk)− f(xk + sk)

mk(xk)−mk(xk + sk)

If ρk ≥ η1 or if both ρk ≥ η0 and the model is κ-fully quadratic on B(xk; ∆k), thenxk+1 = xk + sk and the model is updated to include the new iterate into the sampleset resulting in a new model micb

k+1(xk+1 + s) (with gradient and Hessian at s = 0

given by gicbk+1 and Hicbk+1, respectively), with ςicbk+1 = max

‖gicbk+1‖,−λmin(Hicbk+1)

;

otherwise, the model and the iterate remain unchanged (micbk+1 = mk and xk+1 = xk).

Step 4 (model improvement): If ρk < η1, use the model-improvement algo-rithm to

• attempt to certify that mk is κ-fully quadratic on B(xk; ∆k),• if such a certificate is not obtained, we say that mk is not certifiably κ-fullyquadratic and make one or more suitable improvement steps.

Define micbk+1 to be the (possibly improved) model.

Step 5 (trust region update): Set

∆icbk+1 ∈

min γinc∆k,∆max if ρk ≥ η1 and ∆k < βςmk ,[∆k,min γinc∆k,∆max] if ρk ≥ η1 and ∆k ≥ βςmk ,γ∆k if ρk < η1 and mk is κ-fully quadratic,∆k if ρk < η1 and mk is not certifiably

κ-fully quadratic.

Increment k by 1 and go to Step 1.Algorithm CriticalityStep. ([6, Algorithm 10.4]) This algorithm is applied only if

ςicbk ≤ ǫc and at least one of the following holds: the model micbk is not certifiably

κ-fully quadratic on B(xk; ∆icbk ) or ∆icb

k > µςicbk .

Initialization: Set i = 0. Set m(0)k = micb

k .Repeat Increment i by one. Use the model improvement algorithm to improve

the previous model m(i−1)k until it is κ-fully quadratic on B(xk;ωi−1∆icb

k ). Denote

the new model by m(i)k . Set ∆k = ωi−1∆icb

k and mk = m(i)k .

Until ∆k ≤ µ(ςmk )(i).

2.1.1. Global convergence. Define the set L(x0) =

x ∈ Rn : f(x) ≤ f(x0)

.Assumption 2.4. Assume that f is bounded from below on L(x0).Assumption 2.5. There exists a constant κbhm > 0 such that, for all xk generated

by the algorithm,

‖Hk‖ ≤ κbhm.

Page 6: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

6 Derivative-Free Optimization: Weighted Regression

Theorem 2.6. ([6, Theorem 10.23]) Let Assumptions 2.1, 2.4 and 2.5 hold withS = L(x0). Then

limk→+∞

max‖∇f(xk)‖,−λmin(∇2f(xk)) = 0.

It follows from this theorem that any accumulation point of xk satisfies the firstand second order necessary conditions for a minimum of f .

To specify an algorithm within this framework, three things are required:1. Define the class of model functions M. This is determined by the method

for constructing models from the sample set. In [6], models were constructedusing interpolation, least squares regression, and minimum Frobenius normmethods. We describe the general form of our weighted regression models in§3 and propose a specific weighting scheme in §6.

2. Define a model-improvement algorithm. §5 describes our model improvementalgorithm which tests the geometry of the sample set, and if necessary, addsand/or deletes points to ensure that the model function constructed fromthe sample set satisfies the error bounds in Definition 2.2 (i.e. it is κ-fullyquadratic).

3. Demonstrate that the model-improvement algorithm satisfies the require-ments for the definition of a class of fully quadratic models. For our algorithm,this is discussed in §5.

3. Model Construction. This section describes how we construct the modelfunction mk at the kth iteration. For simplicity, we drop the subscript k for theremainder of this section. Let f = (f0, . . . , fp)

T where fi denotes the computedfunction value at yi, and let Ei denote the associated computational error. That is

fi = f(yi) + Ei. (3.1)

Let w = (w0, . . . , wp)T be a vector of positive weights. A quadratic polynomial

m is said to be a weighted least squares approximation of f (with respect to w) if itminimizes

p∑

i=0

w2i

(

m(yi)− fi)2

=∥

∥W(

m(Y )− f)∥

2.

where m(Y ) denotes the vector (m(y0),m(y1), . . . ,m(yp))T and W = diag(w). Inthis case, we write

Wm(Y )ℓ.s.= Wf. (3.2)

Let φ = φ0, φ1, . . . , φq be a basis for the quadratic polynomials in Rn. For

example, φ might be the monomial basis

φ = 1, x1, x2, . . . , xn, x21/2, x1x2, . . . , xn−1xn, x

2n/2. (3.3)

Define

M(φ, Y ) =

φ0(y0) φ1(y

0) · · · φq(y0)

φ0(y1) φ1(y

1) · · · φq(y1)

......

......

φ0(yp) φ1(y

p) · · · φq(yp)

.

Page 7: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 7

Since φ is a basis for the quadratic polynomials, the model function can be written

m(x) =

q∑

i=0

αiφi(x). The coefficients α = (α0, . . . , αq)T solve the weighted least

squares regression problem

WM(φ, Y )αℓ.s.= Wf. (3.4)

If M(φ, Y ) has full column rank, the sample set Y is said to be poised for quadraticregression. The following lemma is a straightforward generalization of [6, Lemma 4.3]:

Lemma 3.1. If Y is poised for quadratic regression, then the weighted least squaresregression polynomial (with respect to positive weights w = (w0, . . . , wp)) exists, isunique and is given by m(x) = φ(x)Tα, where

α = (WM)+Wf = (MTW 2M)−1MTW 2f , (3.5)

where W = diag(w) and M = M(φ, Y ).Proof. Since W and M both have full column rank, so does WM . Thus, the least

squares problem (3.4) has a unique solution given by (WM)+Wf . Moreover, since

WM has full column rank, (WM)+ =(

(WM)T (WM))−1

MTW .

4. Error analysis and the geometry of the sample set. Throughout thissection, Y =

y0, · · · , yp

denotes the sample set, p1 = p+ 1, w ∈ Rp1

+ is a vector ofpositive weights, W = diag(w), and M = M(φ, Y ). f denotes the vector of computedfunction values at the points in Y as defined by (3.1).

The accuracy of the model function mk relies critically on the geometry of thesample set. In this section, we generalize the theory of poisedness from [6] to theweighted regression framework. This section also includes error analysis which ex-tends results from [6] to weighted regression, as well as considering the impact ofcomputational error on the error bounds. We start by defining weighted regressionLagrange polynomials.

4.1. Weighted regression Lagrange polynomials.Definition 4.1. A set of polynomials ℓj(x), j = 0, . . . , p in Pd

n are calledweighted regression Lagrange polynomials with respect to the weights w and sampleset Y if for each j,

WℓYjℓ.s.= Wej ,

where ℓYj =[

ℓj(y0), · · · , ℓj(yp)

]T.

The following lemma is a direct application of Lemma 3.1.Lemma 4.2. Let φ(x) = (φ0(x), . . . , φq(x))

T. If Y is poised, then the set of

weighted regression Lagrange polynomials exists and is unique, and is given by ℓj(x) =φ(x)T aj , j = 0, · · · , p, where aj is the jth column of the matrix

A = (WM)+W. (4.1)

Proof. Note that m(·) = ℓj(·) satisfies (3.2) with f = ej . By Lemma 3.1, ℓj(x) =φ(x)T aj where aj = (WM)+Wej which is the jth column of (WM)+W .

The following lemma is based on [6, Lemma 4.6].

Page 8: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

8 Derivative-Free Optimization: Weighted Regression

Lemma 4.3. If Y is poised, then the model function defined by (3.2) satisfies

m(x) =

p∑

i=0

fiℓi(x),

where ℓj(x), j = 0, · · · , p denote the weighted regression Lagrange polynomials corre-sponding to Y and W .

Proof. By Lemma 3.1 m(x) = φ(x)Tα where

α = (WM)+Wf = Af

for A defined by (4.1). Let ℓ(x) = [ℓ0(x), · · · , ℓp(x)]T . Then

m(x) = φT (x)Af = fT ℓ(x) =

p∑

i=0

fiℓi(x).

4.2. Error Analysis. For the remainder of this paper, let Y =

y0, · · · , yp

de-

note the shifted and scaled sample set where yi = (yi−y0)/R and R = maxi

∥yi − y0∥

∥.

Note that y0 = 0 and maxi

∥yi∥

∥ = 1. Any analysis of Y can be directly related to Y

by the following lemma:

Lemma 4.4. Define the basis φ =

φ0(x), · · · , φq(x)

, where φi(x) = φi(Rx+y0),

i = 0, . . . , q and φ is the monomial basis. Let ℓ0(x), · · · , ℓp(x) be weighted regression

Lagrange polynomials for Y and

ℓ0(x), · · · , ℓp(x)

be weighted regression Lagrange

polynomials for Y . Then M(φ, Y ) = M(φ, Y ). If Y is poised, then

ℓ(x) = ℓ

(

x− y0

R

)

.

Proof. Observe that

M(φ, Y ) =

φ0(y0) · · · φq(y

0)φ0(y

1) · · · φq(y1)

......

...φ0(y

p) · · · φq(yp)

=

φ0(y0) · · · φq(y

0)

φ0(y1) · · · φq(y

1)...

......

φ0(yp) · · · φq(y

p)

= M(φ, Y ).

By the definition of poisedness, Y is poised if and only if Y is poised. Let φ(x) =(

φ0(x), . . . , φq(x))T

and φ(x) =(

φ0(x), . . . , φq(x))T

. Then

φ(x) =

φ0(x−y0

R )...

φq(x−y0

R )

=

φ0(x)...

φq(x)

= φ(x).

By Lemma 4.2, if Y is poised, then

ℓ(x) = φ(x)T (WM(φ, Y ))+W = φ(x)T (WM(φ, Y ))+W = ℓ(x).

Page 9: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 9

Let fi be defined by (3.1) and let Ω be an open convex set containing Y . If fis C2 on Ω, then by Taylor’s theorem, for each sample point yi ∈ Y , and a fixedx ∈ conv(Y ), there exists a point ηi(x) on the line segment connecting x to yi suchthat

fi = f(yi) + Ei

= f(x) +∇f(x)T (yi − x) +1

2(yi − x)T∇2f(ηi(x))(y

i − x) + Ei

= f(x) +∇f(x)T (yi − x) +1

2(yi − x)T∇2f(x)(yi − x)

+1

2(yi − x)THi(x)(y

i − x) + Ei,

(4.2)

where Hi(x) = ∇2f(ηi(x))−∇2f(x).Let ℓi(x) denote the weighted-regression Lagrange polynomials associated with

Y . The following lemma and proof are inspired by [2, Theorem 1]:Lemma 4.5. Let f be twice continuously differentiable on Ω and let m(x) denote

the quadratic function determined by weighted regression. Then, for any x ∈ Ω thefollowing identities hold:

• m(x) = f(x) + 12

∑pi=0(yi − x)THi(x)(yi − x)ℓi(x) +

∑pi=0 Eiℓi(x),

• ∇m(x) = ∇f(x) + 12

∑pi=0(yi − x)THi(x)(yi − x)∇ℓi(x) +

∑pi=0 Ei∇ℓi(x),

• ∇2m(x) = ∇2f(x)+ 12

∑pi=0(yi−x)THi(x)(yi−x)∇2ℓi(x)+

∑pi=0 Ei∇2ℓi(x),

where Hi(x) = ∇2f(ηi(x))−∇2f(x) for some point ηi(x) = θx+ (1− θ)yi, 0 ≤ θ ≤ 1on the line segment connecting x to yi.

Proof. Let D denote the differential operator as defined in [2]. In particular,D0f(x) = f(x), D1f(x)z = ∇f(x)T z, and D2f(x)z2 = zT∇2f(x)z. By Lemma 4.3,m(x) =

∑pi=0 fiℓi(x); so for h = 0, 1 or 2,

Dhm(x) =

p∑

i=0

fiDhℓi(x).

Substituting (4.2) for fi in the above equation yields

Dhm(x) =

2∑

j=0

1

j!

p∑

i=0

Djf(x)(yi − x)jDhℓi(x)

+1

2

p∑

i=0

(yi − x)THi(x)(yi − x)Dhℓi(x) +

p∑

i=0

EiDhℓi(x)

(4.3)

where Hi(x) = ∇2f(ηi(x)) − ∇2f(x) for some point ηi(x) on the line segment con-necting x to yi. Consider the first term on the right hand side above. We shall showthat

1

j!

p∑

i=0

Djf(x)(yi − x)jDhℓi(x) =

Dhf(x) for j = h0 for j 6= h.

(4.4)

Let Bj = Djf(x), and let gj : Rn → R be the polynomial defined by gj(z) =1j!Bj(z − x)j . Observe that Djgj(x) = Bj and Dhgj(x) = 0 for h 6= j. Since gj

Page 10: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

10 Derivative-Free Optimization: Weighted Regression

has degree j ≤ 2, the weighted least squares approximation of gj by a quadraticpolynomial is gj itself. Thus, by Lemma 4.3 and the definition of gj ,

gj(z) =

p∑

i=0

gj(yi)ℓi(z) =

1

j!

p∑

i=0

Bj(yi − x)jℓi(z). (4.5)

Applying the differential operator Dh with respect to z yields

Dhgj(z) =1

j!

p∑

i=0

Bj(yi − x)jDhℓi(z)

=1

j!

p∑

i=0

Djf(x)(yi − x)jDhℓi(z)

Letting z = x, the expression on the right is identical to the left side of (4.4). Thisproves (4.4) since Dhgj(x) = 0 for j 6= h and Djgj(x) = Bj for j = h. By (4.4), (4.3)reduces to

Dhm(x) = Dhf(x) +1

2

p∑

i=0

Hi(x)(yi − x)2Dhℓi(x) +

p∑

i=0

EiDhℓi(x).

Applying this with h = 0, 1, 2 proves the lemma.

Since ‖Hi(x)‖ ≤ L∥

∥yi − x∥

∥ by the Lipschitz continuity of ∇2f , the following isa direct consequence of Lemma 4.5.

Corollary 4.6. Let f satisfy Assumption 2.1 for some convex set Ω, and letm(x) denote the quadratic function determined by weighted regression. Then, for anyx ∈ Ω the following error bounds hold:

• |f(x)−m(x)| ≤∑pi=0

(

L2

∥yi − x∥

3+ |Ei|

)

|ℓi(x)|• ‖∇f(x)−∇m(x)‖ ≤∑p

i=0

(

L2

∥yi − x∥

3+ |Ei|

)

‖∇ℓi(x)‖•∥

∥∇2f(x)−∇2m(x)∥

∥ ≤∑pi=0

(

L2

∥yi − x∥

3+ |Ei|

)

∥∇2ℓi(x)∥

∥.

Using this corollary, the following result provides error bounds between the func-tion and the model in terms of the sample set radius.

Corollary 4.7. Let Y be poised, and let R = maxi

∥yi − y0∥

∥. Suppose |Ei| ≤ ǫ

for i = 0, . . . , p. If f satisfies Assumption 2.1 with Lipschitz constant L, then thereexist constants Λ1, Λ2, and Λ3, independent of R, such that for all x ∈ B(y0;R),

• |f(x)−m(x)| ≤ Λ1√p1(

4LR3 + ǫ)

.

• ‖∇f(x)−∇m(x)‖ ≤ Λ2√p1(

4LR2 + ǫ/R)

.

•∥

∥∇2f(x)−∇2m(x)∥

∥ ≤ Λ3√p1(

4LR+ ǫ/R2)

.

Proof. Let ℓ0(x), . . . , ℓp(x) be the Lagrange polynomials generated by the

shifted and scaled set Y , and let ℓ0(x), . . . , ℓp(x) be the Lagrange polynomials

generated by the set Y . By Lemma 4.4, for each x ∈ B(y0;R), ℓi(x) = ℓi(x) ∀ i,

where x = (x − y0)/R. Thus, ∇ℓi(x) = ∇ℓi(x)/R, and ∇2ℓi(x) = ∇2ℓi(x)/R2.

Let ℓ(x) =[

ℓ0(x), . . . , ℓp(x)]T

, g(x) =[∥

∥∇ℓ0(x)

∥, . . . ,

∥∇ℓp(x)

]T

and h(x) =[∥

∥∇2ℓ0(x)

∥, . . . ,

∥∇2ℓp(x)

]T

.

Page 11: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 11

By Corollary 4.6,

|f(x)−m(x)| ≤p∑

i=0

(

L

2

∥yi − x∥

3+ |Ei|

)

|ℓi(x)|

≤p∑

i=0

(

4LR3 + ǫ)

|ℓi(x)| (since ‖yi − x‖ ≤ 2R, and |Ei| ≤ ǫ)

=(

4LR3 + ǫ)

‖ℓi(x)‖1≤ √

p1(

4LR3 + ǫ)

∥ℓ(x)

∥, (since for x ∈ R

n, ‖x‖1 ≤ √n‖x‖2).

Similarly, ‖∇f(x)−∇m(x)‖ ≤ √p1

(

4LR2 +ǫ

R

)

‖g(x)‖

and∥

∥∇2f(x)−∇2m(x)∥

∥ ≤ √p1

(

4LR+ǫ

R2

)∥

∥h(x)

∥.

Setting Λ1 = maxx∈B(0;1)

∥ℓ(x)

∥, Λ2 = max

x∈B(0;1)‖g(x)‖, and Λ3 = max

x∈B(0;1)

∥h(x)

yields the desired result.Note the similarity between these error bounds and those in the definition of κ-

fully quadratic models. If there is no computational error (or if the error is O(∆3)), κ-fully quadratic models (for some fixed κ) can be obtained by controlling the geometryof the sample set so that Λi

√p1, i = 1, 2, 3 are bounded by fixed constants and

by controlling the trust region radius ∆ so that ∆R is bounded. This motivates the

definitions of Λ-poised and strongly Λ-poised in the weighted regression sense in thenext section.

4.3. Λ-poisedness (in the weighted regression sense). In this section, werestrict our attention to the monomial basis φ defined in (3.3). In order to produceaccurate model functions, the points in the sample set need to be distributed in sucha way that the matrix M = M(φ, Y ) is sufficiently well-conditioned. This is themotivation behind the following definitions of Λ-poised and strongly Λ-poised sets.These definitions are identical to [6, Definitions 4.7, 4.10] except that the Lagrangepolynomials in the definitions are weighted regression Lagrange polynomials.

Definition 4.8. Let Λ > 0 and let B be a set in Rn. Let w = (w0, . . . , wp) be a

vector of positive weights, Y = y0, . . . , yp be a poised set, and let ℓ0, . . . , ℓp be the

associated weighted regression Lagrange polynomials. Let ℓ(x) = (ℓ0(x), · · · , ℓp(x))T .• Y is said to be Λ-poised in B (in the weighted regression sense) if and only if

Λ ≥ maxx∈B

max0≤i≤p

|ℓi(x)| .

• Y is said to be strongly Λ-poised in B (in the weighted regression sense) ifand only if

q1√p1

Λ ≥ maxx∈B

‖ℓ(x)‖ .

Note that if the weights are all equal, the above definitions are equivalent to thosefor Λ-poised and strongly Λ-poised given in [6].

We are naturally interested in using these weighted regression Lagrange polyno-mials to form models that are guaranteed to sufficiently approximate f . Let Y k, ∆k,and Rk denote the sample set, trust region radius, and sample set radius at iteration

Page 12: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

12 Derivative-Free Optimization: Weighted Regression

k as defined at the beginning of §4.2. Assume that ∆k

Rkis bounded. If the number of

sample points is bounded, it can be shown, using Corollary 4.7, that if Y k is Λ-poisedfor all k, then the corresponding model functions are κ-fully quadratic, (assuming nocomputational error, or that the computational error is O(∆3)). When the number ofsample points is not bounded, Λ-poisedness is not enough. In the following, we showthat if Y k is strongly Λ-poised for all k, then the corresponding models are κ-fullyquadratic, regardless of the number of points in Y k.

Lemma 4.9. Let M = M(φ, Y ). If∥

∥W (MTW )+

∥≤√

q1/p1Λ, then Y is strongly

Λ-poised in B(0; 1) in the weighted regression sense, with respect to the weights w.Conversely, if Y is strongly Λ-poised in B(0; 1) in the weighted regression sense, then

∥W (MTW )+

∥≤ θq1√

p1Λ,

where θ > 0 is a fixed constant dependent only on n (but independent of Y and Λ).

Proof. Let A = (WM)+W and ℓ(x) = (ℓ0(x), . . . , ℓp(x))T. By Lemma 4.2,

ℓ(x) = ATφ(x). It follows that for any x ∈ B(0; 1),

‖ℓ(x)‖ =∥

∥AT φ(x)∥

∥ ≤ ‖A‖∥

∥φ(x)∥

∥ ≤(

q1/p1Λ)

(√q1∥

∥φ(x)∥

∞)

≤ (q1/√p1) Λ.

(For the last inequality, we used the fact that maxx∈B(0;1)

∥φ(x)∥

∞ ≤ 1).

To prove the converse, let UΣV T = AT be the reduced singular value decom-position of AT . Note that U and V are p1 × q1 and q1 × q1 matrices, respectively,with orthonormal columns; Σ is a q1 × q1 diagonal matrix, whose diagonal entriesare the singular values of AT . Let σ1 be the largest singular value with v1 the cor-responding column of V . As shown in the proof of [4, Theorem 2.9], there existsa constant γ > 0 such that for any unit vector v, there exists an x ∈ B(0; 1) suchthat

∣vT φ(x)∣

∣ ≥ γ. Therefore, since∥

∥v1∥

∥ = 1, there is an x ∈ B(0; 1) such that∣

∣(v1)T φ(x)∣

∣ ≥ γ. Let v⊥ be the orthogonal projection of φ(x) onto the subspace

orthogonal to v1; so φ(x) =(

(v1)T φ(x))

v1 + v⊥. Note that ΣV T v1 and ΣV T v⊥ are

orthogonal vectors. Note also that for any vector z,∥

∥UΣV T z∥

∥ =∥

∥ΣV T z∥

∥ (since Uhas orthonormal columns). It follows that

‖ℓ(x)‖ =∥

∥ATφ(x)∥

∥ =∥

∥ΣV T φ(x)∥

∥ =∥

∥ΣV T v⊥∥

∥+∥

∥ΣV T(

(v1)T φ(x))∥

≥∣

∣(v1)T φ∣

∥ΣV T v1∥

∥ ≥ γ∥

∥ΣV T v1∥

∥ = γ∥

∥Σe1∥

∥ = γ ‖A‖ .

Thus, ‖A‖ ≤ maxx∈B(0;1)

‖ℓ(x)‖ /γ ≤ q1γ√p1

Λ, which proves the result with θ = 1/γ.

We can now prove that models generated by weighted regression Lagrange poly-nomials are κ-fully quadratic.

Proposition 4.10. Let f satisfy Assumption 2.1 and let Λ > 0 be fixed. Thereexists a vector κ = (κef , κeg, κeh, 0) such that for any y0 ∈ S and ∆ ≤ ∆max, if

1. Y = y0, . . . , yp ⊂ B(y0; ∆) is strongly Λ-poised in B(y0; ∆) in the weightedregression sense with respect to positive weights w = w0, . . . , wp, and

2. the computational error |Ei| is bounded by C∆3, where C is a fixed constant,

then the corresponding model function m is κ-fully quadratic.

Proof. Let x, ℓ(·), g(·), h(·), Λ1,Λ2, and Λ3 be as defined in the proof of Corol-

lary 4.7. Let M = M(φ, Y ) and W = diag(wk). By Lemma 4.2, ℓ(x) = AT φ(x),

Page 13: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 13

where A = (WM)+W . By Lemma 4.9, ‖A‖ ≤ θq1√p1

Λ, where θ is a fixed constant. It

follows that

Λ1 = maxx∈B(0;1)

∥ℓ(x)

∥≤ max

x∈B(0;1)‖A‖

∥φ(x)∥

∥ ≤ c1θq1√p1

Λ,

where c1 = maxx∈B(0;1)

∥φ(x)∥

∥ is a constant independent of Y . Similarly,

Λ2 = maxx∈B(0;1)

‖g(x)‖ = maxx∈B(0;1)

(

‖∇ℓ0(x)‖, · · · , ‖∇ℓp(x)‖)∥

= maxx∈B(0;1)

∥AT∇φ(x)∥

F≤ √

q1 maxx∈B(0;1)

∥AT∇φ(x)∥

≤ √q1 max

x∈B(0;1)‖A‖

∥∇φ(x)∥

∥ ≤ c2θq

3

2

1√p1

Λ,

where c2 = maxx∈B(0;1)

∥∇φ(x)∥

∥ is independent of Y .

To bound Λ3, let Js,t denote the unique index j such that xs and xt both appearin the quadratic monomial φj(x). For example J1,1 = n+ 2, J1,2 = J2,1 = n+ 3, etc.Observe that

[

∇2φj(x)]

s,t=

1 if j = Js,t,0 otherwise.

It follows that

∇2ℓi(x) =

q∑

j=0

ATi,j∇2φj(x) =

ATi,J1,1

ATi,J1,2

. . . ATi,J1,n

ATi,J2,1

ATi,J2,2

. . . ATi,J2,n

.... . .

ATi,Jn,1

ATi,Jn,2

. . . ATi,Jn,n

.

We conclude that∥

∥∇2ℓi(x)

∥≤∥

∥∇2ℓi(x)

F≤

√2∥

∥ATi,·∥

∥. Thus,

Λ3 = maxx∈B(0;1)

∥h(x)

∥= max

x∈B(0;1)

(

‖∇2ℓ0(x)‖, · · · , ‖∇2ℓp(x)‖)∥

√2

p∑

i=0

∥ATi,·∥

2=

√2 ‖A‖F ≤

2q1 ‖A‖ ≤√2θq

3

2

1√p1

Λ.

By assumption, the computational error |Ei| is bounded by ǫ = C∆3. So, byCorollary 4.7, for all x ∈ B(y0; ∆),

• |f(x)−m(x)| ≤ √p1 (4L+ C)∆3Λ1 ≤ c1θq1Λ (4L+ C)∆3 = κef∆

3.

• ‖∇f(x)−∇m(x)‖ ≤ √p1 (4L+ C)∆2Λ2 ≤ c2θq

3

2

1 Λ (4L+ C)∆2 = κeg∆2.

•∥

∥∇2f(x)−∇2m(x)∥

∥ ≤ √p1 (4L+ C)∆Λ3 ≤

√2θq

3

2

1 Λ (4L+ C)∆ = κeh∆.

where κef = c1θq1Λ (4L+ C) , κeg = c2θq3

2

1 Λ (4L+ C) , κeh =√2θq

3

2

1 Λ (4L+ C) .Thus, m(x) is (κef , κeg, κeh, 0)-fully quadratic, and since these constants are inde-pendent of y0 and ∆, the result is proven.

The final step in establishing that we have a fully quadratic class of models is todefine an algorithm that produces a strongly Λ-poised sample set in a finite numberof steps.

Page 14: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

14 Derivative-Free Optimization: Weighted Regression

Proposition 4.11. Let Y = y0, . . . , yp ⊂ Rn be a set of points in the unit

ball B(0; 1) such that∥

∥yj∥

∥ = 1 for at least one j. Let w = (w0, . . . , wp)T be a vector

of positive weights. If Y is strongly Λ-poised in B(0; 1) in the sense of unweightedregression, then there exists a constant θ > 0, independent of Y , Λ and w, such thatY is strongly

(

cond(W )θΛ)

-poised in the weighted regression sense.

Proof. Let M = M(φ, Y ), where φ is the monomial basis. By Lemma 4.9 (appliedwith unit weights), cond M+ ≤ θq1Λ/

√p1, where θ is a constant independent of Y

and Λ. Thus,

∥W (MTW )+

∥≤ cond(W )

∥M+

∥≤ cond(W )θq1√

p1Λ.

The result follows with θ = θ√q1.

The significance of this proposition is that any model improvement algorithm forunweighted regression can be used for weighted regression to ensure the same globalconvergence properties, provided cond(W ) is bounded. For the model improvementalgorithm described in the following section, this requirement is satisfied by boundingthe weights away from zero while keeping the largest weight equal to 1.

In practice, we need not ensure Λ-poisedness of Y k at every iterate to guaranteethe algorithm convergences to a second-order minimum. Rather, Λ-poisedness onlyneeds to be enforced as the algorithm stagnates.

5. Model Improvement Algorithm. This section describes a model improve-ment algorithm (MIA) for regression which, by the preceding section, can also be usedfor weighted regression to ensure that the sample sets are strongly Λ-poised for somefixed Λ (which is not necessarily known). The algorithm is based on the followingobservation, which is a straightforward extension of [6, Theorem 4.11].

The MIA presented in [6] makes assumptions (such as all points must lie withinB(y0; ∆)) to simplify the theory. We resist such assumptions to account for practicalconcerns (points which lie outside of B(y0,∆)) that arise in the algorithm.

Proposition 5.1. If the shifted and scaled sample set Y of p1 points containsl = ⌊p1

q1⌋ disjoint subsets of q1 points, each of which are Λ-poised in B(0; 1) (in the

interpolation sense), then Y is strongly√

l+1l Λ-poised in B(0; 1) (in the regression

sense).Proof. Let Yj = y0j , y1j , . . . , yqj, j = 1, . . . , l be the disjoint Λ-poised subsets of

Y , and let Yr be the remaining points in Y . Let λji , i = 0, . . . , q be the (interpolation)

Lagrange polynomials for the set Yj . As noted in [6], for any x ∈ Rn,

q∑

i=0

λji (x)φ(y

ij) = φ(x), j = 1, . . . , l.

Dividing each of these equations by l and summing yields

l∑

j=1

q∑

i=0

1

lλji (x)φ(y

ij) = φ(x). (5.1)

Let λj(x) =(

λj1(x), · · · , λj

q1(x))T

, and let λ ∈ Rp1 be formed by concatenating the

λj(x), j = 1, · · · , l and a zero vector of length |Yr| and then dividing every entry by

Page 15: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 15

l. By (5.1), λ is a solution to the equation

p∑

i=0

λiφ(yi) = φ(x). (5.2)

Since Yj is Λ-poised in B(0; 1), for any x ∈ B(0; 1),

∥λj(x)∥

∥ ≤ √q1∥

∥λj(x)∥

∞ ≤ √q1Λ.

Thus,

∥λ∥

∥ ≤√lmax

j

∥λj(x)∥

l≤√

q1lΛ ≤

l + 1

l

q1p1/q1

Λ =

l + 1

l

q1√p1

Λ.

Let ℓi(x), i = 0, · · · , p1 be the regression Lagrange polynomials for the complete set

Y . As observed in [6], ℓ(x) = (ℓ0(x), · · · , ℓp1(x))

Tis the minimum norm solution to

(5.2). Thus,

‖ℓ(x)‖ ≤∥

∥λ∥

∥ ≤√

l + 1

l

q1√p1

Λ.

Since this holds for all x ∈ B(0; 1), Y is strongly

l + 1

lΛ-poised in B(0; 1).

Based on this observation, and noting that l+1l ≤ 2 for l ≥ 1, we adopt the fol-

lowing strategy for improving a shifted and scaled regression sample set Y ⊂ B(0; 1):1. If Y contains l ≥ 1 Λ-poised subsets with at most q1 points left over, Y is

strongly√2Λ-poised.

2. Otherwise, if Y contains at least one Λ-poised subset, save as many Λ-poisedsubsets as possible, plus at most q1 additional points from Y , discarding therest.

3. Otherwise, add additional points to Y in order to create a Λ-poised subset.Keep this subset, plus at most q1 additional points from Y .

To implement this strategy, we first describe an algorithm that attempts to find aΛ-poised subset of Y . To discuss the algorithm we introduce the following definition:

Definition 5.2. A set Y ⊂ B is said to be Λ-subpoised in a set B if there existsa superset Z ⊇ Y that is Λ-poised in B with |Z| = q1.

Given a sample set Y ⊂ B(0; 1) (not necessarily shifted and scaled) and a ra-dius ∆, the algorithm below selects a Λ-subpoised subset Ynew ⊂ Y containing asmany points as possible. If |Ynew| = q1, then Ynew is Λ-poised in B(0; ∆) for somefixed Λ. Otherwise, the algorithm determines a new point ynew ∈ B(0; ∆) such thatYnew

⋃ ynew is Λ-subpoised in B(0; ∆).

Algorithm FindSet. (Finds a Λ-subpoised set)Input: A sample set Y ⊂ B(0; 1) and a trust region radius ∆ ∈

[√ξacc, 1

]

,for fixed parameter ξacc > 0.

Output: A set Ynew ⊂ Y that is Λ-poised in B(0; ∆); or a Λ-subpoised

set Ynew ⊂ B(0; ∆) and a new point ynew ⊂ B(0; ∆) such that Ynew

⋃ ynew is

Λ-subpoised in B(0; ∆).

Step 0 (Initialization:) Initialize the pivot polynomial basis to the monomial basis:

Page 16: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

16 Derivative-Free Optimization: Weighted Regression

ui(x) = φi(x), i = 0, . . . , q. Set Ynew = ∅. Set i = 0.Step 1 (Point Selection:) If possible, choose ji ∈ i, . . . , |Y | − 1 such that

|ui(yji)| ≥ ξacc (threshold test).

If such an index is found, add the corresponding point to Ynew and swapthe positions of points yi and yji in Y .

Otherwise, compute ynew = argmaxx∈B(0;∆)

|ui(x)|, and exit, returning

Ynew and ynew.Step 2 (Gaussian Elimination:) For j = i+ 1, . . . , |Y | − 1

uj(x) = uj(x)−uj(y

i)

ui(yi)ui(x).

If i < q, set i = i+ 1 and go to step 1.Exit, returning Ynew.

The algorithm, which is modeled after Algorithms 6.5 and 6.6 in [6], applies Gaus-sian elimination with a threshold test to form a basis of pivot polynomials ui(x).As discussed in [6], at the completion of the algorithm, the values ui(y

i), yi ∈ Ynew

are exactly the diagonal entries of the diagonal matrix D in the LDU factorizationof M = M(φ, Ynew). If |Ynew| = q1, M is a square matrix. In this case, since∣

∣ui(yi)∣

∣ ≥ ξacc,

∥M−1∥

∥ ≤√q1ξgrowth

ξacc, (5.3)

where ξgrowth is the growth factor for the factorization (see [13]).Point Selection. The point selection rule allows flexibility in how an acceptable

point is chosen. For example, to keep the growth factor down, one could choose theindex ji that maximizes |ui(y

j)| (which corresponds to Gaussian elimination withpartial pivoting). But in practice, it is often better to select points according totheir proximity to the current iterate. In our implementation, we balance these twocriteria by choosing the index that maximizes |ui(y

j)|/d3j , over j ≥ i, where dj =

max1,∥

∥yj∥

∥ /∆. If all sample points are contained in B(0; ∆), then dj = 1 for all j.In this case, the point selection rule is identical to the one used in Algorithm 6.6 of [6](with the addition of the threshold test). When Y contains points outside B(0; ∆),the corresponding values of dj are greater than 1, so the point selection rule gives

preference to points that are within B(0; ∆).The theoretical justification for our revised point selection rule comes from exam-

ining the error bounds in Corollary 4.6. For a given point x in B(0; ∆), each sample

point yi makes a contribution to the error bound that is proportional to∥

∥yi − x∥

3

(assuming the computational error is relatively small). Since x can be anywhere in the

trust region, this suggests modifying the point selection rule to maximize|ui(y

ji)|d3ji

,

where dj = maxx∈B(0;∆)

∥yj − x∥

∥ /∆ =∥

∥yj∥

∥ /∆+ 1. To simplify analysis, we modifythis formula so that all points inside the trust region are treated equally, resulting inthe formula dj = max(1,

∥yj∥

∥ /∆).Lemma 5.3. Suppose Algorithm FindSet returns a set Ynew with |Ynew| = q1.

Then Ynew is Λ-poised in B(0; ∆) for some Λ, which is proportional toξgrowth

ξaccmax1, ∆2/2, ∆, where ξgrowth is the growth factor for the Gaussian elimi-

nation.

Page 17: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 17

Proof. Let M = M(φ, Ynew). By (5.3),∥

∥M−1

∥≤ √

q1ξgrowth/ξacc. Let ℓ(x) =

(ℓ0(x), . . . , ℓq(x))Tbe the vector of interpolation Lagrange polynomials for the sample

set Ynew. For any x ∈ B(0; ∆),

‖ℓ(x)‖∞ =∥

∥M−T φ(x)

∞≤∥

∥M−1

∥φ(x)∥

∞ ≤ √q1

∥M−1

∥φ(x)∥

≤√q1ξgrowth

ξacc

∥φ(x)∥

∞ ≤√q1ξgrowth

ξaccmax1, ∆2/2, ∆.

Since this inequality holds for all x ∈ B(0; ∆), Ynew is Λ-poised for Λ =(√

q1ξgrowth/ξacc)

max1, ∆2/2, ∆, which establishes the result.In general, the growth factor in the above lemma depends on the matrix M and

the threshold ξacc. Because of the threshold test, it is possible to establish a boundon the growth factor that is independent of M . So we can claim that the algorithmselects a Λ-poised set for a fixed Λ that is independent of Y . However, the boundis extremely large, so is not very useful. Nevertheless, in practice ξgrowth is quite

reasonable, so Λ tends to be proportional to max1, ∆2/2, ∆/ξacc.In the case where the threshold test is not satisfied, Algorithm FindSet determines

a new point ynew by maximizing |ui(x)| over B(0; ∆). In this case, we need to showthat the new point would satisfy the threshold test. The following lemma shows thatthis is possible, provided ξacc is small enough. The proof is modeled after the proofof [6, Lemma 6.7].

Lemma 5.4. Let vT φ(x) be a quadratic polynomial of degree 2, where ‖v‖∞ = 1.Then

maxx∈B(0;∆)

|vT φ(x)| ≥ min1, ∆2

4.

Proof. Since ‖v‖∞ = 1, at least one of the coefficients of q(x) = vT φ(x) is 1, -1,1/2, or -1/2.

Looking at the case where the largest coefficient is 1 or 1/2 (-1 and -1/2 aresimilarly proven), either this coefficient corresponds to the constant term, a linearterm xi, or a quadratic term x2

i /2 or xixj . Restrict all variables not appearing in theterm corresponding to the largest coefficient to zero.

• If q(x) = 1 then the lemma trivially holds.• If q(x) = x2

i /2 + axi + b, let x = ∆ei ∈ B(0; ∆)

q(x) = ∆2/2 + ∆a+ b, q(−x) = ∆2/2− ∆a+ b, and q(0) = b.

If |q(−x)| ≥ ∆2

4 or |q(x)| ≥ ∆2

4 the result is shown. Otherwise, −∆4 < q(∆) <

∆2

4 and −∆4 < q(−∆) < ∆2

4 . Adding these together yields ∆2

2 < ∆2+2b < ∆2

2 .

Therefore b < ∆2

4 − ∆2

2 = − ∆2

4 and therefore |q(0)| > ∆2

4 .

• If q(x) = ax2i /2 + xi + b, then let x = ∆ei, yielding q(x) = ∆ + a∆2/2 + b

and q(−x) = −∆ + a∆2/2 + b then

max |q(−x)|, |q(x)| = max

| − ∆ + α|, |∆ + α|

≥ ∆ ≥ min1, ∆2

4

where α = a∆/2 + b = 0.

Page 18: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

18 Derivative-Free Optimization: Weighted Regression

• If q(x) = ax2i /2+bx2

j/2+xixj+cxi+dxj+e, we consider 4 points on B(0; ∆)

y1 =

∆2

∆2

, y2 =

∆2

−√

∆2

, y3 =

−√

∆2

∆2

, y4 =

−√

∆2

−√

∆2

.

q(y1) =a∆

4+

b∆

4+

2+ c

2+ d

2+ e

q(y2) =a∆

4+

b∆

4− ∆

2+ c

2− d

2+ e

q(y3) =a∆

4+

b∆

4− ∆

2− c

2+ d

2+ e

q(y4) =a∆

4+

b∆

4+

2− c

2− d

2+ e

Note that q(y1)−q(y2) = ∆+2d

∆2 and q(y3)−q(y4) = −∆+2d

∆2 . There

are two cases:1. If d ≥ 0, then q(y1) − q(y2) ≥ ∆, so either |q(y1)| ≥ ∆

2 or

|q(y2)| ≥ min1, ∆2 .

2. If d < 0, then a similar study of q(y3)− q(y4) proves the result.

Lemma 5.5. Suppose ξacc ≤ min1, ∆2/4. If Algorithm FindSet exits duringthe point selection step, then Ynew

⋃ ynew is Λ-subpoised in B(0; ∆) for some fixed

Λ, which is proportional toξgrowth

ξaccmax1, ∆2/2, ∆, where ξgrowth is the growth pa-

rameter for the Gaussian elimination.Proof. Consider a modified version of Algorithm FindSet that does not exit in

the point selection step. Instead, yi is replaced by ynew and ynew is added to Ynew.This modified algorithm will always return a set consisting of q1 points. Call this setZ. Let Ynew and ynew be the output of the unmodified algorithm, and observe thatYnew

⋃ynew ⊂ Z.To show that Ynew

⋃ynew is Λ-subpoised, we show that Z is Λ-poised inB(0; ∆).From the Gaussian elimination, after k iterations of the algorithm, the (k+1)st pivotpolynomial uk(x) can be expressed as (vk)T φ(x), where vk = (v0, . . . , vk−1, 1, 0, . . . , 0).

Observe that∥

∥vk∥

∞ ≥ 1, and let v =vk

‖vk‖∞. By Lemma 5.4,

maxx∈B(0;∆)

|uk(x)| = maxx∈B(0;∆)

∣(vk)T φ(x)∣

∣ =∥

∥vk∥

(

maxx∈B(0;∆)

|vT φ(x)|)

≥ min1, ∆2

4∥

∥vk∥

∞ ≥ min1, ∆2

4 ≥ ξacc.

It follows that each time a new point is chosen in the point selection step, thatpoint will satisfy the threshold test. Thus, the set Z returned by the modified algo-rithm will include q1 points, all of which satisfy the threshold test. By Lemma 5.3,

Page 19: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 19

Z is Λ-poised, with Λ proportional toξgrowth

ξaccmax1, ∆2/2, ∆. It follows that

Ynew

⋃ ynew is Λ-subpoised.We are now ready to state our model improvement algorithm for regression. Prior

to calling this algorithm, we discard all points in Y with distance greater than ∆/√ξacc

from y0. We then form the shifted and scaled set Y by the transformation yj = (yj −y0)/d, where d = maxyj∈Y

∥yj − y0∥

∥, and scale the trust region radius accordingly

(i.e., ∆ = ∆/d). This ensures that ∆ = ∆d ≥ ∆

∆/√ξacc

=√ξacc. After calling the

algorithm, we reverse the shift and scale transformation.

Algorithm MIA. (Model Improvement for Regression)

Input: A shifted and scaled sample set Y ⊂ B(0; 1), a trust region radius ∆ ≥ √ξacc

for fixed ξacc ∈ (0, 14r2 ), where r ≥ 1 is a fixed parameter.

Output: A modified set Y ′ with improved poisedness on B(0; ∆).Step 0 (Initialization:) Remove the point in Y farthest from y0 = 0 if it is outside

B(0; r∆). Set Y ′ = ∅.Step 1 (Find Poised Subset:) Use Algorithm FindSet either to identify a

Λ-poised subset Ynew ⊂ Y , or to identify a Λ-subpoised subset Ynew ⊂ Y and

one additional point ynew ∈ B(0; ∆) such that Ynew

⋃ynew is Λ-subpoised

on B(0; ∆).Step 2 (Update Set:)

If Ynew is Λ-poised, add it to Y ′ and remove Ynew from Y . Remove all points

from Y that are outside of B(0; r∆).Otherwise, if |Y ′| = ∅, set Y ′ = Ynew

⋃ynew plus q1 − |Ynew| − 1 other

points from Y . Otherwise, set Y ′ = Y ′⋃Ynew plus q1 − |Ynew|additional points from Y . Set Y = ∅.

Step 3 If |Y | ≥ q1, go to Step 1.Step 4 Set Y ′ = Y ′⋃Ynew

In Algorithm MIA, if every call to Algorithm FindSet yields a Λ-poised set Ynew,then eventually all points in Y will be included in Y ′. In this case, the algorithm hasverified that Y contains ℓ = ⌊p1

q1⌋ Λ-poised sets. By Proposition 5.1, Y is strongly

l+1l Λ-poised in B(0; 1).

If the first call to FindSet fails to identify a Λ-poised subset, the algorithm im-proves the sample set by adding a new point ynew and by removing points so thatthe output set Y ′ contains at most q1 points. In this case the output set contains theΛ-subpoised set Ynew

⋃ ynew. Thus, if the algorithm is called repeatedly, with Yreplaced by Y ′ after each call, eventually Y ′ will contain a Λ-poised subset and willbe strongly 2Λ-poised, by Proposition 5.1.

If Y fails to be Λ-poised after the second or later call to FindSet, no new pointsare added. Instead, the sample set is improved by removing points from Y so that theoutput set Y ′ consists of all the Λ-poised subsets identified by FindSet, plus up to q1additional points. The resulting set is then strongly ℓ+1

ℓΛ-poised, where ℓ = ⌊ |Ynew|

q1⌋.

Trust region scale factor. The trust region scale factor r was suggested in [6, Sec-tion 11.2], although implementation details were omitted. The scale factor determineswhat points are allowed to remain in the sample set. Each call to Algorithm MIAremoves a point from outside B(0; r∆) if such a point exists. Thus, if the algorithmis called repeatedly with Y replaced by Y ′ each time, eventually all points in the

Page 20: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

20 Derivative-Free Optimization: Weighted Regression

sample set will be in the region B(0; r∆). Using a scale factor r > 1 can improve theefficiency of the algorithm. To see this, observe that if r = 1, a slight movement ofthe trust region center may result in previously “good” points lying just outside ofB(y0; ∆). These points would then be unnecessarily removed from Y .

To justify this approach, suppose that Y is strongly Λ-poised in B(0; ∆). ByProposition 4.10, the associated model function m is κ-fully quadratic for some fixedvector κ, which depends on Λ. If instead Y has points outside of B(0; ∆), we can show(by a simple modification to the proof of Proposition 4.10) that the model functionis R3κ-fully quadratic, where R = max

∥yi − y0∥

. Thus, if Y ⊂ B(0; r∆) for somefixed r ≥ 1, then calling the model improvement algorithm will result in a modelfunctionm that is κ-fully quadratic with respect to a different (but still fixed) κ = r3κ.In this case, however, whenever new points are added during the model improvementalgorithm, they are always chosen within the original trust region B(0; ∆).

The discussion above demonstrates that Algorithm MIA satisfies the requirementsof a model improvement algorithm specified in Definition 2.2. This algorithm is usedin the CSV2 framework described in Section 2 as follows:

• In Step 1 of Algorithm CSV2, Algorithm MIA is called once. If no change ismade to the sample set, the model is certified to be κ-fully quadratic.

• In Step 4 of Algorithm CSV2, Algorithm MIA is called once. If no changeis made to the sample set, the model is κ-fully quadratic. Otherwise, thesample set has been modified to improve the model.

• In Algorithm CriticalityStep, Algorithm MIA is called repeatedly to improvethe model until it is κ-fully quadratic.

In our implementation, we modified Algorithm CriticalityStep to improve effi-ciency by introducing an additional exit criterion. Specifically, after each call to themodel improvement algorithm, ϑm

k is tested. If ϑmk > ǫc, x

k is no longer a secondorder stationary point of the model function; so we exit the criticality step.

6. Computational Results. As shown in the previous section, the CSV2framework using weighted quadratic regression converges to a second-order stationarypoint provided the ratio between the largest and smallest weight is bounded. Thisleaves much leeway in the derivation of the weights. We now describe a heuristicstrategy based on the error bounds derived in §4.

6.1. Using Error Bounds to Choose Weights. Intuitively, the models usedthroughout our algorithm will be most effective if the weights are chosen so thatm(x) is as accurate as possible in the sense that it agrees with the 2nd order Taylorapproximation of f(x) around the current trust region center y0. That is, we want toestimate the quadratic function

q(x) = f(y0) +∇f(y0)T (x− y0) +1

2(x− y0)T∇2f(y0)(x− y0).

If f(x) happens to be a quadratic polynomial, then

fi = q(yi) + Ei.

If the errors Ei are uncorrelated random variables with zero mean and finite variancesσ2i , i = 0, . . . , p, then the best linear unbiased estimator of the polynomial q(x) is given

by m(x) = φ(x)Tα, where α solves (3.4) with the ith weight proportional to 1/σi [23,Theorem 4.4]. This is intuitively appealing since each sample point will have the sameexpected contribution to the weighted sum of square errors.

Page 21: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 21

When f(x) is not a quadratic function, the errors depend not just on the compu-tational error, but also on the distances from each point to y0. In the particular casewhen x = y0, the first three terms of (4.2) are the quadratic function q(yi). Thus,the error between the computed function value and q(yi) is given by:

fi − q(yi) =1

2(yi − y0)THi(y

0)(yi − y0) + Ei, (6.1)

where Hi(y0) = ∇2f(ηi(y

0)) − ∇2f(y0) for some point ηi(y0) on the line segment

connecting y0 and yi.We shall refer to the first term on the right as the Taylor error and the second term

on the right as the computational error. By Assumption 2.1,∥

∥Hi(y0)∥

∥ ≤ L∥

∥yi − y0∥

∥.This leads us to the following heuristic argument for choosing the weights: Sup-pose that Hi(y

0) is a random symmetric matrix such that the standard deviation of∥

∥Hi(y0)∥

∥ is proportional to L∥

∥yi − y0∥

∥. Then the Taylor error will have standard

deviation proportional to L∥

∥yi − y0∥

3. Assuming the computational error is inde-

pendent of the Taylor error, the total error fi − q(yi) will have standard deviation√

(

ζL ‖yi − y0‖3)2

+ σ2i , where ζ is a proportionality constant, and σi is the standard

deviation of Ei. This leads to the following formula for the weights:

wi ∝1

ζ2L2 ‖yi − y0‖6 + σ2i

.

Of course, this formula depends on knowing ζ, L and σi. If L, σi, and/or ζ arenot known, this formula could still be used in conjunction with some strategy forestimating L, σi, and ζ (for example, based upon the accuracy of the model functionsat known points). Alternatively, ζ and L can be combined into a single parameter Cthat could be chosen using some type of adaptive strategy:

wi ∝1

C ‖yi − y0‖6 + σ2i

.

If the computational errors have equal variances, the formula can be further sim-plified as

wi ∝1

C ‖yi − y0‖6 + 1, (6.2)

where C = C/σ2i .

An obvious flaw in the above development is that the errors in |fi − q(yi)| arenot uncorrelated. Additionally, the assumption that

∥Hi(y0)∥

∥ is proportional to

L∥

∥yi − y0∥

∥ is valid only for limited classes of functions. Nevertheless, based on ourcomputational experiments, (6.2) appears to provide a sensible strategy for balancingdiffering levels of computational uncertainty with the Taylor error.

6.2. Benchmark Performance. To study the impact of weighted regression,we developed MATLAB implementations of three quadratic model-based trust re-gion algorithms using interpolation, regression, and weighted regression, respectively,to construct the quadratic model functions. To the extent possible, the differences

Page 22: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

22 Derivative-Free Optimization: Weighted Regression

between these algorithms were minimized, with code shared whenever possible. Ob-viously, all three methods use different strategies for constructing the model from thesample set. Beyond that, the only difference is that the two regression methods usethe model improvement algorithm described in Section 5, whereas the interpolationalgorithm uses the model improvement algorithm described in [6, Algorithm 6.6].

We compared the three algorithms using the suite of test problems for bench-marking derivative-free optimization algorithms made available by More and Wild[19]. We ran our tests on 3 types of problems from this test suite: smooth problems(with no noise), piecewise smooth functions, and functions with deterministic noise.The suite also includes stochastically noisy function, but we did not consider these.Doing so would require significantly modifying the algorithm, for example by sam-pling points multiple times or removing excessively “old” points which appear locallyoptimal because of a large magnitude of negative noise. We consider such modifica-tions non-trivial and outside the scope of the current work. The problems were runwith the following parameter settings:∆max = 100, ∆icb

0 = 1, η0 = 10−6, η1 = 0.5, γ = 0.5, γinc = 2, ǫc = 0.01, µ = 2,β = 0.5, ω = .5, r = 3, ξacc = 10−4. For the interpolation algorithm, we usedξimp = 1.01, for the calls to [6, Algorithm 6.6].

As described in [19], the smooth problems are derived from 22 nonlinear leastsquares functions defined in the CUTEr [12] collection. For each problem, the objec-tive function f(x) is defined by

f(x) =

m∑

k=1

gk(x)2,

where g : Rn → Rm represents one of the CUTEr test functions. The piecewise-

smooth problems are defined using the same CUTEr test functions by

f(x) =

m∑

k=1

|gk(x)| .

The noisy problems are derived from the smooth problems by multiplying by a noisefunction as follows:

f(x) = (1 + εfΓ(x))

m∑

k=1

gk(x)2,

where εf defines the relative noise level, and Γ(x) is a function that oscillates between-1 and 1, with both high-frequency and low-frequency oscillations. Using multiplestarting points for some of the test functions, a total of 53 different problems arespecified in the test suite for each of these 3 types of problems.

For the weighted regression algorithm, the weights were determined by the weight-ing scheme (6.2), with α = 1, and C = 100.

The relative performances of the algorithms were compared using performanceprofiles and data profiles [10, 19]. If S is the set of solvers to be compared on a suiteof problems P , let tp,s be the number of iterates required for solver s ∈ S on a problemp ∈ P to find a function value satisfying:

f(x) ≤ fL + τ (f(x0)− fL) ,

Page 23: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 23

where fL is the best function value achieved by any s ∈ S. Then the performanceprofile of a solver s ∈ S is the following fraction:

ρs(α) =1

|P |

p ∈ P :tp,s

min tp,s : s ∈ S ≤ α

.

The data profile of a solver s ∈ S is:

ds(α) =1

|P |

p ∈ P :tp,s

np + 1≤ α

where np is the dimension of problem p ∈ P . For more information on these profiles,including their relative merits and faults, see [19].

The stopping criterion for each problem is based on the best function value fLachieved by any of the methods. Specifically, the test is satisfied if

f(x) ≤ fL + τ (f(0)− fL) ,

where τ > 0 is a specified accuracy.Performance profiles comparing the three algorithms are shown in Figures 6.1–6.2,

for an accuracy of τ = 10−5. We observe that on the smooth problems, the weightedand unweighted regression methods had similar performance and both performedslightly better than interpolation. For the noisy problems, we see slightly better per-formance from the weighted regression method. And for the piecewise differentiablefunctions, the performance of the weighted regression method is significantly bet-ter. This mirrors the findings in [7] where SID-PSM using regression models showsconsiderable improvement over interpolation models.

InterpolationLeast SquaresWeighted Regression

Smooth Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

2 4 6 8 10 120

0.2

0.4

0.6

0.8

1

InterpolationLeast SquaresWeighted Regression

Smooth Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Figure 6.1. Performance (left) and data (right) profiles: Interpolation vs. regression vs.weighted regression (Smooth Problems)

We also compared our weighted regression algorithm with the DFO algorithm[3] as well as NEWUOA [22], (which had the best performance of the three solverscompared in [19]). We obtained the DFO code from the COIN-OR website [17]. Thiscode calls IPOPT, which we also obtained from COIN-OR. We obtained NEWUOAfrom [18]. We ran both algorithms on the benchmark problems with a stoppingcriteria of ∆min = 10−8, where ∆min denotes the minimum trust region radius. ForNEWUOA, the number of interpolation conditions was set to NPT=2n+ 1.

Page 24: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

24 Derivative-Free Optimization: Weighted Regression

InterpolationLeast SquaresWeighted Regression

Nondifferentiable Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

1 1.5 2 2.50

0.2

0.4

0.6

0.8

1

InterpolationLeast SquaresWeighted Regression

Nondifferentiable Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Figure 6.2. Performance (left) and data (right) profiles: Interpolation vs. regression vs.weighted regression (Nondifferentiable Problems)

InterpolationLeast SquaresWeighted Regression

Deterministically Noisy Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

InterpolationLeast SquaresWeighted Regression

Deterministically Noisy Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Figure 6.3. Performance (left) and data (right) profiles: Interpolation vs. regression vs.weighted regression (Problems with Deterministic Noise)

The performance profiles are shown in Figures 6.4–6.5, with an accuracy ofτ = 10−5. NEWUOA outperforms both our algorithm and DFO on the smoothproblems. This is not surprising since NEWUOA is a mature code that has beenrefined over several years, whereas our code is a relatively unsophisticated implemen-tation. In contrast, on the noisy problems and the piecewise differentiable problems,our weighted regression algorithm achieves superior performance.

7. Summary and Conclusions. Our computational results indicate that us-ing weighted regression to construct more accurate model functions can reduce thenumber of function evaluations required to reach a stationary point. Encouragedby these results, we believe that further study of weighted regression methods iswarranted. This paper provides a theoretical foundation for such study. In partic-ular, we have extended the concepts of Λ-poisedness and strong Λ-poisedness to theweighted regression framework, and we demonstrated that any scheme that maintainsstrongly Λ-poised sample sets for (unweighted) regression can also be used to main-tain strongly Λ-poised sample sets for weighted regression, provided that no weightis too small relative to the other weights. Using these results, we showed that, whenthe computational error is sufficiently small relative to the trust region radius, the

Page 25: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 25

DFONEWUOA (2n+1)Weighted Regression

Smooth Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

1 2 3 4 5 6 70

0.2

0.4

0.6

0.8

1

DFONEWUOA (2n+1)Weighted Regression

Smooth Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Figure 6.4. Performance (left) and data (right) profiles: weighted regression vs. NEWUOAvs. DFO (Smooth Problems)

DFONEWUOA (2n+1)Weighted Regression

Nondifferentiable Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

0.8

1

DFONEWUOA (2n+1)Weighted Regression

Nondifferentiable Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 150 2000

0.2

0.4

0.6

0.8

1

Figure 6.5. Performance (left) and data (right) profiles: weighted regression vs. NEWUOAvs. DFO (Nondifferentiable Problems)

DFONEWUOA (2n+1)Weighted Regression

Deterministically Noisy Problems; τ = 10−5

Fractionof

Problems

Performance Ratio

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

DFONEWUOA (2n+1)Weighted Regression

Deterministically Noisy Problems; τ = 10−5

Fractionof

Problems

Number of Simplex Gradients

0 50 100 1500

0.2

0.4

0.6

0.8

1

Figure 6.6. Performance (left) and data (right) profiles: weighted regression vs. NEWUOAvs. DFO (Problems with Deterministic Noise)

Page 26: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

26 Derivative-Free Optimization: Weighted Regression

algorithm converges to a stationary point under standard assumptions.This investigation began with a goal of more efficiently dealing with computational

error in derivative-free optimization, particularly under varying levels of uncertainty.Surprisingly, we discovered that regression based methods can be advantageous evenin the absence of computational error. Regression methods produce quadratic ap-proximations that better fit the objective function close to the trust region center.This is due partly to the fact that interpolation methods throw out points that areclose together in order to maintain a well-poised sample set. In contrast, regressionmodels keep these points in the sample set, thereby putting greater weight on pointsclose to the trust region center.

The question of how to choose weights needs further study. In this paper, weproposed a heuristic that balances uncertainties arising from computational errorwith uncertainties arising from poor model fidelity (i.e., Taylor error as describedin §6.1). This weighting scheme appears to provide a benefit for noisy problems ornon-differentiable problems. We believe better schemes can be devised based on morerigorous analysis.

Finally, we note that the advantage of regression-based methods is not withoutcost in terms of computational efficiency. In the regression methods, quadratic modelsare constructed from scratch every iteration, requiring O(n6) operations. In contrast,interpolation-based methods are able to use an efficient scheme developed by Pow-ell [22] to update the quadratic models at each iteration. It is not clear whethersuch a scheme can be devised for regression methods. Nevertheless, when functionevaluations are extremely expensive, and when the number of variables is not toolarge, this advantage is outweighed by the reduction in function evaluations realizedby regression based methods.

8. Acknowledgements. This research was partially supported by National Sci-ence Foundation Grant GK-12-0742434. We are grateful to two anonymous refereesfor their comments that led to improvement of an earlier version of the paper.

REFERENCES

[1] E. J. Anderson and M. C. Ferris, A direct search algorithm for optimization with noisyfunction evaluations, SIAM Journal on Optimization, 11 (2001), pp. 837–857.

[2] P. G. Ciarlet and P. A. Raviart, General Lagrange and Hermite interpolation in Rn with

applications to finite element methods, Archive for Rational Mechanics and Analysis, 46(1972), pp. 177–199.

[3] A. R. Conn, K. Scheinberg, and P. L. Toint, Recent progress in unconstrained nonlinearoptimization without derivatives, Mathematical Programming, 79 (1997), pp. 397–414.

[4] A. R. Conn, K. Scheinberg, and L. N. Vicente, Geometry of sample sets in derivative freeoptimization: Polynomial regression and underdetermined interpolation, IMA Journal onNumerical Analysis, 28 (2008), pp. 721–748.

[5] , Global Convergence of General Derivative-Free Trust-Region Algorithms to First- andSecond-Order Critical Points, SIAM Journal on Optimization, 20 (2009), pp. 387–415.

[6] , Introduction to Derivative-Free Optimization, MPS-SIAM Series on Optimization,SIAM, Philadelphia, 2009.

[7] A. L. Custodio, H. Rocha, and L. N. Vicente, Incorporating minimum Frobenius normmodels in direct search, Computational Optimization and Applications, 46 (2010), pp. 265–278.

[8] G. Deng and M. C. Ferris, Adaptation of the UOBQYA algorithm for noisy functions, inProceedings of the Winter Simulation Conference, 2006, pp. 312–319.

[9] , Extension of the direct optimization algorithm for noisy functions, in Proceedings ofthe Winter Simulation Conference, 2007, pp. 497–504.

[10] E. D. Dolan and J. J. More, Benchmarking optimization software with performance profiles,Mathematical Programming, 91 (2001), pp. 201–213.

Page 27: DERIVATIVE-FREE OPTIMIZATION OF EXPENSIVE …of the well-known DFO algorithm which is freely available on the COIN-OR website [17]. There have been previous attempts at optimizing

SIAM Journal on Optimization 27

[11] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins UniversityPress, 3rd ed., 1996.

[12] N. I. M. Gould, D. Orban, and P. L. Toint, CUTEr and SifDec: A constrained and uncon-strained testing environment, revisited, ACM Transactions on Mathematical Software, 29(2003), pp. 373–394.

[13] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 2nd ed.,2002.

[14] W. Huyer and A. Neumaier, SNOBFIT stable noisy optimization by branch and fit, ACMTransactions on Mathematical Software, 35 (2008), pp. 1–25.

[15] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, Lipschitzian optimization without theLipschitz constant, Journal of Optimization Theory and Applications, 79 (1993), pp. 157–181.

[16] C. T. Kelley, Detection and remediation of stagnation in the Nelder-Mead algorithm using asufficient decrease condition, SIAM Journal on Optimization, 10 (1999), pp. 43–55.

[17] R. Lougee-Heimer, The Common Optimization INterface for Operations Research: Promot-ing open-source software in the operations research community, IBM Journal of Researchand Development, 47 (2003), pp. 57–66.

[18] H. D. Mittelmann, Decision tree for optimization software. http://plato.asu.edu/guide.html,2010.

[19] J. J. More and S. M. Wild, Benchmarking derivative-free optimization algorithms, SIAMJournal on Optimization, 20 (2009), pp. 172–191.

[20] J. A. Nelder and R. Mead, A simplex method for function minimization, Computer Journal,7 (1965), pp. 308–313.

[21] M. J. D. Powell, UOBYQA: Unconstrained optimization by quadratic approximation, Math-ematical Programming, 92 (2002), pp. 555–582.

[22] , Developments of NEWUOA for minimization without derivatives, IMA Journal onNumerical Analysis, 28 (2008), pp. 649–664.

[23] C. R. Rao and H. Toutenburg, Linear Models: Least Squares and Alternatives, SpringerSeries in Statistics, Springer-Verlag, 2nd ed., 1999.

[24] J. J. Tomick, S. F. Arnold, and R. R. Barton, Sample size selection for improved Nelder-Mead performance, in Proceedings of the Winter Simulation Conference, 1995, pp. 341–345.