damped techniques for the limited memory bfgs method for large-scale optimization

12
J Optim Theory Appl DOI 10.1007/s10957-013-0448-8 Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization Mehiddin Al-Baali · Lucio Grandinetti · Ornella Pisacane Received: 8 June 2013 / Accepted: 4 October 2013 © Springer Science+Business Media New York 2013 Abstract This paper is aimed to extend a certain damped technique, suitable for the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, to the limited memory BFGS method in the case of the large-scale unconstrained optimization. It is shown that the proposed technique maintains the global convergence property on uniformly con- vex functions for the limited memory BFGS method. Some numerical results are described to illustrate the important role of the damped technique. Since this tech- nique enforces safely the positive definiteness property of the BFGS update for any value of the steplength, we also consider only the first Wolfe–Powell condition on the steplength. Then, as for the backtracking framework, only one gradient evaluation is performed on each iteration. It is reported that the proposed damped methods work much better than the limited memory BFGS method in several cases. Keywords Large-scale optimization · The limited memory BFGS method · Damped technique · Line search framework Communicated by David Hull. Some results were presented by M. Al-Baali at 40th Workshop on Large Scale Nonlinear Optimization, Erice, Italy, June 22–July 1, 2004. M. Al-Baali Department of Mathematics and Statistics, Sultan Qaboos University, Muscat, Oman e-mail: [email protected] L. Grandinetti (B ) Department of Computer Engineering, Electronics and Systems, University of Calabria, Rende, Italy e-mail: [email protected] O. Pisacane Information Engineering Department, Polytechnic University of Marche, Ancona, Italy e-mail: [email protected]

Upload: ornella

Post on 23-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory ApplDOI 10.1007/s10957-013-0448-8

Damped Techniques for the Limited Memory BFGSMethod for Large-Scale Optimization

Mehiddin Al-Baali · Lucio Grandinetti ·Ornella Pisacane

Received: 8 June 2013 / Accepted: 4 October 2013© Springer Science+Business Media New York 2013

Abstract This paper is aimed to extend a certain damped technique, suitable for theBroyden–Fletcher–Goldfarb–Shanno (BFGS) method, to the limited memory BFGSmethod in the case of the large-scale unconstrained optimization. It is shown thatthe proposed technique maintains the global convergence property on uniformly con-vex functions for the limited memory BFGS method. Some numerical results aredescribed to illustrate the important role of the damped technique. Since this tech-nique enforces safely the positive definiteness property of the BFGS update for anyvalue of the steplength, we also consider only the first Wolfe–Powell condition on thesteplength. Then, as for the backtracking framework, only one gradient evaluation isperformed on each iteration. It is reported that the proposed damped methods workmuch better than the limited memory BFGS method in several cases.

Keywords Large-scale optimization · The limited memory BFGS method ·Damped technique · Line search framework

Communicated by David Hull.

Some results were presented by M. Al-Baali at 40th Workshop on Large Scale NonlinearOptimization, Erice, Italy, June 22–July 1, 2004.

M. Al-BaaliDepartment of Mathematics and Statistics, Sultan Qaboos University, Muscat, Omane-mail: [email protected]

L. Grandinetti (B)Department of Computer Engineering, Electronics and Systems, University of Calabria, Rende, Italye-mail: [email protected]

O. PisacaneInformation Engineering Department, Polytechnic University of Marche, Ancona, Italye-mail: [email protected]

Page 2: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

1 Introduction

Consider the limited memory quasi-Newton methods for minimizing a smooth func-tion f (x) : Rn →R, under the hypothesis that n is large so that a number of n-vectorscan be stored, but a n × n symmetric matrix cannot.

To solve this large-scale optimization problem, we will focus on certain modi-fied Broyden–Fletcher–Goldfarb–Shanno (BFGS) methods which use values of thesteplength which satisfy the Wolfe–Powell conditions (see for example [1]). The maingoal of this paper is to extend the recent damped (D-)technique [2], suitable for theBFGS method which modifies the one described in [3] for the constrained optimiza-tion, to the limited memory BFGS method [4] in the case of the large-scale uncon-strained optimization.

The resulting damped methods (referred to as L-D-BFGS) are defined in the sameway of the limited BFGS ones (L-BFGS, for short), except for the BFGS updatingformula. In fact, it is replaced by the D-BFGS formula [3], which maintains the Hes-sian approximations sufficiently positive definite. Indeed, for certain choices of thedamped technique, it is shown in [5] that the performance of the D-BFGS method issubstantially better than that of the standard BFGS method.

Therefore, it is expected that some L-D-BFGS methods would work better thanthe ‘undamped’ L-BFGS ones in certain cases as presented in [6]. This paper studiesthis conjecture by applying these methods to a set of standard test large-scale op-timization problems. Some of these results and other experiments are described inSect. 4, which shows that L-D-BFGS performs substantially better than L-BFGS incertain cases. The other sections are organized as follows. Section 2 describes boththe L-BFGS and L-D-BFGS methods with some features. In particular, it extends theglobal convergence result of [7] that the L-BFGS method has for convex functions tothe L-D-BFGS class of methods. Section 3 considers certain choices for the dampedtechnique, which work well in practice. Finally, Sect. 5 concludes the paper.

2 Limited Memory Methods

Any quasi-Newton method generates a sequence of points {xk} iteratively by the linesearch formula

xk+1 = xk + αkdk, (1)

where x1 is given, αk is a steplength parameter, and

dk = −Hkgk (2)

is the search direction. Here, gk denotes the gradient g(xk) = ∇f (xk) and Hk ap-proximates the inverse Hessian [∇2f (xk)]−1. Initially, H1 is given positive definiteand Hk is updated on every iteration in terms of the differences

sk = xk+1 − xk, yk = gk+1 − gk, (3)

Page 3: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

such that the quasi-Newton condition Hk+1yk = sk is satisfied. In the BFGS method,the updated Hessian approximation is given by

Hk+1 = bfgs−1(Hk, sk, yk), (4)

where, for any vectors s and y and for any symmetric matrix H , the function

bfgs−1(H, s, y) =(

I − syT

sT y

)H

(I − ysT

sT y

)+ ssT

sT y(5)

defines the BFGS updating formula for the inverse Hessian approximation. Since thisformula maintains H positive definite if sT y > 0, the BFGS method preserves thepositive definiteness of the matrices {Hk} if the curvature condition sT

k yk > 0 holds.This condition is guaranteed if αk is chosen such that the following Wolfe–Powellconditions hold:

fk − fk+1 ≥ σ0|sTk gk|, sT

k yk ≥ (1 − σ1)|sTk gk|, (6)

where fk denotes f (xk) and σ0 ∈ ]0,0.5[ and σ1 ∈ ]σ0,1[ are fixed parameters (forfurther details, see for example [1]).

To fulfill the conditions in (6) by testing a few gradient evaluations, σ1 is cho-sen large (usually, σ1 = 0.9). However, for a sufficiently large value of σ1 < 1, theseconditions may hold with sT

k yk close to zero. In addition, this scalar might becomenegative if only the first condition in (6) is used for acceptable values of αk , using theso-called backtracking line search (see for example [8]). Although this difficulty canbe avoided by skipping the update so that Hk+1 = Hk (see for example [8]), a draw-back is represented by the fact that the Hessian approximation remains unchanged.However, to circumvent these disadvantages, the following D-BFGS update is con-sidered in [5]:

Hk+1 = bfgs−1(Hk, sk, yk) (7)

which is given by (4) with yk replaced by the damped technique

yk = ϕkyk + (1 − ϕk)Bksk, (8)

where Bk = H−1k and ϕk ∈ ]0,1] is a parameter. This damped parameter is chosen

such that sTk yk is sufficiently positive for an appropriate value of ϕk . Indeed, for

certain choices of this parameter, it has been reported in [5] that the performance ofthe D-BFGS method is substantially better than that of the standard BFGS method.

Both the L-BFGS and the L-D-BFGS methods are defined like the above BFGSand D-BFGS methods, except for the information of the inverse Hessian approxi-mations, given by (4) and (7), respectively. These pieces of information are, in fact,reduced sufficiently in such a way that the product in (2) is computed efficiently with-out storing Hk explicitly. This computation can be described in the following way forthe L-BFGS method. Instead, for the L-D-BFGS method, a similar process can beused with yk replaced by the damped vector yk .

Page 4: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

Replacing Hk in (4) by another implicitly stored matrix H 0k , such that the product

H 0k v can be computed for any vector v, the updated matrix Hk+1 would be obtained

by (5) in terms of H 0k and the vector pair {sk, yk} as in the following:

Hk+1 = bfgs−1(H 0k , sk, yk

)= V T

k H 0k Vk + ρksks

Tk , (9)

where

Vk = I − ρkyksTk , ρk = 1

yTk sk

. (10)

Hence, the product −Hk+1gk+1, which defines the next search direction dk+1, canbe computed without storing Hk+1 explicitly, since for any vector v we have Vkv =v − ρk(s

Tk v)yk and V T

k v = v − ρk(yTk v)sk . In particular, if H 0

k (referred to as a basicmatrix) is given by the identity matrix I , which requires zero storage, we obtain thememoryless BFGS method (see for example [9]). Similarly, the choice H 0

k = νkI ,for a certain scalar νk , yields the L-BFGS method with least storage (see for example[9]). However, it would be useful to increase the storage so that the information asregards the Hessian approximation would be increased as shown in the followingway.

Assuming Hk−1 is replaced by H 0k , formula (4) yields the 2-BFGS updates in

terms of H 0k and the 2-vector pairs {si, yi}ki=k−1, given by

Hk+1 = bfgs−1(bfgs−1(H 0k , sk−1, yk−1

), sk, yk

)= V T

k V Tk−1H

0k Vk−1Vk + ρk−1V

Tk sk−1s

Tk−1Vk + ρksks

Tk . (11)

In general, for m ≥ 1 (which is replaced by k whenever k < m), let Hk−m+1 bereplaced by H 0

k . Then formula (4) yields the m-BFGS updates in terms of H 0k and

the m most recent vector pairs {si, yi}ki=k−m+1, which require the storage location of2mn spaces, given by a sum of symmetric matrices of the structure

Hk+1 = [V T

k · · ·V Tk−m+1

]H 0

k [Vk−m+1 · · ·Vk]+ ρk−m+1

[V T

k · · ·V Tk−m+2

]sk−m+1s

Tk−m+1[Vk−m+2 · · ·Vk]

+ ρk−m+2[V T

k · · ·V Tk−m+3

]sk−m+2s

Tk−m+2[Vk−m+3 · · ·Vk]

...

+ ρk−1VTk sk−1s

Tk−1Vk

+ ρksksTk . (12)

Hence, the search direction dk+1 = −Hk+1gk+1 (defined by (2) with k replaced byk + 1) can be computed without storing Hk+1 explicitly. In [4], a two loop recursiveprocedure for computing this direction efficiently is derived (for further details seefor example [9]).

Page 5: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

To define the basic matrix H 0k , it is worth noting that in [7], few choices are taken

into account and, in particular, the following choice is recommended:

H 0k = νkI, (13)

where

νk = sTk yk

yTk yk

, (14)

which depends on the most recent vectors in (3).We now state the global convergence result of the L-D-BFGS method by extending

that of [7], since the choice ϕk = 1 reduces the L-D-BFGS update to that of L-BFGS.To do so, we first state the following standard assumptions on the objective function.

Assumption 2.1

1. The objective function f is twice continuously differentiable,2. the level set D = {x : f (x) ≤ f1} is convex,3. there exist positive constants M1 and M2 such that

M1‖z‖2 ≤ zT G(x)z ≤ M2‖z‖2, (15)

where G(x) = ∇2f (x), for all z ∈ Rn and all x ∈ D, and ‖.‖ denotes the usual

Euclidean norm.

Theorem 2.1 Let x1 be any starting point for which f satisfies Assumption 2.1, and

assume that the basic matrices H 0k are chosen so that {‖H 0

k ‖} and {‖H 0k

−1‖} arebounded. Suppose yk be defined by (8) such that ϕk ≥ ν, where ν is a positive numberand αk be chosen such the Wolfe–Powell conditions (6) hold. Then for any positivedefinite H1, the L-D-BFGS method (defined by (1), (2), (12), and (10) with yk re-placed by yk) generates a sequence of points {xk} which converges to the solution x∗.Moreover, there is a constant 0 ≤ r < 1 such that

fk+1 − f∗ ≤ rk(f1 − f∗),

where f∗ denotes f (x∗), which implies that {xk} converges R-linearly.

Proof It is a simple extension to the result of [7] for the L-BFGS method. Let theD-BFGS updated Hessian approximation be given by the inverse of matrix (4) withyk replaced by yk , that is, the following BFGS update of the Hessian approximationBk (= H−1

k ) in terms of sk and yk :

bfgs(Bk, sk, yk) = Bk − BksksTk Bk

sTk Bksk

+ ykyTk

yTk sk

. (16)

Using the result of [2]

yTk yk

sTk yk

≤ ϕk

yTk yk

sTk yk

+ (1 − ϕk)sTk BkBksk

sTk Bksk

,

Page 6: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

it follows that

trace[bfgs(Bk, sk, yk)

] ≤ trace(Bk) + ϕk

yTk yk

sTk yk

≤ trace(Bk) + yTk yk

sTk yk

, (17)

which yields the trace inequality (2.10) of [7].Noting that sT

k yk ≥ ϕksTk yk and using the formula for the determinant of the BFGS

update, we obtain

det[bfgs(Bk, sk, yk)

] = sTk yk

sTk Bksk

det(Bk)

≥ ϕk

sTk yk

sTk Bksk

det(Bk)

≥ ν det[bfgs(Bk, sk, yk)

]. (18)

This inequality, together with the expressions (6.10) and (2.11) described in [7],yields the determinant inequality (6.12) with M4 replaced by νM4 in that paper. Nowfollowing the analysis in that paper, the proof will be completed. �

It is worth noting that this result with yk = yk for all k yields that of [7], for theL-BFGS method, which also states the possibility of proving it for several other linesearch strategies, including backtracking, on the basis of the result of [10]. It is alsopossible to prove the following convergence result for the L-D-BFGS method.

Theorem 2.2 Theorem 2.1 remains valid if the steplength αk and the damped param-eter ϕk are chosen such that the first Wolfe–Powell condition in (6) and the curvaturelike condition sT

k yk > 0 are satisfied, respectively.

3 Choosing Damped Parameters

We now define some formulas for the damped parameter ϕk , which maintain theabove convergence properties of the L-BFGS method. In particular, [3] uses ϕk = 1only if τk ≥ 0.2, where

τk = sTk yk

sTk Bksk

, (19)

but otherwise, ϕk = 0.8/(1 − τk). This choice has been extended by [6] to

ϕk =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

σ2

1 − τk

, τk < 1 − σ2,

σ3

τk − 1, τk > 1 + σ3,

1, otherwise,

(20)

Page 7: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

where σ2 and σ3 are positive parameters (values of σ2 ≥ 0.6 and σ3 ≥ 3 have beensuggested). This formula is reduced to that of [3] if σ2 = 0.8 (or 0.9) and σ3 = ∞.

Since a value of τk = hk , where

hk = yTk Hkyk

sTk yk

, (21)

reduces the BFGS update to the symmetric rank one update (see for example [11]),which has some useful properties, we use ϕk < 1 only when hk/τk is sufficientlylarger than one. Thus, in particular, we modify formula (20) to

ϕk =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

σ2

1 − τk

, τk ≤ 0 or 0 < τk < min(1 − σ2, (1 − σ4)hk),

σ3

τk − 1, 1 + σ3 < τk < (1 − σ4)hk,

1, otherwise,

(22)

where σ4 is another fixed positive parameter. In the D-BFGS method, this formulaworks better than (20), even for σ4 = 0. Note that both formulas are equivalent, ex-cept the case when hk/τk and τk are sufficiently close to and far away from one,respectively. We also note that the inequality τk > 0 is guaranteed by the secondWolfe–Powell condition in (6).

We observed that several choices for the damped parameter ϕk work well in prac-tice (see for example [5]). In the next section, we test some of these choices.

4 Numerical Results

We now give some idea about the performance of the L-D-BFGS class of methodsversus that of the standard L-BFGS method. For this comparison, we applied somemethods to a set of 33 general unconstrained test problems; most of them are usedby [12] and [13]. Table 1 presents, for every test problem, the name, the abbreviatedname, the reference, and the number of variables n which are used for the runs. Wetested other problems with values of n ≤ 104. However, we decided to not includethem, since we observed that they do not give a reasonable idea of the behavior of theL-BFGS methods with respect to the more significant ones described in Table 1.

Our experiments were performed in double precision with machine ε = 2−52(≈2.22×10−16). For given σ0, σ1, m and initial H1 = I , the identity matrix, we used theFortran 77 software of [1], which implements the L-BFGS method, with yk replacedby yk defined by (8) for a given value of ϕk based on (22) for some choices of σ2, σ3and σ4. The modified L-D-BFGS routine is reduced to L-BFGS if the value of ϕk = 1is used for all k. The routine ensures finding a value of αk which satisfies the strongWolfe–Powell conditions, defined by (6) and

sTk yk ≤ (1 + σ1)|sT

k gk|, (23)

where σ1 is defined as in (6). Note that these conditions are reduced to the first condi-tion in (6) if a sufficiently large value of σ1 is used. In this case, the backtracking line

Page 8: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

Table 1 List of test problems

Abbreviated name n Function’s name Reference

EX-ROSBRK 103 Extended Rosenbrock [14]

EX-ROSBRKI 103 Extended Rosenbrock I [12]

EX-FRDRTH 103 Extended Freudenstein and Roth [14]

CH-FRDRTH 103 Extended Freudenstein and Roth [15]

EX-POW-SR 103 Extended Powell Singular [14]

EX-POW-SRI 103 Extended Powell Singular I [12]

VAR-DIM 103 Variably Dimensioned [14]

TRGSPD 103 Trigonometric of Spedicato [14]

PENALTYI 103 Penalty I [14]

TRID1BandL 100, 103 Tridiagonal of Broyden [14]

LINMINSUR 121, 961 Linear Minimum Surface [15]

BROYTRIDN 103 Tridiagonal of Broyden [14]

EX-ENGVL1 103, 104 Extended ENGVL1 [15]

WR-EX-WOOD 103 Wrong Extended Wood [15]

SP-MXSQRT 100, 103 Sparse Matrix Square Root [7]

CH-ROSBRK 100, 200 Chained Rosenbrock [16]

PE 100, 200 Potential Energy of [17] [12]

TRID2TOIN 100, 200 TRIG (trigonometric) [15]

BVP 100 Boundary Value Problem [15]

MXSQRT1 100 Matrix Square Root 1 [7]

MXSQRT2 100 Matrix Square Root 2 [7]

VAR(-3.4) 100, 200 VAR(λ) (variational) [18]

QOR 50 Quadratic Operations Research [18]

GOR 50 Generalized QOR [18]

PSP 50 Pseudo Penalty [18]

search framework can be used, which requires only one gradient evaluation for com-puting an acceptable value of the steplength αk . Therefore, all the algorithms (withdifferent values for m), used in our implementation, differ only in the preset choicesof σi, i ≥ 0. We let σ0 = 10−4, σ4 = 0, and the other values are given below to definethe following algorithms:

• LM1: the standard L-BFGS method; σ1 = 0.9, σ2 = 1 and σ3 = ∞,• LM2: an L-D-BFGS method; σ1 = 0.9, σ2 = 0.6 and σ3 = 3,• LM3: the same as LM2, except that σ1 = 0.9999,• LM4: the same as LM2, except that σ1 = ∞.

The runs were terminated when the following condition was satisfied:

‖gk‖ ≤ ε max(1, |fk|

), (24)

Page 9: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

where ε = 10√

ε (≈1.47 × 10−7). We use this fairly accurate stopping condition asa fair criterion for comparing different algorithms. All the runs achieved essentiallythe same accuracy.

We consider LM2 to examine the usefulness of the damped technique for improv-ing the performance of the standard L-BFGS method. LM3 maintains the theoreticalproperties of LM2, but considers a weakly accurate line search. Although this choicemay worsen the performance of L-BFGS in certain cases, it improves the perfor-mance of L-D-BFGS as shown below. However, LM4 considers the backtrackingline search technique which is based on enforcing only the first Wolfe–Powell condi-tion in (6) for a steplength estimated by a quadratic interpolation larger than or equalto 1/2 (see [1], for instance). Although this case may yield a failure for L-BFGS, itmaintains the global convergence of L-D-BFGS. A feature worth of noting is thatLM3 requires a number of gradient evaluations less than or equal to that required byLM2, while LM4 requires only one gradient evaluation for obtaining an acceptablevalue of the steplength. In addition, implementing LM4 is simpler than those of theother algorithms.

We applied the above algorithms with the practical choices of m = 3,5,7,10, and20 to our set of tests. We observed that there is little to choose among the dampedLM2, LM3, and LM4 algorithms and all of them perform slightly better than that ofLM1 in terms of the number of function and gradient evaluations (referred to as nfeand nge, respectively) required to solve the test problems.

To illustrate this behavior observation for m = 10, we display the numerical resultsas performance profiles (see [19]) in Figs. 1 and 2 based on nfe and nge, respectively.

Nowadays (see for example [20]), this performance profile is a standard tool forpresenting such results in a concise mechanism for comparing at least two algorithmsover a set of problems. Considering, for example, nfe, the performance profile plotsthe fraction of problems (vertical-axis) solved within a given factor (horizontal-axis)of the best available algorithm which requires the least nfe among those needed by thealgorithms under comparison. More efficient/robust algorithms are “on top” towardsnear the left/right side of the plot. Thus, an algorithm that is “on top” throughout theplot could be considered more efficient and robust of the other algorithms, a usefuloutcome.

Now it follows from the graphs in both Figs. 1 and 2 that the performance of theLM3 algorithm is slightly better than that of LM4, which also performs slightly betterthan LM2. These damped algorithms clearly perform better than that of the standardLM1 algorithm.

Depending on the experiments of [5], we repeated the run using σ3 = max(αk −1,1) rather than σ3 = 3. We observed that there is little to choose among the above al-gorithms, except that for m = 5. For this choice of m, the corresponding performanceprofiles for comparing nfe and nge are plotted in Figs. 3 and 4, respectively. Bothgraphs clearly show that the damped LM2, LM3, and LM4 algorithms are preferableto the standard undamped LM1 method. Specifically, LM4 is slightly better than LM3and both of them are much better than both LM2 and LM1 in terms of nfe and nge.However, Fig. 3 shows that there is little to choose between LM2 and LM1 in termsof nfe, while Fig. 4 shows that for all cases the former algorithm is slightly betterthan the latter one in terms of nge.

Page 10: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

Fig. 1 Performance profile based on nfe; m = 10, σ3 = 3

Fig. 2 Performance profile based on nge; m = 10, σ3 = 3

The above experiments show that the LM4 algorithm outperforms the others undercomparison. Finally, it is worth noting that we repeated the run for the damped LM3algorithm with the exception that an acceptable value of αk is given by the largestvalue of 2−i , i = 0,1,2, . . . which satisfies the first condition in (6), the backtrack-

Page 11: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

Fig. 3 Performance profile based on nfe; m = 5, σ3 = max(αk − 1,1)

Fig. 4 Performance profile based on nge; m = 5, σ3 = max(αk − 1,1)

ing Armijo line search (see for example [9]). We observed that the above technique,which is based on the quadratic polynomial interpolation, is preferable to that ofArmijo.

Page 12: Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization

J Optim Theory Appl

5 Conclusions

In this paper, it is demonstrated that the proposed class of L-D-BFGS methods hassimilar features to that of the attractive L-BFGS method, except that it maintains theHessian approximations sufficiently positive definite and bounded. Indeed, we ob-served that this class, for certain choices of the damped parameter, performs betterthan the L-BFGS method, significantly on certain ill conditioned problems. In gen-eral, however, the former methods perform better than the latter one in terms of thenumber of function and gradient evaluations required to solve the problems. In partic-ular, computational results also showed that the latter number required by L-D-BFGSis always smaller than that required by L-BFGS. In conclusion, the damped methods,specifically defined by LM4, are recommended. However, further numerical experi-ments are required.

References

1. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, Chichester (1987). Reprinted in2000

2. Al-Baali, M.: Convergence analysis of a class of damped quasi-Newton methods for nonlinear opti-mization. Research Report, DOMAS 11/2, Sultan Qaboos University, Oman (2011)

3. Powell, M.J.D.: Algorithms for nonlinear constraints that use Lagrange functions. Math. Program. 14,224–248 (1978)

4. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782(1980)

5. Al-Baali, M., Grandinetti, L.: On practical modifications of the quasi-Newton BFGS method. Adv.Model. Optim. 11, 63–76 (2009)

6. Al-Baali, M.: Quasi-Wolfe conditions for quasi-Newton methods for large-scale optimization. In: 40thWorkshop on Large Scale Nonlinear Optimization, Erice, Italy, June 22–July 1 (2004)

7. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math.Program. 45, 503–528 (1989)

8. Kelley, C.T.: Iterative Methods for Optimization, 2nd edn. Siam, Philadelphia (1999)9. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, London (1999)

10. Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-Newton methods with application to uncon-strained minimization. SIAM J. Numer. Anal. 26, 727–739 (1989)

11. Al-Baali, M.: Variational quasi-Newton methods for unconstrained optimization. J. Optim. TheoryAppl. 77, 127–143 (1993)

12. Al-Baali, M.: Improved Hessian approximations for the limited memory BFGS method. Numer. Al-gorithms 22, 99–112 (1999)

13. Nash, S.G., Nocedal, J.: A numerical study of the limited memory BFGS method and the truncated-Newton method for large scale optimization. SIAM J. Optim. 1, 358–372 (1991)

14. Moré, J.J., Garbow, B.S., Hillstrom, K.E.: Testing unconstrained optimization software. ACM Trans.Math. Softw. 7, 17–41 (1981)

15. Toint, P.L.: Test problems for partially separable optimization and results for the routine PSPMIN.Technical Report No. 83/4, Department of Mathematics, Faculties University de Namur, Namur, Bel-gium (1983)

16. Grandinetti, L.: Some investigations in a new algorithm for nonlinear optimization based on conicmodels of the objective function. J. Optim. Theory Appl. 43, 1–21 (1984)

17. Siegel, D.: Implementing and modifying Broyden class updates for large scale optimization. TechnicalReport NA12, Department of Applied Mathematics and Theoretical Physics, Cambridge University,England (1992)

18. Toint, Ph.L.: Some numerical results using a sparse matrix updating formula in unconstrained opti-mization. Math. Comput. 32, 839–851 (1978)

19. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Pro-gram. 91, 201–213 (2002)

20. Malmedy, V., Toint, Ph.L.: Approximating Hessians in unconstrained optimization arising from dis-cretized problems. Comput. Optim. Appl. 50, 1–22 (2011)