smoothing methods for nonsmooth, nonconvex minimization · smoothing methods for nonsmooth,...

Math. Program., Ser. B (2012) 134:71–99DOI 10.1007/s10107-012-0569-0

FULL LENGTH PAPER

Smoothing methods for nonsmooth, nonconvexminimization

Xiaojun Chen

Received: 12 September 2011 / Accepted: 6 June 2012 / Published online: 22 June 2012© Springer and Mathematical Optimization Society 2012

Abstract We consider a class of smoothing methods for minimization problemswhere the feasible set is convex but the objective function is not convex, not differen-tiable and perhaps not even locally Lipschitz at the solutions. Such optimization prob-lems arise from wide applications including image restoration, signal reconstruction,variable selection, optimal control, stochastic equilibrium and spherical approxima-tions. In this paper, we focus on smoothing methods for solving such optimizationproblems, which use the structure of the minimization problems and composition ofsmoothing functions for the plus function (x)+. Many existing optimization algorithmsand codes can be used in the inner iteration of the smoothing methods. We presentproperties of the smoothing functions and the gradient consistency of subdifferen-tial associated with a smoothing function. Moreover, we describe how to update thesmoothing parameter in the outer iteration of the smoothing methods to guarantee con-vergence of the smoothing methods to a stationary point of the original minimizationproblem.

Keywords Nonsmooth · Nonconvex minimization · Smoothing methods ·Regularized minimization problems · Eigenvalue optimization · Stochastic variationalinequality problems

Mathematics Subject Classification 90C30 · 49M37 · 65K10 · 90C26

X. Chen (B)Department of Applied Mathematics, The Hong Kong Polytechnic University,Kowloon, Hong Kong, Chinae-mail: [email protected]

123

72 X. Chen

1 Introduction

This paper considers the following nonsmooth, nonconvex optimization problem

minx∈X

f (x) (1)

where the feasible set X ⊆ Rn is closed and convex, and the objective functionf : Rn → R is continuous and almost everywhere differentiable in X . Of particularinterest is the case where f is not smooth, not convex and perhaps not even locallyLipschitz. Here nonsmoothness refers to nondifferentiability.

Recently, problem (1) has attracted significant attention in engineering and eco-nomics. An increasing number of practical problems require minimizing a nonsmooth,nonconvex function on a convex set, including image restoration, signal reconstruc-tion, variable selection, optimal control, stochastic equilibrium problems and spher-ical approximations [1,8,10,11,13,21,29,30,33,42,44,61,76,85,101,105]. In manycases, the objective function f is not differentiable at the minimizers, but the con-straints are simple, such as box constraints, X = { x ∈ Rn |� ≤ x ≤ u} for twogiven vectors � ∈ {R ∪ {−∞}}n and u ∈ {R ∪ {∞}}n . For example, in some min-imization models of image restoration, signal reconstruction and variable selection[8,33,42,85], the objective function is the sum of a quadratic data-fidelity term anda nonconvex, non-Lipschitz regularization term. Sometimes nonnegative constraintsare used to reflect the fact that the pixels are nonnegative, and box constraints areused to represent thresholds for some natural phenomena or finance decisions. More-over, a number of constrained optimization problems can be reformulated as prob-lem (1) by using exact penalty methods. However, many well-known optimizationalgorithms lack effectiveness and efficiency in dealing with nonsmooth, nonconvexobjective functions. Furthermore, for non-Lipschitz continuous functions, the Clarkegeneralized gradients [34] can not be used directly in the analysis.

Smooth approximations for optimization problems have been studied for decades,including complementarity problems, variational inequalities, second-order conecomplementarity problems, semidefinite programming, semi-infinite programming,optimal control, eigenvalue optimization, penalty methods and mathematical pro-grams with equilibrium constraints. Smoothing methods are not only efficient forproblems with nonsmooth objective functions, but also for problems with nonsmoothconstraints. See [1,2,4–7,14,16–18,20,22,24–31,33,39,40,46–50,54,56–60,63,64,66,70,72,73,81,82,88–92,94,97,103,104].

In this paper, we describe a class of smooth approximations which are constructedbased on the special structure of nonsmooth, nonconvex optimization problems andthe use of smoothing functions for the plus function (t)+ := max(0, t). Using thestructure of problems and the composition of smoothing functions, we can developefficient smoothing algorithms for solving many important optimization problemsincluding regularized minimization problems, eigenvalue optimization and stochasticcomplementarity problems.

In particular, we consider a class of smoothing functions with the following defi-nition.

123

Smoothing methods for nonsmooth, nonconvex minimization 73

Definition 1 Let f : Rn → R be a continuous function. We call f̃ : Rn × R+ → R asmoothing function of f , if f̃ (·, μ) is continuously differentiable in Rn for any fixedμ > 0, and for any x ∈ Rn ,

limz→x,μ↓0

f̃ (z, μ) = f (x).

Based on this definition, we can construct a smoothing method using f̃ and ∇x f̃as follows.

(i) Initial step: Define a parametric smooth function f̃ : Rn × R+ → R to approx-imate f .

(ii) Inner iteration: Use an algorithm to find an approximate solution (no need to bean exact solution) of the smooth optimization problem

minx∈X

f̃ (x, μk) (2)

for a fixed μk > 0.(iii) Outer iteration: Update μk to guarantee the convergence of the smoothing

method to a local minimizer or a stationary point of the nonsmooth problem (1).

The advantage of smoothing methods is that we solve optimization problems withcontinuously differentiable functions for which there are rich theory and powerfulsolution methods [86], and we can guarantee to find a local minimizer or stationarypoint of the original nonsmooth problem by updating the smoothing parameter. Theefficiency of smoothing methods depends on the smooth approximation function, thesolution method for the smooth optimization problem (2) and the updating scheme forthe smoothing parameter μk . For example, if f is level-bounded and the smoothingfunction satisfies

f (x) ≤ f̃ (x, μ) ∀ x ∈ Rn, ∀μ > 0, (3)

then problem (2) has a solution for any fixed μk > 0. In Sect. 3, we show that a classof smoothing functions satisfy (3).

This paper is organized as follows. In Sect. 2, we describe three motivational prob-lems for the study of smoothing methods for (1): regularized minimization problems,eigenvalue optimization problems and stochastic equilibrium problems. In Sect. 3, wedemonstrate how to define smoothing functions for the three motivational problemsby using the structure of the problems and the composition of smoothing functionsfor the plus function (t)+. In Sect. 4, we summarize properties of smoothing func-tions and the subdifferential associated with a smoothing function. In particular, weconsider the relation between the Clarke subdifferential

∂ f (x) = con{v | ∇ f (z) → v, f is differentiable at z, z → x}

and the subdifferential associated with a smoothing function

G f̃ (x) = con{v | ∇x f̃ (xk, μk) → v, for xk → x, μk ↓ 0 },

123

74 X. Chen

where “con” denotes the convex hull. According to Theorem 9.61 and (b) of Corollary8.47 in the book of Rockafellar and Wets [94], if f is locally Lipschitz, then G f̃ (x)is nonempty and bounded, and ∂ f (x) ⊆ G f̃ (x). We show the gradient consistency

∂ f (x) = G f̃ (x) (4)

holds for locally Lipschitz functions in the three motivational problems and a classof smoothing functions. Moreover, we show the affine-scaled gradient consistencyfor a class of non-Lipschitz functions. In Sect. 5, we use a simple smoothing gradi-ent method to illustrate how to update the smoothing parameter in the algorithms toguarantee the convergence of smoothing methods. In Sect. 6, we illustrate numericalimplementation of smoothing methods using the smoothing SQP algorithm [6] forsolving �2-�p(0 < p < 1) minimization.

2 Motivational problems

In this section, we describe three classes of nonsmooth, nonconvex minimization prob-lems which are of the form (1) and can be solved efficiently by the smoothing methodsstudied in this paper.

2.1 Regularized minimization problems

min Θ(x)+ λΦ(x)s.t. Ax = b, � ≤ x ≤ u,

(5)

where Θ : Rn → R+ is a convex function, Φ : Rn → R+ is a continuous function,A ∈ Rm×n, b ∈ Rm, � ∈ {R ∪ {−∞}}n, u ∈ {R ∪ {∞}}n, (� ≤ u) are given matricesand vectors. Problem (5) is a nonlinear programming problem with linear constraints,which includes the unconstrained optimization problem as a special case. The matrixA has often simple structure. For instance A = (1, . . . , 1) ∈ R1×n and b = 1 inMarkowitz mean-variance model for portfolio selection. In the objective, Θ forcescloseness to data,Φ pushes the solution x to exhibit some priori expected features andλ > 0 is a parameter that controls the trade-off between the data-fitting term and theregularization term. A class of regularization terms is of the form

Φ(x) =r∑

i=1

ϕ(dTi x), (6)

where di ∈ Rn, i = 1, . . . , r and ϕ : R → R+ is a continuous function. The regular-ization term Φ is also called a potential function [85] in image sciences and a penaltyfunction [42] in statistics. The following fitting functions and regularization functionsare widely used in practice.

– least squares: Θ(x) = ‖H x − c‖22,

123


– censored least absolute deviations [83]: Θ(x) = ‖(H x)+ − c‖1,

– bridge penalty [61,67]: ϕ(t) = |t |p,– smoothly clipped absolute deviation (SCAD) [42]: ϕ(t) = ∫ |t |

0 min(1, (α − s/λ)+/(α − 1))ds,

– minimax concave penalty (MCP) [106]: ϕ(t) = ∫ |t |0 (1 − s/(αλ))+ds,

– fraction penalty [52]: ϕ(t) = α|t |/(1 + α|t |),– logistic penalty [84]: ϕ(t) = log(1 + α|t |),– hard thresholding penalty function [41]: ϕ(t) = λ− (λ− |t |)2+/λ,

where H ∈ Rm1×n and c ∈ Rm1 are given data, and α and p are positive constants.For the bridge penalty function, smoothness and convexity are dependent on the valueof p. In particular, |t |p (p > 1) is smooth, convex, |t |p (p = 1) is nonsmooth, con-vex, and |t |p (0 < p < 1) is non-Lipschitz, nonconvex. The other five regularizationfunctions are nonsmooth, nonconvex.

There is considerable evidence that using nonsmooth, nonconvex regularizationterms can provide better reconstruction results than using smooth convex or nonsmoothconvex regularization functions, for piecewise constant image, signal reconstructionproblems, variable selection and high dimensional estimations. Moreover, using non-Lipschitz regularization functions can produce desirable sparse approximations. See[13,30,42,61,62,84]. However, finding a global minimizer of (5) with a non-Lipschitzregularization function can be strongly NP-hard [23].

It is worth noting that the regularization term in (6) can be generalized to

Φ(x) =r∑

i=1

ϕi (dTi x) (7)

and

Φ(x) =r∑

i=1

ϕi (Di xi ), (8)

where Di ∈ Rνi ×γi , xi ∈ Rγi and ϕi (Di xi ) =log(1 + α‖Di xi‖) or ϕi (Di xi ) =(‖Di xi‖1)

p. In (7), various functions ϕi are used instead of using a single functionϕ. In (8) variables are divided into various groups which can be used for both groupvariable selections and individual variable selections [62].

2.2 Eigenvalue optimization

minx

g(�(C + Y (x)T Y (x))), (9)

where g : Rm → R is a continuously differentiable function, C is an m × m symmet-ric positive semidefinite matrix and Y (x) is an r × m matrix for any given x ∈ Rn .Suppose that each element Yi j of Y is a continuously differentiable function from Rn

to R. Let

123

76 X. Chen

B(x) = C + Y (x)T Y (x).

Then B(x) is an m × m symmetric positive semidefinite matrix and each element isa continuously differentiable function of x . We use

�(B(x)) = (λ1(B(x)), . . . , λm(B(x)))

to denote the vector of eigenvalues of B(x). We assume that the eigenvalues are orderedin a decreasing order:

λ1(B(x)) ≥ λ2(B(x)) ≥ · · · ≥ λm(B(x)).

Consider the following eigenvalue optimization problems.

(i) Minimizing the condition number

g(�(B(x))) = λ1(B(x))

λm(B(x)).

(ii) Minimizing the product of the k largest eigenvalues

g(�(B(x))) =k∏

i=1

λi (B(x)).

(iii) Maximizing the minimal eigenvalue

g(�(B(x))) = −λm(B(x)).

Such nonsmooth, nonconvex problems arise from optimal control, design of experi-ments and distributions of points on the sphere [10,29,71,76].

2.3 Stochastic complementarity problems

Let F : Rn → Rn be a continuously differentiable function. The following nonsmoothand nonconvex minimization problem

minx

θ(x) := ‖ min(x, F(x))‖22 =

n∑

i=1

(min(xi , Fi (x)))2 (10)

is equivalent to the nonlinear complementarity problem, denoted by NCP(F),

0 ≤ x ⊥ F(x) ≥ 0, (11)

in the sense that the solution sets of (10) and (11) coincide with each other if theNCP(F) has a solution.

123


The function θ : Rn → R+ is a residual function of the NCP(F). A functionγ : Rn → R is called a residual function (also called a merit function) of the NCP(F),if γ (x) ≥ 0 for all x ∈ Rn and γ (x∗) = 0 if and only if x∗ is a solution of the NCP(F).There are many residual functions of the NCP(F), see [36,40]. The function θ in (10)is called a natural residual function. We use θ here only for simplicity of explanation.

When F involves stochastic parameters ξ ∈ � ⊆ R�, in general, we can not findan x such that

0 ≤ x ⊥ F(ξ, x) ≥ 0, ∀ ξ ∈ �, (12)

equivalently, we can not find an x such that

θ(ξ, x) := ‖ min(x, F(ξ, x))‖22 = 0, ∀ ξ ∈ �.

We may consider a variety of reformulations. For example, find x ∈ Rn+ such that

Prob{θ(ξ, x) = 0

} ≥ α or 0 ≤ x ⊥ E[F(ξ, x)] ≥ 0,

where α ∈ (0, 1) and E[·] denotes the expected value over�, a set representing futurestates of knowledge. The first formulation takes on the form of a “chance constraint”.The second one is called Expected Value (EV) formulation, which is a determinis-tic NCP(F̄) with the expectation function F̄(x) = E[F(ξ, x)]. See [53,65,95,100].Using the residual function θ , the EV formulation can be equivalently written as aminimization problem

minx

‖ min(x, E[F(ξ, x)])‖22. (13)

There are other two ways to reformulate the stochastic NCP by using a residualfunction θ(ξ, x) as the recourse cost, which depends on both the random event ξ andthe decision x . By the definition of a residual function, the value of θ(ξ, x) reflexesthe cost due to failure in satisfying the equilibrium conditions at x and ξ . Minimiz-ing the expected values of cost in all possible scenarios gives the expected residualminimization (ERM) for the stochastic NCP

minx≥0

E[θ(ξ, x)]. (14)

The ERM formulation (14) was introduced in [21] and applied to pricing Ameri-can options, local control of discontinuous dynamics and transportation assignment[54,98,105]. Mathematical analysis and practice examples show that the ERM formu-lation is robust in the sense that its solutions have a minimal sensitivity with respectto random parameter variations in the model [32,43,74,105].

On the other hand, taking on the form of robust optimization (best worst case) [3]gives

minx

maxξ∈� θ(ξ, x). (15)

123

78 X. Chen

In the reformulations above, we need prior commitments of probability distributions.However, in practice, we can only guess the distribution of ξ with limited information.To find a robust solution, we can adapt the ideas of “distributionally robust optimi-zation” [38], which take into account knowledge about distribution’s support and aconfidence region for its mean and its covariance matrix. In particular, we define a setP of possible probability distributions that is assumed to include the true ρξ , and theobjective function is reformulated with respect to the worst case expect loss over thechoice of a distribution in P . Distributionally robust complementarity problems canbe reformulated as

minx≥0

maxρξ∈P

E[θ(ξ, x)]. (16)

As the complementarity problem is a special subclass of the variational inequal-ity problem, the stochastic complementarity problem is also a special subclass ofthe stochastic variational inequality problem. The EV reformulation and the ERMreformulation for finding a “here and now” solution x, for the stochastic variationalinequality problem:

(y − x)T F(ξ, x) ≥ 0, ∀y ∈ Xξ , ∀ξ ∈ �,

can be found in [28,95], where Xξ is a stochastic convex set.

3 Smooth approximations

The systematic study of nonsmooth functions has been a rich area of mathematicalresearch for three decades. Clarke [34] introduced the notion of generalized gradi-ent ∂ f (x) for Lipschitz continuous functions. Comprehensive studies of more recentdevelopments can be found in [80,94]. The Clarke gradient and stationarity have beenwidely used in the construction and analysis of numerical methods for nonsmooth opti-mization problems. In addition to the study of general nonsmooth optimization, thereis a large literature on more specialized problems, including semismooth and semi-convex functions [79], nonconvex polyhedral functions [87], composite nonsmoothfunctions [9,99,102] and piecewise smooth functions [93]. Furthermore, the studyof smooth approximations for various specialized nonsmooth optimization and exactpenalty functions has a long history [14,17,24,27,31,33,40,59,70,81,82,90,94].

In this section, we consider a class of nonsmooth functions which can be expressedby composition of the plus function (t)+ with smooth functions. All motivationalproblems in Sect. 2 belong to this class.

Chen and Mangasarian construct a class of smooth approximations of the func-tion (t)+ by convolution [17,90,94] as follows. Let ρ : R → R+ be a piecewisecontinuous density function satisfying

ρ(s) = ρ(−s) and κ :=∞∫

−∞|s|ρ(s)ds < ∞.

123


Then

φ(t, μ) :=∞∫

−∞(t − μs)+ρ(s)ds (17)

from R × R+ to R+ is well defined. Moreover, for any fixed μ > 0, φ(·, μ) is contin-uously differentiable, convex, strictly increasing, and satisfies [14]

0 ≤ φ(t, μ)− (t)+ ≤ κμ. (18)

Inequalities in (18) imply that for any t ∈ R,

limtk→t,μ↓0

φ(tk, μ) = (t)+. (19)

Hence φ is a smoothing function of (t)+ by Definition 1. Moreover, from 0 ≤∇tφ(t, μ) = ∫ t

μ

−∞ ρ(s)ds ≤ 1, we have

{lim

tk→t,μk↓0∇tφ(tk, μk)

}⊆ ∂(t)+ =

⎧⎨

⎩

1 if t > 00 if t < 0

[0, 1] if t = 0,(20)

where ∂(t)+ is the Clarke subdifferential of (t)+. The subdifferential associated withthe smoothing function φ at a point t is

Gφ(t) = con{τ | ∇tφ(tk, μk) → τ, tk → t, μk ↓ 0}.

At t = 0, taking two sequences tk ↓ 0 and tk ↑ 0, and setting μk = t2k , from

0 ≤ ∇tφ(tk, μk) = ∫ tkμk−∞ ρ(s)ds ≤ 1, we find Gφ(0) = [0, 1]. Hence the Clarke sub-

differential coincides with the subdifferential associated with the smoothing functionφ, namely, we have

Gφ(t) = ∂(t)+. (21)

Many nonsmooth optimization problems including all motivational problems inSect. 2 can be reformulated by using the plus function (t)+. We list some problems asfollows

|x | = (x)+ + (−x)+max(x, y) = x + (y − x)+min(x, y) = x − (x − y)+

mid(x, �, u) = min(max(�, x), u), for given �, u

xr1 xr2 xr3 = max{xi x j xk : i < j < k, i, j, k ∈ {1, . . . , n}},where xr1 ≥ . . . ≥ xrn ≥ 0.

123

80 X. Chen

We can choose a smooth approximation of (t)+ to define smooth approximationsfor these nonsmooth functions and their compositions.

Example 1 Consider minimizing the following function

f (x) =r∑

i=1

|(hTi x)+ − ci |p,

in censored least absolute deviations [83], where p > 0 and hi ∈ Rn, ci ∈ R, i =1, . . . , r .

Choose ρ(s) ={

0 if |s| ≥ 12

1 if |s| < 12 .

Then

φ(t, μ) =∞∫

−∞(t − μs)+ρ(s)ds =

{(t)+ if |t | ≥ μ

2t2

2μ + t2 + μ

8 if |t | < μ2

is a smoothing function of (t)+. This function is called the uniform smoothingfunction.

From |t | = (t)+ + (−t)+, we know

ψ(t, μ) := φ(t, μ)+ φ(−t, μ) ={ |t | if |t | ≥ μ

2t2

μ+ μ

4 if |t | < μ2 .

is a smoothing function of |t | and

f̃ (x, μ) =r∑

i=1

ψ(φ(hTi x, μ)− ci , μ)

p

is a smoothing function of f (x) = ∑ri=1 |(hT

i x)+ − ci |p.

Example 2 [29,76] Consider minimizing the condition number of a symmetric posi-tive definite matrix B(x) in Sect. 2.2,

f (x) = λ1(B(x))

λm(B(x)).

Let h(x) = max(x1, . . . , xn). Choose ρ(s) = e−s/(1 + e−s)2. Then

φ(t, μ) =∞∫

−∞(t − μs)+ρ(s)ds =

{μ ln(1 + et/μ) if μ > 0(t)+ if μ = 0

is a smoothing function of (t)+. This function is called the Neural Networks smoothingfunction.

123


For n = 2, from max(x1, x2) = x1 + (x2 − x1)+, we know

x1 + φ(x2 − x1, μ) = x1 + μln(e(x2−x1)/μ + 1) = μln(ex1/μ + ex2/μ)

for μ > 0. Hence x1 + φ(x2 − x1, μ) is a smoothing function of max(x1, x2).

Suppose μln∑n−1

i=1 exi /μ for μ > 0 is a smoothing function of max(x1, . . . , xn−1).Then from max(x1, . . . , xn−1, xn) = max(xn,max(x1, . . . , xn−1)), we find

h̃(x, μ) = μlnn∑

i=1

exi /μ ≈ μln(exn/μ + emax(x1,...,xn−1)/μ)

for μ > 0 is a smoothing function of h(x) = max(x1, . . . , xn).Moreover, from min(x1, . . . , xn) = − max(−x1, . . . ,−xn), we have−h̃(−x, μ) =

−μln∑n

i=1 e−xi /μ for μ > 0 is a smoothing function of min(x1, . . . , xn). Hence, weget

f̃ (x, μ) ={

− ln∑n

i=1 eλi (B(x))/μ

ln∑n

i=1 e−λi (B(x))/μif μ > 0

f (x) if μ = 0

is a smoothing function of f (x) = λ1(B(x))λm (B(x))

. In computation, we use a stable form

f̃ (x, μ) = λ1(B(x))+ μln∑n

i=1 e(λi (B(x))−λ1(B(x)))/μ

λm(B(x))− μln∑n

i=1 e(λm (B(x))−λi (B(x)))/μ

for μ > 0.

Example 3 [21,32] Consider the stochastic nonlinear complementarity problem andlet

f (x) = E[‖ min(x, F(ξ, x))‖2

2

].

Choose ρ(s) = 2

(s2+4)32

. Then

φ(t, μ) =∞∫

−∞(t − μs)+ρ(s)ds = 1

2(t +

√t2 + 4μ2)

is a smoothing function of (t)+. This smoothing function is called the CHKS(Chen–Harker–Kanzow–Smale) smoothing function [15,66,96].

From min(x, F(ξ, x)) = x − (x − F(ξ, x))+, we have

f̃ (x, μ) = 1

4E

[n∑

i=1

(xi + Fi (ξ, x)−

√(xi − Fi (ξ, x))2 + 4μ2

)2]

is a smoothing function of f (x) = E[‖ min(x, F(ξ, x))‖22].

123

82 X. Chen

Remark 1 The plus function (t)+ has been widely used for penalty functions, barrierfunctions, complementarity problems, semi-infinite programming, optimal control,mathematical programs with equilibrium constraints etc. Examples of the class ofsmooth approximations φ of (t)+ defined by (17) can be found in a number of articles,we refer to [2,5,14,16–18,22,24,27–31,33,39,40,64,92].

Remark 2 Some smoothing functions can not be expressed by the Chen–Mangasariansmoothing function. For example, the smoothing function 1

2 (t + s −√t2 + s2 + 2μ2)

of the Fischer–Bumeister function [45] 12 (t + s − √

t2 + s2) proposed by Kanzow[66] for complementarity problems, and the smoothing function

φ(t, μ) ={ 1

p |t |p − ( 1p − 1

2 )μp if |t | > μ

12μ

p−2t2 if 0 ≤ |t | ≤ μ

of 1p |t |p proposed by Hintermüller and Wu [58] for the �p-norm regularized minimi-

zation problems where 0 < p < 1.

4 Analysis of smoothing functions

The plus function (t)+ is convex and globally Lipschitz continuous. Any smoothingfunction φ(t, μ) of (t)+ defined by (17) is also convex and globally Lipschitz and hasnice properties (18)–(21). In addition, for any fixed t, φ is continuously differentiable,monotonically increasing and convex with respect to μ > 0 and satisfies

0 ≤ φ(t, μ2)− φ(t, μ1) ≤ κ(μ2 − μ1), for μ2 ≥ μ1. (22)

See [14]. These properties are important for developing efficient smoothing methodsfor nonsmooth optimization. In this section, we study properties of the smoothingfunctions defined by the composition of the smoothing function φ(t, μ). Specially,we investigate the subdifferential associated with a smoothing function in this classof smoothing functions.

4.1 Locally Lipschitz continuous functions

In this subsection, we assume that f is locally Lipschitz continuous. According toRademacher’s theorem, f is differentiable almost everywhere. The Clarke subdiffer-ential of f at a point x is defined by

∂ f (x) = con∂B f (x),

where

∂B f (x) = {v | ∇ f (z) → v, f is differentiable at z, z → x}.

123


For a locally Lipschitz function f , the gradient consistency

∂ f (x) = con{ limxk→x,μk↓0

∇x f̃ (xk, μk)} = G f̃ (x), ∀x ∈ Rn (23)

between the Clarke subdifferential and the subdifferential associated with a smoothingfunction of f is important for the convergence of smoothing methods. Rockafellar andWets [94] show that for any locally Lipschitz function f , we can construct a smoothingfunction by using convolution

f̃ (x, μ) =∫

Rn

f (x − y)ψμ(y)dy =∫

Rn

f (y)ψμ(x − y)dy, (24)

where ψ : Rn → R is a smooth kernel function (a mollifier), which has the gradientconsistency (23).

Now we show the gradient consistency of smoothing composite functions using φin (17) for the plus function. For a vector function h(x) = (h1(x), . . . , hm(x))T withcomponents hi : Rn → R, we denote (h(x))+ = ((h1(x))+, . . . , (hm(x))+)T and

φ(h(x), μ) = (φ(h1(x), μ), . . . , φ(hm(x), μ))T .

Theorem 1 Let f (x) = g((h(x))+), where h : Rn → Rm and g : Rm → R arecontinuously differentiable functions. Then f̃ (x, μ) = g(φ(h(x), μ)) is a smoothingfunction of f with the following properties.

(i) For any x, {limxk→x,μk↓0 ∇x f̃ (xk, μk)} is nonempty and bounded, and ∂ f (x) =G f̃ (x).

(i i) If g, hi are convex and g is monotonically nondecreasing, then for any fixedμ > 0, f̃ (·, μ) is convex.

Proof By the chain rule for continuously differentiable functions, f̃ is a smoothingfunction of f with

∇x f̃ (x, μ) = ∇h(x)diag(∇yφ(y, μ)|y=hi (x))∇g(z)|z=φ(h(x),μ).

(i) Taking two sequences xk → x and μk ↓ 0, we have

{lim

xk→x,μk↓0∇x f̃ (xk, μk)

}

={

limxk→x,μk↓0

∇h(xk)diag(∇yφ(yk, μk)|yk=hi (xk ))∇g(zk)|zk=φ(h(xk ),μk )

}

⊆ ∇h(x)∂(u)+|u=h(x)∇g(z)|z=(h(x))+ .

Since (t)+ is monotonically nondecreasing and h is continuously differentiable, byTheorem 2.3.9 in [34], (h(x))+ is Clarke regular. Let

S = {V | ∇(uk)+ → V, uk = h(xk), xk → x, ∇(uk)+ exist}.

123

84 X. Chen

Note that ∂(u)+|u=h(x) = conS. We have

∇h(x)∂(u)+|u=h(x)∇g(z)|z=(h(x))+ = ∇h(x)conS∇g(z)|z=(h(x))+= con∇h(x)S∇g(z)|z=(h(x))+⊆ ∂ f (x).

Hence, we obtain

G f̃ (x) = con{ limxk→x,μk↓0

∇x f̃ (xk, μk)} ⊆ ∂ f (x).

By the continuous differentiability of g, h, the function f is locally Lipschitz, and∂ f (x) is bounded. On the other hand, according to Theorem 9.61 and (b) of Corollary8.47 in [94], ∂ f (x) ⊆ G f̃ (x). Hence we obtain ∂ f (x) = G f̃ (x).

(ii) For any fixed μ, the smoothing function φ(t, μ) of the plus function (t)+ isconvex and monotonically nondecreasing. Hence for any λ ∈ (0, 1), we have

f̃ (λx + (1 − λ)y, μ) = g(φ(h(λx + (1 − λ)y), μ))

≤ g(φ(λh(x)+ (1 − λ)h(y), μ))

≤ g(λφ(h(x), μ)+ (1 − λ)φ(h(y), μ))

≤ λg(φ(h(x), μ))+ (1 − λ)g(φ(h(y), μ))

= λ f̃ (x, μ)+ (1 − λ) f̃ (y, μ).

Corollary 1 Let f (x) = (g((h(x))+))+, where h : Rn → Rm and g : Rm → Rare continuously differentiable functions. Then f̃ (x, μ) = φ(g(φ(h(x), μ)), μ) is asmoothing function of f with the following properties.

(i) For any x, {limxk→x,μk↓0 ∇x f̃ (xk, μk)} is nonempty and bounded, and ∂ f (x) =G f̃ (x).

(i i) If g, hi are convex and g is monotonically nondecreasing, then for any fixedμ > 0, f̃ (·, μ) is convex.

Proof By Theorem 2.3.9 in [34], (h(x))+, g((h(x))+) and f (x) are Clarke regular.From Theorem 2.3.9 in [34] and Proposition 1, we derive this corollary.

Many useful properties of the Clarke subdifferential can be applied to smoothingfunctions. For example, the following proposition is derived from Theorem 2.3.9 in[34] for the Chain rules.

Proposition 1 [5] Let g̃ and h̃ be smoothing functions of locally Lipschitz functionsg : Rm → R and h : Rn → Rm with ∂g(x) = Gg̃(x) and ∂h(x) = Gh̃(x), respec-

tively. Set g̃ = g or h̃ = h when g or h is differentiable. Then f̃ = g̃(h̃) is a smoothingfunction of f = g(h) with ∂ f (x) = G f̃ (x) under any one of the following conditionsfor any x ∈ Rn.

1. g is regular at h(x), each hi is regular at x and ∇z g̃(z, μ)|z=h(x) ≥ 0.2. g is continuously differentiable at h(x) and m = 1.3. g is regular at h(x) and h is continuously differentiable.

123


Using the chain rules for smoothing functions, we can quickly find that the smooth-ing functions f̃ in Example 1 with p ≥ 1 and Examples 2 and 3, satisfies ∂ f (x) =G f̃ (x).

4.2 Non-Lipschitz continuous functions

Lipschitz continuity is important in nonsmooth optimization. Generalized gradientand Jacobian of Lipschitz continuous functions have been well studied [24,34,80,94]. However, we have lack of theory and algorithms for non-Lipschitz optimiza-tion problems. Recently, non-Lipschitz optimization problems have attracted consid-erable attention from variable selection, sparse approximation, signal reconstruction,image restoration, exact penalty functions, eigenvalue optimization, etc [42,62,70,75,78]. Reformulation or approximation of non-Lipschitz continuous functions by Lips-chitz continuous functions have been proposed in [1,19,70,107]. In this subsection,we show that smoothing functions defined by composition of smoothing functions(17) of the plus function for a class of non-Lipschitz continuous functions have bothdifferentiability and Lipschitz continuity.

We consider the following non-Lipschitz function

f (x) = ‖(g(x))+‖pp + ‖h(x)‖p

p =r∑

i=1

((gi (x))+)p +m∑

i=1

|hi (x)|p, (25)

where g : Rn → Rr and h : Rn → Rm are continuously differentiable functions and0 < p < 1. This function includes �p norm regularization and �p exact penalty func-tions for 0 < p < 1 as special cases [13,30,42,62,70,75,78]. It is easy to see that fis a continuous function from Rn to R+. Moreover, f is continuously differentiable atx if all components of g and h at x are nonzero. However, f is possibly non-Lipschitzcontinuous at x , if one of components of g or h at x is zero.

We use a smoothing function φ(t, μ) in the class of Chen-Mangasarian smoothingfunctions (17) for (t)+ and ψ(t, μ) = φ(t, μ)+ φ(−t, μ) for |t | = (t)+ + (−t)+ todefine a smoothing function of f as the following

f̃ (x, μ) =r∑

i=1

(φ(gi (x), μ))p +

m∑

i=1

(ψ(hi (x), μ))p. (26)

Lemma 1 (φ(t, μ))p and (ψ(t, μ))p are smoothing functions of ((t)+)p and |t |p inR+ and R, respectively. Moreover, the following statements hold.

(i) Let κ0 = ( κ2 )p where κ := ∫ ∞

−∞ |s|ρ(s)ds. Then for any t ∈ R,

0 ≤ (φ(t, μ))p − ((t)+)p ≤ κ0μp and 0 ≤ (ψ(t, μ))p − |t |p ≤ 2κ0μ

p.

(i i) For any fixed μ > 0, (φ(t, μ))p and (ψ(t, μ))p are Lipschitz continuous in R+and R, respectively. In particular, their gradients are bounded by p(φ(0, μ))p−1,

123

86 X. Chen

that is,

0 ≤ p(φ(t, μ))p−1∇tφ(t, μ) ≤ p(φ(0, μ))p−1, for t ∈ R+

and

|p(ψ(t, μ))p−1∇tψ(t, μ)| ≤ 2p(φ(0, μ))p−1, for t ∈ R.

(i i i) Assume there is ρ0 > 0 such that |ρ(s)| ≤ ρ0. Let

νμ = p(1 − p)φ(0, μ)p−2 + 1

μpφ(0, μ)p−1ρ0

then for any fixed μ > 0, the gradients of (φ(t, μ))p and (ψ(t, μ))p are Lips-chitz continuous in R+ and R, with Lipschitz constants νμ and 2νμ, respectively.In particular, if ρ is continuous at t then |∇2

t (φ(t, μ))p| ≤ νμ for t ∈ R+ and

|∇2t (ψ(t, μ))

p| ≤ 2νμ for t ∈ R.

In addition, if ρ(s) > 0 for s ∈ (−∞,∞), then (φ(t, μ))p is a smoothing functionof ((t)+)p in R, and for any fixed μ > 0, (φ(t, μ))p and its gradient are Lipschitzcontinuous in R.

Proof We first prove this lemma for (φ(t, μ))p. Since φ(t, μ) is a smoothing functionof (t)+ and ∇tφ(t, μ) > 0 in R+, (φ(t, μ))p is a smoothing function of ((t)+)p.

(i) Since t p is a monotonically increasing function in R+, from (18), we have(φ(t, μ))p − ((t)+)p ≥ 0. Moreover, from

∇μφ(t, μ) = −t/μ∫

−∞sρ(s)ds, for t ≥ 0,

the difference between (φ(t, μ))p and ((t)+)p has the maximal value at t = 0, that is,

0 ≤ (φ(t, μ))p − ((t)+)p ≤ (φ(0, μ))p =(κ

2

)pμp.

(ii)–(iii) Note that 0 ≤ ∇tφ(t, μ) = ∫ t/μ−∞ ρ(s)ds ≤ 1. Straightforward calculation

shows that for any t ∈ R+,

p(φ(t, μ))p−1∇tφ(t, μ) ≤ p(φ(0, μ))p−1

t/μ∫

−∞ρ(s)ds ≤ p(φ(0, μ))p−1

and if ρ is continuous at t , then

|∇2t (φ(t, μ))

p| = |p(p − 1)(φ(t, μ))p−2(∇tφ(t, μ))2 + p(φ(t, μ))p−1∇2

t φ(t, μ)|≤ p(1 − p)(φ(0, μ))p−2 + pφ(0, μ)p−1ρ

(t

μ

)1

μ≤ νμ.

123


Since ρ is piecewise continuous, the gradient of (φ(t, μ))p is locally Lipschitz and thesecond derivative ∇2

t (φ(t, μ))p almost everywhere exists in R+. By the mean value

theorem in [34], we find for any t1, t2 ∈ R+,

|∇t (φ(t1, μ))p − ∇t (φ(t2, μ))

p| ≤ νμ|t1 − t2|.

In addition, ifρ(s) > 0 for s ∈ (−∞,∞), then 0 < ∇tφ(t, μ) = ∫ t/μ−∞ ρ(s)ds < 1.

Hence, we complete the proof for (φ(t, μ))p.By the symmetrical characteristic of |t | = (t)++(−t)+ andψ(t, μ) ≥ 2φ(0, μ) =

κμ > 0 for t ∈ R, μ > 0, we can prove this lemma for ψ by the same argumentabove.

Remark 3 If the density function ρ(s) = 0 for some s ∈ (−∞,∞), (φ(t, μ))p isnot necessarily a smoothing function of ((t)+)p in R. For example, the function ρ inExample 1 is a piecewise continuous function with ρ(s) = 0 for |s| ≥ 1

2 . Considerthe gradient of (φ(t, μ))p at t = −μ

2 , we have

limτ↑0

(φ(t + τ, μ))p − (φ(t, μ))p

τ= lim

τ↑0

0 − 0

τ= 0

and

limτ↓0

(φ(t + τ, μ))p − (φ(t, μ))p

τ= lim

τ↓0

(φ(t + τ, μ))p

τ= lim

τ↓0

(1

2μ

)p

τ 2p−1.

Hence (φ(t, μ))p is a smoothing function of ((t)+)p in R if and only if p > 12 .

On the other hand, (ψ(t, μ))p is a smoothing function of |t |p in R. Again considerthe gradient of (ψ(t, μ))p at t = −μ

2 ,

limτ↑0

(ψ(t + τ, μ))p − (ψ(t, μ))p

τ= p|t |p−1sign(t) = −p

(μ2

)p−1

limτ↓0

(ψ(t + τ, μ))p − (ψ(t, μ))p

τ= p

(t2

μ+ μ

4

)p−12t

μ= −p

(μ2

)p−1.

Proposition 2 Let f and f̃ be defined as in (25) and (26), and letΩ = {x |g(x) ≥ 0}.Then f̃ is a smoothing function of f in Ω . Moreover, the following statements hold.

(i) There is a positive constant α0 such that

0 ≤ f̃ (x, μ)− f (x) ≤ α0μp.

(ii) For any fixed μ > 0, f̃ (·, μ) and ∇x f̃ (·, μ) are locally Lipschitz continuous inΩ .

(iii) If g and h are globally Lipschitz continuous, then f̃ (·, μ) is globally Lipschitzcontinuous inΩ; If ∇g and ∇h are globally Lipschitz continuous, then ∇x f̃ (·, μ)is globally Lipschitz continuous in Ω .

123

88 X. Chen

In addition, if ρ(s) > 0 for s ∈ (−∞,∞), then f̃ is a smoothing function of f in Rn,and (i) and (ii) hold in Rn.

Proposition 2 can be easily proved by the Chain rules and Lemma 1. Hence, we omitthe proof.

Now we compare the smoothing function f̃ with the robust regularization

f̄μ(x) = sup{ f (y) : y ∈ X, ‖x − y‖ ≤ μ }

for approximating a non-Lipschitz function f : X ⊂ Rn → R. The robust regulari-zation f̄μ is proposed by Lewis and Pang [70]. They use the following example

f (t) ={−t if t < 0√

t if t ≥ 0(27)

to illustrate the robust regularization. The function f is non-Lipschitz, sincelimε↓0

f (ε)− f (0)ε−0 = limε↓0

1√ε

= ∞. The minimizer of f is 0. The robust regu-larization of f is

f̄μ(t) ={μ− t if t < α(μ)√μ+ t if t ≥ α(μ),

where μ > α(μ) = 1+2μ−√1+8μ

2 > −μ. The robust regularization f̄μ(t) is Lips-chitz continuous with the Lipschitz modulus 1

2√μ+α(μ) for any fixed μ > 0, but not

differentiable. The minimizer of f̄μ isα(μ)which is different from the minimizer of f .Now we use the smoothing functionψ(t, μ) of |t | in Example 1 to define a smooth-

ing function of f in (27). To simplify the notation, we replace μ by 4μ in ψ(t, μ).Then we have

f̃ (t, μ) ={ψ(t, μ) if t < 0√ψ(t, μ)+ μ− √

μ if t ≥ 0

=

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

−t if t < −2μ

t2

4μ + μ if − 2μ ≤ t < 0√

t2

4μ + μ+ μ− √μ if 0 ≤ t < 2μ

√t + μ− √

μ if t ≥ 2μ.

Compared with the robust regularization, the smoothing function f̃ (t, μ) is notonly Lipschitz continuous, but also continuously differentiable with |∇t f̃ (t, μ)| ≤

12√

2μ≤ 1

2√μ+α(μ) for any fixed μ > 0. Moreover, the minimizer of f̃ (·, μ) is 0,

which is the same as the minimizer of f .For a non-Lipschitz function f , we can not define the gradient of f . The scaled

gradient, and scaled first order and second order necessary conditions for non-Lips-chitz optimization have been studied [4,25,30]. Now we consider the scaled gradient

123


consistency for the following non-Lipschitz function

f (x) := θ(x)+ λ

m∑

i=1

ϕ(dTi x), (28)

where θ : Rn → R+ is locally Lipschitz continuous, λ ∈ R+, di ∈ Rn, i = 1, · · · ,m,and ϕ : R → R+ is continuously differentiable in (−∞, 0)∪ (0,+∞). Many penaltyfunctions satisfy the conditions, for example, the six penalty functions in Sect. 2.2.For a given vector x ∈ Rn , we set

Ix = {i | dTi x = 0, i = 1, · · · ,m} and Jx = {i | dT

i x �= 0, i = 1, · · · ,m}.

Let Zx be an n × �matrix whose columns are an orthonormal basis for the null spaceof {di | i ∈ Ix } [25].

Let f̃ (x, μ) = θ̃ (x, μ) + λ∑m

i=1 φ(dTi x, μ), where θ̃ is a smoothing function

of θ which satisfies the gradient consistency ∂θ(x) = G θ̃ (x), and φ is a smoothingfunction of ϕ which satisfies the gradient consistency ϕ′(t) = Gφ(t) at t �= 0.

Theorem 2 For any x ∈ Rn, we have

Z Tx G f̃ (x) = Z T

x ∂(θ(x)+ λ∑

i∈Jx

ϕ(dTi x)). (29)

Proof If x = 0, then Jx = ∅, and (29) holds. If x �= 0, then rank(Zx ) = � > 0.Moreover, we have

Z Tx ∇x f̃ (xk, μ) = Z T

x (∇x θ̃ (xk, μ)+ λ

m∑

i=1

di∇tφ′(tk, μ)|tk=dT

i xk )

= Z Tx ∇x θ̃ (x

k, μ)+ λ∑

i∈Jx

Z Tx di∇tφ

′(tk, μ)|tk=dTi xk ,

where the last equality uses Z Tx di = 0, ∀ i ∈ Ix . Since ϕ is continuously differentiable

at dTi x for i ∈ Jx , we obtain (29) by the gradient consistency of θ̃ and φ.

5 Smoothing algorithms

Numerical methods for solving nonsmooth, nonconvex optimization problems havebeen studied extensively [7,10,12,35,37,51,59,68,77,102].

To explain numerical implementation and convergence analysis of smoothing meth-ods in a clear and simple way, we use a smoothing gradient method to show how toupdate the smoothing parameter to guarantee the global convergence.

Burke et al. [10] proposed a robust gradient sampling algorithm for nonsmooth,nonconvex, Lipschitz continuous unconstrained minimization problems. Basically,it is a stabilized steepest descent algorithm. At each iteration of the algorithm, a

123

90 X. Chen

descent direction is obtained by evaluating the gradient at the current point and atadditional nearby points if the function is differentiable at these points, and then com-puting the vector in the convex hull of these gradients with least �2 norm. A standardline search is then used to obtain a new point. If at one of these points, the function isnot differentiable, the algorithm will terminate. Kiwiel [69] slightly revised this algo-rithm and showed that any accumulation point generated by the revised algorithm isa Clarke stationary point with probability one. Under the same conditions of [10,69],we show that any accumulation point generated by the smoothing gradient method is aClarke stationary point. Moreover, we show that any accumulation point generated bythe smoothing gradient method is a scaled Clarke stationary point for non-Lipschitzoptimization problems.

We consider (1) with X = Rn . Assume that f : Rn → R is locally Lipschitz, level-bounded, and continuously differentiable on an open dense subset of Rn [10,69]. Letf̃ be a smoothing function with the gradient consistency (4).

Smoothing gradient method

Step 1. Choose constants σ, ρ ∈ (0, 1), γ > 0 and an initial point x0. Set k = 0.Step 2. Compute the gradient

gk = ∇x f̃ (xk, μk)

and the step size νk by the Armijo line search, where νk = max{ρ0, ρ1, · · · }and ρi satisfies

f̃ (xk − ρi gk, μk) ≤ f̃ (xk, μk)− σρi (gk)T gk .

Set xk+1 = xk − νk gk .

Step 3. If ‖∇x f̃ (xk+1, μk)‖ ≥ γμk, then setμk+1 = μk; otherwise, chooseμk+1 =σμk .

Theorem 3 Any accumulation point generated by the smoothing gradient method isa Clarke stationary point.

This theorem can be proved in a similar way in [33, Theorem 2.6]. For completeness,we give a simple proof.

Proof Denote K = {k |μk+1 = σμk}. If K is finite, then there exists an integer k̄such that for all k > k̄,

‖∇x f̃ (xk+1, μk)‖ ≥ γμk, (30)

and μk =: μ̄ for all k ≥ k̄ in Step 2 of the smoothing gradient method. Since f̃ (·, μ̄)is a smooth function, the gradient method for solving

min f̃ (x, μ̄)

123


satisfies

limk→ inf∞ ‖∇x f̃ (xk, μ̄)‖ = 0, (31)

which contradicts with (30). This shows that K must be infinite and limk→∞ μk = 0.Since K is infinite, we can assume that K = {k0, k1, . . .} with k0 < k1 < . . .. Then

we have

limi→∞ ‖∇x f̃ (xki +1, μki )‖ ≤ γ lim

i→∞μki = 0. (32)

Let x̄ be an accumulation point of {xki +1}. Then by the gradient consistency, we have0 ∈ ∂ f (x̄). Hence x̄ is a Clarke stationary point.

Corollary 2 Any accumulation point x̄ generated by the smoothing gradient methodfor solving (28) is an affine-scaled Clarke stationary point, that is,

0 ∈ Z Tx̄ ∂(θ(x̄)+ λ

∑

i∈Jx̄

ϕ(dTi x̄)). (33)

Proof Following the proof of Theorem 3, we can show that there is a subsequenceK = {k0, k1, . . .} with k0 < k1 < . . . such that (32) holds. Let x̄ be an accumulationpoint of {xki +1}. From Theorem 2, we have

0 = limi→∞ Z T

x̄ ∇x f̃ (xki +1, μki ) ∈ Z Tx̄ ∂(θ(x̄)+ λ

∑

i∈Jx̄

ϕ(dTi x̄)).

Remark 4 Developing a smoothing method for solving a nonsmooth, nonconvex opti-mization problem (1) involves three main parts. (i) Define a smoothing function f̃ byusing the methods and propositions in Sects. 3 and 4. (ii) Choose an algorithm for solv-ing the smooth optimization problem (2) with the objective function f̃ . For instance,the smoothing gradient method uses the gradient method in its Step 2. (iii) Update thesmoothing parameter μk . The updating scheme will depend on the convergence ofthe algorithm used to solve the smooth optimization problem. As shown above, sincethe gradient method has the convergence property (31), we set the updating condition‖∇x f̃ (xk+1, μk)‖ < γμk in Step 3 in the smoothing gradient method. The efficiencyof smoothing methods will depend on the smoothing function f̃ , the method for solv-ing the smooth optimization problem (2) and the scheme for updating the smoothingparameter μk .

There are rich theory and abundant efficient algorithms for solving smooth optimi-zation problems [86]. With adaptive smoothing functions and updating schemes, thesetheory and algorithms can be powerful for solving nonsmooth optimization problems.

123

92 X. Chen

6 Numerical implementation

Many smoothing methods have been developed to solve nonsmooth, noncon-vex optimization [4,5,18,22,24,26–31,33,39,40,46–48,54,64,66,72,73,81,82,89,90]. Recently, Garmanjani and Vicente [50] propose a smoothing direct search (DS)algorithm basing on smooth approximations and derivative free methods for non-smooth, nonconvex, Lipschitz continuous unconstrained minimization problems. Thesmoothing DS algorithm takes at most O(−ε−3 log ε) function evaluations to find anx such that ‖∇x f̃ (x, μ)‖∞ ≤ ε andμ ≤ ε. In [6], Bian and Chen propose a smoothingsequential quadratic programming (SSQP) algorithm for solving regularized minimi-zation problem (5) where Θ is continuously differentiable but the penalty functionmay be non-Lipschitz. The worst-case complexity of the SSQP algorithm for findingan ε affine-scaled stationary point is O(ε−2). Moreover, if Φ is locally Lipschitz, theSSQP algorithm with a slightly modified updating scheme can obtain an ε Clarkestationary point at most O(ε−3) steps.

In the following, we illustrate numerical implementation of smoothing methodsusing the SSQP algorithm for solving the �2-�p problem

minx∈Rn

f (x) := ‖Ax − b‖22 + λ‖x‖p

p, (34)

where A ∈ Rm×n, b ∈ Rm, ‖x‖pp = ∑n

i=1 |xi |p, p ∈ (0, 1). It is shown that prob-lem (34) is strongly NP hard in [23]. Let X = diag(x). The affine-scaled first ordernecessary condition for local minimizers of (34) is

G(x) := 2X AT (Ax − b)+ λp|x |p = 0 (35)

and the affine-scaled second order necessary condition is that

2X AT AX + λp(p − 1)|X |p (36)

is positive semi-definite, where |X |p = diag(|x |p) [30]. Obviously, (34) is a specialcase of (28) and (35) is a special case of (33) with the i th column (Zx )i = (X)i/|xi |for i ∈ Jx . Moreover, (35) is equivalent to

2(AT (Ax − b))i + λp|xi |p−1sign(xi ) = 0, i ∈ Jx = {i | xi �= 0}.

Using (35) and (36), several good properties of the �2-�p problem (34) have beenderived [23,30], including the lower bounds for nonzero entries of its local minimizersand the sparsity of its local minimizers. Moreover, a smoothing trust region Newtonmethod for finding a point x satisfying (35) and (36) is proposed in [25].

Let us use the smoothing function

s(t, μ) ={ |t | if |t | > μ

t2

2μ + μ2 if |t | ≤ μ

123


of |t | to define the smoothing function

f̃ (x, μ) = ‖Ax − b‖22 + λ

n∑

i=1

s(xi , μ)p

of f . Then we can easily find that

limxk→x,μ↓0

Xk∇x f̃ (xk, μ)

= limxk→x,μ↓0

2Xk AT (Axk − b)+ λpn∑

i=1

s(xki , μ)

p−1s′(xki , μ)x

ki ei

= 2X AT (Ax − b)+ λp|x |p

= limxk→x,μ↓0

X∇x f̃ (xk, μ).

We apply the SSQP method [6] to solve (34).

SSQP algorithmChoose x0 ∈ Rn, μ0 > 0 and σ ∈ (0, 1). Set k = 0 and z0 = x0.For k ≥ 0, set

xk+1 = arg minx∈Rn f̃ (xk, μk)+ (x − xk)T gk + 12 (x − xk)T Dk(x − xk),

μk+1 ={μk if f̃ (xk+1, μk)− f̃ (xk, μk) ≤ −4αpμp

kσμk otherwise,

zk+1 ={

xk if μk+1 = σμk

zk otherwise.

Here Dk is a diagonal matrix whose diagonal elements dki , i = 1, . . . , n are defined

by

dki =

⎧⎪⎨

⎪⎩

max{2‖AT A‖ + 8λp| xki2 |p−2,

|gki |

2p2 −1

μp2 |xk

i |1− p2} if |xk

i | > 2μ

max{2‖AT A‖ + 8λpμp−2,|gk

i |μ

} if |xki | ≤ 2μ,

where gk = ∇x f̃ (xk, μk). The solution of the QP problem can be given as xk+1i =

xki − gk

i /dki , i = 1, . . . , n.

It is shown in [6] that for any ε > 0, the SSQP algorithm with any starting point z0

takes at most O(ε−2) steps to find a zk such that ‖G(zk)‖∞ ≤ ε.

Example 4 We use Example 3.2.1: Prostate cancer in [55] to show numerical perfor-mance of the SSQP algorithm. The data for this example consists of the medical recordsof 97 men who were about to receive a radical prostatectomy, which is divided into atraining set with 67 observations and a test set with 30 observations. The variables areeight clinical measures: lcavol, lweight, age, lbph, svi, lcp, pleason and pgg45. Theaim is to find few main factors with small prediction error.

123

94 X. Chen

Table 1 Example 4: Variable selection by SSQP, Lasso, Best subset methods

λ SSQP Lasso Best subset

7.734 7.78 22.1

x∗1 (lcavol) 0.6302 0.6437 0.7004 0.533 0.740

x∗2 (lweight) 0.2418 0.2765 0.1565 0.169 0.316

x∗3 (age) 0 0 0 0 0

x∗4 (lbph) 0.0755 0 0 0.002 0

x∗5 (svi) 0.1674 0.1327 0 0.094 0

x∗6 (lcp) 0 0 0 0 0

x∗7 (pleason) 0 0 0 0 0

x∗8 (pgg45) 0 0 0 0 0

Number of nonzero 4 3 2 4 2

Error 0.4294 0.4262 0.488 0.479 0.492

0 500 1000 1500 2000 2500−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

z1k

z5k

others

z4k

z2k

0 500 1000 1500−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

z1k

z2k

z5k

z4k

others

Fig. 1 Convergence of {zk } generated by the SSQP for (34)

0 200 400 600 800 1000 1200 1400 1600 1800 200040

45

50

55

60

65

70

75

80

step k0 200 400 600 800 1000 1200 1400 1600 1800 2000

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

step k

Fig. 2 Convergence of f̃ (zk , μk ), f (zk ), μk and ‖G(zk )‖∞ with λ = 7.78

123


We use the training set to build model (34) with matrix A ∈ R67×8, b ∈ R67 andp = 0.5. The test set is used to compute the prediction error and judge the performanceof the selected methods. Table 1 reports numerical results of the SSQP algorithm withalgorithm parameters σ = 0.99, μ0 = 10, x0 = (0, . . . , 0)T ∈ R8 and stop criterion‖G(zk)‖∞ ≤ 10−4 andμk ≤ 10−4. Moreover, results of the best two methods (Lasso,Best subset) from Table 3.3 in [55] are also listed in Table 1. We use Figs. 1 and 2 toshow the convergence of {zk}, f̃ (zk, μk), f (zk), μk and ‖G(zk)‖∞ generated by theSSQP algorithm for (34) with λ = 7.734 and λ = 7.78.

Theoretical and numerical results show that smoothing methods are promising fornonsmooth, nonconvex, non-Lipschitz optimization problems.

Acknowledgments We would like to thank Prof. Masao Fukushima for reading this paper carefully andgiving many helpful comments. Thanks to Dr. Wei Bian for helpful discussions concerning analysis ofsmoothing functions and numerical experience. This work was supported in part by Hong Kong ResearchGrant Council grant PolyU5003/10p.

References

1. Alefeld, G., Chen, X.: A regularized projection method for complementarity problems withnon-Lipschitzian functions. Math. Comput. 77, 379–395 (2008)

2. Auslender, A.: How to deal with the unbounded in optimization: theory and algorithm. Math.Program. 79, 3–18 (1997)

3. Ben-Tal, A., EI Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press,Princeton (2009)

4. Bian, W., Chen, X.: Smoothing neural network for constrained non-Lipschitz optimization with appli-cations. IEEE Trans. Neural Netw. Learn. Syst. 23, 399–411 (2012)

5. Bian, W., Chen, X.: Neural network for nonsmooth, nonconvex constrained minimization via smoothapproximation. Preprint (2011)

6. Bian, W., Chen, X.: Smoothing SQP algorithm for non-Lipschitz optimization with complexityanalysis. Preprint (2012)

7. Bian, W., Xue, X.P.: Subgradient-based neural network for nonsmooth nonconvex optimizationproblem. IEEE Trans. Neural Netw. 20, 1024–1038 (2009)

8. Bruckstein, A.M., Donoho, D.L., Elad, M.: From sparse solutions of systems of equations to sparsemodeling of signals and images. SIAM Rev. 51, 34–81 (2009)

9. Burke, J.V.: Descent methods for composite nondifferentiable optimization problems. Math.Program. 33, 260–279 (1985)

10. Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth,nonconvex optimization. SIAM J. Optim. 15, 751–779 (2005)

11. Burke, J.V., Henrion, D., Lewis, A.S., Overton, M.L.: Stabilization via nonsmooth, nonconvexoptimization. IEEE Trans. Automat. Control 51, 1760–1769 (2006)

12. Catis, C., Gould, N.I.M., Toint, P.: On the evaluation complexity of composite function minimizationwith applications to nonconvex nonlinear programming. SIAM J. Optim. 21, 1721–1739 (2011)

13. Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Proc.Lett. 14, 707–710 (2007)

14. Chen, B., Chen, X.: A global and local superlinear continuation-smoothing method for P0 and R0NCP or monotone NCP. SIAM J. Optim. 9, 624–645 (1999)

15. Chen, B., Harker, P.T.: A non-interior-point continuation method for linear complementarity prob-lems. SIAM J. Matrix Anal. Appl. 14, 1168–1190 (1993)

16. Chen, B., Xiu, N.: A global linear and local quadratic noninterior continuation method for non-linear complementarity problems based on Chen–Mangasarian smoothing functions. SIAM J.Optim. 9, 605–623 (1999)

17. Chen, C., Mangasarian, O.L.: A class of smoothing functions for nonlinear and mixed complemen-tarity problems. Math. Program. 71, 51–70 (1995)

123

96 X. Chen

18. Chen, X.: Smoothing methods for complementarity problems and their applications: a survey. J. Oper.Res. Soc. Japan 43, 32–46 (2000)

19. Chen, X.: A superlinearly and globally convergent method for reaction and diffusion problems witha non-Lipschitzian operator. Computing[Suppl] 15, 79–90 (2001)

20. Chen, X.: First order conditions for nonsmooth discretized constrained optimal control prob-lems. SIAM J. Control Optim. 42, 2004–2015 (2004)

21. Chen, X., Fukushima, M.: Expected residual minimization method for stochastic linear complemen-tarity problems. Math. Oper. Res. 30, 1022–1038 (2005)

22. Chen, X., Fukushima, M.: A smoothing method for a mathematical program with P-matrix linearcomplementarity constraints. Comput. Optim. Appl. 27, 223–246 (2004)

23. Chen, X., Ge, D., Wang, Z., Ye, Y.: Complexity of unconstrained L2-L p minimization. Preprint(2011)

24. Chen, X., Nashed, Z., Qi, L.: Smoothing methods and semismooth methods for nondifferentiableoperator equations. SIAM J. Numer. Anal. 38, 1200–1216 (2000)

25. Chen, X., Niu, L., Yuan, Y.: Optimality conditions and smoothing trust region Newton method fornon-Lipschitz optimization. Preprint (2012)

26. Chen, X., Qi, L.: A parameterized Newton method and a quasi-Newton method for nonsmooth equa-tions. Comput. Optim. Appl. 3, 157–179 (1994)

27. Chen, X., Qi, L., Sun, D.: Global and superlinear convergence of the smoothing Newton method andits application to general box constrained variational inequalities. Math. Comput. 67, 519–540 (1998)

28. Chen, X., Wets, R.J-B., Zhang, Y.: Stochastic variational inequalities: residual minimization smooth-ing/sample average approximations SIAM J. Optim. (to appear)

29. Chen, X., Womersley, R., Ye, J.: Minimizing the condition number of a Gram matrix. SIAMJ. Optim. 21, 127–148 (2011)

30. Chen, X., Xu, F., Ye, Y.: Lower bound theory of nonzero entries in solutions of l2 − l p minimiza-tion. SIAM J. Sci. Comput. 32, 2832–2852 (2012)

31. Chen, X., Ye, Y.: On homotopy-smoothing methods for variational inequalities. SIAM J. ControlOptim. 37, 587–616 (1999)

32. Chen, X., Zhang, C., Fukushima, M.: Robust solution of monotone stochastic linear complementarityproblems. Math. Program. 117, 51–80 (2009)

33. Chen, X., Zhou, W.: Smoothing nonlinear conjugate gradient method for image restoration usingnonsmooth nonconvex minimization. SIAM J. Imaging Sci. 3, 765–790 (2010)

34. Clarke, F.H.: Optimization and Nonsmooth Analysis. Wiley, New York (1983)35. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free Optimization. MPS-SIAM

Book Series on Optimization. SIAM, Philadelphia (2009)36. Cottle, R.W., Pang, J.S., Stone, R.E.: The Linear Complementarity Problem. Academic Press,

Boston (1992)37. Daniilidis, A., Sagastizbal, C., Solodov, M.: Identifying structure of nonsmooth convex functions by

the bundle technique. SIAM J. Optim. 20, 820–840 (2009)38. Delage, E., Ye, Y.: Distributionally robust optimization under monment uncertainty with application

to data-driven problems. Oper. Res. 58, 595–612 (2010)39. Facchinei, F., Jiang, H., Qi, L.: A smoothing method for mathematical programs with equilibrium

constraints. Math. Program. 85, 107–134 (1999)40. Facchinei, F., Pang, J.S.: Finite-Dimensional Variational Inequalities and Complementarity Prob-

lems. Springer, New York (2003)41. Fan, J.: Comments on “Wavelets in Statistic: a review” by A. Antoniadis. Stat. Method. Appl. 6, 131–

138 (1997)42. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties.

J. Am. Stat. Assoc. 96, 1348–1360 (2001)43. Fang, H., Chen, X., Fukushima, M.: Stochastic R0 matrix linear complementarity problems. SIAM

J. Optim. 18, 482–506 (2007)44. Ferris, M.: Extended mathematical programming: competition and stochasticity. SIAM News 45,

1–2 (2012)45. Fischer, A.: A special Newton-type optimization method. Optimization 24, 269–284 (1992)46. Fukushima, M., Luo, Z.-Q., Pang, J.-S.: A globally convergent sequential quadratic programming

algorithm for mathematical programs with linear complementarity constraints. Comput. Optim.Appl. 10, 5–34 (1998)

123


47. Fukushima, M., Luo, Z.-Q., Tseng, P.: Smoothing functions for second-order-cone complementarityproblems. SIAM J. Optim. 12, 436–460 (2002)

48. Fukushima, M., Pang, J.-S.: Convergence of a smoothing continuation method for mathematicalprograms with complementarity constraints. In: Thera, M., Tichatschke, R. (eds.) Lecture Notes inEconomics and Mathematical Systems, vol. 477, pp. 99–110. Springer, Berlin (1999)

49. Gabriel S.A., More J.J.: Smoothing of mixed complementarity problems. In: Ferris, M.C., Pang, J.S.(eds.) Complementarity and Variational Problems: State of the Art, pp. 105–116. SIAM, Philadelphia(1997)

50. Garmanjani, R., Vicente, L.N.: Smoothing and worst case complexity for direct-search methods innon-smooth optimization. Preprint (2011)

51. Ge, D., Jiang, X., Ye, Y.: A note on the complexity of L p minimization. Math. Program. 21, 1721–1739 (2011)

52. Geman, D., Reynolds, G.: Constrained restoration and the recovery of discontinuities. IEEE Trans.Pattern Anal. Mach. Intell. 14, 357–383 (1992)

53. Gürkan, G., Özge, A.Y., Robinson, S.M.: Sample-path solution of stochastic variational inequali-ties. Math. Program. 84, 313–333 (1999)

54. Hamatani, K., Fukushima, M.: Pricing American options with uncertain volatility through stochasticlinear complementarity models. Comput. Optim. Appl. 50, 263–286 (2011)

55. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning Data Mining, Inference,and Prediction. Springer, New York (2009)

56. Hayashi, S., Yamashita, N., Fukushima, M.: A combined smoothing and regularization method formonotone second-order cone complementarity problems. SIAM J. Optim. 15, 593–615 (2005)

57. Heinkenschloss, M., KelleyC.T. Tran, H.T.: Fast algorithms for nonsmooth compact fixed-point prob-lems. SIAM J. Numer. Anal. 29, 1769–1792 (1992)

58. Hintermueller, M., Wu, T.: Nonconvex TVq -models in image restoration: analysis and a trust-regionregularization based superlinearly convergent solver. Preprint (2011)

59. Hiriart-Urruty, J.B., Lemareéchal, C.: Convex Analysis and Minimization Algorithms II: AdvancedTheory and Boundle Methods. Springer, Berlin (1993)

60. Hu, M., Fukushima, M.: Smoothing approach to Nash equilibrium formulations for a class of equi-librium problems with shared complementarity constraints. Comput. Optim. Appl. (to appear)

61. Huang, J., Horowitz, J.L., Ma, S.: Asymptotic properties of bridge estimators in sparse high-dimen-sional regression models. Ann. Stat. 36, 587–613 (2008)

62. Huang, J., Ma, S., Xie, H., Zhang, C.-H.: A group bridge approach for variable selection.Biometrika 96, 339–355 (2009)

63. Huang, Z., Qi, L., Sun, D.: Sub-quadratic convergence of a smoothing Newton algorithm for theP0- and monotone LCP. Math. Program. 99, 423–441 (2004)

64. Jiang, H., Ralph, D.: Smooth SQP methods for mathematical programs with nonlinear complemen-tarity constraints. Comput. Optim. Appl. 25, 123–150 (2002)

65. Jiang, H., Xu, H.: Stochastic approximation approaches to the stochastic variational inequality prob-lem. IEEE. Trans. Autom. Control 53, 1462–1475 (2008)

66. Kanzow, C.: Some noninterior continuation methods for linear cornplementarity problems. SIAM J.Matrix Anal. Appl. 17, 851–868 (1996)

67. Knight, K., Fu, W.J.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)68. Kiwiel, K.C.: Methods of Descent for Nondifferentiable Optimization, Lecture Notes in Math,

vol. 1133. Springer, Berlin (1985)69. Kiwiel, K.C.: Convergence of the gradient sampling algorithm for nonsmooth nonconvex optimiza-

tion. SIAM J. Optim. 18, 379–388 (2007)70. Lewis, A.S., Pang, C.H.J.: Lipschitz behavior of the robust regularization. SIAM, J. Control

Optim. 48, 3080–3104 (2010)71. Lewis, A.S., Overton, M.L.: Eigenvalue optimization. Acta Numerica 5, 149–190 (1996)72. Li, D., Fukushima, M.: Smoothing Newton and quasi-Newton methods for mixed complementarity

problems. Comput. Optim. Appl. 17, 203–230 (2000)73. Lin, G.H., Chen, X., Fukushima, M.: Solving stochastic mathematical programs with equilib-

rium constraints via approximation and smoothing implicit programming with penalization. Math.Program. 116, 343–368 (2009)

74. Lin, G.H., Fukushima, M.: Stochastic equilibrium problems and stochastic mathematical programswith equilibrium constraints: a survey. Pac. J. Optim. 6, 455–482 (2010)

123

98 X. Chen

75. Luo, Z-Q., Pang, J-S., Ralph, D., Wu, S.: Exact penalization and stationarity conditions of mathe-matical programs with equilibrium constraints. Math. Program. 75, 19–76 (1996)

76. Maréchal, P., Ye, J.J.: Optimizing condition numbers. SIAM J. Optim. 20, 935–947 (2009)77. Martínez, J.M., Moretti, A.C.: A trust region method for minimization of nonsmooth functions with

linear constraints. Math. Program. 76, 431–449 (1997)78. Meng, K., Yang, X.: Optimality conditions via exact penalty functions. SIAM J. Optim. 20, 3205–

3231 (2010)79. Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control

Optim. 15, 957–972 (1977)80. Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I and II. Springer,

Berlin (2006)81. Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103, 127–152

(2005)82. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Math. Pro-

gram. 110, 245–259 (2007)83. Newey, W.K., McFadden, D.: Large sample estimation and hypothesis testing. In: Handbook of

econometrics, vol. IV, pp. 2111–2245. North-Holland, Amsterdam (1994)84. Nikolova, M.: Analysis of the recovery of edges in images and signals by minimizing nonconvex

regularized least-squares. SIAM J. Multiscale Model. Simul. 4, 960–991 (2005)85. Nikolova, M., Ng, M.K., Zhang, S., Ching, W.: Efficient reconstruction of piecewise constant images

using nonsmooth nonconvex minimization. SIAM J. Imaging Sci. 1, 2–25 (2008)86. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006)87. Osborne, M.R.: Finite Algorithms in Optimizations in Optimization and Data Analysis. Wiley, Chich-

ester (1985)88. Polyak, R.A.: A nonlinear rescaling vs. smoothing technique in convex optimization. Math.

Program. 92, 197–235 (2002)89. Qi, H.D., Liao, L.Z.: A smoothing Newton method for extended vertical linear complementarity

problems. SIAM J. Matrix Anal. Appl. 21, 45–66 (1999)90. Qi, L., Chen, X.: A globally convergent successive approximation method for severely nonsmooth

equations. SIAM J. Control Optim. 33, 402–418 (1995)91. Qi, L., Ling, C., Tong, X., Zhou, G.: A smoothing projected Newton-type algorithm for semi-infinite

programming. Comput. Optim. Appl. 42, 1–30 (2009)92. Qi, L., Sun, D., Zhou, G.: A new look at smoothing Newton methods for nonlinear complementarity

problems and box constrained variational inequalities. Math. Program. 87, 1–35 (2000)93. Rockafellar, R.T.: A property of piecewise smooth functions. Comput. Optim. Appl. 25,

247–250 (2003)94. Rockafellar, R.T., Wets, R.J-B.: Variational Analysis. Springer, New York (1998)95. Ruszczynski, A., Shapiro, A.: Stochastic Programming, Handbooks in Operations Research and Man-

agement Science. Elsevier, Amsterdam (2003)96. Smale, S.: Algorithms for solving equations. In: Proceedings of the International Congress of Math-

ematicians, Berkeley, CA, pp. 172–195 (1986)97. Sun, J., Sun, D., Qi, L.: A squared smoothing Newton method for nonsmooth matrix equations and

its applications in semidefinite optimization problems. SIAM J. Optim. 14, 783–806 (2004)98. Tassa, Y., Todorov, E.: Stochastic complementarity for local control of discontinuous dynamics. In:

Proceedings of Robotics: Science and Systems (RSS) (2010)99. Wright, S.J.: Convergence of an inexact algorithm for composite nonsmooth optimization. IMA J.

Numer. Anal. 9, 299–321 (1990)100. Xu, H.: Sample average approximation methods for a class of stochastic variational inequality prob-

lems. Asia Pac. J. Oper. Res. 27, 103–119 (2010)101. Xu, Z., Zhang, H., Wang, Y., Chang, X.: L1/2 regularizer. Sci. China Ser. F-Inf Sci. 53, 1159–

1169 (2010)102. Yuan, Y.: Conditions for convergence of a trust-region method for nonsmooth optimization. Math.

Program. 31, 220–228 (1985)103. Zang, I.: A smoothing-out technique for min-max optimization. Math. Program. 19, 61–71 (1980)104. Zhang, C., Chen, X.: Smoothing projected gradient method and its application to stochastic linear

complementarity problems. SIAM J. Optim. 20, 627–649 (2009)

123


105. Zhang, C., Chen, X., Sumalee, A.: Wardrop’s user equilibrium assignment under stochastic environ-ment. Transp. Res. B 45, 534–552 (2011)

106. Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38,894–942 (2010)

107. Zhou, G.L., Caccetta, L., Teo, K.L: A superlinearly convergent method for a class of complementarityproblems with non-Lipschitzian functions. SIAM J. Optim. 20, 1811–1827 (2010)

123

smoothing methods for nonsmooth, nonconvex minimization · smoothing methods for nonsmooth,...

Documents