inexact variable metric proximal gradient methods with...

Inexact variable metric proximal gradient methodswith line-search for convex and nonconvex optimization

Silvia Bonettini

Dipartimento di Scienze Fisiche,Informatiche e Matematiche

Università di Modena e Reggio Emilia

Optimization Algorithms and Softwarefor Inverse problemSwww.oasis.unimore.it

Computational Methods for Inverse Problems in ImagingComo, 16-18 July 2018

Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 1 / 17

Collaborators and main references

Joint works with:

Marco Prato, Università di Modena e Reggio Emilia

Federica Porta, Simone Rebegoldi, Valeria Ruggiero, Università di Ferrara

Ignace Loris, Université Libre de Bruxelles

Main references:S. B., I. Loris, F. Porta, M. Prato 2016, Variable metric inexact line–search basedmethods for nonsmooth optimization, SIAM J. Optim., 26(2), 891-921

S. B., F. Porta, V. Ruggiero 2016, A variable metric forward–backward method withextrapolation, SIAM J. Sci. Comput.,38(4), A2558-A2584

S. B., I. Loris, F. Porta, M. Prato, S. Rebegoldi 2017, On the convergence of aline–search base proximal-gradient method for nonconvex optimization, Inverse Probl.,33(5), 055005

S. B., S. Rebegoldi, V. Ruggiero, 2018, Inertial variable metric techniques for theinexact forward-backward algorithm, submitted.


A general nonsmooth problem

Several optimization problems arising from the Bayesian approach to inverse prob-lems have the following structure

minx∈Rn

f(x) ≡ f0(x) + f1(x),

where:f0(x) continuously differentiable, possibly nonconvex.

usually expressing some kind of data discrepancyf1(x) convex, possibly nondifferentiable

usually expressing regularization

Goal: develop a numerical optimization algorithm producing a good approximationof the solution of the minimization problem in few, cheap iterations.


The class of proximal gradient methods

Proximal gradient methods, aka forward-backward methods, exploit the smooth-ness of f0 and the convexity of f1 in problem

minx∈Rn

f(x) ≡ f0(x) + f1(x),

Definition (Proximal gradient method)

Any first order method based on the following two operations:

Explicit Forward/Gradient step: computation of the gradient ∇f0(x)

Implicit Backward/Proximal step: computation of the proximity (or resolvent)operator:

proxf1(z) = arg minx∈Rn

f1(x) +1

2‖x− z‖2

Example: If Ω ⊂ Rn is a closed convex set, we can define the indicator function

ιΩ(x) =

0 if x ∈ Ω+∞ otherwise ⇒ proxιΩ(z) = ΠΩ(z)

(orthogonal projection onto Ω).NB: gradient projection methods are special instances of proximal gradient meth-ods.


A basic forward-backward scheme

z(k) = x(k) − αk∇f0(x(k))← Forward step

y(k) = proxαkf1(z(k))← Backward step

d(k) = y(k) − x(k)

x(k+1) = x(k) + λkd(k)

NB: The steplength parameters αk, λk ∈ R>0, in standard convergence analysis,are related to the Lipschitz constant L of ∇f0(x) [Combettes-Wajs 2006], [Com-bettes, Wu, 2014] requiring that

αk and/or λk ≤C

L

A motivating problem: nonnegative image restoration from Poisson data

minx∈Rn

KL(Hx, g)︸︷︷︸f0(x)

+ ρ‖∇x‖+ ιRn≥0

(x)︸︷︷︸f1(x)

where KL(t, g) =

n∑i=1

log

(giti

)+ ti − gi

either ∇f0 is not Lipschitz or L is very largeproxf1 is not available in closed form


A line–search approach

We propose to compute λk with a line–search approach, starting from 1 and back-tracking until a sufficient decrease of the objective function is obtained.

Generalized Armijo rule [Tseng, Yun, 2009, Porta, Loris, 2015, B. et al., 2016]

f(x(k) + λkd(k)) ≤ f(x(k)) + βλkh

(k)(y(k)),

where β ∈ (0, 1) and

h(k)(y) = ∇f0(x(k))T (y − x(k)) +1

2αk‖y − x(k)‖2 + f1(y)− f1(x(k)),

NB1: We have y(k) = proxαkf1(x(k) − αk∇f0(x(k)) = arg miny∈Rn h(k)(y). Since

h(k)(y(k)) < 0, we obtain a monotone decrease of the objective function.NB2: For f1 ≡ 0, dropping the quadratic term we obtain the standard Armijo rule for smoothoptimization.

Pros:No need of any Lipschitz assumption

Adaptive selection of λk (no userprovided parameter)

No assumptions on αk, just bebounded above and away from zero.

Cons:Needs the evaluation of the function fat each backtracking loop (usually 1-2per outer iteration).


Inexact computation of the proximity operator (1)Basic idea

Compute an approximation y(k) of y(k) by applying an iterative optimization methodto the minimum problem defining the proximity operator:

y(k) ' y(k) = arg miny∈Rn

h(k)(y)

with an increasing accuracy as k increases. This results in a two loop algorithmand the question now is

How to stop the inner iterations to preserve the convergence of the iterates x(k)to a solution?

We need to define a criterion to measure the accuracy of the approximate proximityoperator computation.Crucial properties of this criterion:

It has to preserve the convergence properties of the whole scheme.

It must be based on computable quantities.

Borrowing the ideas in [Salzo,Villa,2012], [Villa etal 2013]

replace 0 ∈ ∂h(k)(y(k)) with 0 ∈ ∂εkh(k)(y(k))


Inexact computation of the proximity operator (2)A well defined primal-dual procedure

Assume that f1(x) = g(Ax),A ∈ Rm×n (easy generalization to f1(x) =∑pi=1 gi(Aix)).

The dual problem of the proximity operator computation is

minx∈Rn

h(k)(x) = maxv∈Rm

Ψ(k)(v) ≡ − 1

2αk‖αkAT v − z(k)‖2 − g∗(v) + Ck

where g∗ is the Fenchel convex conjugate of g.

If v(k) = arg max Ψ(k)(v), then y(k) = z(k) − αkAT v(k).

Compute y(k) as follows:apply a maximization method to the dual problem, generating the dualsequence v(k,`)`∈N converging to v(k)

compute the corresponding primal sequence y(k,`)`∈N, with formulay(k,`) = z(k) − αkAT v(k,`)

stop the inner iterations when

h(k)(y(k,`))−Ψ(k)(v(k,`)) ≤ εkwhere

εk =

Ckq

with q > 1 prefixed sequence choiceor

ηh(k)(y(k,`)) with η ∈ (0, 1] adaptive choice


Introducing Scaling

Add a new parameter, a s.p.d. scaling matrix Dk which determines a differentmetric at each iterate:

replace ‖x‖ with ‖x‖Dk = xTDkx

Variable Metric Inexact Line–Search Algorithm (VMILA)

z(k) = x(k) − αkD−1k ∇f0(x(k))← Scaled Forward step

y(k) ≈ proxDkαkf1

(z(k)) ≡ y(k) ← Scaled Inexact Backward step

d(k) = y(k) − x(k)

x(k+1) = x(k) + λkd(k) ← Armijo-like line–search


Summary of convergence results about VMILA

VMILA

λk with line–search + inexact computation of the proximal point with increasingaccuracy + αk bounded

Convex caseAssumption: Dk

k→∞−→ I likeC/kp, p > 1

Convergence to aminimizer (withoutLipschitz assumptions on∇f0(x))

Convergence ratef(x(k))− f∗ = O(1/k)(proof with Lipschitzassumptions on ∇f0(x))

Nonconvex caseAssumption: Dk has boundedeigenvalues.

Every accumulation point ofx(k)k∈Z is a stationary point

If f satisfies theKurdyka–Lojasiewicz property and∇f0 is locally Lipschitz, thenx(k)k∈Z converges to astationary point (with exactproximal point computation).

Block-coordinate version of VMILA proposed in [B., Prato, Rebegoldi, 2018, toappear].NB: αk and Dk are required only to be bounded⇒ use them to implement someacceleration strategy.


Metric selection - A Majorization-Minimization approach

No theoretical results (same rate and lower complexity bound than nonscaled meth-ods). No general recipe for selecting Dk.

Freedom to chooseDk, αk allows to use them for accelerating practical performances.Good numerical results with problem dependent practical strategies for Dk.

Majorization-Minimization idea

Define Dk such that

x(k) −D−1k ∇f0(x(k)) = arg min

x∈RnF (x, x(k))

where F (x, x(k)) is a (non necessarily quadratic) auxiliary function for f0, i.e.

x

f0(x)

x(k)

F (x, x(k))

x

F (x(k), x(k)) = f0(x(k)) and F (x, x(k)) ≥ f0(x) ∀x ∈ Rn

F (x) ≤ F (x, x(k)) ≤ F (x(k), x(k)) = f0(x(k))

This produces a diagonal Dk whose elements are obtained from the components of∇f0(x(k)) in several relevant cases (discrepancy functions for Gaussian, Poisson,Cauchy, multiplicative noise,...) [Yang, Oja, 2011], [Chouzenoux, Pesquet, 2016].The convergence condition Dk → I can be fulfilled by squeezing the elements of thediagonal matrix Dk to 1 as k increases.


Stepsize selection - Barzilai-Borwein rules

Given Dk, we would choose αk such that

1

αkDk ' ∇2f0(x(k))

simulating the Taylor’s equality

∇f0(x+ d) = ∇f0(x) +

∫ 1

0

∇2f0(x+ td)d dt

∇f0(x(k))−∇f0(x(k−1))︸︷︷︸w(k−1)

' 1

αkDk(x(k) − x(k−1)︸︷︷︸

s(k−1)

)

αkBB1 = arg min

α‖ 1

αDks

(k−1) − w(k−1)‖ =‖Dks(k−1)‖2

s(k−1)TDkw(k−1)

αkBB2 = arg min

α‖s(k−1) − αD−1

k w(k−1)‖ =s(k−1)TD−1

k w(k−1)

‖Dkw(k−1)‖2

Good results when the two values are alternated following an adaptiveswitching rule and projected onto a given interval [αmin, αmax], with0 < αmin < αmax.Recent developments in steplength selection rules: Ritz values [Fletcher2012]


Numerical results

VMILA has been tested on a variety of convex and nonconvex imagerestoration problems.The numerical comparison shows that its performances are comparable withthe ones of state-of-the-art methods such as: Chambolle-Pock (CP) method,preconditioned CP, ADMM, PidSplit+, iPiano, VMFB, FISTA...

Illustration of the effects of a well tailored choice of αk, Dk: nonnegative imagedeconvolution in presence of Poisson noise with smooth TV regularization.

0 50 100 150 200 250 300

10−8

10−6

10−4

10−2

100

GP

GP-BBSGP

SGP

FISTA

Iter. number

(f(x

(k))−f∗ )/f

∗

GPGP-BBSGPSGP-BBFISTA

0 50 100 150 200 250 300

10−6

10−5

10−4

10−3

10−2

10−1

100

GP

GP-BB

SGP

SGP

FISTA

Iter. number(f(x

(k))−f∗ )/f

∗

GPGP-BBSGPSGP-BBFISTA

x∗ x(300) SGP x∗ x(300) SGPSilvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 13 / 17

Two acceleration techniques: Extrapolation

FISTA iteration [Beck, Teboulle, 2008]

x(k) = x(k) + γk(x(k) − x(k−1))← Extrapolation step

z(k) = x(k) − αk∇f0(x(k))← Forward step

x(k+1) = proxαkf1(z(k))← Backward step

It applies when f0, f1 are both convex and ∇f0 is L-Lipschitz continuous.αk can be chosen as αk = α ≤ 1

Lor with a backtracking procedure, starting

from αk−1, guaranteeing that

f0(x(k+1)) ≤ f0(x(k)) +∇f0(x(k))T (x(k+1) − x(k)) +1

2αk‖x(k+1) − x(k)‖2

Properly chosen extrapolation parameter γkk→∞−→ 1

γk =k − 1

k + a, a ≥ 2

Quadratic convergence rate f(x(k))− f∗ = O(1/k2) with a ≥ 2 andf(x(k))− f∗ = o(1/k2) with a > 2 [Attouch-Peypouquet, 2016]Convergence of the iterates to a minimizer [Chambolle-Dossal, 2015] witha > 2.


Combining Extrapolation with Scaling and Inexact computation ofproximity operator[B., Porta, Ruggiero, 2016], [B., Rebegoldi, Ruggiero,2018,submitted]

Scaled, inexact FISTA-like method

x(k) = ΠDkdom(f)(x

(k) + γk(x(k) − x(k−1)))← Extrapolation step with

scaled projection

z(k) = x(k) − αkD−1k ∇f0(x(k))← Scaled Forward step

x(k+1) ' proxDkαkf1

(z(k))← Scaled Inexact Backward step

Main features:

FISTA-like extrapolation + projection + line–search computation of αk + inexactcomputation of the proximity operator + Dk

k→∞−→ I

Convex case (∇f0(x) Lipschitz continuous):

Quadratic convergence rate f(x(k))− f∗ = O(1/k2) with a ≥ 2 andf(x(k))− f∗ = o(1/k2) with a > 2

Convergence of the iterates to a minimizer with a > 2.


Numerical results

Nonnegative image deconvolution with Total Variation regularization from Poisson data.

Nonscaled (blue) vs. scaled (red,black) Scaled inexact FISTA-like method (blue) vs.inexact FISTA-like method state of the art

0 2 4 6 8 10 12 14 16 18 20

Time(s)

10-5

10-4

10-3

10-2

10-1

100

(f(x

(k) )−

f∗)/f∗

Relative error on the objective function

non scaledt1 = 1010 t2 = 2t1 = 1010 t2 = 3t1 = 1010 t2 = 4t1 = 107 t2 = 2t1 = 107 t2 = 3t1 = 107 t2 = 4

0 5 10 15 20 25 30

Times

10-8

10-6

10-4

10-2

100

102

0 2 4 6 8 10 12 14 16 18 20

Time(s)

10-3

10-2

10-1

100

(f(x

(k) )−

f∗)/f∗

Relative error on the objective function

non scaledt1 = 1010 t2 = 2t1 = 1010 t2 = 3t1 = 1010 t2 = 4t1 = 107 t2 = 2t1 = 107 t2 = 3t1 = 107 t2 = 4

0 5 10 15 20 25 30

Times

10-4

10-3

10-2

10-1

100

101


Conclusions and perspectives

Line–search based algorithms allow the implementation of FB methodswithout the knowledge of the Lipschitz constant or even to non Lipschitzproblems

To do: study a less expensive line–search strategy for the inexact FISTA, to avoidthe computation of a new approximate proximal point at each trial step

the approximation of the proximity operator with an inner loop can be done insuch a way that the basic convergence properties of the FB algorithms arenot affected

To do: investigate other implementable criteria, especially for nonconvexproblems.

variable metric techniques can be considered as acceleration techniquesimproving the behaviour of FB methods especially at the first iterations

To do: give a better insight on the Majorization-Minimization techniques

Codes available on www.oasis.unimore.it


inexact variable metric proximal gradient methods with...

Documents