inexact variable metric proximal gradient methods with...
TRANSCRIPT
Inexact variable metric proximal gradient methodswith line-search for convex and nonconvex optimization
Silvia Bonettini
Dipartimento di Scienze Fisiche,Informatiche e Matematiche
Università di Modena e Reggio Emilia
Optimization Algorithms and Softwarefor Inverse problemSwww.oasis.unimore.it
Computational Methods for Inverse Problems in ImagingComo, 16-18 July 2018
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 1 / 17
Collaborators and main references
Joint works with:
Marco Prato, Università di Modena e Reggio Emilia
Federica Porta, Simone Rebegoldi, Valeria Ruggiero, Università di Ferrara
Ignace Loris, Université Libre de Bruxelles
Main references:S. B., I. Loris, F. Porta, M. Prato 2016, Variable metric inexact line–search basedmethods for nonsmooth optimization, SIAM J. Optim., 26(2), 891-921
S. B., F. Porta, V. Ruggiero 2016, A variable metric forward–backward method withextrapolation, SIAM J. Sci. Comput.,38(4), A2558-A2584
S. B., I. Loris, F. Porta, M. Prato, S. Rebegoldi 2017, On the convergence of aline–search base proximal-gradient method for nonconvex optimization, Inverse Probl.,33(5), 055005
S. B., S. Rebegoldi, V. Ruggiero, 2018, Inertial variable metric techniques for theinexact forward-backward algorithm, submitted.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 2 / 17
A general nonsmooth problem
Several optimization problems arising from the Bayesian approach to inverse prob-lems have the following structure
minx∈Rn
f(x) ≡ f0(x) + f1(x),
where:f0(x) continuously differentiable, possibly nonconvex.
usually expressing some kind of data discrepancyf1(x) convex, possibly nondifferentiable
usually expressing regularization
Goal: develop a numerical optimization algorithm producing a good approximationof the solution of the minimization problem in few, cheap iterations.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 3 / 17
The class of proximal gradient methods
Proximal gradient methods, aka forward-backward methods, exploit the smooth-ness of f0 and the convexity of f1 in problem
minx∈Rn
f(x) ≡ f0(x) + f1(x),
Definition (Proximal gradient method)
Any first order method based on the following two operations:
Explicit Forward/Gradient step: computation of the gradient ∇f0(x)
Implicit Backward/Proximal step: computation of the proximity (or resolvent)operator:
proxf1(z) = arg minx∈Rn
f1(x) +1
2‖x− z‖2
Example: If Ω ⊂ Rn is a closed convex set, we can define the indicator function
ιΩ(x) =
0 if x ∈ Ω+∞ otherwise ⇒ proxιΩ(z) = ΠΩ(z)
(orthogonal projection onto Ω).NB: gradient projection methods are special instances of proximal gradient meth-ods.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 4 / 17
A basic forward-backward scheme
z(k) = x(k) − αk∇f0(x(k))← Forward step
y(k) = proxαkf1(z(k))← Backward step
d(k) = y(k) − x(k)
x(k+1) = x(k) + λkd(k)
NB: The steplength parameters αk, λk ∈ R>0, in standard convergence analysis,are related to the Lipschitz constant L of ∇f0(x) [Combettes-Wajs 2006], [Com-bettes, Wu, 2014] requiring that
αk and/or λk ≤C
L
A motivating problem: nonnegative image restoration from Poisson data
minx∈Rn
KL(Hx, g)︸ ︷︷ ︸f0(x)
+ ρ‖∇x‖+ ιRn≥0
(x)︸ ︷︷ ︸f1(x)
where KL(t, g) =
n∑i=1
log
(giti
)+ ti − gi
either ∇f0 is not Lipschitz or L is very largeproxf1 is not available in closed form
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 5 / 17
A line–search approach
We propose to compute λk with a line–search approach, starting from 1 and back-tracking until a sufficient decrease of the objective function is obtained.
Generalized Armijo rule [Tseng, Yun, 2009, Porta, Loris, 2015, B. et al., 2016]
f(x(k) + λkd(k)) ≤ f(x(k)) + βλkh
(k)(y(k)),
where β ∈ (0, 1) and
h(k)(y) = ∇f0(x(k))T (y − x(k)) +1
2αk‖y − x(k)‖2 + f1(y)− f1(x(k)),
NB1: We have y(k) = proxαkf1(x(k) − αk∇f0(x(k)) = arg miny∈Rn h(k)(y). Since
h(k)(y(k)) < 0, we obtain a monotone decrease of the objective function.NB2: For f1 ≡ 0, dropping the quadratic term we obtain the standard Armijo rule for smoothoptimization.
Pros:No need of any Lipschitz assumption
Adaptive selection of λk (no userprovided parameter)
No assumptions on αk, just bebounded above and away from zero.
Cons:Needs the evaluation of the function fat each backtracking loop (usually 1-2per outer iteration).
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 6 / 17
Inexact computation of the proximity operator (1)Basic idea
Compute an approximation y(k) of y(k) by applying an iterative optimization methodto the minimum problem defining the proximity operator:
y(k) ' y(k) = arg miny∈Rn
h(k)(y)
with an increasing accuracy as k increases. This results in a two loop algorithmand the question now is
How to stop the inner iterations to preserve the convergence of the iterates x(k)to a solution?
We need to define a criterion to measure the accuracy of the approximate proximityoperator computation.Crucial properties of this criterion:
It has to preserve the convergence properties of the whole scheme.
It must be based on computable quantities.
Borrowing the ideas in [Salzo,Villa,2012], [Villa etal 2013]
replace 0 ∈ ∂h(k)(y(k)) with 0 ∈ ∂εkh(k)(y(k))
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 7 / 17
Inexact computation of the proximity operator (2)A well defined primal-dual procedure
Assume that f1(x) = g(Ax),A ∈ Rm×n (easy generalization to f1(x) =∑pi=1 gi(Aix)).
The dual problem of the proximity operator computation is
minx∈Rn
h(k)(x) = maxv∈Rm
Ψ(k)(v) ≡ − 1
2αk‖αkAT v − z(k)‖2 − g∗(v) + Ck
where g∗ is the Fenchel convex conjugate of g.
If v(k) = arg max Ψ(k)(v), then y(k) = z(k) − αkAT v(k).
Compute y(k) as follows:apply a maximization method to the dual problem, generating the dualsequence v(k,`)`∈N converging to v(k)
compute the corresponding primal sequence y(k,`)`∈N, with formulay(k,`) = z(k) − αkAT v(k,`)
stop the inner iterations when
h(k)(y(k,`))−Ψ(k)(v(k,`)) ≤ εkwhere
εk =
Ckq
with q > 1 prefixed sequence choiceor
ηh(k)(y(k,`)) with η ∈ (0, 1] adaptive choice
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 8 / 17
Introducing Scaling
Add a new parameter, a s.p.d. scaling matrix Dk which determines a differentmetric at each iterate:
replace ‖x‖ with ‖x‖Dk = xTDkx
Variable Metric Inexact Line–Search Algorithm (VMILA)
z(k) = x(k) − αkD−1k ∇f0(x(k))← Scaled Forward step
y(k) ≈ proxDkαkf1
(z(k)) ≡ y(k) ← Scaled Inexact Backward step
d(k) = y(k) − x(k)
x(k+1) = x(k) + λkd(k) ← Armijo-like line–search
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 9 / 17
Summary of convergence results about VMILA
VMILA
λk with line–search + inexact computation of the proximal point with increasingaccuracy + αk bounded
Convex caseAssumption: Dk
k→∞−→ I likeC/kp, p > 1
Convergence to aminimizer (withoutLipschitz assumptions on∇f0(x))
Convergence ratef(x(k))− f∗ = O(1/k)(proof with Lipschitzassumptions on ∇f0(x))
Nonconvex caseAssumption: Dk has boundedeigenvalues.
Every accumulation point ofx(k)k∈Z is a stationary point
If f satisfies theKurdyka–Lojasiewicz property and∇f0 is locally Lipschitz, thenx(k)k∈Z converges to astationary point (with exactproximal point computation).
Block-coordinate version of VMILA proposed in [B., Prato, Rebegoldi, 2018, toappear].NB: αk and Dk are required only to be bounded⇒ use them to implement someacceleration strategy.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 10 / 17
Metric selection - A Majorization-Minimization approach
No theoretical results (same rate and lower complexity bound than nonscaled meth-ods). No general recipe for selecting Dk.
Freedom to chooseDk, αk allows to use them for accelerating practical performances.Good numerical results with problem dependent practical strategies for Dk.
Majorization-Minimization idea
Define Dk such that
x(k) −D−1k ∇f0(x(k)) = arg min
x∈RnF (x, x(k))
where F (x, x(k)) is a (non necessarily quadratic) auxiliary function for f0, i.e.
x
f0(x)
x(k)
F (x, x(k))
x
F (x(k), x(k)) = f0(x(k)) and F (x, x(k)) ≥ f0(x) ∀x ∈ Rn
F (x) ≤ F (x, x(k)) ≤ F (x(k), x(k)) = f0(x(k))
This produces a diagonal Dk whose elements are obtained from the components of∇f0(x(k)) in several relevant cases (discrepancy functions for Gaussian, Poisson,Cauchy, multiplicative noise,...) [Yang, Oja, 2011], [Chouzenoux, Pesquet, 2016].The convergence condition Dk → I can be fulfilled by squeezing the elements of thediagonal matrix Dk to 1 as k increases.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 11 / 17
Stepsize selection - Barzilai-Borwein rules
Given Dk, we would choose αk such that
1
αkDk ' ∇2f0(x(k))
simulating the Taylor’s equality
∇f0(x+ d) = ∇f0(x) +
∫ 1
0
∇2f0(x+ td)d dt
∇f0(x(k))−∇f0(x(k−1))︸ ︷︷ ︸w(k−1)
' 1
αkDk(x(k) − x(k−1)︸ ︷︷ ︸
s(k−1)
)
αkBB1 = arg min
α‖ 1
αDks
(k−1) − w(k−1)‖ =‖Dks(k−1)‖2
s(k−1)TDkw(k−1)
αkBB2 = arg min
α‖s(k−1) − αD−1
k w(k−1)‖ =s(k−1)TD−1
k w(k−1)
‖Dkw(k−1)‖2
Good results when the two values are alternated following an adaptiveswitching rule and projected onto a given interval [αmin, αmax], with0 < αmin < αmax.Recent developments in steplength selection rules: Ritz values [Fletcher2012]
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 12 / 17
Numerical results
VMILA has been tested on a variety of convex and nonconvex imagerestoration problems.The numerical comparison shows that its performances are comparable withthe ones of state-of-the-art methods such as: Chambolle-Pock (CP) method,preconditioned CP, ADMM, PidSplit+, iPiano, VMFB, FISTA...
Illustration of the effects of a well tailored choice of αk, Dk: nonnegative imagedeconvolution in presence of Poisson noise with smooth TV regularization.
0 50 100 150 200 250 300
10−8
10−6
10−4
10−2
100
GP
GP-BBSGP
SGP
FISTA
Iter. number
(f(x
(k))−f∗ )/f
∗
GPGP-BBSGPSGP-BBFISTA
0 50 100 150 200 250 300
10−6
10−5
10−4
10−3
10−2
10−1
100
GP
GP-BB
SGP
SGP
FISTA
Iter. number(f(x
(k))−f∗ )/f
∗
GPGP-BBSGPSGP-BBFISTA
x∗ x(300) SGP x∗ x(300) SGPSilvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 13 / 17
Two acceleration techniques: Extrapolation
FISTA iteration [Beck, Teboulle, 2008]
x(k) = x(k) + γk(x(k) − x(k−1))← Extrapolation step
z(k) = x(k) − αk∇f0(x(k))← Forward step
x(k+1) = proxαkf1(z(k))← Backward step
It applies when f0, f1 are both convex and ∇f0 is L-Lipschitz continuous.αk can be chosen as αk = α ≤ 1
Lor with a backtracking procedure, starting
from αk−1, guaranteeing that
f0(x(k+1)) ≤ f0(x(k)) +∇f0(x(k))T (x(k+1) − x(k)) +1
2αk‖x(k+1) − x(k)‖2
Properly chosen extrapolation parameter γkk→∞−→ 1
γk =k − 1
k + a, a ≥ 2
Quadratic convergence rate f(x(k))− f∗ = O(1/k2) with a ≥ 2 andf(x(k))− f∗ = o(1/k2) with a > 2 [Attouch-Peypouquet, 2016]Convergence of the iterates to a minimizer [Chambolle-Dossal, 2015] witha > 2.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 14 / 17
Combining Extrapolation with Scaling and Inexact computation ofproximity operator[B., Porta, Ruggiero, 2016], [B., Rebegoldi, Ruggiero,2018,submitted]
Scaled, inexact FISTA-like method
x(k) = ΠDkdom(f)(x
(k) + γk(x(k) − x(k−1)))← Extrapolation step with
scaled projection
z(k) = x(k) − αkD−1k ∇f0(x(k))← Scaled Forward step
x(k+1) ' proxDkαkf1
(z(k))← Scaled Inexact Backward step
Main features:
FISTA-like extrapolation + projection + line–search computation of αk + inexactcomputation of the proximity operator + Dk
k→∞−→ I
Convex case (∇f0(x) Lipschitz continuous):
Quadratic convergence rate f(x(k))− f∗ = O(1/k2) with a ≥ 2 andf(x(k))− f∗ = o(1/k2) with a > 2
Convergence of the iterates to a minimizer with a > 2.
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 15 / 17
Numerical results
Nonnegative image deconvolution with Total Variation regularization from Poisson data.
Nonscaled (blue) vs. scaled (red,black) Scaled inexact FISTA-like method (blue) vs.inexact FISTA-like method state of the art
0 2 4 6 8 10 12 14 16 18 20
Time(s)
10-5
10-4
10-3
10-2
10-1
100
(f(x
(k) )−
f∗)/f∗
Relative error on the objective function
non scaledt1 = 1010 t2 = 2t1 = 1010 t2 = 3t1 = 1010 t2 = 4t1 = 107 t2 = 2t1 = 107 t2 = 3t1 = 107 t2 = 4
0 5 10 15 20 25 30
Times
10-8
10-6
10-4
10-2
100
102
0 2 4 6 8 10 12 14 16 18 20
Time(s)
10-3
10-2
10-1
100
(f(x
(k) )−
f∗)/f∗
Relative error on the objective function
non scaledt1 = 1010 t2 = 2t1 = 1010 t2 = 3t1 = 1010 t2 = 4t1 = 107 t2 = 2t1 = 107 t2 = 3t1 = 107 t2 = 4
0 5 10 15 20 25 30
Times
10-4
10-3
10-2
10-1
100
101
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 16 / 17
Conclusions and perspectives
Line–search based algorithms allow the implementation of FB methodswithout the knowledge of the Lipschitz constant or even to non Lipschitzproblems
To do: study a less expensive line–search strategy for the inexact FISTA, to avoidthe computation of a new approximate proximal point at each trial step
the approximation of the proximity operator with an inner loop can be done insuch a way that the basic convergence properties of the FB algorithms arenot affected
To do: investigate other implementable criteria, especially for nonconvexproblems.
variable metric techniques can be considered as acceleration techniquesimproving the behaviour of FB methods especially at the first iterations
To do: give a better insight on the Majorization-Minimization techniques
Codes available on www.oasis.unimore.it
Silvia Bonettini Inexact variable metric proximal gradient methods CMIPI 2018 17 / 17