fast inertial dynamics for convex optimization. …plc/data2015/attouch2015.pdffast inertial...

Fast inertial dynamics for convex optimization.Convergence of FISTA algorithms.

Hedy ATTOUCH

Universite Montpellier 2ACSIOM, I3M UMR CNRS 5149

Joint work with J. Peypouquet, P. Redont, and Z. Chbani

CHALLENGES IN OPTIMIZATION FOR DATA SCIENCE

Laboratoire Jacques-Louis Lions, Universite Pierre et Marie Curie,July 1-2, 2015

H. ATTOUCH (Univ. Montpellier 2)Fast inertial dynamics for convex optimization. Convergence of FISTA algorithms.1 / 47

1A. General presentation: dynamical system

Fast dynamical methods for convex minimization.

min Φ(x) : x ∈ H .

H real Hilbert space; ‖x‖2 = 〈x , x〉;Φ : H → R convex, continuously differentiable, argminΦ 6= ∅.

Dissipative inertial system, asymptotic vanishing damping.

x(t) +α

tx(t) +∇Φ(x(t)) = 0.

(SBC) α ≥ 3 : Φ(x(t))−minHΦ ≤ C

t2;

(APR) α > 3 : x(t) x∞ ∈ argminΦ as t → +∞.

Time discretization: fast Nesterov type algorithms, FISTA.


1B. General presentation: history

Heavy Ball with Friction, γ > 0.

(HBF) x(t) + γx(t) +∇Φ(x(t)) = 0.

• Opti.: Polyak (87), A-Goudou-Redont (00), Ivorra-Mohammadi (06).

• Convergence: Haraux-Jendoubi (98) analytic, Alvarez (2000) convex.

Asymptotic Vanishing Damping, limt→+∞ a(t) = 0.

(AVD) x(t) + a(t)x(t) +∇Φ(x(t)) = 0.

• Cabot-Engler-Gaddat (2009)∫ +∞

t0

a(t)dt = +∞ =⇒ Φ(x(t))→ minHΦ.

• Su-Boyd-Candes (2014), A-Peypouquet-Redont (2015)

a(t) = αt , α ≥ 3 =⇒ Φ(x(t))−minHΦ ≤ Ct−2.


Contents

1 General presentation.

2 Fast convergence of the values.

3 Weak convergence of the orbits.

4 Strong convergence results.

5 The effect of perturbations, errors.

6 Fast related algorithms. Nesterov method, FISTA.

7 Related dynamics. Case a(t) = 1tγ .

8 Related dynamics. Hessian driven damping.

9 Related dynamics. Adaptive restart.

10 Related dynamics. Tikhonov regularization.

11 Annex 1. Stochastic gradient descent algorithm.

12 Annex 2. Complexity aspects.

13 Perspective, open questions.


2A. Fast convergence of the values

(AVD)α x(t) +α

tx(t) +∇Φ(x(t)) = 0.

Theorem 1 (Su-Boyd-Candes, NIPS 2014)

Suppose α ≥ 3, t0 > 0, x : [t0,+∞[→ H is an orbit of (AVD)α. Then,

Φ(x(t))−minHΦ ≤ C

t2.

Proof : for x∗ ∈ S = argminΦ, take as a Lyapunov function

Eα(t) := 2α−1 t

2(Φ(x(t))− infHΦ) + (α− 1)‖x(t)− x∗ + tα−1 x(t)‖2.

Eα(t) + 2α− 3

α− 1t(Φ(x(t))−min

HΦ) ≤ 0.

C = t20 (Φ(x0)−minHΦ) + (α− 1)2d2(x0,S) + t2

0‖x0‖2.


Lyapunov analysis

Eα(t) :=2

α− 1t2(Φ(x(t))− inf

HΦ) + (α− 1)‖x(t)− x∗ +

t

α− 1x(t)‖2.

Derivation of Eα(·) gives

Eα(t) :=4

α− 1t(Φ(x(t))− inf

HΦ) +

2

α− 1t2〈∇Φ(x(t)), x(t)〉

+ 2(α− 1)〈x(t)− x∗ +t

α− 1x(t), x(t) +

1

α− 1x(t) +

t

α− 1x(t)〉

=4


HΦ) +

2

α− 1t2〈∇Φ(x(t)), x(t)〉

+ 2(α− 1)〈x(t)− x∗ +t

α− 1x(t),

t

α− 1

(αtx(t) + x(t)

)〉.

Then use (AVD) in this last expression to obtain


Lyapunov analysis

Eα(t) =4


HΦ) +

2

α− 1t2〈∇Φ(x(t)), x(t)〉 (1)

− 2t〈x(t)− x∗ +t

α− 1x(t),∇Φ(x(t))〉

=4


HΦ)− 2t〈x(t)− x∗,∇Φ(x(t))〉. (2)

By convexity of Φ

Φ(x∗) ≥ Φ(x(t)) + 〈x∗ − x(t),∇Φ(x(t))〉.

Replacing in (2), we obtain

Eα(t) +(

2− 4α−1

)t(Φ(x(t))− infHΦ) ≤ 0.

Eα(t) + 2α− 3


HΦ) ≤ 0.


2B. O( 1t2 ) as the worst possible case

H = R, Φ(x) = c|x |γ , c > 0, γ > 0, parameters.

x(t) +α

tx(t) + cγx(t)γ−1 = 0.

Nonnegative, completely damped solutions of (AVD)α:

x(t) = 1tθ, θ > 0.

Replacing x(·) in (AVD)α gives γ > 2, θ = 2γ−2 , α > γ

γ−2 , and

Φ(x(t)) =2

γ(γ − 2)(α− γ

γ − 2)

1

t2γγ−2

.

As γ ↑ +∞,2γ

γ − 2↓ 2: Φ becomes very flat around its minimizer.


2C. Strong convexity: faster rate of convergence

Convergence rates increase indefinitely for larger values of α.

Theorem 2 (SBC, APR)

Suppose that Φ : H → R is strongly convex. Let x(·) be an orbit of(AVD)α, with α > 3. Then

x(t) converges strongly to the unique element x∗ ∈ argminΦ;

Φ(x(t))−minHΦ = O(t−23α);

‖x(t)‖2 = O(t−23α);

‖x(t)− x∗‖2 = O(t−23α).

Proof: use the Lyapunov function Epλ with p = 23 (α− 3), λ = 2

3α

Epλ(t) := tp(t2(Φ(x(t))−min

HΦ) +

1

2‖λ(x(t)− x∗) + tx(t)‖2

).


2D. Example Φ(x) = 12‖x‖

2. Role of Bessel functions

(AVD)α x(t) +α

tx(t) + x(t) = 0.

Solution of (AVD)α with Cauchy data x(0) = x0, x(0) = 0:

x(t) = 2α−1

2 Γ(α + 1

2)Jα−1

2(t)

tα−1

2

x0.

Jα−12

(·): first kind Bessel function of order α−12 . For large t,

Jα(t) =√

2πt

(cos(t − πα

2 −π4

)+O( 1

t )).

HenceΦ(x(t))−min

HΦ = O(t−α).

Compare with O(t−23α), valid for arbitrary strongly convex functions.


2E. Case argmin Φ = ∅.

Theorem 3 (APR)

Suppose Φ : H → R convex, argmin Φ possibly empty. Let x(·) be anorbit of (AVD)α with α > 1. Then,

limt→+∞Φ(x(t)) = infHΦ.

Moreover, if infHΦ > −∞, then limt→+∞ ‖x(t)‖ = 0.

Fast convergence may not be satisfied in this case:

Φ(x) =c

xθ, with c =

2(2α + θ(α− 1))

θ(2 + θ)2.

Then x(t) = t2

2+θ is solution of (AVD)α. We have infHΦ = 0, and

Φ(x(t)) =c

t2θ

2+θ

.


3A. Weak convergence of the orbits

(AVD)α x(t) +α

tx(t) +∇Φ(x(t)) = 0.

Theorem 4 (APR)

Suppose α > 3. Let x : [t0,+∞[→ H be an orbit of (AVD)α. Then,

x(t) x∗ ∈ argminΦ weakly as t → +∞;

limt→+∞ ‖x(t)‖ = 0, ‖x(t)‖ ≤ C

t,

∫ +∞

t0

t‖x(t)‖2dt < +∞;

Φ(x(t))−minH

Φ ≤ C

t2,

∫ +∞

t0

t

(Φ(x(t))−min

HΦ

)dt < +∞;

limt→+∞1

tα

∫ t

t0

τα‖x(τ)‖2dτ = 0.


3B. Proof of the convergence results

Lemma (Opial)

Let S ⊂ H, S 6= ∅, and x : [t0,+∞[→ H a map. Assume that

(i) for every z ∈ S , limt→+∞

‖x(t)− z‖ exists;

(ii) every weak sequential cluster point of x(·) belongs to S .

Then, w − limt→+∞ x(t) = x∞ exists, for some element x∞ ∈ S .

Lemma (differential inequality)

Let t0 > 0, α > 1, and w : [t0,+∞[→ R that satisfies

w(t) + αt w(t) ≤ g(t),

for some g : [t0,+∞[→ R+ such that t 7→ tg(t) ∈ L1(t0,+∞). Then

w+ ∈ L1(t0,+∞).


3C. Proof of the convergence results

Step 1. Given x∗ ∈ argminΦ, set h(t) := 12‖x(t)− x∗‖2.

h(t) = 〈x(t)− x∗, x(t)〉,h(t) = 〈x(t)− x∗, x(t)〉+ ‖x(t)‖2.

Combining these two equations, and using (AVD)α, we obtain

h(t) +α

th(t) = ‖x(t)‖2 + 〈x(t)− x∗, x(t) +

α

tx(t)〉, (3)

= ‖x(t)‖2 + 〈x(t)− x∗,−∇Φ(x(t))〉. (4)

By monotonicity of ∇Φ and ∇Φ(x∗) = 0, we infer

h(t) +α

th(t) ≤ ‖x(t)‖2. (5)

The next step is to prove that∫ +∞t0

t‖x(t)‖2dt < +∞.


3D. Proof of the convergence results

Step 2. From

Eα(t) + 2α− 3

α− 1t(Φ(x(t))−min

HΦ) ≤ 0,

and α > 3, we deduce that∫ +∞

t0

t

(Φ(x(t))−min

HΦ

)dt < +∞. (6)

Then, take the scalar product of (AVD)α by t2x(t), and integrate

1

2t2 d

dt‖x(t)‖2 + αt‖x(t)‖2 + t2 d

dtΦ(x(t) ≤ 0.

t2

2 ‖x(t)‖2+(α−1)∫ tt0τ‖x(τ)‖2dτ ≤ C+2

∫ tt0τ(Φ(x(τ))−minHΦ)dτ.∫ +∞

t0

t‖x(t)‖2dt < +∞. (7)


3E. Proof of the convergence results

Step 3. Asymptotic behaviour of x(t). From

t2

2‖x(t)‖2 +(α−1)

∫ t

t0

τ‖x(τ)‖2dτ ≤ C +2

∫ t

t0

τ(Φ(x(τ))−minH

Φ)dτ

we deduce that ‖x(t)‖ ≤ Ct , limt→+∞ ‖x(t)‖ = 0.

Step 4. Let us show that limt→+∞1tα

∫ tt0τα‖x(τ)‖2dτ = 0.

Let us return to (3)

h(t) +α

th(t) + 〈x(t)− x∗,∇Φ(x(t))〉 = ‖x(t)‖2.

∇Φ is Lipschitz continuous on bounded sets. By Baillon-Haddadtheorem, it is 1

L -cocoercive on a ball containing the trajectory:

〈x(t)− x∗,∇Φ(x(t))−∇Φ(x∗)〉 ≥ 1

L‖∇Φ(x(t))−∇Φ(x∗)‖2.


3F. Proof of the convergence results

Combining the two above equations, and using ∇Φ(x∗) = 0, we obtain

h(t) +α

th(t) +

1

L‖∇Φ(x(t))‖2 ≤ ‖x(t)‖2.

Replacing ∇Φ(x(t)) = −x(t)− αt x(t)

h(t) + αt h(t) + 1

L‖x(t) + αt x(t)‖2 ≤ ‖x(t)‖2.

Developing

h(t) +α

th(t) +

1

L‖x(t)‖2 +

α

Lt

d

dt‖x(t)‖2 ≤ ‖x(t)‖2.

Then integrate, and apply Fubini’s theorem.


4. Strong convergence results

(AVD)α x(t) +α

tx(t) +∇Φ(x(t)) = 0.

Theorem 5 (APR)

Suppose α > 3, and one of the following properties is satisfied by Φ:

int(argmin Φ) 6= ∅;Φ is an even function (i.e., Φ(−x) = Φ(x));

Φ is uniformly convex.

Then, for any orbit x(·) of (AVD)α, there exists x∗ ∈ argminΦ such that

x(t)→ x∗ ∈ argminΦ strongly in H as t → +∞.


5. The effect of perturbations, errors.

(AVD)α,g x(t) +α

tx(t) +∇Φ(x(t)) = g(t)

Theorem 6 (A-Chbani)

Suppose ∫ +∞

t0

t‖g(t)‖dt < +∞.

Let x : [t0,+∞[→ H be an orbit of (AVD)α,g . Then

a) α ≥ 3:

Φ(x(t))−minH

Φ = O(

1

t2

).

b) α > 3: There exists some x∗ ∈ argminΦ such that

x(t) x∗ weakly as t → +∞.


6A. Fast related algorithms. Nesterov method, FISTA.

Non-smooth structured convex minimization problem:

min Φ(x) + Ψ(x) : x ∈ H .

Φ : H → R ∪ +∞ closed, convex, proper ;

Ψ : H → R convex, differentiable, ∇Ψ Lipschitz continuous.

Optimal solutions:

∂Φ(x) +∇Ψ(x) 3 0.

Dynamical approach via the differential inclusion

x(t) +α

tx(t) + ∂Φ(x(t)) +∇Ψ(x(t)) 3 0.


6B. Fast related algorithms. Nesterov method, FISTA.

x(t) +α

tx(t) + ∂Φ(x(t)) +∇Ψ(x(t)) 3 0.

Θ := Φ + Ψ, Θ : H → R ∪ +∞ closed convex, proper.

x(t) + a(t)x(t) + ∂Θ(x(t)) 3 0.

a(t) ≡ γ > 0, dimH < +∞, Schatzman, A-Cabot-Redont.x(·) loc. Lipschitz; x(·) bounded variation; x(·) bounded measure.Nonuniqueness (shocks).

a(t) = αt , α ≥ 3. Lyapunov analysis is still valid: convex

subdifferential inequalites, generalized derivation chain rule.


6C. Fast related algorithms. Nesterov method, FISTA.

x(t) +α

tx(t) + ∂Φ(x(t)) +∇Ψ(x(t)) 3 0.

Implicit discretization /nonsmooth function Φ.

Explicit discretization /smooth function Ψ.

Time step h > 0, tk = kh, xk = x(tk). Finite difference scheme

1

h2(xk+1 − 2xk + xk−1) +

α

kh2(xk − xk−1) + ∂Φ(xk+1) +∇Ψ(yk) 3 0.

xk+1 + h2∂Φ(xk+1) 3(xk +

(1− α

k

)(xk − xk−1)

)− h2∇Ψ(yk).

Natural choice (Nesterov): yk = xk +(1− α

k

)(xk − xk−1).


6D. Fast related algorithms. Nesterov method, FISTA.

Proximal mapping, resolvent

proxγΦ(x) := argminξ∈H

Φ(ξ) + 1

2γ ‖ξ − x‖2

= (I + γ∂Φ)−1 (x).

yk = xk +

(1− α

k

)(xk − xk−1);

xk+1 = proxh2Φ

(yk − h2∇Ψ(yk)

).

(8)

Equivalent formulation (1− αk+α−1 = k−1

k+α−1):

(AVD− algo)α

yk = xk + k−1

k+α−1 (xk − xk−1);

xk+1 = proxh2Φ

(yk − h2∇Ψ(yk)

).

(9)


6E. Fast related algorithms. Nesterov method, FISTA.

Set s = h2.

(AVD− algo)α

yk = xk + k−1

k+α−1 (xk − xk−1);

xk+1 = proxsΦ (yk − s∇Ψ(yk)) .

Proximal inertial algo.: A-Alvarez, Moudafi-Oliny, Lorenz-Pock.

α = 3: Nesterov, Guler, Beck-Teboulle (FISTA)

(FISTA)

yk = xk + k−1

k+2 (xk − xk−1);


α ≥ 3. Recent studies Chambolle-Dossal, Su-Boyd-Candes, APR.


6F. Fast related algorithms. Nesterov method, FISTA.

(AVD− algo)α

yk = xk + k−1k+α−1 (xk − xk−1);


Theorem 7 (Chambolle-Dossal, Su-Boyd-Candes)

Φ : H → R ∪ +∞ closed convex proper;

Ψ : H → R convex differentiable, ∇Ψ L-Lipschitz continuous;

S = argmin(Φ + Ψ) 6= ∅, s < 1L , α > 3.

Let (xk) be a sequence generated by (AVD− algo)α. Then,

xk x∗ ∈ argmin(Φ + Ψ) weakly as k → +∞;

(Φ + Ψ)(xk)−minH(Φ + Ψ) ≤ C

k2;∑

k k‖xk − xk−1‖2 < +∞, ‖xk − xk−1‖ ≤ Ck .


6G. Fast related algorithms. Proof of convergence

Step one. k 7→ Ek is the correspondent of Eα(·):

Ek =2s

α− 1(k+α−2)2(Θ(xk)−Θ∗)+

1

α− 1‖(k+α−1)yk−kxk−(α−1)x∗‖2

Ek is a strict Lyapunov function: for any k ∈ N

Ek + 2sα−1

((α− 3)(k + α− 2) + 1

)(Θ(xk)− inf Θ) ≤ Ek−1.

Fast convergence properties

(Φ + Ψ)(xk)−minH

(Φ + Ψ) ≤ C

k2∑k

k

((Φ + Ψ)(xk)−min

H(Φ + Ψ)

)< +∞.


6H. Fast related algorithms. Proof of convergence

Step two. Next step consists in obtening the energy estimate∑k

k‖xk − xk−1‖2 < +∞,

= discrete version of the continuous energy estimate∫ ∞t0

t‖x(t)‖2dt < +∞.

Step three. The final step is to apply Opial’s lemma. Using theprevious estimates, it is a direct adaptation of the classical proof of theconvergence of proximal-like inertial algorithms. It is a parallelargument to that using the differential inequality with ‖xk − x∗‖2

instead of ‖x(t)− x∗‖2, and x∗ ∈ argmin(Φ + Ψ).


6I. A perturbed FISTA algorithm.

(AVD)α,g − algo

yk = xk + k−1k+α−1 (xk − xk−1);

xk+1 = proxsΦ (yk − s(∇Ψ(yk)− gk)) .


Φ : H → R ∪ +∞ closed convex proper;

Ψ : H → R convex differentiable, ∇Ψ L-Lipschitz continuous;

S = argmin(Φ + Ψ) 6= ∅, s < 1L , α > 3,

∑k k‖gk‖ < +∞.

Let (xk) be a sequence generated by (AVD)α,g − algo. Then,

xk x∗ ∈ argmin(Φ + Ψ) weakly as k → +∞.

(Φ + Ψ)(xk)−minH(Φ + Ψ) = O(k−2).∑k k‖xk − xk−1‖2 < +∞, ‖xk − xk−1‖ ≤ C

k .

Related results: Schmidt-Le Roux-Bach, NIPS’11.H. ATTOUCH (Univ. Montpellier 2)Fast inertial dynamics for convex optimization. Convergence of FISTA algorithms.28 / 47

7. Related systems. Case a(t) = 1tγ

x(t) +1

tγx(t) +∇Φ(x(t)) = 0.

Global energy : W (t) =1

2‖x(t)‖2 + Φ(x(t))−min Φ.

a(t) = 1tγ γ = 0 0 < γ < 1 γ = 1, a(t) = α

t , α > 3

W (t)→ 0 O( 1t ) o ( 1

t1+γ ), ∀γ < γ O( 1t2 )

a(t) = 1, γ = 0 (HBF), Alvarez (2000).

a(t) = 1tγ , 0 < γ < 1, Cabot-Frankel (2012), R. May (2015).

a(t) = αt , α ≥ 3, SBC (2014), APR (2015).


8A. Related dynamics. Hessian driven damping

Φ : H → R convex, C2, argminΦ 6= ∅, α > 0, β > 0.

(DIN-AVD) x(t) +α

tx(t) + β∇2Φ(x(t))x(t) +∇Φ(x(t)) = 0.

Theorem 9 (APR)

• α > 0 : limt→+∞

Φ(x(t)) = min Φ, limt→+∞

‖x(t)‖ = 0.

• α ≥ 3 : i) Φ(x(t))−minH

Φ ≤ C

t2;

ii)

∫ ∞0

t2‖∇Φ(x(t))‖2dt < +∞;

iii) limt→+∞

‖x(t)‖ = limt→+∞

‖x(t)‖ = limt→+∞

‖∇Φ(x(t))‖ = 0.

• α > 3 : x(t) converges weakly to a minimizer of Φ.


8B. (DIN-AVD) with two potentials

min φ(x) + Ψ(x) : x ∈ H , Ψ smooth, φ nonsmooth.

(DIN−AVD)

x(t) + β∂φ(x(t))− ( 1

β −αt )x(t)− y(t) 3 0;

y(t) +∇Ψ(x(t)) + 1β ( 1

β −αt + αβ

t2 )x(t) + 1β y(t) = 0.

Equivalent equation, φ = Φ smooth:

x(t) + αt x(t) + β∇2Φ(x(t))x(t) +∇Φ(x(t)) +∇Ψ(x(t)) = 0.

Theorem 10 (APR)

Let (x(·), y(·)) be an orbit of (DIN−AVD), α > 0. Then

limt→+∞(φ+ Ψ)(x(t)) = min(φ+ Ψ), limt→+∞

‖x(t)‖ = 0.


8C. (DIN-AVD) algorithm with two potentials

Time step h > 0, tk = kh, xk = x(tk), yk = y(tk), (x0, y0) ∈ H ×H:0 ∈ xk+1 − xk

h+ β∂φ(xk+1)− (

1

β− α

kh)xk − yk

0 =yk+1 − yk

h+∇Ψ(xk+1) +

1

β(

1

β− α

kh+

αβ

k2h2)xk+1 +

1

βyk+1

xk+1 = proxβhφ

((1 + h(

1

β− α

kh)

)xk + hyk

)yk+1 =

β

β + hyk −

h

β + h(

1

β− α

kh+

αβ

k2h2)xk+1 −

hβ

β + h∇Ψ(xk+1).

First-order dynamic /(xk , yk): (xk , yk)→ (xk+1, yk)→ (xk+1, yk+1).


9. Related dynamics. Adaptive restart (SBC)

Strategy: maintain high velocity along the orbit.

(AVD)α x(t) +α

tx(t) +∇Φ(x(t)) = 0, x(0) = x0, x(0) = 0.

Restarting time: T (Φ, x0) = supt > 0, ∀τ ∈]0, t[, ddτ ‖x(τ)‖2 > 0.

Before time T (Φ, x0) > 0, t 7→ Φ(x(t)) decreases:

ddt Φ(x(t)) = 〈∇Φ(x(t)), x(t)〉 = −α

t ‖x(t)‖2 − 12

ddt ‖x(t)‖2 ≤ 0.

At time T (Φ, x0), stop and restart, and so on.

Theorem 11 (SBC), linear convergence

Suppose Φ : H → R strongly convex, ∇Φ Lipschitz continuous, α ≥ 3.Let xsr (·) be an orbit of the speed restarting dynamic. Then

Φ(xsr (t))−minHΦ ≤ c1e−c2t .


10. Related dynamics. Tikhonov regularization.

(AVD)α,ε x(t) +α

tx(t) +∇Φ(x(t)) + ε(t)x(t) = 0.


Φ : H → R convex, continuously differentiable, argminΦ 6= ∅.ε : [t0,+∞[→ R+ C1 decreasing function, limt→+∞ ε(t) = 0.

Let x(·) be a classical global solution of (AVD)α,ε, α > 1.

Case 1:∫ +∞t0

ε(t)t dt < +∞. Then,

limt→+∞Φ(x(t)) = infHΦ, limt→+∞ ‖x(t)‖ = 0.

Case 2:∫ +∞t0

ε(t)t dt = +∞. Then,

lim inft→+∞ ‖x(t)− p‖ = 0

where p is the element of minimal norm of argminΦ.H. ATTOUCH (Univ. Montpellier 2)Fast inertial dynamics for convex optimization. Convergence of FISTA algorithms.34 / 47

11. Annex 1. Stochastic gradient descent algorithm

∀x ∈ Rd Φ(x) :=

∫Ωφ(x , ω)µ(dω).

(ωn)n≥1 is a sequence of independent identically distributed variables

Xn+1 = Xn − εn∇φ(Xn, ωn+1).

The stochastic approximation can be numerically improved:

Xn+1 = Xn − εn+1

∑ni εi∇φ(Xi , ωi+1)∑n

i εi

Limit ODE (n→ +∞,∑εn = +∞,

∑εn

p < +∞ for some p > 1)

sX (s) + X (s) +∇Φ(X (s)) = 0.

Time rescaling t = 2√s gives X (t) + 1

t X (t) +∇Φ(X (t)) = 0.


12. Annex 2. Complexity aspects

In 1983, Nemirovsky & Yudin proved lower bounds on the complexityof first-order methods (number of subgradient calls needed to achieve agiven accuracy) for convex optimization under various regularityassumptions for the objective functions. See also Nesterov (2004).

1 They constructed convex, piecewise linear functions in dimensionsn > k, where no first-order method can have function values moreaccurate than O( 1√

k) after k subgradient evaluations.

2 They also constructed convex quadratic functions in dimensionsn ≥ 2k where no first-order method can have function values moreaccurate than O( 1

k2 ) after k gradient evaluations.

3 For strongly convex functions with Lipschitz continuous gradients,the known lower bounds on the complexity allow a dimensionindependent linear rate of convergence O(qk) with 0 < q < 1.


13. Perspective, open questions

Convergence of the orbits for α = 3? of Nesterov algorithm?

Convergence of the values: exhibit concrete examples showing thatα = 3 is critical. Rate of convergence for a(t) = α

t , 1 ≤ α < 3?

Find a Lyapunov function in the case 1tγ , giving the rate of

convergence 1t1+γ for W (t).

Extend to the algorithmic part the convergence properties of thecontinuous dynamic (strong convergence...).

Adaptive restart for (DIN-AVD), without strong convexity.

Compare /combine with other rapid methods: multigrid, Newtonbased methods, other type of friction (dry).

Show the O( 1t2 ) convergence of the values for (DIN-AVD),

combining Hessian driven and asymptotic vanishing damping.

Extension to a non-convex setting: for analytic potentials, theconvergence theory for HBF (HJ), and DIN (AABR) still works.

Nonsmooth potentials, shock theory, PDE’s hyperbolic equations.


References

B. Abbas, H. Attouch, and B. F. Svaiter, Newton-likedynamics and forward-backward methods for structured monotoneinclusions in Hilbert spaces, JOTA, 161 (2014), N0 2, pp. 331–360.

S. Adly, H. Attouch, A. Cabot, Finite time stabilization ofnonlinear oscillators subject to dry friction, Nonsmooth Mechanicsand Analysis (ed. P. Alart, O. Maisonneuve, R.T. Rockafellar),Adv. in Math. and Mech., Kluwer , 12 (2006) pp. 289–304.

F. Alvarez, On the minimizing property of a second-orderdissipative system in Hilbert spaces, SIAM J. Control Optim., 38(2000), N0 4, pp. 1102-1119.

F. Alvarez, H. Attouch, An inertial proximal method formaximal monotone operators via discretization of a nonlinearoscillator with damping, Set-Valued Analysis, 9 (2001), pp. 3–11.


References

F. Alvarez, H. Attouch, Convergence and asymptoticstabilization for some damped hyperbolic equations withnon-isolated equilibria, ESAIM Control Optim. Calc. of Var., 6(2001), pp. 539–552.

F. Alvarez, H. Attouch, J. Bolte, P. Redont, Asecond-order gradient-like dissipative dynamical system withHessian-driven damping. Application to optimization andmechanics, J. Math. Pures Appl., 81 (2002), No. 8, pp. 747–779.

H. Attouch, G. Buttazzo, G. Michaille, Variational analysisin Sobolev and BV spaces. Applications to PDE’s and optimization,MPS/SIAM Series on Optimization, 6, SIAM, Philadelphia, PA,Second edition, 2014, 793 pages.


References

H. Attouch, A. Cabot, P. Redont, The dynamics of elasticshocks via epigraphical regularization of a differential inclusion,Adv. Math. Sci. Appl., 12 (2002), N0 1, pp. 273–306.

H. Attouch and M.-O. Czarnecki, Asymptotic control andstabilization of nonlinear oscillators with non-isolated equilibria, J.Differential Equations, 179 (2002), pp. 278–310.

H. Attouch, X. Goudou and P. Redont, The heavy ball withfriction method. The continuous dynamical system, globalexploration of the local minima of a real-valued function, Commun.Contemp. Math., 2 (2000), N0 1, pp. 1–34.


References

H. Attouch, P.E. Mainge, P. Redont, A second-orderdifferential system with Hessian-driven damping; Application tonon-elastic shock laws, Differential Equations and Applications, 4(2012), N0 1, pp. 27–65.

H. Attouch, J. Peypouquet, P. Redont, A dynamicalapproach to an inertial forward-backward algorithm for convexminimization, SIAM J. Optim., 24 (2014), No. 1, pp. 232–256.

H. Attouch and A. Soubeyran, Inertia and reactivity indecision making as cognitive variational inequalities, Journal ofConvex Analysis, 13 (2006), pp. 207–224.


References

J.-B. Baillon, Un exemple concernant le comportementasymptotique de la solution du probleme du

dt + ∂φ(u) 3 0, Journal ofFunctional Analysis, 28 (1978), pp. 369–376.

H. Bauschke and P. Combettes, Convex Analysis andMonotone Operator Theory in Hilbert spaces , CMS Books inMathematics, Springer, (2011).

A. Beck and M. Teboulle, A fast iterativeshrinkage-thresholding algorithm for linear inverse problems, SIAMJ. Imaging Sci., 2 (1) (2009), pp. 183–202.

H. Brezis, Operateurs maximaux monotones dans les espaces deHilbert et equations d’evolution, Lecture Notes 5, North Holland,(1972).


References

H. Brezis, Asymptotic behavior of some evolution systems:Nonlinear evolution equations, Academic Press, New York, (1978),pp. 141–154.

R.E. Bruck, Asymptotic convergence of nonlinear contractionsemigroups in Hilbert spaces, J. Funct. Anal., 18 (1975), pp. 15–26.

A. Cabot, Inertial gradient-like dynamical system controlled by astabilizing term, J. Optim. Theory Appl., 120 (2004), pp. 275–303.

A. Cabot, H. Engler, S. Gadat, On the long time behavior ofsecond-order differential equations with asymptotically smalldissipation Trans. AMS, 361 (2009), pp. 5983–6017.


References

A. Cabot, H. Engler, S. Gadat, Second order differentialequations with asymptotically small dissipation and piecewise flatpotentials, Electronic Journal of Differential Equations, 17 (2009),pp. 33–38.

A. Cabot, P. Frankel, Asymptotics for some semilinearhyperbolic equations with non-autonomous damping, J. DifferentialEquations 252 (2012), pp. 294–322.

A. Chambolle, Ch. Dossal, On the convergence of the iteratesof Fista, HAL Id: hal-01060130 https://hal.inria.fr/hal-01060130v3Submitted on 20 Oct 2014.

O. Guler, New proximal point algorithms for convexminimization, SIAM J. Optim., 2 (4), (1992), pp. 649–664.


References

A. Haraux, M. Jendoubi, Convergence of solutions ofsecond-order gradient-like systems with analytic nonlinearities,Journal of Differential Equations, 144 (2), (1998).

A. Moudafi, M. Oliny, Convergence of a splitting inertialproximal method for monotone operators, J. Comput. Appl. Math.,155 (2), (2003), pp. 447–454.

Y. Nesterov, A method of solving a convex programming problemwith convergence rate O(1/k2). In Soviet Mathematics Doklady, 27(1983), pp. 372–376.

Y. Nesterov, Introductory lectures on convex optimization: Abasic course, volume 87 of Applied Optimization. Kluwer AcademicPublishers, Boston, MA, 2004.


References

Y. Nesterov, Smooth minimization of non-smooth functions,Mathematical Programming, 103 (1) (2005), pp. 127–152.

Y. Nesterov, Gradient methods for minimizing compositeobjective function, CORE Discussion Papers, 2007.

Z. Opial, Weak convergence of the sequence of successiveapproximations for nonexpansive mappings, Bull. Amer. Math.Soc., 73 (1967), pp. 591–597.

B. O Donoghue and E. J. Candes, Adaptive restart foraccelerated gradient schemes, Found. Comput. Math., 2013.


References

J. Peypouquet and S. Sorin, Evolution equations for maximalmonotone operators: asymptotic analysis in continuous and discretetime, Journal of Convex Analysis, 17, (2010), pp. 1113–1163.

D. A. Lorenz and Thomas Pock, An inertial forward-backwardalgorithm for monotone inclusions, J. Math. Imaging Vision, 2014.

A.S. Nemirovsky and D.B. Yudin, Problem complexity andmethod efficiency in optimization, Wiley, New York, 1983.

M. Schmidt, N. Le Roux, F. Bach, Convergence Rates ofInexact Proximal- Gradient Methods for Convex Optimization,NIPS’11 -Grenada, Spain. 2011. <inria-00618152v3>, HAL.

W. Su, S. Boyd, E. J. Candes, A Differential Equation forModeling Nesterov’s Accelerated Gradient Method, Advances inNeural Information Processing Systems 27 (NIPS 2014).


fast inertial dynamics for convex optimization. …plc/data2015/attouch2015.pdffast inertial...

Documents