l.vandenberghe ece236c(spring2020) …vandenbe/236c/lectures/fgrad.pdf · 2020. 6. 29. · + m =...

L. Vandenberghe ECE236C (Spring 2020)

7. Accelerated proximal gradient methods

• Nesterov’s method

• analysis with fixed step size

• line search

7.1

Proximal gradient method

Results from lecture 4

• each proximal gradient iteration is a descent step (page 4.15 and 4.17):

f (xk+1) < f (xk), ‖xk+1 − x?‖22 ≤ c ‖xk − x?‖22

with c = 1 − m/L• suboptimality after k iterations is O(1/k) (page 4.16):

f (xk) − f? ≤ L2k‖x0 − x?‖22

Accelerated proximal gradient methods

• to improve convergence, we add a momentum term

• we relax the descent properties

• originated in work by Nesterov in the 1980s

Accelerated proximal gradient methods 7.2

Assumptions

we consider the same problem and make the same assumptions as in lecture 4:

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn

• there exist constants m ≥ 0 and L > 0 such that the functions

g(x) − m2

xT x,L2

xT x − g(x)

are convex

• the optimal value f? is finite and attained at x? (not necessarily unique)


Nesterov’s method

choose x0 = v0 and θ0 ∈ (0,1], and repeat the following steps for k = 0,1, . . .

• if k ≥ 1, define θk as the positive root of the quadratic equation

θ2k

tk= (1 − θk)γk + mθk where γk =

θ2k−1

tk−1

• update xk and vk as follows:

y = xk +θkγk

γk + mθk(vk − xk) (y = x0 if k = 0)

xk+1 = proxtk h(y − tk∇g(y))

vk+1 = xk +1θk(xk+1 − xk)

stepsize tk is fixed (tk = 1/L) or obtained from line search


Momentum interpretation

• the first iteration (k = 0) is a proximal gradient step at y = x0

• next iterations are proximal gradient steps at extrapolated points y:

y = xk +θkγk

γk + mθk(vk − xk) = xk + βk(xk − xk−1)

whereβk =

θkγk

γk + mθk

(1

θk−1− 1

)=

tkθk−1(1 − θk−1)tk−1θk + tkθ

2k−1

xk−1 xk y = xk + βk(xk − xk−1)



Parameters θk and βk (for fixed stepsize tk = 1/L = 1)

0 5 10 15 20 250.2

0.4

0.6

0.8

1

√m/L

k

θk

m = 0.1

θ0 =√

m/Lθ0 =

√4m/L

θ0 = 1

0 5 10 15 20 250

0.2

0.4

0.6 √L−√m√L+

√m

k

βk

m = 0.1

θ0 =√

m/Lθ0 =

√4m/L

θ0 = 1

0 10 20 30 400

0.2

0.4

0.6

0.8

1

k

θk

m = 0

θ0 = 0.5θ0 = 1.0

0 10 20 30 400

0.2

0.4

0.6

0.8

1

k

βk

m = 0

θ0 = 0.5θ0 = 1.0


Parameter θk

• for k ≥ 1, θk is the positive root of the quadratic equation

θ2k

tk= (1 − θk)

θ2k−1

tk−1+ mθk

• if m > 0 and θ0 =√

mt0, then θk =√

mtk for all k

• θk < 1 if mtk < 1

• for constant tk , sequence θk is completely determined by θ0


FISTA

if we take m = 0 on page 7.4, the expression for y simplifies:

y = xk + θk(vk − xk)xk+1 = proxtk h(y − tk∇g(y))

vk+1 = xk +1θk(xk+1 − xk)

eliminating the variables v(k) gives the equivalent iteration

y = xk + θk(1

θk−1− 1)(xk − xk−1) (y = x0 if k = 0)


this is known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)


Example

minimize logp∑

i=1exp(aT

i x + bi)

• two randomly generated problems with p = 2000, n = 1000

• same fixed step size used for gradient method and FISTA

• figures show ( f (x(k)) − f?)/ f?

0 50 100 150 20010−6

10−5

10−4

10−3

10−2

10−1

100

k

gradientFISTA

0 50 100 150 20010−6

10−5

10−4

10−3

10−2

10−1

100

k

gradientFISTA


A simplification for strongly convex problems

• if m > 0 and we choose θ0 =√

mt0, then

γk = m, θk =√

mtk for all k ≥ 1

• the algorithm on page 7.4 and page 7.5 simplifies:

y = xk +

√tk√

tk−1

1 − √mtk−11 +√

mtk(xk − xk−1) (y = x0 if k = 0)


• with constant stepsize tk = 1/L, the expression for y reduces to

y = xk +1 −

√m/L

1 +√

m/L(xk − xk−1) (y = x0 if k = 0)


Outline



• line search

Overview

• we show that if ti = 1/L, the following inequality holds at iteration i:

f (xi+1) − f? +γi+1

2‖vi+1 − x?‖22

≤ (1 − θi)( f (xi) − f?) + γi+1 − mθi

2‖vi − x?‖22

= (1 − θi)(

f (xi) − f?) + γi

2‖vi − x?‖22

)if i ≥ 1

• combining the inequalities from i = 0 to i = k − 1 shows that

f (xk) − f? ≤ λk

((1 − θ0)( f (x0) − f?) + γ1 − mθ0

2‖x0 − x?‖22

)

≤ λk

((1 − θ0)( f (x0) − f?) +

θ20

2t0‖x0 − x?‖22

)

where λ1 = 1 and λk =∏k−1

i=1 (1 − θi) for k > 1

(here we assume x0 ∈ dom f )Accelerated proximal gradient methods 7.11

Notation for one iteration

quantities in iteration i of the algorithm on page 7.4

• define t = ti, θ = θi, γ+ = γi+1 = θ2/t

• if i ≥ 1, define γ = γi and note that γ+ − mθ = (1 − θ)γ• define x = xi, x+ = xi+1, v = vi, and v+ = vi+1:

y =1

γ + mθ(γ+x + θγv

)(y = x = v if i = 0)

x+ = y − tGt(y)v+ = x +

1θ(x+ − x)

• v+, v, and y are related as

γ+v+ = γ+v + mθ(y − v) − θGt(y) (1)


Proof (last identity):

• combine v and x updates and use γ+ = θ2/t:

v+ = x +1θ(y − tGt(y) − x)

=1θ(y − (1 − θ)x) − θ

γ+Gt(y)

• for i = 0, the equation (1) follows because y = x = v

• for i ≥ 1, multiply with γ+ = γ + mθ − θγ:

γ+v+ =γ+

θ(y − (1 − θ)x) − θGt(y)

=(1 − θ)θ((γ + mθ)y − γ+x) + θmy − θGt(y)

= (1 − θ)γv + θmy − θGt(y)= (γ+ − mθ)γv + θmy − θGt(y)


Bound on objective function

recall the results on the proximal gradient update (page 4.13):

• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by

g(x+) ≤ g(y) − t∇g(y)TGt(y) + t2‖Gt(y)‖22 (2)

• if the inequality (2) holds, then mt ≤ 1 and, for all z,

f (z) ≥ f (x+) + t2‖Gt(y)‖22 + Gt(y)T(z − y) + m

2‖z − y‖22

• add (1 − θ) times the inequality for z = x and θ times the inequality for z = x?:

f (x+) − f? ≤ (1 − θ)( f (x) − f?) − Gt(y)T((1 − θ)x + θx? − y

)− t

2‖Gt(y)‖22 −

mθ2‖x? − y‖22


Bound on distance to optimum

• it follows from (1) that

γ+

2‖v+ − x?‖22 =

γ+ − mθ2

‖v − x?‖22 + θGt(y)T(x? − v − mθγ+(y − v))

− mθ(γ+ − mθ)2γ+

‖y − v‖22 +t2‖Gt(y)‖22 +

mθ2‖x? − y‖22

≤ γ+ − mθ2

‖v − x?‖22 + θGt(y)T(x? − v − mθγ+(y − v))

+t2‖Gt(y)‖22 +

mθ2‖x? − y‖22

• γ+ and y are chosen so that θ(γ+ − mθ)(y − v) = γ+(1 − θ)(x − y); hence

γ+

2‖v+ − x?‖22 ≤ γ+ − mθ

2‖v − x?‖22 + Gt(y)T(θx? + (1 − θ)x − y)

+t2‖Gt(y)‖22 +

mθ2‖x? − y‖22


Progress in one iteration

• combining the bounds on page 7.14 and 7.15 gives

f (x+) − f? +γ+

2‖v+ − x?‖22

≤ (1 − θ)( f (x) − f?) + γ+ − mθ

2‖v − x?‖22

this is the first inequality on page 7.11

• if i ≥ 1, we use γ+ − mθ = (1 − θ)γ to write this as

f (x+) − f? +γ+

2‖v+ − x?‖22

≤ (1 − θ)(

f (x) − f? +γ

2‖v − x?‖22

)


Analysis for fixed step size

the product λk =∏k−1

i=1 (1 − θi) determines the rate of convergence (page 7.11)

• the sequence λk satisfies the following bound (proof on next page)

λk ≤4

(2 + √γ1k−1∑i=1

√ti)2=

4t0

(2√t0 + θ0k−1∑i=1

√ti)2

(3)

• for constant step size and θ0 = 1, we obtain

λk ≤4

(k + 1)2

• with t0 = 1/L, the inequality on page 7.11 shows a 1/k2 convergence rate

f (xk) − f? ≤ 2L(k + 1)2 ‖x0 − x?‖22


Proof.

• recall that for k ≥ 1,

γk+1 = (1 − θk)γk + θkm, γk = θ2k−1/tk−1

• we first note that λk ≤ γk/γ1; this follows from

λi+1 = (1 − θi)λi =γi+1 − θim

γiλi ≤ γi+1

γiλi

• the inequality (3) follows by combining from i = 1 to i = k − 1 the inequalities

1√λi+1− 1√

λi≥ λi − λi+1

2λi√λi+1

=θi

2√λi+1

≥ θi

2√γi+1/γ1

=12√γ1ti


Strongly convex functions

the following bound on λk is useful for strongly convex functions (m > 0)

• if θ0 ≥√

mt0, then θk ≥√

mtk for all k and

λk ≤k−1∏i=1(1 − √mti)

(proof on next page)

• for constant step size tk = 1/L, we obtain

λk ≤(1 −

√m/L

) k−1

• combined with the inequality on page 7.11, this shows linear convergence

f (xk) − f? ≤(1 −

√mL

) k−1 ((1 − θ0)( f (x0) − f?) +

θ20

2t0‖x0 − x?‖22

)


Proof.

• if θk−1 ≥√

mtk−1, then θk ≥√

mtk :

θ2k

tk= (1 − θk)

θ2k−1

tk−1+ mθk

≥ (1 − θk)m + mθk

= m

• if θ0 ≥√

mt0, then θk ≥√

mtk for all k and

λk =k−1∏i=1(1 − θi) ≤

k−1∏i=1(1 − √mti)


Outline



• line search

Line search

• the analysis for fixed step size starts with the inequality (2):

g(x − tGt(y)) ≤ g(y) − t∇g(y)TGt(y) + t2‖Gt(y)‖22

this inequality is known to hold for 0 ≤ t ≤ 1/L

• if L is not known, we can satisfy (2) by a backtracking line search:

start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds

• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}

• for each tentative tk we need to recompute θk , y, xk+1 in the algorithm on p. 7.4

• requires evaluations of ∇g, proxth, and g (twice) per line search iteration


Analysis with line search

• from page 7.17, if θ0 = 1:

λk ≤4t0

(2√t0 +k−1∑i=1

√ti)2≤ 4t̂/tmin(k + 1)2

• from page 7.19, if θ0 ≥√

mt0:

λk ≤k−1∏i=1(1 − √mti) ≤

(1 − √mtmin

) k−1

• therefore the results for fixed step size hold with 1/tmin substituted for L


References

Most of the material in the lecture is from §2.2 in Nesterov’s Lectures on Convex Optimization.

FISTA

• A. Beck, First-Order Methods in Optimization (2017), §10.7.• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse

problems, SIAM J. on Imaging Sciences (2009).• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.

Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications(2009).

Accelerated proximal gradient methods

• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends inMachine Learning (2015), §3.7.

• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).

Line search strategies

• FISTA papers by Beck and Teboulle.• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with

line search (2011).• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).


https://doi.org/10.1007/978-3-319-91578-4

https://doi.org/10.1137/1.9781611974997

Implementation

• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications tosparse signal recovery, Mathematical Programming Computation (2011).

• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations ofComputational Mathematics (2015).

• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTAimplementation, arXiv:1411.3406 (2016).


l.vandenberghe ece236c(spring2020) …vandenbe/236c/lectures/fgrad.pdf · 2020. 6. 29. · + m =...

Documents