l.vandenberghe ece236c(spring2020) …vandenbe/236c/lectures/fgrad.pdf · 2020. 6. 29. · + m =...

26
L. Vandenberghe ECE236C (Spring 2020) 7. Accelerated proximal gradient methods Nesterov’s method analysis with fixed step size line search 7.1

Upload: others

Post on 01-Apr-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

L. Vandenberghe ECE236C (Spring 2020)

7. Accelerated proximal gradient methods

• Nesterov’s method

• analysis with fixed step size

• line search

7.1

Page 2: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Proximal gradient method

Results from lecture 4

• each proximal gradient iteration is a descent step (page 4.15 and 4.17):

f (xk+1) < f (xk), ‖xk+1 − x?‖22 ≤ c ‖xk − x?‖22

with c = 1 − m/L• suboptimality after k iterations is O(1/k) (page 4.16):

f (xk) − f? ≤ L2k‖x0 − x?‖22

Accelerated proximal gradient methods

• to improve convergence, we add a momentum term

• we relax the descent properties

• originated in work by Nesterov in the 1980s

Accelerated proximal gradient methods 7.2

Page 3: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Assumptions

we consider the same problem and make the same assumptions as in lecture 4:

minimize f (x) = g(x) + h(x)

• h is closed and convex (so that proxth is well defined)

• g is differentiable with dom g = Rn

• there exist constants m ≥ 0 and L > 0 such that the functions

g(x) − m2

xT x,L2

xT x − g(x)

are convex

• the optimal value f? is finite and attained at x? (not necessarily unique)

Accelerated proximal gradient methods 7.3

Page 4: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Nesterov’s method

choose x0 = v0 and θ0 ∈ (0,1], and repeat the following steps for k = 0,1, . . .

• if k ≥ 1, define θk as the positive root of the quadratic equation

θ2k

tk= (1 − θk)γk + mθk where γk =

θ2k−1

tk−1

• update xk and vk as follows:

y = xk +θkγk

γk + mθk(vk − xk) (y = x0 if k = 0)

xk+1 = proxtk h(y − tk∇g(y))

vk+1 = xk +1θk(xk+1 − xk)

stepsize tk is fixed (tk = 1/L) or obtained from line search

Accelerated proximal gradient methods 7.4

Page 5: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Momentum interpretation

• the first iteration (k = 0) is a proximal gradient step at y = x0

• next iterations are proximal gradient steps at extrapolated points y:

y = xk +θkγk

γk + mθk(vk − xk) = xk + βk(xk − xk−1)

whereβk =

θkγk

γk + mθk

(1

θk−1− 1

)=

tkθk−1(1 − θk−1)tk−1θk + tkθ

2k−1

xk−1 xk y = xk + βk(xk − xk−1)

xk+1 = proxtk h(y − tk∇g(y))

Accelerated proximal gradient methods 7.5

Page 6: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Parameters θk and βk (for fixed stepsize tk = 1/L = 1)

0 5 10 15 20 250.2

0.4

0.6

0.8

1

√m/L

k

θk

m = 0.1

θ0 =√

m/Lθ0 =

√4m/L

θ0 = 1

0 5 10 15 20 250

0.2

0.4

0.6 √L−√m√L+

√m

k

βk

m = 0.1

θ0 =√

m/Lθ0 =

√4m/L

θ0 = 1

0 10 20 30 400

0.2

0.4

0.6

0.8

1

k

θk

m = 0

θ0 = 0.5θ0 = 1.0

0 10 20 30 400

0.2

0.4

0.6

0.8

1

k

βk

m = 0

θ0 = 0.5θ0 = 1.0

Accelerated proximal gradient methods 7.6

Page 7: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Parameter θk

• for k ≥ 1, θk is the positive root of the quadratic equation

θ2k

tk= (1 − θk)

θ2k−1

tk−1+ mθk

• if m > 0 and θ0 =√

mt0, then θk =√

mtk for all k

• θk < 1 if mtk < 1

• for constant tk , sequence θk is completely determined by θ0

Accelerated proximal gradient methods 7.7

Page 8: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

FISTA

if we take m = 0 on page 7.4, the expression for y simplifies:

y = xk + θk(vk − xk)xk+1 = proxtk h(y − tk∇g(y))

vk+1 = xk +1θk(xk+1 − xk)

eliminating the variables v(k) gives the equivalent iteration

y = xk + θk(1

θk−1− 1)(xk − xk−1) (y = x0 if k = 0)

xk+1 = proxtk h(y − tk∇g(y))

this is known as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm)

Accelerated proximal gradient methods 7.8

Page 9: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Example

minimize logp∑

i=1exp(aT

i x + bi)

• two randomly generated problems with p = 2000, n = 1000

• same fixed step size used for gradient method and FISTA

• figures show ( f (x(k)) − f?)/ f?

0 50 100 150 20010−6

10−5

10−4

10−3

10−2

10−1

100

k

gradientFISTA

0 50 100 150 20010−6

10−5

10−4

10−3

10−2

10−1

100

k

gradientFISTA

Accelerated proximal gradient methods 7.9

Page 10: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

A simplification for strongly convex problems

• if m > 0 and we choose θ0 =√

mt0, then

γk = m, θk =√

mtk for all k ≥ 1

• the algorithm on page 7.4 and page 7.5 simplifies:

y = xk +

√tk√

tk−1

1 − √mtk−11 +√

mtk(xk − xk−1) (y = x0 if k = 0)

xk+1 = proxtk h(y − tk∇g(y))

• with constant stepsize tk = 1/L, the expression for y reduces to

y = xk +1 −

√m/L

1 +√

m/L(xk − xk−1) (y = x0 if k = 0)

Accelerated proximal gradient methods 7.10

Page 11: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Outline

• Nesterov’s method

• analysis with fixed step size

• line search

Page 12: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Overview

• we show that if ti = 1/L, the following inequality holds at iteration i:

f (xi+1) − f? +γi+1

2‖vi+1 − x?‖22

≤ (1 − θi)( f (xi) − f?) + γi+1 − mθi

2‖vi − x?‖22

= (1 − θi)(

f (xi) − f?) + γi

2‖vi − x?‖22

)if i ≥ 1

• combining the inequalities from i = 0 to i = k − 1 shows that

f (xk) − f? ≤ λk

((1 − θ0)( f (x0) − f?) + γ1 − mθ0

2‖x0 − x?‖22

)

≤ λk

((1 − θ0)( f (x0) − f?) +

θ20

2t0‖x0 − x?‖22

)

where λ1 = 1 and λk =∏k−1

i=1 (1 − θi) for k > 1

(here we assume x0 ∈ dom f )Accelerated proximal gradient methods 7.11

Page 13: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Notation for one iteration

quantities in iteration i of the algorithm on page 7.4

• define t = ti, θ = θi, γ+ = γi+1 = θ2/t

• if i ≥ 1, define γ = γi and note that γ+ − mθ = (1 − θ)γ• define x = xi, x+ = xi+1, v = vi, and v+ = vi+1:

y =1

γ + mθ(γ+x + θγv

)(y = x = v if i = 0)

x+ = y − tGt(y)v+ = x +

1θ(x+ − x)

• v+, v, and y are related as

γ+v+ = γ+v + mθ(y − v) − θGt(y) (1)

Accelerated proximal gradient methods 7.12

Page 14: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Proof (last identity):

• combine v and x updates and use γ+ = θ2/t:

v+ = x +1θ(y − tGt(y) − x)

=1θ(y − (1 − θ)x) − θ

γ+Gt(y)

• for i = 0, the equation (1) follows because y = x = v

• for i ≥ 1, multiply with γ+ = γ + mθ − θγ:

γ+v+ =γ+

θ(y − (1 − θ)x) − θGt(y)

=(1 − θ)θ((γ + mθ)y − γ+x) + θmy − θGt(y)

= (1 − θ)γv + θmy − θGt(y)= (γ+ − mθ)γv + θmy − θGt(y)

Accelerated proximal gradient methods 7.13

Page 15: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Bound on objective function

recall the results on the proximal gradient update (page 4.13):

• if 0 < t ≤ 1/L then g(x+) = g(y − tGt(y)) is bounded by

g(x+) ≤ g(y) − t∇g(y)TGt(y) + t2‖Gt(y)‖22 (2)

• if the inequality (2) holds, then mt ≤ 1 and, for all z,

f (z) ≥ f (x+) + t2‖Gt(y)‖22 + Gt(y)T(z − y) + m

2‖z − y‖22

• add (1 − θ) times the inequality for z = x and θ times the inequality for z = x?:

f (x+) − f? ≤ (1 − θ)( f (x) − f?) − Gt(y)T((1 − θ)x + θx? − y

)− t

2‖Gt(y)‖22 −

mθ2‖x? − y‖22

Accelerated proximal gradient methods 7.14

Page 16: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Bound on distance to optimum

• it follows from (1) that

γ+

2‖v+ − x?‖22 =

γ+ − mθ2

‖v − x?‖22 + θGt(y)T(x? − v − mθγ+(y − v))

− mθ(γ+ − mθ)2γ+

‖y − v‖22 +t2‖Gt(y)‖22 +

mθ2‖x? − y‖22

≤ γ+ − mθ2

‖v − x?‖22 + θGt(y)T(x? − v − mθγ+(y − v))

+t2‖Gt(y)‖22 +

mθ2‖x? − y‖22

• γ+ and y are chosen so that θ(γ+ − mθ)(y − v) = γ+(1 − θ)(x − y); hence

γ+

2‖v+ − x?‖22 ≤ γ+ − mθ

2‖v − x?‖22 + Gt(y)T(θx? + (1 − θ)x − y)

+t2‖Gt(y)‖22 +

mθ2‖x? − y‖22

Accelerated proximal gradient methods 7.15

Page 17: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Progress in one iteration

• combining the bounds on page 7.14 and 7.15 gives

f (x+) − f? +γ+

2‖v+ − x?‖22

≤ (1 − θ)( f (x) − f?) + γ+ − mθ

2‖v − x?‖22

this is the first inequality on page 7.11

• if i ≥ 1, we use γ+ − mθ = (1 − θ)γ to write this as

f (x+) − f? +γ+

2‖v+ − x?‖22

≤ (1 − θ)(

f (x) − f? +γ

2‖v − x?‖22

)

Accelerated proximal gradient methods 7.16

Page 18: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Analysis for fixed step size

the product λk =∏k−1

i=1 (1 − θi) determines the rate of convergence (page 7.11)

• the sequence λk satisfies the following bound (proof on next page)

λk ≤4

(2 + √γ1k−1∑i=1

√ti)2=

4t0

(2√t0 + θ0k−1∑i=1

√ti)2

(3)

• for constant step size and θ0 = 1, we obtain

λk ≤4

(k + 1)2

• with t0 = 1/L, the inequality on page 7.11 shows a 1/k2 convergence rate

f (xk) − f? ≤ 2L(k + 1)2 ‖x0 − x?‖22

Accelerated proximal gradient methods 7.17

Page 19: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Proof.

• recall that for k ≥ 1,

γk+1 = (1 − θk)γk + θkm, γk = θ2k−1/tk−1

• we first note that λk ≤ γk/γ1; this follows from

λi+1 = (1 − θi)λi =γi+1 − θim

γiλi ≤ γi+1

γiλi

• the inequality (3) follows by combining from i = 1 to i = k − 1 the inequalities

1√λi+1− 1√

λi≥ λi − λi+1

2λi√λi+1

=θi

2√λi+1

≥ θi

2√γi+1/γ1

=12√γ1ti

Accelerated proximal gradient methods 7.18

Page 20: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Strongly convex functions

the following bound on λk is useful for strongly convex functions (m > 0)

• if θ0 ≥√

mt0, then θk ≥√

mtk for all k and

λk ≤k−1∏i=1(1 − √mti)

(proof on next page)

• for constant step size tk = 1/L, we obtain

λk ≤(1 −

√m/L

) k−1

• combined with the inequality on page 7.11, this shows linear convergence

f (xk) − f? ≤(1 −

√mL

) k−1 ((1 − θ0)( f (x0) − f?) +

θ20

2t0‖x0 − x?‖22

)

Accelerated proximal gradient methods 7.19

Page 21: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Proof.

• if θk−1 ≥√

mtk−1, then θk ≥√

mtk :

θ2k

tk= (1 − θk)

θ2k−1

tk−1+ mθk

≥ (1 − θk)m + mθk

= m

• if θ0 ≥√

mt0, then θk ≥√

mtk for all k and

λk =k−1∏i=1(1 − θi) ≤

k−1∏i=1(1 − √mti)

Accelerated proximal gradient methods 7.20

Page 22: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Outline

• Nesterov’s method

• analysis with fixed step size

• line search

Page 23: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Line search

• the analysis for fixed step size starts with the inequality (2):

g(x − tGt(y)) ≤ g(y) − t∇g(y)TGt(y) + t2‖Gt(y)‖22

this inequality is known to hold for 0 ≤ t ≤ 1/L

• if L is not known, we can satisfy (2) by a backtracking line search:

start at some t := t̂ > 0 and backtrack (t := βt) until (2) holds

• step size selected by the line search satisfies t ≥ tmin = min {t̂, β/L}

• for each tentative tk we need to recompute θk , y, xk+1 in the algorithm on p. 7.4

• requires evaluations of ∇g, proxth, and g (twice) per line search iteration

Accelerated proximal gradient methods 7.21

Page 24: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Analysis with line search

• from page 7.17, if θ0 = 1:

λk ≤4t0

(2√t0 +k−1∑i=1

√ti)2≤ 4t̂/tmin(k + 1)2

• from page 7.19, if θ0 ≥√

mt0:

λk ≤k−1∏i=1(1 − √mti) ≤

(1 − √mtmin

) k−1

• therefore the results for fixed step size hold with 1/tmin substituted for L

Accelerated proximal gradient methods 7.22

Page 25: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

References

Most of the material in the lecture is from §2.2 in Nesterov’s Lectures on Convex Optimization.

FISTA

• A. Beck, First-Order Methods in Optimization (2017), §10.7.• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse

problems, SIAM J. on Imaging Sciences (2009).• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal recovery, in: Y.

Eldar and D. Palomar (Eds.), Convex Optimization in Signal Processing and Communications(2009).

Accelerated proximal gradient methods

• S. Bubeck, Convex Optimization: Algorithms and Complexity, Foundations and Trends inMachine Learning (2015), §3.7.

• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization (2008).

Line search strategies

• FISTA papers by Beck and Teboulle.• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex optimization with

line search (2011).• O. Güler, New proximal point algorithms for convex minimization, SIOPT (1992).• Yu. Nesterov, Gradient methods for minimizing composite functions (2013).

Accelerated proximal gradient methods 7.23

Page 26: L.Vandenberghe ECE236C(Spring2020) …vandenbe/236C/lectures/fgrad.pdf · 2020. 6. 29. · + m = „1 ” definex = xi,x+ = xi+1,v= vi,andv+ = vi+1: y = 1 + m +x + v (y = x = vifi

Implementation

• S. Becker, E.J. Candès, M. Grant, Templates for convex cone problems with applications tosparse signal recovery, Mathematical Programming Computation (2011).

• B. O’Donoghue, E. Candès, Adaptive restart for accelerated gradient schemes, Foundations ofComputational Mathematics (2015).

• T. Goldstein, C. Studer, R. Baraniuk, A field guide to forward-backward splitting with a FASTAimplementation, arXiv:1411.3406 (2016).

Accelerated proximal gradient methods 7.24