fast proximal gradient method (fista) fista with line ...vandenbe/236c/lectures/fista.pdf · fast...

32
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov’s second method 1

Upload: others

Post on 05-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

L. Vandenberghe EE236C (Spring 2013-14)

Fast proximal gradient methods

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method

1

Page 2: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Fast (proximal) gradient methods

• Nesterov (1983, 1988, 2005): three gradient projection methods with1/k2 convergence rate

• Beck & Teboulle (2008): FISTA, a proximal gradient version ofNesterov’s 1983 method

• Nesterov (2004 book), Tseng (2008): overview and unified analysis offast gradient methods

• several recent variations and extensions

this lecture:

FISTA and Nesterov’s 2nd method (1988) as presented by Tseng

Fast proximal gradient methods 2

Page 3: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method

Page 4: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

FISTA (basic version)

minimize f(x) = g(x) + h(x)

• g convex, differentiable, with dom g = Rn

• h closed, convex, with inexpensive proxth operator

algorithm: choose any x(0) = x(−1); for k ≥ 1, repeat the steps

y = x(k−1) +k − 2

k + 1(x(k−1) − x(k−2))

x(k) = proxtkh (y − tk∇g(y))

• step size tk fixed or determined by line search

• acronym stands for ‘Fast Iterative Shrinkage-Thresholding Algorithm’

Fast proximal gradient methods 3

Page 5: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Interpretation

• first iteration (k = 1) is a proximal gradient step at y = x(0)

• next iterations are proximal gradient steps at extrapolated points y

x(k−2) x(k−1) y

x(k) = proxtkh (y − tk∇g(y))

note: x(k) is feasible (in domh); y may be outside domh

Fast proximal gradient methods 4

Page 6: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Example

minimize logm∑

i=1

exp(aTi x+ bi)

randomly generated data with m = 2000, n = 1000, same fixed step size

0 50 100 150 20010-6

10-5

10-4

10-3

10-2

10-1

100

gradientFISTA

k

f(x

(k) )

−f⋆

|f⋆|

Fast proximal gradient methods 5

Page 7: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

another instance

0 50 100 150 20010-6

10-5

10-4

10-3

10-2

10-1

100

gradientFISTA

k

f(x

(k) )

−f⋆

|f⋆|

FISTA is not a descent method

Fast proximal gradient methods 6

Page 8: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Convergence of FISTA

assumptions

• g convex with dom g = Rn; ∇g Lipschitz continuous with constant L:

‖∇g(x)−∇g(y)‖2 ≤ L‖x− y‖2 ∀x, y

• h is closed and convex (so that proxth(u) is well defined)

• optimal value f⋆ is finite and attained at x⋆ (not necessarily unique)

convergence result: f(x(k))− f⋆ decreases at least as fast as 1/k2

• with fixed step size tk = 1/L

• with suitable line search

Fast proximal gradient methods 7

Page 9: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Reformulation of FISTA

define θk = 2/(k + 1) and introduce an intermediate variable v(k)

algorithm: choose x(0) = v(0); for k ≥ 1, repeat the steps

y = (1− θk)x(k−1) + θkv

(k−1)

x(k) = proxtkh (y − tk∇g(y))

v(k) = x(k−1) +1

θk(x(k) − x(k−1))

substituting expression for v(k) in formula for y gives FISTA of page 3

Fast proximal gradient methods 8

Page 10: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Important inequalities

choice of θk: the sequence θk = 2/(k + 1) satisfies θ1 = 1 and

1− θkθ2k

≤ 1

θ2k−1

, k ≥ 2

upper bound on g from Lipschitz property (page 1-12)

g(u) ≤ g(z) +∇g(z)T (u− z) +L

2‖u− z‖22 ∀u, z

upper bound on h from definition of prox-operator (page 6-7)

h(u) ≤ h(z) +1

t(w − u)T (u− z) ∀w, u = proxth(w), z

Fast proximal gradient methods 9

Page 11: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Progress in one iteration

define x = x(i−1), x+ = x(i), v = v(i−1), v+ = v(i), t = ti, θ = θi

• upper bound from Lipschitz property: if 0 < t ≤ 1/L,

g(x+) ≤ g(y) +∇g(y)T (x+ − y) +1

2t‖x+ − y‖22 (1)

• upper bound from definition of prox-operator:

h(x+) ≤ h(z) +∇g(y)T (z − x+) +1

t(x+ − y)T (z − x+) ∀z

• add the upper bounds and use convexity of g

f(x+) ≤ f(z) +1

t(x+ − y)T (z − x+) +

1

2t‖x+ − y‖22 ∀z

Fast proximal gradient methods 10

Page 12: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

• make convex combination of upper bounds for z = x and z = x⋆

f(x+)− f⋆ − (1− θ)(f(x)− f⋆)

= f(x+)− θf⋆ − (1− θ)f(x)

≤ 1

t(x+ − y)T (θx⋆ + (1− θ)x− x+) +

1

2t‖x+ − y‖22

=1

2t

(

‖y − (1− θ)x− θx⋆‖22 −∥

∥x+ − (1− θ)x− θx⋆∥

2

2

)

=θ2

2t

(

‖v − x⋆‖22 −∥

∥v+ − x⋆∥

2

2

)

conclusion: if the inequality (1) holds at iteration i, then

tiθ2i

(

f(x(i))− f⋆)

+1

2‖v(i) − x⋆‖22

≤ (1− θi)tiθ2i

(

f(x(i−1)− f⋆)

+1

2‖v(i−1) − x⋆‖22 (2)

Fast proximal gradient methods 11

Page 13: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Analysis for fixed step size

take ti = t = 1/L and apply (2) recursively, using (1− θi)/θ2i ≤ 1/θ2i−1:

t

θ2k

(

f(x(k))− f⋆)

+1

2‖v(k) − x⋆‖22

≤ (1− θ1)t

θ21

(

f(x(0))− f⋆)

+1

2‖v(0) − x⋆‖22

=1

2‖x(0) − x⋆‖22

therefore,

f(x(k))− f⋆ ≤ θ2k2t

‖x(0) − x⋆‖22 =2L

(k + 1)2‖x(0) − x⋆‖22

conclusion: reaches f(x(k))− f⋆ ≤ ǫ after O(1/√ǫ) iterations

Fast proximal gradient methods 12

Page 14: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Example: quadratic program with box constraints

minimize (1/2)xTAx+ bTxsubject to 0 � x � 1

0 10 20 30 40 5010-7

10-6

10-5

10-4

10-3

10-2

10-1

100

101gradientFISTA

k

(f(x

(k) )

−f⋆)/

|f⋆|

n = 3000; fixed step size t = 1/λmax(A)

Fast proximal gradient methods 13

Page 15: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

1-norm regularized least-squares

minimize1

2‖Ax− b‖22 + ‖x‖1

0 20 40 60 80 10010-7

10-6

10-5

10-4

10-3

10-2

10-1

100gradientFISTA

k

(f(x

(k) )

−f⋆)/

f⋆

randomly generated A ∈ R2000×1000; step tk = 1/L with L = λmax(ATA)

Fast proximal gradient methods 14

Page 16: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method

Page 17: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Key steps in the analysis of FISTA

• the starting point (page 10) is the inequality

g(x+) ≤ g(y) +∇g(y)T (x+ − y) +1

2t‖x+ − y‖22 (1)

this inequality is known to hold for 0 < t ≤ 1/L

• if (1) holds, then the progress made in iteration i is bounded by

tiθ2i

(

f(x(i))− f⋆)

+1

2‖v(i) − x⋆‖22

≤ (1− θi)tiθ2i

(

f(x(i−1)− f⋆)

+1

2‖v(i−1) − x⋆‖22 (2)

• to combine these inequalities recursively, we need

(1− θi)tiθ2i

≤ ti−1

θ2i−1

(i ≥ 2) (3)

Fast proximal gradient methods 15

Page 18: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

• if θ1 = 1, combining the inequalities (2) from i = 1 to k gives the bound

f(x(k))− f⋆ ≤ θ2k2tk

‖x(0) − x⋆‖22

conclusion: rate 1/k2 convergence if (1) and (3) hold with

θ2ktk

= O(1

k2)

FISTA with fixed step size

tk =1

L, θk =

2

k + 1

these values satisfy (1) and (3) with

θ2ktk

=4L

(k + 1)2

Fast proximal gradient methods 16

Page 19: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

FISTA with line search (method 1)

replace update of x in iteration k (page 8) with

t := tk−1 (define t0 = t > 0)

x := proxth(y − t∇g(y))

while g(x) > g(y) +∇g(y)T (x− y) + 12t‖x− y‖22

t := βt

x := proxth(y − t∇g(y))

end

• inequality (1) holds trivially, by the backtracking exit condition

• inequality (3) holds with θk = 2/(k + 1) because tk ≤ tk−1

• Lipschitz continuity of ∇g guarantees tk ≥ tmin = min{t, β/L}• preserves 1/k2 convergence rate because θ2k/tk = O(1/k2):

θ2ktk

≤ 4

(k + 1)2tmin

Fast proximal gradient methods 17

Page 20: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

FISTA with line search (method 2)

replace update of y and x in iteration k (page 8) with

t := t > 0

θ := positive root of tk−1θ2 = tθ2k−1(1− θ)

y := (1− θ)x(k−1) + θv(k−1)

x := proxth(y − t∇g(y))

while g(x) > g(y) +∇g(y)T (x− y) + 12t‖x− y‖22

t := βt

θ := positive root of tk−1θ2 = tθ2k−1(1− θ)

y := (1− θ)x(k−1) + θv(k−1)

x := proxth(y − t∇g(y))

end

assume t0 = 0 in the first iteration (k = 1), i.e., take θ1 = 1, y = x(0)

Fast proximal gradient methods 18

Page 21: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

discussion

• inequality (1) holds trivially, by the backtracking exit condition

• inequality (3) holds trivially, by construction of θk

• Lipschitz continuity of ∇g guarantees tk ≥ tmin = min{t, β/L}• θi is defined as the positive root of θ2i /ti = (1− θi)θ

2i−1/ti−1; hence

√ti−1

θi−1=

(1− θi)tiθi

≤√tiθi

−√ti2

combine inequalities from i = 2 to k to get√t1 ≤

√tkθk

− 1

2

k∑

i=2

√ti

• rearranging shows that θ2k/tk = O(1/k2):

θ2ktk

≤ 1(√

t1 +12

k∑

i=2

√ti

)2 ≤ 4

(k + 1)2tmin

Fast proximal gradient methods 19

Page 22: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Comparison of line search methods

method 1

• uses nonincreasing step sizes (enforces tk ≤ tk−1)

• one evaluation of g(x), one proxth evaluation per line search iteration

method 2

• allows non-monotonic step sizes

• one evaluation of g(x), one evaluation of g(y), ∇g(y), one evaluation ofproxth per line search iteration

the two strategies can be combined and extended in various ways

Fast proximal gradient methods 20

Page 23: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Outline

• fast proximal gradient method (FISTA)

• FISTA with line search

• FISTA as descent method

• Nesterov’s second method

Page 24: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Descent version of FISTA

choose x(0) = v(0); for k ≥ 1, repeat the steps

y = (1− θk)x(k−1) + θkv

(k−1)

u = proxtkh (y − tk∇g(y))

x(k) =

{

u f(u) ≤ f(x(k−1))x(k−1) otherwise

v(k) = x(k−1) +1

θk(u− x(k−1))

• step 3 implies f(x(k)) ≤ f(x(k−1))

• use θk = 2/(k + 1) and tk = 1/L, or one of the line search methods

• same iteration complexity as original FISTA

• changes on page 10: replace x+ with u and use f(x+) ≤ f(u)

Fast proximal gradient methods 21

Page 25: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Example

(from page 6)

0 50 100 150 20010-6

10-5

10-4

10-3

10-2

10-1

100

gradientFISTAFISTA-d

k

(f(x

(k) )

−f⋆)/

|f⋆|

Fast proximal gradient methods 22

Page 26: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Outline

• fast proximal gradient method (FISTA)

• line search strategies

• enforcing descent

• Nesterov’s second method

Page 27: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Nesterov’s second method

algorithm: choose x(0) = v(0); for k ≥ 1, repeat the steps

y = (1− θk)x(k−1) + θkv

(k−1)

v(k) = prox(tk/θk)h

(

v(k−1) − tkθk∇g(y)

)

x(k) = (1− θk)x(k−1) + θkv

(k)

• use θk = 2/(k + 1) and tk = 1/L, or one of the line search methods

• identical to FISTA if h(x) = 0

• unlike in FISTA, y is feasible (in domh) if we take x(0) ∈ domh

Fast proximal gradient methods 23

Page 28: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Convergence of Nesterov’s second method

assumptions

• g convex; ∇g is Lipschitz continuous on domh ⊆ dom g

‖∇g(x)−∇g(y)‖2 ≤ L‖x− y‖2 ∀x, y ∈ domh

• h is closed and convex (so that proxth(u) is well defined)

• optimal value f⋆ is finite and attained at x⋆ (not necessarily unique)

convergence result: f(x(k))− f⋆ decreases at least as fast as 1/k2

• with fixed step size tk = 1/L

• with suitable line search

Fast proximal gradient methods 24

Page 29: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Analysis of one iteration

define x = x(i−1), x+ = x(i), v = v(i−1), v+ = v(i), t = ti, θ = θi

• from Lipschitz property if 0 < t ≤ 1/L

g(x+) ≤ g(y) +∇g(y)T (x+ − y) +1

2t‖x+ − y‖22

• plug in x+ = (1− θ)x+ θv+ and x+ − y = θ(v+ − v)

g(x+) ≤ g(y) +∇g(y)T(

(1− θ)x+ θv+ − y)

+θ2

2t‖v+ − v‖22

• from convexity of g, h

g(x+) ≤ (1− θ)g(x) + θ(

g(y) +∇g(y)T (v+ − y))

+θ2

2t‖v+ − v‖22

h(x+) ≤ (1− θ)h(x) + θh(v+)

Fast proximal gradient methods 25

Page 30: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

• upper bound on h from p. 9 (with u = v+, w = v − (t/θ)∇g(y))

h(v+) ≤ h(z) +∇g(y)T (z − v+)− θ

t(v+ − v)T (v+ − z) ∀z

• combine the upper bounds on g(x+), h(x+), h(v+) with z = x⋆

f(x+) ≤ (1− θ)f(x) + θf⋆ − θ2

t(v+ − v)T (v+ − x⋆) +

θ2

2t‖v+ − v‖22

= (1− θ)f(x) + θf⋆ +θ2

2t

(

‖v − x⋆‖22 − ‖v+ − x⋆‖22)

this is identical to the final inequality (2) in the analysis of FISTA on p.11

tiθ2i

(

f(x(i))− f⋆)

+1

2‖v(i) − x⋆‖22

≤ (1− θi)tiθ2i

(

f(x(i−1))− f⋆)

+1

2‖v(i−1) − x⋆‖22

Fast proximal gradient methods 26

Page 31: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

References

surveys of fast gradient methods

• Yu. Nesterov, Introductory Lectures on Convex Optimization. A Basic Course (2004)

• P. Tseng, On accelerated proximal gradient methods for convex-concave optimization

(2008)

FISTA

• A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear

inverse problems, SIAM J. on Imaging Sciences (2009)

• A. Beck and M. Teboulle, Gradient-based algorithms with applications to signal

recovery, in: Y. Eldar and D. Palomar (Eds.), Convex Optimization in Signal

Processing and Communications (2009)

line search strategies

• FISTA papers by Beck and Teboulle

• D. Goldfarb and K. Scheinberg, Fast first-order methods for composite convex

optimization with line search (2011)

• Yu. Nesterov, Gradient methods for minimizing composite objective function (2007)

• O. Guler, New proximal point algorithms for convex minimization, SIOPT (1992)

Fast proximal gradient methods 27

Page 32: fast proximal gradient method (FISTA) FISTA with line ...vandenbe/236C/lectures/fista.pdf · Fast (proximal) gradient methods • Nesterov (1983, 1988, 2005): three gradient projection

Nesterov’s third method (not covered in this lecture)

• Yu. Nesterov, Smooth minimization of non-smooth functions, Mathematical

Programming (2005)

• S. Becker, J. Bobin, E.J. Candes, NESTA: a fast and accurate first-order method for

sparse recovery, SIAM J. Imaging Sciences (2011)

Fast proximal gradient methods 28