accelerated rst-order methods
TRANSCRIPT
![Page 1: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/1.jpg)
Accelerated first-order methods
Geoff Gordon & Ryan TibshiraniOptimization 10-725 / 36-725
1
![Page 2: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/2.jpg)
Remember generalized gradient descent
We want to solveminx∈Rn
g(x) + h(x),
for g convex and differentiable, h convex
Generalized gradient descent: choose initial x(0) ∈ Rn, repeat:
x(k) = proxtk(x(k−1) − tk · ∇g(x(k−1))), k = 1, 2, 3, . . .
where the prox function is defined as
proxt(x) = argminz∈Rn
1
2t‖x− z‖2 + h(z)
If ∇g is Lipschitz continuous, and prox function can be evaluated,then generalized gradient has rate O(1/k) (counts # of iterations)
We can apply acceleration to achieve optimal O(1/k2) rate!
2
![Page 3: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/3.jpg)
Acceleration
Four ideas (three acceleration methods) by Nesterov (1983, 1998,2005, 2007)
• 1983: original accleration idea for smooth functions
• 1988: another acceleration idea for smooth functions
• 2005: smoothing techniques for nonsmooth functions, coupledwith original acceleration idea
• 2007: acceleration idea for composite functions1
Beck and Teboulle (2008): extension of Nesterov (1983) tocomposite functions2
Tseng (2008): unified analysis of accleration techniques (all ofthese, and more)
1Each step uses entire history of previous steps and makes two prox calls2Each step uses only information from two last steps and makes one prox call
3
![Page 4: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/4.jpg)
Outline
Today:
• Acceleration for composite functions (method of Beck andTeboulle (2008), presentation of Vandenberghe’s notes)
• Convergence rate
• FISTA
• Is acceleration always useful?
4
![Page 5: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/5.jpg)
Accelerated generalized gradient method
Our problemminx∈Rn
g(x) + h(x),
for g convex and differentiable, h convex
Accelerated generalized gradient method: choose any initialx(0) = x(−1) ∈ Rn, repeat for k = 1, 2, 3, . . .
y = x(k−1) +k − 2
k + 1(x(k−1) − x(k−2))
x(k) = proxtk(y − tk∇g(y))
• First step k = 1 is just usual generalized gradient update
• After that, y = x(k−1) + k−2k+1(x(k−1) − x(k−2)) carries some
“momentum” from previous iterations
• h = 0 gives accelerated gradient method
5
![Page 6: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/6.jpg)
0 20 40 60 80 100
−0.
50.
00.
51.
0
k
(k −
2)/
(k +
1)
6
![Page 7: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/7.jpg)
Consider minimizing
f(x) =
n∑i=1
(− yiaTi x+ log(1 + exp(aTi x)
)i.e., logistic regression with predictors ai ∈ Rp
This is smooth, and
∇f(x) = −AT (y − p(x)), where
pi(x) = exp(aTi x)/(1 + exp(aTi x)) for i = 1, . . . n
No nonsmooth part here, so proxt(x) = x
7
![Page 8: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/8.jpg)
Example (with n = 30, p = 10):
0 20 40 60 80 100
1e−
031e
−02
1e−
011e
+00
1e+
01
k
f(k)
−fs
tar
Gradient descentAccelerated gradient
8
![Page 9: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/9.jpg)
Another example (n = 30, p = 10):
0 20 40 60 80 100
1e−
051e
−03
1e−
01
k
f(k)
−fs
tar
Gradient descentAccelerated gradient
Not a descent method!
9
![Page 10: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/10.jpg)
Reformulation
Initialize x(0) = u(0), and repeat for k = 1, 2, 3, . . .
y = (1− θk)x(k−1) + θku(k−1)
x(k) = proxtk(y − tk∇g(y))
u(k) = x(k−1) +1
θk(x(k) − x(k−1))
with θk = 2/(k + 1)
This is equivalent to the formulation of accelerated generalizedgradient method presented earlier (slide 5). Makes convergenceanalysis easier
(Note: Beck and Teboulle (2008) use a choice θk < 2/(k + 1), butvery close)
10
![Page 11: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/11.jpg)
Convergence analysis
As usual, we are minimizing f(x) = g(x) + h(x) assuming
• g is convex, differentiable, ∇g is Lipschitz continuous withconstant L > 0
• h is convex, prox function can be evaluated
Theorem: Accelerated generalized gradient method with fixedstep size t ≤ 1/L satisfies
f(x(k))− f(x?) ≤ 2‖x(0) − x?‖2
t(k + 1)2
Achieves the optimal O(1/k2) rate for first-order methods!
I.e., to get f(x(k))− f(x?) ≤ ε, need O(1/√ε) iterations
11
![Page 12: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/12.jpg)
Helpful inequalities
We will use1− θkθ2k
≤ 1
θ2k−1
, k = 1, 2, 3, . . .
We will also use
h(v) ≤ h(z) +1
t(v − w)T (z − v), all z, w, v = proxt(w)
Why is this true? By definition of prox operator,
v minimizes1
2t‖w − v‖2 + h(v) ⇔ 0 ∈ 1
t(v − w) + ∂h(v)
⇔ − 1
t(v − w) ∈ ∂h(v)
Now apply definition of subgradient
12
![Page 13: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/13.jpg)
Convergence proof
We focus first on one iteration, and drop k notation (so x+, u+ areupdated versions of x, u). Key steps:
• g Lipschitz with constant L > 0 and t ≤ 1/L ⇒
g(x+) ≤ g(y) +∇g(y)T (x+ − y) +1
2t‖x+ − y‖2
• From our bound using prox operator,
h(x+) ≤ h(z) +1
t(x+− y)T (z−x+) +∇g(y)T (z−x+) all z
• Adding these together and using convexity of g,
f(x+) ≤ f(z) +1
t(x+ − y)T (z − x+) +
1
2t‖x+ − y‖2 all z
13
![Page 14: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/14.jpg)
• Using this bound at z = x and z = x∗:
f(x+)− f(x?)− (1− θ)(f(x)− f(x?))
≤ 1
t(x+ − y)T (θx? + (1− θ)x− x+) +
1
2t‖x+ − y‖2
=θ2
2t
(‖u− x?‖2 − ‖u+ − x?‖2
)• I.e., at iteration k,
t
θ2k
(f(x(k))− f(x?)) +1
2‖u(k) − x?‖2
≤ (1− θk)tθ2k
(f(x(k−1))− f(x?)) +1
2‖u(k−1) − x?‖2
14
![Page 15: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/15.jpg)
• Using (1− θi)/θ2i ≤ 1/θ2
i−1, and iterating this inequality,
t
θ2k
(f(x(k))− f(x?)) +1
2‖u(k) − x?‖2
≤ (1− θ1)t
θ21
(f(x(0))− f(x?)) +1
2‖u(0) − x?‖2
=1
2‖x(0) − x?‖2
• Therefore
f(x(k))− f(x?) ≤θ2k
2t‖x(0) − x?‖2 =
2
t(k + 1)2‖x(0) − x?‖2
15
![Page 16: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/16.jpg)
Backtracking line search
A few ways to do this with acceleration ... here’s a simple method(more complicated strategies exist)
First think: what do we need t to satisfy? Looking back at proofwith tk = t ≤ 1/L,
• We used
g(x+) ≤ g(y) +∇g(y)T (x+ − y) +1
2t‖x+ − y‖2
• We also used(1− θk)tk
θ2k
≤ tk−1
θ2k−1
,
so it suffices to have tk ≤ tk−1, i.e., decreasing step sizes
16
![Page 17: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/17.jpg)
Backtracking algorithm: fix β < 1, t0 = 1. At iteration k, replacex update (i.e., computation of x+) with:
• Start with tk = tk−1 and x+ = proxtk(y − tk∇g(y))
• While g(x+) > g(y) +∇g(y)T (x+ − y) + 12tk‖x+ − y‖2,
repeat:
I tk = βtk and x+ = proxtk(y − tk∇g(y))
Note this achieves both requirements. So under same conditions(∇g Lipschitz, prox function evaluable), we get same rate
Theorem: Accelerated generalized gradient method with back-tracking line search satisfies
f(x(k))− f(x?) ≤ 2‖x(0) − x?‖2
tmin(k + 1)2
where tmin = min1, β/L
17
![Page 18: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/18.jpg)
FISTA
Recall lasso problem,
minx
1
2‖y −Ax‖2 + λ‖x‖1
and ISTA (Iterative Soft-thresholding Algorithm):
x(k) = Sλtk(x(k−1) + tkAT (y −Ax(k−1))), k = 1, 2, 3, . . .
Sλ(·) being matrix soft-thresholding. Applying acceleration givesus FISTA (F is for Fast):3
v = x(k−1) +k − 2
k + 1(x(k−1) − x(k−2))
x(k) = Sλtk(v + tkAT (y −Av)), k = 1, 2, 3, . . .
3Beck and Teboulle (2008) actually call their general acceleration technique(for general g, h) FISTA, which may be somewhat confusing
18
![Page 19: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/19.jpg)
Lasso regression: 100 instances (with n = 100, p = 500):
0 200 400 600 800 1000
1e−
041e
−03
1e−
021e
−01
1e+
00
k
f(k)
−fs
tar
ISTAFISTA
19
![Page 20: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/20.jpg)
Lasso logistic regression: 100 instances (n = 100, p = 500):
0 200 400 600 800 1000
1e−
041e
−03
1e−
021e
−01
1e+
00
k
f(k)
−fs
tar
ISTAFISTA
20
![Page 21: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/21.jpg)
Is acceleration always useful?
Acceleration is generally a very effective speedup tool ... butshould it always be used?
In practice the speedup of using acceleration is diminished in thepresence of warm starts. I.e., suppose want to solve lasso problemfor tuning parameters values
λ1 ≥ λ2 ≥ . . . ≥ λr
• When solving for λ1, initialize x(0) = 0, record solution x(λ1)
• When solving for λj , initialize x(0) = x(λj−1), the recordedsolution for λj−1
Over a fine enough grid of λ values, generalized gradient descentperform can perform just as well without acceleration
21
![Page 22: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/22.jpg)
Sometimes acceleration and even backtracking can be harmful!
Recall matrix completion problem: observe some only entries of A,(i, j) ∈ Ω, we want to fill in the rest, so we solve
minX
1
2‖PΩ(A)− PΩ(X)‖2F + λ‖X‖∗
where ‖X‖∗ =∑r
i=1 σi(X), nuclear norm, and
[PΩ(X)]ij =
Xij (i, j) ∈ Ω
0 (i, j) /∈ Ω
Generalized gradient descent with t = 1 (soft-impute algorithm):updates are
X+ = Sλ(PΩ(A) + P⊥Ω (X))
where Sλ is the matrix soft-thresholding operator ... requires SVD
22
![Page 23: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/23.jpg)
Backtracking line search with generalized gradient:
• Each backtracking loop evaluates generalized gradient Gt(x)at various values of t
• Hence requires multiple evaluations of proxt(x)
• For matrix completion, can’t afford this!
Acceleration with generalized gradient:
• Changes argument we pass to prox function: y − t∇g(y)instead of x− t∇g(x)
• For matrix completion (and t = 1),
X −∇g(X) = PΩ(A)︸ ︷︷ ︸sparse
+P⊥Ω (X)︸ ︷︷ ︸low rank
a
⇒ fast SVD
Y −∇g(Y ) = PΩ(A)︸ ︷︷ ︸sparse
+ P⊥Ω (Y )︸ ︷︷ ︸not necessarily
low rank
⇒ slow SVD
23
![Page 24: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/24.jpg)
Soft-impute uses L = 1 and exploits special structure ... so it canoutperform fancier methods. E.g., soft-impute (solid blue line) vsaccelerated generalized gradient (dashed black line):
Small problem Big problem
(From Mazumder et al. (2011), Spectral regularization algorithmsfor learning large incomplete matrices)
24
![Page 25: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/25.jpg)
Optimization for well-behaved problems
For statistical learning problems,“well-behaved” means:
• signal to noise ratio is decently high
• correlations between predictor variables are under control
• number of predictors p can be larger than number ofobservations n, but not absurdly so
For well-behaved learning problems, people have observed thatgradient or generalized gradient descent can converge extremelyquickly (much more so than predicted by O(1/k) rate)
Largely unexplained by theory, topic of current research. E.g., veryrecent work4 shows that for some well-behaved problems, w.h.p.:
‖x(k) − x?‖2 ≤ ck‖x(0) − x?‖2 + o(‖x? − xtrue‖2)
4Agarwal et al. (2012), Fast global convergence of gradient methods forhigh-dimensional statistical recovery
25
![Page 26: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/26.jpg)
References
Nesterov’s four ideas (three acceleration methods):
• Y. Nesterov (1983), A method for solving a convexprogramming problem with convergence rate O(1/k2)
• Y. Nesterov (1988) On an approach to the construction ofoptimal methods of minimization of smooth convex functions
• Y. Nesterov (2005), Smooth minimization of non-smoothfunctions
• Y. Nesterov (2007), Gradient methods for minimizingcomposite objective function
26
![Page 27: Accelerated rst-order methods](https://reader031.vdocuments.mx/reader031/viewer/2022012502/617b641904c92b66031362df/html5/thumbnails/27.jpg)
Extensions and/or analyses:
• A. Beck and M. Teboulle (2008), A fast iterativeshrinkage-thresholding algorithm for linear inverse problems
• S. Becker and J. Bobin and E. Candes (2009), NESTA: A fastand accurate first-order method for sparse recovery
• P. Tseng (2008), On accelerated proximal gradient methodsfor convex-concave optimization
and there are many more ...
Helpful lecture notes/books:
• E. Candes, Lecture Notes for Math 301, Stanford University,Winter 2010-2011
• Y. Nesterov (2004), Introductory Lectures on ConvexOptimization: A Basic Course, Kluwer Academic Publishers,Chapter 2
• L. Vandenberghe, Lecture Notes for EE 236C, UCLA, Spring2011-2012
27