3.subgradientmethodvandenbe/236c/lectures/sgmethod.pdfl.vandenberghe ece236c(spring2019)...

L. Vandenberghe ECE236C (Spring 2019)

3. Subgradient method

• subgradient method

• convergence analysis

• optimal step size when f? is known

• alternating projections

• optimality

3.1

Subgradient method

to minimize a nondifferentiable convex function f : choose x0 and repeat

xk+1 = xk − tkgk, k = 0,1, . . .

gk is any subgradient of f at xk

Step size rules

• fixed step: tk constant

• fixed length: tk ‖gk ‖2 = ‖xk+1 − xk ‖2 is constant

• diminishing: tk → 0 and∞∑

k=0tk = ∞

Subgradient method 3.2

Assumptions

• f has finite optimal value f? and minimizer x?

• f is convex with dom f = Rn

• f is Lipschitz continuous with constant G > 0:

| f (x) − f (y)| ≤ G‖x − y‖2 for all x, y

this is equivalent to ‖g‖2 ≤ G for all x and g ∈ ∂ f (x) (see next page)


Proof.

• assume ‖g‖2 ≤ G for all subgradients; choose gy ∈ ∂ f (y), gx ∈ ∂ f (x):

gTx (x − y) ≥ f (x) − f (y) ≥ gT

y (x − y)

by the Cauchy–Schwarz inequality

G‖x − y‖2 ≥ f (x) − f (y) ≥ −G‖x − y‖2

• assume ‖g‖2 > G for some g ∈ ∂ f (x); take y = x + g/‖g‖2:

f (y) ≥ f (x) + gT(y − x)= f (x) + ‖g‖2> f (x) + G


Analysis

• the subgradient method is not a descent method

• therefore fbest,k = mini=0,...,k f (xi) can be less than f (xk)• the key quantity in the analysis is the distance to the optimal set

Progress in one iteration

• distance to x?:

‖xi+1 − x?‖22 = xi − tigi − x?

22

= ‖xi − x?‖22 − 2tigTi (xi − x?) + t2

i ‖gi‖22≤ ‖xi − x?‖22 − 2ti

(f (xi) − f?

)+ t2

i ‖gi‖22

• best function value: combine inequalities for i = 0, . . . , k:

2(k∑

i=0ti)( fbest,k − f?) ≤ ‖x0 − x?‖22 − ‖xk+1 − x?‖22 +

k∑i=0

t2i ‖gi‖22

≤ ‖x0 − x?‖22 +k∑

i=0t2i ‖gi‖22


Fixed step size and fixed step length

Fixed step size: ti = t with t constant

fbest,k − f? ≤‖x0 − x?‖222(k + 1)t +

G2t2

• does not guarantee convergence of fbest,k

• for large k, fbest,k is approximately G2t/2-suboptimal

Fixed step length: ti = s/‖gi‖2 with s constant

fbest,k − f? ≤G‖x0 − x?‖22

2(k + 1)s +Gs2

• does not guarantee convergence of fbest,k

• for large k, fbest,k is approximately Gs/2-suboptimal


Diminishing step size

ti → 0,∞∑

i=0ti = ∞

• bound on function value:

fbest,k − f? ≤‖x(0) − x?‖22

2k∑

i=0ti

+

G2 k∑i=0

t2i

2k∑

i=0ti

• can show that (k∑

i=0t2i )/(

k∑i=0

ti) → 0; hence, fbest,k converges to f?

• examples: ti = τ/(i + 1) or ti = τ/√

i + 1


Example: 1-norm minimization

minimize ‖Ax − b‖1

• subgradient is given by AT sign(Ax − b)• example with A ∈ R500×100, b ∈ R500

Fixed steplength tk = s/‖gk ‖2 for s = 0.1, 0.01, 0.001

0 100 200 300 400 50010−4

10−3

10−2

10−1

100

k

( f (xk) − f ?)/ f ?

0.10.010.001

0 1000 2000 300010−4

10−3

10−2

10−1

100

k

( fbest,k) − f ?)/ f ?

0.10.010.001


Diminishing step size: tk = 0.01/√

k + 1 and tk = 0.01/(k + 1)

0 1000 2000 3000 4000 500010−5

10−4

10−3

10−2

10−1

100

k

( fbest,k − f ?)/ f ?

tk = 0.01/√

k + 1tk = 0.01/(k + 1)


Optimal step size for fixed number of iterations

from page 3.5: if si = ti‖gi‖2 and ‖x0 − x?‖2 ≤ R, then

fbest,k − f? ≤R2 +

k∑i=0

s2i

2k∑

i=0si/G

• for given k, the right-hand side is minimized by the fixed step length

si = s =R√

k + 1

• the resulting bound after k steps is

fbest,k − f? ≤ GR√k + 1

• this guarantees an accuracy fbest,k − f? ≤ ε in k = O(1/ε2) iterations


Optimal step size when f ? is known

• the right-hand side in the first inequality of page 3.5 is minimized by

ti =f (xi) − f?

‖gi‖22

• the optimized bound is(f (xi) − f?

)2

‖gi‖22≤ ‖xi − x?‖22 − ‖xi+1 − x?‖22

• applying this recursively from i = 0 to i = k (and using ‖gi‖2 ≤ G) gives

fbest,k − f? ≤ G‖x0 − x?‖2√k + 1


Exercise: find point in intersection of convex sets

find a point in the intersection of m closed convex sets C1, . . . , Cm:

minimize f (x) = max { f1(x), . . . , fm(x)}

where f j(x) = infy∈Cj‖x − y‖2 is Euclidean distance of x to Cj

• f? = 0 if the intersection is nonempty

• (from page 2.14) g ∈ ∂ f (x̂) if g ∈ ∂ f j(x̂) and Cj is farthest set from x̂

• (from page 2.20) subgradient g ∈ ∂ f j(x̂) follows from projection Pj(x̂) on Cj :

g = 0 if x̂ ∈ Cj, g =1

‖ x̂ − Pj(x̂)‖2(x̂ − Pj(x̂)

)if x̂ < Cj

note that ‖g‖2 = 1 if x̂ < Cj


Subgradient method

• optimal step size (page 3.11) for f? = 0 and ‖gi‖2 = 1 is ti = f (xi)

• at iteration k, find farthest set Cj (with f (xk) = f j(xk)), and take

xk+1 = xk −f (xk)f j(xk)

(xk − Pj(xk))

= Pj(xk)

at each step, we project the current point onto the farthest set

• a version of the alternating projections algorithm

• for m = 2, projections alternate onto one set, then the other

• later, we will see faster versions of this that are almost as simple


Optimality of the subgradient method

can the fbest,k − f? ≤ GR/√

k + 1 bound on page 3.10 be improved?

Problem class

• f is convex, with a minimizer x?

• we know a starting point x(0) with ‖x(0) − x?‖2 ≤ R

• we know the Lipschitz constant G of f on {x | ‖x − x?‖2 ≤ R}• f is defined by an oracle: given x, the oracle returns f (x) and a g ∈ ∂ f (x)

Algorithm class

• algorithm can choose any x(i+1) from the set x(0) + span{g(0),g(1), . . . ,g(i)}• we stop after a fixed number k of iterations


Test problem and oracle

f (x) = maxi=1,...,k+1

xi +12‖x‖22 (with k < n), x(0) = 0

• subdifferential ∂ f (x) = {x} + conv{e j | 1 ≤ j ≤ k + 1, x j = maxi=1,...,k+1 xi}• solution and optimal value

x? = −( 1k + 1

, . . . ,1

k + 1︸︷︷︸k + 1 times

, 0, . . . ,0), f? = − 12(k + 1)

• distance of starting point to solution is R = ‖x(0) − x?‖2 = 1/√

k + 1

• Lipschitz constant on {x | ‖x − x?‖2 ≤ R}:

G = supg∈∂ f (x), ‖x−x?‖2≤R

‖g‖2 ≤2√

k + 1+ 1

• the oracle returns the subgradient e ̂ + x where ̂ = min{ j | x j = maxi=1,...,k+1

xi}


Iteration

• after i ≤ k iterations of any algorithm in the algorithm class,

x(i) = (x(i)1 , . . . , x(i)i ,0, . . . ,0), f (x(i)) ≥ ‖x(i)‖2 ≥ 0

• suboptimality after k iterations

fbest,k − f? = − f? =1

2(k + 1) =GR

2(2 +√

k + 1)

Conclusion

• example shows that O(GR/√

k) bound cannot be improved

• subgradient method is “optimal” (for this problem and algorithm class)


Summary: subgradient method

• handles general nondifferentiable convex problem

• often leads to very simple algorithms

• convergence can be very slow

• no good stopping criterion

• theoretical complexity: O(1/ε2) iterations to find ε -suboptimal point

• an “optimal” first-order method: O(1/ε2) bound cannot be improved


References

• S. Boyd, Lecture slides and notes for EE364b, Convex Optimization II.

• Yu. Nesterov, Lectures on Convex Optimization (2018), section 3.2.3. Theexample on page 3.15 is in §3.2.1.

• B. T. Polyak, Introduction to Optimization (1987), section 5.3.


https://web.stanford.edu/class/ee364b/lectures.html

https://doi.org/10.1007/978-3-319-91578-4

3.subgradientmethodvandenbe/236c/lectures/sgmethod.pdfl.vandenberghe ece236c(spring2019)...

Documents