orie6326: convexoptimization unconstrainedminimizationunconstrained minimization minimize f(x) f :...

ORIE 6326: Convex Optimization

Unconstrained minimization

Professor Udell

Operations Research and Information EngineeringCornell

April 30, 2017

Slides adapted from Stanford EE364a

1 / 33

https://stanford.edu/class/ee364a/

Outline

◮ terminology and assumptions

◮ gradient descent method

◮ steepest descent method

◮ Newton’s method

◮ self-concordant functions

◮ implementation

2 / 33

Unconstrained minimization

minimize f (x)

◮ f : Rn → R convex, continuously differentiable (hence dom f

open)

◮ we assume optimal value p⋆ = infx f (x) is attained (and finite)

◮ we assume a starting point x (0) such that x (0) ∈ dom f is known

unconstrained minimization methods

◮ produce sequence of points x (k) ∈ dom f , k = 0, 1, . . . with

f (x (k))→ p⋆

◮ can be interpreted as iterative methods for solving optimalitycondition

∇f (x⋆) = 0

3 / 33

Descent methods

x (k+1) = x (k) + t(k)∆x (k) with f (x (k+1)) < f (x (k))

◮ other notations: x+ = x + t∆x , x := x + t∆x , x ← x + t∆x◮ ∆x is the step, or search direction◮ t is the step size, or step length◮ from convexity, f (x+) ≥ f (x) +∇f (x)∆x , so

f (x+) < f (x) implies ∇f (x)T∆x < 0

(i.e., ∆x is a descent direction)

Algorithm 1 General descent method.

given a starting point x ∈ dom f .repeat

1. Determine a descent direction ∆x .2. Line search. Choose a step size t > 0.3. Update. x := x + t∆x .

until stopping criterion is satisfied.

4 / 33

Step size choices

◮ constant step size t(k) = t

◮ decreasing step size t(k) = 1/k

◮ line search

5 / 33

Line search types

exact line search: t = argmint>0 f (x + t∆x)

◮ how?

6 / 33

Line search types

exact line search: t = argmint>0 f (x + t∆x)

◮ how? bisection to find a zero of g ′(t) where g(t) = f (x + t∆x)

◮ takes O(log(1/ǫ)) steps to find t with |g ′(t)| ≤ ǫ

backtracking line search (with parameters a ∈ (0, 1/2], b ∈ (0, 1))◮ starting at t = 1, repeat t := bt until

f (x + t∆x) < f (x) + at∇f (x)T∆x

◮ graphical interpretation: backtrack until t ≤ t0

t

f (x + t∆x)

t = 0 t0

f (x) + at∇f (x)T∆xf (x) + t∇f (x)T∆x

6 / 33

Gradient descent method

general descent method with ∆x = −∇f (x)

Algorithm 2 Gradient descent method.

given a starting point x ∈ dom f .repeat

1. ∆x := −∇f (x).2. Line search. Choose a step size t > 0.3. Update. x := x + t∆x .

until stopping criterion is satisfied.

◮ stopping criterion usually of the form ‖∇f (x)‖2 ≤ ǫ

◮ very simple, but often very slow

7 / 33

quadratic problem in R2

f (x) = (1/2)(x21 + γx22 ) (γ > 0)

with exact line search, starting at x (0) = (γ, 1):

x(k)1 = γ

(

γ − 1

γ + 1

)k

, x(k)2 =

(

−γ − 1

γ + 1

)k

◮ very slow if γ ≫ 1 or γ ≪ 1

◮ example for γ = 10:

x1

x2

x(0)

x(1)

−10 0 10

−4

0

4

8 / 33

nonquadratic example

f (x1, x2) = ex1+3x2−0.1 + ex1−3x2−0.1 + e−x1−0.1

x(0)

x(1)

x(2)

x(0)

x(1)

backtracking line search exact line search

9 / 33

a problem in R100

f (x) = cT x −

500∑

i=1

log(bi − aTi x)

f−fs

t

k

exact l.s.

backtracking l.s.

0 50 100 150 20010−4

10−2

100

102

104

‘linear’ convergence, i.e., a straight line on a semilog plot

10 / 33

Quadratic upper and lower bounds

what problem does the gradient descent step x := x + t∇f (x) solve?

11 / 33

Quadratic upper and lower bounds

what problem does the gradient descent step x := x + t∇f (x) solve?answer:

minimizey f (x) +∇f (x)T (y − x) +1

2t‖x − y‖2

So gradient descent will work best if Hessian is“almost” scaledmultiple of the identity.

Formally, find α ∈ R and β ∈ R so that for all x , y ∈ dom f ,

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2

◮ lower bound is called strong convexity (with parameter α)

◮ upper bound is called smoothness (with parameter β)

◮ clearly α ≤ β

11 / 33

Example I: quadratic

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2

◮ lower bound is called strong convexity (with parameter α)◮ upper bound is called smoothness (with parameter β)

suppose A ∈ Sn+, and consider f (x) = 1

2xTAx . is f smooth and

strongly convex?

12 / 33


f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2




strongly convex?

◮ strongly convex, with parameter α = λmin(A) (if λmin(A) > 0)◮ smooth, with parameter β = λmax(A)

12 / 33


f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2




strongly convex?

◮ strongly convex, with parameter α = λmin(A) (if λmin(A) > 0)◮ smooth, with parameter β = λmax(A)

proof:

1

2yTAy =

1

2(x + (y − x))TA(x + (y − x))

=1

2xT x + (Ax)T (y − x) +

1

2(y − x)TA(y − x)

λmin(A)

2‖y − x‖2 ≤

1

2(y − x)TA(y − x) ≤

λmax(A)

2‖y − x‖2

12 / 33

Example II: smoothed absolute value

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2



consider

huber(x) =

{

12x

2 x2 ≤ 1|x | − 1

2 otherwise

is huber smooth and strongly convex?

13 / 33

Example II: smoothed absolute value

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2



consider

huber(x) =

{

12x

2 x2 ≤ 1|x | − 1

2 otherwise

is huber smooth and strongly convex?

◮ not strongly convex

◮ smooth, with parameter β = 1

13 / 33

Example III: regularized absolute value

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2



consider f (x) = |x |+ x2. is f smooth and strongly convex?

14 / 33

Example III: regularized absolute value

f (x)+∇f (x)T (y−x)+α

2‖x−y‖2 ≤ f (y) ≤ f (x)+∇f (x)T (y−x)+

β

2‖x−y‖2



consider f (x) = |x |+ x2. is f smooth and strongly convex?

◮ strongly convex, with parameter α = 1

◮ not smooth

14 / 33

Roadmap

we’ll analyze gradient descent for a few cases:

◮ is f α-strongly convex, or not?

◮ is f β-smooth, or not?

◮ is f Lipschitz differentiable, or not?

◮ do we use a fixed step size, or line-search?

a question: are these rates “the best possible”? how could we tell?compared to what?

we’ll do the β-smooth case first, then work up to the others

15 / 33

Monotone gradient

first, another convexity condition:

◮ a differentiable function f : Rn → R is convex iff for all x ,y ∈ dom f ,

(∇f (x)−∇f (y))T (x − y) ≥ 0

◮ ∇f is called a monotone mapping

◮ strict inequality =⇒ strictly monotone mapping

16 / 33

Monotone gradient: proof

◮ if f is differentiable and convex, then

f (y) ≥ f (x) +∇f (x)T (y − x) f (x) ≥ f (y) +∇f (y)T (x − y)

add these to get

0 ≥ (∇f (x)−∇f (y))T (y − x)

◮ if ∇f is monotone, then for any x , y ∈ dom f , letg(t) = f (x + t(y − x)). for t ≥ 0,

g ′(t) = ∇f (x + t(y − x))T (y − x) ≥ g ′(0).

so

f (y) = g(1) = g(0) +

∫ 1

0

g ′(t)dt ≥ g(0) + g ′(0)

= f (x) +∇f (x)(y − x)

17 / 33

Smoothness: equivalent definitions

for convex f : Rn → R, the following properties are all equivalent:

1. β

2 xT x − f (x) is convex

2. f is β-smooth: for all x , y ∈ dom f ,

f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖x − y‖2

3. (if f is twice differentiable) ∇2f (x) � βI

4. ∇f is Lipschitz continuous with parameter β: ∀x , y ∈ dom f ,

‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖

5. ∇f is co-coercive with parameter 1β: for all x , y ∈ dom f ,

(∇f (x)−∇f (y))T (x − y) ≥1

β‖∇f (x)−∇f (y)‖2

18 / 33

Smoothness: equivalent definitions


1. β

2 xT x − f (x) is convex


f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖x − y‖2

3. (if f is twice differentiable) ∇2f (x) � βI

4. ∇f is Lipschitz continuous with parameter β: ∀x , y ∈ dom f ,

‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖


(∇f (x)−∇f (y))T (x − y) ≥1

β‖∇f (x)−∇f (y)‖2

proof: 2) is first order condition for convexity of 1); 3) is second ordercondition for convexity of 1); 4) =⇒ 2) by integration; 5) =⇒ 4)using Cauchy-Schwarz; 2) =⇒ 5) using first order condition (Lemma3.5 of Bubeck). only 2) =⇒ 5) requires convexity.

18 / 33

Smoothness: proofs of equivalence


f (y) ≤ f (x) +∇f (x)T (y − x) +β

2‖x − y‖2

2. ∇f is Lipschitz continuous with parameter β: for all x ,y ∈ dom f ,

‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖

proof. integrate and use Lipschitz ∇f (last line) to prove smoothness:

f (y)− f (x)−∇f (x)T (y − x)

=

∫ 1

0

∇f (x + t(y − x))T (y − x)dt −∇f (x)T (y − x)

=

∫ 1

0

(∇f (x + t(y − x))−∇f (x))T (y − x)dt

≤

∫ 1

0

‖∇f (x + t(y − x))−∇f (x)‖2‖(y − x)‖2dt

≤

∫ 1

0

βt‖(y − x)‖22dt =β

2‖(y − x)‖22

19 / 33



‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖


(∇f (x)−∇f (y))T (x − y) ≥1

β‖∇f (x)−∇f (y)‖2

20 / 33



‖∇f (x)−∇f (y)‖ ≤ β‖x − y‖


(∇f (x)−∇f (y))T (x − y) ≥1

β‖∇f (x)−∇f (y)‖2

proof. use Cauchy-Shwarz (first ineq) and co-coercivity (second)

‖∇f (x)−∇f (y))‖‖(x − y)‖ ≥ (∇f (x)−∇f (y))T (x − y)

≥1

β‖∇f (x)−∇f (y)‖2

‖(x − y)‖ ≥1

β‖∇f (x)−∇f (y)‖

to get Lipschitz continuity20 / 33

Analysis of gradient descent for smooth functions

three assumptions for analysis:

◮ f : Rn → R convex and differentiable with dom f = Rn

◮ f is smooth with parameter β > 0

◮ optimal value p⋆ = infx f (x) finite and attained at x⋆

Algorithm: GD with constant step size.

pick constant step size 0 < t ≤ 1β, and repeat

x (k+1) = x (k) − t∇f (x (k))

(nb: not implementable without guess for β)

21 / 33


use quadratic upper bound with y = x+ = x − t∇f (x):

f (x+) ≤ f (x) +∇f (x)(x+ − x) +β

2‖x+ − x‖2

= f (x)− t‖∇f (x)‖2 + t2β

2‖∇f (x)‖2

if constant step size 0 < t ≤ 1β,

f (x+) ≤ f (x)−t

2‖∇f (x)‖2

≤ f (x⋆)−∇f (x)T (x − x⋆)−t

2‖∇f (x)‖2

= p⋆ +1

2t(‖x − x⋆‖2 − ‖x − x⋆ − t∇f (x)‖2)

= p⋆ +1

2t(‖x − x⋆‖2 − ‖x+ − x⋆‖2)

(second line uses first order convexity condition)22 / 33


take average over iteration counter i = 1, . . . , k :

1

k

k∑

i=1

f (x (i))− p⋆ ≤1

k

k∑

i=1

1

2t(‖x (i) − x⋆‖2 − ‖x (i+1) − x⋆‖2)

≤1

2tk(‖x (0) − x⋆‖2 − ‖x (k+1) − x⋆‖2)

≤1

2tk‖x (0) − x⋆‖2

since f (x (k)) is non-increasing,

f (x (k))− p⋆ ≤1

2tk‖x (0) − x⋆‖2

so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(1/ǫ)

23 / 33


now, with line search!

◮ t chosen by line search w/params (a, b) = ( 12 ,12 ) (to simplify

proofs), so x+ = x − t∇f (x) satisfies

f (x+) < f (x)−t

2‖∇f (x)‖2.

◮ from smoothness of f , we know t = 1βwould work

◮ so linesearch returns t ≥ bβ= 1

2β

Algorithm: GD with line search.

pick line search parameters (a, b) = ( 12 ,12 ) and x (0) ∈ Rn, and repeat

1. compute ∇f (x (k))

2. find t(k) by line search

3. updatex (k+1) = x (k) − t∇f (x (k))

24 / 33


using line search condition,

f (x+) ≤ f (x)−t

2‖∇f (x)‖2

≤ f (x⋆)−∇f (x)T (x − x⋆)−t

2‖∇f (x)‖2

= p⋆ +1

2t(‖x − x⋆‖2 − ‖x − x⋆ − t∇f (x)‖2)

= p⋆ +1

2t(‖x − x⋆‖2 − ‖x+ − x⋆‖2)

≤ p⋆ + β(‖x − x⋆‖2 − ‖x+ − x⋆‖2)

second line uses first order convexity condition,last line uses 1

t≤ β

b= 2β

25 / 33


take average over iteration counter i = 1, . . . , k :

1

k

k∑

i=1

f (x (i))− p⋆ ≤1

k

k∑

i=1

β(‖x (i) − x⋆‖2 − ‖x (i+1) − x⋆‖2)

≤β

k(‖x (0) − x⋆‖2 − ‖x (k+1) − x⋆‖2)

≤β

k‖x (0) − x⋆‖2

since f (x (k)) is non-increasing,

f (x (k))− p⋆ ≤β

k‖x (0) − x⋆‖2

so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(1/ǫ)

26 / 33

Strong convexity: equivalent definitions


1. f (x)− α2 x

T x is convex

2. f is α-strongly convex: for all x , y ∈ dom f ,

f (y) ≥ f (x) +∇f (x)T (y − x) +α

2‖x − y‖2

3. (if f is twice differentiable) ∇2f (x) � αI

4. ∇f is coercive with parameter α: for all x , y ∈ dom f ,

(∇f (x)−∇f (y))T (x − y) ≥ α‖x − y‖2

proof: 2) is first order condition for convexity of 1); 3) is second ordercondition for convexity of 1); 4) is monotone gradient condition for 1)

27 / 33

Strong convexity + smoothness

if f is α-strongly convex and β-smooth,then h(x) = f (x)− α/2‖x‖2 is convex and (β − α)-smooth:

(∇h(x)−∇h(y))T (x − y) ≥1

β − α‖∇h(x)−∇h(y)‖2

expand h to show

(∇f (x)−∇f (y))T (x − y) ≥αβ

α+ β‖x − y‖2 +

1

α+ β‖∇f (x)−∇f (y)‖2

28 / 33

Analysis of gradient descent for SSC functions

assumptions for analysis:

◮ f : Rn → R convex and differentiable with dom f = Rn

◮ f is smooth with parameter β > 0

◮ f is strongly convex with parameter α > 0

◮ optimal value p⋆ = infx f (x) finite and attained at x⋆

Algorithm: GD with constant step size.

pick constant step size 0 < t ≤ 2α+β

, and repeat

x (k+1) = x (k) − t∇f (x (k))

note α < β =⇒ 1β< 2

α+β,

so SSC allows larger step sizes than just smoothness

29 / 33


‖x+ − x⋆‖2 = ‖x − t∇f (x)− x⋆‖2

= ‖x − x⋆‖2 + t2‖∇f (x)‖2 − 2t∇f (x)T (x − x⋆)

≤ ‖x − x⋆‖2 + t2‖∇f (x)‖2

−2t

(

αβ

α+ β‖x − x⋆‖2 +

1

α+ β‖∇f (x)‖2

)

=

(

1− t

(

2αβ

α+ β

))

‖x − x⋆‖2 + t

(

t −2

α+ β

)

‖∇f (x)‖2

≤

(

1− t

(

2αβ

α+ β

))

‖x − x⋆‖2

(first inequality uses coercivity + co-coercivity,last uses t ≤ 2

α+β)

30 / 33


◮ distance to optimum decreases by c = 1− t( 2αβα+β

) every iteration

‖x (k) − x⋆‖2 ≤ ck‖x (0) − x⋆‖2

i.e., “linear convergence”

◮ if t = 2α+β

, c = (κ−1κ+1 )

2, where κ = β

α≥ 1 is condition number

◮ using quadratic upper bound, get bound on function value

f (x (k))− p⋆ ≤β

2‖x (k) − x⋆‖2 ≤

βck

2‖x (0) − x⋆‖2

◮ so number of iterations k to reach f (x (k))− p⋆ ≤ ǫ is O(log(1/ǫ))

31 / 33

Conclusion

we showed that the gradient method with appropriate step sizesconverges, and guarantees

◮ for f convex and β-smooth,

f (x (k))− p⋆ ≤β

2k‖x (0) − x⋆‖2

◮ for f convex, β-smooth, and α-strongly convex,

f (x (k))− p⋆ ≤βck

2‖x (0) − x⋆‖2,

where c = (κ−1κ+1 )

2, κ = β

α≥ 1 is condition number

32 / 33

References

◮ Lieven Vandenberghe, UCLA EE236C: Gradient methods

◮ Sebastian Bubeck, Convex Optimization: Algorithms andComplexity

33 / 33

orie6326: convexoptimization unconstrainedminimizationunconstrained minimization minimize f(x) f :...

Documents