introduction to convex optimization for machine … to convex optimization for machine learning john...

82
Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53

Upload: ngokiet

Post on 24-May-2018

243 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Introduction to Convex Optimization for MachineLearning

John Duchi

University of California, Berkeley

Practical Machine Learning, Fall 2009

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53

Page 2: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Outline

What is Optimization

Convex Sets

Convex Functions

Convex Optimization Problems

Lagrange Duality

Optimization Algorithms

Take Home Messages

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 2 / 53

Page 3: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

What is Optimization (and why do we care?)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 3 / 53

Page 4: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

What is Optimization?

◮ Finding the minimizer of a function subject to constraints:

minimizex

f0(x)

s.t. fi(x) ≤ 0, i = {1, . . . , k}hj(x) = 0, j = {1, . . . , l}

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53

Page 5: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

What is Optimization?

◮ Finding the minimizer of a function subject to constraints:

minimizex

f0(x)

s.t. fi(x) ≤ 0, i = {1, . . . , k}hj(x) = 0, j = {1, . . . , l}

◮ Example: Stock market. “Minimize variance of return subject togetting at least $50.”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53

Page 6: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

Why do we care?

Optimization is at the heart of many (most practical?) machine learningalgorithms.

◮ Linear regression:minimize

w‖Xw − y‖2

◮ Classification (logistic regresion or SVM):

minimizew

n∑

i=1

log(

1 + exp(−yixTi w)

)

or ‖w‖2 + Cn∑

i=1

ξi s.t. ξi ≥ 1 − yixTi w, ξi ≥ 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 5 / 53

Page 7: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

We still care...

◮ Maximum likelihood estimation:

maximizeθ

n∑

i=1

log pθ(xi)

◮ Collaborative filtering:

minimizew

i≺j

log(

1 + exp(wT xi − wT xj))

◮ k-means:

minimizeµ1,...,µk

J(µ) =k∑

j=1

i∈Cj

‖xi − µj‖2

◮ And more (graphical models, feature selection, active learning,control)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 6 / 53

Page 8: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

But generally speaking...

We’re screwed.◮ Local (non global) minima of f0

◮ All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3

−2

−1

0

1

2

3−50

0

50

100

150

200

250

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

Page 9: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

What is Optimization

But generally speaking...

We’re screwed.◮ Local (non global) minima of f0

◮ All kinds of constraints (even restricting to continuous functions):

h(x) = sin(2πx) = 0

−3−2

−10

12

3

−3

−2

−1

0

1

2

3−50

0

50

100

150

200

250

◮ Go for convex problems!Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53

Page 10: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

Convex Sets

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 8 / 53

Page 11: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

DefinitionA set C ⊆ R

n is convex if for x, y ∈ C and any α ∈ [0, 1],

αx + (1 − α)y ∈ C.

x

y

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53

Page 12: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

Examples

◮ All of Rn (obvious)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

Page 13: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

Examples

◮ All of Rn (obvious)

◮ Non-negative orthant, Rn+: let x � 0, y � 0, clearly

αx + (1 − α)y � 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

Page 14: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

Examples

◮ All of Rn (obvious)

◮ Non-negative orthant, Rn+: let x � 0, y � 0, clearly

αx + (1 − α)y � 0.

◮ Norm balls: let ‖x‖ ≤ 1, ‖y‖ ≤ 1, then

‖αx + (1 − α)y‖ ≤ ‖αx‖ + ‖(1 − α)y‖ = α ‖x‖ + (1 − α) ‖y‖ ≤ 1.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53

Page 15: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

Examples

◮ Affine subspaces: Ax = b, Ay = b, then

A(αx + (1 − α)y) = αAx + (1 − α)Ay = αb + (1 − α)b = b.

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x1x2

x3

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 11 / 53

Page 16: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

More examples

◮ Arbitrary intersections of convex sets: let Ci be convex for i ∈ I,C =

i Ci, then

x ∈ C, y ∈ C ⇒ αx + (1 − α)y ∈ Ci ∀ i ∈ Iso αx + (1 − α)y ∈ C.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 12 / 53

Page 17: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Sets

More examples

◮ PSD Matrices, a.k.a. thepositive semidefinite coneS

n+ ⊂ R

n×n. A ∈ Sn+ means

xT Ax ≥ 0 for all x ∈ Rn. For

A, B ∈ S+n ,

xT (αA + (1 − α)B) x

= αxT Ax + (1 − α)xT Bx ≥ 0.

◮ On right:

S2+ =

{[

x zz y

]

� 0

}

={

x, y, z : x ≥ 0, y ≥ 0, xy ≥ z2}

00.2

0.40.6

0.81

−1

−0.5

0

0.5

10

0.2

0.4

0.6

0.8

1

xy

z

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 13 / 53

Page 18: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

Convex Functions

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 14 / 53

Page 19: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

DefinitionA function f : R

n → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],

f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y).

f(x)

f(y)αf(x) + (1 - α)f(y)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53

Page 20: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

First order convexity conditions

TheoremSuppose f : R

n → R is differentiable. Then f is convex if and only if for

all x, y ∈ dom ff(y) ≥ f(x) + ∇f(x)T (y − x)

f(y)

(x, f(x))

f(x) + ∇f(x)T (y - x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 16 / 53

Page 21: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

Actually, more general than that

DefinitionThe subgradient set, or subdifferential set, ∂f(x) of f at x is

∂f(x) ={

g : f(y) ≥ f(x) + gT (y − x) for all y}

.

Theoremf : R

n → R is convex if and

only if it has non-empty

subdifferential set everywhere.

f(y)

f(x) + gT (y - x)

(x, f(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53

Page 22: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

Second order convexity conditions

TheoremSuppose f : R

n → R is twice differentiable. Then f is convex if and only if

for all x ∈ dom f ,

∇2f(x) � 0.

−2

−1

0

1

2

−2

−1

0

1

20

2

4

6

8

10

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 18 / 53

Page 23: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions

Convex sets and convex functions

DefinitionThe epigraph of a function f is theset of points

epi f = {(x, t) : f(x) ≤ t}.

◮ epi f is convex if and only if f isconvex.

◮ Sublevel sets, {x : f(x) ≤ a}are convex for convex f .

epi f

a

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 19 / 53

Page 24: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

Examples

◮ Linear/affine functions:

f(x) = bT x + c.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53

Page 25: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

Examples

◮ Linear/affine functions:

f(x) = bT x + c.

◮ Quadratic functions:

f(x) =1

2xT Ax + bT x + c

for A � 0. For regression:

1

2‖Xw − y‖2 =

1

2wT XT Xw − yT Xw +

1

2yT y.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53

Page 26: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

‖αx + (1 − α)y‖ ≤ ‖αx‖ + ‖(1 − α)y‖ = α ‖x‖ + (1 − α) ‖y‖ .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53

Page 27: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

‖αx + (1 − α)y‖ ≤ ‖αx‖ + ‖(1 − α)y‖ = α ‖x‖ + (1 − α) ‖y‖ .

◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 − α)y) + b) = f(α(Ax + b) + (1 − α)(Ay + b))

≤ αf(Ax + b) + (1 − α)f(Ay + b)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53

Page 28: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

More examples

◮ Norms (like ℓ1 or ℓ2 for regularization):

‖αx + (1 − α)y‖ ≤ ‖αx‖ + ‖(1 − α)y‖ = α ‖x‖ + (1 − α) ‖y‖ .

◮ Composition with an affine function f(Ax + b):

f(A(αx + (1 − α)y) + b) = f(α(Ax + b) + (1 − α)(Ay + b))

≤ αf(Ax + b) + (1 − α)f(Ay + b)

◮ Log-sum-exp (via ∇2f(x) PSD):

f(x) = log

(

n∑

i=1

exp(xi)

)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53

Page 29: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Functions Examples

Important examples in Machine Learning

◮ SVM loss:

f(w) =[

1 − yixTi w]

+

◮ Binary logistic loss:

f(w) = log(

1 + exp(−yixTi w)

)

−2 30

3

[1 - x]+

log(1 + ex)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53

Page 30: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Optimization Problems

Convex Optimization Problems

DefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffine

minimizex

f0(x) (Convex function)

s.t. fi(x) ≤ 0 (Convex sets)

hj(x) = 0 (Affine)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 23 / 53

Page 31: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Optimization Problems

It’s nice to be convex

TheoremIf x̂ is a local minimizer of a convex optimization problem, it is a global

minimizer.

0 0.5 1 1.5 2 2.5 3 3.50.5

1

1.5

2

2.5

3

3.5

4

x∗

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 24 / 53

Page 32: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Optimization Problems

Even more reasons to be convex

Theorem∇f(x) = 0 if and only if x is a global minimizer of f(x).

Proof.

◮ ∇f(x) = 0. We have

f(y) ≥ f(x) + ∇f(x)T (y − x) = f(x).

◮ ∇f(x) 6= 0. There is a direction of descent.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 25 / 53

Page 33: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Convex Optimization Problems

LET’S TAKE A BREAK

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 26 / 53

Page 34: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Lagrange Duality

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 27 / 53

Page 35: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Goals of Lagrange Duality

◮ Get certificate for optimality of a problem

◮ Remove constraints

◮ Reformulate problem

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 28 / 53

Page 36: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Constructing the dual

◮ Start with optimization problem:

minimizex

f0(x)

s.t. fi(x) ≤ 0, i = {1, . . . , k}hj(x) = 0, j = {1, . . . , l}

◮ Form Lagrangian using Lagrange multipliers λi ≥ 0, νi ∈ R

L(x, λ, ν) = f0(x) +k∑

i=1

λifi(x) +l∑

j=1

νjhj(x)

◮ Form dual function

g(λ, ν) = infxL(x, λ, ν) = inf

x

f0(x) +k∑

i=1

λifi(x) +l∑

j=1

νjhj(x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 29 / 53

Page 37: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Remarks

◮ Original problem is equivalent to

minimizex

[

supλ�0,ν

L(x, λ, ν)

]

◮ Dual problem is switching the min and max:

maximizeλ�0,ν

[

infxL(x, λ, ν)

]

.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 30 / 53

Page 38: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

One Great Property of Dual

Lemma (Weak Duality)

If λ � 0, then

g(λ, ν) ≤ f0(x∗).

Proof.We have

g(λ, ν) = infxL(x, λ, ν) ≤ L(x∗, λ, ν)

= f0(x∗) +

k∑

i=1

λifi(x∗) +

l∑

j=1

νjhj(x∗) ≤ f0(x

∗).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 31 / 53

Page 39: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

The Greatest Property of the Dual

TheoremFor reasonable1 convex problems,

supλ�0,ν

g(λ, ν) = f0(x∗)

1There are conditions called constraint qualification for which this is trueDuchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 32 / 53

Page 40: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Geometric Look

Minimize 12(x − c − 1)2 subject to x2 ≤ c.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

0

2

4

6

8

10

12

14

x0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

−0.4

−0.2

0

0.2

0.4

0.6

λ

True function (blue), constraint(green), L(x, λ) for different λ(dotted)

Dual function g(λ) (black), primal op-timal (dotted blue)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 33 / 53

Page 41: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

Page 42: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

◮ I−(a) = ∞ if a > 0, 0 otherwise; I0(a) = ∞ unless a = 0. Rewriteproblem as

minimizex

f0(x) +k∑

i=1

I−(fi(x)) +l∑

j=1

I0(hj(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

Page 43: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Intuition

Can interpret duality as linear approximation.

◮ I−(a) = ∞ if a > 0, 0 otherwise; I0(a) = ∞ unless a = 0. Rewriteproblem as

minimizex

f0(x) +k∑

i=1

I−(fi(x)) +l∑

j=1

I0(hj(x))

◮ Replace I(fi(x)) with λifi(x); a measure of “displeasure” whenλi ≥ 0, fi(x) > 0. νihj(x) lower bounds I0(hj(x)):

minimizex

f0(x) +

k∑

i=1

λifi(x) +

l∑

j=1

νjhj(x)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53

Page 44: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Example: Linearly constrained least squares

minimizex

1

2‖Ax − b‖2 s.t. Bx = d.

Form the Lagrangian:

L(x, ν) =1

2‖Ax − b‖2 + νT (Bx − d)

Take infimum:

∇xL(x, ν) = AT Ax − AT b + BT ν ⇒ x = (AT A)−1(AT b − BT ν)

Simple unconstrained quadratic problem!

infxL(x, ν)

=1

2

∥A(AT A)−1(AT b − BT ν) − b∥

2+ νT B((AT A)−1AT b − BT ν) − dT ν

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 35 / 53

Page 45: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Example: Quadratically constrained least squares

minimizex

1

2‖Ax − b‖2 s.t.

1

2‖x‖2 ≤ c.

Form the Lagrangian (λ ≥ 0):

L(x, λ) =1

2‖Ax − b‖2 +

1

2λ(‖x‖2 − 2c)

Take infimum:

∇xL(x, ν) = AT Ax − AT b + λI ⇒ x = (AT A + λI)−1AT b

infxL(x, λ) =

1

2

∥A(AT A + λI)−1AT b − b∥

2+

λ

2

∥(AT A + λI)−1AT b∥

2−λc

One variable dual problem!

g(λ) = −1

2bT A(AT A + λI)−1AT b − λc +

1

2‖b‖2 .

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 36 / 53

Page 46: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we havefeasible x and know the dual g(λ, ν), then

g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x

∗) − f0(x) ≥ g(λ, ν) − f0(x)

⇒ f0(x) − f0(x∗) ≤ f0(x) − g(λ, ν).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53

Page 47: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Lagrange Duality

Uses of the Dual

◮ Main use: certificate of optimality (a.k.a. duality gap). If we havefeasible x and know the dual g(λ, ν), then

g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x

∗) − f0(x) ≥ g(λ, ν) − f0(x)

⇒ f0(x) − f0(x∗) ≤ f0(x) − g(λ, ν).

◮ Also used in more advanced primal-dual algorithms (we won’t talkabout these).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53

Page 48: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms

Optimization Algorithms

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 38 / 53

Page 49: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Gradient Descent

The simplest algorithm in the world (almost). Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηt∇f(xt)

where ηt is stepsize.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53

Page 50: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Single Step Illustration

f(xt) - η∇f(xt)T (x - xt)

f(x)

(xt, f(xt))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 40 / 53

Page 51: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Full Gradient Descent

f(x) = log(exp(x1 + 3x2 − .1) + exp(x1 − 3x2 − .1) + exp(−x1 − .1))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 41 / 53

Page 52: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Stepsize Selection

How do I choose a stepsize?

◮ Idea 1: exact line search

ηt = argminη

f (x − η∇f(x))

Too expensive to be practical.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53

Page 53: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Stepsize Selection

How do I choose a stepsize?

◮ Idea 1: exact line search

ηt = argminη

f (x − η∇f(x))

Too expensive to be practical.

◮ Idea 2: backtracking (Armijo) line search. Let α ∈ (0, 12), β ∈ (0, 1).

Multiply η = βη until

f (x − η∇f(x)) ≤ f(x) − αη ‖∇f(x)‖2

Works well in practice.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53

Page 54: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Illustration of Armijo/Backtracking Line Search

ηt = 0 η0

η

f(x) - ηα‖∇f(x)‖2

f(x - η∇f(x))

As a function of stepsize η. Clearly a region where f(x − η∇f(x)) isbelow line f(x) − αη ‖∇f(x)‖2.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 43 / 53

Page 55: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Newton’s method

Idea: use a second-order approximation to function.

f(x + ∆x) ≈ f(x) + ∇f(x)T ∆x +1

2∆xT∇2f(x)∆x

Choose ∆x to minimize above:

∆x = −[

∇2f(x)]−1 ∇f(x)

This is descent direction:

∇f(x)T ∆x = −∇f(x)T[

∇2f(x)]−1 ∇f(x) < 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 44 / 53

Page 56: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Newton step picture

(x, f(x))

(x + ∆x, f(x + ∆x))

f

f̂ is 2nd-order approximation, f is true function.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53

Page 57: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Convergence of gradient descent and Newton’s method

◮ Strongly convex case: ∇2f(x) � mI, then “Linear convergence.” Forsome γ ∈ (0, 1), f(xt) − f(x∗) ≤ γt, γ < 1.

f(xt) − f(x∗) ≤ γt or t ≥ 1

γlog

1

ε⇒ f(xt) − f(x∗) ≤ ε.

◮ Smooth case: ‖∇f(x) −∇f(y)‖ ≤ C ‖x − y‖.

f(xt) − f(x∗) ≤ K

t2

◮ Newton’s method often is faster, especially when f has “long valleys”

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 46 / 53

Page 58: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method(assume Ax = b):

minimize∆x

∇f(x)T ∆x +1

2∆xT∇2f(x)∆x s.t. A∆x = 0.

Solution ∆x satisfies A(x + ∆x) = Ax + A∆x = b.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53

Page 59: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

What about constraints?

◮ Linear constraints Ax = b are easy. For example, in Newton method(assume Ax = b):

minimize∆x

∇f(x)T ∆x +1

2∆xT∇2f(x)∆x s.t. A∆x = 0.

Solution ∆x satisfies A(x + ∆x) = Ax + A∆x = b.

◮ Inequality constraints are a bit tougher...

minimize∆x

∇f(x)T ∆x +1

2∆xT∇2f(x)∆x s.t. fi(x + ∆x) ≤ 0

just as hard as original.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53

Page 60: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Logarithmic Barrier Methods

Goal:minimize

xf0(x) s.t. fi(x) ≤ 0, i = {1, . . . , k}

Convert to

minimizex

f0(x) +k∑

i=1

I−(fi(x))

Approximate I−(u) ≈ −t log(−u) for small t.

minimizex

f0(x) − tk∑

i=1

log (−fi(x))

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 48 / 53

Page 61: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

The barrier function

−3 0 1−2

0

10

u

I−(u) is dotted line, others are −t log(−u).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 49 / 53

Page 62: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Gradient Methods

Illustration

Minimizing cT x subject to Ax ≤ b.

c

t = 5t = 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 50 / 53

Page 63: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Subgradient Descent

Really, the simplest algorithm in the world. Goal:

minimizex

f(x)

Just iteratext+1 = xt − ηtgt

where ηt is a stepsize, gt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 51 / 53

Page 64: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Why subgradient descent?

◮ Lots of non-differentiable convex functions used in machine learning:

f(x) =[

1 − aT x]

+, f(x) = ‖x‖1 , f(X) =

k∑

r=1

σr(X)

where σr is the rth singular value of X.

◮ Easy to analyze

◮ Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 52 / 53

Page 65: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Proof of convergence for subgradient descent

Idea: bound ‖xt+1 − x∗‖ using subgradient inequality. Assume that‖gt‖ ≤ G.

‖xt+1 − x∗‖2 = ‖xt − ηgt − x∗‖2

= ‖xt − x∗‖2 − 2ηgTt (xt − x∗) + η2 ‖gt‖2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53

Page 66: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Proof of convergence for subgradient descent

Idea: bound ‖xt+1 − x∗‖ using subgradient inequality. Assume that‖gt‖ ≤ G.

‖xt+1 − x∗‖2 = ‖xt − ηgt − x∗‖2

= ‖xt − x∗‖2 − 2ηgTt (xt − x∗) + η2 ‖gt‖2

Recall that

f(x∗) ≥ f(xt) + gTt (x∗ − xt) ⇒ − gT

t (xt − x∗) ≤ f(x∗) − f(xt)

so‖xt+1 − x∗‖2 ≤ ‖xt − x∗‖2 + 2η [f(x∗) − f(xt)] + η2G2.

Then

f(xt) − f(x∗) ≤ ‖xt − x∗‖2 − ‖xt+1 − x∗‖2

2η+

η

2G2.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53

Page 67: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Almost done...

Sum from t = 1 to T :

T∑

t=1

f(xt) − f(x∗) ≤ 1

T∑

t=1

[

‖xt − x∗‖2 − ‖xt+1 − x∗‖2]

+Tη

2G2

=1

2η‖x1 − x∗‖2 − 1

2η‖xT+1 − x∗‖2 +

2G2

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53

Page 68: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Almost done...

Sum from t = 1 to T :

T∑

t=1

f(xt) − f(x∗) ≤ 1

T∑

t=1

[

‖xt − x∗‖2 − ‖xt+1 − x∗‖2]

+Tη

2G2

=1

2η‖x1 − x∗‖2 − 1

2η‖xT+1 − x∗‖2 +

2G2

Now let D = ‖x1 − x∗‖, and keep track of min along run,

f (xbest) − f(x∗) ≤ 1

2ηTD2 +

η

2G2.

Set η = D

G√

Tand

f (xbest) − f(x∗) ≤ DG√T

.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53

Page 69: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Extension: projected subgradient descent

Now have a convex constraint set X.Goal:

minimizex∈X

f(x)

Idea: do subgradient steps, projectxt back into X at every iteration.

xt+1 = ΠX(xt − ηgt)

Proof:

‖ΠX(xt) − x∗‖ ≤ ‖xt − x∗‖

if x∗ ∈ X.

X

x∗

xt

ΠX(xt)

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 55 / 53

Page 70: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 71: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 72: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 73: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 74: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 75: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 76: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 77: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Projected subgradient example

minimizex

1

2‖Ax − b‖ s.t. ‖x‖1 ≤ 1

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53

Page 78: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0,∑∞

t=1 ηt = ∞ gives

f(

xavg(t)

)

− f(x∗) → 0.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

Page 79: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0,∑∞

t=1 ηt = ∞ gives

f(

xavg(t)

)

− f(x∗) → 0.

◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/√

t

f(

xavg(t)

)

− f(x∗) ≤ C√t

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

Page 80: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Optimization Algorithms Subgradient Methods

Convergence results for (projected) subgradient methods

◮ Any decreasing, non-summable stepsize ηt → 0,∑∞

t=1 ηt = ∞ gives

f(

xavg(t)

)

− f(x∗) → 0.

◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/√

t

f(

xavg(t)

)

− f(x∗) ≤ C√t

◮ Same convergence when gt is random, i.e. Egt ∈ ∂f(xt). Example:

f(w) =1

2‖w‖2 + C

n∑

i=1

[

1 − yixTi w]

+

Just pick random training example.

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53

Page 81: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Take Home Messages

Recap

◮ Defined convex sets and functions

◮ Saw why we want optimization problems to be convex (solvable)

◮ Sketched some of Lagrange duality

◮ First order methods are easy and (often) work well

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 58 / 53

Page 82: Introduction to Convex Optimization for Machine … to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009

Take Home Messages

Take Home Messages

◮ Many useful problems formulated as convex optimization problems

◮ If it is not convex and not an eigenvalue problem, you are out of luck

◮ If it is convex, you are golden

Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 59 / 53