unconstrained optimization rong jin. recap gradient ascent/descent simple algorithm, only requires...

22
Unconstrained Optimization Rong Jin

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Unconstrained Optimization

Rong Jin

Page 2: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Recap Gradient ascent/descent

Simple algorithm, only requires the first order derivative

Problem: difficulty in determining the step size Small step size slow

convergence Large step size oscillation

or bubbling

Page 3: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Recap: Newton Method Univariate Newton method

Mulvariate Newton method

Guarantee to converge when the objective function is convex/concave

'( )

''( )

old

old

new old x x

x x

f xx x

f x

1 ( )new old f xx x

x

H

1 1, ,...,

T

m

f f f f

x x x x

21 2

,( , ,..., )m

i ji j

f x x x

x x

H

Hessian matrix

Page 4: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Recap Problem with standard Newton method

Computing inverse of Hessian matrix H is expensive (O(n^3)) The size of Hessian matrix H can be very large (O(n^2))

Quasi-Newton method (BFGS): Approximate the inverse of Hessian matrix H with another matrix B Avoid the difficulty in computing inverse of H However, still have problem when the size of B is large

Limited memory Quasi-Newton method (L-BFGS) Storing a set of vectors instead of matrix B Avoid the difficulty in computing the inverse of H Avoid the difficulty in storing the large-size B

Page 5: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Recap

Num

ber of Variable

Standard Newton method: O(n3)

Small

Medium Quasi Newton method (BFGS): O(n2)

Limited-memory Quasi Newton method (L-BFGS): O(n)

Large

Con

verg

ence

Rat

e

V-Fast

Fast

R-Fast

Page 6: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Empirical Study: Learning Conditional Exponential Model

Dataset Instances Features

Rule 29,602 246

Lex 42,509 135,182

Summary 24,044 198,467

Shallow 8,625,782 264,142

Dataset Iterations Time (s)

Rule 350 4.8

81 1.13

Lex 1545 114.21

176 20.02

Summary 3321 190.22

69 8.52

Shallow 14527 85962.53

421 2420.30

Limited-memory Quasi-Newton method

Gradient ascent

Page 7: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Free Software http://www.ece.northwestern.edu/~nocedal/so

ftware.html L-BFGS L-BFGSB

Page 8: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Conjugate Gradient Another Great Numerical Optimization

Method !

Page 9: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Linear Conjugate Gradient Method Consider optimizing the quadratic function

Conjugate vectors The set of vector {p1, p2, …, pl} is said to be conjugate with respect to

a matrix A if

Important property The quadratic function can be optimized by simply optimizing the

function along individual direction in the conjugate set. Optimal solution:

k is the minimizer along the kth conjugate direction

* arg min ( ) where ( )2

TT

x

x xx f x f x b x

A

0, for any Ti jp p i j A

1 1 2 2 ... l lx p p p

Page 10: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Example Minimize the following function

Matrix A

Conjugate direction

Optimization First direction, x1 = x2=x:

Second direction, x1 =- x2=x:

Solution: x1 = x2=1

-4

-2

0

2

4

-3

-2

-1

0

1

2

3-10

0

10

20

302 21 2 1 2 1 2 1 2( , )f x x x x x x x x

1 0.51

0.5 12A

1 21 1

,1 1

p p

21 2 1( , ) 2 Minimizer 1f x x x x

21 2 2( , ) 3 Minimizer 0f x x x

Page 11: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

How to Efficiently Find a Set of Conjugate Directions Iterative procedure

Given conjugate directions {p1,p2,…, pk-1}

Set pk as follows:

Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e.,

Note: compute the k direction pk only requires the previous direction pk-1

11

1 1

( ), , where

k

Tk k

k k k k k kTx xk k

r Ap f xp r p r

xp Ap

, for any [1, 2,..., 1]Tk ip p i k A

Page 12: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a quadratic

objective function, it can be applied directly to other nonlinear functions Guarantee convergence if the objective is convex/concave

Variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)

More robust than FR-CG

Compared to Newton method The first order method Usually less efficient than Newton method However, it is simple to implement

Page 13: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Empirical Study: Learning Conditional Exponential Model

Dataset Instances Features

Rule 29,602 246

Lex 42,509 135,182

Summary 24,044 198,467

Shallow 8,625,782 264,142

Dataset Iterations Time (s)

Rule 142 1.93

81 1.13

Lex 281 21.72

176 20.02

Summary 537 31.66

69 8.52

Shallow 2813 16251.12

421 2420.30

Limited-memory Quasi-Newton method

Conjugate Gradient (PR)

Page 14: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Free Software http://www.ece.northwestern.edu/~nocedal/so

ftware.html CG+

Page 15: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

When Should We Use Which Optimization Technique

Using Newton method if you can find a package

Using conjugate gradient if you have to implement it

Using gradient ascent/descent if you are lazy

Page 16: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Logarithm Bound Algorithms To maximize

Start with a guess Do it for t = 1, 2, …, T

Compute Find a decoupling function

Find optimal solution

0 0 0 01 2( , ,..., )mx x x x

1 1( , ) ( ) ( )t tQ x x f x f x x

1 2( ) ( , ,..., )nf x f x x x

1 1 2 2

1 1 1 1

( ) ( ) ( ) ... ( )

such that

( ) ( , ), and ( ) ( , ) 0

m m

t t t t

x g x g x g x

x Q x x x Q x x

' for each ( )i i ix g x

1 'ti ix x

Touch Point

Page 17: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Logarithm Bound Algorithm

( )f x

0x

• Start with initial guess x0

0( ) ( )x f x

• Come up with a lower bounded function (x) f(x) + f(x0)

• Touch point: (x0) =0

Touch Point

1x

• Optimal solution x1 for (x)

Page 18: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Logarithm Bound Algorithm

( )f x

0x

• Start with initial guess x0

1( ) ( )x f x

• Come up with a lower bounded function (x) f(x) + f(x0)

• Touch point: (x0) =0

1x

• Optimal solution x1 for (x)

• Repeat the above procedure

2x

Page 19: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Logarithm Bound Algorithm

( )f x

0x

• Start with initial guess x0

• Come up with a lower bounded function (x) f(x) + f(x0)

• Touch point: (x0) =0

1x

• Optimal solution x1 for (x)

• Repeat the above procedure

• Converge to the optimal point

2x

Optimal Point

Page 20: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Property of Concave Functions For any concave function

1 2 1 21 2 1 2

n

1

( ... ) ( ) ( ) ... ( )

for 0 and 1

n nn n

i jj

f p x p x p x p f x p f x p f x

p p

( )f x

( )f x

1x 2x

1( )f x

2( )f x

1 21 2

3 3x x

1 21 2( )3 3

f x x

1 21 2( ) ( )

3 3f x f x

Page 21: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Important Inequality log(x), -exp(x) are concave functions Therefore

1 1

1 1

log( ) log( )

exp( ) exp( )

n n

i i i ii i

n n

i i i ii i

p x p x

p x p x

Page 22: Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in

Expectation-Maximization Algorithm Derive the EM algorithm for

Hierarchical Mixture Model

m1(x)

r(x)

m2(x)

X

y

1 1

2 2

( | ) ( , | )

( 1| ; ) ( | ; )

( 1| ; ) ( | ; )

g m

r

r

p y x p y m x

r x m y x

r x m y x

1 1 2 2

( ) ( | )

log ( 1| ; ) ( | ; ) ( 1| ; ) ( | ; )

train ii

i r i i r ii

l D p y x

r x m y x r x m y x

Log-likelihood of training data