newton’s methodggordon/10725-f12/slides/12-newton.pdf · review: newton • for solving nonlinear...

Newton’s method

10-725 OptimizationGeoff Gordon

Ryan Tibshirani

Geoff Gordon—10-725 Optimization—Fall 2012

Review

• Volume rule

• Infomax ICA‣ matrix natural gradient

2


Review: Newton• For solving nonlinear equations:‣ approx by linear ones, solve, update approx

‣ d = –J(x)-1f(x)

• For finding minima/maxima/saddles:‣ just use Newton on gradient g(x) = f ’(x) = 0

‣ d = –H(x)-1g(x)

• Line search: Newton is a descent method

• (Often) quadratic convergence

3


Equality constraints

• min f(x) s.t. h(x) = 0

4

2 0 22

1

0

1

2


Optimality w/ equality

• min f(x) s.t. h(x) = 0

‣ f: Rd → R, h: Rd → Rk (k ! d)

‣ g: Rd → Rd (gradient of f)

• Useful special case: min f(x) s.t. Ax = 0

5


Picture

6

max c�

xyz

s.t.

x2 + y2 + z2 = 1

a�x = b


Optimality w/ equality• min f(x) s.t. h(x) = 0

‣ f: Rd → R, h: Rd → Rk (k ! d)

‣ g: Rd → Rd (gradient of f)

• Now suppose:‣ dg = dh =

• Optimality:

7


Example: bundle adjustment

• Latent:

‣ Robot positions xt, θt

‣ Landmark positions yk

• Observed: odometry, landmark vectors

‣ vt = Rθt[xt+1–xt] + noise

‣ wt = [θt+1–θt + noise]"

‣ dkt = Rθt[yk–xt] + noise

8

2 0 25

4

3

2

1

0

1

2

3

O = {observed kt pairs}


Bundle adjustment

9

minxt,ut,yk

�t�vt −R(ut)[xt+1 − xt]�2 +

�t�Rwtut − ut+1�2 +

�(t,k)∈O

�dk,t −R(ut)[yk − xt]�2

s.t. u�t ut = 1

‣ latent: Robot positions xt, θt

‣ ut = [cos θt; sin θt]

‣ latent: Landmark positions yk

‣ obs: vt = Rθt[xt+1–xt] + noise

‣ obs: wt = [θt+1–θt + noise]"

‣ obs: dkt = Rθt[yk–xt] + noise


Ex: MLE in exponential family

10

L = − ln�

k

P (xk | θ)

P (xk | θ) =g(θ) =


MLE Newton interpretation

11


Convergence behavior

• minx f(x) s.t. Ax = b‣ strictly convex f(x), twice differentiable

‣ some kind of bound on 3rd derivative

• Two phases‣ damped Newton—most of time here

‣ step size < 1

‣ quadratic convergence—a few final iterations to get accuracy very high

‣ step size = 1

12


Convergence behavior

• Damped Newton

‣ f(xt+1) ! f(xt) – Δ some fixed Δ > 0

‣ limit:

• Quadratic convergence

‣ enter when errort ! δ ! 0.5

‣ errort+1 ! (errort)2

‣ limit:

13


Comparisonof methods for minimizing a convex function

14

Newton FISTA (sub)grad stoch. (sub)grad.

convergence

cost/iter

smoothness


Variations• Trust region‣ [H(x) + tI]dx = –g(x)

‣ [H(x) + tD]dx = –g(x)

• Quasi-Newton‣ use only gradients, but build estimate of Hessian

‣ in Rd, d gradient estimates at “nearby” points determine approx. Hessian (think finite differences)

‣ can often get “good enough” estimate w/ fewer— even forget old info to save memory/time (L-BFGS)

15


Variations: Gauss-Newton

16

L = minθ

�

k

1

2�yk − f(xk, θ)�2

newton’s methodggordon/10725-f12/slides/12-newton.pdf · review: newton • for solving nonlinear...

Documents