newton’s methodggordon/10725-f12/slides/12-newton.pdf · review: newton • for solving nonlinear...

16
Newton’s method 10-725 Optimization Geoff Gordon Ryan Tibshirani

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Newton’s method

10-725 OptimizationGeoff Gordon

Ryan Tibshirani

Page 2: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Review

• Volume rule

• Infomax ICA‣ matrix natural gradient

2

Page 3: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Review: Newton• For solving nonlinear equations:‣ approx by linear ones, solve, update approx

‣ d = –J(x)-1f(x)

• For finding minima/maxima/saddles:‣ just use Newton on gradient g(x) = f ’(x) = 0

‣ d = –H(x)-1g(x)

• Line search: Newton is a descent method

• (Often) quadratic convergence

3

Page 4: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Equality constraints

• min f(x) s.t. h(x) = 0

4

2 0 22

1

0

1

2

Page 5: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Optimality w/ equality

• min f(x) s.t. h(x) = 0

‣ f: Rd → R, h: Rd → Rk (k ! d)

‣ g: Rd → Rd (gradient of f)

• Useful special case: min f(x) s.t. Ax = 0

5

Page 6: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Picture

6

max c�

xyz

s.t.

x2 + y2 + z2 = 1

a�x = b

Page 7: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Optimality w/ equality• min f(x) s.t. h(x) = 0

‣ f: Rd → R, h: Rd → Rk (k ! d)

‣ g: Rd → Rd (gradient of f)

• Now suppose:‣ dg = dh =

• Optimality:

7

Page 8: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Example: bundle adjustment

• Latent:

‣ Robot positions xt, θt

‣ Landmark positions yk

• Observed: odometry, landmark vectors

‣ vt = Rθt[xt+1–xt] + noise

‣ wt = [θt+1–θt + noise]"

‣ dkt = Rθt[yk–xt] + noise

8

2 0 25

4

3

2

1

0

1

2

3

O = {observed kt pairs}

Page 9: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Bundle adjustment

9

minxt,ut,yk

�t�vt −R(ut)[xt+1 − xt]�2 +

�t�Rwtut − ut+1�2 +

�(t,k)∈O

�dk,t −R(ut)[yk − xt]�2

s.t. u�t ut = 1

‣ latent: Robot positions xt, θt

‣ ut = [cos θt; sin θt]

‣ latent: Landmark positions yk

‣ obs: vt = Rθt[xt+1–xt] + noise

‣ obs: wt = [θt+1–θt + noise]"

‣ obs: dkt = Rθt[yk–xt] + noise

Page 10: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Ex: MLE in exponential family

10

L = − ln�

k

P (xk | θ)

P (xk | θ) =g(θ) =

Page 11: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

MLE Newton interpretation

11

Page 12: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Convergence behavior

• minx f(x) s.t. Ax = b‣ strictly convex f(x), twice differentiable

‣ some kind of bound on 3rd derivative

• Two phases‣ damped Newton—most of time here

‣ step size < 1

‣ quadratic convergence—a few final iterations to get accuracy very high

‣ step size = 1

12

Page 13: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Convergence behavior

• Damped Newton

‣ f(xt+1) ! f(xt) – Δ some fixed Δ > 0

‣ limit:

• Quadratic convergence

‣ enter when errort ! δ ! 0.5

‣ errort+1 ! (errort)2

‣ limit:

13

Page 14: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Comparisonof methods for minimizing a convex function

14

Newton FISTA (sub)grad stoch. (sub)grad.

convergence

cost/iter

smoothness

Page 15: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Variations• Trust region‣ [H(x) + tI]dx = –g(x)

‣ [H(x) + tD]dx = –g(x)

• Quasi-Newton‣ use only gradients, but build estimate of Hessian

‣ in Rd, d gradient estimates at “nearby” points determine approx. Hessian (think finite differences)

‣ can often get “good enough” estimate w/ fewer— even forget old info to save memory/time (L-BFGS)

15

Page 16: Newton’s methodggordon/10725-F12/slides/12-newton.pdf · Review: Newton • For solving nonlinear equations: ‣ approx by linear ones, solve, update approx ‣ d = –J(x)-1f(x)

Geoff Gordon—10-725 Optimization—Fall 2012

Variations: Gauss-Newton

16

L = minθ

k

1

2�yk − f(xk, θ)�2