newton’s methodggordon/10725-f12/slides/12-newton.pdf · review: newton • for solving nonlinear...
Post on 21-Jul-2020
2 Views
Preview:
TRANSCRIPT
Newton’s method
10-725 OptimizationGeoff Gordon
Ryan Tibshirani
Geoff Gordon—10-725 Optimization—Fall 2012
Review
• Volume rule
• Infomax ICA‣ matrix natural gradient
2
Geoff Gordon—10-725 Optimization—Fall 2012
Review: Newton• For solving nonlinear equations:‣ approx by linear ones, solve, update approx
‣ d = –J(x)-1f(x)
• For finding minima/maxima/saddles:‣ just use Newton on gradient g(x) = f ’(x) = 0
‣ d = –H(x)-1g(x)
• Line search: Newton is a descent method
• (Often) quadratic convergence
3
Geoff Gordon—10-725 Optimization—Fall 2012
Equality constraints
• min f(x) s.t. h(x) = 0
4
2 0 22
1
0
1
2
Geoff Gordon—10-725 Optimization—Fall 2012
Optimality w/ equality
• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ! d)
‣ g: Rd → Rd (gradient of f)
• Useful special case: min f(x) s.t. Ax = 0
5
Geoff Gordon—10-725 Optimization—Fall 2012
Picture
6
max c�
xyz
s.t.
x2 + y2 + z2 = 1
a�x = b
Geoff Gordon—10-725 Optimization—Fall 2012
Optimality w/ equality• min f(x) s.t. h(x) = 0
‣ f: Rd → R, h: Rd → Rk (k ! d)
‣ g: Rd → Rd (gradient of f)
• Now suppose:‣ dg = dh =
• Optimality:
7
Geoff Gordon—10-725 Optimization—Fall 2012
Example: bundle adjustment
• Latent:
‣ Robot positions xt, θt
‣ Landmark positions yk
• Observed: odometry, landmark vectors
‣ vt = Rθt[xt+1–xt] + noise
‣ wt = [θt+1–θt + noise]"
‣ dkt = Rθt[yk–xt] + noise
8
2 0 25
4
3
2
1
0
1
2
3
O = {observed kt pairs}
Geoff Gordon—10-725 Optimization—Fall 2012
Bundle adjustment
9
minxt,ut,yk
�t�vt −R(ut)[xt+1 − xt]�2 +
�t�Rwtut − ut+1�2 +
�(t,k)∈O
�dk,t −R(ut)[yk − xt]�2
s.t. u�t ut = 1
‣ latent: Robot positions xt, θt
‣ ut = [cos θt; sin θt]
‣ latent: Landmark positions yk
‣ obs: vt = Rθt[xt+1–xt] + noise
‣ obs: wt = [θt+1–θt + noise]"
‣ obs: dkt = Rθt[yk–xt] + noise
Geoff Gordon—10-725 Optimization—Fall 2012
Ex: MLE in exponential family
10
L = − ln�
k
P (xk | θ)
P (xk | θ) =g(θ) =
Geoff Gordon—10-725 Optimization—Fall 2012
MLE Newton interpretation
11
Geoff Gordon—10-725 Optimization—Fall 2012
Convergence behavior
• minx f(x) s.t. Ax = b‣ strictly convex f(x), twice differentiable
‣ some kind of bound on 3rd derivative
• Two phases‣ damped Newton—most of time here
‣ step size < 1
‣ quadratic convergence—a few final iterations to get accuracy very high
‣ step size = 1
12
Geoff Gordon—10-725 Optimization—Fall 2012
Convergence behavior
• Damped Newton
‣ f(xt+1) ! f(xt) – Δ some fixed Δ > 0
‣ limit:
• Quadratic convergence
‣ enter when errort ! δ ! 0.5
‣ errort+1 ! (errort)2
‣ limit:
13
Geoff Gordon—10-725 Optimization—Fall 2012
Comparisonof methods for minimizing a convex function
14
Newton FISTA (sub)grad stoch. (sub)grad.
convergence
cost/iter
smoothness
Geoff Gordon—10-725 Optimization—Fall 2012
Variations• Trust region‣ [H(x) + tI]dx = –g(x)
‣ [H(x) + tD]dx = –g(x)
• Quasi-Newton‣ use only gradients, but build estimate of Hessian
‣ in Rd, d gradient estimates at “nearby” points determine approx. Hessian (think finite differences)
‣ can often get “good enough” estimate w/ fewer— even forget old info to save memory/time (L-BFGS)
15
Geoff Gordon—10-725 Optimization—Fall 2012
Variations: Gauss-Newton
16
L = minθ
�
k
1
2�yk − f(xk, θ)�2
top related