unit 18 avoiding derivatives - stanford...
TRANSCRIPT
Avoiding Derivatives
Part II Roadmap
• Part I – Linear Algebra (units 1-12) !" = $
• Part II – Optimization (units 13-20)• (units 13-16) Optimization -> Nonlinear Equations -> 1D roots/minima
• (units 17-18) Computing/Avoiding Derivatives
• (unit 19) Hack 1.0: “I give up” % = & and ' is mostly 0 (descent methods)
• (unit 20) Hack 2.0: ”It’s an ODE!?” (adaptive learning rate and momentum)
linearizeline search
Theory
Methods
1D Root Finding (Unit 15)
• Newton’s method requires !′, as do mixed methods using Newton• Secant method replaces !′ with a secant line though two prior iterates• Finite differencing (unit 17) may be used to approximate this derivative as well,
although one needs to determine the size of the perturbation ℎ• Automatic differentiation (unit 17) may be used to find the value of !′ at a
particular point, if/when “backprop” code exists, even when ! and !′ are not known in closed form• Convergence is only guaranteed under certain conditions, emphasizing the
importance of safe set methods (such as mixed methods with bisection)• Safe set methods also help to guard against errors in derivative approximations
1D Optimization (Unit 16)
• Root finding approaches search for critical points as the roots of !′• All root finding methods use the function itself (!′ here)• Newton (and mixed methods using Newton) require the derivative of the
function (!′′ here)• Can use secant lines for !′ and interpolating parabolas for !′′, using either prior
iterates (unit 16) or finite differences (unit 17)• Automatic differentiation (unit 17) may be leveraged as well• Although, not (typically) for approaches that require !′′
• Safe set methods (such as mixed methods with bisection or golden section search) help to guard against errors in the approximation of various derivatives
Nonlinear Systems (Unit 14)
• !" #$ Δ#$ = −( #$ is solved to find the search direction Δ#$• Then line search utilizes various 1D approaches (unit 15/16)
• The Jacobian matrix of first derivatives !" #$ needs to be evaluated (given #$)
• Each entry )"*)+,
(#$) can be approximated via finite differences (unit 17) or
automatic differentiation (unit 17)
• Quasi-Newton approaches get quite cavalier with the idea of a search direction,
and as such make various aggressive approximations to the Jacobian !" #$• Quasi-Newton can wildly perturb the search direction, so robust/safe set
approaches to the 1D line search become quite important to making “progress”
towards solutions
Broyden’s Method
• An initial guess for the Jacobian is repeatedly corrected with rank one updates, similar in spirit to a secant approach• Let !" = $• Solve !%Δ'% = −) '% to find search direction Δ'%• Use line search to find '%*+ and thus ) '%*+ , and then update Δ'% = '%*+ − '%
• Update !%*+ = !% + +-./ 0-./ ) '%*+ − ) '% − !%Δ'% Δ'% 1
• Note: !%*+(c%*+−'%) = ) '%*+ − ) '%• That is, !%*+ satisfies a secant type equation !Δ' = Δ)
Optimization (Unit 13)
• Scalar cost function !"($) has critical points where & !'( $ = 0 (unit13)• 2!'
(($3)Δ$3 = −& !'(($3) is solved to find a search direction Δ$3 (unit 14) • Then line search utilizes various 1D approaches (unit 15/16)• The Hessian matrix of second derivatives 2!'
(($3) and the Jacobian vector of first derivatives & !'(($3) both need to be evaluated (given $3)• The various entries can be evaluated via finite differences (unit 17) or automatic
differentiation (unit 17)• These approaches can struggle on the Hessian matrix of second partial derivatives• This makes Quasi-Newton approaches quite popular for optimization
• When $ is large, the 6(78) Hessian 2 is unwieldy/intractable, so some approaches instead approximate the action of 29( on a vector (i.e., on the right hand side)
Broyden’s Method (for Optimization)
• Same formulation as for nonlinear systems
• Solve for the search direction, and find !"#$ and % &'( !"#$• Update Δ!" = !"#$ − !" and Δ% &'( = % &'( !"#$ − % &'( !"
• Then (-&'()"#$= (-&'
()"+ $012 3012 Δ% &'( − (-&'
()"Δ!" Δ!" (
• So that (-&'()"#$Δ!" = Δ% &'(
Broyden’s Method (for Optimization)
• For the inverse, using Δ"# = "#%& − "# and Δ( )*+ = ( )*+ "#%& − ( )*+ "#
• Define (-)*.+)#%&= (-)*
.+)#+123. 456
78 31956
8 123 8 45678 3
123 8(45678)31956
8
• So that (-)*.+)#%&Δ( )*+ = Δ"#
• Then, solving -)*+ "#%& Δ"#%& = −( )*+ "#%& is replaced with defining the search
direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&
SR1 (Symmetric Rank 1)
• For the inverse, using Δ"# = "#%& − "# and Δ( )*+ = ( )*+ "#%& − ( )*+ "#
• Define (-)*.+)#%&= (-)*
.+)#+123. 456
78 31956
8 123. 45678 3
19568 8
123. 45678 3
19568 :
19568
• So that (-)*.+)#%&Δ( )*+ = Δ"#
• Then, solving -)*+ "#%& Δ"#%& = −( )*+ "#%& is replaced with defining the search
direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&
DFP (Davidon-Fletcher-Powell)
• For the inverse, using Δ"# = "#%& − "# and Δ( )*+ = ( )*+ "#%& − ( )*+ "#
• Define (-)*.+)#%&= (-)*
.+)#− 01234 5
671246712 012
34 5
6712 01234 5
67124 + 695 695 4
695 467124
• So that (-)*.+)#%&Δ( )*+ = Δ"#
• Then, solving -)*+ "#%& Δ"#%& = −( )*+ "#%& is replaced with defining the search
direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&
BFGS (Broyden-Fletcher-Goldfarb-Shanno)
• For the inverse, using Δ"# = "#%& − "# and Δ( )*+ = ( )*+ "#%& − ( )*+ "#
• Define (-)*.+)#%&= 0 − 1231456
123 714567 -)*
.+#0 − 1456
7 123 7
123 714567 + 123 123 7
123 714567
• So that (-)*.+)#%&Δ( )*+ = Δ"#
• Then, solving -)*+ "#%& Δ"#%& = −( )*+ "#%& is replaced with defining the search
direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&
L-BFGS (Limited Memory BFGS)
• BFGS stores a dense size !"! approximation to the inverse Hessian• This can become unwieldy for large problems• Smarter storage can be accomplished leveraging vectors that describe the outer products;
however, the number of vectors still grows with #• L-BFGS instead estimates the inverse Hessian using only a few vectors • often less than 10 vectors (vectors, vector spaces, not matrices)
• This makes it quite popular for machine learning
• On optimization methods for deep learning, Andrew Ng et al., ICML 2011• “we show that more sophisticated off-the-shelf optimization methods such as Limited
memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms”
Nonlinear Least Squares (ML relevancy)
• Minimize a cost function of the form: !"($) = '()"* $ )" $
• Recall from Unit 13: • Determine parameters $ that make " +, -, $ = 0 best fit the training data, i.e. that make " +/, -/, $ (( = " +/, -/, $ *" +/, -/, $ close to zero for all 0
• Combining all +/, -/ , minimize !" $ = ∑/ " +/, -/, $ *" +/, -/, $• Let 2 be the number of data points and 32 be the output size of " +, -, $• Define )" $ by stacking the 32 outputs of " +, -, $ consecutively 2 times, so
that the vector valued output of )" $ is length 2 ∗ 32• Then, !" $ = ∑/ " +/, -/, $ *" +/, -/, $ = )"* $ )" $• Multiplying by ½ doesn’t change the minimum
Nonlinear Least Squares (Critical Points)
• Minimize !" # = %&'"( # '" #
• Jacobian matrix of '" is ) '* # = + '*+,-
# + '*+,.
# ⋯ + '*+,0
#
• Critical points of !" # have ) !*( # =
'"( # + '*+,-
#'"( # + '*
+,.#
⋮'"( # + '*
+,0#
= ) '*( # '" # = 0
Gauss Newton
• ! "#$ % "& % = 0 becomes ! "#$ % "& %) + ! "# %) +%) + ⋯ = 0• Using the Taylor series: "& % = "& %) + ! "# %) +%) + ⋯
• Eliminating high order terms: ! "#$ % "& %) + ! "# %) +%) ≈ 0• Evaluating ! "#$ at %) gives ! "#$ %) ! "# %) +%) ≈ −! "#$ %) "& %)
• Compare to /0#$(%))Δ%) = −! 0#$(%)) and note that ! 0#$ % = ! "#$ % "& %
• This implies an estimate of /0#$ %) ≈ ! "#$ %) ! "# %)
Gauss Newton (!" approach)
• The Gauss Newton equations # $%& '( # $% '( )'( = −# $%& '( $, '( are the normal equations for # $% '( )'( = − $, '(• Thus, (instead) solve # $% '( )'( = − $, '( via any least squares (!") and
minimum norm approach• Note that setting the second factor in # $%& ' $, '( + # $% '( )'( ≈ 0 to zero
also leads to # $% '( )'( = − $, '(• This is a linearization of the nonlinear system $, ' = 0, aiming to minimize 0, ' = 1
2$,& ' $, '
Weighted Gauss Newton
• Recall: Row scaling changes the importance of the equations• Recall: Thus, it also changes the (unique) least squares solution for any
overdetermined degrees of freedom
• Given a diagonal matrix ! indicating the importance of various equations• !" #$ %& '%& = −! #* %&• " #$+ %& !," #$ %& '%& = −" #$+ %& !, #* %&
Gauss Newton (Regularized)
• When concerned about small singular values in ! "# $% &$% = − ") $% , one can add *+ = 0 as extra equations (unit 12 regularization)
• This results in ! "#- $% ! "# $% + */+ &$% = −! "#- $% ") $%
• This is often called Levenberg-Marquardt or Damped (Nonlinear) Least Squares
Gradient/Steepest Descent
• Approximate !"#$ very crudely with the identity matrix (first step of Broyden)
• Then !"#$(&')Δ&' = −, "#$(&') becomes -.&' = −, "#$ &'
• That is, the search direction is .&' = −, "#$ &' = −∇ "0(&')• This is the steepest descent direction• No matrix inversion is necessary
• See unit 19
Coordinate Descent
• Coordinate Descent ignores !"#$(&')Δ&' = −, "#$(&') completely
• Instead Δ&' is set to the various coordinate directions .̂/• No matrix inversion is necessary