unit 18 avoiding derivatives - stanford...

Avoiding Derivatives

Part II Roadmap

• Part I – Linear Algebra (units 1-12) !" = $

• Part II – Optimization (units 13-20)• (units 13-16) Optimization -> Nonlinear Equations -> 1D roots/minima

• (units 17-18) Computing/Avoiding Derivatives

• (unit 19) Hack 1.0: “I give up” % = & and ' is mostly 0 (descent methods)

• (unit 20) Hack 2.0: ”It’s an ODE!?” (adaptive learning rate and momentum)

linearizeline search

Theory

Methods

1D Root Finding (Unit 15)

• Newton’s method requires !′, as do mixed methods using Newton• Secant method replaces !′ with a secant line though two prior iterates• Finite differencing (unit 17) may be used to approximate this derivative as well,

although one needs to determine the size of the perturbation ℎ• Automatic differentiation (unit 17) may be used to find the value of !′ at a

particular point, if/when “backprop” code exists, even when ! and !′ are not known in closed form• Convergence is only guaranteed under certain conditions, emphasizing the

importance of safe set methods (such as mixed methods with bisection)• Safe set methods also help to guard against errors in derivative approximations

1D Optimization (Unit 16)

• Root finding approaches search for critical points as the roots of !′• All root finding methods use the function itself (!′ here)• Newton (and mixed methods using Newton) require the derivative of the

function (!′′ here)• Can use secant lines for !′ and interpolating parabolas for !′′, using either prior

iterates (unit 16) or finite differences (unit 17)• Automatic differentiation (unit 17) may be leveraged as well• Although, not (typically) for approaches that require !′′

• Safe set methods (such as mixed methods with bisection or golden section search) help to guard against errors in the approximation of various derivatives

Nonlinear Systems (Unit 14)

• !" #$ Δ#$ = −( #$ is solved to find the search direction Δ#$• Then line search utilizes various 1D approaches (unit 15/16)

• The Jacobian matrix of first derivatives !" #$ needs to be evaluated (given #$)

• Each entry )"*)+,

(#$) can be approximated via finite differences (unit 17) or

automatic differentiation (unit 17)

• Quasi-Newton approaches get quite cavalier with the idea of a search direction,

and as such make various aggressive approximations to the Jacobian !" #$• Quasi-Newton can wildly perturb the search direction, so robust/safe set

approaches to the 1D line search become quite important to making “progress”

towards solutions

Broyden’s Method

• An initial guess for the Jacobian is repeatedly corrected with rank one updates, similar in spirit to a secant approach• Let !" = $• Solve !%Δ'% = −) '% to find search direction Δ'%• Use line search to find '%*+ and thus ) '%*+ , and then update Δ'% = '%*+ − '%

• Update !%*+ = !% + +-./ 0-./ ) '%*+ − ) '% − !%Δ'% Δ'% 1

• Note: !%*+(c%*+−'%) = ) '%*+ − ) '%• That is, !%*+ satisfies a secant type equation !Δ' = Δ)

Optimization (Unit 13)

• Scalar cost function !"($) has critical points where & !'( $ = 0 (unit13)• 2!'

(($3)Δ$3 = −& !'(($3) is solved to find a search direction Δ$3 (unit 14) • Then line search utilizes various 1D approaches (unit 15/16)• The Hessian matrix of second derivatives 2!'

(($3) and the Jacobian vector of first derivatives & !'(($3) both need to be evaluated (given $3)• The various entries can be evaluated via finite differences (unit 17) or automatic

differentiation (unit 17)• These approaches can struggle on the Hessian matrix of second partial derivatives• This makes Quasi-Newton approaches quite popular for optimization

• When $ is large, the 6(78) Hessian 2 is unwieldy/intractable, so some approaches instead approximate the action of 29( on a vector (i.e., on the right hand side)

Broyden’s Method (for Optimization)

• Same formulation as for nonlinear systems

• Solve for the search direction, and find !"#$ and % &'( !"#$• Update Δ!" = !"#$ − !" and Δ% &'( = % &'( !"#$ − % &'( !"

• Then (-&'()"#$= (-&'

()"+ $012 3012 Δ% &'( − (-&'

()"Δ!" Δ!" (

• So that (-&'()"#$Δ!" = Δ% &'(

Broyden’s Method (for Optimization)

• For the inverse, using Δ"# = "#%& − "# and Δ( )*+ = ( )*+ "#%& − ( )*+ "#

• Define (-)*.+)#%&= (-)*

.+)#+123. 456

78 31956

8 123 8 45678 3

123 8(45678)31956

8

• So that (-)*.+)#%&Δ( )*+ = Δ"#

• Then, solving -)*+ "#%& Δ"#%& = −( )*+ "#%& is replaced with defining the search

direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&

SR1 (Symmetric Rank 1)


• Define (-)*.+)#%&= (-)*

.+)#+123. 456

78 31956

8 123. 45678 3

19568 8

123. 45678 3

19568 :

19568

• So that (-)*.+)#%&Δ( )*+ = Δ"#


direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&

DFP (Davidon-Fletcher-Powell)


• Define (-)*.+)#%&= (-)*

.+)#− 01234 5

671246712 012

34 5

6712 01234 5

67124 + 695 695 4

695 467124

• So that (-)*.+)#%&Δ( )*+ = Δ"#


direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&

BFGS (Broyden-Fletcher-Goldfarb-Shanno)


• Define (-)*.+)#%&= 0 − 1231456

123 714567 -)*

.+#0 − 1456

7 123 7

123 714567 + 123 123 7

123 714567

• So that (-)*.+)#%&Δ( )*+ = Δ"#


direction by Δ"#%& = −(-)*.+)#%&( )*+ "#%&

L-BFGS (Limited Memory BFGS)

• BFGS stores a dense size !"! approximation to the inverse Hessian• This can become unwieldy for large problems• Smarter storage can be accomplished leveraging vectors that describe the outer products;

however, the number of vectors still grows with #• L-BFGS instead estimates the inverse Hessian using only a few vectors • often less than 10 vectors (vectors, vector spaces, not matrices)

• This makes it quite popular for machine learning

• On optimization methods for deep learning, Andrew Ng et al., ICML 2011• “we show that more sophisticated off-the-shelf optimization methods such as Limited

memory BFGS (L-BFGS) and Conjugate gradient (CG) with line search can significantly simplify and speed up the process of pretraining deep algorithms”

Nonlinear Least Squares (ML relevancy)

• Minimize a cost function of the form: !"($) = '()"* $ )" $

• Recall from Unit 13: • Determine parameters $ that make " +, -, $ = 0 best fit the training data, i.e. that make " +/, -/, $ (( = " +/, -/, $ *" +/, -/, $ close to zero for all 0

• Combining all +/, -/ , minimize !" $ = ∑/ " +/, -/, $ *" +/, -/, $• Let 2 be the number of data points and 32 be the output size of " +, -, $• Define )" $ by stacking the 32 outputs of " +, -, $ consecutively 2 times, so

that the vector valued output of )" $ is length 2 ∗ 32• Then, !" $ = ∑/ " +/, -/, $ *" +/, -/, $ = )"* $ )" $• Multiplying by ½ doesn’t change the minimum

Nonlinear Least Squares (Critical Points)

• Minimize !" # = %&'"( # '" #

• Jacobian matrix of '" is ) '* # = + '*+,-

# + '*+,.

# ⋯ + '*+,0

#

• Critical points of !" # have ) !*( # =

'"( # + '*+,-

#'"( # + '*

+,.#

⋮'"( # + '*

+,0#

= ) '*( # '" # = 0

Gauss Newton

• ! "#$ % "& % = 0 becomes ! "#$ % "& %) + ! "# %) +%) + ⋯ = 0• Using the Taylor series: "& % = "& %) + ! "# %) +%) + ⋯

• Eliminating high order terms: ! "#$ % "& %) + ! "# %) +%) ≈ 0• Evaluating ! "#$ at %) gives ! "#$ %) ! "# %) +%) ≈ −! "#$ %) "& %)

• Compare to /0#$(%))Δ%) = −! 0#$(%)) and note that ! 0#$ % = ! "#$ % "& %

• This implies an estimate of /0#$ %) ≈ ! "#$ %) ! "# %)

Gauss Newton (!" approach)

• The Gauss Newton equations # $%& '( # $% '( )'( = −# $%& '( $, '( are the normal equations for # $% '( )'( = − $, '(• Thus, (instead) solve # $% '( )'( = − $, '( via any least squares (!") and

minimum norm approach• Note that setting the second factor in # $%& ' $, '( + # $% '( )'( ≈ 0 to zero

also leads to # $% '( )'( = − $, '(• This is a linearization of the nonlinear system $, ' = 0, aiming to minimize 0, ' = 1

2$,& ' $, '

Weighted Gauss Newton

• Recall: Row scaling changes the importance of the equations• Recall: Thus, it also changes the (unique) least squares solution for any

overdetermined degrees of freedom

• Given a diagonal matrix ! indicating the importance of various equations• !" #$ %& '%& = −! #* %&• " #$+ %& !," #$ %& '%& = −" #$+ %& !, #* %&

Gauss Newton (Regularized)

• When concerned about small singular values in ! "# $% &$% = − ") $% , one can add *+ = 0 as extra equations (unit 12 regularization)

• This results in ! "#- $% ! "# $% + */+ &$% = −! "#- $% ") $%

• This is often called Levenberg-Marquardt or Damped (Nonlinear) Least Squares

Gradient/Steepest Descent

• Approximate !"#$ very crudely with the identity matrix (first step of Broyden)

• Then !"#$(&')Δ&' = −, "#$(&') becomes -.&' = −, "#$ &'

• That is, the search direction is .&' = −, "#$ &' = −∇ "0(&')• This is the steepest descent direction• No matrix inversion is necessary

• See unit 19

Coordinate Descent

• Coordinate Descent ignores !"#$(&')Δ&' = −, "#$(&') completely

• Instead Δ&' is set to the various coordinate directions .̂/• No matrix inversion is necessary

unit 18 avoiding derivatives - stanford...

Documents