slac - fitting the unknown...bfgs method advantageof bfgs: the inevitability of the hessian...
TRANSCRIPT
![Page 1: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/1.jpg)
Fitting The Unknown
Joshua Lande
Stanford
September 1, 2010
1/28
![Page 2: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/2.jpg)
Motivation: Why Maximize
It is frequently important in physics to find the maximum(or minimum) of a function
Nature will maximize entropy
Economists Maximize (Minimize?)the Cost Function
In classical mechanics, minimizes theaction
Build experiments to maximizeperformance
Model parameter estimation.
2/28
![Page 3: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/3.jpg)
Parameter Estimation
Common when analyzing data to fit a model to data
χ2 =∑
(yi − y(xi))/σi)2
logLikelihood = log(Prob(data|model))
Model is generally a function of free parameters
Interesting to find parameters that maximize thelikelihood.
3/28
![Page 4: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/4.jpg)
Plan
Typically, physicists pullout an off the shelfoptimizer to fit theirfunction and be donewith it
Today, lets dig under thehood and figure out howthey work
4/28
![Page 5: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/5.jpg)
Ad Hoc Methods
Given an arbitrary function F (~x) of n variables ~x,
how would you go about minimizing it?
Grid Search
Divide space into an n dimensional gridevaluate the function along the gridavoids local minimumUseful to seed other algorithms
Bisection Algorithm
Random points method
These are slow/inefficient - O(2n)
5/28
![Page 6: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/6.jpg)
Alternating Variables
Maximize one parameterat a time
Ignore correlationbetween variables
Algorithm is inefficientand unreliable
Can cause oscillatorybehavior
6/28
![Page 7: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/7.jpg)
Gradient Descent
Function decreases in the direction ofthe negative gradient
The negative of the gradient shouldlead to the minimum
~xi+1 = ~xi − γ~∇F (~x)
Iterate until |~∇F (~xi)| < ε
Well suited when ~∇F iseasily/analytically calculated
Often, perform a grid search in thedirection of −~∇F before nextiteration
7/28
![Page 8: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/8.jpg)
Simplex Fitting Algorithm (What’s a Simplex???)
A simplex is a generalization of atriangle or tetrahedron to arbitrarydimension
An n-simplex has n+ 1 vertices in ndimensions
all equidistant
For example,
a 2-simplex is a trianglea 3-simplex is a tetrahedrona 4-simplex is a pentachorona 5-simplex is a hexaterona 6-simplex is a heptapeton
8/28
![Page 9: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/9.jpg)
Simplex (continued)
Define a simplex in the ndimensional fit space
Evaluate the function atall points
Reflect the highest pointthrough the centroid ofthe other points
If the reflected point is still the highest, reflect thesecond highest pointWhen a certain vertex has remained in the currentsimplex for many iterations, contract all other verticestowards it by 1/2 9/28
![Page 10: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/10.jpg)
Simplex Example
10/28
![Page 11: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/11.jpg)
Simplex (continued)
Pros:
Ignores the gradient/curvature of the functionWorks well for noisy data,Good for functions with local minimumWorks well when curvature varies rapidly
Cons:
Requires an initial simplex choiceSlow convergence for smooth functions (compared togradient descent)Inflexible to changes in local function structure
E.G. wouldn’t work well in a long valley
11/28
![Page 12: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/12.jpg)
Nedler Mead Algorithm
Improvement of Simplex algorithm
“Adapts itself to the local landscape,
elongating down long inclined planes,changing direction on encountering a valley at anangle,and contracting in the neighborhood of a minimum”
“Copies of the routine, written in Extended MercuryAutocode, are available from the authors”1
Used by Minuit’s SIMPLEX algorithm and scipy’sfmin function
1J.A. Nedler and R. Mead ”A Simple Method for FunctionMinimization“
12/28
![Page 13: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/13.jpg)
Nedler Mead Algorithm
P̄ is the simplex centroid. Ph is the largest Fh. Pl hasthe smallest Fh
Reflection: Evaluate the function F∗ on the reflectedpart of the simplex P ∗ = (1 + α)P̄ − αPhExpansion: If F∗ < Fl (reflected point newminimum), then expland simplex futher in thedirection by a ratio γ
P ∗∗ = γP ∗ + (1− γ)P̄
Contraction: If F ∗ > Fi for i 6= h, then we contract byusing as our new point
P ∗∗ = βPh + (1− β)P̄
Replace Ph with P ∗∗
13/28
![Page 14: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/14.jpg)
Quit When. . .
End when√∑
(Fi − F̄ )2/n < ε
End criteria is well suited for minimizing χ2 or loglikelihood, where curvature at minimum givesinformation about parameter uncertainty
Fit error only has to be small compared to parameteruncertainty!
14/28
![Page 15: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/15.jpg)
Newton-Raphson algorithm
Assume your function is a parabola and calculate theextrema of the estimated parabola
use curvature information to take a more direct route
Taylor expand the derivative, set it to 0
f ′(x+ ∆x) = f ′(x) + ∆xf ′′(x) = 0
∆x = −f ′(x)/f ′′(x)
xi+1 = xi − γf ′(xi)/f ′′(xi)Iterate until |f ′(xi) < ε|Excellent local convergence!
Often, instead perform a grid search in direction ofsteepest descent
15/28
![Page 16: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/16.jpg)
Newton Algorithm (Issues)
May end up converging on a saddle point/localmaximum
May overshoot by quite a bit
Formula undefined for F ′′ = 0.
16/28
![Page 17: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/17.jpg)
Newton-Raphson in Many Dimensions
Perform a n dimensional Taylorexpansion~∇F (~x+ ∆~x) = ~∇F (~x) +H∆~x = 0
Where the Hessian matrixHij = ∂
∂xi
∂∂xjF
The recursion condition is
~xn+1 = ~xn − γH−1n ~∇F (~xn)
Iterate until |~∇F (~xn)| < δ
Figure: gradientdescent (green)and Newton’smethod (red) forminimizing afunction 17/28
![Page 18: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/18.jpg)
Performance
No reason that Hn has to be invertible
Newton-Raphson works particularly well near theminimum
Gradient descent (ignore curvature) works better whenfar from the minimum and higher order terms are moresignificant
Gradient descent converges very slowly near theminimum
18/28
![Page 19: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/19.jpg)
Levenberg-Marquardt
Algorithm devised to naturally interpolate betweenGradient and Newton-Raphson
Replace equation to solve with(H(~x) + µI)∆~x = −~∇F (~x)
µ << 1 reduces to the Newton-Raphson algorithm
µ >> 1 reduces to the Gradient algorithm withγ = 1/µ
Many different algorithms for adaptively changing µbased upon function
19/28
![Page 20: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/20.jpg)
BFGS Method
Often H(~x) is very costly to evaluate
Desirable to find an intelligent approximation of thecurvature
BFGS is modification of Newton’s algorithm thatapproximates the Hessian
Uses Hessian at previous points and values of thederivative to estimate new one.
20/28
![Page 21: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/21.jpg)
BFGS Method
Same general formula as Newton’s Method~xn+1 = ~xn −H−1n ~∇F (~xn)
Approximate the Hessian~sn+1 = ~xn+1 − ~xn~yn+1 = ~∇F (~xn+1)− ~∇F (~xn)
Hn+1 = Hn + ~yn~yTn /~y
Tn~sn −Hnsn(Bnsn)T/sTnBn~sn
Invert Hn+1 Using the Sherman Morrison formula:
H−1n+1 = H−1n +(~sTn~yn + ~yTnB
−1n ~yn)(~sn~s
Tn )
(~sTn~yn)2
− H−1n ~yn~sTn + ~sn~y
TnH
−1n
~sTk ~yk
21/28
![Page 22: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/22.jpg)
BFGS Method
Advantageof BFGS:
the inevitability of the Hessian approximation isensured directlyWell suited for problems where H is costly to compute
Disadvantage: Convergence slower than Newton’sMethod2
fmin bfgs in scipy
ROOT::Math::MinimizerOptions::SetDefaultMinimizer(”GSLMultiMin”,”BFGS”)
2http://www.math.mtu.edu/~msgocken/ma5630spring2003/
lectures/global2/
22/28
![Page 23: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/23.jpg)
Physical Constrains
Frequently, parameter values are constrained
E.G, experiment constrained by upper limit on costunable to observer negative counts
A common strategy is to change to unconstrainedvariables
instead of fitting x, y on a circle, fit θ
When a fit parameter must be positive, it is easy toinstead fit the log of the parameter
Remember that you have to correct the fit errorTo first order, σlog x = ∂ log(x)
∂x σxσx = xσlog x
23/28
![Page 24: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/24.jpg)
Constrains
Minuit fitter allows two sided limits of each fitparameters3
It internally fits unconstrained variables buttransformed them into constrained variables
Pint = arcsin(2Pext−a
b−a − 1)
Pext = a+ b−a2
(sinPint + 1)
Mapping is non-linear, causes distortions in errors
3http://wwwinfo.cern.ch/asdoc/minuit/minmain.html
24/28
![Page 25: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/25.jpg)
Penalty Functions
Another strategy to for constrains are penaltyfunctions
Replace the function you are fitting with a functionwhich increases rapidly in forbidden regions
Want to minimize F (~x) such that
gi(~x) ≤ 0hi(~x) = 0
gi are inequalities (Flux > 0) and hi are fixedconstraints (cost = 1, 000)
Many types of penalty functions have been suggested
25/28
![Page 26: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/26.jpg)
Static Penalty functions4
Constant Penalty FunctionsReplace function with Fp(~x) = F (~x) +
∑Ciδi
where δi =
{1 if constrain i is violated
0 if constrain i is satisfiedNo obvious way to pick the Ci
“Cost to Completion” Penalty FunctionLet penalty increase further farther from allowedregionFp(~x) = F (~x) +
∑Cid
κi
Where di =
{δigi(~x)
|hi(~x)|Frequently κ is 1 or 2
4http://www.eng.auburn.edu/users/smithae/publications/
bookch/chapter.pdf26/28
![Page 27: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/27.jpg)
Dynamic Penalty Functions
static penalty functions lack a robust strategy forpicking Ci
Dynamic penalties use the length of time of search t
Fp(~x, t) = F (~x) +∑s(t)dκi
di is an increasing function of time
Often have to tune si(t) to particular problem
If si(t) is too lenient, infeasible solution may resultfrom fitIf si(t) is too strict, search may converge tonon-optimal feasible solution
Lots of research into adaptive penalty functions. . .
27/28
![Page 28: SLAC - Fitting The Unknown...BFGS Method Advantageof BFGS: the inevitability of the Hessian approximation is ensured directly Well suited for problems where His costly to compute Disadvantage:](https://reader036.vdocuments.mx/reader036/viewer/2022070211/60fee8ed13348044ac53520f/html5/thumbnails/28.jpg)
Questions?
28/28