slac - fitting the unknown...bfgs method advantageof bfgs: the inevitability of the hessian...

Fitting The Unknown

Joshua Lande

Stanford

September 1, 2010

1/28

Motivation: Why Maximize

It is frequently important in physics to find the maximum(or minimum) of a function

Nature will maximize entropy

Economists Maximize (Minimize?)the Cost Function

In classical mechanics, minimizes theaction

Build experiments to maximizeperformance

Model parameter estimation.

2/28

Parameter Estimation

Common when analyzing data to fit a model to data

χ2 =∑

(yi − y(xi))/σi)2

logLikelihood = log(Prob(data|model))

Model is generally a function of free parameters

Interesting to find parameters that maximize thelikelihood.

3/28

Plan

Typically, physicists pullout an off the shelfoptimizer to fit theirfunction and be donewith it

Today, lets dig under thehood and figure out howthey work

4/28

Ad Hoc Methods

Given an arbitrary function F (~x) of n variables ~x,

how would you go about minimizing it?

Grid Search

Divide space into an n dimensional gridevaluate the function along the gridavoids local minimumUseful to seed other algorithms

Bisection Algorithm

Random points method

These are slow/inefficient - O(2n)

5/28

Alternating Variables

Maximize one parameterat a time

Ignore correlationbetween variables

Algorithm is inefficientand unreliable

Can cause oscillatorybehavior

6/28

Gradient Descent

Function decreases in the direction ofthe negative gradient

The negative of the gradient shouldlead to the minimum

~xi+1 = ~xi − γ~∇F (~x)

Iterate until |~∇F (~xi)| < ε

Well suited when ~∇F iseasily/analytically calculated

Often, perform a grid search in thedirection of −~∇F before nextiteration

7/28

Simplex Fitting Algorithm (What’s a Simplex???)

A simplex is a generalization of atriangle or tetrahedron to arbitrarydimension

An n-simplex has n+ 1 vertices in ndimensions

all equidistant

For example,

a 2-simplex is a trianglea 3-simplex is a tetrahedrona 4-simplex is a pentachorona 5-simplex is a hexaterona 6-simplex is a heptapeton

8/28

Simplex (continued)

Define a simplex in the ndimensional fit space

Evaluate the function atall points

Reflect the highest pointthrough the centroid ofthe other points

If the reflected point is still the highest, reflect thesecond highest pointWhen a certain vertex has remained in the currentsimplex for many iterations, contract all other verticestowards it by 1/2 9/28

Simplex Example

10/28

Simplex (continued)

Pros:

Ignores the gradient/curvature of the functionWorks well for noisy data,Good for functions with local minimumWorks well when curvature varies rapidly

Cons:

Requires an initial simplex choiceSlow convergence for smooth functions (compared togradient descent)Inflexible to changes in local function structure

E.G. wouldn’t work well in a long valley

11/28

Nedler Mead Algorithm

Improvement of Simplex algorithm

“Adapts itself to the local landscape,

elongating down long inclined planes,changing direction on encountering a valley at anangle,and contracting in the neighborhood of a minimum”

“Copies of the routine, written in Extended MercuryAutocode, are available from the authors”1

Used by Minuit’s SIMPLEX algorithm and scipy’sfmin function

1J.A. Nedler and R. Mead ”A Simple Method for FunctionMinimization“

12/28

Nedler Mead Algorithm

P̄ is the simplex centroid. Ph is the largest Fh. Pl hasthe smallest Fh

Reflection: Evaluate the function F∗ on the reflectedpart of the simplex P ∗ = (1 + α)P̄ − αPhExpansion: If F∗ < Fl (reflected point newminimum), then expland simplex futher in thedirection by a ratio γ

P ∗∗ = γP ∗ + (1− γ)P̄

Contraction: If F ∗ > Fi for i 6= h, then we contract byusing as our new point

P ∗∗ = βPh + (1− β)P̄

Replace Ph with P ∗∗

13/28

Quit When. . .

End when√∑

(Fi − F̄ )2/n < ε

End criteria is well suited for minimizing χ2 or loglikelihood, where curvature at minimum givesinformation about parameter uncertainty

Fit error only has to be small compared to parameteruncertainty!

14/28

Newton-Raphson algorithm

Assume your function is a parabola and calculate theextrema of the estimated parabola

use curvature information to take a more direct route

Taylor expand the derivative, set it to 0

f ′(x+ ∆x) = f ′(x) + ∆xf ′′(x) = 0

∆x = −f ′(x)/f ′′(x)

xi+1 = xi − γf ′(xi)/f ′′(xi)Iterate until |f ′(xi) < ε|Excellent local convergence!

Often, instead perform a grid search in direction ofsteepest descent

15/28

Newton Algorithm (Issues)

May end up converging on a saddle point/localmaximum

May overshoot by quite a bit

Formula undefined for F ′′ = 0.

16/28

Newton-Raphson in Many Dimensions

Perform a n dimensional Taylorexpansion~∇F (~x+ ∆~x) = ~∇F (~x) +H∆~x = 0

Where the Hessian matrixHij = ∂

∂xi

∂∂xjF

The recursion condition is

~xn+1 = ~xn − γH−1n ~∇F (~xn)

Iterate until |~∇F (~xn)| < δ

Figure: gradientdescent (green)and Newton’smethod (red) forminimizing afunction 17/28

Performance

No reason that Hn has to be invertible

Newton-Raphson works particularly well near theminimum

Gradient descent (ignore curvature) works better whenfar from the minimum and higher order terms are moresignificant

Gradient descent converges very slowly near theminimum

18/28

Levenberg-Marquardt

Algorithm devised to naturally interpolate betweenGradient and Newton-Raphson

Replace equation to solve with(H(~x) + µI)∆~x = −~∇F (~x)

µ << 1 reduces to the Newton-Raphson algorithm

µ >> 1 reduces to the Gradient algorithm withγ = 1/µ

Many different algorithms for adaptively changing µbased upon function

19/28

BFGS Method

Often H(~x) is very costly to evaluate

Desirable to find an intelligent approximation of thecurvature

BFGS is modification of Newton’s algorithm thatapproximates the Hessian

Uses Hessian at previous points and values of thederivative to estimate new one.

20/28

BFGS Method

Same general formula as Newton’s Method~xn+1 = ~xn −H−1n ~∇F (~xn)

Approximate the Hessian~sn+1 = ~xn+1 − ~xn~yn+1 = ~∇F (~xn+1)− ~∇F (~xn)

Hn+1 = Hn + ~yn~yTn /~y

Tn~sn −Hnsn(Bnsn)T/sTnBn~sn

Invert Hn+1 Using the Sherman Morrison formula:

H−1n+1 = H−1n +(~sTn~yn + ~yTnB

−1n ~yn)(~sn~s

Tn )

(~sTn~yn)2

− H−1n ~yn~sTn + ~sn~y

TnH

−1n

~sTk ~yk

21/28

BFGS Method

Advantageof BFGS:

the inevitability of the Hessian approximation isensured directlyWell suited for problems where H is costly to compute

Disadvantage: Convergence slower than Newton’sMethod2

fmin bfgs in scipy

ROOT::Math::MinimizerOptions::SetDefaultMinimizer(”GSLMultiMin”,”BFGS”)

2http://www.math.mtu.edu/~msgocken/ma5630spring2003/

lectures/global2/

22/28

http://www.math.mtu.edu/~msgocken/ma5630spring2003/lectures/global2/

http://www.math.mtu.edu/~msgocken/ma5630spring2003/lectures/global2/

Physical Constrains

Frequently, parameter values are constrained

E.G, experiment constrained by upper limit on costunable to observer negative counts

A common strategy is to change to unconstrainedvariables

instead of fitting x, y on a circle, fit θ

When a fit parameter must be positive, it is easy toinstead fit the log of the parameter

Remember that you have to correct the fit errorTo first order, σlog x = ∂ log(x)

∂x σxσx = xσlog x

23/28

Constrains

Minuit fitter allows two sided limits of each fitparameters3

It internally fits unconstrained variables buttransformed them into constrained variables

Pint = arcsin(2Pext−a

b−a − 1)

Pext = a+ b−a2

(sinPint + 1)

Mapping is non-linear, causes distortions in errors

3http://wwwinfo.cern.ch/asdoc/minuit/minmain.html

24/28

http://wwwinfo.cern.ch/asdoc/minuit/minmain.html

Penalty Functions

Another strategy to for constrains are penaltyfunctions

Replace the function you are fitting with a functionwhich increases rapidly in forbidden regions

Want to minimize F (~x) such that

gi(~x) ≤ 0hi(~x) = 0

gi are inequalities (Flux > 0) and hi are fixedconstraints (cost = 1, 000)

Many types of penalty functions have been suggested

25/28

Static Penalty functions4

Constant Penalty FunctionsReplace function with Fp(~x) = F (~x) +

∑Ciδi

where δi =

{1 if constrain i is violated

0 if constrain i is satisfiedNo obvious way to pick the Ci

“Cost to Completion” Penalty FunctionLet penalty increase further farther from allowedregionFp(~x) = F (~x) +

∑Cid

κi

Where di =

{δigi(~x)

|hi(~x)|Frequently κ is 1 or 2

4http://www.eng.auburn.edu/users/smithae/publications/

bookch/chapter.pdf26/28

http://www.eng.auburn.edu/users/smithae/publications/bookch/chapter.pdf

http://www.eng.auburn.edu/users/smithae/publications/bookch/chapter.pdf

Dynamic Penalty Functions

static penalty functions lack a robust strategy forpicking Ci

Dynamic penalties use the length of time of search t

Fp(~x, t) = F (~x) +∑s(t)dκi

di is an increasing function of time

Often have to tune si(t) to particular problem

If si(t) is too lenient, infeasible solution may resultfrom fitIf si(t) is too strict, search may converge tonon-optimal feasible solution

Lots of research into adaptive penalty functions. . .

27/28

Questions?

28/28

slac - fitting the unknown...bfgs method advantageof bfgs: the inevitability of the hessian...

Documents