statistics - folk.uio.nofolk.uio.no/erikadl/fys4550/are/lectures_statistics_h16.pdfstatistics •...

LECTURE NOTES

FYS 4550/FYS9550 -

EXPERIMENTAL HIGH ENERGY PHYSICS

AUTUMN 2016

STATISTICS

A. STRANDLIE

NTNU AT GJØVIK

AND

UNIVERSITY OF OSLO

Statistics

• Statistics is about making inference about a statistical model, given

a set of data or measurements

– Parameters of a distribution

– Parameters describing the kinematics of a particle after a collision

• Position and momentum at some reference surface

– Parameters describing an interaction vertex (position, refined estimates

of particle momenta)

• Will consider two issues

– Parameter estimation

– Hypothesis tests and confidence intervals

Statistics

• Parameter estimation

• We want to estimate the unknown value of a parameter θ.

• An estimator is a function of the data which aims to estimate the

value of θ as closely as possible.

• General estimator properties

– Consistency

– Bias

– Efficiency

– Robustness

• A consistent estimator is an estimator which converges to the true

value of θ when the amount of data increases (formally, in the limit

of infinite amount of data).

^

Statistics

• The bias b of an estimator is given as

• Since the estimator is a function of the data, it is itself a random

variable with its own distribution.

• The expectation value of θ can be interpreted as the mean value of

the estimate for a very large number of hypothetical, identical

experiments.

• Obviously, unbiased (i.e. b=0) estimators are desirable.

^

Eb

Statistics

• The efficiency of an estimator is the inverse of the ratio of its

variance to the minimum possible value.

• The minimum possible value is given by the Rao-Cramer-Frechet

lower bound

where I(θ) is the Fisher information:

)(

1

2

2

min

I

b

2

);(lnE)(i

ixfI

Statistics

• The sum is over all the data, which are assumed independent and to follow

the pdf f(x; θ).

• The expression of the lower bound is valid for all estimators with the same

bias function b(θ) (for unbiased estimators b(θ) vanishes).

• If the variance of the estimator happens to be equal to the Cramer-Rao-

Frechet lower bound, it is called a minimum variance lower bound estimator

or a (fully) efficient estimator.

• Different estimators of the same parameter can also be compared by

looking at the ratios of the efficiencies. One then talks about relative

efficiencies.

• Robustness is the (qualitative) degree of insensitivity of the estimator to

deviations in the assumed pdf of the data

– e.g. noise in the data not properly taken into account

– wrong data

– etc

Statistics

• Common estimators for the mean and variance are (often called the

sample mean and the sample variance):

• The variances of these are:

N

i

i

N

i

i

xxN

s

xN

x

1

22

1

1

1

1

4

4

2

2

1

31)(

)(

N

Nm

NsV

NxV

Statistics

• For variables which obey the Gaussian distribution, this yields for

large N

• For Gaussian variables the sample mean is a fully efficient

estimator.

• If the different measurements used in the calculation of the sample

mean have different variances, a better estimator of the mean is a

weighted sample mean:

Nsstd

2)(

i

i

i

ii

w

xx

2

2

1

1

Statistics

• The method of maximum likelihood:

• Assume that we have N independent measurements all obeying the

pdf f(x;θ), where θ is a parameter vector consisting of n different

parameters to be estimated.

• The maximum likelihood estimate is the value of the parameter

vector θ which maximizes the likelihood function

• Since the natural logarithm is a monotoneously increasing function,

ln(L) and L will have maximum for the same value of θ.

);(1

θθ

N

i

ixfLθ

Statistics

• Therefore the maximum likelihood estimate can be found by solving the likelihood equations

for all i=1,…..,n.

• ML-estimators are asymptotically (i.e. for large amounts of data) unbiased and fully efficient

– Therefore very popular

• An estimate of the inverse of the covariance matrix of an ML-estimate is

evaluated at the estimated value of θ.

0ln

i

L

ji

ij

LV

ln2

1

Statistics

• The method of least squares.

• Simplest possible example: estimating the parameters of a straight

line (intercept and tangent of inclination angle) given a set of

measurements.

measurements

fitted line

Statistics

• Least-squares approach: minimizing the sum of squared distances S

between the line and the N measurements,

with respect to the parameters of the line (i.e. a and b).

• This cost function or objective function S can be written in a more

compact way by using matrix notation:

N

i i

ii baxyS

12

2))((

variance of measurement

error

θyθy HVHST

1

Statistics

• Here y is a vector of measurements, θ is a vector of the parameters

a and b, V is the (diagonal) covariance matrix of measurements

(consisting of the individual variances on the main diagonal), and H

is given by

• Taking the derivative of S with respect to θ, setting this to zero and

solving for θ yields the least-squares solution to the problem.

Nx

x

H

1

1 1

Statistics

• The result is:

• The covariance matrix of the estimated parameters is:

and the covariance matrix of the estimated positions is

yθ111 VHHVH TT

11cov HVH T

θ

TT HHVHH11covy

θy

H

Statistics

Simulating 10000 lines

Histogram of value of

estimated intercept

What is true value

of intercept?

Statistics

Simulating 10000 lines

Histogram of value of

tangent of angle of

inclination

What is true value?

StatisticsHistograms of normalized residuals of estimated parameters.

This means that for each fitted line and each estimated parameter,

a quantity ((estimated parameter-true parameter)/standard deviation

of parameter) is put into the histogram.

If everything is OK with the fitting procedure, these histograms should

have mean 0 and standard deviation 1.

mean=-0.0189

std=1.0038

mean=0.0157

std=1.0011

Statistics

• Least-squares estimation is for instance used in track fitting in high-

energy physics experiments.

• Track fitting is basically the same task as the line fit example:

estimating a set of parameters describing a particle track through a

tracking detector, given a set of measurements created by the

particle.

• In the general case the track model is not a straight line but rather a

helix (homogeneous magnetic field) or some other trajectory

obeying the equations of motion in an inhomogeneous magnetic

field.

• The principles of the fitting procedure, however, are largely the

same.

Statistics

• As long as there is a linear relationship between the parameters and the

measurements, the least-squares method is linear.

• If this relationship is a non-linear function F(θ), the problem is said to be of a

non-linear least-squares type:

• There exists no direct solution to this problem, and one has to resort to an

iterative approach (Gauss-Newton):

– Start out with an initial guess of θ, linearize function F around the initial guess by

a Taylor expansion and solve the resulting linear least-squares problem

– Use the estimated value for θ as a new expansion point for F and repeat the step

above

– Iterate until convergence (i.e. until θ changes less than a specified value from

one iteration to the next)

)()( 1θyθy FVFS

T

Statistics

• Relationship between maximum likelihood and least-squares:

• Consider a set of independent measurements y with mean values

F(x;θ).

• If these measurements follow a Gaussian distribution, the log-

likelihood function is basically

plus some terms which do not depend on θ.

• Maximizing the log-likelihood function is in this case equivalent to

minimizing the least-squares objective function.

N

i i

ii xFyL

12

2;

)(ln2

Statistics

• Confidence intervals and hypothesis tests.

• Confidence intervals:

– Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within

– Such an interval is called a 95 % confidence interval of a parameter

• Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α) % confidence interval for μ.

• From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(N).

• For large enough N, the quantity is distributed according to a standard, normal distribution

(mean value 0, standard deviation 1)

N

XZ

/

Statistics

• Therefore:

• In words, there is a probability 1-α that the true mean is in the interval

• This interval is therefore a 100(1- α) % confidence interval for μ.

• Such intervals are highly relevant in physics analysis.

1//

1//

1//

1/

2/2/

2/2/

2/2/

2/2/

NzXNzXP

NzXNzP

NzXNzP

zN

XzP

NzXNzX /,/ 2/2/

Statistics

• Hypothesis tests:

• A hypothesis is a statement about the distribution of a vector x of

data.

• Similar to the previous example:

– given a number N measurements, test whether the measurements

come from a normal distribution with a certain expectation value μ or

not.

– define a test statistic, i.e. the quantity to be used in the evaluation of the

hypothesis. Here: the sample mean.

– define the significance level of the test, i.e. the probability that the

hypothesis will be discarded even though it is true.

– determine the critical region of the test statistic, i.e. interval(s) of values

of the test statistic which will lead to the rejection of the hypothesis

Statistics

• We then state two competing hypotheses:

– A null hypothesis, stating that the expectation value is equal to a given

value

– An alternative hypothesis, stating that the expectation value is not equal

to the given value

• Mathematically:

• Test statistic:

01

00

:

:

H

H

N

XZ

/

0

Statistics

2/z 2/z

Probability of being in shaded

area: α

Shaded area is therefore the

critical region of Z for

significance level αObtain a value of the test

statistic from test data by

calculating the sample mean

and transforming to Z.

Use the actual value of Z to

determine whether the null

hypothesis is rejected or not.

Statistics

• Alternatively: perform the test by calculating the so-called p-value of the test statistic.

• Given the actual value of the test statistic, what is the area below the pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. – For the current example this would correspond to adding two integrals

of the pdf of the test statistic (because this is a so-called two-sided test):• one from minus infinity to minus the absolute value of the actual value of the

test statistic

• another from the absolute value of the actual value of the test statistic to plus infinity

• For a one-sided test one would stick to one integral of the abovementioned type

• If the p-value is less than the significance level: discard the null hypothesis. If not, don’t discard it.

Statistics

• p-values can be used in so-called goodness-of-fit tests.

• In such tests one frequently uses a test statistic which is assumed to

be chisquare distributed

– Is a measurement in a tracking detector compatible with belonging to a

particle track defined by a set of other measurements?

– Is a histogram with a set of entries in different bins compatible with an

expected histogram (defined by an underlying assumption of the

distribution)?

– Is the residual distributions of estimated parameters compatible with the

estimated covariance matrix of the parameters?

• If one can calculate many independent values of the test statistic,

the following procedure is often applied:

– Calculate the p-value of the test statistic each time the test statistic is

calculated

Statistics

– The p-value itself is also a random variable, and it can be shown that it

is distributed according to a uniform distribution if the test statistic

origins from the expected (chisquare) distribution.

– Create a histogram with the various p-values as entries and see

whether it looks reasonably flat

• NB! With only one calculated p-value, the null hypothesis can be

rejected but never confirmed!

• With many calculated p-values (as immediately above) the null

hypothesis can also (to a certain extent) be confirmed!

• Example: line fit (as before)

• For each fitted line, calculate the following chisquare:

θθθθθ 12 cov

T

Statistics

• Here θ is the true value of the parameter vector.

• For each value of the chisquare, calculate the corresponding p-value

– Integral of chisquare distribution from the value of the chisquare to

infinity

• Given in tables or in standard computer programs (CERNLIB, CLHEP,

MATLAB,….)

• Fill up a histogram with the p-values and make a plot:

Reasonably flat histogram, seems OK.

What we really test here is that the estimated

parameters are unbiased estimates of the true

parameters, distributed according to

a Gaussian with a covariance matrix as

obtained in the estimate!!

statistics - folk.uio.nofolk.uio.no/erikadl/fys4550/are/lectures_statistics_h16.pdfstatistics •...

Documents