probability and statisticsfolk.uio.no/ares/fys4550/lectures_h14_1.pdf · 1. probability is never...

LECTURE NOTES

FYS 4550/FYS9550 - EXPERIMENTAL HIGH ENERGY PHYSICS

AUTUMN 2014

PART I PROBABILITY AND STATISTICS

A. STRANDLIE

GJØVIK UNIVERSITY COLLEGE

AND UNIVERSITY OF OSLO

Probability • Before embarking on the concept of probability, we will first define a

set of other concepts. • A stochastic experiment is characterized by:

– All possible elementary outcomes of the experiment are known – Only one of the outcomes can occur in a single experiment

– The outcome of an experiment is not known a priori • Example: throwing a dice

– Outcomes are: S={1,2,3,4,5,6} – Can only observe one of these each time you throw – Don’t know beforehand what you will observe

• The set S is called the sample space of the experiment

Probability • An event A is one or more outcomes which satisfy certain

specifications • Example: A=”odd number” when throwing a dice • An event is therefore also a subset of S • Here: A={1,3,5} • If B=”even number”, what is the subset of S describing B? • The probability of occurence of an event A, P(A), is a number

between 0 and 1 • Intuitively a number for P(A) close to 0 means that A is supposed to

occur very rarely in an experiment, whereas a number close to 1 means that A occurs very often

Probability • There are three ways of quantifying probability

1. Classical approach, valid when all outcomes can be assumed equally likely. Probability is defined as number of favourable outcomes for a given event divided by total number of outcomes. Example: throwing a dice has N=6 different outcomes. Assume that the event A = ”observing 6 eyes”. Only n=1 of the outcomes are favourable for A. P(A)=n/N=1/6=0.167.

2. Approach based on convergence value of relative frequency for a very large number of repeated, identical experiments. Example: throwing a dice, recording relative frequency of occurence of A for various numbers of trials

3. Subjective approach, reflecting ”degree of belief” of occurence of a certain event A. Possible guideline: convergence value of a large number of hypothetical experiments

Probability

relative frequency

logarithm (base 10) of trials

true probability

Convergence of relative frequency

Probability • Approach 2) forms the basis of frequentist statistics, whereas

approach 3) is the baseline of Bayesian statistics – Two different schools

• When estimating parameters from a set of data, the two approaches usually give the same numbers for the estimates if there is a large amount of data

• If there is little available data, estimates might differ – No easy way of determining which approach is ”best” – Both approaches advocated in high-energy physics experiments

• Will not enter any further into such questions in this course

Probability • Will now look at probabilities of combinations of events • Need some concepts from set theory: • The union

is a new event which occurs if A or B or both events occur. • To events are disjoint if they cannot occur simultaneously • The intersection

is a new event which occurs if both A and B occurs • The complement is a new event which occurs if A does not occur

ΒΑ

BA

A

Probability

S

A B A∩B

VENN DIAGRAM

BA

C (disjoint with A and B)

outcomes

Probability • The mathematical axioms of probability:

1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

sample space S (i.e. the probability of observing any of the possible outcomes of the experiment) is equal to the unit value, i. e. P(S) = 1

3. Probability must comply with the addition rule of disjoint events:

• A couple of useful formulas which can be derived from the

axioms:

P(A)1)AP( −=

)()()()( 2121 nn APAPAPAAAP ++=

)()()()( BAPBPAPBAP −+=

Probability A

B A∩B

Concept of conditional probability: What is the probability of occurence of A given that we know B will occur, i. e. P(A|B) ?

Probability • Recalling the definition of probability as the number of favourable

outcomes divided by the total number of outcomes, we get:

• Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6} – What is P(A|B)??

3

1)(}6,4{ =⇒= BAPBA

)()(

/

/)|(

BPBAP

NN

NN

N

NBAP

totB

totBA

B

BA ===

21

3/23/1

)()(

)|( ===BP

BAPBAP

Probability BA

B

B A∩B

Important observation: and are disjoint! BAA∩B

Probability • Therefore:

• Expressing P(A) in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is:

where all { } are disjoint and span the entire sample space S.

)()|()()|(

)()())()(()(

BPBAPBPBAP

BAPBAPBABAPAP

•+•=

+==

∑ •=i

ii BPBAPAP )()|()(

iB

Probability • From the definition of conditional probability it follows:

• A quick manipulation gives:

which is called Bayes’ theorem.

)|()()|()()( ABPAPBAPBPBAP •=•=

)()()|()|(

APBPBAPABP •

=

Probability • By using the law of total probability, one ends up with the general

formulation of Bayes’ theorem:

which is an extremely important result in statistics. Particularly in Bayesian statistics this theorem is often used to update or refine the knowledge about a set of unknown parameters by the introduction of information from new data.

∑ •

•=

iii

jjj BPBAP

BPBAPABP

)()|()()|(

)|(

Probability • This can be explained by a rewrite of Bayes theorem:

P(parameters|data) α P(data|parameters) × P(parameters). P(data|parameters) is often called the likelihood, P(parameters) denotes the prior knowledge of the parameters, whereas P(parameters|data) is the posterior probability of the parameters given the data.

• If P(parameters) cannot be deduced by any objective means, a subjective belief of its value is used in Bayesian statistics.

• Since there is no fundamental rule describing how to deduce this prior probability, Bayesian statistics is still debated (also in high-energy physics!)

Probability • Definition of independence of events A and B: P(A|B) = P(A), i.e.

any given information about B does not affect the probability of observing A.

• Physically this means that the events A and B are uncorrelated. • For practical applications such independence can not be derived but

rather has to be assumed, given the nature of the physical problem one intends to model.

• General multiplication rule for independent events :

)()()()( 2121 nn APAPAPAAAP •••=

nAAA ,,, 21

Probability • Stochastic or random variable:

– Number which can be attached to all outcomes of an experiment • Example: throwing two dice, sum of number of spots

– Mathematical terminology: real-valued function defined over the elements of the sample space S of an experiment

– A capital letter is often used to denote a random variable, for instance X • Simulation experiment: throwing two dice N times, recording sum of

spots each time and calculating the relative frequency of occurence for each of the outcomes

Probability N=10 Blue columns: observed rel. freq. Red columns: teoretically expected rel. freq.

Probability

N=20

Probability

N=100

Probability

N=1000

Probability

N=10000

Probability

N=100000

Probability

N=1000000

Probability

N=10000000

Probability • The relative frequencies seem to converge towards the theoretically

expected probabilities • Such a diagram is an expression of a probability distribution:

– A list of all different values of a random variable together with the associated probabilities

– Mathematically: a function f(x) = P(X=x) defined for all possible values x of X (given by the experiment at hand)

– The values of X can be discrete (like in the previous example), or continuous

– For continuous x, f(x) is called a probability density function • Simulation experiment: height of Norwegian men • Collecting data, calculating relative frequencies of occurences in

intervals of various widths

Probability

interval width 10 cm

Probability

interval width 5 cm

Probability

interval width 1 cm

Probability

interval width 0.5 cm

Probability

interval width 0

continuous probability distribution

Probability • Cumulative distribution function: F(a)=P(X ≤ a) • For discrete, random variables:

• For continuous, random variables:

∑ ∑≤ ≤

===ax ax

iii i

xfxXPaF )()()(

∫∞−

=a

dxxfaF )()(

Probability • It follows:

• For continuous variables:

)()()( aFbFbXaP −=≤<

∫=≤<b

a

dxxfbXaP )()(

Probability

a b

shaded area is P(a < X < b)

Probability

b

shaded area is P(X < b)

Probability

a

shaded area is P(X > a)

Probability • A function u(X) of a random variable X is also a random variable. • The expectation value of such a function is:

• Two very important special cases are:

[ ] ∫∞

∞−

= dxxfxuXuE )()()(

∫∞

∞−

•== dxxfxXE )()(µ

[ ] ∫∞

∞−

•−=−== dxxfxXEXVar )()()()( 222 µµσ

mean

variance

Probability • The mean μ is the most important measure of the centre of the

distribution of X. • The variance, or its square root σ, the standard deviation, is the

most important measure of the spread of the distribution of X around the mean.

• The mean is the first moment of X, whereas the variance is the second central moment of X.

• In general, the n’th moment of X is

[ ] ∫∞

∞−

•== dxxfxXE nnn )(α

Probability • The n’th central moment is

• Another measure of the centre of the distribution of X is the median, defined as

or, in words, the value of of X of which half of the probability lies above and half lies below.

[ ] ∫∞

∞−

•−=−= dxxfxXEm nnn )()()( 11 αα

21)( =medxF

Probability • Assume now that X and Y are two random variables with a joint

probability distribution function (pdf) f(x,y). • The marginal pdf of X is

whereas the marginal pdf of Y is

∫∞

∞−

= dyyxfxf ),()(1

∫∞

∞−

= dxyxfyf ),()(2

Probability • The mean values of X and Y are

• The covariance of X and Y is

∫ ∫ ∫∞

∞−

∞

∞−

∞

∞−

•=•= dxxfxdxdyyxfxX )(),( 1µ

∫ ∫ ∫∞

∞−

∞

∞−

∞

∞−

•=•= dyyfydxdyyxfyY )(),( 2µ

[ ] [ ] [ ] YXYX XYEYXEYX µµµµ −=−−= ))((,cov

Probability • If several random variables are considered simultaneously, one

frequently arranges the variables in a stochastic or random vector

• The covariances are then naturally displayed in a covariance matrix

( )TnXX ,,,X 21 =X

( )

=

),cov(),cov(),cov(

),cov(),cov(),cov(),cov(),cov(),cov(

cov

21

22212

12111

nnnn

n

n

XXXXXX

XXXXXXXXXXXX

X

Probability • If two variables X and Y are independent, the joint pdf can be written

• The covariance of X and Y vanishes in this case (why?), and the variances add: V(X+Y)=V(X)+V(Y).

• If X and Y are not independent, the general formula is: V(X+Y)=V(X)+V(Y)+2Cov(X,Y).

• For n mutually independent random variables the covariance matrix becomes diagonal (i.e. all off-diagonal terms are identically zero).

)()(),( 21 yfxfyxf •=

Probability • If a random vector is related to a vector X (with

pdf f(X) )by a function Y(X), the pdf of Y is

where |J| is the absolute value of the determinant of a matrix J. • This matrix is the so-called Jacobian of the transformation from Y to

X:

( )nYYY ,,, 21 =Y

Jyxy •= ))(()( fg

∂∂

∂∂

∂∂

∂∂

=

n

nn

n

yx

yx

yx

yx

1

1

1

1

J

Probability • The transformation of the covariance matrix is

where the inverse of J is • The transformation from x to y must be one-to-one, such that the

inverse functional relationship exists.

T11 )cov()cov( −−= JXJY

∂∂

∂∂

∂∂

∂∂

=−

n

nn

n

xy

xy

xy

xy

1

1

1

1

1J

Probability • Obtaining cov(Y) from cov(X) as in the previous slide is a very much

used technique in high-energy physics data analysis. • It is called linear error propagation and is applicable any time one

wants to transform from one set of estimated parameters to another – Transformation between different sets of parameters describing a

reconstructed particle track – Transport of track parameters from one location in a detector to another – ………….

• Will see examples later in the course

Probability • The characteristic function Φ(u) associated with the pdf f(x) is the

Fourier transform of f(x):

• Such functions are useful in deriving results about moments of random variables.

• The relation between Φ(u) and the moments of X are

• If Φ(u) is known, all moments of f(x) can be calculated without the knowledge of f(x) itself

[ ] ∫∞

∞−

•== dxxfeeEu iuxiux )()(φ

∫∞

∞−=

− =•= nn

un

nn dxxfx

dudi αφ )(

0

Probability • Some common probability distributions:

– Binomial distribution – Poisson distribution – Gaussian distribution – Chisquare distribution – Student’s t distribution – Gamma distribution

• We will take a closer look at some of them

Probability • Binomial distribution: • Assume that we make n identical experiments with only two possible

outcomes: ”success” or ”no success” • The probability of success p is the same for all experiments • The individual experiments are independent of each other • The probability of x successes out of n trials is then

• Example: throwing dice n times • Defining event of success to be occurence of six spots in a throw • Probability p=1/6

( ) xnx ppxn

xXP −−

== 1)(

Probability

probability distribution for

number of successes

in 5 throws

Probability


number of successes

in 15 throws

Probability


number of successes

in 50 throws

anything familiar with the shape

of this distribution?

Probability • Mean value and variance:

• Five throws with a dice: – E(# six spots) = 5/6 – Var(# six spots) = 25/36 – Std(# six spots) = 5/6

)1()()(

pnpXVarnpXE

−==

Probability • Poisson distribution:

– Number of occurences of event A per given time (length, area, volume,…) interval is constant and equal to λ.

– Probability distribution of observing x occurences in the interval is

– Both mean value and variance of X is λ. – Example: number of particles in a beam passing through a given area in

a given time must be Poisson distributed. If the average number λ is known, the probabilities for all x can be calculated according to the formula above.

!)(

xexXP

x λλ −•==

Probability • Gaussian distribution:

– Most frequently occurring distribution in nature. – Most measurement uncertainties, disturbances of directions of charged

particles when penetrating through (enough) matter, number of ionizations created by charged particle in a slab of material etc. follow Gaussian distribution.

– Main reason: CENTRAL LIMIT THEOREM – States that sum of n independent random variables converges to a

Gaussian distribution when n is ”large enough”, irrespective of the individual distributions of the variables.

– Abovementioned examples are typically of this type.

Probability • Gaussian probability density function with mean value μ and

standard deviation σ:

• For a random vector X of size n with mean value μ and covariance matrix V the function is (multivariate Gaussian distribution):

( ) 22 2/

2

2

21),;( σµ

πσσµ −−= xexf

( )( )

( ) ( )

−−−= − μxVμx

VVμx 1

2/ 21exp

)det(21,; T

nfπ

Probability • Usual terminology: X ~ N(μ,σ) : ”X is distributed according to a

Gaussian (normal) with mean value μ and standard deviation σ”. • 68 % of distribution within plus/minus one σ. • 95 % of distribution within plus/minus two σ. • 99.5 % of distribution within plus/minus three σ. • Standard normal variable Z~N(0,1): Z=(X- μ)/ σ • Quantiles of the standard normal distribution:

• The value is denoted the ”100 * α % quantile of the standard normal distribution”

• Such quantiles can be found in tables or by computer programs

αα −=< 1)( zZPαz

Probability

10 % quantile

Probability

5 % quantile (1.64)

Probability

95 % of area within plus/ minus 2.5 %

quantile (1.96)

Probability • distribution: • If are independent, Gaussian random variables, then

follow a distribution with n degrees of freedom. • Often used in evaluating level of compatibility between observed

data and assumed pdf of the data • Example: is position of measurement in a particle detector

compatible with the assumed distribution of the measurement? • Mean value is n and variance 2n.

{ }nXX ,,1

( )∑=

−=

n

i i

iiX1

2

22

σµχ

2χ

2χ

Probability

chisquare distribution with

10 degrees of freedom

Statistics • Statistics is about making inference about a statistical model, given

a set of data or measurements – Parameters of a distribution – Parameters describing the kinematics of a particle after a collision

• Position and momentum at some reference surface – Parameters describing an interaction vertex (position, refined estimates

of particle momenta) • Will consider two issues

– Parameter estimation – Hypothesis tests and confidence intervals

Statistics • Parameter estimation • We want to estimate the unknown value of a parameter θ. • An estimator is a function of the data which aims to estimate the

value of θ as closely as possible. • General estimator properties

– Consistency – Bias – Efficiency – Robustness

• A consistent estimator is an estimator which converges to the true value of θ when the amount of data increases (formally, in the limit of infinite amount of data).

θ^

Statistics • The bias b of an estimator is given as

• Since the estimator is a function of the data, it is itself a random variable with its own distribution.

• The expectation value of θ can be interpreted as the mean value of the estimate for a very large number of hypothetical, identical experiments.

• Obviously, unbiased (i.e. b=0) estimators are desirable.

θθ −

=

^Eb

Statistics • The efficiency of an estimator is the inverse of the ratio of its

variance to the minimum possible value. • The minimum possible value is given by the Rao-Cramer-Frechet

lower bound

where I(θ) is the Fisher information:

)(

12

2min θ

θσI

b

∂∂

+=

∂∂

= ∑2

);(lnE)(i

ixfI θθ

θ

Statistics • The sum is over all the data, which are assumed independent and to follow

the pdf f(x; θ). • The expression of the lower bound is valid for all estimators with the same

bias function b(θ) (for unbiased estimators b(θ) vanishes). • If the variance of the estimator happens to be equal to the Cramer-Rao-

Frechet lower bound, it is called a minimum variance lower bound estimator or a (fully) efficient estimator.

• Different estimators of the same parameter can also be compared by looking at the ratios of the efficiencies. One then talks about relative efficiencies.

• Robustness is the (qualitative) degree of insensitivity of the estimator to deviations in the assumed pdf of the data

– e.g. noise in the data not properly taken into account – wrong data – etc

Statistics • Common estimators for the mean and variance are (often called the

sample mean and the sample variance):

• The variances of these are:

( )∑

∑

=

=

−−

=

=

N

ii

N

ii

xxN

s

xN

x

1

22

1

11

1

−−

−=

=

44

2

2

131)(

)(

σ

σ

NNm

NsV

NxV

Statistics • For variables which obey the Gaussian distribution, this yields for

large N

• For Gaussian variables the sample mean is a fully efficient estimator.

• If the different measurements used in the calculation of the sample mean have different variances, a better estimator of the mean is a weighted sample mean:

Nsstd

2)( σ=

∑∑

•=i

i

i

ii

wxx 2

21

1σ

σ

Statistics • The method of maximum likelihood: • Assume that we have N independent measurements all obeying the

pdf f(x;θ), where θ is a parameter vector consisting of n different parameters to be estimated.

• The maximum likelihood estimate is the value of the parameter vector θ which maximizes the likelihood function

• Since the natural logarithm is a monotoneously increasing function, ln(L) and L will have maximum for the same value of θ.

( ) );(1

θθ ∏=

=N

iixfL

θ

Statistics • Therefore the maximum likelihood estimate can be found by solving

the likelihood equations

for all i=1,…..,n. • ML-estimators are asymptotically (i.e. for large amounts of data)

unbiased and fully efficient – Therefore very popular

• An estimate of the inverse of the covariance matrix of an ML-estimate is

evaluated at the estimated value of θ.

0ln=

∂∂

i

Lθ

( )ji

ijLVθθ ∂∂

∂−=− ln2

1

Statistics • The method of least squares. • Simplest possible example: estimating the parameters of a straight

line (intercept and tangent of inclination angle) given a set of measurements.

measurements

fitted line

Statistics • Least-squares approach: minimizing the sum of squared distances S

between the line and the N measurements,

with respect to the parameters of the line (i.e. a and b). • This cost function or objective function S can be written in a more

compact way by using matrix notation:

∑=

+−=

N

i i

ii baxyS1

2

2))((σ variance of measurement

error

( ) ( )θyθy HVHS T −−= −1

Statistics • Here y is a vector of measurements, θ is a vector of the parameters

a and b, V is the (diagonal) covariance matrix of measurements (consisting of the individual variances on the main diagonal), and H is given by

• Taking the derivative of S with respect to θ, setting this to zero and solving for θ yields the least-squares solution to the problem.

=

Nx

xH

1

1 1

Statistics • The result is:

• The covariance matrix of the estimated parameters is:

and the covariance matrix of the estimated positions is

( ) yθ 111 −−−= VHHVH TT

( ) ( ) 11cov −−= HVH Tθ

( ) ( ) TT HHVHH 11cov −−=y

θy H=

Statistics

Simulating 10000 lines Histogram of value of estimated intercept

What is true value of intercept?

Statistics

Simulating 10000 lines Histogram of value of tangent of angle of inclination

What is true value?

Statistics Histograms of normalized residuals of estimated parameters.

This means that for each fitted line and each estimated parameter, a quantity ((estimated parameter-true parameter)/standard deviation

of parameter) is put into the histogram. If everything is OK with the fitting procedure, these histograms should

have mean 0 and standard deviation 1.

mean=-0.0189 std=1.0038

mean=0.0157 std=1.0011

Statistics • Least-squares estimation is for instance used in track fitting in high-

energy physics experiments. • Track fitting is basically the same task as the line fit example:

estimating a set of parameters describing a particle track through a tracking detector, given a set of measurements created by the particle.

• In the general case the track model is not a straight line but rather a helix (homogeneous magnetic field) or some other trajectory obeying the equations of motion in an inhomogeneous magnetic field.

• The principles of the fitting procedure, however, are largely the same.

Statistics • As long as there is a linear relationship between the parameters and the

measurements, the least-squares method is linear. • If this relationship is a non-linear function F(θ), the problem is said to be of a

non-linear least-squares type:

• There exists no direct solution to this problem, and one has to resort to an iterative approach (Gauss-Newton):

– Start out with an initial guess of θ, linearize function F around the initial guess by a Taylor expansion and solve the resulting linear least-squares problem

– Use the estimated value for θ as a new expansion point for F and repeat the step above

– Iterate until convergence (i.e. until θ changes less than a specified value from one iteration to the next)

( ) ( ))()( 1 θyθy FVFS T −−= −

Statistics • Relationship between maximum likelihood and least-squares: • Consider a set of independent measurements y with mean values

F(x;θ). • If these measurements follow a Gaussian distribution, the log-

likelihood function is basically

plus some terms which do not depend on θ. • Maximizing the log-likelihood function is in this case equivalent to

minimizing the least-squares objective function.

( )( )∑=

−=−

N

i i

ii xFyL1

2

2;)(ln2σ

θθ

Statistics • Confidence intervals and hypothesis tests. • Confidence intervals:

– Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within

– Such an interval is called a 95 % confidence interval of a parameter

• Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α) % confidence interval for μ.

• From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(N).

• For large enough N, the quantity is distributed according to a standard, normal distribution

(mean value 0, standard deviation 1)

NXZ

/σµ−

=

Statistics • Therefore:

• In words, there is a probability 1-α that the true mean is in the interval

• This interval is therefore a 100(1- α) % confidence interval for μ. • Such intervals are highly relevant in physics analysis.

( )( )( ) ασµσ

ασµσ

ασµσ

ασ

µ

αα

αα

αα

αα

−=•+<<•−

−=•<−<•−

−=•<−<•−

−=

<

−<−

1//

1//

1//

1/

2/2/

2/2/

2/2/

2/2/

NzXNzXP

NzXNzP

NzXNzP

zN

XzP

[ ]NzXNzX /,/ 2/2/ σσ αα •+•−

Statistics • Hypothesis tests: • A hypothesis is a statement about the distribution of a vector x of

data. • Similar to the previous example:

– given a number N measurements, test whether the measurements come from a normal distribution with a certain expectation value μ or not.

– define a test statistic, i.e. the quantity to be used in the evaluation of the hypothesis. Here: the sample mean.

– define the significance level of the test, i.e. the probability that the hypothesis will be discarded even though it is true.

– determine the critical region of the test statistic, i.e. interval(s) of values of the test statistic which will lead to the rejection of the hypothesis

Statistics • We then state two competing hypotheses:

– A null hypothesis, stating that the expectation value is equal to a given value

– An alternative hypothesis, stating that the expectation value is not equal to the given value

• Mathematically:

• Test statistic:

01

00

::

µµµµ

≠=

HH

NXZ

/0

σµ−

=

Statistics

2/αz− 2/αz

Probability of being in shaded area: α

Shaded area is therefore the critical region of Z for significance level α

Obtain a value of the test statistic from test data by

calculating the sample mean and transforming to Z.

Use the actual value of Z to determine whether the null

hypothesis is rejected or not.

Statistics • Alternatively: perform the test by calculating the so-called p-value of

the test statistic. • Given the actual value of the test statistic, what is the area below the

pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. – For the current example this would correspond to adding two integrals

of the pdf of the test statistic (because this is a so-called two-sided test): • one from minus infinity to minus the absolute value of the actual value of the

test statistic • another from the absolute value of the actual value of the test statistic to

plus infinity • For a one-sided test one would stick to one integral of the

abovementioned type • If the p-value is less than the significance level: discard the null

hypothesis. If not, don’t discard it.

Statistics • p-values can be used in so-called goodness-of-fit tests. • In such tests one frequently uses a test statistic which is assumed to

be chisquare distributed – Is a measurement in a tracking detector compatible with belonging to a

particle track defined by a set of other measurements? – Is a histogram with a set of entries in different bins compatible with an

expected histogram (defined by an underlying assumption of the distribution)?

– Is the residual distributions of estimated parameters compatible with the estimated covariance matrix of the parameters?

• If one can calculate many independent values of the test statistic, the following procedure is often applied: – Calculate the p-value of the test statistic each time the test statistic is

calculated

Statistics – The p-value itself is also a random variable, and it can be shown that it

is distributed according to a uniform distribution if the test statistic origins from the expected (chisquare) distribution.

– Create a histogram with the various p-values as entries and see whether it looks reasonably flat

• NB! With only one calculated p-value, the null hypothesis can be rejected but never confirmed!

• With many calculated p-values (as immediately above) the null hypothesis can also (to a certain extent) be confirmed!

• Example: line fit (as before) • For each fitted line, calculate the following chisquare:

( ) ( ) ( )θθθθθ −−=− 12 cov

Tχ

Statistics • Here θ is the true value of the parameter vector. • For each value of the chisquare, calculate the corresponding p-value

– Integral of chisquare distribution from the value of the chisquare to infinity

• Given in tables or in standard computer programs (CERNLIB, CLHEP, MATLAB,….)

• Fill up a histogram with the p-values and make a plot:

Reasonably flat histogram, seems OK. What we really test here is that the estimated parameters are unbiased estimates of the true

parameters, distributed according to a Gaussian with a covariance matrix as

obtained in the estimate!!

probability and statisticsfolk.uio.no/ares/fys4550/lectures_h14_1.pdf · 1. probability is never...

Documents