probability and statisticsfolk.uio.no/ares/fys4550/lectures_h14_1.pdf · 1. probability is never...
TRANSCRIPT
LECTURE NOTES
FYS 4550/FYS9550 - EXPERIMENTAL HIGH ENERGY PHYSICS
AUTUMN 2014
PART I PROBABILITY AND STATISTICS
A. STRANDLIE
GJØVIK UNIVERSITY COLLEGE
AND UNIVERSITY OF OSLO
Probability • Before embarking on the concept of probability, we will first define a
set of other concepts. • A stochastic experiment is characterized by:
– All possible elementary outcomes of the experiment are known – Only one of the outcomes can occur in a single experiment
– The outcome of an experiment is not known a priori • Example: throwing a dice
– Outcomes are: S={1,2,3,4,5,6} – Can only observe one of these each time you throw – Don’t know beforehand what you will observe
• The set S is called the sample space of the experiment
Probability • An event A is one or more outcomes which satisfy certain
specifications • Example: A=”odd number” when throwing a dice • An event is therefore also a subset of S • Here: A={1,3,5} • If B=”even number”, what is the subset of S describing B? • The probability of occurence of an event A, P(A), is a number
between 0 and 1 • Intuitively a number for P(A) close to 0 means that A is supposed to
occur very rarely in an experiment, whereas a number close to 1 means that A occurs very often
Probability • There are three ways of quantifying probability
1. Classical approach, valid when all outcomes can be assumed equally likely. Probability is defined as number of favourable outcomes for a given event divided by total number of outcomes. Example: throwing a dice has N=6 different outcomes. Assume that the event A = ”observing 6 eyes”. Only n=1 of the outcomes are favourable for A. P(A)=n/N=1/6=0.167.
2. Approach based on convergence value of relative frequency for a very large number of repeated, identical experiments. Example: throwing a dice, recording relative frequency of occurence of A for various numbers of trials
3. Subjective approach, reflecting ”degree of belief” of occurence of a certain event A. Possible guideline: convergence value of a large number of hypothetical experiments
Probability
relative frequency
logarithm (base 10) of trials
true probability
Convergence of relative frequency
Probability • Approach 2) forms the basis of frequentist statistics, whereas
approach 3) is the baseline of Bayesian statistics – Two different schools
• When estimating parameters from a set of data, the two approaches usually give the same numbers for the estimates if there is a large amount of data
• If there is little available data, estimates might differ – No easy way of determining which approach is ”best” – Both approaches advocated in high-energy physics experiments
• Will not enter any further into such questions in this course
Probability • Will now look at probabilities of combinations of events • Need some concepts from set theory: • The union
is a new event which occurs if A or B or both events occur. • To events are disjoint if they cannot occur simultaneously • The intersection
is a new event which occurs if both A and B occurs • The complement is a new event which occurs if A does not occur
ΒΑ
BA
A
Probability
S
A B A∩B
VENN DIAGRAM
BA
C (disjoint with A and B)
outcomes
Probability • The mathematical axioms of probability:
1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire
sample space S (i.e. the probability of observing any of the possible outcomes of the experiment) is equal to the unit value, i. e. P(S) = 1
3. Probability must comply with the addition rule of disjoint events:
• A couple of useful formulas which can be derived from the
axioms:
P(A)1)AP( −=
)()()()( 2121 nn APAPAPAAAP ++=
)()()()( BAPBPAPBAP −+=
Probability A
B A∩B
Concept of conditional probability: What is the probability of occurence of A given that we know B will occur, i. e. P(A|B) ?
Probability • Recalling the definition of probability as the number of favourable
outcomes divided by the total number of outcomes, we get:
• Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6} – What is P(A|B)??
3
1)(}6,4{ =⇒= BAPBA
)()(
/
/)|(
BPBAP
NN
NN
N
NBAP
totB
totBA
B
BA ===
21
3/23/1
)()(
)|( ===BP
BAPBAP
Probability BA
B
B A∩B
Important observation: and are disjoint! BAA∩B
Probability • Therefore:
• Expressing P(A) in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is:
where all { } are disjoint and span the entire sample space S.
)()|()()|(
)()())()(()(
BPBAPBPBAP
BAPBAPBABAPAP
•+•=
+==
∑ •=i
ii BPBAPAP )()|()(
iB
Probability • From the definition of conditional probability it follows:
• A quick manipulation gives:
which is called Bayes’ theorem.
)|()()|()()( ABPAPBAPBPBAP •=•=
)()()|()|(
APBPBAPABP •
=
Probability • By using the law of total probability, one ends up with the general
formulation of Bayes’ theorem:
which is an extremely important result in statistics. Particularly in Bayesian statistics this theorem is often used to update or refine the knowledge about a set of unknown parameters by the introduction of information from new data.
∑ •
•=
iii
jjj BPBAP
BPBAPABP
)()|()()|(
)|(
Probability • This can be explained by a rewrite of Bayes theorem:
P(parameters|data) α P(data|parameters) × P(parameters). P(data|parameters) is often called the likelihood, P(parameters) denotes the prior knowledge of the parameters, whereas P(parameters|data) is the posterior probability of the parameters given the data.
• If P(parameters) cannot be deduced by any objective means, a subjective belief of its value is used in Bayesian statistics.
• Since there is no fundamental rule describing how to deduce this prior probability, Bayesian statistics is still debated (also in high-energy physics!)
Probability • Definition of independence of events A and B: P(A|B) = P(A), i.e.
any given information about B does not affect the probability of observing A.
• Physically this means that the events A and B are uncorrelated. • For practical applications such independence can not be derived but
rather has to be assumed, given the nature of the physical problem one intends to model.
• General multiplication rule for independent events :
)()()()( 2121 nn APAPAPAAAP •••=
nAAA ,,, 21
Probability • Stochastic or random variable:
– Number which can be attached to all outcomes of an experiment • Example: throwing two dice, sum of number of spots
– Mathematical terminology: real-valued function defined over the elements of the sample space S of an experiment
– A capital letter is often used to denote a random variable, for instance X • Simulation experiment: throwing two dice N times, recording sum of
spots each time and calculating the relative frequency of occurence for each of the outcomes
Probability N=10 Blue columns: observed rel. freq. Red columns: teoretically expected rel. freq.
Probability
N=20
Probability
N=100
Probability
N=1000
Probability
N=10000
Probability
N=100000
Probability
N=1000000
Probability
N=10000000
Probability • The relative frequencies seem to converge towards the theoretically
expected probabilities • Such a diagram is an expression of a probability distribution:
– A list of all different values of a random variable together with the associated probabilities
– Mathematically: a function f(x) = P(X=x) defined for all possible values x of X (given by the experiment at hand)
– The values of X can be discrete (like in the previous example), or continuous
– For continuous x, f(x) is called a probability density function • Simulation experiment: height of Norwegian men • Collecting data, calculating relative frequencies of occurences in
intervals of various widths
Probability
interval width 10 cm
Probability
interval width 5 cm
Probability
interval width 1 cm
Probability
interval width 0.5 cm
Probability
interval width 0
continuous probability distribution
Probability • Cumulative distribution function: F(a)=P(X ≤ a) • For discrete, random variables:
• For continuous, random variables:
∑ ∑≤ ≤
===ax ax
iii i
xfxXPaF )()()(
∫∞−
=a
dxxfaF )()(
Probability • It follows:
• For continuous variables:
)()()( aFbFbXaP −=≤<
∫=≤<b
a
dxxfbXaP )()(
Probability
a b
shaded area is P(a < X < b)
Probability
b
shaded area is P(X < b)
Probability
a
shaded area is P(X > a)
Probability • A function u(X) of a random variable X is also a random variable. • The expectation value of such a function is:
• Two very important special cases are:
[ ] ∫∞
∞−
= dxxfxuXuE )()()(
∫∞
∞−
•== dxxfxXE )()(µ
[ ] ∫∞
∞−
•−=−== dxxfxXEXVar )()()()( 222 µµσ
mean
variance
Probability • The mean μ is the most important measure of the centre of the
distribution of X. • The variance, or its square root σ, the standard deviation, is the
most important measure of the spread of the distribution of X around the mean.
• The mean is the first moment of X, whereas the variance is the second central moment of X.
• In general, the n’th moment of X is
[ ] ∫∞
∞−
•== dxxfxXE nnn )(α
Probability • The n’th central moment is
• Another measure of the centre of the distribution of X is the median, defined as
or, in words, the value of of X of which half of the probability lies above and half lies below.
[ ] ∫∞
∞−
•−=−= dxxfxXEm nnn )()()( 11 αα
21)( =medxF
Probability • Assume now that X and Y are two random variables with a joint
probability distribution function (pdf) f(x,y). • The marginal pdf of X is
whereas the marginal pdf of Y is
∫∞
∞−
= dyyxfxf ),()(1
∫∞
∞−
= dxyxfyf ),()(2
Probability • The mean values of X and Y are
• The covariance of X and Y is
∫ ∫ ∫∞
∞−
∞
∞−
∞
∞−
•=•= dxxfxdxdyyxfxX )(),( 1µ
∫ ∫ ∫∞
∞−
∞
∞−
∞
∞−
•=•= dyyfydxdyyxfyY )(),( 2µ
[ ] [ ] [ ] YXYX XYEYXEYX µµµµ −=−−= ))((,cov
Probability • If several random variables are considered simultaneously, one
frequently arranges the variables in a stochastic or random vector
• The covariances are then naturally displayed in a covariance matrix
( )TnXX ,,,X 21 =X
( )
=
),cov(),cov(),cov(
),cov(),cov(),cov(),cov(),cov(),cov(
cov
21
22212
12111
nnnn
n
n
XXXXXX
XXXXXXXXXXXX
X
Probability • If two variables X and Y are independent, the joint pdf can be written
• The covariance of X and Y vanishes in this case (why?), and the variances add: V(X+Y)=V(X)+V(Y).
• If X and Y are not independent, the general formula is: V(X+Y)=V(X)+V(Y)+2Cov(X,Y).
• For n mutually independent random variables the covariance matrix becomes diagonal (i.e. all off-diagonal terms are identically zero).
)()(),( 21 yfxfyxf •=
Probability • If a random vector is related to a vector X (with
pdf f(X) )by a function Y(X), the pdf of Y is
where |J| is the absolute value of the determinant of a matrix J. • This matrix is the so-called Jacobian of the transformation from Y to
X:
( )nYYY ,,, 21 =Y
Jyxy •= ))(()( fg
∂∂
∂∂
∂∂
∂∂
=
n
nn
n
yx
yx
yx
yx
1
1
1
1
J
Probability • The transformation of the covariance matrix is
where the inverse of J is • The transformation from x to y must be one-to-one, such that the
inverse functional relationship exists.
T11 )cov()cov( −−= JXJY
∂∂
∂∂
∂∂
∂∂
=−
n
nn
n
xy
xy
xy
xy
1
1
1
1
1J
Probability • Obtaining cov(Y) from cov(X) as in the previous slide is a very much
used technique in high-energy physics data analysis. • It is called linear error propagation and is applicable any time one
wants to transform from one set of estimated parameters to another – Transformation between different sets of parameters describing a
reconstructed particle track – Transport of track parameters from one location in a detector to another – ………….
• Will see examples later in the course
Probability • The characteristic function Φ(u) associated with the pdf f(x) is the
Fourier transform of f(x):
• Such functions are useful in deriving results about moments of random variables.
• The relation between Φ(u) and the moments of X are
• If Φ(u) is known, all moments of f(x) can be calculated without the knowledge of f(x) itself
[ ] ∫∞
∞−
•== dxxfeeEu iuxiux )()(φ
∫∞
∞−=
− =•= nn
un
nn dxxfx
dudi αφ )(
0
Probability • Some common probability distributions:
– Binomial distribution – Poisson distribution – Gaussian distribution – Chisquare distribution – Student’s t distribution – Gamma distribution
• We will take a closer look at some of them
Probability • Binomial distribution: • Assume that we make n identical experiments with only two possible
outcomes: ”success” or ”no success” • The probability of success p is the same for all experiments • The individual experiments are independent of each other • The probability of x successes out of n trials is then
• Example: throwing dice n times • Defining event of success to be occurence of six spots in a throw • Probability p=1/6
( ) xnx ppxn
xXP −−
== 1)(
Probability
probability distribution for
number of successes
in 5 throws
Probability
probability distribution for
number of successes
in 15 throws
Probability
probability distribution for
number of successes
in 50 throws
anything familiar with the shape
of this distribution?
Probability • Mean value and variance:
• Five throws with a dice: – E(# six spots) = 5/6 – Var(# six spots) = 25/36 – Std(# six spots) = 5/6
)1()()(
pnpXVarnpXE
−==
Probability • Poisson distribution:
– Number of occurences of event A per given time (length, area, volume,…) interval is constant and equal to λ.
– Probability distribution of observing x occurences in the interval is
– Both mean value and variance of X is λ. – Example: number of particles in a beam passing through a given area in
a given time must be Poisson distributed. If the average number λ is known, the probabilities for all x can be calculated according to the formula above.
!)(
xexXP
x λλ −•==
Probability • Gaussian distribution:
– Most frequently occurring distribution in nature. – Most measurement uncertainties, disturbances of directions of charged
particles when penetrating through (enough) matter, number of ionizations created by charged particle in a slab of material etc. follow Gaussian distribution.
– Main reason: CENTRAL LIMIT THEOREM – States that sum of n independent random variables converges to a
Gaussian distribution when n is ”large enough”, irrespective of the individual distributions of the variables.
– Abovementioned examples are typically of this type.
Probability • Gaussian probability density function with mean value μ and
standard deviation σ:
• For a random vector X of size n with mean value μ and covariance matrix V the function is (multivariate Gaussian distribution):
( ) 22 2/
2
2
21),;( σµ
πσσµ −−= xexf
( )( )
( ) ( )
−−−= − μxVμx
VVμx 1
2/ 21exp
)det(21,; T
nfπ
Probability • Usual terminology: X ~ N(μ,σ) : ”X is distributed according to a
Gaussian (normal) with mean value μ and standard deviation σ”. • 68 % of distribution within plus/minus one σ. • 95 % of distribution within plus/minus two σ. • 99.5 % of distribution within plus/minus three σ. • Standard normal variable Z~N(0,1): Z=(X- μ)/ σ • Quantiles of the standard normal distribution:
• The value is denoted the ”100 * α % quantile of the standard normal distribution”
• Such quantiles can be found in tables or by computer programs
αα −=< 1)( zZPαz
Probability
10 % quantile
Probability
5 % quantile (1.64)
Probability
95 % of area within plus/ minus 2.5 %
quantile (1.96)
Probability • distribution: • If are independent, Gaussian random variables, then
follow a distribution with n degrees of freedom. • Often used in evaluating level of compatibility between observed
data and assumed pdf of the data • Example: is position of measurement in a particle detector
compatible with the assumed distribution of the measurement? • Mean value is n and variance 2n.
{ }nXX ,,1
( )∑=
−=
n
i i
iiX1
2
22
σµχ
2χ
2χ
Probability
chisquare distribution with
10 degrees of freedom
Statistics • Statistics is about making inference about a statistical model, given
a set of data or measurements – Parameters of a distribution – Parameters describing the kinematics of a particle after a collision
• Position and momentum at some reference surface – Parameters describing an interaction vertex (position, refined estimates
of particle momenta) • Will consider two issues
– Parameter estimation – Hypothesis tests and confidence intervals
Statistics • Parameter estimation • We want to estimate the unknown value of a parameter θ. • An estimator is a function of the data which aims to estimate the
value of θ as closely as possible. • General estimator properties
– Consistency – Bias – Efficiency – Robustness
• A consistent estimator is an estimator which converges to the true value of θ when the amount of data increases (formally, in the limit of infinite amount of data).
θ^
Statistics • The bias b of an estimator is given as
• Since the estimator is a function of the data, it is itself a random variable with its own distribution.
• The expectation value of θ can be interpreted as the mean value of the estimate for a very large number of hypothetical, identical experiments.
• Obviously, unbiased (i.e. b=0) estimators are desirable.
θθ −
=
^Eb
Statistics • The efficiency of an estimator is the inverse of the ratio of its
variance to the minimum possible value. • The minimum possible value is given by the Rao-Cramer-Frechet
lower bound
where I(θ) is the Fisher information:
)(
12
2min θ
θσI
b
∂∂
+=
∂∂
= ∑2
);(lnE)(i
ixfI θθ
θ
Statistics • The sum is over all the data, which are assumed independent and to follow
the pdf f(x; θ). • The expression of the lower bound is valid for all estimators with the same
bias function b(θ) (for unbiased estimators b(θ) vanishes). • If the variance of the estimator happens to be equal to the Cramer-Rao-
Frechet lower bound, it is called a minimum variance lower bound estimator or a (fully) efficient estimator.
• Different estimators of the same parameter can also be compared by looking at the ratios of the efficiencies. One then talks about relative efficiencies.
• Robustness is the (qualitative) degree of insensitivity of the estimator to deviations in the assumed pdf of the data
– e.g. noise in the data not properly taken into account – wrong data – etc
Statistics • Common estimators for the mean and variance are (often called the
sample mean and the sample variance):
• The variances of these are:
( )∑
∑
=
=
−−
=
=
N
ii
N
ii
xxN
s
xN
x
1
22
1
11
1
−−
−=
=
44
2
2
131)(
)(
σ
σ
NNm
NsV
NxV
Statistics • For variables which obey the Gaussian distribution, this yields for
large N
• For Gaussian variables the sample mean is a fully efficient estimator.
• If the different measurements used in the calculation of the sample mean have different variances, a better estimator of the mean is a weighted sample mean:
Nsstd
2)( σ=
∑∑
•=i
i
i
ii
wxx 2
21
1σ
σ
Statistics • The method of maximum likelihood: • Assume that we have N independent measurements all obeying the
pdf f(x;θ), where θ is a parameter vector consisting of n different parameters to be estimated.
• The maximum likelihood estimate is the value of the parameter vector θ which maximizes the likelihood function
• Since the natural logarithm is a monotoneously increasing function, ln(L) and L will have maximum for the same value of θ.
( ) );(1
θθ ∏=
=N
iixfL
θ
Statistics • Therefore the maximum likelihood estimate can be found by solving
the likelihood equations
for all i=1,…..,n. • ML-estimators are asymptotically (i.e. for large amounts of data)
unbiased and fully efficient – Therefore very popular
• An estimate of the inverse of the covariance matrix of an ML-estimate is
evaluated at the estimated value of θ.
0ln=
∂∂
i
Lθ
( )ji
ijLVθθ ∂∂
∂−=− ln2
1
Statistics • The method of least squares. • Simplest possible example: estimating the parameters of a straight
line (intercept and tangent of inclination angle) given a set of measurements.
measurements
fitted line
Statistics • Least-squares approach: minimizing the sum of squared distances S
between the line and the N measurements,
with respect to the parameters of the line (i.e. a and b). • This cost function or objective function S can be written in a more
compact way by using matrix notation:
∑=
+−=
N
i i
ii baxyS1
2
2))((σ variance of measurement
error
( ) ( )θyθy HVHS T −−= −1
Statistics • Here y is a vector of measurements, θ is a vector of the parameters
a and b, V is the (diagonal) covariance matrix of measurements (consisting of the individual variances on the main diagonal), and H is given by
• Taking the derivative of S with respect to θ, setting this to zero and solving for θ yields the least-squares solution to the problem.
=
Nx
xH
1
1 1
Statistics • The result is:
• The covariance matrix of the estimated parameters is:
and the covariance matrix of the estimated positions is
( ) yθ 111 −−−= VHHVH TT
( ) ( ) 11cov −−= HVH Tθ
( ) ( ) TT HHVHH 11cov −−=y
θy H=
Statistics
Simulating 10000 lines Histogram of value of estimated intercept
What is true value of intercept?
Statistics
Simulating 10000 lines Histogram of value of tangent of angle of inclination
What is true value?
Statistics Histograms of normalized residuals of estimated parameters.
This means that for each fitted line and each estimated parameter, a quantity ((estimated parameter-true parameter)/standard deviation
of parameter) is put into the histogram. If everything is OK with the fitting procedure, these histograms should
have mean 0 and standard deviation 1.
mean=-0.0189 std=1.0038
mean=0.0157 std=1.0011
Statistics • Least-squares estimation is for instance used in track fitting in high-
energy physics experiments. • Track fitting is basically the same task as the line fit example:
estimating a set of parameters describing a particle track through a tracking detector, given a set of measurements created by the particle.
• In the general case the track model is not a straight line but rather a helix (homogeneous magnetic field) or some other trajectory obeying the equations of motion in an inhomogeneous magnetic field.
• The principles of the fitting procedure, however, are largely the same.
Statistics • As long as there is a linear relationship between the parameters and the
measurements, the least-squares method is linear. • If this relationship is a non-linear function F(θ), the problem is said to be of a
non-linear least-squares type:
• There exists no direct solution to this problem, and one has to resort to an iterative approach (Gauss-Newton):
– Start out with an initial guess of θ, linearize function F around the initial guess by a Taylor expansion and solve the resulting linear least-squares problem
– Use the estimated value for θ as a new expansion point for F and repeat the step above
– Iterate until convergence (i.e. until θ changes less than a specified value from one iteration to the next)
( ) ( ))()( 1 θyθy FVFS T −−= −
Statistics • Relationship between maximum likelihood and least-squares: • Consider a set of independent measurements y with mean values
F(x;θ). • If these measurements follow a Gaussian distribution, the log-
likelihood function is basically
plus some terms which do not depend on θ. • Maximizing the log-likelihood function is in this case equivalent to
minimizing the least-squares objective function.
( )( )∑=
−=−
N
i i
ii xFyL1
2
2;)(ln2σ
θθ
Statistics • Confidence intervals and hypothesis tests. • Confidence intervals:
– Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within
– Such an interval is called a 95 % confidence interval of a parameter
• Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α) % confidence interval for μ.
• From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(N).
• For large enough N, the quantity is distributed according to a standard, normal distribution
(mean value 0, standard deviation 1)
NXZ
/σµ−
=
Statistics • Therefore:
• In words, there is a probability 1-α that the true mean is in the interval
• This interval is therefore a 100(1- α) % confidence interval for μ. • Such intervals are highly relevant in physics analysis.
( )( )( ) ασµσ
ασµσ
ασµσ
ασ
µ
αα
αα
αα
αα
−=•+<<•−
−=•<−<•−
−=•<−<•−
−=
<
−<−
1//
1//
1//
1/
2/2/
2/2/
2/2/
2/2/
NzXNzXP
NzXNzP
NzXNzP
zN
XzP
[ ]NzXNzX /,/ 2/2/ σσ αα •+•−
Statistics • Hypothesis tests: • A hypothesis is a statement about the distribution of a vector x of
data. • Similar to the previous example:
– given a number N measurements, test whether the measurements come from a normal distribution with a certain expectation value μ or not.
– define a test statistic, i.e. the quantity to be used in the evaluation of the hypothesis. Here: the sample mean.
– define the significance level of the test, i.e. the probability that the hypothesis will be discarded even though it is true.
– determine the critical region of the test statistic, i.e. interval(s) of values of the test statistic which will lead to the rejection of the hypothesis
Statistics • We then state two competing hypotheses:
– A null hypothesis, stating that the expectation value is equal to a given value
– An alternative hypothesis, stating that the expectation value is not equal to the given value
• Mathematically:
• Test statistic:
01
00
::
µµµµ
≠=
HH
NXZ
/0
σµ−
=
Statistics
2/αz− 2/αz
Probability of being in shaded area: α
Shaded area is therefore the critical region of Z for significance level α
Obtain a value of the test statistic from test data by
calculating the sample mean and transforming to Z.
Use the actual value of Z to determine whether the null
hypothesis is rejected or not.
Statistics • Alternatively: perform the test by calculating the so-called p-value of
the test statistic. • Given the actual value of the test statistic, what is the area below the
pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. – For the current example this would correspond to adding two integrals
of the pdf of the test statistic (because this is a so-called two-sided test): • one from minus infinity to minus the absolute value of the actual value of the
test statistic • another from the absolute value of the actual value of the test statistic to
plus infinity • For a one-sided test one would stick to one integral of the
abovementioned type • If the p-value is less than the significance level: discard the null
hypothesis. If not, don’t discard it.
Statistics • p-values can be used in so-called goodness-of-fit tests. • In such tests one frequently uses a test statistic which is assumed to
be chisquare distributed – Is a measurement in a tracking detector compatible with belonging to a
particle track defined by a set of other measurements? – Is a histogram with a set of entries in different bins compatible with an
expected histogram (defined by an underlying assumption of the distribution)?
– Is the residual distributions of estimated parameters compatible with the estimated covariance matrix of the parameters?
• If one can calculate many independent values of the test statistic, the following procedure is often applied: – Calculate the p-value of the test statistic each time the test statistic is
calculated
Statistics – The p-value itself is also a random variable, and it can be shown that it
is distributed according to a uniform distribution if the test statistic origins from the expected (chisquare) distribution.
– Create a histogram with the various p-values as entries and see whether it looks reasonably flat
• NB! With only one calculated p-value, the null hypothesis can be rejected but never confirmed!
• With many calculated p-values (as immediately above) the null hypothesis can also (to a certain extent) be confirmed!
• Example: line fit (as before) • For each fitted line, calculate the following chisquare:
( ) ( ) ( )θθθθθ −−=− 12 cov
Tχ
Statistics • Here θ is the true value of the parameter vector. • For each value of the chisquare, calculate the corresponding p-value
– Integral of chisquare distribution from the value of the chisquare to infinity
• Given in tables or in standard computer programs (CERNLIB, CLHEP, MATLAB,….)
• Fill up a histogram with the p-values and make a plot:
Reasonably flat histogram, seems OK. What we really test here is that the estimated parameters are unbiased estimates of the true
parameters, distributed according to a Gaussian with a covariance matrix as
obtained in the estimate!!