probability and statisticsfolk.uio.no/ares/fys4550/lectures_h14_1.pdf · 1. probability is never...

91
LECTURE NOTES FYS 4550/FYS9550 - EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2014 PART I PROBABILITY AND STATISTICS A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

Upload: others

Post on 26-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

LECTURE NOTES

FYS 4550/FYS9550 - EXPERIMENTAL HIGH ENERGY PHYSICS

AUTUMN 2014

PART I PROBABILITY AND STATISTICS

A. STRANDLIE

GJØVIK UNIVERSITY COLLEGE

AND UNIVERSITY OF OSLO

Page 2: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Before embarking on the concept of probability, we will first define a

set of other concepts. • A stochastic experiment is characterized by:

– All possible elementary outcomes of the experiment are known – Only one of the outcomes can occur in a single experiment

– The outcome of an experiment is not known a priori • Example: throwing a dice

– Outcomes are: S={1,2,3,4,5,6} – Can only observe one of these each time you throw – Don’t know beforehand what you will observe

• The set S is called the sample space of the experiment

Page 3: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • An event A is one or more outcomes which satisfy certain

specifications • Example: A=”odd number” when throwing a dice • An event is therefore also a subset of S • Here: A={1,3,5} • If B=”even number”, what is the subset of S describing B? • The probability of occurence of an event A, P(A), is a number

between 0 and 1 • Intuitively a number for P(A) close to 0 means that A is supposed to

occur very rarely in an experiment, whereas a number close to 1 means that A occurs very often

Page 4: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • There are three ways of quantifying probability

1. Classical approach, valid when all outcomes can be assumed equally likely. Probability is defined as number of favourable outcomes for a given event divided by total number of outcomes. Example: throwing a dice has N=6 different outcomes. Assume that the event A = ”observing 6 eyes”. Only n=1 of the outcomes are favourable for A. P(A)=n/N=1/6=0.167.

2. Approach based on convergence value of relative frequency for a very large number of repeated, identical experiments. Example: throwing a dice, recording relative frequency of occurence of A for various numbers of trials

3. Subjective approach, reflecting ”degree of belief” of occurence of a certain event A. Possible guideline: convergence value of a large number of hypothetical experiments

Page 5: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

relative frequency

logarithm (base 10) of trials

true probability

Convergence of relative frequency

Page 6: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Approach 2) forms the basis of frequentist statistics, whereas

approach 3) is the baseline of Bayesian statistics – Two different schools

• When estimating parameters from a set of data, the two approaches usually give the same numbers for the estimates if there is a large amount of data

• If there is little available data, estimates might differ – No easy way of determining which approach is ”best” – Both approaches advocated in high-energy physics experiments

• Will not enter any further into such questions in this course

Page 7: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Will now look at probabilities of combinations of events • Need some concepts from set theory: • The union

is a new event which occurs if A or B or both events occur. • To events are disjoint if they cannot occur simultaneously • The intersection

is a new event which occurs if both A and B occurs • The complement is a new event which occurs if A does not occur

ΒΑ

BA

A

Page 8: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

S

A B A∩B

VENN DIAGRAM

BA

C (disjoint with A and B)

outcomes

Page 9: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The mathematical axioms of probability:

1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

sample space S (i.e. the probability of observing any of the possible outcomes of the experiment) is equal to the unit value, i. e. P(S) = 1

3. Probability must comply with the addition rule of disjoint events:

• A couple of useful formulas which can be derived from the

axioms:

P(A)1)AP( −=

)()()()( 2121 nn APAPAPAAAP ++=

)()()()( BAPBPAPBAP −+=

Page 10: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability A

B A∩B

Concept of conditional probability: What is the probability of occurence of A given that we know B will occur, i. e. P(A|B) ?

Page 11: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Recalling the definition of probability as the number of favourable

outcomes divided by the total number of outcomes, we get:

• Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6} – What is P(A|B)??

3

1)(}6,4{ =⇒= BAPBA

)()(

/

/)|(

BPBAP

NN

NN

N

NBAP

totB

totBA

B

BA ===

21

3/23/1

)()(

)|( ===BP

BAPBAP

Page 12: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability BA

B

B A∩B

Important observation: and are disjoint! BAA∩B

Page 13: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Therefore:

• Expressing P(A) in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is:

where all { } are disjoint and span the entire sample space S.

)()|()()|(

)()())()(()(

BPBAPBPBAP

BAPBAPBABAPAP

•+•=

+==

∑ •=i

ii BPBAPAP )()|()(

iB

Page 14: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • From the definition of conditional probability it follows:

• A quick manipulation gives:

which is called Bayes’ theorem.

)|()()|()()( ABPAPBAPBPBAP •=•=

)()()|()|(

APBPBAPABP •

=

Page 15: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • By using the law of total probability, one ends up with the general

formulation of Bayes’ theorem:

which is an extremely important result in statistics. Particularly in Bayesian statistics this theorem is often used to update or refine the knowledge about a set of unknown parameters by the introduction of information from new data.

∑ •

•=

iii

jjj BPBAP

BPBAPABP

)()|()()|(

)|(

Page 16: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • This can be explained by a rewrite of Bayes theorem:

P(parameters|data) α P(data|parameters) × P(parameters). P(data|parameters) is often called the likelihood, P(parameters) denotes the prior knowledge of the parameters, whereas P(parameters|data) is the posterior probability of the parameters given the data.

• If P(parameters) cannot be deduced by any objective means, a subjective belief of its value is used in Bayesian statistics.

• Since there is no fundamental rule describing how to deduce this prior probability, Bayesian statistics is still debated (also in high-energy physics!)

Page 17: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Definition of independence of events A and B: P(A|B) = P(A), i.e.

any given information about B does not affect the probability of observing A.

• Physically this means that the events A and B are uncorrelated. • For practical applications such independence can not be derived but

rather has to be assumed, given the nature of the physical problem one intends to model.

• General multiplication rule for independent events :

)()()()( 2121 nn APAPAPAAAP •••=

nAAA ,,, 21

Page 18: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Stochastic or random variable:

– Number which can be attached to all outcomes of an experiment • Example: throwing two dice, sum of number of spots

– Mathematical terminology: real-valued function defined over the elements of the sample space S of an experiment

– A capital letter is often used to denote a random variable, for instance X • Simulation experiment: throwing two dice N times, recording sum of

spots each time and calculating the relative frequency of occurence for each of the outcomes

Page 19: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability N=10 Blue columns: observed rel. freq. Red columns: teoretically expected rel. freq.

Page 20: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=20

Page 21: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=100

Page 22: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=1000

Page 23: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=10000

Page 24: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=100000

Page 25: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=1000000

Page 26: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

N=10000000

Page 27: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The relative frequencies seem to converge towards the theoretically

expected probabilities • Such a diagram is an expression of a probability distribution:

– A list of all different values of a random variable together with the associated probabilities

– Mathematically: a function f(x) = P(X=x) defined for all possible values x of X (given by the experiment at hand)

– The values of X can be discrete (like in the previous example), or continuous

– For continuous x, f(x) is called a probability density function • Simulation experiment: height of Norwegian men • Collecting data, calculating relative frequencies of occurences in

intervals of various widths

Page 28: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

interval width 10 cm

Page 29: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

interval width 5 cm

Page 30: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

interval width 1 cm

Page 31: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

interval width 0.5 cm

Page 32: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

interval width 0

continuous probability distribution

Page 33: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Cumulative distribution function: F(a)=P(X ≤ a) • For discrete, random variables:

• For continuous, random variables:

∑ ∑≤ ≤

===ax ax

iii i

xfxXPaF )()()(

∫∞−

=a

dxxfaF )()(

Page 34: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • It follows:

• For continuous variables:

)()()( aFbFbXaP −=≤<

∫=≤<b

a

dxxfbXaP )()(

Page 35: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

a b

shaded area is P(a < X < b)

Page 36: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

b

shaded area is P(X < b)

Page 37: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

a

shaded area is P(X > a)

Page 38: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • A function u(X) of a random variable X is also a random variable. • The expectation value of such a function is:

• Two very important special cases are:

[ ] ∫∞

∞−

= dxxfxuXuE )()()(

∫∞

∞−

•== dxxfxXE )()(µ

[ ] ∫∞

∞−

•−=−== dxxfxXEXVar )()()()( 222 µµσ

mean

variance

Page 39: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The mean μ is the most important measure of the centre of the

distribution of X. • The variance, or its square root σ, the standard deviation, is the

most important measure of the spread of the distribution of X around the mean.

• The mean is the first moment of X, whereas the variance is the second central moment of X.

• In general, the n’th moment of X is

[ ] ∫∞

∞−

•== dxxfxXE nnn )(α

Page 40: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The n’th central moment is

• Another measure of the centre of the distribution of X is the median, defined as

or, in words, the value of of X of which half of the probability lies above and half lies below.

[ ] ∫∞

∞−

•−=−= dxxfxXEm nnn )()()( 11 αα

21)( =medxF

Page 41: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Assume now that X and Y are two random variables with a joint

probability distribution function (pdf) f(x,y). • The marginal pdf of X is

whereas the marginal pdf of Y is

∫∞

∞−

= dyyxfxf ),()(1

∫∞

∞−

= dxyxfyf ),()(2

Page 42: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The mean values of X and Y are

• The covariance of X and Y is

∫ ∫ ∫∞

∞−

∞−

∞−

•=•= dxxfxdxdyyxfxX )(),( 1µ

∫ ∫ ∫∞

∞−

∞−

∞−

•=•= dyyfydxdyyxfyY )(),( 2µ

[ ] [ ] [ ] YXYX XYEYXEYX µµµµ −=−−= ))((,cov

Page 43: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • If several random variables are considered simultaneously, one

frequently arranges the variables in a stochastic or random vector

• The covariances are then naturally displayed in a covariance matrix

( )TnXX ,,,X 21 =X

( )

=

),cov(),cov(),cov(

),cov(),cov(),cov(),cov(),cov(),cov(

cov

21

22212

12111

nnnn

n

n

XXXXXX

XXXXXXXXXXXX

X

Page 44: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • If two variables X and Y are independent, the joint pdf can be written

• The covariance of X and Y vanishes in this case (why?), and the variances add: V(X+Y)=V(X)+V(Y).

• If X and Y are not independent, the general formula is: V(X+Y)=V(X)+V(Y)+2Cov(X,Y).

• For n mutually independent random variables the covariance matrix becomes diagonal (i.e. all off-diagonal terms are identically zero).

)()(),( 21 yfxfyxf •=

Page 45: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • If a random vector is related to a vector X (with

pdf f(X) )by a function Y(X), the pdf of Y is

where |J| is the absolute value of the determinant of a matrix J. • This matrix is the so-called Jacobian of the transformation from Y to

X:

( )nYYY ,,, 21 =Y

Jyxy •= ))(()( fg

∂∂

∂∂

∂∂

∂∂

=

n

nn

n

yx

yx

yx

yx

1

1

1

1

J

Page 46: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The transformation of the covariance matrix is

where the inverse of J is • The transformation from x to y must be one-to-one, such that the

inverse functional relationship exists.

T11 )cov()cov( −−= JXJY

∂∂

∂∂

∂∂

∂∂

=−

n

nn

n

xy

xy

xy

xy

1

1

1

1

1J

Page 47: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Obtaining cov(Y) from cov(X) as in the previous slide is a very much

used technique in high-energy physics data analysis. • It is called linear error propagation and is applicable any time one

wants to transform from one set of estimated parameters to another – Transformation between different sets of parameters describing a

reconstructed particle track – Transport of track parameters from one location in a detector to another – ………….

• Will see examples later in the course

Page 48: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • The characteristic function Φ(u) associated with the pdf f(x) is the

Fourier transform of f(x):

• Such functions are useful in deriving results about moments of random variables.

• The relation between Φ(u) and the moments of X are

• If Φ(u) is known, all moments of f(x) can be calculated without the knowledge of f(x) itself

[ ] ∫∞

∞−

•== dxxfeeEu iuxiux )()(φ

∫∞

∞−=

− =•= nn

un

nn dxxfx

dudi αφ )(

0

Page 49: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Some common probability distributions:

– Binomial distribution – Poisson distribution – Gaussian distribution – Chisquare distribution – Student’s t distribution – Gamma distribution

• We will take a closer look at some of them

Page 50: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Binomial distribution: • Assume that we make n identical experiments with only two possible

outcomes: ”success” or ”no success” • The probability of success p is the same for all experiments • The individual experiments are independent of each other • The probability of x successes out of n trials is then

• Example: throwing dice n times • Defining event of success to be occurence of six spots in a throw • Probability p=1/6

( ) xnx ppxn

xXP −−

== 1)(

Page 51: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

probability distribution for

number of successes

in 5 throws

Page 52: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

probability distribution for

number of successes

in 15 throws

Page 53: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

probability distribution for

number of successes

in 50 throws

anything familiar with the shape

of this distribution?

Page 54: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Mean value and variance:

• Five throws with a dice: – E(# six spots) = 5/6 – Var(# six spots) = 25/36 – Std(# six spots) = 5/6

)1()()(

pnpXVarnpXE

−==

Page 55: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Poisson distribution:

– Number of occurences of event A per given time (length, area, volume,…) interval is constant and equal to λ.

– Probability distribution of observing x occurences in the interval is

– Both mean value and variance of X is λ. – Example: number of particles in a beam passing through a given area in

a given time must be Poisson distributed. If the average number λ is known, the probabilities for all x can be calculated according to the formula above.

!)(

xexXP

x λλ −•==

Page 56: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Gaussian distribution:

– Most frequently occurring distribution in nature. – Most measurement uncertainties, disturbances of directions of charged

particles when penetrating through (enough) matter, number of ionizations created by charged particle in a slab of material etc. follow Gaussian distribution.

– Main reason: CENTRAL LIMIT THEOREM – States that sum of n independent random variables converges to a

Gaussian distribution when n is ”large enough”, irrespective of the individual distributions of the variables.

– Abovementioned examples are typically of this type.

Page 57: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Gaussian probability density function with mean value μ and

standard deviation σ:

• For a random vector X of size n with mean value μ and covariance matrix V the function is (multivariate Gaussian distribution):

( ) 22 2/

2

2

21),;( σµ

πσσµ −−= xexf

( )( )

( ) ( )

−−−= − μxVμx

VVμx 1

2/ 21exp

)det(21,; T

nfπ

Page 58: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • Usual terminology: X ~ N(μ,σ) : ”X is distributed according to a

Gaussian (normal) with mean value μ and standard deviation σ”. • 68 % of distribution within plus/minus one σ. • 95 % of distribution within plus/minus two σ. • 99.5 % of distribution within plus/minus three σ. • Standard normal variable Z~N(0,1): Z=(X- μ)/ σ • Quantiles of the standard normal distribution:

• The value is denoted the ”100 * α % quantile of the standard normal distribution”

• Such quantiles can be found in tables or by computer programs

αα −=< 1)( zZPαz

Page 59: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

10 % quantile

Page 60: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

5 % quantile (1.64)

Page 61: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

95 % of area within plus/ minus 2.5 %

quantile (1.96)

Page 62: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability • distribution: • If are independent, Gaussian random variables, then

follow a distribution with n degrees of freedom. • Often used in evaluating level of compatibility between observed

data and assumed pdf of the data • Example: is position of measurement in a particle detector

compatible with the assumed distribution of the measurement? • Mean value is n and variance 2n.

{ }nXX ,,1

( )∑=

−=

n

i i

iiX1

2

22

σµχ

Page 63: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Probability

chisquare distribution with

10 degrees of freedom

Page 64: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Statistics is about making inference about a statistical model, given

a set of data or measurements – Parameters of a distribution – Parameters describing the kinematics of a particle after a collision

• Position and momentum at some reference surface – Parameters describing an interaction vertex (position, refined estimates

of particle momenta) • Will consider two issues

– Parameter estimation – Hypothesis tests and confidence intervals

Page 65: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Parameter estimation • We want to estimate the unknown value of a parameter θ. • An estimator is a function of the data which aims to estimate the

value of θ as closely as possible. • General estimator properties

– Consistency – Bias – Efficiency – Robustness

• A consistent estimator is an estimator which converges to the true value of θ when the amount of data increases (formally, in the limit of infinite amount of data).

θ^

Page 66: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The bias b of an estimator is given as

• Since the estimator is a function of the data, it is itself a random variable with its own distribution.

• The expectation value of θ can be interpreted as the mean value of the estimate for a very large number of hypothetical, identical experiments.

• Obviously, unbiased (i.e. b=0) estimators are desirable.

θθ −

=

^Eb

Page 67: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The efficiency of an estimator is the inverse of the ratio of its

variance to the minimum possible value. • The minimum possible value is given by the Rao-Cramer-Frechet

lower bound

where I(θ) is the Fisher information:

)(

12

2min θ

θσI

b

∂∂

+=

∂∂

= ∑2

);(lnE)(i

ixfI θθ

θ

Page 68: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The sum is over all the data, which are assumed independent and to follow

the pdf f(x; θ). • The expression of the lower bound is valid for all estimators with the same

bias function b(θ) (for unbiased estimators b(θ) vanishes). • If the variance of the estimator happens to be equal to the Cramer-Rao-

Frechet lower bound, it is called a minimum variance lower bound estimator or a (fully) efficient estimator.

• Different estimators of the same parameter can also be compared by looking at the ratios of the efficiencies. One then talks about relative efficiencies.

• Robustness is the (qualitative) degree of insensitivity of the estimator to deviations in the assumed pdf of the data

– e.g. noise in the data not properly taken into account – wrong data – etc

Page 69: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Common estimators for the mean and variance are (often called the

sample mean and the sample variance):

• The variances of these are:

( )∑

=

=

−−

=

=

N

ii

N

ii

xxN

s

xN

x

1

22

1

11

1

−−

−=

=

44

2

2

131)(

)(

σ

σ

NNm

NsV

NxV

Page 70: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • For variables which obey the Gaussian distribution, this yields for

large N

• For Gaussian variables the sample mean is a fully efficient estimator.

• If the different measurements used in the calculation of the sample mean have different variances, a better estimator of the mean is a weighted sample mean:

Nsstd

2)( σ=

∑∑

•=i

i

i

ii

wxx 2

21

σ

Page 71: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The method of maximum likelihood: • Assume that we have N independent measurements all obeying the

pdf f(x;θ), where θ is a parameter vector consisting of n different parameters to be estimated.

• The maximum likelihood estimate is the value of the parameter vector θ which maximizes the likelihood function

• Since the natural logarithm is a monotoneously increasing function, ln(L) and L will have maximum for the same value of θ.

( ) );(1

θθ ∏=

=N

iixfL

θ

Page 72: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Therefore the maximum likelihood estimate can be found by solving

the likelihood equations

for all i=1,…..,n. • ML-estimators are asymptotically (i.e. for large amounts of data)

unbiased and fully efficient – Therefore very popular

• An estimate of the inverse of the covariance matrix of an ML-estimate is

evaluated at the estimated value of θ.

0ln=

∂∂

i

( )ji

ijLVθθ ∂∂

∂−=− ln2

1

Page 73: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The method of least squares. • Simplest possible example: estimating the parameters of a straight

line (intercept and tangent of inclination angle) given a set of measurements.

measurements

fitted line

Page 74: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Least-squares approach: minimizing the sum of squared distances S

between the line and the N measurements,

with respect to the parameters of the line (i.e. a and b). • This cost function or objective function S can be written in a more

compact way by using matrix notation:

∑=

+−=

N

i i

ii baxyS1

2

2))((σ variance of measurement

error

( ) ( )θyθy HVHS T −−= −1

Page 75: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Here y is a vector of measurements, θ is a vector of the parameters

a and b, V is the (diagonal) covariance matrix of measurements (consisting of the individual variances on the main diagonal), and H is given by

• Taking the derivative of S with respect to θ, setting this to zero and solving for θ yields the least-squares solution to the problem.

=

Nx

xH

1

1 1

Page 76: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • The result is:

• The covariance matrix of the estimated parameters is:

and the covariance matrix of the estimated positions is

( ) yθ 111 −−−= VHHVH TT

( ) ( ) 11cov −−= HVH Tθ

( ) ( ) TT HHVHH 11cov −−=y

θy H=

Page 77: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics

Simulating 10000 lines Histogram of value of estimated intercept

What is true value of intercept?

Page 78: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics

Simulating 10000 lines Histogram of value of tangent of angle of inclination

What is true value?

Page 79: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics Histograms of normalized residuals of estimated parameters.

This means that for each fitted line and each estimated parameter, a quantity ((estimated parameter-true parameter)/standard deviation

of parameter) is put into the histogram. If everything is OK with the fitting procedure, these histograms should

have mean 0 and standard deviation 1.

mean=-0.0189 std=1.0038

mean=0.0157 std=1.0011

Page 80: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Least-squares estimation is for instance used in track fitting in high-

energy physics experiments. • Track fitting is basically the same task as the line fit example:

estimating a set of parameters describing a particle track through a tracking detector, given a set of measurements created by the particle.

• In the general case the track model is not a straight line but rather a helix (homogeneous magnetic field) or some other trajectory obeying the equations of motion in an inhomogeneous magnetic field.

• The principles of the fitting procedure, however, are largely the same.

Page 81: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • As long as there is a linear relationship between the parameters and the

measurements, the least-squares method is linear. • If this relationship is a non-linear function F(θ), the problem is said to be of a

non-linear least-squares type:

• There exists no direct solution to this problem, and one has to resort to an iterative approach (Gauss-Newton):

– Start out with an initial guess of θ, linearize function F around the initial guess by a Taylor expansion and solve the resulting linear least-squares problem

– Use the estimated value for θ as a new expansion point for F and repeat the step above

– Iterate until convergence (i.e. until θ changes less than a specified value from one iteration to the next)

( ) ( ))()( 1 θyθy FVFS T −−= −

Page 82: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Relationship between maximum likelihood and least-squares: • Consider a set of independent measurements y with mean values

F(x;θ). • If these measurements follow a Gaussian distribution, the log-

likelihood function is basically

plus some terms which do not depend on θ. • Maximizing the log-likelihood function is in this case equivalent to

minimizing the least-squares objective function.

( )( )∑=

−=−

N

i i

ii xFyL1

2

2;)(ln2σ

θθ

Page 83: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Confidence intervals and hypothesis tests. • Confidence intervals:

– Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within

– Such an interval is called a 95 % confidence interval of a parameter

• Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α) % confidence interval for μ.

• From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(N).

• For large enough N, the quantity is distributed according to a standard, normal distribution

(mean value 0, standard deviation 1)

NXZ

/σµ−

=

Page 84: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Therefore:

• In words, there is a probability 1-α that the true mean is in the interval

• This interval is therefore a 100(1- α) % confidence interval for μ. • Such intervals are highly relevant in physics analysis.

( )( )( ) ασµσ

ασµσ

ασµσ

ασ

µ

αα

αα

αα

αα

−=•+<<•−

−=•<−<•−

−=•<−<•−

−=

<

−<−

1//

1//

1//

1/

2/2/

2/2/

2/2/

2/2/

NzXNzXP

NzXNzP

NzXNzP

zN

XzP

[ ]NzXNzX /,/ 2/2/ σσ αα •+•−

Page 85: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Hypothesis tests: • A hypothesis is a statement about the distribution of a vector x of

data. • Similar to the previous example:

– given a number N measurements, test whether the measurements come from a normal distribution with a certain expectation value μ or not.

– define a test statistic, i.e. the quantity to be used in the evaluation of the hypothesis. Here: the sample mean.

– define the significance level of the test, i.e. the probability that the hypothesis will be discarded even though it is true.

– determine the critical region of the test statistic, i.e. interval(s) of values of the test statistic which will lead to the rejection of the hypothesis

Page 86: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • We then state two competing hypotheses:

– A null hypothesis, stating that the expectation value is equal to a given value

– An alternative hypothesis, stating that the expectation value is not equal to the given value

• Mathematically:

• Test statistic:

01

00

::

µµµµ

≠=

HH

NXZ

/0

σµ−

=

Page 87: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics

2/αz− 2/αz

Probability of being in shaded area: α

Shaded area is therefore the critical region of Z for significance level α

Obtain a value of the test statistic from test data by

calculating the sample mean and transforming to Z.

Use the actual value of Z to determine whether the null

hypothesis is rejected or not.

Page 88: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Alternatively: perform the test by calculating the so-called p-value of

the test statistic. • Given the actual value of the test statistic, what is the area below the

pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. – For the current example this would correspond to adding two integrals

of the pdf of the test statistic (because this is a so-called two-sided test): • one from minus infinity to minus the absolute value of the actual value of the

test statistic • another from the absolute value of the actual value of the test statistic to

plus infinity • For a one-sided test one would stick to one integral of the

abovementioned type • If the p-value is less than the significance level: discard the null

hypothesis. If not, don’t discard it.

Page 89: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • p-values can be used in so-called goodness-of-fit tests. • In such tests one frequently uses a test statistic which is assumed to

be chisquare distributed – Is a measurement in a tracking detector compatible with belonging to a

particle track defined by a set of other measurements? – Is a histogram with a set of entries in different bins compatible with an

expected histogram (defined by an underlying assumption of the distribution)?

– Is the residual distributions of estimated parameters compatible with the estimated covariance matrix of the parameters?

• If one can calculate many independent values of the test statistic, the following procedure is often applied: – Calculate the p-value of the test statistic each time the test statistic is

calculated

Page 90: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics – The p-value itself is also a random variable, and it can be shown that it

is distributed according to a uniform distribution if the test statistic origins from the expected (chisquare) distribution.

– Create a histogram with the various p-values as entries and see whether it looks reasonably flat

• NB! With only one calculated p-value, the null hypothesis can be rejected but never confirmed!

• With many calculated p-values (as immediately above) the null hypothesis can also (to a certain extent) be confirmed!

• Example: line fit (as before) • For each fitted line, calculate the following chisquare:

( ) ( ) ( )θθθθθ −−=− 12 cov

Page 91: PROBABILITY AND STATISTICSfolk.uio.no/ares/FYS4550/Lectures_H14_1.pdf · 1. Probability is never negative, P(A) ≥ 0 2. The probability for the event which corresponds to the entire

Statistics • Here θ is the true value of the parameter vector. • For each value of the chisquare, calculate the corresponding p-value

– Integral of chisquare distribution from the value of the chisquare to infinity

• Given in tables or in standard computer programs (CERNLIB, CLHEP, MATLAB,….)

• Fill up a histogram with the p-values and make a plot:

Reasonably flat histogram, seems OK. What we really test here is that the estimated parameters are unbiased estimates of the true

parameters, distributed according to a Gaussian with a covariance matrix as

obtained in the estimate!!