statistical methods in experimental...

54
Statistical methods in experimental physics Diego Tonelli CBPF Physics School Rio de Janeiro, 13—24 Jul, 2015

Upload: vuongkiet

Post on 01-Mar-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Statistical methods in experimental physics

Diego Tonelli

CBPF Physics School Rio de Janeiro, 13—24 Jul, 2015

Page 2: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

A short story

2

Leon Lederman is a living legend. In the HEP golden age of the ’60-’70 he did a few of the most relevant experiments that contributed to the birth of the Standard Model. In 1988, he got the Nobel prize in physics for the discovery of the muon neutrino.

In 1976, his group announced the observation of a new particle produced by a beam of protons on Beryllium and decaying into e+ e- pairs, with a mass of about 6 GeV.

Page 3: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

The “Oops-Leon” particle

3

This was published and provided a very strong candidate for the Upsilon, the bound state of a (not yet observed) fifth quark.

The experiment took more data and could not confirm the finding.

The erroneous first claim has been later tracked down to a mistake in the statistical evaluation of the significance of the signal (neglected the look-elsewhere-effect, will see this later on)

This, along with other “false discoveries” at those times, contributed to raise the attention toward the need for a proper education in basics statistics for HEP physicists.

Invariant ee mass

Page 4: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Goal

4

No matter how good a physicist you are, mistakes in the interpretation of data are always possible.

The goal of these lectures is to stimulate your curiosity about the statistical notions that most likely will be relevant for your future work, provide some kind of solid conceptual foundation over which you will be able to build your own statistical education, and ultimately provide a general framework that may allow you making the most out of your measurements.

Page 5: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

PS

5

A couple of years later, the same group found the real Upsilon meson, at 9.5 GeV using muon pairs and nobody cared too much about the 6 GeV fluke, which someone dubbed “Oops-Leon” in a pun over Lederman’s name.

Page 6: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Introduction

6

Page 7: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

What?

7

The science of learning from the data obtained by counting or measuring the properties of populations of natural phenomena.

• Provides the navigation essentials for controlling the course of scientific and societal advances by measuring, controlling, and communicating uncertainty.

• Statisticians apply statistical thinking and methods to a wide variety of scientific, social, and business endeavors in such areas as astronomy, biology, education, economics, engineering, genetics, marketing, medicine, psychology, public health, sports, among many.

A broad subject. We will focus on some applications relevant for the purpose of experimental physics, in which statistics is mainly used to interpret and communicate the results of experiments.

Page 8: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Understanding nature from blurred observations

8

Page 9: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Measuring nature

9

We describe natural phenomena with models, expressed through mathematical relationships that depend on unknown free parameters.

When we make a measurement, we try to learn about the unknown parameters m by observing the values of observable quantities x (like number of particles, their energies, ecc) that are somehow related to m.

The relation between x and mu is always of a probabilistic nature: x is a random variable with some distribution p(x;m), given by the theory, that depends on the value of m. The width of the distribution p(x;m) and poor knowledge of its shape introduce uncertainty in our measurements.

Statistics helps in quantifying that uncertainty so that the information contained in data can be properly extracted and communicated.

Page 10: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Outline

10

• Data description, sample statistics, and standard theoretical distributions

• Probability and statistical inference. Bayesian vs frequentist inference. Theory of estimators. Pdf vs likelihood. Maximum likelihood estimator and its properties.

• Confidence interval estimation and limits. Ordering criteria. Likelihood ratio theorem. Likelihood-ratio ordering and its properties. Systematic uncertainties. Profile-likelihood ratio. The Bayesian approach.

• Hypothesis testing. Properties of tests. Neyman-Pearson lemma. Goodness of fit. Significance.

• Advanced topics.

Page 11: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Caveats

11

I am not a professional statistician nor did I give any original contribution to that field. I am just an enthusiastic practitioner educated by 15+ years of HEP data analysis with my former supervisor (himself an expert who did give original contributions to statistics) who first got me interested in the subject 15 years ago.

Data analysis is often learned through apprenticeship. Different disciplines, and different subfields within the same discipline, or even groups within the same field develop their own analysis subculture. These cultural conventions impact formal aspects, like jargon, but also substantial aspects, like data analysis practices and decision making. This course is necessarily biased toward my collider-physics background. Hopefully will be useful for those of you who are not in HEP.

Please feel free to ask me any question during the lectures or at [email protected]

I will put slides and any relevant material on www.pi.infn.it/~dtonelli/StatLectures

Page 12: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

These lectures

12

Won’t be a cookbook. Already plenty of how-to’s online. They usually cover the vanilla cases, but when something in your work requires to stop a second and think about statistics you are probably not in a standard case already.

Convey some fundamental concepts in a fairly general framework. Hope to give you the tools to find your own path toward identifying and addressing the statistical issues you will encounter.

Will go slowly — no point if you don’t follow the logic. And interrupt me with questions: we will all learn and occasional interruptions make lectures less boring.

Tried to compose fairly detailed slides. Should allow following the logic offline too and serve as reference (don’t have a write-up). Please let me know of mistakes.

Subject is broad and I won’t touch on many topics that are nevertheless important.

Page 13: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Further readings

13

G. Cowan — good starting point (rookies)

F. James — very good and complete (intermediates)

Stuart, Ord, and Arnold — the ultimate reference (experts)

Also: R.J. Barlow, A Guide to the use of Statistical Methods in the Physical Sciences, Wiley (1989); L. Lyons, Statistics for Nuclear and Particle Physics, CUP (1986); I. Narsky and F. C. Porter, Statistical Analysis Techniques in Particle Physics, Wiley (2014); S. Brandt, Statistical and computational methods in data analysis, Springer (1998)

Page 14: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Further lectures

14

I stole a lot of material from several excellent sets of lectures. Some of these can be found online and I encourage you to look at those as well.

K. Cranmer: https://indico.cern.ch/event/243641/material/slides/0.pdf ; G. Cowan — http://www.pp.rhul.ac.uk/~cowan/stat_cern.html; H. Prosper — http://indico.cern.ch/event/358542/; L. Lyons — http://indico.cern.ch/event/a063350/

G. Punzi, B. Cousins, J. Rademacker, etc…

Page 15: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Describing data

Random variables and their features

15

Page 16: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Fundamental notions

16

A random event is an event that has more than one possible outcome. The outcome cannot be predicted deterministically, but a probability for each outcome is known.

A random event can be associated to one (or multiple) variates (or random variables, in HEP typically just “variables”) x, which take different values, corresponding to the different possible outcomes. The possible values for x are associated to probabilites p(x), which form a distribution, the probability distribution of x.

Population (or group, or aggregate) is a collection of random events.

We wish to quantify the collective properties of the population. Not to those of specific individual elements of the sample. The sample of events generates a corresponding population of variable values, which are the object of study.

For samples larger than a few elements, frequency distributions allow for a much easier grasp of their significance and population properties

Page 17: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Population in physics

17

Population: a straightforward notion in medicine, veterinary, industrial quality control, social sciences….: “the population of all adult males between 25 and 35 years of age”, “the population of workers who will retire in the next 2 years” “the population of cans produced between June 1 and July 1”.

What does population mean in experimental physics? Can we talk about the “population of Higgs boson decays”? or “population of trapped atoms”?

An abstraction is necessary.

In physics, population identifies the hypothetical infinite set of repeated independent and identical experiments.

Distributions observed in data are interpreted as random samplings of finite-size from an infinitely populated “parent distribution” that represents the true distribution of the observable of interest.

Page 18: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Parent distribution

18

expt #1

expt #2

expt #3

expt #N

expt #N-1

Parent distribution

Page 19: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

You do it everyday

19

Most of you are probably familiar with the practice of quoting √ N as uncertainty in a counting experiment that observes N counts. E.g, in an histogram, a bin with N entries has an error bar of length √ N

What that bar exactly mean?

..that I am uncertain whether in my sample I I counted N events in that bin?

No. I am pretty sure that my sample has just N events falling in that bin.

The bar represents the fluctuations in the counts of that bin one could expect if the experiment was repeated. That is, the fluctuations between samples of finite size drawn from the same parent distribution

?

Page 20: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Sample

20

Not very informative.

Assuming that all observations are equivalent, the individual sequence does not matter and all relevant information is contained in the frequency of each outcome.

Page 21: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Frequencies

21

Better. Still not very intuitive.

Page 22: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Frequency distribution

22

Much better. Offers immediate visual feel of

the “shape”,

the “localization” and

the “dispersion”

of data

Page 23: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Binning — short aside

23

Some feel strongly about rules for choosing the bin width. Not that relevant as long as event counts remain O(10) or greater in most of the interesting region of the distribution (O(10) so that Poisson —> Gaussian, see later)

More important: binning is a data reduction and as such it induces a loss of information.In continuous data, values of variables for each entry are known up to the native precision of the apparatus.

In binned data, all entries with values within the range of a bin are collectively filed in that bin. In all subsequent manipulations they are treated as if they have the same value (corresponding to the center of the bin).

That’s why attributing additional uncertainty due to changes in binning is wrong. Changes in the results are due to the method of data reduction and as such included in the statistical uncertainty. Adding uncertainty leads to double counting.

Page 24: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Sample statistics

24

Sample mode: value of the variable for which the population is larger.

Sample median: mid-range value of the variable so that 1/2 of sample has larger and 1/2 has smaller values.

Sample mean: arithmetic average of the values of the variable across the sample

Hard to do any serious analysis by just staring at distributions.

Need to get more quantitative. A few simple quantities can be calculated from the available data only (they do not depend on parameters) and encapsulate quantitative information of “location” or “central value” of a distribution into a few numbers.

Page 25: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Sample mean

25

Simple and most common quantity if one wants to summarize the distribution information into a single number.

For a sample of N events, each associated with a variable xi and binned into an histogram with n bins, the sample mean is

Unbinned sample mean x̄ =1

N

NX

i=1

xi

Binned sample mean x̄ =1

N

nX

j=1

xjnj

Linear: ↵x+ y = ↵x+ y

Page 26: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Variance

26

The mean says nothing about the dispersion of data, which is another key information to grasp the features of a population.

Use the variance: average of the difference square from the mean

V (x) = (x� x)2 =1

N

NX

i=1

(xi � x)2

Easier to remember: the mean of the squares minus the square of the mean

V (x) = x

2i � x

2

The root of the variance is the standard deviation, √V(x) = σ, which is typically used as a standard measure of spread.

Page 27: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Multiple dimensions

27

In general, more than one variable is associated to each random event

Take two variables (easy to generalise further): each of N statistical experiments observes of a pair of numbers {(x1,y1), (x2, y2), …, (xN, xN)}

The sample mean and variance are easily generalized to estimate the location and dispersion of the sample along each axis of the multidimensional space.

An additional useful concept quantifies information about the relation between dispersions along different axes.

Page 28: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Covariance and correlation

28

Easier to remember: the mean of the product minus the product of the means

In N-dimensional data, define the covariance matrix

Covariance has units so it depends on the choice of units. Better to use a unitless quantity, the Pearson linear correlation

Cov(x, y) = xy � x y

Cov(x, y) =1

N

NX

i=1

(xi � x)(yi � y)

⇢(x, y) =Cov(x, y)pV (x)

pV (y)

=Cov(x, y)

x

y

Vij = Cov(x(i), x

(j))

and its associated correlation matrix ⇢ij =Vij

�i�j

Page 29: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: correlation and dependence

29

Correlation and dependence between variables are often confused. Let’s set the record straight.

• Two variables x and y are (linearly) uncorrelated if ρ(x,y) = 0

• Two variables x and y are statistically independent if their two-dimensional distribution f(x,y) can be factorized into the product f(x,y) = g(x) h(y). The shape of the distribution of one variable does not depend on the value of the other variable. In other words, information from one variable does not carry information on the other.

• Variables that are independent are also uncorrelated.

• Variables that are uncorrelated may still be dependent

Page 30: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: correlation strength and sign

30

Note: correlation says nothing about the “slope”

Page 31: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: dependence

31

In all of these samples, the correlation is zero. But the two variables are clearly not independent.

Understanding dependences in multivariate samples of data is important. For instance, in likelihood fits, failure to identify dependence between observables result in wrong results.

Page 32: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: Testing for correlation and dependence

32

For testing dependence should plot distribution of one variable “in slices” of the other, and check that they overlap.

Testing for correlations is easy: just compute the correlation coefficients and make sure they are consistent with zero.

If you see a correlation, then the variables are certainly dependent. If you don’t see a correlation, you may need to still check against dependence.

Page 33: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: causality

33

Often correlations are used to motivate causality: causes of phenomena are what is relevant to “understand what’s going on” and build scientific evidence.

This is a sensitive business. Statistics won’t tell you much about causality. Any statement of causality is necessarily associated with some degree of arbitrariness from the analyzers. (Physics relies on established laws that help evaluating plausibility of causal connections. In social sciences, speculations on causality based on observed correlations can get much wilder)

When two phenomena A and B appear to be correlated, it is hard to find out in which of the following cases you are:

• A causes B

• B causes A

• A third phenomenon C causes both A and B

• Correlation is just a coincidence

Page 34: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: A causes B (or B causes A?)

34

Page 35: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: A third phenomenon C causes both A and B

35

Warm temperatures push people to buy more ice-creams, and also to spend more time outside and party, increasing chances that members of opposing gangs meet and get violent on turf or drug-dealing issues.

NYC study, late 80ies

Page 36: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: spurious correlations

36

Data: US Department of Agriculture and Center for Disease Control and Prevention. Plot: tylervigen.com

Page 37: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: spurious correlations

37

Data: National Vital Statistics Reports and US Department of Agriculture. Plot: tylervigen.com

Page 38: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

38

Page 39: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Describing data

Theory of distributions

39

Page 40: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Frequency distributions

40

Most frequency distributions in experimental science are highly regular.

This suggest that frequency distributions can be approximated by smooth curves parametrized by simple mathematical expressions.

(Think of bringing the number of observations to infinite, the bin-width to zero, and maintaining unit area)

These would approximate the “theoretical” probability functions. Not yet defined what probability is, let’s use the intuitive idea as a working approximation for the moment.

increase the number of observations

Page 41: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Choose a short range Δx, centered in x, of the variable. The local frequency of events is approximated by f(x)Δx.

As Δx→0, the probability that x is contained in the range x - (dx/2) and x + (dx/2) is

Probability density function

41

dF = f(x)dx

f(x)is the probability density function. Not a probability: has units of x-1 It is normalized to unity.

F (x)

is the cumulative density function expresses the probability that x is between -∞ and x

�x

x

f(x)�x

f(x)

F (x)

F (x)

Page 42: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Characterizing the pdf

42

The pdf can be used as weight to obtain the average value of any function g(x) of the random variable

In analogy with what done for data distributions, the theoretical pdfs can be characterized by a few numbers that provide quantitative information of their location and dispersion.

The expectation value of x is the mean of f(x)

V (x) = hx2i � hxi2 = E[x2]� E

2[x] =

Z(x� hxi)2f(x)dx

Expectation value of g

hg(x)i = E[g(x)] =

Zg(x)f(x)dx

hxi = E[x] =

Zxf(x)dx

The expectation value of (x-E[x])2 is the variance of f(x)

Page 43: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Aside: the oddball Cauchy distribution

43

These properties cannot be defined for any pdf

E.g., the Cauchy distribution, which HEP physicist know well under the name of Breit-Wigner,

has undefined mean and infinite variance

f(x;�, x0) =1

�/2

�2/4 + (x� x0)2

Page 44: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Functions of random variables

44

Functions of random variables are themselves random variables. Take f(x) as pdf of the random variable x and y(x) a function of x (e.g., change of variables).

The pdf for y(x) is obtained by imposing the conservation of probability in the two metrics. (suppose I flip a coin and define variable n, number of heads, and variable m, number of heads squared, it’s obvious that the probability to get m=4 should equal that of getting n=2)

P (xa

< x < x

b

) =

Zxb

xa

f(x)dx =

Zy(xb)

y(xa)g(y)dy = P (y(x

a

) < y < y(xb

))

Zy(xb)

y(xa)g(y)dy =

Zxb

xa

g(y(x))

����dy

dx

���� dx f(x) = g(y)

����dy

dx

����

Exercise: derive g(y) for uniform f(x) = 1/2π and y(x) = sin(x)

therefore

Because

Page 45: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

A special case — probability integral transform

45

If x is continuous and f(x) is its pdf, consider the special change of variables that transforms x into its cumulative

Using the relation one gets

which result into g(y) =1

Any continuous distribution can be transformed into an uniform distribution.

y(x) =

Zx

�1f(x0)dx0

f(x) = g(y)

����dy

dx

��������dy

dx

���� = f(x)

Page 46: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Variances of functions of random variables (a.k.a. “propagation of errors…”)

46

Often one is interested in knowing the variance of the pdf of a function of a random variable, given the variance of the random variable.

xx0 x0+σxx0-σx

y(x0)y - σy = y(x0-σx)

y + σy = y(x0+σx) } σy =|a| σx

Take a linear example: y(x) = a x +b

Easy: standard deviation of y(x) is

σy = |dy/dx| σx

Page 47: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Variances of functions of random variables (cont’d)

47

xx0 x0+σxx0-σx

y(x0)y - σy = y(x0-σx)

y + σy = y(x0+σx) } σy =|a| σx

Can linearize any non-linear y(x) if you are close enough to the point x0

σx

y(x)

y(x0)+x|df/dx|x0

y(x) ⇡ y(x0) +

����dy

dx

����x

Page 48: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Variances of functions of random variables (1D)

48

y(x) ⇡ y(x0) +

����dy

dx

����x

V (y) = hy2(x)i � hy(x)i2

⇡ h(y(x0) + x

dy

dx

)2 � hy(x0) + x

dy

dx

i2

=

✓dy

dx

◆2 �hx2i � hxi2

=

✓dy

dx

◆2

V (x)

Definition of variance

Replace with linearization

Do the algebra

Page 49: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Variances of functions of random variables

49

y(x1, x2) ⇡ y(x1,0, x2,0) +

����@y

@x1

����x1,0

x1 +

����@y

@x2

����x2,0

x2

V (y) = hy2i � hyi2

⇡����@y

@x1

����2

x1,0

V (x1) +

����@y

@x2

����2

x2,0

V (x2) + 2

����@y

@x1

����

����@y

@x2

����Cov(x1, x2)

Can be obviously extended to functions of 2…N variables.

Important reminders

- linearized formulas are exact only if y(x⃗) is linear. They fail if the function is nonlinear over a range comparable in size to σxi

- linearized formulas apply for any pdf of the xi variables.

Page 50: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Convenient in practice

50

• Absolute uncertainties add in quadrature for the sum, or difference, of uncorrelated variables.

• Relative uncertainties add in quadrature for the product, or ratio, of uncorrelated variables.

• Correlations do change things dramatically. E.g,, the difference of two quantities that have equal uncertainties and 100% correlations has zero uncertainty.

Page 51: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Parametrized probability density functions

51

So far we have only discussed f(x), but most of the time, in experimental measurements, we’ll deal with pdfs that contain some parameter m of the assumed theoretical model:

which indicates the “pdf of observable x, which depends on parameter m”

Generalization to multidimensional space of observables and of model parameters is natural and it is the one most frequently encountered in physics.

f(x;m)

f(~x; ~m) = f(x1, x2, ..., xn;m1,m2, ...,mm)

Page 52: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Joint, conditional, marginal

52

f(x1, x2; m) is the joint pdf. Contains the whole available information and is related to the probability that x1 and x2 assume simultaneously values within certain ranges.

f(x1 | x2; m) is the conditional pdf. Related to the probability that x1 is in a certain range given that x2 has a well defined value.

∫ f(x1, x2; m ) dx2 is the marginal pdf. Related to the probability that x1 is in a certain range regardless of what is the value of x2

These notions obviously generalize to the n-dimensional pdf f(x1, x2, …, xn)

Page 53: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Conditional and marginal - in practice

53

f(x1 | x2; m) is the conditional pdf.

“Plot one variable in slices of the other”

∫ f(x1, x2; m ) dx2 is the marginal pdf.

“Project the distribution onto one variable or the other”

Page 54: Statistical methods in experimental physicsmesonpi.cat.cbpf.br/escola2015/downloads/material/PG01_DTonelli... · A short story 2 Leon Lederman is a living legend. In the HEP golden

Remember independence?

54

If (and only if) a joint multivariate pdf can be factorized into the product of the individual pdf

then, the variables are independent. Information on one does not add any information on the other. Aside: statisticians use a more general notion of “correlation” that is more linked to the notion of statistical independence, the mutual information

f(x1, x2;m) = f1(x1;m)f2(x2;m)

where px(x) and py(y) are the distributions of x and y marginalised over the other variable. Not very popular in physics, but it may be useful. It’s symmetric for x and y interchange, is zero iif x and y are independent.

I(x, y) =

Z

Y

Z

X

f(x, y) log

✓f(x, y)

f

x

(x)f

y

(y)

◆dxdy