s3e - probdist - lessons - rev 2020 · 2020. 7. 8. · iut de saint-etienne – département tc...
TRANSCRIPT
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 1 / 20
SALES AND MARKETING Department
MATHEMATICS
3rd Semester
________ Probability distributions ________
LESSONS
Online document : on http://jff-dut-tc.weebly.com section DUT Maths S3.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 2 / 20
TABLE OF CONTENTS
INTRODUCTION AND HISTORY 3
LESSONS 5
1 DISCRETE PROBABILITY DISTRIBUTIONS .................................................................................................... 5
1.1 GENERAL CASE: REMINDERS 5
1.2 HYPERGEOMETRIC DISTRIBUTION 6
1.3 BINOMIAL DISTRIBUTION 7
1.4 POISSON'S DISTRIBUTION 8
2 A CONTINUOUS PROBABILITY DISTRIBUTION: THE NORMAL LAW ............................................................ 9
2.1 CONVERGENCE OF DISCRETE LAWS 9
2.2 CONTINUOUS REAL RANDOM VARIABLE 10
2.3 THE NORMAL LAW (OR LAPLACE'S LAW) 11
3 SAMPLING DISTRIBUTIONS ..................................................................................................................... 14
3.1 INTRODUCTION 14
3.2 RANDOM SAMPLING 14
3.3 SAMPLING DISTRIBUTION OF MEANS 14
3.4 SAMPLING DISTRIBUTION OF PROPORTIONS 15
4 ESTIMATES (STATISTICAL INFERENCE) ..................................................................................................... 16
4.1 POINT ESTIMATE 16
4.2 ESTIMATE BY A CONFIDENCE INTERVAL 16
5 STATISTICAL HYPOTHESIS TESTING ......................................................................................................... 17
5.1 ADEQUACY χ² TEST (PEARSON'S TEST) 17
5.2 CONFORMANCE TEST OF A MEAN, OF A PROPORTION 18
5.3 THE RISKS (NON REQUIRED) 19
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 3 / 20
INTRODUCTION AND HISTORY
A quick story of the normal law
On the late XVIIth century, Jakob Bernoulli found the way to the binomial law, calculating the chances of
success while performing a given experiment several times. His manual
calculations became horribly complicated in case of big numbers, due to the
calculation of factorials. In the first half of the
XVIIIth century, Abraham de Moivre worked on
chance calculus and discovered a formula that
gives (approximately) the factorial of a natural
number. Stirling-Moivre formula:
! 2e
nn
n n ≈ π
(with n > 8, deviation < 1 %)(n increases : % of deviation decreases)
Afterwards Leonhard Euler improved this formula, proving the following equality :
! .
0
e dx nn x x
+∞−= ∫ . The function within the integral shows a
typical "bell" curve, whose vertex is the point ,e
nn
n
.
Pierre Simon de Laplace gave a new demonstration of
this formula, using Euler's works.
With Euler, and then with Laplace and Legendre, a new theory is developed : the
theory of errors (born to simplify astronomers' works) : among several fluctuating
measures of the same object or phenomenon (fluctuations due to a lack of
sharpness, dilatation of materials, variable pressure in the atmosphere, …), what unique value could be
considered as the true one? Thus, laws of distribution were to be created: distribution of values and of sample
means. These distributions of values are in infinite number, given each possible concrete example. The general
case of the theory of errors is still today an unsolved problem.
Between 1790 and 1800, Carl Friederich Gauss, the "prince of mathematicians",
applied the least square method (invented by Laplace) to the theory of errors,
arguing that the best representative value for a data series xi is the one, x, that
minimises Σ(xi - x)². This way, and from simple distributions, x appears to be the
arithmetic mean of the xi ; this result is also true from a bell distribution (that is
generally typical of a sampling means distribution - with same sized samples taken
from the same former population). These works are the only ones in which Gauss
mentioned the now famous "bell curve", but he never drew one and its function
already existed - that's why calling it a Gauss curve is irrelevant.
Laplace soon objected, in relation to Gauss works, that if a bell distribution leads to a
bell sample distribution, there is no mention about the numerous other concrete
situations whose populations don't behave this way (bell curve). According to
Laplace, Gauss works are only theoretical thoughts and, worst, are reflexive (bell
leads to bell… because it's bell !). In the 1810s, he demonstrated that if the values are
uniformly distributed on an interval (a constant probability density, distributed into
an interval whose mean is µ), then the sample mean distribution of n-sized samples
(n big enough) is a bell one, whose mean is µ and whose standard deviation is about
µ/√(3n).
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 4 / 20
Then, he enunciated a theorem that is the cornerstone of statistical inference:
Laplace's theorem (nowadays central limit theorem):
Whatever the distribution of the values, for n big enough, the sample distribution of the means (of the n-
sized samples) is normal (bell curve), whose mean is the arithmetic mean of the values, and whose standard
deviation can be easily calculated by a formula (which always looks like the one given above).
Thus, he's been building his Laplace's law (so: the normal law) and discovering its fundamental properties.
The profession of statistician only appeared in the XIXth century (for many purposes,
people needed to know how a population behaves). The most famous and prolific at
this time was the French Adolphe Quételet, who published an analysis of Laplace's
philosophy, numerous concrete data series showing bell-shaped distributions (for
instance, the "chest sizes of 4000 Scottish soldiers", whose distribution perfectly fits
in the kind of theoretical normal curve. Indeed, the chest size of a man is the sum of
several, random and independent factors: genetics, education, feeding, activity, …
and Laplace's theorem assesses that the distribution of a sum, like the one of a mean,
is normal!). It has also to be reported that Quételet was the first who drew one of
these famous normal bell curves! (neither Gauss, nor Laplace, felt the need to draw
one while thinking about theory).
Everything isn't necessarily normal
During the second half of XIXth century, statisticians shown that a lot of data series are in fact not normally
distributed (the symmetry of the normal law isn't always representative of what happens in our complex world).
Consequently, other continuous or discrete laws are created in order to model several concrete situations.
For instance:
* Poisson's law, quite asymmetrical, in case of rare events,
* Pareto's law for incomes distributions, asymmetrical as well,
* Exponential law and others based on the same model, for life lengths, asymmetrical again, …
Other laws had been found before the normal law has been created:
* Uniform law, the probability of each value is the same (throw of a die; choose a number between 0 and 1),
* Binomial law (from Bernoulli),
* Geometric law, dealing with the number of attempts until your first success (in binomial situations),
* Hypergeometric law, similar to the binomial law, but in which repetition isn't allowed, ...
In the early XXst century, laws of superior orders are built, dealing with more than one variable, generally
involving degrees of freedom:
* Student's law (sample distribution of means, built with two variables: mean and standard deviation)
* χ² law - "Chi-squared" - (evaluates the differences between a theoretical law and a real distribution)
At this time, English statisticians like Pearson, Student (nickname of William Sealy Gosset) or Fisher began to
develop a true actual methodology in statistics, that's to say a well-formalized theory of inference (drawing
conclusions about a population, only knowing one or more of its samples), by the mean of creating new
probability laws to describe phenomena.
They have dictated, between 1900 and 1950, an "objectivist" or "frequencist" interpretation of the concept of
probability. Since the 1950s, an argument known as the "neo-Bayesian" school appeared, telling that statistical
inference shouldn't be only based on the collected data alone, but also needs the knowledge and the use of
underlying probabilistic models - it's the "subjectivist" school.
Calculation tools are increasingly powerful
Data processing (computering) helped a new performance taking off: the "multidimensional data analysis". It
consists in describing, sorting and simplifying large recordings of collected data (e.g.: a survey on 3000 people in
which 80 answers each have to be collected). Observed and crossed results may suggest laws (already existing
or not), models or explanations that may avoid statisticians to consider data relatively to arbitrary laws, formerly
created, with which they would be forced to do a comparison.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 5 / 20
PROBABILITY DISTRIBUTIONS - LESSONS
1 Discrete probability distributions
1.1 General case: reminders Let's consider an object or a set of objects and conceive a random experiment on it, whose outcomes form
a sample space partitioned into a certain number of events.
e.g.:
objects:
experiment:
sample space:
partition of Ω :
two dice
roll them and add both numbers
Ω = 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 (non equally likely outcomes)
E1 : "less than 7" ; E2 : "from 7 to 10" ; E3 : "11 or 12"
Each event Ei can be associated with a value xi , a gain, random, as the upcoming outcome is -
unpredictable; the set of the xi values is named random variable, denoted X.
events: E1 E2 E3
gain X (€): -3 1 5
For each value of the gain, we have to be able to calculate the probability of the associated event.
This is called "getting the probability distribution of X".
gain X (€ : -3 1 5
pi = p(X = xi): 15/36 18/36 3/36
Interpretation and purpose of these probabilities:
If you play this game many times, your numbers of losses and wins may be estimated thanks to the
proportions announced by these probabilities.
With our example: every 36 games, you will have on average 15 losses of €3, 18 wins of €1 and 3 wins of
€5; thus, on combining them: a global loss of €12, on average, every 36 games.
This overall result can be expressed on average per game : 12/36 ≈ €0.33.
Playing it long-term, you will approximately have an average loss of 33 cents per attempt.
This value is called expected value of X: E(X).
This expected value is in any case: ( )n
i i
i
X p x=
=∑1
E
where n is the number of possible values of X.
These long-term forecasts allow us to regard the former table as a statistical series in which probabilities
could be real frequencies of occurrence of the gains (though they only are "ideal" frequencies). Into this
context, the table can be interpreted on a statistical angle, leading for instance to the calculation of the
standard deviation of X, σ(X).
( ) ( ) ( ) ( ) ( ) ( );
n
i i
i
X p x X X X X X=
= − = − σ =∑2 22 2
1
V E E E V
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 6 / 20
1.2 Hypergeometric distribution
Its study will be restricted to a simple partition of the initial set into TWO subsets.
1.2.1 Definition and implementation
The probability distribution of a random variable X is hypergeometric iff:
* an experiment is conducted n times without repetition of any outcome, leading to combinations,
falling inside a partition of Ω into an event (success) and its contrary (failure).
* X is the total number of successes got after n attempts. X ⊂ 0; 1; 2; 3; …; n
Let's consider a sample space Ω, set of N outcomes, parted into two events:
A, containing a outcomes called successes
A , containing the N - a other outcomes, named failures.
An experiment is conducted n times, without possibility of repetition of any outcome (which forces n to be
less than or equal to N). In the end, k successes would be met, random but observing k ≤ n and k ≤ a
(number of available "success" outcomes), and of course n - k failures, ≤ N - a (available failures).
X refers to the random variable number k of successes after n attempts.
Then, the probability distribution of X is hypergeometric, with parameters n, a and N.
Notation: H (n , a , N).
1.2.2 Calculation of probabilities
The total number of different sets of outcomes after n attempts is: NCn
Among them, the number of sets that contain exactly k successes is: NC Ck n k
a a
−−×
Hence, the probability of reaching k successes after n attempts is: ( ) N
N
C Cp
C
k n k
a a
nX k
−−×= =
1.2.3 Mean and variance
In that context, both parameters are accessible thanks to the following formulas:
( )EN
aX n= × ( ) ( )
2
N NV
N N 1
a a nX n
− −= × ×−
Comment: on naming "p" the probability of success at the first attempt, and "q" the complementary
probability of failure, we can notice that: −= = N
andN N
a ap q
Thus, the formulas above become: ( )E X np= and ( ) NV
N 1
nX npq
−= ×−
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 7 / 20
1.3 Binomial distribution
1.3.1 Definition and implementation
The probability distribution of a random variable X is binomial iff:
* an experiment is conducted n times with allowed repetition of an outcome, leading to p-lists, falling
inside a partition of Ω into an event (success) and its contrary (failure).
* X is the total number of successes got after n attempts. X = 0; 1; 2; 3; …; n.
a. Bernoulli's scheme
Let's consider a random experiment leading to a sample space Ω.
The event A, named success, has a probability p(A) to occur, denoted p.
The probability of its contrary, named failure, is q = 1 - p.
b. Binomial law
This experiment is conducted n times in the same conditions, so: p is invariable.
X refers to the random variable number k of successes after n attempts.
Then, the probability distribution of X is binomial with parameters n and p.
Notation: B (n ; p).
1.3.2 Calculation of probabilities
A tree (n levels Bernoulli's scheme) will lead to the formula to be used – in this example, the experiment is
conducted three times: n = 3 ; A is the success.
On the right, the numbers of successes, values of X, match the probabilities of the corresponding
intersections. For instance, the probability that X = 1 is the sum of pq², qpq and q²p. Thus: p(X = 1) = 3pq².
Why are there 3 paths in the tree leading to X = 1? Because there are 3 ways to combine one success
among 3 attempts.
We can generalise: the probability of reaching k successes after n attempts is: ( )p Ck k n k
nX k p q −= =
1.3.3 Mean and variance
In that context, both parameters are accessible thanks to the following formulas:
( )E X np= ( )V X npq=
1.3.4 Approximation of a hypergeometric distribution by a binomial one
In case N ≥ 20n, the law H (n, a, N) comes close to the law B (n, p) where p = a/N.
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 8 / 20
1.4 Poisson's distribution
1.4.1 Why it has been created
In many cases, the number of different values that a variable X can reach is very big. So, calculating a
probability may involve very large numbers of combinations (and also include large powers if the law is
binomial), that even a computer might not be able to calculate. Moreover, in case a success is a rare event,
every result isn't useful, like calculating the extremely low probabilities of many non-realistic situations
implying a large number of successes (non-realistic because very far from the low average number of
expected successes).
In the context of a binomial law, with a low value of p, another formula can be used (instead of the
binomial formula) based on a Poisson's law, whose results will appear to be close enough to reality.
Concrete examples of use :
* examining a sample taken from a large quantity of products, or a large harvested production, in case the
probability p that an element is shoddy (wrong) is low:
Here, the n elements of the sample are taken among N elements without possible repetition -
which gives a hypergeometric law, but n is very little compared to N, so we can simplify the
situation considering it as if repetition were allowed. So, this case can be treated by a binomial law,
whose results will be reliable. Moreover, the low value of p allows us to use a Poisson's law instead
of a binomial law.
* problems of length of a queue
* predicting a "maximum" number of accidents or failures, or other rare events concerning a large
population (for insurance companies, or study of rare diseases, for instance).
In the context of a binomial or hypergeometric law, under certain conditions, we can therefore use an
approximate model, a Poisson's law, whose results will be fairly close to reality.
1.4.2 Definition, calculation of a probability
This law has been designed for a theoretical random variable X that might value every natural number k (0,
1, 2, 3, 4,… "until infinity"). This still represents a number of successes.
Probabilities are defined by the following formula: ( )!
p ek
X kk
λ λ−= = e is the exponential number,
λ is the expectation of X. λ = E(X)
The probability distribution of X is the Poisson's law with parameter λ, denoted P (λ).
Using a Poisson's law into an exercise must be justified: either an exercise tells that the law is a Poisson's
one, or a binomial law will logically lead to the corresponding Poisson's law (section 1.4.4)
1.4.3 Mean and variance
In that context, both parameters are very simple: ( )E X λ= ( )V X λ=
1.4.4 Approximation of both previous laws by a Poisson's one
Given a random variable X whose distribution is B (n, p),
for n "big enough" (n > 30) and p "little enough" (p ≤ 0.1), such that npq ≤ 10,
B (n, p) comes close to the law P (λ) where λ = E(X) = np.
Given a random variable X whose distribution is H (n, a, N),
For a sample "little enough inside the population" (N ≥ 20n) but "big enough" (n > 30),
and for a proportion of success elements "little enough" (a/N ≤ 0.1),
H (n, a, N) comes close to the law P (λ) where λ = E(X) = na/N.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 9 / 20
2 A continuous probability distribution: the Normal law
2.1 Convergence of discrete laws Let's display a few probability distributions, with given values for n, a, N :
n = 10, a = 500, N = 5000 n = 50, a = 500, N = 5000 n = 200, a = 500, N = 5000
Some comments can be done from that:
* for the whole set of graphs, p = 0.1. This probability of success isn't very low, hence the reliability
criterion "np < 10" for the use of a Poisson's law isn't met everywhere,
* the population's size (N = 5000) is rather big compared to n, which implies that both hypergeometric
and binomial distributions are quite similar,
* the higher n is, the more the distributions appear to be symmetrical, around a central value that is
actually the expectation of the variable.
* the higher n is, the more the distributions seem to follow a curve, that might be the same whatever
the genuine discrete law, or at least that might belong to a unique class of functions.
Then, could we, under conditions on n and p, define a unique law that would correctly and quickly describe
the reality?
* As n becomes high, looking for every punctual probability among a lot of other ones may not be
relevant or useful. We had better look for the probability that X would be located inside some
interval.
Could this unique law be described in terms of intervals (instead of punctual values), by a continuous
random variable?
To conclude, the opportunity of a new and general probability distribution is obvious. Nevertheless, this
law could be available only in case of big populations and big samples taken inside them (but little enough
compared to the population)… but that's actually the purpose of many current surveys!
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 10 / 20
2.2 Continuous real random variable
2.2.1 Statistical introduction to a "continuous" distribution
2.2.2 Continuous random variable
Let's consider the ideal situation where X can take every possible real value, working in an infinite
population. Here, the "frequencies concentration" is renamed "probability density".
A probability density of X is a function f , positive and continuous in ℝ and such that ( ).d 1f x x =∫ℝ
where a probability is the measure of a surface bounded between its curve and the abscissas axis (Ox).
For example (tutorial), the probability that a mass would be less than 3.7 kg is ( ).
.
−∞∫
3 7
df x x .
The distribution function of X is the function F that, to an input x, lead to the output F(x) = p(X < x).
F is an increasing function of x.
Comments:
* the graph of a probability density doesn't necessarily have a symmetry axis, unlike the graphs above
could lead us to conclude; the latter is the one of a normal distribution, that is actually symmetrical.
* the expectation of a continuous random variable is: ( ) ( ).E dX x f x x= ×∫ℝ
.
* the variance of a continuous random variable is: ( ) ( )( ) ( ).2
V E dX x X f x x= − ×∫ℝ
, definition from
which we can rediscover the well-known property: ( ) ( ) ( )22V E EX X X= − .
y = f (x)
F (3.7) F (3.85)
y = f (x)
y = F (x)
F (3.7)
F (3.85)
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 11 / 20
2.3 The Normal law (or Laplace's law) As we glimpsed, with a large number of observations from a big population, a lot of concrete phenomena,
as well as discrete probability distributions, can be modelled by probability densities sharing a typical
shape.
The general expression of such functions f is: ( ) ( ).
2
ea x b
f x k− −=
Their graphs are named "bell curves".
2.3.1 General definition of the normal law N (µ , σ)
Let be a random variable X, whose mean and standard deviation are µ and σ. (E(X) = µ ; V(X) = σ²)
Its probability distribution is N (µ , σ) when its probability density expression is:
( )2
1
21e
2
x
f x
µσ
σ
− − =
π
e.g.: probability density of N (25 , 10):
Comment 1: such a curve owns two inflexion points, whose abscissas are µ - σ and µ + σ.
Hence, we can depict the standard deviation graphically.
Comment 2: some typical results have to be known:
p(µ - σ < X < µ + σ) ≈ 68.3 % p(µ - 1.96σ < X < µ + 1.96σ) ≈ 95 %
p(µ - 2σ < X < µ + 2σ) ≈ 95,4 % p(µ - 2.58σ < X < µ + 2.58σ) ≈ 99 %
Comment 3: the term "normal" can't be defined for one individual. Only a population may show a normal
distribution, using this adjective because it's known that these functions fit with many
concrete situations.
15 25 35
µ
σ σ
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 12 / 20
2.3.2 The standard normal law N (0 , 1)
We shall sometimes be forced to use it (demanded or necessary…).
The variable of this special distribution (mean = 0 , standard deviation = 1) is denoted U (its values: u).
The comment 2 above gives here:
p(-1 < U < 1) ≈ 68.3 %
p(-1.96 < U < 1.96) ≈ 95 %
p(-2 < U < 2) ≈ 95,4 %
p(-2.58 < U < 2.58) ≈ 99 %
A lot of values F(u) = p(U < u) are given in a
table (form), but only with u ≥ 0.
The latter restriction still make us able to find out other probabilities, thanks to the following formulas:
p(a < U < b) = p(U < b) – p(U < a) p(U > a) = 1 – p(U < a) p(U < –a) = p(U > a)
2.3.3 Variable change: transition from N (µ , σ) to N (0 , 1)
Sometimes, we are confronted to an unsolvable problem while expressed in a given normal law, especially
when one parameter is unknown. It will have to make a transition to N (0 , 1).
X is distributed by N (µ , σ) ⇔ X
Uµ
σ−= is distributed by N (0 , 1).
U is distributed by N (0 , 1) ⇔ µ σ= +X U is distributed by N (µ , σ).
Hence: ( )p pX
X x Uµ
σ− < = <
Whatever the parameters of the normal
distribution, a probability is the area of a
given surface under the curve. On applying
the variable change given above, you only
modify the labels on the horizontal axis,
without modifying the curve! For instance,
the abscissa µ +0,5σ for X matches the
abscissa 0.5 for U , and then
p(X < µ + 0.5σ) = p(U < 0.5).
a a -a a b
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 13 / 20
2.3.4 Approximation of discrete laws by a normal one
We already told that as n gets high, the hypergeometric, binomial and Poisson's distributions become close
to a normal one. This normal distribution will be efficient to replace the former ones if:
Approximation of a binomial distribution by a normal one:
From B (n , p), if n > 30 and npq > 5, then we can use N (µ , σ) with µ = np and σ = npq
Approximation of a Poisson's distribution by a normal one:
From P (λ), if λ > 20, then we can use N (µ , σ) with µ = λ and σ = λ
(starting with a hypergeometric distribution will require a first transformation into a binomial one)
2.3.5 Calculation of a discrete probability
In a discrete situation, where the variable X can only take integers as values (e.g.: number of successes, but
not only), we're interested in the calculation of p(X = k). However, the normal law only permits us to
calculate probabilities of intervals.
In that case, the best way is to apply the following rule: p(X = k) = p(k – 0.5 < X < k + 0.5)
2.3.6 Important consequence
The binomial and Poisson's distributions are discrete, so that X can only take natural numbers as values:
something like X = 3.8 has no reality for them. Though, the point 2.3.5 shows us that the effect of the use
the normal law instead is to transform any integer into an interval around it, whose size is 1.
In discrete situations, the numbers 3, 0, or –8 for instance, have to be translated into the intervals [2.5 ;
3.5], [–0.5 ; 0.5], [–8.5 ; –7.5].
As for the probability that X be more than or equal to 10, it will be translated into p(X > 9.5); but, be
careful: the probability that X be more than 10 will be translated into p(X > 10.5)!
(the probability that X be equal to 10 being p(9.5 < X < 10.5).
____________________________________________________________________________
IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 14 / 20
3 Sampling distributions
3.1 Introduction Do you know an operation where the entire population is surveyed ? …to get several information ?
The deployed means are huge. It takes more than a year to collect and analyse the whole data set, and also
an impressive number of surveyors to walk through the whole country. Of course, this work can't be carried
out for any survey…
By selecting a part of the population, you can get a pretty good representation of reality. This selection,
more or less "representative" to reality, is called sample. Survey methods do exist, to build a sample as
representative to the population as possible.
Our aim is, in this section, given an completely known population, to be able to tell how its set of samples
will surely behave.
Naming conventions:
The population's parameters will be written with Greek letters:
mean: µ ; standard deviation: σ ; proportion : π
The sample's parameters will be written using our alphabet:
mean: x ; standard deviation: s ; proportion : p
3.2 Random sampling There are two main types of random sampling:
* the simple random sampling (SRS) allows the repetition of an individual and takes the order into account
(which leads to p-lists in counts and to the binomial law in probabilities),
* the exhaustive sampling doesn't allow the repetition of an individual and doesn't take the order into
account (which leads to combinations in counts and to the hypergeometric law in probabilities).
3.3 Sampling distribution of means A variable X has to be studied in a population.
Once chosen a size n , we can virtually extract all the samples that share this size.
The sample n° k can give way to the calculation of its own mean: k
x .
We denote X the random variable of the means of the n-sized samples, and we name sampling
distribution of the means the probability distribution of the whole set of k
x , that is to say the probability
distribution of the random variable X .
Let be a "big enough" population (N > 30), on which a variable X is known in details (at least, its mean and
standard deviation µ and σ are known). The mean k
x of each n – sized sample is more or less close to µ.
In case n is big enough too (n ≥ 5),
X is distributed by ,n
σµ
N on SRS, and by ,N
N 1
n
n
σµ − −
N on exhaustive sampling.
Comment 1: in case N > 20n ("little enough" sample), we can claim that N
N 1
n−−
is close to 1 and then
forget it. An exhaustive sampling (which is the most used) will in this case be handled as a SRS.
Comment 2: if, in an exercise, no comparison between N et n is possible, we will use the SRS results.
Comment 3: (from the "central limit" theorem) The higher N and n are, the closer the law of X is to a
normal law, and that, whatever the probability distribution of X.
Comment 4: in case n is little (< 5), the distribution of X is not close to a normal one. However, its mean
and standard deviation are still those announced in the frame above.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 15 / 20
Activity: Let be the population: Ω = 0, 1, 2, 3, 4, 5 (N = 6), uniformly distributed.
Its mean is: µ = 2.5 and its standard deviation is: σ = 1.7078.
Below are listed all the samples of size 2 (SRS): (bold: the sample; besides: sample's mean)
Now, let's analyse the statistical distribution of these samples means:
their mean is: 2.5 ! their standard deviation is: 1.2076… but !σ = 1.2076n
Below are listed all the samples of size 2 (exhaustive): (bold: the sample; besides: sample's mean)
Now, let's analyse the statistical distribution of these samples means:
their mean is: 2.5 ! their standard deviation is: 1.0801… but !σ − =
−N
1.0801N 1
n
n
3.4 Sampling distribution of proportions Let be a population of N individuals into which we know that a number a of individuals share the character
A. The proportion of such individuals in the population is then : N
aπ = .
Once chosen a size n , we can virtually extract all the samples that share this size.
The sample n° k can give way to the calculation of its own proportion: pk .
We denote P the random variable of the proportions , set of all values pk , and sampling distribution of
proportions the probability distribution of P.
Let be a "big enough" population (N > 30), into which a proportion π is known. The proportion pk in each n-
sized sample is more or less close to π. In case n is big enough too (n ≥ 5),
P is distributed by ( )
,1
n
π ππ −
N on SRS, and by ( )
,1 N
N 1
n
n
π ππ − − −
N on exhaust. samp.
Comment: let's explain these results for the SRS case. Let's name Y the variable giving, in each n-sized
sample, the number of individuals owning the character A. The law of Y binomial, with the
parameters n and π. Reminders: E(Y) = nπ and V(Y) = nπ (1 – π).
Moreover: P = Y/n , which leads to: E(P) = π and V(P) = ( )1
n
π π−.
Moreover, the four comments made in part 3.3 are still relevant here.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 16 / 20
4 Estimates (statistical inference) A large population is partially or totally unknown. A unique n-sized sample being extracted, to what extent
does it represent the whole population? Is the information got from this sample reliable in order to estimate
the reality of the unknown population?
As it’s a large population, we will systematically consider SRS samples.
4.1 Point estimate The sign ^ will have to be placed above a parameter in order to express its estimate.
The mean of a sample serves as an estimate of the population's mean; same for a proportion.
ˆ ˆ;x pµ π= =
(indeed, it has been recorded in sections 3.3 and 3.4 that X is centred on µ and that P is centred on π. We
say that the variables X and P are non-biased estimators)
The variance s² of a sample is not a best estimate of the one, σ², of the population. It has to be corrected:
ˆ ˆ;2 2
1 1
n ns s
n nσ σ= × = ×
− − (biased estimator)
4.2 Estimate by a confidence interval A point estimate doesn't guarantee some accuracy. Indeed, a sample might represent the population very
badly, and the both means or proportions might be far from each other.
A confidence interval will make us able to know about the probability that a population's parameter be at a
given distance from the one got from a sample. For instance, we will build around the mean x of a sample
an interval "that has 95 % chances" to contain the population's mean µ.
We name significance level, α, the probability that a confidence interval might not contain the population's
parameter. α = 5 % or α = 1 % are the most commonly used.
We name confidence level the probability that a confidence interval contains the population's parameter.
Commonly: 1 – α = 95 % or 1 – α = 99 %.
4.2.1 Estimate of a mean
The way to build the interval depends on the knowledge of σ.
if σ is known: ;I x u x un n
ασ σ = − +
using the variable U distributed by N (0 , 1), looking for u such that p(-u < U < u) = 1 – α.
e.g.: with α = 5 %, u = 1.96 and with α = 1 %, u = 2.58.
if σ is unknown: ;1 1
s sI x t x t
n nα
= − + − −
Both population's mean and standard deviation are unknown, which prevents us using the variable U
and obliges us to replace it by the variable T distributed by the Student's law with n – 1 "degrees of
freedom" (dof) St (0 , 1), looking for t such that p(-t < T < t) = 1 – α.
Several values of T are given in the form, in the corresponding table.
4.2.2 Estimate of a proportion
( ) ( );
1 1p p p pI p u p u
n nα
− −= − +
(automatic use of the normal law)
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 17 / 20
5 Statistical hypothesis testing On knowing (at least) a sample, a hypothesis can be expressed on the unknown population. It is called null
hypothesis and denoted H0. An appropriate statistical test will allow us to reject it (or will not).
A rejection of H0 will be associated to a risk to be wrong: the significance level α.
Sometimes, it is useful to express its contrary, the alternative hypothesis H1.
5.1 Adequacy χ² test (Pearson's test)
Aim: to compare a distribution of observed values to a given law.
By convention, the null hypothesis H0 is: "the observed distribution fits the chosen law".
This hypothesis will be rejected if the observed distribution differs "very much" from the chosen law.
e.g.: frequency bar diagram of observed values (vertical bars) compared to the normal law N (6 , 2)
H0 : in the population, the variable is distributed by the law N (6 , 2).
(we venture the hypothesis that the observed values – in the sample – are consistent with the idea
that the population would be distributed by this normal law. For this purpose, we have to perform a
χ² adequacy test in order to decide whether H0 can be rejected with a high enough confidence level)
implementation of the test
1. expression of the null hypothesis
2. Calculation of the observed chi-square: χ²calc
n observations are done: n individuals are evaluated. k different values are spotted.
The tested law makes us calculate theoretical frequencies.
3. Rejection area
look for the value χ²limit in relation with the significance level α and with the number of dof, which is in
any case k – 1 here.
4. Comparison and decision
If χ²calc > χ²lim, then we can reject H0 with a risk α to be wrong.
If χ²calc < χ²lim, then we cannot reject H0 at the level α (the risk to be wrong would be more than α).
values observed
frequencies
theoretical
frequencies
val 1 obs1 th1 χ²1
val 2 obs2 th2 χ²2
… … … …
val k obsk thk χ²k
total n n χ²calc
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 18 / 20
5.2 Conformance test of a mean, of a proportion
5.2.1 Principle
The aim of these tests is to decide whether the mean µ (or the proportion π ) of a population, unknown,
differs from a given value µ0 (or π0 ) or not.
Null hypothesis: H0 : µ = µ0 or: (H0 : π = π0)
Alternative hypothesis: H1 : µ ≠ µ0 : two-sided test,
or H1 : µ < µ0 : right one-sided test same for a proportion
or H1 : µ > µ0 : left one-sided test
This alternative hypothesis is essential because:
* in case of a two-sided test, α has to be cut in two, both halves creating two rejection areas, on the left
and on the right of the tested value,
* in case of a one-sided test, a unique rejection area corresponds to α whole.
There is a strong relationship between confidence intervals around an observed value (as we seen them in
the previous section) and performing a test on a given value; hence what follows:
5.2.2 Conformance test of a mean
If the standard deviation of the population, σ , is known
The associated decision variable is: 0
XU
n
µσ−=
which is distributed, under the null hypothesis, by the standard normal law, provided that X is normally
distributed or in case n is big enough (n ≥ 5).
If the standard deviation of the population, σ , is unknown
The associated decision variable is: X
TS
n
µ−=
−
0
1
which is distributed, under the null hypothesis, by the Student's law with n – 1 degrees of freedom,
provided that X is normally distributed or in case n is big enough (n ≥ 5).
(S is the random variable "standard deviations in the samples")
5.2.3 Conformance test of a proportion
The associated decision variable is: ( )
PU
n
ππ π
−=−
0
0 01
which is distributed, under the null hypothesis, by the standard normal law, provided that X is normally
distributed or in case n is big enough (n ≥ 5).
(P is the random variable "proportions in the samples")
5.2.4 Methodology
1. Clearly spell out the null hypothesis and the alternative hypothesis
2. Calculate the value (u or t) of the decision variable in association with the value x or p of the sample
3. Calculate the limit value(s) u or t demarcating the rejection area
4. Compare the results of points 2 and 3, then conclude about the rejection (or not) of H0
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 19 / 20
5.3 The risks (non required)
5.3.1 Accept a hypothesis?
If we are to make the decision of the rejection or not of a hypothesis, we have to perform a statistical test:
make observations and confront them to an alternative hypothesis (defining a rejection area). The main
idea is trying to reject H0 in case the observation appear to be too far from what was expected according to
this hypothesis. However, making a wrong decision is possible: each decision is associated with a
probability to be wrong, that we try to minimize to the best.
The conclusion of a test can only be the rejection or the non-rejection of H0, and never its acceptance. In
statistical inference (as well as in any observational activity: physics, chemistry, astronomy, economics, …),
proving that a theory is true is impossible (it is!) ; on the other hand, it's possible that an observation
contradicts this theory, forcing the analyser to modify it.
Let's consider a statistical test at a 5% significance level. If our observation lies in the rejection area, then
we can reject H0 with less than 5% risk to be wrong doing that (and then a 95% reliability). On the other
hand, if our observation isn't located in the rejection area, we only know that our chances to be right on
rejecting H0 would be less than 95% ("we can't reject H0 at a 5% significance level"), which is surely not a
situation that would lead us to accept H0!
5.3.2 Decisions and risks
There are two different kinds of mistakes while making a decision, each one associated with a risk:
We reject H0 whereas H0 is true: associated with the risk α : type 1 error,
( )α =0H true 0p reject H
We don't reject H0 whereas H0 is false: associated with the risk β : type 2 error.
( )β =0H false 0p not reject H
The four probabilities in the table are conditional; be careful of their interpretation!
α is the probability to reject H0, given that H0 is true,
1-α is the probability not to reject H0, given that H0 is true,
1-β is the probability to reject H0, given that H0 is false,
β is the probability not to reject H0, given that H0 is false,
5.3.3 Risks and statistical tests
The probability α to be wrong rejecting H0 is named significance level of the test.
The probability 1-α to be right not rejecting H0 is named confidence level of the test.
The probability 1-β to be right rejecting H0 is named power of the test.
While the risk α is well-known, since we have to decide its value, it's unfortunately impossible to know the
risk β, since the population remains unknown.
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S3 – ProbDist – Lessons – Rev2020 – page 20 / 20
e.g.: let's test the hypothesis that the mean of a population is 4.
So, we assume that the sample distribution
of means is the one in the opposite graph.
We decide a significance level α = 5%, that
leads us to the conclusion: if our sample's
mean is more than 5.3, then we can reject
the hypothesis that claims µ = 4.
However, if .> 5 3x , then the risk to be
wrong (on rejection) is 5%, since in case µ =
4 is true, 5% of the samples would show a
mean that is more than 5.3!
Now, let's suppose that the real mean of the population is 6 (but the person who performs the test
doesn't know!). The real distribution has been added in the second graph below, dotted. If our sample's
mean is less than à 5.3, the one who performs the test will not be allowed to reject the hypothesis µ = 4
(too low confidence level), of course making a mistake.
The real proportion of the samples whose mean is less than 5.3 is β : risk to be wrong while not rejecting
µ = 4, unfortunately unknown (because the real value of µ is unknown) and possibly very high!
Nevertheless, the error risk β decreases as the number of observations (the sample's size) increases.
The third graph below displays what happens as the sample's size is doubled (the standard deviation is
divided by 2 ), with the same value for α :
To conclude: there is a safe method to reduce the risks of errors in surveys: enlarge the sample!