an overview of probability, statistics and stochastic...

61
An overview of Probability, Statistics and Stochastic Processes D. Pinheiro CEMAPRE, ISEG Universidade T´ ecnica de Lisboa Rua do Quelhas 6, 1200-781 Lisboa Portugal May 18, 2012 Abstract The present manuscript constitutes the lecture notes for the “Probability, Statis- tics and Stochastic Processes” module of the PhD program on Complexity Science, held at ISCTE-IUL during April and May 2012. The aim of the notes is to pro- vide some auxiliary material for the students to follow this 10 hour length module, devoted to the study of the Probability theory and Stochastic Processes, as well as Statistics. A good working knowledge of calculus in several variables and linear algebra is desirable. Contents 1 Introduction 2 2 Probability 3 2.1 Probability spaces ............................ 3 2.2 Random variables ............................. 8 2.3 Discrete distributions ........................... 18 2.3.1 Discrete uniform ......................... 18 2.3.2 Bernoulli ............................. 18 2.3.3 Binomial ............................. 19 2.3.4 Negative Binomial ........................ 20 2.3.5 Hypergeometric .......................... 21 2.3.6 Poisson .............................. 21 2.4 Continuous distributions ......................... 22 2.4.1 Uniform .............................. 22 2.4.2 Normal .............................. 23 * Email: [email protected] 1

Upload: vanhanh

Post on 24-May-2018

227 views

Category:

Documents


2 download

TRANSCRIPT

An overview of Probability, Statistics and

Stochastic Processes

D. Pinheiro∗

CEMAPRE, ISEG

Universidade Tecnica de Lisboa

Rua do Quelhas 6, 1200-781 Lisboa

Portugal

May 18, 2012

Abstract

The present manuscript constitutes the lecture notes for the “Probability, Statis-tics and Stochastic Processes” module of the PhD program on Complexity Science,held at ISCTE-IUL during April and May 2012. The aim of the notes is to pro-vide some auxiliary material for the students to follow this 10 hour length module,devoted to the study of the Probability theory and Stochastic Processes, as wellas Statistics. A good working knowledge of calculus in several variables and linearalgebra is desirable.

Contents

1 Introduction 2

2 Probability 3

2.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Discrete uniform . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.2 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.3 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.4 Negative Binomial . . . . . . . . . . . . . . . . . . . . . . . . 202.3.5 Hypergeometric . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.6 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.1 Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.2 Normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

∗Email: [email protected]

1

2.4.3 Exponential . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.4 Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4.5 Chi-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.6 Student’s t . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4.7 Snedecor’s F distribution . . . . . . . . . . . . . . . . . . . . 26

2.5 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 272.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 28

3 Stochastic Processes 29

3.1 Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.4 Levy process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.5 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Statistics 39

4.1 Random sample and Statistic . . . . . . . . . . . . . . . . . . . . . . 394.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Method of Moments . . . . . . . . . . . . . . . . . . . . . . . 424.2.2 Maximum Likelihood method . . . . . . . . . . . . . . . . . . 434.2.3 Some measures to assess estimators quality . . . . . . . . . . 44

4.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.3.3 Difference of two means . . . . . . . . . . . . . . . . . . . . . 514.3.4 Ratio of two variances . . . . . . . . . . . . . . . . . . . . . . 524.3.5 Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.6 Difference of two proportions . . . . . . . . . . . . . . . . . . 54

4.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4.3 Difference of two means . . . . . . . . . . . . . . . . . . . . . 574.4.4 Ratio of two variances . . . . . . . . . . . . . . . . . . . . . . 594.4.5 Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.4.6 Difference of two proportions . . . . . . . . . . . . . . . . . . 594.4.7 Other tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

1 Introduction

In order to pursue a deeper understanding of complex behaviour in science and socialscience, one should be prepared to accept the fact that most real world phenomena, ifnot all, have some degree of uncertainty attached to it. The emergence of uncertaintymay be due to a number of reasons. One possibility is that the object of studyhas a random nature by itself. Some simple examples of this kind of phenomenainclude hazard games, quality control, and particle detection. On the other hand,even deterministic experiments may exhibit some sort of uncertainty which, forinstance, may be due to the presence of measurement errors or some fluctuation ofenvironmental factors. For a typical experiment, there may even be several distinct

2

sources of uncertainty affecting the corresponding outcome. It is then relevantto be able to identify and describe the influence of such random phenomena in theoutcomes of the experiment under consideration. The mathematical tools to addresssuch problems are part of Probability theory and Statistics.

Probability theory is the branch of mathematics devoted to the study of randomphenomena. Its initial development was motivated by the analysis of hazard games,dating back to the XVI and XVII centuries. Prominent figures responsible forthis initial development include Cardano, Fermat and Pascal. The current formof the theory is mainly due to Kolmogorov, responsible for the axiomatization ofthe theory in the early XX century. A stochastic process is one of the objects ofstudy in Probability theory, used to study the evolution with time of some randomphenomena.

Statistics is one of the key applications of Probability theory, providing toolsfor the quantitative analysis of large sets of data. Its main goal is to study theappropriate procedures for the collection, organization, analysis, and interpretationof data.

An effort has been made to make this short course self contained. The aim is toprovide the student with an overview of the subject, describing some of the mainconcepts and results of the theory, complemented by some illustrative examples. Inorder to cover a wider range of subjects, proofs and technical details are avoidedin the present text, but can be found in the references provided at the end of thisnotes.

2 Probability

This section is devoted to the introduction of some of the main concepts in Proba-bility Theory. We start by describing its current formulation in terms of probabilityspaces, and then we move on to discuss concepts such as random variables and somespecial probability distributions. We finish the section with two key results: the lawof large numbers and the central limit theorem.

2.1 Probability spaces

Probability theory is the branch of mathematics concerned with the study of ran-dom experiments. Such experiments share the property that their outcomes is notpredetermined, i.e. if a random experiment is repeated under exactly the same con-ditions, the resulting outcome is not necessarily equal. As trivial examples one maythink of tossing a coin, rolling a dice, or drawing a card from a 52-card deck.

The first object to be introduced in probability theory is the space of elementaryoutcomes, which we will call sample space and will denote by Ω from now onwards.The sample space Ω is a non-empty set, whose elements ω ∈ Ω are called elementaryoutcomes or elementary events.

Example Here are several simple examples of random experiments with the corre-sponding sample spaces.

(i) Consider an experiment involving a single coin toss. There are two possibleoutcomes, heads (H) and tails (T). The sample space is Ω = H,T.

3

(ii) Consider another experiment involving two coin tosses. The outcome will be astring with two elements, each representing either heads or tails. The samplespace is

Ω = HH,HT, TH, TT .

(iii) If we consider the experiment of rolling a single dice, the elementary outcomesare “face 1”, “face 2”, up until “face 6” and the sample space is

Ω = “face 1”, “face 2”, . . . , “face 6” .

(iv) For the experiment of rolling a dice until we obtain “face 6”, the elementaryoutcomes are “1” (to obtain “face 6” at the first time the dice is rolled), “2”(to obtain “face 6” at the second time the dice is rolled), and so on up until“∞” (representing the outcome “face 6” is never obtained). In this case thesample space is given by

Ω = “1”, “2”, “3”, . . . , “∞” .

(v) Another experiment may be to measure the height of a randomly chosen 10year old Portuguese child. The elementary outcomes of such experiment arepositive real numbers corresponding to the child height. The sample space isΩ = R

+, the set of positive real numbers. One could also think of measuringthe height and weight of a randomly chosen 10 year old Portuguese child. Inthis case, the elementary outcomes are pairs of positive real numbers corre-sponding to the child height and weight and the sample space is Ω = R

+×R+.

To complete this set of examples, it should be remarked that examples (i), (ii) and(iii) have finite sample spaces, (iv) has a discrete (countable) infinite sample space,and (v) has a continuous (non-countable) sample space.

The next object that needs to be introduced for the study of probability theoryis the notion of event. A naive definition would be to say that an event is a subsetof the sample space Ω. However, some additional structure is needed to ensurethat set operations such as union, intersection and complementary are closed. Suchstructure is provided by the notion of σ-algebra.

Before moving on to define σ-algebra, we recall the definition of the set operationsmentioned above. Let A and B be subsets of Ω. The intersection of A and B,denoted by A ∩ B, is the subset of Ω whose elements belong simultaneously to Aand B, i.e. it is the set

A ∩B = ω ∈ Ω : ω ∈ A and ω ∈ B .

The union of A and B, denoted by A∪B, is the subset of Ω whose elements belongto A or B, i.e. it is the set

A ∪B = ω ∈ Ω : ω ∈ A or ω ∈ B .

The complementary of A in Ω, denoted by Ac or Ω\A, is the subset of Ω whoseelements do not belong to A, i.e. it is the set

Ac = ω ∈ Ω : ω /∈ A .

4

Similarly, one can define the complementary of A in B, denoted by B\A, as thesubset of Ω whose elements belong to B and do not belong to A, i.e. it is the set

B\A = ω ∈ B : ω /∈ A .

We are now ready to define σ-algebra.

Definition 2.1.1 (σ-algebra). A collection F of subsets of Ω is called a σ-algebraif the following three properties hold:

1) Ω ∈ F ;

2) A ∈ F implies that Ac ∈ F .

3) Ai ∈ F , i ≥ 1, implies that⋃∞

i=1Ai ∈ FElements of F are called measurable sets, or events.

A few more comments concerning properties of a σ-algebra F . It follows easilyfrom the previous definition that:

(i) the empty set ∅ is an element of F ;

(ii) if A1, . . . , An ∈ F , then⋂n

i=1Ai ∈ F ;

(iii) if A,B ∈ F , then B\A ∈ F .

The simplest examples of a σ-algebra for a sample space Ω are the trivial σ-algebra F = ∅,Ω and the σ-algebra F containing all the subsets of Ω. However,depending on the random experiment, other non-trivial σ-algebras may be consid-ered.

Definition 2.1.2 (Measurable space). A measurable space is a pair (Ω,F), whereΩ is a space of elementary outcomes and F is a σ-algebra of subsets of Ω.

Having introduced the notion of a measurable space, we are now able to definewhat is meant by a probability measure. Shortly, a probability measure P in ameasurable space (Ω,F) is a function that assigns real numbers in the interval [0, 1]to events A ∈ F . The rigorous definition is provided below.

Definition 2.1.3 (Probability measure). Let (Ω,F) be a measurable space. A prob-ability measure on (Ω,F) is a function P : F → R for which the following propertieshold:

(i) P (Ω) = 1;

(ii) P (A) ≥ 0 for every A ∈ F ;

(iii) for every sequence of events Cii∈N ⊂ F such that Ci ∩ Cj = ∅ for i 6= j wehave

P

( ∞⋃

i=1

Ci

)

=∞∑

i=1

P (Ci) .

From the properties of a probability measure listed above, one can deduce severalother well known properties. Let A,B ∈ F . We have that:

• P (∅) = 0;

• P (Ac) = 1− P (A);

5

• P (A ∪B) = P (A) + P (B)− P (A ∩B);

• P (B\A) = P (B)− P (A ∩B);

• if A ⊆ B, then P (A) ≤ P (B).

Definition 2.1.4 (Probability space). A probability space is a triplet (Ω,F , P ),where (Ω,F) is a measurable space and P is a probability measure on (Ω,F).If C ∈ F , the number P (C) is called the probability of the event C.

Example Recall the random experiments of the previous example.

(i) Single coin toss. The sample space is

Ω = H,T .

We can endow Ω with the σ-algebra F defined by

F = ∅, H, T,Ωand a probability measure P assigning probabilities P (H) = p and P (T) =q, with p+ q = 1, to the elementary events of Ω. If the coin is balanced, thenp = q = 1/2. Note that Ω admits only the two trivial choices for σ-algebra.

(ii) Two coin tosses. The sample space Ω given by

Ω = HH,HT, TH, TTcan be endowed with the σ-algebra F1 defined by

F1 =

∅, HH, HT, TH, TT, HH,HT, HH,TH, HH,TT,HT, TH, HT, TT, TH, TT, HH,HT, TH, HH,HT, TT,HH,TH, TT, HT, TH, TT,Ω

.

This is the largest possible σ-algebra for Ω. Other choices are possible andmay be useful depending on the situation at hand. An alternative choice is

F2 =

∅, HH,HT, TH, TT,Ω

.

An easy exercise is to check that F1 and F2 are indeed σ-algebras of Ω and tothink of two distinct situations modeled, respectively, by F1 and F2.

We will now define a probability measure for each one the σ-algebras above.We assume that the coin is balanced. For F1, define P1 : F1 → [0, 1] by

P1(HH) = P1(HT) = P1(TH) = P1(TT) =1

4.

For F2, define P2 : F2 → [0, 1] through

P2(HH,HT) = P2(TH, TT) = 1

2.

Note that we have defined P1 and P2 by assigning probabilities to the smallernon-empty sets of F1 and F2, respectively. This is enough in this particularexample. As an exercise, find the probabilities for the remaining events in F1.This should also provide a clue for the following two questions. Why could weproceed in this way in this example? Can you find an example for which thisapproach does not work?

6

(iii) Rolling a dice with six faces. This example is very similar to the one in item(ii). As an exercise, find at least two distinct σ-algebras and the correspondingprobability measures under the assumption that the dice is properly balanced.

(iv) Rolling a dice until the outcome “face 6” is realized. We have seen above thatthe sample space is given by

Ω = “1”, “2”, “3”, . . . , “∞” .

For a σ-algebra, one can take the set F of all subsets of Ω. Note that sinceΩ is a set with an infinite number of elements, so must be the case for F .To endow the measurable space (Ω,F) with a probability measure, it is againenough to consider elementary events in F . A good exercise is to check thatthe map given by

P (“i”) = 1

6

(

5

6

)i−1

, i ∈ N

defines a probability measure on (Ω,F).

(v) The height of a randomly chosen 10 year old Portuguese child. As seen above,the sample space may be taken to be Ω = R

+. The most common choice fora σ-algebra on Ω is the Borel σ-algebra, the σ-algebra generated by the opensets in Ω. Its construction is slightly more subtle than the constructions ofthe previous examples, being outside the scope of this short course. Generalprobability measures on Ω are no longer obtained by assigning probabilities toelementary events of Ω. However, some examples of probability measures (dis-tributions) suitable to model this kind of random experiments will be discussedbelow.

A sample space Ω is said to be discrete if it has a finite or countable numberof elements. One can always endow Ω with the σ-algebra F consisting of all thesubsets of Ω. As in some of the previous examples, to define a probability measureon (Ω,F) it is enough to assign probabilities to its elementary events. However,such probabilities must satisfy a couple of consistency conditions:

(i) P (ω ≥ 0 for every ω ∈ Ω;

(ii)∑

ω∈Ω P (ω = 1.

We will now introduce two important concepts describing relations between twoevents and their probabilities.

Definition 2.1.5 (Independent events). Let (Ω,F , P ) be a probability space and letA,B ∈ F . The events A and B are independent if

P (A ∩B) = P (A)P (B) .

From an heuristic point of view, two events are independent if the occurrenceof one does not influence the probability of occurrence of the other. The trivialexamples of independent events are the impossible event (empty set) and the cer-tain event (full sample space), which are independent of every other event. Foranother very simple example, consider the random experiment consisting of twocoin tosses. Clearly, the outcome of the first toss does not affect the outcome of thesecond toss. Thus, the event the first toss outcome is tail is independent of the

7

event the second toss outcome is tail. As an exercise, think of two or three moreexamples of independent events in distinct random experiments.

It should now be remarked that independency is a property of the probabilitymeasure, not just of events. For instance, two mutually exclusive (disjoint) eventswith positive probability are not independent. An easy exercise: why is this laststatement true?

Definition 2.1.6 (Conditional probability). Let (Ω,F , P ) be a probability spaceand let A,B ∈ F be such that P (B) > 0. The conditional probability of A given Bis

P (A|B) =P (A ∩B)

P (B).

The concept of conditional probability provides the following information: P (A|B)is the probability that an event A occurs based on the extra information that theevent B also occurs. It enables one to reassess the probability of occurrence of anevent when some additional information is obtained. Some further remarks are inorder:

• Given B ∈ F such that P (B) > 0, the conditional probability P (·|B) is aprobability measure on (Ω,F)

• If the events A,B ∈ F are independent and such that P (A) > 0 and P (B) > 0,we have P (A|B) = P (A) and P (B|A) = P (B).

• Whenever well defined, P (A|B) 6= P (B|A) in general. Indeed, Bayes theoremgives:

P (B|A) = P (A|B)P (B)

P (A).

We conclude this section with a trivial example. Consider again the randomexperiment consisting of tossing a coin twice. Assume the coin is balanced. It is clearthat the probability of obtaining two heads is 1/4. Now, assume that we had alreadytossed the coin one time and that the outcome was head. What can we say aboutthe probability of obtaining two heads? This is a very simple but typical exampleof conditional probability. With the extra available information, the probability ofobtaining two heads is now 1/2. As an exercise, think of some (non-independent)pairs of events, and compute the corresponding conditional probabilities, alwaystrying to obtain an interpretation for the results obtained.

2.2 Random variables

A random variable is a map a from the sample space Ω into the set of real numbersR (or RK , for some K ∈ N, in the case of random vectors) satisfying a measurabilitycondition to be detailed below. Its introduction allows one to move the study of agiven random experiment from the sample space Ω to the set of real numbers, moresuitable for the mathematical treatment of the problem at hand.

Before giving a precise definition of a random variable, we need to introduce theconcept of measurable function.

Definition 2.2.1 (Measurable function). Let (Ω,F) be a measurable space. Afunction f : Ω → R is F-measurable (or simply measurable) if for each a, b ∈ R wehave

ω ∈ Ω : a ≤ f(ω) < b ∈ F .

8

It should be remarked that the definition of measurable function given above,can be restated in a more general way. Start by noting that the definition above canbe rewritten as follows: a function f : Ω → R is F-measurable if f−1([a, b)) ∈ F ,for each a, b ∈ R. This should be interpreted in the following way: the relevantrequirement in the definition above is that the preimage of every set in the Borelσ-algebra of R is in F . More generally, we say that a function f : Ω → Ω′ betweentwo measurable spaces (Ω,F) and (Ω′,F ′) is measurable if and only if for every setB ∈ F ′, its preimage f−1(B) ∈ F .

It is possible to check that linear combinations and products of measurablefunctions are again measurable functions. Moreover, if the sample space Ω is discreteand equipped with the σ-algebra consisting of all the subsets of Ω, then any real-valued function on Ω is measurable.

We are now ready to define what is meant by random variable.

Definition 2.2.2 (Random variable). A measurable function f : Ω → R defined ona probability space (Ω,F , P ) is called a random variable.

Similarly, a measurable function f : Ω → RK defined on a probability space

(Ω,F , P ) is called a random vector. From an heuristic point of view, a randomvariable X : Ω → R identifies the sample space Ω with a subset of the real numbersX(Ω) ⊆ R. There are several advantages in this approach. First of all, the intro-duction of a random variable enables one to move the analysis of a given randomexperiment to the set of real numbers (or some other multidimensional euclideanspace), for which standard analytical techniques are readily available. Secondly, oneis free to choose the most suitable random variable to analyse a particular randomexperiment. This is illustrated in the example below.

Example Recall the random experiments described above.

(i) Single coin toss. One possible random variable to assign to this random ex-periment would be

X(ω) =

1 , if ω = H

0 , if ω = T.

An easy exercise: argue that X is a random variable.

(ii) Several coin tosses. One can use the random variable of the previous item toexamine this random experiment. The sample space for this modified randomexperiment is Ω = H,Tn, i.e. the space of sequences of two symbols (H andT ) of length n:

H,Tn = (ω1, ω2, . . . , ωn) : ωi ∈ H,T, i = 1, 2, . . . , n .

Let Xi be the random variable on Ω defined by Xi(ω) = 1 if ωi = H (thei-th coin toss outcome is heads), and Xi(ω) = 0 if ωi = T (the ith coin tossoutcome is tails). We define another random variable Y : Ω → R by

Y (ω) =n∑

i=1

Xi(ω) .

The random variable Y maps each sequence ω ∈ H,Tn to the number ofheads observed in such sequence.

9

(iii) Rolling a dice with six faces. Recall that the sample space for a single dicerolling is

Ω = “face 1”, “face 2”, . . . , “face 6” .

One can easily think about assigning the following random variable to thisrandom experiment:

X(ω) = i if ω = “face i” .

Consider now the experiment of rolling the dice twice. The sample spaceis Ω2 = (ω1, ω2) : ω1, ω2 ∈ Ω and one can consider the random variableY : Ω → R

2 defined by

Y (ω1, ω2) = (i, j) if ω1 = “face i” and ω2 = “face j” .

However, if one is only interested in the sum of points obtained in the two dicerollings, we could also consider the random variable Z : Ω2 → R given by

Z(ω1, ω2) = i+ j if ω1 = “face i” and ω2 = “face j” .

(iv) Rolling a dice until the outcome “face 6” is realized. Exercise: think of twodistinct non-trivial random variables or vectors that may be assigned to thisrandom experiment.

(v) The height of a randomly chosen 10 year old Portuguese child. The mostnatural random variable to assign to this particular random experiment is thefunction whose image is the numerical value of the height of the child for somechoice of measurement units.

The next definition relates probability with the notion of random variable.

Definition 2.2.3 (Distribution function). Let (Ω,F , P ) be a probability space andlet X be a random variable on (Ω,F , P ). The distribution function of the randomvariable X is the function FX : R → R given by

FX(x) = P (ω ∈ Ω : X(ω) ≤ x) , x ∈ R .

Note that there are two functions involved in the definition of the distributionfunction FX : the random variable X and the probability measure P . Hence, thedistribution function carries with it information from both functions. Moreover, itis their joint use that enables the definition of the distribution function as a realfunction with a real variable. This is one particular feature of the distributionfunction that makes it so amenable to mathematical treatment. We list below someother properties satisfied by a distribution function of a random variable.

Theorem 2.2.4. Let (Ω,F , P ) be a probability space. If FX is the distributionfunction of a random variable X on (Ω,F , P ), then

1) FX is non-decreasing, that is FX(x) ≤ FX(y) if x ≤ y.

2) limx→−∞ FX(x) = 0 and limx→+∞ FX(x) = 1.

3) FX(x) is continuous from the right for every x ∈ R, that is:

limy→x+

FX(y) = FX(x) .

10

Any function F : R → R which satisfies the three properties listed in the previoustheorem is called a distribution function. Moreover, any distribution function definesa probability measure on the set of real numbers R endowed with the correspondingBorel σ-algebra.

The concept of distribution function is easily generalized to several dimensions.

Definition 2.2.5 (Distribution function of a random vector). Let (Ω,F , P ) be aprobability space and let X : Ω → R

K be a random vector on (Ω,F , P ). Thedistribution function of the random vector X is the function FX : RK → R givenby

FX(x1, . . . , xK) = P (ω ∈ Ω : X1(ω) ≤ x1, . . . , XK(ω) ≤ xK) ,where X1(ω), . . . , XK(ω) are the K components of X(ω).

Let x = (x1, . . . , xK) and y = (y1, . . . , yK) be two vectors in RK . We say that

x ≤ y if and if xi ≤ yi for every i ∈ 1, . . . , n. Similarly to the one-dimensionalcase, a distribution function of a random vector has the following properties:

1) FX is non-decreasing with respect to the order in RK defined above, i.e.

FX(x) ≤ FX(y) if x ≤ y .

2) limx→(−∞,...,−∞) FX(x) = 0 and limx→(+∞,...,+∞) FX(x) = 1.

3) FX(x) is continuous from above for every x ∈ RK , that is:

limy→x+

FX(y) = FX(x) .

As before, any function F : RK → R which satisfies the three properties listed aboveis called a distribution function.

We will now restrict our attention to two important classes of random variables:discrete and continuous. It is important to remark that these two cases are notmutually exclusive, i.e. there are random variables which do not fit into any ofthese classes. A complete treatment of this subject is based on the study of measuretheory and is outside of the scope of this course. We discuss the case of discreterandom variables first.

Definition 2.2.6 (Discrete random variable and probability function). Let X :Ω → R be a random variable on a probability space (Ω,F , P ). The random variableX is said to be discrete if its distribution function FX : R → R is a step functionwith a countable number of discontinuities.Denote by DX the set of discontinuity points of FX and define a function fX : R → R

through

fX(x) =

FX(x)− limy→x− FX(y) , if x ∈ DX

0 , otherwise.

The function fX defined above is called probability function of X.

The probability function fX of a discrete random variable X on a probabilityspace (Ω,F , P ) has the following properties:

1) fX(x) > 0 for every x ∈ DX .

11

2)∑

x∈DXfX(x) = 1.

3) FX(x) =∑

y∈Dx: y≤x fX(y).

Example Recall the examples discussed above.

(i) Single coin toss. Assume that the coin is balanced and consider the randomvariable X : Ω → R given by

X(ω) =

1 , if ω = H

0 , if ω = T.

Its distribution function is given by

FX(ω) =

0 , if x < 0

1/2 , if 0 ≤ x < 1

1 , if x ≥ 1

.

Thus, X is a discrete random variable with probability function given by

fX(x) =

1/2 , if x = 0 or x = 1

0 , otherwise.

(ii) n coin tosses. Similarly to what was done before, let Xi be the random variableon Ω = H,Tn defined by Xi(ω) = 1 if ωi = H and Xi(ω) = 0 if ωi = T ,i ∈ 1, . . . , n. Consider again the random variable Y : Ω → R defined by

Y (ω) =n∑

i=1

Xi(ω) .

It should be clear that the random variable Y takes values on the set 0, 1, . . . , nand is a discrete random variable. Exercise: why is this true?

Indeed, the random variable Y has a well-know distribution, that will be dis-cussed in some more detail below. Assume that in a single coin toss theprobability of obtaining heads is p, and the probability of obtaining tails isq = (1− p). Then, the probability function of Y is known to be given by

fX(x) =

(

n

x

)

px(1− p)n−x , if x ∈ 0, 1, . . . , n

0 , otherwise

.

The corresponding distribution function is

FX(x) =

[x]∑

k=0

fX(k) ,

where [x] denotes the integer part of x.

12

(iv) Rolling a dice until the outcome “face 6” is realized. We have seen before thatthe sample space for this random experiment is

Ω = “1”, “2”, “3”, . . . , “∞” .

One can define the random variable X : Ω → R given by

X(“i”) = i .

Under the assumption that the dice is balanced, we have already computedthe probabilities of the elementary events of Ω. Indeed, this determines theprobability function

fX(x) =

(

16

) (

56

)x−1, if x ∈ N

0 , otherwise

and the corresponding distribution function

FX(x) =

[x]∑

k=0

fX(k) .

This particular probability distribution is known as geometric distribution.

We will now move on to the topic of continuous random variables.

Definition 2.2.7 (Continuous random variable and probability density). Let X :Ω → R be a random variable on a probability space (Ω,F , P ). The random variableX is said to be continuous if its distribution function is a continuous function andthere exists a non-negative integrable function f : R → R such that

FX(x) =

∫ x

−∞fX(t) dt .

The function fX is called probability density of X.

The probability density fX of a continuous random variable X on a probabilityspace (Ω,F , P ) has the following properties:

1)∫ +∞−∞ fX(x) dx = 1.

2) P (X(ω) ≤ a) =∫ a−∞ fX(x) dx.

Example Two examples of probability distributions of continuous random vari-ables.

(i) The height of a randomly chosen 10 year old Portuguese child. There is not aunique probability distribution model for this random experience. One simplechoice for the distribution of this random variable is the normal distributionwith properly chosen mean and variance. However, even though this may beusually a good approximation, it has the inconvenience of assigning a smallpositive probability to negative values of height.

13

(ii) Waiting time for some event to happen. Consider the following random exper-iment: given some piece of electronic equipment, measure the time it worksbefore any malfunction. It should be clear that the sample space may be takenas Ω = R

+0 , or even Ω = R for simplicity of treatment. Take for σ-algebra the

Borel σ-algebra of R. Define a random variable X : Ω → R by

X(it takes x units of time before malfunction) = x .

A standard choice for the distribution function of this random variable is theexponential distribution. This is a one-parameter family of distributions withprobability density given by

f(x;λ) =

λe−λx , x ≥ 0

0 , x < 0,

where λ > 0. The corresponding distribution function is given by

F (x;λ) =

1− e−λx , x ≥ 0

0 , x < 0.

The parameter λ is related with the average waiting time before a malfunction,equal to 1/λ.

Before moving on, we remark that the notion of distribution function of a randomvariable, as well as those of probability function and density function, are easilyextend to the multidimensional setup of random vectors.

Definition 2.2.8 (Discrete random vector and probability function). Let X : Ω →RK be a random vector on a probability space (Ω,F , P ). The random vector X =

(X1, . . . , XK) is said to be discrete if each one of its components Xi, i ∈ 1, . . . ,K,is a discrete random variable. The function f : RK → R given by

fX(x1, . . . , xK) = P (ω ∈ Ω : X(ω) = (x1, . . . , xK))

is called the probability function of X.

Note that the set of points DX = (x1, . . . , xK) ∈ RK : fX(x1, . . . , xK) > 0 is

at most countable.For the continuous case, we have the following definition.

Definition 2.2.9 (Continuous random vector and probability density). Let X :Ω → R

K be a random vector on a probability space (Ω,F , P ). The random vectorX is said to be continuous if its distribution function is a continuous function andthere exists a non-negative integrable function f : RK → R such that

FX(x1, . . . , xK) =

∫ x1

−∞. . .

∫ xK

−∞fX(t1, . . . , tK) dtK . . . dt1 .

The function fX is called probability density of X.

We will now introduce the concept of mathematical expectation, followed by abrief discussion of some other parameters useful for the description of probabilitydistributions.

14

Definition 2.2.10 (Mathematical Expectation). Let X : Ω → R be a randomvariable on a probability space (Ω,F , P ).

• If X is a discrete random variable, its mathematical expectation E[X] is

E[X] =∑

x∈DX

xfX(x) .

• If X is a continuous random variable, its mathematical expectation E[X] is

E[X] =

∫ +∞

−∞xfX(x) dx .

Note that the mathematical expectation of a given random variable may not befinite, i.e. the sum or integral defining it may be divergent. If finite, the mathemat-ical expectation provides a measure of localization for the distribution of a randomvariable X. Other terminology includes: expected value, mean, and mean value.

Proposition 2.2.11 (Properties of the Mathematical Expectation). Let X andY be random variables with finite mathematical expectation on a probability space(Ω,F , P ) and let a, b, c be real numbers. The following properties hold:

1) If X(ω) = c for every ω ∈ Ω, then E[X] = c.

2) The mathematical expectation of the random variable aX + bY is finite and

E[aX + bY ] = aE[X] + bE[Y ] .

3) If X ≥ 0 then E[X] ≥ 0.

4) If a ≤ X ≤ b then a ≤ E[X] ≤ b.

5) Chebyshev Inequality: If X ≥ 0, then for each a > 0 we have

P (X ≥ a) ≤ E[X]

a.

Let X be a random variable on a probability space (Ω,F , P ) and let Y = g(X),for some measurable function g : R → R. Then, Y is also a random variable on(Ω,F , P ). If X is a discrete random variable, then Y is also discrete. In the casewhere X is a continuous random variable, Y is not necessarily continuous. Themathematical expectation of Y = g(X) is given by

E[g(X)] =∑

x∈DX

g(x)fX(x)

in the case where X is a discrete random variable and

E[g(X)] =

∫ +∞

−∞g(x)fX(x) dx

in the case where both X and Y are continuous random variables.The definition above can be extended to functions of random vectors. Let X :

Ω → RK be a random vector on a probability space (Ω,F , P ) and let Y = g(X) for

some measurable function g : RK → R. Then, if X is discrete

E[g(X)] =∑

(x1,...,xK)∈DX

g(x1, . . . , xK)fX(x1, . . . , xK)

15

and if X and Y are both continuous

E[g(X)] =

∫ +∞

−∞. . .

∫ +∞

−∞g(x1, . . . , xK)fX(x1, . . . , xK) dx1 . . . dxK .

In the case of measurable functions g : RK → RN of a random vector X : Ω → R

K ,one defines

E[g(X)] = (E[g1(X)], . . . , E[gN (X)]) ,

where the functions gi : RK → R, i ∈ 1, . . . , N, are the N components of g.

Based on the definition of mathematical expectation, we will now introduce ameasure of dispersion – the variance. The variance measures how much a probabilitydistribution spreads around its expected value.

Definition 2.2.12 (Variance). Let X be a random variable (or random vector) ona probability space (Ω,F , P ). The variance of X, denoted by Var(X), is equal to

Var[X] = E[

(X − E[X])2]

.

As with the mathematical expectation of a random variable, the variance mayalso not be finite. However, if Var(X) is finite, then so is E[X], but the reciprocalstatement is not true.

Proposition 2.2.13 (Properties of the Variance). Let X and Y be random variableswith finite mathematical expectation on a probability space (Ω,F , P ) and let a, b, cbe real numbers. The following properties hold:

1) Var[X] is finite if and only if E[X2] is finite. In this case

Var(X) = E[X2]− (E[X])2

2) If X(ω) = c for every ω ∈ Ω, then V ar[X] = 0.

3) If Var[X] is finite, the variance of aX + b is finite and

Var[aX + b] = a2Var[X] .

4) If a ≤ X ≤ b then Var[X] ≤ (b− a)2/4.

5) Chebyshev Inequality: If Var[X] is finite, then for each a > 0 we have

P (|X − E[X]| ≥ a) ≤ Var[X]

a2.

There are other interesting parameters describing the shape of a distribution,such as the skewness and kurtosis parameters. More information on this subjectcan be easily found in the references.

We will now introduce two quantities used to measure the linear dependencebetween two random variables: the covariance and the correlation coefficient, thelatter being a dimensionless normalization of the covariance. If larger values of onerandom variable correspond to larger values of the second one, then the covarianceis positive. On the other hand, if larger values of one random variable correspondto smaller values of the second one, then the covariance is negative.

16

Definition 2.2.14 (Covariance). Let X and Y be random variables on a probabilityspace (Ω,F , P ). The covariance of X and Y , denoted by Cov[X,Y ], is equal to

Cov[X,Y ] = E [(X − E[X]) (Y − E[Y ])] .

Some properties of the covariance are listed below.

Proposition 2.2.15 (Properties of the Covariance). Let X and Y be random vari-ables with finite variance on a probability space (Ω,F , P ) and let a, b, c, d be realnumbers. The following properties hold:

1) Cov[aX + b, cY + d] = acCov[X,Y ].

2) (Cov[X,Y ])2 ≤ Var[X]Var[Y ].

3) Cov[X,Y ] = E[XY ]− E[X]E[Y ].

Definition 2.2.16 (Correlation coefficient). Let X and Y be random variables (orrandom vectors) with non-zero variance on a probability space (Ω,F , P ). The cor-relation coefficient of X and Y , denoted by ρ[X,Y ], is equal to

ρ[X,Y ] =Cov[X,Y ]

Var[X]Var[Y ].

Some properties of the correlation coefficient are listed below.

Proposition 2.2.17 (Properties of the correlation coefficient). Let (Ω,F , P ) be aprobability space and let X and Y be random variables on (Ω,F , P ) with non-zerovariance. Let a, b, c, d be real numbers. The following properties hold:

1) −1 ≤ ρ[X,Y ] ≤ 1.

2) ρ[aX + b, cY + d] = ρ[X,Y ].

3) Var[X + Y ] = Var[X] + 2Cov[X,Y ] + Var[Y ].

To end this section, we introduce and discuss the concept of independence ofrandom variables

Definition 2.2.18 (Independent random variables). Let X and Y be random vari-ables on a probability space (Ω,F , P ). We say that X and Y are independentrandom variables if for every a, b ∈ R the events ω ∈ Ω : X(ω) ≤ a andω ∈ Ω : X(ω) ≤ b are independent.

The definition above can be given in terms of σ-algebras, as we pass to explain.Let FX ⊆ F be the smallest sub-σ-algebra of F that makesX a measurable function.Define FY ⊆ F in a similar fashion. Then, X and Y are independent if for anyA ∈ FX and any B ∈ FY , the events A and B are independent. The notion ofindependence may also be expressed in terms of distribution or probability functions.

Theorem 2.2.19. Let X and Y be random variables on a probability space (Ω,F , P ).Denote by FX and FY the distribution functions of X and Y , respectively, and by fXand fY the corresponding probability functions (densities in the continuous case).The following statements are equivalent:

(i) X and Y are independent.

17

(ii) the distribution function of the random vector (X,Y ) is such that

F(X,Y )(x, y) = FX(x)FY (y)

for every x, y ∈ R.

(iii) the probability function (density) of the random vector (X,Y ) is such that

f(X,Y )(x, y) = fX(x)fY (y)

for every x, y ∈ R.

The following properties also hold.

Theorem 2.2.20. Let X and Y be independent random variables on a probabilityspace (Ω,F , P ). Then:

(i) E[XY ] = E[X]E[Y ].

(ii) Cov[X,Y ] = 0.

(iii) Var[X + Y ] = Var[X] + Var[Y ].

2.3 Discrete distributions

This section is devoted to a brief description of some of the most common probabilitydistributions for discrete random variables.

2.3.1 Discrete uniform

The discrete uniform distribution in n points, n > 1, is a probability distributionunder which n distinct outcomes have equal probability. This distribution may beused to describe random experiments such as tossing a balanced coin or rolling abalanced dice.

Definition 2.3.1 (Discrete uniform distribution). We say that the random variableX follows a discrete uniform distribution in the set of points x1, . . . , xn, anddenote it by X ∼ U(x1, . . . , xn), if its probability function is

f(x) =

1/n , if x ∈ x1, . . . , xn0 , otherwise

.

The expected value and variance of a random variable X ∼ U(x1, . . . , xn) are

E[X] =1

n

n∑

i=1

xi , Var[X] =1

n

n∑

i=1

xi2 − (E[X])2 .

2.3.2 Bernoulli

The Bernoulli distribution is used to describe random experiments with two possiblecomplementary outcomes: “success” (with probability p) when one given event A isobserved and “failure” (with probability 1−p) when that same event is not observed.Such random experiments are called Bernoulli trials. One example of a Bernoullitrial: rolling a dice to obtain “face 6”. The outcome “success” is observing “face

18

6”, while the outcome “failure” corresponds to the observation of any of the otherfive faces.

The sample space for the class of random experiments described above is Ω =S, F, where S denotes the outcome “success” and F denotes the outcome “fail-ure”. Define the random variable

X(ω) =

1 , if ω = S

0 , if ω = F.

Definition 2.3.2 (Bernoulli distribution). We say that X follows a Bernoulli dis-tribution, and denote it by X ∼ Bi(1, p), if its probability function is

f(x) =

p , if x = 1

1− p , if x = 0

0 , otherwise

.

Note that the Bernoulli distribution is a one-parameter family of distributionsdepending on p, the probability of the outcome “success”. The expected value andvariance of a random variable X ∼ Bi(1, p) are

E[X] = p , Var[X] = p(1− p) .

2.3.3 Binomial

The Binomial distribution describes the probability distribution of a random vari-able associated with the following random experiment: how many outcomes “suc-cess” are observed in a repetition of n ≥ 1 independent Bernoulli trials with successprobability p ∈ (0, 1). Examples of this kind of random experiment include the num-ber of “heads” observed in several coin tosses, or the number of “face 6” observedin several dice rollings.

The sample space for the class of random experiments described above is Ω =“0”, “1”, . . . , “n”. Define the random variable

X(“i”) = i .

Definition 2.3.3 (Binomial distribution). We say that X follows a Binomial dis-tribution with parameters n and p, and denote it by X ∼ Bi(n, p), if its probabilityfunction is

f(x) =

(

n

x

)

px(1− p)n−x , if x ∈ 0, 1, . . . , n

0 , otherwise

.

Note that the Bernoulli distribution is a two-parameter family of distributionsdepending on n, the number of repetitions of a given Bernoulli trial, and p, the prob-ability of the outcome “success” in each Bernoulli trial. The Bernoulli distributionmay be seen as a particular case of the Binomial distribution when the number ofrepetitions is n = 1. We list below some other properties for the Binomial family ofrandom variables.

19

• The expected value and variance of X ∼ Bi(n, p) are

E[X] = np , Var[X] = np(1− p) .

• If X ∼ Bi(n, p) and Y ∼ Bi(m, p) are independent random variables, thenX + Y ∼ Bi(n+m, p).

• If X1, . . . , Xn ∼ Bi(1, p) is a sequence of independent Bernoulli random vari-ables, then

n∑

i=1

Xi ∼ Bi(n, p) .

2.3.4 Negative Binomial

The Negative Binomial distribution describes the probability distribution of a ran-dom variable associated with the following random experiment: how many outcomes“success” are observed in a repetition of independent Bernoulli trials with successprobability p ∈ (0, 1) before a specified number n ≥ 1 of “failure” occurs. An ex-ample of this kind of random experiment: number of times that the faces “1”, “2”,“3”, “4” and “5” are observed before “face 6” is observed twice (note that “face 6”corresponds to the outcome “failure” in this example).

The sample space for the class of random experiments described above may beidentified with Ω = N0. Define the random variable

X(number i of “successes” before n “failures”) = i .

Definition 2.3.4 (Negative Binomial distribution). We say that X follows a Nega-tive Binomial distribution with parameters n and p, and denote it by X ∼ NB(n, p),if its probability function is

f(x) =

(

x+ n− 1

x

)

px(1− p)n , if x ∈ N0

0 , otherwise

.

The Negative Bernoulli distribution is a two-parameter family of distributionsdepending on n, the number of “failure” to be observed on a sequence of Bernoullitrials, and p, the probability of the outcome “success” in each Bernoulli trial. Itshould be remarked that the definition given above may be extended to positive realvalues of n. The geometric distribution of parameter p coincides with the NegativeBinomial distribution NB(1, 1−p). Some other properties of the Negative Binomialdistribution:

• The expected value and variance of X ∼ NB(n, p) are

E[X] =np

1− p, Var[X] =

np

(1− p)2.

• If X ∼ NB(n, p) and Y ∼ NB(m, p) are independent random variables, thenX + Y ∼ NB(n+m, p).

• If X1, . . . , Xn ∼ NB(1, p) is a sequence of independent Negative Bernoullirandom variables, then

n∑

i=1

Xi ∼ NB(n, p) .

20

2.3.5 Hypergeometric

The Hypergeometric distribution describes the probability distribution of a randomvariable associated with the following random experiment: how many outcomes“success” are observed in n draws from a finite population of size N without re-placement. Note that the Binomial distribution corresponds to a similar randomexperiment with replacement after each trial. Examples of this kind of random ex-periment include the number of red cards observed in several draws from a deck of52 cards.

Let N > 1 denote the population size, m ∈ 0, 1, . . . , N denote the num-ber of elements of the population identified with the outcome “success” and n ∈0, 1, . . . , N denote the number of draws without replacement taken from the pop-ulation. The sample space for this class of random experiments is

Ω = max0, n+m−N, . . . ,minn,m .

Define the random variable

X(“i successes observed”) = i .

Definition 2.3.5 (Hypergeometric distribution). We say that X follows a Hyperge-ometric distribution with parameters N , n and m, and denote it by X ∼ H(N,n,m),if its probability function is

f(x) =

m

x

M −m

n− x

M

n

, if x ∈ max0, n+m−N, . . . ,minn,m

0 , otherwise

.

The Hypergeometric distribution is a three-parameter family of distributions.The expected value and variance of X ∼ H(N,n,m) are

E[X] =nm

N, Var[X] =

nm(N −m)(N − n)

N2(N − 1).

2.3.6 Poisson

The Poisson distribution describes the probability distribution of a random variableassociated with the following random experiment: how many occurrences of a givenevent are observed during a fixed interval of time or space under the assumptionthat these events occur with a known average rate and independently of the time orspace elapsed since the last occurrence. Examples of this kind of random experiment:number of ships entering a port over an hour, and the number of telephone callsgoing through a central during a five minutes interval.

The sample space for this class of random experiments may be identified withΩ = N0. Define the random variable

X(number i of occurrences) = i .

21

Definition 2.3.6 (Poisson distribution). We say that X follows a Poisson distri-bution with parameter λ > 0, and denote it by X ∼ Po(λ), if its probability functionis

f(x) =

λxe−λ

x! , if x ∈ N0

0 , otherwise.

The Poisson distribution is a one-parameter family of distributions dependingon λ – the average rate of occurrence per unit of time or space of the particularevent under observation. Further properties of the Poisson distribution:

• The expected value and variance of X ∼ Po(λ) are

E[X] = λ , Var[X] = λ .

• If X ∼ Po(λ1) and Y ∼ Po(λ2) are independent random variables, thenX + Y ∼ Po(λ1 + λ2).

• Let X1, . . . , Xn be n independent Bernoulli trials. Assume that the probabil-ity of success pn for the n Bernoulli trials depends on n is such a way thatlimn→∞ npn = λ > 0. Then

limn→∞

P

(

n∑

i=1

Xi = x

)

=λxe−λ

x!,

that is, the random variable∑n

i=1Xi is asymptotically Poisson distributed.

2.4 Continuous distributions

This section is devoted to a brief description of some of the most common probabilitydistribution for continuous random variables.

2.4.1 Uniform

The Uniform distribution on an interval [a, b] ⊂ R describes the probability distribu-tion of a random variable with outcomes on a bounded interval of R, assigning equalprobability to subintervals of [a, b] of the same size. This distribution is particularlyuseful to produce pseudo-random sequences of numbers in a computer.

Before proceeding to the definition of continuous uniform distribution, we definethe indicator function of a set A as

χA(x) =

1 , if x ∈ A

0 , if x /∈ A.

Definition 2.4.1 (Uniform distribution). We say that a random variable X followsa Uniform distribution on the interval [a, b], with a, b ∈ R, and denote it by X ∼U([a, b]), if its probability function is

f(x) =1

b− aχ[a,b](x) .

The expected value and variance of a random variable X ∼ U([a, b]) are

E[X] =a+ b

2, Var[X] =

(b− a)2

12.

22

2.4.2 Normal

The normal distribution with its famous bell-shaped probability density is a cen-tral element to probability theory and statistics. Its importance is closely relatedwith the following asymptotic property for the distribution of the sum of randomvariables: under rather mild conditions the distribution of these sums converge toa normal distribution as the number of terms in the sum grows. This is the contentof the central limit theorem, to be discussed below in more detail.

Definition 2.4.2 (Normal distribution). We say that a random variable X followsa Normal distribution with parameters µ ∈ R and σ2 > 0, and denote it by X ∼N(µ, σ2), if its probability function is

f(x) =1√2πσ2

e−(x−µ)2

2σ2 .

The Normal distribution is a two-parameter family of distributions dependingon its expected value µ and variance σ2. One particular feature of the Normaldistribution is that it is rather suitable to analytical treatment. This is mainly dueto a number of properties, including:

• The expected value and variance of X ∼ N(µ, σ2) are

E[X] = µ , Var[X] = σ2 .

• The probability density function of the normal distribution is symmetric withrespect to its expected value.

• If X ∼ N(µ, σ2), thenX − µ

σ∼ N(0, 1) .

The distribution N(0, 1) is called the standard normal distribution.

• Let X1, . . . , Xn be a sequence of normally distributed random variables suchthat for each i ∈ 1, . . . , n we have

E[Xi] = µi , Var[Xi] = σi2 , Cov[Xi, Xj ] = σij .

Then, for every choice of constants α1, . . . , αn ∈ R, the random variable∑n

i=1 αiXi is normally distributed, i.e. we have that

n∑

i=1

αiXi ∼ N(µ, σ2) ,

where

µ =

n∑

i=1

αiµi , σ2 =

n∑

i=1

αi2σi

2 + 2

n∑

i=1

n∑

j=1j>i

αiαjσij .

• A particular consequence of the last statement is that ifX1, . . . , Xn ∼ N(µ, σ2)are independent random variables, then

n∑

i=1

Xi ∼ N(nµ, nσ2)

and1

n

n∑

i=1

Xi ∼ N

(

µ,σ2

n

)

.

23

2.4.3 Exponential

The Exponential distribution describes the probability distribution of the randomvariable associated with the following random experiment: measure the interval oftime between two consecutive occurrences of a given event under the assumptionthat such event occurs with a known average rate and independently of the timeelapsed since the last occurrence.

Definition 2.4.3 (Exponential distribution). We say that a random variable Xfollows an Exponential distribution with parameter λ > 0, and denote it by X ∼Ex(λ), if its probability function is

f(x) = λe−λx χR+0(x) .

The Exponential distribution is a one-parameter family of distributions depend-ing on λ – the average rate of occurrence per unit of time or space of the particularevent under observation. Clearly, it is closely related with the Poisson distribution.Further properties of the Exponential distribution:

• The expected value and variance of X ∼ Ex(λ) are

E[X] =1

λ, Var[X] =

1

λ2.

• If X ∼ Ex(λ), then

P (X > x+ y |X > y) = P (X > x)

for every x, y > 0.

2.4.4 Gamma

The Gamma distribution is an important two-parameter family of continuous prob-ability distributions, which includes the exponential family, the chi-square distribu-tion, and the Erlang distribution.

Before proceeding to the definition of the Gamma distribution, we define thegamma function Γ : R+ → R by

Γ(x) =

∫ +∞

0e−xxα−1 dx .

The gamma function generalizes factorial to positive real numbers:

• Γ(1) = 1.

• If n ∈ N, then Γ(n) = (n− 1)!.

• Γ(α) = (α− 1)Γ(α− 1), α > 1.

Definition 2.4.4 (Gamma distribution). We say that a random variable X fol-lows a Gamma distribution with parameters λ, α > 0, and denote it by X ∼Gamma(α, λ), if its probability function is

f(x) =λαe−λxxα−1

Γ(α)χR+0(x) .

24

Some properties of the Gamma distribution:

• The expected value and variance of X ∼ Gamma(α, λ) are

E[X] =α

λ, Var[X] =

α

λ2.

• If X ∼ Gamma(α1, λ) and Y ∼ Gamma(α2, λ) are independent random vari-ables, then X + Y ∼ Gamma(α1 + α2, λ).

• The case α = 1 corresponds to the exponential distribution.

• The case α = n ∈ N corresponds to the Erlang distribution (which generalizesthe exponential distribution). It describes the following random experiment:measure the interval of time elapsed before the observation of n consecutiveoccurrences of a given event under the assumption that such event occurswith a known average rate and independently of the time elapsed since thelast occurrence.

2.4.5 Chi-square

The Chi-square distribution is yet another member of the Gamma family of proba-bility distributions. It plays a prominent role in statistical inference.

Definition 2.4.5 (Chi-square distribution). We say that a random variable X fol-lows a Chi-square distribution with n degrees of freedom, and denote it by X ∼χ2(n), if its probability function is

f(x) =e−

x2 x

n2−1

2n2 Γ(

n2

) χR+0(x) .

Some properties of the Chi-square distribution:

• A Chi-square distribution with n degrees of freedom is a Gamma distributionof the form Gamma

(

n2 ,

12

)

• The expected value and variance of X ∼ χ2(n) are

E[X] = n , Var[X] = 2n .

• If X ∼ χ2(n1) and Y ∼ χ2(n2) are independent random variables, then X +Y ∼ χ2(n1 + n2).

• If X ∼ N(0, 1), then X2 ∼ χ2(1).

• If X1, . . . , Xn ∼ N(0, 1) are independent random variables, then

n∑

i=1

X2i ∼ χ2(n) .

• If X1, . . . , Xn ∼ N(µ, σ2) are independent random variables, then

n∑

i=1

(

Xi − µ

σ

)2

∼ χ2(n) .

25

2.4.6 Student’s t

The Student’s t distribution is related with the normal and chi-square distributions.Its important role in statistical inference is due to the following relation with theNormal and Chi-square distributions. Let U, V be independent random variableswith distributions U ∼ N(0, 1) and V ∼ χ2(n). Then, the random variable

U√

V/n

follows a Student’s t distribution with n degrees of freedom.

Definition 2.4.6 (Student’s t distribution). We say that a random variable Xfollows a Student’s t distribution with n degrees of freedom, and denote it by X ∼t(n), if its probability function is

f(x) =Γ((n+ 1)/2)√nπ Γ(n/2)

(

1 +x2

n

)−(n+1)/2

.

Some properties of the Student’s t distribution:

• The expected value and variance of X ∼ t(n) are

E[X] = 0 , Var[X] =n

n− 2.

• The probability density is symmetric with respect to its expected value.

• The particular case n = 1 is known as Cauchy’s distribution.

• As n → ∞, the Student’s t probability density converges to the standardnormal distribution probability density.

2.4.7 Snedecor’s F distribution

The Snedecor’s F distribution is widely used in statistical inference. Let U, V beindependent random variables with distributions U ∼ χ2(m) and V ∼ χ2(n). Then,the random variable

U/m

V/n

follows a Snedecor’s F distribution distribution with m and n degrees of freedom.

Definition 2.4.7 (Snedecor’s F distribution). We say that a random variable Xfollows a Snedecor’s F distribution distribution with m and n degrees of freedom,and denote it by X ∼ F (m,n), if its probability function is

f(x) =1

B(

m2 ,

n2

)

(m

n

)m2x

m2−1(

1 +m

nx)−m+n

2χR+0(x) ,

where B denotes the Beta function:

B(x, y) =

∫ 1

0tx−1(1− t)y−1 , Re(x),Re(y) > 0 .

Some properties of the Snedecor’s F distribution:

26

• The expected value and variance of X ∼ F (m,n) are

E[X] =n

n− 2, for n > 2

and

Var[X] =2n2(m+ n− 2)

m(n− 2)2(n− 4), for n > 4 .

• If X ∼ F (m,n), then 1/X ∼ F (n,m).

2.5 The Law of Large Numbers

Let X1, X2, . . . be a sequence of random variables with finite mathematical expecta-tion on a probability space (Ω,F , P ). Denote the mathematical expectation of eachXi, i ∈ N, by µi = E[Xi]. Moreover, denote by Xn and Mn the following averages:

Xn =1

n

n∑

i=1

Xi , Mn =1

n

n∑

i=1

µi .

Theorem 2.5.1 (Law of large numbers). A sequence Xii∈N of independent iden-tically distributed random variables with finite mathematical expectation satisfies theLaw of Large Numbers, i.e.

P(∣

∣Xn − µ∣

∣ > ǫ)

→ 0 , as n → ∞ for any ǫ > 0.

Indeed, it is possible to drop the assumption that the random variables Xii∈Nare identically distributed. If Xii∈N is a sequence of independent random variableswith finite variance, i.e. there exists σ > 0 such that for all i ∈ N the variance ofXi is such that Var[Xi] < σ2, then the Law of Large Numbers still holds:

P(∣

∣Xn −Mn

∣ > ǫ)

≤ σ2

ǫ2n→ 0 , as n → ∞ for any ǫ > 0.

In the particular case of a sequence of homogeneous independent trials, the Lawof Large Numbers states that typical realizations are such that the frequency withwhich an event occurs is close to the probability of this event. More precisely, letΩ = x1, . . . , xk be a finite set with a probability measure P and let

pj = P (xj) , 1 ≤ j ≤ k .

Denote by νnj the number of occurrences of the event xj in a sequence of n inde-pendent trials. Then, for each j ∈ 1, . . . , k we have that

P(∣

νjn

− pj

∣ < ǫ)

→ 1 , as n → ∞.

Finally, we remark that the Law of large numbers concerns the convergence inprobability of the sequence of random variables Xnn∈N. Under stricter sets ofhypotheses, stronger results hold such as the Strong Law of large numbers whichensures that the convergence is pointwise for almost every ω ∈ Ω.

27

2.6 The Central Limit Theorem

The central limit theorem is a key result in probability theory. It partially explainsthe huge relevance that the normal distribution has in mathematics. Very roughly,it states that the average

Xn =1

n

n∑

i=1

Xi

of a sequence of independent and identically distributed random variables Xii∈Nconverges in distribution to a normal distribution, no matter what the initial dis-tribution of the initial random variables was.

We state below one version of Central limit theorem. The assumptions of thetheorem can be relaxed to obtain stronger versions.

Theorem 2.6.1 (Central Limit theorem). Let X1, X2, . . . be a sequence of inde-pendent identically distributed random variables with finite mean µ = E[Xi] andvariance and 0 < σ2 = V ar[Xi], i ∈ N. Then, the distribution functions of therandom variables

Zn =Xn − µ

σ/√n

converge to the distribution function of the standard normal distribution N(0, 1) asn → ∞.

For a sequence of independent Bernoulli trials with success probability p, weobtain the following result.

Theorem 2.6.2 (de Moivre-Laplace theorem). Let X1, X2, . . . be a sequence ofindependent Bernoulli distributed random variables with mean p = E[Xi]. Then,the distribution functions of the random variables

Zn =

∑ni=1Xi − np√

np(1− p)

converge to the distribution function of the standard normal distribution N(0, 1) asn → ∞.

A consequence of the de Moivre-Laplace theorem is that for large enough n, theNormal distribution provides a good approximation for the Binomial distribution.

28

3 Stochastic Processes

A stochastic process is a collection of random variables Xtt∈T on a probabilityspace (Ω,F , P ) whose index t ∈ T is usually referred to as “time” when T is asubset of R. In such case, stochastic processes describe the evolution with “time”of random phenomena. In this section we will provide the key definitions in thissubject, and give a brief overview of the main properties of the following families ofstochastic processes: Poisson process, Markov process and Brownian motion.

3.1 Basic properties

We start by providing the precise definition of a stochastic process.

Definition 3.1.1 (Stochastic process). A stochastic process is a collection of ran-dom variables X = Xt : t ∈ T on probability space (Ω,F , P ) taking values on ameasurable space (Π,G), indexed by a parameter t on a totally ordered set T.The space (Ω,F , P ) is called sample space, while (Π,G) is called the state space.For a fixed sample point ω ∈ Ω, the function t → Xt(ω), t ∈ T, is the sample pathof the process X associated with ω.

For simplicity of exposition, we will assume that the state space is (Rd,B(Rd)),where B(Rd) is the Borel σ-algebra of Rd. If the set T is a subset of R, usually N,Z, R+ or R, we think of the index t ∈ T as time.

Note that implicit in the definition of a stochastic process X = Xt : t ∈ T asa collection of (Rd,B(Rd))-valued random variables on (Ω,F , P ), is the assumptionthat each random variable Xt is F-measurable (as discussed in the section con-cerning random variables). However, since X is a function of the pair of variables(t, ω) ∈ T× Ω, it is convenient to have joint measurability properties.

Definition 3.1.2 (Measurable stochastic process). Let X = Xt : t ∈ T be astochastic process on the probability space (Ω,F , P ). The stochastic process X iscalled measurable if for every A ∈ B(Rd) the set (t, ω) : Xt(ω) ∈ A belongs toB(T)⊗F , i.e.

Xt(ω) : (T× Ω,B(T)⊗F) → (Rd,B(Rd))

is a measurable function.

The temporal feature of a stochastic process suggests a flow of time, in which,at every moment t ∈ T, we can talk about past, present and future. In particular,we can ask how much an observer of the process knows about it at the presenttime, as compared to how much he knew at some point in the past or will know atsome point in the future. The notion of σ-algebra is used in the study of stochasticprocesses to keep track of information as time evolves through the introduction ofa filtration – a nested sequence of σ-algebras.

From now on, we assume that our sample space (Ω,F) is equipped with a filtra-tion.

Definition 3.1.3 (Filtration). A filtration on a measurable space (Ω,F) is a non-decreasing family Ft : t ∈ T of sub-σ-algebras of F , i.e. Fs ⊂ Ft ⊂ F for s, t ∈ T

such that s < t.

29

If T is an infinite set, we define F∞ = σ(⋃

t∈TFt

)

to be the smallest σ-algebracontaining

t∈TFt.The concept of measurability for a stochastic process introduced before is still

rather weak. The introduction of a filtration Ft enables us to use more interestingand useful concepts.

Definition 3.1.4 (Adapted stochastic process). Let X = Xt : t ∈ T be a stochas-tic process on the probability space (Ω,F , P ). The stochastic process X is adaptedto the filtration Ft if, for every t ∈ T, Xt is an Ft-measurable random variable.

For a given stochastic processX = Xt : t ∈ T on a probability space (Ω,F , P ),the simplest choice for a filtration is the filtration generated by the process itself:

FXt = σ(Xs : s ∈ T , s ≤ t) ,

the smallest σ-algebra with respect to which Xs is measurable for every s ∈ T suchthat s ≤ t. It should be noted that every stochastic process X is adapted to thefiltration FX

t .A filtration can be seen as representing the flow of information. The σ-algebra

FXt contains only the events that “can happen up to time t”, i.e. when A ∈ FX

t , anobserver of X during the time period [0, t] knows whether or not the event A hasoccurred up to time t, but not after time t. Thus, an adapted process is one that“does not look into the future”.

We will introduce a general notion of independency that will be useful in thesequel.

Definition 3.1.5 (Independent σ-algebras). Let (Ω,F , P ) be a probability space andlet F1,F2, ...,Fn be sub-σ-algebras of F . A finite set of sub-σ-algebras F1,F2, ...,Fn

is independent if for any set of events Ai ∈ Fi, i = 1, ..., n, we have that

P (A1 ∩A2 ∩ ... ∩An) = P (A1)P (A2)...P (An) .

An arbitrary set S of σ-algebras is mutually independent if any finite subset of Sis independent.

The above definition is a generalization of the notions of independence for eventsand random variables:

• Events B1, ..., Bn ∈ F are mutually independent if the sub-σ-algebras σ(Bi) :=∅, Bi,Ω−Bi,Ω are mutually independent.

• Random variables X1, ..., Xn defined on (Ω,F , P ) are mutually independent ifthe sub-σ-algebras σ(Xi) = X−1

i (B) : B ∈ B(Rd) are mutually independent.

• In general, mutual independence among events Bi, random variables Xj andσ-algebras Fk means the mutual independence among σ(Bi), σ(Xj) and Fk.

Similarly to what was done before for events, one can define probability andmathematical expectation conditioned on a σ-algebra of events.

Definition 3.1.6 (Conditional expectation with respect to a σ-algebra). Let X be arandom variable on the probability space (Ω,F , P ) with values in R

d, and let G be asub-σ-algebra of F . The conditional expectation of X given G is denoted by E[X|G]and defined as the random variable on the probability space (Ω,G, P ) satisfying

E [E[X|G]χA] = E[XχA] , for all A ∈ G.

30

Let X,Y be random variables with finite mathematical expectation on the prob-ability space (Ω,F , P ) and let α, β ∈ R. The conditional expectation with respectto the sub-σ-algebra G ⊂ F has the following properties:

• E[αX + βY |G] = αE[X|G] + βE[X|G].• E[E[X|G]] = E[X].

• E[X|G] = X if X is G-measurable.

• E[X|G] = E[X] if X is independent of G.• E[Y X|G] = Y E[X|G] if Y is G-measurable.

• If H is a sub-σ-algebra of F such that G ⊂ H ⊂ F , then

E[E[X|H]|G] = E[X|G] .

Building up on the concept of conditional expectation, we define conditionalprobability with respect to a σ-algebra.

Definition 3.1.7 (Conditional probability with respect to a σ-algebra). Let (Ω,F , P )be a probability space, A ∈ F an event and G a sub-σ-algebra of F . The conditionalprobability of A given G is the conditional expectation of the indicator function ofA, χA, given the sub-σ-algebra G, that is

P [A|G] = E[χA|G] .

Similarly, we define the conditional probability of A given a random variable X on(Ω,F) as

P [A|X] = E[χA|σ(X)] ,

where σ(X) is the σ-algebra generated by the random variable X.

We now define Martingale: a stochastic process for which the expected futurestate is its current state.

Definition 3.1.8 (Martingale). Let X = Xt : t ∈ T be a stochastic processdefined on a probability space (Ω,F , P ), adapted to a filtration Ft. Furthermore,assume that E|Xt| < ∞ for all t ∈ T. The process X is a martingale if, for everys, t ∈ T such t ≥ s, we have

E[Xt|Fs] = Xs .

In the next sections we will discuss some special examples of stochastic processes.

3.2 Poisson Process

Fix a probability space (Ω,F , P ) and an event A ∈ F and count the number oftimes A occurs during an interval of time [0, t], t > 0. Think of examples such asthe arrival of ships to a port, the arrival of customers to a store, or the arrival ofphone calls to a call center, during a certain period of time.

Denote by X1 the time between t = 0 and the first occurrence of the event A,X2 the time between the first and the second occurrences of the event A, and soon. Assume that the random variables X1, X2, . . . are independent and identicallydistributed. Additionally, assume that if the event A has not occurred until time

31

t > 0, then the distribution of the time remaining until the next occurrence of A isthe same as the distribution of each of Xi, that is

P (Xi − t ∈ B|Xi > t) = P (Xi ∈ B)

for any Borel set B ∈ B(R). It is possible to show that if an unbounded randomvariable satisfies the property above, then it has exponential distribution.

Define a stochastic process N : [0,∞)× Ω → R by

Nt(ω) = sup

n ∈ N :∑

i≤n

Xi(ω) ≤ t

.

Note that Nt(ω) is equal to the number of occurrences of the event A until timet > 0. Thus, we obtain that Nt(ω) ∈ 0, 1, 2, . . . for every t > 0 and every ω ∈ Ω.

Assume that the event A occurs with a known average rate λ > 0 per unit time.We have that:

• for each fixed t > 0, Nt is a random variable with Poisson distribution Po(λt);

• the waiting time for Nt to increase by one unit is a random variable withexponential distribution Ex(λ);

• the waiting time forNt to increase by n units is a random variable with Gammadistribution Gamma(n, λ).

The formal definition of a Poisson process is the following.

Definition 3.2.1 (Poisson process). A stochastic process N = Nt : t ∈ [0,∞) ona probability space (Ω,F , P ) is called a Poisson process with parameter λ > 0 if thefollowing properties hold:

1) N0 = 0 almost surely, i.e. P (ω ∈ Ω : N0(ω) = 0) = 1.

2) Nt is a process with independent increments, i.e. for 0 ≤ t1 ≤ . . . ≤ tk therandom variables Nt1 , Nt2 −Nt1 , . . . , Ntk −Ntk−1

are independent.

3) For any 0 ≤ s < t < ∞, the random variable Nt−Ns has a Poisson distributionwith parameter λ(t− s).

We now list some further properties of a Poisson process:

• the sample path t → Nt is piecewise constant and continuous from the right;

• in points of discontinuity, the sample path jumps have unit size, i.e.

Nt − lims→t−

Ns ∈ 0, 1

for all t > 0.

• the stochastic process Nt − λt is a martingale.

An interesting situation, not yet covered by the notion of Poisson process asdefined above, is the case of a stochastic process which shares most properties ofthe Poisson process except for the size of the increments, which are now allowed tobe non-unitary. This is the case of a compound Poisson process. This may be usedto model the amount of goods sold in a shop over a given period of time, or thenumber of containers arriving at a port on a given day.

32

Let Nt be a Poisson process with parameter λ > 0 on a probability space(Ω,F , P ) and let Jii∈N be a sequence of independent and identically distributedrandom variables on the same probability space with distribution function F . As-sume thatNt and the random variables Ji are independent. The stochastic processY on (Ω,F , P ) given by

Yt(ω) =

Nt(ω)∑

i=1

Ji(ω)

is called a compound Poisson process with rate λ and jump distribution F . Itsatisfies the following properties:

• Y0 = 0 almost surely.

• Yt has independent increments.

• the number of increments of Yt is Poisson distributed.

• the waiting time for the next increment is exponentially distributed.

• the size of the increments follows a probability distribution determined by thedistribution function F .

3.3 Brownian motion

Brownian motion was first observed in 1827 by the biologist Robert Brown whenlooking through a microscope at pollen grains suspended in water. He observed thatthe pollen grains seem to move with a certain degree of randomness. This behaviourmay be modeled by a stochastic process that is also know as Brownian motion.

The first person to obtain a mathematical model for Brownian motion seems tohave been Louis Bachelier, a student of Henri Poincare, in 1900. Bachelier interestin Brownian motion was connected with the time evolution of financial assets. In1905, Albert Einstein studied Brownian motion from a physical perspective. Therigorous definition and the first mathematical proof of the existence of Brownianmotion are due to the American mathematician Norbert Wiener in 1920.

Definition 3.3.1 (standard, one-dimensional Brownian motion). A stochastic pro-cess B = Bt : 0 ≤ t < ∞ on a probability space (Ω,F , P ) adapted to a filtrationFt is called a standard, one-dimensional Brownian motion if the following propertieshold:

1) the sample paths of B are continuous functions of t for almost all ω ∈ Ω;

2) B0 = 0 almost surely;

3) for 0 ≤ s < t, the increment Bt −Bs is independent of Fs;

4) for 0 ≤ s < t, the increment Bt − Bs is normally distributed with mean zeroand variance t− s.

Analogously, we can define a Brownian motion B = Bt : 0 ≤ t ≤ T on theinterval [0, T ], for some T > 0.

If B is a Brownian motion and 0 = t0 < t1 < ... < tn < ∞, then the incrementsBtj−Btj−1 , j = 1, . . . , n, are independent and the distribution of Btj−Btj−1 dependson tj and tj−1 only through the difference tj − tj−1: it is normal with mean zeroand variance tj − tj−1. We say that B has stationary, independent increments.

33

Note that the filtration Ft is a key part in the definition of Brownian motion.However, if we are given Bt : 0 ≤ t < ∞ but no filtration, and if we know that Bhas stationary independent increments and that Bt −B0 is normal with mean zeroand variance t, we can take FB

t = σ(Bt : 0 ≤ s ≤ t) for filtration. If Ft turns outto be “larger” than FB

t (in the sense that FBt ⊂ Ft for all t ≥ 0) and if Bt − Bs

is independent of Fs whenever 0 ≤ s < t, then Bt : 0 ≤ t < ∞ is still a Brownianmotion with respect to Ft.

Definition 3.3.2 (d-dimensional Brownian motion with initial distribution µ). Let(Ω,F , P ) be a probability space equipped with a filtration Ft, µ be a probabilitymeasure on (Rd,B(Rd)), and d be a positive integer. A stochastic process B = Bt :t ≥ 0 on (Ω,F , P ) with values in R

d and adapted to Ft is called a d-dimensionalBrownian motion with initial distribution µ if the following properties hold:

1) the sample paths of B are continuous functions of t for almost all ω ∈ Ω;

2) P [B0 ∈ Γ] = µ(Γ), for all Γ ∈ B(Rd);

3) for 0 ≤ s < t, the increment Bt −Bs is independent of Fs;

4) for 0 ≤ s < t, the increment Bt − Bs is normally distributed with mean zeroand covariance matrix equal to (t − s)Id, where Id denotes the d × d identitymatrix.

If µ assigns measure one to some singleton x, we say that B is a d-dimensionalBrownian motion starting at x.

The following properties hold:

• Standard Brownian motion is a martingale.

• Nowhere differentiability : for almost every ω ∈ Ω, the Brownian sample pathBt(ω) is nowhere differentiable with respect to t.

• Strong Law of Large Numbers:

limt→∞

Bt

t= 0 almost surely.

• Equivalence transformations. If B = Bt : 0 ≤ t < ∞ is a standard Brownianmotion, so are the processes obtained from the following equivalence transfor-mations:

– Scaling: X = Xt : 0 ≤ t < ∞ defined for c > 0 by

Xt =1√cBct , 0 ≤ t < ∞ .

– Time-inversion: Y = Yt : 0 ≤ t < ∞ defined by

Yt = tB1/t , 0 < t < ∞ , Y0 = 0 .

– Time-reversal: Z = Zt : 0 ≤ t ≤ T defined for T > 0 by

Zt = BT −BT−t , 0 ≤ t ≤ T .

– Symmetry: −B = −Bt : 0 ≤ t < ∞.

34

3.4 Levy process

Levy processes form a large family of stochastic processes which includes, for in-stance, Brownian motion and the Poisson process. It is an adapted stochastic pro-cess with independent and stationary increments. The formal definition is givenbelow.

Definition 3.4.1 (Levy process). A stochastic process X = Xt : 0 ≤ t < ∞ on aprobability space (Ω,F , P ) adapted to a filtration Ft is called a Levy process if thefollowing properties hold:

1) for 0 ≤ s < t, the increment Xt −Xs is independent of Fs;

2) X has stationary increments, that is, Xt − Xs, has the same distribution asXt−s, 0 ≤ s < t;

3) Xt is continuous in probability, that is, for all ǫ > 0

lims→t

P (ω ∈ Ω : |Xs(ω)−Xt(ω)| ≥ ǫ) = 0 .

Examples of Levy processes:

• A Poisson process with intensity λ > 0.

• A compound Poisson process with intensity λ > 0 and jumps distribution F .

• Let B be a standard Brownian motion, µ ∈ R and σ > 0. The Brownianmotion with drift defined by

Xt = µt+ σBt

is a Levy process. Indeed, the Brownian motion with linear drift is in the onlyLevy process with continuous sample paths.

• A jump-diffusion process such as

Xt = µt+ σBt + Jt ,

where Bt is a standard Brownian motion and Jt is a compound Poisson process.

3.5 Markov Processes

A Markov process is a stochastic process with the property that the probability ofa future event conditioned on all the process past history is equal to the probabilityof that same future event conditioned only on the present state of the process, i.e.given the present state of the system, its future and past are independent. Examplesof Markov processes include the Poisson process and Brownian motion.

The formal definition of a Markov process is the following.

Definition 3.5.1 (Markov process). Let (Ω,F , P ) be a probability space with afiltration Ft and let µ be a probability measure on B(Rd). An adapted stochasticprocess X = Xt : t ∈ [0,+∞) with values in R

d is called a Markov process withinitial distribution µ if:

1) P (X0 ∈ A) = µ(A) for any A ∈ B(Rd).

35

2) Ifs, t > 0 and A ∈ B(Rd), then

P (Xs+t ∈ A|Fs) = P (Xs+t ∈ A|Xs) almost surely.

It is also useful to introduce the concept of a Markov family.

Definition 3.5.2 (Markov family). Let (Ω,F , P ) be a probability space with a fil-tration Ft, µ be a probability measure on B(Rd) and Xx = Xx

t : t ∈ [0,+∞),x ∈ R

d, be a family of processes with values in Rd which are adapted to the filtration

Ft. This family of processes is called a time-homogeneous Markov family if:

1) The function p : R+0 × R

d × B(Rd) → R defined by

p(t, x, A) = P (Xxt ∈ A)

is Borel-measurable as a function of x ∈ Rd for any t > 0 and any set A ∈

B(Rd).

2) P (Xx0 = x) = 1 for any x ∈ R

d.

3) If s, t > 0, x ∈ Rd, and A ∈ B(Rd), then

P (Xxs+t ∈ A|Fs) = p(t,Xx

s , A) almost surely.

The function p(t, x, A) introduced in item 1) of the previous definition is calledthe transition function for the Markov family Xx

t . It has the following properties:

• For fixed t ≥ 0 and x ∈ Rd, p(t, x, A) is a probability measure on B(Rd).

• For fixed t ≥ 0 and A ∈ B(Rd), p(t, x, A) is a measurable function of x ∈ Rd.

• p(0, x, x) = 1 for every x ∈ Rd.

• If s, t ≥ 0, x ∈ Rd, and A ∈ B(Rd), then

p(s+ t, x, A) =

Rd

p(t, y, A)p(s, x, dy) .

For an example of a Markov family, consider the process Xxt = x+Bt, where x ∈ R

d

and Bt is a standard Brownian motion in Rd.

A special case of Markov processes arises when the stochastic process takes valueson a discrete set of states. This particular class of Markov processes is known asMarkov chains. We will discuss the case of continuous-time Markov chains, beforeconsidering the case of (discrete-time) Markov chains.

Definition 3.5.3 (Continuous-time Markov chain). Let (Ω,F , P ) be a probabilityspace with a filtration Ft. An adapted stochastic process X = Xt : t ∈ [0,+∞)with values in a discrete set S = x1, x2, x3, . . . is called a continuous-time Markovchain if for every s, t ≥ 0 and states i, j, x(u) ∈ S, with u < s, we have

P (Xs+t = j |Xs = i,Xu = x(u), 0 ≤ u < s) = P (Xs+t = j |Xs = i) .

If, in addition, P (Xs+t = j |Xs = i) is independent of s, then the Markov chain issaid to have stationary or homogeneous transition probabilities.

Note that:

36

• the amount of time a Markov chain spends in a state before making a transitionto a different state is exponentially distributed with parameter λi.

• the amount of time a Markov chain spends in a state i, and the next statevisited, are independent random variables.

• when the Markov chain leaves a state i ∈ S, it will enter state j ∈ S with someprobability pij such that

i 6=j

pij = 1 .

Example A simple example of a continuous-time Markov chain is provided by birthand death processes. Take for state space the set S = N0 and let qij the transitionrate from state i to state j:

qij = λipij ,

where λi is the rate at which the process leaves state i ∈ S, and pij is the probabilitythat the process goes to state j ∈ S from state i ∈ S. A birth and death process is acontinuous-time Markov chain for which qij = 0 for all i, j ∈ S such that |i− j| > 1.Hence, if the Markov chain is at state i, then it can only go to either state i − 1or state i + 1. In applications, the state of the process is usually though of asrepresenting the size of a population. If the state increases by one unit, a birth issaid to occur, and if the state decreases by one unit, a death is said to occur. Thevalues

bi = qii+1 , di = qii−1

are called, respectively, the birth and death rates. Since∑

i∈N0qij = λi, we obtain

that

λi = bi + di , pii+1 = 1− pii−1 =bi

bi + di.

We consider now the discrete-time case of a Markov chain.

Definition 3.5.4 (Discrete-time Markov chain). A stochastic process X = Xt : t ∈N0 on a probability space (Ω,F , P ) with values on a discrete set S = x1, x2, x3, . . .is called a Markov chain if for every t ∈ N0, every i, j ∈ S and every sequencei0, i1, . . . , it−1 ∈ S, we have

P (Xt+1 = j |Xt = i,Xt−1 = it−1, . . . , X0 = i0) = P (Xt+1 = j |Xt = i) .

If P (Xt+1 = j |Xt = i) is independent of t, the Markov chain is said to havestationary or homogeneous transition probabilities.

Let X be a Markov chain and let pij denote the probability that the process Xwill make a transition to state j ∈ S from state i ∈ S, i.e.

pij = P (Xt+1 = j |Xt = i) .

Then, we have that

pij ≥ 0 for all i, j ∈ S and∑

j∈Spij = 1 .

37

Example The random walk provides an example of a Markov Chain. Let Xii∈Nbe a sequence of independent identically distributed random variables satisfying

P (Xi = j) = aj , j ∈ Z .

Define a stochastic process S = Sn : n ∈ N0 by

S0 = 0 , Sn =

n∑

i=1

Xi .

Then, S is a Markov chain with transition probabilities given by pij = aj−i.

38

4 Statistics

This last section is devoted to an overview of some topics in Statistics. We will pro-vide a very brief review of some concepts in statistical inference, including randomsampling, estimation, confidence interval and hypothesis testing.

4.1 Random sample and Statistic

The act of collecting data through observation of a variable of interest in someexperiment is very often the first step in the statistical treatment of such experi-ment. We will refer to the set of all units of a given class (under observation) asa population. That class may be people, buildings, physical quantities or econom-ical results. The term population is also commonly used to refer to the set of allpotential measurements or values in a given experiment.

Clearly, a population may be of finite or infinite size. Even when the populationunder consideration is finite, it is usually not practical to observe every one of itselements. The typical procedure to follow is to extract a subset of the population ofappropriate size. Such process is called sampling and the resulting subset is calledsample. Depending on the population under observation, different types of samplingmay be used. For instance, one may wish to divide the full population into smallersubsets of elements sharing some particular property, and only then sample fromeach of these sets. This process is known as cluster sampling or stratified sampling.

In what follows, we will only consider random sampling. By random samplingone may either refer to a set of independent and identically distributed randomvariables corresponding to some given observation or measurement, or to a sampleof individuals selected from a population in such a way that each sample of the samesize is equally likely. We will use the former nomenclature here.

Definition 4.1.1 (Random sample). The random variables X1, . . . Xn are calleda random sample of size n from the population f(x) if X1, . . . , Xn are mutuallyindependent and identically distributed random variables with probability functionf(x) (density in the continuous case).

A random sample corresponds to an experimental situation in which the variableof interest has a probability distribution described by f(x). In most experimentsthere are n > 1 repeated observations made on the variable, the first being X1, thesecond X2, and so on. Note that each Xi is an observation on the same variableand each Xi has distribution determined by f(x). Moreover, the observations aremade in such a way that the value of one observation is independent of any of theother observations. Thus, we obtain that the joint probability function f(x1, . . . , xn)(density in the continuous case) of the random sample X1, . . . , Xn is such that

f(x1, . . . , xn) =n∏

i=1

f(xi) .

If the population under observation is assumed to have a specified parametricfamily of probability distributions f(x|θ) with unknown true parameter value θ ∈RK , then a random sample extracted from this population has a joint probability

39

function (density in the continuous case) f(x1, . . . , xn|θ) satisfying:

f(x1, . . . , xn|θ) =n∏

i=1

f(xi|θ) .

By considering different values for the parameter θ, we can study how a randomsample behaves for different populations. On the other hand, we can use the ran-dom sample to estimate the value of the parameter θ and move forward with thestatistical analysis from this point.

When a sample Xl, . . . , Xn is extracted from a population, one may try to con-struct relevant quantities describing the main properties of observed sample. Suchquantities may be expressed in the form of a function T (x1, . . . , xn), which may bereal-valued or vector-valued, and whose domain includes the sample space of therandom vector (X1, . . . , Xn). Thus, these quantities define a random variable (orvector) Y = T (X1, . . . , Xn) known as a statistic.

Definition 4.1.2 (Statistic). Let X1, . . . , Xn be a random sample of size n from apopulation and let T : Rn → R

k, k ≥ 1, be a real-valued or vector-valued functionwhose domain includes the sample space of (X1, . . . , Xn). The random variable orrandom vector Y = T (X1, . . . , Xn) is called a statistic and its probability distributionis called the sampling distribution of Y .

Note that a statistic is a function of the random sampling only, and does notinvolve any other unknown parameters. Examples of a statistic include the samplemean and the sample variance defined below.

Definition 4.1.3 (Sample mean). Let X1, . . . , Xn be a random sample of size n.The sample mean, denoted by Xn, is the arithmetic average of the values in therandom sample:

Xn =1

n

n∑

i=1

Xi .

Definition 4.1.4 (Sample variance). Let X1, . . . , Xn be a random sample of sizen. The sample variance, denoted by S2, is the statistic defined by

S2 =1

n

n∑

i=1

(Xi −Xn)2 .

The next result lists some properties of the sample mean and variance.

Proposition 4.1.5 (Properties of the sample mean and variance). Let X1, . . . , Xn

be a random sample from a population with finite mean µ and variance σ2. Then:

1) E[Xn] = µ.

2) Var[Xn] = σ2/n.

3) E[S2] = (n− 1)σ2/n.

Due to the fact that E[S2] 6= σ2, a modified version of the sample variance ismore commonly used.

40

Definition 4.1.6 (Corrected sample variance). Let X1, . . . , Xn be a random sampleof size n. The corrected sample variance, denoted by S′2, is the statistic defined by

S′2 =1

n− 1

n∑

i=1

(Xi −Xn)2 .

Let X1, . . . , Xn be a random sample from a population with finite mean µ andvariance σ2. It is possible to check that the mathematical expectation of the cor-rected sample variance is equal to:

E[S′2] = σ2 .

We now state another very interesting property of S′2, widely used in statisticalinference. Let X1, . . . , Xn be a random sample of size n taken from a normal pop-ulation N(µ, σ2). Then, the random variable

Q =(n− 1)S′2

σ2∼ χ2(n− 1) .

It is very common that an explicit expression for the exact distribution of astatistic Y = T (X1, . . . , Xn)

FY (y) = P ((x1, . . . , xn) ∈ Rn : T (x1, . . . , xn) ≤ y)

is not available. To avoid this sort of difficulty, one may resort to asymptotic distri-butions (such as the one provided by the Central Limit Theorem) or to the MonteCarlo method when there is no analytical description available for the statistic dis-tribution.

One particular class of statistics for which it is possible to compute its exactdistribution are order statistics. Let X1, . . . , Xn be a random sample of size n withdistribution function F (x) and probability function (density in the continuous case)f(x). The order statistics of X1, . . . , Xn are the values obtained by ordering therandom sample:

X(1) ≤ X(2) ≤ . . . ≤ X(n) .

Hence, the value X(1) corresponds to the sample minimum

X(1) = minX1, . . . , Xn

and the value X(n) corresponds to the sample maximum

X(n) = maxX1, . . . , Xn .

The distribution of the order statistic X(i), i ∈ 1, . . . , n, is determined by itsprobability function (or density) fX(i)

given by

fX(i)(x) =

n!

(i− 1)!(n− i)!(F (x))i−1 (1− F (x))n−if(x) .

41

4.2 Estimators

Consider the case where a random sample is taken from a population with prob-ability function (or density) f(x|θ) depending on an unknown parameter θ. Anyknowledge about the parameter θ leads to knowledge about the entire population.Thus, methods providing a good estimator for the value of θ are of great relevancefor Statistics. Very often, the parameter θ has also a meaningful interpretation suchas, for instance, the case of a population mean.

Definition 4.2.1 (Point estimator). Let X1, . . . , Xn be a random sample of size ntaken from a population with probability function (density in the continuous case)f(x|θ), where θ ∈ Θ is an unknown parameter. A point estimator for θ is a statisticθ(X1, . . . , Xn) that is used to infer the value of θ.

Note the distinction between an estimator and an estimate. An estimator is afunction of the random sample alone, while an estimate is the numerical value of anestimator obtained when a sample is actually taken.

There are several techniques that can be used to construct estimators. We willdiscuss two of these techniques here – the Method of Moments and the MaximumLikelihood Estimator. We discuss the former method first.

4.2.1 Method of Moments

Let X1, . . . , Xn be a random sample of size n taken from a population with prob-ability function (density in the continuous case) f(x|θ1, . . . , θk), k ≥ 1. Assumethat the distribution determined by f(x|θ1, . . . , θk) has as many moments E[Xr] asneeded. Let m1, . . . ,mk denote the sample moments

mj =1

n

n∑

i=1

Xi , j ∈ 1, . . . , k ,

readily computable from the sample, and let µ′1, . . . , µ

′k denote the corresponding

population moments

µ′j = E[Xj ] , j ∈ 1, . . . , k .

Note that the population moments are functions of the parameters θ1, . . . , θk, i.e.µ′j = µ′

j(θ1, . . . , θk). The method of moments estimators are then found by solvingwith respect to θ1, . . . , θk the system of k equations in k unknowns given by

µ′1(θ1, . . . , θk) = m1

...

µ′k(θ1, . . . , θk) = mk .

If this system turns out to be underdetermined, introduce more equations usinghigher order moments.

Example For an example of application of the method of moments, let X1, . . . , Xn

be a random sample of size n taken from an exponential population with unknownparameter λ > 0. Recalling that the expected variable of a random variable X ∼

42

Ex(λ) is E[X] = 1/λ, we obtain that the method of moments estimator is thesolution of the following equation (with respect to λ):

1

λ=

1

n

n∑

i=1

Xi ,

Thus, we obtain the estimator

λ(X1, . . . , Xn) =n

∑ni=1Xi

for the unknown parameter λ.

4.2.2 Maximum Likelihood method

We will now introduce one of the most used methods to obtain estimators – themaximum likelihood method. As before, let X1, . . . , Xn be a random sample takenfrom a population with probability function (or density) f(x; θ), where θ ∈ Θ ⊂ R

K ,K ≥ 1, is an unknown parameter. The likelihood function L : Θ → R is defined by

L(θ|x) = L(θ1, . . . , θk |x1, . . . , xn) =n∏

i=1

f(xi|θ1, . . . , θk) .

A maximum likelihood estimator θ(X1, . . . , Xn) for θ is the value of θ for whichthe likelihood function L(θ|x) attains its maximum as a function of θ, while x isheld fixed.

The maximum likelihood estimator is very often a reasonable choice for an esti-mator. To find the maximum likelihood estimator one has to deal with the problemof finding the global maximum of a function. If the probability function (or densityfunction) of the random sample is reasonably behaved, this problem reduces to astandard calculus problem, even though the computations may be cumbersome insome cases.

If the likelihood function is differentiable with respect to θ ∈ Θ, possible can-didates for the maximum likelihood estimator are the values of θ = (θ1, . . . , θK)satisfying the first order conditions

∂L

∂θi(θ|x) = 0 , i = 1, . . . ,K .

Note that any solution of the set of equations above is only a candidate for amaximum likelihood estimator, i.e. the first order condition above is only a necessarycondition for a maximum, not a sufficient one. Moreover, the zeros of the firstderivative of L locate only extreme points in the interior of the set Θ. If theextrema occur on the boundary of Θ, the first derivative may not be zero. Thus,the boundary must be checked separately for extrema.

Example Let X1, . . . , Xn be a random sample of size n taken from a normal pop-

43

ulation N(µ, 1) with unknown mean µ. The likelihood function L : R → R is

L(µ|x) =n∏

i=1

f(xi|θ1, . . . , θk)

=n∏

i=1

1

(2π)1/2e−(xi−µ)2/2

=1

(2π)n/2e−(1/2)

∑ni=1(xi−µ)2 .

Working out the equation∂L

∂µ(µ|x1, . . . , xn) = 0

we obtainn∑

i=1

(xi − µ) = 0 .

Hence, one gets the solution

µ =1

n

n∑

i=1

Xi ,

as expected. To check that µ is indeed a maximum likelihood estimator for µ, it isenough to check that

∂2L

∂µ2(µ|x1, . . . , xn) < 0 ,

which turns out to be the case here.

Before concluding the discussion of maximum likelihood estimators, one shouldpoint out one special property of this class of estimators – the invariance principle.Let θ be the maximum likelihood estimator for the parameter θ ∈ Θ ⊂ R

K . Then,for any function h of the unknown parameter θ, the maximum likelihood estimatorof h(θ) is given by h(θ).

4.2.3 Some measures to assess estimators quality

We will now discuss some desirable properties for an estimator to have. In whatfollows, let X1, . . . , Xn be a random sample of size n taken from a population withprobability function (density in the continuous case) f(x|θ), where θ ∈ Θ is anunknown parameter.

Definition 4.2.2 (Bias). The bias of an estimator θ of a parameter θ, denoted byBiasθ(θ), is

Biasθ(θ) = Eθ[θ]− θ ,

where Eθ denotes the expected value taken with respect to f(x|θ). An estimatorwhose bias is identically zero as a function of θ is said to be unbiased and satisfiesEθ[θ] = θ for all θ ∈ Θ.

44

Example Let X1, . . . , Xn be a random sample from a normal population N(µ, σ2).We have already seen above that E(µ,σ2)[Xn] = µ, i.e. the sample average Xn is anunbiased estimator for the population mean µ (this statement hold for non-normalpopulations also). In what concerns the population variance, we have seen thatE(µ,σ2)[S

2] = (n− 1)σ2/n, and thus the sample average has non-zero bias:

Bias(µ,σ2)(S2) = −σ2

n.

This is a disadvantage of using S2 instead of the corrected sample average S′2, sinceS′2 is an unbiased estimator for σ2.

Definition 4.2.3 (Mean square error). The mean squared error of an estimator θof a parameter θ is the function of θ defined by Eθ[(θ − θ)2].

The mean squared error measures the average squared difference between theestimator θ and the unknown parameter θ. It provides a reasonable measure for theperformance of an estimator, even though any increasing function of the distance|θ−θ| would provide the same sort of information. However, in the case of the meansquare error it is possible to prove that the following identity is satisfied:

Eθ[(θ − θ)2] = Varθ[θ] +(

Biasθ(θ))2

,

where Varθ denotes the variance taken with respect to the population probabilityfunction (or density) f(x|θ). Hence, the mean square error may be decomposedinto two components, one measuring the estimator variability or precision, and theother measuring its bias. Thus, an estimator with small mean square error has smallcombined variance and bias.

Example Returning to the previous example, where X1, . . . , Xn is a random sam-ple from a normal population N(µ, σ2), one can see that

E(µ,σ2)[(Xn − µ)2] = Var(µ,σ2)(Xn) =σ2

n.

Thus, the mean square error of the sample mean decreases as the sample size in-creases. In what concerns the sample variance S2 and the corrected sample varianceS′2, it is possible to check that

E(µ,σ2)[(S2 − σ2)2] = Var(µ,σ2)[S

2] + Bias(µ,σ2)(S2)

=2(n− 1)σ4

n2+

σ4

n2

=(2n− 1)σ4

n2

while

E(µ,σ2)[(S′2 − σ2)2] = Var(µ,σ2)[S

′2] =2σ4

n− 1.

Thus, even though S′2 is unbiased while S2 is biased, we obtain that

E(µ,σ2)[(S2 − σ2)2] < E(µ,σ2)[(S

′2 − σ2)2]

45

for every n ≥ 2.This example does not imply that S2 should be used as an estimator of σ2

instead of S′2. The computation above only shows that on the average S2 is closerto σ2 than S′2 if the distance is measured using the mean square error. However,S2 is biased and will underestimate σ2 on average.

Note that controlling bias does not guarantee that the mean square error iscontrolled. Indeed, it may even happen that a trade-off occurs between varianceand bias in such a way that an increase in bias leads to a larger decrease in variance,resulting in an improvement of an estimator with respect to the mean square error.Moreover, a comparison between two different estimators θ1, θ2 of θ may not evenyield one “best” estimator. The typical situation is that one estimator is betterwith respect to the other in a subset of the parameter space Θ, while the oppositeis true in the complementary of that subset.

In order to use mean square error to compare two estimators, one must restrictthe analysis to smaller classes of estimators. One such class is the one whose ele-ments are unbiased estimators. In this case, if θ1, θ2 are unbiased estimators of aparameter θ, then their mean squared errors are equal to their variances, and oneshould choose the estimator with smaller variance.

Definition 4.2.4 (Best unbiased estimator). An estimator θ∗ is a best unbiasedestimator of θ if it satisfies Eθ[θ

∗] = θ for all θ ∈ Θ and, for any other estimator θwith Eθ[θ] = θ, we have Varθ[θ

∗] ≤ Varθ[θ] for all θ ∈ Θ.

Finding a best unbiased estimator, if there is one, may not be easy. First,when comparing two estimators, the computations leading to their variances maybe lengthy. Moreover, there may exist another estimator with even smaller variance.The Cramer-Rao inequality provides a lower bound for the variance of an estimator.If one is able to find an unbiased estimator with variance equal to such lower bound,than this must be a best unbiased estimator.

Theorem 4.2.5 (Cramer-Rao inequality). Let X1, . . . , Xn be a random sample ofsize n taken from a population with probability function (density in the continuouscase) f(x|θ), where θ ∈ Θ ⊂ R, and let θ be any estimator of θ satisfying

d

dθEθ[θ] =

Rn

∂θ

[

θ(x1, . . . , xn)f(x1, . . . , xn|θ)]

dx

andVarθ[θ] < ∞ .

Then,

Varθ[θ] ≥

(

ddθEθ[θ]

)2

I(θ),

where I(θ) is the Fisher information of the sample

I(θ) = Eθ

[

(

∂θln f(x1, . . . , xn)|θ)

)2]

.

46

Note that although the Cramer-Rao inequality is stated above for continuousrandom variables, it also applies to discrete random variables with the obviousmodifications.

If θ is an unbiased estimator for θ, the Cramer-Rao inequality reduces to

Varθ[θ] ≥1

I(θ).

It should be remarked that the lower bound provided by Cramer-Rao inequalitymay not be sharp, i.e. its value may be strictly smaller than the variance of anyunbiased estimator. Define an estimator efficiency through

e(θ) =

(

ddθEθ[θ]

)2

I(θ)Varθ[θ].

From Cramer-Rao inequality, one obtains that

e(θ) ≤ 1 .

An estimator θ is said to be efficient if e(θ) = 1, i.e. the variance of θ is equalto the lower bound in Cramer-Rao inequality. However, there may be cases wheresuch lower bound is not attainable, i.e. e(θ) < 1 for every estimator θ of θ.

4.3 Confidence intervals

As discussed in the previous section, given a random sample X1, . . . , Xn taken froma population with probability function (density in the continuous case) f(x|θ), whereθ ∈ Θ is an unknown parameter, one can construct an estimator for θ. This cor-responds to finding a function of the random sample yielding a point value for θ.An alternative approach to estimate θ is to look for an entire interval (or region) ofplausible values – a confidence interval.

When constructing a confidence interval, one needs to select a confidence level,measuring the degree of reliability of the interval. Information about the precisionof a confidence interval is then given by the width of the interval. If the confidencelevel is high and the confidence interval is narrow, the information obtained for thevalue of the unknown parameter may be rather precise.

Definition 4.3.1 (Confidence interval). Let X1, . . . , Xn be a random sample of sizen taken from a population with probability function (density in the continuous case)f(x|θ), where θ ∈ Θ is an unknown parameter. A confidence interval for θ withconfidence level 1− α is a random interval

I(X1, . . . , Xn) = (L(X1, . . . , Xn), U(X1, . . . , Xn)) ,

with the following properties:

i) the functions L,U : Rn → R are statistics such that for every (x1, . . . , xn) ∈ Rn

the following inequality holds

L(x1, . . . , xn) ≤ U(x1, . . . , xn) .

47

ii) the confidence coefficient of I(X1, . . . , Xn) is 1− α, i.e.

infθ∈Θ

Pθ [L(X1, . . . , Xn) < θ < U(X1, . . . , Xn)] = 1− α ,

where Pθ denotes the probability measure determined by f(x|θ).

In a similar way, we define a lower confidence bound for θ with confidence level1− α to be an interval of the form

I(X1, . . . , Xn) = (L(X1, . . . , Xn),+∞)

and an upper confidence bound for θ with confidence level 1−α to be an interval ofthe form

I(X1, . . . , Xn) = (−∞, U(X1, . . . , Xn)) .

All these notions can be easily extended to define confidence regions for multidimen-sional parameters θ. However, for simplicity of exposition, we restrict our attentionto the one-dimensional case.

We will now describe a general strategy that may be used to construct confidenceintervals. Let X1, X2, . . . , Xn be a random sample taken from a population withprobability function (density in the continuous case) f(x|θ), where θ ∈ Θ is theunknown parameter to be estimated.

The first step is to find a random variable Q : Rn ×Θ → R, depending only onthe random sample X1, . . . , Xn and on the parameter θ, such that the probabilitydistribution of Q does not depend on θ or on any other unknown parameter. Therandom variable Q is called a pivot or a pivotal quantity. The precise form ofthe pivot Q is usually obtained by examining the distribution of an appropriateestimator θ for θ.

Assume now that a pivot Q is provided. Then, for any α such that 0 < α < 1,there exist constants a, b ∈ R such that

Pθ (a < Q(X1, . . . , Xn, θ) < b) = 1− α

If the inequalities a < Q(X1, . . . , Xn, θ) < b can be worked out in such a way thatθ becomes isolated, one obtains the equivalent statement

Pθ (L(X1, . . . , Xn) < θ < U(X1, . . . , Xn)) = 1− α ,

where L(X1, . . . , Xn) and U(X1, . . . , Xn) are, respectively, the lower and upper endsof a confidence interval for θ with confidence level 1− α.

A final remark concerning the choice of constants a, b ∈ R in the constructionabove is in order. It should be clear that there are infinitely many choices for theseconstants preserving the confidence level 1− α. Out of all the possible choices, wecould ask for the values of a and b minimizing the length b− a of the interval [a, b].The following result provides the appropriate choice for the case where the pivotQ is a continuous random variable with a unimodal probability density functionf(x), i.e. there exists x∗ ∈ R such that f(x) is nondecreasing for x < x∗ and f(x)is nonincreasing for x > x∗. Among others, this property is shared by the densityfunctions of the following families of probability distributions: normal, chi-square,Student’s t, and Snedecor’s F .

48

Proposition 4.3.2. Let f(x) be a unimodal probability density function. If theinterval [a, b] satisfies:

i)∫ ba f(x)dx = 1− α,

ii) f(a) = f(b) > 0, and

iii) a ≤ x∗ ≤ b, where x∗ is a mode of f(x),

then [a, b] is the shortest among all intervals that satisfy property (i).

We will now discuss several relevant examples of confidence intervals for param-eters of normal populations and for large samples.

4.3.1 Mean

We distinguish between the following three cases:

1) Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and known variance σ2. We have seen abovethat the sample mean Xn is an estimator for the population mean µ withdistribution

Xn ∼ N

(

µ,σ2

n

)

.

Thus, the random variable

Z =Xn − µ

σ/√n

∼ N(0, 1)

is a pivot for the population mean µ. The next step is to choose a, b ∈ R suchthat

P

(

a <Xn − µ

σ/√n

< b

)

= 1− α .

The choice a = −zα/2 and b = zα/2, where zα/2 ∈ R is the unique solution ofP (Z > zα/2) = α/2, yields the following confidence interval for µ:

(

Xn − zα/2σ√n,Xn + zα/2

σ√n

)

.

2) Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and variance σ2. Since σ2 is unknown, thefunction

Xn − µ

σ/√n

is no longer a pivot for the population mean µ. To obtain a pivot we proceedas follows. Recall that

U =Xn − µ

σ/√n

∼ N(0, 1)

and that

V =(n− 1)S′2

σ2∼ χ2(n− 1) .

49

We obtain that

T =U

V/(n− 1)=

Xn − µ

S′/√n

∼ t(n− 1)

is a pivot for the population mean µ. Picking values a = −tα/2,n−1 and b =tα/2,n−1, where tα/2,n−1 ∈ R is the unique solution of P (T > tα/2,n−1) = α/2,yields

P

(

a <Xn − µ

S′/√n

< b

)

= 1− α .

We obtain the following confidence interval for µ:(

Xn − tα/2,n−1S′√n,Xn + tα/2,n−1

S′√n

)

.

3) Let X1, . . . , Xn be a random sample of large size n taken from a populationwith unknown mean µ and variance σ2. In this case, even if an appropriatepivotal quantity is not available, provided the sample size n is sufficiently large,the central limit theorem may be used to obtain an approximate confidenceinterval for the mean µ of the population. In such case, the pivotal quantityis

Z =Xn − µ

S′/√n

a∼ N(0, 1) ,

where the notationa∼ is used to denote that Z is asymptotically normally

distributed when n → ∞. Proceeding similarly to case 1) above, one obtainsthe following approximate confidence interval for µ:

(

Xn − zα/2S′√n,Xn + zα/2

S′√n

)

.

If n is not very large, one can use the Student’s t distribution of case 2),yielding a slightly wider approximate confidence interval.

4.3.2 Variance

Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and variance σ2. The pivotal quantity to be usedis

Q =(n− 1)S′2

σ2∼ χ2(n− 1) .

Let qα,n−1 be such that P (Q > qα,n−1) = α. Picking values a = q1−α/2,n−1 andb = qα/2,n−1 yields

P

(

a <(n− 1)S′2

σ2< b

)

= 1− α .

We obtain the following confidence interval for σ2:(

(n− 1)S′2

qα/2,n−1,(n− 1)S′2

q1−α/2,n−1

)

.

Note that since the chi-square distribution is non-symmetric, the choice made abovefor the values a, b does not yield the confidence interval with confidence level 1− αwith smaller expected width.

50

4.3.3 Difference of two means

We distinguish between the following four cases:

1) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples ofsize n and m taken, respectively, from two normal population N(µ1, σ

21) and

N(µ2, σ22) with unknown means µ1 and µ2 and known variances σ2

1 and σ22.

Using the properties of the normal distribution, it is possible to check that

Z =(X1n −X2m)− (µ1 − µ2)

σ21n +

σ22m

∼ N(0, 1) ,

where X1n, X2m denote the sample means of each random sample. Thus, Z isa pivotal quantity. Picking values a = −zα/2 and b = zα/2, where zα/2 ∈ R isthe unique solution of P (Z > zα/2) = α/2, yields

P (a < Z < b) = 1− α .

We obtain the following confidence interval for µ1 − µ2:

(

X1n −X2m − zα/2σ∗, X1n −X2m + zα/2σ

∗) ,

where

σ∗ =

σ21

n+

σ22

m.

2) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples ofsize n and m taken, respectively, from two normal population N(µ1, σ

21) and

N(µ2, σ22) with unknown means µ1 and µ2 and unknown, but equal, variances

σ21 = σ2

2 = σ2. It is possible to check that

T =(X1n −X2m)− (µ1 − µ2)√

1n + 1

m

(n−1)S′21 +(m−1)S′2

2n+m−2

∼ t(n+m− 2)

is a pivotal quantity for the difference µ1−µ2. Picking values a = −tα/2,n+m−2

and b = tα/2,n+m−2, where tα/2,n+m−2 ∈ R is the unique solution of P (T >tα/2,n+m−2) = α/2, yields

P (a < T < b) = 1− α .

We obtain the following confidence interval for µ1 − µ2:

(

X1n −X2m − tα/2,n+m−2S∗, X1n −X2m + tα/2,n+m−2S

∗) ,

where

S∗ =

1

n+

1

m

(n− 1)S′21 + (m− 1)S′2

2

n+m− 2.

3) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples ofsize n and m taken, respectively, from two normal population N(µ1, σ

21) and

51

N(µ2, σ22) with unknown means µ1 and µ2 and unknown and unequal variances

σ21 and σ2

2. It is possible to check that the following asymptotic identity holds:

T =(X1n −X2m)− (µ1 − µ2)

S′21n +

S′22m

a∼ t(r) ,

where r is the integer part of r∗:

r∗ =

(

S′21n +

S′22m

)2

1n−1

(

S′21n

)2+ 1

m−1

(

S′22m

)2 .

Thus, T is a pivotal quantity for the difference µ1 − µ2. Picking values a =−tα/2,r and b = tα/2,r, where tα/2,r ∈ R is the unique solution of P (T >tα/2,r) = α/2, yields

P (a < T < b) = 1− α .

We obtain the following approximate confidence interval for µ1 − µ2:(

X1n −X2m − tα/2,rS∗, X1n −X2m + tα/2,rS

∗) ,

where

S∗ =

S′21

n+

S′22

m.

4) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples ofsize n and m taken from two populations with unknown means µ1 and µ2 andunknown variances σ2

1 and σ22. Even if an appropriate pivotal quantity is not

available, provided the sample sizes n and m are sufficiently large, the centrallimit theorem may be used to obtain an approximate confidence interval forthe difference µ1 − µ2. It is possible to check that

Z =(X1n −X2m)− (µ1 − µ2)

S′21n +

S′22m

a∼ N(0, 1)

may be used as a pivotal quantity. Proceeding similarly to case 1) above, oneobtains the following confidence interval for µ1 − µ2:

(

X1n −X2m − zα/2S∗, X1n −X2m + zα/2S

∗) ,

where

S∗ =

S′21

n+

S′22

m.

4.3.4 Ratio of two variances

Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of size nand m taken, respectively, from two normal population N(µ1, σ

21) and N(µ2, σ

22)

with unknown means and variances. To construct a confidence interval for the ratioof the two variances, recall that

U =(n− 1)S′2

1

σ21

∼ χ2(n− 1)

52

and

V =(m− 1)S′2

2

σ22

∼ χ2(m− 1) .

We obtain that

F =U/(n− 1)

V/(m− 1)=

S′21

S′22

σ22

σ21

∼ F (n− 1,m− 1)

is a pivotal quantity for the variances ratio σ22/σ

21. Let fα,n−1,m−1 be such that

P (F > fα,n−1,m−1) = α. Picking values a = f1−α/2,n−1,m−1 and b = fα/2,n−1,m−1

yields

P

(

a <S′21

S′22

σ22

σ21

< b

)

= 1− α .

We obtain the following confidence interval for the ratio σ22/σ

21:

(

S′22

S′21

f1−α/2,n−1,m−1,S′22

S′21

fα/2,n−1,m−1

)

.

Similarly to the chi-square distribution, Snedecor’s F distribution is non-symmetric.Thus, the choice made above for the values a, b does not yield the confidence intervalwith confidence level 1− α with smaller expected width.

4.3.5 Proportion

LetX1, . . . , Xn be a random sample of large size n taken from a Bernoulli populationBi(1, p) with unknown proportion p. It is known that the sample mean Xn is agood estimator for p. If the sample size n is sufficiently large, then the centrallimit theorem may be used to obtain an approximate confidence interval for theproportion p. Recall that

Z =Xn − p√

p(1−p)n

a∼ N(0, 1) .

Taking a = −zα/2 and b = zα/2, where zα/2 ∈ R is the unique solution of P (Z >zα/2) = α/2, yields the following approximation

P

(

Xn − zα/2

p(1− p)

n< p < Xn + zα/2

p(1− p)

n

)

≈ 1− α .

Using Xn as an estimate for p, we obtain the following approximate confidenceinterval for p:

Xn − zα/2

Xn(1−Xn)

n,Xn + zα/2

Xn(1−Xn)

n

.

53

4.3.6 Difference of two proportions

Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of size nand m taken, respectively, from two Bernoulli populations Bi(1, p1) and Bi(1, p2)with unknown proportions p1 and p2. If the sample sizes n and m are sufficientlylarge, the central limit theorem may be used to obtain an approximate confidenceinterval for the difference p1 − p2. It is possible to check that

Z =(X1n −X2m)− (p1 − p2)√

p1(1−p1)n + p2(1−p2)

m

a∼ N(0, 1) .

UsingX1n as an estimate for p1 and X2n as an estimate for p2, and taking a = −zα/2and b = zα/2, where zα/2 ∈ R is the unique solution of P (Z > zα/2) = α/2, yieldsthe following approximate confidence interval for µ1 − µ2:

(

X1n −X2m − zα/2S∗, X1n −X2m + zα/2S

∗) ,

where

S∗ =

X1n(1−X1n)

n+

X2m(1−X2m)

m.

4.4 Hypothesis Testing

The previous two sections were devoted to the topic of estimation. The estimatemay be given by a single value or an interval with some given confidence level.Very often, the goal is not to estimate a parameter, but to decide between twocontradictory claims about a given parameter. This latter goal is accomplished bythe part of statistical inference called hypothesis testing.

A hypothesis is a statement about a population parameter and the goal of ahypothesis test is to decide, based on a random sample taken from the population,which of two alternative hypotheses is true. These two hypotheses are called thenull hypothesis, denoted by H0, and the alternative hypothesis, denoted by H1.

If θ ∈ Θ denotes a population parameter, the general format of the null andalternative hypotheses is H0 : θ ∈ Θ0 and H1 : θ ∈ Θ1, where Θ0,Θ1 are subsets ofthe parameter space Θ such that Θ0∩Θ1 = ∅. Thus, in a hypothesis testing problem,after observing the sample, one must decide either to reject the null hypotheses H0

as false or to reject the alternative hypothesis H1 as false.

Definition 4.4.1 (Hypothesis test). A hypothesis test is a rule that specifies:

1) For which sample values the null hypothesis H0 is rejected.

2) For which sample values the alternative hypothesis H1 is rejected.

The subset of the sample space for which H0 will be rejected is called the rejectionregion, while its complement is called the acceptance region.

For simplicity of exposition, we will only consider null hypotheses of the formH0 : θ = θ0, where θ0 ∈ Θ is fixed, i.e. Θ0 = θ0 is a singleton. Alternatives to anull hypothesis of this form include the following three assertions:

i) H1 : θ 6= θ0;

54

ii) H1 : θ > θ0 (in which case the implicit null hypothesis may be seen as θ ≤ θ0);

iii) H1 : θ < θ0 (in which case the implicit null hypothesis may be seen as θ ≥ θ0).

A hypothesis test of H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1 might incur in the followingtwo types of errors. If θ ∈ Θ0 but the hypothesis test result is to reject H0, thenthe test has made a Type I Error. If, on the other hand θ ∈ Θ1 but the test resultis to accept H0, a Type II Error has been made. Ideally, we would like to havetest procedures with none of these errors. However, for a fixed sample size, suchgoal is usually impossible to achieve. The usual strategy to obtain a good test is torestrict ourselves to the class of tests with a prespecified probability of Type I error.Within this class of tests, one can then look for tests with Type II Error probabilityas small as possible.

Hence, the probabilities of occurrence of each of the two types of errors describedabove are key parameters in hypothesis testing. These are related with significanceand power of a test. The significance level of a test is equal to 1 − α, where α isthe probability of Type I Error:

α = P (Type I Error) = P (reject H0|θ ∈ Θ0) .

If Θ0 has more than one element, then α is the supremum over θ ∈ Θ0 of theprobability of Type I Error. The power of a test is equal to 1 − β, where β isprobability of Type II Error, i.e.

β = P (Type II Error) = P (accept H0|θ /∈ Θ0) .

Therefore, if two distinct statistical tests are available with the same hypotheses H0

and H1 and the same level of significance, one can compare their powers and choosethe one with larger value.

There are several methods that can be used to construct hypothesis tests. Dueto lack of time and space, we skip the precise discussion of any of these methods,but the interested reader can find that information in the references. We restrictour attention to one particular strategy relating the formulation of hypothesis testswith the construction of confidence intervals. This approach is suitable to test theparameters of normal populations, or asymptotically normal quantities, for instance.Its main steps and assumptions may be roughly stated as follows:

1) Consider the hypotheses H0 : θ = θ0 and H1 : θ ∈ Θ1, where θ0 /∈ Θ1. Assumethat for each θ0 ∈ Θ there is a set Θ1(θ0) ⊂ Θ, where θ0 /∈ Θ1. Note thatthis is the case of the alternative hypothesis H1 : θ 6= θ0, H1 : θ > θ0, andH1 : θ < θ0.

2) Depending on the parameter θ and the probability distribution f(x|θ) of thepopulation from where the random sample is to be extracted, look for anestimator θ(X1, . . . , Xn) for θ.

3) Fix the level of significance 1− α at which the test will be performed.

4) Use α and θ to determine the rejection region of the test, i.e. find an appro-priate subset R ⊂ R such that

P (θ(X1, . . . , Xn) ∈ R | θ = θ0) = α .

55

If the distribution of θ(X1, . . . , Xn) depends on unknown parameters, one mayneed to find an appropriate pivotal quantity Q(X1, . . . , Xn, θ0) and then arejection region R′ such that

P (Q(X1, . . . , Xn, θ0) ∈ R′) = α .

5) Collect the random sample and reject the null hypothesis or the alternativehypothesis depending on the observed value for θ or Q(X1, . . . , Xn, θ0).

There is a missing point in the strategy described above: how do we obtainan appropriate rejection region? The answer is provided by the next theorem andexplores the close relation between the notions of confidence set and hypothesis test.

Theorem 4.4.2. For each θ0 ∈ Θ, consider a statistical test with significance level1 − α for testing H0 : θ = θ0 against H1 : θ ∈ Θ1(θ0) and denote by Ωα(θ0) itsacceptance region, i.e. the subset of sample space Ω for which H0 is not rejected.For each (x1, . . . , xn) ∈ Ω, define a set Θα(x1, . . . , xn) in parameter space by

Θα(x1, . . . , xn) = θ ∈ Θ : (x1, . . . , xn) ∈ Ωα(θ) .

Then Θα(x1, . . . , xn) is a 1− α confidence set for θ.Conversely, let Θα(x1, . . . , xn) be a 1 − α confidence set for θ. For any θ0 ∈ Θ,define

Ωα(θ0) = (x1, . . . , xn) ∈ Ω : θ0 ∈ Θα(x1, . . . , xn) .

Then, Ωα(θ0) is the acceptance region of a statistical test with significance level 1−αfor testing H0 : θ = θ0 against H1 : θ ∈ Θ1(θ0).

Summarizing, the hypothesis test fixes the parameter and asks what samplevalues, i.e. which acceptance region, are consistent with that fixed value. The confi-dence set fixes the sample value and asks what parameter values, i.e. the confidenceinterval, make this sample value most plausible. In short, in the conditions of theprevious theorem:

(x1, . . . , xn) ∈ Ωα(θ0) if and only if θ0 ∈ Θα(x1, . . . , xn) .

A final remark to note that even though we are using the theorem above to constructhypothesis tests from confidence set, this result is very often used for the oppositepurpose, i.e. to build confidence sets from hypothesis tests.

In the next subsections, we will translate some of the confidence intervals for theparameters of normal populations discussed in the previous section to the setup ofhypothesis testing. We conclude with some short comments about tests for fit of adistribution and independency of samples.

4.4.1 Mean

We distinguish between the following two cases:

1) Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and known variance σ2.Let zα be such that P (Z > zα) = α, where Z ∼ N(0, 1).Statistical test with significance 1− α of H0 : µ = µ0 against H1:

56

• if H1 : µ 6= µ0: Reject H0 if Xn < µ0 − zα/2σ√nor Xn > µ0 + zα/2

σ√n.

• if H1 : µ > µ0: Reject H0 if Xn > µ0 + zασ√n.

• if H1 : µ < µ0: Reject H0 if Xn < µ0 − zασ√n.

2) Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and variance σ2.Let tα,n−1 be such that P (T > tα,n−1) = α, where T ∼ t(n− 1).Statistical test with significance 1− α of H0 : µ = µ0 against H1:

• if H1 : µ 6= µ0: Reject H0 if Xn < µ0 − tα/2,n−1S′√nor Xn > µ0 + tα/2,n−1

S′√n.

• if H1 : µ > µ0: Reject H0 if Xn > µ0 + tα,n−1S′√n.

• if H1 : µ < µ0: Reject H0 if Xn < µ0 − tα,n−1S′√n.

4.4.2 Variance

Let X1, . . . , Xn be a random sample of size n taken from a normal populationN(µ, σ2) with unknown mean µ and variance σ2.Let qα,n−1 be such that P (Q > qα,n−1) = α, where Q ∼ χ2(n− 1).Statistical test with significance 1− α of H0 : σ

2 = σ20 against H1:

• if H1 : σ2 6= σ2

0: Reject H0 if

S′2 <q1−α/2,n−1σ

20

n− 1or S′2 >

qα/2,n−1σ20

n− 1.

• if H1 : σ2 > σ2

0: Reject H0 if

S′2 >qα,n−1σ

20

n− 1.

• if H1 : σ2 < σ2

0: Reject H0 if

S′2 <q1−α,n−1σ

20

n− 1.

4.4.3 Difference of two means

We distinguish between the following three cases:

1) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of sizen and m taken, respectively, from two normal population N(µ1, σ

21) and N(µ2, σ

22)

with unknown means µ1 and µ2 and known variances σ21 and σ2

2.Let zα be such that P (Z > zα) = α, where Z ∼ N(0, 1).Statistical test with significance 1− α of H0 : µ1 = µ2 against H1:

• if H1 : µ1 6= µ2: Reject H0 if

X1n −X2m < −zα/2

σ21

n+

σ22

mor X1n −X2m > zα/2

σ21

n+

σ22

m.

57

• if H1 : µ1 > µ2: Reject H0 if

X1n −X2m > zα

σ21

n+

σ22

m.

• if H1 : µ1 < µ2: Reject H0 if

X1n −X2m < −zα

σ21

n+

σ22

m.

2) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of sizen and m taken, respectively, from two normal population N(µ1, σ

21) and N(µ2, σ

22)

with unknown means µ1 and µ2 and unknown, but equal, variances σ21 = σ2

2 = σ2.Let tα,n+m−2 be such that P (T > tα,n+m−2) = α, where T ∼ t(n+m− 2), and letS∗ be given by

S∗ =

1

n+

1

m

(n− 1)S′21 + (m− 1)S′2

2

n+m− 2.

Statistical test with significance 1− α of H0 : µ1 = µ2 against H1:

• if H1 : µ1 6= µ2: Reject H0 if

X1n −X2m < −tα/2,n+m−2S∗ or X1n −X2m > tα/2,n+m−2S

∗ .

• if H1 : µ1 > µ2: Reject H0 if X1n −X2m > tα,n+m−2S∗.

• if H1 : µ1 < µ2: Reject H0 if X1n −X2m < −tα,n+m−2S∗.

3) Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of sizen and m taken, respectively, from two normal population N(µ1, σ

21) and N(µ2, σ

22)

with unknown means µ1 and µ2 and unknown and unequal variances σ21 and σ2

2.Let S∗ be given by

S∗ =

S′21

n+

S′22

m.

and let tα,r be such that P (T > tα,r) = α, where T ∼ t(r) and r is the integer partof r∗:

r∗ =

(

S′21n +

S′22m

)2

1n−1

(

S′21n

)2+ 1

m−1

(

S′22m

)2 .

Statistical test with significance 1− α of H0 : µ1 = µ2 against H1:

• if H1 : µ1 6= µ2: Reject H0 if X1n−X2m < −tα/2,rS∗ or X1n−X2m > tα/2,rS

∗.

• if H1 : µ1 > µ2: Reject H0 if X1n −X2m > tα,rS∗.

• if H1 : µ1 < µ2: Reject H0 if X1n −X2m < −tα,rS∗.

58

4.4.4 Ratio of two variances

Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of size nand m taken, respectively, from two normal population N(µ1, σ

21) and N(µ2, σ

22)

with unknown means and variances.Let fα,n−1,m−1 be such that P (F > fα,n−1,m−1) = α, where F ∼ F (n− 1,m− 1).Statistical test with significance 1− α of H0 : σ

21 = σ2

2 against H1:

• if H1 : σ21 6= σ2

2: Reject H0 if

S′21

S′22

< f1−α/2,n−1,m−1 orS′21

S′22

> fα/2,n−1,m−1 .

• if H1 : σ21 > σ2

2: Reject H0 if

S′21

S′22

> fα,n−1,m−1 .

• if H1 : σ21 < σ2

2: Reject H0 if

S′21

S′22

< f1−α,n−1,m−1 .

4.4.5 Proportion

Let X1, . . . , Xn be a random sample of size n taken from a Bernoulli populationBi(1, p) with unknown parameter (proportion) p. Assume that the sample size nis sufficiently large so that the central limit theorem may be used to provide anasymptotic distribution for Xn.Let zα be such that P (Z > zα) = α, where Z ∼ N(0, 1).Statistical test with significance 1− α of H0 : p = p0 against H1:

• if H1 : p 6= p0: Reject H0 if

Xn < p0 − zα/2

p0(1− p0)

nor Xn > p0 + zα/2

p0(1− p0)

n.

• if H1 : p > p0: Reject H0 if

Xn > p0 + zα

p0(1− p0)

n.

• if H1 : p < p0: Reject H0 if

Xn < p0 − zα

p0(1− p0)

n.

4.4.6 Difference of two proportions

Let X11, . . . , X1n and X21, . . . , X2m be two independent random samples of size nand m taken, respectively, from two Bernoulli populations Bi(1, p1) and Bi(1, p2)with unknown proportions p1 and p2. Assume that the sample sizes n and mare sufficiently large so that the central limit theorem may be used to provide an

59

asymptotic distribution for X1n −X2m.Let S∗ be given by

S∗ =

p(1− p)

(

1

m+

1

n

)

,

where

p =nX1n +mX2m

n+m,

and let zα be such that P (Z > zα) = α, where Z ∼ N(0, 1).Statistical test with significance 1− α of H0 : p1 = p2 against H1:

• if H1 : p1 6= p2: Reject H0 if X1n −X2m < −zα/2S∗ or X1n −X2m > zα/2S

∗.

• if H1 : p1 > p2: Reject H0 if X1n −X2m > zαS∗

• if H1 : p1 < p2: Reject H0 if X1n −X2m < −zαS∗

4.4.7 Other tests

The tests described above are suitable to test the parameters of normal populations.Moreover, as illustrated with the tests for the proportions of a Bernoulli population,if the random samples is sufficiently large, it is possible to use the central limittheorem to produce tests for the population mean of non-normal populations.

However, very seldom is the probability distribution of the population underobservation exactly known. In such cases, the best one can aim for is to makean “informed” guess about that probability distribution and to validate or rejectsuch guess using hypothesis tests. Tests with this goal usually have a null hypothesisstating that the population has a given distribution, while the alternative hypothesisasserts that this is not the case. Examples for this class of tests include chi-squaretest of goodness of fit and the non-parametric Kolmogorov-Smirnov. Further detailscan be found in the references.

A key assumption used often in probability and statistics is independence. Giventwo samples, there may be interest in finding whether the two samples are indepen-dent or not, before proceeding further with their statistical treatment. This goalcan be achieved through the use of the chi-square test for independency.

Indeed, it should be remarked that there is a very large range of statistical tests,each one adapted to a particular case or situation. Check the references for moreinformation in this topic.

60

References

[1] G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2nd edition,2001.

[2] D.R. Cox and V. Isham. Point Processes. Chapman and Hall/CRC, 1980.

[3] J.L. Devore and K.N. Berk. Modern Mathematical Statistics with Applications.Springer, 2nd edition, 2012.

[4] G. Grimmett and D. Stirzacker. Statistical Inference. Oxford University Press,2nd edition, 2001.

[5] B. Oksendal. Stochastic Differential Equations: An Introduction with Applica-tions. Springer, 6th edition, 2003.

[6] P. E. Pfeiffer. Concepts of Probability Theory. Dover Publications, 2nd revisededition, 2012.

[7] J. Pitman. Probability. Springer texts in Statistics. Springer, 1993.

[8] P. Protter and J. Jacod. Probability Essentials. Springer, 2nd edition, 2004.

[9] S. Ross. Stochastic Processes. Wiley, 2nd edition, 1995.

[10] S. Ross. A first course in probability. Pearson, 8th edition, 2010.

[11] D. Stirzaker. Probability and Random Variables: A Beginner’s Guide. Cam-bridge University Press, 1999.

[12] Y. Suhov and M. Kelbert. Probability and Statistics by Example: Volume 1,Basic Probability and Statistics. Cambridge University Press, 2005.

61