statistical inference - lecture 1: probability theory
TRANSCRIPT
Statistical InferenceLecture 1: Probability Theory
MING GAO
DASE @ ECNU(for course related communications)
Mar. 2, 2021
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 2 / 61
Introduction
Introduction
Probability as a mathematical framework for:
reasoning about uncertainty
deriving approaches to inference problems
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 3 / 61
Introduction
Introduction
Probability as a mathematical framework for:
reasoning about uncertainty
deriving approaches to inference problems
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 3 / 61
Set Theory
Experiment and sample space
Experiment
An experiment is a procedure that yields one of a given set of pos-sible outcomes.
Sample space
The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.
Example
Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.
“List” (set) of possibleoutcomes
List must be:1 Mutually exclusive2 Collectively exhaustive
Art: to be at the “right”granularity
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61
Set Theory
Experiment and sample space
Experiment
An experiment is a procedure that yields one of a given set of pos-sible outcomes.
Sample space
The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.
Example
Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.
“List” (set) of possibleoutcomes
List must be:1 Mutually exclusive2 Collectively exhaustive
Art: to be at the “right”granularity
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61
Set Theory
Experiment and sample space
Experiment
An experiment is a procedure that yields one of a given set of pos-sible outcomes.
Sample space
The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.
Example
Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.
“List” (set) of possibleoutcomes
List must be:1 Mutually exclusive2 Collectively exhaustive
Art: to be at the “right”granularity
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61
Set Theory
Experiment and sample space
Experiment
An experiment is a procedure that yields one of a given set of pos-sible outcomes.
Sample space
The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.
Example
Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.
“List” (set) of possibleoutcomes
List must be:1 Mutually exclusive2 Collectively exhaustive
Art: to be at the “right”granularity
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61
Set Theory
Continuous sample space
For this case, sample spaceΩ = (x , y)|0 ≤ x , y ≤ 1.Note that the sample space isinfinite and uncountable.
In this course, we consider thecountable sample spaces anduncountable sample spaces.Thus, we call the learning contentto be the discrete probability andcontinuous probability,respectively.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 5 / 61
Set Theory
Event and set operators
An event, represented as a set, is a subset of the sample space.
Example
Toss at least one headB = HH,HT ,TH ⊂Ω;
Toss at least three headC = ∅ ⊂ Ω.
There are 2|Ω| events foran experiments;
Events therefore have allset operations.
Operators
Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;
Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61
Set Theory
Event and set operators
An event, represented as a set, is a subset of the sample space.
Example
Toss at least one headB = HH,HT ,TH ⊂Ω;
Toss at least three headC = ∅ ⊂ Ω.
There are 2|Ω| events foran experiments;
Events therefore have allset operations.
Operators
Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;
Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61
Set Theory
Event and set operators
An event, represented as a set, is a subset of the sample space.
Example
Toss at least one headB = HH,HT ,TH ⊂Ω;
Toss at least three headC = ∅ ⊂ Ω.
There are 2|Ω| events foran experiments;
Events therefore have allset operations.
Operators
Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;
Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61
Set Theory
Set identities
Table of set identities
equivalence name
A ∩ U = A Identity lawsA ∪ ∅ = A
A ∩ A = A Idempotent lawsA ∪ A = A
(A ∩ B) ∩ C = A ∩ (B ∩ C ) Associative laws(A ∪ B) ∪ C = A ∪ (B ∪ C )
A ∪ (A ∩ B) = A Absorption lawsA ∩ (A ∪ B) = A
A ∩ A = ∅ Complement laws
A ∪ A = U
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 7 / 61
Set Theory
Logical equivalence Cont’d
Table of set identities
equivalence name
A ∪ U = U Domination lawsA ∩ ∅ = ∅
A ∩ B = B ∩ A Commutative lawsA ∪ B = B ∪ A
(A ∩ B) ∪ C = (A ∪ C ) ∩ (B ∪ C ) Distributive laws(A ∪ B) ∩ C = (A ∩ C ) ∪ (B ∩ C )
(A ∩ B) = A ∪ B De Morgan’s laws
(A ∪ B) = A ∩ B
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 8 / 61
Set Theory
Extensions of set operators
If A1,A2, · · · ,An, · · · is a collection of events, all defined on a samplespace Ω, then
Union:⋃∞
i=1 Ai = x ∈ Ω|x ∈ Ai for some i;Intersection:
⋂∞i=1 Ai = x ∈ Ω|x ∈ Ai for all i;
Example
Let Ω = (0, 1] and define Ai = [ 1i , 1]. then⋃∞
i=1 Ai = limn→+∞⋃n
i=1[ 1i , 1] = limn→+∞[ 1
n , 1] = (0, 1];⋂∞i=1 Ai = limn→+∞
⋂ni=1[ 1
i , 1] = limn→+∞[1, 1] = 1;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 9 / 61
Set Theory
Extensions of set operators Cont’d
If Γ is an uncountable collection of events, all defined on a samplespace Ω, then
Union:⋃
a∈Γ Ai = x ∈ Ω|x ∈ Aa for some a;Intersection:
⋂a∈Γ Ai = x ∈ Ω|x ∈ Aa for all a;
Example
Let Γ = R+ and define Aa = (0, a]. then⋃a∈Γ Aa = lima→+∞
⋃ai=1(0, a] = (0,+∞];⋂
a∈Γ Aa = lima→+∞⋂a
i=1(0, a] = ∅;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 10 / 61
Set Theory
Disjoint and partition
Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .
If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞
i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.
The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have
⋃∞i=1 Ai = [0,+∞).
The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61
Set Theory
Disjoint and partition
Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .
If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞
i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.
The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have
⋃∞i=1 Ai = [0,+∞).
The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61
Set Theory
Disjoint and partition
Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .
If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞
i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.
The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have
⋃∞i=1 Ai = [0,+∞).
The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61
Set Theory
Disjoint and partition
Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .
If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞
i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.
The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have
⋃∞i=1 Ai = [0,+∞).
The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61
Basics of Probability Theory
Sigma Algebra
A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties
∅ ∈ B;
If A ∈ B, then Ac ∈ B;
If A1,A2, · · · ∈ B, then⋃∞
i=1 Ai ∈ B (B is closed undercountable unions).
In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,
( ∞⋃i=1
Ai
)c=∞⋂i=1
Aci ∈ B.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61
Basics of Probability Theory
Sigma Algebra
A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties
∅ ∈ B;
If A ∈ B, then Ac ∈ B;
If A1,A2, · · · ∈ B, then⋃∞
i=1 Ai ∈ B (B is closed undercountable unions).
In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,
( ∞⋃i=1
Ai
)c=∞⋂i=1
Aci ∈ B.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61
Basics of Probability Theory
Sigma Algebra
A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties
∅ ∈ B;
If A ∈ B, then Ac ∈ B;
If A1,A2, · · · ∈ B, then⋃∞
i=1 Ai ∈ B (B is closed undercountable unions).
In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,
( ∞⋃i=1
Ai
)c=∞⋂i=1
Aci ∈ B.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61
Basics of Probability Theory
Sigma Algebra
A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties
∅ ∈ B;
If A ∈ B, then Ac ∈ B;
If A1,A2, · · · ∈ B, then⋃∞
i=1 Ai ∈ B (B is closed undercountable unions).
In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,
( ∞⋃i=1
Ai
)c=∞⋂i=1
Aci ∈ B.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61
Basics of Probability Theory
Example
If Ω is finite or countable, the sigma algebra of Ω can be defined as
B = A|A ⊆ Ω.
If Ω has n elements, there are 2n sets in B;
If Ω is countable infinite, the cardinality of B is ℵ0;
Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form
[a, b], (a, b], (a, b) and [a, b)
for all real numbers a and b.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61
Basics of Probability Theory
Example
If Ω is finite or countable, the sigma algebra of Ω can be defined as
B = A|A ⊆ Ω.
If Ω has n elements, there are 2n sets in B;
If Ω is countable infinite, the cardinality of B is ℵ0;
Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form
[a, b], (a, b], (a, b) and [a, b)
for all real numbers a and b.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61
Basics of Probability Theory
Example
If Ω is finite or countable, the sigma algebra of Ω can be defined as
B = A|A ⊆ Ω.
If Ω has n elements, there are 2n sets in B;
If Ω is countable infinite, the cardinality of B is ℵ0;
Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form
[a, b], (a, b], (a, b) and [a, b)
for all real numbers a and b.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ B
Normalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Probability function
Definition
Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies
Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;
Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞
i=1 Ai ) =∑∞
i=1 P(Ai ).
Note that
The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).
Any function P that satisfies the Axioms of probability is calleda probability function.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61
Basics of Probability Theory
Example of defining probability
Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is
P(H) = P(T).
P(H,T) = P(H) + P(T) = 1;
P(H) = P(T) = 12 ;
We can also define probability over an unfair coin, e.g.,P(H) = 1
3 and P(T) = 23 .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61
Basics of Probability Theory
Example of defining probability
Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is
P(H) = P(T).
P(H,T) = P(H) + P(T) = 1;
P(H) = P(T) = 12 ;
We can also define probability over an unfair coin, e.g.,P(H) = 1
3 and P(T) = 23 .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61
Basics of Probability Theory
Example of defining probability
Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is
P(H) = P(T).
P(H,T) = P(H) + P(T) = 1;
P(H) = P(T) = 12 ;
We can also define probability over an unfair coin, e.g.,P(H) = 1
3 and P(T) = 23 .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61
Basics of Probability Theory
Example of defining probability
Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is
P(H) = P(T).
P(H,T) = P(H) + P(T) = 1;
P(H) = P(T) = 12 ;
We can also define probability over an unfair coin, e.g.,P(H) = 1
3 and P(T) = 23 .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61
Basics of Probability Theory The Calculus of Probabilities
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 16 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability operators
Operators
Let Ω be the sample space, A and B be two events:
1 If A ∩ B = ∅, then
P(A ∪ B) = P(A) + P(B);
2
P(Ac) = P(Ω)− P(A) = 1− P(A);
3 If A and B are independent, then
P(A ∩ B) = P(A) · P(B);
4 The conditional probability of A given B, denoted by P(A|B),is computed as
P(A|B) =P(A ∩ B)
P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability operators
Operators
Let Ω be the sample space, A and B be two events:
1 If A ∩ B = ∅, then
P(A ∪ B) = P(A) + P(B);
2
P(Ac) = P(Ω)− P(A) = 1− P(A);
3 If A and B are independent, then
P(A ∩ B) = P(A) · P(B);
4 The conditional probability of A given B, denoted by P(A|B),is computed as
P(A|B) =P(A ∩ B)
P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability operators
Operators
Let Ω be the sample space, A and B be two events:
1 If A ∩ B = ∅, then
P(A ∪ B) = P(A) + P(B);
2
P(Ac) = P(Ω)− P(A) = 1− P(A);
3 If A and B are independent, then
P(A ∩ B) = P(A) · P(B);
4 The conditional probability of A given B, denoted by P(A|B),is computed as
P(A|B) =P(A ∩ B)
P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability operators
Operators
Let Ω be the sample space, A and B be two events:
1 If A ∩ B = ∅, then
P(A ∪ B) = P(A) + P(B);
2
P(Ac) = P(Ω)− P(A) = 1− P(A);
3 If A and B are independent, then
P(A ∩ B) = P(A) · P(B);
4 The conditional probability of A given B, denoted by P(A|B),is computed as
P(A|B) =P(A ∩ B)
P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties I
If P is a probability function, A and B are any two sets in B, then
P(∅) = 0, where ∅ is the empty set;
P(A) ≤ 1;
P(B ∩ Ac) = P(B)− P(A ∩ B);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B);
If A ⊂ B, then P(A) ≤ P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties I
If P is a probability function, A and B are any two sets in B, then
P(∅) = 0, where ∅ is the empty set;
P(A) ≤ 1;
P(B ∩ Ac) = P(B)− P(A ∩ B);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B);
If A ⊂ B, then P(A) ≤ P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties I
If P is a probability function, A and B are any two sets in B, then
P(∅) = 0, where ∅ is the empty set;
P(A) ≤ 1;
P(B ∩ Ac) = P(B)− P(A ∩ B);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B);
If A ⊂ B, then P(A) ≤ P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties I
If P is a probability function, A and B are any two sets in B, then
P(∅) = 0, where ∅ is the empty set;
P(A) ≤ 1;
P(B ∩ Ac) = P(B)− P(A ∩ B);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B);
If A ⊂ B, then P(A) ≤ P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties I
If P is a probability function, A and B are any two sets in B, then
P(∅) = 0, where ∅ is the empty set;
P(A) ≤ 1;
P(B ∩ Ac) = P(B)− P(A ∩ B);
P(A ∪ B) = P(A) + P(B)− P(A ∩ B);
If A ⊂ B, then P(A) ≤ P(B).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties II
If P is a probability function, the collection C1,C2, · · · is a partitionof Ω, then
P(A) =∑∞
i=1 P(A ∩ Ci );
P(⋃∞
i=1 Ai ) ≤∑∞
i=1 P(Ai ) for any collection A1,A2, · · · ;To prove the second statement, we first construct a disjoint collectionA∗1,A
∗2, · · · , where
A∗1 = A1,A∗i = Ai\
( i−1⋃j=1
Aj
)for i = 2, 3, · · ·
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 19 / 61
Basics of Probability Theory The Calculus of Probabilities
Probability properties II
If P is a probability function, the collection C1,C2, · · · is a partitionof Ω, then
P(A) =∑∞
i=1 P(A ∩ Ci );
P(⋃∞
i=1 Ai ) ≤∑∞
i=1 P(Ai ) for any collection A1,A2, · · · ;To prove the second statement, we first construct a disjoint collectionA∗1,A
∗2, · · · , where
A∗1 = A1,A∗i = Ai\
( i−1⋃j=1
Aj
)for i = 2, 3, · · ·
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 19 / 61
Basics of Probability Theory The Calculus of Probabilities
Running example of independence
Tossing coins
We toss a coin twice (Head = H, Tail = T), then Ω =HH,HT ,TH,TT.We define three events:
1 A : the first toss is H;
2 B : the second toss is H;
3 C : the first and second toss give the same results.
Hence, we have
P(A) = P(B) = P(C ) = 12 ;
P(A ∩ B) = P(A ∩ C ) = P(B ∩ C ) = |HH||Ω| = 1
4 ;
P(A ∩ B ∩ C ) = |HH||Ω| = 1
4 ;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 20 / 61
Basics of Probability Theory The Calculus of Probabilities
Independence
Definition
Events A1 and A2 are pair-wise independent (statisticallyindependent) if and only if
P(A1 ∩ A2) = P(A1)P(A2);
Events A1,A2, · · · ,An are mutually independent if
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aim) = P(Ai1)P(Ai2) · · ·P(Aim),
where ij , j = 1, 2, · · · ,m, are integers with1 ≤ i1 < i2 < · · · < im ≤ n and m ≥ 2.
Note that mutually independent must be pair-wise independent, butpair-wise independent may not imply mutually independent (shownin previous example).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 21 / 61
Basics of Probability Theory The Calculus of Probabilities
Independence
Theorem
Events A and B are pair-wise independent, then
A and Bc are pair-wise independent;
Ac and B are pair-wise independent;
Ac and Bc are pair-wise independent;
Proof.
P(A) = P(A ∩ (B ∪ Bc)) = P((A ∩ B) ∪ (A ∩ Bc))
= P(A ∩ B) + P(A ∩ Bc)
P(A ∩ Bc) = P(A)− P(A ∩ B) = P(A)− P(A)P(B)
= P(A)(1− P(B)) = P(A)P(Bc)
Hence, we have A and Bc are independent.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 22 / 61
Basics of Probability Theory The Calculus of Probabilities
Conditional probability
Definition
Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as
P(E |F ) =P(E ∩ F )
P(F ).
P(E |F ) is the probability of E ,given that F occurred
F is our new sample space;
P(E |F ) is undefined if P(F ) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61
Basics of Probability Theory The Calculus of Probabilities
Conditional probability
Definition
Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as
P(E |F ) =P(E ∩ F )
P(F ).
P(E |F ) is the probability of E ,given that F occurred
F is our new sample space;
P(E |F ) is undefined if P(F ) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61
Basics of Probability Theory The Calculus of Probabilities
Conditional probability
Definition
Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as
P(E |F ) =P(E ∩ F )
P(F ).
P(E |F ) is the probability of E ,given that F occurred
F is our new sample space;
P(E |F ) is undefined if P(F ) = 0.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61
Basics of Probability Theory The Calculus of Probabilities
Example
Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.
Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that
P(A|B) =P(A ∩ B)
P(B)=
1/4
3/4=
1
3.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61
Basics of Probability Theory The Calculus of Probabilities
Example
Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.
Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.
Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that
P(A|B) =P(A ∩ B)
P(B)=
1/4
3/4=
1
3.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61
Basics of Probability Theory The Calculus of Probabilities
Example
Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.
Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.
Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that
P(A|B) =P(A ∩ B)
P(B)=
1/4
3/4=
1
3.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61
Basics of Probability Theory The Calculus of Probabilities
Example
Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.
Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.
We conclude that
P(A|B) =P(A ∩ B)
P(B)=
1/4
3/4=
1
3.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61
Basics of Probability Theory The Calculus of Probabilities
Example
Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.
Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that
P(A|B) =P(A ∩ B)
P(B)=
1/4
3/4=
1
3.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61
Basics of Probability Theory The Calculus of Probabilities
Remarks for conditional probability
P(E |F ) = P(E ) if events E and F are independent;
P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );
P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);
P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =
P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )
(Bayes rule).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61
Basics of Probability Theory The Calculus of Probabilities
Remarks for conditional probability
P(E |F ) = P(E ) if events E and F are independent;
P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );
P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);
P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =
P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )
(Bayes rule).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61
Basics of Probability Theory The Calculus of Probabilities
Remarks for conditional probability
P(E |F ) = P(E ) if events E and F are independent;
P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );
P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);
P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =
P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )
(Bayes rule).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61
Basics of Probability Theory The Calculus of Probabilities
Remarks for conditional probability
P(E |F ) = P(E ) if events E and F are independent;
P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );
P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);
P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =
P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )
(Bayes rule).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61
Basics of Probability Theory The Calculus of Probabilities
Proof of the total probability theorem
Theorem
Let Fi (for i = 1, 2, · · · , n) be a partition of sample space Ω, for anyevent E , then
P(E ) =n∑
i=1
P(Fi ) · P(E |Fi ).
Proof:
P(E ) = P(E ∩ Ω) = P(E ∩ (F1 ∪ F2 ∪ · · · ∪ Fn))
= P((A ∩ F1) ∪ (A ∩ F2) ∪ · · · ∪ (A ∩ Fn))
= P(A ∩ F1) + P(A ∩ F2) + · · ·+ P(A ∩ Fn)
=n∑
i=1
P(Fi ) · P(E |Fi ).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 26 / 61
Basics of Probability Theory The Calculus of Probabilities
Proof of the total probability theorem
Theorem
Let Fi (for i = 1, 2, · · · , n) be a partition of sample space Ω, for anyevent E , then
P(E ) =n∑
i=1
P(Fi ) · P(E |Fi ).
Proof:
P(E ) = P(E ∩ Ω) = P(E ∩ (F1 ∪ F2 ∪ · · · ∪ Fn))
= P((A ∩ F1) ∪ (A ∩ F2) ∪ · · · ∪ (A ∩ Fn))
= P(A ∩ F1) + P(A ∩ F2) + · · ·+ P(A ∩ Fn)
=n∑
i=1
P(Fi ) · P(E |Fi ).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 26 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayes’ Theorem
Theorem
Suppose that E is an event from a sample space Ω and F1,F2, · · · ,Fnis a partition of the sample space. Let P(E ) 6= 0 and P(Fi ) 6= 0 for∀i . Then
P(Fi |E ) =P(E |Fi )P(Fi )∑n
k=1 P(E |Fk)P(Fk).
Proof:
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 27 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayes’ Theorem
Theorem
Suppose that E is an event from a sample space Ω and F1,F2, · · · ,Fnis a partition of the sample space. Let P(E ) 6= 0 and P(Fi ) 6= 0 for∀i . Then
P(Fi |E ) =P(E |Fi )P(Fi )∑n
k=1 P(E |Fk)P(Fk).
Proof:
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 27 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease
Suppose that one of 100,000 persons has a particular rare disease forwhich there is a fairly accurate diagnostic test. This test is correct99.0% when given to a person selected at random who has the dis-ease; it is correct 99.5% when given to a person selected at randomwho does not have the disease. Please find
the probability that a person who tests positive for the diseasehas the disease?
the probability that a person who tests negative for the diseasedoes not have the disease?
Should a person who tests positive be very concerned that he or shehas the disease?
Solution: Let B be the event that a person selected at random hasthe disease, and let A be the event that a person selected at randomtests positive for the disease.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 28 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease
Suppose that one of 100,000 persons has a particular rare disease forwhich there is a fairly accurate diagnostic test. This test is correct99.0% when given to a person selected at random who has the dis-ease; it is correct 99.5% when given to a person selected at randomwho does not have the disease. Please find
the probability that a person who tests positive for the diseasehas the disease?
the probability that a person who tests negative for the diseasedoes not have the disease?
Should a person who tests positive be very concerned that he or shehas the disease?Solution: Let B be the event that a person selected at random hasthe disease, and let A be the event that a person selected at randomtests positive for the disease.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 28 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease Cont’d
Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.
Case a: In terms of Bayes’ theorem, we have
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc)
=0.99 · 10−5
0.99 · 10−5 + 0.005 · 0.99999≈ 0.002
Case b: Similarly, we have
P(Bc |Ac) =P(Ac |Bc)P(Bc)
P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease Cont’d
Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc)
=0.99 · 10−5
0.99 · 10−5 + 0.005 · 0.99999≈ 0.002
Case b: Similarly, we have
P(Bc |Ac) =P(Ac |Bc)P(Bc)
P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease Cont’d
Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc)
=0.99 · 10−5
0.99 · 10−5 + 0.005 · 0.99999≈ 0.002
Case b: Similarly, we have
P(Bc |Ac) =P(Ac |Bc)P(Bc)
P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61
Basics of Probability Theory The Calculus of Probabilities
Diagnostic test for rare disease Cont’d
Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc)
=0.99 · 10−5
0.99 · 10−5 + 0.005 · 0.99999≈ 0.002
Case b: Similarly, we have
P(Bc |Ac) =P(Ac |Bc)P(Bc)
P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters
Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.
On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.
Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters
Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.
Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters
Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.
Question: How to detect spam email?
Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters
Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.
Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.
Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters
Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.
Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.
Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .
By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is
P(B|A) =P(A|B)P(B)
P(A|B)P(B) + P(A|Bc)P(Bc).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).
Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is
P(B|A) =P(A|B)
P(A|B) + P(A|Bc).
P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by
r(w) =p(w)
p(w) + q(w).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.
Using this assumption, we find that the probability that a message isspam, given that it contains word w , is
P(B|A) =P(A|B)
P(A|B) + P(A|Bc).
P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by
r(w) =p(w)
p(w) + q(w).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is
P(B|A) =P(A|B)
P(A|B) + P(A|Bc).
P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by
r(w) =p(w)
p(w) + q(w).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61
Basics of Probability Theory The Calculus of Probabilities
Bayesian spam filters Cont’d
To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is
P(B|A) =P(A|B)
P(A|B) + P(A|Bc).
P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by
r(w) =p(w)
p(w) + q(w).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61
Basics of Probability Theory The Calculus of Probabilities
Extended Bayesian spam filters
The more words we use to estimate the probability that an incomingmail message is spam, the better is our chance that we correctlydetermine whether it is spam.In general, if Ai is the event that the message contains word wi ,assuming that P(S) = P(Sc), and that events Ai |S are independent,then by Bayes theorem the probability that a message containing allwords w1,w2, · · · ,wk is spam is
P(S |k⋂
i=1
Ai ) =P(⋂k
i=1 Ai |S)P(S)
P(⋂k
i=1 Ai |S)P(S) + P(⋂k
i=1 Ai |S)P(S)
=Πki=1P(Ai |S)
Πki=1P(Ai |S) + Πk
i=1P(Ai |S)
≈Πki=1p(wi )
Πki=1p(wi ) + Πk
i=1q(wi )= r(w1,w2, · · · ,wk).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 33 / 61
Basics of Probability Theory The Calculus of Probabilities
Naive Bayes
Why is this called Naive Bayes?
The model employs the chain rule for repeated applications ofthe definition of conditional probability.
To handle underflow, we calculateΠni=1P(Xi |S) = exp(
∑ni=1 logP(Xi |S)).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61
Basics of Probability Theory The Calculus of Probabilities
Naive Bayes
Why is this called Naive Bayes?
The model employs the chain rule for repeated applications ofthe definition of conditional probability.
To handle underflow, we calculateΠni=1P(Xi |S) = exp(
∑ni=1 logP(Xi |S)).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61
Basics of Probability Theory The Calculus of Probabilities
Naive Bayes
Why is this called Naive Bayes?
The model employs the chain rule for repeated applications ofthe definition of conditional probability.
To handle underflow, we calculateΠni=1P(Xi |S) = exp(
∑ni=1 logP(Xi |S)).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61
Basics of Probability Theory Counting
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 35 / 61
Basics of Probability Theory Counting
Counting principle I
Product rule
Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.
Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.
Sum rule
If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2
ways, then there are n1 + n2 ways to do the task.
Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are
∑mi=1 ni ways for the procedure.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61
Basics of Probability Theory Counting
Counting principle I
Product rule
Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.
Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.
Sum rule
If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2
ways, then there are n1 + n2 ways to do the task.
Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are
∑mi=1 ni ways for the procedure.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61
Basics of Probability Theory Counting
Counting principle I
Product rule
Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.
Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.
Sum rule
If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2
ways, then there are n1 + n2 ways to do the task.
Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are
∑mi=1 ni ways for the procedure.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61
Basics of Probability Theory Counting
Counting principle I
Product rule
Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.
Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.
Sum rule
If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2
ways, then there are n1 + n2 ways to do the task.
Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are
∑mi=1 ni ways for the procedure.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61
Basics of Probability Theory Counting
Counting principle II
Subtraction rule
There are n/d ways to do a task if it can be done using a procedurethat can be carried out in n ways, and for every way w , exactly d ofthe n ways correspond to way w .
Inclusion-exclusion rule
If a task can be done in either n1 ways or n2 ways, then the numberof ways to do the task is n1 + n2 minus the number of ways to dothe task that are common to the two different ways. The rule is alsocalled the principle of inclusion-exclusion, i.e.,
|A ∪ B| = |A|+ |B| − |A ∩ B|.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 37 / 61
Basics of Probability Theory Counting
Counting principle II
Subtraction rule
There are n/d ways to do a task if it can be done using a procedurethat can be carried out in n ways, and for every way w , exactly d ofthe n ways correspond to way w .
Inclusion-exclusion rule
If a task can be done in either n1 ways or n2 ways, then the numberof ways to do the task is n1 + n2 minus the number of ways to dothe task that are common to the two different ways. The rule is alsocalled the principle of inclusion-exclusion, i.e.,
|A ∪ B| = |A|+ |B| − |A ∩ B|.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 37 / 61
Basics of Probability Theory Counting
Permutations and combinations
Permutations
An ordered arrangement of m elements of a set is called an m-permutation. # m-permutations of a set with n distinct elementsis
P(n,m) = n(n − 1)(n − 2) · · · · · (n −m + 1) =n!
(n −m)!.
Combinations
An m-combination of elements of a set is an unordered selection ofm elements from the set. # m-combinations of a set with n elementsequals
C (n,m) =n!
m!(n −m)!,
where C (n,m) is also denoted as(nm
).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 38 / 61
Basics of Probability Theory Counting
Permutations and combinations
Permutations
An ordered arrangement of m elements of a set is called an m-permutation. # m-permutations of a set with n distinct elementsis
P(n,m) = n(n − 1)(n − 2) · · · · · (n −m + 1) =n!
(n −m)!.
Combinations
An m-combination of elements of a set is an unordered selection ofm elements from the set. # m-combinations of a set with n elementsequals
C (n,m) =n!
m!(n −m)!,
where C (n,m) is also denoted as(nm
).
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 38 / 61
Basics of Probability Theory Counting
Finite probability
If S is a finite nonempty sample space of equally likely outcomes, andE is an event, that is, a subset of S , then the probability of E is
p(E ) =|E ||S |
.
Let all outcomes be equally likely;
Computing probabilities ≡ two countings;
Counting the successful ways of the event;Counting the size of the sample space;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 39 / 61
Basics of Probability Theory Counting
Finite probability
If S is a finite nonempty sample space of equally likely outcomes, andE is an event, that is, a subset of S , then the probability of E is
p(E ) =|E ||S |
.
Let all outcomes be equally likely;
Computing probabilities ≡ two countings;
Counting the successful ways of the event;Counting the size of the sample space;
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 39 / 61
Basics of Probability Theory Counting
Examples
Example I
Question: An urn contains four blue balls and five red balls. Whatis the probability that a ball chosen at random from the urn is blue?Solution: Let S be the sample space, i.e.,
S = ©1,©2,©3,©4,©1,©2,©3,©4,©5.
Let E be the event of choosing a blue ball, i.e.,
E = ©1,©2,©3,©4.
In terms of the definition, we can compute the probability as
P(E ) =|E ||S |
=4
9.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 40 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example II
Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:
There are a total of 36 possibleoutcomes when two dice arerolled.
There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).
Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example II
Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:
There are a total of 36 possibleoutcomes when two dice arerolled.
There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).
Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example II
Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:
There are a total of 36 possibleoutcomes when two dice arerolled.
There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).
Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example III
Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?
Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4
1
)×9 = 36 ways to choose four digits with exactly three of the four
digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example III
Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.
Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4
1
)×9 = 36 ways to choose four digits with exactly three of the four
digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example III
Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.
Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4
1
)×9 = 36 ways to choose four digits with exactly three of the four
digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example III
Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct.
Hence, there is a total of(41
)×9 = 36 ways to choose four digits with exactly three of the four
digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example III
Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4
1
)×9 = 36 ways to choose four digits with exactly three of the four
digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example IV
Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).
Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is
C (13, 1)C (4, 4)C (48, 1).
Case II: # hands of three of one kind and two of another kind is
P(13, 2)C (4, 3)C (4, 2).
Hence, the probabilities are
C (13, 1)C (4, 4)C (48, 1)
C (52, 5)≈ 0.00024,
C (13, 2)C (4, 3)C (4, 2)
C (52, 5)≈ 0.0014.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example IV
Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.
Case I: # hands of five cards with four cards of one kind is
C (13, 1)C (4, 4)C (48, 1).
Case II: # hands of three of one kind and two of another kind is
P(13, 2)C (4, 3)C (4, 2).
Hence, the probabilities are
C (13, 1)C (4, 4)C (48, 1)
C (52, 5)≈ 0.00024,
C (13, 2)C (4, 3)C (4, 2)
C (52, 5)≈ 0.0014.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example IV
Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is
C (13, 1)C (4, 4)C (48, 1).
Case II: # hands of three of one kind and two of another kind is
P(13, 2)C (4, 3)C (4, 2).
Hence, the probabilities are
C (13, 1)C (4, 4)C (48, 1)
C (52, 5)≈ 0.00024,
C (13, 2)C (4, 3)C (4, 2)
C (52, 5)≈ 0.0014.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example IV
Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is
C (13, 1)C (4, 4)C (48, 1).
Case II: # hands of three of one kind and two of another kind is
P(13, 2)C (4, 3)C (4, 2).
Hence, the probabilities are
C (13, 1)C (4, 4)C (48, 1)
C (52, 5)≈ 0.00024,
C (13, 2)C (4, 3)C (4, 2)
C (52, 5)≈ 0.0014.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61
Basics of Probability Theory Counting
Examples Cont’d
Example IV
Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is
C (13, 1)C (4, 4)C (48, 1).
Case II: # hands of three of one kind and two of another kind is
P(13, 2)C (4, 3)C (4, 2).
Hence, the probabilities are
C (13, 1)C (4, 4)C (48, 1)
C (52, 5)≈ 0.00024,
C (13, 2)C (4, 3)C (4, 2)
C (52, 5)≈ 0.0014.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61
Random Variable and Distributions Random Variable
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 44 / 61
Random Variable and Distributions Random Variable
Random variables
Definition: A random variable (r.v.) X is a function from samplespace Ω of an experiment to the set of real numbers in R, i.e.,
∀ω ∈ Ω,X (ω) = x ∈ R.
Remarks
Note that a random variable is a function. It is not a variable,and it is not random!
We usually use notation X ,Y , etc. to represent a r.v., and x , yto represent the numerical values. For example, X = x meansthat r.v. X has value x .
The domain of the function can be countable and uncountable.If it is countable, the random variable is a discrete r.v.,otherwise continuous r.v..
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 45 / 61
Random Variable and Distributions Random Variable
Random variables
Definition: A random variable (r.v.) X is a function from samplespace Ω of an experiment to the set of real numbers in R, i.e.,
∀ω ∈ Ω,X (ω) = x ∈ R.
Remarks
Note that a random variable is a function. It is not a variable,and it is not random!
We usually use notation X ,Y , etc. to represent a r.v., and x , yto represent the numerical values. For example, X = x meansthat r.v. X has value x .
The domain of the function can be countable and uncountable.If it is countable, the random variable is a discrete r.v.,otherwise continuous r.v..
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 45 / 61
Random Variable and Distributions Random Variable
Examples of r.v.
A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.
And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then
X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.
When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:
X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61
Random Variable and Distributions Random Variable
Examples of r.v.
A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.
And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then
X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.
When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:
X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61
Random Variable and Distributions Random Variable
Examples of r.v.
A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.
And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then
X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.
When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:
X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61
Random Variable and Distributions Random Variable
Random variables VS. events
Suppose now that a sample space Ω = ω1, ω2, · · · , ωn is given,and r.v. X on Ω is defined the number of heads obtained when wetoss a coin twice.
Event E1 represents only one head obtained. Hence,
E1 = ω : X (ω) = 1;
Event E2 represents even heads obtained. Hence,
E = ω : X (ω) mod 2 = 0;
Event E2 represents at least one heads obtained. Hence,
E = ω : X (ω) > 0.
These indicate that we can also define probability about r.v.s.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 47 / 61
Random Variable and Distributions Random Variable
Random variables VS. events
Suppose now that a sample space Ω = ω1, ω2, · · · , ωn is given,and r.v. X on Ω is defined the number of heads obtained when wetoss a coin twice.
Event E1 represents only one head obtained. Hence,
E1 = ω : X (ω) = 1;
Event E2 represents even heads obtained. Hence,
E = ω : X (ω) mod 2 = 0;
Event E2 represents at least one heads obtained. Hence,
E = ω : X (ω) > 0.
These indicate that we can also define probability about r.v.s.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 47 / 61
Random Variable and Distributions Distribution Functions
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 48 / 61
Random Variable and Distributions Distribution Functions
Cumulative distribution function
The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by
FX (x) = PX (X ≤ x),
for all x .
Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is
FX (x) =
0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61
Random Variable and Distributions Distribution Functions
Cumulative distribution function
The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by
FX (x) = PX (X ≤ x),
for all x .
Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is
FX (x) =
0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61
Random Variable and Distributions Distribution Functions
Cumulative distribution function
The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by
FX (x) = PX (X ≤ x),
for all x .Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is
FX (x) =
0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61
Random Variable and Distributions Distribution Functions
Theorem
The function F (x) is a cdf if and only if the following three conditionshold:
limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1;
F (x) is a nondecreasing function of x ;
F (x) is right-continuous; that is, for ever number x0,limx↓x0 F (x) = F (x0).
Example
Suppose we do an experiment that consists of tossing a coin until ahead appears. Let p = probability of a head on any given toss, anddefine a r.v. X = # tosses required to get a head. Then for anyx = 1, 2, ·, P(X ≤ x) =
∑xi=1(1− p)i−1p = 1− (1− p)x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 50 / 61
Random Variable and Distributions Distribution Functions
Theorem
The function F (x) is a cdf if and only if the following three conditionshold:
limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1;
F (x) is a nondecreasing function of x ;
F (x) is right-continuous; that is, for ever number x0,limx↓x0 F (x) = F (x0).
Example
Suppose we do an experiment that consists of tossing a coin until ahead appears. Let p = probability of a head on any given toss, anddefine a r.v. X = # tosses required to get a head. Then for anyx = 1, 2, ·, P(X ≤ x) =
∑xi=1(1− p)i−1p = 1− (1− p)x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 50 / 61
Random Variable and Distributions Distribution Functions
Example Cont’d
Given 0 < p < 1, we have
limx→−∞ 1− (1− p)x = 0andlimx→+∞ 1− (1− p)x = 1;
1− (1− p)x is anondecreasing function of x ;
For any x ,F (x)(x + ε) = FX (x) if ε > 0is sufficiently small. Hence,
limε↓0
FX (x + ε) = FX (x).
FX (x) is the cdf of a distribution called the Geometric distribution.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 51 / 61
Random Variable and Distributions Distribution Functions
Example Cont’d
Given 0 < p < 1, we have
limx→−∞ 1− (1− p)x = 0andlimx→+∞ 1− (1− p)x = 1;
1− (1− p)x is anondecreasing function of x ;
For any x ,F (x)(x + ε) = FX (x) if ε > 0is sufficiently small. Hence,
limε↓0
FX (x + ε) = FX (x).
FX (x) is the cdf of a distribution called the Geometric distribution.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 51 / 61
Random Variable and Distributions Distribution Functions
Example Cont’d
An example of a continuous cdf is the function
FX (x) =1
1 + e−x.
limx→−∞ FX (x) = 0 since limx→−∞ e−x =∞;
limx→+∞ FX (x) = 1 since limx→+∞ e−x = 0;
Differentiating FX (x) gives
d
dxFX (x) =
e−x
(1 + e−x)2> 0;
FX (x) is not only right-continuous, but also continuous.
This is a special case of the logistic distribution.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 52 / 61
Random Variable and Distributions Distribution Functions
Continuous and discrete r.v.s
A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .
The r.v.s X and Y are identically distributed if, for every set A ∈ B
P(X (ω ∈ A)) = P(Y (ω ∈ A)).
Note that two r.v.s that are identically distributed are not necessarilyequal.
Theorem
The following two statements are equivalent:
The r.v., X and Y are identically distributed;
FX (x) = FY (x) for all x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61
Random Variable and Distributions Distribution Functions
Continuous and discrete r.v.s
A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .
The r.v.s X and Y are identically distributed if, for every set A ∈ B
P(X (ω ∈ A)) = P(Y (ω ∈ A)).
Note that two r.v.s that are identically distributed are not necessarilyequal.
Theorem
The following two statements are equivalent:
The r.v., X and Y are identically distributed;
FX (x) = FY (x) for all x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61
Random Variable and Distributions Distribution Functions
Continuous and discrete r.v.s
A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .
The r.v.s X and Y are identically distributed if, for every set A ∈ B
P(X (ω ∈ A)) = P(Y (ω ∈ A)).
Note that two r.v.s that are identically distributed are not necessarilyequal.
Theorem
The following two statements are equivalent:
The r.v., X and Y are identically distributed;
FX (x) = FY (x) for all x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61
Random Variable and Distributions Distribution Functions
Example
Note that two r.v.s that are identically distributed are not necessarilyequal.
For example, consider an example of tossing a fair coin three times.Define r.v.s
X = # heads observed
andY = # tails observed.
We have P(X = k) = P(Y = k), i.e., X and Y are identicallydistributed. However, we do not have X (ω) = Y (ω) for any ω ∈ Ω.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 54 / 61
Random Variable and Distributions Distribution Functions
Example
Note that two r.v.s that are identically distributed are not necessarilyequal.
For example, consider an example of tossing a fair coin three times.Define r.v.s
X = # heads observed
andY = # tails observed.
We have P(X = k) = P(Y = k), i.e., X and Y are identicallydistributed. However, we do not have X (ω) = Y (ω) for any ω ∈ Ω.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 54 / 61
Random Variable and Distributions Density and Mass Functions
Outline
1 Introduction
2 Set Theory
3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting
4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions
5 Take-aways
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 55 / 61
Random Variable and Distributions Density and Mass Functions
Probability mass function
For all x , the probability mass function (pmf) of a discrete r.v. Xon Ω is given by fX (x) = P(X = x).
For the geometric distribution, we have the pmf
fX (x) = P(X = x) =
(1− p)x−1p, for x = 1, 2, · · ·0, otherwise.
Recall that P(X = x) or, equivalently, fX (x) is the size of thejump in the cdf at x .
We can use the pmf to calculate probabilities, for positiveintegers a and b ≥ a, we have
P(a ≤ X ≤ b) =b∑
k=a
fX (k) =b∑
k=a
(1− p)k−1p.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 56 / 61
Random Variable and Distributions Density and Mass Functions
Probability mass function
For all x , the probability mass function (pmf) of a discrete r.v. Xon Ω is given by fX (x) = P(X = x).
For the geometric distribution, we have the pmf
fX (x) = P(X = x) =
(1− p)x−1p, for x = 1, 2, · · ·0, otherwise.
Recall that P(X = x) or, equivalently, fX (x) is the size of thejump in the cdf at x .
We can use the pmf to calculate probabilities, for positiveintegers a and b ≥ a, we have
P(a ≤ X ≤ b) =b∑
k=a
fX (k) =b∑
k=a
(1− p)k−1p.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 56 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136
3 118
4 112 5 1
96 5
36 7 16
8 536 9 1
910 1
12 11 118
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
18
4 112 5 1
96 5
36 7 16
8 536 9 1
910 1
12 11 118
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12
5 19
6 536 7 1
68 5
36 9 19
10 112 11 1
1812 1
36
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36 9 19
10 112 11 1
1812 1
36
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536
7 16
8 536 9 1
910 1
12 11 118
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
6
8 536 9 1
910 1
12 11 118
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36
9 19
10 112 11 1
1812 1
36
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36 9 19
10 112 11 1
1812 1
36
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36 9 19
10 112
11 118
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36 9 19
10 112 11 1
18
12 136
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Examples of pmf
Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?
Solution:value prob. value prob.
2 136 3 1
184 1
12 5 19
6 536 7 1
68 5
36 9 19
10 112 11 1
1812 1
36
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61
Random Variable and Distributions Density and Mass Functions
Probability density function
If we naively try to calculate P(X = x) for a continuous r.v. and anyε > 0, we have
P(X = x) ≤ P(x − ε < X ≤ x) = FX (x)− FX (x − ε).
Therefore,
0 ≤ P(X = x) ≤ limε↓0
[FX (x)− FX (x − ε)] = 0.
Definition
The probability density function or pdf, fX (x), of a continuous r.v.X is the function that satisfies
FX (x) =
∫ x
−∞fX (t)dt for all x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 58 / 61
Random Variable and Distributions Density and Mass Functions
Probability density function
If we naively try to calculate P(X = x) for a continuous r.v. and anyε > 0, we have
P(X = x) ≤ P(x − ε < X ≤ x) = FX (x)− FX (x − ε).
Therefore,
0 ≤ P(X = x) ≤ limε↓0
[FX (x)− FX (x − ε)] = 0.
Definition
The probability density function or pdf, fX (x), of a continuous r.v.X is the function that satisfies
FX (x) =
∫ x
−∞fX (t)dt for all x .
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 58 / 61
Random Variable and Distributions Density and Mass Functions
Remarks of pdf
“X has a distribution given by FX (x)” is abbreviatedsymbolically by “X ∼ FX (x)” (X ∼ fX (x)).
Given a continuous r.v. X , since P(X = x) = 0, we have
P(a < X < b) = P(a ≤ X < b) = P(a ≤ X ≤ b) = P(a < X ≤ b).
For logistic distribution,Fx(x) = 1
1+e−x .We have
fX (x) =d
dxFX (x) =
e−x
(1 + e−x)2.
P(a < X < b) = FX (b)− FX (a) =
∫ b
−∞fX (t)dt −
∫ a
−∞fX (t)dt
=
∫ b
a
fX (t)dt.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 59 / 61
Random Variable and Distributions Density and Mass Functions
Remarks of pdf
“X has a distribution given by FX (x)” is abbreviatedsymbolically by “X ∼ FX (x)” (X ∼ fX (x)).
Given a continuous r.v. X , since P(X = x) = 0, we have
P(a < X < b) = P(a ≤ X < b) = P(a ≤ X ≤ b) = P(a < X ≤ b).For logistic distribution,Fx(x) = 1
1+e−x .We have
fX (x) =d
dxFX (x) =
e−x
(1 + e−x)2.
P(a < X < b) = FX (b)− FX (a) =
∫ b
−∞fX (t)dt −
∫ a
−∞fX (t)dt
=
∫ b
a
fX (t)dt.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 59 / 61
Random Variable and Distributions Density and Mass Functions
Theorem
A function fX (x) is a pdf (or pmf) of a r.v. X if and only if
fX (x) ≥ 0 for all x ;∑x fX (x) = 1( pmf) or
∫ +∞−∞ fX (x)dx = 1( pdf).
Actually, any nonnegative function with a finite positive integral (orsum) can be turned into a pdf or pmf.For example, if h(x) is any nonnegative function that is positive ona set A, 0 otherwise, and∫
x∈Ah(x)dx = K <∞
for some constant K > 0, then the function fX (x) = h(x)K is a pdf of
a r.v. X taking values in A.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 60 / 61
Random Variable and Distributions Density and Mass Functions
Theorem
A function fX (x) is a pdf (or pmf) of a r.v. X if and only if
fX (x) ≥ 0 for all x ;∑x fX (x) = 1( pmf) or
∫ +∞−∞ fX (x)dx = 1( pdf).
Actually, any nonnegative function with a finite positive integral (orsum) can be turned into a pdf or pmf.For example, if h(x) is any nonnegative function that is positive ona set A, 0 otherwise, and∫
x∈Ah(x)dx = K <∞
for some constant K > 0, then the function fX (x) = h(x)K is a pdf of
a r.v. X taking values in A.
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 60 / 61
Take-aways
Take-aways
Conclusions
Introduction
Set Theory
Basics of Probability Theory
The Calculus of ProbabilitiesCounting
Random Variable and Distributions
Random VariableDistribution FunctionsDensity and Mass Functions
MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 61 / 61