statistical inference - lecture 1: probability theory

158
Statistical Inference Lecture 1: Probability Theory MING GAO DASE @ ECNU (for course related communications) [email protected] Mar. 2, 2021

Upload: others

Post on 30-May-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Inference - Lecture 1: Probability Theory

Statistical InferenceLecture 1: Probability Theory

MING GAO

DASE @ ECNU(for course related communications)

[email protected]

Mar. 2, 2021

Page 2: Statistical Inference - Lecture 1: Probability Theory

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 2 / 61

Page 3: Statistical Inference - Lecture 1: Probability Theory

Introduction

Introduction

Probability as a mathematical framework for:

reasoning about uncertainty

deriving approaches to inference problems

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 3 / 61

Page 4: Statistical Inference - Lecture 1: Probability Theory

Introduction

Introduction

Probability as a mathematical framework for:

reasoning about uncertainty

deriving approaches to inference problems

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 3 / 61

Page 5: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Experiment and sample space

Experiment

An experiment is a procedure that yields one of a given set of pos-sible outcomes.

Sample space

The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.

Example

Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.

“List” (set) of possibleoutcomes

List must be:1 Mutually exclusive2 Collectively exhaustive

Art: to be at the “right”granularity

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61

Page 6: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Experiment and sample space

Experiment

An experiment is a procedure that yields one of a given set of pos-sible outcomes.

Sample space

The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.

Example

Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.

“List” (set) of possibleoutcomes

List must be:1 Mutually exclusive2 Collectively exhaustive

Art: to be at the “right”granularity

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61

Page 7: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Experiment and sample space

Experiment

An experiment is a procedure that yields one of a given set of pos-sible outcomes.

Sample space

The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.

Example

Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.

“List” (set) of possibleoutcomes

List must be:1 Mutually exclusive2 Collectively exhaustive

Art: to be at the “right”granularity

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61

Page 8: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Experiment and sample space

Experiment

An experiment is a procedure that yields one of a given set of pos-sible outcomes.

Sample space

The sample space, denoted as Ω, of the experiment is the set ofpossible outcomes.

Example

Roll a die one time,Ω = 1, 2, 3, 4, 5, 6.We toss a coin twice (Head= H, Tail = T),Ω = HH,HT ,TH,TT.

“List” (set) of possibleoutcomes

List must be:1 Mutually exclusive2 Collectively exhaustive

Art: to be at the “right”granularity

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 4 / 61

Page 9: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Continuous sample space

For this case, sample spaceΩ = (x , y)|0 ≤ x , y ≤ 1.Note that the sample space isinfinite and uncountable.

In this course, we consider thecountable sample spaces anduncountable sample spaces.Thus, we call the learning contentto be the discrete probability andcontinuous probability,respectively.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 5 / 61

Page 10: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Event and set operators

An event, represented as a set, is a subset of the sample space.

Example

Toss at least one headB = HH,HT ,TH ⊂Ω;

Toss at least three headC = ∅ ⊂ Ω.

There are 2|Ω| events foran experiments;

Events therefore have allset operations.

Operators

Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;

Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61

Page 11: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Event and set operators

An event, represented as a set, is a subset of the sample space.

Example

Toss at least one headB = HH,HT ,TH ⊂Ω;

Toss at least three headC = ∅ ⊂ Ω.

There are 2|Ω| events foran experiments;

Events therefore have allset operations.

Operators

Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;

Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61

Page 12: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Event and set operators

An event, represented as a set, is a subset of the sample space.

Example

Toss at least one headB = HH,HT ,TH ⊂Ω;

Toss at least three headC = ∅ ⊂ Ω.

There are 2|Ω| events foran experiments;

Events therefore have allset operations.

Operators

Containment:A ⊂ B ⇔ x ∈ A⇒ x ∈ B;

Union:A ∪ B = x |x ∈ A or x ∈ B;Intersection:A ∩ B = x |x ∈ A and x ∈ B;Difference:A− B = x : x ∈ A ∧ x 6∈ B;Complement: Ac = x |x 6∈ A;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 6 / 61

Page 13: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Set identities

Table of set identities

equivalence name

A ∩ U = A Identity lawsA ∪ ∅ = A

A ∩ A = A Idempotent lawsA ∪ A = A

(A ∩ B) ∩ C = A ∩ (B ∩ C ) Associative laws(A ∪ B) ∪ C = A ∪ (B ∪ C )

A ∪ (A ∩ B) = A Absorption lawsA ∩ (A ∪ B) = A

A ∩ A = ∅ Complement laws

A ∪ A = U

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 7 / 61

Page 14: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Logical equivalence Cont’d

Table of set identities

equivalence name

A ∪ U = U Domination lawsA ∩ ∅ = ∅

A ∩ B = B ∩ A Commutative lawsA ∪ B = B ∪ A

(A ∩ B) ∪ C = (A ∪ C ) ∩ (B ∪ C ) Distributive laws(A ∪ B) ∩ C = (A ∩ C ) ∪ (B ∩ C )

(A ∩ B) = A ∪ B De Morgan’s laws

(A ∪ B) = A ∩ B

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 8 / 61

Page 15: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Extensions of set operators

If A1,A2, · · · ,An, · · · is a collection of events, all defined on a samplespace Ω, then

Union:⋃∞

i=1 Ai = x ∈ Ω|x ∈ Ai for some i;Intersection:

⋂∞i=1 Ai = x ∈ Ω|x ∈ Ai for all i;

Example

Let Ω = (0, 1] and define Ai = [ 1i , 1]. then⋃∞

i=1 Ai = limn→+∞⋃n

i=1[ 1i , 1] = limn→+∞[ 1

n , 1] = (0, 1];⋂∞i=1 Ai = limn→+∞

⋂ni=1[ 1

i , 1] = limn→+∞[1, 1] = 1;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 9 / 61

Page 16: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Extensions of set operators Cont’d

If Γ is an uncountable collection of events, all defined on a samplespace Ω, then

Union:⋃

a∈Γ Ai = x ∈ Ω|x ∈ Aa for some a;Intersection:

⋂a∈Γ Ai = x ∈ Ω|x ∈ Aa for all a;

Example

Let Γ = R+ and define Aa = (0, a]. then⋃a∈Γ Aa = lima→+∞

⋃ai=1(0, a] = (0,+∞];⋂

a∈Γ Aa = lima→+∞⋂a

i=1(0, a] = ∅;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 10 / 61

Page 17: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Disjoint and partition

Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .

If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞

i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.

The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have

⋃∞i=1 Ai = [0,+∞).

The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61

Page 18: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Disjoint and partition

Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .

If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞

i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.

The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have

⋃∞i=1 Ai = [0,+∞).

The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61

Page 19: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Disjoint and partition

Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .

If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞

i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.

The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have

⋃∞i=1 Ai = [0,+∞).

The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61

Page 20: Statistical Inference - Lecture 1: Probability Theory

Set Theory

Disjoint and partition

Two events A and B are disjoint (or mutually exclusive) ifA∩B = ∅. The events A1,A2, · · · ,An, · · · are pairwise disjoint(or mutually exclusive) if Ai ∩ Aj = ∅ for all i 6= j .

If A1,A2, · · · ,An, · · · are pairwise disjoint and⋃∞

i=1 Ai = Ω,then the collection A1,A2, · · · ,An, · · · forms a partition of Ω.

The collection Ai = [i , i + 1) for i ∈ N consists of pairwisedisjoint sets. Furthermore, we have

⋃∞i=1 Ai = [0,+∞).

The collection of sets Ai = [i , i + 1) for i ∈ N forms a partitionof [0,+∞).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 11 / 61

Page 21: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Sigma Algebra

A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties

∅ ∈ B;

If A ∈ B, then Ac ∈ B;

If A1,A2, · · · ∈ B, then⋃∞

i=1 Ai ∈ B (B is closed undercountable unions).

In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,

( ∞⋃i=1

Ai

)c=∞⋂i=1

Aci ∈ B.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61

Page 22: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Sigma Algebra

A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties

∅ ∈ B;

If A ∈ B, then Ac ∈ B;

If A1,A2, · · · ∈ B, then⋃∞

i=1 Ai ∈ B (B is closed undercountable unions).

In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,

( ∞⋃i=1

Ai

)c=∞⋂i=1

Aci ∈ B.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61

Page 23: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Sigma Algebra

A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties

∅ ∈ B;

If A ∈ B, then Ac ∈ B;

If A1,A2, · · · ∈ B, then⋃∞

i=1 Ai ∈ B (B is closed undercountable unions).

In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,

( ∞⋃i=1

Ai

)c=∞⋂i=1

Aci ∈ B.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61

Page 24: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Sigma Algebra

A collection of subsets of Ω is called a sigma algebra (or Borel field),denoted as B, if it satisfies the following three properties

∅ ∈ B;

If A ∈ B, then Ac ∈ B;

If A1,A2, · · · ∈ B, then⋃∞

i=1 Ai ∈ B (B is closed undercountable unions).

In addition, from DeMorgan’s Laws it follows that B is closed undercountable intersection, i.e.,

( ∞⋃i=1

Ai

)c=∞⋂i=1

Aci ∈ B.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 12 / 61

Page 25: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example

If Ω is finite or countable, the sigma algebra of Ω can be defined as

B = A|A ⊆ Ω.

If Ω has n elements, there are 2n sets in B;

If Ω is countable infinite, the cardinality of B is ℵ0;

Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form

[a, b], (a, b], (a, b) and [a, b)

for all real numbers a and b.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61

Page 26: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example

If Ω is finite or countable, the sigma algebra of Ω can be defined as

B = A|A ⊆ Ω.

If Ω has n elements, there are 2n sets in B;

If Ω is countable infinite, the cardinality of B is ℵ0;

Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form

[a, b], (a, b], (a, b) and [a, b)

for all real numbers a and b.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61

Page 27: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example

If Ω is finite or countable, the sigma algebra of Ω can be defined as

B = A|A ⊆ Ω.

If Ω has n elements, there are 2n sets in B;

If Ω is countable infinite, the cardinality of B is ℵ0;

Let Ω = (−∞,+∞), then B is chosen to contain all sets ofthe form

[a, b], (a, b], (a, b) and [a, b)

for all real numbers a and b.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 13 / 61

Page 28: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 29: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ B

Normalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 30: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 31: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 32: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 33: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Probability function

Definition

Given a sample space Ω and an associated sigma algebra B, a prob-ability function is a function P with domain B that satisfies

Nonnegativity: P(A) ≥ 0 for all A ∈ BNormalization: P(Ω) = 1 and P(∅) = 0;

Additivity: If A1,A2, · · · ∈ B are pairwise disjoint, thenP(⋃∞

i=1 Ai ) =∑∞

i=1 P(Ai ).

Note that

The three properties given in the definition are usually referredto as the Axioms of probability (or the Kolmogorov Axioms).

Any function P that satisfies the Axioms of probability is calleda probability function.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 14 / 61

Page 34: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example of defining probability

Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is

P(H) = P(T).

P(H,T) = P(H) + P(T) = 1;

P(H) = P(T) = 12 ;

We can also define probability over an unfair coin, e.g.,P(H) = 1

3 and P(T) = 23 .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61

Page 35: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example of defining probability

Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is

P(H) = P(T).

P(H,T) = P(H) + P(T) = 1;

P(H) = P(T) = 12 ;

We can also define probability over an unfair coin, e.g.,P(H) = 1

3 and P(T) = 23 .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61

Page 36: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example of defining probability

Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is

P(H) = P(T).

P(H,T) = P(H) + P(T) = 1;

P(H) = P(T) = 12 ;

We can also define probability over an unfair coin, e.g.,P(H) = 1

3 and P(T) = 23 .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61

Page 37: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory

Example of defining probability

Consider the simple experiment of tossing a fair coin, so Ω = H,T,hence the reasonable probability function is

P(H) = P(T).

P(H,T) = P(H) + P(T) = 1;

P(H) = P(T) = 12 ;

We can also define probability over an unfair coin, e.g.,P(H) = 1

3 and P(T) = 23 .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 15 / 61

Page 38: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 16 / 61

Page 39: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability operators

Operators

Let Ω be the sample space, A and B be two events:

1 If A ∩ B = ∅, then

P(A ∪ B) = P(A) + P(B);

2

P(Ac) = P(Ω)− P(A) = 1− P(A);

3 If A and B are independent, then

P(A ∩ B) = P(A) · P(B);

4 The conditional probability of A given B, denoted by P(A|B),is computed as

P(A|B) =P(A ∩ B)

P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61

Page 40: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability operators

Operators

Let Ω be the sample space, A and B be two events:

1 If A ∩ B = ∅, then

P(A ∪ B) = P(A) + P(B);

2

P(Ac) = P(Ω)− P(A) = 1− P(A);

3 If A and B are independent, then

P(A ∩ B) = P(A) · P(B);

4 The conditional probability of A given B, denoted by P(A|B),is computed as

P(A|B) =P(A ∩ B)

P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61

Page 41: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability operators

Operators

Let Ω be the sample space, A and B be two events:

1 If A ∩ B = ∅, then

P(A ∪ B) = P(A) + P(B);

2

P(Ac) = P(Ω)− P(A) = 1− P(A);

3 If A and B are independent, then

P(A ∩ B) = P(A) · P(B);

4 The conditional probability of A given B, denoted by P(A|B),is computed as

P(A|B) =P(A ∩ B)

P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61

Page 42: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability operators

Operators

Let Ω be the sample space, A and B be two events:

1 If A ∩ B = ∅, then

P(A ∪ B) = P(A) + P(B);

2

P(Ac) = P(Ω)− P(A) = 1− P(A);

3 If A and B are independent, then

P(A ∩ B) = P(A) · P(B);

4 The conditional probability of A given B, denoted by P(A|B),is computed as

P(A|B) =P(A ∩ B)

P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 17 / 61

Page 43: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties I

If P is a probability function, A and B are any two sets in B, then

P(∅) = 0, where ∅ is the empty set;

P(A) ≤ 1;

P(B ∩ Ac) = P(B)− P(A ∩ B);

P(A ∪ B) = P(A) + P(B)− P(A ∩ B);

If A ⊂ B, then P(A) ≤ P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61

Page 44: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties I

If P is a probability function, A and B are any two sets in B, then

P(∅) = 0, where ∅ is the empty set;

P(A) ≤ 1;

P(B ∩ Ac) = P(B)− P(A ∩ B);

P(A ∪ B) = P(A) + P(B)− P(A ∩ B);

If A ⊂ B, then P(A) ≤ P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61

Page 45: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties I

If P is a probability function, A and B are any two sets in B, then

P(∅) = 0, where ∅ is the empty set;

P(A) ≤ 1;

P(B ∩ Ac) = P(B)− P(A ∩ B);

P(A ∪ B) = P(A) + P(B)− P(A ∩ B);

If A ⊂ B, then P(A) ≤ P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61

Page 46: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties I

If P is a probability function, A and B are any two sets in B, then

P(∅) = 0, where ∅ is the empty set;

P(A) ≤ 1;

P(B ∩ Ac) = P(B)− P(A ∩ B);

P(A ∪ B) = P(A) + P(B)− P(A ∩ B);

If A ⊂ B, then P(A) ≤ P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61

Page 47: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties I

If P is a probability function, A and B are any two sets in B, then

P(∅) = 0, where ∅ is the empty set;

P(A) ≤ 1;

P(B ∩ Ac) = P(B)− P(A ∩ B);

P(A ∪ B) = P(A) + P(B)− P(A ∩ B);

If A ⊂ B, then P(A) ≤ P(B).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 18 / 61

Page 48: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties II

If P is a probability function, the collection C1,C2, · · · is a partitionof Ω, then

P(A) =∑∞

i=1 P(A ∩ Ci );

P(⋃∞

i=1 Ai ) ≤∑∞

i=1 P(Ai ) for any collection A1,A2, · · · ;To prove the second statement, we first construct a disjoint collectionA∗1,A

∗2, · · · , where

A∗1 = A1,A∗i = Ai\

( i−1⋃j=1

Aj

)for i = 2, 3, · · ·

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 19 / 61

Page 49: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Probability properties II

If P is a probability function, the collection C1,C2, · · · is a partitionof Ω, then

P(A) =∑∞

i=1 P(A ∩ Ci );

P(⋃∞

i=1 Ai ) ≤∑∞

i=1 P(Ai ) for any collection A1,A2, · · · ;To prove the second statement, we first construct a disjoint collectionA∗1,A

∗2, · · · , where

A∗1 = A1,A∗i = Ai\

( i−1⋃j=1

Aj

)for i = 2, 3, · · ·

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 19 / 61

Page 50: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Running example of independence

Tossing coins

We toss a coin twice (Head = H, Tail = T), then Ω =HH,HT ,TH,TT.We define three events:

1 A : the first toss is H;

2 B : the second toss is H;

3 C : the first and second toss give the same results.

Hence, we have

P(A) = P(B) = P(C ) = 12 ;

P(A ∩ B) = P(A ∩ C ) = P(B ∩ C ) = |HH||Ω| = 1

4 ;

P(A ∩ B ∩ C ) = |HH||Ω| = 1

4 ;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 20 / 61

Page 51: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Independence

Definition

Events A1 and A2 are pair-wise independent (statisticallyindependent) if and only if

P(A1 ∩ A2) = P(A1)P(A2);

Events A1,A2, · · · ,An are mutually independent if

P(Ai1 ∩ Ai2 ∩ · · · ∩ Aim) = P(Ai1)P(Ai2) · · ·P(Aim),

where ij , j = 1, 2, · · · ,m, are integers with1 ≤ i1 < i2 < · · · < im ≤ n and m ≥ 2.

Note that mutually independent must be pair-wise independent, butpair-wise independent may not imply mutually independent (shownin previous example).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 21 / 61

Page 52: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Independence

Theorem

Events A and B are pair-wise independent, then

A and Bc are pair-wise independent;

Ac and B are pair-wise independent;

Ac and Bc are pair-wise independent;

Proof.

P(A) = P(A ∩ (B ∪ Bc)) = P((A ∩ B) ∪ (A ∩ Bc))

= P(A ∩ B) + P(A ∩ Bc)

P(A ∩ Bc) = P(A)− P(A ∩ B) = P(A)− P(A)P(B)

= P(A)(1− P(B)) = P(A)P(Bc)

Hence, we have A and Bc are independent.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 22 / 61

Page 53: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Conditional probability

Definition

Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as

P(E |F ) =P(E ∩ F )

P(F ).

P(E |F ) is the probability of E ,given that F occurred

F is our new sample space;

P(E |F ) is undefined if P(F ) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61

Page 54: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Conditional probability

Definition

Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as

P(E |F ) =P(E ∩ F )

P(F ).

P(E |F ) is the probability of E ,given that F occurred

F is our new sample space;

P(E |F ) is undefined if P(F ) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61

Page 55: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Conditional probability

Definition

Let E and F be events with P(F ) > 0. The conditional probabilityof E given F , denoted by P(E |F ), is defined as

P(E |F ) =P(E ∩ F )

P(F ).

P(E |F ) is the probability of E ,given that F occurred

F is our new sample space;

P(E |F ) is undefined if P(F ) = 0.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 23 / 61

Page 56: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Example

Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.

Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that

P(A|B) =P(A ∩ B)

P(B)=

1/4

3/4=

1

3.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61

Page 57: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Example

Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.

Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.

Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that

P(A|B) =P(A ∩ B)

P(B)=

1/4

3/4=

1

3.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61

Page 58: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Example

Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.

Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.

Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that

P(A|B) =P(A ∩ B)

P(B)=

1/4

3/4=

1

3.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61

Page 59: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Example

Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.

Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.

We conclude that

P(A|B) =P(A ∩ B)

P(B)=

1/4

3/4=

1

3.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61

Page 60: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Example

Question: What is the conditional probability that a family with twochildren has two boys, given they have at least one boy? Assume thateach of the possibilities BB,BG ,GB, and GG is equally likely.

Solution: Let A be the event that a family with two children hastwo boys, and let B be the event that a family with two children hasat least one boy.Thus, A = BB, B = BB,BG ,GB, and A ∩ B = BB.Because the four possibilities are equally likely, it follows that P(B) =3/4 and P(A ∩ B) = 1/4.We conclude that

P(A|B) =P(A ∩ B)

P(B)=

1/4

3/4=

1

3.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 24 / 61

Page 61: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Remarks for conditional probability

P(E |F ) = P(E ) if events E and F are independent;

P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );

P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);

P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =

P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )

(Bayes rule).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61

Page 62: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Remarks for conditional probability

P(E |F ) = P(E ) if events E and F are independent;

P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );

P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);

P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =

P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )

(Bayes rule).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61

Page 63: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Remarks for conditional probability

P(E |F ) = P(E ) if events E and F are independent;

P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );

P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);

P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =

P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )

(Bayes rule).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61

Page 64: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Remarks for conditional probability

P(E |F ) = P(E ) if events E and F are independent;

P(E ∩ F ) = P(E ) · P(F |E ) = P(F ) · P(E |F );

P(E ) =P(F1) · P(E |F1) + P(F2) ·P(E |F2) + P(F3) · P(E |F3) ifF1,F2 and F3 is a partitionof Ω (Total probabilitytheorem);

P(Fi |E ) = P(Fi )·P(E |Fi )P(E) =

P(Fi )·P(E |Fi )∑i P(Fi )·P(E |Fi )

(Bayes rule).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 25 / 61

Page 65: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Proof of the total probability theorem

Theorem

Let Fi (for i = 1, 2, · · · , n) be a partition of sample space Ω, for anyevent E , then

P(E ) =n∑

i=1

P(Fi ) · P(E |Fi ).

Proof:

P(E ) = P(E ∩ Ω) = P(E ∩ (F1 ∪ F2 ∪ · · · ∪ Fn))

= P((A ∩ F1) ∪ (A ∩ F2) ∪ · · · ∪ (A ∩ Fn))

= P(A ∩ F1) + P(A ∩ F2) + · · ·+ P(A ∩ Fn)

=n∑

i=1

P(Fi ) · P(E |Fi ).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 26 / 61

Page 66: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Proof of the total probability theorem

Theorem

Let Fi (for i = 1, 2, · · · , n) be a partition of sample space Ω, for anyevent E , then

P(E ) =n∑

i=1

P(Fi ) · P(E |Fi ).

Proof:

P(E ) = P(E ∩ Ω) = P(E ∩ (F1 ∪ F2 ∪ · · · ∪ Fn))

= P((A ∩ F1) ∪ (A ∩ F2) ∪ · · · ∪ (A ∩ Fn))

= P(A ∩ F1) + P(A ∩ F2) + · · ·+ P(A ∩ Fn)

=n∑

i=1

P(Fi ) · P(E |Fi ).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 26 / 61

Page 67: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayes’ Theorem

Theorem

Suppose that E is an event from a sample space Ω and F1,F2, · · · ,Fnis a partition of the sample space. Let P(E ) 6= 0 and P(Fi ) 6= 0 for∀i . Then

P(Fi |E ) =P(E |Fi )P(Fi )∑n

k=1 P(E |Fk)P(Fk).

Proof:

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 27 / 61

Page 68: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayes’ Theorem

Theorem

Suppose that E is an event from a sample space Ω and F1,F2, · · · ,Fnis a partition of the sample space. Let P(E ) 6= 0 and P(Fi ) 6= 0 for∀i . Then

P(Fi |E ) =P(E |Fi )P(Fi )∑n

k=1 P(E |Fk)P(Fk).

Proof:

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 27 / 61

Page 69: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease

Suppose that one of 100,000 persons has a particular rare disease forwhich there is a fairly accurate diagnostic test. This test is correct99.0% when given to a person selected at random who has the dis-ease; it is correct 99.5% when given to a person selected at randomwho does not have the disease. Please find

the probability that a person who tests positive for the diseasehas the disease?

the probability that a person who tests negative for the diseasedoes not have the disease?

Should a person who tests positive be very concerned that he or shehas the disease?

Solution: Let B be the event that a person selected at random hasthe disease, and let A be the event that a person selected at randomtests positive for the disease.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 28 / 61

Page 70: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease

Suppose that one of 100,000 persons has a particular rare disease forwhich there is a fairly accurate diagnostic test. This test is correct99.0% when given to a person selected at random who has the dis-ease; it is correct 99.5% when given to a person selected at randomwho does not have the disease. Please find

the probability that a person who tests positive for the diseasehas the disease?

the probability that a person who tests negative for the diseasedoes not have the disease?

Should a person who tests positive be very concerned that he or shehas the disease?Solution: Let B be the event that a person selected at random hasthe disease, and let A be the event that a person selected at randomtests positive for the disease.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 28 / 61

Page 71: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease Cont’d

Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.

Case a: In terms of Bayes’ theorem, we have

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc)

=0.99 · 10−5

0.99 · 10−5 + 0.005 · 0.99999≈ 0.002

Case b: Similarly, we have

P(Bc |Ac) =P(Ac |Bc)P(Bc)

P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61

Page 72: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease Cont’d

Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc)

=0.99 · 10−5

0.99 · 10−5 + 0.005 · 0.99999≈ 0.002

Case b: Similarly, we have

P(Bc |Ac) =P(Ac |Bc)P(Bc)

P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61

Page 73: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease Cont’d

Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc)

=0.99 · 10−5

0.99 · 10−5 + 0.005 · 0.99999≈ 0.002

Case b: Similarly, we have

P(Bc |Ac) =P(Ac |Bc)P(Bc)

P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61

Page 74: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Diagnostic test for rare disease Cont’d

Hence, we have p(B) = 1/100, 000 = 10−5. Then we also haveP(A|B) = 0.99, P(Ac |B) = 0.01, P(Ac |Bc) = 0.995, andP(A|Bc) = 0.005.Case a: In terms of Bayes’ theorem, we have

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc)

=0.99 · 10−5

0.99 · 10−5 + 0.005 · 0.99999≈ 0.002

Case b: Similarly, we have

P(Bc |Ac) =P(Ac |Bc)P(Bc)

P(Ac |Bc)P(Bc) + P(Ac |F )P(F )≈ 0.9999999

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 29 / 61

Page 75: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters

Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.

On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.

Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61

Page 76: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters

Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.

Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61

Page 77: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters

Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.

Question: How to detect spam email?

Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61

Page 78: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters

Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.

Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.

Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61

Page 79: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters

Most electronic mailboxes receive a flood of unwanted and unsolicitedmessages, known as spam.On the Internet, an Internet Water Army is a group of Internetghostwriters paid to post online comments with particular content.

Question: How to detect spam email?Solution: Bayesian spam filters look for occurrences of particularwords in messages. For a particular word w , the probability that wappears in a spam e-mail message is estimated by determining #times w appears in a message from a large set of messages knownto be spam and # times it appears in a large set of messages knownnot to be spam.Step 1: Collect ground-truth Suppose we have a set B of messagesknown to be spam and a set G of messages known not to be spam.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 30 / 61

Page 80: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.

Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61

Page 81: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .

By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61

Page 82: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

Step 2: Learn parameters We next identify the words that occurin B and in G . Let nB(w) and nG (w) be # messages containingword w in sets B and G , respectively.Let p(w) = nB(w)/|B| and q(w) = nG (w)/|G | be the empiricalprobabilities that a message are not spam and spam contains wordw , respectively.Step 3: Make decision Now suppose we receive a new e-mail mes-sage containing word w . Let B be the event that the message isspam. Let A be the event that the message contains word w .By Bayes theorem, the probability that the message is spam, giventhat it contains word w , is

P(B|A) =P(A|B)P(B)

P(A|B)P(B) + P(A|Bc)P(Bc).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 31 / 61

Page 83: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).

Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is

P(B|A) =P(A|B)

P(A|B) + P(A|Bc).

P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by

r(w) =p(w)

p(w) + q(w).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61

Page 84: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.

Using this assumption, we find that the probability that a message isspam, given that it contains word w , is

P(B|A) =P(A|B)

P(A|B) + P(A|Bc).

P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by

r(w) =p(w)

p(w) + q(w).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61

Page 85: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is

P(B|A) =P(A|B)

P(A|B) + P(A|Bc).

P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by

r(w) =p(w)

p(w) + q(w).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61

Page 86: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Bayesian spam filters Cont’d

To apply the above formula, we first estimate P(B) (probability ofspam) and P(Bc) (probability of not spam).Without prior knowledge, for simplicity we assume that the messageis equally likely to be spam as it is not to be spam, i.e., P(B) =P(Bc) = 1/2.Using this assumption, we find that the probability that a message isspam, given that it contains word w , is

P(B|A) =P(A|B)

P(A|B) + P(A|Bc).

P(A|B) and P(A|Bc) are known, P(B|A) can be estimated by

r(w) =p(w)

p(w) + q(w).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 32 / 61

Page 87: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Extended Bayesian spam filters

The more words we use to estimate the probability that an incomingmail message is spam, the better is our chance that we correctlydetermine whether it is spam.In general, if Ai is the event that the message contains word wi ,assuming that P(S) = P(Sc), and that events Ai |S are independent,then by Bayes theorem the probability that a message containing allwords w1,w2, · · · ,wk is spam is

P(S |k⋂

i=1

Ai ) =P(⋂k

i=1 Ai |S)P(S)

P(⋂k

i=1 Ai |S)P(S) + P(⋂k

i=1 Ai |S)P(S)

=Πki=1P(Ai |S)

Πki=1P(Ai |S) + Πk

i=1P(Ai |S)

≈Πki=1p(wi )

Πki=1p(wi ) + Πk

i=1q(wi )= r(w1,w2, · · · ,wk).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 33 / 61

Page 88: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Naive Bayes

Why is this called Naive Bayes?

The model employs the chain rule for repeated applications ofthe definition of conditional probability.

To handle underflow, we calculateΠni=1P(Xi |S) = exp(

∑ni=1 logP(Xi |S)).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61

Page 89: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Naive Bayes

Why is this called Naive Bayes?

The model employs the chain rule for repeated applications ofthe definition of conditional probability.

To handle underflow, we calculateΠni=1P(Xi |S) = exp(

∑ni=1 logP(Xi |S)).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61

Page 90: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory The Calculus of Probabilities

Naive Bayes

Why is this called Naive Bayes?

The model employs the chain rule for repeated applications ofthe definition of conditional probability.

To handle underflow, we calculateΠni=1P(Xi |S) = exp(

∑ni=1 logP(Xi |S)).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 34 / 61

Page 91: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 35 / 61

Page 92: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle I

Product rule

Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.

Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.

Sum rule

If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2

ways, then there are n1 + n2 ways to do the task.

Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are

∑mi=1 ni ways for the procedure.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61

Page 93: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle I

Product rule

Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.

Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.

Sum rule

If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2

ways, then there are n1 + n2 ways to do the task.

Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are

∑mi=1 ni ways for the procedure.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61

Page 94: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle I

Product rule

Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.

Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.

Sum rule

If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2

ways, then there are n1 + n2 ways to do the task.

Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are

∑mi=1 ni ways for the procedure.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61

Page 95: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle I

Product rule

Suppose that a procedure consists of a sequence of two tasks. Ifthere are n1 ways to do the first task, there are n2 ways to do thesecond task, then there are n1n2 ways to do the procedure.

Extended version: A procedure is followed by tasks T1,T2, · · · ,Tm insequence. If each task Ti can be done in ni ways independently, then thereare n1 · n2 · · · · · nm ways to carry out the procedure.

Sum rule

If a task can be done either in one of n1 ways or in one of n2 ways,where none of the set of n1 ways is the same as any of the set of n2

ways, then there are n1 + n2 ways to do the task.

Extended version: A procedure can be done by m ways, each way Wi hasni possibilities, then there are

∑mi=1 ni ways for the procedure.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 36 / 61

Page 96: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle II

Subtraction rule

There are n/d ways to do a task if it can be done using a procedurethat can be carried out in n ways, and for every way w , exactly d ofthe n ways correspond to way w .

Inclusion-exclusion rule

If a task can be done in either n1 ways or n2 ways, then the numberof ways to do the task is n1 + n2 minus the number of ways to dothe task that are common to the two different ways. The rule is alsocalled the principle of inclusion-exclusion, i.e.,

|A ∪ B| = |A|+ |B| − |A ∩ B|.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 37 / 61

Page 97: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Counting principle II

Subtraction rule

There are n/d ways to do a task if it can be done using a procedurethat can be carried out in n ways, and for every way w , exactly d ofthe n ways correspond to way w .

Inclusion-exclusion rule

If a task can be done in either n1 ways or n2 ways, then the numberof ways to do the task is n1 + n2 minus the number of ways to dothe task that are common to the two different ways. The rule is alsocalled the principle of inclusion-exclusion, i.e.,

|A ∪ B| = |A|+ |B| − |A ∩ B|.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 37 / 61

Page 98: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Permutations and combinations

Permutations

An ordered arrangement of m elements of a set is called an m-permutation. # m-permutations of a set with n distinct elementsis

P(n,m) = n(n − 1)(n − 2) · · · · · (n −m + 1) =n!

(n −m)!.

Combinations

An m-combination of elements of a set is an unordered selection ofm elements from the set. # m-combinations of a set with n elementsequals

C (n,m) =n!

m!(n −m)!,

where C (n,m) is also denoted as(nm

).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 38 / 61

Page 99: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Permutations and combinations

Permutations

An ordered arrangement of m elements of a set is called an m-permutation. # m-permutations of a set with n distinct elementsis

P(n,m) = n(n − 1)(n − 2) · · · · · (n −m + 1) =n!

(n −m)!.

Combinations

An m-combination of elements of a set is an unordered selection ofm elements from the set. # m-combinations of a set with n elementsequals

C (n,m) =n!

m!(n −m)!,

where C (n,m) is also denoted as(nm

).

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 38 / 61

Page 100: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Finite probability

If S is a finite nonempty sample space of equally likely outcomes, andE is an event, that is, a subset of S , then the probability of E is

p(E ) =|E ||S |

.

Let all outcomes be equally likely;

Computing probabilities ≡ two countings;

Counting the successful ways of the event;Counting the size of the sample space;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 39 / 61

Page 101: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Finite probability

If S is a finite nonempty sample space of equally likely outcomes, andE is an event, that is, a subset of S , then the probability of E is

p(E ) =|E ||S |

.

Let all outcomes be equally likely;

Computing probabilities ≡ two countings;

Counting the successful ways of the event;Counting the size of the sample space;

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 39 / 61

Page 102: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples

Example I

Question: An urn contains four blue balls and five red balls. Whatis the probability that a ball chosen at random from the urn is blue?Solution: Let S be the sample space, i.e.,

S = ©1,©2,©3,©4,©1,©2,©3,©4,©5.

Let E be the event of choosing a blue ball, i.e.,

E = ©1,©2,©3,©4.

In terms of the definition, we can compute the probability as

P(E ) =|E ||S |

=4

9.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 40 / 61

Page 103: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example II

Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:

There are a total of 36 possibleoutcomes when two dice arerolled.

There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).

Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61

Page 104: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example II

Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:

There are a total of 36 possibleoutcomes when two dice arerolled.

There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).

Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61

Page 105: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example II

Question: What is the probability that when two dice are rolled, thesum of the numbers on the two dice is 7?Solution:

There are a total of 36 possibleoutcomes when two dice arerolled.

There are six successfuloutcomes, namely, (1, 6), (2, 5),(3, 4), (4, 3), (5, 2), and (6, 1).

Hence, the probability that aseven comes up when two fairdice are rolled is 6/36 = 1/6.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 41 / 61

Page 106: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example III

Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?

Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4

1

)×9 = 36 ways to choose four digits with exactly three of the four

digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61

Page 107: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example III

Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.

Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4

1

)×9 = 36 ways to choose four digits with exactly three of the four

digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61

Page 108: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example III

Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.

Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4

1

)×9 = 36 ways to choose four digits with exactly three of the four

digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61

Page 109: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example III

Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct.

Hence, there is a total of(41

)×9 = 36 ways to choose four digits with exactly three of the four

digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61

Page 110: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example III

Question: In a lottery, players win a large prize when they pick fourrandom digits that match, in the correct order. A smaller prize iswon if only three digits are matched. What is the probability thata player wins the large prize? What is the probability that a playerwins the small prize?Solution: By the product rule, there are 104 = 10, 000 ways tochoose four digits.Large prize case: There is only one way to choose all four digitscorrectly. Thus, the probability is 1/10, 000 = 0.0001.Small prize case: Exactly one digit must be wrong to get threedigits correct, but not all four correct. Hence, there is a total of(4

1

)×9 = 36 ways to choose four digits with exactly three of the four

digits correct. Thus, the probability that a player wins the smallerprize is 36/10, 000 = 9/2500 = 0.0036.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 42 / 61

Page 111: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example IV

Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).

Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is

C (13, 1)C (4, 4)C (48, 1).

Case II: # hands of three of one kind and two of another kind is

P(13, 2)C (4, 3)C (4, 2).

Hence, the probabilities are

C (13, 1)C (4, 4)C (48, 1)

C (52, 5)≈ 0.00024,

C (13, 2)C (4, 3)C (4, 2)

C (52, 5)≈ 0.0014.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61

Page 112: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example IV

Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.

Case I: # hands of five cards with four cards of one kind is

C (13, 1)C (4, 4)C (48, 1).

Case II: # hands of three of one kind and two of another kind is

P(13, 2)C (4, 3)C (4, 2).

Hence, the probabilities are

C (13, 1)C (4, 4)C (48, 1)

C (52, 5)≈ 0.00024,

C (13, 2)C (4, 3)C (4, 2)

C (52, 5)≈ 0.0014.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61

Page 113: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example IV

Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is

C (13, 1)C (4, 4)C (48, 1).

Case II: # hands of three of one kind and two of another kind is

P(13, 2)C (4, 3)C (4, 2).

Hence, the probabilities are

C (13, 1)C (4, 4)C (48, 1)

C (52, 5)≈ 0.00024,

C (13, 2)C (4, 3)C (4, 2)

C (52, 5)≈ 0.0014.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61

Page 114: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example IV

Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is

C (13, 1)C (4, 4)C (48, 1).

Case II: # hands of three of one kind and two of another kind is

P(13, 2)C (4, 3)C (4, 2).

Hence, the probabilities are

C (13, 1)C (4, 4)C (48, 1)

C (52, 5)≈ 0.00024,

C (13, 2)C (4, 3)C (4, 2)

C (52, 5)≈ 0.0014.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61

Page 115: Statistical Inference - Lecture 1: Probability Theory

Basics of Probability Theory Counting

Examples Cont’d

Example IV

Question: Find the probabilities that a poker hand contains fourcards of one kind, or a full house (i.e., three of one kind and two ofanother kind).Solution: There are C (52, 5) different hands of five cards.Case I: # hands of five cards with four cards of one kind is

C (13, 1)C (4, 4)C (48, 1).

Case II: # hands of three of one kind and two of another kind is

P(13, 2)C (4, 3)C (4, 2).

Hence, the probabilities are

C (13, 1)C (4, 4)C (48, 1)

C (52, 5)≈ 0.00024,

C (13, 2)C (4, 3)C (4, 2)

C (52, 5)≈ 0.0014.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 43 / 61

Page 116: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 44 / 61

Page 117: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Random variables

Definition: A random variable (r.v.) X is a function from samplespace Ω of an experiment to the set of real numbers in R, i.e.,

∀ω ∈ Ω,X (ω) = x ∈ R.

Remarks

Note that a random variable is a function. It is not a variable,and it is not random!

We usually use notation X ,Y , etc. to represent a r.v., and x , yto represent the numerical values. For example, X = x meansthat r.v. X has value x .

The domain of the function can be countable and uncountable.If it is countable, the random variable is a discrete r.v.,otherwise continuous r.v..

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 45 / 61

Page 118: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Random variables

Definition: A random variable (r.v.) X is a function from samplespace Ω of an experiment to the set of real numbers in R, i.e.,

∀ω ∈ Ω,X (ω) = x ∈ R.

Remarks

Note that a random variable is a function. It is not a variable,and it is not random!

We usually use notation X ,Y , etc. to represent a r.v., and x , yto represent the numerical values. For example, X = x meansthat r.v. X has value x .

The domain of the function can be countable and uncountable.If it is countable, the random variable is a discrete r.v.,otherwise continuous r.v..

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 45 / 61

Page 119: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Examples of r.v.

A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.

And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then

X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.

When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:

X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61

Page 120: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Examples of r.v.

A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.

And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then

X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.

When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:

X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61

Page 121: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Examples of r.v.

A coin is tossed. If X is the r.v. whose value is the number of headsobtained, then X (H) = 1,X (T ) = 0.

And then tossed again. We define sample space Ω =HH,HT ,TH,TT. If Y is the r.v. whose value is the numberof heads obtained, then

X (HH) = 2,X (HT ) = X (TH) = 1,X (TT ) = 0.

When a player rolls a die, he will win $1 if the outcome is 1,2 or 3,otherwise lose 1$. Let Ω = 1, 2, 3, 4, 5, 6 and define X as follows:

X (1) = X (2) = X (3) = 1,X (4) = X (5) = X (6) = −1.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 46 / 61

Page 122: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Random variables VS. events

Suppose now that a sample space Ω = ω1, ω2, · · · , ωn is given,and r.v. X on Ω is defined the number of heads obtained when wetoss a coin twice.

Event E1 represents only one head obtained. Hence,

E1 = ω : X (ω) = 1;

Event E2 represents even heads obtained. Hence,

E = ω : X (ω) mod 2 = 0;

Event E2 represents at least one heads obtained. Hence,

E = ω : X (ω) > 0.

These indicate that we can also define probability about r.v.s.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 47 / 61

Page 123: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Random Variable

Random variables VS. events

Suppose now that a sample space Ω = ω1, ω2, · · · , ωn is given,and r.v. X on Ω is defined the number of heads obtained when wetoss a coin twice.

Event E1 represents only one head obtained. Hence,

E1 = ω : X (ω) = 1;

Event E2 represents even heads obtained. Hence,

E = ω : X (ω) mod 2 = 0;

Event E2 represents at least one heads obtained. Hence,

E = ω : X (ω) > 0.

These indicate that we can also define probability about r.v.s.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 47 / 61

Page 124: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 48 / 61

Page 125: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Cumulative distribution function

The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by

FX (x) = PX (X ≤ x),

for all x .

Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is

FX (x) =

0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61

Page 126: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Cumulative distribution function

The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by

FX (x) = PX (X ≤ x),

for all x .

Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is

FX (x) =

0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61

Page 127: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Cumulative distribution function

The cumulative distribution function or cdf of a r.v. X , denotedby FX (x) is defined by

FX (x) = PX (X ≤ x),

for all x .Consider the experiment of tossing threefair coins, and let X =# headsobserved. The cdf of X is

FX (x) =

0, if −∞ < x < 0;18 , if 0 ≤ x < 1;12 , if 1 ≤ x < 2;78 , if 2 ≤ x < 3;1, if 3 ≤ x < +∞.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 49 / 61

Page 128: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Theorem

The function F (x) is a cdf if and only if the following three conditionshold:

limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1;

F (x) is a nondecreasing function of x ;

F (x) is right-continuous; that is, for ever number x0,limx↓x0 F (x) = F (x0).

Example

Suppose we do an experiment that consists of tossing a coin until ahead appears. Let p = probability of a head on any given toss, anddefine a r.v. X = # tosses required to get a head. Then for anyx = 1, 2, ·, P(X ≤ x) =

∑xi=1(1− p)i−1p = 1− (1− p)x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 50 / 61

Page 129: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Theorem

The function F (x) is a cdf if and only if the following three conditionshold:

limx→−∞ F (x) = 0 and limx→+∞ F (x) = 1;

F (x) is a nondecreasing function of x ;

F (x) is right-continuous; that is, for ever number x0,limx↓x0 F (x) = F (x0).

Example

Suppose we do an experiment that consists of tossing a coin until ahead appears. Let p = probability of a head on any given toss, anddefine a r.v. X = # tosses required to get a head. Then for anyx = 1, 2, ·, P(X ≤ x) =

∑xi=1(1− p)i−1p = 1− (1− p)x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 50 / 61

Page 130: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Example Cont’d

Given 0 < p < 1, we have

limx→−∞ 1− (1− p)x = 0andlimx→+∞ 1− (1− p)x = 1;

1− (1− p)x is anondecreasing function of x ;

For any x ,F (x)(x + ε) = FX (x) if ε > 0is sufficiently small. Hence,

limε↓0

FX (x + ε) = FX (x).

FX (x) is the cdf of a distribution called the Geometric distribution.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 51 / 61

Page 131: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Example Cont’d

Given 0 < p < 1, we have

limx→−∞ 1− (1− p)x = 0andlimx→+∞ 1− (1− p)x = 1;

1− (1− p)x is anondecreasing function of x ;

For any x ,F (x)(x + ε) = FX (x) if ε > 0is sufficiently small. Hence,

limε↓0

FX (x + ε) = FX (x).

FX (x) is the cdf of a distribution called the Geometric distribution.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 51 / 61

Page 132: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Example Cont’d

An example of a continuous cdf is the function

FX (x) =1

1 + e−x.

limx→−∞ FX (x) = 0 since limx→−∞ e−x =∞;

limx→+∞ FX (x) = 1 since limx→+∞ e−x = 0;

Differentiating FX (x) gives

d

dxFX (x) =

e−x

(1 + e−x)2> 0;

FX (x) is not only right-continuous, but also continuous.

This is a special case of the logistic distribution.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 52 / 61

Page 133: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Continuous and discrete r.v.s

A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .

The r.v.s X and Y are identically distributed if, for every set A ∈ B

P(X (ω ∈ A)) = P(Y (ω ∈ A)).

Note that two r.v.s that are identically distributed are not necessarilyequal.

Theorem

The following two statements are equivalent:

The r.v., X and Y are identically distributed;

FX (x) = FY (x) for all x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61

Page 134: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Continuous and discrete r.v.s

A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .

The r.v.s X and Y are identically distributed if, for every set A ∈ B

P(X (ω ∈ A)) = P(Y (ω ∈ A)).

Note that two r.v.s that are identically distributed are not necessarilyequal.

Theorem

The following two statements are equivalent:

The r.v., X and Y are identically distributed;

FX (x) = FY (x) for all x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61

Page 135: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Continuous and discrete r.v.s

A r.v. X is continuous if FX (x) is a continuous function of x . Anda r.v. X is discrete if FX (x) is a step function of x .

The r.v.s X and Y are identically distributed if, for every set A ∈ B

P(X (ω ∈ A)) = P(Y (ω ∈ A)).

Note that two r.v.s that are identically distributed are not necessarilyequal.

Theorem

The following two statements are equivalent:

The r.v., X and Y are identically distributed;

FX (x) = FY (x) for all x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 53 / 61

Page 136: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Example

Note that two r.v.s that are identically distributed are not necessarilyequal.

For example, consider an example of tossing a fair coin three times.Define r.v.s

X = # heads observed

andY = # tails observed.

We have P(X = k) = P(Y = k), i.e., X and Y are identicallydistributed. However, we do not have X (ω) = Y (ω) for any ω ∈ Ω.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 54 / 61

Page 137: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Distribution Functions

Example

Note that two r.v.s that are identically distributed are not necessarilyequal.

For example, consider an example of tossing a fair coin three times.Define r.v.s

X = # heads observed

andY = # tails observed.

We have P(X = k) = P(Y = k), i.e., X and Y are identicallydistributed. However, we do not have X (ω) = Y (ω) for any ω ∈ Ω.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 54 / 61

Page 138: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Outline

1 Introduction

2 Set Theory

3 Basics of Probability TheoryThe Calculus of ProbabilitiesCounting

4 Random Variable and DistributionsRandom VariableDistribution FunctionsDensity and Mass Functions

5 Take-aways

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 55 / 61

Page 139: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Probability mass function

For all x , the probability mass function (pmf) of a discrete r.v. Xon Ω is given by fX (x) = P(X = x).

For the geometric distribution, we have the pmf

fX (x) = P(X = x) =

(1− p)x−1p, for x = 1, 2, · · ·0, otherwise.

Recall that P(X = x) or, equivalently, fX (x) is the size of thejump in the cdf at x .

We can use the pmf to calculate probabilities, for positiveintegers a and b ≥ a, we have

P(a ≤ X ≤ b) =b∑

k=a

fX (k) =b∑

k=a

(1− p)k−1p.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 56 / 61

Page 140: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Probability mass function

For all x , the probability mass function (pmf) of a discrete r.v. Xon Ω is given by fX (x) = P(X = x).

For the geometric distribution, we have the pmf

fX (x) = P(X = x) =

(1− p)x−1p, for x = 1, 2, · · ·0, otherwise.

Recall that P(X = x) or, equivalently, fX (x) is the size of thejump in the cdf at x .

We can use the pmf to calculate probabilities, for positiveintegers a and b ≥ a, we have

P(a ≤ X ≤ b) =b∑

k=a

fX (k) =b∑

k=a

(1− p)k−1p.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 56 / 61

Page 141: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136

3 118

4 112 5 1

96 5

36 7 16

8 536 9 1

910 1

12 11 118

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 142: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

18

4 112 5 1

96 5

36 7 16

8 536 9 1

910 1

12 11 118

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 143: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12

5 19

6 536 7 1

68 5

36 9 19

10 112 11 1

1812 1

36

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 144: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36 9 19

10 112 11 1

1812 1

36

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 145: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536

7 16

8 536 9 1

910 1

12 11 118

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 146: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

6

8 536 9 1

910 1

12 11 118

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 147: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36

9 19

10 112 11 1

1812 1

36

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 148: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36 9 19

10 112 11 1

1812 1

36

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 149: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36 9 19

10 112

11 118

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 150: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36 9 19

10 112 11 1

18

12 136

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 151: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Examples of pmf

Question: Let X be the sum of the numbers that appear when apair of dice is rolled. What are the values and probabilities of thisrandom variable for 36 possible outcomes (i , j), when these two dicesare rolled?

Solution:value prob. value prob.

2 136 3 1

184 1

12 5 19

6 536 7 1

68 5

36 9 19

10 112 11 1

1812 1

36

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 57 / 61

Page 152: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Probability density function

If we naively try to calculate P(X = x) for a continuous r.v. and anyε > 0, we have

P(X = x) ≤ P(x − ε < X ≤ x) = FX (x)− FX (x − ε).

Therefore,

0 ≤ P(X = x) ≤ limε↓0

[FX (x)− FX (x − ε)] = 0.

Definition

The probability density function or pdf, fX (x), of a continuous r.v.X is the function that satisfies

FX (x) =

∫ x

−∞fX (t)dt for all x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 58 / 61

Page 153: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Probability density function

If we naively try to calculate P(X = x) for a continuous r.v. and anyε > 0, we have

P(X = x) ≤ P(x − ε < X ≤ x) = FX (x)− FX (x − ε).

Therefore,

0 ≤ P(X = x) ≤ limε↓0

[FX (x)− FX (x − ε)] = 0.

Definition

The probability density function or pdf, fX (x), of a continuous r.v.X is the function that satisfies

FX (x) =

∫ x

−∞fX (t)dt for all x .

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 58 / 61

Page 154: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Remarks of pdf

“X has a distribution given by FX (x)” is abbreviatedsymbolically by “X ∼ FX (x)” (X ∼ fX (x)).

Given a continuous r.v. X , since P(X = x) = 0, we have

P(a < X < b) = P(a ≤ X < b) = P(a ≤ X ≤ b) = P(a < X ≤ b).

For logistic distribution,Fx(x) = 1

1+e−x .We have

fX (x) =d

dxFX (x) =

e−x

(1 + e−x)2.

P(a < X < b) = FX (b)− FX (a) =

∫ b

−∞fX (t)dt −

∫ a

−∞fX (t)dt

=

∫ b

a

fX (t)dt.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 59 / 61

Page 155: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Remarks of pdf

“X has a distribution given by FX (x)” is abbreviatedsymbolically by “X ∼ FX (x)” (X ∼ fX (x)).

Given a continuous r.v. X , since P(X = x) = 0, we have

P(a < X < b) = P(a ≤ X < b) = P(a ≤ X ≤ b) = P(a < X ≤ b).For logistic distribution,Fx(x) = 1

1+e−x .We have

fX (x) =d

dxFX (x) =

e−x

(1 + e−x)2.

P(a < X < b) = FX (b)− FX (a) =

∫ b

−∞fX (t)dt −

∫ a

−∞fX (t)dt

=

∫ b

a

fX (t)dt.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 59 / 61

Page 156: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Theorem

A function fX (x) is a pdf (or pmf) of a r.v. X if and only if

fX (x) ≥ 0 for all x ;∑x fX (x) = 1( pmf) or

∫ +∞−∞ fX (x)dx = 1( pdf).

Actually, any nonnegative function with a finite positive integral (orsum) can be turned into a pdf or pmf.For example, if h(x) is any nonnegative function that is positive ona set A, 0 otherwise, and∫

x∈Ah(x)dx = K <∞

for some constant K > 0, then the function fX (x) = h(x)K is a pdf of

a r.v. X taking values in A.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 60 / 61

Page 157: Statistical Inference - Lecture 1: Probability Theory

Random Variable and Distributions Density and Mass Functions

Theorem

A function fX (x) is a pdf (or pmf) of a r.v. X if and only if

fX (x) ≥ 0 for all x ;∑x fX (x) = 1( pmf) or

∫ +∞−∞ fX (x)dx = 1( pdf).

Actually, any nonnegative function with a finite positive integral (orsum) can be turned into a pdf or pmf.For example, if h(x) is any nonnegative function that is positive ona set A, 0 otherwise, and∫

x∈Ah(x)dx = K <∞

for some constant K > 0, then the function fX (x) = h(x)K is a pdf of

a r.v. X taking values in A.

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 60 / 61

Page 158: Statistical Inference - Lecture 1: Probability Theory

Take-aways

Take-aways

Conclusions

Introduction

Set Theory

Basics of Probability Theory

The Calculus of ProbabilitiesCounting

Random Variable and Distributions

Random VariableDistribution FunctionsDensity and Mass Functions

MING GAO (DaSE@ECNU) Statistical Inference Mar. 2, 2021 61 / 61