02 machine learning - introduction probability

241
Machine Learning for Data Mining Probability Review Andres Mendez-Vazquez May 14, 2015 1 / 87

Upload: andres-mendez-vazquez

Post on 06-Jan-2017

111 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: 02 Machine Learning - Introduction probability

Machine Learning for Data MiningProbability Review

Andres Mendez-Vazquez

May 14, 2015

1 / 87

Page 2: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

2 / 87

Page 3: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

3 / 87

Page 4: 02 Machine Learning - Introduction probability

Gerolamo Cardano: Gambling out of Darkness

GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.

Gerolamo Cardano (16th century)While gambling he developed the following rule!!!

Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”

4 / 87

Page 5: 02 Machine Learning - Introduction probability

Gerolamo Cardano: Gambling out of Darkness

GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.

Gerolamo Cardano (16th century)While gambling he developed the following rule!!!

Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”

4 / 87

Page 6: 02 Machine Learning - Introduction probability

Gerolamo Cardano: Gambling out of Darkness

GamblingGambling shows our interest in quantifying the ideas of probability formillennia, but exact mathematical descriptions arose much later.

Gerolamo Cardano (16th century)While gambling he developed the following rule!!!

Equal conditions“The most fundamental principle of all in gambling is simply equalconditions, e.g. of opponents, of bystanders, of money, of situation, of thedice box and of the dice itself. To the extent to which you depart fromthat equity, if it is in your opponent’s favour, you are a fool, and if in yourown, you are unjust.”

4 / 87

Page 7: 02 Machine Learning - Introduction probability

Gerolamo Cardano’s Definition

Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”

MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...

Thus, we get

P(All favourable throws) = Number All favourable throwsNumber of All throws (1)

5 / 87

Page 8: 02 Machine Learning - Introduction probability

Gerolamo Cardano’s Definition

Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”

MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...

Thus, we get

P(All favourable throws) = Number All favourable throwsNumber of All throws (1)

5 / 87

Page 9: 02 Machine Learning - Introduction probability

Gerolamo Cardano’s Definition

Probability“If therefore, someone should say, I want an ace, a deuce, or a trey, youknow that there are 27 favourable throws, and since the circuit is 36, therest of the throws in which these points will not turn up will be 9; theodds will therefore be 3 to 1.”

MeaningProbability as a ratio of favorable to all possible outcomes!!! As long allevents are equiprobable...

Thus, we get

P(All favourable throws) = Number All favourable throwsNumber of All throws (1)

5 / 87

Page 10: 02 Machine Learning - Introduction probability

Intuitive Formulation

Empiric DefinitionIntuitively, the probability of an event A could be defined as:

P(A) = limn→∞

N (A)n

Where N (A) is the number that event a happens in n trials.

ExampleImagine you have three dices, then

The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6

63 = 136

6 / 87

Page 11: 02 Machine Learning - Introduction probability

Intuitive Formulation

Empiric DefinitionIntuitively, the probability of an event A could be defined as:

P(A) = limn→∞

N (A)n

Where N (A) is the number that event a happens in n trials.

ExampleImagine you have three dices, then

The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6

63 = 136

6 / 87

Page 12: 02 Machine Learning - Introduction probability

Intuitive Formulation

Empiric DefinitionIntuitively, the probability of an event A could be defined as:

P(A) = limn→∞

N (A)n

Where N (A) is the number that event a happens in n trials.

ExampleImagine you have three dices, then

The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6

63 = 136

6 / 87

Page 13: 02 Machine Learning - Introduction probability

Intuitive Formulation

Empiric DefinitionIntuitively, the probability of an event A could be defined as:

P(A) = limn→∞

N (A)n

Where N (A) is the number that event a happens in n trials.

ExampleImagine you have three dices, then

The total number of outcomes is 63If we have event A = all numbers are equal, |A| = 6Then, we have that P(A) = 6

63 = 136

6 / 87

Page 14: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

7 / 87

Page 15: 02 Machine Learning - Introduction probability

Axioms of Probability

AxiomsGiven a sample space S of events, we have that

1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),

then:

P(A1 ∪A2 ∪ ... ∪An) =n∑

i=1P(Ai)

8 / 87

Page 16: 02 Machine Learning - Introduction probability

Axioms of Probability

AxiomsGiven a sample space S of events, we have that

1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),

then:

P(A1 ∪A2 ∪ ... ∪An) =n∑

i=1P(Ai)

8 / 87

Page 17: 02 Machine Learning - Introduction probability

Axioms of Probability

AxiomsGiven a sample space S of events, we have that

1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),

then:

P(A1 ∪A2 ∪ ... ∪An) =n∑

i=1P(Ai)

8 / 87

Page 18: 02 Machine Learning - Introduction probability

Axioms of Probability

AxiomsGiven a sample space S of events, we have that

1 0 ≤ P(A) ≤ 12 P(S) = 13 If A1,A2, ...,An are mutually exclusive events (i.e. P(Ai ∩Aj) = 0),

then:

P(A1 ∪A2 ∪ ... ∪An) =n∑

i=1P(Ai)

8 / 87

Page 19: 02 Machine Learning - Introduction probability

Set Operations

We are usingSet Notation

ThusWhat Operations?

9 / 87

Page 20: 02 Machine Learning - Introduction probability

Set Operations

We are usingSet Notation

ThusWhat Operations?

9 / 87

Page 21: 02 Machine Learning - Introduction probability

Example

SetupThrow a biased coin twice

HH .36 HT .24

TH .24 TT .16

We have the following eventAt least one head!!! Can you tell me which events are part of it?

What about this one?Tail on first toss.

10 / 87

Page 22: 02 Machine Learning - Introduction probability

Example

SetupThrow a biased coin twice

HH .36 HT .24

TH .24 TT .16

We have the following eventAt least one head!!! Can you tell me which events are part of it?

What about this one?Tail on first toss.

10 / 87

Page 23: 02 Machine Learning - Introduction probability

Example

SetupThrow a biased coin twice

HH .36 HT .24

TH .24 TT .16

We have the following eventAt least one head!!! Can you tell me which events are part of it?

What about this one?Tail on first toss.

10 / 87

Page 24: 02 Machine Learning - Introduction probability

We need to count!!!

We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement

11 / 87

Page 25: 02 Machine Learning - Introduction probability

We need to count!!!

We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement

11 / 87

Page 26: 02 Machine Learning - Introduction probability

We need to count!!!

We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement

11 / 87

Page 27: 02 Machine Learning - Introduction probability

We need to count!!!

We have four main methods of counting1 Ordered samples of size r with replacement2 Ordered samples of size r without replacement3 Unordered samples of size r without replacement4 Unordered samples of size r with replacement

11 / 87

Page 28: 02 Machine Learning - Introduction probability

Ordered samples of size r with replacement

DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n × ...× n = nr

ExampleIf you throw three dices you have 6× 6× 6 = 216

12 / 87

Page 29: 02 Machine Learning - Introduction probability

Ordered samples of size r with replacement

DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n × ...× n = nr

ExampleIf you throw three dices you have 6× 6× 6 = 216

12 / 87

Page 30: 02 Machine Learning - Introduction probability

Ordered samples of size r without replacement

DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n − 1× ...× (n − (r − 1)) = n!

(n−r)!

ExampleThe number of different numbers that can be formed if no digit can berepeated. For example, if you have 4 digits and you want numbers of size3.

13 / 87

Page 31: 02 Machine Learning - Introduction probability

Ordered samples of size r without replacement

DefinitionThe number of possible sequences (ai1 , ..., air ) for n different numbers isn × n − 1× ...× (n − (r − 1)) = n!

(n−r)!

ExampleThe number of different numbers that can be formed if no digit can berepeated. For example, if you have 4 digits and you want numbers of size3.

13 / 87

Page 32: 02 Machine Learning - Introduction probability

Unordered samples of size r without replacement

DefinitionActually, we want the number of possible unordered sets.

HoweverWe have n!

(n−r)! collections where we care about the order. Thus

n!(n−r)!

r ! = n!r ! (n − r)! =

(nr

)(2)

14 / 87

Page 33: 02 Machine Learning - Introduction probability

Unordered samples of size r without replacement

DefinitionActually, we want the number of possible unordered sets.

HoweverWe have n!

(n−r)! collections where we care about the order. Thus

n!(n−r)!

r ! = n!r ! (n − r)! =

(nr

)(2)

14 / 87

Page 34: 02 Machine Learning - Introduction probability

Unordered samples of size r with replacement

DefinitionWe want to find an unordered set ai1 , ..., air with replacement

Use a digit trick for thatLook at the Board

Thus (n + r − 1

r

)(3)

15 / 87

Page 35: 02 Machine Learning - Introduction probability

Unordered samples of size r with replacement

DefinitionWe want to find an unordered set ai1 , ..., air with replacement

Use a digit trick for thatLook at the Board

Thus (n + r − 1

r

)(3)

15 / 87

Page 36: 02 Machine Learning - Introduction probability

Unordered samples of size r with replacement

DefinitionWe want to find an unordered set ai1 , ..., air with replacement

Use a digit trick for thatLook at the Board

Thus (n + r − 1

r

)(3)

15 / 87

Page 37: 02 Machine Learning - Introduction probability

How?Change encoding by adding more signsImagine all the strings of three numbers with 1, 2, 3

We haveOld String New String

111 1+0,1+1,1+2=123112 1+0,1+1,2+2=124113 1+0,1+1,3+2=125122 1+0,2+1,2+2=134123 1+0,2+1,3+2=135133 1+0,3+1,3+2=145222 2+0,2+1,2+2=234223 2+0,2+1,3+2=225233 1+0,3+1,3+2=233333 3+0,3+1,3+2=345

16 / 87

Page 38: 02 Machine Learning - Introduction probability

How?Change encoding by adding more signsImagine all the strings of three numbers with 1, 2, 3

We haveOld String New String

111 1+0,1+1,1+2=123112 1+0,1+1,2+2=124113 1+0,1+1,3+2=125122 1+0,2+1,2+2=134123 1+0,2+1,3+2=135133 1+0,3+1,3+2=145222 2+0,2+1,2+2=234223 2+0,2+1,3+2=225233 1+0,3+1,3+2=233333 3+0,3+1,3+2=345

16 / 87

Page 39: 02 Machine Learning - Introduction probability

Independence

DefinitionTwo events A and B are independent if and only ifP(A,B) = P(A ∩ B) = P(A)P(B)

17 / 87

Page 40: 02 Machine Learning - Introduction probability

Example

We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6

We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9

So, we can doLook at the board!!! Independence between A,B,C

18 / 87

Page 41: 02 Machine Learning - Introduction probability

Example

We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6

We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9

So, we can doLook at the board!!! Independence between A,B,C

18 / 87

Page 42: 02 Machine Learning - Introduction probability

Example

We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6

We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9

So, we can doLook at the board!!! Independence between A,B,C

18 / 87

Page 43: 02 Machine Learning - Introduction probability

Example

We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6

We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9

So, we can doLook at the board!!! Independence between A,B,C

18 / 87

Page 44: 02 Machine Learning - Introduction probability

Example

We have two dicesThus, we have all pairs (i, j) such that i, j = 1, 2, 3, ..., 6

We have the following eventsA =First dice 1,2 or 3B = First dice 3, 4 or 5C = The sum of two faces is 9

So, we can doLook at the board!!! Independence between A,B,C

18 / 87

Page 45: 02 Machine Learning - Introduction probability

We can use to derive the Binomial Distribution

WHAT?????

19 / 87

Page 46: 02 Machine Learning - Introduction probability

First, we use a sequence of n Bernoulli Trials

We have this“Success” has a probability p.“Failure” has a probability 1− p.

ExamplesToss a coin independently n times.Examine components produced on an assembly line.

NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).

20 / 87

Page 47: 02 Machine Learning - Introduction probability

First, we use a sequence of n Bernoulli Trials

We have this“Success” has a probability p.“Failure” has a probability 1− p.

ExamplesToss a coin independently n times.Examine components produced on an assembly line.

NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).

20 / 87

Page 48: 02 Machine Learning - Introduction probability

First, we use a sequence of n Bernoulli Trials

We have this“Success” has a probability p.“Failure” has a probability 1− p.

ExamplesToss a coin independently n times.Examine components produced on an assembly line.

NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).

20 / 87

Page 49: 02 Machine Learning - Introduction probability

First, we use a sequence of n Bernoulli Trials

We have this“Success” has a probability p.“Failure” has a probability 1− p.

ExamplesToss a coin independently n times.Examine components produced on an assembly line.

NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).

20 / 87

Page 50: 02 Machine Learning - Introduction probability

First, we use a sequence of n Bernoulli Trials

We have this“Success” has a probability p.“Failure” has a probability 1− p.

ExamplesToss a coin independently n times.Examine components produced on an assembly line.

NowWe take S =all 2n ordered sequences of length n, with components0(failure) and 1(success).

20 / 87

Page 51: 02 Machine Learning - Introduction probability

Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.

We have then

P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac

k+1 ∩ . . . ∩Acn)

= P (A1) P (A2) · · ·P (Ak) P(Ac

k+1)· · ·P (Ac

n)= pk (1− p)n−k

ImportantThe number of such sample is the number of sets with k elements.... or...(

nk

)

21 / 87

Page 52: 02 Machine Learning - Introduction probability

Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.

We have then

P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac

k+1 ∩ . . . ∩Acn)

= P (A1) P (A2) · · ·P (Ak) P(Ac

k+1)· · ·P (Ac

n)= pk (1− p)n−k

ImportantThe number of such sample is the number of sets with k elements.... or...(

nk

)

21 / 87

Page 53: 02 Machine Learning - Introduction probability

Thus, taking a sample ωω = 11 · · · 10 · · · 0k 1’s followed by n − k 0’s.

We have then

P (ω) = P(A1 ∩A2 ∩ . . . ∩Ak ∩Ac

k+1 ∩ . . . ∩Acn)

= P (A1) P (A2) · · ·P (Ak) P(Ac

k+1)· · ·P (Ac

n)= pk (1− p)n−k

ImportantThe number of such sample is the number of sets with k elements.... or...(

nk

)

21 / 87

Page 54: 02 Machine Learning - Introduction probability

Did you notice?

We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k

Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑

k 1’sp(ωk)

Then ∑k 1’s

p(ωk)

=(

nk

)p (1− p)n−k

22 / 87

Page 55: 02 Machine Learning - Introduction probability

Did you notice?

We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k

Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑

k 1’sp(ωk)

Then ∑k 1’s

p(ωk)

=(

nk

)p (1− p)n−k

22 / 87

Page 56: 02 Machine Learning - Introduction probability

Did you notice?

We do not care where the 1’s and 0’s areThus all the probabilities are equal to pk (1− p)k

Thus, we are looking to sum all those probabilities of all thosecombinations of 1’s and 0’s ∑

k 1’sp(ωk)

Then ∑k 1’s

p(ωk)

=(

nk

)p (1− p)n−k

22 / 87

Page 57: 02 Machine Learning - Introduction probability

Proving this is a probability

Sum of these probabilities is equal to 1n∑

k=0

(nk

)p (1− p)n−k = (p + (1− p))n = 1

The other is simple

0 ≤(

nk

)p (1− p)n−k ≤ 1 ∀k

This is know asThe Binomial probability function!!!

23 / 87

Page 58: 02 Machine Learning - Introduction probability

Proving this is a probability

Sum of these probabilities is equal to 1n∑

k=0

(nk

)p (1− p)n−k = (p + (1− p))n = 1

The other is simple

0 ≤(

nk

)p (1− p)n−k ≤ 1 ∀k

This is know asThe Binomial probability function!!!

23 / 87

Page 59: 02 Machine Learning - Introduction probability

Proving this is a probability

Sum of these probabilities is equal to 1n∑

k=0

(nk

)p (1− p)n−k = (p + (1− p))n = 1

The other is simple

0 ≤(

nk

)p (1− p)n−k ≤ 1 ∀k

This is know asThe Binomial probability function!!!

23 / 87

Page 60: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

24 / 87

Page 61: 02 Machine Learning - Introduction probability

Different Probabilities

UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:

P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.

ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:

P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”

25 / 87

Page 62: 02 Machine Learning - Introduction probability

Different Probabilities

UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:

P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.

ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:

P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”

25 / 87

Page 63: 02 Machine Learning - Introduction probability

Different Probabilities

UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:

P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.

ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:

P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”

25 / 87

Page 64: 02 Machine Learning - Introduction probability

Different Probabilities

UnconditionalThis is the probability of an event A prior to arrival of any evidence, it isdenoted by P(A). For example:

P(Cavity)=0.1 means that “in the absence of any other information,there is a 10% chance that the patient is having a cavity”.

ConditionalThis is the probability of an event A given some evidence B, it is denotedP(A|B). For example:

P(Cavity/Toothache)=0.8 means that “there is an 80% chance thatthe patient is having a cavity given that he is having a toothache”

25 / 87

Page 65: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

26 / 87

Page 66: 02 Machine Learning - Introduction probability

Posterior Probabilities

Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:

P(A|B) = P(A, B)P(B)

which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).

Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =

∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).

In addition, this can be rewritten into P(A) =∑n

i=1 P(A|Bi)P(Bi).

27 / 87

Page 67: 02 Machine Learning - Introduction probability

Posterior Probabilities

Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:

P(A|B) = P(A, B)P(B)

which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).

Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =

∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).

In addition, this can be rewritten into P(A) =∑n

i=1 P(A|Bi)P(Bi).

27 / 87

Page 68: 02 Machine Learning - Introduction probability

Posterior Probabilities

Relation between conditional and unconditional probabilitiesConditional probabilities can be defined in terms of unconditional probabilities:

P(A|B) = P(A, B)P(B)

which generalizes to the chain rule P(A, B) = P(B)P(A|B) = P(A)P(B|A).

Law of Total Probabilitiesif B1, B2, ..., Bn is a partition of mutually exclusive events and Ais an event, thenP(A) =

∑ni=1 P(A ∩ Bi). An special case P(A) = P(A, B) + P(A, B).

In addition, this can be rewritten into P(A) =∑n

i=1 P(A|Bi)P(Bi).

27 / 87

Page 69: 02 Machine Learning - Introduction probability

Example

Three cards are drawn from a deckFind the probability of no obtaining a heart

We have52 cards39 of them not a heart

DefineAi =Card i is not a heart Then?

28 / 87

Page 70: 02 Machine Learning - Introduction probability

Example

Three cards are drawn from a deckFind the probability of no obtaining a heart

We have52 cards39 of them not a heart

DefineAi =Card i is not a heart Then?

28 / 87

Page 71: 02 Machine Learning - Introduction probability

Example

Three cards are drawn from a deckFind the probability of no obtaining a heart

We have52 cards39 of them not a heart

DefineAi =Card i is not a heart Then?

28 / 87

Page 72: 02 Machine Learning - Introduction probability

Independence and Conditional

From here, we have that...P(A|B) = P(A) and P(B|A) = P(B).

Conditional independenceA and B are conditionally independent given C if and only if

P(A|B,C ) = P(A|C )

Example: P(WetGrass|Season,Rain) = P(WetGrass|Rain).

29 / 87

Page 73: 02 Machine Learning - Introduction probability

Independence and Conditional

From here, we have that...P(A|B) = P(A) and P(B|A) = P(B).

Conditional independenceA and B are conditionally independent given C if and only if

P(A|B,C ) = P(A|C )

Example: P(WetGrass|Season,Rain) = P(WetGrass|Rain).

29 / 87

Page 74: 02 Machine Learning - Introduction probability

Bayes TheoremOne Version

P(A|B) = P(B|A)P(A)P(B)

WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.

30 / 87

Page 75: 02 Machine Learning - Introduction probability

Bayes TheoremOne Version

P(A|B) = P(B|A)P(A)P(B)

WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.

30 / 87

Page 76: 02 Machine Learning - Introduction probability

Bayes TheoremOne Version

P(A|B) = P(B|A)P(A)P(B)

WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.

30 / 87

Page 77: 02 Machine Learning - Introduction probability

Bayes TheoremOne Version

P(A|B) = P(B|A)P(A)P(B)

WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.

30 / 87

Page 78: 02 Machine Learning - Introduction probability

Bayes TheoremOne Version

P(A|B) = P(B|A)P(A)P(B)

WhereP(A) is the prior probability or marginal probability of A. It is"prior" in the sense that it does not take into account any informationabout B.P(A|B) is the conditional probability of A, given B. It is also calledthe posterior probability because it is derived from or depends uponthe specified value of B.P(B|A) is the conditional probability of B given A. It is also calledthe likelihood.P(B) is the prior or marginal probability of B, and acts as anormalizing constant.

30 / 87

Page 79: 02 Machine Learning - Introduction probability

General Form of the Bayes Rule

DefinitionIf A1,A2, ...,An is a partition of mutually exclusive events and B anyevent, then:

P(Ai |B) = P(B|Ai)P(Ai)P(B) = P(B|Ai)P(Ai)∑n

i=1 P(B|Ai)P(Ai)

where

P(B) =n∑

i=1P(B ∩Ai) =

n∑i=1

P(B|Ai)P(Ai)

31 / 87

Page 80: 02 Machine Learning - Introduction probability

General Form of the Bayes Rule

DefinitionIf A1,A2, ...,An is a partition of mutually exclusive events and B anyevent, then:

P(Ai |B) = P(B|Ai)P(Ai)P(B) = P(B|Ai)P(Ai)∑n

i=1 P(B|Ai)P(Ai)

where

P(B) =n∑

i=1P(B ∩Ai) =

n∑i=1

P(B|Ai)P(Ai)

31 / 87

Page 81: 02 Machine Learning - Introduction probability

Example

SetupThrow two unbiased dice independently.

Let1 A =sum of the faces =82 B =faces are equal

Then calculate P (B|A)Look at the board

32 / 87

Page 82: 02 Machine Learning - Introduction probability

Example

SetupThrow two unbiased dice independently.

Let1 A =sum of the faces =82 B =faces are equal

Then calculate P (B|A)Look at the board

32 / 87

Page 83: 02 Machine Learning - Introduction probability

Example

SetupThrow two unbiased dice independently.

Let1 A =sum of the faces =82 B =faces are equal

Then calculate P (B|A)Look at the board

32 / 87

Page 84: 02 Machine Learning - Introduction probability

Another Example

We have the followingTwo coins are available, one unbiased and the other two headed

AssumeThat you have a probability of 3

4 to choose the unbiased

EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen

I Find that if a head come up, find the probability that the two headedcoin was chosen

33 / 87

Page 85: 02 Machine Learning - Introduction probability

Another Example

We have the followingTwo coins are available, one unbiased and the other two headed

AssumeThat you have a probability of 3

4 to choose the unbiased

EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen

I Find that if a head come up, find the probability that the two headedcoin was chosen

33 / 87

Page 86: 02 Machine Learning - Introduction probability

Another Example

We have the followingTwo coins are available, one unbiased and the other two headed

AssumeThat you have a probability of 3

4 to choose the unbiased

EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen

I Find that if a head come up, find the probability that the two headedcoin was chosen

33 / 87

Page 87: 02 Machine Learning - Introduction probability

Another Example

We have the followingTwo coins are available, one unbiased and the other two headed

AssumeThat you have a probability of 3

4 to choose the unbiased

EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen

I Find that if a head come up, find the probability that the two headedcoin was chosen

33 / 87

Page 88: 02 Machine Learning - Introduction probability

Another Example

We have the followingTwo coins are available, one unbiased and the other two headed

AssumeThat you have a probability of 3

4 to choose the unbiased

EventsA= head comes upB1= Unbiased coin chosenB2= Biased coin chosen

I Find that if a head come up, find the probability that the two headedcoin was chosen

33 / 87

Page 89: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 90: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 91: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 92: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 93: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 94: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 95: 02 Machine Learning - Introduction probability

Random Variables I

DefinitionIn many experiments, it is easier to deal with a summary variable thanwith the original probability structure.

ExampleIn an opinion poll, we ask 50 people whether agree or disagree with acertain issue.

Suppose we record a “1” for agree and “0” for disagree.The sample space for this experiment has 250 elements. Why?Suppose we are only interested in the number of people who agree.Define the variable X=number of “1” ’s recorded out of 50.Easier to deal with this sample space (has only 51 elements).

34 / 87

Page 96: 02 Machine Learning - Introduction probability

Thus...

It is necessary to define a function “random variable as follow”

X : S → R

Graphically

35 / 87

Page 97: 02 Machine Learning - Introduction probability

Thus...

It is necessary to define a function “random variable as follow”

X : S → R

Graphically

35 / 87

Page 98: 02 Machine Learning - Introduction probability

Random Variables II

How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?

Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >

Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or

36 / 87

Page 99: 02 Machine Learning - Introduction probability

Random Variables II

How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?

Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >

Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or

36 / 87

Page 100: 02 Machine Learning - Introduction probability

Random Variables II

How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?

Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >

Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or

36 / 87

Page 101: 02 Machine Learning - Introduction probability

Random Variables II

How?What is the probability function of the random variable is being definedfrom the probability function of the original sample space?

Suppose the sample space is S = s1, s2, ..., snSuppose the range of the random variable X =< x1, x2, ..., xm >

Then, we observe X = xi if and only if the outcome of the randomexperiment is an sj ∈ S s.t. X(sj) = xj or

P(X = xj) = P(sj ∈ S |X(sj) = xj)

36 / 87

Page 102: 02 Machine Learning - Introduction probability

Example

SetupThrow a coin 10 times, and let R be the number of heads.

ThenS = all sequences of length 10 with components H and T

We have forω =HHHHTTHTTH ⇒ R (ω) = 6

37 / 87

Page 103: 02 Machine Learning - Introduction probability

Example

SetupThrow a coin 10 times, and let R be the number of heads.

ThenS = all sequences of length 10 with components H and T

We have forω =HHHHTTHTTH ⇒ R (ω) = 6

37 / 87

Page 104: 02 Machine Learning - Introduction probability

Example

SetupThrow a coin 10 times, and let R be the number of heads.

ThenS = all sequences of length 10 with components H and T

We have forω =HHHHTTHTTH ⇒ R (ω) = 6

37 / 87

Page 105: 02 Machine Learning - Introduction probability

Example

SetupLet R be the number of heads in two independent tosses of a coin.

Probability of head is .6

What are the probabilities?Ω =HH,HT,TH,TT

Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)

38 / 87

Page 106: 02 Machine Learning - Introduction probability

Example

SetupLet R be the number of heads in two independent tosses of a coin.

Probability of head is .6

What are the probabilities?Ω =HH,HT,TH,TT

Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)

38 / 87

Page 107: 02 Machine Learning - Introduction probability

Example

SetupLet R be the number of heads in two independent tosses of a coin.

Probability of head is .6

What are the probabilities?Ω =HH,HT,TH,TT

Thus, we can calculateP (R = 0) ,P (R = 1) ,P (R = 2)

38 / 87

Page 108: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

39 / 87

Page 109: 02 Machine Learning - Introduction probability

Types of Random Variables

DiscreteA discrete random variable can assume only a countable number of values.

ContinuousA continuous random variable can assume a continuous range of values.

40 / 87

Page 110: 02 Machine Learning - Introduction probability

Types of Random Variables

DiscreteA discrete random variable can assume only a countable number of values.

ContinuousA continuous random variable can assume a continuous range of values.

40 / 87

Page 111: 02 Machine Learning - Introduction probability

Properties

Probability Mass Function (PMF) and Probability Density Function (PDF)

The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.

Properties of the pmf and pdf

Some properties of the pmf:I∑

x p(x) = 1 and P(a < X < b) =∑b

k=a p(k).

In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =

´ ba p(t)dt .

41 / 87

Page 112: 02 Machine Learning - Introduction probability

Properties

Probability Mass Function (PMF) and Probability Density Function (PDF)

The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.

Properties of the pmf and pdf

Some properties of the pmf:I∑

x p(x) = 1 and P(a < X < b) =∑b

k=a p(k).

In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =

´ ba p(t)dt .

41 / 87

Page 113: 02 Machine Learning - Introduction probability

Properties

Probability Mass Function (PMF) and Probability Density Function (PDF)

The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.

Properties of the pmf and pdf

Some properties of the pmf:I∑

x p(x) = 1 and P(a < X < b) =∑b

k=a p(k).

In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =

´ ba p(t)dt .

41 / 87

Page 114: 02 Machine Learning - Introduction probability

Properties

Probability Mass Function (PMF) and Probability Density Function (PDF)

The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.

Properties of the pmf and pdf

Some properties of the pmf:I∑

x p(x) = 1 and P(a < X < b) =∑b

k=a p(k).

In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =

´ ba p(t)dt .

41 / 87

Page 115: 02 Machine Learning - Introduction probability

Properties

Probability Mass Function (PMF) and Probability Density Function (PDF)

The pmf /pdf of a random variable X assigns a probability for eachpossible value of X.

Properties of the pmf and pdf

Some properties of the pmf:I∑

x p(x) = 1 and P(a < X < b) =∑b

k=a p(k).

In a similar way for the pdf:I´∞−∞ p(x)dx = 1 and P(a < X < b) =

´ ba p(t)dt .

41 / 87

Page 116: 02 Machine Learning - Introduction probability

42 / 87

Page 117: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

43 / 87

Page 118: 02 Machine Learning - Introduction probability

Cumulative Distributive Function I

Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:

FX (x) = P(f (X) ≤ x)

With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .

ExampleIf X is discrete, its CDF can be computed as follows:

FX (x) = P(f (X) ≤ x) =∑N

k=1 P(Xk = pk).

44 / 87

Page 119: 02 Machine Learning - Introduction probability

Cumulative Distributive Function I

Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:

FX (x) = P(f (X) ≤ x)

With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .

ExampleIf X is discrete, its CDF can be computed as follows:

FX (x) = P(f (X) ≤ x) =∑N

k=1 P(Xk = pk).

44 / 87

Page 120: 02 Machine Learning - Introduction probability

Cumulative Distributive Function I

Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:

FX (x) = P(f (X) ≤ x)

With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .

ExampleIf X is discrete, its CDF can be computed as follows:

FX (x) = P(f (X) ≤ x) =∑N

k=1 P(Xk = pk).

44 / 87

Page 121: 02 Machine Learning - Introduction probability

Cumulative Distributive Function I

Cumulative Distribution FunctionWith every random variable, we associate a function calledCumulative Distribution Function (CDF) which is defined as follows:

FX (x) = P(f (X) ≤ x)

With properties:I FX(x) ≥ 0I FX(x) in a non-decreasing function of X .

ExampleIf X is discrete, its CDF can be computed as follows:

FX (x) = P(f (X) ≤ x) =∑N

k=1 P(Xk = pk).

44 / 87

Page 122: 02 Machine Learning - Introduction probability

Example: Discrete Function

.16

.48

.36

.16

.48

.36

1 2 1 2

1

45 / 87

Page 123: 02 Machine Learning - Introduction probability

Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:

F(x) =ˆ x

−∞f (t)dt.

RemarkBased in the fundamental theorem of calculus, we have the followingequality.

p(x) = dFdx (x)

NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).

46 / 87

Page 124: 02 Machine Learning - Introduction probability

Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:

F(x) =ˆ x

−∞f (t)dt.

RemarkBased in the fundamental theorem of calculus, we have the followingequality.

p(x) = dFdx (x)

NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).

46 / 87

Page 125: 02 Machine Learning - Introduction probability

Cumulative Distributive Function IIContinuous FunctionIf X is continuous, its CDF can be computed as follows:

F(x) =ˆ x

−∞f (t)dt.

RemarkBased in the fundamental theorem of calculus, we have the followingequality.

p(x) = dFdx (x)

NoteThis particular p(x) is known as the Probability Mass Function (PMF) orProbability Distribution Function (PDF).

46 / 87

Page 126: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 127: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 128: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 129: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 130: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 131: 02 Machine Learning - Introduction probability

Example: Continuous Function

SetupA number X is chosen at random between a and bXhas a uniform distribution

I fX (x) = 1b−a for a ≤ x ≤ b

I fX (x) = 0 for x < a and x > b

We have

FX (x) = P X ≤ x =ˆ x

−∞fX (t) dt (4)

P a < X ≤ b =ˆ b

afX (t) dt (5)

47 / 87

Page 132: 02 Machine Learning - Introduction probability

GraphicallyExample uniform distribution

1

48 / 87

Page 133: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

49 / 87

Page 134: 02 Machine Learning - Introduction probability

Properties of the PMF/PDF I

Conditional PMF/PDFWe have the conditional pdf:

p(y|x) = p(x, y)p(x) .

From this, we have the general chain rule

p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).

IndependenceIf X and Y are independent, then:

p(x, y) = p(x)p(y).

50 / 87

Page 135: 02 Machine Learning - Introduction probability

Properties of the PMF/PDF I

Conditional PMF/PDFWe have the conditional pdf:

p(y|x) = p(x, y)p(x) .

From this, we have the general chain rule

p(x1, x2, ..., xn) = p(x1|x2, ..., xn)p(x2|x3, ..., xn)...p(xn).

IndependenceIf X and Y are independent, then:

p(x, y) = p(x)p(y).

50 / 87

Page 136: 02 Machine Learning - Introduction probability

Properties of the PMF/PDF II

Law of Total Probability

p(y) =∑

xp(y|x)p(x).

51 / 87

Page 137: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

52 / 87

Page 138: 02 Machine Learning - Introduction probability

ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call

if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)

We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15

If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls

53 / 87

Page 139: 02 Machine Learning - Introduction probability

ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call

if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)

We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15

If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls

53 / 87

Page 140: 02 Machine Learning - Introduction probability

ExpectationSomething NotableYou have the random variables R1,R2 representing how long is a call andhow much you pay for an international call

if 0 ≤ R1 ≤ 3(minute) R2 = 10(cents)if 3 < R1 ≤ 6(minute) R2 = 20(cents)if 6 < R1 ≤ 9(minute) R2 = 30(cents)

We have then the probabilitiesP R2 = 10 = 0.6, P R2 = 20 = 0.25, P R2 = 10 = 0.15

If we observe N calls and N is very largeWe can say that we have N × 0.6 calls and 10×N × 0.6 the cost of thosecalls

53 / 87

Page 141: 02 Machine Learning - Introduction probability

Expectation

SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N

We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall

The average10(0.6N)+20(.25N)+30(0.15N)

N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y

54 / 87

Page 142: 02 Machine Learning - Introduction probability

Expectation

SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N

We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall

The average10(0.6N)+20(.25N)+30(0.15N)

N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y

54 / 87

Page 143: 02 Machine Learning - Introduction probability

Expectation

SimilarlyR2 = 20 =⇒ 0.25N and total cost 5NR2 = 20 =⇒ 0.15N and total cost 4.5N

We have then the probabilitiesThe total cost is 6N + 5N + 4.5N = 15.5N or in average 15.5 cents percall

The average10(0.6N)+20(.25N)+30(0.15N)

N = 10 (0.6) + 20 (.25) + 30 (0.15) =∑y yP R2 = y

54 / 87

Page 144: 02 Machine Learning - Introduction probability

Expected Value

DefinitionDiscrete random variable X : E(X) =

∑x xp(x).

Continuous random variable Y : E(Y ) =´

x xp(x)dx.

Extension to a function g(X)E(g(X)) =

∑x g(x)p(x) (Discrete case).

E(g(X)) =´∞

=∞ g(x)p(x)dx (Continuous case)

Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))

55 / 87

Page 145: 02 Machine Learning - Introduction probability

Expected Value

DefinitionDiscrete random variable X : E(X) =

∑x xp(x).

Continuous random variable Y : E(Y ) =´

x xp(x)dx.

Extension to a function g(X)E(g(X)) =

∑x g(x)p(x) (Discrete case).

E(g(X)) =´∞

=∞ g(x)p(x)dx (Continuous case)

Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))

55 / 87

Page 146: 02 Machine Learning - Introduction probability

Expected Value

DefinitionDiscrete random variable X : E(X) =

∑x xp(x).

Continuous random variable Y : E(Y ) =´

x xp(x)dx.

Extension to a function g(X)E(g(X)) =

∑x g(x)p(x) (Discrete case).

E(g(X)) =´∞

=∞ g(x)p(x)dx (Continuous case)

Linearity propertyE(af (X) + bg(Y )) = aE(f (X)) + bE(g(Y ))

55 / 87

Page 147: 02 Machine Learning - Introduction probability

Example

Imagine the followingWe have the following functions

1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0

FindThe expected Value

56 / 87

Page 148: 02 Machine Learning - Introduction probability

Example

Imagine the followingWe have the following functions

1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0

FindThe expected Value

56 / 87

Page 149: 02 Machine Learning - Introduction probability

Example

Imagine the followingWe have the following functions

1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0

FindThe expected Value

56 / 87

Page 150: 02 Machine Learning - Introduction probability

Example

Imagine the followingWe have the following functions

1 f (x) = e−x , x ≥ 02 g (x) = 0, x < 0

FindThe expected Value

56 / 87

Page 151: 02 Machine Learning - Introduction probability

Variance

DefinitionVar(X) = E((X − µ)2) where µ = E(X)

Standard DeviationThe standard deviation is simply σ =

√Var(X).

57 / 87

Page 152: 02 Machine Learning - Introduction probability

Variance

DefinitionVar(X) = E((X − µ)2) where µ = E(X)

Standard DeviationThe standard deviation is simply σ =

√Var(X).

57 / 87

Page 153: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

58 / 87

Page 154: 02 Machine Learning - Introduction probability

Example

SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:

p (x|θ) = θxe−θ

x! x = 0, 1, 2, ... (6)

We need to obtain information about θFor this, we observe that certain information is needed!!!

For exampleWe could need more of certain equipment if θ > θ0

We do not need it if θ ≤ θ0

59 / 87

Page 155: 02 Machine Learning - Introduction probability

Example

SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:

p (x|θ) = θxe−θ

x! x = 0, 1, 2, ... (6)

We need to obtain information about θFor this, we observe that certain information is needed!!!

For exampleWe could need more of certain equipment if θ > θ0

We do not need it if θ ≤ θ0

59 / 87

Page 156: 02 Machine Learning - Introduction probability

Example

SupposeYou have that the number of call made per day at a given exchange has aPoisson distribution with an unknown parameter θ:

p (x|θ) = θxe−θ

x! x = 0, 1, 2, ... (6)

We need to obtain information about θFor this, we observe that certain information is needed!!!

For exampleWe could need more of certain equipment if θ > θ0

We do not need it if θ ≤ θ0

59 / 87

Page 157: 02 Machine Learning - Introduction probability

Thus, we want to take a decision about θ

To avoid making an incorrect decisionTo avoid losing money!!!

60 / 87

Page 158: 02 Machine Learning - Introduction probability

Ingredients of statistical decision models

FirstN , the set of states

SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N

ThirdA, the set of possible actions:

A = N = (0,∞)

FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:

It represents the loss of taking a decision.

61 / 87

Page 159: 02 Machine Learning - Introduction probability

Ingredients of statistical decision models

FirstN , the set of states

SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N

ThirdA, the set of possible actions:

A = N = (0,∞)

FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:

It represents the loss of taking a decision.

61 / 87

Page 160: 02 Machine Learning - Introduction probability

Ingredients of statistical decision models

FirstN , the set of states

SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N

ThirdA, the set of possible actions:

A = N = (0,∞)

FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:

It represents the loss of taking a decision.

61 / 87

Page 161: 02 Machine Learning - Introduction probability

Ingredients of statistical decision models

FirstN , the set of states

SecondA random variable or random vector X , the observable, whose distributionFθ depends on θ ∈ N

ThirdA, the set of possible actions:

A = N = (0,∞)

FourthA loss (cost) function L (θ, a), θ ∈ N , a ∈ A:

It represents the loss of taking a decision.

61 / 87

Page 162: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

62 / 87

Page 163: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 164: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 165: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 166: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 167: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 168: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 169: 02 Machine Learning - Introduction probability

Hypothesis Testing

SupposeH0 and H1 two subset such that

H0 ∩H1 = ∅H0 ∪H1 = N

In the telephone exampleH0 = θ|θ ≤ θ0H1 = θ|θ > θ1

In other words“θ ∈ H0”“θ ∈ H1”

63 / 87

Page 170: 02 Machine Learning - Introduction probability

Simple Hypothesis Vs. Simple Alternative

In this specific caseEach H0 and H1 contains one element, θ0 and θ1

ThusWe have that our random variable X which depends on θ:

If we are in H0, X ∼ f0If we are in H1, X ∼ f1

Thus, the problemIt is deciding whether X has density f0 or f1

64 / 87

Page 171: 02 Machine Learning - Introduction probability

Simple Hypothesis Vs. Simple Alternative

In this specific caseEach H0 and H1 contains one element, θ0 and θ1

ThusWe have that our random variable X which depends on θ:

If we are in H0, X ∼ f0If we are in H1, X ∼ f1

Thus, the problemIt is deciding whether X has density f0 or f1

64 / 87

Page 172: 02 Machine Learning - Introduction probability

Simple Hypothesis Vs. Simple Alternative

In this specific caseEach H0 and H1 contains one element, θ0 and θ1

ThusWe have that our random variable X which depends on θ:

If we are in H0, X ∼ f0If we are in H1, X ∼ f1

Thus, the problemIt is deciding whether X has density f0 or f1

64 / 87

Page 173: 02 Machine Learning - Introduction probability

Simple Hypothesis Vs. Simple Alternative

In this specific caseEach H0 and H1 contains one element, θ0 and θ1

ThusWe have that our random variable X which depends on θ:

If we are in H0, X ∼ f0If we are in H1, X ∼ f1

Thus, the problemIt is deciding whether X has density f0 or f1

64 / 87

Page 174: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 175: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 176: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 177: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 178: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 179: 02 Machine Learning - Introduction probability

What do we do?

We define a functionϕ : E → [0, 1], interpreted as the probability of rejecting H0 when x isobserved

We have thenIf ϕ (x) = 1, we reject H0

If ϕ (x) = 0, we accept H0

if 0 < ϕ (x) < 1, we toss a coin with probability a of headsI if coins comes up reject H0I if coins comes up tail accept H0

65 / 87

Page 180: 02 Machine Learning - Introduction probability

Thus

x|ϕ (x) = 1It is called the rejection region or critical section.

Andϕ is called a test!!!

Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!

66 / 87

Page 181: 02 Machine Learning - Introduction probability

Thus

x|ϕ (x) = 1It is called the rejection region or critical section.

Andϕ is called a test!!!

Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!

66 / 87

Page 182: 02 Machine Learning - Introduction probability

Thus

x|ϕ (x) = 1It is called the rejection region or critical section.

Andϕ is called a test!!!

Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!

66 / 87

Page 183: 02 Machine Learning - Introduction probability

Thus

x|ϕ (x) = 1It is called the rejection region or critical section.

Andϕ is called a test!!!

Clearly the decision could be erroneous!!!A type 1 error occurs if we reject H0 when H0 is true!!!A type 2 error occurs if we accept H0 when H1 is true!!!

66 / 87

Page 184: 02 Machine Learning - Introduction probability

Thus the probability of error when X = x

If H0 is rejected when trueProbability of a type error 1

α =ˆ ∞−∞

ϕ (x) f0 (x) dx (7)

If H0 is accepted when falseProbability of a type error 2

β =ˆ ∞−∞

(1− ϕ (x)) f1 (x) dx (8)

67 / 87

Page 185: 02 Machine Learning - Introduction probability

Thus the probability of error when X = x

If H0 is rejected when trueProbability of a type error 1

α =ˆ ∞−∞

ϕ (x) f0 (x) dx (7)

If H0 is accepted when falseProbability of a type error 2

β =ˆ ∞−∞

(1− ϕ (x)) f1 (x) dx (8)

67 / 87

Page 186: 02 Machine Learning - Introduction probability

Actually

If the test is an indicator function ϕ (x) = IAccept H0 (x) and1− ϕ (x) = IReject H0 (x)

True

True

Retain Reject

68 / 87

Page 187: 02 Machine Learning - Introduction probability

Problem!!!

There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.

The over-all average cost associated with ϕ is

B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)

69 / 87

Page 188: 02 Machine Learning - Introduction probability

Problem!!!

There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.

The over-all average cost associated with ϕ is

B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)

69 / 87

Page 189: 02 Machine Learning - Introduction probability

Problem!!!

There is not a unique answer to the question of what is a good testThus, we suppose there is a nonnegative cost ci associated to errortype i.In addition, we have a prior probability p of H0 to be true.

The over-all average cost associated with ϕ is

B (ϕ) = p × c1 × α (ϕ) + (1− p)× c2 × β (ϕ) (9)

69 / 87

Page 190: 02 Machine Learning - Introduction probability

We can do the followingThe over-all average cost associated with ϕ is

B (ϕ) = p × c1 ׈ ∞−∞

ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞

(1− ϕ (x)) f1 (x) dx

ThusB (ϕ) =

ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...

(1− p) c2

ˆ ∞−∞

f1 (x) dx

We have thatB (ϕ) =

ˆ ∞−∞

ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2

70 / 87

Page 191: 02 Machine Learning - Introduction probability

We can do the followingThe over-all average cost associated with ϕ is

B (ϕ) = p × c1 ׈ ∞−∞

ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞

(1− ϕ (x)) f1 (x) dx

ThusB (ϕ) =

ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...

(1− p) c2

ˆ ∞−∞

f1 (x) dx

We have thatB (ϕ) =

ˆ ∞−∞

ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2

70 / 87

Page 192: 02 Machine Learning - Introduction probability

We can do the followingThe over-all average cost associated with ϕ is

B (ϕ) = p × c1 ׈ ∞−∞

ϕ (x) f0 (x) dx + (1− p)× c2 ׈ ∞−∞

(1− ϕ (x)) f1 (x) dx

ThusB (ϕ) =

ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx + (1− p) c2 (1− ϕ (x)) f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x) + (1− p) c2f1 (x)] dx

=ˆ ∞−∞

[pc1ϕ (x) f0 (x) dx − (1− p) c2ϕ (x) f1 (x)] dx + ...

(1− p) c2

ˆ ∞−∞

f1 (x) dx

We have thatB (ϕ) =

ˆ ∞−∞

ϕ (x) [pc1f0 (x)− (1− p) c2f1 (x)] dx + (1− p) c2

70 / 87

Page 193: 02 Machine Learning - Introduction probability

Bayes Risk

We have that...B (ϕ) is called the Bayes risk associated to the test function ϕ

In additionA test that minimizes B (ϕ) is called a Bayes test corresponding to thegiven p, c1, c2, f0 and f1.

71 / 87

Page 194: 02 Machine Learning - Introduction probability

Bayes Risk

We have that...B (ϕ) is called the Bayes risk associated to the test function ϕ

In additionA test that minimizes B (ϕ) is called a Bayes test corresponding to thegiven p, c1, c2, f0 and f1.

71 / 87

Page 195: 02 Machine Learning - Introduction probability

What do we want?

We wantTo minimize

´S ϕ (x) g (x) dx

We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!

72 / 87

Page 196: 02 Machine Learning - Introduction probability

What do we want?

We wantTo minimize

´S ϕ (x) g (x) dx

We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!

72 / 87

Page 197: 02 Machine Learning - Introduction probability

What do we want?

We wantTo minimize

´S ϕ (x) g (x) dx

We want to find g (x)!!!This will tell us how to select the correct hypothesis!!!

72 / 87

Page 198: 02 Machine Learning - Introduction probability

What do we want?

Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .

Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .

Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.

73 / 87

Page 199: 02 Machine Learning - Introduction probability

What do we want?

Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .

Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .

Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.

73 / 87

Page 200: 02 Machine Learning - Introduction probability

What do we want?

Case 1If g (x) < 0, it is best to take ϕ (x) = 1 for all x ∈ S .

Case 2If g (x) > 0, it is best to take ϕ (x) = 0 for all x ∈ S .

Case 3If g (x) = 0, ϕ (x) may be chosen arbitrarily.

73 / 87

Page 201: 02 Machine Learning - Introduction probability

Finally

We choose

g (x) = pc1f0 (x)− (1− p) c2f1 (x) (10)

We look at the moment where g (x) = 0

pc1f0 (x)− (1− p) c2f1 (x) = 0pc1f0 (x) = (1− p) c2f1 (x)

pc1(1− p) c2

= f1 (x)f0 (x)

74 / 87

Page 202: 02 Machine Learning - Introduction probability

Finally

We choose

g (x) = pc1f0 (x)− (1− p) c2f1 (x) (10)

We look at the moment where g (x) = 0

pc1f0 (x)− (1− p) c2f1 (x) = 0pc1f0 (x) = (1− p) c2f1 (x)

pc1(1− p) c2

= f1 (x)f0 (x)

74 / 87

Page 203: 02 Machine Learning - Introduction probability

Bayes Solution

Thus, we haveLet L (x) = f1(x)

f0(x)

If L (x) > pc1(1−p)c2

then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1

(1−p)c2then take ϕ (x) = 0 i.e. accept H0.

I If L (x) = pc1(1−p)c2

then take ϕ (x) =anything

75 / 87

Page 204: 02 Machine Learning - Introduction probability

Bayes Solution

Thus, we haveLet L (x) = f1(x)

f0(x)

If L (x) > pc1(1−p)c2

then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1

(1−p)c2then take ϕ (x) = 0 i.e. accept H0.

I If L (x) = pc1(1−p)c2

then take ϕ (x) =anything

75 / 87

Page 205: 02 Machine Learning - Introduction probability

Bayes Solution

Thus, we haveLet L (x) = f1(x)

f0(x)

If L (x) > pc1(1−p)c2

then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1

(1−p)c2then take ϕ (x) = 0 i.e. accept H0.

I If L (x) = pc1(1−p)c2

then take ϕ (x) =anything

75 / 87

Page 206: 02 Machine Learning - Introduction probability

Bayes Solution

Thus, we haveLet L (x) = f1(x)

f0(x)

If L (x) > pc1(1−p)c2

then take ϕ (x) = 1 i.e. reject H0.I If L (x) < pc1

(1−p)c2then take ϕ (x) = 0 i.e. accept H0.

I If L (x) = pc1(1−p)c2

then take ϕ (x) =anything

75 / 87

Page 207: 02 Machine Learning - Introduction probability

Likelihood Ratio

We haveL is called the likelihood ratio.

For the test ϕThere is a constant 0 ≤ λ ≤ ∞

ϕ (x) = 1 when L (x) > λ

ϕ (x) = 0 when L (x) < λ

Remark: This is know as the Likelihood Ratio Test (LRT)

76 / 87

Page 208: 02 Machine Learning - Introduction probability

Likelihood Ratio

We haveL is called the likelihood ratio.

For the test ϕThere is a constant 0 ≤ λ ≤ ∞

ϕ (x) = 1 when L (x) > λ

ϕ (x) = 0 when L (x) < λ

Remark: This is know as the Likelihood Ratio Test (LRT)

76 / 87

Page 209: 02 Machine Learning - Introduction probability

Likelihood Ratio

We haveL is called the likelihood ratio.

For the test ϕThere is a constant 0 ≤ λ ≤ ∞

ϕ (x) = 1 when L (x) > λ

ϕ (x) = 0 when L (x) < λ

Remark: This is know as the Likelihood Ratio Test (LRT)

76 / 87

Page 210: 02 Machine Learning - Introduction probability

Likelihood Ratio

We haveL is called the likelihood ratio.

For the test ϕThere is a constant 0 ≤ λ ≤ ∞

ϕ (x) = 1 when L (x) > λ

ϕ (x) = 0 when L (x) < λ

Remark: This is know as the Likelihood Ratio Test (LRT)

76 / 87

Page 211: 02 Machine Learning - Introduction probability

Likelihood Ratio

We haveL is called the likelihood ratio.

For the test ϕThere is a constant 0 ≤ λ ≤ ∞

ϕ (x) = 1 when L (x) > λ

ϕ (x) = 0 when L (x) < λ

Remark: This is know as the Likelihood Ratio Test (LRT)

76 / 87

Page 212: 02 Machine Learning - Introduction probability

Example

Let X be a discrete random variablex = 0, 1, 2, 3

We have thenx 0 1 2 3

p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3

We have the following likelihood ratiox 1 3 2 0

L (x) 12

34

43 2

77 / 87

Page 213: 02 Machine Learning - Introduction probability

Example

Let X be a discrete random variablex = 0, 1, 2, 3

We have thenx 0 1 2 3

p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3

We have the following likelihood ratiox 1 3 2 0

L (x) 12

34

43 2

77 / 87

Page 214: 02 Machine Learning - Introduction probability

Example

Let X be a discrete random variablex = 0, 1, 2, 3

We have thenx 0 1 2 3

p0 (x) .1 .2 .3 .4p1 (x) .2 .1 .4 .3

We have the following likelihood ratiox 1 3 2 0

L (x) 12

34

43 2

77 / 87

Page 215: 02 Machine Learning - Introduction probability

Example

We have the following situationLRT Reject Region Acceptance Region α β

0 ≤ λ < 12 All x Empty 1 0

12 < λ < 3

4 x = 0, 2, 3 x = 1 .8 .134 < λ < 4

3 x = 0, 2 x = 1, 3 .4 .443 < λ < 2 x = 0 x = 1, 2, 3 .1 .8

2 < λ ≤ ∞ Empty All x 0 1

78 / 87

Page 216: 02 Machine Learning - Introduction probability

Example

Assume λ = 3/4

Reject H0 if x = 0, 2Accept H0 if x = 1

If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus

α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)

79 / 87

Page 217: 02 Machine Learning - Introduction probability

Example

Assume λ = 3/4

Reject H0 if x = 0, 2Accept H0 if x = 1

If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus

α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)

79 / 87

Page 218: 02 Machine Learning - Introduction probability

Example

Assume λ = 3/4

Reject H0 if x = 0, 2Accept H0 if x = 1

If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus

α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)

79 / 87

Page 219: 02 Machine Learning - Introduction probability

Example

Assume λ = 3/4

Reject H0 if x = 0, 2Accept H0 if x = 1

If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus

α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)

79 / 87

Page 220: 02 Machine Learning - Introduction probability

Example

Assume λ = 3/4

Reject H0 if x = 0, 2Accept H0 if x = 1

If x = 3, we randomizei.e. reject H0 with probability a, 0 ≤ a ≤ 1, thus

α = p0 (0) + p0 (2) + ap0 (3) = 0.4 + 0.4aβ = p1 (1) + (1− a) p1 (3) = 0.1 + 0.3 (1− a)

79 / 87

Page 221: 02 Machine Learning - Introduction probability

The Graph of B (ϕ)

Thus, we have for each λ value

80 / 87

Page 222: 02 Machine Learning - Introduction probability

Thus, we have several test

The classic one: Minimax TestThe test that minimize max α, β

WhichAn admissible test with constant risk (α = β) is minimax

ThenWe have only one test where α = β = 0.4 then 3

4 < λ < 43 , Thus

We reject H0 when x =0 or 2We accept H0 when x =1 or 3

81 / 87

Page 223: 02 Machine Learning - Introduction probability

Thus, we have several test

The classic one: Minimax TestThe test that minimize max α, β

WhichAn admissible test with constant risk (α = β) is minimax

ThenWe have only one test where α = β = 0.4 then 3

4 < λ < 43 , Thus

We reject H0 when x =0 or 2We accept H0 when x =1 or 3

81 / 87

Page 224: 02 Machine Learning - Introduction probability

Thus, we have several test

The classic one: Minimax TestThe test that minimize max α, β

WhichAn admissible test with constant risk (α = β) is minimax

ThenWe have only one test where α = β = 0.4 then 3

4 < λ < 43 , Thus

We reject H0 when x =0 or 2We accept H0 when x =1 or 3

81 / 87

Page 225: 02 Machine Learning - Introduction probability

Thus, we have several test

The classic one: Minimax TestThe test that minimize max α, β

WhichAn admissible test with constant risk (α = β) is minimax

ThenWe have only one test where α = β = 0.4 then 3

4 < λ < 43 , Thus

We reject H0 when x =0 or 2We accept H0 when x =1 or 3

81 / 87

Page 226: 02 Machine Learning - Introduction probability

Remark

From this ideasWe can work out the classics of hypothesis testing

82 / 87

Page 227: 02 Machine Learning - Introduction probability

Outline1 Basic Theory

Intuitive FormulationAxioms

2 IndependenceUnconditional and Conditional ProbabilityPosterior (Conditional) Probability

3 Random VariablesTypes of Random VariablesCumulative Distributive FunctionProperties of the PMF/PDFExpected Value and Variance

4 Statistical DecisionStatistical Decision ModelHypothesis TestingEstimation

83 / 87

Page 228: 02 Machine Learning - Introduction probability

Introduction

Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).

There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)

84 / 87

Page 229: 02 Machine Learning - Introduction probability

Introduction

Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).

There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)

84 / 87

Page 230: 02 Machine Learning - Introduction probability

Introduction

Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).

There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)

84 / 87

Page 231: 02 Machine Learning - Introduction probability

Introduction

Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).

There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)

84 / 87

Page 232: 02 Machine Learning - Introduction probability

Introduction

Supposeγ is a real valued function on the set N of states of nature.Now, we observe X = x, we want to produce a number ψ (x) that isclose to γ (θ).

There are different ways of doing thisMaximum Likelihood (ML).Expectation Maximization (EM).Maximum A Posteriori (MAP)

84 / 87

Page 233: 02 Machine Learning - Introduction probability

Maximum Likelihood Estimation

Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ

If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)

85 / 87

Page 234: 02 Machine Learning - Introduction probability

Maximum Likelihood Estimation

Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ

If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)

85 / 87

Page 235: 02 Machine Learning - Introduction probability

Maximum Likelihood Estimation

Suppose the followingfθ be a density or probability function corresponding to the state ofnature θ.Assume for simplicity that γ (θ) = θ

If X = x, the ML estimate of θ is given by γ (θ) = θ or the value of θthat maximizes fθ (x)

85 / 87

Page 236: 02 Machine Learning - Introduction probability

Example

Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1

The pdf

pθ (x) =(

nx

)θx (1− θ)n−x with x = 0, 1, 2, ...,n

Derive with respect to θ∂∂θ ln pθ (x) = 0

86 / 87

Page 237: 02 Machine Learning - Introduction probability

Example

Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1

The pdf

pθ (x) =(

nx

)θx (1− θ)n−x with x = 0, 1, 2, ...,n

Derive with respect to θ∂∂θ ln pθ (x) = 0

86 / 87

Page 238: 02 Machine Learning - Introduction probability

Example

Let X have a binomial distributionWith parameters n and θ, 0 ≤ θ ≤ 1

The pdf

pθ (x) =(

nx

)θx (1− θ)n−x with x = 0, 1, 2, ...,n

Derive with respect to θ∂∂θ ln pθ (x) = 0

86 / 87

Page 239: 02 Machine Learning - Introduction probability

Example

We getxθ− n − x

1− θ = 0 =⇒ θ = xn

Now, we can regard X as a sum of independent variables

X = X1 + X2 + ...+ Xn

where: Xi is 1 with probability θ or 0 with probability 1− θ

We get finally

θ (X) =∑n

i=1 Xin ⇒ lim

n→∞θ (X) = E (Xi) = θ

87 / 87

Page 240: 02 Machine Learning - Introduction probability

Example

We getxθ− n − x

1− θ = 0 =⇒ θ = xn

Now, we can regard X as a sum of independent variables

X = X1 + X2 + ...+ Xn

where: Xi is 1 with probability θ or 0 with probability 1− θ

We get finally

θ (X) =∑n

i=1 Xin ⇒ lim

n→∞θ (X) = E (Xi) = θ

87 / 87

Page 241: 02 Machine Learning - Introduction probability

Example

We getxθ− n − x

1− θ = 0 =⇒ θ = xn

Now, we can regard X as a sum of independent variables

X = X1 + X2 + ...+ Xn

where: Xi is 1 with probability θ or 0 with probability 1− θ

We get finally

θ (X) =∑n

i=1 Xin ⇒ lim

n→∞θ (X) = E (Xi) = θ

87 / 87