machine learning - lecture 01-1: basics of probability...

51
Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang [email protected] Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 52

Upload: others

Post on 07-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Machine LearningLecture 01-1: Basics of Probability Theory

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

Nevin L. Zhang (HKUST) Machine Learning 1 / 52

Page 2: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Outline

1 Basic Concepts in Probability Theory

2 Interpretation of Probability

3 Univariate Probability Distributions

4 Multivariate ProbabilityBayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 2 / 52

Page 3: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Random Experiments

Probability associated with a random experiment — a process withuncertain outcomes

Often kept implicit

In machine learning, we often assume that data are generated by ahypothetical process (or a model), and task is to determine the structureand parameters of the model from data.

Nevin L. Zhang (HKUST) Machine Learning 3 / 52

Page 4: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Sample Space

Sample space (aka population) Ω: Set of possible outcomes and arandom experiment.

Example: Rolling two dices.

Elements in a sample space are outcomes.

Nevin L. Zhang (HKUST) Machine Learning 4 / 52

Page 5: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Events

Event: A subset of the sample space.

Example: The two results add to 4.

Nevin L. Zhang (HKUST) Machine Learning 5 / 52

Page 6: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Probability Weight Function

A probability weight P(ω) is assigned to each outcome.

In Machine Learning, we often need to determine the probability weights,or related parameters, from data. This task is called parameter learning.

Nevin L. Zhang (HKUST) Machine Learning 6 / 52

Page 7: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Probability measure

Probability P(E ) of an event E : P(E ) =∑

ω∈E P(ω)

A probability measure is a mapping from the set of events to [0, 1]

P : 2Ω → [0, 1]

that satisfies Kolmogorov’s axioms:

1 P(Ω) = 1.2 P(A) ≥ 0 ∀A ⊆ Ω3 Additivity: P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.

In a more advanced treatment of Probability Theory, we would start withthe concept of probability measure, instead of probability weights.

Nevin L. Zhang (HKUST) Machine Learning 7 / 52

Page 8: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Random Variables

A random variable is a function over the sample space.

Example: X = sum of the two results. X ((2, 5)) = 7;X ((3, 1)) = 4)

Why is it random? The experiment.

Domain of a random variable: Set of all its possible values.

ΩX = 2, 3, . . . , 12

Nevin L. Zhang (HKUST) Machine Learning 8 / 52

Page 9: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Random Variables and Event

A random variable X taking a specific value x is an event:

ΩX=x = ω ∈ Ω|X (ω) = x

ΩX=4 = (1, 3), (2, 2, )(3, 1).

Nevin L. Zhang (HKUST) Machine Learning 9 / 52

Page 10: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Basic Concepts in Probability Theory

Probability Mass Function (Distribution)

Probability mass function P(X ): ΩX → [0, 1]

P(X = x) = P(ΩX=x)

P(X = 4) = P((1, 3), (2, 2, )(3, 1)) = 336 .

If X is continuous, we have a density function p(X ).

Nevin L. Zhang (HKUST) Machine Learning 10 / 52

Page 11: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Interpretation of Probability

Outline

1 Basic Concepts in Probability Theory

2 Interpretation of Probability

3 Univariate Probability Distributions

4 Multivariate ProbabilityBayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 11 / 52

Page 12: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Interpretation of Probability

Frequentist interpretation

Probabilities are long term relative frequencies.

Example:

X is result of coin tossing. ΩX = H,TP(X=H) = 1/2 means that

the relative frequency of getting heads will almost surely approach 1/2as the number of tosses goes to infinite.

Justified by the Law of Large Numbers:

Xi : result of the i-th tossing; 1 – H, 0 — TLaw of Large Numbers:

limn→∞

∑ni=1 Xi

n=

1

2with probability 1

The frequentist interpretation is meaningful only when experimentcan be repeated under the same condition.

Nevin L. Zhang (HKUST) Machine Learning 12 / 52

Page 13: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Interpretation of Probability

Bayesian interpretation

Probabilities are logically consistent degrees of beliefs.

Applicable when experiment not repeatable.

Depends on a person’s state of knowledge.

Example: “probability that Suez canal is longer than the Panamacanal”.

Doesn’t make sense under frequentist interpretation.Subjectivist: degree of belief based on state of knowledge

Primary school student: 0.5Me: 0.8Geographer: 1 or 0

Arguments such as Dutch book are used to explain why one’sprobability beliefs must satisfy Kolmogorov’s axioms.

Nevin L. Zhang (HKUST) Machine Learning 13 / 52

Page 14: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Interpretation of Probability

Interpretations of Probability

Now both interpretations are accepted. In practice, subjective beliefsand statistical data complement each other.

We rely on subjective beliefs (prior probabilities) when data arescarce.As more and more data become available, we rely less and less onsubjective beliefs.Often, we also use prior probabilities to impose some bias on the kindof results we want from a machine learning algorithm.

The subjectivist interpretation makes concepts such as conditionalindependence easy to understand.

Nevin L. Zhang (HKUST) Machine Learning 14 / 52

Page 15: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Univariate Probability Distributions

Outline

1 Basic Concepts in Probability Theory

2 Interpretation of Probability

3 Univariate Probability Distributions

4 Multivariate ProbabilityBayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 15 / 52

Page 16: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Univariate Probability Distributions

Binomial and Bernoulli Distributions

Suppose we toss a coin n times. At each time, the probability ofgetting a head is θ.

Let X be the number of heads. Then X follows the binomialdistribution, written as X ∼ Bin(n, θ):

Bin(X = k|n, θ) =

(nk

)θk(1− θ)n−k if 0 ≤ k ≤ n

0 if k < 0 or k > n

If n = 1, then X follows the Bernoulli distribution, written asX ∼ Ber(θ)

Ber(X = x |θ) =

θ if x = 11− θ if x = 0

Nevin L. Zhang (HKUST) Machine Learning 16 / 52

Page 17: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Univariate Probability Distributions

Multinomial Distribution

Suppose we toss a K -sided die n times. At each time, the probabilityof getting result j is θj . Let θ = (θ1, . . . , θK )>.

Let x = (x1, ..., xK ) be a random vector, where xj is the number oftimes side j of the die occurs. Then x follows the multinomialdistribution, written as x ∼ Multi(n,θ)

Multi(x|n,θ) =

(n

x1, . . . , xK

) K∏j=1

θxjk ,

where

(n

x1, . . . , xK

)=

n!

x1! . . . xK !is the multinomial coefficient

Nevin L. Zhang (HKUST) Machine Learning 17 / 52

Page 18: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Univariate Probability Distributions

Categorical Distribution

In the previous slide, if n = 1, x = (x1, ..., xK ) has one componentbeing 1 and the others are 0. In other words, it is a one-hot vector.

In this case, x follows the categorical distribution, written asx ∼ Cat(θ)

Cat(x|θ) =K∏j=1

θ1(xj=1)j ,

where 1(xj = 1) is the indicator function, whose value is 1 whenxj = 1 and 0 otherwise.

Nevin L. Zhang (HKUST) Machine Learning 18 / 52

Page 19: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Univariate Probability Distributions

Gaussian (Normal) Distribution

The most widely used distribution in statistics and machine learning isthe Gaussian or normal distribution.

Its probability density is given by

N (x |µ, σ2) =1√

2πσ2exp

[−(x − µ)2

2σ2

]Here µ = E [X ] is the mean (and mode), and σ2 = var [X ] is thevariance

Nevin L. Zhang (HKUST) Machine Learning 19 / 52

Page 20: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Outline

1 Basic Concepts in Probability Theory

2 Interpretation of Probability

3 Univariate Probability Distributions

4 Multivariate ProbabilityBayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 20 / 52

Page 21: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Joint probability mass function

Probability mass function of a random variable X :

P(X ) : ΩX → [0, 1]

P(X = x) = P(ΩX=x).

Suppose there are n random variables X1, X2, . . . , Xn.A joint probability mass function, P(X1,X2, . . . ,Xn), over thoserandom variables is:

a function defined on the Cartesian product of their state spaces:n∏

i=1

ΩXi → [0, 1]

P(X1 = x1,X2 = x2, . . . ,Xn = xn) = P(ΩX1=x1 ∩ΩX2=x2 ∩ . . .∩ΩXn=xn).

Nevin L. Zhang (HKUST) Machine Learning 21 / 52

Page 22: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Joint probability mass function

Example:

Population: Apartments in Hong Kong rental market.Random variables: (of a random selected apartment)

Monthly Rent: low (≤ 1k), medium ((1k, 2k]), upper medium((2k,4k]), high (≥4k),Type: public, private, others

Joint probability distribution P(Rent,Type):

public private otherslow .17 .01 .02

medium .44 .03 .01upper medium .09 .07 .01

high 0 0.14 0.1

Nevin L. Zhang (HKUST) Machine Learning 22 / 52

Page 23: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Multivariate Gaussian Distributions

For continuous variables, the most commonly used joint distribution isthe multivariate Gaussian distribution: N (µ,Σ)

N (x|µ,Σ) =1√

(2π)D |Σ|exp

[−(x− µ)>Σ−1(x− µ)

2

]

D: dimensionality.x: vector of D random variables, representing dataµ: vector of meansΣ: covariance matrix. |Σ| denotes the determinant of Σ.

Nevin L. Zhang (HKUST) Machine Learning 23 / 52

Page 24: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Multivariate Gaussian Distributions

A 2-D Gaussian distribution.

µ: center of contours

Σ: orientation and size of contours

Nevin L. Zhang (HKUST) Machine Learning 24 / 52

Page 25: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Marginal probability

What is the probability of a randomly selected apartment being apublic one? (Law of total probability)

P(Type=pulic) = P(Type=public, Rent=low)+P(Type=public,Rent=medium)+ P(Type=public, Rent=upper medium)+P(Type=public, Rent=high) = .7

P(Type=private) = P(Type=private, Rent=low)+ P(Type=private,Rent=medium)+ P(Type=private, Rent=upper medium)+P(Type=private, Rent=high)= .25

public private others P(Rent)low .17 .01 .02 .2

medium .44 .03 .01 .48upper medium .09 .07 .01 .17

high 0 0.14 0.1 .15P(Type) .7 .25 .05

Called marginal probability because written on the margins.

Nevin L. Zhang (HKUST) Machine Learning 25 / 52

Page 26: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Conditional probability

For events A and B:

P(A|B) =P(A,B)

P(B)(=

P(A ∩ B)

P(B))

Meaning:P(A): My probability on A (without any knowledge about B)P(A|B): My probability on event A assuming that I know event B istrue.

What is the probability of a randomly selected private apartmenthaving “low” rent?

P(Rent=low|Type=private)

=P(Rent=Low, Type=private)

P(Type=private)= .01/.25=.04

In contrast:

P(Rent=low) = 0.2.

Nevin L. Zhang (HKUST) Machine Learning 26 / 52

Page 27: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Marginal independence

Two random variables X and Y are marginally independent, writtenX ⊥ Y , if

for any state x of X and any state y of Y ,

P(X=x |Y=y) = P(X=x), whenever P(Y = y) 6= 0.

Meaning: Learning the value of Y does not give me any informationabout X and vice versa.Y contains no information about X and viceversa.

Equivalent definition:

P(X=x ,Y=y) = P(X=x)P(Y=y)

Shorthand for the equations:

P(X |Y ) = P(X ),P(X ,Y ) = P(X )P(Y ).

Nevin L. Zhang (HKUST) Machine Learning 27 / 52

Page 28: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Marginal independence

Examples:

X :result of tossing a fair coin for the first time,Y : result of second tossing of the same coin.X : result of US election, Y : your grades in this course.

Counter example:X – oral presentation grade , Y – project reportgrade.

Nevin L. Zhang (HKUST) Machine Learning 28 / 52

Page 29: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Conditional independence

Two random variables X and Y are conditionally independent given athird variable Z ,written X ⊥ Y |Z , if

P(X=x |Y=y ,Z=z) = P(X=x |Z=z) whenever P(Y=y ,Z=z) 6= 0

Meaning:

If I know the state of Z already, then learning the state of Y does notgive me additional information about X .Y might contain some information about X .However all the information about X contained in Y are also containedin Z .

Shorthand for the equation:

P(X |Y ,Z ) = P(X |Z )

Equivalent definition:

P(X ,Y |Z ) = P(X |Z )P(Y |Z )

Nevin L. Zhang (HKUST) Machine Learning 29 / 52

Page 30: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Example of Conditional Independence

There is a bag of 100 coins. 10 coins were made by a malfunctioningmachine and are biased toward head. Tossing such a coin results inhead 80% of the time. The other coins are fair.

Randomly draw a coin from the bag and toss it a few time.

Xi : result of the i-th tossing, Y : whether the coin is produced by themalfunctioning machine.

The Xi ’s are not marginally independent of each other:

If I get 9 heads in first 10 tosses, then the coin is probably a biasedcoin. Hence the next tossing will be more likely to result in a head thana tail.Learning the value of Xi gives me some information about whether thecoin is biased, which in term gives me some information about Xj .

Nevin L. Zhang (HKUST) Machine Learning 30 / 52

Page 31: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability

Example of Conditional Independence

However, they are conditionally independent given Y :

If the coin is not biased, the probability of getting a head in one toss is1/2 regardless of the results of other tosses.If the coin is biased, the probability of getting a head in one toss is80% regardless of the results of other tosses.If I already knows whether the coin is biased or not, learning the valueof Xi does not give me additional information about Xj .

Here is how the variables are related pictorially. We will return to thispicture later.

Nevin L. Zhang (HKUST) Machine Learning 31 / 52

Page 32: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability Bayes’ Theorem

Prior, posterior, and likelihood

Three important concepts in Bayesian inference.

With respect to a piece of evidence: E

Prior probability P(H): belief about a hypothesis before observingevidence.

Example: Suppose 10% of people suffer from Hepatitis B. A doctor’sprior probability about a new patient suffering from Hepatitis B is 0.1.

Posterior probability P(H|E ):belief about a hypothesis afterobtaining the evidence.

If the doctor finds that the eyes of the patient are yellow, his beliefabout patient suffering from Hepatitis B would be > 0.1.

Nevin L. Zhang (HKUST) Machine Learning 33 / 52

Page 33: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability Bayes’ Theorem

Prior, posterior, and likelihood

Suppose a patient is observed to have yellow eyes (E ).

Consider two possible explanations:

1 The patient has Hepatitis B (H1),2 The patient does not have Hepatitis B (H2)

Obviously, H1 is a better explanation because P(E |H1) > P(E |H2). To stateit another way, we say that H1 is more likely than H2 given E .

In general, the likelihood of a hypothesis H given evidence E is a measureof how well H explains E . Mathematically, it is

L(H|E ) = P(E |H)

In Machine Learning, we often talk about the likelihood of a model M givendata D. It is a measure of how well the model M explains the data D.Mathematically, it is

L(M|D) = P(D|M)

Nevin L. Zhang (HKUST) Machine Learning 34 / 52

Page 34: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Multivariate Probability Bayes’ Theorem

Bayes’ Theorem/Bayes Rule

Bayes’ Theorem: relates prior probability, likelihood, and posteriorprobability:

P(H|E ) =P(H)P(E |H)

P(E )∝ P(H)L(H|E )

where P(E ) is normalization constant to ensure∑

h∈ΩHP(H = h|E ) = 1.

That is: posterior ∝ prior× likelihood

Nevin L. Zhang (HKUST) Machine Learning 35 / 52

Page 35: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Outline

1 Basic Concepts in Probability Theory

2 Interpretation of Probability

3 Univariate Probability Distributions

4 Multivariate ProbabilityBayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 36 / 52

Page 36: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

A Simple Problem

Let X be the result of tossing a thumbtack and ΩX = H,T.Data instances:D1 = H, D2 = T , D3 = H, . . . , Dm = H

Data set: D = D1,D2,D3, . . . ,DmTask: To estimate parameter θ = P(X=H).

Nevin L. Zhang (HKUST) Machine Learning 37 / 52

Page 37: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Likelihood

Data: D = H,T ,H,T ,T ,H,TAs possible values of θ, which of the following is the most likely?Why?

θ = 0θ = 0.01θ = 0.5

θ = 0 contradicts data because P(D|θ = 0) = 0.It cannot explain thedata at all.

θ = 0.01 almost contradicts with the data. It does not explain thedata well.However, it is more consistent with the data than θ = 0 becauseP(D|θ = 0.01) > P(D|θ = 0).

So θ = 0.5 is more consistent with the data than θ = 0.01 becauseP(D|θ = 0.5) > P(D|θ = 0.01)It explains the data the best, and is hence the most likely.

Nevin L. Zhang (HKUST) Machine Learning 38 / 52

Page 38: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Maximum Likelihood Estimation

In general, the larger P(D|θ) is, the more likely the value θ is.Likelihood of parameter θ given data set:

L(θ|D) = P(D|θ)

The maximum likelihood estimation (MLE) θ∗ is

L(θ∗|D) = arg maxθ

L(θ|D).

MLE best explains data or best fits data.

Nevin L. Zhang (HKUST) Machine Learning 39 / 52

Page 39: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

i.i.d and Likelihood

Assume the data instances D1, . . . , Dm are independent given θ:

P(D1, . . . ,Dm|θ) =m∏i=1

P(Di |θ)

Assume the data instances are identically distributed:

P(Di = H) = θ,P(Di = T ) = 1−θ for all i

(Note: i.i.d means independent and identically distributed)

Then

L(θ|D) = P(D|θ) = P(D1, . . . ,Dm|θ)

=m∏i=1

P(Di |θ) = θmh(1− θ)mt (1)

where mh is the number of heads and mt is the number of tail.Binomial likelihood.

Nevin L. Zhang (HKUST) Machine Learning 40 / 52

Page 40: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Example of Likelihood Function

Example: D = D1 = H,D2T ,D3 = H,D4 = H,D5 = T

L(θ|D) = P(D|θ)

= P(D1 = H|θ)P(D2 = T |θ)P(D3 = H|θ)P(D4 = H|θ)P(D5 = T |θ)

= θ(1− θ)θθ(1− θ)

= θ3(1− θ)2.

Nevin L. Zhang (HKUST) Machine Learning 41 / 52

Page 41: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Sufficient Statistic

A sufficient statistic is a function s(D) of data that summarizingthe relevant information for computing the likelihood. That is

s(D) = s(D′)⇒ L(θ|D) = L(θ|D′)

Sufficient statistics tell us all there is to know about data.

Since L(θ|D) = θmh(1− θ)mt ,the pair (mh,mt) is a sufficient statistic.

Nevin L. Zhang (HKUST) Machine Learning 42 / 52

Page 42: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Loglikelihood

Loglikelihood:

l(θ|D) = logL(θ|D) = logθmh(1− θ)mt = mhlogθ + mt log(1− θ)

Maximizing likelihood is the same as maximizing loglikelihood. Thelatter is easier.

Taking the derivative of dl(θ|D)dθ and setting it to zero, we get

θ∗ =mh

mh + mt=

mh

m

MLE is intuitive.

It also has nice properties:

E.g. Consistence: θ∗ approaches the true value of θ with probability 1as m goes to infinity.

Nevin L. Zhang (HKUST) Machine Learning 43 / 52

Page 43: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Drawback of MLE

Thumbtack tossing:

(mh,mt) = (3, 7). MLE: θ = 0.3.Reasonable. Data suggest that the thumbtack is biased toward tail.

Coin tossing:Case 1: (mh,mt) = (3, 7). MLE: θ = 0.3.

Not reasonable.Our experience (prior) suggests strongly that coins are fair, henceθ=1/2.The size of the data set is too small to convince us this particular coinis biased.The fact that we get (3, 7) instead of (5, 5) is probably due torandomness.

Case 2: (mh,mt) = (30, 000, 70, 000). MLE: θ = 0.3.Reasonable.Data suggest that the coin is after all biased, overshadowing our prior.

MLE does not differentiate between those two instances. It doe nottake prior information into account.

Nevin L. Zhang (HKUST) Machine Learning 44 / 52

Page 44: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Two Views on Parameter Estimation

MLE:

Assumes that θ is unknown but fixed parameter.

Estimates it using θ∗, the value that maximizes the likelihood function

Makes prediction based on the estimation: P(Dm+1 = H|D) = θ∗

Bayesian Estimation:

Treats θ as a random variable.

Assumes a prior probability of θ: p(θ)

Uses data to get posterior probability of θ: p(θ|D)

Nevin L. Zhang (HKUST) Machine Learning 45 / 52

Page 45: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Two Views on Parameter Estimation

Bayesian Estimation:

Predicting Dm+1

P(Dm+1 = H|D) =

∫P(Dm+1 = H, θ|D)dθ

=

∫P(Dm+1 = H|θ,D)p(θ|D)dθ

=

∫P(Dm+1 = H|θ)p(θ|D)dθ

=

∫θp(θ|D)dθ.

Full Bayesian: Take expectation over θ.

Bayesian MAP:

P(Dm+1 = H|D) = θ∗ = arg max p(θ|D)

Nevin L. Zhang (HKUST) Machine Learning 46 / 52

Page 46: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Calculating Bayesian Estimation

Posterior distribution:

p(θ|D) ∝ p(θ)L(θ|D)

= θmh(1− θ)mtp(θ)

where the equation follows from (1)

To facilitate analysis, assume prior has Beta distribution B(αh, αt)

p(θ) ∝ θαh−1(1− θ)αt−1

Then

p(θ|D) ∝ θmh+αh−1(1− θ)mt+αt−1 (2)

Nevin L. Zhang (HKUST) Machine Learning 47 / 52

Page 47: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Beta Distribution

The normalization constant forthe Beta distribution B(αh, αt)

Γ(αt + αh)

Γ(αt)Γ(αh)

where Γ(.) is the Gammafunction. For any integer α,Γ(α) = (α− 1)!. It is alsodefined for non-integers.

Density function of prior Betadistribution B(αh, αt),

p(θ) =Γ(αt + αh)

Γ(αt)Γ(αh)θαh−1(1− θ)αt−1 (3)

The hyperparameters αh andαt can be thought of as”imaginary” counts from ourprior experiences.

Their sum α = αh+αt is calledequivalent sample size.

The larger the equivalentsample size, the more confidentwe are in our prior.

Nevin L. Zhang (HKUST) Machine Learning 48 / 52

Page 48: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Conjugate Families

Binomial Likelihood: θmh(1− θ)mt

Beta Prior: θαh−1(1− θ)αt−1

Beta Posterior: θmh+αh−1(1− θ)mt+αt−1.

Beta distributions are hence called a conjugate family for Binomiallikelihood.

Conjugate families allow closed-form for posterior distribution ofparameters and closed-form solution for prediction.

Nevin L. Zhang (HKUST) Machine Learning 49 / 52

Page 49: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

Calculating Prediction

We have

P(Dm+1 = H|D) =

∫θp(θ|D)dθ

= c

∫θθmh+αh−1(1− θ)mt+αt−1dθ

=mh + αh

m + α

where c is the normalization constant, m=mh+mt , α = αh+αt .

Consequently,

P(Dm+1 = T |D) =mt + αt

m + α

After taking data D into consideration, now our updated belief onX=T is mt+αt

m+α .

Nevin L. Zhang (HKUST) Machine Learning 50 / 52

Page 50: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

MLE and Bayesian estimation

As m goes to infinity, P(Dm+1 = H|D) approaches the MLEmh

mh+mt,which approaches the true value of θ with probability 1.

Coin tossing example revisited:

Suppose αh = αt = 100. Equivalent sample size: 200In case 1,

P(Dm+1 = H|D) =3 + 100

10 + 100 + 100≈ 0.5

Our prior prevails.In case 2,

P(Dm+1 = H|D) =30, 000 + 100

100, 0000 + 100 + 100≈ 0.3

Data prevail.

Nevin L. Zhang (HKUST) Machine Learning 51 / 52

Page 51: Machine Learning - Lecture 01-1: Basics of Probability Theoryhome.cse.ust.hk/~lzhang/teach/comp5212/slides/l01-1.p.pdf · Lecture 01-1: Basics of Probability Theory Nevin L. Zhang

Parameter Estimation

MLE vs Bayesian Estimation

Much of Machine Learning is about parameter estimation.

In all case, both MLE and Bayesian estimations can used, althoughthe latter is harder mathematically.

In this course, we will focus on MLE.

Nevin L. Zhang (HKUST) Machine Learning 52 / 52