![Page 1: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/1.jpg)
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 12 Bayesian Parameter Estimation
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 32
![Page 2: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/2.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Overview
In linear discriminant analysis (LDA), there are generally two types ofapproaches
Generative approach: Estimate model, then define the classifier
Discriminative approach: Directly define the classifier2 / 32
![Page 3: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/3.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Posterior1D IllustrationInterpretations
Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior
3 / 32
![Page 4: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/4.jpg)
c©Stanley Chan 2020. All Rights Reserved.
MLE and MAP
There are two typical ways of estimating parameters.
Maximum-likelihood estimation (MLE): θ is deterministic.
Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.
4 / 32
![Page 5: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/5.jpg)
c©Stanley Chan 2020. All Rights Reserved.
From MLE to MAP
In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:
pΘ(θ)
pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.
5 / 32
![Page 6: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/6.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Maximum-a-Posteriori
By Bayes Theorem again:
pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)
pX (xn).
To maximize the posterior distribution
θ = argmaxθ
pΘ|X (θ|D)
= argmaxθ
N∏n=1
pΘ|X (θ|xn)
= argmaxθ
N∏n=1
pX |Θ(xn|θ)pΘ(θ)
pX (xn)
= argminθ
−N∑
n=1
log pX |Θ(xn|θ) + log pΘ(θ)
6 / 32
![Page 7: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/7.jpg)
c©Stanley Chan 2020. All Rights Reserved.
The Role of pΘ(θ)
Let’s look at the MAP:
θ = argminθ
−N∑
n=1
log pX |Θ(xn|θ) + log pΘ(θ)
Special case: When
pΘ(θ) = δ(θ − θ0).
Then the delta function gives
log pΘ(θ) =
−∞, if θ 6= θ0,
0, if θ = θ0.
This will give
θ = argminθ
−N∑
n=1
−∞, if θ 6= θ0,
log pX |Θ(xn|θ0), if θ = θ0.
= θ0.
No uncertainty. Absolutely sure θ = θ0.7 / 32
![Page 8: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/8.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Illustration: 1D Example
Suppose that:
pX |Θ(x |θ) =1√
2πσ2exp
−(x − θ)2
2σ2
pΘ(θ) =
1√2πσ2
0
exp
−(θ − θ0)2
2σ20
.
When N = 1. The MAP problem is simply
θ = argmaxθ
pX |Θ(x |θ)pΘ(θ)
= argmaxθ
1√2πσ2
exp
−(x − θ)2
2σ2
· 1√
2πσ20
exp
−(θ − θ0)2
2σ20
= argmaxθ
−(x − θ)2
2σ2− (θ − θ0)2
2σ20
8 / 32
![Page 9: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/9.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Illustration: 1D Example
Taking derivatives:
ddθ
− (x−θ)2
2σ2 − (θ−θ0)2
2σ20
= 0
⇒ (x−θ)σ2 − (θ−θ0)
σ20
= 0
⇒ σ20(x − θ) = σ2(θ − θ0)
⇒ σ20x + σ2θ0 = (σ2
0 + σ2)θ
Therefore, the solution is
θ =σ2
0x + σ2θ0
σ20 + σ2
.
9 / 32
![Page 10: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/10.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the Result
Let us interpret the result
θ =σ2
0x + σ2θ0
σ20 + σ2
.
Does it make sense?
If σ0 = 0, then θ = σ20x+σ2θ0
σ20+σ2
= θ0.
This means: No uncertainty. Absolutely sure that θ = θ0.
pΘ(θ) = δ(θ − θ0)
The other extreme
If σ0 =∞, then θ =σ2
0x+σ2θ0
σ20+σ2 = x .
This means: I don’t trust my prior at all. Use data.
pΘ(θ) = 1|Ω| , for all θ ∈ Ω.
Therefore, MAP solution gives you a trade-off between data and prior.10 / 32
![Page 11: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/11.jpg)
c©Stanley Chan 2020. All Rights Reserved.
When N = 2.
When N = 2. The MAP problem is
θ = argmaxθ
pX |Θ(x |θ)pΘ(θ)
= argmaxθ
(2∏
n=1
pX |Θ(xn|θ)
)pΘ(θ)
= argmaxθ
(1√
2πσ2
)2
exp
−(x1 − θ)2 + (x2 − θ)2
2σ2
× 1√
2πσ20
exp
−(θ − θ0)2
2σ20
= argmin
θ− log( · )
= argminθ
(x1 − θ)2
2σ2+
(x2 − θ)2
2σ2+
(θ − θ0)2
2σ20
11 / 32
![Page 12: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/12.jpg)
c©Stanley Chan 2020. All Rights Reserved.
When N = 2.
Taking derivatives and setting to zero
d
dθ
(x1 − θ)2
2σ2+
(x2 − θ)2
2σ2+
(θ − θ0)2
2σ20
= −x1 − θ
σ2− x2 − θ
σ2+θ − θ0
σ20
= 0.
Equating to zero yields
θ =(x1 + x2)σ2
0 + θ0σ2
2σ20 + σ2
.
If σ0 = 0 (certain prior), then θ = θ0.
If σ0 =∞ (useless prior), then θ = x1+x22 .
12 / 32
![Page 13: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/13.jpg)
c©Stanley Chan 2020. All Rights Reserved.
When N is Arbitrary
General N.
θ = argmaxθ
[N∏
n=1
pX |Θ(xn|θ)
]pΘ(θ)
= argmaxθ
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − θ)2
2σ2
· 1√
2πσ20
exp
− (θ − θ0)2
2σ20
= argminθ
N∑
n=1
(xn − θ)2
2σ2+
(θ − θ0)2
2σ20
=σ2
0
∑Nn=1 xn + θ0σ
2
Nσ20 + σ2
.
What does it mean?
13 / 32
![Page 14: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/14.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the MAP solution: N →∞Let’s do some algebra:
θ =(∑N
n=1 xn)σ20 + θ0σ
2
Nσ20 + σ2
=(∑N
n=1 xn)σ20 + θ0σ
2
N(σ20 + σ2
N )
=
(1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
.
Fix σ0 and σ
As N →∞,
θ =
(1N
∑Nn=1 xn
)σ2
0 +σ2
N θ0
σ20 +
σ2
N
=1
N
N∑n=1
xn
This is the maximum-likelihood estimate.
When I have a lot of samples, the prior does not really matter.14 / 32
![Page 15: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/15.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Interpreting the MAP solution: N → 0
θ =
(1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
Fix σ0 and σ
As N → 0,
θ = (1N
∑Nn=1 xn
)σ2
0 + σ2
N θ0
σ20 + σ2
N
= θ0.
This is just the prior.
When I have very few sample, I should rely on the prior.
If the prior is good, then I can do well even if I have very few samples.
Maximum-likelihood does not have the same luxury!
15 / 32
![Page 16: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/16.jpg)
c©Stanley Chan 2020. All Rights Reserved.
What Happens to the Posterior?
Consider
p(D|µ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
p(µ) =1√
2πσ20
exp
−(µ− µ0)2
2σ20
.
We can show that the posterior is
p(µ|D) =1√
2πσ2N
exp
−(µ− µN)2
2σ2N
,
where
µN =σ2
Nσ20 + σ2
µ0 +Nσ2
0
Nσ20 + σ2
µML
σ2N =
σ2σ20
σ2 + Nσ20
.16 / 32
![Page 17: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/17.jpg)
c©Stanley Chan 2020. All Rights Reserved.
What Happens to the Posterior?
xn are generated from µ = 0.8 and σ2 = 0.1.When N increases, the posterior shifts towards the true distribution.
17 / 32
![Page 18: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/18.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Posterior1D IllustrationInterpretations
Choosing PriorsPrior for MeanPrior for VarianceConjugate Prior
18 / 32
![Page 19: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/19.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for µ
So far we have considered
p(D|µ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
p(µ) =1√
2πσ20
exp
−(µ− µ0)2
2σ20
.
Unknown µ, and known σ2.
The likelihood is Gaussian (by problem setup).
The prior for µ is Gaussian (by our choice).
Good, because posterior remains a Gaussian.
What happens if σ2 is unknown but µ is known?
19 / 32
![Page 20: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/20.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
Let us define the precision: λ = 1σ2 .
The likelihood is
p(D|λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
=
(λN/2
(√
2π)N
)exp
−
N∑n=1
λ
2(xn − µ)2
=1
(2π)N/2λN/2 exp
−λ
2
N∑n=1
(xn − µ)2
.
We want to choose p(λ) in a similar form:
p(λ) = AλB exp −Cλ
so that the posterior p(λ|D) is easy to compute.20 / 32
![Page 21: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/21.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
We want to choose p(λ) in a similar form:
p(λ) = AλB exp −CλThe candidate is ...
p(λ) =1
Γ(a)baλa−1 exp(−bλ)
This distribution is called the Gamma distribution Gam(λ|a, b).We can show that
E[λ] =a
b, Var[λ] =
a
b2.
21 / 32
![Page 22: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/22.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2
If we consider this pair of likelihood and prior
p(D|λ) =1
(2π)N/2λN/2 exp
−λ
2
N∑n=1
(xn − µ)2
p(λ) =1
Γ(a0)ba0
0 λa0−1 exp(−b0λ),
then the posterior is
p(λ|D) ∝ λ(a0+N/2)−1 exp
−
(b0 +
1
2
N∑n=1
(xn − µ)2
)λ
Just another Gamma distribution.
You can now do estimation on this Gamma by finding λ whichmaximizes the posterior. Details: See Appendix.
22 / 32
![Page 23: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/23.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2
Again, let λ = 1σ2 .
The likelihood is
p(D|µ, λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
∝[λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
Candidate for the prior is
p(µ, λ) ∝[λ1/2 exp
−λµ
2
2
]βexp cλµ− dλ
= exp
−βλ
2(µ− c/β)2
︸ ︷︷ ︸
N (µ|µ0,σ20)
λβ/2 exp
−(d − c2
2β
)λ
︸ ︷︷ ︸
Gam(λ|a,b)
The prior distribution is called the Normal-Gamma distribution.
23 / 32
![Page 24: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/24.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Priors for High-dimension Gaussians
Let Λ = Σ−1.
The likelihood is
p(xn|µ,Σ) = N (xm|µ,Λ−1)
Prior for µ: Gaussian.
p(µ) = N (µ|µ0,Λ−10 ).
Prior for Σ: Wishart.
p(Λ) =W(Λ|W , ν).
Prior for both µ and Σ: Normal-Wishart.
p(µ,Λ) = N (µ|µ0, (βΛ)−1)W(Λ|W , ν).
24 / 32
![Page 25: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/25.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Conjugate Prior
You have a likelihood pX |Θ(x |θ)
You want to choose a prior pΘ(θ) so that ...
the posterior pΘ|X (θ|x) takes the same form as the prior
Such prior is called the conjugate prior
Conjugate with respect to the likelihood
Finding the conjugate prior may not be easy!
Good news: Any likelihood belong to the exponential family willhave a conjugate prior also in the exponential family.
Exponential family: Gaussian, Exponential, Poisson, Bernoulli, etc
For more discussions, see Bishop Chapter 2.4
25 / 32
![Page 26: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/26.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Bayesian Parameter Estimation
Duda-Hart-Stork, Pattern Classification, Chapter 3.3 - 3.5
Bishop, Pattern Recognition and Machine Learning, Chapter 2.4
M. Jordan (Berkeley),https://people.eecs.berkeley.edu/~jordan/courses/
260-spring10/other-readings/chapter9.pdf
CMU Note, http://www.cs.cmu.edu/~aarti/Class/10701_Spring14/slides/MLE_MAP_Part1.pdf
A. Kak (Purdue), https://engineering.purdue.edu/kak/Tutorials/Trinity.pdf
26 / 32
![Page 27: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/27.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Appendix
27 / 32
![Page 28: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/28.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2: Solution
The posterior is
p(λ|D) ∝ λ(a0+N/2)−1 exp
−
(b0 +
1
2
N∑n=1
(xn − µ)2
)λ
∝ λaN−1 exp −bNλ .
The maximum-a-posteriori estimate of λ is
λ = argmaxλ
p(λ|D)
= argmaxλ
λaN−1 exp −bNλ
= argmaxλ
(aN − 1) log λ− bNλ.
Taking derivative and setting to zero:
d
dλ
((aN − 1) log λ− bNλ
)=
aN − 1
λ− bN = 0.
28 / 32
![Page 29: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/29.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for σ2: Solution
Therefore,
λ =aN − 1
bN.
where the parameters are
aN = a0 +N
2,
bN = b0 +1
2
N∑n=1
(xn − µ)2 = b0 +N
2σ2ML.
Hence, the MAP estimate is
λ =a0 + N
2
b0 + N2 σ
2ML
.
As N →∞, λ→ 1σ2ML
.
As N → 0, λ→ a0b0
.
29 / 32
![Page 30: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/30.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
Again, let λ = 1σ2 .
The likelihood is
p(D|µ, λ) =
(1√
2πσ2
)N
exp
−
N∑n=1
(xn − µ)2
2σ2
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
(xn − µ)2
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
(x2n − 2µxn + µ2)
=
(λ
2π
)N/2
exp
−λ
2
N∑n=1
x2n + λµ
N∑n=1
xn
[exp
−λµ
2
2
]N
=
(1
2π
)N/2 [λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
30 / 32
![Page 31: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/31.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
The likelihood is
p(D|µ, λ) ∝[λ1/2 exp
−λµ
2
2
]Nexp
λµ
N∑n=1
xn −λ
2
N∑n=1
x2n
Candidate for the prior is
p(µ, λ) ∝[λ1/2 exp
−λµ
2
2
]βexp cλµ− dλ
=
[exp
−λµ
2
2
]β [λβ/2 exp cλµ− dλ
]= exp
−βλ
2(µ− c/β)2
︸ ︷︷ ︸
N (µ|µ0,σ20)
λβ/2 exp
−(d − c2
2β
)λ
︸ ︷︷ ︸
Gam(λ|a,b)
µ0 = c/β, σ20 = (βλ)−1, a = 1 + β/2, b = d − c2/2β
31 / 32
![Page 32: ECE595 / STAT598: Machine Learning I Lecture 12 Bayesian … · Maximum-likelihood estimation (MLE): is deterministic. Maximum-a-posteriori estimation (MAP): is random and has a prior](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f886a023fbe5e54245bb1b5/html5/thumbnails/32.jpg)
c©Stanley Chan 2020. All Rights Reserved.
Prior for Both µ and σ2: Detailed Derivation
The prior distribution is
p(µ, λ) = N(µ|µ0, (βλ)−1
)Gam(λ|a, b)
This is called the Normal-Gamma distribution
Here is a 2D plot of p(µ, λ) when µ0 = 0, β = 2, a = 5, b = 6.
32 / 32