approximations in bayesian inference -...
TRANSCRIPT
![Page 1: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/1.jpg)
Approximations in Bayesian Inference
Vaclav Smıdl
March 3, 2020
![Page 2: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/2.jpg)
Previous models
xi
i=1,2
m s
µ = 0 α = 2 β = 3
p(s) = iG(α0, β0)p(m|s) = N (µ, s)
p(xi |m, s) = N (m, s)
I Observations xi are sampled from Gaussian with unknown mean andvariance.
I We have some prior information about the mean and variance
![Page 3: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/3.jpg)
Previous models
xi
i=1,2
m s
µ = 0 α β
xi
i=1,2
m s
µ φ α β
Inference:
p(m, s|x1, x2, µ, α, β, φ) ∝2∏
i=1p(xi ,m, s)p(m|µ, φ)p(s|α, β),
where p(d) is a normalization constant.
![Page 4: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/4.jpg)
Joint likelihood
p(m, s|x1, x2, µ, α, β, φ)
∝ 1s
1sα0+1 exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s
)p(m|s, x1, x2, µ, α, β, φ)
∝ exp(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ
)∝ exp
(−1
2
[m2( 1
φ+ 2
s )− 2m(µφ
+ x1 + x2s )
])= N
(m;(
1φ
+ 2s
)−1(µ
φ+ x1 + x2
s
),
(1φ
+ 2s
)−1)
p(s|m, x1, x2, µ, α, β, φ)
∝ 1sα0+2 exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − β0s
)= iG(α0 + 1, 0.5(m − x1)2 + 0.5(m − x2)2 + β0)
![Page 5: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/5.jpg)
Numerical solution:
p(m,s|x1=1,x2=2, 7=3, ,=0.1, -=0.1, ?=1000
2 4 6 8s
0
1
2
3
4
5
6
m
marginal
p(s|x1,x2)
0
1
2
3
4
5
6
marginal
p(m|x1,x2)
p(m,s|x1=1,x2=2, 7=3, ,=0.1, -=0.1, ?=1000
2 4 6 8s
0
1
2
3
4
5
6
m
marginal
p(s|x1,x2)
marginal
p(m|x1,x2)
![Page 6: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/6.jpg)
Numerical solution:
p(m,s|x1=1,x2=2, 7=5, ,=0.1, -=0.1, ?=2
2 4 6 8s
0
1
2
3
4
5
6
m
marginal
p(s|x1,x2)
marginal
p(m|x1,x2)
![Page 7: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/7.jpg)
What is the role of prior?
1. UninformativeI needed to make a consistent answerI Jeffrey’s
2. Non-committalI make minor adjustment
3. InformativeI incorporate important informationI do we know exact shape of the distribution?
4. Structural, weakly informative
![Page 8: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/8.jpg)
Non-informative (Jeffreys)
I The answer should be invariant to the change of coordinates ofparameter θ
I Solutionp(θ) ∝
√detF(θ),
where F is the Fisher information matrix
F(θ) = −E[∂2
∂θ2 log p(X ; θ)∣∣∣∣ θ]
I For normal likelihood
log p(x |m, s) = −12 log s − 1
2(m − x)2
s + c= . . .
p(m, s) ∝ 1s
= iG(0, 0)
I Uniform prior on scale is informative...
![Page 9: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/9.jpg)
Non-informative (Jeffreys)
I The answer should be invariant to the change of coordinates ofparameter θ
I Solutionp(θ) ∝
√detF(θ),
where F is the Fisher information matrix
F(θ) = −E[∂2
∂θ2 log p(X ; θ)∣∣∣∣ θ]
I For normal likelihood
log p(x |m, s) = −12 log s − 1
2(m − x)2
s + c= . . .
p(m, s) ∝ 1s = iG(0, 0)
I Uniform prior on scale is informative...
![Page 10: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/10.jpg)
Conjugate prior
For normal likelihood with parameters m, s
log p(x |m, s) = −12 log s − 1
2(m − x)2
s + c
= −12 log s − 1
2m2 − 2mx + x2
s + c
= [−12 ,−
12 , x ,−
12x2][log s, m2
s ,ms ,
1s ]
is a composition of bases functions [log s, m2
s ,ms ,
1s ]. It is advantageous to
choose prior with the same basis functions.
log p(m, s) = −12 log s − 1
2(m − µ)2
s + c − (α0 + 1) log s − β0s
=
= [−12 − α0 − 1,−1
2 , µ,−12µ
2 − β0][log s, m2
s ,ms ,
1s ]
Posterior is of the same form as prior.
![Page 11: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/11.jpg)
Conjugate prior
For normal likelihood with parameters m, s
log p(x |m, s) = −12 log s − 1
2(m − x)2
s + c
= −12 log s − 1
2m2 − 2mx + x2
s + c
= [−12 ,−
12 , x ,−
12x2][log s, m2
s ,ms ,
1s ]
is a composition of bases functions [log s, m2
s ,ms ,
1s ]. It is advantageous to
choose prior with the same basis functions.
log p(m, s) = −12 log s − 1
2(m − µ)2
s + c − (α0 + 1) log s − β0s
=
= [−12 − α0 − 1,−1
2 , µ,−12µ
2 − β0][log s, m2
s ,ms ,
1s ]
Posterior is of the same form as prior.
![Page 12: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/12.jpg)
Exponential family:
Likelihood of the data is in form:
p(x |θ) = h(x) exp(η(θ)>T (x)− A(θ)
)whereη(θ) is natural parameter, sometimes used η = [η1, η2, . . .]T (x) is sufficient statistics,
Use:
p(x1|θ) = h(x1) exp(η(θ)>T (x1)− A(θ)
),
p(x2|θ) = h(x2) exp(η(θ)>T (x2)− A(θ)
),
p(θ) = h0 exp(η(θ)>T0 − ν0A(θ)
)p(x1, . . . xn|θ) =
n∏i=1
h(xi ) exp(η(θ)>
n∑i=1
T (xi )− nA(θ)),
![Page 13: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/13.jpg)
Exponential family:
Likelihood of the data is in form:
p(x |θ) = h(x) exp(η(θ)>T (x)− A(θ)
)whereη(θ) is natural parameter, sometimes used η = [η1, η2, . . .]T (x) is sufficient statistics,
Use:
p(x1|θ) = h(x1) exp(η(θ)>T (x1)− A(θ)
),
p(x2|θ) = h(x2) exp(η(θ)>T (x2)− A(θ)
),
p(θ) = h0 exp(η(θ)>T0 − ν0A(θ)
)p(x1, . . . xn|θ) =
n∏i=1
h(xi ) exp(η(θ)>
n∑i=1
T (xi )− nA(θ)),
![Page 14: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/14.jpg)
Exponential family of normal distribution
p(x |m, s) = 1√2π
exp(−1
2 log s − 12
(m − x)2
s
)= 1√
2πexp
(−1
2 log s − 12
[m2
s − 2ms x + 1
s x2])
= 1√2π︸ ︷︷ ︸
h(x)
exp
[
ms ,−
12s
]︸ ︷︷ ︸
η
[x , x2]>︸ ︷︷ ︸
T (x)
−12 log s − m2
2s︸ ︷︷ ︸−A(θ)
Or
p(x |m, s) = 1√2π︸ ︷︷ ︸
h(x)
exp
[
ms ,−
12s ,−
m2
2s
]︸ ︷︷ ︸
η
[x , x2, 1
]>︸ ︷︷ ︸T (x)
−12 log s︸ ︷︷ ︸−A(θ)
Non-commital: any prior conjugate statistics with minimum impact ofsufficient statistics.
![Page 15: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/15.jpg)
Prior
Prior is typically in the hands of the modeller:I for sufficient number of data, use Jeffrey’sI Conjugate prior beneficial for analytical tractabilityI Care needed for tail behaviour
0.7 0.8 0.9 1.0 1.1 1.2 1.3yt
p(y t
)
mean(yt) = 1, std(yt) = 0.1
NGiGlN
0.5 1.0 1.5yt
p(y t
)mean(yt) = 1, std(yt) = 0.3
NGiGlN
![Page 16: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/16.jpg)
Working out the non-conjugate prior example
p(m, s|x1, x2, µ, α, β, φ)
∝ 1s
1sα0+1 exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s
)Conditional distributions are available!
EM algorithmI provides maximum of one marginal distribution
Variational Bayes approximationI provides conditionally independent posterior
Gibbs samplerI provides samples from posterior (efficient)
![Page 17: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/17.jpg)
EM algorithm classical form
Splits unknown vector into two parts:1. quantity of interest θ (e.g. mean m)2. missing data z (e.g. variance s)
General EM algorithm [Dempster, Laird, Rubin, 1977]:
θ = arg maxθ
∫p(x |θ, z)p(z)dz ,
can be (approximately) found by alternating:E-step: q(θ|θ(i)) =
∫log p(x |θ, z)p(z |θ(i))dz
M-step: θ(i+1) = arg maxθ q(θ|θ(i))
![Page 18: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/18.jpg)
E-step: θ = m , z = sProxy distribution
Q(m|m(i)) =∫
log p(x1, x2,m, s)p(s|m(i))ds
= Ep(s|m(i))(log p(x1, x2,m, s))
using
p(x1, x2|m, s) ∝ 1s exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ
)log p(x1, x2|m, s) = c + log s − 1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ,
Q(m|m(i)) = c − 12 E( 1
s )[(m − x1)2 + (m − x2)2]− 1
2(m − µ)2
φ
E( 1s ) = α
βs = β
α
Q(m|m(i)) = c − 12
[(m − x1)2
s + (m − x2)2
s
]− 1
2(m − µ)2
φ
![Page 19: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/19.jpg)
M-step
New point estimate m(i+1) = arg maxm q(m|m(i))1. set first derivative equal to 02. completion of squares
m(i+1) =(
1φ
+ 2s
)−1(µ
φ+ x1 + x2
s
),
based on s estimate of conditional
p(s|m(i)) = iG(α, β)α = α0 + 1,β = 0.5(m(i) − x1)2 + 0.5(m(i) − x2)2 + β0
repeat for p(s|m(i+1)).Note: meaning of variables can be swapped. E-step over m and
M-step on s.
![Page 20: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/20.jpg)
Divergence minimization
We seek best approximation of intractable distribution p(x) in the chosenclass of parametric functions, q(x |θ), such that
θ∗ = arg minθ
D(p, q),
where D(p, q) is a statistical divergence.
Different results for different choices of: i) q(θ), and ii) D.
Variational Bayes: i) q(θ1, θ2) = q(θ1)q(θ2), and ii) Kullback-Leibler.
![Page 21: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/21.jpg)
Divergence minimization
We seek best approximation of intractable distribution p(x) in the chosenclass of parametric functions, q(x |θ), such that
θ∗ = arg minθ
D(p, q),
where D(p, q) is a statistical divergence.
Different results for different choices of: i) q(θ), and ii) D.
Variational Bayes: i) q(θ1, θ2) = q(θ1)q(θ2), and ii) Kullback-Leibler.
![Page 22: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/22.jpg)
Divergence minimization
We seek best approximation of intractable distribution p(x) in the chosenclass of parametric functions, q(x |θ), such that
θ∗ = arg minθ
D(p, q),
where D(p, q) is a statistical divergence.
Different results for different choices of: i) q(θ), and ii) D.
Variational Bayes: i) q(θ1, θ2) = q(θ1)q(θ2), and ii) Kullback-Leibler.
![Page 23: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/23.jpg)
Variational Bayes
Is a divergence minimization technique with
q∗ = arg minq
KL(q||p) = arg minq
Eq
(log q
p
)q(m, s) = q(m|x1, x2)q(s|x1, x2).
which allows free-form optimization.
Result:
q(m|x1, x2) ∝ exp(Eq(s) [log p(x1, x2,m, s)]
)q(s|x1, x2) ∝ exp
(Eq(m) [log p(x1, x2,m, s)]
)which is a set of implicit functions.I Proportionality above allows to use p(x1, x2,m, s) in place of
p(m, s|x1, x2)I Variational EM algorithm (E-E algorithm).
![Page 24: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/24.jpg)
Variational Bayes
Is a divergence minimization technique with
q∗ = arg minq
KL(q||p) = arg minq
Eq
(log q
p
)q(m, s) = q(m|x1, x2)q(s|x1, x2).
which allows free-form optimization.Result:
q(m|x1, x2) ∝ exp(Eq(s) [log p(x1, x2,m, s)]
)q(s|x1, x2) ∝ exp
(Eq(m) [log p(x1, x2,m, s)]
)which is a set of implicit functions.I Proportionality above allows to use p(x1, x2,m, s) in place of
p(m, s|x1, x2)I Variational EM algorithm (E-E algorithm).
![Page 25: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/25.jpg)
E-step: m
Proxy distribution
q(m) = Eq(s)(log p(x1, x2,m, s))
using
p(x1, x2,m, s) ∝ 1s
1sα0+1 exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s
)log p(x1, x2,m, s) ∝ (α0 + 2) log s − 1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s ,
∝ −12 E(1
s
) [(m − x1)2 + (m − x2)2]− 1
2(m − µ)2
φ
∝ −12
[(m − x1)2
s + (m − x2)2
s
]− 1
2(m − µ)2
φ
q(m) = N
(m;(
1φ
+ 2s
)−1(µ
φ+ x1 + x2
s
),
(1φ
+ 2s
)−1)
![Page 26: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/26.jpg)
E-step: s
Proxy distribution
q(s) = Eq(m)(log p(x1, x2,m, s))
using
p(x1, x2,m, s) ∝ 1s
1sα0+1 exp
(−1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s
)log p(x1, x2,m, s) = −(α0 + 2) log s − 1
2(m − x1)2
s − 12
(m − x2)2
s − 12
(m − µ)2
φ− β0
s ,
Eq(m)(log p(x1, ·)) = −(α0 + 2) log s − 12s Eq(m)
[(m − x1)2 + (m − x2)2]− β0
s
= −(α0 + 2) log s − 12s Eq(m)
[(m2 − 2mx1 + x2
2 + m2 − 2x2m + x22]
q(s) = iG (α, β) ,α = α0 + 1,β = 0.5E(m2)− E(m)(x1 + x2) + 0.5(x2
1 + x22 ) + β0
![Page 27: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/27.jpg)
Toy: Variational Bayes
Factors:
q(m) = N(
m;(
1φ
+ 2s
)−1(µ
φ+ x1 + x2
s
),
(1φ
+ 2s
)−1)
q(s) = iG(α0 + 1,E(m2)− E(m)(x1 + x2) + 0.5(x2
1 + x22 ) + β0
)with
s = E(m2)− E(m)(x1 + x2) + 0.5(x21 + x2
2 ) + β0α0 + 1 ,
E(m) =(
1φ
+ 2s
)−1(µ
φ+ x1 + x2
s
),
E(m2) = E(m)2 +(
1φ
+ 2s
)−1,
which needs to be (Iterated).
![Page 28: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/28.jpg)
Toy: Variational Bayes Iterations
p(m,s|x1=1,x2=2, 7=5, ,=0.1, -=0.1, ?=2
5 10 15s
0
1
2
3
4
5
6
m
marginal
p(s|x1,x2)q(s|x1,x2)
marginal
p(m|x1,x2)q(m|x1,x2)
![Page 29: Approximations in Bayesian Inference - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace02_20.pdf · Previous models x i i=1,2 m s µ= 0 α= 2 β= 3 p(s) = iG(α0,β 0) p(m|s) =](https://reader034.vdocuments.mx/reader034/viewer/2022051909/5ffdb1f72da2ed7d30136016/html5/thumbnails/29.jpg)
Homework assignment
I Working code for EM estimation of toy problem:1. m = arg max p(m|x1, x2) (5 points)2. s = arg max p(s|x1, x2) (5 points)
I Working code for Variational Bayes estimation of the toy problem(10 points).