![Page 1: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/1.jpg)
Computational statisticsEM algorithm
Thierry Denœux
February-March 2017
Thierry Denœux Computational statistics February-March 2017 1 / 72
![Page 2: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/2.jpg)
EM Algorithm
An iterative optimization strategy motivated by a notion ofmissingness and by consideration of the conditional distribution ofwhat is missing given what is observed.Can be very simple to implement. Can reliably find an optimumthrough stable, uphill steps.Difficult likelihoods often arise when data are missing. EM simplifiessuch problems. In fact, the ‘missing data’ may not truly be missing:they may be only a conceptual ploy to exploit the EM simplification!
Thierry Denœux Computational statistics February-March 2017 2 / 72
![Page 3: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/3.jpg)
EM algorithm
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 3 / 72
![Page 4: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/4.jpg)
EM algorithm Description
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 4 / 72
![Page 5: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/5.jpg)
EM algorithm Description
Notation
Y : Observed variables.Z : Missing or latent variables.X : Complete data X = (Y,Z).θ : Unknown parameter.
L(θ) : observed-data likelihood, short for L(θ; y) = f (y ;θ)
Lc(θ) : complete-data likelihood, short for L(θ; x) = f (x ;θ)
`(θ), `c(θ) : observed and complete-data log-likelihoods.
Thierry Denœux Computational statistics February-March 2017 5 / 72
![Page 6: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/6.jpg)
EM algorithm Description
Notation
Suppose we seek to maximize L(θ) with respect to θ.Define Q(θ,θ(t)) to be the expectation of the complete-datalog-likelihood, conditional on the observed data Y = y. Namely
Q(θ,θ(t)) =Eθ(t){`c(θ)
∣∣ y}
=Eθ(t){log f (X;θ)
∣∣ y}
=
∫ [log f (x;θ)
]f (z|y;θ(t)) dz
where the last equation emphasizes that Z is the only random part ofX once we are given Y = y.
Thierry Denœux Computational statistics February-March 2017 6 / 72
![Page 7: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/7.jpg)
EM algorithm Description
The EM Algorithm
Start with θ(0). Then1 E step: Compute Q(θ,θ(t)).2 M step: Maximize Q(θ,θ(t)) with respect to θ. Set θ(t+1) equal to
the maximizer of Q.3 Return to the E step unless a stopping criterion has been met; e.g.,
`(θ(t+1))− `(θ(t)) ≤ ε
Thierry Denœux Computational statistics February-March 2017 7 / 72
![Page 8: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/8.jpg)
EM algorithm Description
Convergence of the EM Algorithm
It can be proved that L(θ) increases after each EM iteration, i.e.,L(θ(t+1)) ≥ L(θ(t)) for t = 0, 1, . . ..Consequently, the algorithm converges to a local maximum of L(θ) ifthe likelihood function is bounded above.
Thierry Denœux Computational statistics February-March 2017 8 / 72
![Page 9: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/9.jpg)
EM algorithm Description
Mixture of normal and uniform distributions
Let Y = (Y1, . . . ,Yn) be an i.i.d. sample from a mixture of a normaldistribution N (µ, σ) and a uniform distribution U([−a, a]), with pdf
f (y ; θ) = πφ(y ;µ, σ) + (1− π)c , (1)
where φ(·;µ, σ) is the normal pdf, c = (2a)−1, π is the proportion ofthe normal distribution in the mixture and θ = (µ, σ, π)T is thevector of parameters.Typically, the uniform distribution corresponds to outliers in the data.The proportion of outliers in the population is then 1− π.We want to estimate θ.
Thierry Denœux Computational statistics February-March 2017 9 / 72
![Page 10: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/10.jpg)
EM algorithm Description
Observed and complete-data likelihoods
Let Zi = 1 if observation i is not an outlier, Zi = 0 otherwise. Wehave Zi ∼ B(π).The vector Z = (Z1, . . . ,Zn) is the missing data.Observed-data likelihood:
L(θ) =n∏
i=1
f (yi ;θ) =n∏
i=1
[πφ(yi ;µ, σ) + (1− π)c]
Complete-data likelihood:
Lc(θ) =n∏
i=1
f (yi , zi ;θ) =n∏
i=1
f (yi |zi ;µ, σ)f (zi |π)
=n∏
i=1
[φ(yi ;µ, σ)zi c1−ziπzi (1− π)1−zi
]Thierry Denœux Computational statistics February-March 2017 10 / 72
![Page 11: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/11.jpg)
EM algorithm Description
Derivation of function Q
Complete-data log-likelihood:
`c(θ) =n∑
i=1
zi log φ(yi ;µ, σ) +
(n −
n∑i=1
zi
)log c+
n∑i=1
(zi log π + (1− zi ) log(1− π))
It is linear in the zi . Consequently, the Q function is simply
Q(θ,θ(t)) =n∑
i=1
z(t)i log φ(yi ;µ, σ) +
(n −
n∑i=1
z(t)i
)log c+
n∑i=1
(z(t)i log π + (1− z
(t)i ) log(1− π)
)with z
(t)i = Eθ(t) [Zi |yi ].
Thierry Denœux Computational statistics February-March 2017 11 / 72
![Page 12: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/12.jpg)
EM algorithm Description
EM algorithm
E-step: compute
z(t)i = Eθ(t) [Zi |yi ] = Pθ(t) [Zi = 1|yi ]
=φ(yi ;µ
(t), σ(t))π(t)
φ(yi ;µ(t), σ(t))π(t) + c(1− π(t))
M-step: Maximize Q(θ,θ(t)) We get
π(t+1) =1n
n∑i=1
z(t)i , µ(t+1) =
∑ni=1 z
(t)i yi∑n
i=1 z(t)i
σ(t+1) =
√√√√∑ni=1 z
(t)i (yi − µ(t+1))2∑n
i=1 z(t)i
Thierry Denœux Computational statistics February-March 2017 12 / 72
![Page 13: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/13.jpg)
EM algorithm Description
Remark
As mentioned before, the EM algorithm finds only a local maximumof `(θ).It is easy to find a global maximum: if µ is equal to some yi andσ = 0, then φ(yi ;µ, σ) =∞ and, consequently, `(θ) = +∞.We are not interested in these global maxima, because theycorrespond to degenerate solutions!
Thierry Denœux Computational statistics February-March 2017 13 / 72
![Page 14: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/14.jpg)
EM algorithm Description
Bayesian posterior mode
Consider a Bayesian estimation problem with likelihood L(θ) andpriori f (θ).The posterior density if proportional to L(θ)f (θ). It can also bemaximized by the EM algorithm.The E-step requires
Q(θ,θ(t)) = Eθ(t){`c(θ)
∣∣ y}
+ log f (θ)
The addition of the log-prior often makes it more difficult tomaximize Q during the M-step.Some methods can be used to facilitate the M-step in difficultsituations (see below).
Thierry Denœux Computational statistics February-March 2017 14 / 72
![Page 15: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/15.jpg)
EM algorithm Analysis
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 15 / 72
![Page 16: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/16.jpg)
EM algorithm Analysis
Why does it work?
Ascent: Each M-step increases the log likelihood.Optimization transfer:
`(θ) ≥ Q(θ, θ(t)) + `(θ(t))− Q(θ(t), θ(t)) = G (θ, θ(t)).
The last two terms in G (θ, θ(t)) are constant with respect to θ, so Qand G are maximized at the same θ.Further, G is tangent to ` at θ(t), and lies everywhere below `. Wesay that G is a minorizing function for `.EM transfers optimization from ` to the surrogate function G , whichis more convenient to maximize.
Thierry Denœux Computational statistics February-March 2017 16 / 72
![Page 17: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/17.jpg)
EM algorithm Analysis
The nature of EM
One-dimensional illustration of EM algorithm as a minorization oroptimization transfer strategy. Each E step forms a minorizing functionG , and each M step maximizes it to provide an uphill step.
Thierry Denœux Computational statistics February-March 2017 17 / 72
![Page 18: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/18.jpg)
EM algorithm Analysis
Proof
We havef (z |y ; θ) =
f (x ; θ)
f (y ; θ)⇒ f (y ; θ) =
f (x ; θ)
f (z |y ; θ)
Consequently,
`(θ) = log f (y ; θ) = log f (x ; θ)︸ ︷︷ ︸`c (θ)
− log f (z |y ; θ)
Taking expectations on both sides wrt the conditional distribution ofX given Y = y and using θ(t) for θ:
`(θ) = Q(θ, θ(t))− Eθ(t) [log f (Z |y ; θ)|y ]︸ ︷︷ ︸H(θ,θ(t))
(2)
Thierry Denœux Computational statistics February-March 2017 18 / 72
![Page 19: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/19.jpg)
EM algorithm Analysis
Proof - the minorizing function
Now, for all θ ∈ Θ,
H(θ, θ(t))− H(θ(t), θ(t)) = Eθ(t)[log
f (Z |y ; θ)
f (Z |y ; θ(t))|y]
(3a)
≤ logEθ(t)[
f (Z |y ; θ)
f (Z |y ; θ(t))|y]
(∗) (3b)
= log∫
f (z |y ; θ)dz = 0 (3c)
(*): from the concavity of the log and Jensen’s inequality.Hence, for all θ ∈ Θ,
H(θ, θ(t)) ≤ H(θ(t), θ(t)) = Q(θ(t), θ(t))− `(θ(t)), or
`(θ) ≥ Q(θ, θ(t)) + `(θ(t))− Q(θ(t), θ(t)) = G (θ, θ(t)) (4)
Thierry Denœux Computational statistics February-March 2017 19 / 72
![Page 20: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/20.jpg)
EM algorithm Analysis
Proof - G is tangent to ` at θ(t)
From (4), `(θ(t)) = G (θ(t), θ(t)).Now, we can rewrite (4) as
Q(θ(t), θ(t))− `(θ(t)) ≥ Q(θ, θ(t))− `(θ), ∀θ
Consequently, θ(t) maximizes Q(θ, θ(t))− `(θ), hence
Q ′(θ, θ(t))|θ=θ(t) − `′(θ)|θ=θ(t) = 0
andG ′(θ, θ(t))|θ=θ(t) = Q ′(θ, θ(t))|θ=θ(t) = `′(θ)|θ=θ(t) .
Thierry Denœux Computational statistics February-March 2017 20 / 72
![Page 21: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/21.jpg)
EM algorithm Analysis
Proof - monotonicity
From (2),
`(θ(t+1))− `(θ(t)) = Q(θ(t+1), θ(t))− Q(θ(t), θ(t))︸ ︷︷ ︸A
−
H(θ(t+1), θ(t))− H(θ(t), θ(t))︸ ︷︷ ︸B
A ≥ 0 because θ(t+1) is a maximizer of Q(θ, θ(t)), and B ≤ 0because, from (3), θ(t) is a maximizer of H(θ, θ(t)).Hence,
`(θ(t+1)) ≥ `(θ(t))
Thierry Denœux Computational statistics February-March 2017 21 / 72
![Page 22: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/22.jpg)
Some variants
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 22 / 72
![Page 23: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/23.jpg)
Some variants Facilitating the E-step
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 23 / 72
![Page 24: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/24.jpg)
Some variants Facilitating the E-step
Monte Carlo EM (MCEM)
Replace the tth E step with1 Draw missing datasets Z(t)
1 , . . . ,Z(t)
m(t) i.i.d. from f (z|y;θ(t)). Each
Z(t)j is a vector of all the missing values needed to complete the
observed dataset, so X(t)j = (y,Z(t)
j ) denotes a completed dataset
where the missing values have been replaced by Z(t)j .
2 Calculate Q̂(t+1)(θ,θ(t)) = 1m(t)
∑m(t)
j=1 log f (X(t)j ;θ).
Then Q̂(t+1)(θ,θ(t)) is a Monte Carlo estimate of Q(θ,θ(t)).The M step is modified to maximize Q̂(t+1)(θ,θ(t)).Increase m(t) as iterations progress to reduce the Monte Carlovariability of Q̂. MCEM will not converge in the same sense asordinary EM, rather values of θ(t) will bounce around the truemaximum, with a precision that depends on m(t).
Thierry Denœux Computational statistics February-March 2017 24 / 72
![Page 25: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/25.jpg)
Some variants Facilitating the M-step
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 25 / 72
![Page 26: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/26.jpg)
Some variants Facilitating the M-step
Generalized EM (GEM) algorithm
In the original EM algorithm, θ(t+1) is a maximizer of Q(θ,θ(t)), i.e.,
Q(θ(t+1),θ(t)) ≥ Q(θ,θ(t))
for all θ.However, to ensure convergence, we only need that
Q(θ(t+1),θ(t)) ≥ Q(θ(t),θ(t))
Any algorithm that chooses θ(t+1) at each iteration to guarantee theabove condition (without maximizing Q(θ,θ(t))) is called aGeneralized EM (GEM) algorithm.
Thierry Denœux Computational statistics February-March 2017 26 / 72
![Page 27: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/27.jpg)
Some variants Facilitating the M-step
EM gradient algorithm
Replace the M step with a single step of Newton’s method, therebyapproximating the maximum without actually solving for it exactly.Instead of maximizing, choose:
θ(t+1) = θ(t) − Q′′(θ,θ(t))−1∣∣∣θ=θ(t)
Q′(θ,θ(t))∣∣∣θ=θ(t)
= θ(t) − Q′′(θ,θ(t))−1∣∣∣θ=θ(t)
`′(θ(t))
Ascent is ensured for canonical parameters in exponential families.Backtracking can ensure ascent in other cases; inflating steps canspeed convergence.
Thierry Denœux Computational statistics February-March 2017 27 / 72
![Page 28: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/28.jpg)
Some variants Facilitating the M-step
ECM algorithm
Replaces the M step with a series of computationally simplerconditional maximization (CM) steps.Call the collection of simpler CM steps after the tth E step a CMcycle. Thus, the tth iteration of ECM is comprised of the tth E stepand the tth CM cycle.Let S denote the total number of CM steps in each CM cycle.
Thierry Denœux Computational statistics February-March 2017 28 / 72
![Page 29: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/29.jpg)
Some variants Facilitating the M-step
ECM algorithm (continued)
For s = 1, . . . ,S , the sth CM step in the tth cycle requires themaximization of Q(θ,θ(t)) subject to (or conditional on) aconstraint, say
gs(θ) = gs(θ(t+(s−1)/S))
where θ(t+(s−1)/S) is the maximizer found in the (s − 1)th CM stepof the current cycle.When the entire cycle of S steps of CM has been completed, we setθ(t+1) = θ(t+S/S) and proceed to the E step for the (t + 1)thiteration.ECM is a GEM algorithm, since each CM step increases Q.The art of constructing an effective ECM algorithm lies in choosingthe constraints cleverly.
Thierry Denœux Computational statistics February-March 2017 29 / 72
![Page 30: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/30.jpg)
Some variants Facilitating the M-step
Choice 1: Iterated Conditional Modes / Gauss-Seidel
Partition θ into S subvectors, θ = (θ1, . . . ,θS).In the sth CM step, maximize Q with respect to θs while holding allother components of θ fixed.This amounts to the constraint induced by the function
gs(θ) = (θ1, . . . ,θs−1,θs+1, . . . ,θS).
Thierry Denœux Computational statistics February-March 2017 30 / 72
![Page 31: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/31.jpg)
Some variants Facilitating the M-step
Choice 2
At the sth CM step, maximize Q with respect to all othercomponents of θ while holding θs fixed.Then gs(θ) = θs .Additional systems of constraints can be imagined, depending on theparticular problem context.A variant of ECM inserts an E step between each pair of CM steps,thereby updating Q at every stage of the CM cycle.
Thierry Denœux Computational statistics February-March 2017 31 / 72
![Page 32: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/32.jpg)
Variance estimation
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 32 / 72
![Page 33: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/33.jpg)
Variance estimation
Variance of the MLE
Let θ̂ be the MLE of θ.As n→∞, the limiting distribution of θ̂ is N (θ∗, I (θ∗)−1), whereθ∗ is the true value of θ, and
I (θ) = E[`′(θ)`′(θ)T ] = −E[`′′(θ)]
is the expected Fisher information matrix (the second equality holdsunder some regularity conditions).I (θ∗) can be estimated by I (θ̂), or by −`′′(θ̂) = Iobs(θ̂) (observedinformation matrix).Standard error estimates can be obtained by computing the squareroots of the diagonal elements of Iobs(θ̂)−1.
Thierry Denœux Computational statistics February-March 2017 33 / 72
![Page 34: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/34.jpg)
Variance estimation
Obtaining variance estimates
The EM algorithm allows us to estimate θ̂, but it does not directlyprovide an estimate of I (θ∗).Direct computation of I (θ̂) or Iobs(θ̂) is often difficult.Main methods:
1 Louis’ method2 Supplemented EM (SEM) algorithm3 Bootstrap
Thierry Denœux Computational statistics February-March 2017 34 / 72
![Page 35: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/35.jpg)
Variance estimation Louis’ method
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 35 / 72
![Page 36: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/36.jpg)
Variance estimation Louis’ method
Missing information principle
We have seen thatf (z |y ;θ) =
f (x ;θ)
f (y ;θ),
from which we get
`(θ) = `c(θ)− log f (z |y ;θ).
Differentiating twice and negating both sides, then takingexpectations over the conditional distribution of X given y ,
−`′′(θ)︸ ︷︷ ︸ı̂Y(θ)
= E[−`′′c (θ)|y
]︸ ︷︷ ︸ı̂X(θ)
−E[−∂
2 log f (z |y ;θ)
∂θ∂θT|y]
︸ ︷︷ ︸ı̂Z|Y(θ)
whereı̂Y(θ) is the observed information,ı̂X(θ) is the complete information, andı̂Z|Y(θ) is the missing information.
Thierry Denœux Computational statistics February-March 2017 36 / 72
![Page 37: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/37.jpg)
Variance estimation Louis’ method
Louis’ method
Computing ı̂X(θ) and ı̂Z|Y(θ) is sometimes easier than computing−`′′(θ) directlyWe can show that
ı̂Z|Y(θ) = Var[SZ|Y(θ)],
where the variance is taken w.r.t. Z |y , and
SZ|Y(θ) =∂ log f (z |y ;θ)
∂θ
is the conditional score.As the expected score is zero at θ̂, we have
ı̂Z|Y(θ̂) =
∫SZ|Y(θ̂)SZ|Y(θ̂)T log f (z |y ; θ̂)dz
Thierry Denœux Computational statistics February-March 2017 37 / 72
![Page 38: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/38.jpg)
Variance estimation Louis’ method
Monte Carlo approximation
When they cannot be computed analytically, ı̂X(θ) and ı̂Z|Y(θ) cansometimes be approximated by Monte Carlo simulation.Method: generate simulated datasets x j = (y , z j), j = 1, . . . ,N,where y is the observed dataset, and the z j are imputed missingdatasets drawn from f (z |y ;θ)
Then,
ı̂X(θ) ≈ 1N
N∑j=1
−∂2 log f (x j ;θ)
∂θ∂θT
and ı̂Z|Y(θ) is approximated by the sample variance of the values
∂ log f (z j |y ;θ)
∂θ
Thierry Denœux Computational statistics February-March 2017 38 / 72
![Page 39: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/39.jpg)
Variance estimation SEM algorithm
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 39 / 72
![Page 40: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/40.jpg)
Variance estimation SEM algorithm
EM mapping
Let Ψ denotes the EM mapping, defined by
θ(t+1) = Ψ(θ(t))
From the convergence of EM, θ̂ is a fixed point:
θ̂ = Ψ(θ̂).
The Jacobian matrix of Ψ is the p × p matrix
Ψ′(θ) =
(∂Ψi (θ)
∂θj
).
It can be shown that
Ψ′(θ̂)T = ı̂Z|Y(θ̂)ı̂X(θ̂)−1
Thierry Denœux Computational statistics February-March 2017 40 / 72
![Page 41: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/41.jpg)
Variance estimation SEM algorithm
Using Ψ′(θ) for variance estimation
From the missing information principle,
ı̂Y (θ̂) = ı̂X(θ̂)− ı̂Z|Y(θ̂)
=[I− ı̂Z|Y(θ̂)ı̂X(θ̂)−1
]ı̂X(θ̂)
=[I−Ψ′(θ̂)T
]ı̂X(θ̂).
Hence,
ı̂Y (θ̂)−1 = ı̂X(θ̂)−1[I−Ψ′(θ̂)T
]−1
From the equality
(I− P)−1 = (I− P + P)(I− P)−1 = I + P(I− P)−1,
we get
ı̂Y (θ̂)−1 = ı̂X(θ̂)−1{
I + Ψ′(θ̂)T[I−Ψ′(θ̂)T
]−1}. (5)
Thierry Denœux Computational statistics February-March 2017 41 / 72
![Page 42: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/42.jpg)
Variance estimation SEM algorithm
Estimation of Ψ′(θ̂)
Ler rij be the element (i , j) of Ψ′(θ̂). By definition,
rij =∂Ψi (θ̂)
∂θj
= limθj→θ̂j
Ψi (θ̂1, . . . , θ̂j−1, θj , θ̂j+1, . . . , θ̂p)−Ψi (θ̂)
θj − θ̂j
= limt→∞
Ψi (θ(t)(j))−Ψi (θ̂)
θ(t)j − θ̂j
= limt→∞
r(t)ij
where θ(t)(j) = (θ̂1, . . . , θ̂j−1, θ(t)j , θ̂j+1, . . . , θ̂p), and (θ
(t)j ),
t = 1, 2, . . . is a sequence of values converging to θ̂j .
Method: compute the r(t)ij , t = 1, 2, . . . until they stabilize to some
values. Then compute ı̂Y (θ̂)−1 using (5).
Thierry Denœux Computational statistics February-March 2017 42 / 72
![Page 43: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/43.jpg)
Variance estimation SEM algorithm
SEM algorithm
1 Run the EM algorithm to convergence, finding θ̂.2 Restart the algorithm from some θ(0) near θ̂. For t = 0, 1, 2, . . .
1 Take a standard E step and M step to produce θ(t+1) from θ(t).2 For j = 1, . . . , p:
Define θ(t)(j) = (θ̂1, . . . , θ̂j−1, θ(t)j , θ̂j+1, . . . , θ̂p), and treating it as the
current estimate of θ, run one iteration of EM to obtain Ψ(θ(t)(j)).Obtain the ratio
r(t)ij =
Ψi (θ(t)(j)) − θ̂i
θ(t)j − θ̂j
for i = 1, . . . , p. (Recall that Ψ(θ̂) = θ̂.)
3 Stop when all r (t)ij have converged
3 The (i , j)th element of Ψ′(θ̂) equals limt→∞ r(t)ij . Use the final
estimate of Ψ′(θ̂) to get the variance.
Thierry Denœux Computational statistics February-March 2017 43 / 72
![Page 44: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/44.jpg)
Variance estimation Bootstrap
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 44 / 72
![Page 45: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/45.jpg)
Variance estimation Bootstrap
Principle
Consider the case of iid data y = (w1, . . . ,wn)
If we knew the distribution of the W i , we couldgenerate many samples y1, . . . , yN ,compute the ML estimate θ̂j of θ from each sample y j , andestimate the variance of θ̂ by the sample variance of the estimatesθ̂1, . . . , θ̂N .
Bootstrap principle: use the empirical distribution in place of thetrue distribution of the W i
Thierry Denœux Computational statistics February-March 2017 45 / 72
![Page 46: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/46.jpg)
Variance estimation Bootstrap
Algorithm
1 Calculate θ̂EM using a suitable EM approach applied toy = (w1, . . . ,wn). Let j = 1 and set θ̂
∗j = θ̂EM .
2 Increment j . Sample pseudo-data y∗j = (w∗j1, . . . ,w∗jn) at random
from (w1, . . . ,wn) with replacement.3 Calculate θ̂
∗j by applying the same EM approach to the pseudo-data
y∗j4 Stop if j = B (typically, B ≥ 1000); otherwise return to step 2.
The collection of parameter estimates θ̂∗1, . . . , θ̂
∗B can be used to estimate
the variance of θ̂,
V̂ar(θ̂) =1B
B∑j=1
(θ̂∗j − θ̂
∗)(θ̂∗j − θ̂
∗)T ,
where θ̂∗is the sample mean of θ̂
∗1, . . . , θ̂
∗B .
Thierry Denœux Computational statistics February-March 2017 46 / 72
![Page 47: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/47.jpg)
Variance estimation Bootstrap
Pros and cons of the bootstrap
1 Advantages:The method is very general, complex analytical derivations areavoided.Allows the estimation of other aspects of the sampling distribution ofθ̂, such as expectation (bias), quantiles, etc.
2 Drawback: bootstrap embeds the EM loop in a second loop of Biterations. May be computationally burdensome when the EMalgorithm is slow (because, e.g., of a high proportion of missing data,or high dimensionality).
Thierry Denœux Computational statistics February-March 2017 47 / 72
![Page 48: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/48.jpg)
Application to Regression models
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 48 / 72
![Page 49: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/49.jpg)
Application to Regression models Mixture of regressions
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 49 / 72
![Page 50: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/50.jpg)
Application to Regression models Mixture of regressions
Introductory example
10 20 30 40
510
1520
1996 GNP and Emissions Data
GNP per capita
CO
2 pe
r ca
pita
CAN
MEX
USA
JAPKOR
AUS
NZ OST
BELCZ DNK
FIN
FRA
DEU
GRC
HUN
EIRE
ITL
HOL
NOR
POL
PORESP
SWCH
TUR
UK
RUS
Thierry Denœux Computational statistics February-March 2017 50 / 72
![Page 51: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/51.jpg)
Application to Regression models Mixture of regressions
Introductory example (continued)
The data in the previous slide do not show any clear linear trend.However, there seem to be several groups for which a linear modelwould be a reasonable approximation.How to identify those groups and the corresponding linear models?
Thierry Denœux Computational statistics February-March 2017 51 / 72
![Page 52: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/52.jpg)
Application to Regression models Mixture of regressions
Model
Model: the response variable Y depends on the input variable X indifferent ways, depending on a latent variable Z . (Beware: we haveswitched back to the classical notation for regression models!)This model is called mixture of regressions or switching regressions.It has been widely studied in the econometrics literature.Model:
Y =
βT1 X + ε1, ε1 ∼ N (0, σ1) if Z = 1,...βTKX + εK , εK ∼ N (0, σK ) if Z = K .
with X = (1,X1, . . . ,Xp), so
f (y |X = x) =K∑
k=1
πkφ(y ;βT x , σk)
Thierry Denœux Computational statistics February-March 2017 52 / 72
![Page 53: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/53.jpg)
Application to Regression models Mixture of regressions
Observed and complete-data likelihoods
Observed-data likelihood:
L(θ) =N∏i=1
f (yi |xi ; θ) =N∏i=1
K∑k=1
πkφ(yi ;βTk xi , σk)
Complete-data likelihood:
Lc(θ) =N∏i=1
f (yi , zi |xi ; θ) =N∏i=1
f (yi |xi , zi ; θ)p(zi |π)
=N∏i=1
K∏k=1
φ(yi ;βTk xi , σk)zikπzikk ,
with zik = 1 if zi = k and zik = 0 otherwise.
Thierry Denœux Computational statistics February-March 2017 53 / 72
![Page 54: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/54.jpg)
Application to Regression models Mixture of regressions
Derivation of function Q
Complete-data log-likelihood:
`c(θ) =N∑i=1
K∑k=1
zik log φ(yi ;βTk xi , σk) +
N∑i=1
K∑k=1
zik log πk
It is linear in the zik . Consequently, the Q function is simply
Q(θ, θ(t)) =N∑i=1
K∑k=1
z(t)ik log φ(yi ;β
Tk xi , σk) +
N∑i=1
K∑k=1
z(t)ik log πk
with z(t)ik = Eθ(t) [Zik |yi ] = Pθ(t) [Zi = k |yi ].
Thierry Denœux Computational statistics February-March 2017 54 / 72
![Page 55: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/55.jpg)
Application to Regression models Mixture of regressions
EM algorithm
E-step: compute
z(t)ik = Pθ(t) [Zi = k|yi ]
=φ(yi ;β
(t)Tk xi , σ
(t)k )π
(t)k∑K
`=1 φ(yi ;β(t)T` xi , σ
(t)` )π
(t)`
M-step: Maximize Q(θ, θ(t)). As before, we get
π(t+1)k =
N(t)k
N,
with N(t)k =
∑Ni=1 z
(t)ik .
Thierry Denœux Computational statistics February-March 2017 55 / 72
![Page 56: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/56.jpg)
Application to Regression models Mixture of regressions
M-step: update of the βk and σk
In Q(θ, θ(t)), the term depending on βk is
SSk =N∑i=1
z(t)ik (yi − βTk xi )
2.
Minimizing SSk w.r.t. βk is a weighted least-squares (WLS) problem.In matrix form,
SSk = (y − Xβk)TW k(y − Xβk)
with W k = diag(z(t)i1 , . . . , z
(t)iK ).
Thierry Denœux Computational statistics February-March 2017 56 / 72
![Page 57: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/57.jpg)
Application to Regression models Mixture of regressions
M-step: update of the βk and σk (continued)
The solution is the WLS estimate of βk :
β(t+1)k = (XTW kX )−1XTW ky
The value of σk minimizing Q(θ, θ(t)) is the weighted average of theresiduals,
σ2(t+1)k =
1
N(t)k
N∑i=1
z(t)ik (yi − β
(t+1)Tk xi )
2
=1
N(t)k
(y − Xβ(t+1)k )TW k(y − Xβ(t+1)
k )
Thierry Denœux Computational statistics February-March 2017 57 / 72
![Page 58: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/58.jpg)
Application to Regression models Mixture of regressions
Mixture of regressions using mixtools
library(mixtools)data(CO2data)attach(CO2data)
CO2reg <- regmixEM(CO2, GNP)summary(CO2reg)
ii1<-CO2reg$posterior>0.5ii2<-CO2reg$posterior<=0.5text(GNP[ii1],CO2[ii1],country[ii1],col=’red’)text(GNP[Cii2],CO2[ii2],country[ii2],col=’blue’)abline(CO2reg$beta[,1],col=’red’)abline(CO2reg$beta[,2],col=’blue’)
Thierry Denœux Computational statistics February-March 2017 58 / 72
![Page 59: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/59.jpg)
Application to Regression models Mixture of regressions
Best solution in 10 runs
10 20 30 40
510
1520
GNP per capita
CO
2 pe
r ca
pita
JAPKOR
NZ OST
BELCZ DNK
FIN
FRA
DEU
GRC
HUN
EIRE
ITL
HOLPOL
PORESP
SWCH
UK
RUS
CAN
MEX
USA
AUSNOR
TUR
Thierry Denœux Computational statistics February-March 2017 59 / 72
![Page 60: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/60.jpg)
Application to Regression models Mixture of regressions
Increase of log-likelihood
5 10 15
−80
−75
−70
iterations
log−
likel
ihoo
d
Thierry Denœux Computational statistics February-March 2017 60 / 72
![Page 61: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/61.jpg)
Application to Regression models Mixture of regressions
Another solution (with lower log-likelihood)
10 20 30 40
510
1520
GNP per capita
CO
2 pe
r ca
pita
MEX
JAPKOR
NZ OST
BEL DNKFIN
FRA
DEU
GRC
HUN
EIRE
ITL
HOLPOL
PORESP
SWCH
TUR
UK
CAN
USA
AUS
CZ
NOR
RUS
Thierry Denœux Computational statistics February-March 2017 61 / 72
![Page 62: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/62.jpg)
Application to Regression models Mixture of regressions
Increase of log-likelihood
2 4 6 8 10 12
−76
−75
−74
−73
−72
−71
−70
iterations
log−
likel
ihoo
d
Thierry Denœux Computational statistics February-March 2017 62 / 72
![Page 63: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/63.jpg)
Application to Regression models Mixture of experts
Overview
EM algorithmDescriptionAnalysis
Some variantsFacilitating the E-stepFacilitating the M-step
Variance estimationLouis’ methodSEM algorithmBootstrap
Application to Regression modelsMixture of regressionsMixture of experts
Thierry Denœux Computational statistics February-March 2017 63 / 72
![Page 64: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/64.jpg)
Application to Regression models Mixture of experts
Making the mixing proportions predictor-dependent
An interesting extension of the previous model is to assume theproportions πk to be partially explained by a vector of concomitantvariables W .If W = X , we can approximate the regression function by differentlinear functions in different regions of the predictor space.In ML, this method is referred to as the mixture of experts methods.A useful parametric form for πk that ensures πk ≥ 0 and∑K
k=1 πk = 1 is the multinomial logit model
πk(w , α) =exp(αT
k w)∑K`=1 exp(αT
` w)
with α = (α1, . . . , αK ) and α1 = 0.
Thierry Denœux Computational statistics February-March 2017 64 / 72
![Page 65: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/65.jpg)
Application to Regression models Mixture of experts
EM algorithm
The Q function is the same as before, except that the πk nowdepend on the wi and parameter α:
Q(θ, θ(t)) =N∑i=1
K∑k=1
z(t)ik log φ(yi ;β
Tk xi , σk)+
N∑i=1
K∑k=1
z(t)ik log πk(wi , α)
In the M-step, the update formula for βk and σk are unchanged.The last term of Q(θ, θ(t)) can be maximized w.r.t. α using aniterative algorithm, such as the Newton-Raphson procedure. (Seeremark on next slide)
Thierry Denœux Computational statistics February-March 2017 65 / 72
![Page 66: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/66.jpg)
Application to Regression models Mixture of experts
Generalized EM algorithm
To ensure convergence of EM, we only need to increase (but notnecessarily maximize) Q(θ, θ(t)) at each step.Any algorithm that chooses θ(t+1) at each iteration to guarantee theabove condition (without maximizing Q(θ, θ(t))) is called aGeneralized EM (GEM) algorithm.Here, we can perform a single step of the Newton-Raphson algorithmto maximize
N∑i=1
K∑k=1
z(t)ik log πk(wi , α)
with respect to α.Backtracking can be used to ensure ascent.
Thierry Denœux Computational statistics February-March 2017 66 / 72
![Page 67: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/67.jpg)
Application to Regression models Mixture of experts
Example: motorcycle data
10 20 30 40 50
−10
0−
500
50
Motorcycle data
time
acce
lera
tion
library(’MASS’)x<-mcycle$timesy<-mcycle$accelplot(x,y)
Thierry Denœux Computational statistics February-March 2017 67 / 72
![Page 68: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/68.jpg)
Application to Regression models Mixture of experts
Mixture of experts using flexmix
library(flexmix)
K<-5res<-flexmix(y ˜ x,k=K,model=FLXMRglm(family="gaussian"),concomitant=FLXPmultinom(formula=˜x))
beta<- parameters(res)[1:2,]alpha<-res@concomitant@coef
Thierry Denœux Computational statistics February-March 2017 68 / 72
![Page 69: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/69.jpg)
Application to Regression models Mixture of experts
Plotting the posterior probabilities
xt<-seq(0,60,0.1)Nt<-length(xt)plot(x,y)pit=matrix(0,Nt,K)for(k in 1:K) pit[,k]<-exp(alpha[1,k]+alpha[2,k]*xt)pit<-pit/rowSums(pit)
plot(xt,pit[,1],type="l",col=1)for(k in 2:K) lines(xt,pit[,k],col=k)
Thierry Denœux Computational statistics February-March 2017 69 / 72
![Page 70: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/70.jpg)
Application to Regression models Mixture of experts
Posterior probabilities
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
Motorcycle data − posterior probabilities
x
π k
Thierry Denœux Computational statistics February-March 2017 70 / 72
![Page 71: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/71.jpg)
Application to Regression models Mixture of experts
Plotting the predictions
yhat<-rep(0,Nt)for(k in 1:K) yhat<-yhat+pit[,k]*(beta[1,k]+beta[2,k]*xt)
plot(x,y,main="Motorcycle data",xlab="time",ylab="acceleration")for(k in 1:K) abline(beta[1:2,k],lty=2)lines(xt,yhat,col=’red’,lwd=2)
Thierry Denœux Computational statistics February-March 2017 71 / 72
![Page 72: EMalgorithm ThierryDenœux February-March2017 · Some variants Facilitating the E-step Monte Carlo EM (MCEM) ReplacethetthEstepwith 1 Draw missing datasets Z (t) 1;:::;Z (t) m(t)](https://reader035.vdocuments.mx/reader035/viewer/2022070806/5f048aaf7e708231d40e7bd5/html5/thumbnails/72.jpg)
Application to Regression models Mixture of experts
Regression lines and predictions
10 20 30 40 50
−10
0−
500
50Motorcycle data
time
acce
lera
tion
Thierry Denœux Computational statistics February-March 2017 72 / 72