computational statistics, 2nd edition -...
TRANSCRIPT
Computational Statistics, 2nd Edition
Chapter4: EM Optimization Methods
Presented by: Weiyu Li, Jincheng Pang
2018.03
Givens & Hoeting, Computational Statistics, 2nd Edition 1
Focus
1. Introduction: the MM Algorithm
2. The EM Algorithm
Examples, Convergence, Variance Estimation
3. Improvements
MCEM, ECM, EM gradient, Acceleration methods
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 2
Introduction: MM
MM: Majorize-Minimization or Minorize-Maximization.
Algorithm: find a surrogate function g(θ|θ(t)) that minorizes the objective
function f (θ) (concave) and maximize g.
1. find g(θ|θ(t)) satisfying {g(θ|θ(t)) ≤ f (θ), ∀θg(θ(t)|θ(t)) = f (θ(t))
(1)
2. maximization. θ(t+1) = argmaxθg(θ|θ(t)).
3. Stop or return to 1.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 3
Figure 1: Illustration of how MM works.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 4
Introduction: What’s E-M for?
EM can be treated as a special case of the MM algorithm.
• Expection
Missing data Z ← Expection from observation X
Complete data Y = (X,Z)
• Maximization (aim)
Maximize L(θ|X)
*Bayesian: estimate the mode of a posterior distribution f (θ|X)
(maximize a posteriori estimation)
• Y|θ, Z|(X,θ) easier to work with
• Latent : not actually missing
• Bayesian: parameters rather than data
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 5
EM Algorithm
Q(θ|θ(t))def= E
{logL(θ|Y)
∣∣ x,θ(t)}
(2)
=E{
log fY(y|θ)∣∣ x,θ(t)
}(3)
=
∫ [log fY(y|θ)
]fZ|X(z|x,θ(t)) dz (4)
Initial: θ(0)
Iterations: altering between E and M.
1. E: Compute Q(θ|θ(t)).
2. M: θ(t+1) = argmaxθQ(θ|θ(t)).
3. Stop or return to 1.
Stopping criteria: d(θ(t+1),θ(t)) / d(Q(θ(t+1)|θ(t)), Q(θ(t)|θ(t))) / . . .
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 6
Example 1: How EM works
Y1, Y2 ∼ i.i.d. Exp(θ) with y1 = 5 observed but y2 missing.
Thus
Q(θ|θ(t)) = 2 log{θ} − 5θ − θ/θ(t) (5)
Updating equation: θ(t+1) = 2θ(t)
5θ(t)+1→ θ = 0.2.
Easy analytic solution. No need of EM at all!
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 7
Example 2: Peppered moths
alleles C>I>T.
How do we estimate allele frequencies from phenotype counts?
Hardy-Weinberg principle: if the allele frequencies in the population are pC,
pI, and p
T, then the genotype frequencies should be p2
C, 2p
Cp
I, 2p
Cp
T, p2
I, 2p
Ip
T,
and p2T, for genotypes CC, CI, CT, II, IT, and TT, respectively.
Observations: phenotypes x = (nC, n
I, n
T) , where n = n
C+ n
I+ n
T.
Complete data: y = (nCC, n
CI, n
CT, n
II, n
IT, n
TT).
Aim: estimate p = (pC, pI) , where pT = 1− pC − pI.
x = (nC, n
I, n
T) = M(y) = (n
CC+ n
CI+ n
CT, n
II+ n
IT, n
TT)
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 8
Example 2: Peppered moths - computation
log{fY(y|p)} = nCC
log{p2C}+n
CIlog{2p
Cp
I}+ . . .+log
(n
nCC
nCIn
CTn
IIn
ITn
TT
). (6)
E-Step:
Q(p|p(t)) = n(t)CC
log{p2C} + n(t)
CIlog{2p
Cp
I} + . . . + n
TTlog{p2
T} + k(n
C, n
I, n
T,p(t)), (7)
where E{NCC|nC, nI , nT ,p(t)} = n
(t)CC = nC(p
(t)C )2
(p(t)C )2+2p
(t)C p
(t)I +2p
(t)C p
(t)T
and so forth.
M-Step:
Setting dQ(p|p(t))dpC
= dQ(p|p(t))dpI
= 0 yields
p(t+1)C
=2n
(t)CC + n
(t)CI + n
(t)CT
2n, p(t+1)
I=
2n(t)II + n
(t)IT + n
(t)CI
2n, and p(t+1)
T=
2n(t)TT + n
(t)CT + n
(t)IT
2n.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 9
Example 2: Peppered moths - simulation results
Observed data: nC
= 85, nI= 196, and n
T= 341
Table 1: EM results for peppered moth example. R(t) is the relative convergence criterion; D(t)C , and D
(t)I
are ratios of consecutive errors.
t p(t)C p
(t)I R(t) D
(t)C D
(t)I
0 0.333333 0.3333331 0.081994 0.237406 5.7× 10−1 0.0425 0.3372 0.071249 0.197870 1.6× 10−1 0.0369 0.1883 0.070852 0.190360 3.6× 10−2 0.0367 0.1784 0.070837 0.189023 6.6× 10−3 0.0367 0.1765 0.070837 0.188787 1.2× 10−3 0.0367 0.1766 0.070837 0.188745 2.1× 10−4 0.0367 0.1767 0.070837 0.188738 3.6× 10−5 0.0367 0.1768 0.070837 0.188737 6.4× 10−6 0.0367 0.176
One can notice that β = 1.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 10
Convergence
Note that log fX(x|θ) = log fY(y|θ)− log fZ|X(z|x,θ), take expectations with
respect to Z|(X,θ) yields
log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) (8)
where H(θ|θ(t)) = E{
log fZ|X(z|x,θ)∣∣ x,θ(t)
}.
Claim: maxH(θ|θ(t)) = H(θ(t)|θ(t)). (hint: using Jensen’s inequality.)
Therefore, increasing Q(θ|θ(t)) leads to increasing log fX(x|θ) (our aim!).
Generalized EM(GEM): increase Q(θ|θ(t)), i.e. Q(θ(t+1)|θ(t)) > (θ(t)|θ(t)).
Convergence order: Linear (slow!). Rate is inversely related to the
proportion of missing data.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 11
Remarks about EM
• Ease of implement and stable ascent.
• Optimization transfer: let G(θ|θ(t))def= Q(θ|θ(t)) + l(θ(t)|x)−Q(θ(t)|θ(t)) yields
the surrogate function g in MM algorithm!
– Q(θ|θ(t)), G(θ|θ(t)) maximized at the same θ.
– minorizing function: G(θ|θ(t)) ≤ l(θ|x),∀θ.
– G is tangent to l at θ(t).
G is more convenient to maximize.
Each E step forms a minorizing function G, and each M step maximizes
it to provide an uphill step.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 12
Discussion: Exponential families
Derivations:
f (y|θ) = c1(y)c2(θ) exp{θTs(y)}
Q(θ|θ(t)) = k + log c2(θ) +∫θTs(y)fZ|X(z|x,θ(t)) dz
Set Q′(θ|θ(t)) = 0 yields −c′2(θ)
c2(θ) =∫
s(y)fZ|X(z|x,θ(t)) dz.
Note that c′2(θ) = −c2(θ)E{s(Y)|θ}, then θ(t+1) is the solution of
E{s(Y)|θ} =
∫s(y)fZ|X(z|x,θ(t)) dz. (9)
Algorithm:
1. E step: Compute s(t) def= E{s(Y)|x,θ(t)} =
∫s(y)fZ|X(z|x,θ(t)) dz.
2. M step: θ(t+1) solves E{s(Y)|θ} = s(t).
3. Stop or return to 1.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 13
Variance estimation: Outline
• Aim: estimate V ar(θ) → compute −l′′(θ|x).
(Bayestian: the Hessian of the log posterior density)
• Theoretical derivations: Louis’s method.
• Methods:
– SEM: easy, fast, reliable.
– Bootstrapping: easier, nested looping.
– Others: empirical information, numerical differentiation, ...
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 14
Variance estimation: Louis’s method
Take 2nd derivatives to log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) with respect to θ
yields
−l′′(θ|x) = −Q′′(θ|ω)|ω=θ + H′′(θ|ω)|ω=θ (10)
Define iX(θ) = −l′′(θ|x), iY(θ) = −Q′′(θ|ω)|ω=θ (= −E{l′′(θ|Y)|x,θ(t)}) ,
iZ|X(θ) = −H′′(θ|ω)|ω=θ = V arZ|X{d log fZ|X(z|x,θ)
dθ }.
Missing information principle:
iX(θ) = iY(θ)− iZ|X(θ) (11)(observed
information
)=
(complete
information
)−(
missing
information
)(12)
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 15
Variance estimation: Louis’s method - remarks
• Define SZ|X(θ) =d log fZ|X(z|x,θ)
dθ , then
iZ|X(θ) =
∫SZ|X(θ)SZ|X(θ)TfZ|X(z|X, θ) dz (13)
since E{SZ|X(θ)} = 0.
• Avoid calculations of θ|X, easier to derive and code sometimes.
• If difficult to compute analytically, then Monte Carlo method.
e.g. estimate iY(θ) by
1
m
m∑i=1
−d2 log fY(yi|θ)
dθ2 , (14)
where zi i.i.d. drawn from fZ|X.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 16
Variance estimation: Example - censored exponential data
Observed data:
xi =
{(ci, 0), yi > ci (censored) ,
(yi, 1), yi ≤ ci (uncensored) ,Y1, . . . , Yn i.i.d. ∼ Exp(λ). (15)
Complete data log likelihood: l(λ|y) = n log λ− λ∑n
i=1 yi.
Q(λ|λ(t)) = n log λ− λn∑i=1
E{Yi|xi, λ(t)} (16)
= n log λ− λn∑i=1
[yiδi + ci(1− δi)]−λ
λ(t)C (17)
where δi = 1{i is uncensored}, C =∑n
i=1(1− δi) denotes the number of censored
cases.
Therefore iY(λ) = −Q′′(λ|λ(t)) = nλ2
, and we can also calculate
iZ|X(λ) = V ar{d log fZ|X(z|x,λ)
dλ } = Cλ2
. Applying Louis’s method, we find
iX(λ) = Uλ2
, where U =∑n
i=1 δi denotes the number of uncensored cases.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 17
Variance estimation: SEM - introduction
Motivation:
Let Ψ denotes the EM mapping, having fixed point θ and Jacobian matrix
Ψ′(θ) with (i, j)th element equaling dΨi(θ)dθj
. It can be shown that
Ψ′(θ)T = iZ|X(θ)iY(θ)−1 (18)
Further use of the missing information principle leads to
Var{θ} = iY(θ)−1(I + Ψ′(θ)T (I−Ψ′(θ)T )−1
). (19)
SEM considers complete data and an incremental matrix. No need to
worry about the uncertainty of the missing data.
SEM is more stable than the generic numerical differentiation approach.
Aim: estimate Ψ′(θ).
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 18
Variance estimation: SEM - algorithm
1. Find θ by standard EM.
2. Restart from θ(0) closer to θ. For t = 0, 1, 2, . . .
(a) Produce θ(t+1) from θ(t) by standard EM.
(b) Define θ(t)(j) = (θ1, . . . , θj−1, θ(t)j , θj+1, . . . , θp) and calculate
r(t)ij =
Ψi(θ(t)(j))− θiθ
(t)j − θj
. (20)
(c) Stop when convergence criteria met.
• Plug the final estimate of Ψ′(θ) into (19) to get the variance.
• Asymmetry (slightly).
• No inverse.
• transformation of θ.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 19
Variance estimation: SEM - remark
Why restart?
Using r(t)ij =
Ψi(θ(t−1)1 ,...,θ
(t−1)j−1 ,θ
(t)j ,θ
(t−1)j+1 ,...,θ
(t−1)p )−Ψi(θ
(t−1))
θ(t)j −θ
(t)j−1
will not require fewer iterations
overall and will be less stable.
Restart closer offset the steps to find θ.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 20
Variance estimation: Other methods
Bootstrapping:
1. Initialization. θ1 = θEM , applied to x1, . . . ,xn.
2. Pseudo-data. θj = θ(j)
EM , applied to x(j)1 , . . . ,x
(j)n generated randomly with
replacement, j = 2, . . . , B.
3. f (θ) = 1B
∑Bj=1 f (θj), so variance can be estimated.
The nested looping can be computationally burdensome.
Empirical information: (related to the Fisher information)
1
n
n∑i=1
l′(θ|xi)l′(θ|xi)T −1
n2l′(θ|x)l′(θ|x)T (21)
for i.i.d data.
All terms are by-products of the M step, since H(θ|θ(t)) is maximized at θ(t),
then l′(θ|x)|θ=θ(t)
= Q′(θ|θ(t))|θ=θ(t)
.
Numerical differentiation
Inaccuracy by perturbation v.s. round off error.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 21
Improvements: Outline
1. E step
• Monte Carlo EM: mean value.
2. M step
• E Conditional Maximization: a CM cycle.
• EM gradient: a single step of Newton’s method.
3.Acceleration methods
• Aitken acceleration: a Newton update with Taylor expansion
approximation.
• Quasi-Newton acceleration: Quasi-Newton update.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 22
Improvements: MCEM
Replace the E step with
1. Draw Z(t)1 , . . . ,Z
(t)
m(t) i.i.d. from fZ|X(z|x,θ(t)).
2. Calculate Q(t+1)(θ|θ(t)) = 1m(t)
∑m(t)
j=1 log fY(Y(t)j |θ), a MC estimate of Q(θ|θ(t)).
• Choice of m(t): small first.
• Convergence: eventually bounce around the true maximum.
Example: censored exponential data reviewed
• Ordinary EM update: λ(t+1) = n∑ni=1 xi+
C
λ(t)
.
• Using MCEM: Q(t+1)(λ|λ(t)) = n log λ− λm(t)
∑m(t)
j=1 YTj 1, where
Yj = (Xj, Zj1, . . . , ZjC), Zjk − ck ∼ i.i.d.Exp(λ(t)), k = 1, . . . , C.
Therefore the MCEM update is λ(t+1) = n
1
m(t)
∑m(t)j=1 YT
j 1.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 23
Improvements: ECM
Using S conditional maximization problems to simplify the original
maximization problem.
θ(t+s/S) = arg maxgs(θ
(t+(s−1)/S))Q(θ|θ(t)), (22)
for s = 1, . . . , S, and set θ(t+1) = θ(t+S/S).
Choice of constraints:
Partition θ into S subvectors, θ = (θ1, . . . ,θS).
1. gs(θ) = (θ1, . . . ,θs−1,θs+1, . . . ,θS), i.e. holding θ(−s) fixed. (Gauss-Seidel)
2. gs(θ) = θs, i.e. holding θs fixed.
Nested iterative loops.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 24
Improvements: ECM - (sketched) example
Multivariate regression
Let U1, . . . ,Un be independent where
Ui ∼ Nd (µi,Σ) (23)
for µi = Viβ, where Vi are known d× p matrices, and β and Σ are unknown.
Algorithm: partition the unknown parameters into (β,Σ).
• E-step: Find the expectation of the complete data sufficient statistics
conditional on the observed data and β(t), Σ(t). (The sufficient stats are∑ni=1Uij for j = 1, . . . , d and
∑ni=1UijUik for j, k = 1, . . . , d.)
• CM 1: β(t+1/2) is estimated given Σ = Σ(t).
• CM 2: Σ(t+2/2) is estimated given β = β(t+1/2).
• Return to the E-step.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 25
Improvements: EM gradient
Since our aim is maximizing Q(θ|θ(t)) in the M step, replace the update by
θ(t+1) = θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)
Q′(θ|θ(t))∣∣∣θ=θ(t)
(24)
= θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)
l′(θ(t)|x) (25)
• Avoid the computational burden of nested looping.
• Same rate of convergence to θ as EM.
• Choice of step length: scaling can ensure ascent while inflating speeds
convergence.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 26
Improvements: Aitken acceleration
Acceleration methods speed convergence by Newton-like steps.
To maximize l(θ(t)|x), the Newton update would be
θ(t+1) = θ(t) − l′′(θ(t)|x)−1l′(θ(t)|x). (26)
To replace l′(θ(t)|x), note that l′(θ(t)|x) = Q′(θ|θ(t))∣∣∣θ=θ(t)
and the Taylor
expansion
0 = Q′(θ|θ(t))∣∣∣θ=θ
(t+1)EM
≈ Q′(θ|θ(t))∣∣∣θ=θ(t)
− iY(θ(t))(θ(t+1)EM − θ
(t)).
Therefore the update would be
θ(t+1) = θ(t) − l′′(θ(t)|x)−1iY(θ(t))(θ(t+1)EM − θ
(t)). (27)
• Since iY included, the increment would be more when more informa-
tion missing.
• A precise approximation only when θ(t) near θ, so initially iterate EM.
• Equivalent to applying the Newton method to find a zero of Ψ(θ)− θwhere the mapping Ψ producing θ(t+1) = Ψ(θ(t)).
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 27
Improvements: Quasi-Newton acceleration
Newton-like method: θ(t+1) = θ(t) − (M(t))−1l′(θ(t)|x).
Take
M(t) = Q′′(θ|θ(t))∣∣∣θ=θ(t)
−B(t)
where B(t) approximates H′′(θ(t)|θ(t)) to make M(t) approximate l′′(θ(t)|x).
Quasi-Newton EM algorithm:
• Start with B(0) = 0.
• Gradually accumulate information about H′′ using secant condition
B(t+1)(θ(t+1) − θ(t)) = H′(θ|θ(t+1))∣∣∣θ=θ(t+1)
− H′(θ|θ(t))∣∣∣θ=θ(t)
.
Then quasi-Newton methods like BFGS can be used.
• Quasi-Newton EM v.s. EM gradient.
• Scaling to guarantee M(t) < 0.
• A potentially superior strategy approximates (l′′)−1 instead of l′′.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 28
Figure 2: Steps taken by the EM gradient algorithm (long dashes). Ordinary EM steps are shown withthe solid line. Steps from two methods from later sections (Aitken and quasi-Newton acceleration) are alsoshown, as indicated in the key. The observed data log likelihood is shown with the grey scale, with lightshading corresponding to high likelihood. All algorithms were started from pC = pI = 1/3.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 29
Miscellanea: Gaussian Mixture Model (GMM)
Data from K different normal distributions:
p(x) =
K∑k=1
πkN(x|µk,Σk), (28)
where mixture coefficient πk is the weight of the kth component.
We can use EM to estimate the unknown πk,µk,Σk by introducing a latent
variable zk. zk = 1 iff the point belongs to the kth normal distribution.
In the view of Bayesian, with the prior density p(z) and the likelihood
function p(x|z), the posterior density is
γ(zk) = p(zk = 1|x) =πkN(x|µk,Σk)∑Kj=1 πjN(x|µj,Σj)
.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 30
Knowing π(t)k ,µ
(t)k ,Σ
(t)k , we solve out the update would be
µ(t+1)k =
1
N(t)k
N∑n=1
γ(t)(znk)xn, (29)
Σ(t+1)k =
1
N(t)k
N∑n=1
γ(t)(znk)(xn − µ(t+1)k )(xn − µ(t+1)
k )T , (30)
π(t+1)k =
N(t)k
N, (31)
where N(t)k =
∑Nn=1 γ
(t)(znk).
The M step seems like a CM cycle rather than the ordinary one.
EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 31