computational statistics, 2nd edition -...

Computational Statistics, 2nd Edition

Chapter4: EM Optimization Methods

Presented by: Weiyu Li, Jincheng Pang

2018.03

Givens & Hoeting, Computational Statistics, 2nd Edition 1

Focus

1. Introduction: the MM Algorithm

2. The EM Algorithm

Examples, Convergence, Variance Estimation

3. Improvements

MCEM, ECM, EM gradient, Acceleration methods

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 2

Introduction: MM

MM: Majorize-Minimization or Minorize-Maximization.

Algorithm: find a surrogate function g(θ|θ(t)) that minorizes the objective

function f (θ) (concave) and maximize g.

1. find g(θ|θ(t)) satisfying {g(θ|θ(t)) ≤ f (θ), ∀θg(θ(t)|θ(t)) = f (θ(t))

(1)

2. maximization. θ(t+1) = argmaxθg(θ|θ(t)).

3. Stop or return to 1.


Figure 1: Illustration of how MM works.


Introduction: What’s E-M for?

EM can be treated as a special case of the MM algorithm.

• Expection

Missing data Z ← Expection from observation X

Complete data Y = (X,Z)

• Maximization (aim)

Maximize L(θ|X)

*Bayesian: estimate the mode of a posterior distribution f (θ|X)

(maximize a posteriori estimation)

• Y|θ, Z|(X,θ) easier to work with

• Latent : not actually missing

• Bayesian: parameters rather than data


EM Algorithm

Q(θ|θ(t))def= E

{logL(θ|Y)

∣∣ x,θ(t)}

(2)

=E{

log fY(y|θ)∣∣ x,θ(t)

}(3)

=

∫ [log fY(y|θ)

]fZ|X(z|x,θ(t)) dz (4)

Initial: θ(0)

Iterations: altering between E and M.

1. E: Compute Q(θ|θ(t)).

2. M: θ(t+1) = argmaxθQ(θ|θ(t)).


Stopping criteria: d(θ(t+1),θ(t)) / d(Q(θ(t+1)|θ(t)), Q(θ(t)|θ(t))) / . . .


Example 1: How EM works

Y1, Y2 ∼ i.i.d. Exp(θ) with y1 = 5 observed but y2 missing.

Thus

Q(θ|θ(t)) = 2 log{θ} − 5θ − θ/θ(t) (5)

Updating equation: θ(t+1) = 2θ(t)

5θ(t)+1→ θ = 0.2.

Easy analytic solution. No need of EM at all!


Example 2: Peppered moths

alleles C>I>T.

How do we estimate allele frequencies from phenotype counts?

Hardy-Weinberg principle: if the allele frequencies in the population are pC,

pI, and p

T, then the genotype frequencies should be p2

C, 2p

Cp

I, 2p

Cp

T, p2

I, 2p

Ip

T,

and p2T, for genotypes CC, CI, CT, II, IT, and TT, respectively.

Observations: phenotypes x = (nC, n

I, n

T) , where n = n

C+ n

I+ n

T.

Complete data: y = (nCC, n

CI, n

CT, n

II, n

IT, n

TT).

Aim: estimate p = (pC, pI) , where pT = 1− pC − pI.

x = (nC, n

I, n

T) = M(y) = (n

CC+ n

CI+ n

CT, n

II+ n

IT, n

TT)


Example 2: Peppered moths - computation

log{fY(y|p)} = nCC

log{p2C}+n

CIlog{2p

Cp

I}+ . . .+log

(n

nCC

nCIn

CTn

IIn

ITn

TT

). (6)

E-Step:

Q(p|p(t)) = n(t)CC

log{p2C} + n(t)

CIlog{2p

Cp

I} + . . . + n

TTlog{p2

T} + k(n

C, n

I, n

T,p(t)), (7)

where E{NCC|nC, nI , nT ,p(t)} = n

(t)CC = nC(p

(t)C )2

(p(t)C )2+2p

(t)C p

(t)I +2p

(t)C p

(t)T

and so forth.

M-Step:

Setting dQ(p|p(t))dpC

= dQ(p|p(t))dpI

= 0 yields

p(t+1)C

=2n

(t)CC + n

(t)CI + n

(t)CT

2n, p(t+1)

I=

2n(t)II + n

(t)IT + n

(t)CI

2n, and p(t+1)

T=

2n(t)TT + n

(t)CT + n

(t)IT

2n.


Example 2: Peppered moths - simulation results

Observed data: nC

= 85, nI= 196, and n

T= 341

Table 1: EM results for peppered moth example. R(t) is the relative convergence criterion; D(t)C , and D

(t)I

are ratios of consecutive errors.

t p(t)C p

(t)I R(t) D

(t)C D

(t)I

0 0.333333 0.3333331 0.081994 0.237406 5.7× 10−1 0.0425 0.3372 0.071249 0.197870 1.6× 10−1 0.0369 0.1883 0.070852 0.190360 3.6× 10−2 0.0367 0.1784 0.070837 0.189023 6.6× 10−3 0.0367 0.1765 0.070837 0.188787 1.2× 10−3 0.0367 0.1766 0.070837 0.188745 2.1× 10−4 0.0367 0.1767 0.070837 0.188738 3.6× 10−5 0.0367 0.1768 0.070837 0.188737 6.4× 10−6 0.0367 0.176

One can notice that β = 1.


Convergence

Note that log fX(x|θ) = log fY(y|θ)− log fZ|X(z|x,θ), take expectations with

respect to Z|(X,θ) yields

log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) (8)

where H(θ|θ(t)) = E{

log fZ|X(z|x,θ)∣∣ x,θ(t)

}.

Claim: maxH(θ|θ(t)) = H(θ(t)|θ(t)). (hint: using Jensen’s inequality.)

Therefore, increasing Q(θ|θ(t)) leads to increasing log fX(x|θ) (our aim!).

Generalized EM(GEM): increase Q(θ|θ(t)), i.e. Q(θ(t+1)|θ(t)) > (θ(t)|θ(t)).

Convergence order: Linear (slow!). Rate is inversely related to the

proportion of missing data.


Remarks about EM

• Ease of implement and stable ascent.

• Optimization transfer: let G(θ|θ(t))def= Q(θ|θ(t)) + l(θ(t)|x)−Q(θ(t)|θ(t)) yields

the surrogate function g in MM algorithm!

– Q(θ|θ(t)), G(θ|θ(t)) maximized at the same θ.

– minorizing function: G(θ|θ(t)) ≤ l(θ|x),∀θ.

– G is tangent to l at θ(t).

G is more convenient to maximize.

Each E step forms a minorizing function G, and each M step maximizes

it to provide an uphill step.


Discussion: Exponential families

Derivations:

f (y|θ) = c1(y)c2(θ) exp{θTs(y)}

Q(θ|θ(t)) = k + log c2(θ) +∫θTs(y)fZ|X(z|x,θ(t)) dz

Set Q′(θ|θ(t)) = 0 yields −c′2(θ)

c2(θ) =∫

s(y)fZ|X(z|x,θ(t)) dz.

Note that c′2(θ) = −c2(θ)E{s(Y)|θ}, then θ(t+1) is the solution of

E{s(Y)|θ} =

∫s(y)fZ|X(z|x,θ(t)) dz. (9)

Algorithm:

1. E step: Compute s(t) def= E{s(Y)|x,θ(t)} =

∫s(y)fZ|X(z|x,θ(t)) dz.

2. M step: θ(t+1) solves E{s(Y)|θ} = s(t).



Variance estimation: Outline

• Aim: estimate V ar(θ) → compute −l′′(θ|x).

(Bayestian: the Hessian of the log posterior density)

• Theoretical derivations: Louis’s method.

• Methods:

– SEM: easy, fast, reliable.

– Bootstrapping: easier, nested looping.

– Others: empirical information, numerical differentiation, ...


Variance estimation: Louis’s method

Take 2nd derivatives to log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) with respect to θ

yields

−l′′(θ|x) = −Q′′(θ|ω)|ω=θ + H′′(θ|ω)|ω=θ (10)

Define iX(θ) = −l′′(θ|x), iY(θ) = −Q′′(θ|ω)|ω=θ (= −E{l′′(θ|Y)|x,θ(t)}) ,

iZ|X(θ) = −H′′(θ|ω)|ω=θ = V arZ|X{d log fZ|X(z|x,θ)

dθ }.

Missing information principle:

iX(θ) = iY(θ)− iZ|X(θ) (11)(observed

information

)=

(complete

information

)−(

missing

information

)(12)


Variance estimation: Louis’s method - remarks

• Define SZ|X(θ) =d log fZ|X(z|x,θ)

dθ , then

iZ|X(θ) =

∫SZ|X(θ)SZ|X(θ)TfZ|X(z|X, θ) dz (13)

since E{SZ|X(θ)} = 0.

• Avoid calculations of θ|X, easier to derive and code sometimes.

• If difficult to compute analytically, then Monte Carlo method.

e.g. estimate iY(θ) by

1

m

m∑i=1

−d2 log fY(yi|θ)

dθ2 , (14)

where zi i.i.d. drawn from fZ|X.


Variance estimation: Example - censored exponential data

Observed data:

xi =

{(ci, 0), yi > ci (censored) ,

(yi, 1), yi ≤ ci (uncensored) ,Y1, . . . , Yn i.i.d. ∼ Exp(λ). (15)

Complete data log likelihood: l(λ|y) = n log λ− λ∑n

i=1 yi.

Q(λ|λ(t)) = n log λ− λn∑i=1

E{Yi|xi, λ(t)} (16)

= n log λ− λn∑i=1

[yiδi + ci(1− δi)]−λ

λ(t)C (17)

where δi = 1{i is uncensored}, C =∑n

i=1(1− δi) denotes the number of censored

cases.

Therefore iY(λ) = −Q′′(λ|λ(t)) = nλ2

, and we can also calculate

iZ|X(λ) = V ar{d log fZ|X(z|x,λ)

dλ } = Cλ2

. Applying Louis’s method, we find

iX(λ) = Uλ2

, where U =∑n

i=1 δi denotes the number of uncensored cases.


Variance estimation: SEM - introduction

Motivation:

Let Ψ denotes the EM mapping, having fixed point θ and Jacobian matrix

Ψ′(θ) with (i, j)th element equaling dΨi(θ)dθj

. It can be shown that

Ψ′(θ)T = iZ|X(θ)iY(θ)−1 (18)

Further use of the missing information principle leads to

Var{θ} = iY(θ)−1(I + Ψ′(θ)T (I−Ψ′(θ)T )−1

). (19)

SEM considers complete data and an incremental matrix. No need to

worry about the uncertainty of the missing data.

SEM is more stable than the generic numerical differentiation approach.

Aim: estimate Ψ′(θ).


Variance estimation: SEM - algorithm

1. Find θ by standard EM.

2. Restart from θ(0) closer to θ. For t = 0, 1, 2, . . .

(a) Produce θ(t+1) from θ(t) by standard EM.

(b) Define θ(t)(j) = (θ1, . . . , θj−1, θ(t)j , θj+1, . . . , θp) and calculate

r(t)ij =

Ψi(θ(t)(j))− θiθ

(t)j − θj

. (20)

(c) Stop when convergence criteria met.

• Plug the final estimate of Ψ′(θ) into (19) to get the variance.

• Asymmetry (slightly).

• No inverse.

• transformation of θ.


Variance estimation: SEM - remark

Why restart?

Using r(t)ij =

Ψi(θ(t−1)1 ,...,θ

(t−1)j−1 ,θ

(t)j ,θ

(t−1)j+1 ,...,θ

(t−1)p )−Ψi(θ

(t−1))

θ(t)j −θ

(t)j−1

will not require fewer iterations

overall and will be less stable.

Restart closer offset the steps to find θ.


Variance estimation: Other methods

Bootstrapping:

1. Initialization. θ1 = θEM , applied to x1, . . . ,xn.

2. Pseudo-data. θj = θ(j)

EM , applied to x(j)1 , . . . ,x

(j)n generated randomly with

replacement, j = 2, . . . , B.

3. f (θ) = 1B

∑Bj=1 f (θj), so variance can be estimated.

The nested looping can be computationally burdensome.

Empirical information: (related to the Fisher information)

1

n

n∑i=1

l′(θ|xi)l′(θ|xi)T −1

n2l′(θ|x)l′(θ|x)T (21)

for i.i.d data.

All terms are by-products of the M step, since H(θ|θ(t)) is maximized at θ(t),

then l′(θ|x)|θ=θ(t)

= Q′(θ|θ(t))|θ=θ(t)

.

Numerical differentiation

Inaccuracy by perturbation v.s. round off error.


Improvements: Outline

1. E step

• Monte Carlo EM: mean value.

2. M step

• E Conditional Maximization: a CM cycle.

• EM gradient: a single step of Newton’s method.

3.Acceleration methods

• Aitken acceleration: a Newton update with Taylor expansion

approximation.

• Quasi-Newton acceleration: Quasi-Newton update.


Improvements: MCEM

Replace the E step with

1. Draw Z(t)1 , . . . ,Z

(t)

m(t) i.i.d. from fZ|X(z|x,θ(t)).

2. Calculate Q(t+1)(θ|θ(t)) = 1m(t)

∑m(t)

j=1 log fY(Y(t)j |θ), a MC estimate of Q(θ|θ(t)).

• Choice of m(t): small first.

• Convergence: eventually bounce around the true maximum.

Example: censored exponential data reviewed

• Ordinary EM update: λ(t+1) = n∑ni=1 xi+

C

λ(t)

.

• Using MCEM: Q(t+1)(λ|λ(t)) = n log λ− λm(t)

∑m(t)

j=1 YTj 1, where

Yj = (Xj, Zj1, . . . , ZjC), Zjk − ck ∼ i.i.d.Exp(λ(t)), k = 1, . . . , C.

Therefore the MCEM update is λ(t+1) = n

1

m(t)

∑m(t)j=1 YT

j 1.


Improvements: ECM

Using S conditional maximization problems to simplify the original

maximization problem.

θ(t+s/S) = arg maxgs(θ

(t+(s−1)/S))Q(θ|θ(t)), (22)

for s = 1, . . . , S, and set θ(t+1) = θ(t+S/S).

Choice of constraints:

Partition θ into S subvectors, θ = (θ1, . . . ,θS).

1. gs(θ) = (θ1, . . . ,θs−1,θs+1, . . . ,θS), i.e. holding θ(−s) fixed. (Gauss-Seidel)

2. gs(θ) = θs, i.e. holding θs fixed.

Nested iterative loops.


Improvements: ECM - (sketched) example

Multivariate regression

Let U1, . . . ,Un be independent where

Ui ∼ Nd (µi,Σ) (23)

for µi = Viβ, where Vi are known d× p matrices, and β and Σ are unknown.

Algorithm: partition the unknown parameters into (β,Σ).

• E-step: Find the expectation of the complete data sufficient statistics

conditional on the observed data and β(t), Σ(t). (The sufficient stats are∑ni=1Uij for j = 1, . . . , d and

∑ni=1UijUik for j, k = 1, . . . , d.)

• CM 1: β(t+1/2) is estimated given Σ = Σ(t).

• CM 2: Σ(t+2/2) is estimated given β = β(t+1/2).

• Return to the E-step.


Improvements: EM gradient

Since our aim is maximizing Q(θ|θ(t)) in the M step, replace the update by

θ(t+1) = θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)

Q′(θ|θ(t))∣∣∣θ=θ(t)

(24)

= θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)

l′(θ(t)|x) (25)

• Avoid the computational burden of nested looping.

• Same rate of convergence to θ as EM.

• Choice of step length: scaling can ensure ascent while inflating speeds

convergence.


Improvements: Aitken acceleration

Acceleration methods speed convergence by Newton-like steps.

To maximize l(θ(t)|x), the Newton update would be

θ(t+1) = θ(t) − l′′(θ(t)|x)−1l′(θ(t)|x). (26)

To replace l′(θ(t)|x), note that l′(θ(t)|x) = Q′(θ|θ(t))∣∣∣θ=θ(t)

and the Taylor

expansion

0 = Q′(θ|θ(t))∣∣∣θ=θ

(t+1)EM

≈ Q′(θ|θ(t))∣∣∣θ=θ(t)

− iY(θ(t))(θ(t+1)EM − θ

(t)).

Therefore the update would be

θ(t+1) = θ(t) − l′′(θ(t)|x)−1iY(θ(t))(θ(t+1)EM − θ

(t)). (27)

• Since iY included, the increment would be more when more informa-

tion missing.

• A precise approximation only when θ(t) near θ, so initially iterate EM.

• Equivalent to applying the Newton method to find a zero of Ψ(θ)− θwhere the mapping Ψ producing θ(t+1) = Ψ(θ(t)).


Improvements: Quasi-Newton acceleration

Newton-like method: θ(t+1) = θ(t) − (M(t))−1l′(θ(t)|x).

Take

M(t) = Q′′(θ|θ(t))∣∣∣θ=θ(t)

−B(t)

where B(t) approximates H′′(θ(t)|θ(t)) to make M(t) approximate l′′(θ(t)|x).

Quasi-Newton EM algorithm:

• Start with B(0) = 0.

• Gradually accumulate information about H′′ using secant condition

B(t+1)(θ(t+1) − θ(t)) = H′(θ|θ(t+1))∣∣∣θ=θ(t+1)

− H′(θ|θ(t))∣∣∣θ=θ(t)

.

Then quasi-Newton methods like BFGS can be used.

• Quasi-Newton EM v.s. EM gradient.

• Scaling to guarantee M(t) < 0.

• A potentially superior strategy approximates (l′′)−1 instead of l′′.


Figure 2: Steps taken by the EM gradient algorithm (long dashes). Ordinary EM steps are shown withthe solid line. Steps from two methods from later sections (Aitken and quasi-Newton acceleration) are alsoshown, as indicated in the key. The observed data log likelihood is shown with the grey scale, with lightshading corresponding to high likelihood. All algorithms were started from pC = pI = 1/3.


Miscellanea: Gaussian Mixture Model (GMM)

Data from K different normal distributions:

p(x) =

K∑k=1

πkN(x|µk,Σk), (28)

where mixture coefficient πk is the weight of the kth component.

We can use EM to estimate the unknown πk,µk,Σk by introducing a latent

variable zk. zk = 1 iff the point belongs to the kth normal distribution.

In the view of Bayesian, with the prior density p(z) and the likelihood

function p(x|z), the posterior density is

γ(zk) = p(zk = 1|x) =πkN(x|µk,Σk)∑Kj=1 πjN(x|µj,Σj)

.


Knowing π(t)k ,µ

(t)k ,Σ

(t)k , we solve out the update would be

µ(t+1)k =

1

N(t)k

N∑n=1

γ(t)(znk)xn, (29)

Σ(t+1)k =

1

N(t)k

N∑n=1

γ(t)(znk)(xn − µ(t+1)k )(xn − µ(t+1)

k )T , (30)

π(t+1)k =

N(t)k

N, (31)

where N(t)k =

∑Nn=1 γ

(t)(znk).

The M step seems like a CM cycle rather than the ordinary one.


computational statistics, 2nd edition -...

Documents