computer vision: models, learning and inference …cv192/wiki.files/cv192_lec...gmm gmm and...

$: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability$
Computer Vision: Models, Learning and Inference–

Mixture Models, Part 2

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

April 9, 2019

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 1 / 38

www.cs.bgu.ac.il/~cv192/

1 GMM

2 GMM Parameter Estimation

3 EMMMConnection Between GMM and K-means



Plan

An intuitive algorithm for GMM parameter estimation

That algorithm is an example of EM algorithms.

EM algorithms are iterative and work on maximizing a lower bound ofthe log-likelihood. Can be shown to increase (or not decrease) thelog-likelihood and converge to a local maximum.

EM algorithms can be easily adapted from Maximum-Likelihood (ML)estimation to Maximum-a-Posteriori (MAP) estimation.



Plan

EM algorithms are a particular case of the more general MM algorithms– which are in fact easier to explain than EM algorithms, without havingto resort to information-theoretic concepts.

The scope of EM algorithms, let alone the scope of MM algorithms,goes beyond GMMs – and even beyond Mixture models.

We will discuss closely-related GMM inference algorithms:1 Hard-assignment EM and its connection to K-means2 Hard-assignment by sampling (maybe today)3 Gibbs Sampling (next time)



GMM

GMM

Model:

p(x; θ) =

K∑k=1

πkN (x;µk,Σk) x ∈ Rn

where:

θk = (µk,Σk)

θ = (θ1, . . . , θK , π1, . . . , πK)

πk ≥ 0 (can also insist on πk > 0), ∀k ∈ {1, . . . ,K}∑Kk=1 πk = 1

Data: D = (xi)Ni=1, (usually) iid samples from p(x; θ)



GMM

Example (a 3-component GMM in 2D)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 6 / 38


GMM

GMM and Clustering

First fit (we will see how) a GMM to the data, D = (xi)Ni=1

Next, compute p(zi = k|xi; θ):

ri,k , p(zi = k|xi; θ)any mixture model

=p(zi = k; θ)p(xi|zi = k; θ)∑K

k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)

GMM=

πkN (x;µk,Σk)∑Kk′=1 π

′kN (x;µ′k,Σ

′k)

ri,k is called the responsibility of cluster k for point i.

This is called soft clustering (AKA soft assignments)



GMM

GMM and Clustering

If we viewπk = p(z = k; θ)

as the prior probability for class k, then

ri,k = p(zi = k|xi; θ)

may be viewed as the posterior probability (i.e., after having observedmeasurement xi) that zi = k; i.e., the probability that xi was drawnfrom Gaussian k, given that we know the value of xi



GMM

GMM and Clustering

Despite the term “posterior” in the previous slide, note it referred to aprobability for the label, zi, not to a posterior probability for θ (thisdifference might seem as semantics: one can argue that the labels areparameters too).

In other words, in this lecture we are interested mostly inMaximum-Likelihood estimates of θ, and do not place a prior on it,p(θ). That is, we view θ as an unknown deterministic quantity.

Next lecture we will be more Bayesian. We will regard θ as an RV, willwrite the Likelihood as p(x|θ), as opposed to p(x; θ), and will target theposterior, p(θ|x).



GMM

GMM and Hard Clustering

Recall (in mixture models in general, not just GMM):

ri,k , p(zi = k|xi; θ) =p(zi = k; θ)p(xi|zi = k; θ)∑K

k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)

For hard clustering, compute

z∗i = arg maxk

p(zi=k|xi;θ)︷︸︸︷ri,k = arg max

klog p(xi|zi; θ) + log

πk︷︸︸︷p(zi = k; θ)



GMM

GMM Parameter Estimation

The likelihood function, p(D; θ), is not concave (i.e., it is hard to optimize)

Example

Left: Data histogram, for a 2-component GMM with π1 = π2 = 0.5,σ1 = σ2 = 5, µ1 = −10 and µ2 = 10.Right: p(D;µ1, µ2, σ1 = σ2 = 5, π1 = π2 = 0.5︸︷︷︸

true values

)

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

µ1

µ2

−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5

−15.5

−10.5

−5.5

−0.5

4.5

9.5

14.5

19.5



EM

Expectation Maximization (EM) for Mixture Models

For a single Gaussian, it was easy to compute the ML estimators of themean and covariance. In fact, we even had closed-form expressions forthese: sample mean and sample covariance.Thus, in a GMM, if we know the ri,k’s, we can compute weightedsample mean and weighted sample covariance for each Gaussian k.

In a GMM, if we know the model parameters, then it is easy to computethe ri,k’s.

So let’s iteratively alternate between the steps. This is EM for mixturemodels in a nutshell.



EM


Log-likelihood of the observed data (AKA incomplete data):

l(θ) ,N∑i=1

log p(xi; θ)mixture model

=

N∑i=1

log

[∑zi

p(xi, zi; θ)

]Define the complete-data log likelihood:

lc(θ) ,N∑i=1

log p(xi, zi; θ)

This can’t be computed – the zi’s are unknown.

Note that even after observing the data, lc(θ) is an RV.

Define the expected complete-data log likelihood:

Q(θ, θt) , E(lc(θ)|D; θt)

where t is the iteration number. The expectation is taken w.r.t. the“old” parameter, θt.



EM


The goal of the E step, is to compute Q(θ, θt), or, more accurately, theterms inside of it which the Maximum-Likelihood estimator depends on;these are known as the expected sufficient statistics.

In the M step, we optimize Q w.r.t. θ: θt+1 = arg maxθQ(θ, θt)

The EM produces a non-decreasing sequence:

l(θ0) ≤ l(θ1) ≤ l(θ2) ≤ . . .

The typical proof, which you are likely to see in other classes and mosttextbooks, relies on some information-theoretic concept. Here you will seeanother proof, which is simpler and more general, via the connectionbetween EM algorithms to the so-called MM algorithms.

The sequence above can be shown to converge to a (usually-local)maximum.



EM


That was for the MLE. If want Maximum-A-Posterior (MAP)estimation, then adjust the M step toθt+1 = arg maxθQ(θ, θt) + log p(θ) (E step unchanged).

Here too, the EM produces a non-decreasing sequence:

l(θ0) + log(θ0) ≤ l(θ1) + log p(θ1) ≤ l(θ2) + log p(θ2) ≤ . . .



EM

EM for GMM

Q(θ, θt) , E(∑

ilog p(xi, zi; θ)

∣∣∣D; θt)

= E(∑

ilog p(zi; θ)p(xi|zi; θ)

∣∣∣D; θt)

= E(∑

ilog πzip(xi; θk)

∣∣∣D; θt)

= E(∑

ilog(∏

k(πkp(xi; θk))

1zi=k

)∣∣∣D; θt)

= E(∑

i

(∑k1zi=k log(πkp(xi; θk))

)∣∣∣D; θt)

=∑

i

∑kE(1zi=k| D; θt

)log (πkp(xi; θk))

=∑

i

∑kp(zi = k|xi; θt) log (πkp(xi; θk))

=∑

i

∑kri,k log πk +

∑i

∑kri,k log p(xi; θk)

where now ri,k , p(zi = k|xi; θt)www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 16 / 38


EM

EM for GMM

E step:Given θt, the current estimate of θ = (θ1, . . . , θK , π1, . . . , πK), compute,∀i ∈ {1, . . . , N} and ∀k ∈ {1, . . . ,K},

ri,kmixture

=πkp(xi|zi = k; θtk)∑K

k′=1 πk′p(xi|zi = k′; θtk′)

GMM=

πkN (xi;µk,Σk)∑Kk′=1 πk′N (xi;µk′ ,Σk′)

Note:∑K

k=1 ri,k = 1∀i ∈ {1, . . . , N}



EM

EM for GMM

M step:We optimize Q w.r.t. π (subject to

∑k πk = 1) and the θk’s.

πk =1

N

∑i

ri,k =rkN

rk ,∑i

ri,k

µk =

∑Ni=1 ri,kxirk

Σk =

∑Ni=1 ri,kxix

Ti

rk− µkµTk



EM

Example

−2 0 2

−2

0

2



EM

EM – Beyond Mixtures

Z,X: jointly distributed according to a parametrized p(z, x; θ)

This is general, regardless if X and/or Z are discrete or continuous RVs.

Observed X = x, want to estimate the parameter, θ.

Likelihood (usually not concave):

L(x; θ) , p(x; θ) =

∫p(z, x; θ) dz

Log Likelihood (usually not concave):

l(θ) = logL(x; θ) = log p(x; θ) = log

∫p(z, x; θ) dz

⇒ hard to maximize; even if p(x, z; θ) concave, p(x; θ) is usually notconcave.

The EM algorithm produces a sequence, (θ0, θ1, θ2, . . .), that increaseslogL(x; θ) to a (local) maximum.



EM MM

MM Algorithms

“One of the virtues of the MM acronym is that it does double duty. Inminimization problems, the first M of MM stands for majorize and thesecond M for minimize. In maximization problems, the first M stands forminorize and the second M for maximize.”

We will define the terms “majorize” and “minorize” later.

Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 21 / 38


EM MM

MM Algorithms

A successful MM algorithm substitutes a difficult optimization problemwith an easy one.

“Simplicity can be attained by (a) avoiding large matrix inversions, (b)linearizing an optimization problem, (c) separating the parameters of anoptimization problem, (d) dealing with equality and inequalityconstraints gracefully, or (e) turning a non-differentiable problem into asmooth problem. Iteration is the price we pay for simplifying the originalproblem.”

Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 22 / 38


EM MM

Present EM Algorithms as MM Algorithms

Find a function A(θ; θt) such that

A(θt; θt) = logL(x; θt)A(θ; θt) ≤ logL(x; θ) (“minorization”)

and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)

Iterate.



EM MM

MM Algorithms

More generally (than EM):An MM algorithm for maximizing some function f(θ):


A(θt; θt) = f(θt) ∀θtA(θ; θt) ≤ f(θ) ∀θ, θt (“minorization”)

and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)

Iterate.



EM MM

MM Algorithms

An MM algorithm for minimizing some function f(θ) (same asarg maxθ −f(θ))


A(θt; θt) = f(θt) ∀θtA(θ; θt) ≥ f(θ) ∀θ, θt (“majorization”)

and then set θt+1 = arg minθ A(θ; θt) (“minimization”)

Iterate.



EM MM

Fact (see next slide for the proof)

Let f be the objective function.

A majorization-minimization algorithm yields a non-increasing sequence:

f(θ0) ≥ f(θ1) ≥ f(θ2) ≥ . . .

A minorization-maximization algorithm yields a non-decreasing sequence:

f(θ0) ≤ f(θ1) ≤ f(θ2) ≤ . . .

Fact (proof is omitted)

The sequence converges to a local optimum, θ∗, or, in very rare cases, asaddle point. Typically, the convergence has linear rate:

limn→∞

∥∥θt+1 − θ∗∥∥

‖θt − θ∗‖= c < 1



EM MM

About Convergence Rates

In comparison, Newton-Raphson algorithm, under certain generalconditions, tend to converge and do it faster, at a quadratic rate:

limn→∞

∥∥θt+1 − θ∗∥∥

‖θt − θ∗‖2= c

(it is easy, however, to construct cases where Newton-Raphson does notconverge at all)

Hence Newton-Raphson algorithms tend to require fewer iterations. Onthe other hand, an iteration of a Newton-Raphson algorithm can be farmore computationally expensive (and error-prone) than an MMiteration. This is due to the evaluation and inversion of the Hessianmatrix, ∇2f(θ), in each Newton-Raphson update:

θt+1 = θt − (∇2f(θt))−1(∇f(θt))T



EM MM

Proof.

Suppose:

(1) f(θ) ≤ A(θ; θt)∀θ, θt (“majorization”)

(2) f(θt) = A(θt; θt)

(3) θt+1 = arg minθ A(θ; θt)

Want to show: f(θt+1) ≤ f(θt).

A(θt+1; θt)by (3)

≤ A(θt; θt)by (2)

= f(θt)

f(θt+1)−A(θt+1; θt)by (1) with θ=θt+1

≤ 0

Sum the inequalities:

f(θt+1) ≤f(θt)



EM MM

MM

The proof also shows that even if we can’t find arg minθ A(θ; θt), it isenough to find θt+1 such that A(θt+1; θt) ≤ A(θt; θt)

Turns out: convergence still holds, with a similar convergence rate.



EM MM


f(θ) , l(θ) = logL(x; θ)

A(θ; θt) ,∫

logp(z, x; θ)

p(z|x; θt)p(z|x; θt) dz = E

(log

p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

Check A(θ; θt) ≤ f(θ):

A(θ; θt) = E

(log

p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

Jensen Inequality≤ logE

(p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

= log

∫p(z, x; θ)


= log

∫p(z, x; θ) dz = logL(x; θ) = f(θ) ∀θ, θt



EM MM


But what is the connection between this A and EM?

A(θ; θt) =

∫log

p(z, x; θ)


=

∫(log p(z, x; θ)− log p(z|x; θt))p(z|x; θt) dz

=

[∫log p(z, x; θ)p(z|x; θt) dz

]−[∫

log p(z, x; θt)p(z|x; θt) dz

]︸︷︷︸

no θ here

arg maxθ

A(θ; θt) ⇐⇒ arg maxθ

∫log p(z, x; θ)p(z|x; θt) dz = E(lc(θ)|x; θt)︸︷︷︸

Q(θ;θt)

The connection to MM shows that indeed an EM iteration does notdecrease l(θ).



EM Connection Between GMM and K-means

GMM and K-means: Version 1

Consider first hard-assignment EM for GMM in the caseΣk = Σ = σ2In×n, and weights that are known to be equal: πk = 1

K .

The hard-assignment rule becomes:

zi = arg maxk

1

KN (xi;µ,Σk) = arg max

kexp

(− 1

2σ2‖xi − µk‖2

)= arg min

k‖xi − µk‖

The M step becomes:

nk =

N∑i=1

1zi=k =∑i:zi=k

1 = |{i : zi = k}|

µk =1

nk

∑i:zi=k

xi

This coincides with the K-means algorithm.




GMM and K-means: Version 2

Now consider standard EM fro GM, in the case Σk = Σ = σ2In×n (withpossibly-unknown non-uniform weights).

It can be shown that if σ → 0, then the E-step and M-step coincide withK-means.




GMM and Kmeans

So we saw two ways to relate GMM and K-means.



More Slides

Jensen Inequality

Fact (Jensen Inequality – convex functions)

Z is an RV and h is some concave function ⇒ E(h(Z)) ≤ h(E(Z)).

Example

E(log(Z)) ≤ log(E(Z))

Fact (Jensen Inequality – convex functions)

Z is an RV and h is some convex function ⇒ E(h(Z)) ≥ h(E(Z)).

Example

E(X2) ≥ E2(X)

(since we already know that Var(X) > 0, it is a good way to rememberwhich direction the inequality goes. . . )



More Slides

Outliers

In previous lectures, we discussed robustness via a robust-cost functionformulation.

In a mixture model, another way to handle outliers is by adding aK+ 1-th component, fixing it to have a low weight (that weight can alsobe estimated) and fixed covariance, σ2In×n, where σ2 is very large. Canalso use a uniformly-distributed outlier model if the support is compact.

Another approach is use components that robust. For example, amixture of Student’s t distributions instead of a mixture of Gaussians.



More Slides

Version Log

18/6/2019, ver 1.01. Slide 18: µTkµk → µkµTk

9/4/2019, ver 1.00.



computer vision: models, learning and inference …cv192/wiki.files/cv192_lec...gmm gmm and...

Documents