computer vision: models, learning and inference …cv192/wiki.files/cv192_lec...gmm gmm and...

38
Computer Vision: Models, Learning and Inference Mixture Models, Part 2 Oren Freifeld and Ron Shapira-Weber Computer Science, Ben-Gurion University April 9, 2019 www.cs.bgu.ac.il/ ~ cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 1 / 38

Upload: others

Post on 04-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

Computer Vision: Models, Learning and Inference–

Mixture Models, Part 2

Oren Freifeld and Ron Shapira-Weber

Computer Science, Ben-Gurion University

April 9, 2019

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 1 / 38

Page 2: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

1 GMM

2 GMM Parameter Estimation

3 EMMMConnection Between GMM and K-means

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 2 / 38

Page 3: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

Plan

An intuitive algorithm for GMM parameter estimation

That algorithm is an example of EM algorithms.

EM algorithms are iterative and work on maximizing a lower bound ofthe log-likelihood. Can be shown to increase (or not decrease) thelog-likelihood and converge to a local maximum.

EM algorithms can be easily adapted from Maximum-Likelihood (ML)estimation to Maximum-a-Posteriori (MAP) estimation.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 3 / 38

Page 4: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

Plan

EM algorithms are a particular case of the more general MM algorithms– which are in fact easier to explain than EM algorithms, without havingto resort to information-theoretic concepts.

The scope of EM algorithms, let alone the scope of MM algorithms,goes beyond GMMs – and even beyond Mixture models.

We will discuss closely-related GMM inference algorithms:1 Hard-assignment EM and its connection to K-means2 Hard-assignment by sampling (maybe today)3 Gibbs Sampling (next time)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 4 / 38

Page 5: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM

Model:

p(x; θ) =

K∑k=1

πkN (x;µk,Σk) x ∈ Rn

where:

θk = (µk,Σk)

θ = (θ1, . . . , θK , π1, . . . , πK)

πk ≥ 0 (can also insist on πk > 0), ∀k ∈ {1, . . . ,K}∑Kk=1 πk = 1

Data: D = (xi)Ni=1, (usually) iid samples from p(x; θ)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 5 / 38

Page 6: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

Example (a 3-component GMM in 2D)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 6 / 38

Page 7: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM and Clustering

First fit (we will see how) a GMM to the data, D = (xi)Ni=1

Next, compute p(zi = k|xi; θ):

ri,k , p(zi = k|xi; θ)any mixture model

=p(zi = k; θ)p(xi|zi = k; θ)∑K

k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)

GMM=

πkN (x;µk,Σk)∑Kk′=1 π

′kN (x;µ′k,Σ

′k)

ri,k is called the responsibility of cluster k for point i.

This is called soft clustering (AKA soft assignments)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 7 / 38

Page 8: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM and Clustering

If we viewπk = p(z = k; θ)

as the prior probability for class k, then

ri,k = p(zi = k|xi; θ)

may be viewed as the posterior probability (i.e., after having observedmeasurement xi) that zi = k; i.e., the probability that xi was drawnfrom Gaussian k, given that we know the value of xi

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 8 / 38

Page 9: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM and Clustering

Despite the term “posterior” in the previous slide, note it referred to aprobability for the label, zi, not to a posterior probability for θ (thisdifference might seem as semantics: one can argue that the labels areparameters too).

In other words, in this lecture we are interested mostly inMaximum-Likelihood estimates of θ, and do not place a prior on it,p(θ). That is, we view θ as an unknown deterministic quantity.

Next lecture we will be more Bayesian. We will regard θ as an RV, willwrite the Likelihood as p(x|θ), as opposed to p(x; θ), and will target theposterior, p(θ|x).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 9 / 38

Page 10: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM and Hard Clustering

Recall (in mixture models in general, not just GMM):

ri,k , p(zi = k|xi; θ) =p(zi = k; θ)p(xi|zi = k; θ)∑K

k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)

For hard clustering, compute

z∗i = arg maxk

p(zi=k|xi;θ)︷︸︸︷ri,k = arg max

klog p(xi|zi; θ) + log

πk︷ ︸︸ ︷p(zi = k; θ)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 10 / 38

Page 11: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

GMM

GMM Parameter Estimation

The likelihood function, p(D; θ), is not concave (i.e., it is hard to optimize)

Example

Left: Data histogram, for a 2-component GMM with π1 = π2 = 0.5,σ1 = σ2 = 5, µ1 = −10 and µ2 = 10.Right: p(D;µ1, µ2, σ1 = σ2 = 5, π1 = π2 = 0.5︸ ︷︷ ︸

true values

)

−25 −20 −15 −10 −5 0 5 10 15 20 250

5

10

15

20

25

30

35

µ1

µ2

−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5

−15.5

−10.5

−5.5

−0.5

4.5

9.5

14.5

19.5

Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 11 / 38

Page 12: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

Expectation Maximization (EM) for Mixture Models

For a single Gaussian, it was easy to compute the ML estimators of themean and covariance. In fact, we even had closed-form expressions forthese: sample mean and sample covariance.Thus, in a GMM, if we know the ri,k’s, we can compute weightedsample mean and weighted sample covariance for each Gaussian k.

In a GMM, if we know the model parameters, then it is easy to computethe ri,k’s.

So let’s iteratively alternate between the steps. This is EM for mixturemodels in a nutshell.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 12 / 38

Page 13: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

Expectation Maximization (EM) for Mixture Models

Log-likelihood of the observed data (AKA incomplete data):

l(θ) ,N∑i=1

log p(xi; θ)mixture model

=

N∑i=1

log

[∑zi

p(xi, zi; θ)

]Define the complete-data log likelihood:

lc(θ) ,N∑i=1

log p(xi, zi; θ)

This can’t be computed – the zi’s are unknown.

Note that even after observing the data, lc(θ) is an RV.

Define the expected complete-data log likelihood:

Q(θ, θt) , E(lc(θ)|D; θt)

where t is the iteration number. The expectation is taken w.r.t. the“old” parameter, θt.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 13 / 38

Page 14: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

Expectation Maximization (EM) for Mixture Models

The goal of the E step, is to compute Q(θ, θt), or, more accurately, theterms inside of it which the Maximum-Likelihood estimator depends on;these are known as the expected sufficient statistics.

In the M step, we optimize Q w.r.t. θ: θt+1 = arg maxθQ(θ, θt)

The EM produces a non-decreasing sequence:

l(θ0) ≤ l(θ1) ≤ l(θ2) ≤ . . .

The typical proof, which you are likely to see in other classes and mosttextbooks, relies on some information-theoretic concept. Here you will seeanother proof, which is simpler and more general, via the connectionbetween EM algorithms to the so-called MM algorithms.

The sequence above can be shown to converge to a (usually-local)maximum.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 14 / 38

Page 15: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

Expectation Maximization (EM) for Mixture Models

That was for the MLE. If want Maximum-A-Posterior (MAP)estimation, then adjust the M step toθt+1 = arg maxθQ(θ, θt) + log p(θ) (E step unchanged).

Here too, the EM produces a non-decreasing sequence:

l(θ0) + log(θ0) ≤ l(θ1) + log p(θ1) ≤ l(θ2) + log p(θ2) ≤ . . .

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 15 / 38

Page 16: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

EM for GMM

Q(θ, θt) , E(∑

ilog p(xi, zi; θ)

∣∣∣D; θt)

= E(∑

ilog p(zi; θ)p(xi|zi; θ)

∣∣∣D; θt)

= E(∑

ilog πzip(xi; θk)

∣∣∣D; θt)

= E(∑

ilog(∏

k(πkp(xi; θk))

1zi=k

)∣∣∣D; θt)

= E(∑

i

(∑k1zi=k log(πkp(xi; θk))

)∣∣∣D; θt)

=∑

i

∑kE(1zi=k| D; θt

)log (πkp(xi; θk))

=∑

i

∑kp(zi = k|xi; θt) log (πkp(xi; θk))

=∑

i

∑kri,k log πk +

∑i

∑kri,k log p(xi; θk)

where now ri,k , p(zi = k|xi; θt)www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 16 / 38

Page 17: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

EM for GMM

E step:Given θt, the current estimate of θ = (θ1, . . . , θK , π1, . . . , πK), compute,∀i ∈ {1, . . . , N} and ∀k ∈ {1, . . . ,K},

ri,kmixture

=πkp(xi|zi = k; θtk)∑K

k′=1 πk′p(xi|zi = k′; θtk′)

GMM=

πkN (xi;µk,Σk)∑Kk′=1 πk′N (xi;µk′ ,Σk′)

Note:∑K

k=1 ri,k = 1∀i ∈ {1, . . . , N}

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 17 / 38

Page 18: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

EM for GMM

M step:We optimize Q w.r.t. π (subject to

∑k πk = 1) and the θk’s.

πk =1

N

∑i

ri,k =rkN

rk ,∑i

ri,k

µk =

∑Ni=1 ri,kxirk

Σk =

∑Ni=1 ri,kxix

Ti

rk− µkµTk

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 18 / 38

Page 19: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

Example

−2 0 2

−2

0

2

Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 19 / 38

Page 20: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM

EM – Beyond Mixtures

Z,X: jointly distributed according to a parametrized p(z, x; θ)

This is general, regardless if X and/or Z are discrete or continuous RVs.

Observed X = x, want to estimate the parameter, θ.

Likelihood (usually not concave):

L(x; θ) , p(x; θ) =

∫p(z, x; θ) dz

Log Likelihood (usually not concave):

l(θ) = logL(x; θ) = log p(x; θ) = log

∫p(z, x; θ) dz

⇒ hard to maximize; even if p(x, z; θ) concave, p(x; θ) is usually notconcave.

The EM algorithm produces a sequence, (θ0, θ1, θ2, . . .), that increaseslogL(x; θ) to a (local) maximum.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 20 / 38

Page 21: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

MM Algorithms

“One of the virtues of the MM acronym is that it does double duty. Inminimization problems, the first M of MM stands for majorize and thesecond M for minimize. In maximization problems, the first M stands forminorize and the second M for maximize.”

We will define the terms “majorize” and “minorize” later.

Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 21 / 38

Page 22: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

MM Algorithms

A successful MM algorithm substitutes a difficult optimization problemwith an easy one.

“Simplicity can be attained by (a) avoiding large matrix inversions, (b)linearizing an optimization problem, (c) separating the parameters of anoptimization problem, (d) dealing with equality and inequalityconstraints gracefully, or (e) turning a non-differentiable problem into asmooth problem. Iteration is the price we pay for simplifying the originalproblem.”

Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 22 / 38

Page 23: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

Present EM Algorithms as MM Algorithms

Find a function A(θ; θt) such that

A(θt; θt) = logL(x; θt)A(θ; θt) ≤ logL(x; θ) (“minorization”)

and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)

Iterate.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 23 / 38

Page 24: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

MM Algorithms

More generally (than EM):An MM algorithm for maximizing some function f(θ):

Find a function A(θ; θt) such that

A(θt; θt) = f(θt) ∀θtA(θ; θt) ≤ f(θ) ∀θ, θt (“minorization”)

and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)

Iterate.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 24 / 38

Page 25: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

MM Algorithms

An MM algorithm for minimizing some function f(θ) (same asarg maxθ −f(θ))

Find a function A(θ; θt) such that

A(θt; θt) = f(θt) ∀θtA(θ; θt) ≥ f(θ) ∀θ, θt (“majorization”)

and then set θt+1 = arg minθ A(θ; θt) (“minimization”)

Iterate.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 25 / 38

Page 26: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

Fact (see next slide for the proof)

Let f be the objective function.

A majorization-minimization algorithm yields a non-increasing sequence:

f(θ0) ≥ f(θ1) ≥ f(θ2) ≥ . . .

A minorization-maximization algorithm yields a non-decreasing sequence:

f(θ0) ≤ f(θ1) ≤ f(θ2) ≤ . . .

Fact (proof is omitted)

The sequence converges to a local optimum, θ∗, or, in very rare cases, asaddle point. Typically, the convergence has linear rate:

limn→∞

∥∥θt+1 − θ∗∥∥

‖θt − θ∗‖= c < 1

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 26 / 38

Page 27: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

About Convergence Rates

In comparison, Newton-Raphson algorithm, under certain generalconditions, tend to converge and do it faster, at a quadratic rate:

limn→∞

∥∥θt+1 − θ∗∥∥

‖θt − θ∗‖2= c

(it is easy, however, to construct cases where Newton-Raphson does notconverge at all)

Hence Newton-Raphson algorithms tend to require fewer iterations. Onthe other hand, an iteration of a Newton-Raphson algorithm can be farmore computationally expensive (and error-prone) than an MMiteration. This is due to the evaluation and inversion of the Hessianmatrix, ∇2f(θ), in each Newton-Raphson update:

θt+1 = θt − (∇2f(θt))−1(∇f(θt))T

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 27 / 38

Page 28: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

Proof.

Suppose:

(1) f(θ) ≤ A(θ; θt)∀θ, θt (“majorization”)

(2) f(θt) = A(θt; θt)

(3) θt+1 = arg minθ A(θ; θt)

Want to show: f(θt+1) ≤ f(θt).

A(θt+1; θt)by (3)

≤ A(θt; θt)by (2)

= f(θt)

f(θt+1)−A(θt+1; θt)by (1) with θ=θt+1

≤ 0

Sum the inequalities:

f(θt+1) ≤f(θt)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 28 / 38

Page 29: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

MM

The proof also shows that even if we can’t find arg minθ A(θ; θt), it isenough to find θt+1 such that A(θt+1; θt) ≤ A(θt; θt)

Turns out: convergence still holds, with a similar convergence rate.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 29 / 38

Page 30: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

EM as a Special Case of MM

f(θ) , logL(x; θ)

A(θ; θt) ,∫

logp(z, x; θ)

p(z|x; θt)p(z|x; θt) dz = E

(log

p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

Check A(θt; θt) = f(θt):

A(θt; θt) =

∫log

p(z, x; θt)

p(z|x; θt)p(z|x; θt) dz

=

∫log

p(z|x; θt)p(x; θt)

p(z|x; θt)p(z|x; θt) dz

=

∫log p(x; θt)p(z|x; θt) dz = log p(x; θt)

∫p(z|x; θt) dz =

= log p(x; θt) = logL(x; θt) = f(θt)

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 30 / 38

Page 31: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

EM as a Special Case of MM

f(θ) , l(θ) = logL(x; θ)

A(θ; θt) ,∫

logp(z, x; θ)

p(z|x; θt)p(z|x; θt) dz = E

(log

p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

Check A(θ; θt) ≤ f(θ):

A(θ; θt) = E

(log

p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

Jensen Inequality≤ logE

(p(Z, x; θ)

p(Z|x; θt)

∣∣∣∣x; θt)

= log

∫p(z, x; θ)

p(z|x; θt)p(z|x; θt) dz

= log

∫p(z, x; θ) dz = logL(x; θ) = f(θ) ∀θ, θt

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 31 / 38

Page 32: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM MM

EM as a Special Case of MM

But what is the connection between this A and EM?

A(θ; θt) =

∫log

p(z, x; θ)

p(z|x; θt)p(z|x; θt) dz

=

∫(log p(z, x; θ)− log p(z|x; θt))p(z|x; θt) dz

=

[∫log p(z, x; θ)p(z|x; θt) dz

]−[∫

log p(z, x; θt)p(z|x; θt) dz

]︸ ︷︷ ︸

no θ here

arg maxθ

A(θ; θt) ⇐⇒ arg maxθ

∫log p(z, x; θ)p(z|x; θt) dz = E(lc(θ)|x; θt)︸ ︷︷ ︸

Q(θ;θt)

The connection to MM shows that indeed an EM iteration does notdecrease l(θ).

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 32 / 38

Page 33: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM Connection Between GMM and K-means

GMM and K-means: Version 1

Consider first hard-assignment EM for GMM in the caseΣk = Σ = σ2In×n, and weights that are known to be equal: πk = 1

K .

The hard-assignment rule becomes:

zi = arg maxk

1

KN (xi;µ,Σk) = arg max

kexp

(− 1

2σ2‖xi − µk‖2

)= arg min

k‖xi − µk‖

The M step becomes:

nk =

N∑i=1

1zi=k =∑i:zi=k

1 = |{i : zi = k}|

µk =1

nk

∑i:zi=k

xi

This coincides with the K-means algorithm.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 33 / 38

Page 34: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM Connection Between GMM and K-means

GMM and K-means: Version 2

Now consider standard EM fro GM, in the case Σk = Σ = σ2In×n (withpossibly-unknown non-uniform weights).

It can be shown that if σ → 0, then the E-step and M-step coincide withK-means.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 34 / 38

Page 35: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

EM Connection Between GMM and K-means

GMM and Kmeans

So we saw two ways to relate GMM and K-means.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 35 / 38

Page 36: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

More Slides

Jensen Inequality

Fact (Jensen Inequality – convex functions)

Z is an RV and h is some concave function ⇒ E(h(Z)) ≤ h(E(Z)).

Example

E(log(Z)) ≤ log(E(Z))

Fact (Jensen Inequality – convex functions)

Z is an RV and h is some convex function ⇒ E(h(Z)) ≥ h(E(Z)).

Example

E(X2) ≥ E2(X)

(since we already know that Var(X) > 0, it is a good way to rememberwhich direction the inequality goes. . . )

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 36 / 38

Page 37: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

More Slides

Outliers

In previous lectures, we discussed robustness via a robust-cost functionformulation.

In a mixture model, another way to handle outliers is by adding aK+ 1-th component, fixing it to have a low weight (that weight can alsobe estimated) and fixed covariance, σ2In×n, where σ2 is very large. Canalso use a uniformly-distributed outlier model if the support is compact.

Another approach is use components that robust. For example, amixture of Student’s t distributions instead of a mixture of Gaussians.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 37 / 38

Page 38: Computer Vision: Models, Learning and Inference …cv192/wiki.files/CV192_lec...GMM GMM and Clustering Despite the term \posterior" in the previous slide, note it referred to a probability

More Slides

Version Log

18/6/2019, ver 1.01. Slide 18: µTkµk → µkµTk

9/4/2019, ver 1.00.

www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 38 / 38