computer vision: models, learning and inference …cv192/wiki.files/cv192_lec...gmm gmm and...
TRANSCRIPT
Computer Vision: Models, Learning and Inference–
Mixture Models, Part 2
Oren Freifeld and Ron Shapira-Weber
Computer Science, Ben-Gurion University
April 9, 2019
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 1 / 38
1 GMM
2 GMM Parameter Estimation
3 EMMMConnection Between GMM and K-means
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 2 / 38
Plan
An intuitive algorithm for GMM parameter estimation
That algorithm is an example of EM algorithms.
EM algorithms are iterative and work on maximizing a lower bound ofthe log-likelihood. Can be shown to increase (or not decrease) thelog-likelihood and converge to a local maximum.
EM algorithms can be easily adapted from Maximum-Likelihood (ML)estimation to Maximum-a-Posteriori (MAP) estimation.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 3 / 38
Plan
EM algorithms are a particular case of the more general MM algorithms– which are in fact easier to explain than EM algorithms, without havingto resort to information-theoretic concepts.
The scope of EM algorithms, let alone the scope of MM algorithms,goes beyond GMMs – and even beyond Mixture models.
We will discuss closely-related GMM inference algorithms:1 Hard-assignment EM and its connection to K-means2 Hard-assignment by sampling (maybe today)3 Gibbs Sampling (next time)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 4 / 38
GMM
GMM
Model:
p(x; θ) =
K∑k=1
πkN (x;µk,Σk) x ∈ Rn
where:
θk = (µk,Σk)
θ = (θ1, . . . , θK , π1, . . . , πK)
πk ≥ 0 (can also insist on πk > 0), ∀k ∈ {1, . . . ,K}∑Kk=1 πk = 1
Data: D = (xi)Ni=1, (usually) iid samples from p(x; θ)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 5 / 38
GMM
Example (a 3-component GMM in 2D)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 6 / 38
GMM
GMM and Clustering
First fit (we will see how) a GMM to the data, D = (xi)Ni=1
Next, compute p(zi = k|xi; θ):
ri,k , p(zi = k|xi; θ)any mixture model
=p(zi = k; θ)p(xi|zi = k; θ)∑K
k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)
GMM=
πkN (x;µk,Σk)∑Kk′=1 π
′kN (x;µ′k,Σ
′k)
ri,k is called the responsibility of cluster k for point i.
This is called soft clustering (AKA soft assignments)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 7 / 38
GMM
GMM and Clustering
If we viewπk = p(z = k; θ)
as the prior probability for class k, then
ri,k = p(zi = k|xi; θ)
may be viewed as the posterior probability (i.e., after having observedmeasurement xi) that zi = k; i.e., the probability that xi was drawnfrom Gaussian k, given that we know the value of xi
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 8 / 38
GMM
GMM and Clustering
Despite the term “posterior” in the previous slide, note it referred to aprobability for the label, zi, not to a posterior probability for θ (thisdifference might seem as semantics: one can argue that the labels areparameters too).
In other words, in this lecture we are interested mostly inMaximum-Likelihood estimates of θ, and do not place a prior on it,p(θ). That is, we view θ as an unknown deterministic quantity.
Next lecture we will be more Bayesian. We will regard θ as an RV, willwrite the Likelihood as p(x|θ), as opposed to p(x; θ), and will target theposterior, p(θ|x).
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 9 / 38
GMM
GMM and Hard Clustering
Recall (in mixture models in general, not just GMM):
ri,k , p(zi = k|xi; θ) =p(zi = k; θ)p(xi|zi = k; θ)∑K
k′=1 p(zi = k′; θ)p(xi|zi = k′; θ)
For hard clustering, compute
z∗i = arg maxk
p(zi=k|xi;θ)︷︸︸︷ri,k = arg max
klog p(xi|zi; θ) + log
πk︷ ︸︸ ︷p(zi = k; θ)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 10 / 38
GMM
GMM Parameter Estimation
The likelihood function, p(D; θ), is not concave (i.e., it is hard to optimize)
Example
Left: Data histogram, for a 2-component GMM with π1 = π2 = 0.5,σ1 = σ2 = 5, µ1 = −10 and µ2 = 10.Right: p(D;µ1, µ2, σ1 = σ2 = 5, π1 = π2 = 0.5︸ ︷︷ ︸
true values
)
−25 −20 −15 −10 −5 0 5 10 15 20 250
5
10
15
20
25
30
35
µ1
µ2
−15.5 −10.5 −5.5 −0.5 4.5 9.5 14.5 19.5
−15.5
−10.5
−5.5
−0.5
4.5
9.5
14.5
19.5
Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 11 / 38
EM
Expectation Maximization (EM) for Mixture Models
For a single Gaussian, it was easy to compute the ML estimators of themean and covariance. In fact, we even had closed-form expressions forthese: sample mean and sample covariance.Thus, in a GMM, if we know the ri,k’s, we can compute weightedsample mean and weighted sample covariance for each Gaussian k.
In a GMM, if we know the model parameters, then it is easy to computethe ri,k’s.
So let’s iteratively alternate between the steps. This is EM for mixturemodels in a nutshell.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 12 / 38
EM
Expectation Maximization (EM) for Mixture Models
Log-likelihood of the observed data (AKA incomplete data):
l(θ) ,N∑i=1
log p(xi; θ)mixture model
=
N∑i=1
log
[∑zi
p(xi, zi; θ)
]Define the complete-data log likelihood:
lc(θ) ,N∑i=1
log p(xi, zi; θ)
This can’t be computed – the zi’s are unknown.
Note that even after observing the data, lc(θ) is an RV.
Define the expected complete-data log likelihood:
Q(θ, θt) , E(lc(θ)|D; θt)
where t is the iteration number. The expectation is taken w.r.t. the“old” parameter, θt.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 13 / 38
EM
Expectation Maximization (EM) for Mixture Models
The goal of the E step, is to compute Q(θ, θt), or, more accurately, theterms inside of it which the Maximum-Likelihood estimator depends on;these are known as the expected sufficient statistics.
In the M step, we optimize Q w.r.t. θ: θt+1 = arg maxθQ(θ, θt)
The EM produces a non-decreasing sequence:
l(θ0) ≤ l(θ1) ≤ l(θ2) ≤ . . .
The typical proof, which you are likely to see in other classes and mosttextbooks, relies on some information-theoretic concept. Here you will seeanother proof, which is simpler and more general, via the connectionbetween EM algorithms to the so-called MM algorithms.
The sequence above can be shown to converge to a (usually-local)maximum.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 14 / 38
EM
Expectation Maximization (EM) for Mixture Models
That was for the MLE. If want Maximum-A-Posterior (MAP)estimation, then adjust the M step toθt+1 = arg maxθQ(θ, θt) + log p(θ) (E step unchanged).
Here too, the EM produces a non-decreasing sequence:
l(θ0) + log(θ0) ≤ l(θ1) + log p(θ1) ≤ l(θ2) + log p(θ2) ≤ . . .
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 15 / 38
EM
EM for GMM
Q(θ, θt) , E(∑
ilog p(xi, zi; θ)
∣∣∣D; θt)
= E(∑
ilog p(zi; θ)p(xi|zi; θ)
∣∣∣D; θt)
= E(∑
ilog πzip(xi; θk)
∣∣∣D; θt)
= E(∑
ilog(∏
k(πkp(xi; θk))
1zi=k
)∣∣∣D; θt)
= E(∑
i
(∑k1zi=k log(πkp(xi; θk))
)∣∣∣D; θt)
=∑
i
∑kE(1zi=k| D; θt
)log (πkp(xi; θk))
=∑
i
∑kp(zi = k|xi; θt) log (πkp(xi; θk))
=∑
i
∑kri,k log πk +
∑i
∑kri,k log p(xi; θk)
where now ri,k , p(zi = k|xi; θt)www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 16 / 38
EM
EM for GMM
E step:Given θt, the current estimate of θ = (θ1, . . . , θK , π1, . . . , πK), compute,∀i ∈ {1, . . . , N} and ∀k ∈ {1, . . . ,K},
ri,kmixture
=πkp(xi|zi = k; θtk)∑K
k′=1 πk′p(xi|zi = k′; θtk′)
GMM=
πkN (xi;µk,Σk)∑Kk′=1 πk′N (xi;µk′ ,Σk′)
Note:∑K
k=1 ri,k = 1∀i ∈ {1, . . . , N}
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 17 / 38
EM
EM for GMM
M step:We optimize Q w.r.t. π (subject to
∑k πk = 1) and the θk’s.
πk =1
N
∑i
ri,k =rkN
rk ,∑i
ri,k
µk =
∑Ni=1 ri,kxirk
Σk =
∑Ni=1 ri,kxix
Ti
rk− µkµTk
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 18 / 38
EM
Example
−2 0 2
−2
0
2
Figures from Kevin Murphy’s Machine-Learning book, 2012www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 19 / 38
EM
EM – Beyond Mixtures
Z,X: jointly distributed according to a parametrized p(z, x; θ)
This is general, regardless if X and/or Z are discrete or continuous RVs.
Observed X = x, want to estimate the parameter, θ.
Likelihood (usually not concave):
L(x; θ) , p(x; θ) =
∫p(z, x; θ) dz
Log Likelihood (usually not concave):
l(θ) = logL(x; θ) = log p(x; θ) = log
∫p(z, x; θ) dz
⇒ hard to maximize; even if p(x, z; θ) concave, p(x; θ) is usually notconcave.
The EM algorithm produces a sequence, (θ0, θ1, θ2, . . .), that increaseslogL(x; θ) to a (local) maximum.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 20 / 38
EM MM
MM Algorithms
“One of the virtues of the MM acronym is that it does double duty. Inminimization problems, the first M of MM stands for majorize and thesecond M for minimize. In maximization problems, the first M stands forminorize and the second M for maximize.”
We will define the terms “majorize” and “minorize” later.
Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 21 / 38
EM MM
MM Algorithms
A successful MM algorithm substitutes a difficult optimization problemwith an easy one.
“Simplicity can be attained by (a) avoiding large matrix inversions, (b)linearizing an optimization problem, (c) separating the parameters of anoptimization problem, (d) dealing with equality and inequalityconstraints gracefully, or (e) turning a non-differentiable problem into asmooth problem. Iteration is the price we pay for simplifying the originalproblem.”
Quote from [Hunter and Lange, “A Tutorial on MM Algorithms”, 2003]www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 22 / 38
EM MM
Present EM Algorithms as MM Algorithms
Find a function A(θ; θt) such that
A(θt; θt) = logL(x; θt)A(θ; θt) ≤ logL(x; θ) (“minorization”)
and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)
Iterate.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 23 / 38
EM MM
MM Algorithms
More generally (than EM):An MM algorithm for maximizing some function f(θ):
Find a function A(θ; θt) such that
A(θt; θt) = f(θt) ∀θtA(θ; θt) ≤ f(θ) ∀θ, θt (“minorization”)
and then set θt+1 = arg maxθ A(θ; θt) (“maximization”)
Iterate.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 24 / 38
EM MM
MM Algorithms
An MM algorithm for minimizing some function f(θ) (same asarg maxθ −f(θ))
Find a function A(θ; θt) such that
A(θt; θt) = f(θt) ∀θtA(θ; θt) ≥ f(θ) ∀θ, θt (“majorization”)
and then set θt+1 = arg minθ A(θ; θt) (“minimization”)
Iterate.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 25 / 38
EM MM
Fact (see next slide for the proof)
Let f be the objective function.
A majorization-minimization algorithm yields a non-increasing sequence:
f(θ0) ≥ f(θ1) ≥ f(θ2) ≥ . . .
A minorization-maximization algorithm yields a non-decreasing sequence:
f(θ0) ≤ f(θ1) ≤ f(θ2) ≤ . . .
Fact (proof is omitted)
The sequence converges to a local optimum, θ∗, or, in very rare cases, asaddle point. Typically, the convergence has linear rate:
limn→∞
∥∥θt+1 − θ∗∥∥
‖θt − θ∗‖= c < 1
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 26 / 38
EM MM
About Convergence Rates
In comparison, Newton-Raphson algorithm, under certain generalconditions, tend to converge and do it faster, at a quadratic rate:
limn→∞
∥∥θt+1 − θ∗∥∥
‖θt − θ∗‖2= c
(it is easy, however, to construct cases where Newton-Raphson does notconverge at all)
Hence Newton-Raphson algorithms tend to require fewer iterations. Onthe other hand, an iteration of a Newton-Raphson algorithm can be farmore computationally expensive (and error-prone) than an MMiteration. This is due to the evaluation and inversion of the Hessianmatrix, ∇2f(θ), in each Newton-Raphson update:
θt+1 = θt − (∇2f(θt))−1(∇f(θt))T
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 27 / 38
EM MM
Proof.
Suppose:
(1) f(θ) ≤ A(θ; θt)∀θ, θt (“majorization”)
(2) f(θt) = A(θt; θt)
(3) θt+1 = arg minθ A(θ; θt)
Want to show: f(θt+1) ≤ f(θt).
A(θt+1; θt)by (3)
≤ A(θt; θt)by (2)
= f(θt)
f(θt+1)−A(θt+1; θt)by (1) with θ=θt+1
≤ 0
Sum the inequalities:
f(θt+1) ≤f(θt)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 28 / 38
EM MM
MM
The proof also shows that even if we can’t find arg minθ A(θ; θt), it isenough to find θt+1 such that A(θt+1; θt) ≤ A(θt; θt)
Turns out: convergence still holds, with a similar convergence rate.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 29 / 38
EM MM
EM as a Special Case of MM
f(θ) , logL(x; θ)
A(θ; θt) ,∫
logp(z, x; θ)
p(z|x; θt)p(z|x; θt) dz = E
(log
p(Z, x; θ)
p(Z|x; θt)
∣∣∣∣x; θt)
Check A(θt; θt) = f(θt):
A(θt; θt) =
∫log
p(z, x; θt)
p(z|x; θt)p(z|x; θt) dz
=
∫log
p(z|x; θt)p(x; θt)
p(z|x; θt)p(z|x; θt) dz
=
∫log p(x; θt)p(z|x; θt) dz = log p(x; θt)
∫p(z|x; θt) dz =
= log p(x; θt) = logL(x; θt) = f(θt)
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 30 / 38
EM MM
EM as a Special Case of MM
f(θ) , l(θ) = logL(x; θ)
A(θ; θt) ,∫
logp(z, x; θ)
p(z|x; θt)p(z|x; θt) dz = E
(log
p(Z, x; θ)
p(Z|x; θt)
∣∣∣∣x; θt)
Check A(θ; θt) ≤ f(θ):
A(θ; θt) = E
(log
p(Z, x; θ)
p(Z|x; θt)
∣∣∣∣x; θt)
Jensen Inequality≤ logE
(p(Z, x; θ)
p(Z|x; θt)
∣∣∣∣x; θt)
= log
∫p(z, x; θ)
p(z|x; θt)p(z|x; θt) dz
= log
∫p(z, x; θ) dz = logL(x; θ) = f(θ) ∀θ, θt
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 31 / 38
EM MM
EM as a Special Case of MM
But what is the connection between this A and EM?
A(θ; θt) =
∫log
p(z, x; θ)
p(z|x; θt)p(z|x; θt) dz
=
∫(log p(z, x; θ)− log p(z|x; θt))p(z|x; θt) dz
=
[∫log p(z, x; θ)p(z|x; θt) dz
]−[∫
log p(z, x; θt)p(z|x; θt) dz
]︸ ︷︷ ︸
no θ here
arg maxθ
A(θ; θt) ⇐⇒ arg maxθ
∫log p(z, x; θ)p(z|x; θt) dz = E(lc(θ)|x; θt)︸ ︷︷ ︸
Q(θ;θt)
The connection to MM shows that indeed an EM iteration does notdecrease l(θ).
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 32 / 38
EM Connection Between GMM and K-means
GMM and K-means: Version 1
Consider first hard-assignment EM for GMM in the caseΣk = Σ = σ2In×n, and weights that are known to be equal: πk = 1
K .
The hard-assignment rule becomes:
zi = arg maxk
1
KN (xi;µ,Σk) = arg max
kexp
(− 1
2σ2‖xi − µk‖2
)= arg min
k‖xi − µk‖
The M step becomes:
nk =
N∑i=1
1zi=k =∑i:zi=k
1 = |{i : zi = k}|
µk =1
nk
∑i:zi=k
xi
This coincides with the K-means algorithm.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 33 / 38
EM Connection Between GMM and K-means
GMM and K-means: Version 2
Now consider standard EM fro GM, in the case Σk = Σ = σ2In×n (withpossibly-unknown non-uniform weights).
It can be shown that if σ → 0, then the E-step and M-step coincide withK-means.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 34 / 38
EM Connection Between GMM and K-means
GMM and Kmeans
So we saw two ways to relate GMM and K-means.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 35 / 38
More Slides
Jensen Inequality
Fact (Jensen Inequality – convex functions)
Z is an RV and h is some concave function ⇒ E(h(Z)) ≤ h(E(Z)).
Example
E(log(Z)) ≤ log(E(Z))
Fact (Jensen Inequality – convex functions)
Z is an RV and h is some convex function ⇒ E(h(Z)) ≥ h(E(Z)).
Example
E(X2) ≥ E2(X)
(since we already know that Var(X) > 0, it is a good way to rememberwhich direction the inequality goes. . . )
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 36 / 38
More Slides
Outliers
In previous lectures, we discussed robustness via a robust-cost functionformulation.
In a mixture model, another way to handle outliers is by adding aK+ 1-th component, fixing it to have a low weight (that weight can alsobe estimated) and fixed covariance, σ2In×n, where σ2 is very large. Canalso use a uniformly-distributed outlier model if the support is compact.
Another approach is use components that robust. For example, amixture of Student’s t distributions instead of a mixture of Gaussians.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 37 / 38
More Slides
Version Log
18/6/2019, ver 1.01. Slide 18: µTkµk → µkµTk
9/4/2019, ver 1.00.
www.cs.bgu.ac.il/~cv192/ Mixtures, Part 2 (ver. 1.01) Apr 9, 2019 38 / 38