cvpr2010: advanced itincvpr in a nutshell: part 6: mixtures

Tutorial

Advanced Information Theory in CVPR “in a Nutshell” CVPR

June 13-18 2010 San Francisco,CA

Gaussian Mixtures:Classification & PDF Estimation

Francisco Escolano & Anand Rangarajan

Gaussian Mixtures

Background. Gaussian Mixtures are ubiquitous in CVPR. Forinstance, in CBIR, it is sometimes iteresting to model the image as apdf over the pixel colors and positions (see for instance [Goldberger et

al.,03] where a KL-divergence computation method is presented).GMs often provide a model for the pdf associated to the image andthis is useful for segmentation. GMs, as we have seen in theprevious lesson, are also useful for modeling shapes.Therefore GMs estimation has been a recurrent topic in CVPR.Traditional methods, associated to the EM algorithm have evolvedto incorporate IT elements like the MDL principle for model-orderselection [Figueiredo et al.,02] in parallel with the development ofVariational Bayes (VB) [Constantinopoulos and Likas,07]

2/43

Uses of Gaussian Mixtures

Figure: Gaussian Mixtures for modeling images (top) and for color-basedsegmentation (bottom)

3/43

Review of Gaussian Mixtures

Definition

A d-dimensional random variable Y follows a finite-mixturedistribution when its pdf p(Y |Θ) can be described by a weightedsum of known pdfs named kernels. When all of these kernels areGaussian, the mixture is named in the same way:

p(Y |Θ) =K∑i=1

πip(Y |Θi ),

where 0 ≤ πi ≤ 1, i = 1, . . . ,K ,∑K

i=1 πi = 1, K is the number ofkernels, π1, . . . , πK are the a priori probabilities of each kernel, andΘi are the parameters describing the kernel. In GMs, Θi = {µi ,Σi},that is, the mean vector and covariance.

4/43

Review of Gaussian Mixtures (2)

GMs and Maximum Likelihood

The whole set of parameters of a given K-mixture is denoted byΘ ≡ {Θ1, . . . ,ΘK , π1, . . . , πK}. Obtaining the optimal set ofparameters Θ∗ is usually posed in terms of maximizing thelog-likelihood of the pdf to be estimated, based on a set of N i.i.d.samples of the variable Y = {y1, . . . , yN}:

L(Θ,Y ) = `(Y |Θ) = log p(Y |Θ) = logN∏

n=1

p(yn|Θ)

=N∑

n=1

logK∑

k=1

πkp(yn|Θk).

5/43


GMs and EM

The EM algorithm allows to find maximum-likelihood solutions toproblems where there are hidden variables. In the case of Gaussianmixtures, these variables are a set of N labels Z = {z1, . . . , zN}associated to the samples. Each label is a binary vector

z i = [z(n)1 , . . . , z

(n)K ], being K the number of components, z

(n)m = 1

and z(n)p = 0, if p 6= m, denoting that yn has been generated by the

kernel m. Then, given the complete set of data X = {Y ,Z}, thelog-likelihood of this set is given by

log p(Y ,Z |Θ) =N∑

n=1

K∑k=1

znk log[πkp(yn|Θk)].

6/43


E-Step

Consists in estimating the expected value of the hidden variablesgiven the visible data Y and the current estimation of theparameters Θ∗(t):

E [z(n)k |Y ,Θ

∗(t)] = P[z(n)k = 1|yn,Θ∗(t)])

=π∗k(t)p(yn|Θ∗k(t))

ΣKj=1π

∗j (t)p(yn|Θ∗k(t))

.

Thus, the probability of generating yn with the kernel k is given by:

p(k |yn) =πkp(yn|k)

ΣKj=1πjp(yn|j)

.

7/43

Model Order Selection

Two Extreme Approaches

I How many kernels are needed for describe the distribution?

I [Figueiredo and Jain,02] it is proposed to perform EM fordifferent values of K and take the one optimizing ML and aMLD-like criterion. Starting from a high K , kernel fusions arepreformed if needed. Local optima arise.

I In EBEM [Penalver et al., 09] we show that it is possible to applyMDL more efficiently and robustly by starting from a uniquekernel and splitting only if the underlying data is not Gaussian.The main challenge of this approach is how to estimateGaussianity for multi-dimensional data.

9/43

Model Order Selection (2)

MDL

Minimum Description Length and related principles choose arepresentation of the data that allows us to express them with theshortest possible message from a postulated set of models.Rissanen’ MDL implies as minimizing

CMDL(Θ(K),K ) = −L(Θ(K),Y ) +N(K )

2log n,

where: N(K ) is the number of parameters required to define aK -component mixture, and n is the number of samples.

N(K ) = (K − 1) + K

(d +

d(d + 1)

2

).

10/43

Gaussian Deficiency

Maximum Entropy of a Mixture

Attending to the 2nd Gibbs Theorem, Gaussian variables have themaximum entropy among all the variables with equal variance. Thistheoretical maximum entropy for a d-dimensional variable Y onlydepends on the covariance Σ and is given by:

Hmax(Y ) =1

2log[(2πe)d |Σ|].

Therefore, the maximum entropy of the mixture is given by

Hmax(Y ) =K∑

k=1

πkHmax(k).

11/43

Gaussian Deficiency (2)

Gaussian Deficiency

Instead of using the MDL principle we may compare the estimatedentropy of the underlying data with the entropy of a Gaussian. Wedefine the Gaussianity Deficiency GD of the whole mixture as thenormalized weighted sum of the differences between maximum andreal entropy of each kernel:

GD =K∑

k=1

πk

(Hmax(k)− Hreal(k)

Hmax(k)

)=

K∑k=1

πk

(1− Hreal(k)

Hmax(k)

),

where Hreal(k) is the real entropy of the data under the k−thkernel. We have: 0 ≤ GD ≤ 1 (0 iff Gaussian). If the GD is highenough we may stop the algorithm.

12/43

Gaussian Deficiency (3)

Kernel Selection

If the GD ratio is below a given threshold, we consider that allkernels are well fitted. Otherwise, we select the kernel with thehighest individual ratio and it is replaced by two other kernels thatare conveniently placed and initialized. Then, a new EM epoch withK + 1 kernels starts. The worst kernel is given by

k∗ = arg maxk

{πk

(Hmax(k)− Hreal(k))

Hmax(k)

}.

Independently of using MDL or GD, in order to decide what kernelcan be split by two other kernels (if needed), we compute and laterexpression and we decide to split k∗ accordingly to MDL or GD.

13/43

Split Process

Split Constrains

The k∗ component must be decomposed into the kernels k1 and k2

with parameters Θk1 = (µk1 ,Σk1) and Θk2 = (µk2 ,Σk2). Inmultivariate settings, the corresponding priors, the mean vectors andthe covariance matrices should satisfy the following split equations:

π∗ = π1 + π2,π∗µ∗ = π1µ1 + π2µ2,

π∗(Σ∗ + µ∗µT∗ ) = π1(Σ1 + µ1µ

T1 ) + π2(Σ2 + µ2µ

T2 ),

Clearly, the split move is an ill-posed problem because the number ofequations is less than the number of unknowns.

14/43

Split Process (2)

Split

Following [Dellaportas,06], let Σ∗ = V∗Λ∗VT∗ . Let also be D a d × d

rotation matrix with orthonormal unit vectors as columns. Then:

π1 = u1π∗, π2 = (1− u1)π∗,

µ1 = µ∗ − (∑d

i=1 ui2

√λi∗V

i∗)√

π2π1,

µ2 = µ∗ + (∑d

i=1 ui2

√λi∗V

i∗)√

π1π2,

Λ1 = diag(u3)diag(ι− u2)diag(ι+ u2)Λ∗π∗π1,

Λ2 = diag(ι− u3)diag(ι− u2)diag(ι+ u2)Λ∗π∗π2,

V1 = DV∗,V2 = DTV∗,

15/43

Split Process (3)

Split (cont.)

The latter spectral split method has a non-evident randomcomponent, because ι is a d x 1 vector of ones,u1, u2 = (u1

2 , u22 , . . . , u

d2 )T and u3 = (u1

3 , u23 , . . . , u

d3 )T are 2d + 1

random variables needed to build priors, means and eigenvalues forthe new component in the mixture. They are calculated as:

u1 ∼ β(2, 2), u12 ∼ β(1, 2d),

uj2 ∼ U(−1, 1), u1

3 ∼ β(1, d), uj3 ∼ U(0, 1),

with j = 2, . . . , d , U(., .) and β(., .) denotes Beta and Uniformdistributions respectively.

16/43

Split Process (4)

Figure: Split of a 2D kernel into two ones.

17/43

EBEM Algorithm

Alg. 1: EBEM - Entropy Based EM Algorithm

Input: convergence th

K = 1, i = 0, π1 = 1, Θ1 = {µ1,Σ1} where µ1 = 1N

∑Ni=1 yi , Σ1 = 1

N−1

∑Ni=1(yi − µ1)T (yi − µ1)

Final = falserepeat

i = i + 1repeat

EM iterationEstimate log-likelihood in iteration i : `(Y |Θ(i))

until |`(Y |Θ(i))− `(Y |Θ(i − 1))| < convergence th ;Evaluate Hreal (Y ) and Hmax (Y )

Select k∗ with the highest ratio: k∗ = arg maxk

{πk

(Hmax (k)−Hreal (k))Hmax (k)

}Estimate CMDL in iteration i : N(k) = (k − 1) + k

(d +

d(d+1)2

), CMDL(Θ(i)) = −`(Y |Θ(i)) +

N(k)2

log n

if (C(Θ(i)) ≥ C(Θ(i − 1))) thenFinal = trueK = K − 1, Θ∗ = Θ(i − 1)

endelse

Decompose k∗ in k1 and k2end

until Final=true ;Output: Optimal mixture model: K , Θ∗

18/43

EBEM Algorithm (2)

Figure: Top: MML (Figueiredo & Jain), Bottom: EBEM

19/43

EBEM Algorithm (3)

Figure: Color Segmentation: EBEM (2nd col.) vs VEM (3rd col.)

20/43

EBEM Algorithm (4)

Table: EM, VEM and EBEM in Color Image Segmentation

Algorithm “Forest” “Sunset” “Lighthouse”

(K=5) (K=7) (K=8)

Classic EM (PSNR) 5.35 14.76 12.08

(dB) ±0.39 ±2.07 ±2.49

VEM (PSNR) 10.96 18.64 15.88

(dB) ±0.59 ±0.40 ±1.08

EBEM (PSNR) 14.1848 18.91 19.4205

(dB) ±0.35 ±0.38 ±2.11

21/43

EBEM Algorithm (5)

EBEM in Higher Dimensions

I We have also tested the algorithm with the well known Winedata set, that contains 3 classes of 178 (13-dimensional)instances.

I The number of samples, 178 is not enough to build the pdfusing Parzen’s windows method in a 13-dimensional space.With the MST approach (see below) where no pdf estimation isneeded, the algorithm has been applied to this data set.

I After EBEM ends with K = 3, a maximum a posterioriclassifier was built. The classification performance was 96.1%.This result is either similar or even better than the experimentsreported in the literature.

22/43

Entropic Graphs

EGs and Renyi Entropy

Entropic Spanning Graphs obtained from data to estimate Renyi’sα-entropy [Hero and Michel, 02] belong to the “non plug-in” methodsfor entropy estimation. Renyi’s α-entropy of a probability densityfunction p is defined as:

Hα(p) =1

1− αln

∫z

pα(z)dz

for α ∈ [0, 1[. The α-entropy converges to the Shannon onelimα→1 Hα(p) = H(p) ≡ −

∫p(z) ln p(z)dz , so it is possible to

obtain the Shannon entropy from the Renyi’s one if the latter limitis either solved or numerically approximated.

23/43

Entropic Graphs (2)

EGs and Renyi Entropy (cont.)

Let be a graph G consisting in a set of vertices Xn = {x1, . . . , xn},with xn ∈ Rd and edges {e} that connect vertices: eij = (xi , xj). Ifwe denote by M(Xn) the possible sets of edges in the class of acyclicgraphs spanning Xn (spanning trees), the total edge lengthfunctional of the Euclidean power weighted Minimal Spanning Treeis:

LMSTγ (Xn) = min

M(Xn)

∑e∈M(Xn)

||e||γ

with γ∈ [0, d ] and ||.|| the Euclidean distance. The MST has beenused in order to measure the randomness of a set of points.

24/43

Entropic Graphs (3)

EGs and Renyi Entropy (cont.)

It is intuitive that the length of the MST for the uniformlydistributed points increases at a greater rate than does the MSTspanning the more concentrated nonuniform set of points. Ford ≥ 2:

Hα(Xn) =d

γ

[ln

Lγ(Xn)

nα− lnβLγ ,d

]is an asymptotically unbiased and almost surely consistent estimatorof the α-entropy of p where α = (d − γ)/d and βLγ ,d is a constantbias correction for which there are only known approximations andbounds: (i) Monte Carlo simulation of uniform random samples onunit cube [0, 1]d ; (ii) Large d approximation: (γ/2) ln(d/(2πe)).

25/43

Entropic Graphs (4)

Figure: Uniform (left) vs Gaussian (right) distribution’s EGs.

26/43

Entropic Graphs (5)

Figure: Extrapolation to Shannon: α∗ = 1− a+b×ecd

N

27/43

Variational Bayes

Problem Definition

Given N i.i.d. samples X = {~x1, . . . ,~xN} of a d-dimensionalrandom variable X , their associated hidden variablesZ = {~z1, . . . ,~zN} and the parameters Θ of the model, the Bayesianposterior is given by [Watanabe et al.,09] :

p(Z ,Θ|X ) =p(Θ)

∏Nn=1 p(~xn,~zn|Θ)∫

p(Θ)∏N

n=1 p(~xn,~zn|Θ)dΘ.

Since the integration w.r.t. Θ is analytically intractable, theposterior is approximated by a factorized distributionq(Z ,Θ) = q(Z )q(Θ) and the optimal approximation is the one thatminimizes the variational free energy.

28/43

Variational Bayes (2)

Problem Definition (cont.)

The variational free energy is given by:

L(q) =

∫q(Z ,Θ) log

q(Z ,Θ)

p(Z ,Θ|X )dΘ− log

∫p(Θ)

N∏n=1

p(~xn|θ)dΘ ,

where the first term is the Kullback-Leibler divergence between theapproximation and the true posterior. As the second term isindependent of the approximation, the Variational Bayes (VB)approach is reduced to minimize the latter divergence. Suchminimization is addressed in a EM-like process alternating theupdating of q(Θ) and the updating of q(Z ).

29/43



The EM-like process alternating the updating of q(Θ) and theupdating of q(Z ) is given by

q(Θ) ∝ p(Θ) exp

{N∑

n=1

〈log p(~xn,~zn|Θ)〉q(Z)

}

q(Z ) ∝ exp

{N∑

n=1

〈log p(~xn,~zn|Θ)〉q(Θ)

}

30/43



In [Constantinopoulos and Likas,07] , the optimization of thevariational free energy yields (being N (.) and W(.) are respectivelythe Gaussian and Wishart densities):

q(Z ) =∏N

n=1

∏sk=1 rkn

znk∏K

k=s+1 ρknznk

q(µ) =∏K

k=1N (µk |mk ,Σk

q(Σ) =∏K

k=1W(Σk |νk ,Vk)

q(β) = (1−∑s

k=1 πk)−K+s Γ(∑K

k=s+1 γk)∏Kk=s+1 Γ(γk )

·∏K

k=s+1

(πk

1−∑s

k=1 πk

)γk−1,

After the maximization of the free energy w.r.t. q(.), it proceeds toupdate the coefficients in α which denote the free components.

31/43

Model Selection in VB

Fixed and Free Components

I In the latter framework, it is assumed that a number of K − scomponents fit the data well in their region of influence (fixedcomponents) and then model order selection is posed in termsof optimizing the parameters of the remaing s (freecomponents).

I Let α = {πk}sk=1 the coefficients of the free components andβ = {πk}Kk=s+1 the coefficients of the fixed components.Underthe i.i.d. sampling assumption, the prior distribution of Z givenα and β can be modeled by a product of multinomials:

p(Z |α, β) =∏N

n=1

∏sk=1 πk

znk∏K

k=s+1 πkznk .

32/43

Model Selection in VB (2)

Fixed and Free Components (cont.)

I Moreover, assuming conjugate Dirichlet priors over the set ofmixing coefficients, we have that

p(β|α) =

(1−∑s

k=1 πk)−K+s Γ(∑K

k=s+1 γk)∏Kk=s+1 Γ(γk )

·∏K

k=s+1

(πk

1−∑s

k=1 πk

)γk−1.

I Then, considering fixed coefficients Θ is redefined asΘ = {µ,Σ, β} and we have the following factorization:

q(Z ,Θ) = q(Z )q(µ)q(Σ)q(β) .

33/43


Kernel Splits

I In [Constantinopoulos and Likas,07] , the VBgmm methodis usedfor training an initial K = 2 model. Then, in the so calledVBgmmSplit, they proceed by sorting the obtained kernels andthen trying to split them recursively.

I Each splitting consists of:I Removing the original component.I Replacing it by two kernels with the same covariance matrix as

the original but with means placed in opposite directions alongthe maximum variabiability direction.

34/43


Kernel Splits (cont)

I Independently of the split strategy, the critical point ofVBgmmSplit is the amount of splits needed until convergence.At each iteration of the latter algorithm the K current exisitingkernels are splited. Consider the case of any split is detected asproper (non-zero π after running the VB update described inthe previous section, where each new kernel is considered asfree).

I Then, the number of components increases and then a new setof splitting tests starts in the next iteration. This means that ifthe algorithm stops (all splits failed) with K kernels, thenumber of splits has been 1 + 2 + . . .+ K = K (K + 1)/2.

35/43


EBVS Split

I We split only one kernel per iteration. In order to do so, weimplement a selection criterion based on measuring the entropyof the kernels.

I If ones uses Leonenko’s estimator then there is no need ofextrapolation as in EGs, and asymptotic consistence is ensured.

I Then, at each iteration of the algorithm we select the worse, interms of low entropy, to be split. If the split is successful wewill have K + 1 kernels to feed the VB optimization in the nextiteration. Otherwise, there is no need to add a new kernel andthe process converges to K kernels. The key question here isthat the overall process is linear (one split per iteration).

36/43

EBVS: Fast BV

Figure: EBVS Results

37/43

EBVS: Fast BV (2)

Figure: EBVS Results (more)

38/43

EBVS: Fast BV (3)

MD Experiments

I With this approach using Leonenko’s estimator, theclassification performance we obtain on this data set is 86%.

I Altough experiments in higher dimensions can be performed,when the number of samples is not high enough, the risk ofunbounded maxima of the likelihood function is higher, due tosingular covariance matrices.

I The entropy estimation method, however, performs very wellwith thousands of dimensions.

39/43

Conclusions

Summarizing Ideas in GMs

I In the multi-dimensional case, efficient entropy estimatorsbecome critical.

I In VB where model-order selection is implicit, it is possible toreduce the complexity at least by one order of magnitude.

I Can use the same approach for shapes in 2D and 3D.

I Once we have the mixtures, new measures for compare themare waiting to be discovered and used. Let’s do it!

40/43

References

I [Goldberger et al.,03] Goldberger, J., Gordon, S., Greenspan, H

(2003). An Efficient Image Similarity Measure Based on

Approximations of KL-Divergence Between Two Gaussian Mixtures.

ICCV’03

I [Figueiredo and Jain, 02] Figueiredo, M. and Jain, A. (2002).

Unsupervised learning of nite mixture models. IEEE Trans. Pattern

Anal. Mach. Intell., vol. 24, no. 3, pp. 381399

I [Constantinopoulos and Likas,07] Constantinopoulos, C. and Likas, A.

(2007). Unsupervised Learning of Gaussian Mixtures based on

Variational Component Splitting. IEEE Trans. Neural Networks, vol.

18., no. 3, 745–755.

41/43

References (2)

I [Penalver et al., 09] Penalver, A., Escolano, F., Saez, J.M. Learning

Gaussian Mixture Models with Entropy Based Criteria. IEEE Trans.

on Neural Networks, 20(11) 1756–1771.

I [Dellaportas,06] Dellaportas, P. and Papageorgiou I. (2006).

Multivariate mixtures of normals with unknown number of

components. Statistics and Computing, vol. 16, no. 1, pp. 57–68

I [Hero and Michel,02] Hero, A. and Michel, o. (2002). Applications of

spanning entropic graphs. IEEE Signal Processing Magazine, vol. 19,

no. 5, pp. 85–95

I [Watanabe et al.,09] Watanabe, K., Akaho, S., Omachi, S.:

Variational bayesian mixture model on a subspace of exponential

family distributions. IEEE Transactions on Neural Networks 20(11)

1783–179642/43

References (3)

I Escolano et al.,10] Escolano, F., Penalver A. and Bonev, B. (2010).

Entropy-based Variational Scheme for Fast Bayes Learning of

Gaussian Mixtures. SSPR’2010 (accepted)

I [Rajwadee et al.,09] Ajit Rajwade, Arunava Banerjee, Anand

Rangarajan(2009). Probability Density Estimation Using Isocontours

and Isosurfaces: Applications to Information-Theoretic Image

Registration. IEEE Trans. Pattern Anal. Mach. Intell. 31(3):

475–491

I [Chen et al.,10] Ting Chen, Baba C Vemuri, Anand Rangarajan,

Stephan J Eisenschenk (2010). Group-wise Point-set registration

using a novel CDF-based Havrda-Charvat Divergence.Int J Comput

Vis. 86 (1):111-124

43/43

cvpr2010: advanced itincvpr in a nutshell: part 6: mixtures

Education

pkyn n n

set of n i

n n e zk y

logpyn n

gaussian variables

gaussian mixturesfigure

gaussian deciencyinstead

gaussian mixturesbackground