tessellation and clustering by mixture models and their parallel … · 2008-07-11 · tessellation...

Tessellation and Clustering by Mixture Models andTheir Parallel Implementations∗

Qiang Du† Xiaoqiang Wang‡

Abstract

Clustering and tessellations are basic tools in data mining.

The k-means and EM algorithms are two of the most im-

portant algorithms in the Mixture Model-based clustering

and tessellations. In this paper, we introduce a new cluster-

ing strategy which shares common features with both the

EM and k-means algorithms. Our methods also lead to

more general tessellations of a spatial region with respect

to a continuous and possibly anisotropic density distribu-

tion. Moreover, we propose some probabilistic methods for

the construction of these clusterings and tessellations corre-

sponding to a continuous density distribution. Some numer-

ical examples are presented to demonstrate the effectiveness

of our new approach. In addition, we also discuss the par-

allel implementation and performance of our algorithms on

some distributed memory systems.

1 Introduction.

Algorithms for clusterings and tessellations have longbeen the basic tools in data mining and pattern recog-nition, as well as popular techniques in mesh genera-tion and optimization. Through clustering, we may ef-ficiently represent data for fast retrieval and reduce thecomplexity of data [15]. Clusterings and tessellationsare in general interchangeable terms since a tessellationinduces a clustering and a clustering can also form a tes-sellation. While clustering may suggest more on group-ing behavior, the term tessellation, nevertheless, oftenimplies emphasis on the shape and the arrangement ofthe patterns, that is, the geometric aspect of the clus-tering.

Mixture Model-based clustering forms a set of veryimportant algorithms in clustering [11]. Meanwhile,the Centroidal Voronoi tessellations (CVTs) [16] leadto very important and elegant algorithms for optimaltessellations. Together, they put our study here intocontext.

∗This research was supported in part by NSF-DMS 0196522and NSF-ITR 0205232.†Department of Mathematics, Penn State University, Univer-

sity Park, PA 16802 ([email protected]).‡Department of Mathematics, Penn State University, Univer-

sity Park, PA 16802 ([email protected]).

Among the Mixture Model-based clustering, the socalled k-means and EM (Expectation Maximization) al-gorithms [3, 21] are two of the most popular techniques.In a typical mixture model, each cluster is representedby a parametric distribution, exemplified by either auniform distribution, a Gaussian, or a Poisson distribu-tion, to represent the distribution of the element withina cluster. A natural idea would be to represent theentire data set by the mixture of these distributions.In one of the simplest setting, the k-means clustering[12, 18, 20] is aimed at representing all clusters by theircluster means and minimizing the in-cluster variance,i. e ., the sum of all mean square distances between theelements in the clusters and their corresponding clustermeans. On the other hand, the EM algorithm [3, 21]may be used to estimate the parameters associated withsuch distributions of all the clusters by the maximumlikelihood criterion. With some constraints or condi-tions imposed, EM algorithm could be changed to aclassification EM algorithm (CEM) [24].

CVTs are special Voronoi tessellations for which thegenerators of the tessellation are also the centers of mass(or means) of the Voronoi cells or clusters. CVTs is apoint placement strategy that has been applied to thesensor placement problem, point and patch distributionfor mesh-less computation, image segmentation andrestoration, image binning and stippling, and meshoptimization [5, 7, 9, 14, 17]. The concept of CVThas also been combined with the Proper OrthogonalDecompositions to give a new way of performing modelreduction for complex systems [6]. CVT in manyways resembles the features of k-means methods andit requires the specification of a given density functionto tessellate a spatial domain.

In this paper, we provide a new interpretation of theCEM algorithm so that the CEM and the CVT or the k-means can be unified within a common framework. Wenotice that in practice, however, the EM or the CEMalgorithm often gives a good clustering but does notnecessarily lead to a nice tessellation; and the CVT,meanwhile, provides very regular tessellations, but itis not good at capturing anisotropic tessellations. Atrade off between the CEM algorithms and the CVTs is

thus proposed here to get a new anisotropic tessellationalgorithm that can automatically detect anisotropy.

In many applications, the existing clustering algo-rithms are often used with a fixed set of sample points.Several issues may arise: one is concerned with the sizeof sample set we need; another is the efficient selec-tion of the sample points, especially with respect to thelarge dimensionality of the data set; and the last is howthese sample points may be used efficiently. In adopt-ing the EM algorithm to continuous distributions andsample regions, successfully addressing the above issuesbecomes even more critical than in the discrete cases.In this regard, the MacQueen’s method [20] can play animportant role. We explore this and other probabilisticmethods to construct and generalize the tessellation al-gorithms for both continuous and discrete distributions.

For very large data sets in high dimensional spaces,parallelization of the clustering algorithms is needed inorder to efficiently scale up with the dimensionality andthe cardinality of the data sets. On today’s parallelcomputing environment, effective parallelization oftenrequires high concurrency, minimal communication andgood load balancing. As a huge amount of works havecarried out in the last few decades on the subject ofclustering, in particular, on the parallel and scalableimplementations [1, 13], we do not attempt to surveyall the relevant studies on the parallel mixture modelbased clustering methods. We refer to [4, 22, 23,28] and the references sited therein. The parallelalgorithm we discuss here is new and it couples withprobabilistic sampling to achieve good parallel efficiencyand clustering quality.

The rest of the paper is organized as follows: westart the paper with some basic introduction to theclassical k-means and EM algorithms and their applica-tions to the continuous distributions. We then discussthe generalizations and variants of both algorithms, fol-lowed by the presentation of their probabilistic construc-tions. Their parallel implementations using the messagepassing interface (MPI) are presented next. Various nu-merical examples are provided afterwards and some con-cluding remarks are given at the end.

2 Mixture Model-based Clustering

In order to compare the new algorithms under consider-ation here with the existing algorithms used for cluster-ing, we first review a number of popular mixture modelbased approaches. Many of these approaches can bederived from different angles and be presented in equiv-alent forms. We merely outline some very basic formu-lations here.

Let the n sample points in Rd be clustered into kcomponents with the i-th component being represented

by a Gaussian distribution parameterized by µi (themean), and σi (variance) for i = 1, 2, ..., k. The densityfunction of the component i is

fi(x) = Φ(x, µi, σi)

= 1√2π|σi|

exp(−(x−µi)Tσ−1

i(x−µi)

2 ) .(2.1)

Now, suppose that the prior probability (weight) ofcomponent i is ai. The mixture density is then givenby:

f(x) =k∑i=1

aifi(x) =k∑i=1

aiΦ(x, µi, σi)(2.2)

We let θ denote the collection of parameters:ai, µi, σi, i = 1, 2, ..., k. For a discrete data setx1, x2, ..., xn, denote the likelihood function by:

L(θ) =n∑j=1

log(k∑i=1

aiΦ(xj , µi, σi))

which is the objective function to be maximized by theEM clustering algorithm. L(θ) is sometimes called themixture likelihood.

Denote the posterior probability on the i-th compo-nent of a point x by pi(x). By the Bayes formula,

pi(x) ∝ aiΦ(x, µi, σi)) ,

with∑ki=1 pi(x) ≡ 1 . This leads to the EM algorithm

which is outlined below for completeness.

Algorithm 2.1. (EM)1. Initialize the set of parameters θ.2. E-step: for all j = 1, ..., n, i = 1, ..., k, set

pi(xj) =aiΦ(xj , µi, σi)∑kl=1 alΦ(xj , µl, σl)

.

3. M-step: update θ by calculating

ai =1n

n∑j=1

pi(xj) ,

µi =

∑nj=1 pi(xj)xj∑nj=1 pi(xj)

,

σi =

∑nj=1 pi(xj)(xj − µi)(xj − µi)T∑n

j=1 pi(xj).

4. Repeat E-step and M-step until convergence.

Changing the mixture likelihood L(θ) to classifica-tion likelihood :

L(θ) =n∑i=1

log(aη(i)Φ(xη(i), µη(i), ση(i)))

with η(i) denoting the index of the cluster the pointxi belonging into, we are led to the Classification EMAlgorithm (CEM).

Algorithm 2.2. (CEM)1. Initial parameters θ.2. E-step: for all j = 1, ..., n, i = 1, ..., k, set


.

3. Classification: let η(i) = argmaxj pj(xi) andtake pj(xi) = 1 if j = η(i) and pj(xi) = 0 if j 6= η(i).Let Gj = xi | pj(xi) = 1 with cardinality I(Gj).

4. M-step: update θ by calculating

ai =

∑nj=1 pi(xj)n

=I(Gi)n

µi =

∑nj=1 pi(xj)xj∑nj=1 pi(xj)

=

∑xj∈Gi

xj

I(Gi)

σi =

∑nj=1 pi(xj)(xj − µi)(xj − µi)T∑n

j=1 pi(xj)

=

∑xj∈Gi

(xj − µi)(xj − µi)T

I(Gi).

4. Repeat E-step and M-step until the algorithmconverges.

Notice that with σi being the identity matrix, theCEM gives exactly the k-means algorithm, which, inthe conventional case, attempts to divide a sample setpoints x1, x2, ..., xn into k clusters with cluster means(centroids) m1,m2, ...,mk that minimizes the objectivefunction:

n∑j=1

||xj −mη(j)||2 .(2.3)

Here, η(j) = i gives the association rule that assignthe j-th point to the i-th cluster. Note also that theCEM with a more general set of σi can provide moreprecise anisotropic characterization of the local pointdistribution.

3 A New Clustering Algorithm

We now present a new clustering algorithm that incor-porates the features of both the k−means and the EMalgorithms. To motivate the new algorithm, let us firstgive a different interpretation to the CEM.

We begin with the discrete case. Given n points,xi, i = 1, 2...n in Rd. Dividing these n points into kgroups, an energy can be defined as follows:

E = −k∑i=1

n∑j=1

[pi(xj) log ai

+pi(xj) log Φ(xj , µi,Mi)] .

Applying the following constraints on pi = pi(x) and ai:

k∑i=1

pi(x) ≡ 1 ,(3.4)

k∑i=1

ai = 1(3.5)

with Φ being a probability density function such asthe Gaussian defined in (2.1), we may look for theconstrained minimizer of the above energy. The classicalCEM algorithm may be viewed as a realization of thisminimization process.

Taking the Gaussian distribution into the expres-sion of E, we get

E′ =k∑i=1

n∑j=1

[−pi(xj) log ai +12pi(xj)(log |Mi|

+(xj − µi)TM−1i (xj − µi))] + c ,

where c is a constant. Thus, CEM also becomes analgorithm to minimize E′. Moreover, it is easy to verifythat E′ decreases for each step of CEM. In the energyexpression, the parameters ai can be regarded as theweights of each group. The more points a group have,the more likely for other points to be clustered into thisgroup. This is in contrast with the standard k-meansalgorithm which assigns each group with the sameweights. Moreover, the standard k-means algorithmalso takes the identity matrix as the variance matrix Mfor each group. To maintain the individual charactersof each group while preserving the equal weights forall groups, we construct a generalized version of the k-means algorithm: first setting all ai’s equal, the newenergy, after removing a constant, is changed to:

E∗ =k∑i=1

n∑j=1

[pi(xj)(log |Mi|

+(xj − µi)TM−1i (xj − µi))] .

Then, our new algorithm is to minimize E∗ with aconstraint (3.4). Adopting the notation in (2.3), wemay re-write E∗ as

E∗ =n∑i=1

[log |Mη(i)|+ (xi − µη(i))TM−1η(i)(xi − µη(i))]

where η(i) = j means the ith point is assigned to thejth cluster. For any matrix B, let TrB denote thetrace of B. We can prove the following theorem.

Theorem 3.1. Assume that the function f(B) =log |B| + TrXTB−1X has a minimum for a matrixB among all positive definite matrices. Then, we haveB = XXT .

Proof: Let A = B−1, then A = (aij) gives the minimumof function g(A) = − log |A|+ TrXTAX. And

∂g

∂aij= −|Aij |

|A|+∑`

Xi`Xj` = 0

where Aij denotes the minors of A associated with theentry aij . So B = A−1 = |A|−1A∗ where A∗ is thematrix whose ij-th element is Aij , that is, the adjointmatrix of A. So B = XXT .

This theorem leads directly to a new anisotropicclustering algorithm that adopts the anisotropic prop-erties of EM or CEM as well as the centroid tessellationproperties of CVTs.

Algorithm 3.1. (New Anisotropic Clustering)1. Randomly select k points µ1, µ2, ..., µk, and set

M1 = M2 = ... = Id×d.2. Assign a point xi to the group j such that xi gives

the minimum of log |Mj |+ (xi − µj)TM−1j (xi − µj) for

j = 1, ..., k, i = 1, ..., n.3. Update µi by the centroid and Mi by the

covariance matrix of group i.4. Repeat steps 2 and 3 until the algorithm

converges.

In the numerical experiments of section 6.2, thisnew algorithm is shown to give better tessellationsthan the CEM algorithm and CVTs. By viewing itas a kind of Mixture Model-based clustering method,a continuous version is also naturally available: givena density ρ = ρ(x) defined in a region Ω, the MixtureModel-based clustering can be used to find a suitableparameter set θ, such that the function f defined in (2.2)best ’approximate’ ρ = ρ(x). From these parametersand the clustering rules, we can also get a tessellation ofthe region Ω. We refer algorithms for such purposes asthe continuous algorithms, for instance, the continuousEM or continuous k-means.

With the proper modification of the original (dis-crete) anisotropic algorithm, the continuous anisotropicclustering algorithm can be formulated as:

Algorithm 3.2. (Continuous anisotropic clustering)1. Randomly select k points µ1, µ2, ..., µk, and set

M1 = M2 = ... = Id×d.2. For j = 1, ..., k, assign a point x to the groupGj if

x gives the minimum of log |Mj |+(x−µj)TM−1j (x−µj).

3. Update µi by the centroid and Mi by thecovariance matrix of group i, i.e.

µi =

∫Giρ(x)x dx∫

Giρ(x) dx

,

Mi =

∫Giρ(x)(x− µi)(x− µi)T dx∫

Giρ(x) dx

.

4. Repeat steps 2 and 3 until the algorithmconverges.

Similar modifications to the EM, CEM and thek-means algorithms may also be used for the abovepurpose and such modified algorithms are called thecontinuous EM, continuous CEM and the continuousk-means algorithms.

The new anisotropic algorithms attempt to addmore generality to the isotropic counterpart. In somesense, the characterization of the anisotropy may be re-lated to the use of a general Riemannian metric. Thus,the construction outlined here is very much closely re-lated to our earlier work [8] on the anisotropic cen-troidal Voronoi tessellations which is a generalization tothe Riemannian metric case of the standard centroidalVoronoi tessellation defined with the Euclidean metric.The anisotropic CVTs studied in [8] is similar but dif-ferent from the weight k-means algorithms based on theweighted Voronoi diagrams [2].

In contrast to the study made in [8] where the Rie-mannian metric (and thus the anisotropy) is predeter-mined, the algorithms proposed here gives an adaptivecharacterization of the parameters for the anisotropicdistribution.

For practical applications, the above continuousalgorithms may be realized in many different ways. Oneattempt is to uniformly sample the region and applythe discrete EM algorithm. Such a strategy does notwork efficiently for nonuniform density distributions ordistributions with sharp peaks. The more efficient waysare to select more sample points where the densityfunction is larger. The details of an efficient algorithmfor the continuous Mixture Model are given in the nextsection.

4 Probabilistic Methods for the ContinuousMixture Models

For more efficient implementations of the continuous al-gorithms, we present some other possible constructions,in particular, we emphasize on the probabilistic imple-mentations.

First, let us recall the MacQueen’s method [20], avery elegant random sequential sampling method whichcan be used for the continuous k-means algorithm.

Algorithm 4.1. (MacQueen’s Method) Given a re-gion Ω, a density function ρ(x) defined on Ω, and apositive integer k.

1. Choose an initial set of k points µi in Ω by MonteCarlo method; set ji = 1 for i = 1, ..., k.

2. Sample a point x in Ω by Monte Carlo method,according to the probability density function ρ = ρ(x).

3. Find µr that is the closest to x among all µiki=1.

4. Update µr by µr = (jrµr + x)/(jr + 1) andjr = jr + 1. The other µi’s remain unchanged.

5. If µiki=1 meets a convergence criterion, termi-nate; otherwise, go to step 2.

The almost sure convergence of the energy for therandom MacQueen’s method has been proved [20]. Forother studies on the algorithm, we refer to [5] and [10]for more discussions.

MacQueen’s method can be directly generalized tothe EM algorithm. A generalization of MacQueen’smethod given in [16] for the construction of CVT pro-vides motivations to develop more efficient and alterna-tive approaches to the EM methods.

The update formula in MacQueen’s method can bere-written as

µnewr = w1µoldr + w2x

where w1 and w2 are two positive weight functionsused in updating the means by new sample point. Ingeneral, w1 and w2 are functions of the iteration times jrwith constrain that w1(jr) + w2(jr) = 1. Another rulein choosing the weight functions is that w1 should bean increasing function, at least non-decreasing. Thesecriteria assure that the old data are weighted moreand more in calculating the new data. In MacQueen’smethod, w1(jr) = jr/(jr + 1) and w2(jr) = 1/(jr + 1).To add more flexibility to MacQueen’s method, fourparameters α1, α2, β1, β2 are introduced in the weightfunctions:

w1(jr) =α1jr + β1

jr + 1, and w2(jr) =

α2jr + β2

jr + 1(4.6)

where α1+α2 = 1 and β1+β2 = 1. Thus, if α1 = β2 = 1and α2 = β1 = 0, it reduces to the MacQueen’s method.We point out that if the weight function w1 is too large,the effect of the newer data maybe too weak which couldlead to the need of more iterations. On the other hand,if the weight function w2 is too large, the effect of thenewer data may be too strong and the convergence maynot be assured.

Instead of adding only one new sample point, moresample points in each iteration may speed up thealgorithms and improve concurrency [16]. And the morenew sample points taken, the larger weight function w2

can be used to enlarge the effect of the new samplepoints.

Using the weight functions defined in (4.6), wepresent a new probabilistic method that can be viewedas a probabilistic version of the EM algorithm and ageneralization of the random MacQueen’s method.

Algorithm 4.2. (Probabilistic method for continuousEM) Given a region Ω, a density function ρ = ρ(x)defined on Ω, and a positive integer k.

0. Choose a positive integer q and nonnegativeconstants α1, α2, β1, β2, such that α1 + α2 = 1 andβ1 + β2 = 1; choose an initial set of k points µiki=1

by using the Monte Carlo method; set ai = 1/k fori = 1, ..., k; set the d × d matrix Mi = Id×d fori = 1, ..., k; set ji = 1 for i = 1, ..., k;

1. Choose q points xrqr=1 in Ω at random bythe Monte Carlo method according to the probabilitydensity function f ;

2. E-step: for all j = 1, ..., q, i = 1, ..., k, set


.

3. M-step: compute:

ai =(α1ji + β1)ai + (α2ji + β2)

∑qj=1(pi(xj)/q)

ji + 1,

mi =

∑qj=1 pi(xj)xj∑qj=1 pi(xj)

,

µi =(α1ji + β1)µi + (α2ji + β2)mi

ji + 1,

Vi =

∑qj=1 pi(xj)(xj − µi)(xj − µi)T∑q

j=1 pi(xj),

σi =(α1ji + β1)σi + (α2ji + β2)Vi

ji + 1.

4. If the new ai, µi, σiki=1 meet some convergencecriterion, terminate; otherwise update ji = ji + 1 forni 6= 0 and return to step 1.

The generalization for the continuous CEM algo-rithm can also be made very similarly as follows:

Algorithm 4.3. (Probabilistic method for continuousCEM) Given a region Ω, a density function ρ(x) definedon Ω, and a positive integer k.

0. Choose a positive integer q and nonnegativeconstants α1, α2, β1, β2, such that α1 + α2 = 1 andβ1 + β2 = 1; choose an initial set of k points µiki=1 byusing Monte Carlo method; set ai = 1

k for i = 1, ..., k;set d × d matrix Mi = Id×d for i = 1, ..., k; set ji = 1for i = 1, ..., k;

1. Choose q points xrqr=1 in Ω at randomby Monte Carlo method according to the probabilitydensity function f ;

2. E-step: For i = 1, ..., k, gather togetherin the set Wi all sampling points xr has the largestaiΦ(xr, µi, σi) among ai, µi, σiki=1;

3. M-step: let ni be the number of points in Wi;compute:

ai =(α1ji + β1)ai + (α2ji + β2)(ni/q)

ji + 1,

let mi be the mean of the points in Wi; compute:


ji + 1,

and let Vi be the variance metric of the points in Wi,i.e.

Vi =1ni

∑x∈Wi

(x− µi)(x− µi)T ;

compute


ji + 1.

4. If the new ai, µi, σiki=1 meet some convergencecriterion, terminate; otherwise update ji = ji + 1 forni 6= 0 and return to step 1.

Our new continuous anisotropic clustering algo-rithm can also be constructed via probabilistic meansas in the above algorithm for the continuous CEM byremoving the coefficients ai:

Algorithm 4.4. (Probabilistic method for the contin-uous anisotropic clustering) Given a region Ω, a densityfunction ρ(x) defined on Ω, and a positive integer k.

0. Choose a positive integer q and nonnegativeconstants α1, α2, β1, β2, such that α1 + α2 = 1 andβ1 + β2 = 1; choose an initial set of k points µiki=1 byusing Monte Carlo method; set d× d matrix Mi = Id×dfor i = 1, ..., k; set ji = 1 for i = 1, ..., k.

1. Choose q points xrqr=1 in Ω at randomby Monte Carlo method according to the probabilitydensity function f .

2. For i = 1, ..., k, gather together in the set Wi allsampling points xr has the largest Φ(xr, µi, σi) amongµi, σiki=1.

3. Let ni be the number of points in Wi and mi bethe mean of the points in Wi; compute:


ji + 1

let Vi be the variance metric of the points in Wi, i.e.

Vi =1ni

∑x∈Wi

(x− µi)(x− µi)T ,

and compute


ji + 1.

4. If the new µi, σiki=1 meet some convergencecriterion, terminate; otherwise update ji = ji + 1 forni 6= 0 and return to step 1.

5 Parallel Implementations

In this section, we consider parallel implementations ondistributed memory systems of the algorithms discussedin the section 4. For convenience, given two positiveintegers k and p and a non-negative integer i such asp ≤ k and i < p, we define

R(i, p, k) =

[k/p] i < k mod(p)[k/p] + 1 otherwise ,

and g0 = G(0, p, k) = 0gi = G(i, p, k) =

∑i−1j=0R(j, p, k)

.

The parallel versions of the MacQueen’s method canbe found in [16]. Here we give the parallel versions ofthe EM and CEM discussed in section 4.

Algorithm 5.1. (Parallel EM algorithm) The regionΩ, density function f on Ω, and a positive integerk are given. We also have p processors with rankr = 0, 1, ..., p− 1.

0. For 0 ≤ r ≤ p − 1, set mr = R(r, p, k); leteach processor independently chooses its own initial setof mr points µigr+1

gr+1 by using, e.g., the Monte Carlomethod; set ai = 1/k, matrices σi = Id×d, and ji = 1for i = gr + 1, ..., gr+1.

1.Communicate among all processors so that allpoints form a complete initial set of µi, ai, σi, jiki=1.

2. Each processor selects its own set of s pointsxrrs1 in Ω at random, e.g., by a Monte Carlo method,according to the probability density function f .

3. E-step:3.a. On each processor, for i = 1, ..., k and j =

1, ..., s, set


.

3.b. Through communication, each processor sendsall of its points xj and pi(xj) to the processor with rankr such that gr < i ≤ gr+1, j = 1, ..., s. And eachprocessor receives the data sent by the other processorsand form points xj , j = 1, ..., sq and i = gr + 1, ..., gr+1.

4. M-step: On each processor, for i = gr +1, ..., gr+1, compute

ai =(α1ji + β1)ai + (α2ji + β2)

∑sqj=1(pi(xj)/(sq)

ji + 1,

mi =

∑sqj=1 pi(xj)xj∑sqj=1 pi(xj)

,


ji + 1,

Vi =

∑sqj=1 pi(xj)(xj − µi)(xj − µi)T∑sq

j=1 pi(xj),


ji + 1,

and update ji = ji + 1.4. If the new ai, µi, σiki=1 meet some convergence

criterion, terminate; otherwise return to step 1.

Similarly, we list the parallel version of the CEMalgorithm:

Algorithm 5.2. (Parallel CEM algorithm) Given theregion Ω, a density function f on Ω, a positive integerk, and p processors with rank r = 0, 1, ..., p− 1.

0. For 0 ≤ r ≤ p − 1, set mr = R(r, p, k); theneach processor independently chooses its own initial setof mr points µigr+1

i=gr+1, e.g., by using Monte Carlomethod; set ai = 1/k, matrices σi = Id×d, ji = 1 fori = gr + 1, ..., gr+1.

1. Communicate among all processors so that allpoints form a complete initial set of µi, ai, σi, jiki=1.

2. Each processor selects its own set of s pointsxr`s`=1 in Ω at random, e.g., by a Monte Carlo method,according to the probability density function f .

3. E-step:3.1. On each processor, for each µi, i = 1, ..., k,

gather together all sampling points xr` which has thelargest aiΦ(xr` , µi, σi) in the set W r

i .3.2. Through communication, each processor sends

the point set W ri to other processors with rank r such

that gr < i ≤ gr+1. And each processor receives thepoint set sent by other processors and gathers them intothe Wi for i = gr + 1, ..., gr+1.

4. M-step: On each processor, for i = gr +1, ..., gr+1, if Wi is not empty, let ni be the points inWi; compute:

ai =(α1ji + β1)ai + (α2ji + β2)ni/(sp)

ji + 1;

let mi be the mean of the points in Wi; compute:


ji + 1;

let Vi be the variance metric of the points in Wi, i.e.

Vi =1ni

∑x∈Wi

(x− µi)(x− µi)T ,

and compute


ji + 1;

then update ji = ji + 1.4. If the new ai, µi, σiki=1 meet some convergence

criterion, terminate; otherwise return to step 1.

Parallelizations of the new anisotropic clusteringalgorithms can also be made with much similarity asthe above algorithms. We omit the details.

Figure 1: The plot of energy vs. the number of samplesper iteration.

6 Numerical Experiments.

This section is dived into three subsections. The firstsubsection discusses the selection of the variables inour algorithms: α1, β1, α2, β2, and q, the number ofsamples points in each iteration. The second subsectiondiscusses the application of the algorithm. Especially,we apply those algorithm to some continuous or discretedistribution clustering. We can also see some pictureclustering experiments in this section. Finally, we willdiscuss the efficiency of parallel implementations.

6.1 Variables selection We first discuss the selec-tion of q, the number of samples points in each itera-tion. As stated in section 4. A larger q means the morenew information we can use in each iteration, thus lessiteration steps is needed. On the other hand, a largerq also means more calculation needed in each iteration.To check the best q for a problem, we can fixed the num-ber of all the sample points of all the iterations. In thisexperiment, we try to compare the energies with differ-ent q while the total number of samples is fixed at 106.The anisotropic clustering algorithm used in this test.The energy of a clustering for a continuous distributionis approximated by sampling another 105 points.

In this experiment, the distribution we used here is:

f(x) = e−40(x−0.5)2−160y2+ e−40(x+0.5)2−160y2

+e−160x2−40(y−0.5)2 + e−160x2−40(y+0.5)2 .

The total number of sample points is 106 and thenumber of clusters k is 30.

Figure 1 show the curve of the energy v.s. p, thenumber of samples of each iteration. Because the totalsample points are fixed at 106, the larger the p is, theless iteration it used, and the total time will be thesame. The quality of the clustering can be seen fromthe energy. The graph indicates, when p increase to 103

, the energy decrease rapidly, and after about 2×103, the

α1 β1 α2 β2 energy0.5 0.5 0.5 0.5 -4.79960.0 0.0 1.0 1.0 -3.67641.0 1.0 0.0 0.0 0.92121.0 0.0 0.0 1.0 -3.99250.0 1.0 1.0 0.0 -4.70450.6 0.6 0.4 0.4 -4.76760.7 0.7 0.3 0.3 -4.82870.8 0.8 0.2 0.2 -4.78710.9 0.9 0.1 0.1 -4.7692

Table 1: α1, β1, α2, β2 v.s. energy.

energy has small fluctuations Thus, p should be takenneither too small nor too large. Since the more clustersthere are, the more sample points are needed. We maythen postulate that there is a linear relationship betweenp and k, the number of cluster. Practically, we find thata good choice of p is to be set around 50k, which meansthe average number of new sample points in each clusteris about 50.

The selection of the parameters α1, β1, α2, β2 is alsodiscussed in section 4. Several criteria are given there.To compare the effect of different α1, β1, α2, β2, wechoose the same distribution and k as before, and set q2000. Table 1 gives the results after the same number ofiterations. We see that, the choices α1 = β1 = 0.7 andα2 = β2 = 0.3 are about the best choices, significantlysurpassing the performance of the MacQueen’s method(α1 = β2 = 1, α2 = β1 = 0). This clearly demonstratesthe trade off between the old and the new data.

6.2 Tessellation and clustering The distributiongiven earlier as a superposition of four Gaussian distri-butions can be described by the left graph in Figure 2.We can see the four leaves. Thus we may set k = 4 andtry to catch the leaves. Direct calculations by CEM, k-means and the new anisotropic algorithm result almostthe same graph given in the Figure 2 (right picture).

All there algorithms can get the centroid of eachleaves. But except k-means, the matrix of other meth-ods can provide us the information of the distributionfor each cluster. To retrieve this information, we canadd an additional so called background cluster, which isdefined by x | ρi(x) < a, i = 1, ..., k, where ρi is thedistribution for cluster i and a can be set to 0.05 or 0.1.For a = 0.1, CEM and the anisotropic algorithm getthe same picture, showed by the left picture in figure3. Because k-means can not get the distribution of eachcluster, at most we can only add a cluster whose pointsare far away for all the centroids. The picture can beshowed by the right picture in figure 3.

Figure 2: Distribution by sample points and 4 clusterresult by CEM, k-means and the anisotropic algorithm.The distribution is given as a superposition of fourGaussian functions.

Figure 3: Add a new cluster: by CEM, the anisotropicalgorithm (left) and by k-means (right).

From the result, one see that adding a new clusteris reasonable and it leads to a better result. So weapply this technique to all of our CEM and anisotropicclustering. Here, we set k = 4 because we know thereare four leaves. If such prior information is not availableand k = 20 is taken, the results, shown in Figure 4, aresignificantly different.

Figure 4: From left to right, the results of CEM, theanisotropic method and k-means when k = 20.

From this result, we can see, CEM can better keepthe main 4 distributions, and the other clusters areeither disappear or appear like noise. The anisotropicmethod can also get the 4 leaves, and it even dividedeach leaves into several small parts. k-means is not goodat this condition.

Figure 5 and Figure 6 shows some other applicationsfor those 3 methods.

In these three examples, CEM is good at gettingthe right clustering. In the figure 4, one can even detect

Figure 5: From left to right, the results of CEM,anisotropic method and k-means. k = 20, distributionis exp(−40|x2 + y2 − 0.25|).

Figure 6: From left to right, the results of CEM,anisotropic method and k-means. k = 20, distributionis exp(−40(x− y)2) + exp(−40(x+ y)2).

how many clusters are exactly needed using CEM, whilethe anisotropic method and k-means can not carryout such a task. However, in order to get beautifultessellations or for some reason, to get a further sub-division of the region into much more than four clusters,the anisotropic works better than CEM. Figures 4 and5 are two good examples for this scenario.

6.3 Applications to image analysis Image seg-mentation and clustering are very important applica-tions of the Mixture Models based clustering. Figure 7gives another good example that illustrates the differ-ences between those three algorithms.

Figure 7 shows the results get by those 3 methods.The top picture is the original picture. The second rowshows the results for k = 3. The bottom row showsthe results for k = 8. From these pictures, we can seethe results from CEM can not give us more detail of thepicture than the other two methods. For example, it cannot recognize the sea behind the island. On the otherhand, k-means shows a lot of details, but sometimes thedetails become over-crowded and affect the results ofthe clustering, such as the layers of the cloud and thewaves in the sea. The anisotropic method provides agood balance between the CEM and k-means.

To test our algorithm for a much larger data set,we consider the problem of handwriting recognition.A set of normalized handwritten digits, automatically

Figure 7: Picture clustering: original (top); results,from left to right, obtained by CEM, our new method,and k-means for k = 3 (center) and k = 8 (bottom).

Figure 8: Sample handwriting.

scanned from envelopes by the U.S. Postal Service, istaken. The original scanned digits are binary and ofdifferent sizes and orientations; the images here havebeen deslanted and size normalized, resulting in 16 x16 grayscale images, see Figure 8. The data set canbe divided into 10 groups, each group contains onedigit, from 0 to 9 respectively. And each group has100 samples. Since the dimension of each sample is 256,we first apply the technique of ISOMAP [25] to decreasethe dimension.

Figure 9 is the graph of the residuals with differentnumber of nearest neighborhoods used for the dimensionreduction. From this graph, the elbow effect is veryobvious, indicating that it is good to choose the first 7dimensions. The data in the reduced dimension is thenstored as the new data set.

We then apply the CEM, our new algorithm, andthe traditional k-means to this new 7-dimension dataset. The rates of correct classification are respectively:78.7% for CEM, 79.8% for k-means and 85.7% withour new method. More detailed classification resultsare presented in the tables 2-4. The graph of the firsttwo dimensions of the reduced data set is plotted in theFigure 10, along with the classification diagrams basedon the CEM and our new algorithm.

In this 10x10 table, the (i, j) element represents

Figure 9: Residue variance as the function of theISOMAP dimensionality.

15 1 9 71 0 4 0 0 0 0

0 100 0 0 0 0 0 0 0 0

19 1 62 4 12 1 0 1 0 0

8 0 0 85 0 7 0 0 0 0

6 0 5 0 82 0 0 0 0 7

6 1 0 0 0 90 2 0 0 0

2 6 0 0 0 5 87 0 0 0

3 0 3 0 2 4 0 77 0 11

4 0 0 0 0 0 0 0 96 0

1 0 0 0 1 3 0 2 0 93

Table 2: Digits redistribution by the CEM algorithm.

88 0 1 0 1 0 0 0 10 0

0 100 0 0 0 0 0 0 0 0

0 1 77 2 16 2 0 0 2 0

0 0 0 89 0 7 0 0 4 0

0 0 9 0 82 0 0 0 2 7

0 1 4 1 0 90 2 0 0 0

0 6 0 0 0 5 87 0 2 0

0 0 5 0 2 4 0 77 1 10

0 1 8 15 0 0 0 0 72 0

0 0 0 0 3 1 0 4 1 91

Table 3: Digits redistribution by the new algorithm.

100 0 0 0 0 0 0 0 0 0

6 87 0 0 0 1 6 0 0 0

10 3 59 13 2 1 2 8 2 0

0 0 1 91 0 2 0 1 4 1

17 0 0 0 72 0 1 0 0 10

2 4 2 4 5 77 2 1 0 3

0 8 1 0 0 15 75 0 1 0

3 0 0 0 0 0 0 77 0 20

4 0 0 7 1 5 0 0 78 5

1 0 0 1 10 0 0 6 0 82

Table 4: Digits redistribution by k-means.

Figure 10: From top to bottom: two dimensionalrepresentations of the data and the classification byCEM, our new method and k-means.

Figure 11: Parallel implementation of the anisotropicalgorithm: with 200 clusters , 3000 iterationsand 24, 000, 000 sample points for the distributionexp(−40(x− y)2) + exp(−40(x+ y)2).

the number of samples which belong to the i-th oldgroup but are distributed to the new j-th cluster ( theindices i and j correspond to the characters i − 1 andj − 1 respectively.) An examination of the table showsthat the diagonal elements dominate the correspondingrows and columns, indicating a good classification. Thelower accuracy rate for the CEM is mainly due to theconfusion between the digits 0 and 3. The lower rateof the k-means, however, is affected by the lack ofrecognition of digit 2, and by the use of even weightsand isotropic clusters applied to all the characters.Clearly, from Figure 10, we can see even in the reduceddimension the anisotropic clustering with respect to thedifferent digits. Our new algorithm can take advantageof the of the good features of both the CEM and thek-means algorithms and overcome the anisotropy of thedata set to get a better recognition of the characterswith a higher accuracy rate.

6.4 Parallel implementations In this section, wecheck the performance of the parallel implementation.We will mainly check the parallel implementation forthe anisotropic algorithm. The original distributionis exp(−40(x − y)2) + exp(−40(x + y)2). We try todivide it to k = 200 clusters. There are totally 3000iterations. And each iteration sample q = ps = 8000points. So totally sample 24, 000, 000 sample points for200 clusters. Figure 11 shows the graph of speed up v.s.number of processors.

From the graph, we can see the speed up is almostlinear with respect to the number of processors.

We also consider the speed-up when the number ofclusters and the total sample points increase with thenumber of processors. In Figure 12, the performance

Figure 12: Parallel implementation of the anisotropicalgorithm: with 50p clusters , 3000 iterations and6, 000, 000p total sample points on p processors for thedistribution exp(−40(x− y)2) + exp(−40(x+ y)2).

of the parallel implementation of our new anisotropicalgorithm with k = 50p clusters, fixed 3000 iterationsand total 6, 000, 000p sample points (40 sample pointsper cluster per iteration) are shown for the distributionexp(−40(x − y)2) + exp(−40(x + y)2)), where p is thenumber of processors. It is clear that the problem isO(k2) complexity when the number of sample points percluster per iteration is fixed. Thus, when we linearlyincrease the number of processors, the time neededshould be linearly increasing in theory. The verificationof the linear increase in the actual timing again showsthe good scalability of our algorithm.

7 Conclusion.

In the present work, we are interested in studying themixture model based clusterings and tessellations. Theclassical algorithms such EM, CEM and k-means are ex-amined. Probabilistic implementations are considered.An anisotropic algorithm which is, in some sense, a com-bination of the CEM and k-means, is introduced. Itis also based a suitable generalization of the conceptof centroidal Voronoi tessellations that has been exten-sively studied recently. Based on the results of somecomputational experiments, we are able to observe thedifferent behaviors of these algorithms. Often, CEM ismore effective in identifying the main clusters but it ig-nores the details. K-means, on the other hand, may pro-vide too much details. The anisotropic algorithm canreveal the details while not losing sight of finding themain clusters, and it also can automatically identify theanisotropy in the data set. For continuous distributions,an efficient probablistic implementation is proposed andtested numerically. The experiments also lead to vari-

ous strategies for the selection of the variables used inour algorithms. The inclusion and the fine tuning ofthese variables can signficantly speed up our algorithmand make it far more superior than, for instance, theclassical MacQueen’s method. The probablistic natureof the implementation also allows a nearly perfect par-allelization. Our numerical results on a parallel clusterconfirmed the desired efficiency and speed by.

References

[1] Agrawal, R., and Shafer, J., Parallel mining ofassociation rules. IEEE Transactions on Knowledgeand Data Engineering 8(6), pp.962–969, 1996.

[2] F. Aurenhammer and H. Edelsbrunner, An opti-mal algorithm for constructing the Weighted Voronoidiagram, Pattern Recognition, 17, 1984, pp. 251-257.

[3] Dempster, A., N. Laird and D. Rubin, Maximumlikelihood from incomplete data via the EM algorithm,J. of Royal Stat. Society, 39, pp. 1-38, 1977

[4] I. Dhillon and D. Modha, A data clustering algo-rithm on distributed memory multiprocessors, Work-shop on Large-Scale Parallel KDD Systems, 1999.

[5] Q. Du, V. Faber and M.Gunzburger, CentroidalVoronoi tessellations: applications and algorithms,SIAM Review, 41,1999, pp.637-676.

[6] Q. Du and M. Gunzburger Centroidal VoronoiTessellation Based Proper Orthogonal DecompositionAnalysis, in Control and estimation of distributed pa-rameter systems, Desch, W., Kappel, F. and K. Ku-niis ed., International series of Numerical Mathemat-ics, 143, pp.137-150, Birkhauser, 2003

[7] Q. Du, M. Gunzburger, and L. Ju, Meshfree,probabilistic determination of point sets and supportregions for meshless computing, Comput. MethodsAppl. Mech. Engrg., 191(2002), pp. 1349-1366.

[8] Q. Du and D. Wang, Anisotropic centroidal Voronoitessellations and their applications, submitted to SIAMJ. Sci. Comp., 2003

[9] Q. Du and D. Wang, Tetrahedral mesh generationand optimization based on centroidal Voronoi tessella-tions, Int. J. Numer. Meth. Eng., 56, No.9, pp.1355-1373, 2002

[10] Q. Du and T. Wong, Numerical studies of the Mac-Queen’s k-means algorithm for computing the cen-troidal Voronoi Tessellation, Comp. Math. Appl., 44,511-523, 2002.

[11] C. Fraley and A. Raftery, Model-based clustering,discriminant analysis, and density estimation, J AmerStat Assoc 97, 611-631, 2002.

[12] R. Gray and D. Neuhoff, Quantization, IEEETrans. Inform. Theory, 44, pp.2325-2383, 1998.

[13] E. Han, G. Karypis and V. Kumar, Scalable paral-lel data mining for association rules, Proceedings of the1997 ACM SIGMOD international conference on Man-agement of data, Tucson, Arizona, p277 - 288, 1997.

[14] S. Hiller, H. Hellwig and O. Deussen, Beyondstippling - Methods for distributing objects on theplane, Comput Graph Forum, 22, pp.515-522, 2003

[15] A. Jain, M. Murty, and P. Flynn, Data clustering:a review, ACM Computing Surveys, 31(3):264-323,1999.

[16] L. Ju, Q. Du and M. Gunzburger, Probabilisticmethods for centroidal Voronoi tessellations and theirparallel implementations, Journal of Parallel Comput-ing, 28, p.1477-1500, 2002

[17] A. Kshemkalyani, A fine-grained modality classifica-tion for global predicates, IEEE Tran Parall Distr 14,pp.807-816, 2003

[18] S. Lloyd; Least squares quantization in PCM, IEEETrans. Infor. Theory, 28, pp.129–137, 1982.

[19] J. Ma, L. Xu, and M. I. Jordan, Asymptoticconvergence rate of the EM algorithm for Gaussianmixtures. Neural Computation, 12, pp.2881-290, 2000.

[20] J. MacQueen; Some methods for classification andanalysis of multivariate observations, in Proc. FifthBerkeley Symposium on Mathematical Statistics andProbability, I, Ed. by L. Le Cam and J. Neyman,University of California, 1967, pp. 281–297.

[21] G.J. McLachlan and T. Krishnan, The EM algo-rithm and Extensions, (1997), Wiley.

[22] H. Nagesh, S. Goil and A. Choudhary, A ScalableParallel Subspace Clustering Algorithm for MassiveData Sets, ICPP 2000.

[23] C.F. Olson, Parallel Algorithms For HierarchicalClustering, Parallel Computing, 21, 1995.

[24] Redner R. and H. Walker, Mixture densities, max-imum likelihood and the EM algorithm, SIAM Review,26 (2), pp195-239, 1984.

[25] J. Tenenbaum, V. De Silva and J. Langford, Aglobal geometric framework for nonlinear dimensionreduction, Science, 290, pp2319-2323, 2000.

[26] C. Wu, On the Convergence Properties of the EMAlgorithm, Ann. Stat., 11, no.1, pp.95-103, 1983.

[27] L. Xu and M. I. Jordan. On convergence propertiesof the EM algorithm for Gaussian mixtures. NeuralComputation, 8 (1), p.129–151, 1996.

[28] X. Xu, Jager J., Kriegel H.-P., A Fast Paral-lel Clustering Algorithm for Large Spatial Databases,Data Mining and Knowledge Discovery, An Interna-tional Journal, Kluwer Academic Publishers, 1999.

tessellation and clustering by mixture models and their parallel … · 2008-07-11 · tessellation...

Documents