cluster validity for kernel fuzzy...

8
Cluster Validity For Kernel Fuzzy Clustering Timothy C. Havens Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824 USA Email: [email protected] James C. Bezdek, Marimuthu Palaniswami Department of Electrical and Electronic Engineering University of Melbourne Parkville, Victoria 3010, Australia Email: [email protected], [email protected] Abstract—This paper presents cluster validity for kernel fuzzy clustering. First, we describe existing cluster validity indices that can be directly applied to partitions obtained by kernel fuzzy clustering algorithms. Second, we show how validity indices that take dissimilarity (or relational) data D as input can be applied to kernel fuzzy clustering. Third, we present four propositions that allow other existing cluster validity indices to be adapted to kernel fuzzy partitions. As an example of how these propositions are used, five well-known indices are formulated. We demonstrate several indices for kernel fuzzy c-means (kFCM) partitions of both synthetic and real data. Index Terms—cluster validity, kernel clustering, fuzzy cluster- ing. I. I NTRODUCTION Clustering or cluster analysis is a form of exploratory data analysis in which data are separated into groups or subsets such that the objects in each group share some similarity. Clustering has been used as a pre-processing step to separate data into manageable parts [1], as a knowledge discovery tool [2], for indexing and compression [3], etc., and there are many good books that describe its various uses [4– 8]. The most popular use for clustering is to assign labels to unlabeled data—data for which no pre-existing grouping is known. Any field that uses or analyzes data can utilize clustering; the problem domains and applications of clustering are innumerable. Finding good clusters involves more than just separating objects into groups; three major problems comprise clustering and all three are equally important for most applications. Tendency: Are there clusters in the data? And, if so, how many? Partitioning: Which objects should be grouped together (and to what degree)? Validity: Which partition is best? The main contribution of this paper is to provide methods for assessing cluster validity for partitions obtained from kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM). We first discuss existing indices that can be directly applied with no modification. Then we show how to apply indices that take dissimilarity data D as input. We then prove four propositions which allow other indices to be adapted to kernel fuzzy partitions and demonstrate the use of these propositions by adapting 5 well-known indices, including the popular Fukuyama-Sugeno and Xie-Beni indices. Because we are specifically addressing fuzzy clustering in kernel spaces, we will use the kFCM algorithm for partition- ing. Note, however, that our validity methods would work with partitions produced by any kernel fuzzy (and crisp, for that matter) clustering algorithm. Section II presents the necessary background and describes related work. The kernel validity indices are proposed in Section III. We present empirical results in Section IV and Section V provides a short summary and talks about future work. II. BACKGROUND AND RELATED WORK Consider a set of n objects O = {o 1 ,...,o n }, e.g., patients at a hospital, bass players in afro-Cuban bands, or wireless sensor network nodes. Each object is typically represented by numerical object or feature-vector data that has the form X = {x 1 ,..., x n }⊂ R d , where the coordinates of x i provide feature values (e.g., weight, length, insurance payment, etc.) describing object o i . A c-partition of the objects is defined as a set of nc values {u ij }, where each value represents the degree to which object o i is in the j th cluster. The c-partition is often represented as an n × c matrix U =[u ij ]. There are three main types of partitions, crisp, fuzzy (or probabilistic), and possibilistic [9, 10]. Crisp partitions of the unlabeled objects are non- empty mutually-disjoint subsets of O such that the union of the subsets equals O. The set of all non-degenerate (no zero columns) crisp c-partition matrices for the object set O is M hcn = n U R n×c |u ij ∈{0, 1}, j, i; (1) 0 < n X i=1 u ij < n, j ; c X j=1 u ij =1, i o , where u ij is the membership of object o i in cluster j ; the partition element u ij =1 if o i is labeled j and is 0 otherwise. When the columns of U are considered as vectors in R n then we denote the j th column as u j . Fuzzy (or probabilistic) partitions are more flexible than crisp partitions in that each object can have membership in more than one cluster. Note, if U is probabilistic, say U = P =[p ik ], then p ik is interpreted as the posterior probability p(k|o i ) that o i is in the kth class. Since this paper focuses on fuzzy partitions, we do not specifically address this difference. However, we stress that most, if not all, the indices described here can be directly applied to probabilistic c-partitions produced by Gaussian Mixture Model / Expectation Maximization (GMM/EM) algorithm, which is

Upload: vucong

Post on 06-Feb-2018

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

Cluster Validity For Kernel Fuzzy ClusteringTimothy C. Havens

Department of Computer Science and EngineeringMichigan State University

East Lansing, MI 48824 USAEmail: [email protected]

James C. Bezdek, Marimuthu PalaniswamiDepartment of Electrical and Electronic Engineering

University of MelbourneParkville, Victoria 3010, Australia

Email: [email protected], [email protected]

Abstract—This paper presents cluster validity for kernel fuzzyclustering. First, we describe existing cluster validity indices thatcan be directly applied to partitions obtained by kernel fuzzyclustering algorithms. Second, we show how validity indices thattake dissimilarity (or relational) data D as input can be appliedto kernel fuzzy clustering. Third, we present four propositionsthat allow other existing cluster validity indices to be adapted tokernel fuzzy partitions. As an example of how these propositionsare used, five well-known indices are formulated. We demonstrateseveral indices for kernel fuzzy c-means (kFCM) partitions of bothsynthetic and real data.

Index Terms—cluster validity, kernel clustering, fuzzy cluster-ing.

I. INTRODUCTION

Clustering or cluster analysis is a form of exploratory dataanalysis in which data are separated into groups or subsetssuch that the objects in each group share some similarity.Clustering has been used as a pre-processing step to separatedata into manageable parts [1], as a knowledge discoverytool [2], for indexing and compression [3], etc., and thereare many good books that describe its various uses [4–8]. The most popular use for clustering is to assign labelsto unlabeled data—data for which no pre-existing groupingis known. Any field that uses or analyzes data can utilizeclustering; the problem domains and applications of clusteringare innumerable.

Finding good clusters involves more than just separatingobjects into groups; three major problems comprise clusteringand all three are equally important for most applications.Tendency: Are there clusters in the data? And, if so, howmany? Partitioning: Which objects should be grouped together(and to what degree)? Validity: Which partition is best?

The main contribution of this paper is to provide methodsfor assessing cluster validity for partitions obtained fromkernelized fuzzy clustering algorithms, such as kernel fuzzyc-means (kFCM). We first discuss existing indices that canbe directly applied with no modification. Then we show howto apply indices that take dissimilarity data D as input. Wethen prove four propositions which allow other indices to beadapted to kernel fuzzy partitions and demonstrate the use ofthese propositions by adapting 5 well-known indices, includingthe popular Fukuyama-Sugeno and Xie-Beni indices.

Because we are specifically addressing fuzzy clustering inkernel spaces, we will use the kFCM algorithm for partition-ing. Note, however, that our validity methods would work with

partitions produced by any kernel fuzzy (and crisp, for thatmatter) clustering algorithm.

Section II presents the necessary background and describesrelated work. The kernel validity indices are proposed inSection III. We present empirical results in Section IV andSection V provides a short summary and talks about futurework.

II. BACKGROUND AND RELATED WORK

Consider a set of n objects O = {o1, . . . , on}, e.g., patientsat a hospital, bass players in afro-Cuban bands, or wirelesssensor network nodes. Each object is typically representedby numerical object or feature-vector data that has the formX = {x1, . . . ,xn} ⊂ Rd, where the coordinates of xi providefeature values (e.g., weight, length, insurance payment, etc.)describing object oi.

A c-partition of the objects is defined as a set of nc values{uij}, where each value represents the degree to which objectoi is in the jth cluster. The c-partition is often representedas an n × c matrix U = [uij ]. There are three main typesof partitions, crisp, fuzzy (or probabilistic), and possibilistic[9, 10]. Crisp partitions of the unlabeled objects are non-empty mutually-disjoint subsets of O such that the union ofthe subsets equals O. The set of all non-degenerate (no zerocolumns) crisp c-partition matrices for the object set O is

Mhcn ={U ∈ Rn×c|uij ∈ {0, 1},∀j, i; (1)

0 <

n∑i=1

uij < n,∀j;c∑j=1

uij = 1,∀i},

where uij is the membership of object oi in cluster j; thepartition element uij = 1 if oi is labeled j and is 0 otherwise.When the columns of U are considered as vectors in Rn thenwe denote the jth column as uj .

Fuzzy (or probabilistic) partitions are more flexible thancrisp partitions in that each object can have membershipin more than one cluster. Note, if U is probabilistic, sayU = P = [pik], then pik is interpreted as the posteriorprobability p(k|oi) that oi is in the kth class. Since thispaper focuses on fuzzy partitions, we do not specificallyaddress this difference. However, we stress that most, if notall, the indices described here can be directly applied toprobabilistic c-partitions produced by Gaussian Mixture Model/ Expectation Maximization (GMM/EM) algorithm, which is

Page 2: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

the most popular way of finding probabilistic clusters. The setof all fuzzy c-partitions is

Mfcn ={U ∈ Rn×c|uij ∈ [0, 1],∀j, i; (2)

0 <n∑i=1

uij < n,∀j;c∑j=1

uij = 1,∀i}.

Each row of the fuzzy partition U must sum to 1, thusensuring that every object has unit cluster membership in total(∑j uij = 1).

A. Fuzzy c-means (FCM)

One of the most popular methods for finding fuzzy partitionsis FCM [9]. The FCM algorithm is generally defined as theconstrained optimization of the squared-error distortion

Jm(U, V ) =c∑j=1

n∑i=1

umij ||xi − vj ||2A, (3)

where U is an (n× c) partition matrix, V = {v1, . . . ,vc} isa set of c cluster centers in Rd, m > 1 is the fuzzificationconstant, and || · ||A is any inner product A-induced norm, i.e.,||x||A =

√xTAx. Typically, the Euclidean norm (A = I) is

used, but there are many examples where the use of anothernorm-inducing matrix has been shown to be effective, e.g.,using A = S−1, the inverse of the sample covariance matrix.Only the Euclidean norm will be used in this paper; we easethe notation by dropping the subscript (A = I) in whatfollows. The FCM/AO algorithm approximates solutions to (3)using alternating optimization (AO) [11]. Other approaches tooptimizing the FCM model include genetic algorithms, particleswarm optimization, etc. The FCM/AO approach is by farthe most common, and the only algorithm used here. Weease the notation by dropping the “/AO” notation. Algorithm1 outlines the steps of the FCM/AO (FCM/AO) algorithm.There are many ways to initialize FCM; we choose c objectsrandomly from the data set itself to serve as the initial clustercenters, which seems to work well in almost all cases. But anyinitialization method that adequately covers the object spaceand does not produce any identical initial centers would work.

The alternating steps of FCM in Eqs. (4) and (5) are iterateduntil the algorithm terminates, where termination is declared

Algorithm 1: FCM/AOInput: X , c, m, εOutput: U , VInitialize Vwhile max1≤k≤c{||vk,new − vk,old||2} > ε do

uij =

[c∑

k=1

(||xi − vj ||||xi − vk||

) 2m−1

]−1, ∀i, j (4)

vi =

∑ni=1(uij)

mxi∑ni=1(uij)

m, ∀j (5)

when there are only negligible changes in the cluster centerlocations: more explicitly, max1≤j≤c{||vj,new−vj,old||2} ≤ ε,where ε is a pre-determined constant.

B. Kernel fuzzy c-means

Now consider some non-linear mapping function φ : x →φ(x) ∈ RdK , where dK is the dimensionality of the trans-formed feature vector φ(x). With kernel clustering, we do notexplicitly transform x, we simply represent the dot productφ(xi) · φ(xl) = κ(xi,xl) by a kernel function κ. Thekernel function κ can take many forms, with the polynomialκ(xi,xl) = (xTi xl + 1)p and radial-basis-function (RBF)κ(xi,xl) = exp(σ||xi−xl||2) being two of the most popularlyused. Given a set of n feature vectors X , we construct an n×nkernel matrix K = [Kij = κ(xi,xj)]n×n. The kernel matrixK represents all pairwise dot products of the feature vectorsin the transformed high-dimensional space—the ReproducingKernel Hilbert Space (RKHS).

Given a kernel function κ, kernel FCM (kFCM) can begenerally defined as the constrained minimization of

Jm(U) =c∑j=1

n∑i=1

umij ||φ(xi)− φ(vj)||2A, (6)

where, like FCM, U ∈Mfcn and m is the fuzzification param-eter. kFCM approximately solves the optimization problem in(6) by computing iterated updates of

dκ(xi,vj) = ||φ(xi)− φ(vj)||2, (7)

uij =

(∑ck=1

(dκ(xi,vj)dκ(xi,vk)

) 1m−1

)−1, ∀i, j, (8)

where dκ(xi,vj) is the kernel distance between input datumxi and cluster center vj .

Like FCM, the cluster centers are linear combinations ofthe feature vectors,

φ(vj) =

∑nl=1 u

mljφ(xl)∑n

l=1 umlj

. (9)

Equation (7) cannot be computed directly, but by usingthe identity Kij = κ(xi,xj) = φ(xi) · φ(xj), denotinguj = umj /

∑i u

mij where umj = (um1j , u

m2j , . . . , u

mnj)

T , andsubstituting (9) into (7) we get

dκ(xi,vj) =

∑nl=1

∑ns=1 u

mlju

msjφ(xl) · φ(xs)∑n

l=1 u2mlj

+φ(xi) · φ(xi)− 2

∑nl=1 u

mljφ(xl) · φ(xi)∑nl=1 u

mlj

= uTj Kuj + eTi Kei − 2uTj Kei

= uTj Kuj +Kii − 2(uTj K)i, (10)

where ei is the n-length unit vector with the ith element equalto 1. This formulation of kFCM is equivalent to that proposedin [12] and is identical to relational FCM if the standard dot-product kernel κ(xi,xk) = 〈xi,xj〉 is used [13].

Page 3: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

C. Some cluster validity indices

Cluster validity indices all attempt to answer the question:which partition among a set of candidate partitions foundby clustering algorithm(s) is best for this data set? Choosingthe best partition from several candidates implicitly identifiesthe appropriate c from a set of different c values (say c =2, 3, . . . , 10). In this paper, our experiments focus on choosingthe appropriate c, which is arguably the most popular use ofvalidity indices. Validity indices fall in two major categories:internal and external. Internal indices take as input only thedata and results of the clustering algorithm (typically the parti-tion U and/or the cluster centers V ). External indices comparethe partition against some external criteria, e.g. known classlabels or must-link / cannot-link constraints. References [14–19] describe studies of fuzzy validity indices that use boththeoretical analysis and empirical results for comparison.

We now outline several validity indices, which are describedin detail in [18].1 Indices with the superscript (+) indicatethe preferred partition by their maximum value, while thosewith the superscript (−) indicate the preferred partition by theirminimum. Some of these indices use the minimized value ofJm(U, V ) at Eq. (3). Note that J2(U, V ) is Jm(U, V ) for m =2.

1) Internal indices based on only U :• Partition coefficient (Bezdek):

V(+)PC =

1

n

c∑j=1

n∑i=1

u2ij (11)

• Partition entropy (Bezdek):

V(−)PE = − 1

n

c∑j=1

n∑i=1

uij log uij (12)

• Modified partition coefficient (Dave):

V(+)MPC = 1− c

c− 1(1− V(+)

PC ) (13)

• Kim:

V(−)KI =

2

c(c− 1)

c−1∑j=1

c∑k=j+1

n∑i=1

c · hi · (uij ∧ uik) (14)

hi =−c∑l=1

uil log uil

2) Indices based on (U, V ) and X:• Fukuyama and Sugeno:

V(−)FS =Jm(U, V )−

c∑j=1

n∑i=1

umij ||vj − v||2 (15)

v =1

c

c∑j=1

vj

1Note that, because we are page-limited, we do not include the originalreferences for all the indices here. However, reference [18] is a comprehensivesource for finding the original references. We do indicate the inventor of eachindex in parentheses, where appropriate.

• Generalized Xie Beni (Xie, Beni, Pal, and Bezdek):

V(−)XB =

Jm(U, V )

nminj 6=k{||vj − vk||2}(16)

• Kwon:

V(−)K =

J2(U, V ) + 1c

∑cj=1 ||vj − x||2

minj 6=k{||vj − vk||2}(17)

x =1

n

n∑i=1

xi (18)

• Tan Sun:

V(−)TS =

J2(U, V ) + 1c(c−1)

∑cj=1

∑ck=1 ||vj − vk||2

minj 6=k{||vj − vk||2}+ 1/c(19)

• PCAES (Wu and Yang):

V(+)PCAES =

c∑j=1

n∑i=1

u2ij/uM (20)

−c∑j=1

exp

(−min

j 6=k

{||vj − vk||2/βT

})

uM =minj

{n∑i=1

u2ij

}, βT = (1/c)

c∑j=1

||vj − x||2

where x is calculated by (18).3) External indices: External indices compare the resulting

partition against “ground-truth” labels or some other externalcriteria, typically class labels. While these indices can be veryuseful for measuring the match between partitions and theexternal criteria, say for the purposes of comparing clusteringalgorithms or eliciting the behavior of a clustering algorithm,we dissuade their use for cluster validity. External indices arenaturally biased towards ground-truth; hence, they can missthe “true” grouping of the data, which may not match theclass labels.

In Section IV, we show the values for the generalizedRand index VRand proposed in [20], which compares a fuzzypartition U against a reference partition Uref . For our experi-ments, we set the reference partition to the crisp partition thatrepresents the known class labels.

4) Other indices: Other indices compare partitions againstdissimilarity values, produce visualizations which suggest thenumber of clusters, etc. In this paper, we present values forthe Correlation Cluster Validity (CCV) family of indices,specifically CCVp and CCVs [21].

The CCV indices first induce a partition dissimilarity

DU = [1]n −(

UUT

maxij {(UUT )ij}

), (21)

where [1]n is the n × n matrix where each element is 1.2

The partition dissimilarity is then compared against the datadissimilarity matrix D, where Dil = ||xi−xl||. Two statisticalcorrelation measures are used for this comparison.

2In [21] the authors represent the partition matrix as a c×n matrix, whilein this paper U is n× c.

Page 4: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

• Pearson (CCVp):

V(+)CCV p =

〈A,B〉2||A||2||B||2

, (22)

where Aij = Dij−Dij and Bij = [DU ]ij− [DU ]ij . Thematrices D and DU are n×n matrices where every entryis the average value of D and DU , respectively.

• Spearman (CCVs):

V(+)CCV s =

〈r, r∗〉2||r||2||r∗||2

, (23)

where rk is the rank of the kth element of the n(n−1)/2off-diagonal upper-triangular values of D and r∗k is therank of the kth element of the corresponding values inDU .

The CCV indices are unique in that they work not only withvector data X , but also pure relational data D (where the datawe start with is D and we do not have access to X). Hence,CCV falls into a category of indices for relational data.

Another form of validity that we do not focus on here, dueto space limitations, is visualization. A popular visual clustervalidity method is the VCV method proposed in [22, 23]. TheVCV indices could be easily applied to kernel clustering bythe method we will show for the CCV indices in Section III-A.

III. KERNEL CLUSTER VALIDITY

Kernel clustering algorithms take a kernel matrix K as anadditional input and produce at least a partition U . Hence,any validity index that only takes U as input, such as VPCand VPE , can be used directly with kernel-based algorithms(or, for that matter, any clustering algorithm that producesa partition matrix). We will show results of these types ofindices in Section IV. The matrix K provides an additionalsource of information about the input data, since it indirectlyrepresents a transformation of X into a distance matrix D.In view of this, it may be that incorporating K into validityindices improves their utility for assessing cluster validity.However, indices that take U and X (which would be φ(X)for kernel clustering) as input are not directly applicable forkernel clustering because we do not have access to the high-dimensional projection φ(X). In III-B, we will prove fourpropositions that show how these types of indices can beadapted for use with kernel clustering outputs. First, we talkabout an existing kernel clustering validity index and then weadapt the CCV indices with a simple transformation.

To our knowledge, the only cluster validity index specifi-cally designed (to date) for kernel clustering is the PK indexproposed in [24], which is very similar to the CCV method.The PK index compares a proximity matrix P to the kernelmatrix K with the idea that if P and K are of similar structurethen the partition is good. The proximity matrix is computedby

Pil =c∑j=1

min{uij , ulj}, i, l = 1, . . . , n, (24)

and the validity index is computed as

V(+)PK =

〈K,P 〉F√〈K,K〉F 〈P, P 〉F

, (25)

where 〈·, ·〉F indicates the Frobenius norm. Notice the sim-ilarity between (25) and the CCV equations, (22) and (23).There is a serious drawback to the PK validity index however.Namely, the index compares P directly to the kernel matrix K.This operation is okay for kernel matrices that have a constantdiagonal, like the RBF kernel. But this validity index failsfor kernels that do not have a constant diagonal, like the dot-product and polynomial kernel. For these kernels the Frobeniusnorm in the numerator of (25) can be dominated by sub-blocksof K that have large diagonal (and subsequently large off-diagonal) values. For this reason, we do not recommend VPKas a validity index. Instead, we recommend the adaptation ofthe CCV indices, described next.

A. Adapting indices that take relational data D as inputIt is well known and easily proven that the Euclidean

distance between two kernel representations of vectors canbe computed by

||φ(xi)− φ(xl)||2 = Kii +Kll − 2Kil. (26)

Hence, any validity index that takes D = [Dij = ||φ(xi) −φ(xl)||] as input, such as CCV, can be adapted to kernelclustering partitions by computing

D = [Dil =√Kii +Kll − 2Kil] (27)

and then applying CCV directly to D (and DU ).

B. Adapting indices that take (U, V ) and X as inputIndices such as Xie-Beni and PCAES are not as easily

adapted to kernel clustering as they take X as input. However,we can transform these indices by noticing a common prop-erty among them: all the indices are dependent on weightedEuclidean distances involving X and V (and combinationsthereof). Examining equations (15) through (20) reveals thatthere are four quantities that must be adapted: i) the objectivefunction, Jm(U); ii) ||φ(vj)−φ(vk)||2; iii) ||φ(vj)−φ(v)||2;and iv) ||φ(vj) − φ(x)||2. The following four propositionsshow how each of these can be calculated using the partitionU and the kernel matrix K.

Proposition 1. For a given kernel function φ : x → φ(x) ∈RdK , the kFCM objective can be formulated as

Jm(U) =c∑j=1

n∑i=1

umij ||φ(xi)− φ(vj)||2,

=n∑i=1

Kii

c∑j=1

umij −c∑j=1

(umj )TKumj ,

= diag(K)Tc∑j=1

umj − trace[(Um)TKUm

], (28)

Um = [um1 , . . . , umc ], umj =

(um1j , . . . , u

mnj

)T√∑ni=1 u

mij

.

Page 5: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

Proof: Expanding Jm(U) gives

Jm(U) =c∑j=1

n∑i=1

umij[φ(xi)

Tφ(xi)

+φ(vj) · φ(vj)− 2φ(xi) · φ(xi)] . (29)

Using (9) to substitue for φ(vj) and substituting Kir = φ(xi)·φ(xr) into (29) gives

Jm(U) =c∑j=1

n∑i=1

umij

[Kii +

∑nk=1

∑nl=1 u

mkju

mljφ(xk) · φ(xl)∑n

k=1

∑nl=1 u

mkju

mlj

−2∑nk=1 u

mkjφ(xk) · φ(xi)∑nk=1 u

mkj

],

=n∑i=1

Kii

c∑j=1

[umij +

∑nk=1

∑nl=1 u

mkju

mljφ(xk) · φ(xl)∑n

k=1 umkj

− 2

∑nk=1

∑nl=1 u

mkju

mljφ(xk) · φ(xl)∑n

k=1 umkj

]

=

n∑i=1

Kii

c∑j=1

umij −c∑j=1

(umj )TKumj∑nk=1 u

mkj

.

Substituting Um into the above equation finishes the proof.

Proposition 2. The squared Euclidean distance between twocluster centers is

||φ(vj)− φ(vk)||2 =

(umj )TKumj + (umk )TKumk − 2(umj )TKumk , (30)

where umj = (um1j , . . . , umnj)

T /∑ni=1 u

mij .

Proof: We follow the same expansion as in the proof ofProposition 1 and using (9) to substitute for φ(vj) and φ(vk)gives

||φ(vj)− φ(vk)||2 =

∑ni=1

∑nl=1 u

miju

mljφ(xi) · φ(xl)(∑n

l=1 umij

)2+

∑ni=1

∑nl=1 u

miku

mlkφ(xi) · φ(xl)

(∑nl=1 u

mik)

2

−2∑ni=1

∑nl=1 u

miju

mlkφ(xi) · φ(xl)∑n

i=1

∑nl=1 u

miju

mlk

=(umj )TKumj(∑n

l=1 umij

)2 +(umk )TKumk

(∑nl=1 u

mik)

2 − 2(umj )TKumk∑nl=1

∑nl=1 u

miju

mlk

.

Substituting um into the above equation finishes the proof.

Proposition 3. The squared Euclidean distance between acluster center φ(vj) and φ(v) = (1/c)

∑cj=1 φ(vj) is

||φ(vj)− φ(v)||2 =

(umj )TKumj + (um)TKum − 2(umj )TKum, (31)

where um = (1/c)∑cj=1 umj∑n

i=1

∑cj=1 u

mij

.

Proof: Using (9) to expand the expression of φ(v) gives

φ(v) =1

c

c∑j=1

∑ni=1 u

mijφ(xi)∑n

i=1 umij

=

∑ni=1

(1c

∑cj=1 u

mij

)φ(xi)∑n

i=1

(1c

∑cj=1 u

mij

)Comparing this equation to (9) shows that φ(v) can beconsidered a cluster center defined by the membership vector(1/c)

∑cj=1 u

mj . Thus, the same process used to prove Propo-

sition 2 can be applied here.

Proposition 4. The squared Euclidean distance between acluster center φ(vj) and φ(x) = (1/n)

∑ni=1 φ(xi) is

||φ(vj)− φ(x)||2 = (umj )TKumj + 1TnK1n − 2(umj )TK1n,(32)

where 1n is the n-length vector with each element equal to1/n.

Proof: We can write φ(x) as

φ(x) =

∑ni=1(1n)iφ(xi)∑n

i=1(1n)i

which shows that φ(x) can also be considered a cluster center,of sorts, with the membership vector 1n. Following the samesteps as in Proposition 2 finishes this proof.

Let us denote the quantities in Propositions 1-4, respectively,as Jm(U), V V (j, k;m) = ||φ(vj) − φ(vk)||2, V V (j;m) =||φ(vj)−φ(v)||2, and V X(j;m) = ||φ(vj)−φ(x)||2. We cannow reformulate the indices at (15-20) in terms of these fourquantities.

V(−)FS =Jm(U)−

c∑j=1

V V (j;m)n∑i=1

umij (33)

V(−)XB =

Jm(U)

nminj 6=k {V V (j, k;m)}(34)

V(−)K =

J2(U) + (1/c)∑cj=1 V X(j;m)

minj 6=k {V V (j, k;m)}(35)

V(−)TS =

J2(U) + 1c(c−1)

∑cj=1

∑ck=1 V V (j, k;m)

minj 6=k {V V (j, k;m)}+ (1/c)(36)

V(+)PCAES =

c∑j=1

n∑i=1

u2ij/uM (37)

−c∑j=1

exp

(−min

j 6=k{V V (j, k;m)/βT }

)

uM =minj

{n∑i=1

u2ij

}, βT = (1/c)

c∑j=1

V X(j;m)

Remark 1. The kernel reformulations in Eqs. (33-37) areequivalent to their vector-data counterparts at (15-20) . If thestandard dot-product kernel K = XXT is used, then the

Page 6: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

TABLE I: Data Sets

Name n d Classesiris 150 5 3

glass 214 10 6dermatology 366 34 6

ionosphere 351 35 2ecoli 336 8 8sonar 208 61 2wdbc 569 31 2wine 178 14 3

indices that take U and K as input will return the same valueas their respective counterparts that take U and X as input.The advantage of the reformulations is that they work withany kernel matrix K.

IV. EXPERIMENTS

We tested the validity indices for both synthetic and realdata. For each data set, we ran kFCM for each integer c,cmin ≤ c ≤ cmax. We then stored the preferred number ofclusters (at the respective optimum) for each validity index.We did this for 31 different tests, each with a differentinitialization. The kFCM iterations were terminated whenmaxij{|(unew)ij − (uold)ij |} < 10−3 or when the numberof iterations exceeded 10,000 (this max iteration terminationcriteria was never reached in all tests). For all data sets, unlessnoted, cmin = 2 and cmax = 10.

A. Synthetic data

Figure 1 shows plots of 8 synthetic data sets used to studythe behavior of the cluster validity indices. Data set 10D-10 inview (e) is a 10-dimensional data set and view (e) shows thetop two PCA components of this data set. View (e) suggeststhat there are 10 clusters in 10D-10. For 10D-10, cmin = 5 andcmax = 15. For 2D-15 and 2D-17 in views (b,c), cmin = 10and cmax = 20.

The ‘uniform’ data set is composed of 500 draws from auniform distribution in the 2D unit box. This data set contains1 cluster (or no clusters, depending on your viewpoint). Wetested against this dataset to see how the validity indices wouldbehave for data that do not contain clear cluster structure.

Table II contains the results for the synthetic data tests.The last row of the table shows the total number of timesthat each validity index voted c (most often) as equal to thenumber of classes. For example, the entry 6(21) for the Randindex applied to partitions of the 2 curves data, which has thepreferred value c = 2, means that this index chose c = 6 most,viz., 21 times in 31 trials. As another example, 5(31) for theRand index, data set 2D-5, means that the Rand index chosethe 5-partition of X in all 31 trials—a perfect score. And, forexample, the 4 in the total row in the Rand column means thatthe Rand index chose the preferred number of clusters moretimes in 31 trials than for any other value of c in 4 of the 10data sets.

The 2D-5, 2D-17, and 10D-10 data have separated,somewhat-spherical clusters. For these data sets, most ofthe validity indices chose c as the number of classes. The

exceptions to this were the FS, PCAES, CCVp, and CCVsindices (CCVp did choose 10 clusters for 10D-10). In thesynthetic data sets overall, the FS, PCAES, CCVp, and CCVsindices had the weakest performance in terms of choosingc to match the number of classes, with PCAES failing forall data sets and FS, CCVp, and CCVs matching on 1 test.Interestingly, CCVs was the only index to indicate 3 clustersfor the 3 lines data set. This is a good example that there isno perfect index; every index will fail (or sometimes succeed)for some data set.

For the 4 rings data, all the indices “fail,” suggesting 2, 7, or10 clusters (albeit with very little variance). The failure in thisdata set is kFCM, which fails to partition this dataset in theapparent 4 clusters (even with several other kernel choices).Interestingly, when we used hard kernel c-means on the 4 ringsdata, the apparent c = 4 partition was found and was the clearchoice among the validity indices. We will further investigatethis phenomenon in the future.

The best performing index overall for the 10 synthetic datasets was PC. However, this index has the serious drawback thatVPC asymptotically increases to 1 as c → n. We alleviatedthis symptom by limiting the maximum number of clusterscmax; however, in practice, you may not have a good idea ofhow large (or small) cmax should be. Furthermore, most realdata sets do not contain well-separated, compact clusters andwe now turn to testing on some real data.

B. Real data

The 8 data sets used here are available from the UCIMachine Learning Repository [25]. Table I shows the name ofthe data sets, the number of objects n, the feature dimensionsd, and the number of physically labeled classes (Note that thenumber of labeled classes may or may not correspond to thenumber of clusters defined by any algorithm in any data set).Table III contains the results of testing the validity indices onthe real data. The last row of this table shows the numberof times that each index voted the number of classes as thepreferred number of clusters.

In contrast to the synthetic data, for the real data the CCVsindex was the most successful at matching the number ofclasses with its choice of c. We hesitate to say that thismakes it the “best” index as the cluster structure of thesedata could be completely different than the class structure.But this does show the utility of using different indices as eachhas its own strengths and weaknesses. Again, the PC, MPC,and PE indexes are also somewhat successful at matching thenumber of classes with their choice of c. The Rand index isalso moderately successful, but it uses the class labels in itsdetermination; hence, its choice of c will be always be biasedtowards the number of classes.

Like the synthetic data tests, the FS and PCAES indicesfail to match c to the number of classes in most tests—theyare successful in 1 test each. However, the FS index is theonly index to successfully choose 6 clusters for the glass dataset—no other index was even close in this regard.

Page 7: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Feature 1

Feat

ure

2

(a) 2D-5, n = 5, 000, c = 5

0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Feature 1

Feat

ure

2

(b) 2D-15, n = 7, 500,c = 15

−1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Feature 1

Feat

ure

2

(c) 2D-17s, n = 1, 700,c = 17

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.6

−0.4

−0.2

0

0.2

0.4

0.6

1st PCA Coefficient

2nd

PCA

Coe

ffici

ent

(d) 10D-10, n = 1, 000,c = 10

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.2

0

0.2

0.4

0.6

0.8

1

Feature 1

Feat

ure

2

(e) 2D-3, n = 1, 000, c = 3

−1 −0.5 0 0.5 1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Feature 1

Feat

ure

2

(f) 3 lines, n = 600, c = 3

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

Feature 1

Feat

ure

2

(g) 2 curves, n = 1, 000,c = 2

−1 −0.5 0 0.5 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Feature 1

Feat

ure

2

(h) 4 rings, n = 1, 000, c = 4

Fig. 1: Synthetic data sets used to test cluster validity indices

TABLE II: Preferred Number of Clusters For Several Validity Indices On Synthetic Data Sets

Data Set classes V(+)Rand V(−)

XB V(−)K V(−)

FS V(−)TS V(+)

PCAES V(+)PC V(−)

PE V(−)KI V(+)

MPC V(+)CCV p V(+)

CCV s

2D-5 5 5 (31) 5 (31) 5 (31) 6 (10) 5 (31) 10 (9) 5 (31) 5 (31) 5 (31) 5 (31) 4 (31) 4 (31)2D-15 15 16 (12) 14 (13) 14 (13) 17 (8) 14 (13) 18 (8) 16 (12) 16 (12) 16 (12) 16 (12) 10 (31) 19 (9)

2D-17s 17 17 (11) 17 (11) 17 (11) 19 (11) 17 (11) 19 (9) 17 (11) 17 (11) 17 (11) 17 (11) 10 (19) 14 (6)10D-10 10 10 (30) 10 (30) 10 (30) 10 (30) 10 (30) 11 (16) 10 (30) 10 (30) 10 (30) 10 (30) 10 (30) 6 (9)

2D-3 3 3 (31) 3 (31) 3 (31) 6 (9) 3 (31) 2 (15) 3 (31) 3 (31) 3 (31) 3 (31) 4 (23) 6 (9)3 lines 3 5 (16) 6 (13) 6 (13) 10 (23) 6 (13) 10 (10) 6 (11) 2 (31) 10 (17) 10 (12) 4 (16) 3 (15)

2 curves 2 6 (21) 8 (11) 10 (11) 10 (24) 8 (11) 8 (17) 8 (14) 2 (31) 10 (23) 10 (23) 4 (31) 7 (7)4 rings 4 10 (31) 7 (29) 7 (29) 10 (31) 7 (29) 7 (26) 2 (31) 2 (31) 7 (29) 7 (21) 7 (29) 10 (28)

uniform ? 9 (31) 9 (31) 10 (31) 9 (31) 4 (31) 2 (31) 2 (31) 4 (31) 4 (31) 4 (31) 9 (7)Total 4 4 4 1 4 0 4 5 4 4 1 1

Bold indicates the number of clusters chosen by an index equals the preferred number of classes more times than any other choice of c.Numbers in parentheses indicates number of trials out of 31 that an index indicated the preferred number of clusters.

V. CONCLUSIONS

We showed how to adapt several popular cluster validityindices for use with kernel clustering. The four propositionsgiven in this paper allow many well-known cluster validityindices to be directly formulated for use in kernel fuzzyclustering. We demonstrated how to use these propositionsto reformulate five popular indices. Furthermore, we showedhow validity indices that take dissimilarity data D as input,such as CCV, can be adapted. We showed the applicationof these indices in choosing the best fuzzy c-partition inseveral synthetic and real data sets with known cluster or classstructure. We used kernel FCM as the clustering algorithm.Not surprisingly, there was no best cluster validity index; someperformed better than others for these data sets, but there werealso data sets that stymied the “best” indices and were “solved”by an overall less effective index. For this reason, we stressthat in practice one should use many indices with the hope ofhaving them come to a consensus.

In the future, we are going to look at three unanswered ques-tions. i) How can cluster validity indices help choose the bestkernel? This question can take two forms: choosing between

types of kernels; and setting kernel parameters, e.g., RBFwidth or polynomial degree. ii) How can other cluster validityindices be adapted to kernel fuzzy clustering, such as those thatuse quantities based on sampled-based covariance estimates?iii) Are there other, perhaps more specialized, cluster validityindices that could be useful for kernel clustering? In [26],the authors examine validity for shell-shaped clusters, suchas those found by fuzzy c-shells. The main reason for usingkernels is their ability to define strangely-shaped, non-linearboundaries between clusters or classes. Hence, we believe thatindices like those proposed in [26] could be very useful forkernel clustering.

ACKNOWLEDGEMENTS

Havens is supported by the National Science Foundationunder Grant #1019343 to the Computing Research Associationfor the CI Fellows Project. This material is based upon worksupported by the Australian Research Council.

REFERENCES

[1] H. Frigui, Advances in Fuzzy Clustering and FeatureDiscrimination with Applications. John Wiley and

Page 8: Cluster Validity For Kernel Fuzzy Clusteringpages.mtu.edu/~thavens/papers/FUZZIEEE2012_Havens.pdf · kernelized fuzzy clustering algorithms, such as kernel fuzzy c-means (kFCM)

TABLE III: Preferred Number of Clusters For Several Validity Indices On Real Data Sets

Data Set classes V(+)Rand V(−)

XB V(−)K V(−)

FS V(−)TS V(+)

PCAES V(+)PC V(−)

PE V(−)KI V(+)

MPC V(+)CCV p V(+)

CCV s

iris 3 3 (25) 2 (31) 2 (31) 4 (7) 2 (31) 2 (30) 2 (31) 2 (31) 2 (31) 2 (31) 2 (31) 3 (20)glass 6 10 (22) 3 (30) 3 (30) 6 (15) 4 (30) 3 (28) 2 (31) 2 (31) 2 (31) 3 (30) 3 (30) 3 (30)

dermatology 6 10 (31) 2 (26) 2 (23) 10 (31) 2 (31) 10 (7) 2 (31) 2 (31) 2 (31) 2 (8) 5 (6) 6 (6)ionosphere 2 2 (31) 10 (7) 7 (6) 10 (31) 2 (31) 2 (9) 2 (31) 2 (31) 2 (31) 2 (12) 2 (10) 2 (6)

ecoli 8 4 (15) 3 (31) 3 (31) 10 (20) 3 (31) 3 (22) 2 (31) 2 (31) 2 (31) 3 (31) 3 (16) 2 (31)sonar 2 2 (31) 2 (24) 2 (22) 10 (31) 2 (30) 4 (7) 2 (31) 2 (31) 2 (31) 2 (11) 5 (9) 9 (5)wdbc 2 3 (31) 2 (31) 2 (31) 9 (30) 8 (30) 9 (31) 2 (31) 2 (31) 10 (30) 2 (31) 2 (31) 2 (29)wine 3 5 (28) 2 (31) 2 (31) 7 (8) 8 (12) 2 (26) 2 (31) 2 (31) 9 (16) 2 (31) 2 (31) 2 (14)Total 3 2 2 1 2 1 3 3 2 3 2 4

Bold indicates the number of clusters chosen by an index equals the preferred number of classes more times than any other choice of c.Numbers in parentheses indicates number of trials out of 31 that an index indicated the preferred number of clusters.

Sons, 2007, ch. Simultaneous Clustering and FeatureDiscrimination with Applications, pp. 285–312.

[2] S. Khan, G. Situ, K. Decker, and C. Schmidt, “GoFigure:Automated Gene Ontology annotation,” Bioinf., vol. 19,no. 18, pp. 2484–2485, 2003.

[3] S. Gunnemann, H. Kremer, D. Lenhard, and T. Seidl,“Subspace clustering for indexing high dimensional data:a main memory index based on local reductions andindividual multi-representations,” in Proc. Int. Conf. Ex-tending Database Technology, Uppsala, Sweden, 2011,pp. 237–248.

[4] A. Jain and R. Dubes, Algorithms for Clustering Data.Englewood Cliffs, NJ: Prentice-Hall, 1988.

[5] L. Kaufman and P. Rousseeuw, Finding Groups in Data:An Introduction to Cluster Analysis. Wiley Blackwell,2005.

[6] R. Xu and D. Wunsch II, Clustering. Psicataway, NJ:IEEE Press, 2009.

[7] D. A. Johnson and D. W. Wichern, Applied MultivariateStatistical Analysis, 6th ed. Englewood Cliffs, NJ:Prentice Hall, 2007.

[8] J. A. Hartigan, Clustering Algorithms. New York: Wiley,1975.

[9] J. C. Bezdek, Pattern Recognition With Fuzzy ObjectiveFunction Algorithms. New York: Plenum, 1981.

[10] R. Krishnapuram and J. M. Keller, “A possibilistic ap-proach to clustering,” IEEE Trans. on Fuzzy Sys., vol. 1,no. 2, May 1993.

[11] J. C. Bezdek and R. J. Hathaway, “Convergence ofalternating optmization,” Neural, Parallel, and ScientificComputations, vol. 11, no. 4, pp. 351–368, Dec. 2003.

[12] Z. Wu, W. Xie, and J. Yu, “Fuzzy c-means cluster-ing algorithm based on kernel method,” in Proc. Int.Conf. Computational Intelligence and Multimedia Appli-cations, September 2003, pp. 49–54.

[13] R. J. Hathaway, J. M. Huband, and J. C. Bezdek, “Akernelized non-euclidean relational fuzzy c-means algo-rithm,” in Proc. IEEE Int. Conf. Fuzzy Systems, 2005,pp. 414–419.

[14] R. Dubes and A. Jain, “Clustering techniques: the user’sdilemma,” Pattern Recognition, vol. 8, pp. 247–260,1977.

[15] J. C. Bezdek, M. Windham, and R. Ehrlich, “Statisticalparameters of fuzzy cluster validity functionals,” Int. J.Computing and Information Sciences, vol. 9, no. 4, pp.232–336, 1980.

[16] M. Windham, “Cluster validity for the fuzzy c-meansclustering algorithm,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 4, no. 4, pp. 357–363, 1982.

[17] N. Pal and J. C. Bezdek, “On cluster validity for the fuzzyc-means model,” IEEE Trans. Fuzzy Systems, vol. 3,no. 3, pp. 370–376, 1995.

[18] W. Wang and Y. Zhang, “On fuzzy cluster validityindices,” Fuzzy Sets and Systems, vol. 158, pp. 2095–2117, 2007.

[19] M. Halkidi, Y. Batisakis, and M. Vazirgiannis, “Clus-ter validity checking methods: part II,” ACM SIGMODRecord, vol. 31, no. 3, pp. 19–27, 2002.

[20] D. Anderson, J. C. Bezdek, M. Popescu, and J. M.Keller, “Comparing fuzzy, probabilistic, and possibilisticpartitions,” IEEE Trans. Fuzzy Systems, vol. 18, no. 5,pp. 906–917, October 2010.

[21] M. Popescu, J. M. Keller, J. C. Bezdek, and T. C. Havens,“Correlation cluster validity,” in Proc. IEEE Int. Conf.Systems, Man, and Cybernetics, October 2011, pp. 2531–2536.

[22] R. J. Hathaway and J. C. Bezdek, “Visual cluster valid-ity for prototype generator clustering models,” PatternRecognition Letters, vol. 24, pp. 1563–1569, 2003.

[23] J. M. Huband and J. C. Bezdek, Computational Intelli-gence: Research Frontiers. Berlin / Heidelberg / NewYork: Springer, June 2008, ch. VCV2 - Visual ClusterValidity, pp. 293–308.

[24] F. Queiroz, A. Braga, and W. Pedrycz, “Sorted ker-nel matrices as cluster validity indexes,” in Proc.IFSA/EUASFLAT Conf., 2009, pp. 1490–1495.

[25] A. Asuncion and D. J. Newman, “UCI machinelearning repository,” http://www.ics.uci.edu/∼mlearn/MLRepository.html, 2007.

[26] R. Krishnapuram, O. Nasraoui, and H. Frigui, “Thesubsurface density criterion and its applications to lin-ear/circular boundary detection and planar/spherical sur-face approximation,” in Proc. IEEE Int. Conf. FuzzySystems, vol. 2, 1993, pp. 725–730.