[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Abstract—Initialization of fuzzy k-means algorithm decreases the convergent rate of clustering and leads to plenty of calculation. Thus, we propose an improved fuzzy k-means clustering based on k-center algorithm and binary tree in this paper, which firstly reduces redundant attributes while too many irrespective attributes affect the efficiency of clustering. Secondly, we remove the differences of units of dimensions, and then adopt k-center clustering to initialize k means of clusters, which means that we choose first mean randomly and others obtained according to distance subsequently. The binary tree is composed of k means in order to find its closest mean easily. Finally, the proposed algorithm is applied on Iris dataset, Pima-Indians-Diabetes dataset and Segmentation dataset, and results show that the proposed algorithm has higher efficiency and greater precision, and reduces the amount of calculation.

I. INTRODUCTION

LUSTERING is to group data points into several clusters and makes the intra-cluster similarity maximum and the inter-cluster similarity minimum [1], [2], [3]. Clustering

plays an important role in data mining and is applied widely in fields of pattern recognition, computer version, and fuzzy control. Various types of clustering methods have been proposed and developed in [4]. Clustering algorithms are mainly divided into five groups, which are hierarchical clustering, partitioning clustering, density-based method, grid-based method and model-based method. The former two methods are often used. Hierarchical algorithm can be further divided into bottom-up and top-down algorithms [1]. Traditional hierarchy clustering algorithms are not suitable for large dataset because of large computation, such as BIRCH [2] and CURE [3]. CLIQUE [4], ENCLUS, and MAFIA [5] belong to bottom-up algorithms. PROCLUS [6] and ORCLUS [7] belong to top-down algorithms. Traditional partition clustering algorithms are k-means, k-modes, and so on. K-means is the most classical algorithm and is widely used in practice.

Existing k-means algorithm and its extensions have their advantages [9], [10], [11], [12], [13], [14], [15], however, all of these algorithms choose initial means randomly at the beginning of the clustering, which are conflicting with the characteristics that k-means algorithm is very sensitive to the

Taoying Li is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China. (phone: 86-138-4082-0896; fax: 86-411-8472-5286; e-mail: ytaoli@ 126.com).

Yan Chen is a professor of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China. (e-mail: [email protected]).

Xiangwei Mu is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China (e-mail: [email protected]).

Ming Yang is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China (e-mail: [email protected]).

initial conditions. In spite of the method in [17] choosing initial means in particular way, but it needs plenty of computation and is not suitable to large dataset.

In this paper, we propose an improved fuzzy k-means clustering based on k-center algorithm and binary tree. The algorithm starts with removing redundant attributes and eliminating difference of units of dimensions. Then we use k-center algorithm to initialize the means of fuzzy k-means algorithm and build a binary tree using k means of clusters that reduces amount of calculation, which means that we don’t need compute the distance between any data point to all means of clusters and only need calculate two distances for one level of the tree.

This paper is organized as follows: In section 2, we give a brief review of fuzzy k-means clustering algorithm. Then the improved fuzzy k-means algorithm adopted is given, and the k-center algorithm and binary tree are introduced in section 3. In section 4, we apply the proposed fuzzy k-means clustering algorithm into the Iris dataset, Pima-Indians-Diabetes dataset and Segmentation dataset to validate the algorithm. In section 5, we give the conclusion according to section 4.

II. RELATED WORK The k-means algorithms, like other partition clustering

algorithms, group n data points into k clusters by minimizing a cost function that has been pre-designed. The type of traditional cost function [8] is like (1).

∑∑= =

−=n

j

m

iliji cxCF

1 1

21 )()( (1)

where is the value of the ith dimension of the jth object. cl is the center that nearest to the jth object and is the value of the ith dimension of the lth cluster center. Because the contribution of different dimensions to the clustering and the preferences of each object belonging to a cluster are different, the extension forms with weights of the traditional cost function are often used. The methods used in [9], [10], [11], [12], [13] are all the extension of (1). H. Friguiand and O. Nasraoui [10], Y. Chan and W. Ching [11] introduce the degree of membership for each object belonging to every cluster and the weight for each dimension of a cluster on contributing to clustering. However, their algorithm is not computable if one of weight is happens to be zero. Domeniconi [12], [13] introduces a cost function with maximum function and it was proved difficult to solving the minimum objective function. Liping Jing, Michael K. Ng, and Joshua zhexue Huang [1] introduce a cost function avoiding the problems above, and they use entropy of the dimension weights to represent the certainty of dimensions in the identification of a cluster. But the goal of clustering is to

An Improved Fuzzy K-Means Clustering with K-Center Initialization Taoying Li, Yan Chen, Xiangwei Mu and Ming Yang

C

157

Third International Workshop on Advanced Computational Intelligence August 25-27, 2010 - Suzhou, Jiangsu, China

978-1-4244-6337-4/10/$26.00 @2010 IEEE

satisfy two objectives, the first is make the distance of objects in a cluster is as small as possible, and the second is to make the distance of objects between different clusters is as large as possible [14]. The cost function in [1] does not satisfy the second objectives. We proposed a new method by adding a variable to adjust its function in [15], which satisfies both of the goals of the clustering. However, the efficiency of all algorithm mentioned above is low because different initial k means will influence the speed of convergence and the times of iterations. Proper initial k means makes the equation reach its goal fast, but it is difficult to know which ones are good before training.

III. FUZZY K-MEANS CLUSTERING BASED ON K-CENTER ALGORITHM AND BINARY TREE

In this section, we use k center algorithm to init the improved k-means algorithm in [15] and at the same time make use of binary tree to reduce the calculation for each data point.

A. Fuzzy K-Means Algorithm with K-Center Initialization The cost function we use is given as follows:

∑∑

∑∑=

=

= =

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−

−=

k

lm

iili

n

j

m

ijilililj

xc

xcCWTF

1

1

2

1 1

2

)(

)(),,(

ωτ

∑ ∑= =

+k

lli

m

ili

1 1

]log[ ωωγ

(2)

Here, ∑=

=k

llj

11τ ,∑

=

=m

ili

11ω , x is the mean of all objects,

ix is value of the ith dimension of x and it equals to

∑=

=n

jjii x

nx

1

1 . If 1>n , the new algorithm is available,

otherwise, ∑=

−m

iili xc

1

2)( is zero and the cost function

F(T,W,C) is not computable. The denominator is a variable and is linear to the square sum of the distances from the mean of all objects to the means of all clusters.

Next, we present process of the improved entropy weighting k-means algorithm for solving the minimization problem.

Minimization of F in (7) with constraints forms a class of constrained nonlinear optimization problems whose solutions are unknown. The usual method likes optimization of F(T,W,C) in (7) is to use partial optimization for T, W and C, and the way in [10],[1] is used for reference. In this method, we first fixed T and C, and search appropriate W to minimize F(T,W,C). Then we fixed T and W, and search appropriate C to minimize F(T,W,C). Later, we fixed C and W, and search appropriate T to minimize F(T,W,C).

The process is repeated until the value of objective function cannot be reduced.

Let T and C be fixed, F is the minimum if the weight satisfies

∑=

−

−

= m

i

li

lt

lt

1

)exp(

)exp(

γψ

γψ

ω

(3)

∑

∑

=

=

−

−= m

iili

n

jjtltlj

lt

xc

xc

1

2

1

2

)(

)(τψ

(4)

We have proved the process mentioned above in [15] and here just adopt it.

Similar to the process of solving ltω , let W and C be fixed, we know that the jth object belongs to the lth cluster if the distance from the jth object to the mean of the lth cluster is smaller than to the mean of other clusters. That’s to say that

⎪⎩

⎪⎨⎧ −≤−= ∑ ∑

= =otherwise

xcxcifm

i

m

ijizizijilili

lj

,0

)()(,11 1

22 ωωτ

(5)

1=ljτ means that the jth object is assigned to the lth cluster,

and 0=ljτ is reverse. Let T and W be fixed, we can use the method of solving

average in mathematic to obtain the value of C. The results can be show as (6).

∑

∑

=

== n

jlj

n

jjilj

li

xc

1

1

τ

τ

(6)

Here, kl ≤≤1 and mi ≤≤1 . We introduced the process of the improved weighted fuzzy

k-means algorithm in [15] as follows: Step1. Input the parameters m, n, k, γ and the max iterative

time s, init initial weights mli /1=ω , and choose k objects randomly as the centers C of k clusters.

Step2. Obtain T according to (5); Step3. Compute the value of F(T,W,C) according to (2); Step4. Updates C according to (6); Step5. Update W according to (3); Step6. Repeat Step2 to Step5 until the F(T,W,C) can’t be

improved or the iterative time is greater than S. In order to reduce the times and calculation of iteration of

weighted fuzzy k-means, we made some adjust here by using k-center algorithm initializing the initial k points, and then the process of weighted k-means algorithm can be shown as follows:

Step1. Input the parameters m, n, k, γ and the max iterative time s, init initial weights mli /1=ω .

Step2. We put all data points into the set of H and let set of centers C be empty. Then, we choose one point randomly from H as the first center, and put it in the set of centers C.

Step3: Make the data point from H which is farthest to the center data points in C and is not in C be the next center and put it in C again.

158

Step4: If the number of centers in C equals to k, go to Step5, else got o Step3.

Step5 Obtain T according to (5); Step6. Compute the value of F(T,W,C) according to (2); Step7. Updates C according to (6); Step8. Update W according to (3); Step9. Repeat Step5 to Step8 until the F(T,W,C) can’t be

improved or the iterative time is greater than S. The complexity of traditional clustering based on distance

algorithm is O (mn2), and it changes exponentially along with the number of objects needing to be partitioned. Thereby, traditional clustering methods need plenty of calculation while a lot of objects exist. The complexity of the proposed algorithm equals to O (mnk) and is similar to that of the EWKM algorithm and the improved weighted fuzzy k-means, and it changes linearly along with the number of objects, the same time the k-center algorithm is used for initial k points, which reduces the times of iteration of k-means algorithm.

B. Improved Fuzzy K-Means Algorithm using Binary Tree of Means Now, we use binary tree to decide which clusters data

points belong to. Supposing we know k means of k clusters, now we

calculate the Euclidean distance between any two means and we can obtain a distance matrix D=(dlv)m×m.

2/1

1

2)( ⎥⎦

⎤⎢⎣

⎡ −= ∑=

m

ivililv ccd (7)

Given dij is the largest and then we split k means into two groups mi and mj. Then we can split mi group again as shown in Fig.1. Thus we gain the binary tree.

The way of establishing the tree can be given as follows: 1. Given the number of layers of the tree be l=1 and the

number of groups g=1 for the l layer, then we calculate the mean of all k centers and make it the root of the first level of the tree and all centers are in one group.

2. Let l=l+1, we divide each of existing groups into two small groups and there may be at most 2l-1 groups, and let means of centers of new small groups be the roots of the l layer and each mean stands for one groups for the l layer, given g be number of true groups.

3. If all groups only have one center, stop, otherwise go to 2.

In fact, from the process of building the tree, we know that all of k means of the fuzzy k-means algorithm are leaf nodes of the binary tree.

Then, the process of the improved fuzzy k-means algorithm based on k-center and binary tree can be shown as following:

Step1: Remove redundant attributes or those attributes having less influence for clustering, which will reduce the amount of calculation and won’t decrease its accuracy. For example, there are 19 attributes in Segment dataset, but we just use 9 of them for simplifying.

Step2: Eliminate the difference of units of dimensions according to equation (8), which will make all data points be zero dimension and values be in the range of [0,1].

tittit

titji

ji xx

xxx

minmax

minoriginal

−

−=

(8)

Here, mi ≤≤1 . Step3. Partition initial data points into k clusters using

improved weighted fuzzy k-means algorithm mentioned above.

Step4. Make the k means into a binary tree structure using two centroids clustering, which can be shown in Figure 1.

Step5. Calculate the distance of the data point to roots of the tree and its sub trees by using Euclidean distance as following:

Supposing the x* is more close to mi then to mj, then we just need to compute two distances between the x* and mi1, and x* and mi2, supposing the distance between x* and mi1 is smaller, then we need to calculate two distances between x* and mi11, and x* and mi12, and so on. If the distance between x* and mi11 is smaller and mi11 doesn’t have sub points, then x* is closest to mi11.

Step6. Elicit the mean of cluster which is closest to the data point according Step5 -> winning cluster represented by its center cwin.

Step7. If all data points have their clusters, update C according to (6).

Step8. If C equals to that of iteration last time or the error rate is less than a pre-designed value, stop. Otherwise go to Step 4.

According the process of the improved fuzzy k-means algorithm based on k-center and binary tree, we know that the complexity of using two centroids clustering is O (mnlogk), and the complexity of traditional algorithm is O (mnk). Therefore, if the number of clusters and number of data points are very large, we can decrease the amount of calculation in this paper.

IV. EXPERIMENT In this section, we use Iris dataset, Indians-diabetes dataset

and Zoo dataset from UCI website to validate the improved fuzzy k-means algorithm based on k-center and binary tree, and detail datasets can be shown in Table I.

TABLE I REAL DATASETS

Dataset Number of attributes

Number of clusters

Number of data points

Iris 4 3 150

Pima-Indians-Diabetes 8 2 768

Segmentation 19 7 2310

Fig. 1. Two centroids clustering.

159

A. Iris Dataset In this section, we use Iris dataset to validate the proposed

algorithm. We start the algorithm by initialing parameters, let m =4,

n=75, k=2, 5.0=γ , 25.0=iω . We firstly use (8) to eliminate the difference of dimensions, then we get the results that eight data points are partitioned into error clusters, whose error rate is 0.04, the error rate of which is less than that in [17], at the same time we don’t need to compute distance between data points to all of means in the process of iteration.

As a matter of fact, the third and the forth dimensions are critical for Iris dataset. In order to see the results clearly, we give its results with the third and the forth dimensions in Figures. The results of fuzzy k-means algorithm based on k-center algorithm and binary tree are given in Fig.2. The error clustering points are points that partition into error clusters in Fig.2.

TABLE II RESULTS FOR IRIS DATASET

Algorithm Number of data points in expecting clusters Accuracy rate

Traditional k-means 132 88% EWKM 142 94.67% IWEKM 143 95.33% Proposed algorithm 144 96% From results of classifying Iris dataset in Table 2, we know

that accuracy of the traditional k-means algorithm is 88%, and that of EWKM algorithm is 94.67%, and that of the IWEKM in [15] is 95.33%, the accuracy of improved fuzzy k-means algorithm proposed in this paper is 96%. Moreover, we reduce the amount of calculation in iteration, so it is effective and practical for classifying Iris dataset.

B. Pima-Indians-Diabetes Dataset In this section, we use Pima-Indians-Diabetes dataset to

validate the improved fuzzy k-means algorithm based on k-center and binary tree.

We begin with remove the difference of dimensions and then initialize the parameters, let m =8, n=768, k=2, 5.0=γ ,

125.0=iω . Then we get the results that the accuracy degree of the improved fuzzy k-means algorithm proposed in this paper is 75.65%.

In fact, none of dimensions of Pima-Indians-Diabetes dataset has significant influence for clustering and it is difficult to show it clearly as shown in Fig.3.

C. Segmentation Dataset In this section, we use Segmentation dataset and

Segmentation Test dataset together to validate the fuzzy k-means algorithm based on k-center and binary tree.

According to values of attributes of data points, we know some attributes are redundant because they won’t influence the clustering, taking the ninth attribute for example, or their influences for clustering are very little, such as the fourth and fifth attributes. In practice, we can reduce those redundant attributes through experience or special strategies and algorithms while there are so many attributes that lead to computationally intensive. Besides, we can remove the tenth, fourteenth, fifteenth, sixteenth, seventeenth, eighteenth and nineteenth attributes, which can be obtained from eleventh, twelve and thirteenth attributes.

Thus, only nine attributes are left and we start the algorithm by removing the difference of these nine dimensions and initialing the parameters, let m =9, n=210, k=2, 5.0=γ ,

9/1=iω , we choose 2310 data points of all Segmentation dataset and all Segmentation Test dataset. Then we get the results that the accuracy degree of the fuzzy k-means algorithm based on k-center and binary tree is 65.86% and the results for Segmentation Dataset in [17] are shown in Table 3.

TABLE III RESULTS FOR SEGMENTATION DATASET

Algorithm Number of attributes used in training Accuracy rate

EKM 18 66% MaxMin 18 40.7% KA 18 63.1% Ward 18 59.8% Proposed algorithm 9 65.86%

Now, we give the results of fuzzy k-means algorithm based

on k-center and binary tree to partition Segmentation Dataset with first and second attributes in Figure 4.

Fig. 2. The results of improved fuzzy k-means algorithm based on k-center and binary tree for partitioning Iris dataset

Fig. 3. The results of improved fuzzy k-means algorithm based on k-center and binary tree for partitioning Pima-Indians-Diabetes dataset with the fourth and seventh attributes

160

In Fig.4, the red round points are those data points that are partitioned into wrong clusters using fuzzy k-means clustering based on k-center algorithm and binary tree.

Comparing the results mentioned above with that in [17], the proposed algorithm in this paper has a higher accuracy while reduce the amount of calculation of training data points of k-means, which reduce the time of iteration of all data points.

V. CONCLUSIONS The initial means of traditional fuzzy k-means algorithm

should be chosen randomly from n data points, which needs plenty of calculation and different initial means may get different results because of local minimum value of cost function. Thus, the improved fuzzy k-means algorithm is proposed in this paper, which uses the k-center algorithm to produce initial k means and makes the outside distance as large as possible while the within distance as small as possible. At same time, we build binary tree for k means of clusters for reducing the amount of calculation for the distance between data points to means.

Finally, we use Iris dataset, Pima-Indians-Diabetes dataset and Segmentation dataset to validate the proposed algorithm. For Iris dataset, the results shown that the proposed algorithm can have higher precision while we design proper values for weights of attributes because the influences of different attributes are different to clustering, and it is effective comparing with the results in [16]. For Pima-Indians-Diabetes dataset and Segmentation dataset, their results are acceptable comparing with the results in [17], but change of values of weights almost doesn’t affect the results of clustering. For all dataset, the proposed algorithm reduces the amount of calculation and the time of iteration. Our future work is to make some adjust for applying it in practice.

REFERENCES [1] L.P. Jing, M. K. Ng, and J. Z. Huang, “An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data,” IEEE Trans. Knowledge and Engineering, vol.19, no.8, pp. 1026-1041, 2007. [2] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data

clustering method for very large databases,” in Proc. of the ACM SIGMOD Intl. Conference on Management of Data, H. V. Jagadish, Inderpal Singh Mumick, Ed. Canada: ACM Press, pp.103-114, 1996.

[3] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for clustering large databases,” in ACM SIGMOD Int’l Conf.

Management of Data, Laura M. Haas, Ashutosh Tiwary, Ed. Seattle: ACM Press, pp. 73-84, 1998.

[4] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications,” in ACM SIGMOD Int’l Conf. Management of Data, Laura M. Haas, Ashutosh Tiwary, Ed. Seattle: ACM Press, pp.94-105, 1998.

[5] C.H. Cheng, A.W. Fu, and Y. Zhang, “Entropy-Based Subspace Clustering for Mining Numerical Data,” in Fifth ACMSIGKDD Int’l Conf. Knowledge and Data Mining, Alex Delis, Christos Faloutsos, Shahram Ghandeharizadeh , Ed. Philadelphia: ACM Press, pp.84-93, 1999.

[6] C. Aggarwal, C. Procopiuc, J. L. Wolf, P.S. Yu, and J.S. Park, “Fast Algorithms for Projected Clustering,” in ACMSIGMOD Int’l Conf. Management of Data, Alex Delis, Christos Faloutsos, Shahram Ghandeharizadeh, Ed. USA: ACM Press, pp.61-72, 1999.

[7] C. C. Aggarwal, and P. S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” in ACMSIGMOD Int’l Conf. Management of Data, Weidong Chen, Jeffrey F. Naughton, Philip A. Bernstein, Ed. USA: ACM Press, pp.70-81, 2000.

[8] A. Ahmad, and L. Dey, “A k-mean clustering algorithm for mixed numeric and categorical data,” Data & Knowledge Engineering, vol. 63, no.2, pp.503-527, 2007.

[9] J.Z. Huang, M.K. Ng, H. Rong, and Z. Li, “Automated Variable Weighting in k-Means Type Clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol.27, no.5, pp.1-12, 2005.

[10] H. Friguiand, and O. Nasraoui, “Unsupervised Learning of Prototypes and Attribute Weights,” Pattern Recognition, vol.37, no.3, pp.567-581, 2004.

[11] Y. Chan, W. Ching, M. K. Ng, and J.Z. Huang, “An Optimization Algorithm for Clustering Using Weighted Dissimilarity Measures,” Pattern Recognition, vol.37, no.5, pp. 943-952, 2004.

[12] C. Domeniconi, Locally Adaptive Techniques for Pattern Classification, PhD dissertation, University of California, 2002.

[13] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S. Ma, “Subspace Clustering of High Dimensional Data,” in SIAM Int’l Conf. Data Mining, Michael W. Berry, Umeshwar Dayal, Chandrika Kamath, David B. Skillicorn, Ed. USA: Kluwer Academic Publishers, pp.517-521, 2004.

[14] S.L.Yang, Y.S. Li, X.X. Hu, and R.Y. Pan, “Optimization Study on k Value of K-means Algorithm,” Systems Engineering-theory & Practice, Beijing: Institute of China System Engineering Press, no.2, pp. 97-101, 2006.

[15] T.Y. Li, and Y. Chen, “An Improved k-means Algorithm for Clustering Using Entropy Weighting Measures,” in 7th World Congress on Intelligent Control and Automation, IEEE Computer Press, pp: 149-153, 2008.

[16] E. Lughofer, “Extensions of vector quantization for incremental clustering,” Pattern Recognition, vol.41, no.3, pp. 995-1011, 2008.

[17] X. Qian, X.J. Huang, and L.D. Wu, “A Spectral Method of K-means Initialization,” Acta Automatic Sinica, vol.33, no.4, pp. 342-346, 2007.

Fig. 4. The results of fuzzy k-means algorithm based on k-center and binary tree for Segmentation Test dataset with the first and second attributes.

161

[ieee 2010 third international workshop on advanced computational intelligence (iwaci) - suzhou,...

Documents