[IEEE 2010 Third International Workshop on Advanced Computational Intelligence (IWACI) - Suzhou, China (2010.08.25-2010.08.27)] Third International Workshop on Advanced Computational Intelligence - An improved fuzzy k-means clustering with k-center initialization

Download [IEEE 2010 Third International Workshop on Advanced Computational Intelligence (IWACI) - Suzhou, China (2010.08.25-2010.08.27)] Third International Workshop on Advanced Computational Intelligence - An improved fuzzy k-means clustering with k-center initialization

Post on 01-Mar-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>AbstractInitialization of fuzzy k-means algorithm decreases the convergent rate of clustering and leads to plenty of calculation. Thus, we propose an improved fuzzy k-means clustering based on k-center algorithm and binary tree in this paper, which firstly reduces redundant attributes while too many irrespective attributes affect the efficiency of clustering. Secondly, we remove the differences of units of dimensions, and then adopt k-center clustering to initialize k means of clusters, which means that we choose first mean randomly and others obtained according to distance subsequently. The binary tree is composed of k means in order to find its closest mean easily. Finally, the proposed algorithm is applied on Iris dataset, Pima-Indians-Diabetes dataset and Segmentation dataset, and results show that the proposed algorithm has higher efficiency and greater precision, and reduces the amount of calculation. </p><p>I. INTRODUCTION LUSTERING is to group data points into several clusters and makes the intra-cluster similarity maximum and the inter-cluster similarity minimum [1], [2], [3]. Clustering </p><p>plays an important role in data mining and is applied widely in fields of pattern recognition, computer version, and fuzzy control. Various types of clustering methods have been proposed and developed in [4]. Clustering algorithms are mainly divided into five groups, which are hierarchical clustering, partitioning clustering, density-based method, grid-based method and model-based method. The former two methods are often used. Hierarchical algorithm can be further divided into bottom-up and top-down algorithms [1]. Traditional hierarchy clustering algorithms are not suitable for large dataset because of large computation, such as BIRCH [2] and CURE [3]. CLIQUE [4], ENCLUS, and MAFIA [5] belong to bottom-up algorithms. PROCLUS [6] and ORCLUS [7] belong to top-down algorithms. Traditional partition clustering algorithms are k-means, k-modes, and so on. K-means is the most classical algorithm and is widely used in practice. </p><p>Existing k-means algorithm and its extensions have their advantages [9], [10], [11], [12], [13], [14], [15], however, all of these algorithms choose initial means randomly at the beginning of the clustering, which are conflicting with the characteristics that k-means algorithm is very sensitive to the </p><p>Taoying Li is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China. (phone: 86-138-4082-0896; fax: 86-411-8472-5286; e-mail: ytaoli@ 126.com). </p><p>Yan Chen is a professor of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China. (e-mail: chenyan_dlmu@163.com). </p><p>Xiangwei Mu is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China (e-mail: xiangwei.mu@gmail.com). </p><p>Ming Yang is a PhD student of Transportation Management College, Dalian Maritime University, Dalian 116026, Liaoning, P.R.China (e-mail: turner7257@163.com). </p><p>initial conditions. In spite of the method in [17] choosing initial means in particular way, but it needs plenty of computation and is not suitable to large dataset. </p><p>In this paper, we propose an improved fuzzy k-means clustering based on k-center algorithm and binary tree. The algorithm starts with removing redundant attributes and eliminating difference of units of dimensions. Then we use k-center algorithm to initialize the means of fuzzy k-means algorithm and build a binary tree using k means of clusters that reduces amount of calculation, which means that we dont need compute the distance between any data point to all means of clusters and only need calculate two distances for one level of the tree. </p><p>This paper is organized as follows: In section 2, we give a brief review of fuzzy k-means clustering algorithm. Then the improved fuzzy k-means algorithm adopted is given, and the k-center algorithm and binary tree are introduced in section 3. In section 4, we apply the proposed fuzzy k-means clustering algorithm into the Iris dataset, Pima-Indians-Diabetes dataset and Segmentation dataset to validate the algorithm. In section 5, we give the conclusion according to section 4. </p><p>II. RELATED WORK The k-means algorithms, like other partition clustering </p><p>algorithms, group n data points into k clusters by minimizing a cost function that has been pre-designed. The type of traditional cost function [8] is like (1). </p><p> = =</p><p>=n</p><p>j</p><p>m</p><p>iliji cxCF</p><p>1 1</p><p>21 )()( (1) </p><p>where is the value of the ith dimension of the jth object. cl is the center that nearest to the jth object and is the value of the ith dimension of the lth cluster center. Because the contribution of different dimensions to the clustering and the preferences of each object belonging to a cluster are different, the extension forms with weights of the traditional cost function are often used. The methods used in [9], [10], [11], [12], [13] are all the extension of (1). H. Friguiand and O. Nasraoui [10], Y. Chan and W. Ching [11] introduce the degree of membership for each object belonging to every cluster and the weight for each dimension of a cluster on contributing to clustering. However, their algorithm is not computable if one of weight is happens to be zero. Domeniconi [12], [13] introduces a cost function with maximum function and it was proved difficult to solving the minimum objective function. Liping Jing, Michael K. Ng, and Joshua zhexue Huang [1] introduce a cost function avoiding the problems above, and they use entropy of the dimension weights to represent the certainty of dimensions in the identification of a cluster. But the goal of clustering is to </p><p>An Improved Fuzzy K-Means Clustering with K-Center Initialization Taoying Li, Yan Chen, Xiangwei Mu and Ming Yang </p><p>C</p><p>157</p><p>Third International Workshop on Advanced Computational Intelligence August 25-27, 2010 - Suzhou, Jiangsu, China </p><p>978-1-4244-6337-4/10/$26.00 @2010 IEEE</p></li><li><p>satisfy two objectives, the first is make the distance of objects in a cluster is as small as possible, and the second is to make the distance of objects between different clusters is as large as possible [14]. The cost function in [1] does not satisfy the second objectives. We proposed a new method by adding a variable to adjust its function in [15], which satisfies both of the goals of the clustering. However, the efficiency of all algorithm mentioned above is low because different initial k means will influence the speed of convergence and the times of iterations. Proper initial k means makes the equation reach its goal fast, but it is difficult to know which ones are good before training. </p><p>III. FUZZY K-MEANS CLUSTERING BASED ON K-CENTER ALGORITHM AND BINARY TREE </p><p>In this section, we use k center algorithm to init the improved k-means algorithm in [15] and at the same time make use of binary tree to reduce the calculation for each data point. </p><p>A. Fuzzy K-Means Algorithm with K-Center Initialization The cost function we use is given as follows: </p><p>=</p><p>=</p><p>= =</p><p>=</p><p>k</p><p>lm</p><p>iili</p><p>n</p><p>j</p><p>m</p><p>ijilililj</p><p>xc</p><p>xcCWTF</p><p>1</p><p>1</p><p>2</p><p>1 1</p><p>2</p><p>)(</p><p>)(),,(</p><p> = =</p><p>+k</p><p>lli</p><p>m</p><p>ili</p><p>1 1</p><p>]log[ </p><p>(2) </p><p>Here, =</p><p>=k</p><p>llj</p><p>11 ,</p><p>=</p><p>=m</p><p>ili</p><p>11 , x is the mean of all objects, </p><p>ix is value of the ith dimension of x and it equals to </p><p>=</p><p>=n</p><p>jjii xn</p><p>x1</p><p>1 . If 1&gt;n , the new algorithm is available, </p><p>otherwise, =</p><p>m</p><p>iili xc</p><p>1</p><p>2)( is zero and the cost function </p><p>F(T,W,C) is not computable. The denominator is a variable and is linear to the square sum of the distances from the mean of all objects to the means of all clusters. </p><p>Next, we present process of the improved entropy weighting k-means algorithm for solving the minimization problem. </p><p>Minimization of F in (7) with constraints forms a class of constrained nonlinear optimization problems whose solutions are unknown. The usual method likes optimization of F(T,W,C) in (7) is to use partial optimization for T, W and C, and the way in [10],[1] is used for reference. In this method, we first fixed T and C, and search appropriate W to minimize F(T,W,C). Then we fixed T and W, and search appropriate C to minimize F(T,W,C). Later, we fixed C and W, and search appropriate T to minimize F(T,W,C). </p><p>The process is repeated until the value of objective function cannot be reduced. </p><p>Let T and C be fixed, F is the minimum if the weight satisfies </p><p>=</p><p>= m</p><p>i</p><p>li</p><p>lt</p><p>lt</p><p>1</p><p>)exp(</p><p>)exp(</p><p>(3) </p><p>=</p><p>=</p><p>= m</p><p>iili</p><p>n</p><p>jjtltlj</p><p>lt</p><p>xc</p><p>xc</p><p>1</p><p>2</p><p>1</p><p>2</p><p>)(</p><p>)(</p><p>(4) </p><p>We have proved the process mentioned above in [15] and here just adopt it. </p><p>Similar to the process of solving lt , let W and C be fixed, we know that the jth object belongs to the lth cluster if the distance from the jth object to the mean of the lth cluster is smaller than to the mean of other clusters. Thats to say that </p><p> = = =</p><p>otherwise</p><p>xcxcifm</p><p>i</p><p>m</p><p>ijizizijilili</p><p>lj</p><p>,0</p><p>)()(,11 1</p><p>22 </p><p>(5) </p><p> 1=lj means that the jth object is assigned to the lth cluster, </p><p>and 0=lj is reverse. Let T and W be fixed, we can use the method of solving </p><p>average in mathematic to obtain the value of C. The results can be show as (6). </p><p>=</p><p>== n</p><p>jlj</p><p>n</p><p>jjilj</p><p>li</p><p>xc</p><p>1</p><p>1</p><p>(6) </p><p>Here, kl 1 and mi 1 . We introduced the process of the improved weighted fuzzy </p><p>k-means algorithm in [15] as follows: Step1. Input the parameters m, n, k, and the max iterative </p><p>time s, init initial weights mli /1= , and choose k objects randomly as the centers C of k clusters. </p><p>Step2. Obtain T according to (5); Step3. Compute the value of F(T,W,C) according to (2); Step4. Updates C according to (6); Step5. Update W according to (3); Step6. Repeat Step2 to Step5 until the F(T,W,C) cant be </p><p>improved or the iterative time is greater than S. In order to reduce the times and calculation of iteration of </p><p>weighted fuzzy k-means, we made some adjust here by using k-center algorithm initializing the initial k points, and then the process of weighted k-means algorithm can be shown as follows: </p><p>Step1. Input the parameters m, n, k, and the max iterative time s, init initial weights mli /1= . </p><p>Step2. We put all data points into the set of H and let set of centers C be empty. Then, we choose one point randomly from H as the first center, and put it in the set of centers C. </p><p>Step3: Make the data point from H which is farthest to the center data points in C and is not in C be the next center and put it in C again. </p><p>158</p></li><li><p>Step4: If the number of centers in C equals to k, go to Step5, else got o Step3. </p><p>Step5 Obtain T according to (5); Step6. Compute the value of F(T,W,C) according to (2); Step7. Updates C according to (6); Step8. Update W according to (3); Step9. Repeat Step5 to Step8 until the F(T,W,C) cant be </p><p>improved or the iterative time is greater than S. The complexity of traditional clustering based on distance </p><p>algorithm is O (mn2), and it changes exponentially along with the number of objects needing to be partitioned. Thereby, traditional clustering methods need plenty of calculation while a lot of objects exist. The complexity of the proposed algorithm equals to O (mnk) and is similar to that of the EWKM algorithm and the improved weighted fuzzy k-means, and it changes linearly along with the number of objects, the same time the k-center algorithm is used for initial k points, which reduces the times of iteration of k-means algorithm. </p><p>B. Improved Fuzzy K-Means Algorithm using Binary Tree of Means Now, we use binary tree to decide which clusters data </p><p>points belong to. Supposing we know k means of k clusters, now we </p><p>calculate the Euclidean distance between any two means and we can obtain a distance matrix D=(dlv)mm. </p><p>2/1</p><p>1</p><p>2)( </p><p> = =</p><p>m</p><p>ivililv ccd (7) </p><p>Given dij is the largest and then we split k means into two groups mi and mj. Then we can split mi group again as shown in Fig.1. Thus we gain the binary tree. </p><p>The way of establishing the tree can be given as follows: 1. Given the number of layers of the tree be l=1 and the </p><p>number of groups g=1 for the l layer, then we calculate the mean of all k centers and make it the root of the first level of the tree and all centers are in one group. </p><p>2. Let l=l+1, we divide each of existing groups into two small groups and there may be at most 2l-1 groups, and let means of centers of new small groups be the roots of the l layer and each mean stands for one groups for the l layer, given g be number of true groups. </p><p>3. If all groups only have one center, stop, otherwise go to 2. </p><p>In fact, from the process of building the tree, we know that all of k means of the fuzzy k-means algorithm are leaf nodes of the binary tree. </p><p>Then, the process of the improved fuzzy k-means algorithm based on k-center and binary tree can be shown as following: </p><p>Step1: Remove redundant attributes or those attributes having less influence for clustering, which will reduce the amount of calculation and wont decrease its accuracy. For example, there are 19 attributes in Segment dataset, but we just use 9 of them for simplifying. </p><p>Step2: Eliminate the difference of units of dimensions according to equation (8), which will make all data points be zero dimension and values be in the range of [0,1]. </p><p>tittit</p><p>titji</p><p>ji xx</p><p>xxx</p><p>minmax</p><p>minoriginal</p><p>=</p><p> (8) </p><p>Here, mi 1 . Step3. Partition initial data points into k clusters using </p><p>improved weighted fuzzy k-means algorithm mentioned above. </p><p>Step4. Make the k means into a binary tree structure using two centroids clustering, which can be shown in Figure 1. </p><p>Step5. Calculate the distance of the data point to roots of the tree and its sub trees by using Euclidean distance as following: </p><p>Supposing the x* is more close to mi then to mj, then we just need to compute two distances between the x* and mi1, and x* and mi2, supposing the distance between x* and mi1 is smaller, then we need to calculate two distances between x* and mi11, and x* and mi12, and so on. If the distance between x* and mi11 is smaller and mi11 doesnt have sub points, then x* is closest to mi11. </p><p>Step6. Elicit the mean of cluster which is closest to the data point according Step5 -&gt; winning cluster represented by its center cwin. </p><p>Step7. If all data points have their clusters, update C according to (6). </p><p>Step8. If C equals to that of iteration last time or the error rate is less than a pre-designed value, stop. Otherwise go to Step 4. </p><p>According the process of the improved fuzzy k-means algorithm based on k-center and binary tree, we know that the complexity of using two centroids clustering is O (mnlogk), and the complexity of traditional algorithm is O (mnk). Therefore, if the number of clusters and number of data points are very large, we can decrease the amount of calculation in this paper. </p><p>IV. EXPERIMENT In this section, we use Iris dataset, Indians-diabetes dataset </p><p>and Zoo dataset from UCI website to validate the improved fuzzy k-means algorithm based on k-center and binary tree, and detail datasets can be shown in Table I. </p><p>TABLE I REAL DATASETS </p><p>Dataset Number of attributes Number of clusters </p><p>Number of dat...</p></li></ul>