83-522-1-pb bees

Upload: aditi1630

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 83-522-1-PB bees

    1/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    A Hybrid Algorithm for Data Clustering UsingHoney Bee Algorithm, Genetic Algorithm and

    K-Means MethodMohammad Ali Shafia 1,a , Mohammad Rahimi

    Moghaddam 1,b , Rozita Tavakolian 2,c 1 Department of Industrial Engineering, Iran University of

    Science and Technology, Tehran, Iran2 Department of Information Technology Engineering, Tarbiat

    Modares University, Tehran, [email protected], [email protected],

    c

    [email protected]

    Article Info

    Received: 29th September 2011Accepted: 10th November 2011Published online: 1st December 2011

    ISSN: 2231-8852 2011 Design for Scientific Renaissance All rights reserved

    ABSTRACT In this article a novel population based hybrid algorithm called Genetic Bee Tabu K-Means ClusteringAlgorithm (GBTKC) is developed based on basic Honey Bee Algorithm in which the benefits of K-Means Method is used in order to improve its efficiency. Also, the simplicity of K-Means, thediversity of Genetic Algorithm to find the global optimum and advantages of Tabu Search has beencombined in GBTKC. Using Honey Bee, our hybrid algorithm has more ability to search for theglobal optimum solutions and more ability for passing local optimum as well as generating efficientnear optimal solutions. Moreover, GBTKC is run on three known data sets from the UCI MachineLearning Repository and the results of clustering using this algorithm compared with other studiedalgorithms will be stated in the literature review. So, this is revealed that GBTKC is definitely aconvergent optimal solution, and the quality of answers provided by the algorithm is more reliablethan those of other algorithms.

    Keywords : Clustering, Hybrid Algorithm, K-Means Method, Honey Bee Algorithm (HBA),Genetic Algorithm (GA), Tabu Search Algorithm (TS).

    1. Introduction

    With the rapid increasing of information on the web, clustering related data anddocuments to achieve useful information become more important for information retrievalsystems. Different methods to tackling the issue of clustering have been existed. The mostimportant of which is K-Means with classifying the data set into a number of homogenizedgroups based on their similarities. The main problem with K-Means is the tendency toconvergence in Local Optimum. Meta-heuristic algorithms are widely used to optimize theresult of K-Means. But, resulting clustering quality and the optimal solution is still achallenging task. Honey Bee Algorithm, as a Meta-heuristic Algorithms, introduces a novelapproach to search in solution space via observing honey bee behavior regarding foragingthat leads to improve the solution quality.

  • 7/29/2019 83-522-1-PB bees

    2/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    111

    Intelligent methods are used mostly in designing great modern and professionalinformation systems for solving complex problems. In order to improve complex sets of dataor information, an approach inspired by popular nature is used. In this approach, organismstructure and team behavior of animals are observed and studied. In the result body of these

    studies Particle Swam Optimization (Cura, 2009; Kennedy & Eberhart, 1995) is derived from birds' flocks or fish schools. Meanwhile, Honey Bee Algorithm (HBA) is a novel methodcreated to solve improvements inspired from honey bee behavior foraging, collecting nectar and pollen and producing honey. Honey Bee Algorithm is a population-based algorithm thatwas originally proposed by (Phamet al., 2006) and simulates the foraging behavior that aswarm of bees display.HBA falls in the category of Swarm-based optimization algorithms (SOAs). SOAs usedifferent methods present in the nature in order to search in solution space and accessing theoptimum solution. The key difference between SOAs and direct algorithms such as hillclimbing is that SOAs use population of solutions instead of a single one (Pham et al., 2007).HBA has this potential to have many functions. This optimization algorithm stimulates honey bee foraging leading to being a helpful approach in other diverse issues. Some of thesefunctions are stated in the review of literature. In this research, HBA is used as a novelmethod in SOAs in order to solve the issue of Clustering. HBA uses different mechanismssuch as waggle dance to find the best site for food source as well as searching for the nextone.

    Clustering is one of the unsupervised techniques and learning based methods of partitioning similar data points. Clustering is defined as classifying homogenized sets of data points into several clusters at most on this condition that there would be no back ground

    knowledge on the subject (Murthy & Chowdhury, 1996). There are different clusteringalgorithms each of which use certain steps in order to categorize a large number of data points into less number of groups in a way that data points of one group have the most similar characteristics and features and the data points of different groups have the least similarity inregards with characteristics and features (Pham et al., 2007).

    One of the most well-known useful clustering algorithms is K-Means (Jain & Dubes,1988). The algorithm is efficient for clustering large data sets, because its computationalcomplexity only grows only linearly with the number of data points. The important pointabout K-Means is that it does not guarantee optimal solution although this method convergesinto good solutions (Bottou & Bengio, 1995). Moreover, framework of K-Means method isused to HBA Hybridization in order to generate optimal solutions.

    This article introduces a new optimization algorithm called; Genetic Bee Tabu K-MeansClustering Algorithm (GBTKC) which stands for. This is a hybrid algorithm which hasdeveloped based on HBA (Pham et al., 2006). Also, it utilizes benefits of Genetic Algorithm(GA) (Goldberg, 1989), Tabu Search (TS) (Glover, 1989a; Glover, 1989b) and K-MeansMethods (Selim & Ismail, 1984). The HBA performs a kind of neighborhood searchcombined with a random search in a way that is reminiscent of the food foraging behavior of swarms of honey bees (Pham et al., 2006; Pham et al., 2006).

    As mentioned before, a major drawback of previous clustering-based approaches isgetting stuck at a local optimum. In this research, we utilize different features of four meta-

  • 7/29/2019 83-522-1-PB bees

    3/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    112

    heuristic algorithms and present a novel hybrid algorithm able to improve the results of previous approaches. For this purpose, our approach tries to answer the following questions:

    a. How does GBTKC combine the different features of four Meta-heuristic algorithms toachieve a more precise solution?

    b. How much improvement can be achieved in our approach in comparison with others?The paper is organized in a way that section two briefly reviews the clustering concepts andencompasses its different methods. Section three briefly introduces the classical GA. Thenext section deals with the foraging behavior of bees and the initial ideas of the proposedclustering method. The Hybrid Algorithm proposed for Clustering is presented in sectionfive. Results of the clustering experiments are reported in section six and finally, sectionseven concludes the paper and provides a set of suggestions for future studies.

    1.1 Clustering Methods

    Data clustering, which is one of the famous issues of NP-complete, has been developed inorder to find groups in the heterogeneous data with minimizing criteria of dissimilarity as theend. Solving this problem would be of use in data mining, machine learning and patternclassification solutions (Liu et al., 2008).

    Clustering methods identify groups or better say clusters of a data set using a step by stepapproach in the sense that in each cluster there are objects similar to each other, yet differentfrom those in other clusters (Lee & Yang, 2009; Rokach & Maimon, 2005; Xu & WunschII,2005). There exist methods of clustering, a number of which are published in the literaturereview (Grabmeier & Rudolph, 2002; Lee & Yang, 2009).

    They can be broadly classified into four categories (Han & Kamber, 2001): partitioningmethods, hierarchical methods, density-based methods and grid-based methods. Other taxonomy of clustering approaches is shown in Fig. 1. Other clustering techniques that do notfit in these categories have also been developed. These are fuzzy clustering, artificial neuralnetworks and GA. A discussion of different clustering algorithms can be found in references published by different authors (Han & Kamber, 2001; Pham & Afify, 2006).

    K-Means is among the simplest and most commonly used methods of clustering inPartitioning Method category (McQueen, 1967). Each cluster in K-Means is displayed by themean value of the data points within the cluster. What this method does is trying to divide adata set S into k clusters in the way that the sum of the Euclidean distances between data points and closest clusters centers is minimized. It named Total With in Cluster Variance(TWCV ). This criterion is defined formally by Eq. 1, where xi(j) is the ith data point belongingto the j th cluster, c j is the centre of the jth cluster, k is the number of clusters andn j is thenumber of data points in cluster j.

    (1)

    As mentioned above, the implementation of K-Means clustering involves optimization.First, the algorithm takes k randomly selected data points and makes them the initial centersof the k clusters being formed. The algorithm then assigns each data point to the cluster withclosest centre to it. In the second step, the centers of the k clusters are recomputed, and the

  • 7/29/2019 83-522-1-PB bees

    4/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    113

    data points are redistributed. This step is repeated for a specified number of iterations or untilthere is no change to the membership of the clusters over two successive iterations. It isknown that the K-Means algorithm may become trapped at the local optimal solutions,depending on the choice of the initial cluster centers.

    Fig.1. A taxonomy of clustering approaches (Lee & Yang, 2009)

    HBA and GA have a potentially greater ability to avoid local optima than is possible withthe localized search employed by most clustering techniques. A research proposed a geneticalgorithm-based clustering technique, called GA-clustering, has proven effective in providingoptimal clusters (Maulik & Bandyopadhyay, 2000). In this algorithm, solutions (typically,cluster centroids) are represented by the bit strings. The search for an appropriate solution begins with a population, or collection, of initial solutions. Members of the current populationare used to create the next generation population by applying operations such as randommutation and crossover. At each step, the solutions in the current population are evaluatedrelative to some measures of fitness (which, typically, is inversely proportional to E), with thefittest solutions selected probabilistically as seeds for producing the next generation. The process performs a generate-and-test beam search of the solution space, in which variants of the best current solutions are most likely to be considered next.

    Another research has presented a clustering method based on the classic HBA (Pham etal., 2007). The method employs the Bee Algorithm to search for the set of cluster centers thatminimizes a given clustering metric. One of the advantages of the proposed method is that, itdoes not become trapped at locally optimal solutions. In the present report, it would be shown

    that the proposed method produces better performances than those of the K-Means and theGA-clustering algorithm.In the next sections, an alternative clustering method to solve the local optimum problem

    is described. The new method adopts HBA and GA as it has proved to give a more robust performance than other intelligent optimization methods for clustering problems. (Pham, etal., 2006)

    1.2 The Classical GA

    GA uses a stochastic search procedure providing adaptive and robust search over a widerange of search spaces. The procedure is inspired by the Darwinian's principle of survival of the fittest individuals and of natural selection. The technique was first introduced for use in

  • 7/29/2019 83-522-1-PB bees

    5/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    114

    adaptive systems (Holland, 1975). It was then employed by several researchers in solvingvarious optimization problems effectively and efficiently.

    The search procedure starts with the initialization of a few parameters which may/maynot be modified in the course of search. The algorithm passes through three basic phases

    iteratively, namely, the reproduction phase, the crossover phase and the mutation phase. Thedetailed operation at each phase is lucidly described in literature (goldberg, 1989; Xiao et al.,2010). The Classical GA can be described as shown in Fig.2.

    Fig.2.Pseudo code of the basic GA

    1.3 The Classical HBA1.3.1 Bees in Nature

    A colony of honey bees can fly in different directions in rather long distance in order toforage. Honey bee forage source is flower patches in which there is a lot of nectar or pollen.Gathering this pollen or nectar is easy for honey bees leading to more honey bees visit morenectar or pollen and vice versa (Seeley, 1996; Von-Frisch, 1976).The foraging process begins with random searching of a scout bee from one patch to another in a colony. Scout bee (See Fig. 3) is a type of unemployed forager that starts searchingspontaneously without any knowledge(zbakir, Baykasoglu,& Tapkan, 2010).During the harvesting season, a colony continues its exploration, keeping a percentage of the population as scout bees (Seeley, 1996). The bees that return to the hive have evaluateddifferent patches based on their quality which is dependent on parameters such as proportionof sugar in nectar or pollen of that patch (Camazine et al., 2003). They deposit their nectar or pollen and go to the dance floor to perform a dance known as the waggle dance(Von-Frisch, 1976).This mysterious dance is essential for colony communication, and contains three pieces of information regarding a flower patch (Camazine et al., 2003). This information helps thecolony to send the bees to the flower patches without any guide, instruction or map moremeticulously. Moreover, patch value relies on the amount of existing food as well as theenergy needed to harvest it (Camazine et al., 2003).

    After performing waggle dance on the dance floor, scout bees return to the patch and takeflower bees waiting for them in the hive with them. More flower bees are sent to patches with

  • 7/29/2019 83-522-1-PB bees

    6/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    115

    a higher probability of food existence. Continuing this manner causes the colony to gather food in the fastest most efficient way possible. Deciding about the next waggle dance iscrucial after returning to the hive (Camazine et al., 2003). Naturally, when there is stillenough nectar in the patch in a way that it is considered as a source, waggle dance is

    advertized and recruit bees are sent to the source. Recruit (See Fig.3), another type of unemployed forager, attends to a waggle dance done by some other bees leading the bee tostart searching using the knowledge from waggle dance (zbakir et al., 2010).

    Fig.3.Typical behavior of honey bee foraging (Hckel & Dippold, 2009; zbakir et al.,2010)

    1.3.2 The basic HBA

    Lots of SOAs have developed from bee behavior. These algorithms are classified into twocategories based on their behavior in the nature: foraging behavior and mating behavior (Marlnakls, Marmakl, & Matsatsinls, 2009). The most important approaches that simulate theforaging behavior of bees are the Artificial Bee Colony (ABC) Algorithm proposed by(Karaboga & Basturk, 2007, 2008), the Bee Colony Optimization Algorithm proposed by(Teodorovic & Dell'Orco, 2005), the Bee Hive Algorithm published by (Wedde, Farooq, &Zhang, 2004) and the Virtual Bee Algorithm proposed by (Yang, 2005) which is applied incontinuous optimization problems.

    We focus on Honey Bee Algorithm among the ones we have introduced above. HBA isan optimization algorithm inspired by the natural foraging behavior of honey bees to find theoptimal solution (Pham, Castellani, & Ghanbarzadeh, 2007). The algorithm has beensuccessfully applied to the optimization problems including a famous data set (Pham et al.,,2006).

    The algorithm starts with the n scout bees being placed randomly in the search space.Then, the fitness of the sites visited by the scout bees after return is evaluated. The bestm sites will be selected fromn. The evaluation of fitness leads to selecting m scout bee from n.

    In this m selected sites, e sites are introduced as good selected sites and other (m-e) sites areintroduced as bad selected ones.

  • 7/29/2019 83-522-1-PB bees

    7/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    116

    A neighborhood search sites of sizengh is selected for each m sites. In this step aneighborhood sizengh is determined which will be used to update the m bees declared in the previous step. Number of bees (n2) will be selected randomly to be sent to e sites andchoosing n1 bees randomly which their number is less than n2, to be sent to (m-e) sites. In this

    step, recruit bees for the selected sites and evaluate the fitness of the sites.Finally, the best bee from each site (the highest fitness) is chosen to form the next bee population. The remaining bees for initialing new population will be assigned randomlyaround the search space. This algorithm is repeated until the stopping criterion is met.Usually stopping criteria is the number of the repetitionsimax .(Pham, Castellani et al., 2007)

    Steps of basic Bee Algorithm are described in detail in Fig. 4 and its flowchart isillustrated in Fig. 5 and the basic HBA requires a number of parameters to be set as areshown in table 1.

    Fig.4. Pseudo code of the basic Bee Algorithm (Idris 2009)

    Fig.5. Flowchart of the basic Bee Algorithm

    0. Begin1. Initialize population with random solutions.2. Evaluate fitness of the population.3. While (stopping criterion not met)

    3.1. Forming new population.3.2. Select sites for neighborhood search.3.3. Recruit bees for selected sites (more bees for best e sites) and evaluate fitness.3.4. Select the fittest bee from each patch.3.5. Assign remaining bees to search randomly and evaluate their fitness.

    9. End While.10. End.

  • 7/29/2019 83-522-1-PB bees

    8/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    117

    Table 1: Parameters of basic HBAParameters Description N Number of scout beesm Number of sites selected out of n visited sitese Number of best sites out of m selected sites

    n2 Number of bees recruited for best e sitesn1 Number of bees recruited for the other (m-e) selected sitesngh Neighborhood sizeimax Number of algorithm steps repetitions or stopping criterion

    2. Methodology2.1 The Proposed Hybrid Algorithm for Clustering

    GBTKC exploits the search capability of the HBA and GA to overcome the localoptimum problem of the K-Means algorithm. More specifically, the task is to search for the

    appropriate cluster centersC i (1 i k ) such thatTWCV (Eq. 1) is minimized. Here K denotesthe number of clusters. Pseudo code of this algorithm is shown in Fig. 6 and its flowchart isshown in Fig. 7. The steps of the proposed algorithm are further described below.

    The algorithm requires a number of parameters to be set, namely: number of scout bees(n), number of sites selected for neighborhood searching out of n visited sites (m), number of top-rated or elite sites among m selected sites (e), number of bees recruited for the best e sites(ne), number of bees recruited for the other (m-e) selected sites (no), stopping criterion,mutation probability in GA ( P m) and the length of Tabu List ( L).

    The algorithm starts with an initial population of n scout bees. Each bee represents a potential clustering solution. Eachn scout bees includes a solution as the set of k cluster centers. The initial locations of the centers are randomly assigned. The Euclidean distances between each data object and all centers are calculated to determine the cluster to which thedata object belongs to (i.e. the cluster with centre closest to the object). In this way, initialclusters can be constructed. The most popular metric especially for continuous features is theEuclidean distance which is illustrated in Eq. 2 below.

    (2)

    After the clusters have been formed, the original clusters centers are replaced by the

    actual centroids of the clusters to define a particular clustering solution (i.e. a bee). Thisinitialization process is applied each time and new bees are to be created.

    In step 3.1, the fitness computation process is carried out for each site visited by a bee bycalculating TWCV (Eq. 1) which is inversely related to fitness. Then in step 3.2,m fittestsites for neighborhood search have been selected.

    In step 3.3, the m sites with the highest fitness aredesignated as selected sites andchosen for neighborhood search. In step 3.4, the algorithm conducts searches around theselected sites, assigning more bees to search in the vicinity of the beste sites. Selection of the best sites can be made directly according to the fitness associated with them. Alternatively,

    the fitness values are used to determine the probability of the sites being selected. As alreadymentioned, this is done by recruiting more bees for the best e sites than for the other selected

  • 7/29/2019 83-522-1-PB bees

    9/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    118

    ones. Together with scouting, this differential recruitment is a key operation of the BeeAlgorithm. Searches in the neighborhood of the beste sites can be conducted using differentformulas such as the one below. If X=(x1,x2,,xk ) is current clusters centers andY=(y1,y2,,yk ) is another point inngh neighborhood of X, then:

    (3)

    In step 3.5, fitness of recruited bees will be evaluated. In step 3.6, the bee with the highestfitness in each patch will be selected to form part of the next bee population. The fittest bee in patches made a part of next generation. In step 3.7, the remaining bees in the new populationare assigned to using GA concept. In this step, two bees from (n-m) remaining bees will beselected randomly and crossover operator will be used in selected bees to generate twooffsprings. Then mutation operator will be used on resulted offsprings. If offsprings is not inTabu list, evaluate its fitness. So, this bee is selected for next generation and added to Tabulist. If the Tabu list is full, the last item will exit and the input item will be stated at top of thelist. In this step will be used from single point crossover and mutation operator with probability of P m, also the Tabu list length is constant ( L).

    If offsprings is not in Tabu list then offsprings will be added to Tabu list, evaluatedfitness of offsprings, and assign offsprings to next generation. These steps will be repeateduntil all of remaining (n-m) bees of next generation is generated.

    Fig 6. Pseudo code of theGBTKC Algorithm

  • 7/29/2019 83-522-1-PB bees

    10/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    119

    Fig.7. Flowchart of theGBTKC Algorithm

    At the end of each iteration, the colony will have two parts in the new population. One part will be acquired from the fittest bee in the selected patches, and the other part will begenerated by GA and TS concepts. These steps are repeated until a stopping criterion is met.

    GBTKC is due to the ability of the HBA and GA to perform local and global searchsimultaneously (Pham, Koc et al., 2007). The power of GA arises from crossover andmutation operators (Srinivas & Patnaik, 1994). Crossover causes a structured, butrandomized, exchange of solutions, with the possibility that good solutions can generatebetter ones. And mutation keeps the diversity of population. AlsoGBTKC allows HBAutilize crossover and mutation operator of GA in order to increase the diversity of populationin each iteration and prevent the premature convergence.

    3. Results & Discussion

    In this section, results of implementing and testingGBTKC algorithm along with itscomparison with the results of four other algorithms stated in the literature are mentioned. Inother words,GBTKC algorithm is compared with Basic K-Means (McQueen, 1967) which isthe simplest method of K-Means for solving the issue of clustering; GA K-Means (Krishna &Murty, 1999) which is a hybrid algorithm of GA and K-Means method; Basic Bee Algorithm(Karaboga & Ozturk, 2011; Pham, Castellani et al., 2007) which is the simplest algorithm

  • 7/29/2019 83-522-1-PB bees

    11/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    120

    derived from bee behavior; SOM K-Means (Vesanto & Alhoniemi, 2000) which is thecombination of K-Means method and Neural Network solutions for solving clustering issue.

    Table 2: Features of Reuters, Wine and IRIS data sets used in the experiment

    Data Set NameNumber of Objects

    (Include documents and news) Number of Features Number of ClassesReuters21578 300 1206 6Wine 178 13 3IRIS 150 4 3

    Table 3: Parameters used in the clustering experimentsAlgorithm Parameters ValueBasic K-MeansAlgorithm Total number of no exchanges in fitness

    2

    Genetic K-MeansAlgorithm Crossover probability, P c 1Mutation probability, P m 0.5

    Population size, P 300

    Basic HBA Number of scout bees,n 90 Number of selected sites,m 60 Neighborhood size,ngh 0.8 Number of best sites out of m or elite bees,e 50 Number of bees recruited for beste sites,ne 50 Number of bees recruited for the other (m-e) selected sites,no 30Stopping criterion,imax 40

    SOM K-MeansAlgorithm

    Initial neighborhood size, IN 3Topology function,TFCN 'hextop'Distance function, DFCN 'dist'Steps for neighborhood to shrink to 1,S TEPS 100

    Genetic BeeK-Means Algorithm(GBTKC )

    Number of scout bees,n 90 Number of selected sites,m 60 Neighborhood size,ngh 0.7 Number of best sites out of m or elite bees,e 40 Number of bees recruited for beste sites,ne 50 Number of bees recruited for the other (m-e) selected sites,no 30Mutation Probability in GA, P m 0.3Stopping criterion,imax 50Length of Tabu List, L 10

    In order to compare and evaluate these algorithms, well known real data sets from theUCI Machine Learning Repository (Blake & Merz, 1998) are needed that has usedReuters21578 (Lewis, 1997) as a larger data setwith 300 documents, then Wine(Murphy & Aha, 1992)as a smaller data set with 178 observations and IRIS(Grabmeier &Rudolph, 2002) as the smallest data set in this article with 150 observations to run and test the

    algorithms. In accordance with features of required data for each of these algorithms, all of

  • 7/29/2019 83-522-1-PB bees

    12/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    121

    Reuters, Wine and IRIS data sets were cleaned using primary and heuristic methods. Table 2illustrates features and general aspects of these three data sets.

    The noticeable point in running and testing of SOM K-Means algorithm is that thisalgorithm was first run 50 times with the standard toolbox of SOM Neural Network in Matlab

    software (R2009) and then the outputs were implemented as cluster centers so that it is possible to produce final results after running K-Means on these centers. To run the first phase of this algorithm in Matlab software default parameters are used.

    For the assessment of the tested algorithm performance clustering criterionTWCV wasused after running them in a way that the smaller value of this criterion, the better clusteringresults will be and vice versa, i.e. the greater theTWCV criterion, the worse the clusteringresults will be. In the following, table 3 displays the parameter size of each of the algorithmsin this test. Each of these algorithms in this test is run 15 times and consequently average(mean), minimum and maximum amount of TWCV is calculated. Table 4 illustrates results of running and implementing of each of the algorithms. As can be seen in Table 4, the proposedclustering method outperforms the other four algorithms in three data sets.

    Table 4: Results for the tested clustering algorithmsData Set Algorithm Mean TWCV Min. TWCV Max. TWCV

    Reuters

    Basic K-Means Algorithm 1434.815940 1452.990012 1486.418508Genetic K-Means Algorithm 1423.873518 1435.549627 1441.937816Basic Bee Algorithm 1425.135698 1434.035237 1446.418007SOM K-Means Algorithm 1422.523106 1430.892543 1440.308472Genetic Bee K-Means Algorithm(GBTKC )

    1419.420171 1430.191211 1440.782632

    Wine

    Basic K-Means Algorithm 18134.905339 18701.956158 18663.266267Genetic K-Means Algorithm 16242.673531 16241.361841 16350.143577Basic Bee Algorithm 16345.956243 16328.918898 16366.847007SOM K-Means Algorithm 16249.969845 16245.641182 16253.840755Genetic Bee K-Means Algorithm(GBTKC )

    16235.406237 16229.165709 16248.729222

    IRIS

    Basic K-Means Algorithm 100.406274 78.940841 145.279322Genetic K-Means Algorithm 78.969404 78.940841 78.999457Basic Bee Algorithm 78.979825 78.940841 79.326450SOM K-Means Algorithm 78.940841 78.940841 78.940841

    Genetic Bee K-Means Algorithm(GBTKC )

    78.941603 78.940841 78.191357

    4. Conclusions

    In this article one of the applications of HBA as one of the new members of Swarm-basedMetaheuristic has been evaluated. Recently, studies on algorithms derived from bee behavior have concerned optimization issues. That is why in this article there has been an attempt tostudy the applying HBA on a NP- hard problem called Clustering. Clustering is a veryimportant issue both theoretically and practically, attracting a lot of researchers. K-Means isone of the simplest most efficient methods introduced in regards with clustering. Naturally,there are disadvantages to this method as well as amazing advantages.

  • 7/29/2019 83-522-1-PB bees

    13/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    122

    The proposed algorithm which is utilized to solve the issues of clustering is the modifiedversion of the basic Bee Algorithm that was presented by Pham and his colleagues for thefirst time. The algorithm which is introduced by the nameGBTKC is a novel hybridalgorithm in which the benefits of K-Means Method is used in order to improve its efficiency.

    GBTKC is based on basic HBA and a mixture of GA, TS and K-Means Method is used todesign it.In this new hybrid algorithm, advantages of different algorithm are used.GBTKC makes useof two algorithms, HBA and GA, in order to search in cluster centers and minimize objectivefunction. Also, this algorithm uses the two algorithms, GA and TS, in order to diversifysolution space in new population generation. One of the key benefits of this algorithm is thatit does not stuck on locally optimal solutions and the quality of findings and answers provided by this algorithm is much better than those of the previous studied algorithms insubject literature. This improved quality of the algorithm is the result of using both the HBAand GA which does the local and global search simultaneously.

    In order to test the performance of GBTKC this algorithm is implemented and run onthree well- known data sets and computational experience is very encouraging. Theseexperiments prove that the algorithm is absolutely converged to optimal solution in all runs.Also, the experimental findings of the algorithm as a result of the algorithm running on bothof Reuters21578, Wine and IRIS data sets determine thatGBTKC has worked better than anyof the other introduced algorithms in the subject literature of this study.

    One of the drawbacks of GBTKC is the number of the parameters used that should betuned in this algorithm. Another drawback is the long CPU time of this algorithm. As a result,working on finding a solution to help the algorithm users in order to choose the appropriate

    value of parameters and use proper solutions to decrease this algorithm CPU timedramatically is of great value. These subjects can be scheduled as future works.

    References

    Blake, C. L., & Merz, C. J. (1998). UC Irvine Machine Learning Repository.University of California at Irvine Repository of Machine Learning Databases , fromhttp://www.ics.uci.edu/~mlearn/MLRepository.html

    Bottou, L., & Bengio, Y. (1995). Convergence properties of the k-means algorithm. Advances

    in Neural Information Processing Systems, 7 , 585-592.Camazine, S., Deneubourg, J., Franks, N. R., Sneyd, J., Theraula, G., & Bonabeau, E. (2003).Self-Organization in Biological Systems . Princeton: Princeton University Press.

    Cura, T. (2009). Particle swarm optimization approach to portfolio optimization. Nonlinear Analysis: Real World Applications, 10 (4), 2396-2406

    Glover, F. (1989a). Tabu Search, Part I.ORSA Journal on Computing, 1 (3), 190-206.Glover, F. (1989b). Tabu Search, Part II.ORSA Journal on Computing, 2 (1), 4-32.Goldberg, D. E. (1989).Genetic Algorithms in Search . New York: Reading: Addison-Wesley

    Longman.Grabmeier, J., & Rudolph, A. (2002). Techniques of cluster algorithms in data mining. Data

    Mining and Knowledge Discovery, 6 , 303-360.

    http://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.html
  • 7/29/2019 83-522-1-PB bees

    14/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    123

    Hckel, S., & Dippold, P. (2009).The Bee Colony-inspired Algorithm (BCiA) A Two-Stage Approach for Solving the Vehicle Routing Problem with Time Windows. Paper presentedat the GECCO09, Montral, Qubec, Canada.

    Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techniques . San Diego,

    California, USA: Academic Press.Holland, J. (1975). Adaptation in Natural and Artificial Systems . Ann Arbor, MI.Idris, R. M., Khairuddin, A., & Mustafa, M. W. (2009). Optimal Allocation of FACTS

    Devices for ATC Enhancement Using Bees Algorithm. Journal of Aleppo University Engineering Science Series, World Academy of Science, Engineering and Technology, 54 .

    Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data . Englewood Cliffs, NewJersey, USA: Prentice Hall.

    Karaboga, D., & Basturk, B. (2007). A powerful and efficient algorithm for numericalfunction optimization: artificial bee colony (ABC) algorithm. Journal of Global Optimization, DOl 10.1007Is10898-0079149-x .

    Karaboga, D., & Basturk, B. (2008). On the performance of artificial bee colony (ABC)algorithm. Applied Soft Computing, 8 , 687-697.

    Karaboga, D., & Ozturk, C. (2011). A novel clustering approach: Artificial Bee Colony(ABC) algorithm. Applied Soft Computing, 11 , 652-657.

    Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Paper presented at theProceedings of 1995 IEEE International Conference on Neural Networks.

    Krishna, K., & Murty, M. N. (1999). Genetic K-Means Algorithm. IEEE Transactions onSystems, Man, and Cybernetics (Part B: Cybernetics), 29 (3) , 433-439.

    Lee, I., & Yang, J. (2009). Common Clustering Algorithms Comprehensive Chemometrics.

    In (Vol. Chapter 2. 27, pp. 577-618): University of Western Sydney, Campbelltown, NSW.Lewis, D. (1997). Reuters-21578 text categorization test collection, Available at:

    http://www.research.att.com/~lewis/reuters21578.html. Liu, Y., Yi, Z., Wu, H., Ye, M., & Chen, K. (2008). A tabu search approach for the minimum

    sum-of-squares clustering problem. Information Sciences, 178 (12), 2680-2704.Marlnakls, Y., Marmakl, M., & Matsatsinls, N. (2009). A Hybrid Discrete Artificial Bee

    Colony - GRASP Algorithm for Clustering. IEEE , 548- 553.Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique.

    Pattern Recognition, 33 (9) , 1455-1465.McQueen, J. (1967). Some methods for classification and analysis of multivariate

    observations. Paper presented at the Proceedings of the Fifth Berkeley Symposium onMathematical Statistics and Probability.

    Murphy, P. M., & Aha, D. W. (1992). UCI machine learning repository. fromhttp://www.ics.uci.edu/~mlearn//MLRepository.html

    Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters using geneticalgorithm. Pattern Recognition Letters, Elsevier B.V., 17 , 825-832.

    zbakir, L., Baykasoglu, A., & Tapkan, P. (2010). Bees algorithm for Gener alizedAssignment Problem. Applied Mathematics and Computation, 215 , 3782 3795.

    http://www.research.att.com/~lewis/reuters21578.htmlhttp://www.research.att.com/~lewis/reuters21578.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.ics.uci.edu/~mlearn/MLRepository.htmlhttp://www.research.att.com/~lewis/reuters21578.html
  • 7/29/2019 83-522-1-PB bees

    15/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    124

    Pham, D. T., & Afify, A. A. (2006). Clustering techniques and their applications inengineering. Submitted to Proceedings of the Institution of Mechanical Engineers,

    Journal of Mechanical Engineering Science .Pham, D. T., Castellani, M., & Ghanbarzadeh, A. (2007). Preliminary design using the Bees

    Algorithm. Paper presented at the Eighth International Conference on Laser Metrology,CMM and Machine Tool Performance, LAMDAMAP, Euspen, UK, Cardiff.Pham, D. T., Ghanbarzadeh, A., Ko, E., & Otri, S. (2006). Application of the Bees Algorithm

    to the training of radial basis function networks for control chart pattern recognition. Paper presented at the 5th CIRP International Seminar on Intelligent Computation inManufacturing Engineering (ICME-06), Ischia, Italy.

    Pham, D. T., Ghanbarzadeh, A., Ko, E., Otri, S., Rahim, S., & Zaid, M. (2006).The Bees Algorithm - A novel tool for complex optimization problems. Paper presented at theIPROMS 2006, Proceedings of the 2nd Virtual International Conference on IntelligentProduction Machines and Systems, Cardiff, UK.

    Pham, D. T., Koc, E., Lee, J. Y., & Phrueksanant, J. (2007).Using the Bees Algorithm to schedule jobs for a machine. Paper presented at the Eighth International Conference onLaser Metrology, CMM and Machine Tool Performance, LAMDAMAP, Euspen, UK,Cardiff.

    Pham, D. T., Otri, S., Afify, A., Mahmuddin, M., & Al-Jabbouli, H. (2007). Data Clustering Using the Bees Algorithm. Paper presented at the 40th CIRP International ManufacturingSystems Conference, Manufacturing Engineering Centre, Cardiff University, Cardiff,UK.

    Rokach, L., & Maimon, O. (2005). Clustering Methods. In O. M. L. Rokach (Ed.), Data

    Mining and Knowledge Discovery Handbook (pp. 321-352). New York: Springer.Seeley, T. D. (1996).The Wisdom of the Hive: The Social Physiology of Honey Bee Colonies .Cambridge: Harvard University Press.

    Selim, S. Z., & Ismail, M. A. (1984). K-means type algorithms: A generalized convergencetheorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell.,6 , 81-87.

    Srinivas, M., & Patnaik, L. M. (1994). Adaptive Probabilities of Crossover and Mutation inGenetic Algorithms. IEEE Transactions on Systems, Man and Cybernetics, 24 , 656-667.

    Teodorovic, D., & Dell'Orco, M. (2005). Bee colony optimization - A cooperative learningapproach to complex transportation problems. Advanced OR and AIMethods inTransportation , 51-60.

    Vesanto, J., & Alhoniemi, E. (2000). Clustering of the Self-Organizing Map.Student Member, IEEE Transactions on Neural Networks, VOL. 11 (NO. 3, MAY 2000), pp. 586 600.

    Von-Frisch, K. (1976). Bees: Their Vision, Chemical Senses and Language . Ithaca: CornellUniversity Press.

    Wedde, H. F., Farooq, M., & Zhang, Y. (2004). BeeHive: An efficient fault-tolerant routingalgorithm inspired by honey bee behavior. In M. Dorigo (Ed.), Ant colony optimizationand swarm intelligence (pp. 83-94). LNCS: Springer Berlin.

    Xiao, J., Yan, Y. P., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithm for k-means clustering. Expert Systems with Applications, 37 , 4966-4973.

  • 7/29/2019 83-522-1-PB bees

    16/16

    Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

    125

    Xu, R., & WunschII, D. (2005). Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16 (3), 645-678.

    Yang, X. S. (2005). Engineering optimizations via natureinspired virtual bee algorithms. In J.M. Yang & J. R. Alvarez (Eds.), IWINAC 2005 (Vol. 3562, pp. 317-323): LNCS,

    Springer-Verlag, Berlin Heidelberg.