[ieee 2013 brics congress on computational intelligence & 11th brazilian congress on...

6
A New Encoding Scheme for a Bee-Inspired Optimal Data Clustering Algorithm Dávila Patrícia Ferreira Cruz Natural Computing Laboratory (LCoN) Graduate Program in Electrical Engineering Mackenzie University, São Paulo, Brazil [email protected] Renato Dourado Maia Computer Sciences Department State University of Montes Claros Montes Claros, MG, Brazil [email protected] Leandro Nunes de Castro Natural Computing Laboratory (LCoN) Graduate Program in Electrical Engineering Mackenzie University, São Paulo, Brazil [email protected] Abstract The amount of data generated in different knowledge areas has made necessary the use of data mining tools capable of automatically analyzing and extracting knowledge from datasets. Clustering is one of the most im- portant tasks in data mining and can be defined as the pro- cess of partitioning objects into groups or clusters, such that objects in the same group are more similar to one another than to objects belonging to other groups. In this context, this paper aims to propose a new encoding scheme to cOptBees, a bee-inspired algorithm to solve data clustering problems. In this new encoding, each bee represents a proto- type for the clusters. The algorithm was run for different da- tasets and the results obtained showed high quality clusters and diversity of solutions, whilst a suitable number of clus- ters was automatically determined. Keywords swarm intelligence; optimal data clustering; dynamic size population; bee-inspired algorithms. I. INTRODUCTION Data mining has attracted attention from industry due to the availability of large amounts of data and the necessi- ty to transform them into knowledge. The goal is to dis- cover patterns in databases that can be used in market analysis, fraud detection, customer retention, and many other applications. Clustering, one of the main data mining tasks, is the organization of a collection of objects into clusters. When the objects are well allocated into a cluster they are more similar to one another than to those belong- ing to other clusters [1]. Clustering can be used, for in- stance, in business to determine groups of customers that have similar behaviors, or in medicine to determine groups of patients that show similar reactions to a specific medi- cine [2]. This paper proposes a new implementation of cOptBees clustering algorithm [3], inspired by the collec- tive decision-making process in bee colonies to solve data clustering problems. The cOptbees is an adaptation of OptBees optimization algorithm [4, 5]. The OptBees is able to generate and maintain the diversity of solutions by finding multiple suboptimal solutions in a single run, a fea- ture useful for solving multimodal optimization problems. The goal of the present paper is to explore the multimodal capability of OptBees in order to design an optimal cluster- ing algorithm [6]. This paper is organized as follows: Section II introduc- es cOptBees, an algorithm inspired by bee colonies to solve optimal data clustering problems; Section III pre- sents the experimental results; and Section IV presents the conclusions and points out proposals for future works. II. COPTBEES:AN ALGORITHM INSPIRED BY BEE COLONIES TO PERFORM OPTIMAL DATA CLUSTERING This section proposes a new implementation of cOptBees [3], an algorithm that solves data clustering problems inspired by the foraging behavior of bee colo- nies. This algorithm is an adaptation of OptBees, originally designed to solve continuous optimization problems [5]. The key features of the collective decision-making by bee colonies used in the design of OptBees are [7]: 1. Bees dance to recruit nestmates to a food source. 2. Bees adjust the exploration and recovery of food ac- cording to the colony state. 3. Bees, unlike ants, explore multiple food sources si- multaneously, but almost invariably converge to the same new construction site of the nest. 4. There is a positive linear relationship between the number of bees dancing and the number of bees re- cruited to a food source: the linear system of re- cruitment means that workers are evenly distributed among similar options. 5. The bee dance communicates the distance and direc- tion of new sites for nests. Recruitment for the new site continues until a threshold number of bees is reached. 6. The quality of the food source influences the bee dance. 7. All the bees retire after some time, which means that regardless of the quality of the new site, the bees stop recruiting others. This retirement depends on the quality of the site: the higher, the later the retire- ment. The OptBees algorithm, summarized in the box below, was tested in the twenty-five minimization problems pro- posed by the 2005 IEEE Congress on Evolutionary Com- putation (CEC) [8]. The results showed that OptBees is able to generate and maintain the diversity of candidate so- lutions, finding multiple local optima without compromis- ing its global search capability. The main features and be- havior of OptBees suggest it can be successfully adapted to solve other types of problems, such as clustering, where the generation and maintenance of diversity are important. The first version of cOptBees was presented in [3] and shown to be able to generate and maintain the diversity of solutions by finding multiple suboptimal solutions in a sin- gle run, a useful feature for solving multimodal optimiza- tion problems, such as optimal data clustering. The imple- mentation presented in this paper is based on the ideas pre- sented in [6], but using the final version of OptBees [5]. 1st BRICS Countries Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI.&.CBIC.2013.28 136 1st BRICS Countries Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI.&.CBIC.2013.28 136 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence 978-1-4799-3194-1/13 $31.00 © 2013 IEEE DOI 10.1109/BRICS-CCI-CBIC.2013.32 136

Upload: leandro-nunes-de

Post on 16-Mar-2017

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

A New Encoding Scheme for aBee-Inspired Optimal Data Clustering Algorithm

Dávila Patrícia Ferreira Cruz Natural Computing Laboratory (LCoN)

Graduate Program in Electrical Engineering

Mackenzie University, São Paulo, Brazil [email protected]

Renato Dourado Maia Computer Sciences Department

State University of Montes Claros Montes Claros, MG, Brazil

[email protected]

Leandro Nunes de Castro Natural Computing Laboratory (LCoN)

Graduate Program in Electrical Engineering

Mackenzie University, São Paulo, Brazil [email protected]

Abstract — The amount of data generated in different knowledge areas has made necessary the use of data mining tools capable of automatically analyzing and extracting knowledge from datasets. Clustering is one of the most im-portant tasks in data mining and can be defined as the pro-cess of partitioning objects into groups or clusters, such that objects in the same group are more similar to one another than to objects belonging to other groups. In this context, this paper aims to propose a new encoding scheme to cOptBees, a bee-inspired algorithm to solve data clustering problems. In this new encoding, each bee represents a proto-type for the clusters. The algorithm was run for different da-tasets and the results obtained showed high quality clusters and diversity of solutions, whilst a suitable number of clus-ters was automatically determined.

Keywords — swarm intelligence; optimal data clustering; dynamic size population; bee-inspired algorithms.

I. INTRODUCTION

Data mining has attracted attention from industry due to the availability of large amounts of data and the necessi-ty to transform them into knowledge. The goal is to dis-cover patterns in databases that can be used in market analysis, fraud detection, customer retention, and many other applications. Clustering, one of the main data mining tasks, is the organization of a collection of objects into clusters. When the objects are well allocated into a cluster they are more similar to one another than to those belong-ing to other clusters [1]. Clustering can be used, for in-stance, in business to determine groups of customers that have similar behaviors, or in medicine to determine groups of patients that show similar reactions to a specific medi-cine [2].

This paper proposes a new implementation of cOptBees clustering algorithm [3], inspired by the collec-tive decision-making process in bee colonies to solve data clustering problems. The cOptbees is an adaptation of OptBees optimization algorithm [4, 5]. The OptBees is able to generate and maintain the diversity of solutions by finding multiple suboptimal solutions in a single run, a fea-ture useful for solving multimodal optimization problems. The goal of the present paper is to explore the multimodal capability of OptBees in order to design an optimal cluster-ing algorithm [6].

This paper is organized as follows: Section II introduc-es cOptBees, an algorithm inspired by bee colonies to solve optimal data clustering problems; Section III pre-sents the experimental results; and Section IV presents the conclusions and points out proposals for future works.

II. COPTBEES: AN ALGORITHM INSPIRED BY BEECOLONIES TO PERFORM OPTIMAL DATA CLUSTERING

This section proposes a new implementation of cOptBees [3], an algorithm that solves data clustering problems inspired by the foraging behavior of bee colo-nies. This algorithm is an adaptation of OptBees, originally designed to solve continuous optimization problems [5].The key features of the collective decision-making by bee colonies used in the design of OptBees are [7]:

1. Bees dance to recruit nestmates to a food source. 2. Bees adjust the exploration and recovery of food ac-

cording to the colony state. 3. Bees, unlike ants, explore multiple food sources si-

multaneously, but almost invariably converge to the same new construction site of the nest.

4. There is a positive linear relationship between the number of bees dancing and the number of bees re-cruited to a food source: the linear system of re-cruitment means that workers are evenly distributed among similar options.

5. The bee dance communicates the distance and direc-tion of new sites for nests. Recruitment for the new site continues until a threshold number of bees isreached.

6. The quality of the food source influences the bee dance.

7. All the bees retire after some time, which means that regardless of the quality of the new site, the bees stop recruiting others. This retirement depends on the quality of the site: the higher, the later the retire-ment.

The OptBees algorithm, summarized in the box below, was tested in the twenty-five minimization problems pro-posed by the 2005 IEEE Congress on Evolutionary Com-putation (CEC) [8]. The results showed that OptBees is able to generate and maintain the diversity of candidate so-lutions, finding multiple local optima without compromis-ing its global search capability. The main features and be-havior of OptBees suggest it can be successfully adapted to solve other types of problems, such as clustering, where the generation and maintenance of diversity are important.The first version of cOptBees was presented in [3] and shown to be able to generate and maintain the diversity of solutions by finding multiple suboptimal solutions in a sin-gle run, a useful feature for solving multimodal optimiza-tion problems, such as optimal data clustering. The imple-mentation presented in this paper is based on the ideas pre-sented in [6], but using the final version of OptBees [5].

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.28

136

1st BRICS Countries Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI.&.CBIC.2013.28

136

2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence

978-1-4799-3194-1/13 $31.00 © 2013 IEEE

DOI 10.1109/BRICS-CCI-CBIC.2013.32

136

Page 2: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

The main modifications introduced in OptBees so as to apply it to clustering problems are described in the follow-ing sections.

OptBees AlgorithmInput Parameters:

� nmin: initial number of active bees. � nmax: maximum number of active bees. � ρ: inhibition radius. � α: recruitment rate. � nmean: average foraging effort. � pmin: minimum probability of a bee being a recruit-

er. � prec: percentage of non-recruiter bees that will be

actually recruited. Output Parameters:

� Active bees and the respective values of the objec-tive function.

1. Randomly generate a swarm of N bees.while (stopping criterion is not attained) do

2. Evaluate the quality of the sites being explored by the active bees.3. Apply local search.4. Determine the recruiter bees.5. Update the number of active bees.6. Determine the recruited and scout bees.7. Perform the recruitment process.8. Perform the exploration process.

end while9. Evaluate the quality of the sites being explored by the ac-tive bees.10. Apply local search.

In cOptBees, the active bees can belong to one of three groups, according to their task: 1) recruiters, who attract other bees to explore a promising region of the search space; 2) recruited, who are recruited by recruiters to ex-plore a promising region of the search space; or 3) scout bees, who randomly look for new promising regions of the space [3].

A. Encoding Scheme In this new implementation of cOptBees, each bee is

composed of a set of prototypes that encodes a potential clustering. A bee is defined by a matrix Bi×j, where i = (d +1), d being the number of attributes of the input data, and jis the maximum number of clusters in a clustering (rMax). Thus, in a given column j, lines 1 to d represent the dimen-sion of prototype Cj and the last line represents a threshold value, Lj � [0,1], that defines if the centroid Cj is active or not. The centroid Cj is active when its threshold is greater than or equal to 0.5. Fig. 1 shows the matrix representation of a bee [6].

� =⎣⎢⎢⎢⎡��,� … ��,����,� … ��,��

⋮ … ⋮��,� ⋱ ��,��

�� … ��� ⎦⎥⎥⎥⎤

Fig. 1. Matrix representation a Bee in cOptBees.

The swarm is composed of N bees and, for each bee, the objects in the database are associated with the nearest prototype. The initial swarm is randomly generated, re-specting the maximum number of clusters, rMax (an input parameter introduced in cOptBees).

B. Determination of the Recruiter Bees The recruiter bees explore promising regions of the

search space and recruit the closest bees. The number of recruited bees for each recruiter is proportional to the qual-ity of the food sources found. Determining the recruiter bees involves three steps. In the first step, a probability ��of being a recruiter bee is associated with each active bee [4]:

�� = � 1 − ���� ��� − ����

� ∙ (�� − ����) + ����, (1)

where �� represents the quality of the site being explored by bee i, ���� and ��� represent, respectively, the mini-mum and maximum qualities among the sites being ex-plored by each active bee in the current iteration (these quality values are determined using the objective-function value) [4].

In the second step the bees are processed and, accord-ing to the probabilities calculated in the previous step, are now classified as recruiters or non-recruiters. In the third step, the recruiter bees are processed in accordance with the corresponding site qualities, from best to worst and, for each recruiter bee, the other recruiters who have a high similarity are inhibited, i.e., they are classified as non-recruiters [4]. The similarity between two bees is calculat-ed based on the objects classified in the same cluster, i.e., the greater the number of objects in common, the greater the similarity. The inhibition process happens when the similarity between two bees is greater than or equal to the inhibition radius �, an input parameter that represents a percentage of the maximum possible value for the similari-ty. This process avoids that many recruiters explore the same promising regions of the search space.

C. Number of Active Bees Update Updating the number of active bees aims to adapt the

foraging effort in accordance with the number of recruiters and the maximum number of active bees [5]. After the de-termination of the recruiter bees �� in a given iteration, the number �� = (�� + 1) ∙ ���� determines the desired number of active bees, where �, the average foraging ef-fort, determines the desired number of non-recruiter bees for each recruiter bee. If �� is greater than the current number of active bees, ��!"#$ = �� − �%$�&� is the nec-essary number of bees that have to become active in order to achieve �� active bees; if this number is less than the current number of active bees, ��!"#$ = �%$�&� − �� is the necessary number of bees that have to become inactive in order to achieve �� active bees. This process is con-strained by the maximum (���) and minimum (����)numbers of active bees. If �� > ���, then �� is set to ���; otherwise, if �� < ����, then �� is set to ����.When an inactive bee becomes active, it is inserted in a random position in the search space. For the inactivation process, the bees are selected according to the correspond-ing site quality they explore, from the worst to the best [5]. When a bee is inactivated, it is removed from the swarm

137137137

Page 3: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

and, when a bee is activated, it is inserted into the swarm, i.e., the swarm size varies dynamically. Through this pro-cedure, the foraging effort (computational effort) adapts in accordance with the number of recruiter bees and the max-imum number of active bees [5].

D. Determination of the Recruited and Scout Bees After the classification of bees as recruiter or non-

recruiter, a percentage of non-recruiter bees are classified as recruited and exploit promising regions already found. The other non-recruiters are classified as scout bees, which explore the search space to find new promising regions, re-inforcing the generation and maintenance of diversity [4].

The number of non-recruiter bees, in Step 6, is deter-mined by ��� = �%$�&� − �� . The number of recruited bees is �� = [���%. ���], where ���% is the percentage of non-recruiter bees that will be recruited and [.] denotes the nearest integer function. The number of scout bees is �# = ��� − ��. The process for determining the recruited bees works as follows. First, the number of recruited bees to be associated with each recruiter is determined. The rel-ative quality of the site operated by each recruiter in rela-tion to the others determines this number: each recruiter recruits a number of bees proportional to the quality of the site that it explores. With these numbers already deter-mined, the non-recruiter bees are processed and associated with the most similar recruiter. After these procedures, the remaining �# non-recruiter bees are considered scout bees [4].

E. Recruitment Process In the recruitment process, the recruiter bees attract the

recruited bees to the sites they explore. This recruitment process is implemented by Eq. (2) or Eq. (3), each with 50% probability, where � is the recruitment rate, an input parameter, xi is the recruited bee, y is the recruiter bee, u is a random number with uniform distribution in the interval [0, 1], U is a vector whose elements are random numbers with uniform distribution in the interval [0, 1] (U has the same dimension as xi and y) and the symbol � denotes the element-wise product [5]

x� = x� + ' . * . (y − x�) (2)

x� = x� + * . U � (y − x�) (3)

F. Exploration Process In the exploration process, the scout bees are moved to

a random position in the search space. By doing this, the exploration process allows the scout bees to explore new regions in the search space [3].

III. EXPERIMENTAL RESULTS The experiments performed to evaluate the new im-

plementation of cOptBees were similar to those presented in [3]. The algorithm was implemented in Matlab and ap-plied to five datasets from the literature [9]. Entropy (E) and Purity (P) were used to evaluate the performance of the algorithms. Entropy measures the homogeneity of a clustering solution, showing how the objects are distribut-ed over the groups [10]: the lower the entropy, the better

the quality of the cluster. Purity, by contrast, indicates how pure is a group, i.e., the ratio of the dominant class of a group in relation to the total number of objects in the group: the higher the purity, the better the quality of the group [10]. The cOptBees algorithm uses the modified Sil-houette as fitness function [11, 12]. The Silhouette meas-ure of an object xi is calculated by:

-(x�) = /(x�) − 0(x�)max{0(x�), /(x�)} , (4)

where 0(2�) represents the dissimilarity between 2� and its centroid, and /(2�) represents the smaller dissimilarity be-tween 2� and the centroid of the other clusters [12].

A. Bidimensional Datasets Initially, the algorithm was tested in two bi-

dimensional datasets, one composed of Gaussian artificial classes and the Ruspini dataset [13]. The algorithm was run fifteen times for each dataset, with the following input pa-rameters: ���� = 50; ���= 100; ���� = 10; rMax = 8; pmin = 0.01; prec linearly varying between 0.1 and 0.5 with the number of iterations; � linearly varying between 0.1 and 0.3 with the number of iterations; number of iterations = 50 and � = 0.5. It is important to highlight that these val-ues were defined based on previous tests and that, in Opt-Bees, � is not an input parameter, but a constant whose value is equal to 2.

Concerning results, the following values are presented: number of objects incorrectly labeled, classification error percentage (CEP), entropy, and purity. The results were compared with those obtained by the k-means clustering algorithm [14, 15]. The Ruspini dataset is composed of 75objects organized in four classes and each object has two integer attributes. The Gaussian artificial dataset is com-posed of 150 objects organized in three classes and each object has two floating-point attributes. The graphical rep-resentations of these datasets are shown in Fig. 2.

Fig. 2. Graphical representation of Ruspini and Gaussian Artificial da-tasets.

Figs. 3 and 4 show the solutions encoded by the four best recruiter bees found by cOptBees in the first and last executions, respectively.

In Fig. 3 and Table I, it is possible to observe that the first recruiter, with fitness 0.8158, grouped all objects cor-rectly and the values of Entropy and Purity were the best possible, 0 and 1, respectively. The other recruiters did not find the perfect clustering solution, but presented high quality solutions. These solutions have good values of En-tropy and Purity, indicating that the clusters are homoge-neous and pure.

Ruspini Dataset Gaussian Artificial Dataset

138138138

Page 4: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

Fig. 3. Graphical representation of the four best recruiter bees returned by cOptBees in the first execution with the ruspini dataset.

Fig. 4. Graphical representation of the four best recruiter bees returned by cOptBees in the last execution with the Ruspini dataset.

The values of fitness, number of errors, CEP, Entropy, Purity and number of clusters to these recruiter bees are shown in Tables I and II.

TABLE I. RESULTS OF THE FOUR BEST RECRUITER BEES IN THE FIRST EXECUTION OF COPTBEES FOR THE RUSPINI DATASET.

Recruiters

Fitness Errors CEP Entro-py Purity Clus-

ters1st 0.8158 0 0% 0.0000 1.0000 42nd 0.6514 32 42.67% 0.1882 1.5733 23rd 0.6092 25 33.33% 0.1085 0.7867 44th 0.4785 24 32% 0.1589 0.6800 3

TABLE II. RESULTS OF THE FOUR BEST RECRUITER BEES IN THE LAST EXECUTION OF COPTBEES FOR THE RUSPINI DATASET.

Bees

Fitness Errors CEP Entro-py Purity Clus-

ters1st 0.8158 0 0% 0.0000 1.0000 42nd 0.6514 32 2.67% 0.1882 0.5733 23rd 0.5020 24 32% 0.1502 0.6800 34th 0.4883 33 44% 0.1447 0.7467 5

This illustrates the ability of the algorithm to generate and maintain diverse solutions by finding several high quality solutions simultaneously. The results of other exe-cutions presented similar performances. By analyzing Ta-bles I and II it is possible to observe that cOptBees grouped all objects correctly and found other solutions with similar fitness, but representing different solutions. Table III compares the results of cOptBees with those of k-means.

TABLE III. MEAN ± STANDARD DEVIATION OF FITNESS, ERRORS,ENTROPY, PURITY AND NUMBER OF CLUSTERS OF THE BEST SOLUTION OF EACH EXECUTION OF COPTBEES AND THE K-MEANS FOR THE RUSPINI DA-

TASET.

cOptBees k-meansFitness 0.816±0,00 0.759±0.084Error 0±0 7.8±11.5CEP 0±0 10.4±15.287

Entropy 0±0 0.032±0.047Purity 1±0 0.926±0.108

Clusters 4±0 4±0

The Gaussian and Ruspini datasets presented similar results. Figs. 5 and 6 show the four best clusterings (re-cruiter bees) found by cOptBees in the first and last execu-tions, respectively. The values of fitness, number of errors, CEP, Entropy, Purity and number of clusters to these re-cruiter bees are shown in Tables IV and V.

TABLE IV. RESULTS OF THE FOUR BEST RECRUITER BEES IN THE FIRST EXECUTION OF COPTBEES FOR THE GAUSSIAN ARTIFICIAL

DATASET.

Recruiter

Fitness Error CEP Entro-py Purity Clus-

ters1st 0.7926 0 0% 0.0000 1.0000 32nd 0.6459 50 33.33% 0.1003 0.6667 23rd 0.5682 50 33.33% 0.1079 0.6667 24th 0.5507 55 36.67% 0.0144 0.9867 6

TABLE V. RESULTS OF THE FOUR BEST BEES IN THE LAST EXECUTION OF COPTBEES FOR THE GAUSSIAN ARTIFICIAL DATASET.

Bees

Fitness Errors CEP Entro-py Purity Clus-

ters1st 0.7926 0 0% 0.0000 1.0000 32nd 0.6459 50 33.33% 0.1003 0.6667 23rd 0.5711 50 33.33% 0.1003 0.6667 24th 0.5263 51 34% 0.1158 0.6600 2

1st – Fitness 0.81577 2nd – Fitness 0.65143

3rd – Fitness 0.60921 4th – Fitness 0.47847

1st – Fitness 0.81577 2nd – Fitness 0.80195

3rd – Fitness 0.72691 4th – Fitness 0.48826

139139139

Page 5: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

Fig. 5. Graphical representation of the four best bees returned by cOptBees in the first execution with the artificial dataset.

Fig. 6. Graphical representation of the four best bees returned by cOptBees in last execution for the Gaussian artificial dataset.

Table VI shows the cOptBees and k-means perfor-mance for the artificial dataset.

TABLE VI. MEAN ± STANDARD DEVIATION OF FITNESS, ERROR,ENTROPY, PURITY AND NUMBER OF CLUSTERS OF THE BEST SOLUTION OF

EACH EXECUTION FOR COPTBEES AND THE K-MEANS ALGORITHM FORTHE GAUSSIAN ARTIFICIAL DATASET.

cOptBees k-meansFitness 0.793±0,00 0.774±0.071Error 0±0 5±19.1CEP 0±0 3.333±12.910

Entropy 0±0 0.007±0.026Purity 1±0 0.978±0.086

Number of Clusters 3±0 3±0

The results show that the performance of cOptBees was better than that of k-means for these simple bi-

dimensional datasets. The cOptBees algorithm found the correct clustering partitions in all tests for both datasets.Also, unlike the proposed algorithm, k-means needs to be informed the number of groups in advance.

B. N-Dimensional Datasets The algorithm was also tested in five datasets from the

UCI Machine Learning Database, as summarized in Table VII.

TABLE VII. MAIN FEATURES OF THE DATASETS USED FOR THEPERFORMANCE COMPARISON.

Dataset Number of Objects

Number of Attributes

Number of Classes

Iris 150 4 3Wine 178 13 3

Balance 625 4 3Haberman 306 3 2

Ecoli 327 7 5

To evaluate the performance of cOptBees, the results were compared with those obtained by FaiNet (Fuzzy Arti-ficial Immune Network), aiNet (Artificial Immune Net-work), FMC and FPSC, reported in [10], and by k-means [15]. As done in [10], for the Ecoli dataset, only five clas-ses were considered (cp, im, imu, om, pp) and the algo-rithm was run ten times for each dataset. The following values were used for the input parameters: ���� = 100; ���= 200; ε = 10; pmin = 0.01; prec linearly varying between 0.1 and 0.5 with the number of iterations; � linearly varying between 0.1 and 0.5 with the number of iterations; number of iterations = 50 and α = 0.5. In each run, the values of Entropy, Purity and number of clusters for each solution found were calculated. For k-means,FPSC and FCM, the initial number of classes was equal to the correct number of classes (Table 9). For FaiNet and aiNet, the algorithms start with a single cell for each da-taset [10]. For cOptBees it is necessary to inform the max-imum number of clusters, so for the Iris, Wine and Balance datasets it was used rMax = 6, for the Haberman it was 4and for the Ecoli rMax = 10. Table VIII shows the mean and standard deviation of the fitness of the best bee in cOptBees and the comparison with the results obtained by k-means.

TABLE VIII. MEAN ± STANDARD DEVIATION OF THE FITNESS OF THE BEST SOLUTION OF COPTBEES AND THE COMPARISON WITH K-

MEANS.

Dataset k-means cOptBeesIris 0.661±0.010 0.775±0.002

Wine 0.677±0.000 0.745±0.000Balance 0.277±0.005 0.285±0.020

Haberman 0.534±0.002 0.773±0.078Ecoli 0.434±0.015 0.554±0.000

The results presented in Table VIII show that cOptBees obtained a better performance than the k-means algorithm. Table IX shows the mean and standard deviation of Entro-py, Purity and number of clusters for cOptBees and the comparison with the results presented by the other algo-rithms. The results presented in Table IX show that cOptBees, for n-dimensional datasets, was able to find the

1st – Fitness 0.79264 2nd – Fitness 0.64586

3rd – Fitness 0.56819 4th – Fitness 0.55066

1st – Fitness 0.79264 2nd – Fitness 0.64586

3rd – Fitness 0.57111 4th – Fitness 0.52263

140140140

Page 6: [IEEE 2013 BRICS Congress on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC) - Ipojuca, Brazil (2013.9.8-2013.9.11)] 2013 BRICS

correct number of groups and obtained good values of pu-rity and entropy, suggesting that the proposed algorithm is competitive when compared with the other tested ones.

TABLE IX. MEAN ± STANDARD DEVIATION OF ENTROPY (E),PURITY (P) AND NUMBER OF CLUSTERS (C) FOR FCM, FPSC, FAINET,

AINET, K-MEANS ALGORITHM AND COPTBEES.

Dataset FCM FPSC FaiNet aiNet k-means cOptBees

Iris

E 0.27±0.01 0.26±0.02 0.24±0.04 0.27±0.06 0.08±0.02 0.11±0.00

P 0.89±0.0 0.89±0.01 0.89±0.03 0.88±0.04 0.85±0.1 0.67±0.00

C 3.0±0.0 3.0±0.0 5.70±0.48 4.8±1.03 3.0±0.0 3.2±0.6

Win

e

E 0.16±0.0 0.19±0.02 0.24±0.08 0.16±0.07 0.15±0.0 0.15±0.0

P 0.95±0.0 0.94±0.01 0.88±0.05 0.88±0.05 0.7±0.0 0.66±0.0

C 3.0±0.0 3.0±0.0 15.0±3.27 29.40±5.5 3.0±0.0 2.0±0.0

Bal

ance

E 0.68±0.0 0.72±0.07 0.50±0.03 0.52±0.04 0.15±0.01 0.15±0.01

P 0.67±0.06 0.61±0.07 0.74±0.04 0.72±0.04 0.65±0.03 0.62±0.05

C 3.0±0.0 3.0±0.0 10.5±0.85 10.4±0.71 3.0±0.0 2.2±0.6

Hab

erm

an E 0.68±0.0 0.72±0.07 0.50±0.03 0.52±0.04 0.12±0.0 0.11±0.0

P 0.67±0.06 0.61±0.07 0.74±0.04 0.72±0.04 0.74±0.0 0.74±0.0

C 2.0±0.0 2.0±0.0 10.5±0.85 6.5±0.71 2.0±0.0 2.0±0.0

Ecol

i

E 0.32±0.0 0.35±0.02 0.33±0.03 0.38±0.07 0.12±0.01 0.18±0.0

P 0.79±0.0 0.77±0.01 0.77±0.03 0.73±0.05 0.78±0.03 0.67±0.0

C 5.0±0.0 5.0±0.0 10.4±1.90 11.4±2.46 5.0±0.0 2.2±0.4

IV. CONCLUSIONS AND FUTURE RESEARCH

This paper proposed a new implementation of cOptBees. The cOptBees is an adaptation of the OptBees optimization algorithm inspired by the foraging behavior of bee colonies, for performing optimal data clustering. In this new implementation, each bee represents a set of pro-totypes and each object of the dataset is associated with the nearest one. To evaluate the performance of cOptBees, itwas tested with different datasets. First, the algorithm was tested with bi-dimensional datasets and compared with the k-means clustering algorithm. Second, cOptBees was test-ed using five n-dimensional datasets from the literature and its results were compared with that of k-means, FaiNet, aiNet, FMC and FPSC.

The results presented show that, for the bi-dimensional datasets and under the conditions tested, the algorithm ob-tained better performance than k-means. For the n-dimensional datasets, cOptBees proved to be competitive when compared with other clustering algorithms from the literature. The algorithm presented in this paper obtained a satisfactory performance, being able to find high quality cluster partitions in datasets without the need to inform the correct number of clusters in the dataset. The cOptBees has proved to be able to generate and maintain the diversity of solutions by finding multiple suboptimal solutions in a sin-gle run, a feature useful for solving multimodal optimiza-tion problems, such as optimal data clustering. As future research, the following steps are suggested: to perform tests using other fitness functions; to perform a parametric sensitivity analysis to increase the understanding about the effect of parameters in the performance of the algorithm; to test other similarity measures; to analyze the results with others validity indexes; to test the algorithm with other da-tasets; and to compare with details de original implementa-

tion of cOptBees [3] and the new implementation present-ed in this paper.

ACKNOWLEDGMENT

The authors thank grant #2013/12005-7, São Paulo Re-search Foundation (FAPESP), Capes, CNPq, Mack-Pesquisa and Fapemig for the financial support.

REFERENCES

[1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. San Francisco: Elsevier, 2006.

[2] R. Elmasri and S. B. Navathe, Sistemas de Banco de Dados, 4th ed. São Paulo, Brazil: Pearson, 2006.

[3] D. P. F. Cruz, R. D. Maia, L. N. Castro and A. Szabo, “A Bee-Inspired Algorithm for Optimal Data Clustering” in: IEEE World Congr. on Computational Evolutionary, Cancún, México, 2013, ,pp.3140-3147.

[4] R. D. Maia, L. N. Castro and W. M. Caminhas, “Bee Colonies as Model for Multimodal Continuous Optimization: The OptBees Algorithm,” in: IEEE World Congress on Computational Intelligence, Brisbane, Australia, 2012, pp. 3316-3323.

[5] R. D. Maia, “Colônias de Abelhas como Modelo para Otimização Multimodal em Espaços Contínuos: uma Abordagem Baseada em Alocação de Tarefas”, doctoral dissertation, Universidade Federal de Minas Gerais, Engenharia Elétrica, Belo Horizonte, 2012.

[6] R. D. Maia, W. O. Barbosa and L. N. De Castro, “Colônias de Abelhas como Modelo para Agrupamento Ótimo de Dados: Uma proposta e análise paramétrica qualitativa,” in: XIX Congresso Brasileiro de Automática, Campina Grande, 2012.

[7] J. E. Gadau and J. Fewell, Organization of Insect Societies: from genome to sociocomplexity. HUP, Cambridge, 2009.

[8] P. N. Suganthan, N. Hansen, J. J. Liang, K. Deb, Y. P. Chen, A. Auger and S. Tiwari, “Problem definitions and evaluation criteria for the CEC 2005 special session on real-parameter optimization”, Technical Report, Nanyang Technol. University, Singapore, 2005.

[9] UCI Machine Learning Repository, 1998. Available: http://archive.ics.uci.edu/ml/ [Accessed: Feb. 2013].

[10] A. Szabo, L. N. De Castro and M. R. Delgado, “FaiNet: An immune algorithm for fuzzy clustering”, in: Prec. IEEE International Conference on Fuzzy Systems, 2012, pp .1-9.

[11] P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”, Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, 1987.

[12] E. R. Hruschka, L. N. De Castro and R. J. G. B, Campello, “Evolutionary algorithms for clustering gene-expression data”, Prec. IEEE World Congress on DataMining, 2004, pp. 403-406.

[13] H. R. Ruspini, “Numerical Methods for Fuzzy Clustering”, in: Information Sciences, 1970, pp. 139-350.

[14] D. Karoba and C. Ozturk, “A Novel Clustering Approach: Artificial Bee Colony (ABC) Algorithm”. Applied Soft Computing, vol. 11, no. 1, pp. 652-657, jan. 2011.

[15] J. B. Macqueen, “Some Methods for classification and Analysis of Multivariate Observations”. Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, California, 1967, pp. 281–297.

141141141