presenting a method for clustering using cuckoo ...scholarism.net/fulltext/22014224.pdf ·...

International Advances in Engineering and Technology (IAET)

ISSN: 2305-8285 Vol.22 2014

www.scholarism.netInternational Scientific Researchers (ISR)

30

Presenting a method for Clustering Using Cuckoo Optimization

Algorithm

SamaneAsadi1, Vahid Rafe

1- Department of computer engineering, Arak Branch, Islamic Azad University ,Arak ,

Iran

Abstract-Nowadays according to increasing growth of data and computers' computing and

storing power, many investigators have been attracted to patterns and relations of these data.

Data mining of a large data collection could be done through different techniques, among

which clustering is one of the most important. Clustering is an unsupervised learning process

with the main objective of organizing data into certain clusters and groups in a way that the

similarities between the data within a cluster being maximized and those of similarities

between the data of different clusters being minimized. With expansion of the applied fields

of clustering, there is even a more ongoing need to present an efficient method of clustering.

Many algorithms have been presented to this end by now. These algorithms are facing

problems and require determined number of cluster before being applied. Cuckoo

Optimization algorithm has been used in the present study as one of the newest evolutionary

algorithms with high level of accuracy in solving different problems and achieving global

optimum. Samples from UCI databases were used to validate the suggested method and the

results of its implementation were compared to those of well-known evolutionary algorithms

such as K-means. The accuracy and convergence speed of the suggested algorithm was

considered to be significant comparing to the results of other algorithms.

Keywords: Data mining, K-means, Cuckoo Optimization Algorithm

1. Introduction

Today, most of the organizations are engaged in collecting and storing data

rapidly, yet they suffer from the lack of the knowledge for decision making in

spite of the massive volume of data. Therefore, data mining, mechanical data

analysis to overcome the deficiencies in decision-making and mining

information and knowledge hidden in the data are obvious necessity[1].

http://scholarism.net/issue2.aspx?jid=17


http://www.scholarism.net/



ISSN: 2305-8285 Vol.22 2014


31

Clustering is a data-mining technique and unsupervised learning process which

is applicable to many fields of study including medical data analysis, biology,

diagnosis of abnormal cases, marketing, etc[2-10]. Generally, clustering

algorithms are divided into two Hierarchical and Partitional types. Based on the

type of the generative structure, the hierarchical clustering approaches are

categorized as Divisive and Agglomerative. Initially all the data are categorized

in one cluster, then data with the least similarities are broken into separate

clusters through an iterative process in different stages and goes on with it until

specific number of clusters are attained during Divisive approach, which is also

known as Top-Down approach. But Agglomerative or Down- Top approach

performs in quite opposite way, i.e. each piece of data is initially considered to

be a separate cluster to be combined with the most similar clusters through an

iterative process during next stages and finally to form one or certain number of

the clusters[11-13].

Partitional clustering algorithm breaks down and categorizes data in a level to

be able to work with wide range of data and to optimize the predetermined cost

function. Most of Partitional clustering algorithms are center-based. One of the

most important and most-widely-used Partitional center-based algorithms is k-

means. Despite the efforts made to put an end to the K-means algorithm's

deficiencies, the optimal solution has not necessarily been achieved[11]. So far,

many algorithms have been presented to improve K-means and clustering which

will follow in section 2.

To overcome K-means deficiencies, the present study has used a hybrid

algorithm called COA-KM which is based on Cuckoo Optimization Algorithm

(COA) by resolving the problem through determining the cuckoos' egg laying

and habitats[14]. In order to take advantage of K-means convergence speed, a

Cuckoo's egg laying site has been determined by using K-means in the

suggested algorithm. COA-KM has proved to overcome other evolutionary

algorithms deficiencies and to enjoy a proper level of accuracy. Section 2 of this

study will concentrate on reviewing past literature on clustering approaches.

Clustering will be discussed on section 3 and basic concepts in section 4. The

suggested algorithm is clarified in section 5. And finally, the result of the






ISSN: 2305-8285 Vol.22 2014


32

comparison between COA-KM and other algorithms is presented along with the

conclusion.

2. Literature Review

In [15] has used Genetic Algorithm and a new hybrid operator which do

clustering by changing neighborhood centers for K-means. Using Genetic

Algorithm and combining this algorithm with K-means in [16], an algorithm

called GAK was introduced. In [17] has done clustering based on Tabu Search

Algorithm. In [18] used neighborhood functions and operators improving Tabu

Search Algorithm to present a different approach of clustering. Having used

Simulated Annealing Algorithm and combining Simulated Annealing

Algorithm with K-means and presenting SAKM as a new hybrid algorithm, in

[19] and [20] have respectively overcame clustering deficiencies by verification

of the efficiency of the suggested method using the available databases. In [21]

have applied Ant Colony Optimization (ACO) to clustering. In [22] has

simulated honey bees social life and clustering according to Honey-Bees Mating

Optimization (HBMO). Particle Swarm Optimization (PSO) has been applied in

[23] and the hybrid algorithm of GSA-KM based on Gravitational Search

Algorithm has been used for clustering in [24].

3. Clustering Problem

Clustering is a NP-Complete problem that aims to find clusters in such a way

that the defined similarity measure is minimized. Therefore, clustering is a

common optimization problem, and as a result it requires a mathematical

expression like other optimization problems. If S={X1,X2,…,Xn} and contains

all points that should be clustered; K is equal to the number of the clusters; and

{C1,C2,…,Ck} includes the clusters' centers, the following equation must be

maintained (Al-Sultan,1995)[17]:

Each cluster should contain at least one data

Ci for i=1,…,K (1)






ISSN: 2305-8285 Vol.22 2014


33

Different clusters should not have any data in common

Ci Cj = for i,j =1,…,K (2)

Each data should be assigned to one cluster, i.e. the total of the data of all

clusters should be equal to the total initial data after the assignment.

i=1 Ci = S (3)

The main objective of this paper is to do Partitional center-based clustering.

Through this algorithm, the clusters are produced in a way to optimize a

predetermined cost function. The most widely function cost used for these

techniques is standard squared error which has a very good performance with

dense and isolated clusters. This standard is defined according to equation

4[11,19].

J = ∑ ∑ ‖ j i

- j‖

i=1 j=1 (4)

In Equation 4, ‖ ‖ stands for distance measure and Cj is the jth

cluster center.

K-means uses a simple way to assign a set of data in a pre-specified number of

clusters, for example k clusters, to perform Partitional center-based clustering.

Main idea is to define k centers for each cluster. These centers must be chosen

precisely because the results depend on them. Therefore, the more distant the

centers are from each other, the better. Next step is to assign each data to the

closest center. After assigning data to all available centers, which means the end

of the first step with the initial clustering being done, k new centers should be

counted for the previous step clusters. Data will be assigned to proper centers

again after defining k new centers[11].

Worth noting that the main objective of this paper is to do clustering based on

centers and to find the cluster centers in a way that objective function is

minimized.

4. Cuckoo Optimization Algorithm

A brief description of Cuckoo Optimization Algorithm is presented in this

section.






ISSN: 2305-8285 Vol.22 2014


34

Cuckoo Optimization Algorithm (COA) is one of evolutionary techniques was

introduced in [14]. This algorithm is inspired by the lifestyle of a bird called the

Cuckoo. This bird did ’t ma e est for itself a d it be used the ests of other

birds for laying eggs. Ability to create eggs like the bird host is reinforced in

cuckoo bird. If the bird's host discover eggs that are not mine, it throw away or

leave the nest and it makes a nest in other places. Cuckoo eggs are the bigger

size of the host bird until cuckoo brood would hatch soon. When the host bird's

eggs throws out of the nest or demand food so much to other broods die of

hungry. When the cuckoo brood grows and becomes a mature bird continues the

mother's life instinctively.

4.1 Generating initial cuckoo habitat

I order to solve a optimizatio problem, it’s e essary that the values of

problem variables be formed as an array. In GA terminologies this array is

called “Chromosome”. But i COA it is alled “habitat” [14]. To start the

optimization algorithm, a candidate habitat matrix is generated. Then some

randomly produced number of eggs is supposed for each of these initial cuckoo

habitats. In nature, each cuckoo lays between 5 to 20 eggs. These values are

used as the upper and lower bounds of egg assigned to each cuckoo at different

iterations. Other habit of real cuckoos is that they lay eggs within a maximum

distance from their habitat. This maximum area will be called “Egg Layi g

Radius ELR ”. Ea h u oo has a egg layi g radius ELR whi h is

appropriate with the total umber of eggs, umber of urre t u oo’s eggs a d

also variable limits of varhi and varlow [14]. So ELR is defined as:

ELR= umber of urre t u oo

,s eggs

otal umber of eggs varhi- varlow (5)

Whi h is a i teger, supposed to ha dle the ma imum value of ELR.

4.2 Immigration of cuckoos

When young cuckoos grow and become mature, they live in their own area and

make society for some time. But when the season for egg laying approaches

they move to new habitats with the most similar host eggs and with more food

for new young birds. Then the cuckoo groups are formed in different areas, the

society with the highest fitness value is selected as the goal point, and other

cuckoo to move to that point.

When mature cuckoos live in that environment identify cuckoos belong to

which groups that are difficult. Now that cuckoo groups are identified their






ISSN: 2305-8285 Vol.22 2014


35

mean benefit value is calculated. The maximum amount of the benefit is

determi ed by the goal group a d o seque tly that group’s best habitat is the

new destination habitat for moving cuckoos.

When moving toward goal point, the cuckoos do not fly all the way to the

destination habitat. They only fly a part of the way and also have a deviation.

Figure1 shows Pseudo code from cuckoo optimization algorithm[14].

1. Initialize cuckoo habitats with random points

2. Define ELR for each cuckoo

3. Let cuckoo to lay eggs inside their corresponding ELR

4. Kill those eggs that are identified by host birds

5. Eggs hatch and chicks grow

6. Evaluate the habitat of each newly grown cuckoo

7. Limit u oos’ma imum umber in environment and kill those that live in

worst habitats

8. Cuckoos find best group and select goal habitat

9. Let new cuckoo population move toward goal habitat

10. If stop condition is satisfied end, if not go to 2

Fig1: Pseudo code of Cuckoo Optimization Algorithm

5. The Suggested COA-KM Algorithm

Cuckoo Optimization Algorithm plays the main role in COA-KM algorithm. K-

means algorithms have been applied for better and faster search of the solution

space.

Step 1: generating cuckoos initial population

Initial habitats and the number of eggs per Cuckoo are randomly initialized

based on Eq.(6).

Habitat = [

1

…

pop

]Hi = Habitati = [ Center1, Center2,.., Centerk] i =1,2, Npop






ISSN: 2305-8285 Vol.22 2014


36

Centerj = [ x1,x2, …, d] j=1, , …, k (6)

x1 = Rand(.) (xmax - xmin) + xmin j

mi j j

ma

Hi stands for i-th

cuckoo's habitat in Eq(6). Centerj is the j-th

cluster center for

the i cuckoo. k,d, Npop are respectively number of the population, problem's

aspect, and number of the clusters. j

ma and j

mi are upper and lower bounds

of the points belong to the j-th

cluster.

Number of Eggs = [

E1

E

…

E pop

] (7)

NEi =⌈Ra d . Ema - Emi ⌉ Emi Emi Ei Ema i = 1, , …, pop

Ema and Emi indicate maximum and minimum numbers of each cuckoo's

eggs. H1 is equal to cluster centers generated in step 1.

Step 2: evaluation of cuckoos' cost function

Suppose that we have N sample feature vectors. The cost function is

evaluated for each habitat as follows:

Step 2-1: i=1 and Objec=0.

Step 2-2: select the ith sample.

Step 2-3: calculate the distances between the ith sample and Centerj

(j=1,…,K).

Step 2-4: add the value of Objec with the minimum distance

calculated in Step2-3 (Objec = Objec+ min (|Ce ter m|) i=1,…,K (8)

Step 2-5: if all samples have been selected, go to the next step, otherwise

i=i+1and return to step2-2.






ISSN: 2305-8285 Vol.22 2014


37

Step 2-6: Cost(X) =Objec.

The cost function is calculated mathematically as below:

Cost(X)=∑ mi |Ce ter m| i=1,…,K N=number of input data

(9)

Step 3: initialization of some of cuckoos' population using K-means

algorithm.

The cuckoo with the maximum number of NE eggs is identified and

initialization of its eggs is done using K-means. Cost function of the cuckoo and

the eggs is computed. If the cost function of an egg is lower than that of the

cuckoo's, the egg will replace the initial cuckoo which is considered as an egg

from then on.

Step 4: laying eggs in host birds' nests

The cuckoo egg radius is computed based on equation 5. Egg laying is done

randomly within a circle-shape area with determined radius. Then, the objective

function of each egg is calculated; 10 % of the egg's population with improper

cost function will be identified and replaced by the host birds.

Step 5: Cuckoo immigration

After eggs grow up and turn into adult cuckoo, the best cuckoo Gbest is

identified. Other cuckoos will start migrating toward this cuckoo according to

the explanations presented in section 4. In the case that α is greater than 0.01,

the parameter amount should be reduced.

Step 6: Elimination of the cuckoos in worst habitats

If the total of all available cuckoos exceed the maximum number of them, the

cuckoos in worst habitats with undesirable cost function will be eliminated.

Step 7: if the stop condition is maintained, the algorithm will stop.

Otherwise, the determined egg laying radius will be determined

according to Eq(5) and algorithm will be performed from the 4th step.






ISSN: 2305-8285 Vol.22 2014


38

6. Experimental Results

Three real datasets are used to validate our proposed algorithm. These datasets

are named Iris, Breast Cancer Wisconsin (Cancer) ,and Contraceptive Method

Choice (CMC).Each dataset has a different number of clusters, data objects ,

features [21,24]. These datasets have been used to compare and evaluate the

performance of clustering algorithms in the literature and are described as

follows:

Iris dataset (n =150,d= 4,k= 3): This dataset contains three classes of 50 objects

each,where each class refers to a type of iris flower. There are 150 random

samples of iris flowers with four numeric attributes in this dataset. These

attributes are sepal length and widthin cm, petal length and widthin cm.There

are no missing values for attributes[21,24].

Breast Cancer Wisconsin (n = 683,d = 9,k= 2): This dataset contains 683objects

characterized by nine features: clump thickness, cell size uniformity,cell shape

uniformity, marginal adhesion, single epithelial cell size, bare nuclei,bland

chromatin, normal nucleoli,andmitoses.There are two clusters in the data:

malignant (444objects)and benign(239objects)[21,24].

Contraceptive Method Choice also denoted as CMC (n= 1473,d= 10,k= 3): This

dataset is a subset of the1987 National Indonesia Contraceptive Prevalence

Survey. The samples are married women whoe ither were not pregnant or did

not know if they were at the time of interview.The problem is to predict the

choice of current contraceptive method (no use has 629 objects, long-term

methods have 334 objects, and short-term methods have 510 objects) of a

woman based on her demographic and socioeconomic characteristics[21,24].

To implement COA-KM algorithm, the initial number of the cuckoos was

considered to be 5 and maximum permitted number of them was defined as 15

cuckoos with the respective maximum and minimum of 12 and 2 eggs.

Algorithm was stopped after 500 iteration. The presented evolutionary

algorithms have been implemented by MATLAB software on a 2GB RAM

computer . The results of the presented evolutionary algorithms like K-means,

ACO, and HBMO are compared in Table 1, Table 2, and Table 3. The

suggested algorithm indicates a better level of clustering accuracy with






ISSN: 2305-8285 Vol.22 2014


39

convergence to the optimal solution in comparison with other algorithms based

on the results of the implementation. For instance, the solution of

implementation of COA-KM on Iris data has always reported to be 96.6554

with standard deviation of zero, while the best values of the full implementation

of K-means, HBMO, and ACO are respectively 97.333, 96.752 and 97.100.

The results of implementation of different algorithms on Iris data are shown in

Table 1. The study of the results of the implementation of the algorithms on

Cancer data, shown in Table 2, also indicate the superiority of the suggested

algorithm compared to other algorithms. The difference between the best and

worst solutions of the suggested algorithm is insignificant; the worst solution is

equal to 2965.88 which is yet more optimal than the best solution of the most of

the other algorithms. The results of the implementation of COA-KM on CMC

data are shown in Table 3. They indicate the superiority of COA-KM over many

evolutionary algorithms.

Table1: Results obtained by the algorithms on Iris data

Standard Deviation Cost Function Value

Method worst Average Best

0 96.6554 96.6554 96.6554 COA-KM

0.00165 96.6636 96.6583 96.6562 COA

14.6311 120.45 106.05 97.33 k-means

0.0123 76.764 96.723 96.698 GSA

0.0076 96.705 96.689 96.679 GSA-KM

2.018 102.01 99.957 97.4573 SA

0.53 98.5694 97.8680 97.3659 TS

14.563 139.7782 125.1970 113.9865 GA

0.367 97.8084 97.1715 97.1007 ACO

0.531 97.7576 96.9531 96.7520 HBMO

Table2: Results obtained by the algorithms on Cancer data



0.4575 2965.88 2964.88 2964.51 COA-KM

1.6785 2967.45 2966.88 2965.27 COA

251.14 3521.59 3251.21 2999.19 k-means

8.1731 2990.83 2973.58 2967.96 GSA

0.0670 2965.30 2965.21 2965.14 GSA-KM

230.192 3421.95 3239.17 2993.45 SA

232.217 3434.16 3251.37 2982.84 TS






ISSN: 2305-8285 Vol.22 2014


40

229.734 3427.43 3249.46 2999.32 GA

90.500 3242.01 3046.06 2970.49 ACO

103.471 3210.78 3112.42 2989.94 HBMO

Table3: Results obtained by the algorithms on CMC data



0.70586 5695.7309 5694.9270 5694.1230 COA-KM

3.0007 5704.5 5699.87 5696.936 COA

47.16 5934.50 5893.60 5842.43 k-means

1.724 5702.09 5699.84 5698.15 GSA

0.2717 5697.87 5697.36 5697.03 GSA-KM

50.867200 5966.9470 5893.4823 5894.0380 SA

40. 84568 5999.8053 5993.5942 5885.0621 TS

50.3694 5812.6480 5756.5984 5705.6301 GA

45.634700 5912.4300 5819.1347 5701.9230 ACO

12.6900 5725.3500 5713.9800 5699.2670 HBMO

7. conclusion

The study of the results of implementation of COA-KM and comparing them

with those of the original algorithms indicates resolving of the problems and

deficiencies of the original algorithms in COA-KM. K-means algorithm enjoys

a high speed of convergence for example, but the convergence takes place in

local optimum. To take advantage of K-means high convergence speed, the

initialization of some of the population of the suggested hybrid algorithm was

done by using K-means. As a result, the search of the solution space is started

from a more proper area through COA-KM; convergence speed improves; and

standard deviation is reduced. Furthermore, cuckoos' egg laying radius are made

by the repetition of the reduced algorithm and less random changes in solution

space of the suggested algorithm. Implementing these changes on Cuckoo

algorithm and using K-means have resulted in COA-KM proper level of the

clustering accuracy.

References

[1] Sh.H. Lia, P.H. Chu, P.Y. Hsiao, Data Mining Techniques And Applications – A Decade

Review From 2000 To 2011, Expert Systems with Applications. 39, (2012) 11303–

11311.






ISSN: 2305-8285 Vol.22 2014


41

[2] Y. Xia, D. Feng, , T. Wang, R. Zhao, Y. Zhang, Image Segmentation by Clustering of

Spatial Patterns, Pattern Recognition Letters. 28, (2007) 1548–1555.

[3] S. Bandyopadhyay, U. Maulik, Genetic Clustering For Automatic Evolution Of Clusters

And Application to Image Classification, Pattern Recognition. 35, (2002) 1197–

1208.

[4] L. Liao, T. Lin, B. Li, MRI Brain Image Segmentation and Bias Field Correction Based

on Fast Spatially Constrained Kernel Clustering Approach, Pattern Recognition

Letters. 29, (2008) 1580–1588.

[5] H. Tang, T. Li, T. Qiu, Y. Park, Segmentation of Heart Sounds Based on Dynamic

Clustering, Biomedical Signal Processing And Control. 7, (2012) 509–516.

[6] M. Ceccarelli, A. Maratea, Improving Fuzzy Clustering of Biological Data by Metric

Learning With Side Information, International Journal of Approximate Reasoning.

47, (2008) 45–57.

[7] N. Wu, J. Zhang, Factor-Analysis Based Anomaly Detection and Clustering, Decision

Support Systems. 42, (2006) 375–389.

[8] S. Hyun Oh, W. Suk Lee, An Anomaly Intrusion Detection Method by Clustering Normal

User Behavior, Computers & Security. 22, (2003) 596–612.

[9] I. Mahdavi, N. Cho, B. Shirazi, N. Sahebjamnia, Designing Evolving User Profile In E-

CRM With Dynamic Clustering of Web Documents, Data & Knowledge

Engineering. 65, (2008) 355–372.

[10] D. Vicari, M. Alfó, Model Based Clustering of Customer Choice Data, Computational

Statistics & Data Analysis. 71, (2014) 3–13.

[11] A. Jai , 010 . “Data Clusteri g: 50 ears Beyo d K-Means, Pattern Recognition

Letters. 31, (2010) 651–666.

[12] S. Paterlinia, Th. Krink, Differential Evolution and Particle Swarm Optimisation In

Partitional Clustering, Computational Statistics & Data Analysis. 50, (2006) 1220 –

1247.

[13] J. Wu, H. Xiong, J. Chen, Towards Understanding Hierarchical Clustering: A Data

Distribution Perspective, Neuro Computing. 72, (2009) 2319–2330.

[14] R. Rajabioun, Cuckoo Optimization Algorithm, Applied Soft Computing. 11, (2011)

5508–5518.

[15] M. Laszlo, S., Mukherjee, A Genetic Algorithm That Exchanges Neighboring Centers

For K-Means Clustering, Pattern Recognition Letters. 28, (2007) 2359–2366.

[16] K. Krishna, M., Murty, Genetic K-Means Algorithm, IEEE Transactions on Systems,

Man, And Cybernetics B, Cybernetics. 29(1999).

[17] Kh. AL-Sulta , 1995 . “A abu Sear h Approa h o the Clusteri g Problem, Pattern

Recognition. 28 , (1995) 1443–1451.






ISSN: 2305-8285 Vol.22 2014


42

[18] Y. Liu, Zh. Yi, H. Wu, M. Ye, , K. Chen, A Tabu Search Approach For the Minimum

Sum-Of-Squares Clustering Problem, Information Sciences. 178, (2008) 2680–2704.

[19] SH. Selim, K. Alsultan, A Simulated Annealing Algorithm For the Clustering Problem,

Pattern Recognition. 24, (1991) 1003-1008.

[20] S. Bandyopadhayay, U., Maulik, M.K. Pakhira, Clustering Using Simulated Annealing

With Probabilistic Redistribution, International Journal of Pattern Recognition And

Artificial Intelligence. 15, (2001) 269-285.

[21] P.S. Shelokar, V.K. Jayaraman, B.D. Kulkarni, An Ant Colony Approach For

Clustering, Analytica Chimica Acta. 509, (2004) 187–195.

[22] D. Karaboga, C. Ozturk, (2011), A Novel Clustering Approach: Artificial Bee Colony

(ABC) Algorithm, Applied Soft Computing. 11, (2011) 652–657.

[23] Kao, Yi.T., E. Zahara, I.W. Kao, A Hybridized Approach to Data Clustering, Expert

Systems with Applications. 34, (2008) 1754–1762.

[24] A. Hatamlou, A. Salwani, H. Nezamabadi-pour, A combined approach for clustering

based on K-means and gravitational search algorithms, Swarm and Evolutionary

Computation .6, (2011) 47-52.