genetic algorithms paper

Real Coded Genetic Clustering with Inter-cluster Mutation

Gautam Ramdurai

Department of Information Science and Engineering

Sri Jayachamarajendra College of Engineering, Mysore – 570 006, India

Tel: +91 98457 89983, E-mail:[email protected]

Abstract

An improved clustering method using genetic algorithms is

proposed in this paper. Real encoding is used to render

conceptual closeness to the problem domain. A new view of

the chromosomes, which exploits the ability to directly map

real solutions, is suggested. A centroid-level crossover

technique that uses this new view is defined. Along with

this, a novel inter-cluster mutation operator specific to

clustering is proposed. The principles of the K-means

algorithm embedded in a Genetic Algorithm, augmented by

the new genetic operators, leads to finding globally optimal

solutions to the clustering problem. The superiority of the

proposed method over the commonly used K-means method

and the currently used technique, based on Genetic

Algorithms, is demonstrated for real-life datasets.

Keywords:

Clustering, K-means, Genetic Algorithms, Real Encoding

Introduction

The need to extract hidden structures and discover groups in a dataset has made clustering an essential machine learning process in fields like data mining and pattern recognition [1]. Clustering [1,2] deals with finding a structure in a collection of unlabeled data. It involves grouping data into clusters or groups such that members of one cluster are similar in some given sense and members of different clusters are dissimilar in the same sense. The primary objectives of clustering can be defined as:

• The distance between the members of a given cluster should be made minimum, as they represent similar kind of data cases.

• The distance between any two clusters (represented by the cluster centroids) should be made maximum, as they represent dissimilar classes.

The challenge here is to come up with a computational technique that can separate a dataset into the most natural groups or clusters. A lot of well known methods like K means and fuzzy c-means have been used and improved extensively. An extensive review of these techniques can be found in [1-3]. A brief overview of the standard K-means algorithm is given in the next section.

The failure of conventional hill-climbing methods in reaching globally optimal solutions for the clustering problem have given rise to an increasing interest in using stochastic methods like Genetic Algorithms (GAs) in this domain. GAs [4-6] is nondeterministic stochastic search or optimization methods that utilize the theories of evolution

and natural selection to solve a problem within a complex search space. They include a string representation of points in a search space; a set of genetic operators for generating new search points; a fitness function to evaluate the search points and a stochastic assignment to control the genetic operations.

Several methods based on GAs have been developed to aid clustering. A basic study of the use of GAs in clustering is provided in [7]. Advanced schemes that use sophisticated encoding schemes and modified genetic operators are proposed in [8,9]. Different variants of GA based clustering with real encoding are proposed in [10-12].

None of the genetic clustering [10-12] algorithms illustrated in past literature tap the ability of direct mapping offered by real encoding schemes. They use usual operators for recombination and mutation without any customization. In this paper, a variant of these algorithms [10] is proposed with significant improvisations. A new way of viewing the chromosomes representing the candidate solutions is suggested. The view is conceptually closer to the real world solution to the clustering problem. Keeping this view in mind, a new centroid – level crossover operator and an inter-cluster mutation operator are defined. The new operators are specific to the problem of clustering and exploit the ability of real encoding to map real solutions on to the candidate chromosomes. The Davies-Bouldin index [13] is used as the optimization metric. Different aspects of the proposed technique are elaborated in later sections.

The superiority of the proposed method is demonstrated by a series of tests on various real-life datasets. The tests validate this superiority in three phases. The first phase illustrates the contribution of the new clustering specific mutation and crossover operators. The second phase of testing compares the performance of the proposed method with that of the standard K-means algorithm and the method proposed in [10] for five real-life datasets.

K-means Clustering

The K-means [1,2] algorithm is one of the simplest and the most popular algorithms used for clustering. The algorithm starts with a given set of initial clusters. Each point in the data-space is assigned to one of them, and then each cluster center is replaced by the mean point on the respective cluster. These two simple steps are repeated until convergence. The Euclidean distance is used as the similarity metric. Euclidean distance is the straight line distance between two points. It is a simple and effective method to judge the distance between any two points in the search space.

552

Proceedings of the Third International Conference on Artificial Intelligence in Engineering & Technology November 22-24, 2006, Kota Kinabalu, Sabah, Malaysia

The basic K-means algorithm is shown below:

Step 1: Choose k initial clusters c1, c2,…,ck randomly from the n data points {x1,x2,…,xn}

Step 2: Assign point xi , i=1,2,…,n to cluster Cj, j∈{1, 2,…,k} if and only if p =1,2,…,k and j ≠ p

||xi – cj|| < ||xi – cp|| (1) Step 3: Compute new cluster centers c1

*, c2*… ck

* i=1,2,…,k as follows:

ci* = ∑

∈ ij Cx

j

i

xn

1 (2)

where ni is the number of elements belonging to cluster Ci

Step 4: If ci* = ci , i = 1,2,…,k then terminate. else goto

Step 2

If normal termination does not occur, the algorithm is run for a predefined maximum number of iterations.

A major disadvantage of this method is that it gets stuck at local optima depending on the choice of initial clusters. It has been shown in [14] that the algorithm may converge to values that are not optimal.

Real Coded Genetic Clustering

One of the new avenues being explored to overcome this drawback of local optimality is the use techniques like Genetic Algorithms, which are efficient in reaching the global optima for most problems. One such method is proposed here. A real-coded genetic algorithm [15,16] is used, where a solution is directly represented as a vector of real-parameter decision variables. This way the representation of the solutions is very close to the natural formulation of many problems. The choice of the real encoding scheme for clustering seems natural as it renders conceptual closeness to the problem domain.

Chromosome Encoding

Each chromosome represents a candidate solution representing cluster centroids to be chosen for an ideally clustered dataset. Each string is a sequence of k genes representing the k cluster centers. For an N-dimensional space, each gene again consists of N sub-genes. Each sub-gene value represents an ordinate of a cluster centroid in the N dimensioned space. The effective length of the chromosome is thus k*N. The chromosome initialization is done by randomly assigning points from the given dataset or pattern set to the sub-genes of a chromosome. Figure 1 shows a sample chromosome containing four genes (cluster centroids), each containing three sub-genes (ordinates).

Initial Population

Chromosomes are initialized using the above method and a population of candidate solutions is created. Each of these chromosomes represents a possible solution to the clustering problem. A given number of individual chromosomes initialized in this manner form the initial population.

genes

sub-genes Figure 1 - Chromosome encoded as genes and sub-genes

Cluster Formation

The clusters are formed according to the cluster centers encoded in the chromosome. This is done by assigning each data point xi , i =1,2…n , to one of the clusters Cj with cluster centroid cj. The Euclidean distance from a data point to all the centroids is calculated and the point is assigned to the cluster.

The point xi, i=1,2…n is assigned to cluster Cj, j

∈{1,2,…,k} according to Equation (1). Once all points have been assigned to a cluster, the new cluster centers c1

*, c2*…

ck* are calculated according to Equation (2). If the

algorithm has not converged, ci* replaces ci in the

chromosome.

Fitness Calculation

A Genetic Algorithm requires a criterion whose optimization provides the final clusters. The criterion must be chosen to achieve the objectives of increasing homogeneity within the cluster and increasing heterogeneity between clusters. An objective fitness function must be defined such that it measures the capability of a candidate solution. There are many clustering metrics available in literature ranging from simple to highly complex mathematical functions. One of the commonly used metric is the Total Within Cluster Variation (TWCV) [8] or simply the summation of the total squared Euclidean distances [10], of all points from their respective cluster centers, of all clusters. This parameter does not take into account the proximity of two different clusters. In order to overcome this, a metric, that considers both objectives of homogeneity and heterogeneity, is to be used. The Davies-Bouldin index [13] (DB-index) has proved to be an effective metric to determine the quality of clusters created. The proposed method uses the DB-index to determine the fitness of a given chromosome and hence validate the clusters formed. The DB index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation. The measure of scatter within the cluster Ci is calculated as:

Si,q = q

Cx

q

i

i i

cxC

1

2

||||||

1

−∑

∈

(3)

where ci is the centroid of Ci and is defined as

ci = ∑

∈ iCxi

xn

1 (4)

where ni is the cardinality of the cluster Ci

Si,q is the qth root of the qth moment of the points in cluster i

553


with respect to their mean, and is a measure of the dispersion of the points in cluster i. In this paper q = 1, hence Si,1 is the average Euclidean distance of the data points in class i from the centroid of class i .

dij,t denotes the Minkowski distance of the order t between the clusters Ci and Cj , i.e. distance between the clusters. Here t = 2

dij,t = d(Ci , Cj) = || ci - ci ||t (5) Ri,qt denotes the maximal similarity index of Ci to the other clusters.

Ri,qt =

+

≠ tij

qjqi

ijj d

SSMax

,

,,

, (6)

The DB index is then defined as follows:

∑=

=K

i

qtirR

KDB

1,

1 (7)

A smaller value of the DB-index indicates a good clustering result. Thus the fitness function can be defined as the inverse of the DB-index:

f(Chromr) = rDB

1 (8)

The fitness value is used as the parameter to be optimized by the algorithm. The maximization of this fitness function ensures minimization of the DB-index. The DB-Index has been used effectively for cluster validation in the past [9,11,12]

Selection

The fitness of an individual determines its probability of contributing to the mating pool. This means that an individual with a higher fitness value has a higher probability to survive, reproduce and contribute to the next generation. The probability for a chromosome to survive and mate can be given by the following mathematical function, which is the ratio of the individual fitness to the total fitness of the whole population:

Prob (Chromr) = ∑=

P

r

r

r

f

f

1

(9)

The well known Roulette Wheel selection scheme is used to select the parents to create the next generation.

Centroid-level Crossover

Crossover [6] is a probabilistic process that exchanges information between two parent chromosomes for generating two child chromosomes. It is the main search operator in GAs. In previous implementations [10-12] the crossover operator used is a single point crossover. In [10] the chromosome is viewed as single string of k*N real values. The cross-site is randomly picked from the range {1,2,…, k*N} and the split parts are swapped. This might lead to the chromosome being split such that the ordinates of a cluster centroid are split in between as shown in Fig. 2. The potential solution to the problem is a set of complete

cluster centroids and not individual ordinates; hence the commonly used single point crossover is not satisfactory. A new Centroid- level Crossover operator is proposed instead.

Figure 2 -- Normal Single-point Crossover

Crossover occurs Real encoding eliminates the mapping from phenotype to genotype; hence the candidate solution can be directly manipulated. This feature is exploited in the proposed technique as the chromosomes are viewed in terms of genes (cluster centroids) and sub-genes (ordinates). It makes more sense than looking at the solution as a mere string of real numbers. The genetic information is exchanged only in terms of genes and not in terms of single ordinates or sub-genetic alleles. A random crossover point is selected in the range {1,2,…, k*N}. The chromosome is split in terms of whole genes and no gene is split in-between. This helps preserve the integrity of genetic information being exchanged. Figure 3 illustrates the working of the Centroid- level Crossover operator. there are four genes with three sub-genes each. The cross-site is three and the genes after the third gene are swapped.

Parent1

Parent2

Child1

Child2

cross-site=3

Figure 3 – Centroid – level Crossover

Crossover occurs with a probability of µc. Two chromosomes are selected out of the current population with a probability proportional to their fitness. A random number in the range (0,1) is generated and if it is less than µc then crossover occurs and the offspring are copied onto the next generation, otherwise the chromosomes are copied directly without crossover. .

Inter-cluster Mutation

The role of mutation [6] is to restore lost or unexpected genetic material into population to prevent premature convergence to sub-optimal solutions. It ensures that the probability of reaching any point in the search space is never zero.

554


The mutation operator must guide the algorithm towards fulfilling the objectives of clustering as mentioned before; an appropriate clustering specific mutation technique is proposed in this end. The chromosome is mutated such that inter-cluster distance increases. Here again the mutation is not allelic, i.e. applicable to single sub-genetic alleles, but in terms of genes. Each chromosome in the population is mutated with a probability µm called the mutation probability. A random number is generated in the range (0,1), if the value is less than µm, mutation occurs. The Inter-cluster Mutation is an operator based on intercluster distances. The intercluster distances among all genes of a chromosome are calculated according to Equation (3)

the pair of centroids (genes) that are closest to each other are selected, say Ca and Cb . The ordinate (sub-gene) values of these two centroids are mutated. A random number δ in the range [0,1] is generated with uniform distribution. If the ordinate (sub-genetic allele) values under consideration are via and v

ib . Here v

ia and v

ib are the i

th ordinates (sub-genetic alleles) of the selected centroids Ca and Cb respectively. They are mutated as follows:

if via ≤ vib and a,b ∈{1,2,..,k} and a ≠ b; i∈{1,2,..,N}

via = via – δ* v

ia (10)

vib = vib + δ* v

ib (11)

else

via = via + δ* v

ia (12)

vib = v

ib - δ* v

ib (13)

If the value of via is lesser than vib then reducing the value

of via and increasing the value of vib will push the centroids

Ca and Cb farther from each other. If the value of via is greater than vib , then the subtraction and addition operations are reversed and the desired distance enhancing effect is achieved. This effect of increasing distance between the clusters facilitates heterogeneity among clusters.

Termination Criterion

The methods of fitness computation, selection, crossover, and mutation are executed for a fixed number of iterations. The best string seen up to the last generation provides the solution to the clustering problem. Elitism can be implemented at each generation by preserving the best string seen in that generation in a location outside the population or by copying it onto the next generation. Thus the resulting fittest chromosome represents the centroids of the final clusters.

Implementation and Results

In order to prove the superiority of method, it was tested, compared and analyzed in two phases. The testing was done on datasets of varying sizes and dimensions. Five real life datasets freely available at [17] are used. In all cases the value of k is known a priori. A brief review of the datasets is given below. Table 1 summarizes the different datasets along with the number of instances, dimensions and classes.

Iris : It is a set of 150 data points in a four dimensional

space. The data represents different categories of irises having four feature values (sepal length, sepal width, petal length, petal width in centimeters). It has three classes, Setosa, Versicolor and Virginica with 50 samples each. Versicolor and Virginica are said to overlap while the class Setosa is linearly separable.

Wine: These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivators. The analysis determined the quantities of 13 constituents found in each of the three types of wines. It contains 178 data points.

New Thyroid: Lab tests are used to try to predict whether a patient's thyroid belongs to the class euthyroidism, hypothyroidism or hyperthyroidism. The diagnosis (class label) was based on a complete medical record, including anamnesis, scan etc. The number of instances is 215 and number of attributes is 5.

Bupa Liver Disorder: It consists of 6 variables in each data point. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each data point constitutes the record of a single male individual. The number of instances is 345 and number of attributes is 6. They can be divided into two classes.

Breast Cancer: The Wisconsin Breast Cancer data set, having 683 points, is used. Each pattern has nine features (clump thickness, cell size uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses). There are two classes in the data: Malignant and Benign.

Table 1 – Datasets used for testing

Dataset

Instances

Dimensions

Classes

Iris 150 4 3

Wine 178 13 3

NewThyroid 215 5 3

LiverDisorder 345 6 2

BreastCancer 683 9 2

Parameter Testing

The values that exhibited characteristics of good optimization in their runs are summarized in Table 2. It was observed that when the number generations was high, e.g. 100, the algorithm converged well before the last generation. Hence the unnecessary computation of redundant generations can be avoided by setting the number of generations to 50. It was also observed that lower mutation probability make the proposed technique converge slowly. Higher probabilities lead to oscillating behavior of the algorithm. These parameters were then used for further testing of the algorithm.

555


Table 2 – Optimum Parameters

Parameter Value

Crossover probability (µc) 0.8

Mutation probability (µm) 0.001

Population size (P) 50

No. of generations (G) 50

Phase I: Contribution of New Operators

The proposed method is compared with previous implementations of Genetic Algorithms-based Clustering Technique [10] (GACT). The GACT uses relatively simple crossover and mutation operators which are not specific to clustering. The parameters discussed in previous section are kept constant for both methods and the DB-index was used as the optimization metric for both techniques. Since both are stochastic algorithms, ten runs of both RCGC-IM (Real Coded Genetic Clustering with Inter-cluster Mutation) and GACT on the Iris dataset were considered for comparison. For each generation, the average DB-index value over ten different runs of both techniques was recorded. By comparing these values, the contribution of the mutation and the crossover operators and their superiority over commonly used operators can be judged. Figure 3 shows the comparison of the average DB-index values for both RCGC-IM and GACT for 50 generations.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

No: of Generations

DB

In

dex

RCGC-IM

GACT

Figure 3 -Average DB-index values of RCGC-IM and

GACT for 50 generations

The graph shows the consistent behavior of the RCGC-IM and that on an average the proposed method performs better than the GACT. It must be kept in mind that a lower DB-index indicates a better clustering result.

Phase II: Performance of RCGC

The performance of proposed method is tested over five different real-life datasets. The performance of the well

known K-means algorithm and the previous implementation of Real Coded Genetic Algorithms in Clustering [10] are considered for comparison. On each dataset the K-means is run 2500 times and the average of all the Davies-Bouldin indices of the resulting clusters is calculated. Since each run of the genetic algorithm samples 50 cluster configurations, it is run 50 times; hence being considered equivalent to the 2500 runs of the standard K-means. Again the average DB index is calculated for the GACT [10] and the RCGC-IM proposed in this paper. The results for all datasets are summarized in Table 3.

Table 3 – Comparison of DB index values

Datasets K-Means GACT RCGC-IM

Iris 1.0593 0.6982 0.6648

Wine 1.7523 0.4846 0.4558

New Thyroid 1.2670 0.5732 0.5467

Liver Disorder 1.7767 0.5447 0.5305

Breast Cancer 1.2746 0.6337 0.6156

It is clearly seen the RCGC-IM achieves lower and better values of the DB-index on an average than that of the K-means algorithm or the GACT.

This is because the performance of the K-means algorithm is highly variant and dependent on the initial configuration. But this difference is very less when it comes to RCGC-IM. The operators used in GACT are not customized for clustering applications unlike the RCGC-IM, hence the difference in performance values.

Conclusion and Future Work

The method proposed in this work effectively combines the simplicity of the K-means algorithm and the searching capability of Genetic Algorithms. Previous implementations [10-12] do not exploit the conceptual closeness offered by real encoding. This method taps the ability of real encoding to map the candidate solutions directly on to the chromosomes by viewing the solution as genes representing a real world solution for the problem rather than viewing the chromosomes as just a string of real numbers. This view is extended and exploited further by the crossover and mutation operators that work on complete genes and not on singular alleles. This incorporates a degree of integrity into the algorithm helping it to compute solutions with a real-world perspective. The superiority of the RCGC-IM along with the new genetic operators is emphasized by the results of the tests conducted on various real world datasets.

Future work in this field will be directed towards using stochastic techniques like GAs as much more than just black-box optimization techniques. Incorporation of valuable real world insights into the workings of the algorithm itself in the form of appropriate fitness functions

556


and customized operators will be an area of focus. Other improvements to this method can be explored by using newer indices [18] as the optimization metrics. The possibility of devising a similar method for clustering when the number of classes is not known a priori can also be explored.

Acknowledgments

The author would like to thank Dr. T.N. Nagabhushan, Professor and Head, Department of Information Science and Engineering, Sri Jayachamarajendra College of Engineering, Mysore, India., for his guidance and support.

References

[1] Duda R et al, 2001. Pattern Classification, Wiley-Interscience Publishers, USA,

[2] Jain A et al, 1988. .Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ

[3] Glenn F. 2001. A Comprehensive Overview of Basic Clustering Algorithms

[4] Michalewicz Z, 1992. Genetic Algorithms = Data Structures + Evolution Programs, Springer, New York,

[5] Davis L Ed. 1991, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York.

[6] Mitchell M. 1996. An Introduction to Genetic

Algorithms. , Complex Adaptive Systems. MITPress. Camhridge

[7] Krovi R. 1992. Genetic Algorithms for clustering: A

preliminary investigation. In Proceedings of the 25th Hawaii Intl. Conf. On System Sciences: 540-544.

[8] Krishna, K et al.,1999. Genetic K-Means Algorithm.

IEEE Transactions on Systems Man And Cybernetics-

Part B: Cybernetics 29: 433-439.

[9] Lin H et al. 2005, An effective GA-based clustering technique, Tamkang Journal of Science and Engg., 8:113-122

[10] Maulik U et al, 2000. Genetic algorithms-based clustering technique, Pattern Recognition 33: 1455-1465

[11] Maulik U et al, 2001. Nonparametric Genetic Clustering: Comparison of Validity Indices, IEEE Tran. ON Systems, Man, and Cybernetics—Part C, 31: 120-125

[12] Maulik U et al, 2002. Genetic Clustering for Automatic Evolution of Clusters and Application to Image Classification, Pattern Recognition, 35: 11971208.

[13] Davies D. et al,1979. A cluster separation measure, IEEE Trans. Pattern. Anal. Mach. Intelligence, 1: 224–227.

[14] Selim S. et al, 1984. K-means type algorithms: a generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach.

Intelligence. 6 : 81-87

[15] Herrera F et al, 1998. Tackling real-coded genetic algorithms: operators and tools for behavioral analysis, Artificial Intelligence Review 12: 265-319

[16] Raghuwanshi M. et al, 2004. Survey on multi-objective evolutionary and real-coded genetic algorithms. In Proceedings of the 8th Asia Pacific Symposium on

Intelligent and Evolutionary Systems: 150-161

[17] The UCI online machine learning database repository. URL http://www.ics.uci.edu/~mlearn/databases.html

[18] Bezdek J. et al,1998. Some New Indexes of Cluster Validity. IEEE Tran. on Systems, Man, and

Cybernetics–Part B, 28 : 301–315

557


genetic algorithms paper

Documents