[studies in computational intelligence] advanced methods for computational collective intelligence...

10
Ant Colony Inspired Clustering Based on the Distribution Function of the Similarity of Attributes Arkadiusz Lewicki 1 , Krzysztof Pancerz 1 , and Ryszard Tadeusiewicz 2 1 University of Information Technology and Management in Rzesz´ow, Poland {alewicki,kpancerz}@wsiz.rzeszow.pl 2 AGH University of Science and Technology, Krak´ ow, Poland [email protected] Abstract. The paper presents results of research on the clustering problem on the basis of swarm intelligence using a new algorithm based on the normalized cumulative distribution function of attributes. In this approach, we assume that the analysis of likelihood of the occurrence of particular types of attributes and their values allows us to measure the similarity of the objects within a given category and the dissimi- larity of the objects between categories. Therefore, on the basis of the complex data set of attributes of any type, we can successfully raise a lot of interesting information about these attributes without necessity of considering their real meaning. Our research shows that the algorithm inspired by the mechanisms observed in nature may return better results due to the modification of the neighborhood based on the similarity co- efficient. Keywords: ant colony clustering analysis, ant colony optimization, swarm intelligence, self-organization, unsupervised clustering, data mining, distribution function. 1 Introduction Clustering became a very important field of research in data mining. This type of problems concerns identification of natural groups, where objects similar to each other are placed in one group while objects varying significantly are placed in different groups. It is interesting for researchers in the fields of statistics, machine learning, pattern recognition, knowledge acquisition and databases [6], [9], [18]. This issue includes the aspects of data processing, determining the similarity and dissimilarity of objects as well as methods for searching optimal solutions. In mathematical terms, clustering methods are based on searching for the partition minimizing a given criterion function. The various methods differ on reaching this requirement [3], [4], [14], [15]. Having defined the function of proximity specified for each pair of objects, we can use the algorithm to create groups. N.T. Nguyen et al. (Eds.): Adv. Methods for Comput. Collective Intelligence, SCI 457, pp. 147–156. DOI: 10.1007/978-3-642-34300-1 14 c Springer-Verlag Berlin Heidelberg 2013

Upload: geun-sik

Post on 04-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

Ant Colony Inspired Clustering Based on the

Distribution Functionof the Similarity of Attributes

Arkadiusz Lewicki1, Krzysztof Pancerz1, and Ryszard Tadeusiewicz2

1 University of Information Technology and Management in Rzeszow, Poland{alewicki,kpancerz}@wsiz.rzeszow.pl

2 AGH University of Science and Technology, Krakow, [email protected]

Abstract. The paper presents results of research on the clusteringproblem on the basis of swarm intelligence using a new algorithm basedon the normalized cumulative distribution function of attributes. In thisapproach, we assume that the analysis of likelihood of the occurrenceof particular types of attributes and their values allows us to measurethe similarity of the objects within a given category and the dissimi-larity of the objects between categories. Therefore, on the basis of thecomplex data set of attributes of any type, we can successfully raise alot of interesting information about these attributes without necessity ofconsidering their real meaning. Our research shows that the algorithminspired by the mechanisms observed in nature may return better resultsdue to the modification of the neighborhood based on the similarity co-efficient.

Keywords: ant colony clustering analysis, ant colony optimization,swarm intelligence, self-organization, unsupervised clustering, datamining, distribution function.

1 Introduction

Clustering became a very important field of research in data mining. This typeof problems concerns identification of natural groups, where objects similar toeach other are placed in one group while objects varying significantly are placedin different groups. It is interesting for researchers in the fields of statistics,machine learning, pattern recognition, knowledge acquisition and databases [6],[9], [18]. This issue includes the aspects of data processing, determining thesimilarity and dissimilarity of objects as well as methods for searching optimalsolutions. In mathematical terms, clustering methods are based on searching forthe partition minimizing a given criterion function. The various methods differon reaching this requirement [3], [4], [14], [15]. Having defined the function ofproximity specified for each pair of objects, we can use the algorithm to creategroups.

N.T. Nguyen et al. (Eds.): Adv. Methods for Comput. Collective Intelligence, SCI 457, pp. 147–156.DOI: 10.1007/978-3-642-34300-1 14 c© Springer-Verlag Berlin Heidelberg 2013

Page 2: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

148 A. Lewicki, K. Pancerz, and R. Tadeusiewicz

In terms of the overall mechanism, clustering methods can be divided intoseveral categories such as flat clustering methods, hierarchical clustering meth-ods, density-based clustering methods, graph-based clustering methods, natureinspired clustering methods or others which do not correspond to any of listedcategories. Most of clustering algorithms requires determining a number of clus-ters (classes, categories) to which objects will be allocated. One of the morepromising solutions, which does not require a predefined number of clusters, isthe use of nature inspired clustering methods. This paper shows that the antcolony clustering algorithms are very efficient in this regard. We propose here toextend the idea originally published in [17].

2 The Approach

The main advantage of the base algorithm created previously is data prioriti-zation. This feature eliminates the need to predetermine a number of clusters.However, sometimes we have a situation that the resulting number of clustersis less than prospective one for data characterized by a large number of smallclasses. Another disadvantage of this algorithm is a lack of stability of obtainedsolutions, especially in case of infrequent clusters, when there is a high proba-bility of destroying them even if data have been identified correctly. Therefore,this algorithm was the basis for us to modify the strategy for searching the de-cision space. The basic idea was to obtain a clearer separation among individualgroups of data, because if the distance between groups of objects is insufficientor there are connections among them, then the hierarchical algorithm interpretsthem as a single cluster. We can improve this situation by modifying the neigh-borhood function (similarity function) and the correlation of parameters. Theanalysis of probabilities of occurrences of individual attribute values, conditionalprobabilities and other statistical dependencies can deliver us a lot of interest-ing information about the attributes without covering the real meaning of eachattribute.

The similarity of two objects x and y belonging to a finite set X can beestimated as the similarity between the distribution functions g(y):

g(y) = P (y|x) = fxy∑

y′∈X

fxy′(1)

where fxy is a number of occurrences of x and y.In order to achieve the smallest variance within a group of similar objects and

the most intergroup variance, we can use similarity of distribution of attributes.In this case, we can calculate:

– The Jaccard coefficient:

d(oi, oj) = 1−

n∑

k=1

oikojk

n∑

k=1

o2ik +n∑

k=1

o2jk −n∑

k=1

oikojk

. (2)

Page 3: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

Ant Colony Inspired Clustering Based on the Distribution Function 149

In view of the fact that, in the modified algorithm, the degree of dissimilarityof objects is taken into account, the formula must be converted to the form:

d(oi, oj) = 1− 1

2

n∑

k=1

oikojk

n∑

k=1

o2ik +n∑

k=1

o2jk −n∑

k=1

oikojk

. (3)

– The Dice coefficient:

d(oi, oj) = 2

n∑

k=1

oikojk

n∑

k=1

o2ik

n∑

k=1

o2jk

. (4)

In both coefficients:

• oi and oj are objects represented by the corresponding n element vectors offeatures,

• oik and ojk denote the k-th element (feature) of the vectors oi and oj , respec-tively, for k = 1, . . . , n.

Referring to the basic ant clustering algorithm published previously [11], [12],[17], we can determine the neighborhood function as:

f(oi) =

⎧⎨

δ∑

oj∈L

(1− d(oi,oj)

α

)if f(oi) > 0 and ∀

oj∈L

(1− d(oi,oj)

α

)> 0,

0 otherwise,(5)

where δ is the factor defining the size of the tested neighborhood, α is a param-eter scaling the dissimilarities within the neighborhood function, L represents aneighborhood area of searching.

Our new strategy has been tested on different quantity data sets. In this paper,data sets are named DS1, DS2 and DS3 and they contain 780, 200 and 65 objects,respectively. Their vectors of features are described by values of financial data.These data have been obtained by courtesy of Laboratory of Adapting EconomicInnovations in Information Technology Facilities at the University of InformationTechnology and Management in Rzeszow, Poland.

In order to determine the influence of control parameters in the implementedalgorithm, we have examined three sets of the most promising parameters verifiedfor similar types of problems [1], [2], [3], [5], [7], [10], [13], [16], [17], [21], [22]. Theywere marked with P1, P2 and P3, respectively, and their values are presented inTable 1.

To tune these parameters, there has been applied a method of auto-tuningtheir values determined individually for each thread on the basis of a number fof failed operations of dropping objects during last steps.

In the clustering process, we expect a state enabling us to get the new par-tition internally homogeneous, but externally heterogeneous. If the partition ischaracterized by high quality solutions, then it means that the number of clusters

Page 4: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

150 A. Lewicki, K. Pancerz, and R. Tadeusiewicz

Table 1. The adopted values of the initialization parameters of the implementedalgorithm

Parameter Parameter Parameter Parametervalues P1 values P2 values P3

Number of ants 150 200 500

Memory size 20 30 50

Step size 50 100 120

α 0.45 0.75 0.9

β 2 2 2

is fixed properly. Quality assessment methods of grouping data can be related tothe verification of the proposed solution on the basis of internal criteria, externalcriteria, and end-user criteria. The first type of evaluation is associated with aspecially selected measure, which examines properties of the solution. The mostpopular measure in this case is the Dunn index [8]. Evaluation results obtainedwith the use of an external criterion answer the question: ”how well does theproposed solution give a solution created by man?”. Here, we use the Rand in-dex [19] and F-measure [20]. The above-mentioned factors for the creation ofthe qualitative characteristics of solutions have been also adopted by us for theproposed solution. The first factor (Formula 6) uses a minimum distance dmin

between two objects from different groups as well as the maximum distance dmax

between two objects within a given group. It verifies compact and well-separatedclusters.

DN =dmin

dmax. (6)

The Rand index, measuring accuracy of two partitions X and Y by comparingeach pair of test objects, can be presented as:

R =a+ b

a+ b+ c+ d, (7)

where

– a is the number of pairs of objects belonging to the same set in X and thesame set in Y ,

– b is the number of pairs of objects belonging to different sets inX and differentsets in Y ,

– c is the number of pairs of objects belonging to the same set in X and differentsets in Y ,

– d is the number of pairs of objects belonging to different sets in X and thesame set in Y .

The F-measure is based on two components:

– the precision coefficient p:

p =tp

tp+ fp(8)

Page 5: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

Ant Colony Inspired Clustering Based on the Distribution Function 151

where tp means a number of correct results (true positive) in a classificationprocess and fp means unexpected results (false positive) in a classificationprocess,

– the recall coefficient r:

r =tp

tp+ fn(9)

where fn means a number of missing results (false negative) in a classificationprocess.

This means that, in a considered clustering problem, we can extract four typesof decisions (on the basis of classification):

– a decision is true positive, if a tested pair of objects is in a pattern group andin a generated group,

– a decision is true negative, if a tested pair of objects is located together neitherin a pattern group nor in a generated group,

– a decision is false negative, when we divide a pair of objects from the patterncollection to different groups,

– a decision is false positive, when we put a pair of objects in one group, whichare not in any pattern group.

The F-measure, which is the harmonic mean of precision and recall, is calculatedusing the formula:

F (r, p) =2pr

p+ r. (10)

A special case of the general F-measure is a situation where we use a non-negativereal value β:

Fβ(r, p) =(β2 + 1)pr

β2p+ r. (11)

The proposed algorithm based on the presented concept can be written aspseudo-instructions (see Algorithm 1).

The presented idea should guarantee that the operations of raising and drop-ping objects under consideration will be deterministic for a very little value ofthe density function in the first case and a very high value in the second case. Itis connected with the process accelerating the formation of clusters, but only inareas which consist of a large number of objects with similar attribute vectors.

The proposed approach has been compared to other solutions known in thisarea, i.e., the k-means algorithm as well as the ATTA algorithm created by Handland Knowles [10], [11], [12], [16].

3 Results

Test results have been obtained by 50 runs. Multiple empirical verification ofthe quality of the results on a sample iris data set showed that the best set ofparameters implemented in the algorithm is the set P2 (presented in Table 1).The values are in accordance with those suggested by the authors of the ATTA

Page 6: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

152 A. Lewicki, K. Pancerz, and R. Tadeusiewicz

Algorithm 1. Ant Colony Clustering Algorithm based on the normalizeddistribution function

for each object oi doPlace oi randomly on the grid;

endfor each agent (ant) ai do

Select randomly the object oi;Pick up the object oi by ai;Put ai in a random place on the grid;

endfor t = 1 to tmax do

Select randomly the agent ai;Move the agent ai;Calculate the probability for object picking;o=object carried by the agent ai;Calculate the probability for object dropping;dropped=try to drop the object o;if dropped = false then

raised = false;while raised = false do

oi=select randomly one of the free objects;raised=try to pick up the object oi;

end

end

endPrint object positions;

algorithm. The results of assessment of the proposed algorithm with the Jaccardcoefficient and with the Dice coefficient as well as the ATTA algorithm and thek-means algorithm using three indexes for examined data sets are presented inTable 2.

The results show that the new approach with the Jaccard coefficient worksbetter for smaller data sets. For this type of collection, the algorithm is the bestpossible one. For larger collections, the k-means algorithm works better. Thissituation concerned a case where data sets consisted of 200 and 780 objects,respectively. For the largest sets, the algorithm with the Dice coefficient achievedbetter results than the ATTA algorithm and the algorithm with the Jaccardcoefficient.

Studies of proposed modifications in relation to the standard version of theant clustering algorithm showed the impact of the proposed solution on the lin-ear correlation of the main parameters of the algorithm. The obtained resultsare included in Table 3. Analysis of these values indicates that the adopted ap-proach allowed reduction of used factors in the proposed algorithm in relationto the ATTA algorithm. In this case, the range was from 0.10 to 0.13. The ATTA

Page 7: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

Ant Colony Inspired Clustering Based on the Distribution Function 153

Table 2. Evaluation of the quality of the new approach in comparison to the ATTAalgorithm and the k-means algorithm

Data Index New approach New approach ATTA k-meansset with Jaccard coefficient with Dice coefficient algorithm algorithm

DS1 Dunn index 0.954 0.965 0.886 0.986

DS1 Rand 0.903 0.963 0.857 0.978

DS1 F-measure 0.846 0.871 0.865 0.892

DS2 Dunn index 0.708 0.598 0.694 0.962

DS2 Rand 0.552 0.463 0.504 0.564

DS2 F-measure 0.574 0.481 0.473 0.678

DS3 Dunn index 1.391 0.917 1.227 0.770

DS3 Rand 0.912 0.743 0.836 0.862

DS3 F-measure 0.782 0.757 0.764 0.747

Table 3. The impact of the proposed solution on the linear correlation of the mainparameters of the algorithm

Algorithm Data Correlationset coefficient

New approach with Jaccard coefficient DS1 0.132473

New approach with Jaccard coefficient DS2 0.114033

New approach with Jaccard coefficient DS3 0.124522

New approach with Dice coefficient DS1 0.132473

New approach with Dice coefficient DS2 0.106767

New approach with Dice coefficient DS3 0.104678

ATTA DS1 0.624273

ATTA DS2 0.565561

ATTA DS3 0.498221

algorithm had the range from 0.49 to 0.62. Therefore, we can conclude that thealgorithm with the Jaccard coefficient and the algorithm with the Dice coefficientare suitable for the use in case where the additional knowledge about groups ofobjects contained in the data set is not necessary.

The best results for most collections have been achieved by the k-means al-gorithm. However, in case of the proposed approach, a number of groups wasunknown. Therefore, this property is an important advantage of the proposedapproach.

4 Conclusions

An attempt to find the proper partition of any set of objects described by vec-tors of features (attributes) is one of the most difficult and very complex tasks.In fact, data clustering is not often associated only with a lack of informationon a number of output classes, but also with the necessity of interpretation and

Page 8: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

154 A. Lewicki, K. Pancerz, and R. Tadeusiewicz

standardization of input data. Existing and implemented solutions, both deter-ministic and non-deterministic, are effective for small spaces. Therefore, therehas been proposed in the paper a new approach to the clustering problem, whichis based on the verified mechanism of swarm intelligence, i.e., collective activityof agents (ants). They carry out a searching process of available solutions in thefield with the specified heuristic rules for determining a function of similarity ofadjacent objects. It has a significant impact on the probability of picking up ordropping the object. Previously implemented and verified heuristic algorithmstake into account the applicable measure of a distance between the examinedobjects in a space of solutions and the starting point is the most common, i.e., de-signing a matrix using the Minkowski metric tensor for this purpose. Meanwhile,as our experience demonstrates, in many cases, the better approach is the mod-ified hierarchical ant colony clustering algorithm taking into account a functionof similarity of attributes using the distribution function based on the Jaccardcoefficient or the Dice coefficient. The obtained results confirmed its greater use-fulness in case of vectors being the standardized values because of the dominanceof the attributes associated with large values over those associated with smallervalues. In this case, the proposed modification of the heuristic approach deliversus a faster algorithm, which requires less iterations to obtain the same resultand the solution that rarely stops at the local optimums. Such a solution willbe appropriate in case of non-structured data representing a significant problemwith mapping images.

The proposed approach can be successfully implemented in case of data setswith quantitative features. However, in practice, there may be a situation wherethe test data will include both quantitative and qualitative features. In sucha situation, there should be considered separate grouping of objects for eachcriterion using a proper measure of similarity. Another issue is consideration ofdata containing partial and incomplete descriptions. Therefore, such cases willbe the next step of our research which will allow us to gain the knowledge onthe possibilities of using the current direction and its modifications dependingon the different types of analyzed data.

Acknowledgments. This paper has been partially supported by the grant No.N N519 654540 from the National Science Centre in Poland.

References

1. Abbass, H., Hoai, N., McKay, R.: AntTAG: A new method to compose computerprograms using colonies of ants. In: Proceedings of the IEEE Congress on Evolu-tionary Computation, Honolulu (2002)

2. Azzag, H., Monmarche, N., Slimane, M., Venturini, G.: AntTree: a new model forclustering with artificial ants. In: Proceedings of the 2003 Congress on EvolutionaryComputation, Beijing, China, pp. 2642–2647 (2003)

Page 9: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

Ant Colony Inspired Clustering Based on the Distribution Function 155

3. Berkhin, P.: Survey of clustering data mining techniques. Tech. rep. Accrue Soft-ware, Inc., San Jose, California (2002)

4. Bin, W., Zhongzi, S.: A clustering algorithm based on swarm intelligence. In:Proceedings of 2001 International Conferences on Info-tech and Info-net, Beijing,China, pp. 58–66 (2001)

5. Boryczka, U.: Ant clustering algorithm. In: Proceedings of the Conference on In-telligent Information Systems, Zakopane, Poland, pp. 377–386 (2008)

6. Deneubourg, J., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., Chretien,L.: The dynamics of collective sorting: Robot-like ants and ant-like robots. In:Proceedings of the First International Conference on Simulation of Adaptive Be-haviour: From Animals to Animats 1, pp. 356–365. MIT Press, Cambridge (1991)

7. Dorigo, M., Di Caro, G., Gambardella, L.M.: Ant algorithms for discrete optimiza-tion. Artificial Life 5(2), 137–172 (1999)

8. Dunn, J.: A fuzzy relative of the ISODATA process and its use in detecting compactwell-separated clusters. Journal of Cybernetics 3(3), 32–57 (1973)

9. Han, Y., Shi, P.: An improved ant colony algorithm for fuzzy clustering in imagesegmentation. Neurocomputing 70(4-6), 665–671 (2007)

10. Handl, J., Knowles, J., Dorigo, M.: Ant-based clustering and topographic mapping.Artificial Life 12(1), 35–62 (2006)

11. Handl, J., Knowles, J., Dorigo, M.: Ant-based clustering: a comparative study ofits relative performance with respect to k-means, average link and 1d-som. Tech.rep., IRIDIA (2003)

12. Handl, J., Knowles, J., Dorigo, M.: Strategies for the Increased Robustness ofAnt-based Clustering. In: Di Marzo Serugendo, G., Karageorgos, A., Rana, O.F.,Zambonelli, F. (eds.) ESOA 2003. LNCS (LNAI), vol. 2977, pp. 90–104. Springer,Heidelberg (2004)

13. Lewicki, A.: Generalized non-extensive thermodynamics to the ant colony sys-tem. In: Swiatek, J., Borzemski, L., Grzech, A., Wilimowska, Z. (eds.) InformationSystems Architecture and Technology: System Analysis Approach to the Design,Control and Decision Support, Wroclaw (2010)

14. Lewicki, A.: Non-euclidean metric in multi-objective ant colony optimization al-gorithms. In: Swiatek, J., Borzemski, L., Grzech, A., Wilimowska, Z. (eds.) Infor-mation Systems Architecture and Technology: System Analysis Approach to theDesign, Control and Decision Support, Wroclaw (2010)

15. Lewicki, A., Tadeusiewicz, R.: The recruitment and selection of staff problem withan ant colony system. In: Proceedings of the 3rd International Conference on Hu-man System Interaction, Rzeszow, Poland, pp. 770–774 (2010)

16. Lewicki, A., Tadeusiewicz, R.: An Autocatalytic Emergence Swarm Algorithm inthe Decision-Making Task of Managing the Process of Creation of IntellectualCapital. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T. (eds.) Human – ComputerSystems Interaction, Part I. AISC, vol. 98, pp. 271–285. Springer, Heidelberg (2012)

17. Lewicki, A., Pancerz, K., Tadeusiewicz, R.: The Use of Strategies of NormalizedCorrelation in the Ant-Based Clustering Algorithm. In: Panigrahi, B.K., Sugan-than, P.N., Das, S., Satapathy, S.C. (eds.) SEMCCO 2011, Part I. LNCS, vol. 7076,pp. 637–644. Springer, Heidelberg (2011)

18. Ouadfel, S., Batouche, M.: An efficient ant algorithm for swarm-based image clus-tering. Journal of Computer Science 3(3), 162–167 (2007)

19. Rand, W.: Objective criteria for the evaluation of clustering methods. Journal ofthe American Statistical Association 66(336), 846–850 (1971)

Page 10: [Studies in Computational Intelligence] Advanced Methods for Computational Collective Intelligence Volume 457 || Ant Colony Inspired Clustering Based on the Distribution Function of

156 A. Lewicki, K. Pancerz, and R. Tadeusiewicz

20. van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)21. Scholes, S., Wilson, M., Sendova-Franks, A.B., Melhuish, C.: Comparisons in evo-

lution and engineering: The collective intelligence of sorting. Adaptive Behavior -Animals, Animats, Software Agents, Robots, Adaptive Systems 12(3-4), 147–159(2004)

22. Vizine, A., de Castro, L., Hruschka, E., Gudwin, R.: Towards improving clusteringants: An adaptive ant clustering algorithm. Informatica 29(2), 143–154 (2005)