data clustering for heterogeneous data en 1_0

Data Clustering for Heterogeneous Data

Teresa Goncalves, Pedro Melgueira

Universidade de Evora

[email protected], [email protected]

Filipe Clerigo, Ricardo Raminhos, Rui Estevao

VIATECLA SA

[email protected], [email protected],[email protected]

Abstract. The continuous growth on the volume of information / datadoes not mean a proportional increase on its related knowledge. Theexistence of automatic analysis mechanisms based on Artificial Intelli-gence algorithms (either supervised or not) can represent an importantadded value when these mechanisms are naturally integrated in reposi-tories that are specialized in managing big volumes of content (i.e. CMS Content Management Systems).As some of these repositories are open, they allow a high level of flexibilityfor organisations that use them, as it is possible to freely model theirbusiness data structures. However, it also means they are not restrictedto a certain domain of specific information which brings a great challengeon the way data is interpreted and analysed by AI algorithms (not beingknown beforehand) either for detecting contents sharing similar groupcharacteristics, as well as the most important attributes for performingdata analysis.This is the main purpose of the SMART Content Provider prototype.The current paper considers and presents the results obtained in thearea of AI algorithms for clustering and attribute suggestion analysis,applied to open repositories of data.

1 The SMART Content Provider (CP) Project

Through the Smart CP [2] project, investigation on enhancing Intelligence onCMS environments was performed under three main pillars:

Enhance mechanisms of aggregation of heterogeneous information (wherethe structures and objects are not known before hand);

Definition of Artificial Intelligence Algorithms, in particular in the area ofthe detection of patterns on semi-structured information;

Mechanisms of data presentation applied to results / contents, exploringnon-conventional formats and ways of information representation that willcontribute to a more fluid knowledge exploration.

The knowledge resulting from this investigation has been materialized in aprototype for a generic platform for data visualization and interaction, referredto as SMART Content Provider (CP), a project developed by VIATECLA [21]and supported by Universidade de Evora [3] and GTE Consultores [7], and co-financed by QREN (Quadro de Referencia Estrategico Nacional) [18].

The present paper focuses only on the second element of the project relatedwith the detection and suggestion of possible patterns present on data basedon applying AI algorithms and evaluation heuristics. A general presentation ofthe project, in terms of its objectives, architecture and results, can be found onthe paper SMART Content Provider [14], whilst the detailed presentation ofgraphical components for data representation and exploration is available on thepaper SMART Data Visualization and Exploration [15].

1.1 Architecture

Figure 1.1 shows a global vision for the SMART CP architectural platform. Athree colour scheme is used to characterize its functional blocks that composethe platform or external interactions:

Orange: Completely external to the platform, with which the SMART CPplatform interacts to obtain data / contents;

Green: Functional blocks with which the SMART CP platform is integrated,i.e. regarding the native content management system that supports the plat-form;

Purple: Native blocks from the SMART CP platform.

The architecture for the SMART CP platform follows a classic client / serverparadigm, as presented in Figure 1.1. Blocks regarding the server component arerepresented on the top of the image, and on the bottom are present the blocksshowing the client related components. Because Smart CP platform uses data/ contents present in content management systems, all client functional groups(i.e. data sorting, data visuals and exploration, accountability and workflows)are integrated in the content management system backoffice itself.

The functional block SMART Analyser is responsible for all AI relateddata processing, analysis and suggestion, consisting on the main scope for thecurrent paper, presenting its internal functioning and approaches followed duringits implementation.

Fig. 1. General diagram of the architecture of the platform SMART CP.

2 State of The Art - Clustering Algorithms and FeatureExtraction

2.1 Clustering

Clustering is the process of finding groups of objects from a dataset. The clus-tering process creates groups such that the objects in a group are more similarwithin themselves than with the objects in other groups. Clustering is usuallydone to data that is not yet classified or divided in anyway.

One of the first difficulties in clustering is in finding the characteristics thatbetter characterize the objects. It might happen that a dataset contains infor-mation that is simply not useful, or it might contain information which is onlyuseful after some transformation. Regardless, the features for the objects aretaken in some numerical representation. [24]

Formally, there is a dataset S. An object is a feature vector o S. For exam-ple, suppose a small dataset that stores bug reports, without much information,comprising 4 fields.

(Pr) Project; (Re) Relevance; (We) WeekDay; (De) Description.

An object from this dataset would be represented by a vector,

o = [Pr,Re,We,De].

2.2 Performance Measures

Since clustering is an unsupervised method, any performance evaluation must bedone using the clustering model itself. Regardless of being supervised or not, aperformance measure for clustering will look into how similar the objects of onecluster are within themselves, and how the objects of one cluster are dissimilar tothe objects of another cluster. As a general rule, the objects of one cluster shouldbe very similar to themselves while the objects on different clusters should bevery dissimilar.

For performance evaluation in unsupervised learning, a known measure calledSilhouette Coefficient [19] is used. This measure takes two variables, a and b, intoaccount. These variables are built from the clustered data. The value for a is themean distance from the same sample object and all other objects in the samecluster. The value of b is the mean distance between the same object and allother objects in other clusters.

Having these values the following is calculated,

s =b a

max(a, b).

The result of Silhouette Coefficient, s, is a real value between -1 and 1. Theclosest the value is to 1, the better the clustering is. A value close to 0 meansthat the clusters are overlapping. Values close to -1 indicate that the objects aremostly assigned to the wrong clusters.

2.3 K-Means and Variants

The first approach to many clustering tasks uses K-Means [13,22,1,11,17]. Thisalgorithm has a very general definition and is the staring point with differentpractices. The algorithm is parametrized with the number of clusters it shouldfind, so the number of clusters might not be optimal.

Because the algorithm doesnt find the optimal number of clusters, alterna-tive methods must be used to find that number. One such method deals withexecuting the same algorithm for many different parameters. Thus, it is possibleto find different clusters and find the best one according to the performancemeasures. Other methods to find a better parametrization are discussed in thefollowing sections.

K-Means uses the concept of centroid to perform. A centroid is an objectwhich has the same features as the objects in S. Given a subset Z S, thecentroid c(Z) is the object which features are the average of all the objects inZ. Formally,

c(Z) = [c0, . . . , cm1],

ci(Z) =1

n

zZ

zi,

where m is the number of features of the objects, and n is the size of Z.The algorithm progresses by updating the position of the centroids. The al-

gorithm ends when the position of the centroids no longer changes from iterationto iteration. In each iteration step, the proximity of each element to the centroidis calculated. Proximity may vary from problem to problem. In general, the Eu-clidean Distance is used. Formally, for two arbitrary objects a and b, the distanceis defined as

dist(a, b) =

i

(ai bi)2.

The distance gives a sense of proximity between two objects. The closest twoobjects are, the smaller the value of the distance is. For a distance of 0, theobjects are considerate to be equal.

Other distance may be used. For example, the Manhattan distance,

dist1(a, b) = a b =i

|(ai bi)|.

Another very useful distance is the Euclidean Square distance.

dist2(a, b) =

(i

(ai bi)2)2

=i

(ai bi)2,

This distance is similar to the normal Euclidean distance, but it is com-putationally simpler, because it doesnt need any square root calculation. Thisdistance is not a real metric because it doesnt follow the triangular inequalityrule, but it can be used as such.

K-Means has an inconvenience regarding certain data types. The algorithmworks for data which is nominal and sortable. It must also be possible to havearithmetic operations applied to the data. Data is nominal when its featuresmay be distinguish in some way. Two operators, = and 6=, may be defined,

a = b,iai = bi,

a 6= b,iai 6= bi.The first states that a and b are equal because all of their features are equal.

The second states that they are not equal because at least one of their featuresisnt.

Sortable data must also be nominal and it must be possible to define someorder, for example, if the data is lexicographically sortable, or if it represents

some kind of ranks. Finally, data which may have arithmetic operations appliedto it is always numeric.

K-Modes [8,9] is a variant of the K-Means algorithm made to deal with nom-inal data. The difference between these two algorithms fall in how the distancefunction is defined, which is more of a similarity function,

d(a, b) =i

(ai, bi),

(ai, bi) =

{1, ai = bi

0, ai 6= bi.

Because the datasets in this project are almost entirely nominal, it only makessense to work with K-Modes and not with K-Means.

2.4 Affinity Propagation

Affinity Propagation [5,23,10,12,4] is a clustering algorithm which doesnt requirethe number of clusters as an initial parameter. The algorithm needs a way todefine distance or similarity between objects. For some objects oi, oj , ok S, ifoi is more similar to oj then it is to ok, then the following must be true,

s(oi, oj) > s(oi, ok).

The algorithm uses two matrices that get updated in each step of the itera-tion. They are the responsibility matrix R, and the availability matrix A. Thevalues of R, for example, Ri,k, show how much element i is suited to represent k.This is a relative measure as opposed to any other element of the dataset. Thevalues of A, for example, Ai,k, state how appropriate is to have i pick k has itsrepresentative.

Each iteration of the algorithm updates both matrices until there is conver-gence. First, the responsibility matrix is updated using the rule,

Ri,k = s(oi, ok)maxk 6=k{Ai,k + s(oi, ok)}.

Then, the following rules updates the availability matrix,

p = Rk,k +

i 6{i,k}max(0, Ri,k),

Ai,k = min(0, p), i 6= k,

Ak,k =i 6=k

max(0, Ri,k).

2.5 Joining Affinity Propagation and K-Modes

In this project, results from the two mentioned algorithms are used. The clus-tering process starts by using the Affinity Propagation algorithm. Upon thecompletion of this process, the available results show which objects in the datacorrespond to which cluster, and how many clusters were found by the algorithmitself.

Having the number of clusters as they are calculated by Affinity Propagationnot only gives a good estimate for the number of clusters in data, but also gives astarting point for clustering using K-Modes. Supposing that Affinity Propagationyielded N clusters, the K-Modes is going to be performed N I, N (I + 1),. . . , N + I times, with I being some positive integer value. This will, in turn,yield a number of clusters. A final analysis is done over all of the results. Theclustering solution that performs better, according to the Silhouette Coefficientperformance measure, is picked for further analysis.

2.6 Feature Extraction

The end result of this project aims at displaying only two or three features andhow the data is distributed between those features. The final display is capableof displaying data arranged by two features in a table-like fashion, having thetwo axis of the table set for two features. Each cell in the table has several pointswhich are randomly scattered across it. The points may have some color, form,or size associated with them, therefore it is possible for this table to display athird, fourth, or fifth feature.

The clustering tasks will find clusters with more features then just two orthree, so there needs to be a process that finds the most interesting groups offeatures so they can be displayed to the end user. This process is called FeatureExtraction.

The features that are extracted are the ones that have the best distributionof data. Having a good distribution of data means that those features alone areable to display distinct clusters in the data. A function is defined that conveyswhat a good distribution is. The function is based on the notion of entropy.

A conditional probability distribution is defined of the form,

P (C|F1, F2, . . . , Fm),where Fk is a feature, m is the total number of features, C is a cluster. The

distribution states the probability of an object with the given features belongingto cluster C. Entropy over such distribution, with any set of features, is goingto be close to 0 if those features are representative of the clustering distribution.So, it is stated the values which is closer to 0 are better. Entropy is thereforethe heuristic used when finding a good set of features.

The process that finds these distributions tries different sets of features andkeeps the ones that perform better, according to the heuristic.

3 Architecture

The high level architectural abstraction for the SmartCP prototype is presentedin figure 3, which is based in the flow of four main blocks that work sequen-tially. Having a dataset made available by the CMS in a JSON format as input,the pre-processing block starts and is responsible for filtering and normalizingdata from external platforms. The following block, for clustering, receives analready normalized output in a numeric CSV format (having all of the textualinformation mapped to numeric values which are easier to process). This blockhas the objective of obtaining aggregation sets. In the Extraction of RelevantAttributes block, the best attributes which contribute for a better distribu-tion of content is determined. These attributes are found having the previouslyfound clusters taken into account. Finally, the output block, which is responsiblefor the output data formating, yields the best clustering solution and the bestattributes.

Fig. 2. High level view of the SMART CP prototypes architecture.

Considering its complexity, it is possible to breakdown the pre-processingphase into 4 steps, as it is shown in figure 3. In the first step, fields and datathat are considered useless are removed. In the Enumeration Mapping, thesubstitution of nominal by numeric attributes is made. This conversion is nec-essary in order to ease the clustering algorithms. The Date Handling stepprocesses the fields with dates, distributing those dates by year, month and dayof the week. Finally, the data is normalized in a CSV format to be consumed byfurther phases.

Fig. 3. Functional detail of the pro-processing block.

The Clustering phase may also be broken down into steps (Figure 5). Initially,the Affinity Propagation algorithm is applied. This algorithm yields a clusteringsolution and the number of clusters N it found. The value of N is later used asbasis for clustering using the K-Modes algorithm. Clustering with K-Modes isdone several times with different variants of the proposed N .

Fig. 4. Functional detail of the clustering block.

Finally, the Evaluation block will use performance measures to determinedwhich are the best clustering solutions. Finally, this blocks yields the clustersfound.

4 Implementation

4.1 Pre-processing

The datasets that are used for this project are formated as JSON files. The jsonfile is a list of objects. Along with this file there is a schema that states the datatype for each field of the objects, along other meta data which is not used inthis project. The fields are the same for every object.

Nominal data tends to have string-like data types, such as comments andtitles. Clustering is about finding similarities in data. In this project, no effort ismade to mine information from natural language, so unless a text field is empty,there is going to be one unique value for each non-empty field in the datasetand no similarities will be found like this. To avoid this problem, such fields arereplaced by a boolean value, which states what is empty and what is not.

Some fields have small domains that are still text based. This is not exactly aproblem for clustering, however, it is computationally complex for the algorithmsto deal with comparing strings. To simplify the process, such fields are mappedto an integer value. For example, if there is a field called Importance in whichthe domain is Not Important, Important, and Very Important, then the values ofthe domain get mapped for 0, 1, and 2, respectively.

Dates have a similar problem to that one of string-like fields. They will havevery diverse values scattered through time, and those, if only the whole date istaken into account, will always be dissimilar. To avoid this problem dates canbe transformed into useful information. This is currently done by keeping onlythe year, month, and day of the week in the dataset. With these alterations itis possible to have similarities in terms of dates.

The output of this phase will be a CSV file. This format is used becausethey are more or less of a norm in term of Machine Learning and Data Miningalgorithms and because most algorithm implementations are ready for this kindof data. The file has a header with the name of each field along with a line foreach object. The file is not alone. There is also a file as output of this phase thatcontains dictionaries for the mapping process.

This whole process is done using a Python 3 script. No third party librarieswere used.

4.2 Clustering and Feature Extraction

Clustering is done using two algorithms, K-Modes, which is the variant of K-Means, and Affinity Propagation. As described in 2.5, the clustering processstarts with the Affinity Propagation algorithm. This algorithm will take thepre-processed dataset as input and will yield a relation of objects to calculatedclusters. It will also yield the number of clusters it found.

The number of clusters then be used to parametrize the execution of the K-Modes algorithm, so it Affinity Propagation specified that there were N clusters,K-Modes will be executed for N I, N (I + 1), . . . , N + I.

In the end of the clustering part there will be one clustering model fittedusing Affinity Propagation and several models fitted using different K-Modesparametrizations.

Feature extraction is done by testing different feature groups using the heuris-tic function. The five groups that perform better are the ones that are kept.

The implementation used for K-Modes is [6], which is a Python library buildover Numpy [16] and is distributed under the MIT license. The implementationfor Affinity Propagation is in the SckiKit Python library [20].

4.3 Output

There are two outputs to take into account. The clustering output and thefeatures extraction output. The clustering output should state which objectsbelong to which cluster. A cluster is identified by a positive integer number,while the objects are identified by their id. The output consists in having aJSON with the format in figure 5.

The output of the feature extraction part is a list of groups, which in termcontain a list of fields of the dataset. These groups state that the fields in it wereconsidered interesting for displaying, according to the definitions given. Figure6 shows an example of this JSON.

The JSON format is used for interoperability reasons between the variousproducts and applications that are used with this project.

[

{"Cluster": 0, "Ids": [...]},

{"Cluster": 1, "Ids": [...]},

...

{"Cluster": n, "Ids": [...]}

]

Fig. 5. JSON for clustering result.

[

["Importance", "WeekDay"]

["Importance", "Project"]

...

["Project", "Description"]

]

Fig. 6. JSON for feature extraction.

5 Future Work

Having the clustering implementation complete, further experiments will haveto be done to it in order to assure its correctness and usefulness. From theseexperiments it will be possible to observe the performance of both algorithmand the performance of the discussed method which joins them. Regarding fea-ture extraction, a method to search for the optimal features will be developed.That method will be based on state space search with heuristics. Finally, it isfundamental for the project to be tested with wider datasets of different nature.These experiments will allow any observation on the overall performance of thedeveloped system.

References

1. P. S. Bradley and Usama M. Fayyad. Refining initial points for k-means clustering.pages 9199. Morgan kaufmann, 1998.

2. Microsite SMART CP. http://www.viatecla.com/inovacao/smart content provider.2015.

3. U. de Evora. http://www.uevora.pt/. 2015.4. Delbert Dueck and Brendan J. Frey. Non-metric affinity propagation for unsuper-

vised image categorization.5. Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data

points. Science, 315:2007, 2007.6. K-Modes GitHub. https://github.com/nicodv/kmodes. 2015.7. GTE. http://www.gte.pt/. 2015.8. Zhexue Huang. Clustering large data sets with mixed numeric and categorical

values. In In The First Pacific-Asia Conference on Knowledge Discovery and DataMining, pages 2134, 1997.

9. Zhexue Huang. Extensions to the k-means algorithm for clustering large data setswith categorical values. Data Min. Knowl. Discov., 2(3):283304, September 1998.

10. Tao Li. A general model for clustering binary data. In Proceedings of the EleventhACM SIGKDD International Conference on Knowledge Discovery in Data Mining,KDD 05, pages 188197, New York, NY, USA, 2005. ACM.

11. Aristidis Likas, Nikos A. Vlassis, and Jakob J. Verbeek. The global k-means clus-tering algorithm. Pattern Recognition, 36(2):451461, 2003.

12. Zhengdong Lu and M.A. Carreira-Perpinan. Constrained spectral clusteringthrough affinity propagation. In Computer Vision and Pattern Recognition, 2008.CVPR 2008. IEEE Conference on, pages 18, June 2008.

13. J. Macqueen. Some methods for classification and analysis of multivariate observa-tions. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability,pages 281297, 1967.

14. Filipe Clerigo & Ricardo Raminhos & Rui Estevao & Teresa Goncalves & PedroMelgueira. Smart content provider. 2015.

15. Filipe Clerigo & Ricardo Raminhos & Rui Estevao & Teresa Goncalves & PedroMelgueira. Smart data visualization and exploration. 2015.

16. Numpy. http://www.numpy.org/. 2015.17. Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient

estimation of the number of clusters. In Proceedings of the Seventeenth Interna-tional Conference on Machine Learning, ICML 00, pages 727734, San Francisco,CA, USA, 2000. Morgan Kaufmann Publishers Inc.

18. QREN. http://www.qren.pt/np4/home. 2015.19. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and valida-

tion of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 65, 1987.

20. SciKit-Learn. http://scikit-learn.org/. 2015.21. Site Institucional. VIATECLA. http://www.viatecla.com. 2015.22. Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained k-

means clustering with background knowledge. In In ICML, pages 577584. MorganKaufmann, 2001.

23. Kaijun Wang, Junying Zhang, Dan Li, Xinna Zhang, and Tao Guo. Adaptiveaffinity propagation clustering. CoRR, abs/0805.1096, 2008.

24. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Toolsand Techniques, Second Edition (Morgan Kaufmann Series in Data ManagementSystems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

data clustering for heterogeneous data en 1_0

Documents