distributed clustering survey

35
Survey of Distributed Clustering Techniques Horatiu Mocian Imperial College London 1st term ISO report Supervisor: Dr Moustafa Ghanem 1

Upload: anita-jindal

Post on 28-Nov-2014

237 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Distributed Clustering Survey

Survey of Distributed Clustering Techniques

Horatiu MocianImperial College London

1st term ISO report

Supervisor: Dr Moustafa Ghanem

1

Page 2: Distributed Clustering Survey

Abstract

Research in distributed clustering has been very active in the last decade. As part of the broader distributeddata mining paradigm, clustering algorithms have been employed in a variety of distributed environments, fromcomputer clusters, to P2P networks with thousands of nodes, to wireless sensor networks.

In trying to observe different distributed clustering algorithms, there are a lot of aspects that need to beconsidered: the type of data on which the algorithm is applied (genes, text, medical records, etc.), the partitioningof data (homogeneous or heterogeneous), the environment in which it has to run (LAN, WAN, P2P network),the privacy requirements, and many others. Accordingly, all this information needs to be known about thealgorithms in order to evaluate them properly.

Although there are plenty data clustering surveys, there is no one that focuses exclusively on distributedclustering. The aim of this report is to present the most influential algorithms that have appeared in this field.But the most important contribution is a taxonomy for classifying and comparing the existing algorithms, aswell as placing future distributed clustering techniques into a perspective. The second major contribution is theproposal of a new parallel clustering algorithm, Parallel QT.

Page 3: Distributed Clustering Survey

Contents

1 Introduction 11.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Distributed Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Organization of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Clustering 22.1 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Clustering Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Clustering High Dimensional Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.2 Coclustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 Examples of Clustering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5.1 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5.2 Gene Expression Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Distributed Clustering 73.1 Distributed Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Key Aspects of Distributing Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Distributed Data Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.2 Types of Distributed Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.3 Communication Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.4 Facilitator/Worker vs P2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.5 Types of Data Transmissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.6 Scope of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.7 Ensemble Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.8 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Distributed Clustering Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.1 Measuring Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Evaluating Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3.3 Determining Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Survey of Distributed Clustering Techniques 124.1 Initial Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Review of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 Refined Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Parallel QT: A New Distributed Clustering Algorithm 235.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Original QT Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Advantages of QT Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Parallel QT Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.1 Challenges of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Page 4: Distributed Clustering Survey

6 Conclusion and Future Work 256.1 Emerging Areas of Distributed Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.1.2 Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Page 5: Distributed Clustering Survey

1 Introduction

Clustering, or unsupervised learning, is one of the most basic, and at the same time important, fields of machinelearning. It splits the data into groups of similar objects helping in extraction of new information by summarizingor discovering new patterns. Clustering is used in a variety of fields, including statistics, pattern recognition anddata mining. This paper focuses on clustering from the data mining perspective.

1.1 Data Mining

Data mining, also called knowledge discovery from databases (KDD), is a field that has gained a lot of attentionin the last decade and a half [22]. It tries to discover meaningful information, from large datasets, which areotherwise unmanageable. Knowledge can be extracted by either aggregating the data (the best examples is OLAP,but clustering also achieves that), or by discovering new patters and connections between data. As the amount ofdata available in digital format has increased exponentially in the last years, so has the employment of data miningtechniques across a varied range of fields: business intelligence, information retrieval, e-commerce, bioinformatics,counter-terrorism, and so on.

Two related subfields of data mining that have developed rapidly, aided by an ever-high Internet penetration rate,are text mining and web mining. The former tries to discover new information by analyzing sets of documents.In addition to common techniques, text mining has specific tasks like entity extraction, sentiment analysis, orgeneration of taxonomies [38]. Web mining derives new knowledge by exploring the Internet. Kosala and Blockeel[55] split this field into three subcategories: web content mining, web structure mining and web usage mining,respectively.

The rise of the Internet also brought two major problems in data mining: first, the amounts of data became solarge that even high-performance supercomputers couldn’t process it. Second, the data was stored at multiple sitesand it became increasingly infeasible to centralize it in one place. Bandwidth limitation and privacy concerns wereamong the factors that hindered centralization. To solve this problems, distributed data mining has emerged as ahot research area.

Distributed Data Mining [48, 37] makes the assumption that either the computation or the data itself is dis-tributed. It can be used in environments ranging from parallel supercomputers to P2P networks. It can be appliedin areas like distributed information retrieval and sensor networks. It caters for communication and privacy con-straints or for resource constraints, like battery power. Consequently, new frameworks and new algorithms neededto be developed to work under this conditions.

1.2 Distributed Clustering

As clustering is an essential technique for data mining, distributed clustering algorithms were developed as part ofthe distributed data mining research. The first of them were modified versions of the existing algorithms: parallelk-means [20] or parallel DBSCAN [72]. Gradually, other algorithms have surfaced, especially for P2P systems.Usually distributed clustering algorithms work in the following way:

1. A local model is computed

2. The local models are aggregated by a central node (or a super-peer in P2P clustering systems)

3. Either a global model is computed, or aggregated models are sent back to all the nodes to produce locallyoptimized clusters.

1.3 Motivation

I have decided to do in-depth research in distributed data clustering for several reasons:

• Working in the field of data mining, I had also touched upon distributed clustering, which seemed a veryinteresitng topic.

1

Page 6: Distributed Clustering Survey

• Distributed Clustering has been a very active field, but no surveys of this topic exist

• I am proposing a new distributed clustering algorithm, and I needed to know how it stands in comparison toother algoriths. In order to achieve that, I created a taxonomy for classifying distributed clustering algorithms.

The contributions of this paper are: a new taxonomy for distributed clustering algorithms and the proposalof a new algorithm for distributed clustering. This is the first survey that concentrates exclusively on distributedclustering, although PhD theses from Hammouda [31] and Kashef [52] provide good reviews of the field.

1.4 Organization of the Report

The rest of the paper is organized as follows: Section 2 contains a brief overview of clustering, Section 3 presentsaspects of distributed clustering, followed by a survey of the most important papers in the field and a new taxonomyin Section 4. A new distributed algorithm will be presented in Section 5, while the conclusions and future researchscenarios are mentioned in Section 6.

2 Clustering

Clustering, or unsupervised learning, is the task of grouping together related data objects [39]. Unlike supervisedlearning, there isn’t a predefined set of discrete classes to assign the objects to. Instead, new classes, in this casecalled clusters, have to be found. There are a lot of possible definitions for what a cluster is, but most of themare based on two properties: objects in the same cluster should be related to each other, while objects in differentclusters should be different.

Clustering is a very natural way for humans to discover new patterns, having been studied since the ancienttimes [35]. If it is regarded as unsupervised learning, clustering is one of the basic tasks of machine learning, butit is used in a whole range of other domains: data mining, statistics, pattern recognition, bioinformatics or imageprocessing. The focus of this survey is the application of clustering algorithms in data mining.

2.1 Clustering Algorithms

There are many methods and algorithms for clustering. Some of them will be presented later on, but this section isfar from being a complete review of the clustering algorithms available. All the algorithms that are relevant to thesubject of this paper, distributed clustering, and are not presented in this section will be detailed in the followingsection. The most widespread clustering algorithms fall into two categories: hierarchical and partitional clustering.

Hierarchical clustering builds a tree of clusters, known as a dendrogram. A hierarchical clustering algorithmcan either be top-down (divisive) or bottom-up (agglomerative). The top-down approach starts with one clustercontaining all the points from a set, and then splits it iteratively until a stopping condition has been fulfilled.Agglomerative clustering begins with singleton clusters, and then, at each step, two clusters are merged. Theclusters to be joined are chosen using a predefined measure. Again, a stopping criterion should be specified. Theadvantage of hierarchical clustering is that the clusters can be easily visualized using the dendrogram, and is verysuitable for data that inherently contains hierarchies. The disadvantages of this approach are constituted by theissue of deciding upon a termination criteria and its O(n2) complexity for both running time and data storage.

Hierarchical clustering stores the distances between each data point in a matrix with N x N dimensions, whereN is the number of data points that need to be processed. Although in theory this matrix may be too large to bestored in memory, in practice the sparsity of this matrix is increased using different methods, so that it occupiesmuch less space. The methods include omitting entries smaller than a certain dimension, or by storing only thedistance to the nearest neighbours of a certain data point [61]. To compute the distance between two clusters, asopposed to individual points, a measure called linkage metric must be used. The most used linkage metrics are:

• single link (the minimum distance between any two points from the clusters)

• complete link (the maximum distance between any two points from the clusters)

2

Page 7: Distributed Clustering Survey

• average link (the mean of the distances between two points from the clusters).

A complete survey of the linkage metrics can be found in [60].Partitional clustering splits the data into a number of subsets. Because it is infeasible to compute all the

possible combinations of subsets, a number relocation schemes are used to iteratively optimize the clusters. Thismeans that the clusters are revisited at each step and can be refined, constituting an advantage over hierarchicalclustering.

Another difference between hierarchical and partitional methods is the way the distances between clusters arecalculated. The former approach computes the distances between each individual points of a group, while the latteruses a single representative for each cluster. There are two techniques used for constructing a representative fora cluster: taking the data point that represents the cluster best (k-medoids) or computing a centroid, usually byaveraging all the points in the group (k-means).

K-means [23] is the most widely used clustering algorithm, offering a number of advantages: it is straightfor-ward, has lower complexity than other algorithms, and it is easy to parallelize. The algorithm works as follows:first, k points are generated randomly as centres for the k clusters. Then, the data points are assigned to the closestclusters. Afterwards, the centroids of the clusters are recomputed. Then, at each step the points are re-assigned totheir closest cluster and the centroids are recomputed. The algorithm stops when there are no more movements ofthe data points between clusters.

Although highly popular, the k-means algorithm has a multitude of issues [9]. First, the choice of k is not atrivial one. Second, the accuracy of the algorithms depends much on the initial cluster centers that are chosen. Ifthis is done randomly, the results are hard to control. Third, the algorithm is sensitive to outliers. Finally, it is notscalable.

2.2 Similarity Measures

The concepts of similarity and distance are fundamental in data clustering, and they are a basic requirement forany algorithm [52]. A similarity measure assesses how close two data points are from each other. The distancemeasure, or dissimilarity, does exactly the opposite. Usually, either one or the other measure is used. Anyway, eachmeasure can be easily derived from the other.

This section will review the most widely used similarity measures. As this measure is used in the local comput-ing of cluster models or in the aggregation step, there are no measures that are specific for distributed algorithms.

The simplest similarity measure is the most widely used. If we visualize the data points as points in a N-dimensional space, then the Minkowski distance between them can be computed. Accordingly, objects with thesmallest distance have the highest similarity [14]. The particular cases of the Minkowsi distance are for n=1(Manhattan distance), and for n=2 (Euclidean distance).

Another measure that is widely used is the cosine similarity measure. Again, data points are considered in acoordinates system with N axis (dimension). Then, for each point a representative vector with the origin in theorigin of the coordinates system can be drawn. The cosine similarity measure between two points is the cosine ofthe angle of the representative vectors for the respective points. This measure is heavily employed in documentclustering.

The Jaccard measure [14, 56] divides the number of features that are common to both points, by the number oftheir total features (in its binary form). Formulas for these metrics are given in the table below.

2.3 Clustering Evaluation Measures

In order to assess the performance, or quality, of different clustering algorithms, objective measures need to beestablished. There are 3 types of quality measures: external, when there is a priori knowledge about the clusters,internal, which assumes no knowledge about the clusters, and relative, which evaluates the differences betweendifferent clusters.

Among the quality measures used in clustering are [52]:

3

Page 8: Distributed Clustering Survey

Metric Formula

Minkowski ||x− y||r = r√

sumdi=1|xi− yi|r

Cosine Similarity cosSim(x,y) = xy||x||||y||

Jaccard Measure JaccardSim(x,y) = sumdi=1min(xi,yi)

sumdi=1max(xi,yi)

Table 1: Similarity Metrics

1. External Quality Measures

• Precision: this measure retrieves the number of correct assignments out of the number of total assign-ments made by the system

• Recall: this measure retrieves the number of correct assignments made by the system, out of the numberof all possible assignments

• F-measure: this measure is a combination of the precision and recall measures used in machine learn-ing

• Entropy: indicates the homogeneity of a cluster: a low entropy indicates a high homogeneity, and viceversa

• Purity: the average precision of the clusters relative to their best matching classes.

2. Internal Quality Measures

• Intracluster Similarity: the most common quality measure, it represents the average of the similaritybetween any two points of the cluster.

• Partition Index: reflects the ratio of the sum of compactness and separation of clusters.

• Dunn Index: it tries to identify compact and well separated clusters. The Dunn Index has a O(n2)complexity which could make it infeasible for large n. Also, the diameter of the largest cluster can beinfluenced by outliers

• Separation Index measures the inter-cluster dissimilarity and the intra-cluster similarity by using clus-ter centroids. It is computationally more efficient than Dunn Index.

3. Relative Quality Measures

• Cluster Distance Index measures the difference between 2 clusters. For centroid-based clusters, thedistance is just ||ci− c j||. If the clusters are not represented by centroids, then the linkage paradigmfrom hierarchical clustering can be applied: the cluster distance could either refer to the minimum,maximum or average distance between any two points of the two clusters.

2.4 Clustering High Dimensional Data Sets

The majority of clustering algorithms work optimally with data sets where each record has no more than a few tensof attributes (features or dimensions). However, there are many data sets, for which this number is in the hundredsor thousands, like gene expressions or text corpora. The high number of features poses a problem known as thedimensionality curse: in high dimensional space, there is a lack of data separation. More specific, the differencein distance between the nearest neighbours of a point and other points in the space becomes very small, and thusinsignificant [4].

4

Page 9: Distributed Clustering Survey

2.4.1 Feature Selection

To reduce the number of dimensions, techniques called dimensionality reduction, or feature selection, are used. Themost representative algorithms for feature selection are Principal Component Analysis (PCA) [44] and SingularValue Decomposition (SVD)[10].

2.4.2 Coclustering

Coclustering is another technique for dimensionality reduction, involving the clustering of the set of attributes,as well as the clustering of data points. Coclustering is an old concept [5], and exists under a lot of names:simultaneous clustering, block clustering or biclustering. One of the most common uses of this techniques is forgene expression analysis, where both samples and genes have to be clustered in order to obtain compelling results.

According to [58], there are 4 types of cluster blocks (biclusters): biclusters with constant values, biclusterswith constant values on rows or columns, biclusters with coherent values and biclusters with coherent evolutions.If we imagine a the rows and the attributes in a matrix, the first three of them analyze the numeric values of thematrix, while the last one looks at behaviours, considering elements in the data matrix as symbols.

2.5 Examples of Clustering Applications

Clustering is applied in a variety fields. In fact, because clustering is an essential component of Data Mining, itis applied wherever DM is used. In this section, we’ll take a closer look, at two applications of clustering, in textmining and in gene expression analysis.

2.5.1 Document Clustering

Document clustering [67] is a subfield of text mining. It partitions documents into groups based on their textualmeaning. It is applied in information retrieval (Vivisimo), automated creation of taxonomies (Yahoo!, DMOZ) andnews aggregation systems (Google News) [31].

A feature that sets document clustering apart is its high dimensionality. A corpus with several hundred docu-ments can have thousands of different words. Nowadays, a set of 1 million documents is not unusual. If the corpusis a general one, and not a specialized one containing articles from a single field, it can have more than 20.000words.

There are several ways in which the number of attributes (i.e. words) is reduced when working with documentdata sets. The first one is text preprocessing: removal of stopwords (common words like ’it’, ’has’, ’an’), stemmingthe words (reducing them to their lexical roots, for example both ’employee’ and ’employer’ can be stemmedto ’employ’) and pruning (deleting the words that appear only a couple of times in the entire corpus. Exceptfor stemming, these techniques could be applied very easily for other types of clustering: stopword removal andpruning basically mean eliminating the attributes that appear to often, and to seldom, respectively, in the data set.

After preprocessing, feature selection/extraction can be applied. There are many sophisticated algorithms fordoing that, but in [73] Yang proves that TF (term frequency) is as good as any. The third way of reducing di-mensionality is to use geometric embedding algorithms for clustering. LSI (Latent Semantic Indexing) is the bestexample of its kind: applying it on text results in a representation with fewer words, but with the same semanticmeaning.

Another aspect of document clustering is represented by the text representation models, which are specific totext documents: vector space models, N-grams, suffix trees and document index graphs.

The most used document data model, the Vector Space Model, was introduced by Salton [63]. Each documentis represented by a vector d = t f1, t f2, , t fn, where t fi is the frequency of each term (TF) in the document. In orderto represent the documents in the same term space, all the terms from all the documents have to be extracted first.These results in a term space of thousands of dimensions. Because each document usually contains only at mostseveral hundred words, these representation leads to a high degree of sparsity. Usually, another factor is added dothe term frequency: the inverse document frequency (IDF). Thus, each component of the vector d now becomes

5

Page 10: Distributed Clustering Survey

t fi∗ id fi. The explanation of this formula is that the more a term appears in other documents, the less is its relevancefor a particular document.

N-grams represent each document as a sequence of characters [69]. From this, n-character sub-sequences areobtained. The similarity between two documents is equal to the number of n-grams that they have in common. Themajor advantage of this approach is its tolerance to spelling errors.

The last two approaches are phrase-based representation models. The first one builds a Suffix Tree where eachnode represents a part of a phrase and is linked to all the documents containing the respective suffix [75]. TheDocument Index Graph model was proposed by Hammouda and Kamel in [32]. Using this model, each documentis represented as a graph, with words being represented as words, and sequences of words (i.e. phrases) as edges.This model can be used for keyphrase extraction and cluster summarization. The advantage of phrase-based modelsis that they capture the proximity between words.

There have been distributed clustering algorithms developed specifically for document clustering, like the onefrom Kargupta et al. [49] or Hammouda [31].

2.5.2 Gene Expression Clustering

Gene clustering is another field that is very suitable for employing clustering techniques. The development of DNAmicroarray technology has made it possible to monitor the activity of thousands of genes simultaneously, acrosscollection of samples. But analyzing the measurements obtained requires large amounts of computing power, asthey can easily reach the order of millions of data points. Consequently, DDM algorithms need to be employed forgene expression analysis.

A good survey of clustering techniques for genes data can be found in [42], that goes through all the stepsincluded in gene expression analysis. There are three basic procedures for studying genes with microarrays: first,a small chip is manufactured, onto which tens of thousands of DNA probes are attached in fixed grids. Each cellin the grid corresponds to a DNA sequence. Second, targets are prepared, labeled and hybridized. A target is asample of cDNA (complementary DNA) obtained from the reverse transcription of two mRNA (messenger RNA)samples, one for control and for testing. Third, chips are scanned to read the signal intensity that is emitted fromthe targets. The data obtained from a microarray experiment is expressed using a gene expression matrix: rowsrepresent genes and columns represents samples. Cell (i, j) represents the expression of gene i at sample j. Withthis matrix representation, we end our foray into the field of genetics, and return to the more familiar clustering.

From the gene expression matrix, one property can be observed: the high dimensionality of the data. As statedbefore, tens of thousands of genes are observed simultaneously in a microarray experiment. Another characteristicthat sets gene expression data apart from other types of data is the fact that clustering both samples and genes isrelevant. Samples can be partitioned into homogeneous groups, while coexpressed genes can be grouped together.Accordingly there are three types of clustering gene expressions:

• gene-based clustering: samples are considered features; genes, considered objects, are clustered

• sample-based clustering: genes are considered features; samples are clustered

• subspace clustering: genes and samples are treated symmetrically, clusters are formed by a subset of genesacross a subset of samples.

Other problems that influence clustering in this area are the high amount of noise induced by the microarraymeasurements, and the high connectivity between clusters: they can have significant overlap or be included intoone another.

Most clustering techniques have been applied on gene expression data: k-means [66], hierarchical clustering[62], self organizing maps. Other clustering algorithms have been specifically developed for gene expressionanalysis: Quality Threshold [36], or DHC (density-based hierarchical clustering) [41].

Traditional clustering techniques cannot be used for subspace clustering. Thus, new techniques have beendeveloped, like Coupled Two-Way Clustering [27] or [57]. These are coclustering algorithms, also covered insection 2.4.2.

6

Page 11: Distributed Clustering Survey

This section is not meant to be an exhaustive review of gene expression clustering. More information aboutthis field can be found in the survey by Jiang et al. [42]. Clustering survey by Xu et al. [71] covers gene clusteringalgorithms in section I. For a review of coclustering techniques, the reader is reffered to [58]. Other informationabout gene clustering can be found in [6] and [12].

Although they come from totally different areas, document clustering and gene clustering have one thing incommon: high dimensionality. Both of them can have well over one thousand features, but the number can easilyexceed ten thousand. Moreover, one data row has only a small number of features: a text document cannot containall the words in the dictionary, while genes do not express in all the samples.

This common feature should encourage researchers working in these fields to get inspired from each other.There are a lot of clustering algorithms applied in both fields, like k-means or hierarchical clustering. An exampleof cooperation between these two fields is found in [16], where document clustering is performed using an algorithmthat has been originally developed for gene expression analysis.

2.6 Summary

This section gives an overview of data clustering, an essential task in data mining. The most important clusteringtechniques, agglomerative and partitioning, are detailed with an example of each: hierarchical agglomerative clus-tering and k-means, respectively. Next, similarity measures used for clustering are presented: Minkowski metrics,cosine similarity and Jaccard measure. The three categories of clustering evaluation measures introduced in thepapers are: external, internal and relative. High dimensionality and feature selection are also touched upon, beforeillustrating popular applications of data clustering: for documents and for gene expression analysis.

3 Distributed Clustering

3.1 Distributed Data Mining

Distributed Data Mining (DDM), sometimes called Distributed Knowledge Discovery (DKD), is a research fieldthat has emerged in the last decade. The reasons for the rapid growth of activity in DDM are not hard to find: thedevelopment of the Internet, which made communication on long distances possible, the exponential increase ofdigital data available (and the necessity to process it) which outstripped the increase in computing power and theneed for companies and organizations to work together and share some of their data in order to address commontasks: detecting financial fraud or finding connections between diseases and symptoms.

An example of the quantities of data needed to be processed by a single system can be found in astronomy[37]: the NASA Earth Observing System (EOS) collects data from a number of satellites and holds 1450 data setsthat are stored and managed by different EOS Data and Information System (EOSDIS) sites. These are locatedat different geographic locations across USA. A single pair of satellites produces 350 GB of data per day, makingit extremely difficult or impossible to process all the EOS data centrally: even if the communication and storageissues would be solved, the computing power requirements cannot be met. Another scenario in which DDM can beemployed is clustering medical records. This could be done in order to find similarities between patients sufferingof the same disease, which could be further use to discover new information about the respective disease. In thiscase, there are two major problems: first, there is no central entity storing all the patient records - they are kept bytheir GPs or by hospitals, so the data is inherently distributed. Second, privacy is essential in this case, so a centralsite should receive local models of the data, not the data itself. Privacy has been researched intensively lately andis covered later in the chapter.

The main goal of distributed data mining research is to provide general frameworks on top of which specificdata mining algorithms can be built upon. Examples of such frameworks can be found in papers by [29], Cannataro[13] or Kargupta [15].

From the point of view of the partitioning of data, DDM techniques can be classified as homogeneous andheterogeneous. If we imagine a database table, splitting data into homogeneous blocks is equivalent to horizontalpartitioning, where each block contains the same columns (attributes), but different records. Heterogeneous data

7

Page 12: Distributed Clustering Survey

is obtained when vertical partitioning is applied where each block contains different attributes of the same record.An unique identifier is used to keep track of a particular record across all sites. The same analogy can be appliedto data stored in other formats than a database table.

A related field that has significant overlap with DDM is parallel data mining (PDM). When computers in anetwork are tightly coupled (e.g. in a LAN, web farm or cluster), parallel computing techniques can be used forDDM. Parallel Data Mining was developed more than a decade ago in the attempt to run data mining algorithmson supercomputers. This is what is called a fine grain approach [46]. The coarse grain approach in PDM is whenseveral algorithms are run in parallel and the results combined. This approach is heavily used in Distributed DataMining. In this paper, both techniques will be referred to as DDM.

In [48], the most important types of DDM algorithms are presented: distributed clustering, distributed classifierlearning, distributed rules association and computing statistical aggregates. The focus of this survey is on distributedclustering.

Some of the new environments in which distributed data mining has been applied are also detailed in [48].Among them, are sensor networks, which are constituted of numerous small devices connected wirelessly. A maincharacteristic of the sensors is that the power required for wireless communication is much greater that the powerrequired for local computations. Of course, in this environment severe energy and bandwidth constraints apply.Other recent applications of DDM are in grid mining applications in science and business, in fields like astronomy,weather analysis, customer buying patterns or fraud detection.

There is a wide range of distributed data mining surveys in the literature. Paper [37] has already been quotedbefore. Survey by Zaki [74] discusses a large variety of aspects from DDM. A range of DDM techniques arediscussed in the book [47] by Kargupta and Chan. A more recent overview of the field by Datta et al. can be foundin [17].

3.2 Key Aspects of Distributing Clustering

Distributed clustering is the partitioning of data into groups, in a distributed environment. The latter means thateither the data itself or the clustering process are distributed [citation]. Although relatively new, this field has beenresearched intensively in the last years, as the need to employ distributed algorithms has grown significantly. Theevolution of the networking and storage equipment fostered the development of very large databases which are indifferent physical locations. It is infeasible to centralize these data stores in order to analyze them. Additionally,even if they could be centralized, no single computer could meet the storage and processing requirements of sucha task.

3.2.1 Distributed Data Scenarios

Distributed clustering is applied when either the data that needs to be processed is distributed, or the computationis distributed, or both of them. If none of these two is distributed, we are talking about centralized clustering. Thepossbile distribution scenarios are included in the table below (from [31]).

Centralized Data Distributed DataCentralized Clustering CD-CC DD-CCDistributed Clustering CD-DC DD-DC

Table 2: Scenarios of Data and Computation Distribution

3.2.2 Types of Distributed Environments

The environments in which distributed clustering runs can be classified in three types, by taking into account theirhardware setup and network topology:

• Computers with parallel processors and shared storage and/or memory

8

Page 13: Distributed Clustering Survey

• Tightly-coupled computers in a high-speed LAN network (clusters)

• Computers connected through WAN networks

• Peer-to-peer networs over the Interent

As it will be shown, the environment has a great influence on the choice of one or another clustering algorithm.

3.2.3 Communication Constraints

Depending on the data distribution, the nature of data, and the network environment in which it is run, communi-cation in distributed clustering algorithms is constrained by a number of factors:

• Bandwidth limitations: in distributed networks, computers may have different connection speeds, whichmight not allow them to transfer large quantity of data in an acceptable time. This is an issue for wirelessad-hoc networks, as well as sensor networks.

• Communication costs: these become a problem in sensor networks especially, where the energy consump-tion of communication is much higher than the one required for computation [48].

• Privacy concerns: when clustering is computed on data from multiple sources, each of the sources mightwant to keep sensitive data hidden, by revealing only a summary of it, or a model.

3.2.4 Facilitator/Worker vs P2P

There are two main architectures for distributed clustering: facilitator/worker and P2P [31]. The former assumesthat there is a central process (facilitator) which coordinates the workers. While this approach benefits from split-ting the work to multiple workers, it has a single point of failure and is not suitable in P2P networks, raising privacyand communications concerns. To overcome these issues, P2P distributed clustering algorithms were developed,resembling the structure of P2P networks: big number of nodes connected in an ad-hoc way (node failures andjoins are very frequent), each node communicates only with its neighbours, and failure of a node is handled el-egantly. P2P clustering algorithms can be again split into structured and unstructured ones. The structured onesuse superpeers, each controlling a set of common peers. The nodes can communicate only with other nodes thathave the same superpeer, while superpeers communicate between them. This is a recurrent theme in research byHammouda et al.

3.2.5 Types of Data Transmissions

There are several options of what type of data is transmitted between two distributed processes [52] throughout theclustering process:

• Whole data set: the most simple way of communicating is for the peers to exchange the entire data thatneeds to be clustered. In the same time, it is the most inefficient method: it offers solutions for none of theconstraints listed above. Accordingly, the possibility of using it is severely limited.

• Representatives: a way to mitigate the costs of communication and bandwidth constraints is to exchangeonly the most representative data samples between processes. However, choosing the data sampling algo-rithm is not trivial. Additionally, the problem of privacy remains unsolved.

• Cluster prototypes: these are computed locally prior to sending them over. The prototypes themselvesdepend on what algorithm is used locally. They could be centroids, dendrograms or generative models. Thisapproach caters for all the communication constraints, and currently is the most widely used.

9

Page 14: Distributed Clustering Survey

[15] takes an in-depth look at the number of rounds of communication required by various distributed clusteringalgorithms. In this regard, there are two types of models: those that require one round of communication, and thoserequiring multiple rounds and, consequently, more frequent synchronization. Algorithms which use only one roundof communication are generally based on ensemble clustering and exchange cluster prototypes. Therefore, a localcomputation needs to be performed and its results sent to the central node or super-peer, where the global model iscomputed. Locally optimized P2P systems need at least two rounds of communication, as the information obtainedfrom global computation is sent back to the other nodes for help in refining local clusters.

3.2.6 Scope of Clustering

Looking at the model that is obtained, distributed clustering can be divided into local clustering and global clus-tering. Local clustering means that only a clustering of local data and, possibly some data from neighbours, iscomputed. In this case, the algorithm uses global data, obtained by aggregating local models, to optimize the localclustering results. The computation of local clustering is specific to some peer-to-peer systems, where each nodehas its own data, and wants to use aggregate information in order to improve it, or to exchange documents whichare of interest to neighbouring nodes. For example, if a node has all the data grouped in some clusters, but hassome outliers, these could be moved to another node where they could fit in the clusters that were already createdthere.

Global clustering creates a global model of data. Here, the property of exact or approximate clustering can beintroduced. It can be said about a distributed clustering that it is exact if the results of the clustering are equivalentto the results obtained if the data would have been centralized first, and then a clustering algorithm was applied onit.

3.2.7 Ensemble Clustering

A technique which is often used in distributed clustering is called ensemble clustering. Ensemble clustering is theapplication of more than one clustering algorithms on a set of data. On the other hand, hybrid clustering uses onlyone of many algorithms at any given time, while the others are idle. Both ensemble clustering and hybrid clusteringare types of cooperative clustering. Many of the distributed clustering algorithms perform a local computation first,using any of the clustering algorithms available, and then the results are sent to another node, which clusters them,possibly using another clustering algorithm. Some of the distributed clustering frameworks do not care whichalgorithm is used locally. Instead they focus on the communication between nodes and how global clustering isperformed.

3.2.8 Privacy Issues

The issue of privacy is as old as the Internet itself. But in the last years, it became the most important problem in anydiscussion about Internet and web systems. This happens for several reasons: first, the amount of digital data storedin distributed networks has increased exponentially. And we are not talking about ordinary data. Highly-sensitivedata, from Government agencies, banks or other companies can be found on computers. Second, as connection tothe Internet have improved, these data stores are more prone to attack from malicious users.

In data mining, privacy concerns may arise from different situations, illustrated by examples. In the first one,suppose that a credit rating agency wants to create credit maps, where for each neighbourhood from a certain city,a map reflecting credit scores in that area is plotted. Of course, the agency would need to collaborate with banksto obtain credit-related data. In this case, the banks can supply the required information in a summarized way bycomputing an average over all the costumers in each area. Revealing the aggregated data would not compromiseany information about individual customers. In the second example, two or more Internet users want to perform alocal clustering of their documents and produce a common taxonomy by summarizing them. Of course, none ofthe users would like others to see their personal files.

Fortunately, data mining has an intrinsic feature that makes privacy preserving easier to implement: it summa-rizes data, looking for trends or patterns. Consequently, individual data points need not be revealed. This property

10

Page 15: Distributed Clustering Survey

is transferred to distributed clustering, too. For this to be enforced only cluster prototypes must be exchangedbetween participating nodes. More detail about privacy preservation in distributed clustering can be found in [15].

3.3 Distributed Clustering Performance Analysis

The performance of a distributed clustering algorithm can be evaluated in different ways. Clustering performancecriteria can be classified into three types , depending on what they measure: scalability, quality and complexity.

Among the most important measurements are the ones that concern scalability: speedup and scaleup. Thefirst one measures the reduction in computation time when the same amount of data is used, while the second onecaptures the variation of the computation time when the amount of data per process is constant, but more processesare added. These criteria are arguably the most important when evaluating distributed clustering algorithms.

For evaluating the quality of a distributed clustering algorithm, several methods are used. If the clustering re-sults are known, then an external measure like the F-measure can be applied. Else, internal similarity measures canbe used: intracluster similarity, partition index, Dunn index, etc. Finally, where possible, results of the distributedalgorithms can be compared with the results obtained when the same data is processed by a traditional algorithm.When a global distributed clustering is said to be exact, then its quality will be equal to the one of its centralcounterpart.

Finally, the complexity of distributed clustering approaches is obtained by analyzing the algorithm. Thosealgorithms evolved from central approaches can use the knowledge about the complexity of the central algorithm.For some distributed clustering approaches it is impossible to determine a global complexity, because they allow anyalgorithm to be used at the local processing step. The best examples are the algorithms that fall under the ensembleclustering category [45, 68]. In addition to calculating the complexity of the computation, for some algorithms, thecomplexity of the communication is also evaluated. For distributed algorithms, cost of communication is essential.

3.3.1 Measuring Scalability

In general, experimental tests with distributed algorithms measure the speedup of the algorithms. This is veryimportant because distribution incurs communication costs. If these are too high, then the employment of a certainalgorithm in distributed environments becomes infeasible. Speedup of clustering algorithms can be almost linear,the parallel k-means algorithm by Dhillon and Modha [20] achieves a speedup of 15.62 when using 16 nodes forprocessing. This was achieved because the paralellization of k-means is inherently natural, so communication isreduced. Other clustering algorithms have more modest results. For example, the cooperative clustering algorithmproposed in [52] achieves a speedup of between 30 and 40, on a setup of 50 nodes. This results are explained bythe fact that the algorithm works in P2P networks, where communication is more intensive.

It is difficult to compare speedup results obtained by various algorithms, for a number of reasons:

• The environment in which they run has a significant impact on the costs. It is a big difference in commu-nication speed between a cluster of computers tightly connected by a gigabit LAN network, and the samenumber of computers connected through the Internet.

• Experiments are performed using different data sets. Although there are some standard corpora (ReutersRCV1, 20-newsgroups), not all the algorithms have been tested on them. A number of algorithms are testedon synthetic data, like Gaussian distributions. The nature and amount of data influence speedup significantly,so testing algorithms on the same dataset is the only way to compare them.

• Speedup has to be visualized in the context of quality. There is a trade-off between these two, so somealgorithms might reduce speedup in order to improve quality, or vice-versa.

Another measure of scalability is called scaleup. This measures if the running time of a distributed algorithmremains the same when more data is added, proportionally with the addition of new processing nodes. This measureis not as widely used as speedup. Scaleup tests have been performed in [20].

11

Page 16: Distributed Clustering Survey

3.3.2 Evaluating Quality

Quality, or accuracy, of clustering algorithms can be measured externally, when the correct assignments are knownbeforehand, or internally where measures like intraclustering similarity can be used. The reader is referred tosection 2.3 to read more about clustering quality evaluation measures for centralized clustering.

There are very few measures that have been developed specifically for distributed clustering. Two of them canbe found in [40]:

• Discrete Object Quality: returns a boolean result assessing if an object X was clustered correctly by thedistributed clustering algorithm (as compared to a centralized algorithm)

• Continuous Object Quality: same as the previous one, but returns a real value between 0 and 1 determininghow close was the distributed algorithm to the centralized one.

Not all the distributed algorithms have been tested for their quality. The reason for this is that they have emergedfrom centralized ones, which have been tested thoroughly before. So, assuming that a distributed clustering algo-rithm is exact, or almost exact, there isn’t a large difference expected in accuracy between itself and the centralizedalgorithm that it originated from.

As for speedup, it is impossible to compare the quality of distributed clustering across the board. The reasonsremain the same: the fact that algorithms have been tested using a wide range of data sets, that don’t have anythingin common, and the need to consider the measures of quality and speedup together, because they are tightly coupled:modifying one influences the other.

3.3.3 Determining Complexity

Complexity of distributed clustering algorithms is the only measure that can be used to compare them objectively,because complexity is determined by analyzing an algorithm mathematically. Although determining complexitydoesn’t depend on any other parameters, algorithms with the same complexity can differ greatly in speed. Thisdifference can be made by the usual factors: nature of data, quality of the algorithm, optimizations, etc.

When measuring complexity of distributed algorithms, two processes need to be taken into consideration:actual computation of data, and communicating results (in one or more rounds). Their corresponding measuresare the computation complexity, and communication complexity, respectively. Put together, they form the overallcomplexity of the algorithm. Communication complexity depends on the environment that a certain algorithm hasbeen designed for.

3.4 Summary

This section aims to place distributed clustering in the context of distributed data mining. After achieving this, thekey aspects of distributed clustering are discussed, beginning with the possible scenarios for distributed data andcomputation. Next, the types of distributed scenarions are listed: parallel computer, LAN/WAN and P2P networks.Each of these environments impose different constraints on communication. The two possible architectures ofdistributed clustering algorithms are Facilitator/Worker and P2P. Data transmitted by these techniques fall intothree categories: data, representatives or prototypes. Ensemble clustering and privacy issues are also importantaspects of the field. Finally, three attributes that need to be analyzed in order to evaluate the performance ofdistributed clustering algorithms are presented: scalability, quality and complexity.

4 Survey of Distributed Clustering Techniques

4.1 Initial Taxonomy

In order to find a structured way of surveying distributed clustering algorithms, I tried to establish a taxonomy thatincludes all the concepts that define the clustering process. Classifying the algorithms according to this taxonomy

12

Page 17: Distributed Clustering Survey

will give a broad picture of the field and will make it easier to position new algorithms in relation to the existingones. Each step of the distributing clustering process involves design decisions that are usually influenced bywhat needs to be achieved. For example, if an algorithm has to run in a P2P environment, it cannot use thefacilitator-worker architecture. Also, algorithms having multiple rounds of communications would not be suitablefor this environment. If privacy preserving is a priority, then algorithms which send the entire data or samples of itcertainly cannot be used. These decisions ultimately define a distributed clustering algorithm. All the key aspectsthat need to be taken into consideration have been grouped by where they appear in the clustering process. First,let’s review the steps of a general clustering algorithm:

1. A local model is computed

2. The local models are aggregated by a central node (or a super-peer in P2P clustering systems)

3. Either a global model is computed, or aggregated models are sent back to all the nodes to produce locallyoptimized clusters.

Next, the elements of the taxonomy that can be found at each step are presented.

1. Computation of a local modelIf taken out of the distributed process context, this step can be viewed as classical, or local clustering. There-fore, all the traditional aspects of clustering apply here:

• Type of the data on which the clustering algorithm is applied: general, synthetic, documents, geneexpressions

• Local clustering algorithm: partitional, hierarchical, density-based, geometric embedding, none (if theentire data is passed to the central node/other peers)

• Feature selection, used when working with high dimensional data sets

2. Aggregation of the local models

• Underlying distributed environment: parallel computer, cluster, WAN, P2P network

• The number of rounds of communication required by the algorithm

• What is communicated: entire data, representatives, prototypes

3. Optimization of local models

• Is it global or local (if it’s global, this step is unnecessary)

• Does it output exactly the same results as the centralized version of the algorithm, or approximateresults

4. General aspects

• Is the data centralized or distributed prior to the start of the algorithm?

• Is the algorithm incremental or non-incremental?

• Is privacy required?

13

Page 18: Distributed Clustering Survey

4.2 Review of Algorithms

[1.PADMA] One of the first efforts in distributed clustering was made by Kargupta et al. in 1997. A parallelhierarchical clustering algorithm is presented in [49], as part of the broader mining framework PADMA (PArallelData Mining Agents). The framework consists of a user interface, a facilitator and independent agents with theirown storage. It is implemented on distributed memory machines using MPI. The clustering component has beentested with documents. N-grams are used for representing them. The parallelization of the hierarchical agglomer-ative clustering is straightforward: each agent performs a local clustering on its data, until only a few clusters areleft. The results are send to a client, which executes clustering on very condensed data. PADMA was tested on a128-node IBM SP2 supercomputer, on a corpus containing 25273 text documents. Because there is no interprocesscommunication, linear speedup was achieved.

• Computation of local model: documents, hierarchical clustering, N-grams

• Aggregation of local model: distributed memory machine environment, prototypes(dendrograms) are sent,one round of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, exact algorithm

[2.RACHET] Another distributed hierarchical clustering algorithm is proposed by Samatova et al. in [64]. Itis quite similar to the one contained in PADMA, but it has a major difference: after local dendrograms are createdat each site, the clusters are not transmitted as a whole to the merger site. Instead, only the centroid descriptivestatistics are sent for each cluster: a 6-tuple composed of the number of data points in the cluster, the square norm ofthe centroid, the radius of the cluster, the sum of the components, their minimum and maximum value, respectively.This representation accounts for a significant reduction of the communication costs. Dendrograms are merged atthe central site using just these statistics. Tests on synthetic data, with up to 16 dimensions, and real data show thatthe quality of the algorithm is comparable with the one of the centralized hierarchical clustering approach.

• Computation of local model: general, hierarchical clustering

• Aggregation of local model: WAN, prototypes(statistics) are sent, one round of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[3.CHC] In [43], Johnson and Kargupta present a distributed hierarchical algorithm, called CHC (CollectiveHierarchical Clustering). It works on data that is heterogeneously distributed, with each site having only a subset ofall features. First, a local hierarchical clustering is performed on each site. Afterwards, the obtained dendrogramsare sent to a facilitator which computes the global model, using statistical bounds. The aggregated results aresimilar to centralized clustering results, making CHC an exact algorithm. An implementation of the algorithm forsingle link clustering is also introduced in the paper. Empirical tests are made on the Boston Housing Data Set,where CHC is compared to monolithic clustering.

• Computation of local model: general, hierarchical clustering,

• Aggregation of local model: WAN, prototypes(dendrograms) are sent, one round of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, exact algorithm

14

Page 19: Distributed Clustering Survey

[4.CPCA] Collective principal analysis clustering (CPCA) is proposed by Kargupta et al. in [50]. It is one ofthe early distributed clustering algorithms and it works on heterogeneous data. It consists of the following steps:first, PCA is performed locally at each site. Next, a sample of the projected data, as well as the eigenvectors, aresent to a central processing node, which combines the projected data from all the sites. Finally, PCA is performedon the global data set to identify the dominant eigenvectors. These are sent back to the sites which perform localclustering upon receiving them. Models of the clusters obtained are sent once more to the central node whichcombines them using various techniques. In the paper this achieved using a nearest neighbour technique. CPCAis an approximate algorithm: experiments comparing it with centralized PCA show differences between the twoapproaches.

• Computation of local model: general, PCA clustering,

• Aggregation of local model: WAN, representatives are sent, multiple rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[5.Parallel K-means] Dhillon and Modha [20] developed a version of k-means that runs on parallel supercom-puter. They make use of the inherent parallelization capabilities of k-means. Each processor, or node, receivesonly a segment of the data that needs to be clustered. One of the nodes selects the initial cluster centroids, beforesending them to the others. New distances between centroids and data points are computed independently, but aftereach iteration of the algorithm, the independent results must be aggregated or reduced. This is done using the MPI(Message Passing Interface). The reduced centroids obtained after the last iteration represent the final result of theclustering process. In the paper, it is shown both analytically and empirically that the speedup and scaleup of thealgorithm should be closer to optimal as the number of nodes increases, because communication cost will have asmaller impact. On a 16 node machine, speedup of 15.62 was obtained.

• Computation of local model: general, k-means clustering,

• Aggregation of local model: parallel computer, prototypes(centroids) are sent, multiple rounds of communi-cation

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is centralized, exact algorithm

[6.Parallel K-harmonic] Zhang et al. [76] propose a method for parallel clustering that can be applied toiterative center-based clustering algorithms: k-means, k-harmonic means or EM (Expectation-Maximization). Thealgorithm is similar to the previous one, but it is more general: first the data to be clustered is split into a number ofpartitions equal to the number of computing units. A computing unit can be a CPU in a multiprocessor computeror a workstation in a LAN. Then, at each step, Sufficient Statistics (SS) are computed locally and then summedup by an integrator, which broadcasts them to all the computing units for the next iteration, if required. Theimplemented algorithm achieved almost linear speedup. In [24], the parallel clustering approach is extended towork with inherently distributed data.

• Computation of local model: general data, k-means/k-harmonic means/EM

• Aggregation of local model: multiprocessor computer/LAN environment, prototypes (SS), multiple roundsof communication

• Optimization of local model: global clustering, no optimization needed

15

Page 20: Distributed Clustering Survey

• General aspects: data is distributed, exact algorithm

[7.DBDC] The paper by Januzaj [40] introduces the Density Based Distributed Clustering algorithm (DBDC).It can be used in the case when the data to be clustered is distributed and infeasible to centralize. DBDC works byfirst computing a local clustering at each of the nodes. For this, a density-based algorithm, DBSCAN, is used [21].The representatives of local clusters are created by using k-means to discover the most important data points ineach cluster. These are passed to a central node that aggregates them, and sends the results back for local clusteringoptimization. This algorithm requires two rounds of communication between the facilitator and the workers. Thispaper also introduces two distributed clustering quality measures: discrete object quality and continuous objectquality. The algorithm was tested on up to 200000 2-dimensional data points and run on one machine.

• Computation of local model: general, density-based clustering

• Aggregation of local model: WAN environment, representatives are sent, one round of communication

• Optimization of local model: local clusters are optimized, another round of communication

• General aspects: data is distributed, approximate algorithm

[8.PDBSCAN] Another algorithm that uses DBSCAN was proposed by Xu in [72]. PDBSCAN adapts theoriginal algorithm to work in parallel on a cluster of interconnected computers. In introduces a new storage struc-ture, called dR*-tree. It differs from a normal R*-tree by storing the pages on different computers, while the indexis replicated to all the computers in the cluster. The data placement function uses Hilbert curves to distributespatial-related areas to the same computer. After local clustering is performed on each computer, results are sent toa master, which merges clusters where needed. The merging step ensures that the final results are similar to the re-sults obtained with centralized clustering. Thus, PDBSCAN is an exact algorithm. It was tested on 8 interconnectedcomputers and on up to 1 million synthetic 2D data points, achieving near linear speedup and scaleup.

• Computation of local model: spatial data, DBSCAN clustering

• Aggregation of local model: cluster of computers, prototypes are sent, multiple rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, exact algorithm

[9.KDEC] The KDEC scheme[54, 53] proposed by Klusch et al. performs kernel-density-based clusteringover homogeneously distributed data. It uses the multi-agent systems paradigm, where each site has an intelligentagent that performs local computation. Additionally, a helper agent sums the data from all the site agents. Densityestimates at each site are computed locally, using a global kernel function. The global density estimate is calculatedby adding the local estimates. This value is sent back to the local sites, which will construct the clusters. Pointsthat can be connected by a continuous uphill path to a local maximum are considered to be in the same cluster. Toaddress privacy concerns, a sampling technique is used: a global cube and a grid are defined globally and each sitecomputes the densities in the corners of the grid. In the end, data will be clustered locally at each site. There is noglobal clustering.

• Computation of local model: general, kernel-density-based clustering

• Aggregation of local model: WAN, prototypes(density estimates) are sent, two rounds of communication

• Optimization of local model: global densities are sent back for local clustering optimization

16

Page 21: Distributed Clustering Survey

• General aspects: data is distributed, approximate algorithm, privacy preserving

[10.KDEC-S] In [15] da Silva et al. focus on the security of the previous kernel-density-based clusteringalgorithm and introduce KDEC-S. This algorithm has the advantage of a better protection from malicious attacks.Specifically, it solves the possible problem of a malicious agent using inference to obtain data from other peers.Testing on synthetic data shows that the accuracy of clustering is not significantly affected. Its characteristics aresimilar to KDEC, but it’s more secure.

[11.Strehl Ensemble] Strehl and Ghosh propose a method for ensemble clustering in [68]. Ensemble cluster-ing combines the results of different clustering algorithms to obtain the optimum. While this approach can be alsoemployed for knowledge reuse or other scenarios, it can obviously be applied to distributed clustering, where itcombines the local clustering results obtained at different sites. There are three methods proposed for combiningclustering results. Cluster-based Similarity Partitioning Algorithm (CSPA) reclusters objects using pairwise simi-larities obtained from the initial clusters. HyperGraph Partitioning Algorithm (HGPA) approximates the maximummutual information objective with a minimum cut objective. Meta-CLustering Algorithm (MCLA) tries to findgroups of clusters, or meta-clusters. It is argued that, because of their low complexity, it is feasible to run all thethree functions for a data set and choose the best of them. For large data sets, sending all the clusters to a centralsite might pose a communication problem that makes scalability difficult to achieve. Because the merging functiondon’t need any information about any object except the cluster that it belongs to, this approach also accounts forprivacy preservation.

• Computation of local model: general, any algorithm can be used

• Aggregation of local model: although no environment is specified in the paper, it is more suitable for WANthan others, prototypes(clustering results) are sent, one round of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[12.Fred Ensemble] Another ensemble clustering approach is presented by Fred and Jain in [25]. It usesevidence accumulation from multiple runs of the k-means algorithm to construct a n x n matrix representing a newsimilarity measure. Then, the clusters are built using a minimum spanning tree algorithm (MST) to cut weak links,that are beneath a certain threshold. The results are similar to the ones obtained by the single link algorithm. Whilethis approach wasn’t designed specifically for distributed clustering and this subject is not covered in the paper, anadaptation is possible: it can be assumed that the different runs of the k-means algorithm come from different sitesand a global co-association matrix is built. However, constructing this matrix in a communication efficient wayneeds to be addressed in order to achieve a successful implementation of this ensemble clustering algorithm in adistributed environment.

• Computation of local model: k-means

• Aggregation of local model: any (most suitable for WAN), prototypes(clustering results) are sent, one roundof communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[13.Jouve Ensemble] Jouve and Nicoloyannis take a different approach to ensemble clustering in [45]. Theydevelop a technique for combining partitions that can also be applied for distributed clustering. A data set canbe split into multiple partitions, clustering is run in parallel on each of them, and the results are combined. This

17

Page 22: Distributed Clustering Survey

technique can be readily used for distributed clustering, with each process, site or peer receiving a partition of thedata set. An adequacy measure is used for determining the optimal unique combination of the structures. The paperproposes a new heuristic method to compute this measure. Some tests have been made, but the efficiency of theparallel clustering cannot be determined from them.

• Computation of local model: any algorithm can be used

• Aggregation of local model: WAN, prototypes(clustering results are sent) are sent, one round of communi-cation

• Optimization of local model: global clustering, no optimization needed

• General aspects: data can be distributed, approximate algorithm

[14.Merugu Privacy] Merugu and Ghosh [59] perform distributed clustering by using generative models.They have the advantages of privacy preservation and low communication costs. Instead of sending the actual data,generative models are built at each site and then sent to a central location. The paper proves that a certain meanmodel is able to represent all the data at one site. This model can be approximated using Markov Chain Monte Carlotechniques. The local models obtained this way are fed to a central Expectation-Maximization (EM) algorithmwhich computes a global model, by trying to minimize the KL (Kullback-Leibler) distance. Special attention isgiven to privacy, which is defined as the inverse of the probability of generating the data from a specific model.Better clustering can be achieved by decreasing privacy, although the authors argue that hiqh quality clustering canbe obtained with little privacy loss.

• Computation of local model: general data, generative models are determined using Markov-Chain MonteCarlo techniques

• Aggregation of local model: WAN environment, prototypes(generative models) are sent, one round of com-munication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm, privacy preserving

[15.Vaidya Privacy] Another algorithm that considers privacy essential, is the k-means clustering algorithmintroduced by Vaidya and Clifton in [70]. It tries to solve the problem of protecting privacy when distributedclustering is performed on vertical partitioned (heterogeneous) data. The algorithm guarantees that each site, orparty, knows only the part of the centroid relevant to the attributes that it holds, and the cluster assignments at eachpoint. There are two issues that are addressed by the algorithm: first, the assignment of points to their correspondingcluster at each iteration and, second, knowing when to end the iterations. Permutations and combinatorics areused for solving this issues while maintaining the privacy goal. Additionally, functions from the field of SecureMultiparty Computation are used.

• Computation of local model: k-means clustering

• Aggregation of local model: WAN, prototypes(centroids) are sent, multiple rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed heterogeneously, approximate algorithm, privacy-preserving

18

Page 23: Distributed Clustering Survey

[16.DIB] In [19], Deb and Angryk propose an algorithm that uses word-clusters for distributed documentsclustering, called DIB (Distributed Information Bottleneck). The algorithm is based on the earlier work of Slonimand Tishby [65], which developed the aIB algorithm: Agglomerative Information Bottleneck. It is stated thatword-clusters are better than the classic co-occurrence matrix of documents because of the following reasons:higher accuracy, a structure that is easier to navigate and reduced dimensionality. The last feature is important fordistributed clustering. It is known that documents have very high dimensionality, with thousand of attributes. Butone document contains only a small portion, several hundred, of them. Word-clusters reduce the dimensionality,and thus they help lower the costs of communication. Another advantage of this algorithm is the fact that the userdoesn’t have to specify the number of clusters to be created at each site. The Distributed Information BottleneckAlgorithm works in the following way: first, each site generates local word-clusters. Second, this are sent toa central site which generates the global word-clusters, which are then sent to all the participating sites. Third,documents are clustered locally using aIB and the word-cluster representation. Finally, the local models are sent tothe central site which computes the global clustering model. Experiments show that the accuracy of DIB is closeto the one of the centralized approach.

• Computation of local model: agglomerative Information Bottleneck, using word-clusters

• Aggregation of local model: WAN, prototypes are sent, clustering is done in two steps: first word-clusters,then actual clusters, three rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[17.HP2PC] HP2PC (Hierarchically-Distributed P2P Clustering) is an algorithm recently introduced by Ham-mouda and Kamel in [34]. It is suitable for large P2P systems. The approach is to first create a logical structureover a P2P network, consisting of layers of neighbourhoods. Each of these is a group of peers that can communi-cate only between themselves. Additionally, there is a supernode that aggregates all the data from the group andrepresents the only point of communication between neighbourhoods. In turn, each supernode represents a peer ofa higher-level neighbourhood, where the same rules apply. This way, multiple layers can be created. At the highestlevel, there is the root supernode. In each neighbourhood, a set of centroids is created using an algorithm similar toP2P k-means, but with slight differences. The centroids are passed forward to the supernode, which will computeanother set of centroids together with its neighbours. Finally, the root supernode will contain the centroids for thewhole data. Tests were run on several document corpora like the 20-newsgroups data set and using F-measure asthe quality criteria.

• Computation of local model: k-means

• Aggregation of local model: P2P networks, prototypes(centroids) are sent, multiple rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[18.P2P K-Means] The algorithm P2P K-means, developed by Datta et al [18] is one of the first algorithmsdeveloped for P2P systems. Each node requires synchronization only with the nodes that it is directly connectedto, or its neighbourhood. Only one node initializes the centroids used for k-means, which are then spread to theentire network. The centroids are updated iteratively. Before computing them at step i, a node must receive thecentroids obtained at step i-1 by all of its neighbours. When the new centroids of a particular node don’t suffermajor modifications, then the node enters a terminated state, where it doesn’t request any centroids, but it canreponse to requests by neighbours. Node or edge failures and additions are also accounted for by P2P k-means,making it suitable for dynamic networks. Experiments were conducted on 10D generated data points, and highaccuracy and good scalability were observed.

19

Page 24: Distributed Clustering Survey

• Computation of local model: k-means

• Aggregation of local model: P2P network, prototypes(centroids) are sent, multiple rounds of communication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

[19.Collaborative] The collaborative clustering algorithm proposed by Hammouda and Kamel in [30, 33] per-forms local documents clustering in P2P environments. First, a local model is computed using an incrementalclustering algorithm based on similarity histograms. Clusters are summarized and the resulting models are ex-changed between peers. The second step allows the refinement of the clusters obtained initially. Each node sendsrelevant documents to its neighbours. Upon receiving new documents, a node merges them to its existing clustersincremental clustering allows that. Here, the distinction between collaborative and cooperative clustering is made:while the former implies a local solution, the latter aims for a global solution. The algorithm has been tested in3-peer, 5-peer and 7-peer environments on the 20 newsgroups dataset. The experiments show that both the entropyand the F-measure have increased after refining the clusters obtained in the initial step.

• Computation of local model: Incremental clustering usint Similarity Histograms is performed locally

• Aggregation of local model: P2P, prototypes(cluster summaries) are sent, multiple rounds of communication

• Optimization of local model: local clustering is optimized after communicating with peers

• General aspects: data is distributed, exact algorithm

[20.DCCP2P] Another algorithm, related to the previous ones, is proposed by Kashef in his PhD thesis [52]:Distributed Cooperative Clustering in super-peer P2P networks (DCCP2P). A centralized cooperative clusteringalgorithm is introduced first, which combines data obtained from three clustering algorithms: k-means, bisecting k-means and partitions around medoids (PAM). Then the algorithm is adapted for structured P2P networks: there aretwo tiers, one containing ordinary nodes, grouped into neighbourhoods, and another one containing supernodes, onefor each neighbourhood. Each node communicates only with others in the same neighbourhood, or its super-node.The algorithm is composed of 3 steps: first, local models are computed on each node. Second, each supernodemerges the clusters obtained on the local nodes. Third, the root super-peer merges the clusters obtained by allthe super-peers, and thus the global model is obtained. The algorithm has been tested on gene expressions anddocuments.

• Computation of local model: tested for genes and documents, local algorithms: k-means/bisecting k-means/PAM

• Aggregation of local model: Structured P2P, prototypes(clustering results) are sent, multiples rounds ofcommunication

• Optimization of local model: global clustering, no optimization needed

• General aspects: data is distributed, approximate algorithm

4.3 Refined Taxonomy

While reviewing the papers, I realized that the taxonomy that I have devised initially was not the most suitablefor all the algorithms that I have encountered. For example, only one of the papers in the review tackled the topicof feature selection. The others didn’t consider it an important aspect in regard to distributed clustering. Almostall the algorithms assume that data is distributed prior to running them. The few algorithms that work with data

20

Page 25: Distributed Clustering Survey

that is initially centralized, like parallel k-means [20], consider the distribution of data as a preprocessing step, andnot a part of the algorithm itself. Thus, it made no sense to keep this element in the taxonomy. The possibility toperform incremental clustering using distributed algorithms was also tackled by just a few, so that column was alsoexcluded.

The structure of the taxonomy was also changed. The steps of the clustering process are not very relevant forthe elements of the taxonomy because, many of them cannot be included in one of the steps. The new hierarchygroups the concepts under 3 categories: requirements, design choices and communication aspects. Additionally, acolumn for the global clustering algorithm has been added, for the situations in which it is different from the localclustering algorithm.

Following next, is a short description of the refined taxonomy. The table on the next page classifies distributedclustering algorithms according to this concepts.

1. Requirements: attributes that come from the underlying problem that is solved and cannot be modified bythe designer of the algorithm

• Data Type: the type of data that needs to be clustered: general, spatial, documents, gene expressions

• Scope: local or global

• Accuracy: exact or approximate

• Privacy Preserving: data from one node cannot be accessed by other nodes

• Environment: the underlying structure on which the algorithm is running, can be a parallel computer ora cluster of computers, Wide Area Network, Peer-to-Peer Networks

2. Design: choices that are made regarding the algorithm, so that it can work with the given requirements

• Architecture: P2P or facilitator/worker

• Local Algorithm: the choice for the local clustering, if one is used

• Global Algorithm: the technique for global clustering: it can be the same algorithm as the local one(in the case when the same algorithm is used in the two steps, e.g. PADMA), it can be differentalgorithms (e.g. Merugu privacy), or it can be a single algorithm that produces global results withoutlocal clustering (Parallel k-means, P2P k-means)

3. Communication: aspects regarding communication between the nodes

• What Is Communicated: data, representatives or prototypes (see Section 3.2.5)

• Rounds of Communication: one (for facilitator/worker architectures), two (when local model is opti-mized after global aggregation of models), multiple (parallel and P2P algorithms)

4. Other: elements that do not fall into one of the given categories

21

Page 26: Distributed Clustering Survey
Page 27: Distributed Clustering Survey

5 Parallel QT: A New Distributed Clustering Algorithm

In this section, a new distributed clustering algorithm is introduced, called Parallel QT (Quality Threshold) Clus-tering.

5.1 Motivation

All of the clustering algorithms developed in the last several years a based on P2P or WAN environments, wherea model is computed locally first. These work very well on the Internet and can be applied in situations likeclustering for file-sharing applications, or data mining at multiple sites of a corporation or agency. But thesealgorithms cannot be used by companies storing huge amounts of data in grids formed of thousands of computers.If Google and Amazon would try to work on a common project that implies sharing of data, they can apply thesame principles used throughout distributed data mining: compute a local model and then aggregate it. But theirproblem is how to compute the local model. They have thousands of servers which can store and process data, sothey need highly-parallelilzed algorithms. It is in this area of necessity where the new algorithm will be placed.Thus, it would be a lower level algorithm than the distributed algorithms developed today and it could be used inconjunction with them (for the local clustering step).

Parallel QT is an extension of the QT clustering algorithm that I have used in my previous research. The nextsection will give an overview of the algorithm and its advantages.

5.2 Original QT Clustering

The original QT algorithm was presented by Heyer et al. in [36] for gene clustering. The idea of the algorithm is tobuilt a possbile cluster for each data point, and then choose the best possible cluster. In the paper, it is proved thatconsidering the largest cluster as the best one is as good as other, more sophisticated techniques. Thse are the stepsof the QT clustering algorithm:

1. For a candidate document D, compute the cosine similarity between D and all the other documents.

2. All the documents that are closer than a certain threshold to D are added to a possible cluster with the centerin D.

3. Repeat steps 1 and 2 for all the documents in the dataset. This results in a set of possible clusters. Thenumber of possible clusters is equal to the number of documents in the dataset. Each document is the centerof one and only one of the possible clusters. One document cat be a member (but not the center) of multiplepossible clusters.

4. Choose the cluster with the highest number of documents as an actual cluster, store it, and remove its docu-ments from the dataset that the algorithm is working on.

5. Repeat steps 1 - 4 on the remaining dataset until there are no more documents to cluster or until all thepossible clusters remaining have a single document.

5.2.1 Advantages of QT Clustering

QT clustering has the following advantages when compared to other clustering techniques, in particular k-means:

• It doesn’t need to know the number of clusters in advance

• It doesn’t make any random decision, so the results will be the same for each run of the algorithm

Also, it is said about QT that it is a quality clustering algorithm, and its results proved that. On the other hand,k-means is a lower-quality algorithm (although much faster).

23

Page 28: Distributed Clustering Survey

The main problem with QT is its O(n2) complexity. For Newistic, a system that I have co-built previously[citatie], heavy optimization managed to reduce the number of similarities that are computed between documents,and the execution time was reduced by a factor of 4. An implementation detail was that we retained the similaritymatrix in memory after clustering was completed, so for the next round we only needed to compute cosines fornew documents. With the optimzations, and run on a quad-core computer with 4 GB RAM, our clustering imple-mentations could work with up to 200k documents. However, the algorithm is currently not scalable beyond thatpoint. Thus, the need of paralellizing this algorithm has appeared.

5.3 Parallel QT Clustering

This section describes the new Parallel QT Clustering algorithm:Suppose there are n documents and N parallel processes.

Step 1, Computing the similarities:

• there will be n(n+1) similarity computations

• we want to split this similarity computations equally to the N processes

• we will split the n documents into M blocks. To compute a similarity between 2 blocks means to computethe similarity between each document of the first block, and each document of the second block.

• block similarity has to be computed for each pair of blocks: B1 − B1,B1 − B2,B1 − B3...,B1 − Bm,B2 −B2,B2 − B3, ...,B2 − Bm, ...,B(m− 1)− Bm,B(m− 1)− Bm. In total, there will be M(M+1) computationsbetween the blocks

• we want to assign each of the block computation to one process. From this, results N=M(M+1). From herewe can calculate the value of M: M=[(sqrt(N)]-1.

• By using the modulo function we can assign each Bi−B j block to one of the N processes. The respectiveprocess will receive the documents included in both blocks Bi and B j

• Each process will compute the similarities for the blocks it receives. If the similarity between two documentsis higher than a certain threshold, it is written into a similarity matrix (either 1 or the similarity itself)

After step 1 is complete, a global nxn similarity matrix will be available. But this matrix is split into M(M+1)different blocks.

Step 2, Computing the clusters: This step is iterative:

1. Choose the best cluster from the remaining candidates

2. Delete the documents of the best cluster from the similarity matrix, and go to step 1)(Continues until there are only one-document candidate clusters)

Algorithm (needs a facilitator):

1. Calculate the number of non-zero elements for each row in the similarity matrix;

2. Out of these totals, take the largest. If the largest total ¿ 1, then send the index of the coresponding row (iB)back to each of the processes. Else STOP;

3. When a process receives the row index computed at step 2 (iB) and it is contained in one of the two blocks,it puts all the non-zero elements corresponding to that row in another structure (e.g. HashMap: iB− >NonZeroColumn1,NonZeroColumn2, ...) and deletes them from the similarity matrix (both row iB andcolumn iB);

24

Page 29: Distributed Clustering Survey

4. Repeat step 1.

5. If STOP command is given, each process sends its cluster lists to the facilitator, which aggregates them.

At the end of step 2, the facilitator should store all the computed clusters.

5.3.1 Challenges of the algorithm

The biggest challenge regarding this algorithm is estimating communication cost accurately. It might be thatbecause of this cost (there is quite a lot of communication between the peers) the scaleup and speedup obtained arenot good enough to make it worth using. But we will get an answer to this question once the experiments are setup.

6 Conclusion and Future Work

Although distributed clustering is a young field, it has seen a lot of research activity. The first distributed clusteringtechniques have tried to mimic the exact behiavour of traditional clustering algorithms (hierarchical, k-means) intoa parallel environment. Later on, new algorithms have been developed using the technique called ensemble clus-tering, where several sets of clustering results (usually obtained with different algorithms) are combined togetherto provide an optimal model. Afterwards, research in distributed clustering techniques has gone into two maindirections: privacy preservation, and algorithms for P2P environments. Indeed, the most recent papers on this topicfocus on distributed clustering in peer-to-peer networks,

Although a lot of issues have been solved by the existing algorithms, new challenges continue to appear. Ascomputing technology and broadband networks have evolved rapidly, there is a huge amount of digital data avail-able that needs to be mined. As one of the most important tasks in data mining, clustering is needed everywhere.New clustering algorithms need to be developed for the communication environments of today: on the one hand,algorithms that can run of millions of computing units connected via the Internet and, on the other hand, highthroughput algorithms for grid and cloud computing. While research in P2P clustering addresses the first typeof environment, there has been little development in the recent years of algorithms that can work on clusters ofcomputers and process massive amounts of data.

The algorithm proposed in this paper, Parallel QT Clustering, is best suited for running in grid computingenvironments, tackling an area that hasn’t been researched recently. Its utility can be evaluated only after thoroughtests of the algorithm have been made. This the subject of my future work.

The necessity of new algorithms for distributed clustering also comes from the development of distributed datamining into new fields, like data stream mining and wireless sensor networks. These require specific algorithms todeal with the intrinsic constraints that they posses. The following subsections will present this two emerging fieldsin detail.

6.1 Emerging Areas of Distributed Data Mining

6.1.1 Sensor Networks

Sensor networks are used more and more often and in many domains, including medicine, warfare and autonomousbuildings. Lightweight, sometimes mobile, sensors with reduced power consumption form these networks byestablishing wireless connections to each other or to a central node. Often, wireless networks are situated in harshenvironments where resources, like power or cost, are limited. Examples of wireless sensor networks, where DDMis applied, can be found in [28] (pollution monitoring) or [51] (stock market watch).

In addition to the limited power supply, information processing algorithms in wireless sensor networks haveother aspects to take into account like reduced computing power of the sensors, the inherently dynamic state of thenetwork, where sensors often fail or move in the range of other nodes, and the asynchronous nature of wireless

25

Page 30: Distributed Clustering Survey

networks. Data mining systems that have been successfully used for wireless sensor networks use multi-agentsystems [15] or P2P architectures [11].

Book by Zhao and Guibas [77] contains in-depth information about data processing in sensor networks. Readeris referred to the book for more information on this field.

6.1.2 Data Stream Mining

A novel application of DDM is in mining data streams. In this scenario, data is sent fluently from a producer. Aproducer is an application that outputs data like web pages, sensor measurements, voice call records in a continuousmanner. The flow of data can be rapid, fluctuating, unpredictable and unlimited in size. Thus, the data is viewed asa stream, and not a collection of individual blocks. Because traditional Database Management Systems (DBMS)are not able to cope with this new source of data, new algorithms were required for basic functions like viewing,filtering or querying data streams. In the last years, more advanced research in data stream mining has beenundertaken, in subjects like distributed clustering [8], and frequent pattern mining.

In [7], a data stream model is introduced, possessing the following characteristics:

• Data elements arrive online

• There is no order among the elements that can be imposed by the system

• Data streams can have unlimited size

• Once an element is processed, getting access to it is difficult or even impossible.

A Data Stream Management System [2] has been developed at Stanford University to process data streams basedon the previously described model.

Examples of applications of distributed data streams are TraderBot [3], a financial search engine over streamsof stock tickers and news feeds, and [1] which provides intrusion detection over gigabit networks packet streams.More details about Data Stream Mining can be found in paper by Babcock et al. [7], as well as in the survey byGaber et al [26].

6.2 Future Work

The algorithm proposed in this paper, Parallel QT Clustering, is best suited for running in grid computing environ-ments, tackling an area that hasn’t been researched recently. Its utility can be evaluated only after thorough testingof the algorithm has been performed. This constitues the subject of my future work in the field of distributedclustering.

References

[1] ipolicy networks web page. http://www.ipolicynetworks.com.

[2] Stanford stream data management project. http://www-db.stanford.edu/stream.

[3] Traderbot web page. http://www.traderbot.com.

[4] Charu C. Aggarwal and Philip S. Yu. Finding generalized projected clusters in high dimensional spaces. InSIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data,pages 70–81, New York, NY, USA, 2000. ACM.

[5] M. R. Anderberg. Cluster analysis for applications. Probability and Mathematical Statistics, New York:Academic Press, 1973, 1973.

26

Page 31: Distributed Clustering Survey

[6] Francisco Azuaje. Clustering-based approaches to discovering and visualising microarray data patterns. BriefBioinform, 4(1):31–42, 2003.

[7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. InPhokion G. Kolaitis, editor, Proceedings of the 21nd Symposium on Principles of Database Systems, pages1–16. ACM Press, 2002.

[8] Sanghamitra Bandyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta, Kun Liu, and Souptik Datta.Clustering distributed data streams in peer-to-peer environments. Information Sciences, 176(14):1952 – 1985,2006. Streaming Data Mining.

[9] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, 2002.

[10] Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien. Using linear algebra for intelligent informationretrieval. SIAM Rev., 37(4):573–595, 1995.

[11] Kanishka Bhaduri. Efficient Local Algorithms for Distributed Data Mining in Large Scale Peer to PeerEnvironments: A Deterministic Approach. PhD thesis, University of Maryland Baltimore County, 2008.

[12] Paul C. Boutros and Allan B. Okey. Unsupervised pattern recognition: An introduction to the whys andwherefores of clustering microarray data. Brief Bioinform, 6(4):331–343, 2005.

[13] M. Cannataro, A. Congiusta, A. Pugliese, D. Talia, and P. Trunfio. Distributed data mining on grids: services,tools, and applications. Systems, Man, and Cybernetics, Part B, IEEE Transactions on, 34(6):2451–2465,Dec. 2004.

[14] Krzysztof Cios, Witold Pedrycz, and Roman W. Swiniarski. Data mining methods for knowledge discovery.Kluwer Academic Publishers, Norwell, MA, USA, 1998.

[15] Josenildo C. da Silva, Chris Giannella, Ruchita Bhargava, Hillol Kargupta, and Matthias Klusch. Distributeddata mining and agents. Engineering Applications of Artificial Intelligence, 18:791–807, 2005.

[16] Ovidiu Dan and Horatiu Mocian. Scalable web mining with newistic. In Proceedings of PAKDD ’09, 2009.

[17] Souptik Datta, Kanishka Bhaduri, Chris Giannella, Ran Wolff, and Hillol Kargupta. Distributed data miningin peer-to-peer networks. IEEE Internet Computing, 10(4):18–26, 2006.

[18] Souptik Datta, Chris Giannella, and Hillol Kargupta. K-means clustering over a large, dynamic network. InJoydeep Ghosh, Diane Lambert, David B. Skillicorn, and Jaideep Srivastava, editors, SDM. SIAM, 2006.

[19] D. Deb and R.A. Angryk. Distributed document clustering using word-clusters. Computational Intelligenceand Data Mining, 2007. CIDM 2007. IEEE Symposium on, pages 376–383, 1 2007-April 5 2007.

[20] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm on distributed memory multipro-cessors. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDDSystems, SIGKDD, pages 245–260, London, UK, 2000. Springer-Verlag.

[21] M. Ester, H.-P. Kriegel, J. Sander, and X. A density-based algorithm for discovering clusters in large spatialdatabases with noise. In Evangelos Simoudis, Jiawei Han, and Usama Fayyad, editors, Proc. 2nd Int. Conf.on Knowledge Discovery and Data Mining (KDD’96 ), pages 226–231. AAAI Press, 1996.

[22] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discov-ery: an overview. pages 1–34, 1996.

[23] E. Forgy. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics,21:768–780, 1965.

27

Page 32: Distributed Clustering Survey

[24] George Forman and Bin Zhang. Distributed data clustering can be efficient and exact. SIGKDD Explor.Newsl., 2(2):34–38, 2000.

[25] A.L.N Fred and A.K. Jain. Data clustering using evidence accumulation. Pattern Recognition, 2002. Pro-ceedings. 16th International Conference on, 4:276–280 vol.4, 2002.

[26] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. Mining data streams: a review.SIGMOD Rec., 34(2):18–26, 2005.

[27] G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. Proc. NatlAcad. Sci. USA, 97:12079–12084, 2000.

[28] M. Ghanem, Y. Guo, J. Hassard, M. Osmond, and M Richards. Sensor grids for air pollution monitoring. InIn Proc. 3rd UK e-Science All Hands Meeting, 2004.

[29] Yike Guo and Janjao Sutiwaraphun. Probing knowledge in distributed data mining. In PAKDD ’99: Pro-ceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining,pages 443–452, London, UK, 1999. Springer-Verlag.

[30] Khaled Hammouda and Mohamed Kamel. Distributed collaborative web document clustering using clusterkeyphrase summaries. Information Fusion, 9(4):465 – 480, 2008. Special Issue on Web Information Fusion.

[31] Khaled M. Hammouda. Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Envi-ronments. PhD thesis, University of Waterloo, Department of Electrical and Computer Enginnering, 2007.

[32] Khaled M. Hammouda and Mohamed S. Kamel. Incremental document clustering using cluster similarityhistograms. In WI ’03: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence,page 597, Washington, DC, USA, 2003. IEEE Computer Society.

[33] Khaled M. Hammouda and Mohamed S. Kamel. Collaborative document clustering. In Joydeep Ghosh,Diane Lambert, David B. Skillicorn, and Jaideep Srivastava, editors, SDM. SIAM, 2006.

[34] Khaled M. Hammouda and Mohamed S. Kamel. Hp2pc: Scalable hierarchically-distributed peer-to-peerclustering. In SDM. SIAM, 2007.

[35] Pierre Hansen and Brigitte Jaumard. Cluster analysis and mathematical programming. Math. Program.,79(1-3):191–215, 1997.

[36] L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of coexpressedgenes. Genome Research 9, pp. 1106-1115, 1999.

[37] Byung hoon Park and Hillol Kargupta. Distributed data mining: Algorithms, systems, and applications. pages341–358, 2002.

[38] Ah hwee Tan. Text mining: The state of the art and the challenges. In In Proceedings of the PAKDD 1999Workshop on Knowledge Disocovery from Advanced Databases, pages 65–70, 1999.

[39] Anil K. Jain and Richard C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River,NJ, USA, 1988.

[40] Eshref Januzaj, Hans-Peter Kriegel, and Martin Pfeifle. Dbdc: Density based distributed clustering. In ElisaBertino, Stavros Christodoulakis, Dimitris Plexousakis, Vassilis Christophides, Manolis Koubarakis, KlemensBhm, and Elena Ferrari, editors, EDBT, volume 2992 of Lecture Notes in Computer Science, pages 88–105.Springer, 2004.

28

Page 33: Distributed Clustering Survey

[41] Daxin Jiang, Jian Pei, and Aidong Zhang. Dhc: A density-based hierarchical clustering method for timeseries gene expression data. In BIBE ’03: Proceedings of the 3rd IEEE Symposium on BioInformatics andBioEngineering, page 393, Washington, DC, USA, 2003. IEEE Computer Society.

[42] Daxin Jiang and Aidong Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions onKnowledge and Data Engineering, 16:1370–1386, 2004.

[43] Erik L. Johnson and Hillol Kargupta. Collective, hierarchical clustering from distributed, heterogeneous data.In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems,SIGKDD, pages 221–244, London, UK, 2000. Springer-Verlag.

[44] I. T. Jolliffe. Principal component analysis. Springer Series in Statistics, Berlin: Springer, 1986, 1986.

[45] Pierre-Emmanuel Jouve and Nicolas Nicoloyannis. A method for aggregating partitions, applications in k.d.d.In Kyu-Young Whang, Jongwoo Jeon, Kyuseok Shim, and Jaideep Srivastava, editors, PAKDD, volume 2637of Lecture Notes in Computer Science, pages 411–422. Springer, 2003.

[46] H. Kargupta and P. Chan. Advances in Distributed and Parallel Knowledge Discovery, chapter Distributedand Parallel Data Mining: A Brief Introduction. AAAI/MIT Press, 2000.

[47] H. Kargupta and P. Chan. DAdvances in Distributed and Parallel Knowledge Discovery. AAAI/MIT Press,Cambridge, MA, USA, 2000.

[48] H. Kargupta and K. Sivakumar. Data Mining: Next Generation Challenges and Future Directions, chapterExistential Pleasures of Distributed Data Mining. AAAI/MIT Press, 2004.

[49] Hillol Kargupta, Ilker Hamzaoglu, and Brian Stafford. Scalable, distributed data mining using an agentbased architecture. In Proceedings the Third International Conference on the Knowledge Discovery and DataMining, AAAI Press, Menlo Park, California, pages 211–214. AAAI Press, 1997.

[50] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik Johnson. Distributed clustering usingcollective principal component analysis. Knowl. Inf. Syst., 3(4):422–448, 2001.

[51] Hillol Kargupta, Byung-Hoon Park, Sweta Pittie, Lei Liu, Deepali Kushraj, and Kakali Sarkar. Mobimine:monitoring the stock market from a pda. SIGKDD Explor. Newsl., 3(2):37–46, 2002.

[52] Rasha Kashef. Cooperative Clustering Model and Its Applications. PhD thesis, University of Waterloo,Department of Electrical and Computer Enginnering, 2008.

[53] Matthias Klusch, Stefano Lodi, and Gianluca Moro. Agent-based distributed data mining: The KDEC scheme.In Matthias Klusch, Sonia Bergamaschi, Peter Edwards, and Paolo Petta, editors, Intelligent InformationAgents: The AgentLink Perspective, volume 2586 of Lecture Notes in Computer Science. Springer, 2003.

[54] Matthias Klusch, Stefano Lodi, and Gianluca Moro. Distributed clustering based on sampling local densityestimates. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico,August 2003.

[55] Raymond Kosala and Hendrik Blockeel. Web mining research: A survey. CoRR, cs.LG/0011033, 2000.

[56] R. Krishnapuram, A. Joshi, and Liyu Yi. A fuzzy relative of the k-medoids algorithm with application to webdocument and snippet clustering. Fuzzy Systems Conference Proceedings, 1999. FUZZ-IEEE ’99. 1999 IEEEInternational, 3:1281–1286 vol.3, 1999.

[57] Laura Lazzeroni and Art Owen. Plaid models for gene expression data. Statistica Sinica, 12:61–86, 2002.

29

Page 34: Distributed Clustering Survey

[58] Sara C. Madeira and Arlindo L. Oliveira. Biclustering algorithms for biological data analysis: A survey.IEEE/ACM Trans. Comput. Biol. Bioinformatics, 1(1):24–45, 2004.

[59] Srujana Merugu and Joydeep Ghosh. Privacy-preserving distributed clustering using generative models. InICDM, pages 211–218. IEEE Computer Society, 2003.

[60] F. Murtagh. A survey of recent advances in hierarchical clustering algorithms. The Computer Journal,26(4):354–359, November 1983.

[61] Clark F. Olson. Parallel algorithms for hierarchical clustering. Parallel Comput., 21(8):1313–1325, 1995.

[62] K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimiza-tion problems. Proceedings of the IEEE, 86(11):2210–2239, Nov 1998.

[63] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM,18(11):613–620, 1975.

[64] Nagiza F. Samatova, George Ostrouchov, Al Geist, and Anatoli V. Melechko. Rachet: An efficient cover-basedmerging of clustering hierarchies from distributed datasets. Distrib. Parallel Databases, 11(2):157–180, 2002.

[65] Noam Slonim and Naftali Tishby. Document clustering using word clusters via the information bottleneckmethod. In SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Researchand development in information retrieval, pages 208–215, New York, NY, USA, 2000. ACM.

[66] F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau. Adaptive quality-based clusteringof gene expression profiles. Bioinformatics, 18(5):735–746, 2002.

[67] Michael Steinbach, George Karypis, and Vipin Kumar. Abstract a comparison of document clustering tech-niques. KDD-2000 Workshop on Text Mining, 2000.

[68] Alexander Strehl and Joydeep Ghosh. Cluster ensembles — a knowledge reuse framework for combiningmultiple partitions. J. Mach. Learn. Res., 3:583–617, 2003.

[69] C. Y. Suen. N-gram statistics for natural language understanding and text processing. In IEEE Trans. onPattern Analysis and Machine Intelligence, volume 2, pages 164–172., 1979.

[70] Jaideep Vaidya and Chris Clifton. Privacy-preserving k-means clustering over vertically partitioned data. InLise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors, KDD, pages 206–215. ACM,2003.

[71] Rui Xu and II Wunsch. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645–678, May 2005.

[72] Xiaowei Xu, Jochen Jager, and Hans-Peter Kriegel. A fast parallel clustering algorithm for large spatialdatabases. Data Min. Knowl. Discov., 3(3):263–290, 1999.

[73] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In ICML’97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 412–420, SanFrancisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

[74] Mohammed Javeed Zaki. Parallel and distributed data mining: An introduction. In Mohammed Javeed Zakiand Ching-Tien Ho, editors, Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in ComputerScience, pages 1–23. Springer, 1999.

[75] Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR ’98: Proceed-ings of the 21st annual international ACM SIGIR conference on Research and development in informationretrieval, pages 46–54, New York, NY, USA, 1998. ACM.

30

Page 35: Distributed Clustering Survey

[76] Bin Zhang, Meichun Hsu, and George Forman. Accurate recasting of parameter estimation algorithms usingsufficient statistics for efficient parallel speed-up: Demonstrated for center-based data clustering algorithms.In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and KnowledgeDiscovery, pages 243–254, London, UK, 2000. Springer-Verlag.

[77] Feng Zhao and Leonidas Guibas. Wireless Network Sensors: An Information Processing Approach. MorganKauffman, 2004.

31