[ieee 2010 international conference on advances in social networks analysis and mining (asonam 2010)...

6
Clustering Social Networks Using Distance-preserving Subgraphs Ronald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning Tan Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824, U.S.A. {ronald,esfahanian,ptan}@cse.msu.edu Abstract—Cluster analysis describes the division of a dataset into subsets of related objects, which are usually disjoint. There is considerable variety among the different types of clustering algorithms. Some of these clustering algorithms represent the dataset as a graph, and use graph-based properties to generate the clusters. However, many graph properties have not been explored as the basis for a clustering algorithm. In graph theory, a subgraph of a graph is distance-preserving if the distances (lengths of shortest paths) between every pair of vertices in the subgraph are the same as the corresponding distances in the original graph. In this paper, we consider the question of finding proper distance-preserving subgraphs, and the problem of partitioning a simple graph into an arbitrary number of distance-preserving subgraphs for clustering purposes. We also present a clustering algorithm called DP-Cluster, based on the notion of distance-preserving subgraphs. One area of research that makes considerable use of graph theory is the analysis of social networks. For this reason we evaluate the performance of DP-Cluster on two real-world social network datasets. I. I NTRODUCTION Cluster analysis is a technique for partitioning a dataset into groups of similar objects. It is often used as an exploratory data analysis tool to determine the underlying structure of a data set. A recent area of application is in the realm of social networks, where the goal is to find tightly-knit groups of people, also known as communities, based on their social relations. For such applications, the social network is typically modeled as a graph structure, where the vertices represent the social actors while the edges represent the social ties between them. Communities are detected by partitioning the network into subgraphs in such a way that optimizes certain graph properties (e.g., minimizing the cut between clusters or maximizing an edge-based modularity function). In addition to these criteria, there are other graph invariants that are potentially applicable to clustering social networks. The motivation for our work is to explore the effect of alternative graph invariants on the process of community finding. One of the difficulties with cluster analysis is that there is no strict definition of what a cluster is. This makes it more difficult to conjecture what sort of concepts might be usable in creating a clustering algorithm. In a social network, a vertex usually represents a person or small group of persons. We might wish to cluster participants according to family ties, peer groups, business relationships, and other personal attributes or common interests. For the graph of a social network, we Fig. 1. A toy example of a social network. expect to find higher connectivity between similar vertices than dissimilar ones. More importantly, we expect the centers of social groups to be quite dense, with fewer edges between communities. Since social networks tend to follow a scale- free distribution, we will not find large cliques. Still, we expect the shortest path between any two members of the same community in the network to involve other members from the same community rather than members from other communities. It is this last premise that gives us reason to explore the use of distance-preserving subgraphs as the basis of a clustering algorithm. A distance-preserving subgraph is an induced proper sub- graph that maintains the distances (lengths of shortest paths) as in the original graph. Consider the toy example of a social network shown in Figure 1 which has 8 vertices and 13 edges. There are two natural communities in the network, one involv- ing vertices (A,B,C,D) and the other (E,F,G,H). Clearly, both subgraphs are distance-preserving because the shortest path distance between any two vertices within each subgraph is identical to the one computed from the entire network. However, adding the vertex E to the subgraph (A,B,C,D) no longer preserves the distance between the vertex pair (E,B). Similarly, the addition of vertex F to (A,B,C,D) would make the subgraph non-distance-preserving because the distance between (F,C) is not the same as that in the original graph. We face a number of challenges along the way to creating a useful clustering algorithm. As previously mentioned, we need to find distance-preserving subgraphs within a graph that rep- resent clusters. While we generally cannot divide a given graph into an arbitrary number of distance-preserving subgraphs, we can always divide it into at least that many distance-preserving subgraphs, since isolated vertices are distance-preserving sub- 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.78 380 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.78 380 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.78 380

Upload: pang-ning

Post on 27-Mar-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

Clustering Social Networks UsingDistance-preserving SubgraphsRonald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning Tan

Department of Computer Science and EngineeringMichigan State University

East Lansing, MI 48824, U.S.A.{ronald,esfahanian,ptan}@cse.msu.edu

Abstract—Cluster analysis describes the division of a datasetinto subsets of related objects, which are usually disjoint. Thereis considerable variety among the different types of clusteringalgorithms. Some of these clustering algorithms represent thedataset as a graph, and use graph-based properties to generatethe clusters. However, many graph properties have not beenexplored as the basis for a clustering algorithm. In graph theory,a subgraph of a graph is distance-preserving if the distances(lengths of shortest paths) between every pair of vertices inthe subgraph are the same as the corresponding distances inthe original graph. In this paper, we consider the question offinding proper distance-preserving subgraphs, and the problemof partitioning a simple graph into an arbitrary number ofdistance-preserving subgraphs for clustering purposes. We alsopresent a clustering algorithm called DP-Cluster, based on thenotion of distance-preserving subgraphs. One area of researchthat makes considerable use of graph theory is the analysis ofsocial networks. For this reason we evaluate the performance ofDP-Cluster on two real-world social network datasets.

I. INTRODUCTION

Cluster analysis is a technique for partitioning a dataset intogroups of similar objects. It is often used as an exploratorydata analysis tool to determine the underlying structure ofa data set. A recent area of application is in the realmof social networks, where the goal is to find tightly-knitgroups of people, also known as communities, based on theirsocial relations. For such applications, the social network istypically modeled as a graph structure, where the verticesrepresent the social actors while the edges represent the socialties between them. Communities are detected by partitioningthe network into subgraphs in such a way that optimizescertain graph properties (e.g., minimizing the cut betweenclusters or maximizing an edge-based modularity function). Inaddition to these criteria, there are other graph invariants thatare potentially applicable to clustering social networks. Themotivation for our work is to explore the effect of alternativegraph invariants on the process of community finding.

One of the difficulties with cluster analysis is that there isno strict definition of what a cluster is. This makes it moredifficult to conjecture what sort of concepts might be usable increating a clustering algorithm. In a social network, a vertexusually represents a person or small group of persons. Wemight wish to cluster participants according to family ties, peergroups, business relationships, and other personal attributesor common interests. For the graph of a social network, we

Fig. 1. A toy example of a social network.

expect to find higher connectivity between similar vertices thandissimilar ones. More importantly, we expect the centers ofsocial groups to be quite dense, with fewer edges betweencommunities. Since social networks tend to follow a scale-free distribution, we will not find large cliques. Still, weexpect the shortest path between any two members of thesame community in the network to involve other membersfrom the same community rather than members from othercommunities. It is this last premise that gives us reason toexplore the use of distance-preserving subgraphs as the basisof a clustering algorithm.

A distance-preserving subgraph is an induced proper sub-graph that maintains the distances (lengths of shortest paths)as in the original graph. Consider the toy example of a socialnetwork shown in Figure 1 which has 8 vertices and 13 edges.There are two natural communities in the network, one involv-ing vertices (A,B,C,D) and the other (E,F,G,H). Clearly,both subgraphs are distance-preserving because the shortestpath distance between any two vertices within each subgraphis identical to the one computed from the entire network.However, adding the vertex E to the subgraph (A,B,C,D)no longer preserves the distance between the vertex pair(E,B). Similarly, the addition of vertex F to (A,B,C,D)would make the subgraph non-distance-preserving because thedistance between (F,C) is not the same as that in the originalgraph.

We face a number of challenges along the way to creating auseful clustering algorithm. As previously mentioned, we needto find distance-preserving subgraphs within a graph that rep-resent clusters. While we generally cannot divide a given graphinto an arbitrary number of distance-preserving subgraphs, wecan always divide it into at least that many distance-preservingsubgraphs, since isolated vertices are distance-preserving sub-

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.78

380

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.78

380

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.78

380

Page 2: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

graphs. So we must devise a method of merging extra clustersso that we can always reach the desired number of clusters. Wemight also encounter graphs that can be divided into less thanthe desired number of distance-preserving clusters, in whichcase we will have to split them. Once we have addressedthese challenges, we can use the heuristics found to developa clustering algorithm.

In the next section, we review related work in clusteranalysis and give further background on distance-preservingsubgraphs. In section III, we give a formal problem state-ment. In section IV, we describe our algorithm, DP-Cluster.In section V, we present the results of applying the DP-Cluster algorithm on two social network datasets. We stateour conclusions in Section VI.

II. RELATED WORK

Many clustering algorithms exist. Here we cover someof the more common ones, and those based on the use ofgraph properties. We also provide some background into thearea of community finding. Finally, we give a more thoroughexplanation of distance-preserving subgraphs.

A. Clustering Algorithms

One of the better known methods of cluster analysis isk-means, whose name dates back to a paper by MacQueen[1]. k-means is a partitional algorithm that iteratively assignsobjects to their nearest cluster centers, then recomputes thecenters until they converge. Another important class of clus-tering methods is hierarchical clustering, which can be donein agglomerative or divisive fashion. Hierarchical clusteringproduces a dendrogram, which represents a series of clusteringsolutions, from all objects in one cluster to each object in acluster of its own. Many variations of hierarchical clusteringalgorithms exist, including single-link, complete-link, groupaverage, and Ward’s method. An alternate approach is to defineclusters as regions of high density objects separated by lowdensity regions often populated by noisy observations. Twosuch algorithms are DBSCAN [3] and OPTICS [2].

A number of graph theoretic clustering algorithms have beendeveloped. These algorithms are designed to optimize certaingraph properties. For example, spectral clustering methodswere developed to minimize variants of the graph cut propertyusing the eigenvectors of the graph Laplacian matrix. Otherheuristics have also been proposed based on minimizingdiameter, p-centers, and other graph properties [4], [5], [6].

B. Community Finding

The technique of community finding, also called groupdetection [7], positional analysis [8], [9] or blockmodelling[9], is the process of placing vertices into groups in such away that the vertices within a group are similar to each otherand dissimilar to vertices in other groups. For example, New-man and Girvan [10] proposed an approach for discoveringcommunity structure in networks using modularity functionas a measure of cluster quality. Scripps et al. [11] presentedan overlapping clustering algorithm, taking into account the

ill-effects of bridge vertices in a network. Liu et al. [12]gave an algorithm for building hierarchical communities usingclient-side web browsing history. More recently, there has beenconsiderable interest in discovering dynamic communities inan evolving social network. For example, Zhou et al. [13]presented an algorithm for finding dynamic communities bycombining the statically derived communities subject to certainconstraints to ensure consistency of the clusters at differenttime periods. Tantipathananandh et al. [14] developed a com-munity finding algorithm for dynamic networks using a graphcoloring method.

C. Distance-preserving Subgraphs

The topic of distance-preserving subgraphs is relativelyunexplored. A subgraph of a graph G is called a distance-preserving subgraph if the distances between every pair ofvertices in the subgraph are the same as the distances betweenthem in G, and the subgraph is a proper subgraph of G.A distance-preserving subgraph is necessarily induced if alledge weights in G are equal, since the omission of anyedge would increase the distance between the endpoints ofthat edge. One can determine whether a given subgraph isdistance-preserving by computing all-pairs shortest paths forthe subgraph, and for G. If there are no negative cycles in G,then Floyd Warshall’s algorithm may be used to accomplishthis, which has a time complexity of O(n3). A related conceptis the distance-hereditary graph, first proposed by Howorka[15]. Distance-hereditary graphs are a subset of perfect graphs,and have been studied extensively [16], [17], [18].

III. PROBLEM STATEMENT

Let G = (V,E) be a graph representing a connected socialnetwork of order |V | = n. Order refers to the number ofvertices in a graph, and size refers to the number of edges.Our goal is to use the concept of distance-preserving subgraphsto partition V (G) into k pairwise disjoint subsets V1, . . . , Vk,such that V1 ∪ · · · ∪ Vk = V . With n vertices and k clusters,on average a cluster must contain n/k vertices. Of course,the purpose of using cluster analysis is to find good clusters,not just ones of equal order. However, this fact hints at thefundamental question facing us: Given a graph G and integerm, does G contain a distance-preserving subgraph of orderm? Answering this question raises several immediate issues.

A. Challenges

Our foremost concern is attaining clusters that are distance-preserving. Obviously, G will not always contain a distance-preserving subgraph of order m. So it will not always bepossible to partition G into k distance-preserving clusters.Instead, we will seek to find clusters that are as close to beingdistance-preserving as possible. In the next section, we offer away to quantitatively measure such almost distance-preservingsubgraphs.

Even if G does have a distance-preserving subgraph of orderm, we still have to find it. We suspect that this cannot bedone for the general case in polynomial time complexity. It

381381381

Page 3: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

definitely is computationally infeasible to do so in a naivefashion, so we resort to heuristics.

Since any distance-preserving algorithm is based off ofdistances between pairs of vertices, we also have a problemif G is not connected. Our distance-preserving clusteringalgorithm assumes that G is a connected network. If the graphG is disconnected, we could apply our algorithm to eachcomponent separately.

B. Algorithmic Approach

Here we are at a crossroads. In our algorithm presented inthe next section, we partition G into as few properly distance-preserving clusters as we can find according to a heuristic, andproceed to merge them until we have k clusters. However, oncewe define a measure for how distance-preserving a subgraphis, we could instead construct an agglomerative or divisivehierarchical clustering algorithm. We proceed with the formermethod in the interest of keeping larger parts of the clustersdistance-preserving. It is not clear which path might lead to abetter performing or less computationally expensive algorithm.

IV. DP-CLUSTER

Here we describe our heuristic to find distance-preservingsubgraphs, give a definition for an almost distance-preservingsubgraph, and present a simple clustering algorithm, which wecall DP-Cluster. In the next section we compare it against thehierarchical clustering algorithm in Matlab’s Statistics Toolbox(see linkage and cluster), along with random clustering for abaseline.

A. Finding Distance-preserving Subgraphs

Finding large distance-preserving proper subgraphs in agraph is not an easy task. Clearly we cannot test each of thealmost 2n induced subgraphs of G to see whether or not itis distance-preserving. The best approach we have found sofar is to start with a single vertex as a cluster C, which istrivially distance-preserving. At each step, we attempt to addeach neighbor not in C to C in turn. If there is some non-empty set of neighbors where the union of a single elementwith C leaves C distance-preserving, we choose one of them topermanently add to C according to some criteria, and continue.Once we reach a step where no neighbors can be added to Cand have it remain distance-preserving, we are done.

B. Almost Distance-preserving Subgraphs

We define the average distance increase for a subgraph ofG as the sum of the distance increases between the subgraphand G, divided by the number of vertices in the subgraph.If the subgraph is distance-preserving, then this value is 0.The less distance-preserving the subgraph is, the higher thisnumber will be.

C. The Algorithm

Our algorithm begins with each vertex from G in a separatecluster, all of which are trivially distance-preserving subgraphsof G. These clusters must be combined until there are only kclusters left. We pick one of these singleton clusters at random

and use it as the start vertex to find a distance-preservingsubgraph as described in the first part of this section. Randomselection is used when there is more than one choice of verticesto add to the current cluster. Once this has finished, we setthese vertices aside, pick another vertex at random, and repeatthe process. Eventually, this will leave us with G partitionedinto some number of distance-preserving clusters.

If we do not stop cluster growth at n/k vertices, we mayend up with less that k distance-preserving clusters. In thiscase, we would have to break distance-preserving clustersapart to achieve k clusters. Since this has not happened withan actual dataset, we do not concern ourselves further withthis possibility. Instead, we end up with a number of distance-preserving clusters larger than k. We then take each pair ofclusters, and calculate how close to distance-preserving eachpotential merger is, according to the metric given in the secondpart of this section. The merge with the lowest value, i.e., themost distance-preserving, is made. This process is repeateduntil only k clusters remain.

Below is pseudocode for DP-Cluster. clusters is the set ofclusters, whose members are sets of vertices. unused is the setof vertices that have not been placed into a cluster. neighborsis the set available vertices adjacent to some vertex in thecurrent cluster. usable is the set of available vertices whichcan be added to the current cluster, and still have it be distance-preserving. RANDOM returns a random element from a set.NEIGHBORS returns the set of neighbors for a given vertex ofa graph. RANDOMIZE returns a random permutation of a set.COUNT returns the number of elements in a set. ALMOST-DPreturns how close a cluster is from being distance-preserving,according to the method described in the previous subsection.

unused = Vfor i← 1 to n dov ← RANDOM(unused)unused← unused \ {v}clusters[i]← {v}neighbors← NEIGHBORS(G, v) ∩ unusedwhile unused do

neighbors← RANDOMIZE(neighbors)usable← {}for all v ∈ neighbors do

if IS DP(G, clusters[i] ∪ v) thenusable← usable ∪ v

end ifend forif usable 6= {} then

v ← RANDOM(usable)unused← unused \ {v}clusters[i]← clusters[i] ∪ vneighbors← neighbors∪

NEIGHBORS(G, v) ∩ unusedelse

breakend if

end while

382382382

Page 4: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

end forc1, c2 ← 0best←∞while COUNT(clusters) > maxclusters do

for all i ∈ clusters dofor all j ∈ clusters \ {i} do

if ALMOST DP(clusters, i, j) < best thenc1 ← ic2 ← jbest← ALMOST DP(clusters, i, j)

end ifend for

end forclusters← clusters ∪ {C1 ∪ C2} \ {C1, C2}

end whilereturn clusters

D. Efficiency

Using distance-preserving subgraphs in any manner requiresrunning an all-pairs-shortest-paths algorithm on G. Regardlessof how we formulate our clustering algorithm, our runningtime must be at least O(n3).

V. EXPERIMENTAL EVALUATION

In this section we compare and contrast DP-Cluster againsta few existing hierarchical clustering algorithms, as well arandom clustering algorithm, using different social networkdatasets.

A. Datasets

The two datasets we use for testing are CiteSeer andCora. They can be downloaded from the University of Mary-land website at http://www.cs.umd.edu/∼sen/lbc-proj/LBC.html. Vertices in these datasets are scientific publications, andedges represent citations, i.e., if the graph contains an edge(A,B), then either paper A cites paper B or paper B citespaper A. Although the papers represent authors, clustering inthese datasets is closer to document classification than commu-nity finding in an ordinary social network. CiteSeer, which iscomposed of general topic computer science papers, contains3312 vertices, 4536 edges, and 6 class labels: Agents, ArtificialIntelligence, Database, Information Retrieval, Machine Learn-ing, and Human-Computer Interaction. Cora, which consistsentirely of machine learning papers, contains 2708 vertices,5278 edges, and 7 class labels: Cased Based, Genetic Algo-rithms, Neural Networks, Probabilistic Methods, Reinforce-ment Learning, Rule Learning, and Theory. Since we wantconnected social networks for testing purposes, we extract thelargest component from each dataset. For the remainder of thispaper, we are referring to the largest component whenever wereference CiteSeer and Cora.

B. Experiments

Before attempting to evaluate the DP-Cluster algorithm,we want to look at results for the intermediate steps. Ourfirst experiment is simply to find a single distance-preserving

TABLE IDATASET PROPERTIES - LARGEST COMPONENT

Dataset Order Size Diameter No. of ClassesCiteSeer 2110 3668 28 6Cora 2485 5069 19 7

subgraph, using the incremental approach described in sectionIV.A. When multiple vertices can be added to a cluster suchthat the cluster is still distance-preserving, we must choosebetween them in some fashion. We consider random selection,as well as using the degree and clustering coefficient (CC) ofthe vertex as found in the original graph G. The clusteringcoefficient of a vertex v is the number of edges found betweenthe neighbors of v divided by the total possible number ofedges between the neighbors of v. We run 100 trials foreach combination of dataset and method of breaking ties, andpresent statistics for the distance-preserving subgraphs found.For each trial, we calculate the entropy of the final distance-preserving subgraph using the actual class labels. If G hask labels, then the entropy of the subgraph is given by theequation

entropy = −k∑

i=1

mi

mlog k

mi

m,

where m is the order of the subgraph, mi is the number ofvertices in the subgraph with label i, and m1+ . . .+mk = m.Over all trials, we calculate the average order, diameter andentropy of the distance-preserving subgraphs found.

In the second experiment we examine the first part of theDP-Cluster algorithm, using random tiebreaks. That is, we runthe algorithm as normal, but discard any clusters in excess ofthe number of class labels instead of merging them. As a result,each run of the DP-Cluster algorithm gives us a differentsubset of the vertices in G. The goal of this experiment is togive us some idea of the applicability of distance-preservingsubgraphs to cluster analysis, without testing our definitionof almost distance-preserving subgraphs as defined in sectionIV.B. Here we run 20 trials for each dataset, and calculate theaverage order and entropy over all trials. In this experiment,the entropy for a given trial is the weighted average of theindividual cluster entropies. For comparison, we calculatethese same values using hierarchical clustering (HC) withvarious metrics, and random clustering. The inputs to thehierarchical clustering and random clustering algorithms ineach trial are only those vertices found by DP-Cluster in thatparticular trial.

The third experiment tests the full DP-Cluster algorithmas presented in section IV.C, again using random tiebreaks.For each trial the graph is fully partitioned into distance-preserving subgraphs, which are then merged into almostdistance-preserving subgraphs. As with the previous experi-ment, we run 20 trials for each dataset, and present the averageentropy over all runs. We also consider cluster stability. Foreach run, we create an n × n incidence matrix, where entryi, j is 1 if i and j belong to the same cluster, and 0 if they do

383383383

Page 5: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

not. We then calculate the average correlation between thesematrices for each pair of trials. Again, we calculate these samevalues using hierarchical clustering, and random clusteringalgorithms.

For the hierarchical clustering algorithm, we use threedifferent distance metrics. Single linkage (nearest neighbor)uses the minimum of all the distances between pairs of verticesin the two clusters. Complete linkage (furthest neighbor) usesthe maximum of all the distances between pairs of vertices inthe two clusters. Average linkage uses the average unweighteddistance between all pairs of vertices in the two clusters. For allof the distance metrics, the pair of clusters with the minimumdistance between clusters are merged at each step.

C. Results

In this subsection we give the results for our three experi-ments.

1) Finding A Single Distance-preserving Subgraph: In thetables below we see that the average distance-preserving sub-graph found was reasonably large, with considerable variationdepending on the tiebreak method used. As expected, theaverage order is still much smaller than the average clusterneeds to be, so we do not have to consider the problemof splitting excessively large distance-preserving subgraphs.The average diameter found with the CiteSeer dataset appearssomewhat high, although this is perhaps not surprising giventhat the original CiteSeer graph had a diameter of 28 comparedto a diameter of 19 for Cora.

TABLE IIFINDING A SINGLE DISTANCE-PRESERVING SUBGRAPH - CITESEER

Average Average AverageTiebreak Method Order σ(Order) Diameter EntropyRandom 101 58 14 0.636Degree, increasing 56 35 12 0.664Degree, decreasing 176 87 16 0.645CC, increasing 86 51 14 0.657CC, decreasing 131 69 15 0.652

It is clear from our results that the method of tiebreakingused has a significant effect on the order of the subgraph found.Specifically, biasing our heuristic towards using higher degreevertices yields much larger subgraphs compared to selectinga vertex at random. Other methods of breaking ties mightproduce even larger subgraphs, or subgraphs of lower entropy,although for complexity reasons we prefer properties that canbe calculated from the original graph.

TABLE IIIFINDING A SINGLE DISTANCE-PRESERVING SUBGRAPH - CORA

Average Average AverageTiebreak Method Order σ(Order) Diameter EntropyRandom 110 86 10 0.618Degree, increasing 47 52 8 0.562Degree, decreasing 169 109 11 0.634CC, increasing 64 57 9 0.590CC, decreasing 150 106 10 0.626

2) Partial Clustering With DP-Cluster: In this experiment,the DP-Cluster algorithm used approximately one quarter ofthe vertices from the largest component of the correspondingdataset in each trial. For each trial, the hierarchical and randomclustering algorithms used the same subset of vertices found byDP-Cluster, which is why the average order of these clusteringalgorithms is the same as DP-Cluster. We had some concernthat the order of successive distance-preserving subgraphswould decrease rapidly after the first one. This was notobserved in practice, and the average order for these trials wasonly slightly less than the average order for the correspondingtrials in our first experiment times the number of class labelsfor that dataset.

TABLE IVDP-CLUSTER RESULTS, DISCARDING EXCESS CLUSTERS - CITESEER

AverageAlgorithm EntropyDP-Cluster 0.656HC, single 0.815HC, complete 0.627HC, average 0.660Random 0.888

The average entropy for all the clusters found by DP-Cluster, while not as low as we might desire, did not increaseappreciably from the previous experiment. For the CiteSeerdataset, DP-Cluster resulted in an average cluster order of 581,and it outperformed two of the three hierarchical clusteringalgorithms. With the Cora dataset, DP-Cluster resulted in anaverage cluster order of 633, and it outperformed the sametwo hierarchical clustering algorithms. Since the set of verticesused here varies from trial to trial, it is meaningless to considercorrelation.

TABLE VDP-CLUSTER RESULTS, DISCARDING EXCESS CLUSTERS - CORA

AverageAlgorithm EntropyDP-Cluster 0.618HC, single 0.795HC, complete 0.589HC, average 0.637Random 0.811

3) Full Clustering With DP-Cluster: For the CiteSeerdataset, performance of hierarchical clustering algorithmsrange from somewhat better to much worse than DP-Cluster,depending on the distance metric used to create the hierarchi-cal cluster tree. Unsurprisingly, random clustering led to muchhigher average entropies than all other clustering algorithms.

With the Cora dataset, DP-Cluster performs better than allbut one of the distance metrics. It is not immediately clear whycomplete-link hierarchical clustering, which merges clustersaccording to the furthest distance between them, performsso well here. What is obvious is that cluster entropy hassignificantly increased from the previous experiment. Also of

384384384

Page 6: [IEEE 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2010) - Odense, Denmark (2010.08.9-2010.08.11)] 2010 International Conference on Advances

TABLE VIDP-CLUSTER RESULTS, MERGING EXCESS CLUSTERS - CITESEER

Average AverageAlgorithm Entropy CorrelationDP-Cluster 0.778 0.264HC, single 0.952 1.000HC, complete 0.727 1.000HC, average 0.898 1.000Random 0.951 0.091

concern is the relatively low correlation found between trialsusing DP-Cluster.

TABLE VIIDP-CLUSTER RESULTS, MERGING EXCESS CLUSTERS - CORA

Average AverageAlgorithm Entropy CorrelationDP-Cluster 0.783 0.205HC, single 0.937 1.000HC, complete 0.748 1.000HC, average 0.883 1.000Random 0.937 0.077

D. Discussion

Overall, the results of DP-Cluster compare favorably to hier-archical clustering methods. The algorithm has some obviousweaknesses, some of which may be due to the current im-plementation. Like many incremental algorithms, DP-Clusteris order dependent. It does find clusters of reasonably lowentropy, even after “leftover” clusters are taken into account.However, complete-link hierarchical clustering clearly outper-forms DP-Cluster on the CiteSeer and Cora datasets. Thisoutperformance might disappear on other datasets dependingon their shape. Also, DP-Cluster found clusters of fairly lowstability given their low entropy. A better distance-preservingsubgraph finding heuristic should reduce this effect.

VI. CONCLUSIONS AND FUTURE WORK

The performance of the fairly simple DP-Cluster algorithmshows promise for the use of distance-preserving subgraphsin community finding. Nonetheless, further exploration isneeded to develop a better performing algorithm. Along withfinding better clusters, finding more consistent clusters is aconcern. The most obvious method of doing this is to improvethe heuristic for finding distance-preserving clusters, althoughchanging the manner in which the distance-preserving clus-ters are merged is a further possibility. Another approach isto create a distance-preserving based hierarchical algorithm,which would combine the “finding” and “merging” processesinto a single step. Our experimental results lead us to makethe following conjecture.

Conjecture. Almost all graphs are distance-preserving, i.e.,there is at least one distance-preserving subgraph for eachorder m = 1, . . . , n.

If true, this underscores the need to develop heuristics foridentifying distance-preserving subgraphs that better reflectactual communities in social networks.

REFERENCES

[1] J. MacQueen, “Some methods for classification and analysis of mul-tivariate observations,” Proceedings of the 5th Berkeley Symposium onMathematical Statistics and Probability, vol. 1, p. 281297, 1966.

[2] M. Ankerst, M. Breunig, H. Kriegel, and J. Sander, “OPTICS: OrderingPoints To Identify the Clustering Structure,” ACM SIGMOD Record,vol. 28, no. 2, pp. 49–60, 1999.

[3] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithmfor Discovering Clusters in Large Spatial Databases with Noise,” in Proc.KDD, vol. 96, 1996, pp. 226–231.

[4] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “IncrementalClustering and Dynamic Information Retrieval,” in Proceedings of thetwenty-ninth annual ACM symposium on Theory of computing. ACMNew York, NY, USA, 1997, pp. 626–635.

[5] G. Flake, R. Tarjan, and K. Tsioutsiouliklis, “Graph Clustering andMinimum Cut Trees,” Internet Mathematics, vol. 1, no. 4, pp. 385–408,2004.

[6] J. Plesnik, “A heuristic for the p-center problem in graphs.” DISCRETEAPPL. MATH., vol. 17, no. 3, pp. 263–268, 1987.

[7] L. Getoor and C. Diehl, “Link Mining: A Survey,” ACM SIGKDDExplorations Newsletter, vol. 7, no. 2, p. 12, 2005.

[8] P. Doreian, V. Batagelj, and A. Ferligoj, “Positional analyses of socio-metric data,” Models and methods in social network analysis, pp. 77–97,2005.

[9] S. Wasserman and K. Faust, Social Network Analysis: Methods andApplications. Cambridge University Press, 1994.

[10] M. Newman and M. Girvan, “Finding and evaluating community struc-ture in networks,” Physical review E, vol. 69, no. 2, p. 26113, 2004.

[11] J. Scripps and P. Tan, “Constrained Overlapping Clusters: Minimizingthe Negative Effects of Bridge-Nodes,” Statistical Analysis and DataMining, vol. 3, no. 1, pp. 20–37, 2010.

[12] K. Liu, K. Bhaduri, K. Das, P. Nguyen, and H. Kargupta, “Client-sideweb mining for community formation in peer-to-peer environments,”ACM SIGKDD Explorations Newsletter, vol. 8, no. 2, p. 20, 2006.

[13] D. Zhou, I. Councill, H. Zha, and C. Giles, “Discovering TemporalCommunities from Social Network Documents,” in Proceedings of the2007 Seventh IEEE International Conference on Data Mining. IEEEComputer Society Washington, DC, USA, 2007, pp. 745–750.

[14] C. Tantipathananandh, T. Berger-Wolf, and D. Kempe, “A FrameworkFor Community Identification in Dynamic Social Networks,” in Proceed-ings of the 13th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2007, p. 726.

[15] E. Howorka, “A characterization of distance-hereditary graphs,” Quart.J. Math. Oxford Ser, vol. 2, no. 28, pp. 417–420, 1977.

[16] H. Bandelt and H. Mulder, “Distance-Hereditary Graphs,” Journal ofcombinatorial theory. Series B, vol. 41, no. 2, pp. 182–208, 1986.

[17] G. Damiand, M. Habib, and C. Paul, “A simple paradigm for graphrecognition: application to cographs and distance hereditary graphs,”Theoretical Computer Science, vol. 263, no. 1-2, pp. 99–111, 2001.

[18] P. Hammer and F. Maffray, “Completely separable graphs,” DiscreteApplied Mathematics, vol. 27, no. 1-2, pp. 85–99, 1990.

385385385