[ieee 2011 international conference on advances in social networks analysis and mining (asonam 2011)...

8

Click here to load reader

Upload: bai

Post on 13-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

Detecting Link Communities in Massive Networks

Qi YeSchool of Computer ScienceBeijing University of Posts

and TelecommunicationsBeijing, China, 100876

Email: [email protected]

Bin WuSchool of Computer ScienceBeijing University of Posts

and TelecommunicationsBeijing, China, 100876

Email: [email protected]

Zhixiong Zhao, Bai WangSchool of Computer ScienceBeijing University of Posts

and TelecommunicationsBeijing, China, 100876

Email: [email protected], [email protected]

Abstract—Most of the existing literature which has entirelyfocused on clustering nodes in large-scale networks. To discovermulti-scale overlapping communities quickly, we propose a highlyefficient multi-resolution link community detection algorithm todetect the link communities in massive networks based on theidea of edge labeling. First, we will get the node partition ofthe network based on a new multi-resolution node detectionalgorithm. After that, we can find the link community in a lineartime by the labels of nodes. Its time complexity is near linear andits space complexity is linear. The effectiveness of our algorithmis demonstrated by extensive experiments on lots of computergenerated artificial graphs and real-world networks. The resultsshow that our algorithm is very fast and highly reliable. Tests onreal and artificial networks also give excellent results comparingwith the newly proposed link partition algorithm.

I. INTRODUCTION

To decompose the large-scale networks into communities,a lot of node community detection algorithms have beenproposed in the last few years. In contrast to most of theexisting literature which has entirely focused on groupingnodes, clustering links is a much more flexible approachthan clustering nodes, which contains the situations of nodeoverlapping cases [1]. To detect multi-scale link communi-ties in massive networks, we propose a novel link partitionalgorithm whose time complexity is near linear and spacecomplexity is linear. This algorithm is based on the ideaof link labeling provided by the context of node partition.Furthermore, to investigate different scales of communities,the resolution of its partition can be tuned by a parameter 𝛾enabling. To evaluate the effectiveness of our algorithm, wegive a lots of experiments on real and artificial networks. Theresults show that the efficiency and accuracy of our algorithmmake it feasible to be used for the accurate identification ofoverlapping communities in very large networks.

This paper is organized as follows: Section 2 surveysthe related work. Section 3 contains details regarding theimplementation of the link community detection algorithm. InSection 4, we discuss the experiments of the link communitydetection algorithm. In Section 5, we discuss the propertiesof our link partition algorithm and its relations to othercommunity detection algorithms. Section 6 concludes thispaper.

II. RELATED WORK

Most of current community detection algorithms focus onpartition nodes into non-overlapping communities. However,the current most widely used node partition approach has themainly drawback that nodes attributed to only one commu-nity [1], [2], [3], [4]. While many real-world networks havehighly overlapping communities and many nodes may belongto more than one community, and this is especially true forsocial networks, where it is not uncommon that individualsin the network belong to more than one community at thesame time. Moreover, this conceals important information, andoften leads to misclassifications [3]. Clique percolation method(CPM) [4] provides an elegant method to uncover overlappingcommunity structure based on clique percolation. The maindraw back of CPM is its rigid definition of communities [1].When a network is very dense, it can become super-criticalin the sense of clique percolation and there are too manyoverlapping communities. However, when the network is toosparse, the network is sub-critical and there are not enoughconnected cliques to find any communities. It takes 𝑂(𝐾2)time to find all such communities in the network, where 𝐾 isthe number of k-cliques. To speed up the CPM algorithm,Kumpula et al. [5] have developed a fast implementationalgorithm based on the idea of CPM called SCP algorithm, andthe computational time of the SCP algorithm scales linearlywith the number of k-cliques in the network. Lancichinetti etal. [3] propose a node overlapping algorithm based on the localoptimization of a fitness function, and community structure isrevealed by peaks in the fitness histogram. CONGA algorithmproposed by Gregory [6] is a node overlapping communitydetection algorithm based on GN algorithm [7] by slittingvertices based on vertex betweenness.

In contrast to the existing node partition approach, by defin-ing communities as a partition of the links rather than of thesets of nodes, link communities naturally reveal overlap andhierarchical organizations [1], [2]. This link partition approachshould be especially efficient in situations when the nodes ofa network are connected by different types of links [2]. In tworecent papers written by Ahn, Bagrow and Lehmann [1] andEvans and Lambiotte [2], the authors propose the concept oflink community, respectively. Ahn, Bagrow and Lehmann [1]propose a hierarchical link clustering method by optimizing

2011 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4375-8/11 $26.00 © 2011 IEEE

DOI 10.1109/ASONAM.2011.53

71

Page 2: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

the partition density directly. By making a straightforwardnode partition on the link graph 𝐿(𝐺) of the original network𝐺, Evans and Lambiotte [2] use the partition of a graph tofind its overlapping communities. They also propose threedynamical processes taking place on the links and derive theircorresponding modularity definitions. However, each node 𝑖with degree 𝑘𝑖 of the original graph 𝐺 corresponds to a cliquewith 𝑘𝑖(𝑘𝑖 − 1)/2 edges in the link graph 𝐿(𝐺). Thus thereis 𝑂(𝑘2𝑚𝑎𝑥𝑁) edges in the link graph 𝐿(𝐺), where 𝑘𝑚𝑎𝑥 isthe maximal degree of the original graph 𝐺 and 𝑁 is thenumber of vertices. The size of the link graph 𝐿(𝐺) wouldbe much larger than the original graph 𝐺. Ahn, Bagrow andLehmann [1] also regard that due to the bias of sharing node’sdegree also prohibits us from applying traditional methodsdirectly to the link graph of the original one.

III. LINK PARTITION ALGORITHM

In this section, we will describe the details regarding theimplementation, time complexity and space complexity of thislink partition algorithm.

A. Symbols and Definitions

Each undirected graph 𝐺 = (𝑉,𝐸) can be representedmathematically by an adjacency matrix 𝐴 with elements𝐴𝑖,𝑗 = 𝐴𝑗,𝑖 = 1 if there is an edge from node 𝑖 to node𝑗, and 𝐴𝑖,𝑗 = 𝐴𝑗,𝑖 = 0 otherwise. Let the subgraph 𝑐 bea community of graph 𝐺 with 𝑛𝑐 nodes and 𝑚𝑐 edges, andlet 𝐶 denote the community set in the graph. We define theinternal degree 𝑘𝑖𝑛𝑡𝑣 and external degree 𝑘𝑒𝑥𝑡𝑣 of each node𝑣 ∈ 𝑐, as the number of edges connecting 𝑣 to the nodes in 𝑐and the number of edges connecting 𝑣 to the rest of the graph.In a similar way, we can also define the internal degree 𝑘𝑖𝑛𝑡𝑐

of a community 𝑐 as the sum of the internal degrees of itsnodes, the external degree 𝑘𝑒𝑥𝑡𝑐 of a community 𝑐 as the sumof the external degrees of its nodes. The total degree 𝑘𝑐 of acommunity 𝑐 is the sum of the degrees of its nodes. Similarly,in a weighted graph 𝐺𝑤 = (𝑉,𝐸), let 𝑤 be the sum of weightsof all edges, and 𝑤𝑐 be the sum of the weights of the internaledges of community 𝑐. Let 𝑠𝑐 be the sum of the weights ofall the vertices in 𝑐. Let 𝐶𝑣 be the node community of node𝑣 and 𝐶𝑒 be the link community of edge 𝑒.

B. Description of Algorithm

In real-world networks, although the nodes can belong todifferent communities, each edge often just belongs to onedominant community [1]. Our main hypothesis is that eachlink depends on its connected peers, and we can partition thelinks based on their linked nodes’ partition labels. To gain thelink partition, in our algorithm, every node will first get a com-munity label and each edge adopts the community label basedon the roles of its peers considering their modularity gains.More specifically, our algorithm is a two stage procedure: first,we will partition nodes into different node communities, andget a partition label for each node; second, we will divide theedges into different partitions by the node community labelsusing the idea of “boundary link” and “core link” motivatedby Evans and Lambiotte [2].

1) Fast Node Partition: Lancichinetti and Fortunato [8]give a detailed empirical study on the performance of severalwidely used community detection algorithms and show theBGLL algorithm [9] has an excellent performance in thesebenchmark networks. To get a fast multi-scale node partition,we propose a new node partition algorithm based on the BGLLalgorithm by importing a new multi-scale modularity gainfunction. The widely used modularity metric [10] is not ascale-invariant measurement, and the modularity optimizationalgorithms may fail to identify communities smaller than acertain scale. Kumpula et al. [11] even show a single globaloptimization criteria does not seem to be capable for detectingall communities if their size distribution is broad. To getthe multi-resolution communities, we choose the spin modelmodularity function proposed by Reichardt and Bornholdt [12]for a node community 𝑐. The multi-resolution modularitybased on the spin model can be described as:

𝑄(𝑐) = 𝐽(𝑚𝑐 − 𝛾𝑘𝑐

2

4𝑚), (1)

where 𝐽 is a constant expressing the coupling strength and 𝛾 isa parameter expressing the relative contribution to the energyfrom existing and missing edges in the spin model. When𝛾 = 1 and 𝐽 = 1

𝑚 , the value of spin coefficient of cohesionequals to the value of GN community modularity [12]. Wechoose the multi-scale modularity function for community 𝑐as:

𝑄(𝑐) =1

𝑚(𝑚𝑐 − 𝛾

𝑘𝑐2

4𝑚). (2)

The resolution parameter 𝛾 enables us to span several commu-nity scales from very small to very large communities, where0 ≤ 𝛾 < ∞.

In the BGLL algorithm, Blondel et al. [9] introduce adifferent approach for the general case of weighted graphs.In the BGLL algorithm, initially, all vertices of the graph areput into different isolated communities. The node partitionconsists a sequential iterative steps by visiting all vertices.In each vertex visiting step, each vertex 𝑣 computes the gainin weighted modularity coming from 𝑣 in all the neighboringcommunities and merge 𝑣 into the community withe the largestincrease of weighted modularity 𝑄𝑤 [13]. After a partitionis identified in this way, communities are replaced by super-nodes, yielding a smaller weighted network. The procedureis then iterated, until modularity gain is sufficient small. Theweighted modularity in the BGLL algorithm can be easilyextended to a multi-resolution one. If the weighted graph 𝐺𝑤,we can change Eq. 2 into the following weighted one:

𝑄𝑤(𝑐) =1

𝑤(𝑤𝑐 − 𝛾

𝑠𝑐2

4𝑤). (3)

In the weighted graph, the modularity gain Δ𝑄𝑤(𝑐, 𝑣) byinserting node 𝑣 into the community 𝑐 can be described as

72

Page 3: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

follows:

Δ𝑄𝑤(𝑐, 𝑣) = 𝑄𝑤(𝑐 ∪ {𝑣})−𝑄𝑤(𝑐)−𝑄𝑤({𝑣})=𝑤𝑐 + 𝑤𝑣

𝑐

𝑤− 𝛾(

𝑠𝑐 + 𝑠𝑣2𝑤

)2 − (𝑤𝑐

𝑤− 𝛾(

𝑠𝑐2𝑤

)2)− (0− 𝛾(𝑠𝑣2𝑤

)2)

=1

𝑤(𝑤𝑣

𝑐 − 𝛾(𝑠𝑐𝑠𝑣2𝑤

)),

(4)

where 𝑤𝑣𝑐 is the number of links between node 𝑣 and com-

munity 𝑐. It can also be simplified as:

Δ𝑄𝑤(𝑐, 𝑣) = 𝑤𝑣𝑐 − 𝛾

𝑠𝑐𝑠𝑣2𝑤

. (5)

For an unweighted graph, the weight of each edge will be1 initially. To find multi-resolution communities, we will usethe multi-resolution modularity gain as shown in Eq. 5 to getthe modularity gain rather than the original one in the BGLLalgorithm. Furthermore, to reduce the time complexity of thealgorithm, in each node partition level, we will the maximaliterative steps as a constant 𝐼𝑚𝑎𝑥.

2) Link Partition: We will divide the edges into differentpartitions by the node community labels using the idea of“community boundary link” and “community core link” mo-tivated by Evans and Lambiotte [2]. A “boundary link” ofa node partition is an edge that connects two nodes fromdifferent communities, and a “core link” of a node partitionis an edge that connects two nodes from the same commu-nities. After the nodes have been partitioned into differentnon-overlapping clusters, we will decide the partition of thelinks 𝑃 (𝐸) based on the node partition 𝑃 (𝑉 ). We use themodularity gain to qualify how strongly an edge belongs toa particular node community label. For the “core links”, ifthe two nodes of a link are in the same community thenthis link will share the same community label with the nodecommunity. For the “boundary links”, if the two nodes of alink are in different community, we will get out the modularity(density) gains as shown in Eq. 5 by dividing one node into thecommunity of the other one, respectively. After that, choosethe larger modularity gain community label for each link.However, for the “boundary links”, there is one case that mayhinder our work on link community detection: the bridge linkbetween two different node communities may serve as a bridgewith almost equal modularity gain values, and we will givethis bridge link another new link community label which isdifferent from the community labels of its peers.

In more details, for each edge 𝑒𝑖,𝑗 which contains thenodes 𝑖 and 𝑗, we will compute the modularity gain of eachnodes e. g. Δ𝑄(𝑖, 𝐶𝑗) and Δ𝑄(𝑗, 𝐶𝑖) by Eq. 5, and let 𝑒𝑖,𝑗have the larger modularity gain community label 𝐶𝑘 where𝑘 = argmax𝑘∈{𝑖,𝑗}{Δ𝑄(𝑖, 𝐶𝑗),Δ𝑄(𝑗, 𝐶𝑖)}. If the modularitygain values of the two in different communities are almostthe same, then the edge acts as a bridge between the twocommunities. We will use the following equation to show thiscase: ∣

∣∣∣Δ𝑄(𝑖, 𝐶𝑗)−Δ𝑄(𝑗, 𝐶𝑖)

Δ𝑄(𝑖, 𝐶𝑗)

∣∣∣∣ ≤ 𝜖, (6)

(a) Network A (b) Network B

(c) Network C (d) Network D

Fig. 1. Link communities found by our algorithm in several simple examplenetworks with overlapping communities.

where 𝜖 is an accuracy parameter. If Eq. 6 is true, we give thelink a new community label to link instead of the labels of itspeers.

C. Complexity of the Algorithm

In this part, we will discuss the data structures and complex-ity of our algorithm. In the node partition community detectionstep each node just belongs to one community, therefor westore the node community relations in a hash table. To speedup node partition algorithm, we store each community in ahash table. Under reasonable assumptions, the expected timeto search, add and remove for a node in a community is𝑂(1), and we can get the Eq. 5 in 𝑂(1) time. So it willtake 𝑂(∣𝑉 ∣+ ∣𝐸∣) time complexity in each iteration. The totaltime complexity of the one level partition for 𝐺 = (𝑉,𝐸)is 𝑂(𝐼𝑚𝑎𝑥(∣𝑉 ∣ + ∣𝐸∣)), where ∣𝑉 ∣ is the number of nodesand ∣𝐸∣ is the number of edges and 𝐼𝑚𝑎𝑥 is the number ofiterations in one level of node partition. The space complexityof the first partition is 𝑂(2∣𝑉 ∣+2∣𝐸∣), as we have to store thecommunities and the graph and the weight of edges. Aftera node partition is identified in this way, communities arereplaced by super-nodes, yielding a smaller weighted network.This process just take 𝑂(∣𝐸∣ + ∣𝑉 ∣). Suppose there are 𝑙levels of node partitions. It is easy to show that in refinementstage its time complexity is also 𝑂(𝐼𝑚𝑎𝑥(∣𝑉 ∣+ ∣𝐸∣)) and thespace complexity is 𝑂(2∣𝑉 ∣+ 2∣𝐸∣). As the time complexityof Eq. 5 is 𝑂(1), therefore the link partition can be donein 𝑂(∣𝐸∣) time, the time complexity of the algorithm is𝑂(𝑙×𝐼𝑚𝑎𝑥(∣𝑉 ∣+ ∣𝐸∣)+ ∣𝐸∣). As we just need to maintain twoneighboring levels of node partitions, the space complexity ofthe algorithm is 𝑂(3∣𝑉 ∣+ 2∣𝐸∣).

Our algorithm has several advantages. First, we can showboth of its time complexity is near linear and its space com-plexity is linear. It is very suitable for detecting communities

73

Page 4: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

in very large networks. Second, the resolution parameter 𝛾enables us to span several community scales from very smallto very large communities to avoid the well resolution limitproblem, and we can also use this parameter to change thecommunity numbers of the partitions.

IV. TESTING AND EXPERIMENTS

This section contains plenty of examples of link communi-ties in various networks, all intended to illustrate that our linkpartition algorithm finds meaningful and relevant link com-munity structure. Using data-driven performance measures,we compare link clustering to existing, popular communitydetection methods both overlapping and non-overlapping algo-rithms, such as the ABL algorithm [1], GN the algorithm [7],the CNM algorithm [21], the BGLL algorithm [9], etc. Partic-ularly, we compare our link partition algorithm with the newlyproposed, probably most famous link community detectionalgorithm—the ABL algorithm in these networks. All theresults show that our algorithm is very accurate and fast. Weimplement all of these algorithms in the network analysisframework JSNVA [14] using Java platform: Java 6.0, JavaHotSpot Server VM with 1.3G heap size. The experimentsare performed on a ordinary PC (CPU = Intel Core2 Duo2.66GHz, L2 Cache = 3072kB, RAM = 3G) running WindowXP operating system. In the following experiments, if wedo not mention the value of 𝛾 and the number of nodepartition levels, we will set 𝛾 = 1.0 in according with theGN modularity to get a multi-level node partition in thenode partition step. In the following experiments, we also set𝐼𝑚𝑎𝑥 = 30 and 𝜖 = 0.00001 in our algorithm.

A. Measures

There is no widely accepted alternative measure for usewith overlapping communities. To provide a fair evaluationof all the community detection algorithms we have tested,we study several distinct aspects of the community partitionquality metrics. For a network 𝐺 = (𝑉,𝐸) with 𝑚 = ∣𝐸∣edges, 𝑆𝐸 = {𝑃1, 𝑃2, ⋅ ⋅ ⋅ , 𝑃∣𝐶∣} is a partition of the edgesinto community set 𝐶 which contains ∣𝐶∣ communities, and𝑆𝑁 = {𝑁1, 𝑁2, ⋅ ⋅ ⋅ , 𝑁∣𝐶∣} is the overlapping node clusters inthese link communities, respectively. Suppose a link commu-nity 𝑃𝑐 has 𝑚𝑐 = ∣𝑃𝑐∣ links and 𝑛𝑐 = ∣𝑁𝑐∣ = ∣∪𝑒𝑖,𝑗∈𝑃𝑐

{𝑖, 𝑗}∣nodes. We use the following widely used community partitionquality functions:

∙ Partition Density: to show the quality of a link com-munity 𝑐, Ahn et al. [1] define its density 𝐷𝑐 as 𝐷𝑐 =

𝑚𝑐−(𝑛𝑐−1)𝑛𝑐(𝑛𝑐−1)/2−(𝑛𝑐−1) . To show the quality of whole linkpartition, they also define the partition density 𝐷𝑤 =∑

𝑐𝑚𝑐

𝑚 𝐷𝑐 =2𝑚

∑𝑐𝑚𝑐

𝑚𝑐−(𝑛𝑐−1)(𝑛𝑐−2)(𝑛𝑐−1) which is the average

of 𝐷𝑐, weighted by the fraction of the present links.However, the bigger the size of a link community 𝑚𝑐 is,the larger its density 𝐷𝑐 will be in the weighted partitiondensity 𝐷𝑤.

∙ Mean Partition Density: to show the average qualityof all the link communities, we define the un-weighted

partition density 𝐷 =∑

𝑐 𝐷𝑐

∣𝐶∣ without considering the size𝑚𝑐 of each link community 𝑃𝑐.

∙ Node Overlapping Fraction: the node overlapping frac-tion 𝑓𝑜𝑣 is the sum of the community sizes divided bythe number of all nodes in these communities [6]. Ascommunity overlapping can present serious problems interms of interpreting the structure of a networks [15],[16], a small value of 𝑓𝑜𝑣 suggests the overlappingclustering is a good one.

∙ Vertex Average Degree: the vertex average degree (vad)of the node community set 𝑆𝑁 is defined as 𝑣𝑎𝑑(𝑆𝑁 ) =2∑

𝑐∈𝑆𝑁𝑚𝑐

∑𝑐∈𝑆𝑁

𝑛𝑐[6].

∙ Community Fitness: 𝑓 = 1∣𝐶∣

∑∣𝐶∣𝑖=1 𝑓(𝑁𝑖) =

1∣𝐶∣

∑∣𝐶∣𝑖=1

𝑘𝑁𝑖𝑖𝑛

𝑘𝑁𝑖𝑖𝑛 +𝑘

𝑁𝑖𝑜𝑢𝑡

is the average value of the fitness of

its communities, where 𝑘𝑁𝑖𝑖𝑛 and 𝑘𝑁𝑖

𝑜𝑢𝑡 are the total internaland external degrees of the node community 𝑁𝑖 [3].

∙ LAV Modularity: this non-fuzzy modularitymetric 𝑀𝑜𝑣 = 1

∣𝐶∣∑∣𝐶∣

𝑖=1𝑀𝑁𝑖𝑜𝑣 is proposed

by Lazar et al. [17] to measure the qualityof a overlapping partition, where 𝑀𝑁𝑖

𝑜𝑣 =1

∣𝑁𝑖∣∑

𝑖∈𝑁𝑖

∑𝑗∈𝑁𝑖,𝑖∕=𝑗 𝐴𝑖,𝑗−

∑𝑗 ∕∈𝑁𝑖

𝐴𝑖,𝑗

𝑑𝑖×𝑠𝑖2𝑚𝑖

∣𝑁𝑖∣×(∣𝑁𝑖∣−1) .The density of a community is straightforward to beinterpreted as 2𝑚𝑖

∣𝑁𝑖∣×(∣𝑁𝑖∣−1) .

Intuitively, good and compact communities should havesmall values of Node Overlapping Fraction scores, whilehave large values of other metrics.

B. Benchmark Networks

Currently, almost in any paper, when a new node par-tition community algorithm is proposed, one would test iton computer generated networks with an underlying built-incommunity structure, due to the simplicity and clarity of itsdefinition of structure. However, since the community structurein these benchmark graphs reflects the conceptual model ofcommunities held by the graph models, there is no guaranteethat the results can be extrapolated to real networks [1]. Forexample, every existing benchmark graph has the underlyingprinciple that a community should have intra-community linksthan outgoing links, while high overlapping communities canhave many more external than internal edges [1], as shownin Fig. 1. To avoid requiring the hidden “ground truth” linkcommunities, both of Ahn, Bagrow and Lehmann [1] andEvans and Lambiotte [2] do not use these famous benchmarknetworks to evaluate their link partition algorithms. How-ever, by defining the roles of each node in the overlappingcommunities, we find it is still possible to compare currentlink community detection algorithms to existing, widely usednode community detection algorithms. Belonging coefficientsdescribe how a given vertex 𝑖 is distributed between over-lapping communities 𝐶, where

∑𝑐∈𝐶 𝑓𝑖,𝑐 =

∑𝑐𝐴𝑖,𝑐 = 1

[18]. For the nodes in link communities, we calculate thebelonging factors of each node 𝑖 and divide the node to itsmajor role community 𝑐 with the maximal belonging factor𝑓𝑖,𝑐 that is 𝑐 = argmax𝑐 𝑓𝑖,𝑐 = argmax𝑐

∑𝑗∈𝑐𝐴𝑖,𝑗 . Using

74

Page 5: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

External degree kout

Nor

mal

ized

mut

ual i

nfor

mat

ion

GNCNMLPABGLLABLOurs

(a) GN benchmarks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.2

0.4

0.6

0.8

1

Mixing parameter μ

Nor

mal

ized

mut

ual i

nfor

mat

ion

CNMLPABGLLABLOurs

(b) LFR benchmarks

Fig. 2. Performance of community detection algorithms in the benchmarkgraphs by using the normalized mutual information (NMI) metric.

this method, we can find out whether the link communitydetection algorithms reliably detect the known structures indifferent widely used graph generating benchmark models.

In the following part, we will use NMI to establish how“similar” the partitions extracted by different algorithms isto the underlying partitions in the benchmark graphs. In theexperiments on the artificial networks, we test our algorithmby using one level optimization and multi-level optimization,respectively. In Fig. 2, each point is an average over 40realizations of the networks. First, we present a number oftests of the community detection algorithms on widely usedGN benchmark graphs. Fig. 2(a) shows the performances ofthe community detection algorithms in the GN benchmarks.As shown in Fig. 2(a), our algorithm performs perfectly whenthese communities are more fuzzy (𝑘𝑜𝑢𝑡 = 8). The worstalgorithm is the LPA algorithm [19], and it starts to fail quicklyfor low values of 𝑘𝑜𝑢𝑡. We can find that our link communityalgorithm works slightly worse than the partitions got bythe BGLL algorithm [9] in this homogeneous benchmarknetworks, while works much better than others. Next, we usethe heterogeneous LFR benchmarks to test our algorithm. Asshown in Fig. 2(b), the algorithm performs quite perfectlywhen these communities are more fuzzy (𝜇 = 0.5). The MCLalgorithm [20], the LPA algorithm, the CNM algorithm [21]and the GN algorithm [7] do not have impressive performancesin the LFR benchmark graphs. Interesting, we can find thatour link community algorithm works better than the partitionsgot by the ABL algorithm, and the ABL algorithm is betterthan the BGLL algorithm in most cases in the heterogeneousbenchmark graphs.

C. Artificial Networks

We also test our algorithm in other artificial networks. Asthe first checking case, we use link community detection

(a) 20 × 20 lattice (Ours) (b) 20 × 20 lattice (ABL)

Fig. 3. Link communities in regular networks found by our algorithm andthe ABL algorithm.

algorithm in some toy networks in Fig. 1 which have also beentested by Ahn et al. [1]. The result shows that our algorithmalso performs well in these toy networks. As shown in Fig. 3,we also use our algorithm to find communities in the 20 ×20 two dimensional lattice networks. The red nodes in are theoverlapping nodes in multiple link communities. By using ourmulti-level algorithm, the 20 × 20 regular lattice is dividedinto several size like communities with modularity 𝑄 = 0.766as shown in Fig. 3(a). Interesting, it is possible to find thea coarse grained overview of the lattice network from thecommunity level by using our algorithm. While we also usethe ABL algorithm to find the communities in the network,the ABL algorithm will divide most of the links as single linkcommunities as shown in Fig. 3(b). While the CNM algorithmwill divide the 20×20 regular lattice into 9 communities withthe largest one containing 62 nodes and the smallest one onlycontaining 8 nodes. Therefore, it is hard to find the latticestructure of the original network from the communities got bythe ABL algorithm and the CNM algorithm.

D. Small Real Social Networks

Next, we have studied a couple of social networks forwhich explicit knowledge about its communities is available.These real-world networks are interesting because of theirknown communities. These examples of the typical real-worldnetworks illustrate the advantage of our algorithm by gettingmultiple level solutions. In the following studies, the linkcolors indicate the link clustering, and the red square nodesindicate the overlapping nodes.

1) Zachary’s karate club: First, we have investigated theclassical social network of “Zachary karate club” [7]. The twosmaller communities with the the administrator and the teacheras the central persons are further divided into smaller ones.The Zachary’s karate club has become a benchmark for allcommunity detection algorithms. Today it is well accepted thatthe best partition in terms of modularity of the karate clubnetwork is partitioned into 4 communities with a value of 𝑄 =0.419. The link partitions got by multi-level clustering areshown in Fig. 4(a), and its modularity score 𝑄 in the nodepartition is 0.419 which is the highest modularity score forthe node partitions we can found in this network recently.

75

Page 6: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

TABLE ICOMPARISON OF OUR MULTI-LEVEL LINK PARTITION ALGORITHM (𝑙) AND THE ABL ALGORITHM (𝑎) IN REAL-WORLD NETWORKS.

Network Size ∣𝑉 ∣ 𝐷𝑤(l) 𝐷(l) 𝑓𝑜𝑣(l) 𝑣𝑎𝑑(l) 𝑓(𝑙) 𝑀𝑜𝑣(𝑙) 𝐷𝑤(a) 𝐷(a) 𝑓𝑜𝑣(a) 𝑣𝑎𝑑(a) 𝑓(𝑎) 𝑀𝑜𝑣(𝑎)Karate 34 0.204 0.259 1.674 3.192 0.594 0.224 0.226 0.063 2.618 2.044 0.258 -0.0214Lesmis 77 0.423 0.355 1.545 5.310 0.453 0.318 0.479 0.060 3.030 3.012 0.058 0.079Football 115 0.183 0.145 2.365 4.346 0.384 0.029 0.542 0.067 4.191 3.021 0.171 -0.146Dolphin 62 0.151 0.129 1.661 3.436 0.527 0.127 0.298 0.074 3.210 1.869 0.249 -0.027

Pol-books 105 0.176 0.161 1.628 6.000 0.488 0.117 0.287 0.008 3.848 3.049 0.199 -0.102

TABLE IICOMPARISON OF OUR ONE-LEVEL LINK PARTITION ALGORITHM (𝑙1) AND OUR MULTI-LEVEL ONE (𝑙𝑚) IN REAL-WORLD NETWORKS.

Network 𝐷𝑤(𝑙1) 𝐷(𝑙1) 𝑓𝑜𝑣(𝑙1) 𝑣𝑎𝑑(𝑙1) 𝑓 (𝑙1) 𝑀𝑜𝑣(𝑙1) 𝐷𝑤(𝑙𝑚) 𝐷(𝑙𝑚) 𝑓𝑜𝑣(𝑙𝑚) 𝑣𝑎𝑑(𝑙𝑚) 𝑓 (𝑙𝑚) 𝑀𝑜𝑣(𝑙𝑚)Karate 0.204 0.259 1.674 1.596 0.594 0.224 0.193 0.243 1.353 1.869 0.669 0.255Lesmis 0.423 0.355 1.545 2.655 0.453 0.318 0.297 0.208 1.325 2.853 0.554 0.260Football 0.183 0.145 2.365 2.173 0.384 0.029 0.149 0.146 2.130 2.980 0.566 0.037Dolphin 0.151 0.129 1.661 1.718 0.527 0.127 0.102 0.107 1.452 1.955 0.688 0.112

Pol-books 0.176 0.161 1.628 3.000 0.488 0.117 0.136 0.121 1.314 3.398 0.643 0.110

(a) 𝛾 = 1.0 (b) 𝛾 = 0.5

(c) Dolphin (𝛾 = 1.0) (d) Dolphin (𝛾 = 0.5)

Fig. 4. Link communities in some typical small real-world networks.

To show a larger link partition, as shown in Fig. 4(b), weget 2 node partition communities with the modularity score𝑄 = 0.371 by setting 𝛾 = 0.5. We change the links intotwo node communities based on Belonging coefficients ofnodes, and the node partition is exactly the communities ofthe administrator and the teacher, respectively.

2) Bottlenose dolphin network: Furthermore, as shown inFig. 4(c) and Fig. 4(d), we explore the link communitystructures of the bottlenose dolphin network [10]. It is a socialnetwork of a community of 62 bottlenose dolphins living inDoubtful Sound, New Zealand. The dolphin community splitinto two as a result of the departure of a keystone individualSN100. The biologist David Lusseau reports that for a periodof two years during observation of the dolphins they separatethe dolphins into two large communities by their ages. Bysetting 𝛾 = 1.0, the multi-level node partition will divide thenetwork into 5 clusters with the modularity score 𝑄 = 0.528

which is the highest modularity score for the node partitionswe can found in this network recently [10]. To show a largernode partition, the best cover in two clusters that we found(𝛾 = 0.5) roughly agrees with the separation observed byLusseau. After dividing each node into just one communitywith the maximal belonging factor, we can get the modularityvalue 𝑄 = 0.385 and the normalized mutual information𝑁𝑀𝐼 = 0.888. The normalized mutual information 𝑁𝑀𝐼values of the partitions by by the BGLL, CNM and GNalgorithms are 0.474, 0.573 and 0.554, respectively. As shownin Fig. 4(d), we divide the links into two communities based onthe node partition. The nodes PL, Oscar, SN9 and SN100 areoverlapping nodes, and the bridge overlapping node SN100which caused the separation of the network has also beenfound by our link partition algorithm.

We also apply our algorithm to the collage football networkwhich represents the schedule of games between Americancollege football teams in the 2000 season provided by New-man [7], the a network of books about politics provided byNewman [22], and co-appearance network of characters in thenovel Les Miserables which has also been studied by Ahnet al. [1]. As shown in Table I, we compare the partitionsgot by our algorithm with the results got by ABL algorithm.The result shows that our algorithm perform better in allthe partition quality metrics except the partition density 𝐷𝑤

proposed by Ahn et al. [1]. As shown in Table II, we alsocompare the partitions got by our one-level algorithm with themulti-level one. In general cases, that our one-level algorithmperforms better on the partition density metrics, while ourmulti-level one performs better on other metrics.

E. Efficiency Experiments

To study the qualities of communities in real-world net-works, we run detailed experiments on a lot of real-worldnetwork data as shown in Table III provided by Newman1,

1http://www-personal.umich.edu/∼mejn/netdata/

76

Page 7: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

Arenas2 and Leskovec3. As shown in Table III, these networksinclude the network of US political blogs (Pol-blogs), an e-mail network (Email), the co-authorship network of condensedmatter collaborations between Jan 1, 1995 and March 31, 2005(Cond-Mat), the astrophysics co-authorship network(Astro-ph), the Western States Power Grid of the USA (Power),the high-energy co-authorship network (Hep-th), the networkscience co-authorship network (Net-sci), a snapshot of au-tonomous systems (As06), the network of users of the Pretty-Good-Privacy algorithm (PGP), the collaboration network ofGeneral Relativity category (CA-GrQc), the Enron email net-work (Enron), the Amazon product co-purchasing networkson March 02 2003 (Amazon0302). To compare our algorithmwith current fast link community partition algorithms, werun the experiments on the CNM algorithm, ABL algorithmand our one-level link partition algorithm on a number ofreal-world networks in Table III, and use the characters 𝑐,𝑎 and 𝑙 to show the metrics, respectively. The table showsthe source of each network, its size, and the times for thevarious algorithms to generate solutions. Note that since thesecommunity detection algorithms are heuristic, different runscould yield in principle different partitions. We have performed10 runs of each algorithm in different networks and choose theaverage scores for each partition. We find that our communitydetection algorithm is extremely fast in massive networks, andit is the fastest one of these 3 algorithms. We also note thatthe scores of the mean partition density (D) and the vertexaverage degree (vad) of the link communities found by us arelarger than those found by ABL algorithm in most cases.

F. Applications

To see how meaningful link communities can be, we applyour method to a word association network built on the Uni-versity of South Florida Free Association Norms, analyzed inother overlapping community detection algorithms [23], [1].In the original data the weight of a directed link from oneword to another indicates the frequency that the people in thesurvey associated the end point of the link with its start point.Since most of the community detection algorithms deal withundirected networks, these directed links have been replacedby undirected ones with a weight equal to the sum of theweights of the corresponding two original links. Furthermore,a weight threshold of 𝑤 = 0.025 was applied to the resultingnetwork by deleting links weaker than 𝑤, just as the methodused by other papers [23], [1]. The remaining word associationnetwork contains 31784 links between 7207 nodes. Fig. 5shows link communities around the world ‘Newton’ in thenetwork of commonly associated English words. We get 4large link communities, and these link communities captureconcepts related to ‘Science, Physics, Math’, ‘Smart, Think’,‘Law, Laws, Prove’ and ‘Apple, Food’. We also get a singlelink community contains ‘Newton’ and ‘Fig’ embedded in thecommunity of ‘Apple, Food’. The words ‘Newton’, ‘Einstein’

2http://deim.urv.cat/∼aarenas/data/welcome.htm3http://snap.stanford.edu

Fig. 5. Link communities from the full word association network aroundthe word ‘Newton’.

and ‘Science’ all belong to the ‘Smart, Think’, and ‘Science,Physics, Math’ communities, illustrating that link communitiescapture multiple relationships between nodes.

V. DISCUSSION

For an undirected network 𝐺 = (𝑉,𝐸), the ABL linkpartition algorithm [1] will first get the similarity scoresbetween any adjacent link pairs on each node. In the first stepof ABL algorithm, the time complexity of the link similaritycomputation is 𝑂(∣𝑉 ∣ × 𝑘3𝑚𝑎𝑥), where 𝑘𝑚𝑎𝑥 is the maximaldegree in graph 𝐺. Since the source code of ABL algorithm4

is optimized for size and speed, it is unable to store thefull dendrogram in the hierarchical clustering, nor computetheoretical value of the maximum partition density. Afterthat, the ABL algorithm will cluster the edges using a givensimilarity threshold, so it will take 𝑂(∣𝑉 ∣ × 𝑘2𝑚𝑎𝑥) time toform the link partition with the given similarity threshold.However, to estimate the maximum partition density, the usershave to provide many thresholds and run this algorithm forseveral times. In addition, it also takes 𝑂(∣𝑉 ∣ × 𝑘2𝑚𝑎𝑥) spacecomplexity to store these link pairs, and it is hard to use thisalgorithm on massive networks.

Comparing with the algorithm of ABL algorithm, our algo-rithm has several advantages: first, it time is near linear, andspace complexity is linear; second, it is multi-scale algorithmand one can probe the communities at different scales; third,algorithm has the potential to convert any non-overlappingnode community detection algorithm into an overlapping com-munity detection algorithm. We would like to emphasize thatour method provides a general framework, that yields a largeclass of algorithms, e. g. one could choose a different nodepartition algorithm to divide the nodes and then to get the linkspartitions.

4http://barabasilab.neu.edu/projects/linkcommunities/

77

Page 8: [IEEE 2011 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2011) - Kaohsiung City, Taiwan (2011.07.25-2011.07.27)] 2011 International Conference

TABLE IIIEXPERIMENT COMPARISONS OF THE CNM (𝑐), ABL (𝑎) AND OUR ONE-LEVEL LINK PARTITION ALGORITHM (L).

Network ∣𝑉 ∣ ∣𝐸∣ 𝐷(𝑐) 𝑣𝑎𝑑(𝑐) 𝑇 (𝑐) 𝐷(𝑎) 𝑣𝑎𝑑(𝑎) 𝑇 (𝑎) 𝐷(𝑙) 𝑣𝑎𝑑(𝑙) 𝑇 (𝑙)Email 1133 5451 0.042 3.686 0 0.010 0.785 2 0.028 2.639 0Net-sci 1461 2742 0.447 1.855 0 0.404 1.546 0 0.511 1.692 0Power 4941 6594 0.006 1.288 0 0.047 0.658 1 0.027 0.620 0

CA-GrQc 4158 13422 0.230 2.886 2 0.082 1.240 7 0.183 1.835 0Hep-th 7610 15751 0.221 1.868 4 0.161 1.064 3 0.211 1.338 1PGP 10680 24316 0.093 2.132 8 0.063 1.285 11 0.085 1.495 1

Astro-ph 16046 121251 0.424 6.304 139 0.081 2.882 141 0.215 4.372 5Cond-Mat 39577 175693 0.390 3.789 662 0.064 1.417 122 0.185 2.449 11

As06 22963 48436 0.029 1.655 121 0.001 0.602 880 0.027 1.473 6Enron 36692 183831 0.321 4.091 680 0.024 0.969 1060 0.347 5.259 20

Amazon0302 262111 899792 0.466 3.207 16059 0.058 1.171 252 0.072 1.410 65

VI. CONCLUSION

The massive real-world networks make the issue of thetime complexity of community detection algorithms essential.This link partition approach allows for communities to overlapat nodes so that nodes may belong to several communities.Based on the edge labeling idea, we propose a novel linkcommunity detection algorithm whose time complexity isnear linear and space complexity is linear. In this way, anyalgorithm that produces a node partition can be used inthe link community partition. We use edge labeling conceptmainly due to its simplicity and efficiency, which enablesus to apply link clustering to very large-scale networks. Thelink partition approach enables each node to be included inmore than one module, leading to a natural description ofoverlapping communities. Finally, by tuning the resolutionparameter 𝛾 one can probe the network at different scalesand explore the possible hierarchical levels of communitystructure. To evaluate the effectiveness of our algorithm, wegive a lots of experiments. The result shows our algorithm isvery efficient, and it can enhance our ability to explore massivenetworks interactively. We regard our algorithm represents agood tradeoff between accuracy and speed for detecting linkpartitions in massive real-world networks.

ACKNOWLEDGMENT

We thank Mark Newman, Alex Arenas and Jure Leskovecfor providing us the network data sets. This work is supportedby the National Science Foundation of China (Grant No.90924029, 60905025, 61074128).

REFERENCES

[1] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, “Link communities revealmultiscale complexity in networks,” Nature, vol. 466, no. 7307, pp. 761–764, June 2010.

[2] T. S. Evans and R. Lambiotte, “Line graphs, link partitions, andoverlapping communities,” Phys. Rev. E, vol. 80, no. 1, p. 016105, Jul2009.

[3] A. Lancichinetti, S. Fortunato, and J. Kertesz, “Detecting the overlap-ping and hierarchical community structure in complex networks,” NewJournal of Physics, vol. 11, no. 3, p. 033015, March 2009.

[4] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, “Uncovering the overlap-ping community structure of complex networks in nature and society,”Nature, vol. 435, pp. 814–817, 2005.

[5] J. M. Kumpula, M. Kivela, K. Kaski, and J. Saramaki, “Sequentialalgorithm for fast clique percolation,” Phys. Rev. E, vol. 78, no. 2, p.026109, Aug 2008.

[6] S. Gregory, “An algorithm to find overlapping community structure innetworks,” in Proceedings of the 11th European Conference on Prin-ciples and Practice of Knowledge Discovery in Databases, September2007, pp. 91–102.

[7] M. Girvan and M. E. J. Newman, “Community structure in social andbiological networks,” Proceedings of the National Academy of Sciences,vol. 99, no. 12, pp. 7821–7826, June 2002.

[8] A. Lancichinetti and S. Fortunato, “Community detection algorithms:A comparative analysis,” Phys. Rev. E, vol. 80, no. 5, p. 056117, Nov2009.

[9] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fastunfolding of communities in large networks,” J. Stat. Mech., p. 10008,9 October 2008.

[10] M. E. J. Newman and M. Girvan, “Finding and evaluating communitystructure in networks,” Physical Review E, vol. 69, p. 026113, 2004.

[11] J. M. Kumpula, J. Saramaki, K. Kaski, and J. Kertesz, “Limitedresolution in complex network community detection with potts modelapproach,” The European Physical Journal B, vol. 56, no. 1, pp. 41–45,March 2007.

[12] J. Reichardt and S. Bornholdt, “Statistical mechanics of communitydetection,” Phys. Rev. E, vol. 74, no. 1, p. 016110, Jul 2006.

[13] M. E. J. Newman, “Analysis of weighted networks,” Physical ReviewE, vol. 70, p. 056131, 2004.

[14] Q. Ye, B. Wu, L. Suo, and et al., “Telecomvis: Exploring temporalcommunities in telecom networks,” in ECML PKDD, Slovenia, Bled,2009, pp. 755–758.

[15] M. G. Everett and S. P. Borgatti, “Analyzing clique overlap,” Connec-tions, vol. 21, no. 1, pp. 49–61, 1998.

[16] G. Palla, A.-L. Barabasi, and T. Vicsek, “Quantifying social groupevolution,” Nature, vol. 446, no. 7136, pp. 664–667, April 2007.

[17] A. Lazar, D. Abel, and T. Vicsek, “Modularity measure of networkswith overlapping communities,” Europhysics Letters, vol. 90, no. 1, p.18001, 2010.

[18] S. Gregory, “Fuzzy overlapping communities in networks,” ArXiv e-prints/2010arXiv1010.1523G, Oct. 2010.

[19] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithmto detect community structures in large-scale networks,” Physical ReviewE, vol. 76, no. 3, p. 036106, 2007.

[20] S. van Dongen, “Graph clustering by flow simulation,” Ph.D. disserta-tion, University of Utrecht, May 2000.

[21] A. Clauset, M. E. J. Newman, and C. Moore, “Finding communitystructure in very large networks,” Physical Review E, vol. 70, no. 6,p. 066111, December 2004.

[22] M. E. J. Newman, “Finding community structure in networks using theeigenvectors of matrices,” Phys. Rev. E, vol. 74, p. 036104, 2006.

[23] B. Adamcsek, G. Palla, and et al., “Cfinder: locating cliques andoverlapping modules in biological networks,” Bioinformatics, vol. 22,no. 8, pp. 1021–1023, April 2006.

78