community detection in large-scale social networks · community detection in large-scale social...

Community Detection in Large-Scale Social Networks ∗

Nan DuBeijing Key Laboratory of

Intelligent TelecommunicationsSoftware and Multimedia

Beijing University of Posts andTelecommunications, China

[email protected]

Bin WuBeijing Key Laboratory of



[email protected]

Xin PeiBeijing Key Laboratory of



[email protected] Wang

Beijing Key Laboratory ofIntelligent Telecommunications

Software and MultimediaBeijing University of Posts and

Telecommunications, [email protected]

Liutong XuBeijing Key Laboratory of


Beijing University of Posts andTelecommunications, [email protected]

ABSTRACTRecent years have seen that WWW is becoming a flourish-ing social media which enables individuals to easily shareopinions, experiences and expertise at the push of a sin-gle button. With the pervasive usage of instant messagingsystems and the fundamental shift in the ease of publish-ing content, social network researchers and graph theory re-searchers are now concerned with inferring community struc-tures by analyzing the linkage patterns among individualsand web pages. Although the investigation of communitystructures has motivated many diverse algorithms, most ofthem are unsuitable for large-scale social networks becauseof the computational cost. Moreover, in addition to identifythe possible community structures, how to define and ex-plain the discovered communities is also significant in manypractical scenarios.

In this paper, we present the algorithm ComTector(Com-munity DeTector) which is more efficient for the commu-nity detection in large-scale social networks based on thenature of overlapping communities in the real world. Thisalgorithm does not require any priori knowledge about thenumber or the original division of the communities. Becausereal networks are often large sparse graphs, its running timeis thus O(C × Tri2), where C is the number of the detectedcommunities and Tri is the number of the triangles in thegiven network for the worst case. Then we propose a generalnaming method by combining the topological information

∗This work is supported by the National Science Foundationof China under grant number 60402011, and the NationalScience and Technology Support Program of China underGrant No.2006BAH03B05.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07 ( WebKDD/SNA-KDD’07) August 12, 2007 , San Jose, California , USACopyright 2007 ACM 978-1-59593-848-0 ...$5.00.

with the entity attributes to define the discovered commu-nities. With respected to practical applications, ComTec-tor is challenged with several real life networks includingthe Zachary Karate Club, American College Football, Scien-tific Collaboration, and Telecommunications Call networks.Experimental results show that this algorithm can extractmeaningful communities that are agreed with both of theobjective facts and our intuitions.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications—Data Mining

General TermsAlgorithms Theory Performance

KeywordsSocial Network Analysis, Community Detection, Graph-basedData Mining

1. INTRODUCTIONIn recent years, easy connections brought about by cheap

devices, modular content, and shared computing resourcesare having a profound impact on our social structures. Peo-ple now increasingly take their required information fromone another rather than from institutional sources like cor-porations, media outlets, religions, and political bodies[3].Perhaps the most outstanding success of such kind of com-munication is the World Wide Web(WWW). Powered byWeb 2.0 applications, WWW becomes the most popular so-cial media which covers all forms of sharing: from experi-ences, to photos, to recommendations. As a result, peo-ple are implicitly involved in many social networks whichare formed by our friend lists in the instant messaging soft-wares, by the bloggers who comment on a certain topic inyour blogspace, or by the users who write collaboratively ina wiki site.

Most of these networks are generally sparse in global yetdense in local. They have vertices in a group structure

that the vertices within the groups have higher density ofedges while vertices between groups have lower density ofedges[21][24]. This kind of structure is called the commu-nity which is an important network property and can revealmany hidden features of the given network. Individuals be-longing to the same community are probable to have prop-erties in common. The communities in the blogspace oftencorrespond to topics of interests. Monitoring the aggregatetrends and opinions revealed by these communities providesvaluable insight to a number of business applications, such asmarketing intelligence and competitive intelligence. Hence,identifying the communities is a fundamental step not onlyfor discovering what makes entities come together, but alsofor understanding the overall structural and functional prop-erties of a large network[13][23].

A popular quantitative definition called Network Modular-ity, proposed by Girvan and Newman[9][15], is widely usedas a quality metric for assessing the partitioning of a givennetwork into communities. The search for the largest mod-ularity value is a NP-hard problem due to the fact that thespace of possible partitions grows faster than any power ofthe system size[8]. For this reason, many recent algorithmsadopt various heuristic strategies to achieve the optimizationof this metric. However, as mentioned in [16], most actualnetworks are made of highly overlapping cohesive subgroupsof nodes simply because individuals often belong to numer-ous different kinds of relationships simultaneously. For ex-ample, each of us may participate in many social cycles ac-cording to our hobbies, educational background, workingenvironment and family relationships. As a result, when thenetwork is large and the overlapping is significant, most ofthe existing algorithms in general will have high computa-tional cost due to their heuristic optimization strategies.

Therefore, the main contribution of this paper is first topropose an algorithm ComTector which is efficient for thecommunity detection in large-scale social networks by usingsuch overlapping nature of the communities in real worldscenarios. Given a large sparse graph, the running time ofour algorithm is O(C×Tri2), where C is the number of thedetected communities and Tri is the number of the trianglesin the given network for the worst case. Then we present amethod for describing and naming the discovered communi-ties by combining the network topological information withthe vertex natural attributes.

The paper is then structured as follows: in section 2, wemainly review some related work. Section 3 describes thecommunity detection algorithm in great details. Section 4discusses the naming method to define the discovered com-munities. The experimental results and analysis are pre-sented in section 5; and we conclude the paper in section6.

2. RELATED WORKIn social network analysis (SNA), a community is often re-

garded as some kind of cohesive sub-structures[21][24], suchas the cliques[2][7], n-cliques, n-clans, n-plexes[25], as wellas the quasi-cliques[1][17][26]. These dense sub-structuresalways impose extra restrictions on the community defini-tion. For example, the definition of n-clique requires thatthe distance between any pair of vertices should be no morethan n, while in a quasi-clique the proportion of the numberof each vertex’s neighbors to the number of all the verticesin the sub-structure is no less than a threshold value. At the

same time, these sub-structures are usually small in size, andpeople may get tremendous number of them, which actuallyhides the global organization of the given network. Com-pared with the defined cohesive sub-structures, hierarchicalclustering[12] is another widely used technique which groupssimilar vertices into larger communities in SNA. Donetti andMunoz[6] have adopted this method by treating the Lapla-cian eigenvectors of the graph as a similarity measurementamong vertices. The complexity is determined by the com-putation of all the eigenvectors, in O(n3) time for sparsematrices. While it does not require us to specify the size ornumber of the communities beforehand, this method doesnot know when to stop the agglomerative process for thebest division of the network.

Girvan and Newman have introduced a divisive approach[9][10]which includes the removal of the edges depending on theirbetweenness values. By iteratively cutting the edge withthe greatest betweenness value, it uses the Network Modu-larity Q to get an optimized division of the network withO(m3) time complexity[14]. Radicchi has proposed a sim-ilar methodology with GN[19] by using the edge-clusteringcoefficient as the new metric. Its time complexity is O(m2)which is less than that of GN. To improve the computationefficiency, Clauset, Newman and Moore have also proposed afast clustering algorithm[4] with O(n log n) time complexityon sparse graph which uses a greedy strategy to get a max-imal ∆Q by merging pairs of nodes iteratively until it be-comes negative. Pascal Pons and Matthieu Latapy[18] havedesigned another clustering algorithm based on the randomwalk method to measure the similarity between vertices. Italso uses Network Modularity Q to determine when to stopthe agglomerative process and has O(n2 log n) time com-plexity.

Other interesting algorithms include Jordi Duch and AlexArenas’s extremal optimization method proposed in[8] withO(n2 log n) time complexity, Aaron Clauset’s method forfinding local community structures in[5], the agent-based al-gorithm proposed by Ismail Gunes and Haluk Bingol in[11],as well as the approach based on the information theoreticframework in [20].

All these current algorithms are successful approaches forcommunity detection with different backgrounds and appli-cable scopes. However, the actual social networks are largesparse graphs with significant overlapping among groups ofvertices[16]. As a consequence, the betweenness based di-visive algorithms will have very low computation efficiencywhile the fast agglomerative method[4]in general can notgive a satisfactory division due to its local optimizationstrategy. Therefore, we follow a different track by presentingan algorithm which can generate a higher network modular-ity than the fast algorithm while perform more efficientlythan the GN algorithm.

3. COMMUNITY DETECTIONIn most social networks, triangles counts are usually high

than they are in nonsocial networks, so our approach forcommunity detection is based on the enumeration of allmaximal cliques. Each group of the overlapping maximalcliques is regarded as a certain clustering kernel. We carryout an agglomerative process to assign the rest vertices totheir closest kernels according to a proposed distance mea-sure. In the end, the obtained fractional communities willbe properly merged so as to prevent the network from being

divided into too small pieces.

3.1 Problem FormulationCommunity detection in networks aims to find groups of

vertices within which connections are dense, but betweenwhich connections are sparser. In this paper, we considersimple graphs only, i.e., the graphs without self-loops ormulti-edges. Given graph G, V (G) and E(G) denote thesets of its vertices and edges respectively.

Definition 1. S ⊆ V (G), ∀u, v ∈ S, u 6= v, such that(u, v) ∈ E, then S is a clique in G. If any other S′ is aclique and S′ ⊇ S iff S′ = S, S is a maximal clique of G.

Definition 2. For a given vertex v, N(v) = {u|(v, u) ∈E(G)}, we call N(v) is the set of all neighbors of v. Givenset S ⊆ V (G), N |S =

SN(vi) − S, vi ∈ S, N |S is the set of

all neighbors of S.

Definition 3. Let Com(G) be the set of all components inG. The giant component is denoted by CG and M(CG) isthe set of all the maximal cliques in CG. We use VM ⊆ V (G)to represent the set of all vertices covered by M(CG).

Definition 4. Let P0,P1,...,Pn−1 be the subgraph of G suchthat ∀Pi, Pj , V (Pi)∩ V (Pj) = ∅, and V (P0)∪,...,V (Pn−1) =V (G). For any pair of Pi and Pj , if |E(Pi)| > |(N |Pi ∩ Pj)|,Pi is defined as a community of G.

Definition 5. Given vertex vi ∈ VM , define Ci = {S|S ∈M(CG), vi ∈ S} to be the set of all maximal cliques contain-

ing vi, and C the set of all Ci’s. ∀Ci, Cj ∈ C, if|Ci∩Cj ||Cj | ≥ f

which is a threshold to describe the extent to which Ci over-laps with Cj , we call Cj is contained in Ci, denoted byCj < Ci. If Ci is not contained by any other element in C,Ci is called the kernel of G and vi is the center of Ci.

Definition 6. Let K be the set of all kernels in G. VK ={vi|vi ∈ kj , kj ∈ K} is the set of all vertices covered by Kand IK =

S(ki ∩ kj), ki, kj ∈ K, i 6= j is the union of all the

vertices that any pair of elements in K has in common.

3.2 AlgorithmComTector first enumerates all maximal cliques in the gi-

ant component CG. Because a maximal clique is a completesub-graph, it is thus the densest community which can rep-resent the closest relationship involving a single entity in thegiven network.

3.2.1 Kernel GenerationFor any vi ∈ V (G), Ci is the set of all maximal cliques

containing vi. Every maximal clique in Ci corresponds toone kind of relationship involving vi, in other words, Ci re-flects the fact that individuals belong to different kinds ofrelationships simultaneously. Since that Ci covers all thedensest communities in which vi has participated, set C re-flects the statistics of the overlapping communities in net-

works. For any vi, vj ∈ VM , if|Ci∩Cj ||Cj | ≥ f (f is an empirical

value), which means all or most of vj ’s relationships are cov-ered by those of vi, we say vj depends on vi and Cj < Ci.Otherwise, if ∀Ci ∈ C, i 6= j, Cj ≮ Ci, then Cj becomes thekernel. Therefore, the larger that the size of Ci can be, themore likely that a kernel it would become. We rearrange allthe elements of set C according to the descending order of

Figure 1: Overlapping Communities

their sizes and delete those elements whose sizes are smallerthan 2, which means if Ci is a kernel, vi must participate inat least two different relationships.

Let Ci0 be the element of C whose size is the largest, Ci1

be the element of C whose size ranks second. . . Cin be theelement of C whose size ranks n and etc. K is the set of allkernels. Ci0 is first picked up and those elements containedby Ci0 are removed from C. In the next step, each maximalclique that includes the centers of the left elements in C willbe deleted from Ci0 . If Ci0 is not empty, it is put in K.Again, the element with the largest size is chosen from therest elements of C, such as Cin . Remove it from C, removeall the elements contained by Cin , and delete each maximalclique that includes the centers of the left elements in Cfrom Cin . If there is any maximal clique that contains thecenters of the elements in K, it will also be deleted from Cin

to get rid of unnecessary duplications. If Cin is not empty,it is put in K. The process continues iteratively until Cbecomes empty.

To make things more concrete, an illustrated example isgiven as follows on the network shown in Figure 1. C0 ={{v0, v1, v4, v5}, {v0, v1, v3, v4}, {v0, v2, v3, v4}, {v0, v4, v5, v6}}with v0 being as the center. C1 = {{v0, v1, v4, v5}, {v0, v1,v3, v4}} whose center is v1. Apparently C1 < C0, C1 isnot a kernel. Similarly,C2, C3, C4, C5 are also containedby C0, and C8, C9, C10, C11 are contained by C7. There-fore, C0 and C7 are two different kernels respectively. Theoverall process is depicted in algorithm 1. Each element ofK corresponds to the kernel of a possible community in G.In fact, the process to generate set K is similar to that ofthe classic k-means algorithm for finding the clustering cen-ter. People may argue that another very intuitive methodto search for the kernels might depend on the degree of eachvertex. All the vertices are sorted by the descending orderof the vertex’s degree, and the set of each vertex togetherwith their neighbors is regarded as the element of set C forgenerating the kernels. Even though this method seems tobe simple and straightforward, doing so can not help us tofind the communities.

The reason is that the vertices contained in communitiesdo not hold large degree necessarily. The fact that vertex vi

has a large degree only indicates that as a single entity vi

has many connections with others, yet it does not mean vi is

Algorithm 1 FilterOutKernels(C,f)

1: K ⇐ ∅2: sort C by the descending order of |Ci|, Ci ∈ C3: {core stores the centers of the filtered out kernels}4: core ⇐ ∅5: for Ci ∈ C do6: contained ⇐ Cj , j 6= i, Cj < Ci

7: independent ⇐ k, k 6= i, Ck ≮ Ci

8: delete Ci from C9: C ⇐ C − contained

10: for s ∈ Ci do11: if s ∩ (independent ∪ core) 6= ∅ then12: delete s from Ci

13: end if14: end for15: if Ci 6= ∅ then16: K ⇐ Ci

17: end if18: core ← vi

19: end for20: return K

involved in a large community. In our experiments, we havefound that approximate 40 percent of the top 10 elementsin set C has their centers’ degrees also ranked top 10 on av-erage. Most vertices in the communities of average size donot have large degree. Let vk be the center of the elementin C with the smallest size and vd be the vertex with themaximum degree. We have found that the proportion of thenumber of vertex v such that |N(vk)| ≤ |N(v)| ≤ |N(vd)|to |V (G)| is 75% on average, which is far more than |C|and thus leads to a low efficiency for generating the ker-nels. Therefore, whether an individual would participate ina community also depends on how closely its neighboringvertices are connected with each other. This is another im-portant motivation for us to use the overlapping maximalcliques to find the possible kernels.

The discovered communities will form a partition of thegiven network, which requires every pair of elements in Kshould not have any vertex in common. As a result, pair-wise intersection among elements of K will be performedand all the common vertices will be put in a set IK . Foreach vi ∈ IK , we use the Freeman Relative Centrality [21] toevaluate its positions in the correspondent kernels.

CRD =|N(vi|n− 1

(1)

For a given sub-graph SG, the larger CRD(vi) can be, themore important vertex vi would become in SG. This metriccharacterizes the distance between a given vertex and thecorresponding kernels. The maximum CRD value in SG isdenoted by CRDmax. Every vertex vi in IK is then assignedto its closest kernel based on the value of CRD(vi). Excep-tionally, if vi has the same maximum relative degree in twodifferent kernls km and kn, vi will be assigned to the onewhose value |CRDmax −CRD(vi)| is less than the other. Fi-nally, even if km and kn have the same |CRDmax−CRD(vi)|value, which is a rather rare case, vi is just assigned to one ofthem randomly. The whole procedure is given in algorithm2. In figure 1, C0 and C7 share v5. Since that the distancesof v5 to C0 and C7 are 0.6 and 0.3 accordingly, v5 is thusassigned to C0 and removed from C7.

Algorithm 2 DeDuplication(K)

1: IK ⇐ ∅2: for ki ∈ K do3: for kj ∈ K, i < j do4: IK ← IK

S(ki ∩ kj)

5: end for6: end for7: for v ∈ IK do8: remove v from all the kernels except for the one having

the maximum distance9: end for

3.2.2 Kernel-based ClusteringOnce all the kernels does not have any vertex in com-

mon, each of them is regarded as a clustering center, andevery vertex in V (CG) − VK will also be assigned to theirclosest kernels. We adopt a marking strategy to differen-tiate new vertices from the old ones. All vertices in VK

are first marked as old. Next, every new vertex in the setSN(ki) − VK , ki ∈ K will be put in a tentative set VE . In

the third step, all vertices in VE are assigned to their closestkernels and will be marketed as old. As a result, every ker-nel is now expanded. Again every new vertex in set N |VE isadded to a tentative set VE ’. Then the vertices in VE ’ arealso assigned to their closest kernels and are marked as old.This process is repeated iteratively until the kernels can notbe expanded any more, which is given in Algorithm 3.

Algorithm 3 AssignVertex(K)

1: for vi ∈ VK do2: vi is marked as old3: end for4: VE ← vertices not marked as old in

SN(ki)− VK

5: while VE 6= ∅ do6: for vi ∈ VE do7: assign vi to its closest kernel ki

8: vi is marked as old9: end for

10: VE ’← ∅, VE ’← vertices not marked as old in N |VE

11: VE ← ∅, VE ← VE ’12: end while

3.2.3 Modularity OptimizationOnce the clustering process is finished, all the obtained

sub-structures constitute the original division of the net-work. We adopt the Network Modularity Q to evaluate suchdivision. Given the p × p symmetric matrix e whose ele-ment eij is the fraction of all edges in the network that linkvertices in community i to vertices in community j, the row

sums ai =Xj=0

eij represent the fraction of edges that connect

to vertices in community i. Q is thus defined asXi=0

(eii − a2i ) (2)

This quantity measures the fraction of within-communityedges minus the expected value of the same quantity in anetwork with the same community divisions but randomconnections between the vertices. If the number of within-community edges is no better than random, we will get Q

= 0. Values approaching Q = 1, which is the maximum,indicate strong community structure.

In our experiments, we have found that there exists anumber of fractional communities which are derived fromthe corresponding tiny kernels compared with others. Thehidden reason that causes to generate such fractional kernelsis that for a specific Ci, vi may not become the ”real” centerof Ci. In other words, although each maximal clique inCi contains vi, it may also include many centers of otherkernels, which means vi is not the expected core figure andis just a normal individual participating in the social cyclesof other dominating figures. Consequently, even though Ci

would become a kernel, its size would be tiny, because themaximal cliques involving the centers of other kernels areall deleted from Ci. As a result, the communities resultedfrom these tiny kernels partition the network into too smallpieces.

To address this problem, we propose two approaches toadjust the original division. In terms of the first one, weborrow the basic idea from Newman’s fast algorithm to per-form a local greedy optimization. We iteratively search forthe changes ∆Q resulted from the amalgamation of eachpair of communities, choose the largest of them, and per-form the corresponding amalgamation until ∆Q becomesnegative. The modularity value of the original division isQ0. Suppose we first merge community i with j and thenew community is denoted as (ij). We can have

∆Q =

aij − a2

(ij) + a2i + a2

j i, j is connected0 otherwise

ai =Ei

mEi is the number of edges in community i (3)

Here, ai is the fraction of edges that connect vertices incommunity i and aij is the fraction of edges that connectvertices in community i to vertices in community j. m isthe number of edges in graph G. Once the initial values of∆Q and ai are obtained, we use the amalgamation processof the fast algorithm to increment Q0 by the largest ∆Quntil it becomes negative, which is shown in algorithm 4.

Algorithm 4 AdjustDivision(K)

1: calculate ∆Q from pairs of connected communities2: while maximal ∆Q > 0 do3: select the maximal ∆Q4: join the pair of communities with the maximal ∆Q5: update the ∆Q matrix6: end while

With respect to the second method, the fractional com-munities of the original division whose sizes are under theaverage level will be directly merged with the rests. In ourexperiments, we have found that the final modularity valueobtained by this straightforward method is often close tothat of the former with even less computational costs. Fi-nally, the whole procedure of ComTector is given in algo-rithm 5.

3.2.4 Performance AnalysisFrom the priori discussion, the enumeration of all maximal

cliques in the giant component by using Peamc[7] will costO(∆×MC × Tri2) in the worst case on a single processor,where ∆ is the maximal degree of G, MC is the size of the

Algorithm 5 ComTector(G)

1: Read Graph G2: Com(G) ⇐ all components of G3: M(CG) ⇐ all maximal cliques in CG

4: FilterOutKernels(C,f)5: DeDuplication(K)6: AssignVertex(K)7: AdjustDivision(K)8: return K ∪ (Com(G)− CG)

maximum clique and Tri is the number of all triangles in G.To find the kernel set K, we need to traverse all the elementsof C whose size is larger than 2, which will cost O(MC ×|C|2). The parameter f to identify whether one element ofC is contained by another influences the number of kernels.Based on our experiments, we suggest it should be largerthan 0.3. Since that VM − VK ≈ V (G), assigning the restvertices in VM−VK will cost O(|K|×|V (G)|×I), where I is isthe average times for the process to repeat until K is empty.In sparse graphs, we have |V (G)| ≈ |E(G)|, |C| < |V (G)|,|K| ≈ |C|, |V (G)| < Tri2 ¿ |V (G)|2, and ∆ × MC <|K|. Let C denote the number of the communities in theoriginal division. The adjustment phase using modularityoptimization will cost O(C × log C). Because C ≈ |K| andI has the value of 6 according to the small-world property,the overall cost will be O(C × Tri2) in the worst case.

4. COMMUNITY NAMING MECHANISMOnce all possible communities have been detected suc-

cessfully; one may further ask the question like what thediscovered communities could be? Or since a communitycovers some kind of common relationship shared among theinvolved entities, could we find out what the relationship isindeed. Here, we propose a general method to describe aspecific community by combining the topology informationwith the natural attributes of the contained entities. Theprocess is much like making a profile for the given commu-nity. Suppose we are given two data sets: one is the entityrelational data which is modeled by graph G = {V, E} withV being the entity (vertices) set and E the relation(edges)set respectively. The other is the entity attribute data R ={a0, a1, ..., an−1, rc}, where a0, a1, ..., an−1 are the naturalattributes and rc is the community index number of eachdiscovered community by the previous process. Based onG and R, our task is to find some key attribute values tocharacterize the given community.

Technically, there exist two challenges for the naming taskitself. On one hand, data gathering and synthesis of G andR in real applications are rather complicated. Sometimesthere would be even no enough attributes for the entities.On the other hand, the classic approach to give a profilefor a group of entities is known as rule induction. However,we are often unable to use this method directly after thecommunities are discovered even if the entities have enoughattributes because of the losing of the important topologicalinformation about the internal structure. Therefore, ournaming approach consists of two steps. In the first step, weare going to find the central entities of each community. Inthe second step, several naming mechanisms are proposedaccording to the type and number of the attribute valueshold by the central entities.

4.1 Central Entity ResolutionBy taking the self-similarity[22] and scale-free properties

into account, we assume that most sub-graphs extractedfrom the social networks often have central entities in thesame way like the whole social network often has severalhub nodes. These central entities usually put important im-pacts on the overall formation and development of the givencommunity. For example, in collaboration network, the cen-tral entity of a research team could be a common advisorwho originally started the research area, while in telecom-munication call network, the central person in a group offrequently contacted people may be the chief officer in a de-partment. Therefore, the characteristics of the core figuresgreatly influence the properties of the community.

Many methods in literature have been proposed to quan-tify the centrality of an entity based on the topology of agiven graph, such as degree, relative centrality, betweennessand PageRank. These approaches can bring us an uniquevalue U(v) for each vertex v. Generally speaking, the greaterthe value is, the more important the vertex will be in thecommunity. How to find out the central vertices accordingto these values is what we actually need to do. If the valueof some vertex is significantly greater than those of the othervertices, or, if there exists a large gap between two neighbor-ing vertices, such as v and w with U(v) > U(w), after werank all of them according to the centrality measurementin the descending order, the vertices, whose value is moregreater than that of v, are the central entities of the commu-nity. Algorithm 6 describes this procedure. In figure 2a, the

Algorithm 6 CentralEntityResolution(C,p)

1: {C is the given community and p is a threshold value}2: Calculate the centrality of each vertex vi in C3: v0, v1, ..., vn−1 is arranged by the descending order of

their centrality c0, c1, ..., cn−1

4: i ⇐ 05: while i < n− 1 do6: if

ci−ci−1ci−cn−1

> p then

7: return {ck|0 ≤ k ≤ i}8: else9: i ⇐ i + 1

10: end if11: end while

relative centrality values of vertices ”PeiDa Ye” and ”MingZhang” are 0.86 and 0.57, which ranks the first and the sec-ond respectively. The minimum relative centrality value ofthis community is 0.03. Since the threshold p actually de-termines the distance between two neighboring values andthe third ranking relative centrality value is only 0.2 whichis far less than the priori two, the vertices ”PeiDa Ye” and”Ming Zhang” are thus the central entities which could beused to define this community. By contrast, in figure 2b,almost all vertices have similar relative centrality values, sowe use them all together to describe the community.

4.2 Naming ApproachBased on the data sets G and R, we propose the following

approaches to name a specific community. In terms of thefirst one, set R only has one element, namely, there existsonly one attribute that the central entities could have. Insuch case, the union set of the central entities’ attribute val-

Figure 2: Central Entity Resolution p = 0.6

ues is used to define the given community. With respect tothe second one, there exist as many attributes as possibleto reflect various aspects of an entity. We use an attributeselection technique to find the most informative character-istics of the central entities. Here, an attribute can be usedto name the community if its value of the central entity isnot significantly different from that of the non-central en-tities. A representative value of this attribute among thecentral entities is calculated according to its type (discreteattribute such as sex and address or continuous attributesuch as age and incoming). Otherwise, if the values of anattribute are absent among all central entities, it can notbe used to name the community. This process is depictedin algorithm 7. The representative value of attribute a holdby the central entities is denoted as Ca, the frequency of Ca

among all the entities as FCa , the average value of a as Aa.To sum up, the above two naming mechanisms should be ap-

Algorithm 7 NamingCommunity(R,Center,p1,p2)

1: {R is the attribute set, Center is the community set,p1,p2 are two threshold values}

2: if |Center| > 1 then3: for each attribute a ∈ R do4: if a is discrete then5: Ca ⇐ the most frequent value of a among the

central entities6: else7: Ca ⇐ average value of a over the central entities8: end if9: end for

10: else11: for each attribute a ∈ R do12: Ca ⇐ value of a13: end for14: end if15: for each attribute a ∈ R do16: if a is discrete and FCa > p1 then17: a is selected as the key attribute18: return Ca

19: else20: if Ca−Aa

Aa< p2 then

21: a is selected as the key attribute22: return Ca

23: end if24: end if25: end for

plied to different practical scenarios. If the entity only has

Figure 3: Zachary Karate Club

one attribute, the value of this attribute hold by the centralentity is used to define the community. If a community hasa few central entities, each of which has several attributes,we could use the frequent attribute values to name this com-munity. Suppose that there are n vertices with m attributeseach, and c central entities in the given community. Thetime complexity to calculate the centrality, witch dependson the algorithm of the measurement, is O(T ), so the wholenaming procedure will cost O(T + cm + nm).

5. EXPERIMENTAL RESULTSIn this section, we present a number of applications to

which our ComTector is applied. The algorithm is firsttested on the Zachary Karate Club[27], American CollegeFootball[27], and Scientific Collaboration[28] networks. Ineach case we find that our algorithm reliably detects theknown structures. Based on the experimental results, wehave a detailed discussion about the optimization of param-eter f .

In the end, ComTector is further tested on the large Telecom-munication Call networks (T.C.) to illustrate the global struc-tural properties of the large-scale social networks. All ex-periments are done on a single PC (3.0GHz processor with2Gbytes of main memory on Linux AS3 OS). We use a veryefficient parallel algorithm Peamc to enumerate all maxi-mal cliques. In the telecommunication call networks, Peamcruns on the DAWN Cluster (3.2GHz Processor with 2Gbytesof main memory on each node, Linux AS3 OS) by using 20processors, while for the other networks, Peamc just runs ona single processor.

5.1 Zachary Karate ClubZachary Karate Club is one of the classic studies in so-

cial network analysis. Over the course of two years in theearly 1970s, Wayne Zachary observed social interactions be-tween the members of a karate club at an American univer-sity. He built network of connections with 34 vertices and78 edges among members of the club based on their socialinteractions. By chance, a dispute arose during the courseof his study between the club’s administrator and the karateteacher. As a result, the club splits into two smaller commu-nities with the administrator and the teacher being as thecentral persons accordingly. Figure 3 shows the detectedtwo communities by ComTector which are exactly matchedwith the result of Zachary’s study.

5.2 American College Football

Figure 4: American College Football

As another test of ComTector, we turn to the networkof American College Football. This network represents theschedule of Division I games for the 2000 season. It consistsof 115 vertices and 616 edges which are the representationsof football teams and regular season games among them re-spectively. During the 2000 season, all of the 115 teams aredivided into 12 conferences containing around 8 to 12 teamseach. Games are more frequent between members of thesame conference than between members of different confer-ences. Apparently, each conference can be considered as onecommunity of the network.

Applying ComTector to this network, we have found thatit identifies the conference structure with a success percent-age of 93.8% shown in Figure 4. Almost all teams are cor-rectly grouped with the other teams in their conferences.The few cases where the algorithm fail actually correspondto the scheduling of the game. For example, LouisianaMon-roe team, MiddleTennesseeState team and LouisianaLafayetteteam in Sun Belt are misclassified with LouisianaTech inWestern Athletic as a single group. Independents confer-ence is broken into two groups with its team UtahState be-ing misclassified into Western Athletic. In all other respectshowever it performs remarkably well.

5.3 Scientific Collaboration NetworkThe data of the collaboration network is obtained accord-

ing to the 1990 published papers from the year 1998 to 2005indexed by SCI, EI and ISTP in Beijing University of Postsand Telecommunications(BUPT). Each author correspondsto a vertex of the network and there is an edge between twovertices if the two correspondent authors have collaboratedin a paper. A great deal of work has gone into disambigua-tion of similar names, so co-authorship relationships are rel-atively free of name resolution problems. We split the titleof each published paper of a given author into some keyphrases by getting rid of the unnecessary prepositions, arti-cles as well as several nouns and adjectives that frequentlyappear with very general meaning, such as system, algo-rithm, new and etc. These key phrases are then used as thevalue of the research area attribute hold by each author. Byusing our naming approach, we present a map of scholarship

Figure 5: All Discovered Communities and the De-scriptions

in Figure 5. Each community in the Periphery area is anindependent small component of the network, and the Corearea corresponds to the giant component with each vertexbeing the representation of the corresponding community.The edge between two adjacent vertices indicates that twocommunities have some overlapping parts. By zooming ontothe vertices of the Core area, we come to the detailed de-scription of each specific named community. In this magni-fied picture, the color of each community is the same as thatof the vertex in the Core area. The vertices in every com-munity are the central persons being as the representativesof the research group. The solid lines among these verticesshow that the central persons of the given communities havecollaborated together, while the dashed lines mean that therest persons other than the central ones of the correspondentcommunities have once collaborated with each other. Thecolor of each community from red to deep blue indicates theresearch contribution by the descending order according tothe number of the published papers indexed by SCI, EI andISTP.

Another scientific collaboration network is the networkof coauthorships between scientists posting preprints on theCondensed Matter E-Print Archive between Jan 1, 1995 andMarch 31, 2005. It consists of 39577 vertices and 175693edges. Table 1 gives the numerical results on both of the twonetworks. In fact, for the GN algorithm, the computation ofthe maximum edge betweenness value makes it inefficient onlarge-scale networks. With respected to Newman’s fast algo-rithm, it uses a local optimization policy by searching for themaximal increment of the Network Modularity Q. However,

Table 1: Scientific Collaboration NetworksBUPT (1667 vertices, 4487 edges)

Algorithm Result Q TimeGN 79 0.85 403s

Newman Fast 85 0.43 2.4sComTector 81 0.83 1s

E-Print Archive (39577 vertices, 175693 edges)GN n/a n/a >24h

Newman Fast 1363 0.31 3.7hComTector 1056 0.65 2.2h (25s clique time)

Table 2: Erdos Collaboration NetworkErdos 97 (5482 vertices, 8972 edges)

Algorithm Result Q TimeGN n/a n/a >5h

Newman Fast 68 0.43 29sComTector 49 0.69 23s (1s clique time)

Erdos 98 (5816 vertices, 9505 edges)GN n/a n/a >5h


Erdos 99 (6094 vertices, 9939 edges)GN n/a n/a >5h


the changes of Q is not a monotonous process. Althoughfast algorithm can get a maximal value of Q, this value maybe too far from the maximum one. Compared with fast al-gorithm, which starts its clustering process from each singlevertex, the process of ComTector that generates all the pos-sible kernels by using the overlapping of the maximal cliquesdirectly locates the densest part of each community, whichcan quickly bring about a more faster convergence. Conse-quently, in large sparse graphs, ComTector usually performsbetter than the fast algorithm while becomes more efficientthan the GN algorithm.

Paul Erdos was one of the most prolific mathematiciansin the history, with more than 1500 papers to his name. Heis also known as a promoter of collaboration and as a math-ematician with the largest number of different co-authors,which was a motivation for the introduction of the Erdosnumber collaboration network[29]. Patrick Ion (Mathemati-cal Reviews) and Jerry Grossman (Oakland University) col-lected the related data (Grossman and Ion, 1995, Grossman,1996) which are updated annually through 1997 to 1999. Ex-perimental results are given in table 2.

5.4 Telecommunication Call NetworkIn the algorithm, the possible values of parameter f affect

the ultimate outcome of the partition. We adopt Newman’sQ modularity to evaluate the strength of the detected com-munity structures. f determines the kernels’ number in thegiven network, which in turn has an influence on the Q value.Figure 6 shows this kind of relation in the American CollegeFootball, and the Scientific Collaboration network respec-tively. If f is too large, it will cut the network into smallerpieces. For each community i, eii tends to be small and ai

is relatively large, which further causes Q to decrease. Asa result, in Figure 6, we see when f ∈ (0.3, 0.5), Q often

0.0 0.2 0.4 0.6 0.8 1.0

0.79

0.80

0.81

0.82

0.83

Mod

ular

ity Q

f

Scientific Collaboration

0.0 0.2 0.4 0.6 0.8 1.0

0.36

0.38

0.40

0.42

0.44

0.46

0.48

0.50

0.52

0.54

Mod

ular

ity Q

American College Football

f

0.0 0.2 0.4 0.6 0.8 1.0

0.681

0.682

0.683

0.684

0.685

0.686

0.687

Mod

ular

ity Q

f

Enron 97

0.0 0.2 0.4 0.6 0.8 1.00.681

0.682

0.683

0.684

0.685

0.686

0.687

0.688

0.689

0.690

Mod

ular

ity Q

f

Enron 98

Figure 6: Network Modularity Q and f

reaches its maximum value on average.According to the priori analysis, ComTector is further

tested on two large telecommunication call networks, whichare built upon the datasets in a city and in a province withinthe period of one month from a Telecom Operator in China.The first one consists of 512024 vertices and 1021861 edges,and the second one includes 845750 vertices and 1544834edges. We regard each subscriber as a single vertex and twovertices will share an edge if the subscribers have once con-tacted with each other by their mobile phones. Because eachindividual can have connections with others for various rea-sons and relationships, the telecommunication call networksoften have significant overlapping among the detected com-munities. We have detected 28033 and 2171 communities inthese two networks with 0.60 and 0.64 Q value respectivelywithin the period of 2h, while neither GN nor Newman Fastcan generate satisfactory results in an acceptable time frame.

Looking at the large communities in the networks, wehave found that they often tend to consists of people whohave close consumption levels, similar ages or live in thesame areas. Figure 7 gives the description results on thetelecommunication network. In our studies, p1 = 0.6 andp2 = 0.4, which are the empirical values and the relativedegree is used as the centrality measurement. In Com-munity 70, we can see that all subscribers share the samehome address (HOME ZIP CODE) and their consumptionlevels are also similar according to their average bill fee(BILL FEE Mean). Meanwhile, the subscribers in commu-nity 61 correspond to a typical portion of the client marketfor their close consumption levels and the similar age 32, andCommunity 64 is a young social cycle apparently. To someextent, these obtained community structures and the corre-spondent common factors are useful clues for the TelecomOperator to adjust their client market policies.

6. CONCLUSIONSIn this paper, we have followed a different track by propos-

ing a new method ComTector for the community discoveryin large-scale social networks. Directly based on the over-lapping nature of communities in our real world, ComTectorcan extract meaningful results on networks whose commu-

Figure 7: Some defined communities in the telecom-munication call network

nity structures are known before. Our method consists ofthree critical steps. For the first step, we adopt a signifi-cantly efficient algorithm to enumerate all maximal cliquesin the given network. The overlap of several maximal cliquescorresponds to various relationships and connections each in-dividual may participate in directly. The gathering of maxi-mal cliques forms the kernels of every potential community.For the second step, we use an agglomerative technique toiteratively add the left vertices to their closest kernels basedon the relative degree matric. For the third step, the origi-nally obtained clustering results will be adjusted by mergingpairs of fractional communities to achieve a better NetworkModularity. The finally obtained community structures to-gether with other components constitute the ultimate par-tition of the network.

We have demonstrated the efficiency and utility of Com-Tector over a number of practical examples. Experimentalresults on true life networks show that ComTector can ex-tract meaningful communities that are agreed with both ofthe objective facts and our intuitions. Further, we applyComTector to analyze networks whose structures are other-wise difficult to understand, such as scientific collaborationand telecommunication call network. Moreover, we also pro-pose a general naming method to describe and explain thediscovered communities by combining the topological infor-mation with the entity attributes. Going through the de-tailed descriptive information concerning each communityto the high level abstracted map of their organization, weare able to have a better understanding about the globalproperties of the whole network.

For the future work, we will continue our research by fo-cusing on the evolution of the community as well as thenetwork backbone by using time series analysis. We willalso search for more refined theoretical models to describethe relationship between the knowledge diffusion mechanismand the community membership in social networks to havea deeper understanding of the network dynamics from bothof the micro and macro perspectives.

7. ACKNOWLEDGMENTSWe thank Yan Xiong, Qi Ye, Chaoqun Xu, and Yerong Wu

for their generous help for gathering the collaboration data,and Yi Wang, Xiangang Zhao , for the valuable discussions.

8. REFERENCES[1] J. Abello, M. Resende, and S. Sudarsky. Massive

quasi-clique detection. In 5th Latin American

Symposium on Theoretical Informatics, pages 598–612,2002.

[2] C. Bron and J. Kerbosch. Finding all cliques of anundirected graph. Communications of the ACM,16:575–577, 1973.

[3] O. V. Burton. Computing in the Social Sciences andHumanities. University of Illinois Press, 2006.

[4] A. Clauset, M. Newman, and C. Moore. Findingcommunity structure in very large networks. PhysicalReview E, 70(066111), 2004.

[5] A. Clauset, M. Newman, and C. Moore. Finding localcommunity structure in networks. Physical Review E,72(026132), 2004.

[6] L. Donetti and M. Miguel. Detecting networkcommunities: a new systematic and efficientalgorithm. Journal of Statistical Mechanics, pages100–102, 2004.

[7] N. Du, B. Wu, and B. Wang. A parallel algorithm forenumerating all maximal cliques in complex networks.In The 6th ICDM2006 Mining Complex DataWorkshop, pages 320–324. IEEE CS, December 2006.

[8] J. Duch and A. Arenas. Community detection incomplex networks using extremal optimization.Physical Review E, 72(027104), August 2005.

[9] M. Girvan and M. Newman. Community structure insocial and biological networks. PNAS,99(12):7821–7826, June 2002.

[10] M. Girvan and M. Newman. Finding and evaluatingcommunity structure in networks. Physical Review E,69(026113), 2004.

[11] I. Gunes and H. Bingol. Community detection incomplex networks using agents. In AAMAS2007, 2007.

[12] J. Han and M. Kamber. Data Mining: Concepts andTechniques, 2nd ed. Morgan Kaufmann Publishers,2006.

[13] R. Milo and S. Itzkovitz. Network motifs: Simplebuilding blocks of complex networks. Science,298:824–827, 2002.

[14] M. Newman. Detecting community structure innetworks. The European Physical JournalB-Condensed Matter, 38:321–330, 2004.

[15] M. Newman. Modularity and community structure innetworks. PNAS, 103(23):8577–8582, June 2006.

[16] G. Palla, I. Dernyi, and I. Farkas. Uncovering theoverlapping community structure of complex networkin nature and society. Nature, 435:814–818, June 2005.

[17] J. Pei, D. Jiang, and A. Zhang. On mining cross-graphquasi-cliques. In The 12th ACM SIGKDD, pages228–237, 2006.

[18] P. Pons and M. Latapy. Computing communities inlarge networks using random walks. In ISCIS2005,pages 284–293, 2005.

[19] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, andD. Parisi. Defining and identifying communities innetworks. PNAS, 101(9):2658–2663, March 2004.

[20] M. Rosvall and C. T. Bergstrom. Aninformation-theoretic framework for resolvingcommunity structure in complex networks. PNAS,104(18):7327–7331, May 2007.

[21] J. Scott. Social Network Analysis: A Handbook. SagePublications, London, 2002.

[22] C. Song, M. Havlin, and H. Makse. A self-similarity ofcomplex networks. Nature, 433(7024):392–395, 2005.

[23] A. Vlczquez, R. Dobrin, S. Sergi, J. Eckmann,

Z. Oltvai, and A. Barablcsi. The topologicalrelationship between the large-scale attributes andlocal interaction patterns of complex networks. PNAS,101(52):17940–17945, 2004.

[24] S. Wasserman and K. Faust. Social Network Analysis.Cambridge University Press, Cambridge, 1994.

[25] B. Wu and X. Pei. A parallel algorithm forenumerating all the maximal k-plexes. In PAKDD07Workshops, May 2007.

[26] Z. Zeng, J. Wang, and G. Karypis. Coherent closedquasi-clique discovery from large dense graphdatabases. In The 12th ACM SIGKDD, pages797–802, 2006.

[27] http://www-personal.umich.edu/˜mejn/

[28] http://vlado.fmf.uni-lj.si/pub/networks/pajek/data/gphs.htm

[29] V. Batagelj and A. Mrvar. Some Analyses of ErdosCollaboration Graph. http://vlado.fmf.uni-lj.si/pub/networks/doc/erdos/erdos.pdf

community detection in large-scale social networks · community detection in large-scale social...

Documents