neighbor similarity based agglomerative method for community detection in networks · 2019. 7....

17
Research Article Neighbor Similarity Based Agglomerative Method for Community Detection in Networks Jianjun Cheng , 1 Xing Su , 1 Haijuan Yang, 1,2 Longjie Li , 1 Jingming Zhang, 1 Shiyan Zhao, 1 and Xiaoyun Chen 1 1 School of Information Science and Engineering, Lanzhou University, China 2 Department of Electronic Information Engineering, Lanzhou Vocational Technical College, China Correspondence should be addressed to Jianjun Cheng; [email protected] Received 27 December 2018; Revised 15 March 2019; Accepted 11 April 2019; Published 2 May 2019 Academic Editor: Guang Li Copyright © 2019 Jianjun Cheng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Community structures can reveal organizations and functional properties of complex networks; hence, detecting communities from networks is of great importance. With the surge of large networks in recent years, the efficiency of community detection is demanded critically. erefore, many local methods have emerged. In this paper, we propose a node similarity based community detection method, which is also a local one consisted of two phases. In the first phase, we first take out the node with the largest degree from the network to take it as an exemplar of the first community and insert its most similar neighbor node into the community as well. en, the one with the largest degree in the remainder nodes is selected; if its most similar neighbor has not been classified into any community yet, we create a new community for the selected node and its most similar neighbor. Otherwise, if its most similar neighbor has been classified into a certain community, we insert the selected node into the community to which its most similar neighbor belongs. is procedure is repeated until every node in the network is assigned to a community; at that time, we obtain a series of preliminary communities. However, some of them might be too small or too sparse; edges connecting to outside of them might go beyond the ones inside them. Keeping them as the final ones will lead to a low-quality community structure. erefore, we merge some of them in an efficient approach in the second phase to improve the quality of the resulting community structure. To testify the performance of our proposed method, extensive experiments are performed on both some artificial networks and some real-world networks. e results show that the proposed method can detect high-quality community structures from networks steadily and efficiently and outperform the comparison algorithms significantly. 1. Introduction Many real-world systems can be abstracted as complex networks, in which nodes represent entities in the systems, and edges correspond to interactions between the entities. One of the most significant characteristics observed in these complex networks is the “community structure,” which means that nodes in the network can be divided into groups naturally; nodes in the same group are connected densely, and connections across different groups are relatively sparse; each of the node groups is a so-called “community.” e communities are always related to functional mod- ules of networks. For instance, communities can be groups of web pages in WWW networks [1] or scientific papers in citation networks [2] sharing same topics, books with the same political orientations copurchased from the online bookseller, Amazon.com [3], pathways or complexes in metabolic networks, or protein-protein interaction networks [4, 5]. In social networks, communities oſten correspond to real social groupings having the same interests or profes- sional occupations, e.g., scientist groups classified according to the scientists’ specialties in the coauthor relationship collaboration networks [6, 7], jazz musician groups divided according to the locations and race [8], or affiliations of gang members in the policing area of Hollenbeck, Los Angeles [9]. Besides this, some researches have indicated that networks can present quite different properties when being considered at the community level, rather than from the perspective of entire network or the individual node [10]. Hindawi Complexity Volume 2019, Article ID 8292485, 16 pages https://doi.org/10.1155/2019/8292485

Upload: others

Post on 09-Mar-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Research ArticleNeighbor Similarity Based Agglomerative Method forCommunity Detection in Networks

Jianjun Cheng 1 Xing Su 1 Haijuan Yang12 Longjie Li 1 Jingming Zhang1

Shiyan Zhao1 and Xiaoyun Chen 1

1School of Information Science and Engineering Lanzhou University China2Department of Electronic Information Engineering Lanzhou Vocational Technical College China

Correspondence should be addressed to Jianjun Cheng chengjianjunlzueducn

Received 27 December 2018 Revised 15 March 2019 Accepted 11 April 2019 Published 2 May 2019

Academic Editor Guang Li

Copyright copy 2019 Jianjun Cheng et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Community structures can reveal organizations and functional properties of complex networks hence detecting communities fromnetworks is of great importanceWith the surge of large networks in recent years the efficiency of community detection is demandedcritically Therefore many local methods have emerged In this paper we propose a node similarity based community detectionmethod which is also a local one consisted of two phases In the first phase we first take out the node with the largest degree fromthe network to take it as an exemplar of the first community and insert its most similar neighbor node into the community as wellThen the one with the largest degree in the remainder nodes is selected if its most similar neighbor has not been classified intoany community yet we create a new community for the selected node and its most similar neighbor Otherwise if its most similarneighbor has been classified into a certain community we insert the selected node into the community to which its most similarneighbor belongs This procedure is repeated until every node in the network is assigned to a community at that time we obtain aseries of preliminary communities However some of them might be too small or too sparse edges connecting to outside of themmight go beyond the ones inside them Keeping them as the final ones will lead to a low-quality community structureTherefore wemerge some of them in an efficient approach in the second phase to improve the quality of the resulting community structure Totestify the performance of our proposed method extensive experiments are performed on both some artificial networks and somereal-world networks The results show that the proposed method can detect high-quality community structures from networkssteadily and efficiently and outperform the comparison algorithms significantly

1 Introduction

Many real-world systems can be abstracted as complexnetworks in which nodes represent entities in the systemsand edges correspond to interactions between the entitiesOne of the most significant characteristics observed inthese complex networks is the ldquocommunity structurerdquo whichmeans that nodes in the network can be divided into groupsnaturally nodes in the same group are connected densely andconnections across different groups are relatively sparse eachof the node groups is a so-called ldquocommunityrdquo

The communities are always related to functional mod-ules of networks For instance communities can be groupsof web pages in WWW networks [1] or scientific papersin citation networks [2] sharing same topics books with

the same political orientations copurchased from the onlinebookseller Amazoncom [3] pathways or complexes inmetabolic networks or protein-protein interaction networks[4 5] In social networks communities often correspond toreal social groupings having the same interests or profes-sional occupations eg scientist groups classified accordingto the scientistsrsquo specialties in the coauthor relationshipcollaboration networks [6 7] jazz musician groups dividedaccording to the locations and race [8] or affiliations ofgang members in the policing area of Hollenbeck LosAngeles [9] Besides this some researches have indicatedthat networks can present quite different properties whenbeing considered at the community level rather than fromthe perspective of entire network or the individual node[10]

HindawiComplexityVolume 2019 Article ID 8292485 16 pageshttpsdoiorg10115520198292485

2 Complexity

Therefore analyzing the community structures in net-works can facilitate the recognition of the characteristics ofnetworks and make prediction further about the functionalproperties of the corresponding systems That is to saycommunity detection provides us with an effective means forstudying the functional properties of networks via dippinginto structural characteristics which really make sense inpractical applications Therefore a multitude of methods[11 12] have been proposed for detecting communities incomplex networks we will review some related literature inSection 2

In this paper we propose a community detection methodas well which is based on node similarity and consistsof two phases The first phase repeatedly selects the nodewith the largest degree in the remainder of the networkand either takes it as the exemplar of a new communityor inserts it into the community to which its most similarneighbor belongs according to its most similar neighborrsquoscommunity affiliation At the end of this phase we get a seriesof communities However they are only the preliminarycommunities some of them might be too small or too sparseedges connecting to outside of them might go far beyondthe ones inside them Accepting them as the final ones willlead to a low-quality community structure Therefore thesecond phase merges some of the preliminary communitiesto improve the quality of the resulting community struc-ture

The main contributions of this work can be summarizedas follows

(i) We propose a node similarity based local algorithmshortened as NSA for community detection whichis a two-phase method The first phase is used to getthe preliminary communities and the second phaseis to merge some of the preliminary communitiesto improve the quality of the resulting communitystructure

(ii) We propose an index community metric to measurethe sparsity or smallness of a community In thesecond phase we use the index as a criterion todetermine which preliminary communities need tobe merged

(iii) Extensive experiments on some artificial networksand real-world networks are carried out to testify theperformance of the proposed method The experi-mental results show that the performance and thetime complexity of the proposed method are steadilypromising and outperform its competitors

The remainder of this paper is organized as followsSection 2 reviews some literature about community detec-tion The details of the proposed algorithm are elaboratedin Section 3 The experimental results and analysis on bothartificial networks and real-world networks are presented inSection 4 In Section 5 we discuss how to set the optimalvalue for a parameter introduced in our proposed methodand the paper ends with a conclusion in Section 6

2 Related Work

A great deal of community detection methods have beenproposed in the last decade these methods try to explorecommunities in networks from various perspectives Thegraph theory-based methods take the problem of communitydetection as the traditional task of graph partitioning anddivide the network into subnetworks Kernighan-Lin [13]is a representative method of this kind which partitionsthe network into two arbitrary subnetworks first and thenrepeatedly swaps some nodes between the two subnetworksto maximize a predefined gain function

The hierarchical clustering methods reveal multilevelcommunity structures either in divisive ways or in agglomer-ative approaches or in hybrid ways eg GN algorithm [6 7]detects communities by repeatedly removing the edge withthe largest betweenness from the networks its output is adendrogram representing the nested hierarchy of possiblecommunity structures of the network and the level corre-sponding to the largest value of a measure modularity[7] istaken as the final result FastQ algorithm [23 24] takes eachnode in the network as a community first and then repeatedlymerges two of them into one Its output is also a dendrogramdepicting themerge procedure of possible community hierar-chies Zarandi et al [25] randomly removed some edges withlow similarity to obtain some disconnected components asthe primary communities and then some of them aremergedto get the resulting community structure

The modularity optimization-based algorithms detectcommunity structures from networks by utilizing the phys-ical meaning of modularitymdashthe higher the value of mod-ularity the better the community structuremdashand taking themodularity as the objective to optimize For instance in orderto maximize the modularity of the community structureFast119876[23 24] joins a pair of communities whose merge canlead to the largest modularity increment in each iterationLouvain algorithm [26] uses the node-moving strategy toextract community structure with the optimized modularityfrom the network which begins with an initial partition ofeach node being a community as well then for each nodethe algorithm evaluates the modularity gain of moving itinto the community to which each of its neighbors belongsand moves that node into the community with the largestpositive modularity gain consequently SLM (short for SmartLocal Moving) algorithm [27] searches for possibilities ofincreasing modularity with respect to both splitting com-munities and moving sets of nodes from one community toanother

LPA (Label Propagation Algorithm) [28] makes uti-lization of information propagation mechanism to detectcommunities from networks Every node in the network isinitialized with a unique label and all nodes in the networkare arranged in a random order first then each node in thatspecific order updates its label to the one occurred mostfrequently among its neighbors This label update procedureis ended with the status that every node in the networkhas a label which is the majority one among neighborsand nodes with the same labels form a community Owingto its simplicity and high efficiency several variants have

Complexity 3

been derived from LPA Barber et al [29] proposed a seriesof algorithms that propagate labels under some constraintsLPAm is the most famous one which tries to maximizethe modularity during the label propagation procedureChin et al [30] identified the main communities usingthe number of mutual neighboring nodes first then theyattached some independent constraints to the basic LPA andused the constrained LPA to add the remainder nodes intocommunities finally they used a node-moving strategy likethat is employed in Louvain to refine the quality of theresulting community structure Ding et al [31] yielded amodified version of LPA which exploits the idea of densitypeak clustering [32] and Chebyshev inequality to choosecommunity centers from the network and then propagateslabels of the selected centers to the whole network with theproposed multistrategy of label propagation

Density-based methods define and utilize the concept ofdensity in networks for nodes or communities to uncovercommunity structures SCAN [33] borrows the idea from theclassical density-based clustering algorithm DBSCAN [34]to reveal communities hubs and outliers from networksSCAN++ [35] is a derivative of SCAN it reduces time con-sumption via introducing a new data structure and reducingthe number of density evaluations in the detecting procedureIsoFdp [36] maps the network nodes as data points intoa low-dimensional manifold and then exploits the densitypeak clustering algorithm [32] to extract the final communitystructure LCCD algorithm [37] also practices on the wayproposed in the density peak clustering algorithm [32] tolocate the structural centers from networks and then expandscommunities from the identified centers to the borders usinga local search procedure

Network dynamic-based methods explore communitystructures by simulating the dynamic processes in networksRandom walk is a typical dynamic procedure carried out innetworks random walk-based methods utilize the tendencyof the walker being trapped into a community during a shortwalk rather than walking across the community border intoanother community to detect communities from networksWalkTrap [38] makes use of random walk to calculate theprobability of going from one node to another during ashort-length walk and then calculates the distance tomeasurenodesrsquo similarities and community similarities PPC algo-rithm [39] considers the network as a single communityinitially and recursively partitions each community utilizingnode similarities computed using random walks until furtherpartitioning cannot acquire a better value of modularityRWA [40] employs random walks to calculate the probabilityof a node belonging to a community and each communityis expanded by repeatedly attracting the node which ismost likely to belong to that community to join Besidesthis Attractor [41] utilizes distance dynamics to explorecommunities fromnetworks node interactions might changethe distances among nodes and the distance change willmake an impact on the interaction in reverse Members ofthe same community will gradually move together undersuch interplays and nodes in different communities will keepfar away from each other steadily BiAttractor [42] extendsthe concept of distance dynamics and the idea of Attractor

to bipartite networks which is used to detect two-modecommunities of bipartite networks

Spectral methods engage eigenspectra of various net-work-associated matrices to extract communities For exam-ple Amini et al [43] found the initial node partitionsusing the spectral clustering method based on the normal-ized Laplacian matrix derived from a regularized adjacencymatrix those partitions were used for fitting a stochasticblock model by a pseudolikelihood algorithm to detect theresulting community structure SiemonC de Lange et al [44]identified an integrative community structure in the macro-scopic anatomical neural networks of the macaque and catand the microscopic network of the C elegans by examiningthe spectra of their normalized Laplacian matrices Krzakalaet al [45] produced a class of spectral algorithms to detectcommunities based on the nonbacktracking matrix whichdepicts a nonbacktracking walk on the directed edges ofthe network Shi et al [46] proposed a spectral communitydetection method LLSA which employs Lanczos methodto obtain the approximated eigenvector of the transitionmatrix with the largest eigenvalue and the elements of thiseigenvector approximately indicate the affiliation probabilityof the corresponding nodes to the communities

Most of the methods mentioned above are global onesthey detect communities often depending on some globalinformation such as the number of communities informa-tion about eigenvalues or eigenvectors as prior knowledgebut they are hard to acquire due to the size of networksinvolved getting larger and larger Moreover most of themare computationally demanding leading to high time com-plexity These limitations prevent them from being appliedto large-scale applications To overcome the deficiency of theglobal algorithms many local methods have been proposedincluding someof the aforementionedmethods For exampleLPA and most of its variations determine which label shouldbe adopted by a node according to its neighborhood onlyLCCD takes into account both the local density of nodes andthe relative distance between nodes to locate the local struc-tural centers and expands communities from the structuralcenters with a local search procedure LLSA applies a fastheat kernel diffusing to sample a small subnetwork includingalmost all members of a community and the eigenvectorwhose elements suggest nodes for their memberships ofcommunities is obtained by performing Lanczos method onthe sampled subnetwork

Besides this ComSim algorithm [47] identifies cores ofcommunities from bipartite networks by seeking for cycleswhich are node chains formed by following outgoing linksand reaching a node already visited and then allocates theremaining nodes to the communities that maximize thesimilarity between the node and the community In BLI algo-rithm [48] local clustering information and local structuralsimilarity are employed to establish the primary communitystructure then some small-scale communities whose sizesare smaller than a given threshold 120582 are absorbed by somelarger ones kSIM [49] is also a local method that works ina bottom-up way At the beginning each node is taken as acommunity then the preliminary communities are formedby identifying for each node the neighbor community to

4 Complexity

Input 119866(119881 119864) the network 120575 the community metric thresholdOutput 119862119878 the detected community structurelowast form the preliminary community structure119862119878 119901119903119890 lowast

1 119862119878 119901119903119890 larr997888FPC(119866)lowast merge small or sparse communities in 119862119878 119901119903119890 lowast

2 119862119878 larr997888PCM(119862119878 119901119903119890 120575)3 return 119862119878

Algorithm 1 The framework of our proposed method NSA

which one of its 119896 most similar neighbors with the lowestdegree belongs and assigning the node to that community Inthis procedure common neighbor index is employed as thesimilarity measure for each pair of nodes

Compared to those global ones these local methods showgood performance in large-scale networks Inspired by thiswe also propose a local method to extract communities fromnetworks The proposed method is based on node similarityand is termed as NSA (Node Similarity based Algorithm)for short it comprises of two phases the first phase aimsat constructing the preliminary community structure thesecond phase tries to improve the quality of the final resultby merging some small or sparse communities To do sowe also propose a measure community metric to evaluatethe sparsity or smallness of communities The details of theproposed method are elaborated in the next section

3 The Proposed Method

31 The Framework of the Proposed Method The frameworkof the proposed method is outlined by the pseudocode listedin Algorithm 1

As mentioned previously the proposed method consistsof two phases Function calls FPC() and PCM() implementthe two phases respectively The former establishes thepreliminary community structure based on a node selectionstrategy and the node similarity the latter merges somesmall or sparse communities to improve the quality of theresulting community structure The inputs of this algorithmare the network and a threshold 120575 the network involved inthis paper is the undirected and unweighted graph whichis always represented as 119866(119881 119864) as in Algorithm 1 where 119881and 119864 are the node set and edge set respectively |119881| = 119899and |119864| = 119898 are the number of nodes and edges in thenetwork individually The threshold 120575 is used in the secondphase of the proposed method to identify communities to bemergedmdasha community whose community metric is smallerthan 120575 should be merged into another oneThe output of thisalgorithm is the detected community structure

The next two subsections describe the two proceduresconcretely and deliberately

32 Formation of the Preliminary Community Structure Thefunction FPC() implements the first phase of the proposedmethod whose purpose is to construct the preliminarycommunity structure from the network We first pick out

the node with the largest degree from the network takeit as the exemplar of the first community and insert itsmost similar neighbor into the community as well (if thereare more than one node with the largest degree in thenetwork we arbitrarily select any one of them to take it as theexemplar and if the exemplar hasmore than onemost similarneighbors the one with the smallest degree is selected)Afterwards the next largest-degree node in the remainderof network is selected if its most similar neighbor has notbeen classified into any community yet we create a newcommunity for it and its most similar neighbor Otherwiseif its most similar neighbor has been assigned to a certaincommunity (eg the one denoted as 119862119896) we insert theselected node into that community (ie119862119896 ) aswellWe repeatthis process until every node is classified into a community Inthis procedure densely connected nodes can quickly gathertogether around the exemplars to form communities Atthe end of this procedure we get a series of communitieswhich constitute the preliminary community structure of thenetwork The pseudocode describing the entire procedure islisted in Algorithm 2

In this algorithm the degree of node 119906 is the number of119906rsquos neighbors and is denoted as 119889119906 ie

119889119906 = |Γ (119906)| (1)

where

Γ (119906) = V | (119906 V) isin 119864 V isin 119881 (2)

is the set of neighbors of node 119906 119904119894119898(119906 V) stands for thesimilarity between nodes 119906 and V There are abundant waysto calculate the similarity between nodes in the network anyone of themcanbe employed in principleHowever to pursuethe efficiency we calculate it here as in the following equationwhich involves only the neighborhoods of nodes 119906 and Vthemselves

119904119894119898 (119906 V) = |Γ (119906) cap Γ (V)||Γ (119906) cup Γ (V)| (3)

Thevariables119880 and119862119878 119901119903119890 are used to record the unclassifiednodes and the preliminary community structure they arenaturally initialized to be the original node set 119881 of network119866 and an empty set 120601 in step 1 Steps 2 and 3 select the nodewith the largest degree from the remainder of the networkand its most similar neighbors and denote them as V and 119908respectively Step 4 determines whether 119908 has been assigned

Complexity 5

Input 119866(119881 119864) the networkOutput 119862119878 119901119903119890 = 1198621 1198622 sdot sdot sdot 119862119896 the identified preliminary community structure

1 Initialize variables 119880 and 119862119878 119901119903119890 which are used to recordthe unclassified nodes and the preliminary community structure

119880 larr997888 119881 119862119878 119901119903119890 larr997888 1206012 Select the node with the largest degree denote it as V

V larr997888 argmax119906119889119906 | 119906 isin 1198803 Get the most similar neighbor of V denote it as 119908

119908 larr997888 argmax119906119904119894119898(V 119906) | 119906 isin Γ(V)4 if 119908 has not been assigned to any community then5 Create a new community for nodes V and 119908

119870 larr997888 |119862119878 119901119903119890| 119862119870+1 larr997888 V 1199086 Insert the created community into the community structure

119862119878 119901119903119890 larr997888 119862119878 119901119903119890 cup 119862119870+17 Remove nodes V and 119908 from 119880 as they are classified

119880 larr997888 119880 minus V 1199088 else9 Find the community to which 119908 belongs denote it as 119862119896

119896 larr997888 locate(119862119878 119901119903119890 119908)10 Insert node V into 119862119896

119862119896 larr997888 119862119896 cup V11 Remove node V from 119880 as it is classified

119880 larr997888 119880 minus V12 Repeat steps 2 through 11 until 119880 = 12060113 return 119862119878 119901119903119890

Algorithm 2 FPC(G) forming the preliminary community structure

to a community or not if it has not been classified to anycommunity yet steps 5 and 6 create a new community fornodes V and 119908 and insert the newly created community into119862119878 119901119903119890 then step 7 removes nodes V and 119908 from 119880 as theyhave been classified into the new community just now If node119908 has been already assigned to a community step 9 finds thecommunity 119862119896 to which node Vrsquos most similar neighbor 119908belongs and step 10 inserts node V into community 119862119896 Sincenode V has been assigned to community119862119896 step 11 removes itfrom119880 Step 12 repeats operations in steps 2 through 11 until119880 = 120601 meaning that all the nodes in the network have beenvisited At that time the preliminary community structureis obtained in 119862119878 119901119903119890 and is returned as the output of thisalgorithm in step 13

To make it clearer we take Zacharyrsquos karate club network[14] as an example to illustrate intuitively the procedureThis is a network with 34 nodes and 78 edges as shown inFigure 1(a) in which the node with the largest degree is nodelsquo34rsquo and its most similar neighbor is node lsquo33rsquo Thereforenode lsquo34rsquo is taken as the exemplar of the first communityand node lsquo33rsquo is also inserted into this community Thenthe node with the largest degree in the remaining nodes isnode lsquo1rsquo its most similar neighbor is node lsquo2rsquo Since node lsquo2rsquohas not been assigned to a community yet we create a newcommunity take node lsquo1rsquo as its exemplar and insert node lsquo2rsquointo the new community as well The same thing happens tonode pairs (lsquo3rsquo lsquo4rsquo) (lsquo32rsquo lsquo29rsquo) and (lsquo9rsquo lsquo31rsquo) sequentially Thenthe next largest-degree node is lsquo14rsquo its most similar neighbornode lsquo4rsquo is already in the third community therefore weinsert node lsquo14rsquo into the third community All of the other

nodes are processed in the same way and in the subsequentoperations node pairs (lsquo24rsquo rsquo30rsquo) (lsquo6rsquo lsquo7rsquo) (lsquo5rsquo lsquo11rsquo) and (lsquo25rsquolsquo26rsquo) form new communities all of the remaining nodesare inserted into communities to which their most similarneighbors belong At the end of the process we obtain thepreliminary community structure as shown in Figure 1(b) inwhich each node connects to its most similar neighbor witha directed edge

33 Merge of Small or Sparse Communities At the end ofthe first phase of our proposed method we obtain thepreliminary community structure However some commu-nities are either too small or too sparse to make sense justlike the preliminary communities lsquo5rsquo lsquo11rsquo lsquo9rsquo lsquo31rsquo lsquo32rsquolsquo29rsquo lsquo25rsquo lsquo26rsquo lsquo28rsquo lsquo24rsquo lsquo30rsquo lsquo27rsquo and lsquo6rsquo lsquo7rsquo lsquo17rsquo inFigure 1(b) because each of them contains only a few nodesthe inside edges of each of them are very sparse the numberof edges inside each of them is much smaller than that ofedges connecting to outside violating the characteristic thatconnections inside one community are much denser thanthose across different communities Keeping them in the finalcommunity structure will lead to the low quality Thereforewe merge some of the preliminary communities to acquirethe final result in the second phase which is carried out byfunction call PCM() in Algorithm 1

To this end there are two problems needed to be solvedin PCM() The first one is to identify which communities aresmall or sparse enough that need to be merged into anotherones the second one is to select the communities into whicheach of the small or sparse communities should be merged

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 2: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

2 Complexity

Therefore analyzing the community structures in net-works can facilitate the recognition of the characteristics ofnetworks and make prediction further about the functionalproperties of the corresponding systems That is to saycommunity detection provides us with an effective means forstudying the functional properties of networks via dippinginto structural characteristics which really make sense inpractical applications Therefore a multitude of methods[11 12] have been proposed for detecting communities incomplex networks we will review some related literature inSection 2

In this paper we propose a community detection methodas well which is based on node similarity and consistsof two phases The first phase repeatedly selects the nodewith the largest degree in the remainder of the networkand either takes it as the exemplar of a new communityor inserts it into the community to which its most similarneighbor belongs according to its most similar neighborrsquoscommunity affiliation At the end of this phase we get a seriesof communities However they are only the preliminarycommunities some of them might be too small or too sparseedges connecting to outside of them might go far beyondthe ones inside them Accepting them as the final ones willlead to a low-quality community structure Therefore thesecond phase merges some of the preliminary communitiesto improve the quality of the resulting community struc-ture

The main contributions of this work can be summarizedas follows

(i) We propose a node similarity based local algorithmshortened as NSA for community detection whichis a two-phase method The first phase is used to getthe preliminary communities and the second phaseis to merge some of the preliminary communitiesto improve the quality of the resulting communitystructure

(ii) We propose an index community metric to measurethe sparsity or smallness of a community In thesecond phase we use the index as a criterion todetermine which preliminary communities need tobe merged

(iii) Extensive experiments on some artificial networksand real-world networks are carried out to testify theperformance of the proposed method The experi-mental results show that the performance and thetime complexity of the proposed method are steadilypromising and outperform its competitors

The remainder of this paper is organized as followsSection 2 reviews some literature about community detec-tion The details of the proposed algorithm are elaboratedin Section 3 The experimental results and analysis on bothartificial networks and real-world networks are presented inSection 4 In Section 5 we discuss how to set the optimalvalue for a parameter introduced in our proposed methodand the paper ends with a conclusion in Section 6

2 Related Work

A great deal of community detection methods have beenproposed in the last decade these methods try to explorecommunities in networks from various perspectives Thegraph theory-based methods take the problem of communitydetection as the traditional task of graph partitioning anddivide the network into subnetworks Kernighan-Lin [13]is a representative method of this kind which partitionsthe network into two arbitrary subnetworks first and thenrepeatedly swaps some nodes between the two subnetworksto maximize a predefined gain function

The hierarchical clustering methods reveal multilevelcommunity structures either in divisive ways or in agglomer-ative approaches or in hybrid ways eg GN algorithm [6 7]detects communities by repeatedly removing the edge withthe largest betweenness from the networks its output is adendrogram representing the nested hierarchy of possiblecommunity structures of the network and the level corre-sponding to the largest value of a measure modularity[7] istaken as the final result FastQ algorithm [23 24] takes eachnode in the network as a community first and then repeatedlymerges two of them into one Its output is also a dendrogramdepicting themerge procedure of possible community hierar-chies Zarandi et al [25] randomly removed some edges withlow similarity to obtain some disconnected components asthe primary communities and then some of them aremergedto get the resulting community structure

The modularity optimization-based algorithms detectcommunity structures from networks by utilizing the phys-ical meaning of modularitymdashthe higher the value of mod-ularity the better the community structuremdashand taking themodularity as the objective to optimize For instance in orderto maximize the modularity of the community structureFast119876[23 24] joins a pair of communities whose merge canlead to the largest modularity increment in each iterationLouvain algorithm [26] uses the node-moving strategy toextract community structure with the optimized modularityfrom the network which begins with an initial partition ofeach node being a community as well then for each nodethe algorithm evaluates the modularity gain of moving itinto the community to which each of its neighbors belongsand moves that node into the community with the largestpositive modularity gain consequently SLM (short for SmartLocal Moving) algorithm [27] searches for possibilities ofincreasing modularity with respect to both splitting com-munities and moving sets of nodes from one community toanother

LPA (Label Propagation Algorithm) [28] makes uti-lization of information propagation mechanism to detectcommunities from networks Every node in the network isinitialized with a unique label and all nodes in the networkare arranged in a random order first then each node in thatspecific order updates its label to the one occurred mostfrequently among its neighbors This label update procedureis ended with the status that every node in the networkhas a label which is the majority one among neighborsand nodes with the same labels form a community Owingto its simplicity and high efficiency several variants have

Complexity 3

been derived from LPA Barber et al [29] proposed a seriesof algorithms that propagate labels under some constraintsLPAm is the most famous one which tries to maximizethe modularity during the label propagation procedureChin et al [30] identified the main communities usingthe number of mutual neighboring nodes first then theyattached some independent constraints to the basic LPA andused the constrained LPA to add the remainder nodes intocommunities finally they used a node-moving strategy likethat is employed in Louvain to refine the quality of theresulting community structure Ding et al [31] yielded amodified version of LPA which exploits the idea of densitypeak clustering [32] and Chebyshev inequality to choosecommunity centers from the network and then propagateslabels of the selected centers to the whole network with theproposed multistrategy of label propagation

Density-based methods define and utilize the concept ofdensity in networks for nodes or communities to uncovercommunity structures SCAN [33] borrows the idea from theclassical density-based clustering algorithm DBSCAN [34]to reveal communities hubs and outliers from networksSCAN++ [35] is a derivative of SCAN it reduces time con-sumption via introducing a new data structure and reducingthe number of density evaluations in the detecting procedureIsoFdp [36] maps the network nodes as data points intoa low-dimensional manifold and then exploits the densitypeak clustering algorithm [32] to extract the final communitystructure LCCD algorithm [37] also practices on the wayproposed in the density peak clustering algorithm [32] tolocate the structural centers from networks and then expandscommunities from the identified centers to the borders usinga local search procedure

Network dynamic-based methods explore communitystructures by simulating the dynamic processes in networksRandom walk is a typical dynamic procedure carried out innetworks random walk-based methods utilize the tendencyof the walker being trapped into a community during a shortwalk rather than walking across the community border intoanother community to detect communities from networksWalkTrap [38] makes use of random walk to calculate theprobability of going from one node to another during ashort-length walk and then calculates the distance tomeasurenodesrsquo similarities and community similarities PPC algo-rithm [39] considers the network as a single communityinitially and recursively partitions each community utilizingnode similarities computed using random walks until furtherpartitioning cannot acquire a better value of modularityRWA [40] employs random walks to calculate the probabilityof a node belonging to a community and each communityis expanded by repeatedly attracting the node which ismost likely to belong to that community to join Besidesthis Attractor [41] utilizes distance dynamics to explorecommunities fromnetworks node interactions might changethe distances among nodes and the distance change willmake an impact on the interaction in reverse Members ofthe same community will gradually move together undersuch interplays and nodes in different communities will keepfar away from each other steadily BiAttractor [42] extendsthe concept of distance dynamics and the idea of Attractor

to bipartite networks which is used to detect two-modecommunities of bipartite networks

Spectral methods engage eigenspectra of various net-work-associated matrices to extract communities For exam-ple Amini et al [43] found the initial node partitionsusing the spectral clustering method based on the normal-ized Laplacian matrix derived from a regularized adjacencymatrix those partitions were used for fitting a stochasticblock model by a pseudolikelihood algorithm to detect theresulting community structure SiemonC de Lange et al [44]identified an integrative community structure in the macro-scopic anatomical neural networks of the macaque and catand the microscopic network of the C elegans by examiningthe spectra of their normalized Laplacian matrices Krzakalaet al [45] produced a class of spectral algorithms to detectcommunities based on the nonbacktracking matrix whichdepicts a nonbacktracking walk on the directed edges ofthe network Shi et al [46] proposed a spectral communitydetection method LLSA which employs Lanczos methodto obtain the approximated eigenvector of the transitionmatrix with the largest eigenvalue and the elements of thiseigenvector approximately indicate the affiliation probabilityof the corresponding nodes to the communities

Most of the methods mentioned above are global onesthey detect communities often depending on some globalinformation such as the number of communities informa-tion about eigenvalues or eigenvectors as prior knowledgebut they are hard to acquire due to the size of networksinvolved getting larger and larger Moreover most of themare computationally demanding leading to high time com-plexity These limitations prevent them from being appliedto large-scale applications To overcome the deficiency of theglobal algorithms many local methods have been proposedincluding someof the aforementionedmethods For exampleLPA and most of its variations determine which label shouldbe adopted by a node according to its neighborhood onlyLCCD takes into account both the local density of nodes andthe relative distance between nodes to locate the local struc-tural centers and expands communities from the structuralcenters with a local search procedure LLSA applies a fastheat kernel diffusing to sample a small subnetwork includingalmost all members of a community and the eigenvectorwhose elements suggest nodes for their memberships ofcommunities is obtained by performing Lanczos method onthe sampled subnetwork

Besides this ComSim algorithm [47] identifies cores ofcommunities from bipartite networks by seeking for cycleswhich are node chains formed by following outgoing linksand reaching a node already visited and then allocates theremaining nodes to the communities that maximize thesimilarity between the node and the community In BLI algo-rithm [48] local clustering information and local structuralsimilarity are employed to establish the primary communitystructure then some small-scale communities whose sizesare smaller than a given threshold 120582 are absorbed by somelarger ones kSIM [49] is also a local method that works ina bottom-up way At the beginning each node is taken as acommunity then the preliminary communities are formedby identifying for each node the neighbor community to

4 Complexity

Input 119866(119881 119864) the network 120575 the community metric thresholdOutput 119862119878 the detected community structurelowast form the preliminary community structure119862119878 119901119903119890 lowast

1 119862119878 119901119903119890 larr997888FPC(119866)lowast merge small or sparse communities in 119862119878 119901119903119890 lowast

2 119862119878 larr997888PCM(119862119878 119901119903119890 120575)3 return 119862119878

Algorithm 1 The framework of our proposed method NSA

which one of its 119896 most similar neighbors with the lowestdegree belongs and assigning the node to that community Inthis procedure common neighbor index is employed as thesimilarity measure for each pair of nodes

Compared to those global ones these local methods showgood performance in large-scale networks Inspired by thiswe also propose a local method to extract communities fromnetworks The proposed method is based on node similarityand is termed as NSA (Node Similarity based Algorithm)for short it comprises of two phases the first phase aimsat constructing the preliminary community structure thesecond phase tries to improve the quality of the final resultby merging some small or sparse communities To do sowe also propose a measure community metric to evaluatethe sparsity or smallness of communities The details of theproposed method are elaborated in the next section

3 The Proposed Method

31 The Framework of the Proposed Method The frameworkof the proposed method is outlined by the pseudocode listedin Algorithm 1

As mentioned previously the proposed method consistsof two phases Function calls FPC() and PCM() implementthe two phases respectively The former establishes thepreliminary community structure based on a node selectionstrategy and the node similarity the latter merges somesmall or sparse communities to improve the quality of theresulting community structure The inputs of this algorithmare the network and a threshold 120575 the network involved inthis paper is the undirected and unweighted graph whichis always represented as 119866(119881 119864) as in Algorithm 1 where 119881and 119864 are the node set and edge set respectively |119881| = 119899and |119864| = 119898 are the number of nodes and edges in thenetwork individually The threshold 120575 is used in the secondphase of the proposed method to identify communities to bemergedmdasha community whose community metric is smallerthan 120575 should be merged into another oneThe output of thisalgorithm is the detected community structure

The next two subsections describe the two proceduresconcretely and deliberately

32 Formation of the Preliminary Community Structure Thefunction FPC() implements the first phase of the proposedmethod whose purpose is to construct the preliminarycommunity structure from the network We first pick out

the node with the largest degree from the network takeit as the exemplar of the first community and insert itsmost similar neighbor into the community as well (if thereare more than one node with the largest degree in thenetwork we arbitrarily select any one of them to take it as theexemplar and if the exemplar hasmore than onemost similarneighbors the one with the smallest degree is selected)Afterwards the next largest-degree node in the remainderof network is selected if its most similar neighbor has notbeen classified into any community yet we create a newcommunity for it and its most similar neighbor Otherwiseif its most similar neighbor has been assigned to a certaincommunity (eg the one denoted as 119862119896) we insert theselected node into that community (ie119862119896 ) aswellWe repeatthis process until every node is classified into a community Inthis procedure densely connected nodes can quickly gathertogether around the exemplars to form communities Atthe end of this procedure we get a series of communitieswhich constitute the preliminary community structure of thenetwork The pseudocode describing the entire procedure islisted in Algorithm 2

In this algorithm the degree of node 119906 is the number of119906rsquos neighbors and is denoted as 119889119906 ie

119889119906 = |Γ (119906)| (1)

where

Γ (119906) = V | (119906 V) isin 119864 V isin 119881 (2)

is the set of neighbors of node 119906 119904119894119898(119906 V) stands for thesimilarity between nodes 119906 and V There are abundant waysto calculate the similarity between nodes in the network anyone of themcanbe employed in principleHowever to pursuethe efficiency we calculate it here as in the following equationwhich involves only the neighborhoods of nodes 119906 and Vthemselves

119904119894119898 (119906 V) = |Γ (119906) cap Γ (V)||Γ (119906) cup Γ (V)| (3)

Thevariables119880 and119862119878 119901119903119890 are used to record the unclassifiednodes and the preliminary community structure they arenaturally initialized to be the original node set 119881 of network119866 and an empty set 120601 in step 1 Steps 2 and 3 select the nodewith the largest degree from the remainder of the networkand its most similar neighbors and denote them as V and 119908respectively Step 4 determines whether 119908 has been assigned

Complexity 5

Input 119866(119881 119864) the networkOutput 119862119878 119901119903119890 = 1198621 1198622 sdot sdot sdot 119862119896 the identified preliminary community structure

1 Initialize variables 119880 and 119862119878 119901119903119890 which are used to recordthe unclassified nodes and the preliminary community structure

119880 larr997888 119881 119862119878 119901119903119890 larr997888 1206012 Select the node with the largest degree denote it as V

V larr997888 argmax119906119889119906 | 119906 isin 1198803 Get the most similar neighbor of V denote it as 119908

119908 larr997888 argmax119906119904119894119898(V 119906) | 119906 isin Γ(V)4 if 119908 has not been assigned to any community then5 Create a new community for nodes V and 119908

119870 larr997888 |119862119878 119901119903119890| 119862119870+1 larr997888 V 1199086 Insert the created community into the community structure

119862119878 119901119903119890 larr997888 119862119878 119901119903119890 cup 119862119870+17 Remove nodes V and 119908 from 119880 as they are classified

119880 larr997888 119880 minus V 1199088 else9 Find the community to which 119908 belongs denote it as 119862119896

119896 larr997888 locate(119862119878 119901119903119890 119908)10 Insert node V into 119862119896

119862119896 larr997888 119862119896 cup V11 Remove node V from 119880 as it is classified

119880 larr997888 119880 minus V12 Repeat steps 2 through 11 until 119880 = 12060113 return 119862119878 119901119903119890

Algorithm 2 FPC(G) forming the preliminary community structure

to a community or not if it has not been classified to anycommunity yet steps 5 and 6 create a new community fornodes V and 119908 and insert the newly created community into119862119878 119901119903119890 then step 7 removes nodes V and 119908 from 119880 as theyhave been classified into the new community just now If node119908 has been already assigned to a community step 9 finds thecommunity 119862119896 to which node Vrsquos most similar neighbor 119908belongs and step 10 inserts node V into community 119862119896 Sincenode V has been assigned to community119862119896 step 11 removes itfrom119880 Step 12 repeats operations in steps 2 through 11 until119880 = 120601 meaning that all the nodes in the network have beenvisited At that time the preliminary community structureis obtained in 119862119878 119901119903119890 and is returned as the output of thisalgorithm in step 13

To make it clearer we take Zacharyrsquos karate club network[14] as an example to illustrate intuitively the procedureThis is a network with 34 nodes and 78 edges as shown inFigure 1(a) in which the node with the largest degree is nodelsquo34rsquo and its most similar neighbor is node lsquo33rsquo Thereforenode lsquo34rsquo is taken as the exemplar of the first communityand node lsquo33rsquo is also inserted into this community Thenthe node with the largest degree in the remaining nodes isnode lsquo1rsquo its most similar neighbor is node lsquo2rsquo Since node lsquo2rsquohas not been assigned to a community yet we create a newcommunity take node lsquo1rsquo as its exemplar and insert node lsquo2rsquointo the new community as well The same thing happens tonode pairs (lsquo3rsquo lsquo4rsquo) (lsquo32rsquo lsquo29rsquo) and (lsquo9rsquo lsquo31rsquo) sequentially Thenthe next largest-degree node is lsquo14rsquo its most similar neighbornode lsquo4rsquo is already in the third community therefore weinsert node lsquo14rsquo into the third community All of the other

nodes are processed in the same way and in the subsequentoperations node pairs (lsquo24rsquo rsquo30rsquo) (lsquo6rsquo lsquo7rsquo) (lsquo5rsquo lsquo11rsquo) and (lsquo25rsquolsquo26rsquo) form new communities all of the remaining nodesare inserted into communities to which their most similarneighbors belong At the end of the process we obtain thepreliminary community structure as shown in Figure 1(b) inwhich each node connects to its most similar neighbor witha directed edge

33 Merge of Small or Sparse Communities At the end ofthe first phase of our proposed method we obtain thepreliminary community structure However some commu-nities are either too small or too sparse to make sense justlike the preliminary communities lsquo5rsquo lsquo11rsquo lsquo9rsquo lsquo31rsquo lsquo32rsquolsquo29rsquo lsquo25rsquo lsquo26rsquo lsquo28rsquo lsquo24rsquo lsquo30rsquo lsquo27rsquo and lsquo6rsquo lsquo7rsquo lsquo17rsquo inFigure 1(b) because each of them contains only a few nodesthe inside edges of each of them are very sparse the numberof edges inside each of them is much smaller than that ofedges connecting to outside violating the characteristic thatconnections inside one community are much denser thanthose across different communities Keeping them in the finalcommunity structure will lead to the low quality Thereforewe merge some of the preliminary communities to acquirethe final result in the second phase which is carried out byfunction call PCM() in Algorithm 1

To this end there are two problems needed to be solvedin PCM() The first one is to identify which communities aresmall or sparse enough that need to be merged into anotherones the second one is to select the communities into whicheach of the small or sparse communities should be merged

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 3: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 3

been derived from LPA Barber et al [29] proposed a seriesof algorithms that propagate labels under some constraintsLPAm is the most famous one which tries to maximizethe modularity during the label propagation procedureChin et al [30] identified the main communities usingthe number of mutual neighboring nodes first then theyattached some independent constraints to the basic LPA andused the constrained LPA to add the remainder nodes intocommunities finally they used a node-moving strategy likethat is employed in Louvain to refine the quality of theresulting community structure Ding et al [31] yielded amodified version of LPA which exploits the idea of densitypeak clustering [32] and Chebyshev inequality to choosecommunity centers from the network and then propagateslabels of the selected centers to the whole network with theproposed multistrategy of label propagation

Density-based methods define and utilize the concept ofdensity in networks for nodes or communities to uncovercommunity structures SCAN [33] borrows the idea from theclassical density-based clustering algorithm DBSCAN [34]to reveal communities hubs and outliers from networksSCAN++ [35] is a derivative of SCAN it reduces time con-sumption via introducing a new data structure and reducingthe number of density evaluations in the detecting procedureIsoFdp [36] maps the network nodes as data points intoa low-dimensional manifold and then exploits the densitypeak clustering algorithm [32] to extract the final communitystructure LCCD algorithm [37] also practices on the wayproposed in the density peak clustering algorithm [32] tolocate the structural centers from networks and then expandscommunities from the identified centers to the borders usinga local search procedure

Network dynamic-based methods explore communitystructures by simulating the dynamic processes in networksRandom walk is a typical dynamic procedure carried out innetworks random walk-based methods utilize the tendencyof the walker being trapped into a community during a shortwalk rather than walking across the community border intoanother community to detect communities from networksWalkTrap [38] makes use of random walk to calculate theprobability of going from one node to another during ashort-length walk and then calculates the distance tomeasurenodesrsquo similarities and community similarities PPC algo-rithm [39] considers the network as a single communityinitially and recursively partitions each community utilizingnode similarities computed using random walks until furtherpartitioning cannot acquire a better value of modularityRWA [40] employs random walks to calculate the probabilityof a node belonging to a community and each communityis expanded by repeatedly attracting the node which ismost likely to belong to that community to join Besidesthis Attractor [41] utilizes distance dynamics to explorecommunities fromnetworks node interactions might changethe distances among nodes and the distance change willmake an impact on the interaction in reverse Members ofthe same community will gradually move together undersuch interplays and nodes in different communities will keepfar away from each other steadily BiAttractor [42] extendsthe concept of distance dynamics and the idea of Attractor

to bipartite networks which is used to detect two-modecommunities of bipartite networks

Spectral methods engage eigenspectra of various net-work-associated matrices to extract communities For exam-ple Amini et al [43] found the initial node partitionsusing the spectral clustering method based on the normal-ized Laplacian matrix derived from a regularized adjacencymatrix those partitions were used for fitting a stochasticblock model by a pseudolikelihood algorithm to detect theresulting community structure SiemonC de Lange et al [44]identified an integrative community structure in the macro-scopic anatomical neural networks of the macaque and catand the microscopic network of the C elegans by examiningthe spectra of their normalized Laplacian matrices Krzakalaet al [45] produced a class of spectral algorithms to detectcommunities based on the nonbacktracking matrix whichdepicts a nonbacktracking walk on the directed edges ofthe network Shi et al [46] proposed a spectral communitydetection method LLSA which employs Lanczos methodto obtain the approximated eigenvector of the transitionmatrix with the largest eigenvalue and the elements of thiseigenvector approximately indicate the affiliation probabilityof the corresponding nodes to the communities

Most of the methods mentioned above are global onesthey detect communities often depending on some globalinformation such as the number of communities informa-tion about eigenvalues or eigenvectors as prior knowledgebut they are hard to acquire due to the size of networksinvolved getting larger and larger Moreover most of themare computationally demanding leading to high time com-plexity These limitations prevent them from being appliedto large-scale applications To overcome the deficiency of theglobal algorithms many local methods have been proposedincluding someof the aforementionedmethods For exampleLPA and most of its variations determine which label shouldbe adopted by a node according to its neighborhood onlyLCCD takes into account both the local density of nodes andthe relative distance between nodes to locate the local struc-tural centers and expands communities from the structuralcenters with a local search procedure LLSA applies a fastheat kernel diffusing to sample a small subnetwork includingalmost all members of a community and the eigenvectorwhose elements suggest nodes for their memberships ofcommunities is obtained by performing Lanczos method onthe sampled subnetwork

Besides this ComSim algorithm [47] identifies cores ofcommunities from bipartite networks by seeking for cycleswhich are node chains formed by following outgoing linksand reaching a node already visited and then allocates theremaining nodes to the communities that maximize thesimilarity between the node and the community In BLI algo-rithm [48] local clustering information and local structuralsimilarity are employed to establish the primary communitystructure then some small-scale communities whose sizesare smaller than a given threshold 120582 are absorbed by somelarger ones kSIM [49] is also a local method that works ina bottom-up way At the beginning each node is taken as acommunity then the preliminary communities are formedby identifying for each node the neighbor community to

4 Complexity

Input 119866(119881 119864) the network 120575 the community metric thresholdOutput 119862119878 the detected community structurelowast form the preliminary community structure119862119878 119901119903119890 lowast

1 119862119878 119901119903119890 larr997888FPC(119866)lowast merge small or sparse communities in 119862119878 119901119903119890 lowast

2 119862119878 larr997888PCM(119862119878 119901119903119890 120575)3 return 119862119878

Algorithm 1 The framework of our proposed method NSA

which one of its 119896 most similar neighbors with the lowestdegree belongs and assigning the node to that community Inthis procedure common neighbor index is employed as thesimilarity measure for each pair of nodes

Compared to those global ones these local methods showgood performance in large-scale networks Inspired by thiswe also propose a local method to extract communities fromnetworks The proposed method is based on node similarityand is termed as NSA (Node Similarity based Algorithm)for short it comprises of two phases the first phase aimsat constructing the preliminary community structure thesecond phase tries to improve the quality of the final resultby merging some small or sparse communities To do sowe also propose a measure community metric to evaluatethe sparsity or smallness of communities The details of theproposed method are elaborated in the next section

3 The Proposed Method

31 The Framework of the Proposed Method The frameworkof the proposed method is outlined by the pseudocode listedin Algorithm 1

As mentioned previously the proposed method consistsof two phases Function calls FPC() and PCM() implementthe two phases respectively The former establishes thepreliminary community structure based on a node selectionstrategy and the node similarity the latter merges somesmall or sparse communities to improve the quality of theresulting community structure The inputs of this algorithmare the network and a threshold 120575 the network involved inthis paper is the undirected and unweighted graph whichis always represented as 119866(119881 119864) as in Algorithm 1 where 119881and 119864 are the node set and edge set respectively |119881| = 119899and |119864| = 119898 are the number of nodes and edges in thenetwork individually The threshold 120575 is used in the secondphase of the proposed method to identify communities to bemergedmdasha community whose community metric is smallerthan 120575 should be merged into another oneThe output of thisalgorithm is the detected community structure

The next two subsections describe the two proceduresconcretely and deliberately

32 Formation of the Preliminary Community Structure Thefunction FPC() implements the first phase of the proposedmethod whose purpose is to construct the preliminarycommunity structure from the network We first pick out

the node with the largest degree from the network takeit as the exemplar of the first community and insert itsmost similar neighbor into the community as well (if thereare more than one node with the largest degree in thenetwork we arbitrarily select any one of them to take it as theexemplar and if the exemplar hasmore than onemost similarneighbors the one with the smallest degree is selected)Afterwards the next largest-degree node in the remainderof network is selected if its most similar neighbor has notbeen classified into any community yet we create a newcommunity for it and its most similar neighbor Otherwiseif its most similar neighbor has been assigned to a certaincommunity (eg the one denoted as 119862119896) we insert theselected node into that community (ie119862119896 ) aswellWe repeatthis process until every node is classified into a community Inthis procedure densely connected nodes can quickly gathertogether around the exemplars to form communities Atthe end of this procedure we get a series of communitieswhich constitute the preliminary community structure of thenetwork The pseudocode describing the entire procedure islisted in Algorithm 2

In this algorithm the degree of node 119906 is the number of119906rsquos neighbors and is denoted as 119889119906 ie

119889119906 = |Γ (119906)| (1)

where

Γ (119906) = V | (119906 V) isin 119864 V isin 119881 (2)

is the set of neighbors of node 119906 119904119894119898(119906 V) stands for thesimilarity between nodes 119906 and V There are abundant waysto calculate the similarity between nodes in the network anyone of themcanbe employed in principleHowever to pursuethe efficiency we calculate it here as in the following equationwhich involves only the neighborhoods of nodes 119906 and Vthemselves

119904119894119898 (119906 V) = |Γ (119906) cap Γ (V)||Γ (119906) cup Γ (V)| (3)

Thevariables119880 and119862119878 119901119903119890 are used to record the unclassifiednodes and the preliminary community structure they arenaturally initialized to be the original node set 119881 of network119866 and an empty set 120601 in step 1 Steps 2 and 3 select the nodewith the largest degree from the remainder of the networkand its most similar neighbors and denote them as V and 119908respectively Step 4 determines whether 119908 has been assigned

Complexity 5

Input 119866(119881 119864) the networkOutput 119862119878 119901119903119890 = 1198621 1198622 sdot sdot sdot 119862119896 the identified preliminary community structure

1 Initialize variables 119880 and 119862119878 119901119903119890 which are used to recordthe unclassified nodes and the preliminary community structure

119880 larr997888 119881 119862119878 119901119903119890 larr997888 1206012 Select the node with the largest degree denote it as V

V larr997888 argmax119906119889119906 | 119906 isin 1198803 Get the most similar neighbor of V denote it as 119908

119908 larr997888 argmax119906119904119894119898(V 119906) | 119906 isin Γ(V)4 if 119908 has not been assigned to any community then5 Create a new community for nodes V and 119908

119870 larr997888 |119862119878 119901119903119890| 119862119870+1 larr997888 V 1199086 Insert the created community into the community structure

119862119878 119901119903119890 larr997888 119862119878 119901119903119890 cup 119862119870+17 Remove nodes V and 119908 from 119880 as they are classified

119880 larr997888 119880 minus V 1199088 else9 Find the community to which 119908 belongs denote it as 119862119896

119896 larr997888 locate(119862119878 119901119903119890 119908)10 Insert node V into 119862119896

119862119896 larr997888 119862119896 cup V11 Remove node V from 119880 as it is classified

119880 larr997888 119880 minus V12 Repeat steps 2 through 11 until 119880 = 12060113 return 119862119878 119901119903119890

Algorithm 2 FPC(G) forming the preliminary community structure

to a community or not if it has not been classified to anycommunity yet steps 5 and 6 create a new community fornodes V and 119908 and insert the newly created community into119862119878 119901119903119890 then step 7 removes nodes V and 119908 from 119880 as theyhave been classified into the new community just now If node119908 has been already assigned to a community step 9 finds thecommunity 119862119896 to which node Vrsquos most similar neighbor 119908belongs and step 10 inserts node V into community 119862119896 Sincenode V has been assigned to community119862119896 step 11 removes itfrom119880 Step 12 repeats operations in steps 2 through 11 until119880 = 120601 meaning that all the nodes in the network have beenvisited At that time the preliminary community structureis obtained in 119862119878 119901119903119890 and is returned as the output of thisalgorithm in step 13

To make it clearer we take Zacharyrsquos karate club network[14] as an example to illustrate intuitively the procedureThis is a network with 34 nodes and 78 edges as shown inFigure 1(a) in which the node with the largest degree is nodelsquo34rsquo and its most similar neighbor is node lsquo33rsquo Thereforenode lsquo34rsquo is taken as the exemplar of the first communityand node lsquo33rsquo is also inserted into this community Thenthe node with the largest degree in the remaining nodes isnode lsquo1rsquo its most similar neighbor is node lsquo2rsquo Since node lsquo2rsquohas not been assigned to a community yet we create a newcommunity take node lsquo1rsquo as its exemplar and insert node lsquo2rsquointo the new community as well The same thing happens tonode pairs (lsquo3rsquo lsquo4rsquo) (lsquo32rsquo lsquo29rsquo) and (lsquo9rsquo lsquo31rsquo) sequentially Thenthe next largest-degree node is lsquo14rsquo its most similar neighbornode lsquo4rsquo is already in the third community therefore weinsert node lsquo14rsquo into the third community All of the other

nodes are processed in the same way and in the subsequentoperations node pairs (lsquo24rsquo rsquo30rsquo) (lsquo6rsquo lsquo7rsquo) (lsquo5rsquo lsquo11rsquo) and (lsquo25rsquolsquo26rsquo) form new communities all of the remaining nodesare inserted into communities to which their most similarneighbors belong At the end of the process we obtain thepreliminary community structure as shown in Figure 1(b) inwhich each node connects to its most similar neighbor witha directed edge

33 Merge of Small or Sparse Communities At the end ofthe first phase of our proposed method we obtain thepreliminary community structure However some commu-nities are either too small or too sparse to make sense justlike the preliminary communities lsquo5rsquo lsquo11rsquo lsquo9rsquo lsquo31rsquo lsquo32rsquolsquo29rsquo lsquo25rsquo lsquo26rsquo lsquo28rsquo lsquo24rsquo lsquo30rsquo lsquo27rsquo and lsquo6rsquo lsquo7rsquo lsquo17rsquo inFigure 1(b) because each of them contains only a few nodesthe inside edges of each of them are very sparse the numberof edges inside each of them is much smaller than that ofedges connecting to outside violating the characteristic thatconnections inside one community are much denser thanthose across different communities Keeping them in the finalcommunity structure will lead to the low quality Thereforewe merge some of the preliminary communities to acquirethe final result in the second phase which is carried out byfunction call PCM() in Algorithm 1

To this end there are two problems needed to be solvedin PCM() The first one is to identify which communities aresmall or sparse enough that need to be merged into anotherones the second one is to select the communities into whicheach of the small or sparse communities should be merged

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 4: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

4 Complexity

Input 119866(119881 119864) the network 120575 the community metric thresholdOutput 119862119878 the detected community structurelowast form the preliminary community structure119862119878 119901119903119890 lowast

1 119862119878 119901119903119890 larr997888FPC(119866)lowast merge small or sparse communities in 119862119878 119901119903119890 lowast

2 119862119878 larr997888PCM(119862119878 119901119903119890 120575)3 return 119862119878

Algorithm 1 The framework of our proposed method NSA

which one of its 119896 most similar neighbors with the lowestdegree belongs and assigning the node to that community Inthis procedure common neighbor index is employed as thesimilarity measure for each pair of nodes

Compared to those global ones these local methods showgood performance in large-scale networks Inspired by thiswe also propose a local method to extract communities fromnetworks The proposed method is based on node similarityand is termed as NSA (Node Similarity based Algorithm)for short it comprises of two phases the first phase aimsat constructing the preliminary community structure thesecond phase tries to improve the quality of the final resultby merging some small or sparse communities To do sowe also propose a measure community metric to evaluatethe sparsity or smallness of communities The details of theproposed method are elaborated in the next section

3 The Proposed Method

31 The Framework of the Proposed Method The frameworkof the proposed method is outlined by the pseudocode listedin Algorithm 1

As mentioned previously the proposed method consistsof two phases Function calls FPC() and PCM() implementthe two phases respectively The former establishes thepreliminary community structure based on a node selectionstrategy and the node similarity the latter merges somesmall or sparse communities to improve the quality of theresulting community structure The inputs of this algorithmare the network and a threshold 120575 the network involved inthis paper is the undirected and unweighted graph whichis always represented as 119866(119881 119864) as in Algorithm 1 where 119881and 119864 are the node set and edge set respectively |119881| = 119899and |119864| = 119898 are the number of nodes and edges in thenetwork individually The threshold 120575 is used in the secondphase of the proposed method to identify communities to bemergedmdasha community whose community metric is smallerthan 120575 should be merged into another oneThe output of thisalgorithm is the detected community structure

The next two subsections describe the two proceduresconcretely and deliberately

32 Formation of the Preliminary Community Structure Thefunction FPC() implements the first phase of the proposedmethod whose purpose is to construct the preliminarycommunity structure from the network We first pick out

the node with the largest degree from the network takeit as the exemplar of the first community and insert itsmost similar neighbor into the community as well (if thereare more than one node with the largest degree in thenetwork we arbitrarily select any one of them to take it as theexemplar and if the exemplar hasmore than onemost similarneighbors the one with the smallest degree is selected)Afterwards the next largest-degree node in the remainderof network is selected if its most similar neighbor has notbeen classified into any community yet we create a newcommunity for it and its most similar neighbor Otherwiseif its most similar neighbor has been assigned to a certaincommunity (eg the one denoted as 119862119896) we insert theselected node into that community (ie119862119896 ) aswellWe repeatthis process until every node is classified into a community Inthis procedure densely connected nodes can quickly gathertogether around the exemplars to form communities Atthe end of this procedure we get a series of communitieswhich constitute the preliminary community structure of thenetwork The pseudocode describing the entire procedure islisted in Algorithm 2

In this algorithm the degree of node 119906 is the number of119906rsquos neighbors and is denoted as 119889119906 ie

119889119906 = |Γ (119906)| (1)

where

Γ (119906) = V | (119906 V) isin 119864 V isin 119881 (2)

is the set of neighbors of node 119906 119904119894119898(119906 V) stands for thesimilarity between nodes 119906 and V There are abundant waysto calculate the similarity between nodes in the network anyone of themcanbe employed in principleHowever to pursuethe efficiency we calculate it here as in the following equationwhich involves only the neighborhoods of nodes 119906 and Vthemselves

119904119894119898 (119906 V) = |Γ (119906) cap Γ (V)||Γ (119906) cup Γ (V)| (3)

Thevariables119880 and119862119878 119901119903119890 are used to record the unclassifiednodes and the preliminary community structure they arenaturally initialized to be the original node set 119881 of network119866 and an empty set 120601 in step 1 Steps 2 and 3 select the nodewith the largest degree from the remainder of the networkand its most similar neighbors and denote them as V and 119908respectively Step 4 determines whether 119908 has been assigned

Complexity 5

Input 119866(119881 119864) the networkOutput 119862119878 119901119903119890 = 1198621 1198622 sdot sdot sdot 119862119896 the identified preliminary community structure

1 Initialize variables 119880 and 119862119878 119901119903119890 which are used to recordthe unclassified nodes and the preliminary community structure

119880 larr997888 119881 119862119878 119901119903119890 larr997888 1206012 Select the node with the largest degree denote it as V

V larr997888 argmax119906119889119906 | 119906 isin 1198803 Get the most similar neighbor of V denote it as 119908

119908 larr997888 argmax119906119904119894119898(V 119906) | 119906 isin Γ(V)4 if 119908 has not been assigned to any community then5 Create a new community for nodes V and 119908

119870 larr997888 |119862119878 119901119903119890| 119862119870+1 larr997888 V 1199086 Insert the created community into the community structure

119862119878 119901119903119890 larr997888 119862119878 119901119903119890 cup 119862119870+17 Remove nodes V and 119908 from 119880 as they are classified

119880 larr997888 119880 minus V 1199088 else9 Find the community to which 119908 belongs denote it as 119862119896

119896 larr997888 locate(119862119878 119901119903119890 119908)10 Insert node V into 119862119896

119862119896 larr997888 119862119896 cup V11 Remove node V from 119880 as it is classified

119880 larr997888 119880 minus V12 Repeat steps 2 through 11 until 119880 = 12060113 return 119862119878 119901119903119890

Algorithm 2 FPC(G) forming the preliminary community structure

to a community or not if it has not been classified to anycommunity yet steps 5 and 6 create a new community fornodes V and 119908 and insert the newly created community into119862119878 119901119903119890 then step 7 removes nodes V and 119908 from 119880 as theyhave been classified into the new community just now If node119908 has been already assigned to a community step 9 finds thecommunity 119862119896 to which node Vrsquos most similar neighbor 119908belongs and step 10 inserts node V into community 119862119896 Sincenode V has been assigned to community119862119896 step 11 removes itfrom119880 Step 12 repeats operations in steps 2 through 11 until119880 = 120601 meaning that all the nodes in the network have beenvisited At that time the preliminary community structureis obtained in 119862119878 119901119903119890 and is returned as the output of thisalgorithm in step 13

To make it clearer we take Zacharyrsquos karate club network[14] as an example to illustrate intuitively the procedureThis is a network with 34 nodes and 78 edges as shown inFigure 1(a) in which the node with the largest degree is nodelsquo34rsquo and its most similar neighbor is node lsquo33rsquo Thereforenode lsquo34rsquo is taken as the exemplar of the first communityand node lsquo33rsquo is also inserted into this community Thenthe node with the largest degree in the remaining nodes isnode lsquo1rsquo its most similar neighbor is node lsquo2rsquo Since node lsquo2rsquohas not been assigned to a community yet we create a newcommunity take node lsquo1rsquo as its exemplar and insert node lsquo2rsquointo the new community as well The same thing happens tonode pairs (lsquo3rsquo lsquo4rsquo) (lsquo32rsquo lsquo29rsquo) and (lsquo9rsquo lsquo31rsquo) sequentially Thenthe next largest-degree node is lsquo14rsquo its most similar neighbornode lsquo4rsquo is already in the third community therefore weinsert node lsquo14rsquo into the third community All of the other

nodes are processed in the same way and in the subsequentoperations node pairs (lsquo24rsquo rsquo30rsquo) (lsquo6rsquo lsquo7rsquo) (lsquo5rsquo lsquo11rsquo) and (lsquo25rsquolsquo26rsquo) form new communities all of the remaining nodesare inserted into communities to which their most similarneighbors belong At the end of the process we obtain thepreliminary community structure as shown in Figure 1(b) inwhich each node connects to its most similar neighbor witha directed edge

33 Merge of Small or Sparse Communities At the end ofthe first phase of our proposed method we obtain thepreliminary community structure However some commu-nities are either too small or too sparse to make sense justlike the preliminary communities lsquo5rsquo lsquo11rsquo lsquo9rsquo lsquo31rsquo lsquo32rsquolsquo29rsquo lsquo25rsquo lsquo26rsquo lsquo28rsquo lsquo24rsquo lsquo30rsquo lsquo27rsquo and lsquo6rsquo lsquo7rsquo lsquo17rsquo inFigure 1(b) because each of them contains only a few nodesthe inside edges of each of them are very sparse the numberof edges inside each of them is much smaller than that ofedges connecting to outside violating the characteristic thatconnections inside one community are much denser thanthose across different communities Keeping them in the finalcommunity structure will lead to the low quality Thereforewe merge some of the preliminary communities to acquirethe final result in the second phase which is carried out byfunction call PCM() in Algorithm 1

To this end there are two problems needed to be solvedin PCM() The first one is to identify which communities aresmall or sparse enough that need to be merged into anotherones the second one is to select the communities into whicheach of the small or sparse communities should be merged

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 5: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 5

Input 119866(119881 119864) the networkOutput 119862119878 119901119903119890 = 1198621 1198622 sdot sdot sdot 119862119896 the identified preliminary community structure

1 Initialize variables 119880 and 119862119878 119901119903119890 which are used to recordthe unclassified nodes and the preliminary community structure

119880 larr997888 119881 119862119878 119901119903119890 larr997888 1206012 Select the node with the largest degree denote it as V

V larr997888 argmax119906119889119906 | 119906 isin 1198803 Get the most similar neighbor of V denote it as 119908

119908 larr997888 argmax119906119904119894119898(V 119906) | 119906 isin Γ(V)4 if 119908 has not been assigned to any community then5 Create a new community for nodes V and 119908

119870 larr997888 |119862119878 119901119903119890| 119862119870+1 larr997888 V 1199086 Insert the created community into the community structure

119862119878 119901119903119890 larr997888 119862119878 119901119903119890 cup 119862119870+17 Remove nodes V and 119908 from 119880 as they are classified

119880 larr997888 119880 minus V 1199088 else9 Find the community to which 119908 belongs denote it as 119862119896

119896 larr997888 locate(119862119878 119901119903119890 119908)10 Insert node V into 119862119896

119862119896 larr997888 119862119896 cup V11 Remove node V from 119880 as it is classified

119880 larr997888 119880 minus V12 Repeat steps 2 through 11 until 119880 = 12060113 return 119862119878 119901119903119890

Algorithm 2 FPC(G) forming the preliminary community structure

to a community or not if it has not been classified to anycommunity yet steps 5 and 6 create a new community fornodes V and 119908 and insert the newly created community into119862119878 119901119903119890 then step 7 removes nodes V and 119908 from 119880 as theyhave been classified into the new community just now If node119908 has been already assigned to a community step 9 finds thecommunity 119862119896 to which node Vrsquos most similar neighbor 119908belongs and step 10 inserts node V into community 119862119896 Sincenode V has been assigned to community119862119896 step 11 removes itfrom119880 Step 12 repeats operations in steps 2 through 11 until119880 = 120601 meaning that all the nodes in the network have beenvisited At that time the preliminary community structureis obtained in 119862119878 119901119903119890 and is returned as the output of thisalgorithm in step 13

To make it clearer we take Zacharyrsquos karate club network[14] as an example to illustrate intuitively the procedureThis is a network with 34 nodes and 78 edges as shown inFigure 1(a) in which the node with the largest degree is nodelsquo34rsquo and its most similar neighbor is node lsquo33rsquo Thereforenode lsquo34rsquo is taken as the exemplar of the first communityand node lsquo33rsquo is also inserted into this community Thenthe node with the largest degree in the remaining nodes isnode lsquo1rsquo its most similar neighbor is node lsquo2rsquo Since node lsquo2rsquohas not been assigned to a community yet we create a newcommunity take node lsquo1rsquo as its exemplar and insert node lsquo2rsquointo the new community as well The same thing happens tonode pairs (lsquo3rsquo lsquo4rsquo) (lsquo32rsquo lsquo29rsquo) and (lsquo9rsquo lsquo31rsquo) sequentially Thenthe next largest-degree node is lsquo14rsquo its most similar neighbornode lsquo4rsquo is already in the third community therefore weinsert node lsquo14rsquo into the third community All of the other

nodes are processed in the same way and in the subsequentoperations node pairs (lsquo24rsquo rsquo30rsquo) (lsquo6rsquo lsquo7rsquo) (lsquo5rsquo lsquo11rsquo) and (lsquo25rsquolsquo26rsquo) form new communities all of the remaining nodesare inserted into communities to which their most similarneighbors belong At the end of the process we obtain thepreliminary community structure as shown in Figure 1(b) inwhich each node connects to its most similar neighbor witha directed edge

33 Merge of Small or Sparse Communities At the end ofthe first phase of our proposed method we obtain thepreliminary community structure However some commu-nities are either too small or too sparse to make sense justlike the preliminary communities lsquo5rsquo lsquo11rsquo lsquo9rsquo lsquo31rsquo lsquo32rsquolsquo29rsquo lsquo25rsquo lsquo26rsquo lsquo28rsquo lsquo24rsquo lsquo30rsquo lsquo27rsquo and lsquo6rsquo lsquo7rsquo lsquo17rsquo inFigure 1(b) because each of them contains only a few nodesthe inside edges of each of them are very sparse the numberof edges inside each of them is much smaller than that ofedges connecting to outside violating the characteristic thatconnections inside one community are much denser thanthose across different communities Keeping them in the finalcommunity structure will lead to the low quality Thereforewe merge some of the preliminary communities to acquirethe final result in the second phase which is carried out byfunction call PCM() in Algorithm 1

To this end there are two problems needed to be solvedin PCM() The first one is to identify which communities aresmall or sparse enough that need to be merged into anotherones the second one is to select the communities into whicheach of the small or sparse communities should be merged

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 6: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

6 Complexity

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

1112

13

141516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 1 The procedure of FPC() on the karate club network

For the first problem we propose an index communitymetric which takes into account two factors communitysize and community sparsity to find out the preliminarycommunities needed to be merged Here we formalize therelevant concepts and the index as Definition 1 throughDefinition 3

Definition 1 (community sparsity) The sparsity of commu-nity 119862119894 is defined as follows

120572119894 =10038161003816100381610038161003816119864119894119899119894

1003816100381610038161003816100381610038161003816100381610038161198641199001199061199051198941003816100381610038161003816 (4)

where 119864119894119899119894 is the set of edges within community 119862119894 and and119864119900119906119905119894 is the set of edges connecting nodes in community 119862119894with other communities

That is to say the sparsity of community 119862119894 is defined asthe ratio between the number of inner edges of 119862119894 and thenumber of outer edges of 119862119894 Obviously the more edges existwithin community 119862119894 the larger the value of 120572119894 will be andvice versa

Definition 2 (community scale) The scale of community 119862119894is formalized as follows

120573119894 =10038161003816100381610038161198811198941003816100381610038161003816

|119881| (5)

where 119881119894 is the set of nodes in community 119862119894

Obviously the scale of community 119862119894 is defined as theratio of the number of nodes in 119862119894 to the total numberof nodes in the network The more nodes there are incommunity 119862119894 the larger value the ratio will be and viceversa

Definition 3 (community metric) The community metricis a combination of both the community sparsity and thecommunity scale which is defined for community 119862119894 asfollows

120574119894 = 120572119894 lowast 120573119894 (6)

On the basis of these definitions the first problem can besolved by setting a community metric threshold 120575 That is tosay if 120574119894 lt 120575 community 119862119894 needs to be merged into anothercommunity

For the second problem we consider a strategy con-forming to the construction of preliminary communitiesThe preliminary communities are formed based mainly onnode similarity in the first phase therefore we also use thesimilarity as a criterion here to merge communities ie eachof the small or sparse communities is merged into its mostsimilar adjacent communityHere the similarity between twocommunities 119862119894 and 119862119895 is calculated as follows

119878119894119898(119862119894 119862119895) =sum 119906isin119862119894

Visin119862119895119904119894119898 (119906 V)10038161003816100381610038161003816119862119895

10038161003816100381610038161003816 (7)

where 119904119894119898(119906 V) is the similarity between nodes 119906 isin 119862119894and V isin 119862119895 which is calculated using (3) In functionPCM() implementing the merge procedure 119862119894 is a com-munity needed to be merged 119862119895 is one of its adjacentcommunities The numerator of the right term in (7) is thesum of similarities between nodes in communities 119862119894 and119862119895 Dividing by the denominator |119862119895| is a constraint onthe priority for larger communities to prevent from formingsome giant communities

The logic of entire procedure of the second phase is listedin Algorithm 3 the operations are almost self-explanatoryThe variable 119862119878 is used to record the final communitystructure it is initialized as the preliminary communitystructure 119862119878 119901119903119890 in step 1 Step 2 calculates the communitymetric for each of the preliminary communities steps 3 and4 select the community with the smallest community metricand its most similar community step 5 merges them toyield a new community and step 6 calculates the communitymetric for that new community Step 7 replaces the twocommunities 119862119905 and 119862119895 with that new community in 119862119878to reflect the effect of the merge operation Step 8 repeatsoperations in steps 3 through 7 until the minimal communitymetric of the selected community is larger than the giventhreshold 120575 meaning that all the remaining communities aresatisfactory therefore themerge procedure is terminated andthe resulting community structure in119862119878 is returned in step 9

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 7: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 7

Input 119862119878 119901119903119890 the preliminary community structure 120575 the community-metric thresholdOutput 119862119878 the final community structure

1 Initialize 119862119878 which is used to record the community structure119862119878 larr997888 119862119878 119901119903119890

2 Calculate the community metric for each of the preliminary communitiesforeach 119862119894 isin 119862119878 do

120574119894 larr997888 120572119894 times 1205731198943 Select the community with the minimal community metric denote its index as 119905

119905 larr997888 argmin119894120574119894 | 119894 = 1 2 sdot sdot sdot |119862119878|4 Identify the most similar community with 119862119905 denote its index as 119895

119895 larr997888 argmax119894119878119894119898(119862119905 119862119894) | 119894 = 1 2 sdot sdot sdot |119862119878| 119894 = 1199055 Merge communities 119862119905 and 119862119895 to form a new community

119896 larr997888 |119862119878| 119862119896+1 larr997888 119862119905 cup 1198621198956 Calculate the community metric for the new community

120574119896+1 larr997888 120572119896+1 times 120573119896+17 Replace the two communities 119862119905 and 119862119895 with the new community to reflect the merging effect

119862119878 = 119862119878 minus 119862119905 119862119895 cup 119862119896+18 Repeat steps 3 through 7 until 120574119905 gt 1205759 return 119862119878

Algorithm 3 PCM(119862119878 119901119903119890 120575) merge small or sparse communities

34 Time Complexity The proposed algorithm is comprisedof two phases the first one is to form the preliminarycommunities The main time consumption in this phase ison the selection of the node with the largest degree (step2 in Algorithm 2) and its most similar neighbor (step 3 inAlgorithm 2) the former can be accomplished in 119874(log 119899) ineach iteration using a max-heap data structure the latter canbe got down in 119874(log⟨119889⟩) with the max-heap where ⟨119889⟩ isthe average degree of nodes in the network Since ⟨119889⟩ ≪ 119899the time consumption of the first phase is 119874(119899 log 119899)

The second phase is used to improve the quality of theresulting community structure by merging some of the smallor sparse communities Themajor time is spent on determin-ing the community needed to be merged and its most similaradjacent community in each iteration Assuming there are119870 communities in the preliminary community structure theformer operation can be implemented in 119874(log119870) the lattercan also be carried out with 119874(log119870) time consumption inthe worst case Hence the second phase can be implementedwith 119874(119870 log119870) time consumption

Since 119870 ≪ 119899 then log119870 ≪ log 119899 Therefore theproposed method can detect communities from networkswith a relatively high efficiency 119874(119899 log 119899) time complexity

4 Experimental Results and Discussion

41 Network Datasets and Comparison System To testify theperformance of our proposed method we have conductedextensive experiments on both some groups of artificial net-works and some real-world networks The artificial networksare synthesized using LFR benchmark network generator[50] which works with some parameters to control thecharacteristics of generated networks Here we consider theinfluences of both the network scale and community sizetherefore four types of networks are generated say smallnetworks with small communities and big communities and

larger networks with small communities and big commu-nities respectively Each of the small networks and largernetworks contains 1000 and 5000 nodes respectively thesmall community contains about 10 nodes at least and 50nodes atmost theminimumandmaximumnumber of nodesin the big communities are 20 and 100 respectively Thegenerated networks with small communities and big commu-nities aremarked using the suffixes lsquosrsquo and lsquobrsquo individuallyTheexponents of the power-law distributions that node degreeand community size follow are the default values minus2 andminus1 respectively The parameters used to synthesize the fourgroups of artificial networks are listed in Table 1

We also performed the experiments on 13 real-worldnetworks the size of these networks spans from tens tohundreds of thousands of nodes the information aboutthem is listed in Table 2 These real-world networks can bedivided into two categories the first category includes thefirst four networks whose ground-truth communities areknown a priori the second one contains the other ninenetworks which have no publicly acknowledged ground-truth community structures

On these networks we ran our proposed method todetect community structures from them and compared theresults to those of 5 popular community detection algorithmsnamely Fast119876[24] WalkTrap [38] LPA[28] Attractor[41]IsoFdp[36] which have been already introduced in Section 2For LPA since it is a nondeterministic algorithm we ranit on each network 10 times and take the average of theevaluation metrics as its resulting metric value obtained fromthat network For our proposedmethod NSA we empiricallyset 120575 = 013 for the dolphin social network and 120575 = 01 forother networks in the experiments The details of how to setthe optimal value of 120575 will be discussed in Section 5

42 Evaluation Metrics Two indexes namely NMI (Nor-malized Mutual Information) [51] and modularity[7] are

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 8: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

8 Complexity

Table 1 The parameters used to generate the LFR networks In the header row of this table 119899 is number of nodes contained in the network⟨119889⟩ and 119889119898119886119909 are the average degree and the max degree respectively exp119889 and exp119888119900119898 are the exponents of the power law distributions thatnode degree and community size follow min(119862119894) and max(119862119894) represent the minimal and maximal number of nodes contained in everycommunity respectively

Network 119899 ⟨119889⟩ 119889119898119886119909 exp119889 expcom min(119862119894) max(119862119894)LFR1000s 1000 20 50 -2 -1 10 50LFR1000b 1000 20 50 -2 -1 20 100LFR5000s 5000 20 50 -2 -1 10 50LFR5000b 5000 20 50 -2 -1 20 100

Table 2 The information about the real-world networks 119899 and119898 are the number of nodes and edges in the network respectively

Network 119899 119898Karate club[14] 34 78Dolphin social network[15] 62 159Risk map[16] 42 83Scientists collaboration network [6] 118 197Lesmis[17] 77 254Polbooks[3] 105 441ColiNeta[18] 423 519NetScience[10] 1589 2742Email[19] 1133 5451YeastL[20] 2361 7182PGP[21] 10680 24316DBLP[22] 317080 1049866Amazon[22] 334863 925872

adopted as the measure metrics to evaluate the qualityof the detected community structure in this paper TheNMI between the ground-truth community structure 119875 =1198751 1198752 119875119870 and the extracted one 1198751015840 = 11987510158401 11987510158402 11987510158401198701015840 is calculated as follows

NMI (119875 1198751015840)

=minus2sum|119875|119894=1sum|119875

1015840|119895=1 119899119894119895 log ((119899119894119895 sdot 119899) (119899119875119894 sdot 119899119875

1015840

119895 ))sum|119875|119894=1 119899119875119894 log (119899119875119894 119899) + sum|119875

1015840|119895=1 119899119875

1015840

119895 log (1198991198751015840

119895 119899)

(8)

where 119899119875119894 = |119875119894| 1198991198751015840

119895 = |1198751015840119895 | and 119899119894119895 = |119875119894 cap 1198751015840119895 | respectivelyThe NMI is an information-theory based metric which

measures how much the detected community structureagrees with the ground truth Therefore it can only be usedto evaluate the quality of the detected community structureon networks whose ground-truth community structure isalready known Its value is in the range of [0 1] larger isbetter

Another metric widely used to evaluate the performanceof community detection method is modularity[7] which isdefined as follows

119876 = sum119894

(119890119894119894 minus 1198862119894 ) (9)

where 119890119894119894 is the diagonal element of a 119870 times 119870 matrix 119890whose element 119890119894119895 is the fraction of edges between nodes incommunities 119862119894 and 119862119895 to the total edges in the network 119870

is the number of communities in the community structure 119886119894is the fraction of edges associated with nodes in community119862119894

The first term sum119894 119890119894119894 in the right of (9) is the fractionof edges within communities the second term sum119894 1198862119894 is theexpected value of the same fraction in a random graph inwhich nodes and degree distribution are the same as in theoriginal network but edges are connected between nodesrandomly The smaller difference is between the two termsthe more the network approaches a random graph then theweaker the community structure is On the contrary thelarger the difference between them is the network departsfurther from the random graph then the stronger the com-munity structure is That is to say the modularity measuresquality of the community structure from the perspective ofhow far the detected result deviates from a random networkits effective value falls in [0 1] higher is better

43 Synthetic Networks We carried out experiments on fourgroups of artificial networks to testify the performance ofthe proposed method As mentioned above all the fourtypes of artificial networks are synthesized using the LFRbenchmark generator software [50] Besides the parameterslisted in Table 1 another critical parameter for this softwareis the mixing parameter 120583 which regulates for each node theratio of edges connected to nodes in other communities Thesmaller the value of 120583 is the clearer the community structurewill be Obviously 120583 = 05 is a transitive point above whichcommunities in networks tend to be obscure

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 9: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 9

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a)

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10

NMI

(b)

Figure 2Comparison of different community-detection algorithms on LFR benchmark networks containing 1000 nodes (a)The results detectedfrom small network with small-sized communities (b) The results identified from small networks with big-sized communities

In our experiments we varied the value of 120583 from 01 to08 with an increment of 01 for each group of LFR networksTo eliminate the occasionality we generated 10 networksfor each value of 120583 while keeping the same setting forother parameters Since the community structures have beenalready embedded in these synthetic networks we use NMIas the metric to evaluate the performance of our proposedmethod and the comparison algorithms We took thesenetworks as the input one by one to run our proposedmethodand the comparison algorithms to detect communities anduse the average of NMI as the resulting metric The resultsdetected by our proposal and the comparison algorithmsfrom the small networks with small-sized communities orbig-sized communities are illustrated in Figures 2(a) and 2(b)respectively the results revealed from the larger networkswith small-sized communities and big-sized communities arepresented in Figures 3(a) and 3(b) separately

In Figures 2(a) and 2(b) Fast119876 tends to introducemistakes in the results no matter communities in networksarewell separated or obscure Asmentioned previously Fast119876is a typical modularity-optimization based algorithm it aimsonly at acquiring results with larger modularity rather thanhigh accuracy In our experiments all of the results uncoveredby it are not satisfactory Even in the networks with 120583 =01 it still failed to identify the exact communities andfurthermore its performance is the worst in comparisonalgorithms for 120583 ⩽ 05 For 120583 gt 05 the quality of its results isonly better than that of LPA LPA performed as well as othercomparison algorithms in those networks for 120583 lt 05 but itsperformance dropped dramatically for 120583 ⩾ 05 it even couldnot detect the effective communities from networks for 120583 gt06 This might be due to its own label-update mechanismwhen the community boundaries become obscure nodestend to accept incorrect labels to update their own onesalways leading to the trivial results even all nodes are labeled

as members of one giant community The proposed methodNSA acquired NMI = 1 on all networks for 120583 lt 05 meaningthat the detected partitions are perfectly matched with theground-truth community structures in these networks For120583 = 05 NSA also obtained the results as better as those ofWalkTrap Attractor and IsoFdp For 120583 gt 05 there has beena slip in the quality of the detected community structuresfor all those three algorithms and the proposed method For05 lt 120583 ⩽ 06 the quality of our proposal is better thanthat of Attractor in networks with larger communities andfor 120583 ⩾ 07 the performance of our proposed method is thebest

In Figures 3(a) and 3(b) we obtained the similar results asthose in Figure 2 overall But they still differ from each otherin someway In Figure 3(a) our proposedmethod performedthe best on almost all networks For 05 lt 120583 lt 07 in Figure 2NMI of the results extracted by our proposed method islower than those of WalkTrap and IsoFdp however inFigure 3 the proposedmethod performed better than IsoFdpfor 120583 gt 05 These results suggest that the performancesof the comparison algorithms are not stable on differentnetworks but our proposedmethod can steadily extract high-quality community structures from networks with differentcharacteristics This is also can be manifested from the factthat all the curves of the proposed method in these figuresdecline more slowly than others Moreover we can draw aconclusion by comparing the curves of the proposalrsquos own inthese figures that our proposed method inclines to performbetter on larger networks with small communities thereforeit overcomes the problem of resolution limit to some extent

44 Real-World Networks We also carried out experimentson 13 real-world networks to further test the effectivenessand efficiency of our proposed method As mentioned inSection 41 these networks fall in two categories ones with

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 10: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

10 Complexity

FastQWalktrapLPA

AttractorIsoFdpproposal(NSA)

02 03 04 05 06 07 0801

00

02

04

06

08

10NMI

(a) (b)

Figure 3Comparison of different community detection algorithms on LFR benchmark networks containing 5000 nodes (a)The results extractedfrom the larger networks with small-sized communities (b) The results revealed from the larger networks with big-sized communities

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(a)

1

23

4

5

6

7

8

9

10

11

12

13

14

1516

17

18

19

20

21

22

23

24

2526

27

28

29

30

31

34

33

32

(b)

Figure 4 The karate club network (a) The ground-truth community structure (b) The community structure detected by our proposedmethod NSA (The nodes in different communities are plotted in different colors and shapes this illustration style is also applied in thesubsequent figures)

the ground-truth community structure known a priori andthe other ones without publicly acknowledged ground truth

Networks withGround-Truth Community StructureThis cate-gory includes the first 4 networks listed in Table 2 since theirground-truth community structure is already known wemeasure the quality of the community structures identifiedby the proposed method and comparison algorithms interms of both NMI and modularity The values of the twometrics obtained by the proposed method and comparisonalgorithms have been recorded in Table 3 The scales of thesenetworks are relatively small facilitating to us visualizing thedetected results Belowwe analyze the results extracted by theproposed method from these networks individually

The Karate Club Network This is a network depicting thefriendships among members of a karate club it contains 34nodes and 78 edges This network was compiled by WayneW Zachary who observed the karate club for 3 years Duringthe period of study of Zachary the club split into two factionsbecause of a dispute arisen between the administrator andthe instructor Corresponding to the two parts the network isalways taking the partition of two communities as the groundtruth which is shown in Figure 4(a) The result detected byour proposed method is presented in Figure 4(b)

From Figure 4 we can see that our proposed methoddetected 3 rather than 2 communities from the network Itseems that the detected result deviates from the ground truthin some ways but this result coincides with the conclusion

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 11: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 11

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(a)

musquasimn23 notch

dn21 jet number1 knitrippleflukezig upbang sn96

gallatin plfeather dn63 bumper

beescratchwave web tr77

dn16 tr82 oscarbeak fish

sn100sn89 zipfel tsn83ccl thumper

kringel sn63

sn90

zap hookdouble tr99 whitetipsn9

tsn103grin shmuddelsn4haecksel

mn60 topless scabs stripes tr88

trigger patchback tr120vau jonah fork

cross smn5five mn83 mn105

(b)

Figure 5 The dolphin social network (a) The ground-truth community structure (b) The community structure identified by our proposedmethod NSA

Table 3 The experimental results on networks with ground-truth community structures The largest values of the two measure metrics aretyped in bold

Network Metric Fast119876 WalkTrap LPA Attractor IsoFdp NSAKarate 119876 0381 0353 0355 0371 0371 0402

NMI 0693 0504 062 0924 100 0699Dolphin 119876 0492 0489 0464 045 0505 0513

NMI 0719 0632 0719 069 0744 0887Risk map 119876 0625 0624 059 0598 0519 0624

NMI 0894 0848 0821 0839 0714 0848Scientists 119876 0749 0733 064 0694 0668 0744

NMI 0867 0818 0743 0835 0823 0878

found in the experiments on synthetic networks that ourproposed method tends to find small communities fromnetworks to overcome the problem of resolution limit More-over considering from the perspective of measure metricsthe modularity corresponding to the detected result is thelargest among those of comparison algorithms Although ourproposed method is not based on the strategy of optimizingmodularity it inclines to acquire the community structurewith as larger modularity as possible If it is not the largestit is the second largest with a small offset to the largest Thesefindings can also be manifested in next networks

Lusseaursquos Dolphin Social Network This network describesthe interactions of a group of dolphins living in Doubt-ful Sound New Zealand It consists of 62 nodes and 159edges which represent dolphin individuals and the cooc-currences of pairs of dolphins being observed respectivelyThis network is generally partitioned into 4 groups as theground-truth community structure which is as exhibited inFigure 5(a) Figure 5(b) is the community structure uncov-ered by our proposed method

In Figure 5 our proposed method detected communitiesfrom this network with a high degree of success it identified4 communities as well the absolute majority of nodes areclassified into the correct communities and the result almost

approaches the ground-truth community structure Consid-ering quantitatively both the values of NMI and modularitycorresponding to the result detected by the proposedmethodfrom this network are the largest among those of comparisonalgorithms which means that the community structureidentified by the proposed method is obviously better thanthose of comparison algorithms

Risk Map Network This network is a world politicalmap loaded in the popular game Risk (httpsenwikipediaorgwikiRisk (game)) in which 42 countries or territoriesof 6 continents are involved Therefore 42 nodes and 83 edgesconnecting adjacent countries or territories are organizedin 6 communities as the ground truth which is illustratedin Figure 6(a) Feeding this network into the proposedmethod we obtained the community structure as shown inFigure 6(b)

Comparing the detected result to the ground truth com-munity structure the community containing nodes lsquo18rsquo andlsquo23rsquo in the ground truth is split into two small communitiesin Figure 6(b) owning to the tendency of the proposedmethod Besides this nodes lsquo26rsquo lsquo33rsquo and lsquo34rsquo are misclassifiedinto the wrong communities in the detected result Butnodes lsquo12rsquo lsquo16rsquo lsquo26rsquo lsquo33rsquo and lsquo34rsquo are special ones in thisnetwork the outer edges associated with them are no less

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 12: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

12 Complexity

Table 4 The experimental results of modularity on networks The largest values of the two measure metrics are typed in bold

Network Fast119876 WalkTrap LPA Attractor IsoFdp NSALesmis 0499 0519 0515 0498 0491 054Polbooks 0502 0507 0508 0501 0518 0524ColiNeta 0779 0746 0693 0718 - 0761Email 0499 0531 0379 0464 0531 0544NetScience 0955 0956 0896 0937 - 0957YeastL 0573 0529 0372 0511 - 0574PGP 085 0789 0765 0768 0726 0867DBLP 0735 - 0652 0637 - 0782Amazon 0869 - 0743 0741 - 0898

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

35 36

37 38

3940

4142

(a)

12

3

4

56

7

8

9

10 11

12

1314

15 16

17

18

1920

21

22

23

24

25

26

27 2829

3031

3433

32

3536

37 38

3940

4142

(b)

Figure 6 Risk map network (a) The ground-truth communitystructure (b)The community structure uncovered by our proposedmethod NSA

even more than those within the communities to whichthese nodes belong Therefore if we ignore the meaningof the actual representation of these nodes and considerqualitatively based on the topology only the communitystructure extracted by our proposed method is more rationalthan the ground truth more edges associated with these threenodes are located within the community than in the ground

truth thus more tightly these three nodes are connectedto nodes within the same community in Figure 6(b) Whenconsidering quantitatively both values of the two measuremetrics of our proposed method are second only to those ofFast119876 and are the same with those of WalkTrapThese resultsalso confirm that our proposed method provides us with anacceptable solution to the problem of community detection

Scientists Collaboration Network This is the largest con-nected component of a network delineating the coauthorrelationship among scientists working at the Santa Fe Insti-tute NewMexico Nodes in this network represent scientistsedges stand for the two scientists who have collaborated atleast on one paper There are 118 nodes and 197 edges in totalin this network The nodes can be divided into 6 groups asthe ground-truth communities according to the specialties ofthe scientists which is as presented in Figure 7(a) Taking thisnetwork as the input to the proposedmethodwe obtained thecommunity structure as illustrated in Figure 7(b)

The proposed method revealed 8 communities fromthis network two additional communities are detected inFigure 7(b) These two communities are relatively indepen-dent components especially for the community containingnodes lsquo1rsquo there are much more inner edges than outer edgesThat is to say nodes in these two communities are connectedmore tightly to one another than with the remainder of thenetwork Therefore isolating them from the network andtaking themas independent communities are also reasonableConsidering from the perspective of measure metrics thevalue of NMI obtained by the proposedmethod is the largestwhich suggests that the result detected by our proposal is theonemost approaches the ground-truth community structurethe modularity value of the proposed method is not thelargest though it is also second only to that of Fast119876 Theseresults also testify that our proposed method can extracthigh-quality community structure from networks

Networks without Ground-Truth Community Structure Thiscategory contains the last 9 real-world networks listed inTable 2 For the experiments carried out on this category ofnetworks we evaluate the quality of the extracted communitystructures using the modularity only due to the absence ofthe ground-truth community structures For the proposedmethod and comparison algorithms the obtained values ofmodularity have been recorded in Table 4 To illustrate them

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 13: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 13

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241

48 46 72

7721

31 33

39

1130

404745

71 76

96

19

98

2528 64

4375

946670

101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88 10680

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255 36

84 103110

118109

108 113116

107 114 115

(a)

1814 154

172

1 3

5

79

10

12

16 26 386

2437

823

49341332

35

2027

2241 48 46 72

7721

31 33

39

1130

404745

71 7696

19

98

2528 64

4375

9466 70 101 97

99

97

4442 100

29

63

7495

6165

93

92

91

60 6762

7378 90

5868

88106

80

8911250 56

82 8769 8186

5251

59

57

54

53

85105 111

104 11783

10255

3684 103

110118

109108 113

116107 114 115

(b)

Figure 7 The collaboration network of scientists working at the Santa Fe Institute (a) The ground-truth community structure (b) Thecommunity structure detected by our proposed NSA algorithm

Lesmis DBLPPGPYeastLNetScienceEmailColiNetaPolbooks Amazon00

01

02

03

04

05

06

07

08

09

10Q

Networks

FastQWalktrapLPAAttractorIsoFdpproposal(NSA)

Mod

ularity

(Q)

Figure 8 The bar chart of the modularity obtained by comparison algorithms and the proposed method NSA

intuitively we also plotted them in a bar chart which ispresented in Figure 8

On these networks our proposed method achieved thelargest modularity from 8 of them On the only other onenetwork ColiNeta it still obtained the second largest valueof modularity For Fast119876 it is based on the modularityoptimization strategy though it acquired the largest value ofmodularity on network ColiNeta only For WalkTrap it is anapproach based on random walk then its time complexityis relatively high It cannot manage to get effective resultsfrom networks Amazon and DBLP due to the large scaleof these two networks For LPA and Attractor they can

extract community structures from all those networks butthe quality of the detected results is not satisfactory ForIsoFdp it can only be applied to connected networks andcannot run on networks ColiNeta NetScience and YeastLas these three networks are disconnected It cannot detectthe community structure from networks Amazon and DBLPeffectively either because of their large scale These compari-son results manifest that our proposed method can steadilyeffectively and efficiently provide uswith promising solutionsfor the problem of community detection in networks of wide-range applications and outperform comparison algorithmssignificantly

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 14: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

14 Complexity

000 005 010 015 020 025 030

Karate

020

025

030

035

040

045

050

Q

(a) The karate club network (b) The dolphin social network

000 005 010 015 020 025 030

Riskmap

040

045

050

055

060

065

070

Q

(c) The risk map network

000 005 010 015 020 025 030

Santafe

040

045

050

055

060

065

070

075

080

Q

(d) The scientists collaboration network

Figure 9 The setting of parameter 120575

5 Parameter Setting

In the second phase of the proposed method we introducea threshold 120575 for the community metric to identify thepreliminary communities needed to be merged As afore-mentioned we calculate the community metric 120574119894 = 120572119894 times 120573119894for every preliminary community 119862119894 in the merge procedureif the value of 120574119894 is below the threshold 120575 the correspondingcommunity 119862119894 is identified as the one needed to be merged

Therefore 120575 works as a parameter in our proposedmethod whose setting can influence the quality of theresulting community structure Considering qualitativity thelarger or the sparser the network is the threshold 120575 shouldbe smaller in accordance with the definitions of communitysparsity (120572119894) community scale (120573119894) and community metric(120574119894) To determine the optimal value of 120575 we conduct a groupof experiments to explore the relationship between the valueof 120575 and the quality of the resulting community structure onthe first four networks listed in Table 2 namely the karateclub network the dolphin social network the map of gameRisk and the scientists collaboration network respectivelyThe quality of the resulting community structure is measuredin term of modularity 119876 We vary the value of 120575 from 0 to 10by increasing 0005 each time for each value of 120575 we run ourproposed method on these networks and observe the changeof modularity along with the varies of 120575

The observed results are as illustrated in Figure 9 inwhich we plotted only the proportion of 120575 isin [0 03] because

the largest modularities are obtained during 120575 ⩽ 03 on all ofthose four networks Our proposed method gets the largestmodularity when 120575 = 013 on the dolphin social network and120575 = 01 on the other three networks Therefore we adopt thecorresponding value for those four networks and empiricallyset 120575 = 01 for other networks to perform the experiments InFigure 9 the largest modularity is obtained around the valueof 120575 = 01 and the interval of [005 02] covers the optimalvalue of 120575Therefore we empirically suggest that120575 be adjustedadaptively around 01 in the range of [005 02] according tothe size and the sparsity of networks involved in real-worldapplications

6 Conclusion

In this paper we presented a novel method to detectcommunities from networks It is a local method basedon node similarity and overcomes the deficiency of hightime consumption of global methods First we constructthe preliminary community structure by repeatedly selectingthe node with the largest degree and either taking it asthe exemplar of a new community or inserting it into thecommunity to which its most similar neighbor belongs onthe basis of its most similar neighborrsquos community assign-ment ie if its most similar neighbor has not been assignedto any community yet we create a new community for itand its most similar neighbor if its most similar neighborhas been assigned to a certain community we insert it into

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 15: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Complexity 15

that community as well At the end of this process weobtain a series of preliminary communities However someof them might be too small or too sparse leading to a low-quality result Therefore we merge some of the preliminarycommunities to acquire the final community structure To doso we also proposed some indexes which take both the sizeand sparsity of communities into account to determine whichcommunities should be merged

To test the performance of the proposed method wehave performed extensive experiments on four groups ofsynthetic networks and 13 real-world networks and comparedthe detected community structures with the results extractedby comparison algorithms in terms of NMI and modular-ity the comparison results demonstrate that our proposedmethod can extract high-quality community structures fromnetworks abstracted from various applications and nodes inthe extracted communities are connected more tightly Theproposed method overcomes the problem of resolution limitto some extent and outperforms the competitors successfully

Data Availability

We have conducted experiments on some artificial net-works and some real-world datasets The artificial networksare synthesized using LFR benchmark network generatorwhich can be freely available at httpssitesgooglecomsitesantofortunato The parameters used to synthesize the arti-ficial networks are listed in Table 1 The real-world datasupporting this study are from previously reported studieswhich have been cited in Table 2 Most of the real-worlddatasets can also be downloaded from httpwww-personalumichedusimmejnnetdata and httpssnapstanfordedudataindexhtml TheColiNeta dataset was provided by Jeonget al [18] We construct the Risk Map network manuallyaccording to the literature [16]

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was partially supported by the National NaturalScience Foundation of China (Grant ID 61602225)

References

[1] J Kleinberg and S Lawrence ldquoNetwork analysis The structureof the webrdquo Science vol 294 no 5548 pp 1849-1850 2001

[2] P Chen and S Redner ldquoCommunity structure of the physicalreview citation networkrdquo Journal of Informetrics vol 4 no 3pp 278ndash290 2010

[3] M E J Newman ldquoModularity and community structure innetworksrdquoProceedings of theNational Acadamy of Sciences of theUnited States of America vol 103 no 23 pp 8577ndash8582 2006

[4] E Ravasz A L Somera D A Mongru Z N Oltvai and A LBarabasi ldquoHierarchical organization ofmodularity inmetabolicnetworksrdquo Science vol 297 no 5586 pp 1551ndash1555 2002

[5] R Guimera and L A N Amaral ldquoFunctional cartography ofcomplex metabolic networksrdquo Nature vol 433 no 7028 pp895ndash900 2005

[6] M Girvan and M E J Newman ldquoCommunity structure insocial and biological networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 99 no12 pp 7821ndash7826 2002

[7] M E J Newman andM Girvan ldquoFinding and evaluating com-munity structure in networksrdquo Physical Review E StatisticalNonlinear and Soft Matter Physics vol 69 no 2 Article ID026113 2004

[8] P M Gleiser and L Danon ldquoCommunity structure in jazzrdquoAdvances in Complex Systems (ACS) vol 6 no 4 pp 565ndash5732003

[9] Y van Gennip B Hunter R Ahn et al ldquoCommunity detectionusing spectral clustering on sparse geosocial datardquo SIAM Jour-nal on Applied Mathematics vol 73 no 1 pp 67ndash83 2013

[10] M E J Newman ldquoFinding community structure in networksusing the eigenvectors of matricesrdquo Physical Review E Statisti-cal Nonlinear and Soft Matter Physics vol 74 no 3 Article ID036104 19 pages 2006

[11] S Fortunato ldquoCommunity detection in graphsrdquoPhysics Reportsvol 486 no 3ndash5 pp 75ndash174 2010

[12] S Fortunato and D Hric ldquoCommunity detection in networksa user guiderdquo Physics Reports vol 659 pp 1ndash44 2016

[13] BW Kernighan and S Lin ldquoAn efficient heuristic procedure forpartitioning graphsrdquo Bell Labs Technical Journal vol 49 no 1pp 291ndash307 1970

[14] W W Zachary ldquoAn information flow model for conflict andfission in small groupsrdquo Journal of Anthropological Research vol33 no 4 pp 452ndash473 1977

[15] D Lusseau ldquoThe emergent properties of a dolphin socialnetworkrdquo in Proceedings of the Royal Society of London BBiological Sciences vol 270 supplement 2 pp S186ndashS188 2003

[16] K Steinhaeuser and N V Chawla ldquoIdentifying and evaluatingcommunity structure in complex networksrdquo Pattern Recogni-tion Letters vol 31 no 5 pp 413ndash421 2010

[17] M E J Newman ldquoThe structure and function of complexnetworksrdquo SIAM Review vol 45 no 2 pp 167ndash256 2003

[18] H Jeong B Tombor R Albert Z N Oltval and A-L BarabaslldquoThe large-scale organization of metabolic networksrdquo Naturevol 407 no 6804 pp 651ndash654 2000

[19] RGuimera L DanonADıaz-Guilera F Giralt andAArenasldquoSelf-similar community structure in a network of humaninteractionsrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 68 no 6 Article ID 065103 2003

[20] RMilo S Shen-Orr S ItzkovitzNKashtanDChklovskii andU Alon ldquoNetwork motifs simple building blocks of complexnetworksrdquo Science vol 298 no 5594 pp 824ndash827 2002

[21] M Boguna R Pastor-Satorras A Dıaz-Guilera and A ArenasldquoModels of social networks based on social distance attach-mentrdquo Physical Review E Statistical Nonlinear and Soft MatterPhysics vol 70 no 5 Article ID 056122 2004

[22] J Yang and J Leskovec ldquoDefining and evaluating network com-munities based on ground-truthrdquo Knowledge and InformationSystems vol 42 no 1 pp 181ndash213 2015

[23] M E J Newman ldquoFast algorithm for detecting communitystructure in networksrdquo Physical Review E Statistical Nonlinearand Soft Matter Physics vol 69 no 6 Article ID 066133 2004

[24] A Clauset M E J Newman and C Moore ldquoFinding com-munity structure in very large networksrdquo Physical Review E

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 16: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

16 Complexity

Statistical Nonlinear and Soft Matter Physics vol 70 no 6Article ID 066111 2004

[25] F Dabaghi Zarandi and M Kuchaki Rafsanjani ldquoCommunitydetection in complex networks using structural similarityrdquoPhysica A Statistical Mechanics and its Applications vol 503 pp882ndash891 2018

[26] V D Blondel J Guillaume R Lambiotte and E LefebvreldquoFast unfolding of communities in large networksrdquo Journal ofStatistical Mechanics Theory and Experiment vol 2008 no 10Article ID P10008 2008

[27] L Waltman andN J Van Eck ldquoA smart local moving algorithmfor large-scale modularity-based community detectionrdquo TheEuropean Physical Journal B vol 86 no 11 article 471 pp 1ndash142013

[28] U N Raghavan R Albert and S Kumara ldquoNear lineartime algorithm to detect community structures in large-scalenetworksrdquo Physical Review E Statistical Nonlinear and SoftMatter Physics vol 76 no 3 Article ID 036106 2007

[29] M J Barber and J W Clark ldquoDetecting network communitiesby propagating labels under constraintsrdquo Physical Review EStatistical Nonlinear and Soft Matter Physics vol 80 no 2Article ID 026129 2009

[30] J Hou Chin and K Ratnavelu ldquoA semi-synchronous label prop-agation algorithm with constraints for community detection incomplex networksrdquo Scientific Reports vol 7 Article ID 458362017

[31] J Ding X He J Yuan Y Chen and B Jiang ldquoCommunitydetection by propagating the label of centerrdquoPhysica A Statisti-cal Mechanics and its Applications vol 503 pp 675ndash686 2018

[32] A Laio and A Rodriguez ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[33] X Xu N Yuruk Z Feng and T A J Schweiger ldquoSCAN Astructural clustering algorithm for networksrdquo in Proceedings ofthe 13th ACM SIGKDD International Conference on KnowledgeDiscovery and DataMining (KDD rsquo07) pp 824ndash833 ACMNewYork NY USA August 2007

[34] M Este H P Kriegel S Jorg and x Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the Second International Conference onKnowledge Discovery and Data Mining (KDDrsquo96) pp 226ndash231AAAI Press 1996

[35] H Shiokawa Y Fujiwara and M Onizuka ldquoScan++ Efficientalgorithm for finding clusters hubs and outliers on large-scalegraphsrdquo in Proceedings of the 3rd Workshop on Spatio-TemporalDatabase Management STDBM 2006 Co-located with the 32ndInternational Conference on Very Large Data Bases VLDB 2006pp 1178ndash1189 Republic of Korea September 2006

[36] T You H-M Cheng Y-Z Ning B-C Shia and Z-Y ZhangldquoCommunity detection in complex networks using density-based clustering algorithm and manifold learningrdquo Physica AStatistical Mechanics and its Applications vol 464 pp 221ndash2302016

[37] XWangG Liu J Li and J PNees ldquoLocating structural centersA density-based clustering method for community detectionrdquoPLoS ONE vol 12 no 1 Article ID e0169355 2017

[38] P Pons and M Latapy ldquoComputing communities in largenetworks using random walksrdquo in International symposium oncomputer and information sciences pp 284ndash293 2005

[39] S A Tabrizi A Shakery M Asadpour M Abbasi and M ATavallaie ldquoPersonalized PageRank clustering a graph cluster-ing algorithm based on random walksrdquo Physica A Statistical

Mechanics and its Applications vol 392 no 22 pp 5772ndash57852013

[40] Y Su B Wang and X Zhang ldquoA seed-expanding methodbased on random walks for community detection in networkswith ambiguous community structuresrdquo Scientific Reports vol7 Article ID 41830 2017

[41] J Shao Z Han Q Yang and T Zhou ldquoCommunity detectionbased on distance dynamicsrdquo in Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discovery andData Mining pp 1075ndash1084 ACM Australia August 2015

[42] H-L Sun E Chrsquong X Yong J M Garibaldi S See and D-B Chen ldquoA fast community detection method in bipartite net-works by distance dynamicsrdquo Physica A Statistical Mechanicsand its Applications vol 496 pp 108ndash120 2018

[43] A A Amini A Chen P J Bickel and E Levina ldquoPseudo-likelihood methods for community detection in large sparsenetworksrdquoThe Annals of Statistics vol 41 no 4 pp 2097ndash21222013

[44] S C de Lange M A de Reus and M P van den HeuvelldquoThe laplacian spectrum of neural networksrdquo Frontiers inComputational Neuroscience vol 7 no 189 2014

[45] F Krzakala C Moore E Mossel et al ldquoSpectral redemptionin clustering sparse networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 110 no52 pp 20935ndash20940 2013

[46] P Shi K He D Bindel and J E Hopcroft ldquoLocal LanczosSpectral Approximation for Community Detectionrdquo in JointEuropean Conference on Machine Learning and KnowledgeDiscovery in Databases vol 10534 of Lecture Notes in ComputerScience pp 651ndash667 Springer International Publishing 2017

[47] R Tackx F Tarissan and J Guillaume ldquoComSim a bipartitecommunity detection algorithm using cycle and nodersquos similar-ityrdquo in International Workshop on Complex Networks and theirApplications vol 689 of Studies in Computational Intelligencepp 278ndash289 Springer International Publishing 2017

[48] TWang L Yin and XWang ldquoA community detectionmethodbased on local similarity and degree clustering informationrdquoPhysica A Statistical Mechanics and its Applications vol 490pp 1344ndash1354 2018

[49] K R Zalik ldquoMaximal neighbor similarity reveals real commu-nities in networksrdquo Scientific Reports vol 5 Article ID 183742015

[50] A Lancichinetti S Fortunato and F Radicchi ldquoBenchmarkgraphs for testing community detection algorithmsrdquo PhysicalReview E Statistical Nonlinear and Soft Matter Physics vol 78no 4 Article ID 046110 2008

[51] L Ana and A Jain ldquoRobust data clusteringrdquo in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition vol 2 pp II-128ndashII-133 Madison WIUSA 2003

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 17: Neighbor Similarity Based Agglomerative Method for Community Detection in Networks · 2019. 7. 30. · Community Detection in Networks JianjunCheng ,1 XingSu ,1 HaijuanYang,1,2 LongjieLi

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom