identification of community structure in networks using higher order neighborhood concepts

9
International Journal of Bifurcation and Chaos, Vol. 19, No. 8 (2009) 2677–2685 c World Scientific Publishing Company IDENTIFICATION OF COMMUNITY STRUCTURE IN NETWORKS USING HIGHER ORDER NEIGHBORHOOD CONCEPTS ROBERTO F. S. ANDRADE and SUANI T. R. PINHO Instituto de F´ ısica, Universidade Federal da Bahia, 40210-340 Salvador, Bahia, Brazil THIERRY PETIT LOB ˜ AO Instituto de Matem´ atica, Universidade Federal da Bahia, 40210-340 Salvador, Bahia, Brazil Received March 29, 2008; Revised October 2, 2008 The identification of community structures in networks is investigated within a framework based on the concepts of higher order neighborhoods and neighborhood matrix ˆ M . This procedure is of relevance especially for networks representing evolutionary situations, since several evidences show that they are assembled from pre-existing smaller structures, rather than by the mere adhesion of individual nodes. We proceed within the successive elimination of the links with largest betweenness degree. The effect of erasing a link at step k is quantified by the distance between ˆ M k-1 and ˆ M k , which describe the network neighborhoods prior and after the kth link elimination. For modular networks, this measure is characterized by a very long sequence of sharp peaks, following a much more complete cascade of cluster splitting. The evidences indicate that this method identifies a more precise description of smaller communities splitting than the one based on modularity function. Keywords : Complex network; modularity; network distance. 1. Introduction The determination of the community structure of a complex network is an important step towards understanding the dynamical processes that may have originated it [Newman & Girvan, 2004; Bocal- letti, 2007; Sales-Pardo et al., 2007]. In the last few years, this has been acknowledged of utmost relevance for the analysis of networks constructed from biological data, where the evolutionary path- ways hint that they were assembled on the basis of pre-existing smaller structures [Gavin et al., 2002; Gavin et al., 2004; Guimer´ a & Amaral, 2005; G´ oes Neto et al., 2008]. Such modular networks offer a distinct landscape to other network scenarios based either on random or preferential attachment of indi- vidual nodes. Once we are given a general network, it is by no means obvious to decide how to divide it into communities. Therefore, several procedures have been proposed to identify the basic structures (if any) that have been used in its construction (for a review see [Newman, 2004a]). More recently, faster [Clauset et al., 2004] and local algorithms [Bagrow & Bollt, 2005] have also been discussed in the literature. Also, efforts have been made to quan- tify the precise moment when the subunits have been joined together into a larger structure. The main purpose of this work is to introduce a very precise measure to identify the community splitting process. This purpose is accomplished by considering a suitable defined Euclidean distance δ between two neighborhood matrices (NM) ˆ M and 2677 Int. J. Bifurcation Chaos 2009.19:2677-2685. Downloaded from www.worldscientific.com by MONASH UNIVERSITY on 09/20/13. For personal use only.

Upload: thierry-petit

Post on 11-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

September 9, 2009 10:39 02439

International Journal of Bifurcation and Chaos, Vol. 19, No. 8 (2009) 2677–2685c© World Scientific Publishing Company

IDENTIFICATION OF COMMUNITY STRUCTUREIN NETWORKS USING HIGHER ORDER

NEIGHBORHOOD CONCEPTS

ROBERTO F. S. ANDRADE and SUANI T. R. PINHOInstituto de Fısica, Universidade Federal da Bahia,

40210-340 Salvador, Bahia, Brazil

THIERRY PETIT LOBAOInstituto de Matematica, Universidade Federal da Bahia,

40210-340 Salvador, Bahia, Brazil

Received March 29, 2008; Revised October 2, 2008

The identification of community structures in networks is investigated within a framework basedon the concepts of higher order neighborhoods and neighborhood matrix M . This procedure isof relevance especially for networks representing evolutionary situations, since several evidencesshow that they are assembled from pre-existing smaller structures, rather than by the mereadhesion of individual nodes. We proceed within the successive elimination of the links withlargest betweenness degree. The effect of erasing a link at step k is quantified by the distancebetween Mk−1 and Mk, which describe the network neighborhoods prior and after the kthlink elimination. For modular networks, this measure is characterized by a very long sequence ofsharp peaks, following a much more complete cascade of cluster splitting. The evidences indicatethat this method identifies a more precise description of smaller communities splitting than theone based on modularity function.

Keywords : Complex network; modularity; network distance.

1. Introduction

The determination of the community structure ofa complex network is an important step towardsunderstanding the dynamical processes that mayhave originated it [Newman & Girvan, 2004; Bocal-letti, 2007; Sales-Pardo et al., 2007]. In the lastfew years, this has been acknowledged of utmostrelevance for the analysis of networks constructedfrom biological data, where the evolutionary path-ways hint that they were assembled on the basis ofpre-existing smaller structures [Gavin et al., 2002;Gavin et al., 2004; Guimera & Amaral, 2005; GoesNeto et al., 2008]. Such modular networks offer adistinct landscape to other network scenarios basedeither on random or preferential attachment of indi-vidual nodes.

Once we are given a general network, it isby no means obvious to decide how to divideit into communities. Therefore, several procedureshave been proposed to identify the basic structures(if any) that have been used in its construction(for a review see [Newman, 2004a]). More recently,faster [Clauset et al., 2004] and local algorithms[Bagrow & Bollt, 2005] have also been discussed inthe literature. Also, efforts have been made to quan-tify the precise moment when the subunits havebeen joined together into a larger structure.

The main purpose of this work is to introducea very precise measure to identify the communitysplitting process. This purpose is accomplished byconsidering a suitable defined Euclidean distance δbetween two neighborhood matrices (NM) M and

2677

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

2678 R. F. S. Andrade et al.

M ′, which describe the network neighborhoods attwo different steps of the community identifica-tion process [Andrade et al., 2006; Andrade et al.,2008a]. The NM M , for which only diagonal ele-ments are zero’s, provides a much precise measurebetween networks than that based in the adjacencymatrices, with elements 0 and 1, as the dissimilar-ity measure in the hierarchical clustering method[Newman, 2004a].

To provide concrete examples we considerresults for a community structure identificationbased on the successive elimination of the links(i, j) with largest betweenness degree Bmax

i,j , asproposed by Newman and Girvan (NG) [2004],which has the advantage of being deterministic,with exception when two or more links share thesame value Bmax. Since the definition of δ isindependent of the community separation method,it can also be used together with faster meth-ods that have been proposed recently [Newman,2004b].

To better present our arguments, this work isstructured as follows: Section 2 starts with a briefreview of the concepts of higher order neighbor-hoods. It is followed by a discussion on how toimplement the NG procedure by working with theneighborhood matrices M .

In Sec. 3, we recall that the NG procedureis based on the successive elimination of individ-ual links, so that it is necessary to update theneighborhood structure after the elimination of eachlink. Thus, it becomes quite natural to base theidentification of community splitting on the dis-tance between two matrices Mk−1 and Mk, whichdescribe, respectively, network neighborhoods priorand after the kth link elimination. Indeed, this dis-tance definition is characterized by the presence ofsharp peaks when the network has been divided intoclusters along the used procedure.

Examples are given in Sec. 4, where we con-sider two distinct networks, the well-known ZacharyKarate club network [Zachary, 1977] and FilteredYeast Interactome (FYI), a high-quality yeast inter-action data set [Han et al., 2004]. As will be shown,the proposed measure gives rise to a very longsequence of sharp peaks, which are much more pro-nounced than those produced by the modularityfunction [Newman & Girvan, 2004]. For modularnetworks, the first ones are clearly related to thesplitting of the network into large clusters. Nev-ertheless, the sequence of peaks is much longer,showing that the method is able to follow with

the characterization of the smaller communitiessplitting.

Finally, Sec. 5 closes the work with our conclud-ing remarks.

2. Higher Order Neighborhoodsand the NG Algorithm

The usual representation of a undirected complexnetwork, constituted by N nodes and L links, isprovided either by a list of L pairs of nodes con-nected by a link, or by an N × N the adjacencymatrix (AM) M . In this latter case, the matrix ele-ments mi,j are either 1 or 0, depending on whetherthe nodes i and j are connected or not to each other.The concept of neighborhood matrix (NM) M gen-eralizes that of AM, in the sense that each matrixelement mi,j indicates the number of steps alongthe shortest path di,j connecting nodes i and j. Inorder to fix the notation, let us first define the setof matrices M(�) such that

M(�)ij ={

1, if j ∈ Oi(�)0, otherwise

, (1)

where Oi(�) denotes the set of nodes j for whichdi,j = �, � = 1, 2, . . . ,D, and D denotes the networkdiameter. In this definition we assume that M(0) isequivalent to the identity matrix I and M(1) = M .If the network consists of two or more disconnectedclusters, di,j is not well defined if i and j do notbelong to the same cluster. Therefore, for such pairsof nodes, we adopt the definition di,j = 0. A defini-tion of M can be given in terms of M(�) as:

M =D∑

g=0

gM(g), (2)

As there has been emphasized in previous con-tributions [Andrade et al., 2006; Andrade et al.,2008b], NM consists of a very convenient way ofstoring the information that has been unfolded afterthe systematic use of the data in AM. We havealso provided a direct way to obtain M based onthe use of matrix Boolean product, although it canalso be constructed by the use of other well-knownalgorithms as the breadth-first search [Ahuja et al.,1993; Cormen et al., 2001]. M turns out to be quiteuseful for the purpose of visualizing the networkstructure [Andrade et al., 2006b]. With the help ofcolor or gray tone codes, it is possible to use the val-ues of the matrix elements to construct square pan-els, showing how many steps are required to connectdifferent pairs of nodes.

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

Identification of Community Structure in Networks 2679

Let us now explore M for the purpose of eval-uating the betweenness degree of an existing linkbetween nodes s and r. This parameter counts thenumber of shortest paths between all pairs of nodes(k, �), k, � = 1, 2, . . . , N, with k �= � and that passthrough that particular link. In case of multiplic-ity of shortest paths between two nodes, the con-tribution of this pair is equally divided between theavailable paths. In Fig. 1 we illustrate the contribu-tion to the betweenness degree resulting from theshortest paths from node s = 1 to all nodes in asimple network.

The formalism we derive below is very conve-nient if one wants to make use of the NG algorithm.We recall it prescribes not only the successive elim-ination of the links with higher betweenness degree,but also the re-evaluation of these degrees for allremaining links each time after a link has beeneliminated.

So let us suppose that, for a given network,the corresponding NM M has been evaluated.The betweenness procedure starts by defining abetweenness matrix B, with input element definedby the AM, i.e. bi,j = mi,j. At the starting point, Btakes into account only the shortest paths of pairsof nodes that are directly connected. Then we pro-ceed along a sequence of three steps, which will addto each matrix element of B the effect of all pairsof nodes connected by the corresponding shortestpaths.

(1) Start a loop in a variable � that runs, in decreas-ing order, from D to 2. Initially we look for elements

Fig. 1. Summed contribution to the betweenness degreeresulting from the shortest paths between all nodes to nodes = 1 in a simple linear chain network with closed ends andN = 6. If the contributions from the other pairs of nodes areadded, a constant value 4.5 for all links is obtained.

mi,j = � = D for i < j. Each one of them definesa pair (i, j) and, for each of such pairs, we look fornew pairs of elements mi,t(i,j) and mj,t(i,j) whichsatisfy the following requirement:

mi,t(i,j) = 1 and mj,t(i,j) = � − 1 or

mi,t(i,j) = � − 1 and mj,t(i,j) = 1.(3)

This step aims to identify which links are involvedin connecting the pairs of nodes (i, j) that are atthe maximal distance D from each other. Note how-ever, that it identifies only the links t(i, j) that areattached either to i or to j. To take into accountshortest path multiplicity, let T (i, j) be the numberof values of t(i, j) that satisfy (3).

(2) Update the elements bi,t(i,j) and bj,t(i,j) of Baccording to the following rules:

bi,t(i,j) +bi,j + 1T (i, j)

→ bi,t(i,j) = bt(i,j),i

bj,t(i,j) +bi,j + 1T (i, j)

→ bj,t(i,j) = bt(i,j),j

(4)

This step adds to the matrix elements bi,t(i,j) andbj,t(i,j) the information that they take part in theshortest path connecting i and j. Note that there isno direct link either between i and t(i, j) or betweenj and t(i, j). Nevertheless, the update needs to beperformed on all involved links, as this process willinfluence the values of bi,j in the next iterationsteps.

(3) Return to item 1, and run items 1 and 2 until� = 2.

(4) At the end of the process, the values of thebetweenness degree associated to each link is givenby bi,j, provided the pair (i, j) satisfies the remarkthat mi,j = 1.

As an example let us consider the linear chainwith six nodes and connected ends, shown in Fig. 1,for which M and B are expressed as

M =

0 1 2 3 2 11 0 1 2 3 22 1 0 1 2 33 2 1 0 1 22 3 2 1 0 11 2 3 2 1 0

,

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

2680 R. F. S. Andrade et al.

B =

0 1 0 0 0 11 0 1 0 0 00 1 0 1 0 00 0 1 0 1 00 0 0 1 0 11 0 0 0 1 0

(5)

Starting step 1 at � = D = 3, we find the follow-ing values (i, j) such that mi,j = 3, with i < j:(1, 4), (2, 5), (3, 6). Further, it is easy to show that,for the pair (1, 4) the condition (3) is satisfied fort(i, j) = 2, 3, 5, and 6, indicating that T (1, 4) = 4.Updates according to (4) are explicitly written as

b1,2 +b1,4 + 1

4= 1 + 0 +

14

=54→ b1,2,

b4,2 +b1,4 + 1

4= 0 + 0 +

14

=14→ b4,2,

b1,3 +b1,4 + 1

4= 0 + 0 +

14

=14→ b1,3,

b4,3 +b1,4 + 1

4= 1 + 0 +

14

=54→ b4,3,

b1,5 +b1,4 + 1

4= 0 + 0 +

14

=14→ b1,5,

b4,5 +b1,4 + 1

4= 1 + 0 +

14

=54→ b4,5,

b1,6 +b1,4 + 1

4= 1 + 0 +

14

=54→ b1,6,

b4,6 +b1,4 + 1

4= 0 + 0 +

14

=14→ b4,6.

(6)

For (i, j) = (2, 5) and (3, 6) the situation is muchthe same. Thus, after analyzing all three pairs,which require 24 update operations, we arrive at

B =14

0 6 2 0 2 66 0 6 2 0 22 6 0 6 2 00 2 6 0 6 22 0 2 6 0 66 2 0 2 6 0

(7)

The final step considers � = 2. The new values(i, j) for which the elements mi,j = 2 with i < jare (1, 3), (1, 5), (2, 4), (2, 6), (3, 5), (4, 6). For eachof these pairs, Eq. (3) is satisfied for only onevalue of t(i, j), so that T (i, j) = 1. For instance,

t(1, 3) = 2 and t(1, 5) = 6. Following Eq. (3),12 recurrent update operations are performed, allof which involving matrix elements bi,j for whichmi,j = 1:

b1,2 +b1,3 + 1

1=

64

+24

+ 1 =124

→ b1,2,

b3,2 +b1,3 + 1

1=

64

+24

+ 1 =124

→ b3,2,

b1,6 +b1,5 + 1

1=

64

+24

+ 1 =124

→ b1,6,

b5,6 +b1,5 + 1

1=

64

+24

+ 1 =124

→ b5,6,

b2,3 +b2,4 + 1

1=

124

+24

+ 1 =184

→ b2,3,

b4,3 +b2,4 + 1

1=

64

+24

+ 1 =124

→ b4,3,

b2,1 +b2,6 + 1

1=

124

+24

+ 1 =184

→ b2,1,

b6,1 +b2,6 + 1

1=

124

+24

+ 1 =184

→ b6,1,

b3,4 +b3,5 + 1

1=

124

+24

+ 1 =184

→ b3,4,

b5,4 +b3,5 + 1

1=

64

+24

+ 1 =124

→ b5,4,

b4,5 +b4,6 + 1

1=

124

+24

+ 1 =184

→ b4,5,

b6,5 +b4,6 + 1

1=

124

+24

+ 1 =184

→ b6,5.

(8)

Thus we end up with the final form for B

B =14

0 18 2 0 2 1818 0 18 2 0 22 18 0 18 2 00 2 18 0 18 22 0 2 18 0 18

18 2 0 2 18 0

, (9)

from which we recover the correct values predictedby NG algorithm, i.e. bi,j = 4.5 if mi,j = 1.

3. Measuring Distances

Let us now discuss how to use the distance con-cept proposed in a previous work [Andrade et al.,2006b]. The concept is based on the fact that, once

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

Identification of Community Structure in Networks 2681

we are given two distinct networks with the samenumber N of nodes, labeled α and β, it is possi-ble to introduce a Euclidean distance δ(α, β), sum-ming over positively defined differences between thematrix elements of the corresponding neighborhoodmatrices Mα and Mβ . As we explain hereafter, sucha measure can be used to identify those links which,when eliminated within the NG procedure, give riseto important community splitting.

Let us consider the elimination of the firstlink. The evaluation of the betweenness matrixB0 requires the knowledge of the NM M0, wherethe superscripts indicate that no link has beenyet eliminated. After the identification of the firstlink (i1, j1) that will be eliminated, it is necessaryto re-evaluate all shortest paths or, equivalently,re-evaluate the new NM M1. In fact, for practi-cal purposes, it is possible to follow the “dam-age” caused on M0 by the elimination of (i1, j1),so that, for almost all link eliminations, it is notrequired to re-evaluate a great number of new ele-ments to obtain M1. The same will be true fork = 2, 3, . . . , L. Therefore, it is possible to define thefollowing distance δ(α, β) = δ(Mk, Mk−1) ≡ δ(k)between two networks, before and after the kth linkelimination:

δ2(k) =1

N(N − 1)

N∑i,j=1

[(mk)i,j − (mk−1)i,j]2. (10)

As we just discussed, if the kth link eliminationhas caused changes on only small number of NMelements, the distance δ(k) is small. On the otherextreme, when the link elimination causes the divi-sion of the network into disconnected communities,δ(k) becomes very large. This is much so because,according to the definition of M , this amountsputting to zero, at one single value of k, all matrixelements of Mk corresponding to pairs of nodes thatnow belong to the two newly disconnected clusters.

The definition (10) is meant to be an alterna-tive to identify the link eliminations that give riseto important community splitting. This way, it canbe directly compared to the modularity functionQ =

∑κi=1(eii − (

∑κj=1 eij)2) [Newman & Girvan,

2004], which also identifies the process of com-munity splitting by link eliminations. In this def-inition, one assumes that the network is dividedinto κ parts, and ei,j is the fraction of links inthe network that connects the i and j commu-nities. However, as we show in the next sectionfor a well-known network, (10) seems to be able

to identify a larger number of community split-ting, specially those occurring after the network hasalready been split into medium size clusters. More-over, δ does not depend on ad hoc choices (e.g. thenumber k of communities), as required to the defi-nition of the modularity function Q.

The evolution of δ with link elimination can bedrawn together with the corresponding dendrogram[Newman & Girvan, 2004; Goes Neto et al., 2008]produced by the community splitting method. Inits tree structure, the number of branches increaseswith the number i of eliminated links, from onesingle branch at k = 0 describing all nodes inthe same cluster, to N single-node communities atk = L. The access to the network disassemblingprocess provided by NG’s algorithm turns out tobe of great practical relevance when analyzing evo-lutionary networks. Indeed, it becomes possible toidentify the time and the agent responsible for com-munity splitting. Accordingly, the resulting dendro-gram provides a depth measure, which informs howbackwards the nodes must go until they are in thesame community. This justifies the choice of thisseparation method which, together with the succes-sive evaluation of δ, results in a time complexityO(NM 2).

The same order at which the nodes appear atthe extreme value k = L in the dendrogram can beused to reveal the community structure in the orig-inal NM M0. Indeed, it turns out that the shortestpath between nodes in a same community is usuallysmaller than those for pairs of nodes that belongto distinct structure. As we will show in the nextsection, the color plots based on the original M0

assume different patterns if the lines and columnsof M0 are renumbered according to the final orderat which the nodes appear at k = L. For modularnetworks, the new patterns provide useful insightsfor the interpretation of structural aspects resultingfrom community splitting.

4. Examples

4.1. Zachary club network

In order to show the efficiency of our measure, weconsider first the Zachary Karate Club network,which is a social network formed by 34 people whichmay have or not friendship relation [Zachary, 1977].Besides its pioneering relevance for introducing thenotion of networks into social sciences, it has beenused in recent time as a prototype network fortesting newly proposed methods and measures to

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

2682 R. F. S. Andrade et al.

characterize networks. Among its several features,it has a clear decomposition into two communitiesthat are formed around two key persons in the Club,the trainer and secretary.

This network has been used also to exem-plify the NG method to identify communities [New-man & Girvan, 2004]. As we are using this sameidentification method, we use it to show how thepresence of large peaks in the distance introducedin (10) is efficient in detecting the link eliminationsthat cause community splitting.

The largest peak locates the moment referringto the two main communities. Besides that, it ispossible to identify at least five subsequent sharppeaks of decreasing height, which correspond to fivesecondary subcommunity splitting that can be rec-ognized in the dendrogram.

The comparison of the evolution plots of δ(k)and Q(k), shown in Fig. 2, clearly indicates that themeasures provided by Q and δ are related but notequivalent. In fact, δ(k) seems to be more sensitive tobranching events in the community-structured graphshowed by the dendrogram. The position of Q high-est peak, at elimination step k = 26, coincides to thefourth large δ peak, after which the network is splitinto five communities. However, Q fails to detect thefirst two large community splitting at k = 11 and20, as its first peak is observed only for k = 17 andk = 20 corresponds to a local minimum. Moreover,much sharper peaks in the plot of δ(k) easily iden-tify the elimination of links that lead to communitysplitting.Theheight of thepeaks alsoprovides amea-sure of how large are the communities emerging at anetwork division. This finally makes it easier to pro-ceed with the identification of the set communitiesin a given network for a previously chosen number ofbranches in the dendrogram.

4.2. Yeast protein interactionnetwork

Let us now consider the much larger and morecomplex yeast FYI network. Yeast is one of thefirst organisms that have been systematically inves-tigated within modern molecular biology, whatincludes the search for properties stemming fromDNA coding and gene identification to gene expres-sion, protein synthesis and protein interaction[Gavin et al., 2002; Han et al., 2004; Gavin et al.,2006; Collins et al., 2007]. Corresponding data arestored in large information banks with public accessfor scientists [NCBI].

14142820221813125116717310929151619233433313221273024282526

0 10 20 30 40 50 60 70 800

20

40

60

80

0 10 20 30 40 50 60 70 800.0

0.2

0.4

0.6

Zachary Club, 34 nodes

com

mu

nity

str

uct

ure

eliminated links k

Q(k

)

δ(k)

Fig. 2. (a) Dendrogram produced by the sequence of edgeelimination controlled by the largest betweenness degree. Thenumbers at the r.h.s. of the dendrogram indicate the positionof each node. With this information, it is possible to followthe nodes backwards and identify to which community eachof them belong. (b) Corresponding values of δ (solid) andQ (dashed) for the same edge elimination. Note that, unlikeFig. 9 in [Newman & Girvan, 2004], Q is drawn for every oneof the 76 eliminated links.

Due to the large investment on this field, a verylarge amount of new results are continuously beingproduced by several research groups, which eitherconfirm previous results or contribute with actualnew observations. Therefore, yeast data banks needto be updated with some frequency in order to keepthe pace with advances in the field. In this work,we consider the FYI network investigated in a pre-vious paper [Han et al., 2004], where the authorshave found evidences of modularity and have iden-tified the most important communities and theirbiological role. We collected the network data in theSupplementary Information of the quoted reference,which is available online.

As we are mainly interested in discussing theusefulness of (10) in the characterization of com-munity splitting, we will limit our discussion to this

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

Identification of Community Structure in Networks 2683

specific aspect, while indicating the original workto the reader interested in the biological networkaspects. The network consists of 1,379 proteins,which are separated into 162 clusters, the largestone of which contains 778 proteins. This giant clus-ter, which contains 1,799 links, is much larger thanany of the remaining, so that we can focus our atten-tion on it.

The corresponding dendrogram and distancemeasure are shown in Fig. 3. Note that we have useda shorter horizontal scale in order to display, in amuch clear way, the large peaks that are observedin the early stage of the edge elimination process.Here again it is possible to observe a first splittinginto two large communities, which is followed byat least ten noticeable peaks corresponding to sec-ondary subcommunity splitting.

Finally, the effect of node relabeling in theshape of the neighborhood matrix can be well illus-trated by the two different patterns shown in Fig. 4.In (a), the neighborhood matrix for the giant cluster

0 20 40 60 80 100 120 140

0

2000

4000

6000

8000

com

mu

nity

str

uct

ure

eliminated links

Fig. 3. (a) Dendrogram produced by sequence of edge elim-ination for FYI. (b) Corresponding value of δ for the sameedge elimination.

(a)

(b)

Fig. 4. Color (gray tones) plots based on the values ofthe matrix elements of the NM’s. In (a), node numberingis that of the input data, while in (b) node numberingresults from community finding. It leads to the dendrogramwith no crossing lines. The color (gray tones) codes assignblue (black) and red (white) to the pairs of nodes whichhave, respectively, smallest and largest values for the short-est paths. The diagonal blocks in (b) have direct corre-spondence to the communities in the dendrogram shown inFig. 3.

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

2684 R. F. S. Andrade et al.

is completely confused, hiding completely the com-munity structure existing in the full FYI. On theother hand, the panel in (b), which is obtained fromthat in (a) by commuting lines and columns in theprecise way prescribed by the edge elimination pro-cedure, clearly reveals the existence of an intricatecommunity pattern. Indeed, shortest paths betweennodes within the same community are more likely toconsist of a few steps, what is expressed by blue col-ors. The number of steps in shortest paths betweennodes of different communities slowly increase, giv-ing rise to green, yellow and red pixels in theoff-diagonal blocks of the panels. The set of illus-trations in Figs. 2 and 3 provide a vivid pictorialand quantitative information on the network com-munity structure, which was hidden in the originallabeling.

5. Conclusions

In this work we have presented another measureto identify the modular characteristics of complexnetworks. The main idea is to use the distance δbetween two NM’s, which correspond to two suc-cessive versions of the network. Such versions differfrom each other by the removal of one edge betweena pair of nodes. The results indicate that whenthe edge removal leads to a community splitting,δ has sharp peaks. We have observed that the peakheight can be related both to the size of the origi-nal community as well as to the size of the resultingsubcommunities.

We have implemented the evaluation of δtogether with the successive edge elimination basedon the NG criterion. For this purpose, we also pre-sented an alternative way to evaluate the between-ness degree for all edges in a network based onthe systematic use of the information containedin the NG. Our procedure leads, as a by-productof the dendrogram evaluation, to the renumberedNM, which clearly reveals the community structureby the presence of diagonal blocks.

The distance concept introduced herein canalso be used in connection with other criteria todetect modularity.

Finally, we want to stress that the distance con-cept introduced herein can also be used in con-nection with other criteria to detect modularity,due to the fact NM carries the whole informationon the network structure used in the communityseparation.

Acknowledgments

This work has been partially supported by theCNPq (Brazilian Agency), grants n. 476325/2004and 306369/2004-4. The authors acknowledge stim-ulating discussions with A. Goes-Neto, J. G. V.Miranda, C. N. El-Hani and L. F. Costa.

References

Ahuja, R. K., Magnanti, T. L. & Orlin, J. B. [1993]Network Flows: Theory, Algorithms, and Applications(Prentice Hall, Upper Saddle River, NJ).

Andrade, R. F. S., Miranda, J. G. V. & Petit Lobao,T. [2006] “Neighborhood properties of complex net-works,” Phys. Rev. E 73, 046101.

Andrade, R. F. S., Miranda, J. G. V., Pinho, S. T. R. &Petit Lobao, T. [2008a] “Measuring distances betweencomplex networks,” Phys. Lett. A 372, 5265–5269.

Andrade, R. F. S., Miranda, J. G. V., Pinho, S. T. R. &Petit Lobao, T. [2008b] “Characterization of complexnetworks by higher order neighborhood properties,”Eur. Phys. J. B 61, 247–256.

Bagrow, J. P. & Bollt, E. M. [2005] “Local method fordetecting communities,” Phys. Rev. E 72, 046108.

Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. &Hwang, D.-U. [2006] “Complex networks: Structureand dynamics,” Phys. Rep. 424, 175–308.

Clauset, A., Newman, M. E. J. & Moore, C. [2004] “Find-ing community structure in very large networks,”Phys. Rev. E 70, 066111.

Collins, S. R., Miller, K. M., Maas, N. L., Roguev, A.,Fillingham, J., Chu, C. S., Schuldiner, M., Gebbia,M., Judith Recht, J., Shales, M., Ding, H., Xu, H.,Han, J., Ingvarsdottir, K., Cheng, B., Andrews, B.,Boone, C., Berger, S. L., Hieter, P., Zhang, Z., Brown,G. W., Ingles, C. J., Emili, A., Allis, A. D., Toczyski,D. P., Weissman, J. S., Greenblatt, J. F. & Krogan, N.J. [2007] “Functional dissection of protein complexesinvolved in yeast chromosome biology using a geneticinteraction map,” Nature 446, 806–810.

Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C.[2001] Introduction to Algorithms, 2nd edition (MITPress, Cambridge, MA).

Gavin, A. C., Boesche, M., Krause, R., Grandi, P.,Marzioch, M., Bauer, A., Schultz, J., Rick, J. M.,Michon, A. M., Cruciat, C. M., Remor, M., Hoe-fert, C., Schelder, M., Brajenovic, M., Ruffner, H.,Merino, A., Klein, K., Hudak, M., Dickson, D., Rudi,T., Gnau, V., Bauch, A., Bastuck, S., Huhse, B.,Leutwein, C., Heurtier, M. A., Copley, R. R., Edel-mann, A., Querfurth, E., Rybin, V., Drewes, G.,Raida, M., Bouwmeester, T., Bork, P., Seraphin, B.,Kuster, B., Neubauer, G. & Superti-Furga, G. [2002]“Functional organization of the yeast proteome bysystematic analysis of protein complexes,” Nature415, 141–147.

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.

September 9, 2009 10:39 02439

Identification of Community Structure in Networks 2685

Gavin, A. C., Aloy, P., Grandi, P., Krause, R., Boesche,M., Marzioch, M., Rau, C., Jensen, L. J., Bastuck, S.,Dmpelfeld, B., Edelmann, A., Heurtier, M. A., Hoff-man, V., Hoefert, C., Klein, K., Hudak, M., Michon,A. M., Schelder, M., Schirle, M., Remor, M., Rudi, T.,Hooper, S., Bauer, A., Bouwmeester, T., Casari, G.,Drewes, G., Neubauer, G., Rick, J. M., Kuster, B.,Bork, P., Russell, R. B. & Superti-Furga, G. [2006]“Proteome survey reveals modularity of the yeast cellmachinery,” Nature 440, 631–636.

Goes-Neto, A., Diniz, M. V. C., Santos, L. B. L., Pinho,S. T. R., Miranda, J. G. V., Andrade, R. F. S. &Petit Lobao, T. [2008] “Comparative protein analy-sis of the chitin metabolic pathway in extant organ-isms: A complex network approach,” submitted forpublication.

Guimera, R. & Amaral, L. A. N. [2005] “Functionalcartography of complex metabolic networks,” Nature433, 895–900.

Han, J.-D. J., Bertin, N., Hao, T., Goldberg, D. S.,Berriz, G. F., Zhang, L. V., Dupuy, D., Walhout,

A. J. M., Cusick, M. E., Roth, F. P. & Vidal, M.[2004] “Evidence for dynamically organized modular-ity in the yeast protein-protein interaction network,”Nature 430, 88.

NCBI See, for instance, the webpage of NationalCenter for Biotechnology Information at www.ncbi.nlm.nih.gov/

Newman, M. E. J. [2004a] “Detecting community struc-ture in networks,” Eur. Phys. J. B 38, 321–330.

Newman, M. E. J. [2004b] “Fast algorithm for detectingcommunity structure in networks,” Phys. Rev. E 69,066133.

Newman, M. E. J. & Girvan, M. [2004] “Finding andevaluating community structure in networks,” Phys.Rev. E 69, 026113.

Sales-Pardo, M., Guimera, R., Moreira, A. A. & Amaral,L. A. N. [2007] “Extracting the hierarchical organiza-tion of complex systems,” Proc. Nat. Acad. Sci. 104,15224–15229.

Zachary, W. W. [1977] J. Anthropol. Res. 33, 452.

Int.

J. B

ifur

catio

n C

haos

200

9.19

:267

7-26

85. D

ownl

oade

d fr

om w

ww

.wor

ldsc

ient

ific

.com

by M

ON

ASH

UN

IVE

RSI

TY

on

09/2

0/13

. For

per

sona

l use

onl

y.