vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Vertex Neighborhoods, !Low Conductance Cuts, !and Good Seeds for Local Community Methods

DAVID F. GLEICH PURDUE

C. SESHADHRI SANDIA - LIVERMORE

KDD2012 David Gleich · Purdue

Neighborhoods are good communities

Neighborhoods are good communities ^

conductance

^

Vertex


conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅)


conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅) where 𝜅 is the clustering

coefficient and

the graph has a heavy tailed degree distribution


conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅) where 𝜅 is the clustering

coefficient and

the graph has a heavy tailed degree distribution

is a y

A vertex neighborhood is a “good” conductance community in a graph with a heavy-tailed degree distribution and large clustering coefficient.

Our contributions

1.  The previous theorem and its proof. This shows that good communities are expected and easy to find in modern networks with heavy-tailed degrees and large clustering.

2.  An empirical evaluation of neighborhood communities that shows vertex neighborhoods are the “backbone” of the network community profile.


Formal background for the theorem 1.  Vertex neighborhoods 2.  Low conductance cuts 3.  Clustering coefficients


Vertex neighborhoods The set of a vertex and"all its neighborhood Also called an “egonet” Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], "community seeds [Huang11,Schaeffer11], "overlapping communities [Schaeffer07,Rees10].


Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)


cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

Clustering coefficients Wedge Global clustering coefficient

=

number of closed wedges

number of wedges

center of wedge

closed wedge

Probability that a random wedge is closed


Simple version of theorem

If global clustering coefficient = 1, then "the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities!


Theorem Condition Let graph G have clustering coefficient 𝜅 and "have vertex degrees bounded "by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance

log degree

log

prob

abilit

y ↵1n/d�

↵2n/d�

4(1 � )/(3 � 2)


Proof Sketch 1) Large clustering coefficient "⇒ many wedges are closed 2) Heavy tailed degree dist "⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize

100 101 102 103 1040

0.2

0.4

0.6

0.8

1

CD

F of

Num

ber o

f Wed

ges

Degree


Confession!The theory is weak

�(S) 4(1 � )/(3 � 2)

Collaboration networks "𝜅 ~ [0.1 – 0.5]

Social networks "𝜅 ~ [0.05 – 0.1]

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Tech. networks "𝜅 ~ [0.005 – 0.05]

This bound is useless unless 𝜅 ≥ 1/2


We view this theory as "“intuition for the truth”


Empirical Evaluation using Network Community Profiles

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Community Size

Minimum conductance for

any community of the given size

Canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance.


Empirical Evaluation using Network Community Profiles








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Community Size"(Degree + 1)

Minimum conductance for any community

neighborhood of the given size

“Egonet community profile” shows the same shape, 3 secs to compute.

1.1M verts, 4M edges

The Fiedler community computed from the normalized Laplacian is a neighborhood!


Facebook data from Wilson et al. 2009

Not just one graph








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2
















edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges

15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods KDD2012 David Gleich · Purdue

Filling in the !Network Community Profile








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Minimum conductance for any community

neighborhood of the given size

We are missing a region of the NCP when we just look at neighborhoods


Community Size"(Degree + 1)

Personalized PageRank Communities [Andersen06] To find the canonical NCP structure, Leskovec et al. used a personalized PageRank based community finder. These start with a single vertex seed, and then expand the community based on the solution of a personalized PageRank problem. The resulting community satisfies a local Cheeger inequality. This needs to run thousands of times for an NCP


100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the !Network Community Profile








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Minimum conductance for

any community of the given size

7807 seconds

This region fills when using the PPR method (like now!)


Community Size"

Vertex Neighborhoods, !Low Conductance Cuts, !and Good Seeds for Local Community Methods


Am I a good seed?!Locally Minimal Communities

“My conductance is the best locally.”

�(N(v )) �(N(w))

for all w adjacent to v

In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.


100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Locally minimal communities capture extremal neighborhoods








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Red dots are conductance "and size of a "

locally minimal community

Usually about 1%

of # of vertices.

The red circles – the best local mins – find the extremes in the egonet profile.


Community Size"

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the NCP!Growing locally minimal comm.








edges(N1

(v) \ {v})/2


1

(v) \ {v})/2 = edges(N1


1

(v)) = vol(N1

(v)) � edges(N1




web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2


100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2









Growing only locally minimal

communities

283 seconds vs.

7807 seconds

Full NCP Locally min NCP

Original Egonet


Community Size"

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

Filling in the NCP!Growing locally minimal comm.


communities

143 seconds vs.

2211 seconds


Original Egonets arXiv – 86k verts, 500k edges


Community Size"

Recap A theorem relating clustering,"heavy-tailed degrees, and"low-conductance cuts of "vertex neighborhoods. Empirical evaluation of "vertex neighborhoods. More on k-cores in the paper. ⇒ Many communities are easy to find! ⇒ Explains success of community detection?

Acknowledgements!David supported by NSF CAREER

award 1149756-CCF. Sesh supported by the Sandia

LDRD program (project 158477) and the applied mathematics program at

the Dept. of Energy.


Code and results available online

www.cs.purdue.edu/~dgleich/ codes/neighborhoods

Two words on computing

Can be done by just counting the triangles at each node. Linear complexity in |E| in a power-law graph. It’s possible to do this in MapReduce too.


100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

Filling in the NCP!Growing k cores


communities and k-cores

143 seconds

vs. 2211 seconds


Original Egonets arXiv – 86k verts, 500k edges


Community Size"

PPR grown k-cores

k-cores

Clustering coefficients Wedge Global clustering coefficient Local clustering coefficient

=

number of closed wedges

number of wedges

Cv =

number of closed wedges centered at vnumber of wedges centered at v

center of wedge

closed wedge

Probability that a random wedge is closed


vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Technology

good communities

good seeds

vertex degrees

low conductance cuts3

agood conductance communityin

vertex andall

vertex neighborhoodsthe

vertex neighborhoods2