vertex neighborhoods, low conductance cuts, and good seeds for local community methods

32
Vertex Neighborhoods, Low Conductance Cuts, and Good Seeds for Local Community Methods DAVID F. GLEICH PURDUE C. SESHADHRI SANDIA - LIVERMORE KDD2012 David Gleich · Purdue

Upload: david-gleich

Post on 15-Jan-2015

664 views

Category:

Technology


1 download

DESCRIPTION

My talk from KDD2012 about vertex neighborhoods and low conductance cuts. See the paper here: http://arxiv.org/abs/1112.0031 and http://dl.acm.org/citation.cfm?id=2339628

TRANSCRIPT

Page 1: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Vertex Neighborhoods, !Low Conductance Cuts, !and Good Seeds for Local Community Methods

DAVID F. GLEICH PURDUE

C. SESHADHRI SANDIA - LIVERMORE

KDD2012 David Gleich · Purdue

Page 2: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Neighborhoods are good communities

Page 3: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Neighborhoods are good communities ^

conductance

^

Vertex

Page 4: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Neighborhoods are good communities ^

conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅)

Page 5: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Neighborhoods are good communities ^

conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅) where 𝜅 is the clustering

coefficient and

the graph has a heavy tailed degree distribution

Page 6: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Neighborhoods are good communities ^

conductance

^

A Vertex

(4-4𝜅)/(3-2𝜅) where 𝜅 is the clustering

coefficient and

the graph has a heavy tailed degree distribution

is a y

Page 7: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

A vertex neighborhood is a “good” conductance community in a graph with a heavy-tailed degree distribution and large clustering coefficient.

Page 8: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Our contributions

1.  The previous theorem and its proof. This shows that good communities are expected and easy to find in modern networks with heavy-tailed degrees and large clustering.

2.  An empirical evaluation of neighborhood communities that shows vertex neighborhoods are the “backbone” of the network community profile.

KDD2012 David Gleich · Purdue

Page 9: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Formal background for the theorem 1.  Vertex neighborhoods 2.  Low conductance cuts 3.  Clustering coefficients

KDD2012 David Gleich · Purdue

Page 10: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Vertex neighborhoods The set of a vertex and"all its neighborhood Also called an “egonet” Prior research on egonets of social networks from the “structural holes” perspective [Burt95,Kleinberg08]. Used for anomaly detection [Akoglu10], "community seeds [Huang11,Schaeffer11], "overlapping communities [Schaeffer07,Rees10].

KDD2012 David Gleich · Purdue

Page 11: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)

KDD2012 David Gleich · Purdue

cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

Page 12: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Clustering coefficients Wedge Global clustering coefficient

=

number of closed wedges

number of wedges

center of wedge

closed wedge

Probability that a random wedge is closed

KDD2012 David Gleich · Purdue

Page 13: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Simple version of theorem

If global clustering coefficient = 1, then "the graph is a disjoint union of cliques. Vertex neighborhoods are optimal communities!

KDD2012 David Gleich · Purdue

Page 14: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Theorem Condition Let graph G have clustering coefficient 𝜅 and "have vertex degrees bounded "by a power-law function with exponent 𝛾 less than 3. Theorem Then there exists a vertex neighborhood with conductance

log degree

log

prob

abilit

y ↵1n/d�

↵2n/d�

4(1 � )/(3 � 2)

KDD2012 David Gleich · Purdue

Page 15: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Proof Sketch 1) Large clustering coefficient "⇒ many wedges are closed 2) Heavy tailed degree dist "⇒ a few vertices have a very large degree 3) Large degree ⇒ O(d 2) wedges ⇒ “most” of wedges Thus, there must exist a vertex with a high edge density ⇒ “good” conductance Use the probabilistic method to formalize

100 101 102 103 1040

0.2

0.4

0.6

0.8

1

CD

F of

Num

ber o

f Wed

ges

Degree

KDD2012 David Gleich · Purdue

Page 16: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Confession!The theory is weak

�(S) 4(1 � )/(3 � 2)

Collaboration networks "𝜅 ~ [0.1 – 0.5]

Social networks "𝜅 ~ [0.05 – 0.1]

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Graph Verts Edges Avg.

Deg.

Max

Deg.

¯C

ca-AstroPh 17903 196972 22.0 504 0.318 0.633

email-Enron 33696 180811 10.7 1383 0.085 0.509

cond-mat-2005 36458 171735 9.4 278 0.243 0.657

arxiv 86376 517563 12.0 1253 0.560 0.678

dblp 226413 716460 6.3 238 0.383 0.635

hollywood-2009 1069126 56306653 105.3 11467 0.310 0.766

fb-Penn94 41536 1362220 65.6 4410 0.098 0.212

fb-A-oneyear 1138557 4404989 7.7 695 0.038 0.060

fb-A 3097165 23667394 15.3 4915 0.048 0.097

soc-LiveJournal1 4843953 42845684 17.7 20333 0.118 0.274

oregon2-010526 11461 32730 5.7 2432 0.037 0.352

p2p-Gnutella25 22663 54693 4.8 66 0.005 0.005

as-22july06 22963 48436 4.2 2390 0.011 0.230

itdk0304 190914 607610 6.4 1071 0.061 0.158

Tech. networks "𝜅 ~ [0.005 – 0.05]

This bound is useless unless 𝜅 ≥ 1/2

KDD2012 David Gleich · Purdue

Page 17: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

We view this theory as "“intuition for the truth”

KDD2012 David Gleich · Purdue

Page 18: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Empirical Evaluation using Network Community Profiles

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Community Size

Minimum conductance for

any community of the given size

Canonical shape found by Leskovec, Lang, Dasgupta, and Mahoney Holds for a variety of approximations to conductance.

KDD2012 David Gleich · Purdue

Page 19: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Empirical Evaluation using Network Community Profiles

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Community Size"(Degree + 1)

Minimum conductance for any community

neighborhood of the given size

“Egonet community profile” shows the same shape, 3 secs to compute.

1.1M verts, 4M edges

The Fiedler community computed from the normalized Laplacian is a neighborhood!

KDD2012 David Gleich · Purdue

Facebook data from Wilson et al. 2009

Page 20: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Not just one graph

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

arXiv – 86k verts, 500k edges soc-LiveJournal – 5M verts, 42M edges

15 more graphs available www.cs.purdue.edu/~dgleich/codes/neighborhoods KDD2012 David Gleich · Purdue

Page 21: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Filling in the !Network Community Profile

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Minimum conductance for any community

neighborhood of the given size

We are missing a region of the NCP when we just look at neighborhoods

KDD2012 David Gleich · Purdue

Community Size"(Degree + 1)

Page 22: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Personalized PageRank Communities [Andersen06] To find the canonical NCP structure, Leskovec et al. used a personalized PageRank based community finder. These start with a single vertex seed, and then expand the community based on the solution of a personalized PageRank problem. The resulting community satisfies a local Cheeger inequality. This needs to run thousands of times for an NCP

KDD2012 David Gleich · Purdue

Page 23: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the !Network Community Profile

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Minimum conductance for

any community of the given size

7807 seconds

This region fills when using the PPR method (like now!)

KDD2012 David Gleich · Purdue

Community Size"

Page 24: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Vertex Neighborhoods, !Low Conductance Cuts, !and Good Seeds for Local Community Methods

KDD2012 David Gleich · Purdue

Page 25: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Am I a good seed?!Locally Minimal Communities

“My conductance is the best locally.”

�(N(v )) �(N(w))

for all w adjacent to v

In Zachary’s Karate Club network, there are four locally minimal communities, the two leaders and two peripheral nodes.

KDD2012 David Gleich · Purdue

Page 26: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Locally minimal communities capture extremal neighborhoods

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Red dots are conductance "and size of a "

locally minimal community

Usually about 1%

of # of vertices.

The red circles – the best local mins – find the extremes in the egonet profile.

KDD2012 David Gleich · Purdue

Community Size"

Page 27: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg

Filling in the NCP!Growing locally minimal comm.

large global clustering coe�cients and large mean clusteringcoe�cients.

Social networks. The nodes are people again, and theedges are either explicit “friend” relationships (fb-Penn94 [36],fb-A [39], soc-LiveJournal [4]) or observed network activityover edges in a one-year span (fb-A-oneyear [39]).

Technological networks. The nodes act in a commu-nication network either as agents (p2p-Gnutella25 [27]) or asrouters (oregon2 [23], as-22july06 [29], itdk0304 [10]). Theedges are observed communications between the nodes.

Web graphs. The nodes are web-pages, and the edgesare symmetrized links between the pages [25].

6. EMPIRICAL NEIGHBORHOODCOMMUNITIES

6.1 ComputationWe first show that we can adapt any procedure to compute

all local clustering coe�cients to compute the conductancescores for each neighborhood in the graph. Most of the workto compute a local clustering coe�cient is performed whenfinding the number of triangles at the vertex. We can expressthe number of triangles with v as:

edges(N1

(v) \ {v})/2

because each edge among v’s neighbors produces a triangle(recall that the edges function double-counts). Note alsothat edges(N

1

(v) \ {v})/2 = edges(N1

(v))/2 � dv. Thencut(N

1

(v)) = vol(N1

(v)) � edges(N1

(v)). And so, giventhe number of triangles, we can compute the cut given thevolume of the neighborhood as well. This is easy to do withany graph structure that explicitly stores the degrees.

6.2 Quality of neighborhood communitiesWe use Leskovec et al.’s [24] network community plot to

show the information on all neighborhood communities si-multaneously. These plots will help us understand if theneighborhood communities are high quality (low conduc-tance), and how they compare to other community detectionmethods. Given the conductance scores from all the neigh-borhood communities and their size in terms of number ofvertices, we first identify the best community at each size.The network community plot shows the relationship betweenbest community conductance and community size on a log-log scale. In Leskovec et al., they found that these plots hada characteristic shape for modern information networks: aninitial sharp decrease until the community size is between100 and 1000, then a considerable rise in the conductancescores for larger communities. In our case, neighborhoodcommunities cannot be any larger than the maximum degreeplus one, and so we mark this point on the figures. We alwayslook at the smaller side of the cut, so no community canbe larger than half the vertices of the graph. We also markthis location on the plots. Each subsequent figure in thispaper utilizes this size-vs-conductance plot, and we will con-tinually layer information from new methods above resultsfrom old methods. The result are information-dense plotsthat need slightly more study than would be ideal, however,we point out the salient features in each plot in the text.Note also that we deliberately attempt to preserve the axeslimits across figures to promote comparisons. However, somefigures have di↵erent axis limits to exhibit the range of data.

web-Google itdk0304

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

fb-A-oneyear arxiv

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

ver t s2

soc-LiveJournal1 ca-AstroPh

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!4

10!3

10!2

10!1

100

maxdeg

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

100

101

102

103

104

105

10!2

10!1

100

maxdeg

ver t s2

Number of vertices in cluster Number of vertices in cluster

Figure 2: The best neighborhood community con-

ductance at each size (black) and the Fiedler com-

munity (red). (Note the axis limits on ca-AstroPh).

First, we show these network community plots for six ofthe networks in Figure 2. These figures are representative ofthe best and worst of our results. Plots for other graphs areavailable on the website given in the introduction.

The three graphs on the left show cases where a neighbor-hood community is or is nearby the best Fiedler community(the red circle). The three graphs on the right highlightinstances where the Fiedler community is much better thanany neighborhood community. We find it mildly surprisingthat these neighborhood communities can be as good asthe Fiedler community. The structure of the plot for bothfb-A-oneyear and soc-LiveJournal1 is instructive. Neighbor-hoods of the highest degree vertices are not community-like– suggesting that these nodes are somehow exceptional. Infact, by inspection of these communities, many of them arenearly a star graph. However, a few of the large degree nodesdefine strikingly good communities (these are sets with a fewhundred vertices with conductance scores of around 10�2).This evidence concurs with the intuition from Theorem 4.6.

6.3 Comparison to PPR communitiesNote that these plots show the same shape as observed

by Leskovec et al. [24]. Consequently, in the next set offigures, and in the remainder of the empirical investigation,

Growing only locally minimal

communities

283 seconds vs.

7807 seconds

Full NCP Locally min NCP

Original Egonet

KDD2012 David Gleich · Purdue

Community Size"

Page 28: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

Filling in the NCP!Growing locally minimal comm.

Growing only locally minimal

communities

143 seconds vs.

2211 seconds

Full NCP Locally min NCP

Original Egonets arXiv – 86k verts, 500k edges

KDD2012 David Gleich · Purdue

Community Size"

Page 29: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Recap A theorem relating clustering,"heavy-tailed degrees, and"low-conductance cuts of "vertex neighborhoods. Empirical evaluation of "vertex neighborhoods. More on k-cores in the paper. ⇒ Many communities are easy to find! ⇒ Explains success of community detection?

Acknowledgements!David supported by NSF CAREER

award 1149756-CCF. Sesh supported by the Sandia

LDRD program (project 158477) and the applied mathematics program at

the Dept. of Energy.

KDD2012 David Gleich · Purdue

Code and results available online

www.cs.purdue.edu/~dgleich/ codes/neighborhoods

Page 30: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Two words on computing

Can be done by just counting the triangles at each node. Linear complexity in |E| in a power-law graph. It’s possible to do this in MapReduce too.

KDD2012 David Gleich · Purdue

Page 31: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

100 101 102 103 104 105

10−4

10−3

10−2

10−1

100

maxdeg ver t s

2

Filling in the NCP!Growing k cores

Growing only locally minimal

communities and k-cores

143 seconds

vs. 2211 seconds

Full NCP Locally min NCP

Original Egonets arXiv – 86k verts, 500k edges

KDD2012 David Gleich · Purdue

Community Size"

PPR grown k-cores

k-cores

Page 32: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods

Clustering coefficients Wedge Global clustering coefficient Local clustering coefficient

=

number of closed wedges

number of wedges

Cv =

number of closed wedges centered at vnumber of wedges centered at v

center of wedge

closed wedge

Probability that a random wedge is closed

KDD2012 David Gleich · Purdue