Fast Triangle Counting through Wedge Sampling
Ali Pinar, C. Seshadhri, and Tamara G. KoldaSandia National Laboratories
7/10/2012 Pinar ‐ SIAM Annual 12 1
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security
Administration under contract DE-AC04-94AL85000.
U.S. Department of EnergyOffice of Advanced Scientific Computing Research
U.S. Department of DefenseDefense Advanced Research Projects Agency
Triangles are critical for graph analysis
7/10/2012 Pinar ‐ SIAM Annual 12 2
• Interpreted in many different ways in social sciences. – Identifier for bridges between
communities. – Likelihood to go against norms
• Applied to spam detection• Used to compare graphs• Proposed as a guide for community
structure.• Stated as a core feature
for graph models [Vivar&Banks11] – Cornerstone for Block Two‐level
Erdos‐Renyi (BTER)
• Rich set of algorithmic results– Algorithms, runtime analysis,
streaming algorithms, MapReduce, … Using graph assays to monitor network traffic
Open wedge Closed wedge,(i.e., triangle)
It is not only how many, it is aboutwhere they are…
• We need algorithms that can compute the distributions of triangles over a given set of attributes. – For social networks, degree‐wise clustering coefficients tend to
decrease with degree.
7/10/2012 Pinar ‐ SIAM Annual 12 3
BTER: A New Model with Explicit Community Structure
• Preprocessing: Generate communities – Determined by desired degree distribution– All nodes have (close to) the same degree – Size of cluster = min degree + 1
• Phase 1: Generate ER graph on each community
– User must specify connectivity coefficient for each community, ½k
– We use a function of the min degree in the community, dk
• Phase 2: Generate CL graph on “excess” degree
– e(i) = d(i) – ½k dk where vertex i is in community k
2/15/2012 Pinar ‐ SIAM PP12 4
Preprocessing:Create explicit communities
Phase 1: Erdös‐Rényigraphs in each community
Phase 2:CL model on “excess” degree
Seshadhri, Kolda, & Pinar, Phys. Rev. E, 2012
Hypothesis: Real‐world interaction networks consist of a scale –free collection of dense Erdős‐Rényi graphs.
BTER can match properties of real world graphs
• The code is available at http://www.sandia.gov/~tgkolda/bter_supplement/• Hadoop and MPI implementation will be available soon.
7/10/2012 Pinar ‐ SIAM Annual 12 5
It is not only how many and where they are, it is about what they comprise …
• Tell me about your friends, I will tell you who you are.
• We need algorithms that can reveal the structure of the triangles. – For social networks vertices of a triangle are close in degree, but high degree nodes
are dominant in triangles of infrastructure networks.
7/10/2012 Pinar ‐ SIAM Annual 12 6
amazon0312 ca‐AstroPh Soc_Epinionscit‐HepPh
as‐caida20071105 web‐Stanford wiki‐TalkOregon1_010331
Durak, Pinar, Kolda, Seshadhri, 2012
Enumerating triangles• Core idea: check whether each wedge is closed. – For each vertex v, in the graph
• For every pair of neighbors u, w of vertex v, – If there is an edge between u and w,
» report the triangle.
• Runs in cubic time. • Redundant work: each triangle is reported 3 times.
7/10/2012 Pinar ‐ SIAM Annual 12 7
Example with 13 wedges and 1 triangle
Clever Enumeration• By imposing an ordering on the vertices (e.g., order by
degree), we can check only one wedge per triangle (the one centered on the vertex with min. degree).
• This can be achieved by assigning each edge to its vertex with lower degree.
• Discovered and rediscovered starting in 1985.
7/10/2012 Pinar ‐ SIAM Annual 12 8
Total wedges: 24Wedges that need to be checked: 4
Naïve vs. Clever enumeration
050
100150200250300
Normalized wedge counts Naïve Clever
• In practice, clever approach is very effective in reducing number of wedges that are checked.
• Recent work showed that the clever algorithm runs in linear time for graph generated with edge configuration model, with power‐law degree dist. with coefficient > 7/3. [Berry et al, SAND2010‐4474C]
7/10/2012 Pinar ‐ SIAM Annual 12 9
Triangle counting is amenable to sampling
7/10/2012 Pinar ‐ SIAM Annual 12 10
• Clustering coefficient (CC) can be considered as the success rate of an experiment with a binary outcome.• Each wedge is an experiment, which succeeds if it is closed,
and fails otherwise.• This is an excellent setup for a sampling algorithm, because..
• Many graphs of interest have a very large number of wedges.• Large enough space, to benefit from sampling.
• In many graphs of interest, a nontrivial fraction of the wedges are closed.• We are not looking for a needle in a haystack.
Wedge‐sampling
7/10/2012 Pinar ‐ SIAM Annual 12 11
Clustering coefficients can be considered as the success rate ofexperiments with binary outcomes.
Wedge‐sampling providesprovably accurate estimations
• Theorem: For error = ε and confidence = 1‐δ, the number of samples required is
• For 99.9% confidence and 1% error, we need only k = 38,005 samples
7/10/2012 Pinar ‐ SIAM Annual 12 12
0.5ε−2 ln(2δ
)⎡⎢⎢
⎤⎥⎥
The number of samples in independent of the graph size.
Alternative: DoulionAn alternative to wedge sampling is edge‐based sampling. [Tsourakakis et al, KDD09]• Generate a smaller graph by removing each edge with probability
1‐p. • Count the number of triangles in the original graph. • Multiply by p3 to predict the number of triangles in the original
graph.
7/10/2012 Pinar ‐ SIAM Annual 12 13
Drawback: • Expected value is correct, but the
variance may be huge.
Wedge‐sampling offers accurate estimations
0
0.05
0.1
0.15
0.2
0.25
0.3
Relative error
Wedge‐sampling‐13K Doulion‐10 Doulion‐25
7/10/2012 Pinar ‐ SIAM Annual 12 14
…with big savings in runtime
0
0.1
0.2
0.3
0.4
0.5
Enumeration Wedge‐sampling Doulion‐10 Doulion‐25
7/10/2012 Pinar ‐ SIAM Annual 12 15
Times normalized with respect to the IO time.
Counting Directed Triangles
• We have – three edge types: in, out, bi‐directional,– six wedge types,– seven triangle types.
• Sampling works as is for clustering coefficients. • Estimating the number of triangles needs adjustments.
7/10/2012 Pinar ‐ SIAM Annual 12 16
Counting Directed Triangles
i ii iii iv v vi
a 1 1 1
b 3
c 1 2
d 1 1 1
e 1 2
f 1 1 1
g 3
7/10/2012 Pinar ‐ SIAM Annual 12 17
• Multiple occurrences of the same wedge type causes counting the same triangle multiple times.
• Algorithm– Pick a wedge‐type for the triangle type– Compute the success rate– #triangles = success rate * |w|/wedge multiplicity
Estimating triangles per degree
• Similar principles apply to counting triangles per degree. • But, we need to adjust the counts based of the number of
vertices with the same degree in the sampled wedge.
7/10/2012 Pinar ‐ SIAM Annual 12 18
ca‐CondMat cit‐HepPh soc‐Epinions1
Concluding Remarks
7/10/2012 Pinar ‐ SIAM Annual 12 19
Freq
uency
TrianglesWedges (not in Triangles)
Edges (not in Wedges or Triangles)
Isolates
• Triangles can reveal a lot of information about a graph. • Wedge‐sampling provides provably good estimations with big runtime savings.
– The number of samples is independent of the graph size. – Directed triangles can be counted the same way as undirected graphs. – Distribution of triangles with a given property can be estimated with the same algorithm.
• Current work:– A MapReduce is implementation is on the way. – Enhancing the algorithm for streaming graphs – Sampling for larger patterns efficiently is being investigated.
• Goal: – Build graph assays
Related Publications• Modeling
– C. Seshadhri, T. Kolda, and A. Pinar, “The Blocked Two‐Level Erdos Renyi Graph Model,” Physical Review E.
– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth analysis of Stochastic Kronecker Graphs," submitted.
– A. Pinar, C. Seshadhri, and T. Kolda, “The Similarity of Stochastic Kronecker Graphs to Edge‐Configuration Models,” SDM’12
– C. Seshadhri, A. Pinar, and T. Kolda, “An In Depth study of Stochastic Kronecker Graphs,” ICDM’12
• Generating a random graph
– J. Ray, A. Pinar, and C. Sehadhri, Are we there yet? When to stop a Markov chain while generating random graphs,” WAW 12.
– I. Stanton and A. Pinar, “Constructing and uniform sampling graphs with prescribed joint degree distribution using Markov Chains,” to appear in ACM JEA.
– I. Stanton and A. Pinar, “Sampling graphs with prescribed joint degree distribution using Markov Chains,” ALENEX’11.
• Community structure and triangles
– C. Seshadhri, A. Pinar, and T. Kolda, “Fast Triangle Counting through Wedge Sampling," submitted.
– M. Rocklin and A. Pinar, “On Clustering on Graphs with Multiple Edge Types,” Internet Mathematics.
– M. Rocklin and A. Pinar, “Latent Clustering on Graphs with Multiple Edge Types,” Proc. 8th Workshop on Algorithms and Models for the Web Graph WAW’ 11.
– M. Rocklin and A. Pinar, “Computing an Aggregate Edge‐weight function for Clustering Graphs with Multiple Edge Types,” WAW’10.
7/10/2012 Pinar ‐ SIAM Annual 12 20