branching out: quantifying tree-like structure in complex networks

45
Blair D. Sullivan Complex Systems Group Center for Engineering Science Advanced Research Computer Science and Mathematics Division Oak Ridge National Laboratory Branching Out: Quantifying Tree-like Structure in Complex Networks MMDS, July 12, 2012 Joint work with Michael Mahoney & Aaron Adcock, Stanford University

Upload: others

Post on 03-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Blair D. Sullivan Complex Systems Group Center for Engineering Science Advanced Research Computer Science and Mathematics Division Oak Ridge National Laboratory

Branching Out: Quantifying Tree-like Structure in Complex

Networks

MMDS, July 12, 2012

Joint work with Michael Mahoney & Aaron Adcock, Stanford University

2 Managed by UT-Battelle for the U.S. Department of Energy

Motivation • Large networks are becoming ubiquitous in many

domains – e.g. biology, physics, chemistry, infrastructure, communications, and sociology

• Many methods to understand structure at very large-scale (diameter), small-scale (clustering coefficient); very few to probe intermediate scale (clusters of size 5K in a 5M node network). Can we get good tools to understand and exploit this?

A partial map of the Internet, January 15 2005

The US electric transmission system. Courtesy North American Reliability Corporation. Drug-Target Network.

Nature Biotechnology 25(10), October 2007

3 Managed by UT-Battelle for the U.S. Department of Energy

Intermediate-Scale Structure

Ising model (ferromagnetism): Temperature parameter controls scale of local correlations between magnetic spins.

4 Managed by UT-Battelle for the U.S. Department of Energy

Intermediate-Scale Structure

• Determines network evolution & dynamics of diffusion, other processes

• Implicitly affects applicability of common data analysis tools

• This is where all the “interesting stuff” happens.

Ising model (ferromagnetism): Temperature parameter controls scale of local correlations between magnetic spins.

The “intermediate-scale structure” is the coupling of local & global properties.

5 Managed by UT-Battelle for the U.S. Department of Energy

Prior empirical evidence Claim: Many large complex networks are “tree-like” when viewed at intermediate scales:

• The Unreasonable Effectiveness of Tree-Based Theory for Networks with Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83, No. 3 (2010).

• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li, Muthukrishnan, Iftode. WWW2011.

• "It was noted in recent years that the Internet structure has a highly connected core and long stretched tendrils, and that most of the routing paths between nodes in the tendrils pass through the core. Therefore, we suggest in this work, to embed the Internet distance metric in a hyperbolic space where routes are bent toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1 (2008).

However, no consensus has been reached on defining and measuring this tree-like structure, making it difficult to exploit algorithmically.

Image credit: Munzer et al

6 Managed by UT-Battelle for the U.S. Department of Energy

Prior empirical evidence Claim: Many large complex networks are “tree-like” when viewed at intermediate scales:

• The Unreasonable Effectiveness of Tree-Based Theory for Networks with Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83, No. 3 (2010).

• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li, Muthukrishnan, Iftode. WWW2011.

• "It was noted in recent years that the Internet structure has a highly connected core and long stretched tendrils, and that most of the routing paths between nodes in the tendrils pass through the core. Therefore, we suggest in this work, to embed the Internet distance metric in a hyperbolic space where routes are bent toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1 (2008).

However, no consensus has been reached on defining and measuring this tree-like structure, making it difficult to exploit algorithmically.

7 Managed by UT-Battelle for the U.S. Department of Energy

Arxiv GR-QC collaboration

What do you mean, “tree-like”?

Image credit: Traub, Kelsic, Mucha, Porter

Image credit: Tim Davis

Facebook: Caltech Network

Autonomous

Systems

Image credit: Graphics@Illinois

8 Managed by UT-Battelle for the U.S. Department of Energy

Hyperbolic Space

• Multiple parallel lines pass through a point, and angles in a triangle sum to less than 180.

• At right, see a {7,3}-tessellation of the hyperbolic plane by equilateral triangles, and the dual {3,7}-tessellation by regular heptagons. All triangles and heptagons are of the same hyperbolic size but the size of their Euclidean representations exponentially decreases as a function of the distance from the center, while their number exponentially increases.

• In Euclidean space, a circle’s area grows polynomially with its diameter; in hyperbolic space, it grows exponentially. Think of growth as in a binary tree.

• The shortest paths in hyperbolic spaces are arcs through disk, not paths around the exterior (much like travel in a rooted tree)

Image credit Krioukov et al.

9 Managed by UT-Battelle for the U.S. Department of Energy

Hyperbolic Embedding and Greedy Routing

• Hyperbolic space gives us “extra room” to embed networks (as opposed to Euclidean space).

• A number of algorithms take advantage of this to devise greedy routing schemes

• Kleinberg uses a minimum spanning tree, embedded as a subset of a d-regular tree, where d is the maximum degree of the MST (d = 4 is shown at right)

Image credit Kleinberg

10 Managed by UT-Battelle for the U.S. Department of Energy

So is it good or bad?

Image credit M.C.Escher

11 Managed by UT-Battelle for the U.S. Department of Energy

A generative model • Three-parameter model introduced by Krioukov et

al uses an underlying hyperbolic geometry and allows us to vary the curvature, degree heterogeneity, and density. (Physicists: this is basically fermions)

• Idea: place nodes in the hyperbolic plane (Poincare disk) and connect them with a probability which is dependent on their hyperbolic distance.

• Knob 1: Power law exponent: determines distribution of nodes in the disk – the higher the exponent, the more nodes go towards the center. This determines the curvature (and degree heterogeneity)

• Knob 2: Temperature: determines how much we ignore the underlying geometry in adding edge; at high temperatures, edge connections become essential random (independent of distance).

• Knob 3: Average degree (target): approximately allows control over density

Power Law 2.1 2.25 2.5

Temperature 20 1.5 0.5

Avg. Degree 5 10 20

Our test parameters

Temp. Finite Infinite

Curv.

Finite Random

hyperbolic graphs

Classical random graphs

(Erdos-Renyi)

Infinite Random

geometric graphs

Random graphs

w/given expected

deg.

12 Managed by UT-Battelle for the U.S. Department of Energy

Special Thanks

Special thanks to D. Krioukov for providing us code to generate networks according to the model described on the previous slide.

Image credit San Diego Reader

13 Managed by UT-Battelle for the U.S. Department of Energy

Hyperbolic Embedding for Inference

• Boguna, Krioukov, Papadopolous have mapped “the internet” to hyperbolic space, and used the embedding to identify community structure (and offer suggested routing schemes).

Image credit Boguna, Krioukov, Papadopolous

• Their methods rely on iterative MLE methods, and do not seem to be scalable to examine “big data”.

14 Managed by UT-Battelle for the U.S. Department of Energy

A geometric measure of tree-likeness

• Gromov’s δ-hyperbolicity arises from the geometry of metric spaces and δ measures the extent to which a (geodesic) metric space embeds in a tree metric.

d(u,v) + d(w,x) = 1 + 1 = 2 d(u,x) + d(v,w) = 1 + 1 = 2 d(u,w) + d(v,x) = 1 + 1 = 2

u δ = 0

d(u,v) + d(w,x) = 1 + 1 = 2 d(u,x) + d(v,w) = 2 + 2 = 4 d(u,w) + d(v,x) = 1 + 1 = 2

δ = 1 v v u

x w x w

• Note: d(u,v) is the length of the shortest path between u and v in the graph.

• The minimum δ for which G is δ-hyperbolic can be computed (naively) in O(n4)

15 Managed by UT-Battelle for the U.S. Department of Energy

More on δ-hyperbolicity

• A triangle is δ-thin if the pre-images of every tripod point have distance at most δ.

• A triangle is δ-slim if each of its sides is contained in the δ -neighborhood of the union of the other two sides.

• A graph is δ -hyperbolic if all its geodesic triangles are δ -thin (or δ-slim); each results in a slightly different min δ, related to each other by small constant factors.

• Viewing graphs as a geodesic metric space (replace edges with length 1 segments intersecting only at endpoints) provides another way to think of δ-hyperbolicity.

• For a geodesic triangle, there is a unique isometry to a tripod so that except for the leaves , each point on the tripod has two pre-images on the triangle.

Image credit: Bridson, Haefliger Image credit: Chepoi, Dragan et al

16 Managed by UT-Battelle for the U.S. Department of Energy

Examples: Small world graphs & Ringed Trees • Kleinberg’s small-world random graphs add

long-range edges with probability proportional to 1/dB(u,v)p to a d-dimensional grid.

• Mahoney et al (2011) showed even at the “sweet spot” of p = d, the small-world graphs are not logarithmically hyperbolic w.h.p. When p < d, the graphs are not hyperbolic, and for p > 3 and d = 1, the hyperbolic delta is polynomial in the size of graph.

• Define a ringed tree to be a binary tree

plus edges connecting all vertices at a given tree level into a ring (quasi-isometric to the Poincare disk)

• Adding long-range edges between the leaves of a ringed tree w/ probability decreasing:

– exponentially fast with the ring distance produces logarithmic hyperbolicity

– as a power-law with the ring distance produces non-hyperbolic random graphs

• Replace the ringed tree with a pure binary tree: none of the resulting graphs are hyperbolic.

Image credit: Mahoney et al

17 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Real Graphs

18 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: “Planar”

• Planar graphs have a very different distribution of delta over their quadruples, and very high diameters.

19 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: “Hyperbolic”?

• Much more subtle differences when looking at non-planar graphs.

• Density seems to play a role, and most networks considered had very low diameter.

20 Managed by UT-Battelle for the U.S. Department of Energy

Computing δ: Sampling • Due to high computational complexity, a number of prior works have used

sampling to estimate the hyperbolicity of large networks.

• Some prior work sampled at a rate of about .0002 percent (on their largest data), and although biased towards pairs at larger distances, this could still easily miss the maximum delta, which is achieved on a very small (in our example 2 x 10-11 percent) subset of quadruplets. Note that sampling, however, is likely to be sufficient for computing average deltas.

• Example below is SNAP graph as20000101 (about 1600 nodes)

delta Fraction of quadruplets: # of quadruplets

0.0: 0.677473774788751 4577453756970

0.5: 0.313235924997126 2116425779202

1.0: 0.009262044976055 62580404070

1.5: 0.000028008357243 189242691 2.0: 0.000000246259522 1663890

2.5: 0.000000000022835 154

Total 0.999999999401533 6756650846976

21 Managed by UT-Battelle for the U.S. Department of Energy

K-core Decompositions

• Given a graph G = (V,E), the k-core of the graph, denoted Hk is the maximal subgraph H of G so that degH(v) is at least k for all v in H.

Image credit: LaNet-vi

•The core number of a vertex v is defined to be the maximum k so that v is in Hk but not Hk+1.

• The set of nodes with core number k is called the k-shell of G.

Condensed Matter Collaboration Network

22 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Social Graphs

Facebook-Texas84

~36,000 nodes

~3x10^6 edges

soc-Epinions1

~47,000 nodes

~730,000 edges

23 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Autonomous Systems

AS19990820 ~5,500 nodes

~22,000 edges

AS19990818 ~5,500 nodes

~22,000 edges

24 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Collaboration Graphs

CA-AstroPhysics ~18,000 nodes

~394,000 edges

CA-GrQc ~4,000 nodes

~26,000 edges

25 Managed by UT-Battelle for the U.S. Department of Energy

Empirical results: Synthetic by power law exponent

26 Managed by UT-Battelle for the U.S. Department of Energy

Empirical results: Synthetic by temperature

27 Managed by UT-Battelle for the U.S. Department of Energy

Some (oversimplified) Summary Statistics

ca-AstroPhysics:

• ~0.6% of nodes (113 nodes) in two deepest cores (k = 55,56)

• ~1.8% of edges (~7,000 edges) leaving the deepest core (k = 56)

• ~1.8% of edges (~7000 edges) leaving next core (k = 55)

• Max average k-shell change is +12 (out of k = 56 max shell)

• Suggests collaborators tend to collaborate with people of similar coreness/peripheryness

• “Typical” for collaboration graphs (and other core-periphery graphs)

Texas84:

• ~8% of nodes (≥2400 nodes) in two deepest cores (k = 80,81)

• ~7% of edges (≥220K edges) leaving the deepest core (k = 81)

• ~17% of edges (≥510K edges) leaving the next core (k = 80)

• Max average k-shell change is +50 (out of k = 80 max shell)

• Suggests that the “periphery” nodes are more tightly connected to “core-like” nodes

• “Typical” for more social graphs (and Facebook in particular)

28 Managed by UT-Battelle for the U.S. Department of Energy

A combinatorial measure of tree-likeness • A tree decomposition of a graph G = (V,E ) is a pair (X={X1, X2, ..., XL}, T) with

Xi a subset of V , and T a tree with nodes {1, …,L} satisfying three conditions:

• The union of the sets in X is equal to V

• For every edge (u,v) in G, {u,v} is a subset of some Xi

• For every v in V, the indices of {Xi} containing V form a sub-tree of T.

• We call the sets Xi the bags of the decomposition and max(| Xi |) the width. The tree-width of G is the minimum width over all valid tree decompositions.

29 Managed by UT-Battelle for the U.S. Department of Energy

Understanding FPT: “problems are easier on trees”

• Many NP-hard problems can be solved in polynomial time on trees (graphs with no cycles)

Example: Maximum Weighted Independent Set: Complexity O(|V|)

• We can generalize this dynamic programming approach to get polynomial algorithms (in graph size) on graphs where tree-width is bounded.

3 2 1 1

3 2 3 4

7

2

1

(3,0)

(3,6)

(1,0)

(2,0) (3,0)

(1,0) (2,0)

(7,5)

(4,1)

(8,10)

(17,15)

30 Managed by UT-Battelle for the U.S. Department of Energy

Heuristics for low-width decompositions • In numerical linear algebra, one often wants to permute the rows of a matrix before

computing a factorization so that the resulting factors are as sparse as possible. The objective is to minimize the number of “fill edges” added.

Comparison of width and fill from 6 heuristics on graphs known to have tw <= 30

• For tree decompositions, we instead need to minimize the maximum clique size in the resulting chordal graph.

• Numerous implementations of common heuristics are available, and we tested several on a large set of random graphs with a fixed maximum width and varying sizes.

• Min-degree-based heuristics are orders of magnitude faster than min-fill, etc.

31 Managed by UT-Battelle for the U.S. Department of Energy

Empirical results: Synthetic

MCS Lower Bounds:

AMD Upper Bounds:

32 Managed by UT-Battelle for the U.S. Department of Energy

MCS Lower Bounds:

AMD Upper Bounds: More…

33 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Facebook

34 Managed by UT-Battelle for the U.S. Department of Energy

Empirical Results: Autonomous Systems

• A larger AS graph had similar results: 600K nodes resulted in a 200K largest connected component, and the upper bound was 5961, lower bound 32.

35 Managed by UT-Battelle for the U.S. Department of Energy

Problems with Using Tree Decompositions

• Every bag in a tree decomposition is a vertex separator, so a low-width decomposition means many small separators.

• Treewidth is O(n) w/ high probability for many random graphs (Gao 2009):

– Erdos-Renyi graphs G(n,m) when m/n > 1.073

– Random intersection graphs G(n,m,p) on universe {1,…m} with m=na, p at least 2/m and a > 0.

– Barabasi-Albert preferential attachment with at least 12 new edges for each additional vertex.

• Current heuristics get lost in “local noise”

36 Managed by UT-Battelle for the U.S. Department of Energy

Average k-cores on a tree decomposition

Temperature: 20 Power law exp: 2.1 Avg deg target: 5

37 Managed by UT-Battelle for the U.S. Department of Energy

Average k-cores on a tree decomposition

Temperature: 0.5 Power law exp: 2.1 Avg deg target: 20

38 Managed by UT-Battelle for the U.S. Department of Energy

Real Graphs

39 Managed by UT-Battelle for the U.S. Department of Energy

What’s next?

• Clustering

• Diffusions

• Sparse Dimensionality Reduction

• Applications to Statistical Inference

40 Managed by UT-Battelle for the U.S. Department of Energy

Acknowledgements Primary support for this work through the ORNL Laboratory Directed Research & Development SEED Program.

These slides would not have been possible without many hours of hard work by Aaron Adcock.

41 Managed by UT-Battelle for the U.S. Department of Energy

Backup Slides

42 Managed by UT-Battelle for the U.S. Department of Energy

Motivation for some improvements to min-degree

Minimum Degree

9 Eliminate

Eliminate 2

Minimum Fill-In

43 Managed by UT-Battelle for the U.S. Department of Energy

Tiebreaking with second neighbors

• Gloria investigated various strategies for breaking ties within min-degree and min-fill algorithms

• Her hypothesis was that including information about second-neighborhoods could improve the quality of these heuristics

• Even with optimizations, the running time of the improved algorithms was often significantly slower than random tie-breaks due to computation of additional information (fill or second-neighborhood sizes)

Joint work with Gloria D’Azevedo (ORHS student) and Chris Groer (ORNL).

44 Managed by UT-Battelle for the U.S. Department of Energy

MIND MIND+(0.5)(SEC)

An example where second neighbors help

45 Managed by UT-Battelle for the U.S. Department of Energy

45

0.0:

0.5:

1.0:

1.5:

2.0:

2.5: