measuring proximity in networks

Measuring and Extracting Proximity in Networks

Yehuda Koren, Stephen North and Chris Volinsky

KDD 2006Philadelphia

Outline

• What is proximity and why do we care? • What are the qualities of a good proximity measure?• A series of proposals• Our proposal: Cycle-Free Effective Conductance• Extraction of proximity graphs• Applying CFEC to large graphs• Applications: Call detail, IMDB, DBLP• Summary and Extensions

http://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl


What is Proximity?

• What is the distance between two nodes in a social network?

• proximity [prox·im·i·ty || prɑk'sɪmətɪ /prɒ-]n. adjacency, nearness, closeness, vicinity

What is proximity good for?

• Missing Data• Link Prediction• Indirect relations• Information sharing• Viral marketing• Identifying clusters

Our Goals

• Measure and visualize proximity between nodes.• Measurement should have the following qualities:

– “Close” nodes are intuitive• Short graph distance• Multiple paths • High weights on edges• Low degree nodes in the paths

– Monotonicity – Generalizes to n > 2.

Our goals• Explain proximity by extracting proximity subgraphs that are

readily visualized and contain a large percentage of overall proximity.

• Idea comes from “connection subgraphs” (Faloutsos, McCurley and Tomkins 2004), the small subgraph that best captures the connections between two nodes of the graph

Prox = .0053

Prox = .0048

Large social networks

31M 438K co-authors

1.1M 896K actor-actor

1000M 300M phone calls

800M 200M IM

data source |V| |E|

• -Proximity is relevant in all social networks, listed below are a few we have played with

-For now, we consider these as undirected graphs (stay tuned)

Measuring proximity• Many proposals in the literature (n.b. Liben-Nowell and Kleinberg 2003)• Graph distance: shortest path

– Doesn’t account for path length, multiple paths, or high-degree nodes• Maximum Network Flow

– Disregards path length, high degree nodes, depends on bottlenecks• Electrical networks, or “effective conductance” (e.g. Doyle and Snell

1984)– High degree nodes still a problem

When is the electric current analogy misleading?

Noise?Significant connection

• Same current-flow in both cases! • Degree-1 nodes are neutral (attract no-flow)

Sink- augmented effective conductance [Faloutsos, McCurley & Tomkins, KDD 2004]

• Connect all nodes to a grounded universal sink (with 0V)• Tax each node - deliver portion of the flow to the sink

No nodes of degree 1 (above problem solved)Penalizes long pathsHow do we set taxing system?Doesn’t generalize to n > 2No monotonicity…

Universal sink and (non-)monotonicity

With universal sink – no monotonicity:

• For larger networks, proximity tends to zero creating a “size bias”.

• Adding s—t paths can either increase or decrease proximity!

Network size

Pro

xim

ity

Electrical networks = random walks

• Current-flow notions have direct random walk interpretation

• Take a random walk starting at s, following edges of the graph proportional to their weight (conductance).

• Let D(s), the degree of s, be the number of random walks originating at s. Then:

– The escape probability, EP(st), is the probability that a walk originating at s will reach t before visiting s again , and

– The effective conductance between s and t:• EC(s,t) = EP(st) * Deg(s)

With the random walk perspective, you can see that the 1-degree nodes have no influence.

By discouraging “backtracking”, we now can properly account for high degree nodes

Electrical networks = random walks

Our proximity: cycle free effective conductance

• The cycle-free escape probability, CFEP(st) is the probability that a random walk originating at s will reach t without visiting any node more than once

• Multiplying by degree of the source gives an absolute quantity (accounting for the number of "actually initiated" walks):

• The cycle-free effective conductance between s and t: CFEC(s,t) = CFEP(st) * Deg(s)

Higher redgreen c.f. escape probability

Lower redgreen c.f. escape probability

Properties of CFEC as a proximity measure:• Accounts for multiple paths• Favors short paths• Penalizes high-degree nodes• Penalizes dead-end paths• Parameter free• Has the “right” monotonicity• Accommodates edge directions• Has a natural extension to multiple endpoints

Computing CFEC

• Unlike previous measures, exact computation is impossible

• Practically, we can estimate it extremely well• Probability of paths declines exponentially (e.g.,

100th path is x106 less probable than the first one.)• Estimate using the most probable paths:

c.f.escsimple path [ ]

P ( ) = prob( )p s t

s t p

c.f.eschighly probablesimple path [ ]

P ( ) prob( )

p s t

s t p

Finding k most probable paths

• Finding k shortest simple paths takes O(k|E|log|E|) time [Katoh, Ibarki and Mine, 1982]

• For an edge u-v of weight w(u,v), define its length

• Edge lengths are positive• Exp(-l(u,v)) = C*Prob(path)• Short path = High-probable path• Stop path-computation when probability drops below

“10-6” of first path

( , )( , ) log

deg( ) deg( )

w u vl u v

u v

Extracting proximity graphs

Recall FMT’04 “connection subgraphs”, the small subgraph that best captures the connections between two nodes of the graph

Extracting proximity graphs

• Achieve an efficient balance between “size” and “proximity” by maximizing the ratio:

• Larger α emphasize proximity larger subgraph– α=0 return shortest path

– α=∞ return all paths

CFEC( )

sub ap

gr h

s t

Extracting proximity graphs• We already have the collection, Rk of shortest paths

{P1,P2,…,Pk}• Find the subset of the paths that maximizes

CFEC( )

sub ap

gr h

s t

… and combine the selected paths into a “proximity graph”

• This is an NP-hard problem, but recall that we have a list of paths sorted by probability

• Use a branch and bound path merging algorithm

Working with large graphs• Dealing with full graph is sometimes infeasible and usually

unnecessary• Prior to running the algorithm, we construct a candidate graph in

main memory (also FCT ’04).

full networkN ~ 350M

Candidate graphN ~ 10,000

Proximity GraphN ~ 20

S T

Finding the candidate graph

S T

Dist(T,i)=2Dist(S,i)=2

S T


S T


Shortest path of length 10

S T

Dist(T,i)=12Dist(S,i)=12 i

• Stop adding nodes when path probabilities are below e

• Any path through unscanned node is likely to be low probability

• Once we have this candidate graph, apply CFEC algorithm to extract proximity graph.

Summary: Proximity Graphs

• We have a measure of proximity which fulfills our desired criteria– Intuitive sense of closeness– Generalizes to n>2– Parameter free

• Using this measure of proximity we can efficiently extract the proximity graph.

• Let’s apply to real data

Application: call detail

• AT&T’s call detail graph is large (350M nodes, several billion edges).

• To calculate proximity, we just need an adjacency list– Dynamic, efficient creation of adjacency lists for transaction

graphs (Cortes, Pregibon, and Volinsky 2003)

• Select a random sample of 2000 residential TNs and calculate proximity between them. – We found a path for 1808 of them– For those that we found a path, we calculated proximity, and

rendered a proximity graph for them.

Building Proximity Graphs

full networkN ~ 350M

Candidate graphN ~ 10,000

Proximity GraphN ~ 20

Distribution of proximities in phone-call network

Application: call detail• Capturing proximity in a proximity graph….• Studying a

– Low alpha: smaller graphs, less proximity captured.

a = 10 seems to give a good tradeoff

%C

aptu

red

Pro

xim

ity#

Gra

phs

Size of graph

Proximity as link predictor

• Calculate proximities for a sample of pairs in the network that have never communicated.

• Look in the future to see which of these communicate in the next time period t.

• Did those that eventually communicate have closer proximities.

• i.e. is proximity predictive of future communication?

Mean log proximity:Communicators = -2.4Non-comm. = -5.9

Proximity as link predictor

Using Visualization

• Different Visualizations bring out different aspects of the proximity graph, especially for n>2.

Using a hierarchical layout for n=2 shows different eras of movie stars

Prox webpagehttp://public.research.att.com/~volinsky/cgi-bin/prox/prox.pl


Summary

• Proposed cycle free effective conductance (CFEC) with a random walk interpretation to measure “proximity” in social networks and other ad-hoc networks

• Described a way of approximating CFEC• Described a way of visualizing CFEC as a subgraph• Extended the method to external datasets• Showed empirical evidence for its utility



Extensions

• Compare to other proximity measures (Katz, PageRank, and other methods compared in Liben-Nowell and Kleinberg (2003))

• Quantify proximity across different kinds of networks• Extend c.f. effective conductance to:

– Multiple endpoints (already demonstrated)– Directed edges (future work – use k-shortest paths in a directed

graph, alg. due to Hershberger et al)



measuring proximity in networks

Documents

nodes of degree

proximity subgraphs

proximity good

measuring proximity

high degree nodes

degree of s

probability of paths

proximity larger subgraph