pagerank and diffusion on large graphs

PageRank and Diffusionon Large Graphs

Alexander Tsiatas

University of California, San Diego

Graphs

• A mathematical model for a set of objects with pairwise relationships– Nodes: represent objects within a larger set

– Edges: characterize pairwise relationships between those objects

• Graphs are omnipresent in the real world –both natural and man-made– Examples…

Technological networks

• The Internet

– Nodes are routers, computers, switches

– Edges are physical wires between them

• World Wide Web

– Nodes are individual webpages

– Edges represent hyperlinks between pages

Technological networks

Subgraph of the Internet - From the OPTE Project, 2005

Social networks

• Nodes represent people in a population

• Edges represent friendship, co-authorship, physical contact

Paul Erdős

Communication networks

• Physical networks

– Nodes represent telephones

– Edges represent physical telephone wires

• Interaction networks

– misc@cs.ucsd.edu

Transportation networks

Biological networks

• Protein interactions

• The brain as a neural network

• Transcription regulatory networks

– Relationships between proteins and genes

• Chemical transmission between bacteria

• Many, many more

Different graphs, same properties

• “Small-world phenomenon”

– Short paths between pairs of nodes

– Adjacent nodes share more neighbors

• Power-law degree distribution

kkdv ]Pr[

Importance of graphs

• Interesting graphs

• Interesting problems

• Very important to have rigorous analysis

One important barrier to analysis

• In the real world, these graphs are LARGE

– Internet: billions of webpages

– Facebook: over 350M active users

– Millions of road miles in the USA

• Many computations become intractable at this scale

Outline

• Diffusion

• Applications of diffusion

– Local graph partitioning

– Network epidemics

• Future directions

Problems characterized by diffusion

• Propagation on graphs– Adoption of new products

– Infections and epidemics

– Information dissemination

• Ranking– Finding the most important or relevant webpage,

research paper, etc.

• Routing– Internet, transportation, …

Random walks

• A model for diffusion

• If random walk is at node u, move to a neighbor v chosen uniformly at random

• W: random walk matrix

otherwise0

~ if/1]|Pr[

Random walk

Dolphin social network – Lusseau et al. 2003

Random walk as a probability distribution

Random walk stationary distribution

Equal to the degree distribution

PageRank• Originally conceived by Brin and Page (1998)

for ranking web pages

• Models an Internet user:

– With probability α, jump to a random web page

– With probability (1 – α), click a hyperlink

– PageRank is the stationary distribution

otherwise

~ if]|Pr[

PageRank

• Model is applicable to any graph, not just the Web

• Captures importance and relationships between nodes

• Solution to equation:

– 1 parameter: α

– W is the random walk matrix

– pα is the PageRank vector

Personalized PageRank

• Random jumps: instead of jumping to a node uniformly at random, choose a node according to a prescribed distribution s:

• Personalized PageRank is the stationary distribution, the solution to

W psp )1(

otherwise

~ if]|Pr[

(Personalized) PageRank properties

• Geometrically weighted sum of random walks

• Linear in s

)()1(t

tt Wssp

Starting distribution s

α = 0.1

α = 0.01

α = 0.5

With a different s

α = 0.1

Computing PageRank vectors

• Solve matrix equation

– Intractable for large graphs

• Iterate randomized model until convergence

– Fast convergence, but still intractable for large graphs

Approximating PageRank

• Andersen, Chung, Lang (2006)

• ε-approximate PageRank vector

PageRank vector for (s – r)

0 ≤ r(v) ≤ ε dv for all v in G

ApproximatePR(s,α,ε)

• Computes p and r such that p is an

ε-approximate PageRank vector

– Starts with p = 0 and r = s

– Iteratively pushes PageRank from r to p until r is small enough

– Maintains p = prα(s – r)

ApproximatePR(s,α,ε)

• Computes p and r such that p is an

ε-approximate PageRank vector

– Uses only local computations

– Running time: O(1/εα) independent of n

– vol(Supp p): at most 2/(1-α)ε

• vol(S): a measure of the size of a set S

Outline

• Diffusion

Local graph partitioning

• Small communities within a larger graph structure

– Online communities, social cliques, …

• If v is located within a small community, how can we find it?

• Goal: Design an algorithm to find the community containing v

– Only using local computations

The Cheeger ratio

• Metric for graph cuts

• Cheeger constant: minimum Cheeger ratio for a subset S

))(vol),(volmin(

The Cheeger ratio

hS = 1hS = 0.0645

Relationship between diffusion and graph partitioning

• Suppose S has a low Cheeger ratio

• Start a diffusion process in S

– Unlikely to leave S

• Must be careful about this, though…

Where to start diffusion

Starting a random walk near the boundary of S will make it more likely to leave S.

Where to start diffusion

• Can’t start anywhere in S

– But there are many nodes in S which work

– Need to find a core of nodes in S

Where to start random walks

• Lovász, Simonovits (1990, 1993)

– If S has Cheeger ratio hS, then there is a set St with volume at least half of vol(S)

– Start a random walk in St, for t’ ≤ t steps

– The probability that the random walk is outside of S is at most thS

Where to start random walks

Random walk started at green nodes for 5 iterations

Random walk probability on red nodes is less than 5hS

Which initial distribution for personalized PageRank

– If S has Cheeger ratio hS, then there is a set Sα with volume at least half of vol(S)

– Calculate personalized PageRank with s contained in Sα

– The personalized PageRank outside of S is at most hS /α

Finding small cuts near v:algorithmic ideas

• Simulate diffusion processes

• Examine the results

– Are random walk probabilities or personalized PageRank vectors concentrated among a small set of nodes?

Sweep of a vector

• Suppose p is a vector with components corresponding to nodes in a graph

– Personalized PageRank, random walk probabilities

• Normalize p by dividing by the degree of each node

• Sort p in descending order

• Take the top k nodes to form a set Sk

• Take the Sk with minimal Cheeger ratio

Sweep of a random walk probability vector

• Spielman, Teng (2004)

• If a random walk probability is significantly larger than the stationary distribution, then a sweep over the vector finds a set with small Cheeger ratio.

• More effective if PageRank mixes slowly

Finding a small cut with random walks

• Spielman, Teng (2004)

• Algorithm Nibble

– Simulate a random walk for t0 steps, starting at v

• (t0 depends on size of G, desired Cheeger ratio h)

– Perform a sweep of the random walk probabilities

• Resulting set S (if one exists)

– Cheeger ratio smaller than a target h

Sweep of a personalizedPageRank vector

• If personalized PageRank on a set S is significantly larger than the stationary distribution, then S has low Cheeger ratio

• Can find an S via a sweep on the PageRankvector

• More effective if PageRank mixes slowly

))(vollog( SO

Finding a small cut with PageRank

• Algorithm PageRank-Nibble

– Compute an (approximate) personalized PageRankvector p, where the starting distribution is on v

– Perform a sweep of p

• Resulting set S (if one exists)

– Cheeger ratio is smaller than target h

– vol(S) is small

Running time

• Nibble:

• PageRank-Nibble:

• Only local computations

– Truncated random walks, approximate PageRank

54 /log|| hmSO

22 /log|| hmSO

Graph bipartitioning

What makes a good bipartition?

• Many edges within each subset

• Not many edges between subsets

• Balanced

• Cheeger ratio is a good metric

Putting small cuts together

• Suppose that small cuts found by Nibble or PageRank-Nibble are found by diffusion from a vertex v in the core of a set S

• Suppose that S has small Cheeger ratio

• Can put them together to form a larger cut

– Algorithm Partition (Spielman, Teng 2004)

Putting small cuts together

• Result: a bipartition

– Small Cheeger ratio

– Volume not too large or too small

A refinement

• Andersen and Chung (2008)

• Algorithm Local Partition

– Calculate an approximate personalized PageRankvector p

– Initialize a set S based on the normalized PageRank

– Repeatedly add to S by looking for sharp drops in PageRank, until S is large enough.

A refinement

• Andersen and Chung (2008)

• Algorithm Local Partition– Resulting S has small Cheeger ratio

– Resulting S is not too large or too small• Target size x

– If the initial vertex v is within the core of a set Cwith small Cheeger ratio, then S has a large intersection with C.

• No need to combine smaller cuts

Running time of bipartitioning

• Partition: O(m log6 m / h5)

• PageRank-Partition: O(m log4 m / h2)

• Local Partition: O(m log2 m / h2)

Outline

• Diffusion

Network epidemics

• Disease in human and animal populations

– H1N1 flu, STD’s, SARS, etc.

• Viruses and worms on technological and social networks

– MySpace worms, e-mail attachment viruses, …

• Clear connection to diffusion on graphs

Model for network epidemics

• Contact process (1927)

– Continuous-time Markov process on G

– Each node in G is either healthy or infected

– Infected nodes cure according to vector c

– Healthy nodes become infected by neighbors at rate β

• “SIS” model vs. “SIR” model

• Traditionally, c = 1

Thresholds

• For the contact process on many graphs G, there is an infection threshold βc

– Existence and properties depend on graph structure

– If β < βc, then any infection will die out quickly

– If β > βc, then any infection will persist indefinitely

• First discovered on empirical data, later proven rigorously

Thresholds

• Shown to exist for many classes of graphs (Newman 2002, Ganesh et al. 2005)

– Star graphs

– Complete graphs

– Erdős-Rényi random graphs

– General graphs

• Depending on Cheeger ratio and eigenvalues

Thresholds on power-law graphs

• Threshold tends to zero very quickly for scale-free graphs

– Pastor-Sattoras, Vespignani 2001-2; May, Lloyd 2001

– This is for c = 1

• What about a non-constant c

– Interpret c as the amount of antidote to be given to each node in G

Idea: contact tracing

• Give extra antidote to neighbors of infected nodes

– c now depends on t

• Shown to be ineffective in many cases

– Dezső, Barabási 2002

– Tsmiring, Huerta 2003

– Kiss et al. 2005

Contact tracing

• Contact tracing does poorly on star graphs

– Borgs et al. 2008

• Star graphs embedded in scale-free networks

Contact tracing on star graphs

• Star graphs represent the ‘worst case’

– If center is infected, leaves easily infected

– If several leaves infected, center easily infected

• In order for contact tracing to be effective, the total amount of antidote in c must be super-linear in the size of the graph

• Otherwise, threshold goes to zero quickly with the size of the graph

Diffusion-based inoculation scheme

• Borgs et al. 2008

• Give each node antidote equal to its degree

– Random walk stationary distribution

• Any infection dies out in logarithmic time

• Total amount of antidote: vol(G)

– Within a constant factor of the best possible for expander graphs

– c no longer depends on t

Another diffusion-basedinoculation scheme

• Previous scheme based on random walks. What about PageRank?

– (Miller, Hyman 2007) Empirical study shows that distributing antidote according to PageRank is effective

– No rigorous analysis

Mathematical relationship between PageRank and network epidemics

• Suppose an infection starts in S, a subset of H, and each node in H receives antidote according to their degrees.

• Probability that the infection never leaves H is lower-bounded by s/β times the personalized PageRank on H

• If H has a low Cheeger ratio, then this probability is small.

PageRank-based inoculation scheme

• To combat an infection starting from a set S:

– Find a set H that contains S in its core, with small Cheeger ratio

• This can be done with a local partitioning algorithm

– Give nodes H antidote according to their degrees

• Leads to a probabilistic guarantee that the infection will die out in logarithmic time

• vol(H) antidote vs. vol(G)

Outline

• Diffusion

Local clustering

• Directed graphs (some work done)

• Weighted graphs

• Better running time

• Simpler algorithms and analysis

Network epidemics

• Characterize cases when contact tracing is effective

• Smarter ways to find small sets H to inoculate

Game theoretic problems

• Game theory has a lot of problems relating to diffusion on graphs

– Adoption of new products

– Consensus game

New and improved tools

• Better approximations for PageRank

• Heat kernel PageRank and other variants

– Approximation and computation

– Better algorithms?

Conclusions

• Random walks and PageRank lead to tractable algorithms for local graph partitioning

• Random walks and PageRank lead to effective means to combat network infection

Questions?

pagerank and diffusion on large graphs

Documents

pagerank di

time-evolving graph processing at scale · pagerank...

regularization on graphs with function-adapted diffusion...

application of the pagerank algorithm to alarm graphs of the...

1 qsx: querying social graphs graph queries and algorithms...

pagerank (1)

fast-ppr: personalized pagerank estimation for large graphs

computing pagerank with...

regularization on graphs with function-adapted diffusion...

pagerank in undirected random graphs -...

diffusion and clustering on large...

only connect! explorations in using graphs for ir and...

diffusion geometries, and multiscale harmonic analysis on...

yep workshop on information diffusion on random graphs...

mälardalen university press dissertations no. 217 pagerank...

pagerank . pagerank . pagerank google - aut

diffusion wavelets on graphs and manifolds › innovation...

1 qsx: querying social graphs graph algorithms in mapreduce...

diversified recommendation on graphs: pitfalls, …recently,...

pagerank and diffusion on large...