pagerank and diffusion on large graphs
Post on 09-Feb-2022
3 Views
Preview:
TRANSCRIPT
Graphs
• A mathematical model for a set of objects with pairwise relationships– Nodes: represent objects within a larger set
– Edges: characterize pairwise relationships between those objects
• Graphs are omnipresent in the real world –both natural and man-made– Examples…
Technological networks
• The Internet
– Nodes are routers, computers, switches
– Edges are physical wires between them
• World Wide Web
– Nodes are individual webpages
– Edges represent hyperlinks between pages
Social networks
• Nodes represent people in a population
• Edges represent friendship, co-authorship, physical contact
Paul Erdős
Communication networks
• Physical networks
– Nodes represent telephones
– Edges represent physical telephone wires
• Interaction networks
– misc@cs.ucsd.edu
Biological networks
• Protein interactions
• The brain as a neural network
• Transcription regulatory networks
– Relationships between proteins and genes
• Chemical transmission between bacteria
• Many, many more
Different graphs, same properties
• “Small-world phenomenon”
– Short paths between pairs of nodes
– Adjacent nodes share more neighbors
• Power-law degree distribution
kkdv ]Pr[
Importance of graphs
• Interesting graphs
• Interesting problems
• Very important to have rigorous analysis
One important barrier to analysis
• In the real world, these graphs are LARGE
– Internet: billions of webpages
– Facebook: over 350M active users
– Millions of road miles in the USA
• Many computations become intractable at this scale
Outline
• Diffusion
• Applications of diffusion
– Local graph partitioning
– Network epidemics
• Future directions
Problems characterized by diffusion
• Propagation on graphs– Adoption of new products
– Infections and epidemics
– Information dissemination
• Ranking– Finding the most important or relevant webpage,
research paper, etc.
• Routing– Internet, transportation, …
Random walks
• A model for diffusion
• If random walk is at node u, move to a neighbor v chosen uniformly at random
• W: random walk matrix
otherwise0
~ if/1]|Pr[
vuduv
u
PageRank• Originally conceived by Brin and Page (1998)
for ranking web pages
• Models an Internet user:
– With probability α, jump to a random web page
– With probability (1 – α), click a hyperlink
– PageRank is the stationary distribution
otherwise
~ if]|Pr[
)1(
n
n
d
nvu
uvu
PageRank
• Model is applicable to any graph, not just the Web
• Captures importance and relationships between nodes
• Solution to equation:
– 1 parameter: α
– W is the random walk matrix
– pα is the PageRank vector
Wn
p1
p )1(
Personalized PageRank
• Random jumps: instead of jumping to a node uniformly at random, choose a node according to a prescribed distribution s:
• Personalized PageRank is the stationary distribution, the solution to
W psp )1(
otherwise
~ if]|Pr[
)1(
v
n
d
v
s
vusuv
u
(Personalized) PageRank properties
• Geometrically weighted sum of random walks
• Linear in s
0
)()1(t
tt Wssp
Computing PageRank vectors
• Solve matrix equation
– Intractable for large graphs
• Iterate randomized model until convergence
– Fast convergence, but still intractable for large graphs
Approximating PageRank
• Andersen, Chung, Lang (2006)
• ε-approximate PageRank vector
PageRank vector for (s – r)
0 ≤ r(v) ≤ ε dv for all v in G
ApproximatePR(s,α,ε)
• Computes p and r such that p is an
ε-approximate PageRank vector
– Starts with p = 0 and r = s
– Iteratively pushes PageRank from r to p until r is small enough
– Maintains p = prα(s – r)
ApproximatePR(s,α,ε)
• Computes p and r such that p is an
ε-approximate PageRank vector
– Uses only local computations
– Running time: O(1/εα) independent of n
– vol(Supp p): at most 2/(1-α)ε
• vol(S): a measure of the size of a set S
Outline
• Diffusion
• Applications of diffusion
– Local graph partitioning
– Network epidemics
• Future directions
Local graph partitioning
• Small communities within a larger graph structure
– Online communities, social cliques, …
• If v is located within a small community, how can we find it?
• Goal: Design an algorithm to find the community containing v
– Only using local computations
The Cheeger ratio
• Metric for graph cuts
• Cheeger constant: minimum Cheeger ratio for a subset S
))(vol),(volmin(
),(
SS
SSehS
Relationship between diffusion and graph partitioning
• Suppose S has a low Cheeger ratio
• Start a diffusion process in S
– Unlikely to leave S
• Must be careful about this, though…
Where to start diffusion
start
start
Starting a random walk near the boundary of S will make it more likely to leave S.
Where to start diffusion
• Can’t start anywhere in S
– But there are many nodes in S which work
– Need to find a core of nodes in S
Where to start random walks
• Lovász, Simonovits (1990, 1993)
– If S has Cheeger ratio hS, then there is a set St with volume at least half of vol(S)
– Start a random walk in St, for t’ ≤ t steps
– The probability that the random walk is outside of S is at most thS
Where to start random walks
Random walk started at green nodes for 5 iterations
Random walk probability on red nodes is less than 5hS
S
Which initial distribution for personalized PageRank
• Andersen, Chung, Lang (2006)
– If S has Cheeger ratio hS, then there is a set Sα with volume at least half of vol(S)
– Calculate personalized PageRank with s contained in Sα
– The personalized PageRank outside of S is at most hS /α
Finding small cuts near v:algorithmic ideas
• Simulate diffusion processes
• Examine the results
– Are random walk probabilities or personalized PageRank vectors concentrated among a small set of nodes?
Sweep of a vector
• Suppose p is a vector with components corresponding to nodes in a graph
– Personalized PageRank, random walk probabilities
• Normalize p by dividing by the degree of each node
• Sort p in descending order
• Take the top k nodes to form a set Sk
• Take the Sk with minimal Cheeger ratio
Sweep of a random walk probability vector
• Spielman, Teng (2004)
• If a random walk probability is significantly larger than the stationary distribution, then a sweep over the vector finds a set with small Cheeger ratio.
• More effective if PageRank mixes slowly
Finding a small cut with random walks
• Spielman, Teng (2004)
• Algorithm Nibble
– Simulate a random walk for t0 steps, starting at v
• (t0 depends on size of G, desired Cheeger ratio h)
– Perform a sweep of the random walk probabilities
• Resulting set S (if one exists)
– Cheeger ratio smaller than a target h
Sweep of a personalizedPageRank vector
• Andersen, Chung, Lang (2006)
• If personalized PageRank on a set S is significantly larger than the stationary distribution, then S has low Cheeger ratio
• Can find an S via a sweep on the PageRankvector
• More effective if PageRank mixes slowly
))(vollog( SO
Finding a small cut with PageRank
• Andersen, Chung, Lang (2006)
• Algorithm PageRank-Nibble
– Compute an (approximate) personalized PageRankvector p, where the starting distribution is on v
– Perform a sweep of p
• Resulting set S (if one exists)
– Cheeger ratio is smaller than target h
– vol(S) is small
Running time
• Nibble:
• PageRank-Nibble:
• Only local computations
– Truncated random walks, approximate PageRank
54 /log|| hmSO
22 /log|| hmSO
What makes a good bipartition?
• Many edges within each subset
• Not many edges between subsets
• Balanced
• Cheeger ratio is a good metric
Putting small cuts together
• Suppose that small cuts found by Nibble or PageRank-Nibble are found by diffusion from a vertex v in the core of a set S
• Suppose that S has small Cheeger ratio
• Can put them together to form a larger cut
– Algorithm Partition (Spielman, Teng 2004)
Putting small cuts together
• Result: a bipartition
– Small Cheeger ratio
– Volume not too large or too small
A refinement
• Andersen and Chung (2008)
• Algorithm Local Partition
– Calculate an approximate personalized PageRankvector p
– Initialize a set S based on the normalized PageRank
– Repeatedly add to S by looking for sharp drops in PageRank, until S is large enough.
A refinement
• Andersen and Chung (2008)
• Algorithm Local Partition– Resulting S has small Cheeger ratio
– Resulting S is not too large or too small• Target size x
– If the initial vertex v is within the core of a set Cwith small Cheeger ratio, then S has a large intersection with C.
• No need to combine smaller cuts
Running time of bipartitioning
• Partition: O(m log6 m / h5)
• PageRank-Partition: O(m log4 m / h2)
• Local Partition: O(m log2 m / h2)
Outline
• Diffusion
• Applications of diffusion
– Local graph partitioning
– Network epidemics
• Future directions
Network epidemics
• Disease in human and animal populations
– H1N1 flu, STD’s, SARS, etc.
• Viruses and worms on technological and social networks
– MySpace worms, e-mail attachment viruses, …
• Clear connection to diffusion on graphs
Model for network epidemics
• Contact process (1927)
– Continuous-time Markov process on G
– Each node in G is either healthy or infected
– Infected nodes cure according to vector c
– Healthy nodes become infected by neighbors at rate β
• “SIS” model vs. “SIR” model
• Traditionally, c = 1
Thresholds
• For the contact process on many graphs G, there is an infection threshold βc
– Existence and properties depend on graph structure
– If β < βc, then any infection will die out quickly
– If β > βc, then any infection will persist indefinitely
• First discovered on empirical data, later proven rigorously
Thresholds
• Shown to exist for many classes of graphs (Newman 2002, Ganesh et al. 2005)
– Star graphs
– Complete graphs
– Erdős-Rényi random graphs
– General graphs
• Depending on Cheeger ratio and eigenvalues
Thresholds on power-law graphs
• Threshold tends to zero very quickly for scale-free graphs
– Pastor-Sattoras, Vespignani 2001-2; May, Lloyd 2001
– This is for c = 1
• What about a non-constant c
– Interpret c as the amount of antidote to be given to each node in G
Idea: contact tracing
• Give extra antidote to neighbors of infected nodes
– c now depends on t
• Shown to be ineffective in many cases
– Dezső, Barabási 2002
– Tsmiring, Huerta 2003
– Kiss et al. 2005
Contact tracing
• Contact tracing does poorly on star graphs
– Borgs et al. 2008
• Star graphs embedded in scale-free networks
Contact tracing on star graphs
• Star graphs represent the ‘worst case’
– If center is infected, leaves easily infected
– If several leaves infected, center easily infected
• In order for contact tracing to be effective, the total amount of antidote in c must be super-linear in the size of the graph
• Otherwise, threshold goes to zero quickly with the size of the graph
Diffusion-based inoculation scheme
• Borgs et al. 2008
• Give each node antidote equal to its degree
– Random walk stationary distribution
• Any infection dies out in logarithmic time
• Total amount of antidote: vol(G)
– Within a constant factor of the best possible for expander graphs
– c no longer depends on t
Another diffusion-basedinoculation scheme
• Previous scheme based on random walks. What about PageRank?
– (Miller, Hyman 2007) Empirical study shows that distributing antidote according to PageRank is effective
– No rigorous analysis
Mathematical relationship between PageRank and network epidemics
• Suppose an infection starts in S, a subset of H, and each node in H receives antidote according to their degrees.
• Probability that the infection never leaves H is lower-bounded by s/β times the personalized PageRank on H
• If H has a low Cheeger ratio, then this probability is small.
PageRank-based inoculation scheme
• To combat an infection starting from a set S:
– Find a set H that contains S in its core, with small Cheeger ratio
• This can be done with a local partitioning algorithm
– Give nodes H antidote according to their degrees
• Leads to a probabilistic guarantee that the infection will die out in logarithmic time
• vol(H) antidote vs. vol(G)
Outline
• Diffusion
• Applications of diffusion
– Local graph partitioning
– Network epidemics
• Future directions
Local clustering
• Directed graphs (some work done)
• Weighted graphs
• Better running time
• Simpler algorithms and analysis
Network epidemics
• Characterize cases when contact tracing is effective
• Smarter ways to find small sets H to inoculate
Game theoretic problems
• Game theory has a lot of problems relating to diffusion on graphs
– Adoption of new products
– Consensus game
New and improved tools
• Better approximations for PageRank
• Heat kernel PageRank and other variants
– Approximation and computation
– Better algorithms?
Conclusions
• Random walks and PageRank lead to tractable algorithms for local graph partitioning
• Random walks and PageRank lead to effective means to combat network infection
top related