drunkardmob: billions of random walks on just a pc

DrunkardMob - RecSys '13

DrunkardMob: Billions of Random Walks on

Just a PCAapo Kyrola

Carnegie Mellon UniversityTwitter: @kyrpov

Big Data – small machine

Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012).


This work in a Nutshell

1. Background: Random walk –based methods are popular in Recommender Systems.

2. Research problem: How to simulate random walks if your graph does not fit in memory?

3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM.


Contents

• Introduction to random walks• Disk-based graph systems: GraphChi• DrunkardMob algorithm• Experiments

All code available in GitHub: http://github.com/graphchi/graphchi-java


Introduction: Random Walks

• Graph: G(V, E)– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t visits to vertices:

w := source(0) v(1) v(2) v(3) …. v(t)

• Walks follow edges by default, but can also reset or teleport with certain probability.– Transition probability: P(v(k+1) | v(k))


Introduction (cont.)

• Usually we are interested about the distribution of the visits.– Either global distribution or for each

source separately.– Many applications (PageRank, FolkRank,

SALSA,..)

• Can be used to generate candidates:– Choose top K visited vertices as

candidates to recommend.

Example: Global PageRank

• Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability:– With probability d,

teleports to a random vertex infinite walk.

• Pagerank(web page) ~ authority of web page.Can be computed using “power iteration” very

efficiently (in secs / minutes even for graphs with billions of vertices) Not interesting.

P = d

“any vertex”

P=(1-d) / 3

P=(1-d) / 3

P=(1-d) / 3

?



Personalized Pagerank

• Pagerank | home (source) nodes:– Compute pagerank vector

for each node separately resets only to the home node(s).

– Restrict home nodes to some category / topic / pages visited by a user.

• Used e.g. for social network recommendations.

P = d

home vertex

P=(1-d) / 3

P=(1-d) / 3

P=(1-d) / 3

?


Personalized Pagerank (cont.)

• Naïve computation of Personalized Pagerank (PPR):– Compute pagerank vector for each

source separately using power iteration: O(n^2)

• Approximate by sampling:– Simulate actual walks on the graph.


Random walk in an in-memory graph

• Compute one walk a time (multiple in parallel, of course):parfor walk in walks:

for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor())


Problem: What if Graph does not fit in memory?

Twitter network visualization, by Akshay Java, 2009

Distributed graph systems:- Each hop across partition boundary is costly.

Disk-based “single-machine” graph systems:- “Paging” from disk

is costly.

(This talk)


DISK-BASED GRAPH SYSTEMS


Disk-based Graph Systems

• Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed:– GraphChi (Kyrola, Blelloch, Guestrin:

OSDI’12)– TurboGraph (KDD’13)– [X-Stream (SOSP’13) – model not suitable]

• We assume vertex-centric model:– Computation done one vertex a time.


GraphChi execution model

For T iterations:For p=1 to P

For vertex in interval(p)updateFunction(vertex)

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 nv1 v2


DRUNKARDMOB ALGORITHM

Random walk is often called “Drunkard’s Walk”


DrunkardMob: Basic Idea• By example:

– Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel

• I.e 1MM different home/source -nodes

– For each user, launch 1000 random walks (with resets) – in parallel

• Each walk takes 10 hops~ Equivalent to one 10,000 hop walk (with resets) / user

– For each user, keep track of the visits done by its 1000 short walks PPR for each user.

– Store state of each walk in RAM, process graph from disk.

= 1B random walks in parallel ~5 GB of RAM.


Random walks in GraphChi

• DrunkardMob –algorithm– Reverse thinking

ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks:

walkManager.addHop(walk, vertex.randomNeighbor())

Note: Need to store only current position of each walk!


WalkManager

• Store walks in buckets– Array for each vertex would cost too

much.


Encoding walks Only 4 bytes / walk.

Keeps track of each path knowledge base applications.


Keeping track of walks

Vertex walks table (WalkManager)

Source A top-N visits

Source B top-N visits

Walk Distribution Tracker (DrunkardCompanion)

Execution interval

GraphChi


Keeping track of walks

Vertex walks table (WalkManager)



Walk Distribution Tracker (DrunkardCompanion)

Execution interval

GraphChi




Keeping track of Walks

• If we don’t have enough RAM to store the distributions:– Cut long tails: Similar problem to

estimating top-K frequent items in data streams with limited memory.

• Can also write hops to disk (bucket-by-bucket) and analyze later.


Validity

• We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets.– Not exactly same distribution – some

longer streaks not covered.• But those would be not relevant anyway for

recommendations!

– See Fogaras (2005) for analysis.


Related Work

• Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005)– Similar idea with full external memory

implementation.• We keep walks in memory.

• Plenty of research in approximating PPR.


EXPERIMENTS

See paper for more experiments!


Case Study: Twitter WTF

• Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper)– Based on WWW’13 paper by Gupta et al.– Use DrunkardMob to generate set of

candidates to recommend for each user.– See paper.


PPR: Full Twitter Graph

On Mac laptop, could estimate 500K-1M PPRs )= 0.5-1B walks ) in roughly the same time.

With a large server with SSD and 144 GB of memory:


Runtime / Graph size

Running time ~ linear with graph size


Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob.


Summary

• DrunkardMob allows simulating random walks efficiently on extremely large graphs– Uses bulk of RAM for keeping track of walks,

graph streamed from disk.– Graph size not limited by RAM.– Implement Twitter Who-To-Follow on your

Laptop!

• Future work: Adapt to distributed graph systems.– Even Hadoop if you really really want.


Thank You!

• Code: http://github.com/graphchi/graphchi-java

Aapo KyröläPh.D. candidate @ CMU

http://www.cs.cmu.edu/~akyrolaTwitter: @kyrpov

Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter.

http://github.com/graphchi/graphchi-java

http://github.com/graphchi/graphchi-java

http://www.cs.cmu.edu/~akyrola

drunkardmob: billions of random walks on just a pc

Technology

random walks graph

vk drunkardmob recsys

billions of random walks

b random walks

talkdrunkardmob recsys

actual walks

vt walks

walk states