drunkardmob: billions of random walks on just a pc

30
DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall- 2012). DrunkardMob - RecSys '13

Upload: aapo-kyroelae

Post on 11-May-2015

1.614 views

Category:

Technology


2 download

DESCRIPTION

Research paper presentation at RecSys explaining how to simulate random walks if your graph does not fit in memory.

TRANSCRIPT

Page 1: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

DrunkardMob: Billions of Random Walks on

Just a PCAapo Kyrola

Carnegie Mellon UniversityTwitter: @kyrpov

Big Data – small machine

Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012).

Page 2: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

This work in a Nutshell

1. Background: Random walk –based methods are popular in Recommender Systems.

2. Research problem: How to simulate random walks if your graph does not fit in memory?

3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM.

Page 3: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Contents

• Introduction to random walks• Disk-based graph systems: GraphChi• DrunkardMob algorithm• Experiments

All code available in GitHub: http://github.com/graphchi/graphchi-java

Page 4: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Introduction: Random Walks

• Graph: G(V, E)– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t visits to vertices:

w := source(0) v(1) v(2) v(3) …. v(t)

• Walks follow edges by default, but can also reset or teleport with certain probability.– Transition probability: P(v(k+1) | v(k))

Page 5: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Introduction (cont.)

• Usually we are interested about the distribution of the visits.– Either global distribution or for each

source separately.– Many applications (PageRank, FolkRank,

SALSA,..)

• Can be used to generate candidates:– Choose top K visited vertices as

candidates to recommend.

Page 6: DrunkardMob: Billions of Random Walks on Just a PC

Example: Global PageRank

• Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability:– With probability d,

teleports to a random vertex infinite walk.

• Pagerank(web page) ~ authority of web page.Can be computed using “power iteration” very

efficiently (in secs / minutes even for graphs with billions of vertices) Not interesting.

P = d

“any vertex”

P=(1-d) / 3

P=(1-d) / 3

P=(1-d) / 3

?

DrunkardMob - RecSys '13

Page 7: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Personalized Pagerank

• Pagerank | home (source) nodes:– Compute pagerank vector

for each node separately resets only to the home node(s).

– Restrict home nodes to some category / topic / pages visited by a user.

• Used e.g. for social network recommendations.

P = d

home vertex

P=(1-d) / 3

P=(1-d) / 3

P=(1-d) / 3

?

Page 8: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Personalized Pagerank (cont.)

• Naïve computation of Personalized Pagerank (PPR):– Compute pagerank vector for each

source separately using power iteration: O(n^2)

• Approximate by sampling:– Simulate actual walks on the graph.

Page 9: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Random walk in an in-memory graph

• Compute one walk a time (multiple in parallel, of course):parfor walk in walks:

for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor())

Page 10: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Problem: What if Graph does not fit in memory?

Twitter network visualization, by Akshay Java, 2009

Distributed graph systems:- Each hop across partition boundary is costly.

Disk-based “single-machine” graph systems:- “Paging” from disk

is costly.

(This talk)

Page 11: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

DISK-BASED GRAPH SYSTEMS

Page 12: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Disk-based Graph Systems

• Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed:– GraphChi (Kyrola, Blelloch, Guestrin:

OSDI’12)– TurboGraph (KDD’13)– [X-Stream (SOSP’13) – model not suitable]

• We assume vertex-centric model:– Computation done one vertex a time.

Page 13: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

GraphChi execution model

For T iterations:For p=1 to P

For vertex in interval(p)updateFunction(vertex)

shard(1)

interval(1) interval(2) interval(P)

shard(2) shard(P)

1 nv1 v2

Page 14: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

DRUNKARDMOB ALGORITHM

Random walk is often called “Drunkard’s Walk”

Page 15: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

DrunkardMob: Basic Idea• By example:

– Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel

• I.e 1MM different home/source -nodes

– For each user, launch 1000 random walks (with resets) – in parallel

• Each walk takes 10 hops~ Equivalent to one 10,000 hop walk (with resets) / user

– For each user, keep track of the visits done by its 1000 short walks PPR for each user.

– Store state of each walk in RAM, process graph from disk.

= 1B random walks in parallel ~5 GB of RAM.

Page 16: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Random walks in GraphChi

• DrunkardMob –algorithm– Reverse thinking

ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks:

walkManager.addHop(walk, vertex.randomNeighbor())

Note: Need to store only current position of each walk!

Page 17: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

WalkManager

• Store walks in buckets– Array for each vertex would cost too

much.

Page 18: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Encoding walks Only 4 bytes / walk.

Keeps track of each path knowledge base applications.

Page 19: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Keeping track of walks

Vertex walks table (WalkManager)

Source A top-N visits

Source B top-N visits

Walk Distribution Tracker (DrunkardCompanion)

Execution interval

GraphChi

Page 20: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Keeping track of walks

Vertex walks table (WalkManager)

Source A top-N visits

Source B top-N visits

Walk Distribution Tracker (DrunkardCompanion)

Execution interval

GraphChi

Source A top-N visits

Source B top-N visits

Page 21: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Keeping track of Walks

• If we don’t have enough RAM to store the distributions:– Cut long tails: Similar problem to

estimating top-K frequent items in data streams with limited memory.

• Can also write hops to disk (bucket-by-bucket) and analyze later.

Page 22: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Validity

• We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets.– Not exactly same distribution – some

longer streaks not covered.• But those would be not relevant anyway for

recommendations!

– See Fogaras (2005) for analysis.

Page 23: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Related Work

• Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005)– Similar idea with full external memory

implementation.• We keep walks in memory.

• Plenty of research in approximating PPR.

Page 24: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

EXPERIMENTS

See paper for more experiments!

Page 25: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Case Study: Twitter WTF

• Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper)– Based on WWW’13 paper by Gupta et al.– Use DrunkardMob to generate set of

candidates to recommend for each user.– See paper.

Page 26: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

PPR: Full Twitter Graph

On Mac laptop, could estimate 500K-1M PPRs )= 0.5-1B walks ) in roughly the same time.

With a large server with SSD and 144 GB of memory:

Page 27: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Runtime / Graph size

Running time ~ linear with graph size

Page 28: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob.

Page 29: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Summary

• DrunkardMob allows simulating random walks efficiently on extremely large graphs– Uses bulk of RAM for keeping track of walks,

graph streamed from disk.– Graph size not limited by RAM.– Implement Twitter Who-To-Follow on your

Laptop!

• Future work: Adapt to distributed graph systems.– Even Hadoop if you really really want.

Page 30: DrunkardMob: Billions of Random Walks on Just a PC

DrunkardMob - RecSys '13

Thank You!

• Code: http://github.com/graphchi/graphchi-java

Aapo KyröläPh.D. candidate @ CMU

http://www.cs.cmu.edu/~akyrolaTwitter: @kyrpov

Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter.