streaming graph partitioning kdd 8/15 streaming graph partitioning for large distributed graphs...

21
Streaming Graph Partitioning Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft Research XCG

Upload: annalise-malin

Post on 31-Mar-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Streaming Graph Partitioning for Large Distributed Graphs

Isabelle Stanton, UC BerkeleyGabriel Kliot, Microsoft Research XCG

Page 2: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• Modern graph datasets are huge– The web graph had over a trillion links in 2011.

Now?– facebook has “more than 901 million users with

average degree 130”– Protein networks

Motivation

Page 3: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• We still need to perform computations, so we have to deal with large data– PageRank (and other matrix-multiply problems)– Broadcasting status updates– Database queries– And on and on and on…

Motivation

P QL

Graph has to be distributed across a cluster of machines!

Page 4: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Motivation

• Edges cut correspond (approximately) to communication volume required

• Too expensive to move data on the network– Interprocessor communication: nanoseconds– Network communication: microseconds

• The data has to be loaded onto the cluster at some point…

• Can we partition while we load the data?

Page 5: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• Graph partitioning is NP-hard on a good day• But then we made it harder:

– Graphs like social networks are notoriously difficult to partition (expander-like)

– Large data sets drastically reduce the amount of computation that is feasible – O(n) or less

– The partitioning algorithms need to be parallel and distributed

High Level Background

Page 6: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

𝑀𝑘

𝑀 1

𝑀 2

The Streaming Model

Graph Stream →

PartitionerGraph is ordered:• Random• Breadth-First Search• Depth-First Search

Goal: Generate an approximately balanced k-partitioning

Each machine

holds nodes

𝐶=(1+𝜀)𝑛𝑘

Possible Buffer of size

Page 7: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Lower Bounds On Orderings

Best balanced -partition cuts edges

Adversarial Ordering- Give every other vertex- See no edges till !- Can’t competeDFS Ordering- Stream is connected- Greedy will do optimally

Random Ordering- Birthday paradox: won’t see edges

until - Still can’t compete with edges cut

Theory says these types of algorithms can’t do well

Page 8: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• Totally ignore edges and hash vertex ID• Pro

– Fast to locate data– Doesn’t require a complex DHT or synchronization

• Con– Hashing the vertex ID cuts a fraction of the edges

for any order– Great simple approximation for MAX-CUT

Current Approach in Real Systems

Page 9: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions

• Find out which heuristics work on each graph• Compare these with the results of

– Random Hashing to get worst case– METIS to get ‘best’ offline performance

Our Approach

Page 10: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Caveats

• METIS is a heuristic, not true lower bound– Does fine in practice– Available online for reproducing results

• Used publicly available datasets– Public graph datasets tend to be much smaller

than what companies have• Using meta-data for partitioning can be good

– partitioning the web graph by URL– Using geographic location for social network users

Page 11: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Heuristics

• Balanced• Chunking• Hashing• (weighted)

Deterministic Greedy• (weighted) Randomized

Greedy• Triangles• Balance Big

Uses a Buffer of size • Prefer Big• Avoid Big• Greedy EvoCut

Weight functionsUnweighted

Linear weighted

Exponentially weighted

Page 12: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Datasets

• Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs

• Sizes: 297 vertices to 41.7 million vertices• Synthetic graph models

– Barabasi-Albert (Preferential Attachment)– RMAT (Kronecker)– Watts-Strogatz– Power law-Clustered

• Biggest graphs: LiveJournal and Twitter

Page 13: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Experimental Method

• For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces

• Compare with a random cut – upper bound• Compare with METIS – lower bound

• Performance was measured by:

¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡 −¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 h𝑒𝑢𝑟𝑖𝑠𝑡𝑖𝑐 ¿¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡− ¿

𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦𝑀𝐸𝑇𝐼𝑆¿

Page 14: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Heuristic ResultsBest heuristic, LDG,

gets an average improvement of 76%

over all datasets!

Synthetic

Social network Finite element mesh

Hash

METIS

BFSDFS

Random

Page 15: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Scaling in the Size of Graphs: Exploiting Synthetic Graphs

LDG

Hash

METIS

Page 16: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

More Observations

• BFS is a superior ordering for all algorithms• Avoid Big does 46% WORSE on average than

Random Cut• Further experiments showed Linear Det.

Greedy has identical performance to Det. Greedy with load-based tie breaking.

Page 17: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

• Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/)

• Used 2 datasets• 4.6 million users, 77 million edges• 41.7 million users, 1.468 billion edges

• Computed the PageRank of each graph

Results on a Real System

Page 18: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Results on SPARK

LJ Hash LJ Stream Twitter Hash Twitter Stream

Naïve PR Mean 296.2 s 181.5 s 1199.4 s 969.3 s

Naïve PR STD 5.5 s 2.2 s 81.2 s 16.9 s

Combiner PR Mean

155.1 s 110.4 s 599.4 s 486.8 s

Combiner PR STD

1.5 s 0.8 s 14.4 s 5.9 s

LJ Improvement:Naïve – 38.7%

Combiner – 28.8 %

Twitter Improvement:Naïve – 19.1%

Combiner – 18.8 %

LiveJournal – 4.6 million users, 77 million edgesTwitter – 41.7 million users, 1.468 billion edges

Page 19: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Streaming graph partitioning is a really nice, simple,

effective preprocessing step.

Page 20: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Where to now?

• Can we explain theoretically why the greedy algorithm performs so well?*

• What heuristics work better?• What heuristics are optimal for different

classes of graphs?• Use multiple parallel streams!• Implement in real systems!

*Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs

[email protected]

Page 21: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Streaming Graph Partitioning KDD 8/15

Acknowledgements

• David B. Wecker• Burton Smith• Reid Andersen• Nikhil Devanur• Sameh Elkinety• Sreenivas Gollapudi• Yuxiong He• Rina Panigrahy• Yuval PeresAll at MSR

• Satish Rao• Virginia Vassilevska Williams• Alexandre Stauffer• Ngoc Mai Tran• Miklos Racz• Matei ZahariaAll at Berkeley - CS and Statistics

Supported by NSF and NDSEG fellowships, NSF grant CCF-0830797, and an internship at Microsoft Research’s eXtreme Computing Group.

[email protected]