comparing pregel related systems
DESCRIPTION
Comparing Open Source implementations of Pregel and Related Systems. Installation of Hadoop and the Pregel Related Systems. Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges. Worked on 1,4,8 node Amazon EC2 cluster. 4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative FilteringTRANSCRIPT
![Page 1: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/1.jpg)
Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems
December 2, 2013
Joshua Woo, Prashant Raghav, Vishnu Prathish
David R. Cheriton School of Computer ScienceUniversity of Waterloo
![Page 2: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/2.jpg)
Outline
● Motivation● Our Project● Setup● Preliminary Results● Preliminary Analysis● In-Progress● References
![Page 3: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/3.jpg)
Motivation
Recall: Pregel● Large-scale graph processing system● Fault-tolerant framework for graph
algorithms● MapReduce for graph operations?● Vertex-centric model (“think like a vertex”)
![Page 4: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/4.jpg)
Motivation
● Pregel is proprietary● Many open source graph processing
systems○ Pregel clones○ Pregel-inspired○ BSP
![Page 5: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/5.jpg)
Motivation
● Apache Hama● Signal/Collect● Apache Giraph● GPS● GraphLab
● Phoebus● GoldenOrb● HipG● Mizan
![Page 6: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/6.jpg)
MotivationSystem Impl. Language Type
Apache Hama Java Pure BSP framework
Signal/Collect Scala Pregel inspired
Apache Giraph Java Pregel clone
GPS Java Advanced Pregel clone
GraphLab C++ Pregel inspired
Phoebus Erlang Pregel clone
GoldenOrb Java Pregel clone
HipG Java Advanced Pregel clone
Mizan C++ Advanced Pregel clone
![Page 7: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/7.jpg)
Motivation
● How do these systems compare?○ In terms of performance (runtime)?○ In terms of memory footprint?○ In terms of network utilization (num. messages)?○ Variables:
■ Algorithm■ Graph size (number of vertices)■ Cluster size
![Page 8: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/8.jpg)
Our Project
● Compare at least 3 systems○ Apache Hama - general BSP framework○ Apache Giraph - Hadoop Map-only job, Facebook○ GPS - +dynamic repartitioning, +multi vertex-centric○ Signal/Collect - +edges, +async computations○ GraphLab○ Mizan
![Page 9: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/9.jpg)
Our Project
● Measure the runtime of at least two algorithms on each system○ PageRank
■ Fixed number of supersteps = 30○ Single Source Shortest Path (SSSP)○ k-means clustering
![Page 10: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/10.jpg)
Setup
● Experiments on AWS○ Ubuntu 12.04 m1.medium EC2 instances
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance
■ 8 GiB EBS volume per instance○ Cluster sizes:
■ Single-node cluster■ 4-node cluster■ 8-node cluster
![Page 11: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/11.jpg)
Setup
● Experiments on AWS○ 5 runs per dataset per algorithm per cluster
■ 35 runs per algorithm per cluster■ 70 runs per cluster■ 140 runs in total (single-node, 4-node)
● TODO: another 70 runs (8-node)
![Page 12: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/12.jpg)
Setup
● Dataset○ 7 datasets
■ tinyEWD: 8 vertices 15 edges■ mediumEWD: 250 vertices 2,546 edges■ 1000EWD: 1,000 vertices 16,866 edges■ rome99: 3,353 vertices 8,870 edges■ 10000EWD: 10,000 vertices 16,866 edges■ NYC: 264,346 vertices 733,846 edges■ largeEWD: 1,000,000 vertices 15,172,126 edges
○ Source: http://algs4.cs.princeton.edu/44sp/
![Page 13: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/13.jpg)
Setup
● Systems○ Hama
■ Hadoop 1.03.0■ Hama 0.6.3
○ Giraph■ Hadoop 0.20.203rc1■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a)
○ GPS■ Hadoop 0.20.203rc1■ GPS (trunk@Revision 112)
![Page 14: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/14.jpg)
Setup
● Input Graph○ Source files converted into format suitable for each
system■ Time for this conversion excluded from results:
● Conversion done before algorithms are run (pre-processing?)
● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
![Page 15: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/15.jpg)
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 14.17 41.60 14.40
mediumEWD 16.36 44.00 36.00
1000EWD 18.06 48.80 46.60
rome99 22.95 66.00 50.00
10000EWD 25.32 67.40 55.00
NYC 165.01 267.00 310.00
largeEWD 6,109.20 602.80 618.70
Average SSSP runtime on 4-node cluster (in seconds)
![Page 16: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/16.jpg)
Preliminary ResultsSSSP runtime vs. graph size (num. vertices)
![Page 17: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/17.jpg)
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 29.36 49.40 58.57
mediumEWD 30.26 53.40 60.42
1000EWD 37.86 54.60 61.03
rome99 29.35 56.20 61.80
10000EWD 302.33 61.80 64.80
NYC 1,001.24 134.40 68.69
largeEWD Failed 2,100.00 1,213.56
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds)
![Page 18: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/18.jpg)
Preliminary ResultsPageRank runtime vs. graph size (num. vertices)
![Page 19: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/19.jpg)
Preliminary Analysis● A point of resource crunch
○ No significant change in performance until a point● Hama does not scale well (vertices ~10^4)● Giraph and GPS scale better● In general, PageRank runtime > SSSP runtime● GPS input reader does not guarantee true partitioning
for large datasets● Which ‘knobs’ to keep constant? - Optimization vs.
Comparability
![Page 20: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/20.jpg)
In-Progress
● Output validation● Memory footprint● Network utilization (num. messages)● GraphLab and Signal/Collect● Green-Marl?
○ (DSL) → [Compiler] → (Giraph, GPS)
![Page 21: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/21.jpg)
Questions?
![Page 22: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/22.jpg)
Extras
![Page 23: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/23.jpg)
Preliminary Results
Dataset Hama Giraph GPS
tinyEWD 10 7 7
mediumEWD 16 13 18
1000EWD 27 25 23
rome99 105 102 18
10000EWD 85 80 64
NYC 671 905 438
largeEWD 806 670 730
Number of supersteps for SSSP
![Page 24: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/24.jpg)
Preliminary ResultsNumber of supersteps for SSSP
![Page 25: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/25.jpg)
Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
Dataset Native Green-Marl generated
tinyEWD 58.57 60.20
mediumEWD 60.42 60.11
1000EWD 61.03 62.30
rome99 61.80 62.32
10000EWD 64.80 65.78
NYC 68.69 71.34
largeEWD 1,213.56 -
![Page 26: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/26.jpg)
Really, really PreliminaryPageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
![Page 27: Comparing pregel related systems](https://reader031.vdocuments.mx/reader031/viewer/2022020122/54620e4fb1af9f936c8b4d39/html5/thumbnails/27.jpg)
References
● Our Project Proposal● http://algs4.cs.princeton.edu/44sp/● https://github.com/apache/hadoop-common● https://github.com/apache/giraph● https://subversion.assembla.com/svn/phd-
projects/gps/trunk/● http://ppl.stanford.edu/main/green_marl.html