![Page 1: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/1.jpg)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
![Page 2: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/2.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Hello, I'm Eli Reisman!
![Page 3: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/3.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Eli is...
• Apache Giraph Committer and PMC Member
• Apache Tajo Committer
• Wrote initial port of Giraph to YARN
• Collaborating with fellow Giraph committers on Giraph in Action book for Manning publishing
![Page 4: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/4.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Eli is...
• Only able to do all this with the support of:
![Page 5: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/5.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Eli is a software engineer at
![Page 6: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/6.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Etsy enables non-technical folks to sell handmade and vintage stuff:
We have a great blog called Code As Craft:
![Page 7: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/7.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
...but, enough about me, lets talk Giraph!
![Page 8: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/8.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Key Topics
What is Apache Giraph?
Why do I need it?
Giraph + MapReduce
Giraph + YARN
Giraph Roadmap
![Page 9: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/9.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
What is Apache Giraph?
Giraph is a framework for performing offline batch processing of semi-structured graph
data on a massive scale.
Giraph is loosely based upon Google's Pregel graph processing framework.
![Page 10: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/10.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
What is Apache Giraph?
Giraph performs iterative calculations on top of an existing Hadoop cluster.
![Page 11: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/11.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
What is Apache Giraph?
Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election.
Done! Done! ...Still working...
![Page 12: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/12.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
What is Apache Giraph?
Giraph benefits from a vibrant Apache community, and is under active development:
![Page 13: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/13.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Why do I need it?
Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous
Parallel (BSP) programming model.
In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph
performing a single iteration of the computation.
![Page 14: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/14.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Why do I need it?
• Giraph makes iterative data processing more practical for Hadoop users.
• Giraph can avoid costly disk and network operations that are mandatory in MR.
• No concept of message passing in MR.
![Page 15: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/15.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Why do I need it?
Each cycle of an iterative calculation on Hadoop means running a full MapReduce
job.
![Page 16: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/16.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Let's use simple PageRank as a quick example:
http://en.wikipedia.org/wiki/PageRank
1.0
1.0
1.0
![Page 17: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/17.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
1. All vertices start with same PageRank
1.0
1.0
1.0
![Page 18: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/18.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
2. Each vertex distributes an equal portion of its PageRank to all neighbors:
0.5 0.5
1
1
![Page 19: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/19.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
3. Each vertex sums incoming values times a weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
![Page 20: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/20.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
4. This value becomes the vertices' PageRank for the next iteration
.43
.21
.64
![Page 21: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/21.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
![Page 22: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/22.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Vertices with more in-degrees converge to higher
PageRank
![Page 23: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/23.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Put another way:
![Page 24: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/24.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
1. Load complete input graph from disk as [K= Vertex ID, V = out-edges and PR]
Map Sort/Shuffle Reduce
![Page 25: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/25.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
2. Emit all input records (full graph state), Emit [K = edgeTarget, V = share of PR]
Map Sort/Shuffle Reduce
![Page 26: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/26.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
3. Sort and Shuffle this entire mess!
Map Sort/Shuffle Reduce
![Page 27: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/27.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
4. Sum incoming PR shares for each vertex, update PR values in graph state records
Map Sort/Shuffle Reduce
![Page 28: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/28.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
5. Emit full graph state to disk...
Map Sort/Shuffle Reduce
![Page 29: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/29.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
6. ...and start over!
Map Sort/Shuffle Reduce
![Page 30: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/30.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on MapReduce
• Awkward to reason about
• I/O bound despite simple core business logic
Map Sort/Shuffle Reduce
![Page 31: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/31.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on Giraph
1. Hadoop Mappers are "hijacked" to host Giraph master and worker tasks.
Map Sort/Shuffle Reduce
![Page 32: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/32.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on Giraph
2. Input graph is loaded once, maintaining code-data locality when possible.
Map Sort/Shuffle Reduce
![Page 33: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/33.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on Giraph
3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/
scan-based.
Map Sort/Shuffle Reduce
![Page 34: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/34.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
PageRank on Giraph
4. Output is written from the Mappers hosting the calculation, and the job run ends.
Map Sort/Shuffle Reduce
![Page 35: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/35.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
This is all well and good, but must we
manipulate Hadoop this way?
?
![Page 36: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/36.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph + MapReduce
• Heap and other resources are set once, globally, for all Mappers in the computation.
• No control of which cluster nodes host which tasks.
• No control over how Mappers are scheduled.
• Mapper and Reducer slots abstraction is meaningless for Giraph at best, an artificial limit at worst.
![Page 37: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/37.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
YARN
• YARN (Yet Another Resource Negotiator) is Hadoop's next-gen job management platform.
• Powers MapReduce v2, but is a general purpose framework that is not tied to the MapReduce paradigm.
• Offers fine-grained control over each task's resource allocations and host placement for clients that need it.
![Page 38: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/38.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
YARN Architecture
![Page 39: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/39.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph + YARN
Its a natural fit!
![Page 40: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/40.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph + YARN
• Giraph has maintained compatibility with Hadoop since 0.1 release by executing via MapReduce interface.
• Giraph has featured a "pure YARN" build profile since 1.0 release. It supports Hadoop-2.0.3 and trunk.
*Patches to add 2.0.4 and 2.0.5 support are in review :)
• Giraph's YARN component is easy to extend or use as a template to port other projects!
![Page 41: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/41.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph + YARN: Roadmap
• YARN Application Master allows for more natural and stable bootstrapping of Giraph jobs.
• Zookeeper management can find natural home in Application Master.
• Giraph on YARN can stop borrowing from Hadoop and have its own web interface.
![Page 42: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/42.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph + YARN: Roadmap
• Variable per-task resource allocation opens up the possibility of Supertasks to manage graph supernodes.
• Ability to spawn or retire tasks per-iteration enables in-flight reassignment of data partitions.
• AppMaster managed utility tasks such as dedicated sub-aggregators for tree-like aggregation, or data pre-samplers.
![Page 43: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/43.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Giraph New Developments
• Decoupling of logic and graph data means tasks host computations that are pluggable per-iteration.
• Support for Giraph job scripting, starting with Jython. More to follow...
• New website, fresh docs, upcoming Manning book, and large, active community means Giraph has never been easier to use or contribute to!
![Page 44: Fast, Scalable Graph Processing: Apache Giraph on YARN](https://reader034.vdocuments.mx/reader034/viewer/2022051322/54431cc1afaf9fe3098b4747/html5/thumbnails/44.jpg)
Fast, Scalable Graph Processing: Apache Giraph on YARN
Great! Where can I learn more?
http://giraph.apache.org
Mailing List: [email protected]