2013 06-03 berlin buzzwords
DESCRIPTION
TRANSCRIPT
![Page 2: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/2.jpg)
Agenda
1 Background
2 Scaling
3 Results
4 Questions
![Page 3: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/3.jpg)
Background
![Page 4: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/4.jpg)
What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, and Accumulo.• BSP model with simple think like a vertex API.• Combiners, Aggregators, Mutability, and more.• Configurable Graph<I,V,E,M>:
– I: Vertex ID
– V: Vertex Value
– E: Edge Value
– M: Message data
What is Giraph NOT?• A Graph database. See Neo4J.• A completely asynchronous generic MPI system.• A slow tool.
implementsWritable
![Page 5: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/5.jpg)
Why not Hive?
Inputformat
Outputformat
Map tasks
Intermediatefiles
Reducetasks
Output 0
Output 1
Input 0
Input 1
Iterate!
• Too much disk. Limited in-memory caching.• Each iteration becomes a MapReduce job!
![Page 6: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/6.jpg)
Giraph components
Master – Application coordinator
• Synchronizes supersteps
• Assigns partitions to workers before superstep begins
Workers – Computation & messaging
• Handle I/O – reading and writing the graph
• Computation/messaging of assigned partitions
ZooKeeper
• Maintains global application state
![Page 7: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/7.jpg)
Giraph Dataflow
Split 0
Split 1
Split 2
Split 3
Work
er
1
Mast
er
Work
er
0Input
formatLoad / SendGraph
Load / SendGraph
Loading the graph
1
Part 0
Part 1
Part 2
Part 3
Compute / Send
Messages
Work
er
1
Compute / Send
Messages
Mast
er
Work
er
0
In-memory graph
Send stats / iterate!
Compute/Iterate
2
Work
er
1W
ork
er
0 Part 0
Part 1
Part 2
Part 3
Output format
Part 0
Part 1
Part 2
Part 3
Storing the graph
3
Split 4
Split
![Page 8: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/8.jpg)
Giraph Job Lifetime
Output
Active Inactive
Vote to Halt
Received Message
Vertex Lifecycle
All Vertices Halted?
InputCompute Superstep
No
Master halted?
No
Yes
Yes
![Page 9: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/9.jpg)
Simple Example – Compute the maximum value
5
15
2
5
5
25
5
5
5
5
1
2
Processor 1
Processor 2
Time
Connected Componentse.g. Finding Communities
![Page 10: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/10.jpg)
PageRank – ranking websites
Mahout (Hadoop)854 lines
Giraph< 30 lines
• Send neighbors an equal fraction of your page rank • New page rank = 0.15 / (# of vertices) + 0.85 *
(messages sum)
![Page 11: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/11.jpg)
Scaling
![Page 12: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/12.jpg)
Problem: Worker Crash.
Superstep i(no
checkpoint)
Superstep i+1
(checkpoint)
Superstep i+2(no
checkpoint)Worker failure!
Superstep i+1
(checkpoint)
Superstep i+2(no
checkpoint)
Superstep i+3
(checkpoint)Worker failure after
checkpoint complete!
Superstep i+3(no
checkpoint)
ApplicationComplete…
Solution: Checkpointing.
![Page 13: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/13.jpg)
“Spare”Master 2
ActiveMaster State“Spare”
Master 1
“Active”Master 0
Before failure of active master 0
“Spare”Master 2
ActiveMaster State“Active”
Master 1
“Active”Master 0
After failure of active master 0
ZooKeeper ZooKeeper
Problem: Master Crash.
Solution: ZooKeeper Master Queue.
![Page 14: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/14.jpg)
Problem: Primitive Collections.• Graphs often parameterized with {Null,Int,Long,Float,Double}• Boxing/unboxing. Objects have internal overhead.
3
Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.
fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion
1
24
5
1.2
0.50.8
0.4
1.7
0.7
Single Source Shortest Path
s
t
1.2
0.50.8
0.4
0.2
0.7
Network Flow
3
1
24
5
Count In-Degree
![Page 15: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/15.jpg)
Problem: Too many objects.Lots of time spent in GC.
Graph: 1B Vertices, 200B Edges, 200 Workers.
• 1B Edges per Worker. 1 object per edge value.• List<Edge<I, E>> ~ 10B objects
• 5M Vertices per Worker. 10 objects per vertex value.• Map<I, Vertex<I, V, E> ~ 50M objects
• 1 Message per Edge. 10 objects per message data.• Map<I, List<M>> ~ 10B objects
• Objects used ~= O(E*e + V*v + M*m) => O(E*e)
Label Propagatione.g. Who’s sleeping?
3
1
24
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
![Page 16: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/16.jpg)
Problem: Too many objects.Lots of time spent in GC.
Solution: byte[]• Serialize messages, edges, and vertices.• Iterable interface with representative object.
Input Input Input
next()next()
next()Objects per worker ~= O(V)
Label Propagatione.g. Who’s sleeping?
3
1
24
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
![Page 17: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/17.jpg)
Problem: Serialization of byte[]• DataInput? Kyro? Custom?
Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).
• AWESOME. As fast as it gets.• True native. Essentially C: *(long*)(data+offset);
![Page 18: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/18.jpg)
Problem: Large Aggregations.
Worker
Worker
Worker Worke
r
Worker
Master
Workers own aggregators
Worker
Worker
Worker Worke
r
Worker
Master
Aggregator owners communicatewith Master
Worker
Worker
Worker Worke
r
Worker
Master
Aggregator owners distribute values
Solution: Sharded Aggregators.
Worker
Worker
Worker Worke
r
Worker
Master
K-Means Clusteringe.g. Similar Emails
![Page 19: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/19.jpg)
Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.
Solution: NettyTune queue sizes & threads
BarrierBarrier
Begin superstep
computenetwork
End compute
End superstep
wait
BarrierBarrier
Begin superstep
compute
network
wait
Time to first message
End compute
End superstep
![Page 20: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/20.jpg)
Results
![Page 21: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/21.jpg)
50 100 150 200 250 3000
50
100
150
200
250
300
350
400
450
2B Vertices, 200B Edges, 20 Compute Threads
Workers
Itera
tion
Tim
e (
sec)
Increasing Workers
Increasing Data Size
1000000000 1010000000000
50
100
150
200
250
300
350
400
450
50 Workers, 20 Compute Threads
EdgesIt
era
tion
Tim
e (
sec)
Scalability Graphs
![Page 22: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/22.jpg)
Lessons Learned
• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primitive collections, primitive performance. Use fastutil.• byte[] is simple yet powerful.• Being Unsafe can be a good thing.
• Have a graph? Use Giraph.
![Page 23: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/23.jpg)
What’s the final result?
Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.
Computations on entire Facebook graph no longer “weekend jobs”.Now they’re coffee breaks.
![Page 24: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/24.jpg)
Questions?
![Page 25: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/25.jpg)
Problem: Measurements.
• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.
Solution: Do it all.• YourKit – see YourKitProfiler• jmap – see JMapHistoDumper• VisualVM –with jstatd & ssh socks proxy• Yammer Metrics• Hadoop Counters• Logging & GC prints
![Page 26: 2013 06-03 berlin buzzwords](https://reader033.vdocuments.mx/reader033/viewer/2022061110/5453d730af7959aa678b8036/html5/thumbnails/26.jpg)
Problem: Mutations• Synchronization.• Load balancing.
Solution: Reshuffle resources• Mutations handled at barrier between supersteps.• Master rebalances vertex assignments to optimize
distribution.• Handle mutations in batches.• Avoid if using byte[].• Favor algorithms which don’t mutate graph.