adam belloum (uva) dr. jason maassen (escience center ... · ibis data serialization in apache...
TRANSCRIPT
![Page 1: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/1.jpg)
Ibis Data Serialization in Apache Spark
By Dadepo Aderemi and Mathijs VisserSupervisors:
dr. Jason Maassen (eScience Center)Adam Belloum (UvA)
![Page 2: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/2.jpg)
We live in a big data world- Increase in data generation: IoT,
mobile devices, social media, logs from large scale software etc.
- Large and complex data sets- Beyond ability of traditional
software tools.- Rich analytical potential
2Image source: https://towardsdatascience.com/what-is-big-data-lets-answer-this-question-933b94709caf
![Page 3: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/3.jpg)
We live in a big data world- Big data is essential not only in
business but in Science- Computational Astrophysics, Climate
Modeling, Medical and Pharmaceutical research etc.
- Volume 455 Issue 7209, 4 September 2008 of Nature magazine talked about the challenges of dealing with big data.
- Core problem: Explosion of data that cannot be managed speedily using traditional approaches.
3
![Page 4: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/4.jpg)
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
- Gartner Glossary
4
![Page 5: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/5.jpg)
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.
- Gartner Glossary
5
![Page 6: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/6.jpg)
6
![Page 7: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/7.jpg)
What is Apache Spark- Is a unified analytics engine for large-scale data processing written in Scala- Began at UC Berkeley in 2009, Apache project in 2013- Supports the MapReduce programming model- Supports both batch and streaming processing of data - Provides SQL, Machine learning and Graph processing capabilities- Provides a distributed computing platform that can be run Apache Mesos,
Kubernetes, standalone, or in the cloud.- Has ability to access data in:
- HDFS (Hadoop Distributed File System)- Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources
7
![Page 8: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/8.jpg)
Common bottleneck in big data processing- Network bandwidth- Disk IO- Memory- Serialization
8
“...the mechanism for converting (graphs of) data (Java objects) to some format that can be stored or transferred (e.g., a stream of bytes, or XML)...”
![Page 9: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/9.jpg)
Research Questions- Can Apache Spark's performance be improved by taking advantage of Ibis'
serialization techniques?
Sub questions:
- What components of Apache Spark can benefit from Ibis' fast serialization?- How can Ibis' serialization techniques be integrated into Apache Spark?- How does the performance of Apache Spark differ when using Java, Kryo and
Ibis serialization?
9
![Page 10: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/10.jpg)
10
![Page 11: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/11.jpg)
What is Ibis- Ibis is an open source Java distributed computing software project- Developed at the Vrije Universiteit Amsterdam- With the goal of creating an efficient Java-based platform for distributed
computing.1
[1] https://www.cs.vu.nl/ibis/
11
![Page 12: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/12.jpg)
Related work- Xiaoyi Lu et al.
- Improvements to Spark has been made using various methods such as Remote Direct Memory Access (RDMA)
- Applying zero-copy buffer management in the network stack - van Nieuwpoort, Rob et al
- Applied compile-time code generation to improve Java's RMI in Ibis RMI- Apache Spark has also shown serialization performance can be improved
using Kryo serialization.
12
- But no prior work has been done regarding using Ibis serialization in Spark
[1] “High-performance design of apache spark with RDMA and its benefitson various workloads”. In:2016 IEEE International Conference on Big Data (BigData). IEEE. 2016, pp. 253–262
[2] Accelerating spark with rdma for big data processing: Early experiences”. In:2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.IEEE. 2014, pp. 9–16
![Page 13: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/13.jpg)
Overview of Ibis components
13
![Page 14: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/14.jpg)
What is Ibis software stack: Component view
14
![Page 15: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/15.jpg)
What is Ibis software stack
15
![Page 16: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/16.jpg)
What makes Ibis serialization efficient- Ibis serialization optimizes:
- Optimizes object creation- Avoiding Data Copying- Optionally moves runtime type inspection to compile time
16
![Page 17: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/17.jpg)
Overview of how Spark works
17
![Page 18: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/18.jpg)
How Spark Works
Source: https://spark.apache.org/docs/latest/cluster-overview.html
18
![Page 19: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/19.jpg)
Spark APIs
RDD (Resilient Distributed Dataset)
DataFrames
Datasets
19
![Page 20: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/20.jpg)
How Spark executes applications
Source: https://trongkhoanguyen.com/spark/understand-rdd-operations-transformations-and-actions/
20
![Page 21: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/21.jpg)
Methodology
21
![Page 22: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/22.jpg)
Methodology- Identifying Spark components using serialization.- Extracting the serialization component in Ibis- Modify spark to use the serialization from Ibis- Measure performance difference
22
![Page 23: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/23.jpg)
Identifying Spark components using serialization- We analysed the source code of Spark- We found 17 instances of direct serialization calls
- Internal operations- Network operations- Persistence operations (Disk and Memory)
- Available serialization mechanisms:- Native Java serialization- Kryo serialization 1
[1] https://github.com/EsotericSoftware/kryo
23
![Page 24: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/24.jpg)
Modifying Spark to use Ibis serialization- 17 different components using serialization.- We managed to replace 15 of those.
24
![Page 25: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/25.jpg)
Unresolved Incompatibilities.- Incompatibility with NettyBlockRpcServer and NettyBlockTransferService
- Uses Zero-copy I/O- Off heap network buffer management- Making a drop in replacement harder
- Incompatibility with deserializing from Hadoop filesystem.
25
![Page 26: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/26.jpg)
Resolved Incompatibilities.- Modification to support serialization of Scala’s Option type- Modification to support serialization of Enum with constant method
- Thanks to the Ibis maintainer: Ceriel Jacobs from the Vrije University Amsterdam
- Modification to support ByteBuffer
26
![Page 27: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/27.jpg)
Measuring the performance differences
27
![Page 28: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/28.jpg)
Benchmark setup- We now have a:
- A modified version of Spark- Original Spark version to test Kryo and Native Java serialization
- Two worker nodes, directly connected- Both running a HDFS DataNode- Using Hadoop Yarn as resource manager
28
![Page 29: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/29.jpg)
Benchmark setup
HDFS
Worker Node 1
Yarn
Worker Node 2
Spark
29
![Page 30: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/30.jpg)
Benchmarking method- Single test results may not be conclusive- To get more reliable results we perform each benchmark 50 times- Take the mean of all results- Test environments are reset between test runs- Also comparing Ibis and Ibisc
30
![Page 31: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/31.jpg)
Benchmark types- Mostly use standardized benchmarks
- TeraSort: - Distributed sorting algorithm- Measures shuffling performance
- SparkPi: - Computes an approximation of Pi- Measures computing performance
- Memory persistence- Measure memory persistence performance
31
![Page 32: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/32.jpg)
Results
32
![Page 33: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/33.jpg)
TeraSort results
33
![Page 34: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/34.jpg)
34
![Page 35: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/35.jpg)
35
![Page 36: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/36.jpg)
36
![Page 37: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/37.jpg)
37
![Page 38: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/38.jpg)
38
![Page 39: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/39.jpg)
Conclusion- Research question:
- Can Apache Spark's performance be improved by taking advantage of Ibis' serialization techniques?
- 15 out of 17 components could be replaced- Ibis was 15-20% faster in benchmarks that extensively use serialization- Ibis was 10-15% more efficient in memory usage in benchmarks that
extensively use serialization- There was no noticeable performance difference in purely computational
benchmarks
39
![Page 40: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/40.jpg)
Future Work- Replace remaining two components with Ibis serialization- Measure performance using other benchmarks- Research performance on a larger scale- Apply Ibis rewriter to Spark- Compare Ibis against dataset encoders- Experiment with Ibis' networking implementations in Spark- Investigate Ibis serialization performance in other distributed applications
40
![Page 41: Adam Belloum (UvA) dr. Jason Maassen (eScience Center ... · Ibis Data Serialization in Apache Spark By Dadepo Aderemi and Mathijs Visser Supervisors: dr. Jason Maassen (eScience](https://reader034.vdocuments.mx/reader034/viewer/2022042311/5ed9cc39c775f12f0c2069ff/html5/thumbnails/41.jpg)
Questions?
41