how data volume affects spark based data analytics on a...
TRANSCRIPT
![Page 1: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/1.jpg)
1
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Ahsan Javed Awan EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard
Ayguade(UPC and BSC),
![Page 2: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/2.jpg)
2
MotivationWhy should we care about architecture support?
*Source: SGI
Data Growing Faster Than Technology
![Page 3: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/3.jpg)
3
MotivationCont...
Our Focus Our Focus
Improve the node level performancethrough architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,Metis, Ostrich,
etc..
Hadoop, Spark,Flink, etc..
![Page 4: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/4.jpg)
4
MotivationConti...
● A mismatch between the characteristics of emerging workloads and the underlying hardware.
– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in ASPLOS 2012.
– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013.
– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014
– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014.
– T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in IISWC 2014
Existing studies lack quantitative analysis of bottlenecks of scale-out frameworks on single-node
![Page 5: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/5.jpg)
5
Progress Meeting 12-12-14Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
![Page 6: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/6.jpg)
6
Our Approach
● Performance characterization of in-memory data analytics on a modern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
What are the major bottlenecks??
Focus of this talk
![Page 7: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/7.jpg)
7
Our Approach
● Do Spark based data analytics benefit from using scale-up servers?
● How severe is the impact of garbage collection on performance of Spark based data analytics?
● Is file I/O detrimental to Spark based data analytics performance?
● How does data size affect the micro-architecture performance of Spark based data analytics?
What are the remaining questions??
![Page 8: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/8.jpg)
8
Our Approach
● We evaluate the impact of data volume on the performance of Spark based data analytics running on a scale-up server.
● We quantify the limitations of using Spark on a scale-up server with large volumes of data.
● We quantify the variations in micro-architectural performance of applications across different data volumes.
What are the contributions??
![Page 9: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/9.jpg)
9
Our Approach
● Use a subset of benchmarks from BigDataBench
● Use Big Data Generator Suite (BDGS), to generate synthetic datasets of 6 GB, 12 GB and 24 GB.
● Configure Spark in local mode and tune its internal Parameters
● Rely on GC logs to collect garbage collection times.
● Use Spark logs to gather execution time of benchmarks.
● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU time of executor pool threads
● Use General Micro-architectural Exploration in Intel Vtune to analyze impact of data volume on micro-architecture characteristics.
Methodology
![Page 10: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/10.jpg)
10
Our ApproachWhat are the characteristics of benchmarks?
![Page 11: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/11.jpg)
11
Our Hardware Configuration
System Details
![Page 12: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/12.jpg)
12
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo-boost are disabled
Hyper Threading and Turbo-boost are disabled
![Page 13: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/13.jpg)
13
Our ApproachSoftware Parameters
![Page 14: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/14.jpg)
14
MotivationDo Spark based data analytics benefit from using larger
scale-up servers?
Spark applications do not benefit significantly by using more than 12-core executors
![Page 15: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/15.jpg)
15
MotivationIs GC detrimental to scalability of Spark applications?
The proportion of GC time increases with the number of cores
![Page 16: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/16.jpg)
16
MotivationDoes performance remain consistent as we enlarge the data
size ?
Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
![Page 17: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/17.jpg)
17
MotivationDoes the choice of Garbage Collector impact the data
processing capability of the system ??
Improvement in DPS ranges from 1.4x to 3.7x on average in Parallel Scavenge as compared to G1
![Page 18: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/18.jpg)
18
MotivationHow does GC affect data processing capability of
the system ??
GC time does not scale linearly with data size.
![Page 19: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/19.jpg)
19
MotivationHow does CPU utilization scale with data volume ?
CPU Utilization decreases with increase in input data size
![Page 20: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/20.jpg)
20
MotivationIs File I/O detrimental to performance ?
Fraction of file I/O increases by 6x, 18x and 25x for Word Count, Naive Bayes and Sort respectively when input data is increased by 4x
![Page 21: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/21.jpg)
21
MotivationHow does data size affects micro-architectural
performance ?
5 to 10 % better instruction retirement as we enlarge the data size
![Page 22: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/22.jpg)
22
MotivationCont..
Execution units inside the core exhibit improved utilization at larger data sets
![Page 23: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/23.jpg)
23
MotivationCont..
Increase in L1 Bound Stalls implies better utilization of L1 Caches
![Page 24: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/24.jpg)
24
MotivationCont..
Spark benchmarks exhibit reduced memory bandwidth utilization
![Page 25: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/25.jpg)
25
Key Findings
● Spark workloads do not benefit significantly from executors with more than 12 cores.
● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off-chip bandwidth on our test machine
![Page 26: How Data Volume Affects Spark Based Data Analytics on a ...prof.ict.ac.cn/.../09/Spark_data_volume_Ahsan-1.pdfPerformance characterization of in-memory data analytics on a modern cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051902/5ff26daeb60d1e35bd20bd5f/html5/thumbnails/26.jpg)
26
MotivationFuture Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures