Download - Advanced Visualization of Spark jobs
![Page 2: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/2.jpg)
Agenda• Introduction• Motivation and the challenges• The Spark execution visualization• An elegant way to tackle data skew• The visualizer extended to Flink• Future plans
![Page 3: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/3.jpg)
Introduction• Hungarian Academy of Sciences (MTA SZTAKI)• Research institute with strong industry ties• Big Data projects using Spark, Flink, Cassandra, Hadoop etc.• Multiple telco use cases lately, with challenging data volume and
distribution
![Page 4: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/4.jpg)
Motivation• We have developped an application aggreagting telco data that tested
well on toy data• When deploying it against the real dataset the application seemed
healthy• However it could become suprisingly slow or even crash• What did go wrong?
![Page 5: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/5.jpg)
Data skew• When some data-points are very frequent, and we need to process
them on a single machine – aka the Bieber-effect• Power-law distributions are very common: from social-media to
telecommunication, even in our beloved WordCount• Default hashing puts these into „random” buckets• We can do better than that
![Page 6: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/6.jpg)
Aim1. Most of the telco data-processing workloads suffer from inefficient
communication patterns, introduced by skewed data2. Help developers to write better applications by detecting issues of
logical and physical execution plans easier3. Help newcomers to understand a distributed data-processing
system faster4. Demonstrate and guide the testing of adaptive (re)partitioning
techniques
![Page 7: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/7.jpg)
Spark’s execution model• Extended MapReduce model, where a previous stage has to complete
all its tasks before the next stage can be scheduled (global synchronization)• Data skew can introduce slow tasks• In most cases, the data distribution
is not known in advance• „Concept drifts” are common
stageboundary & shuffle
even partitioning skewed data
slow task
slow task
![Page 8: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/8.jpg)
Physical execution plan• Understanding the physical plan• can lead to designing a better application• can uncover bottlenecks that are hard to identify otherwise• can help to pick the right tool for the job at hand• can be quite difficult for newcomers• can give insights to new, adaptive, on-the-fly partitioning & optimization
strategies
![Page 9: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/9.jpg)
Current visualizations in Spark• Current visualization tools• are missing the fine-grained input & output metrics – also not available in
most sytems, like Spark• does not capture the data charateristics – hard to accomplish a lightweight
and efficient solution• does not visualize the physical plan
DAG visualization in Spark
![Page 10: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/10.jpg)
Collecting information during execution• Data-transfer between tasks are accomplished in
the shuffle phase, in the form of shuffle block fetches• We distinguish local & remote block fetches
• Data characteristics collected on shuffle write & read
![Page 11: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/11.jpg)
A scalable sampling• Key-distributions are approximated with a strategy, that• is not sensitive to early or late concept drifts,• lightweight and efficient,• scalable by using a backoff strategy
keys
freq
uenc
y
counters
keys
freq
uenc
ysampling rate of sampling rate of
keys
freq
uenc
y
sampling rate of
truncateincrease by
𝑇𝐵
![Page 12: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/12.jpg)
Availability through the Spark REST API• Enhanced Spark REST API to provide block fetch & data characteristics
information of tasks• New queries to the REST API, for example: „what happend in the last
3 seconds?”• /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
![Page 13: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/13.jpg)
THE VISUALIZATION
![Page 14: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/14.jpg)
Our way of tackling data skew• Goal: use the collected information to handle data skew dynamically,
on-the-fly on any workload (batch or streaming)• Currently paper in progress
heavy keys createheavy partitions (slow tasks)
Bieber-key is alone,size of the biggest partition
is minimized
![Page 15: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/15.jpg)
Making visualization pluggable• Let us also support Flink• The execution model is quite different from Spark• Minimal code change on the data reader module of the visualization• Expose the necessary metrics through the Flink JM Rest API• Map Flink’s model to Spark’s
![Page 16: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/16.jpg)
Computational models
Batch data
Kafka, RabbitMQ ...
HDFS, JDBC ...
Stream Data
Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
![Page 17: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/17.jpg)
The currently available Flink visualizer
![Page 18: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/18.jpg)
Granular visualization of Flink jobs
![Page 19: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/19.jpg)
Future plans• Open-sourcing of the visualization framework• Dynamic repartitioning paper is in progress• Task-classification to make the visualization more compact• Visual improvements, more features• Opening PRs against Spark & Flink with the suggested metrics• A rewrite is coming for the Flink metrics, track it at FLINK-1504
![Page 20: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/20.jpg)
Conclusions• The devil is in the details• Visualizations can aid developers to better understand issues and
bottlenecks of certain workloads• Specially data skew can hinder the completion time of the whole
stage in a batch engine• Adaptive repartitioning strategies can be suprisingly efficient
![Page 21: Advanced Visualization of Spark jobs](https://reader036.vdocuments.mx/reader036/viewer/2022081507/586f77a51a28ab10258b686d/html5/thumbnails/21.jpg)
Thank you for your attentionQ&A
stageboundary & shuffle
even partitioning skewed data
slow task
slow task