interactive visual data exploration with spark in ...interactive visual data exploration with spark...
TRANSCRIPT
![Page 1: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/1.jpg)
Interactive Visual Data Exploration with Spark in Databricks Cloud
Hossein Falaki @mhfalaki
![Page 2: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/2.jpg)
About Databricks
Founded by creators of Apache Spark !
!
Offers Spark as a service in the cloud !
!
Dedicated to open source Spark > Largest organization contributing to Apache Spark > Drive the roadmap
!
2
![Page 3: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/3.jpg)
3
Databricks Cloud
Databricks Workspace
Databricks Platform > Start clusters in seconds > Dynamically scale up & down
> Notebooks > Dashboards > Job launcher
> Latest version > Configured / Optimized
![Page 4: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/4.jpg)
Fast & General distributed computing engine: batch, streaming, iterative !
Capable of handling petabytes of data !
Even faster by caching data in-memory !
Versatile programming interfaces
4
Spark
![Page 5: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/5.jpg)
Spark: Mixing SQL with Python/Scala
5
// Query an existing table and get results back as Schema RDD rdd = hiveContext.sql(“select article, text from wikipedia”) !
// Perform transformations words = rdd.flatMap(lambda r: r.text.split()) !
// Collect sample of data in driver machine sampled_words = words.sample(fraction = 0.001)
![Page 6: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/6.jpg)
Databricks Platform
Start clusters in seconds !
Zero-cost management !
Dynamically scale up and down
6
![Page 7: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/7.jpg)
Databricks Workspace
Notebooks > SQL > Python > Scala
Dashboards
Job Launcher
7
![Page 8: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/8.jpg)
Notebooks
Supports Python, Scala, SQL !
Interactive commands and plots !
On-line collaboration
8
![Page 9: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/9.jpg)
Dashboards
WYSIWYG Builder !
Interactive jobs !
On-click publishing !
Exporting from notebooks
9
![Page 10: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/10.jpg)
Job Launcher
Runs arbitrary Spark jobs programmatically
10
![Page 11: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/11.jpg)
11
Expository vs. Exploratory
![Page 12: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/12.jpg)
12
Large data
![Page 13: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/13.jpg)
13
“Visualization is critical to data analysis.”William S. Cleveland
But we often skip exploratory visualization with large data
![Page 14: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/14.jpg)
Challenges
14
1. Interactivity
with large data is challenging
2. Visual medium
cannot accommodate as many pixels as data points
![Page 15: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/15.jpg)
Solutions
15
In-memory computation !
High parallelism
1. Interactivity
![Page 16: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/16.jpg)
Reducing interaction latency with Spark
1. In-memory computation
> Significantly reduces latency
2. High parallelism > Get more executors with Mesos or Yarn: a challenge in itself > Click a button to increase cluster size in Databricks Cloud
16
![Page 17: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/17.jpg)
Versatile programming interface
!
Data visualization is very much like programming. > Point and click doesn’t really cut it > Requires an API (grammar): ggplot, matplotlib, bokeh, etc.
!
Spark has SQL, Scala, Python, Java and (experimental) R API !
Libraries for distributed statistics and machine learning
17
![Page 18: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/18.jpg)
Solutions
18
2. Visual medium
In-memory computation !
High parallelism
In-browser collaborative notebooks !
Summarizing, Sampling and Modeling
1. Interactivity
![Page 19: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/19.jpg)
More data points than pixels
19
Can we visualize 200GB of multidimensional data?
Long answer: > Summarize & visualize
> Sample & visualize
> Model & visualize
Short answer: no
![Page 20: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/20.jpg)
Extensively used by BI tools > Aggregation > Pivoting !
Most data scientists’ nightly jobs summarize data
Summarize and visualize
20
![Page 21: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/21.jpg)
Sample and visualize
Sometimes we need to visualize (feel) individual data points
Sampling is extensively used in statistics
Spark offers native support for:
> Approximate and exact sampling
> Approximate and exact stratified sampling
!
Approximate sampling is faster and is good enough in most cases
21
![Page 22: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/22.jpg)
Model and visualize
MLLib supports a large (and growing) set of distributed algorithms
> Clustering: k-means
> Classification and regression: LM, DT, NB
> Dimensionality reduction: SVD, PCA
> Collaborative filtering: ALS
> Correlation, hypothesis testing
22
![Page 23: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/23.jpg)
23
Demo
![Page 24: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/24.jpg)
SummaryWith new big data tools we can resume interactive visual exploration of data
!
Using Spark we can manipulate large data in seconds > Cache data in memory > Increase parallelism
!
To visualize millions of data points we can > Summarize > Sample > Models
24
![Page 25: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/25.jpg)
25
Databricks Cloud databricks.com
Apache Spark spark.apache.org
Matplotlib matplotlib.org
Python ggplot ggplot.yhathq.com
D3 d3js.org
![Page 26: Interactive Visual Data Exploration with Spark in ...Interactive Visual Data Exploration with Spark in Databricks Cloud Hossein Falaki @mhfalaki. About Databricks Founded by creators](https://reader035.vdocuments.mx/reader035/viewer/2022062306/5ec98b72b83f5f77ec2d4a4e/html5/thumbnails/26.jpg)