using apache spark and zeppelin -...
TRANSCRIPT
![Page 1: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/1.jpg)
Big Data Visualizationusing
Apache Spark and Zeppelin
Prajod Vettiyattil, Software Architect, Wipro
![Page 2: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/2.jpg)
Agenda
• Big Data and Ecosystem tools
• Apache Spark
• Apache Zeppelin
• Data Visualization
• Combining Spark and Zeppelin
2
![Page 3: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/3.jpg)
BIG DATA AND ECOSYSTEM TOOLS
![Page 4: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/4.jpg)
Big Data
• Data size beyond systems capability
– Terabyte, Petabyte, Exabyte
• Storage
– Commodity servers, RAID, SAN
• Processing
– In reasonable response time
– A challenge here
4
![Page 5: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/5.jpg)
Server
Tradition processing tools
• Move what ?
– the data to the code or
– the code to the data
5
Data
Server
move data to code
move code to data
Code
![Page 6: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/6.jpg)
Traditional processing tools
• Traditional tools
– RDBMS, DWH, BI
– High cost
– Difficult to scale beyond certain data size
• price/performance skew
• data variety not supported
6
![Page 7: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/7.jpg)
Map-Reduce and NoSQL
• Hadoop toolset
– Free and open source
– Commodity hardware
– Scales to exabytes(1018), maybe even more
• Not only SQL
– Storage and query processing only
– Complements Hadoop toolset
– Volume, velocity and variety
7
![Page 8: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/8.jpg)
All is well ?
• Hadoop was designed for batch processing
• Disk based processing: slow
• Many tools to enhance Hadoop’s capabilities
– Distributed cache, Haloop, Hive, HBase
• Not for interactive and iterative
8
![Page 9: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/9.jpg)
TOWARDS SINGULARITY
![Page 10: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/10.jpg)
What is singularity ?
10
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 3 4 5 6 7
AI
cap
acit
y
Decade
Decade vs AI capacity
Point of singularity
![Page 11: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/11.jpg)
Technological singularity
• When AI capability exceeds Human capacity
• AI or non-AI singularity
• 2045: http://en.wikipedia.org/wiki/Ray_Kurzweil
– The predicted year
11
![Page 12: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/12.jpg)
APACHE SPARK
![Page 13: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/13.jpg)
History of Spark
13
Spark 1.3.0 released
2015
MarchSpark 1.0.0
released. 100TB sort achieved in
23 mins
2014
Spark donated to Apache Software
Foundation
2013
Spark is made open source. Available on
github
2010
Spark created by PhD student
at UC Berkeley, Matei
Zaharia
2009
![Page 14: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/14.jpg)
Contributors in Spark
• Yahoo
• Intel
• UC Berkeley
• …
• 50+ organizations
14
![Page 15: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/15.jpg)
Hadoop and Spark
• Spark complements the Hadoop ecosystem
• Replaces: Hadoop MR
• Spark integrates with
– HDFS
– Hive
– HBase
– YARN
15
![Page 16: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/16.jpg)
Other big data tools
• Spark also integrates with
– Kafka
– ZeroMQ
– Cassandra
– Mesos
16
![Page 17: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/17.jpg)
Programming Spark
• Java
• Scala
• Python
• R
17
![Page 18: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/18.jpg)
Spark toolset
18
Apache Spark
Spark
SQL
Spark
Streamin
g
MLlib GraphX
Spark R
Blink DB
Spark
Cassandra
Tachyon
![Page 19: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/19.jpg)
What is Spark for ?
19
Batch
Interactive Streaming
![Page 20: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/20.jpg)
The main difference: speed
• RAM access vs Disk access
– RAM access is 100,000 times faster !
20
https://gist.github.com/hellerbarde/2843375
![Page 21: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/21.jpg)
Lambda Architecture pattern
• Used for Lambda architecture implementation
– Batch layer
– Speed layer
– Serving layer
21
Batch
Layer
Speed
Layer
Serving Layer
Data Input
Data
consumers
![Page 22: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/22.jpg)
Worker Node
Executor
Deployment Architecture
22
Master Node
Executor
Task CacheTaskTask
Worker Node
ExecutorExecutorExecutor
HDFS
Data
Node
HDFS
Data
Node
TaskTaskTaskCache
Spark’s Cluster
Manager
Spark Driver
HDFS Name
Node
![Page 23: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/23.jpg)
APACHE ZEPPELIN
![Page 24: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/24.jpg)
Interactive data analytics
• For Spark and Flink
• Web front end
• At the back, it connects to
– SQL systems(Eg: Hive)
– Spark
– Flink
24
![Page 25: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/25.jpg)
Deployment Architecture
25
Spark / Flink /
Hive
Zeppelin daemon
Web
browser 1
Web
browser 2
Web
browser 3
Web Server
Local
Interpreters
Optional
Remote
Interpreters
![Page 26: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/26.jpg)
Notebook
• Is where you do your data analysis
• Web UI REPL with pluggable interpreters
• Interpreters
– Scala, Python, Angular, SparkSQL, Markdown and Shell
26
![Page 27: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/27.jpg)
User Interface features
• Markdown
• Dynamic HTML generation
• Dynamic chart generation
• Screen sharing via websockets
27
![Page 28: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/28.jpg)
SQL Interpreter
• SQL shell
– Query spark data using SQL queries
– Return normal text, HTML or chart type results
28
![Page 29: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/29.jpg)
Scala interpreter for Spark
• Similar to the Spark shell
• Upload your data into Spark
• Query the data sets(RDDs) in your Spark server
• Execute map-reduce tasks
• Actions on RDD
• Transformations on RDD
29
![Page 30: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/30.jpg)
DATA VISUALIZATION
![Page 31: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/31.jpg)
Visualization tools
31Source: http://selection.datavisualization.ch/
![Page 32: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/32.jpg)
D3 Visualizations
32Source: https://github.com/mbostock/d3/wiki/Gallery
![Page 33: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/33.jpg)
The need for visualization
33
Big DataDo
something to data
User gets
comprehensible
data
![Page 34: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/34.jpg)
Tools for Data Presentation Architecture
34
1.Identify
2.Locate
3.Manipulate
4.Format
5.Present
A data analysis tool/toolset would support:
![Page 35: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/35.jpg)
COMBINING SPARK AND ZEPPELIN
![Page 36: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/36.jpg)
Spark and Zeppelin
36
Spark
Worker
Node
Spark
Master
Node
Spark
Worker
Node
Zeppelin daemon
Web
browser 1
Web
browser 2
Web
browser 3
Web Server
Local
Interpreters
Remote
Interpreters
![Page 37: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/37.jpg)
Zeppelin views: Table from SQL
37
![Page 38: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/38.jpg)
Zeppelin views: Table from SQL
38
%sql select age, count(1) from bank where
marital="${marital=single,single|divorced|married}"
group by age order by age
![Page 39: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/39.jpg)
Zeppelin views: Pie chart from SQL
39
![Page 40: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/40.jpg)
Zeppelin views: Bar chart from SQL
40
![Page 41: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/41.jpg)
Zeppelin views: Angular
41
![Page 42: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/42.jpg)
Share variables: MVVM
• Between Scala/Python/Spark and Angular
• Observe scala variables from angular
42
Scala-Spark Angular
x = “foo” x = “bar”
Zeppelin
![Page 43: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/43.jpg)
Screen sharing using Zeppelin
• Share your graphical reports
– Live sharing
– Get the share URL from zeppelin and share with others
– Uses websockets
• Embed live reports in web pages
43
![Page 44: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/44.jpg)
FUTURE
![Page 45: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/45.jpg)
Spark and Zeppelin
• Spark
– Machine Learning using Spark
– GraphX and MLlib
• Zeppelin
– Additional interpreters
– Better graphics
– Report persistence
– More report templates
– Better angular integration45
![Page 46: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/46.jpg)
SUMMARY
![Page 47: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/47.jpg)
Summary
• Spark and tools
• The need for visualization
• The role of Zeppelin
• Zeppelin – Spark integration
47
![Page 48: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod](https://reader031.vdocuments.mx/reader031/viewer/2022020113/5a78cbf67f8b9a07028e5d9a/html5/thumbnails/48.jpg)