![Page 1: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/1.jpg)
Spark 2
Alexey Zinovyev, Java/BigData Trainer in EPAM
![Page 2: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/2.jpg)
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
![Page 3: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/3.jpg)
3Big Data Training
Secret Word from EPAM
itsubbotnik
![Page 4: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/4.jpg)
4Big Data Training
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
![Page 5: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/5.jpg)
5Joker’16: Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
![Page 6: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/6.jpg)
6Joker’16: Spark 2 from Zinoviev Alexey
< SPARK 2.0
![Page 7: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/7.jpg)
7Big Data Training
Modern Java in 2016Big Data in 2014
![Page 8: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/8.jpg)
8Big Data Training
Big Data in 2017
![Page 9: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/9.jpg)
9Big Data Training
Machine Learning EVERYWHERE
![Page 10: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/10.jpg)
10Big Data Training
Machine Learning vs Traditional Programming
![Page 11: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/11.jpg)
11Big Data Training
![Page 12: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/12.jpg)
12Big Data Training
Something wrong with HADOOP
![Page 13: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/13.jpg)
13Joker’16: Spark 2 from Zinoviev Alexey
Hadoop is not SEXY
![Page 14: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/14.jpg)
14Joker’16: Spark 2 from Zinoviev Alexey
Whaaaat?
![Page 15: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/15.jpg)
15Joker’16: Spark 2 from Zinoviev Alexey
Map Reduce Job Writing
![Page 16: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/16.jpg)
16Joker’16: Spark 2 from Zinoviev Alexey
MR code
![Page 17: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/17.jpg)
17Joker’16: Spark 2 from Zinoviev Alexey
Hadoop Developers Right Now
![Page 18: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/18.jpg)
18Joker’16: Spark 2 from Zinoviev Alexey
Iterative Calculations
10x – 100x
![Page 19: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/19.jpg)
19Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
![Page 20: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/20.jpg)
20Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
![Page 21: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/21.jpg)
21Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
![Page 22: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/22.jpg)
22Joker’16: Spark 2 from Zinoviev Alexey
SPARK 2.0 DISCUSSION
![Page 23: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/23.jpg)
23Joker’16: Spark 2 from Zinoviev Alexey
Spark
Family
![Page 24: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/24.jpg)
24Joker’16: Spark 2 from Zinoviev Alexey
Spark
Family
![Page 25: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/25.jpg)
25Joker’16: Spark 2 from Zinoviev Alexey
Case #0 : How to avoid DStreams with RDD-like API?
![Page 26: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/26.jpg)
26Joker’16: Spark 2 from Zinoviev Alexey
Continuous Applications
![Page 27: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/27.jpg)
27Joker’16: Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
![Page 28: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/28.jpg)
28Joker’16: Spark 2 from Zinoviev Alexey
Write Batches
![Page 29: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/29.jpg)
29Joker’16: Spark 2 from Zinoviev Alexey
Catch Streaming
![Page 30: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/30.jpg)
30Joker’16: Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
![Page 31: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/31.jpg)
31Joker’16: Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
![Page 32: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/32.jpg)
32Joker’16: Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
![Page 33: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/33.jpg)
33Joker’16: Spark 2 from Zinoviev Alexey
Unlimited Table
![Page 34: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/34.jpg)
34Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
![Page 35: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/35.jpg)
35Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
![Page 36: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/36.jpg)
36Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
![Page 37: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/37.jpg)
37Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
![Page 38: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/38.jpg)
38Joker’16: Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
![Page 39: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/39.jpg)
39Joker’16: Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
![Page 40: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/40.jpg)
40Joker’16: Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
![Page 41: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/41.jpg)
41Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
![Page 42: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/42.jpg)
42Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
![Page 43: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/43.jpg)
43Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
![Page 44: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/44.jpg)
44Joker’16: Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
![Page 45: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/45.jpg)
45Joker’16: Spark 2 from Zinoviev Alexey
Single
Thread
collection
![Page 46: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/46.jpg)
46Joker’16: Spark 2 from Zinoviev Alexey
No perf
issues,
right?
![Page 47: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/47.jpg)
47Joker’16: Spark 2 from Zinoviev Alexey
The main concept
more partitions = more parallelism
![Page 48: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/48.jpg)
48Joker’16: Spark 2 from Zinoviev Alexey
Do it
parallel
![Page 49: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/49.jpg)
49Joker’16: Spark 2 from Zinoviev Alexey
I’d like
NARROW
![Page 50: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/50.jpg)
50Joker’16: Spark 2 from Zinoviev Alexey
Map, filter, filter
![Page 51: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/51.jpg)
51Joker’16: Spark 2 from Zinoviev Alexey
GroupByKey, join
![Page 52: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/52.jpg)
52Joker’16: Spark 2 from Zinoviev Alexey
Case #2 : DataFrames suggest mix SQL and Scala functions
![Page 53: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/53.jpg)
53Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
![Page 54: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/54.jpg)
54Joker’16: Spark 2 from Zinoviev Alexey
RDD
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
![Page 55: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/55.jpg)
55Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
![Page 56: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/56.jpg)
56Joker’16: Spark 2 from Zinoviev Alexey
SQL
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
![Page 57: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/57.jpg)
57Joker’16: Spark 2 from Zinoviev Alexey
Expression
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
![Page 58: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/58.jpg)
58Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
![Page 59: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/59.jpg)
59Joker’16: Spark 2 from Zinoviev Alexey
DataSet
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
![Page 60: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/60.jpg)
60Joker’16: Spark 2 from Zinoviev Alexey
Case #2 : DataFrame is referring to data attributes by name
![Page 61: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/61.jpg)
61Joker’16: Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
![Page 62: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/62.jpg)
62Joker’16: Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
![Page 63: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/63.jpg)
63Joker’16: Spark 2 from Zinoviev Alexey
Structured APIs in SPARK
![Page 64: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/64.jpg)
64Joker’16: Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
![Page 65: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/65.jpg)
65Joker’16: Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
![Page 66: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/66.jpg)
66Joker’16: Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
![Page 67: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/67.jpg)
67Joker’16: Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
![Page 68: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/68.jpg)
68Joker’16: Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
![Page 69: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/69.jpg)
69Joker’16: Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
![Page 70: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/70.jpg)
70Joker’16: Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
![Page 71: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/71.jpg)
71Joker’16: Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
![Page 72: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/72.jpg)
72Joker’16: Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
![Page 73: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/73.jpg)
73Joker’16: Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
![Page 74: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/74.jpg)
74Joker’16: Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
![Page 75: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/75.jpg)
75Joker’16: Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
![Page 76: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/76.jpg)
76Joker’16: Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
![Page 77: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/77.jpg)
77Joker’16: Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
![Page 78: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/78.jpg)
78Joker’16: Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
![Page 79: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/79.jpg)
79Joker’16: Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
![Page 80: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/80.jpg)
80Joker’16: Spark 2 from Zinoviev Alexey
Encoders
![Page 81: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/81.jpg)
81Joker’16: Spark 2 from Zinoviev Alexey
No custom encoders
![Page 82: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/82.jpg)
82Joker’16: Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels
![Page 83: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/83.jpg)
83Joker’16: Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
![Page 84: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/84.jpg)
84Joker’16: Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
![Page 85: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/85.jpg)
85Joker’16: Spark 2 from Zinoviev Alexey
Spark Core Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
![Page 86: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/86.jpg)
86Joker’16: Spark 2 from Zinoviev Alexey
Spark Streaming Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER (default for Spark Streaming)
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
![Page 87: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/87.jpg)
87Joker’16: Spark 2 from Zinoviev Alexey
Developer API to make new Storage Levels
![Page 88: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/88.jpg)
88Joker’16: Spark 2 from Zinoviev Alexey
What’s the most popular file format in BigData?
![Page 89: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/89.jpg)
89Joker’16: Spark 2 from Zinoviev Alexey
Case #6 : External libraries to read CSV
![Page 90: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/90.jpg)
90Joker’16: Spark 2 from Zinoviev Alexey
Easy to
read CSV
data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/datasets/samples/users.csv")
data.cache()
data.createOrReplaceTempView(“users")
display(data)
![Page 91: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/91.jpg)
91Joker’16: Spark 2 from Zinoviev Alexey
Case #7 : How to measure Spark performance?
![Page 92: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/92.jpg)
92Joker’16: Spark 2 from Zinoviev Alexey
You'd measure performance!
![Page 93: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/93.jpg)
93Joker’16: Spark 2 from Zinoviev Alexey
TPCDS
99 Queries
http://bit.ly/2dObMsH
![Page 94: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/94.jpg)
94Joker’16: Spark 2 from Zinoviev Alexey
![Page 95: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/95.jpg)
95Joker’16: Spark 2 from Zinoviev Alexey
How to benchmark Spark
![Page 96: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/96.jpg)
96Joker’16: Spark 2 from Zinoviev Alexey
Special Tool from Databricks
Benchmark Tool for SparkSQL
https://github.com/databricks/spark-sql-perf
![Page 97: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/97.jpg)
97Joker’16: Spark 2 from Zinoviev Alexey
Spark 2 vs Spark 1.6
![Page 98: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/98.jpg)
98Joker’16: Spark 2 from Zinoviev Alexey
Case #8 : What’s faster: SQL or DataSet API?
![Page 99: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/99.jpg)
99Joker’16: Spark 2 from Zinoviev Alexey
Job Stages in old Spark
![Page 100: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/100.jpg)
100Joker’16: Spark 2 from Zinoviev Alexey
Scheduler Optimizations
![Page 101: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/101.jpg)
101Joker’16: Spark 2 from Zinoviev Alexey
Catalyst Optimizer for DataFrames
![Page 102: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/102.jpg)
102Joker’16: Spark 2 from Zinoviev Alexey
Unified Logical Plan
DataFrame = Dataset[Row]
![Page 103: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/103.jpg)
103Joker’16: Spark 2 from Zinoviev Alexey
Bytecode
![Page 104: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/104.jpg)
104Joker’16: Spark 2 from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe
+- Scan CsvRelation(----)
![Page 105: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/105.jpg)
105Joker’16: Spark 2 from Zinoviev Alexey
Case #9 : Why does explain() show so many Tungsten things?
![Page 106: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/106.jpg)
106Joker’16: Spark 2 from Zinoviev Alexey
How to be effective with CPU
• Runtime code generation
• Exploiting cache locality
• Off-heap memory management
![Page 107: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/107.jpg)
107Joker’16: Spark 2 from Zinoviev Alexey
Tungsten’s goal
Push performance closer to the limits of modern
hardware
![Page 108: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/108.jpg)
108Joker’16: Spark 2 from Zinoviev Alexey
Maybe something UNSAFE?
![Page 109: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/109.jpg)
109Joker’16: Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
![Page 110: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/110.jpg)
110Joker’16: Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
![Page 111: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/111.jpg)
111Joker’16: Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
![Page 112: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/112.jpg)
112Joker’16: Spark 2 from Zinoviev Alexey
Cached
Data
![Page 113: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/113.jpg)
113Joker’16: Spark 2 from Zinoviev Alexey
During
operations
![Page 114: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/114.jpg)
114Joker’16: Spark 2 from Zinoviev Alexey
For your
needs
![Page 115: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/115.jpg)
115Joker’16: Spark 2 from Zinoviev Alexey
For Dark
Lord
![Page 116: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/116.jpg)
116Joker’16: Spark 2 from Zinoviev Alexey
IN CONCLUSION
![Page 117: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/117.jpg)
117Joker’16: Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
![Page 118: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/118.jpg)
118Joker’16: Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
![Page 119: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/119.jpg)
119Joker’16: Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time Dashboarding
Events & Alarms
All Raw Data backupis stored here
Real-time DataIngestion
Batch DataIngestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFSas an option
Time-Series Data
Titan & KairosDBstore data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
![Page 120: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/120.jpg)
120Joker’16: Spark 2 from Zinoviev Alexey
First Part
![Page 121: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/121.jpg)
121Joker’16: Spark 2 from Zinoviev Alexey
Second
Part
![Page 122: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/122.jpg)
122Joker’16: Spark 2 from Zinoviev Alexey
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
![Page 123: Spark 2 - Contentful: a developer-friendly, API-first CMSassets.contentful.com/.../what-does-spark-prepare.pdfJoker’16: Spark 2 from Zinoviev Alexey 30 The main concept of Structured](https://reader031.vdocuments.mx/reader031/viewer/2022030503/5aaf70a67f8b9a22118d3b3a/html5/thumbnails/123.jpg)
123Joker’16: Spark 2 from Zinoviev Alexey
Any questions?