introduction to spark 2.0
TRANSCRIPT
Introduction to Spark 2.0
Next Step in Spark Journey
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Technical Lead at Tellius
● Consultant and Trainer at datamantra.io
● Consult in Hadoop, Spark and Scala
● www.madhukaraphatak.com
Agenda● Major focus in Spark 2.0● Dataset abstraction● Spark Session● Dataset wordcount● RDD to Dataset● Dataset Vs Dataframe API’s● Time window● Custom Optimizations
Major focus of Spark 2.0● Standardizing on Dataset abstraction● Moving all libraries of Spark to play well with dataset
abstraction● Making all API’s available in all languages● Putting seeds for future directions of 2.x like structure
streaming● Performance Improvement in order of 10X
Dataset Abstraction ● A Dataset is a strongly typed collection of domain-
specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row
● RDD represents an immutable,partitioned collection of elements that can be operated on in parallel
● Has custom DSL and runs on custom memory management
Spark Session API● New entry point in spark for creating for creating
datasets● Replaces SQLContext,HiveContext and
StreamingContext ● Most of the programs only need to create this no more
SparkContext● Move from SparkContext to SparkSession signifies
move away from RDD● Ex : SparkSessionExample.scala
Mandatory Dataset WordCount● Dataset provides very similar DSL as RDD ● It combines best of RDD and Dataframe to single API● Dataframe is now aliased now to Dataset[Row]● One of the big change from RDD API, is moving away
from key/value pair based API to more SQL like API● Dataset signifies departure from well know Map/Reduce
like API to more of optimized data handling DSL● Ex : DataSetWordCount.scala
RDD to Dataset● Dataframe lacked functional programming aspects of
RDD which made moving code from RDD to DF more challenging
● But with Dataset, most of the RDD expressions can be easily expressed in more elegantly
● Though both are DSL, they differ large in implementation
● Most of the Dataset operation is ran through code generation and custom serialization
● Ex : RDDToDataset.scala
Dataframe vs Dataset● Most of the logical plans and optimizations of Dataframe
are now moved into Dataset ● Dataframe is now a schema less Dataset● One of the difference of Dataset from Dataframe is, it
adds an additional step for serialization and checking for proper schema
● This serialization is different than spark and kryo. It’s a macro based serialization framework
● Ex : DatasetVsDataframe.scala
Catalogue API● In the theme of support for structured data, catalogue
API bring support to manage external metastores from spark
● Highly useful for interactive programs like Zeppelin and other notebooks
● Integrates well with Hive metastore● Primary used for DDL operations● API is built on Dataset abstraction● Ex : CatalogueExample.scala
Time analysis● One of the import part of any data manipulation is
handling time effectively● In earlier versions of spark, only spark streaming
supported notion of time● As spark 2.0 is trying to merge spark streaming with
dataset abstraction, time is now part of spark sql also● This is more powerful, as we can use same time
abstraction both in batch and streaming operations● Ex: TimeWindowExample.scala
Plugging custom optimisations● Dataset abstractions runs on same catalyst optimiser as
dataframe ● As Dataset is becoming the platform level abstraction
ability to control this optimiser becomes very important● In earlier versions, one needed to change spark source
code to inject custom optimizations● From Spark 2.0, framework providing a public API to
inject custom optimizations without changing source code
● Ex : CustomOptimizationExample.scala
References● http://blog.madhukaraphatak.com/categories/spark-two/● https://www.brighttalk.com/webcast/12891/202021● https://spark-summit.org/2016/schedule/