introduction to spark 2.0

Introduction to Spark 2.0

Next Step in Spark Journey

https://github.com/phatak-dev/spark2.0-examples



● Madhukara Phatak

● Technical Lead at Tellius

● Consultant and Trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

http://tellius.com/

http://datamantra.io/

http://datamantra.io/

http://www.madhukaraphatak.com

http://www.madhukaraphatak.com

Agenda● Major focus in Spark 2.0● Dataset abstraction● Spark Session● Dataset wordcount● RDD to Dataset● Dataset Vs Dataframe API’s● Time window● Custom Optimizations

Major focus of Spark 2.0● Standardizing on Dataset abstraction● Moving all libraries of Spark to play well with dataset

abstraction● Making all API’s available in all languages● Putting seeds for future directions of 2.x like structure

streaming● Performance Improvement in order of 10X

Dataset Abstraction ● A Dataset is a strongly typed collection of domain-

specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row

● RDD represents an immutable,partitioned collection of elements that can be operated on in parallel

● Has custom DSL and runs on custom memory management

Spark Session API● New entry point in spark for creating for creating

datasets● Replaces SQLContext,HiveContext and

StreamingContext ● Most of the programs only need to create this no more

SparkContext● Move from SparkContext to SparkSession signifies

move away from RDD● Ex : SparkSessionExample.scala

Mandatory Dataset WordCount● Dataset provides very similar DSL as RDD ● It combines best of RDD and Dataframe to single API● Dataframe is now aliased now to Dataset[Row]● One of the big change from RDD API, is moving away

from key/value pair based API to more SQL like API● Dataset signifies departure from well know Map/Reduce

like API to more of optimized data handling DSL● Ex : DataSetWordCount.scala

RDD to Dataset● Dataframe lacked functional programming aspects of

RDD which made moving code from RDD to DF more challenging

● But with Dataset, most of the RDD expressions can be easily expressed in more elegantly

● Though both are DSL, they differ large in implementation

● Most of the Dataset operation is ran through code generation and custom serialization

● Ex : RDDToDataset.scala

Dataframe vs Dataset● Most of the logical plans and optimizations of Dataframe

are now moved into Dataset ● Dataframe is now a schema less Dataset● One of the difference of Dataset from Dataframe is, it

adds an additional step for serialization and checking for proper schema

● This serialization is different than spark and kryo. It’s a macro based serialization framework

● Ex : DatasetVsDataframe.scala

Catalogue API● In the theme of support for structured data, catalogue

API bring support to manage external metastores from spark

● Highly useful for interactive programs like Zeppelin and other notebooks

● Integrates well with Hive metastore● Primary used for DDL operations● API is built on Dataset abstraction● Ex : CatalogueExample.scala

Time analysis● One of the import part of any data manipulation is

handling time effectively● In earlier versions of spark, only spark streaming

supported notion of time● As spark 2.0 is trying to merge spark streaming with

dataset abstraction, time is now part of spark sql also● This is more powerful, as we can use same time

abstraction both in batch and streaming operations● Ex: TimeWindowExample.scala

Plugging custom optimisations● Dataset abstractions runs on same catalyst optimiser as

dataframe ● As Dataset is becoming the platform level abstraction

ability to control this optimiser becomes very important● In earlier versions, one needed to change spark source

code to inject custom optimizations● From Spark 2.0, framework providing a public API to

inject custom optimizations without changing source code

● Ex : CustomOptimizationExample.scala

References● http://blog.madhukaraphatak.com/categories/spark-two/● https://www.brighttalk.com/webcast/12891/202021● https://spark-summit.org/2016/schedule/

http://blog.madhukaraphatak.com/categories/spark-two/

http://blog.madhukaraphatak.com/categories/spark-two/

https://www.brighttalk.com/webcast/12891/202021

https://www.brighttalk.com/webcast/12891/202021

https://spark-summit.org/2016/schedule/

https://spark-summit.org/2016/schedule/

introduction to spark 2.0

Technology