an overview of spark dataframes with scala

An Overview of Spark DataFrames with Scala

An Overview of Spark DataFrames with Scala

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

Himanshu GuptaSr. Software ConsultantKnoldus Software LLP

http://www.knoldus.com/

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Himanshu Gupta (@himanshug735)

Spark Certified Developer

Apache Spark Third-Party Package contributor - spark-streaming-gnip

Sr. Software Consultant at Knoldus Software LLP

Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg

http://www.knoldus.com/about/team/himanshu.knol

http://spark-packages.org/package/knoldus/spark-streaming-gnip

http://www.knoldus.com/home.knol

http://www.knoldus.com/about/team/himanshu.knol

http://spark-packages.org/package/knoldus/spark-streaming-gnip



AgendaAgenda

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo

● What is Spark ?

● What is a DataFrame ?

● Why we need DataFrames ?

● A brief example

● Demo


Apache SparkApache Spark

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

● Distributed compute engine for large-scale data processing.

● 100x faster than Hadoop MapReduce.

● Provides APIs in Python, Scala, Java and R (Spark 1.4)

● Combines SQL, streaming and complex analytics.

● Runs on Hadoop, Mesos, or in the cloud.

Img src - http://spark.apache.org/Img src - http://spark.apache.org/


● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

● Distributed collection of data organized into named columns (i.e. SchemaRDD)

● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata

● Available in Python, Scala, Java and R (in Spark 1.4)

Spark DataFramesSpark DataFrames


Google Trends for DataFramesGoogle Trends for DataFrames

Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30


Speed of Spark DataFrames!Speed of Spark DataFrames!

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter

Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter


RDD API vs DataFrames APIRDD API vs DataFrames API

val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>

("", (count1 + count2, n1 + n2)) }

val average = sum.toDouble / n

val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()

val average = wordCountDF.agg(avg("count"))

RDD APIRDD API

DataFrame APIDataFrame API


Catalyst Optimizer

Optimization & Execution Plan shared by DataFrames and SparkSQL

Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html


AnalysisAnalysis

Begins with a relation to be computed.

Builds an “Unresolved Logical Plan”.

Applies Catalyst rules.

DataFrame

UnresolvedLogical Plan

Catalyst Rules


Logical OptimizationsLogical Optimizations

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

● Applies standard rule-based optimizations to the logical plan.

● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification

object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}


Physical PlanningPhysical Planning

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

● Generates one or more physical plans from logical plan.

● Selects a plan using Cost Model.

Optimized Logical Plan

Physical PlansCost

ModelSelected

Physical Plan

Physical PlanningPhysical Planning


Code GenerationCode Generation

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

● Generates Java bytecode for speed of execution.

● Uses Scala language, Quasiquotes.

● Quasiquotes allow programmatic construction of ASTs

def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }

Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html


ExampleExample

val tweets = sqlContext.read.json("tweets.json")

tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)

== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]

== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]


Example (contd.)Example (contd.)

project

filter

Logical PlanOptimized

Logical Plan Physical Plan

tweets

filter

project

tweets

project

filter

tweets

project

filter

tweets

filter

tweets

filter

Scan (tweets)


DemoDemo


Download Code

https://github.com/knoldus/spark-dataframes-meetup

https://github.com/knoldus/spark-dataframes-meetup


References

http://spark.apache.org/

Spark Summit EU 2015

Deep Dive into Spark SQL’s Catalyst Optimizer

Spark SQL: Relational Data Processing in Spark

Spark SQL and DataFrame Programming Guide

Introducing DataFrames in Spark for Large Scale Data Science

Beyond SQL: Speeding up Spark with DataFrames

http://spark.apache.org/

https://spark-summit.org/eu-2015/

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

http://spark.apache.org/docs/latest/sql-programming-guide.html

http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science

http://www.slideshare.net/databricks/spark-sqlsse2015public


Presenter:[email protected]

@himanshug735

Presenter:[email protected]

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks

mailto:[email protected]

https://twitter.com/himanshug735

mailto:[email protected]

https://twitter.com/himanshug735

https://twitter.com/Knolspeak


http://blog.knoldus.com/

https://twitter.com/Knolspeak


http://blog.knoldus.com/


an overview of spark dataframes with scala

Education