an overview of spark dataframes with scala
TRANSCRIPT
An Overview of Spark DataFrames with Scala
An Overview of Spark DataFrames with Scala
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Himanshu GuptaSr. Software ConsultantKnoldus Software LLP
Who am I ?Who am I ?
Himanshu Gupta (@himanshug735)
Spark Certified Developer
Apache Spark Third-Party Package contributor - spark-streaming-gnip
Sr. Software Consultant at Knoldus Software LLP
Himanshu Gupta (@himanshug735)
Spark Certified Developer
Apache Spark Third-Party Package contributor - spark-streaming-gnip
Sr. Software Consultant at Knoldus Software LLP
Img src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpgImg src - https://karengately.files.wordpress.com/2013/06/who-am-i.jpg
AgendaAgenda
● What is Spark ?
● What is a DataFrame ?
● Why we need DataFrames ?
● A brief example
● Demo
● What is Spark ?
● What is a DataFrame ?
● Why we need DataFrames ?
● A brief example
● Demo
Apache SparkApache Spark
● Distributed compute engine for large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java and R (Spark 1.4)
● Combines SQL, streaming and complex analytics.
● Runs on Hadoop, Mesos, or in the cloud.
● Distributed compute engine for large-scale data processing.
● 100x faster than Hadoop MapReduce.
● Provides APIs in Python, Scala, Java and R (Spark 1.4)
● Combines SQL, streaming and complex analytics.
● Runs on Hadoop, Mesos, or in the cloud.
Img src - http://spark.apache.org/Img src - http://spark.apache.org/
● Distributed collection of data organized into named columns (i.e. SchemaRDD)
● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata
● Available in Python, Scala, Java and R (in Spark 1.4)
● Distributed collection of data organized into named columns (i.e. SchemaRDD)
● Domain Specific Language for common tasks - ➢ UDFs➢ Sampling➢ Project, filter, aggregate, join, …➢ Metadata
● Available in Python, Scala, Java and R (in Spark 1.4)
Spark DataFramesSpark DataFrames
Google Trends for DataFramesGoogle Trends for DataFrames
Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30Img src - https://www.google.co.in/trends/explore#q=dataframes&date=1%2F2011%2056m&cmpt=q&tz=Etc%2FGMT-5%3A30
Speed of Spark DataFrames!Speed of Spark DataFrames!
Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter
Img src - https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html?utm_content=12098664&utm_medium=social&utm_source=twitter
RDD API vs DataFrames APIRDD API vs DataFrames API
val linesRDD = sparkContext.textFile(“file.txt”)val wordCountRDD = linesRDD.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
val (word, (sum, n)) = wordCountRDD.map { case (word, count) => (word, (count, 1)) } .reduce { case ((word1, (count1, n1)), (word2, (count2, n2))) =>
("", (count1 + count2, n1 + n2)) }
val average = sum.toDouble / n
val linesDF = sparkContext.textFile(“file.txt”).toDF("line")val wordsDF = linesDF.explode("line", "word")((line: String) => line.split(" "))val wordCountDF = wordsDF.groupBy("word").count()
val average = wordCountDF.agg(avg("count"))
RDD APIRDD API
DataFrame APIDataFrame API
Catalyst Optimizer
Optimization & Execution Plan shared by DataFrames and SparkSQL
Img src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.htmlImg src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
AnalysisAnalysis
Begins with a relation to be computed.
Builds an “Unresolved Logical Plan”.
Applies Catalyst rules.
DataFrame
UnresolvedLogical Plan
Catalyst Rules
Logical OptimizationsLogical Optimizations
Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
● Applies standard rule-based optimizations to the logical plan.
● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification
● Applies standard rule-based optimizations to the logical plan.
● Includes operations like -➢ Constant folding➢ Projection pruning➢ Predicate pushdown➢ Boolean expression simplification
object DecimalAggregates extends Rule[LogicalPlan] { /** Maximum number of decimal digits in a Long */ val MAX_LONG_DIGITS = 18 def apply(plan: LogicalPlan): LogicalPlan = { plan transformAllExpressions { case Sum(e @ DecimalType.Expression(prec, scale)) if prec + 10 <= MAX_LONG_DIGITS => MakeDecimal(Sum(UnscaledValue(e)), prec + 10, scale) } }}
Physical PlanningPhysical Planning
● Generates one or more physical plans from logical plan.
● Selects a plan using Cost Model.
● Generates one or more physical plans from logical plan.
● Selects a plan using Cost Model.
Optimized Logical Plan
Physical PlansCost
ModelSelected
Physical Plan
Physical PlanningPhysical Planning
Code GenerationCode Generation
● Generates Java bytecode for speed of execution.
● Uses Scala language, Quasiquotes.
● Quasiquotes allow programmatic construction of ASTs
● Generates Java bytecode for speed of execution.
● Uses Scala language, Quasiquotes.
● Quasiquotes allow programmatic construction of ASTs
def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left, right) => q"${compile(left)} + ${compile(right)}" }
Snippet src - https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
ExampleExample
val tweets = sqlContext.read.json("tweets.json")
tweets .select("tweetId", "username", "timestamp") .filter("timestamp > 0") .explain(extended = true)
== Parsed Logical Plan =='Filter ('timestamp > 0) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Analyzed Logical Plan ==tweetId: bigint, username: string, timestamp: bigintFilter (timestamp#14L > cast(0 as bigint)) Project [tweetId#15L,username#16,timestamp#14L] Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Optimized Logical Plan ==Project [tweetId#15L,username#16,timestamp#14L] Filter (timestamp#14L > 0) Relation[_id#9,content#10,hashtags#11,score#12,session#13,timestamp#14L,tweetId#15L,username#16] JSONRelation[file:/home/knoldus/data/json/tweets.json]
== Physical Plan ==Filter (timestamp#14L > 0) Scan JSONRelation[file:/home/knoldus/data/json/tweets.json][tweetId#15L,username#16,timestamp#14L]
Example (contd.)Example (contd.)
project
filter
Logical PlanOptimized
Logical Plan Physical Plan
tweets
filter
project
tweets
project
filter
tweets
project
filter
tweets
filter
tweets
filter
Scan (tweets)
DemoDemo
Download Code
https://github.com/knoldus/spark-dataframes-meetup
References
http://spark.apache.org/
Spark Summit EU 2015
Deep Dive into Spark SQL’s Catalyst Optimizer
Spark SQL: Relational Data Processing in Spark
Spark SQL and DataFrame Programming Guide
Introducing DataFrames in Spark for Large Scale Data Science
Beyond SQL: Speeding up Spark with DataFrames
Presenter:[email protected]
@himanshug735
Presenter:[email protected]
@himanshug735
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Organizer:@Knolspeak
http://www.knoldus.comhttp://blog.knoldus.com
Thanks