functional query optimization with sql...functional query optimization with" " sql . what...
TRANSCRIPT
![Page 1: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/1.jpg)
Michael Armbrust @michaelarmbrust
spark.apache.org
Functional Query Optimization with" "
SQL
![Page 2: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/2.jpg)
What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: » In-memory computing primitives » General computation graphs
Improves usability through: » Rich APIs in Scala, Java, Python » Interactive shell
Up to 100× faster (2-10× on disk)
2-5× less code
![Page 3: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/3.jpg)
A General Stack
Spark
Spark Streaming"
real-time
Spark SQL
GraphX graph
MLlib machine learning …
Spark SQL
![Page 4: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/4.jpg)
Spark Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) » Collections of objects that can be stored in
memory or disk across a cluster » Parallel functional transformations (map, filter, …) » Automatically rebuilt on failure
![Page 5: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/5.jpg)
More than Map/Reduce map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
![Page 6: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/6.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”)
val messages = errors.map(_.split(“\t”)(2)) messages.cache() lines
Block 1
lines Block 2
lines Block 3
Worker
Worker
Worker
Driver
messages.filter(_ contains “foo”).count()
messages.filter(_ contains “bar”).count()
. . .
tasks
results
messages Cache 1
messages Cache 2
messages Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec"(vs 170 sec for on-disk data)
![Page 7: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/7.jpg)
Fault Tolerance
file.map(record => (record.tpe, 1)) .reduceByKey(_ + _) .filter { case (_, count) => count > 10 }
filter reduce map
Inpu
t file
RDDs track lineage info to rebuild lost data
![Page 8: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/8.jpg)
filter reduce map
Inpu
t file
Fault Tolerance
file.map(record => (record.tpe, 1)) .reduceByKey(_ + _) .filter { case (_, count) => count > 10 }
RDDs track lineage info to rebuild lost data
![Page 9: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/9.jpg)
and Scala Provides: • Concise Serializable*
Functions
• Easy interoperability with the Hadoop ecosystem
• Interactive REPL * Made even better by Spores (5pm today)
Lines of Code
Scala!
Python!
Java!
Shell!
Other!
![Page 10: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/10.jpg)
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Reduced Developer Complexity
![Page 11: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/11.jpg)
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Streaming
Reduced Developer Complexity
![Page 12: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/12.jpg)
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
SparkSQL Streaming
Reduced Developer Complexity
![Page 13: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/13.jpg)
Reduced Developer Complexity
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
GraphX
Streaming SparkSQL
![Page 14: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/14.jpg)
One of the largest open source projects in big data
150+ developers contributing
30+ companies contributing
Spark Community
Giraph Storm Tez 0
50
100
150
Contributors in past year
![Page 15: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/15.jpg)
Community Growth
Spark 0.6: 17 contributors
Feb ‘13 Oct ‘12 May‘14 Sept ‘13
Spark 0.8: 67 contributors
Feb ‘14
Spark 0.9: 83 contributors
Spark 1.0: 110 contributors
Spark 0.7:"31 contributors
![Page 16: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/16.jpg)
With great power… Strict project coding guidelines to make it easier for non-Scala users and contributors: • Absolute imports only
• Minimize infix function use • Java/Python friendly wrappers for user APIs
• …
![Page 17: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/17.jpg)
SQL
![Page 18: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/18.jpg)
Shark modified the Hive backend to run over Spark, but had two challenges: » Limited integration with Spark programs » Hive optimizer not designed for Spark
Spark SQL reuses the best parts of Shark:
Relationship to
Borrows • Hive data loading • In-memory column store
Adds • RDD-aware optimizer • Rich language interfaces
![Page 19: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/19.jpg)
Spark SQL Components Catalyst Optimizer • Relational algebra + expressions • Query optimization
Spark SQL Core • Execution of queries as RDDs • Reading in Parquet, JSON …
Hive Support • HQL, MetaStore, SerDes, UDFs
26%!
36%!
38%!
![Page 20: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/20.jpg)
Adding Schema to RDDs Spark + RDDs!Functional transformations on partitioned collections of opaque objects.
SQL + SchemaRDDs!Declarative transformations on partitioned collections of tuples.!
User User User
User User User
Name Age Height Name Age Height Name Age Height
Name Age Height Name Age Height Name Age Height
![Page 21: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/21.jpg)
Using Spark SQL SQLContext
• Entry point for all SQL functionality • Wraps/extends existing spark context
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Importing the SQL context gives access to all the SQL
functions and conversions.
import sqlContext._
![Page 22: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/22.jpg)
Example Dataset A text file filled with people’s names and ages: Michael, 30
Andy, 31
Justin Bieber, 19
…
![Page 23: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/23.jpg)
Turning an RDD into a Relation // Define the schema using a case class.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people =
sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt))
people.registerAsTable("people")
![Page 24: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/24.jpg)
Querying Using SQL // SQL statements are run with the sql method from sqlContext.
val teenagers = sql("""
SELECT name FROM people WHERE age >= 13 AND age <= 19""")
// The results of SQL queries are SchemaRDDs but also
// support normal RDD operations.
// The columns of a row in the result are accessed by ordinal.
val nameList = teenagers.map(t => "Name: " + t(0)).collect()
![Page 25: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/25.jpg)
Querying Using the Scala DSL Express queries using functions, instead of SQL strings. // The following is the same as: // SELECT name FROM people // WHERE age >= 10 AND age <= 19 val teenagers = people .where('age >= 10) .where('age <= 19) .select('name)
![Page 26: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/26.jpg)
Caching Tables In-Memory Spark SQL can cache tables using an in-memory columnar format: • Scan only required columns
• Fewer allocated objects (less GC) • Automatically selects best compression
cacheTable("people")
![Page 27: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/27.jpg)
Parquet Compatibility Native support for reading data in Parquet: • Columnar storage avoids reading
unneeded data.
• RDDs can be written to parquet files, preserving the schema.
![Page 28: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/28.jpg)
Using Parquet // Any SchemaRDD can be stored as Parquet. people.saveAsParquetFile("people.parquet")
// Parquet files are self-‐describing so the schema is preserved.
val parquetFile = sqlContext.parquetFile("people.parquet")
// Parquet files can also be registered as tables and then used
// in SQL statements. parquetFile.registerAsTable("parquetFile”)
val teenagers = sql(
"SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
![Page 29: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/29.jpg)
Hive Compatibility Interfaces to access data and code in"the Hive ecosystem:
o Support for writing queries in HQL o Catalog info from Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs
![Page 30: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/30.jpg)
Reading Data Stored In Hive val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
hql("LOAD DATA LOCAL INPATH '.../kv1.txt' INTO TABLE src")
// Queries can be expressed in HiveQL.
hql("FROM src SELECT key, value")
![Page 31: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/31.jpg)
SQL and Machine Learning val trainingDataTable = sql("""
SELECT e.action, u.age, u.latitude, u.logitude
FROM Users u
JOIN Events e ON u.userId = e.userId""")
// SQL results are RDDs so can be used directly in Mllib.
val trainingData = trainingDataTable.map { row =>
val features = Array[Double](row(1), row(2), row(3))
LabeledPoint(row(0), features)
}
val model = new LogisticRegressionWithSGD().run(trainingData)
![Page 32: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/32.jpg)
Supports Java Too! public class Person implements Serializable { private String _name; private int _age; public String getName() { return _name; } public void setName(String name) { _name = name; } public int getAge() { return _age; } public void setAge(int age) { _age = age; } } JavaSQLContext ctx = new org.apache.spark.sql.api.java.JavaSQLContext(sc) JavaRDD<Person> people = ctx.textFile("examples/src/main/resources/people.txt").map( new Function<String, Person>() { public Person call(String line) throws Exception { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);
![Page 33: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/33.jpg)
Supports Python Too! from pyspark.context import SQLContext sqlCtx = SQLContext(sc) lines = sc.textFile("examples/src/main/resources/people.txt") parts = lines.map(lambda l: l.split(",")) people = parts.map(lambda p: {"name": p[0], "age": int(p[1])}) peopleTable = sqlCtx.applySchema(people) peopleTable.registerAsTable("people") teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenNames = teenagers.map(lambda p: "Name: " + p.name)
![Page 34: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/34.jpg)
Optimizing Queries with
![Page 35: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/35.jpg)
What is Query Optimization? SQL is a declarative language: Queries express what data to retrieve, not how to retrieve it.
The database must pick the ‘best’ execution strategy through a process known as optimization. 35
![Page 36: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/36.jpg)
Naïve Query Planning SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Projectname
Projectid,name
Filterid = 1
People
LogicalPlan
Projectname
Projectid,name
Filterid = 1
TableScanPeople
PhysicalPlan
36
![Page 37: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/37.jpg)
Optimized Execution Writing imperative code to optimize all possible patterns is hard.
Projectname
Projectid,name
Filterid = 1
People
LogicalPlan
Projectname
Projectid,name
Filterid = 1
People
IndexLookupid = 1
return: name
LogicalPlan
PhysicalPlan
Instead write simple rules: • Each rule makes one change • Run many rules together to
fixed point.
37
![Page 38: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/38.jpg)
Optimizing with Rules
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
Projectname
Filterid = 1
People
CombineProjection
IndexLookupid = 1
return: name
PhysicalPlan
38
![Page 39: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/39.jpg)
Prior Work: "Optimizer Generators Volcano / Cascades: • Create a custom language for expressing
rules that rewrite trees of relational operators.
• Build a compiler that generates executable code for these rules.
Cons: Developers need to learn this custom language. Language might not be powerful enough. 39
![Page 40: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/40.jpg)
TreeNode Library Easily transformable trees of operators • Standard collection functionality - foreach,
map,collect,etc. • transform function – recursive modification
of tree fragments that match a pattern. • Debugging support – pretty printing,
splicing, etc.
40
![Page 41: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/41.jpg)
Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1. If the function does apply to an operator, that
operator is replaced with the result. 2. When the function does not apply to an
operator, that operator is left unchanged. 3. The transformation is applied recursively to all
children. 41
![Page 42: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/42.jpg)
Writing Rules as Tree Transformations 1. Find filters on top of
projections. 2. Check that the filter
can be evaluated without the result of the project.
3. If so, switch the operators.
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
42
![Page 43: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/43.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
43
![Page 44: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/44.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Partial Function Tree
44
![Page 45: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/45.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Find Filter on Project
45
![Page 46: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/46.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Check that the filter can be evaluated without the result of the project.
46
![Page 47: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/47.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
If so, switch the order.
47
![Page 48: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/48.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Pattern Matching
48
![Page 49: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/49.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Collections library
49
![Page 50: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/50.jpg)
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
} Copy Constructors
50
![Page 51: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/51.jpg)
Efficient Expression Evaluation Interpreting expressions (e.g., ‘a + b’) can very expensive on the JVM: • Virtual function calls
• Branches based on expression type
• Object creation due to primitive boxing
• Memory consumption by boxed primitive objects
![Page 52: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/52.jpg)
Interpreting “a+b” 1. Virtual call to Add.eval() 2. Virtual call to a.eval() 3. Return boxed Int 4. Virtual call to b.eval() 5. Return boxed Int 6. Integer addition 7. Return boxed result
Add
Attributea
Attributeb
![Page 53: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/53.jpg)
Using Runtime Reflection def generateCode(e: Expression): Tree = e match { case Attribute(ordinal) => q"inputRow.getInt($ordinal)" case Add(left, right) => q""" { val leftResult = ${generateCode(left)} val rightResult = ${generateCode(right)} leftResult + rightResult } """ }
![Page 54: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/54.jpg)
Executing “a + b” val left: Int = inputRow.getInt(0) val right: Int = inputRow.getInt(1) val result: Int = left + right resultRow.setInt(0, result)
• Fewer function calls • No boxing of primitives
![Page 55: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/55.jpg)
Performance Comparison
0 5
10 15 20 25 30 35 40
Intepreted Evaluation Hand-written Code Generated with Scala Reflection
Mill
isec
onds
Evaluating 'a+a+a' One Billion Times!
![Page 56: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/56.jpg)
TPC-DS Results
0
50
100
150
200
250
300
350
400
Query 19 Query 53 Query 34 Query 59
Seco
nds
Shark - 0.9.2 SparkSQL + codegen
![Page 57: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/57.jpg)
Code Generation Made Simple • Other Systems (Impala, Drill, etc) do code
generation. • Scala Reflection + Quasiquotes made
our implementation an experiment done over a few weekends instead of a major system overhaul.
Initial Version ~1000 LOC
![Page 58: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/58.jpg)
Future Work: Typesafe Results Currently: people.registerAsTable(“people”) val results = sql(“SELECT name FROM people”) results.map(r => r.getString(0)) What we want: val results = sql“SELECT name FROM $people” results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak
![Page 59: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/59.jpg)
Future Work: Typesafe Results Currently: people.registerAsTable(“people”) val results = sql(“SELECT name FROM people”) results.map(r => r.getString(0)) What we want: val results = sql“SELECT name FROM $people” results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak
![Page 60: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/60.jpg)
Future Work: Typesafe Results Currently: people.registerAsTable(“people”) val results = sql(“SELECT name FROM people”) results.map(r => r.getString(0)) What we want: val results = sql“SELECT name FROM $people” results.map(_.name) Joint work with: Heather Miller, Vojin Jovanovic, Hubert Plociniczak
![Page 61: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/61.jpg)
Get Started Visit spark.apache.org for videos & tutorials Download Spark bundle for CDH Easy to run on just your laptop
Free training talks and hands-on"exercises: spark-summit.org
![Page 62: Functional Query Optimization with SQL...Functional Query Optimization with" " SQL . What is Apache Spark? Fast and general cluster computing system interoperable with Hadoop Improves](https://reader035.vdocuments.mx/reader035/viewer/2022071212/6025fb160499c00f735f6235/html5/thumbnails/62.jpg)
Conclusion Big data analytics is evolving to include: » More complex analytics (e.g. machine learning) » More interactive ad-hoc queries, including SQL » More real-time stream processing
Spark is a fast platform that unifies these apps
"Join us at Spark Summit 2014!"June 30-July 2, San Francisco