spark sql deep dive @ melbourne spark meetup
TRANSCRIPT
Spark SQL Deep Dive
Michael Armbrust Melbourne Spark Meetup – June 1st 2015
What is Apache Spark?
Fast and general cluster computing system, interoperable with Hadoop, included in all major distros Improves efficiency through:
> In-memory computing primitives > General computation graphs
Improves usability through: > Rich APIs in Scala, Java, Python > Interactive shell
Up to 100× faster (2-10× on disk)
2-5× less code
Spark Model
Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs)
> Collections of objects that can be stored in memory or disk across a cluster
> Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure
More than Map & Reduce
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
5
On-Disk Sort Record: Time to sort 100TB
2100 machines 2013 Record: Hadoop
2014 Record: Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
6
Spark “Hall of Fame”
LARGEST SINGLE-DAY INTAKE
LONGEST-RUNNING JOB
LARGEST SHUFFLE
MOST INTERESTING APP
Tencent (1PB+ /day)
Alibaba (1 week on 1PB+ data)
Databricks PB Sort (1PB)
Jeremy Freeman Mapping the Brain at Scale
(with lasers!)
LARGEST CLUSTER
Tencent (8000+ nodes)
Based on Reynold Xin’s personal knowledge
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_ startswith “ERROR”)
val messages = errors.map(_.split(“\t”)(2)) messages.cache() lines
Block 1
lines Block 2
lines Block 3
Worker
Worker
Worker
Driver
messages.filter(_ contains “foo”).count()
messages.filter(_ contains “bar”).count()
. . .
tasks
results
messages Cache 1
messages Cache 2
messages Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec#(vs 170 sec for on-disk data)
A General Stack
Spark
Spark Streaming#
real-time
Spark SQL
GraphX graph
MLlib machine learning …
Spark SQL
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Streaming
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
SparkSQL Streaming
Powerful Stack – Agile Development
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
GraphX
Streaming SparkSQL
Powerful Stack – Agile Development
0
20000
40000
60000
80000
100000
120000
140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
GraphX
Streaming SparkSQL
Your App?
About SQL
Spark SQL > Part of the core distribution since Spark 1.0
(April 2014)
0 50
100 150 200 250
# Of Commits Per Month
0
50
100
150
200
2014-0
3
2014-0
4
2014-0
5
2014-0
6
2014-0
7
2014-0
8
2014-0
9
2014-1
0
2014-1
1
2014-1
2
2015-0
1
2015-0
2
2015-0
3
2015-0
4
2015-0
5
2015-0
6
# of Contributors
SELECT COUNT(*) FROM hiveTable WHERE hive_udf(data)
Spark SQL > Part of the core distribution since Spark 1.0
(April 2014) > Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes
About SQL
Spark SQL > Part of the core distribution since Spark 1.0
(April 2014) > Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes > Connect existing BI tools to Spark through
JDBC
About SQL
Spark SQL > Part of the core distribution since Spark 1.0
(April 2014) > Runs SQL / HiveQL queries including UDFs
UDAFs and SerDes > Connect existing BI tools to Spark through
JDBC > Bindings in Python, Scala, and Java
About SQL
The not-so-secret truth…
is not about SQL.
SQL
SQL: The whole story
Create and Run Spark Programs Faster: > Write less code > Read less data > Let the optimizer do the hard work
DataFrame noun – [dey-tuh-freym] 1. A distributed collection of rows
organized into named columns. 2. An abstraction for selecting, filtering,
aggregating and plotting structured data (cf. R, Pandas).
3. Archaic: Previously SchemaRDD (cf. Spark < 1.3).
Write Less Code: Input & Output
Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.
21
{ JSON }
Built-In External
JDBC
and more…
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats: df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json") df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats: df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json") df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
read and write functions create new builders for
doing I/O
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats: df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json") df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
Builder methods are used to specify: • Format • Partitioning • Handling of
existing data • and more
Write Less Code: Input & Output
Unified interface to reading/writing data in a variety of formats: df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json") df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")
load(…), save(…) or saveAsTable(…) functions create new builders for
doing I/O
ETL Using Custom Data Sources
sqlContext.read .format("com.databricks.spark.git") .option("url", "https://github.com/apache/spark.git") .option("numPartitions", "100") .option("branches", "master,branch-‐1.3,branch-‐1.2") .load() .repartition(1) .write .format("json") .save("/home/michael/spark.json")
Write Less Code: Powerful Operations
Common operations can be expressed concisely as calls to the DataFrame API: • Selecting required columns • Joining different data sources • Aggregation (count, sum, average, etc) • Filtering
27
Write Less Code: Compute an Average
private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()
Using DataFrames
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()
Using SQL
SELECT name, avg(age) FROM people GROUP BY name
Not Just Less Code: Faster Implementations
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame SQL
Time to Aggregate 10 million int pairs (secs)
31
Demo: Data Frames Using Spark SQL to read, write, slice and dice your data using a simple functions
Read Less Data
Spark SQL can help you read less data automatically:
• Converting to more efficient formats • Using columnar formats (i.e. parquet) • Using partitioning (i.e., /year=2014/month=02/…)1 • Skipping data using statistics (i.e., min, max)2
• Pushing predicates into storage systems (i.e., JDBC)
Optimization happens as late as possible, therefore Spark SQL can
optimize across functions.
33
34
def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # udf adds city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == "Palo Alto") .select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
expensive
only join relevant users
Physical Plan
join
scan (events) filter
scan (users)
35
def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column
Physical Plan with Predicate Pushdown
and Column Pruning
join
optimized scan
(events) optimized
scan (users)
events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "Palo Alto") .select(events.timestamp).collect()
Logical Plan
filter
join
events file users table
Physical Plan
join
scan (events) filter
scan (users)
Machine Learning Pipelines
tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)
df0 df1 df2 df3 tokenizer hashingTF lr.model
lr
Pipeline Model
Set Footer from Insert Dropdown Menu 37
So how does it all work?
Plan Optimization & Execution
Set Footer from Insert Dropdown Menu 38
SQL AST
DataFrame
Unresolved Logical
Plan
Logical Plan
Optimized Logical
Plan RDDs
Selected Physical
Plan
Analysis Logical Optimization
Physical Planning
Cost
Mod
el
Physical Plans
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code Generation
An example query SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Projectname
Projectid,name
Filterid = 1
People
LogicalPlan
39
Naïve Query Planning SELECT name
FROM (
SELECT id, name
FROM People) p
WHERE p.id = 1
Projectname
Projectid,name
Filterid = 1
People
LogicalPlan
Projectname
Projectid,name
Filterid = 1
TableScanPeople
PhysicalPlan
40
Optimized Execution Writing imperative code to optimize all possible patterns is hard.
Projectname
Projectid,name
Filterid = 1
People
LogicalPlan
Projectname
Projectid,name
Filterid = 1
People
IndexLookupid = 1
return: name
LogicalPlan
PhysicalPlan
Instead write simple rules: • Each rule makes one change • Run many rules together to
fixed point.
41
Prior Work: #Optimizer Generators Volcano / Cascades: • Create a custom language for expressing
rules that rewrite trees of relational operators.
• Build a compiler that generates executable code for these rules.
Cons: Developers need to learn this custom language. Language might not be powerful enough. 42
TreeNode Library Easily transformable trees of operators • Standard collection functionality - foreach,
map,collect,etc. • transform function – recursive modification
of tree fragments that match a pattern. • Debugging support – pretty printing,
splicing, etc.
43
Tree Transformations Developers express tree transformations as PartialFunction[TreeType,TreeType] 1. If the function does apply to an operator, that
operator is replaced with the result. 2. When the function does not apply to an
operator, that operator is left unchanged. 3. The transformation is applied recursively to all
children. 44
Writing Rules as Tree Transformations 1. Find filters on top of
projections. 2. Check that the filter
can be evaluated without the result of the project.
3. If so, switch the operators.
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
45
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
46
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Partial Function Tree
47
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Find Filter on Project
48
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Check that the filter can be evaluated without the result of the project.
49
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
If so, switch the order.
50
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Scala: Pattern Matching
51
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
}
Catalyst: Attribute Reference Tracking
52
Filter Push Down Transformation
val newPlan = queryPlan transform {
case f @ Filter(_, p @ Project(_, grandChild))
if(f.references subsetOf grandChild.output) =>
p.copy(child = f.copy(child = grandChild)
} Scala: Copy Constructors
53
Optimizing with Rules
Projectname
Projectid,name
Filterid = 1
People
OriginalPlan
Projectname
Projectid,name
Filterid = 1
People
FilterPush-Down
Projectname
Filterid = 1
People
CombineProjection
IndexLookupid = 1
return: name
PhysicalPlan
54
Future Work – Project Tungsten
Consider “abcd” – 4 bytes with UTF8 encoding java.lang.String object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 4 (object header) ... 4 4 (object header) ... 8 4 (object header) ... 12 4 char[] String.value [] 16 4 int String.hash 0 20 4 int String.hash32 0 Instance size: 24 bytes (reported by Instrumentation API)
Project Tungsten
Overcome JVM limitations: • Memory Management and Binary Processing:
leveraging application semantics to manage memory explicitly and eliminate the overhead of JVM object model and garbage collection
• Cache-aware computation: algorithms and data structures to exploit memory hierarchy
• Code generation: using code generation to exploit modern compilers and CPUs
Questions? Learn more at: http://spark.apache.org/docs/latest/ Get Involved: https://github.com/apache/spark