spark sql meetup

Spark SQL 漫谈

Cheng Hao

Oct 25, 2014

Copyright © 2014 Intel Corporation.

Agenda

Spark SQL Overview

Catalyst in Depth

SQL Core API Introduction

V.S. Shark & Hive-on-Spark

Our Contributions

Useful Materials

2Copyright © 2014 Intel Corporation.

Spark SQL Overview


Spark SQL in Spark

4

Spark Streaming

real-time

GraphXGraph(alpha)

MLLibMachine Learning

Spark Core

Spark SQL

Spark SQL was first released in Spark 1.0 (May, 2014) Initial committed by Michael Armbrust & Reynold Xin from Databricks


Spark SQL Component Stack (User Perspective)

Hive-like interface(JDBC Service / CLI)

SQL API support (LINQ-like)

Both Hive QL & Simple SQL dialects are Supported

DDL is 100% compatible with Hive Metastore

Hive QL aims to 100% compatible with Hive DML

Simple SQL dialect is now very weak in functionality,

but easy to extend

5

Spark Core

Spark Execution Operators

Catalyst

Hive QL Simple SQL

SQL APICLI

User Application

JDBC Service

Data Analyst

Hive Meta Store Simple Catalog


Spark SQL Architecture

6

Catalyst

BackendFrontend


By Michael Armbrust @ Databricks

Catalyst in Depth


Understand Some Terminology

8

Logical and Physical query plans

Both are trees representing query evaluation

Internal nodes are operators over the data

Logical plan is higher-level and algebraic

Physical plan is lower-level and operational

Logical plan operators

Correspond to query language constructs

Conceptually describe what operation needs to be

performed

Physical plan operators

Correspond to implemented access methods

Physically Implement the operation described by logical

operators

Unresolved Logical Plan

Logical Plan

SQL Text

Optimized Logical Plan

Physical Plan

Parsing

Binding & Analyzing

Optimizing

Query Planning


Examples

9

• CREATE TABLE T (key: String, value: String)

• EXPLAIN EXTENDED SELECT

a.key * (2 + 3), b.valueFROM T a JOIN T bON a.key=b.key AND a.key>3

We execute the following commands on Spark SQL CLI.



Understand some terminologies

10

== Parsed Logical Plan ==Project [('a.key * (2 + 3)) AS c_0#24,'b.value]Join Inner, Some((('a.key = 'b.key) && ('a.key > 3)))UnresolvedRelation None, T, Some(a)UnresolvedRelation None, T, Some(b)

== Analyzed Logical Plan ==Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30]Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType))))MetastoreRelation default, T, Some(a)MetastoreRelation default, T, Some(b)

== Optimized Logical Plan ==Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30]Join Inner, Some((key#27 = key#29))Project [key#27]Filter (CAST(key#27, DoubleType) > 3.0)MetastoreRelation default, T, Some(a)

MetastoreRelation default, T, Some(b)

== Physical Plan ==Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30]BroadcastHashJoin [key#27], [key#29], BuildLeftFilter (CAST(key#27, DoubleType) > 3.0)HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), NoneHiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None

11

• Catalyst essentially a extensible framework to Analyze & Optimize the logical plan,

expression.

• Core Elements:

• Tree Node API

• Expression Optimization

• Data Type & Schema

• Row API

• Logical Plan (Unresolved) Binding & Analyzing (Rules)

• Logical Plan (Resolved) Optimizing (Rules)

• SPI (Service Provider Interface)

• FunctionRegistry

• Schema Catalog

Catalyst Overview


Data Type & Schema Primitive Type

StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType,

BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not

Complete Supported Yet), Char(Not Complete Supported Yet)

Complex Type

ArrayType

ArrayType(elementType: DataType)

StructType

StructField(name: String, dataType: DataType)

StructType(fields: Seq[StructField])

MapType

MapType(keyType: DataType, valueType: DataType)

UnionType (Not Supported Yet)

12

Relation Schema


Row API

13

trait Row extends Seq[Any] with Serializable{def apply(i: Int): Anydef isNullAt(i: Int): Booleandef getInt(i: Int): Intdef getLong(i: Int): Longdef getDouble(i: Int): Doubledef getFloat(i: Int): Floatdef getBoolean(i: Int): Booleandef getShort(i: Int): Shortdef getByte(i: Int): Bytedef getString(i: Int): Stringdef getAs[T](int: Int): T

}

Row class is the key data structure widely used

internal / external Spark SQL.

“def getAs[T]” is used for non-primitive data types

Field value represented as native language data

type.

Field type represented as DataType described in last

slice.

Logical Plan Binding & Analyzing • Essentially about data binding & semantic analysis

• Example Rules

• Bind Attributes, Relations with concrete data.

• ResolveReferences, ResolveRelation

• Expressions Analysis

• Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division etc.)

• Bind UDF(ResolveFunctions)

• Evict / Expand the Analysis Logical Plan Operators

• StarExpansion, EliminateAnalysisOperators

• Implicit Semantic Supplement

• Add sort expressions into the child projection list.(ResolveSortReferences)

• Convert projection into aggregation if the projection contains aggregate

function(GlobalAggregates).

• UnresolvedHavingClauseAttributes

• Semantic Checking

• Unresolved Function, Relation, Attributes (CheckResolution)

• Illegal expressions in projection of an Aggregation (CheckAggregation)

• ….

14 Copyright © 2014 Intel Corporation.

Logical Plan Optimizing• Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule Based)

• Example Rules

• Expression Optimization.

• NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc.

• Filter PushDown

• UnionPushdown, PushPredicateThroughProject,

PushPredicateThroughJoin,ColumnPruning

• Combine Operators

• CombineFilters, CombineLimits

• Concrete Example

• IsNull(‘a + null) => IsNull(null) => Literal(true)

• SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 =>

SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key


Spark SQL Dialects

16

Hive AST

Hive Parser

Logical Plan


Hive+Spark Planner

Spark Planner

Execution Operators

SQL Parser

Unresolved Logical Plan

Hive Catelog

Simple Catelog

HiveContext SQLContext

DSL API

Frontend

Catalyst

Backend

XXXContext

XXX Planner

XXX Catelog

XX Parser / API

Frontend +

Catalyst + SPI+

Backend||

Tool


Spark Plan (Physical Plan)

17

Root class of Spark Plan Operator (Physical Plan Operator for Spark)

Spark Plan Operators

Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.)

Aggregate: Aggregate

BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.)

Shuffle: AddExchange, Exchange

Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.)

..

Spark Strategy (SparkPlanner)

Map the Optimized Logical Plan to Spark Plan

abstract class SparkPlan {def children: Seq[SparkPlan]/** Specifies how data is partitioned across different nodes in the cluster. */def outputPartitioning: Partitioning = UnknownPartitioning(0) /** Specifies any partition requirements on the input data for this operator. */def requiredChildDistribution: Seq[Distribution] =

Seq.fill(children.size)(UnspecifiedDistribution)

def execute(): RDD[Row]}


Spark Plan

RDD

Spark Execution

Case Study for Catalyst in Depth

18

• StreamSQL

• Reuse the HiveContext but with different Frontend / Backend.

• Frontend: Slight modification of the HiveParser

• Backend: Customed Query Planner, to generate the physical plan based on Spark

DStream.

• JIRA: https://issues.apache.org/jira/browse/SPARK-1363

• Source: https://github.com/thunderain-project/StreamSQL

• SQL 92 Support

• Reuse the HiveContext but with different Frontend

• Frontend: A modified HiveParser & Hive QL translator.

• https://github.com/intel-hadoop/spark/tree/panthera

• Pig on Spark POC

• Modify the SQLContext

• Provide a PigParser to translate the Pig script into Catalyst unresolved logical plan

• https://github.com/databricks/pig-on-spark


https://issues.apache.org/jira/browse/SPARK-1363

https://github.com/thunderain-project/StreamSQL

https://github.com/intel-hadoop/spark/tree/panthera

https://github.com/databricks/pig-on-spark

SQL Core API Introduction


SchemaRDD

20

• What’s SchemaRDD?

• Spark SQL Core API (In Scala)

• Create SchemaRDD instance from

• Plain SQL Text def sql(sqlText: String)• An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan)• Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A])• Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType)• Frequently used format file (json, parquet, etc.) def parquetFile(path: String)

• SQL DSL

• select, where, join, orderBy, limit, groupBy, unionAll, etc.

• Data Sink

• Persist the data with specified storage level def persist(newLevel: StorageLevel)• Save the data as ParquetFile def saveAsParquetFile(path: String)• Save the data as a new Table def registerTempTable(tableName: String)• Insert the data into existed table def insertInto(tableName: String, overwrite: Boolean)• ….

• Java API / Python API supported


class SchemaRDD(@transient val sqlContext: SQLContext,@transient val baseLogicalPlan: LogicalPlan)

extends RDD[Row](sqlContext.sparkContext, Nil)

Conceptual State Transition Diagram

21

RDD

Schema RDD

Unresolved

Logical Plan

SQL API

SQL Text / File / Table

* Unresolved Logical Plan RDD (Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan Spark RDD)


File / Memory etc.

…

Code Example

22

sbt/sbt hive/console

// HiveContext is created by default, and the object is imported, so we can call the object methods directly.sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)")sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it

case class KV(key: Int, value: String)val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD// implicitly convert the kvRDD into a SchemaRDDkvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD

jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Tableval result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c")

result.collect().foreach(row => {val f0 = if(row.isNullAt(0)) "null" else row.getInt(0)val f1 = if(row.isNullAt(1)) "null" else row.getString(1)val f2 = if(row.isNullAt(2)) "null" else row.getInt(2)println(s"result:$f0, $f1, $f2")

})


V.S. Shark &

Hive


Background of Shark/Hive-on-Spark/Spark SQL

Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write QueryPlanner to generate

Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL.

Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical

operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new

features that Hive might introduce in the future.

Spark SQL is a new SQL engine on Spark developed from scratch.

Functionality

Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts.

SQL API on Spark Shell V.S. Pig latin.

Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are very easy to

be integrated.

Implementation Philosophy of Spark SQL (Simple & Nature)

Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.)

Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the physical plan

generation.

In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command).

Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe, StorageHandler etc.)

Stability

Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years

in practices, many corner cases are covered in its continuous enhancements.

Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it.


Our Contributions


26

Totally 60+ PRs, 50+ Merged on Spark SQL

Features

Add serde support for CTAS (PR2570)

Support the Grouping Set (PR1567)

Support EXTENDED for EXPLAIN (PR1982)

Cross join support in HiveQL (PR2124)

Add support for left semi join (PR837)

Add Date type support (PR2344)

Add Timestamp type support (PR275)

Add Expression RLike & Like support (PR224)

..

Performance Enhancement / Improvement

Avoid table creation in logical plan analyzing for CTAS (PR1846)

Extract the joinkeys from join condition (PR1190)

Reduce the Expression tree object creations for aggregation function (min/max) (PR2113)

Pushdown the join filter & predication for outer join (PR1015)

Constant Folding for Expression Optimization (PR482)

Fix Performance Issue in data type casting (PR679)

Not limit argument type for hive simple udf (PR2506)

Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407)

Select null from table would throw a MatchError (PR2396)

Type Coercion should support every type to have null value (PR2246)

….

Bugs Fixing

….


Useful Materials


28

References

http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx

http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-

SQL-Michael-Armbrust.pdf

https://www.youtube.com/watch?v=GQSNJAzxOr8

http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

http://web.stanford.edu/class/cs346/qpnotes.html

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf

https://courses.cs.washington.edu/courses/cse444/12sp/lectures/

http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html

• User Mail List

[email protected]

• Dev Mail List

[email protected]

• Jira

https://issues.apache.org/jira/browse/SPARK/component/12322623

• DevDoc

https://spark.apache.org/docs/latest/sql-programming-guide.html

• Github

https://github.com/apache/spark/tree/master/sql


http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx

http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf

https://www.youtube.com/watch?v=GQSNJAzxOr8

http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark

http://web.stanford.edu/class/cs346/qpnotes.html

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf

http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf

https://courses.cs.washington.edu/courses/cse444/12sp/lectures/

http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html

mailto:[email protected]

mailto:[email protected]

https://issues.apache.org/jira/browse/SPARK/component/12322623

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://github.com/apache/spark/tree/master/sql

Notice and Disclaimers:

Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.See Trademarks on intel.com for full list of Intel trademarks.

Optimization Notice:Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from publish.


http://www.intel.com/sites/corporate/tradmarx.htm

spark sql meetup

Internet