introduction to spark sql and catalyst / spark sqlおよびcalalystの紹介

38
Introduction to SparkSQL & Catalyst Takuya UESHIN ScalaMatsuri2014 2014/09/06(Sat)

Upload: scalaconfjp

Post on 14-Jun-2015

413 views

Category:

Software


2 download

DESCRIPTION

Presentation material by Mr. Takuya Ueshin at ScalaMatsuri 2014 http://scalamatsuri.org/en/

TRANSCRIPT

Page 1: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Introduction to SparkSQL & Catalyst

Takuya UESHIN !

ScalaMatsuri2014 2014/09/06(Sat)

Page 2: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Who am I?

Takuya UESHIN

@ueshin

github.com/ueshin

Nautilus Technologies, Inc.

A Spark contributor

2

Page 3: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Agenda

What is SparkSQL?

Catalyst in depth

SparkSQL core in depth

Interesting issues

How to contribute

3

Page 4: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?

Page 5: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?

5

SparkSQL is one of Spark components.

Executes SQL on Spark

Builds SchemaRDD like LINQ

Optimizes execution plan.

Page 6: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?Catalyst provides a execution planning framework for relational operations.

Including:

SQL parser & analyzer

Logical operators & general expressions

Logical optimizer

A framework to transform operator tree.

6

Page 7: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

7

Page 8: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

8

Catalyst

Page 9: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

What is SparkSQL?To execute query needs some steps.

ParseAnalyze

Logical PlanOptimize

Physical Plan

Execute

9

SQL core

Page 10: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Catalyst in depth

Page 11: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Catalyst in depthProvides a execution planning framework for relational operations.

Row & DataType’s

Trees & Rules

Logical Operators

Expressions

Optimizations

11

Page 12: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Row & DataType’so.a.s.sql.catalyst.types.DataType

Long, Int, Short, Byte, Float, Double, Decimal

String, Binary, Boolean, Timestamp

Array, Map, Struct

o.a.s.sql.catalyst.expressions.Row

Represents a single row.

Can contain complex types.

12

Page 13: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Trees & Ruleso.a.s.sql.catalyst.trees.TreeNode

Provides transformations of tree.

foreach, map, flatMap, collect

transform, transformUp, transformDown

Used for operator tree, expression tree.

13

Page 14: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Trees & Ruleso.a.s.sql.catalyst.rules.Rule

Represents a tree transform rule.

o.a.s.sql.catalyst.rules.RuleExecutor

A framework to transform trees based on rules.

14

Page 15: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Logical OperatorsBasic Operators

Project, Filter, …

Binary Operators

Join, Except, Intersect, Union, …

Aggregate

Generate, Distinct

Sort, Limit

InsertInto, WriteToFile

15

Join

Table Table

Filter

Project

Page 16: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

ExpressionsLiteral

Arithmetics

UnaryMinus, Sqrt, MaxOf

Add, Subtract, Multiply, …

Predicates

EqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual

Not, And, Or, In, If, CaseWhen

16

+

1 2

Page 17: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

ExpressionsCast

GetItem, GetField

Coalesce, IsNull, IsNotNull

StringOperations

Like, Upper, Lower, Contains, StartsWith, EndsWith, Substring, …

17

Page 18: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsConstantFolding

NullPropagation

ConstantFolding

BooleanSimplification

SimplifyFilters

FilterPushdown

CombineFilters

PushPredicateThroughProject

PushPredicateThroughJoin

ColumnPruning

18

Page 19: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsNullPropagation, ConstantFolding

Replace expressions that can be evaluated with some literal value to the value.

ex)

1 + null => null

1 + 2 => 3

Count(null) => 0

19

Page 20: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsBooleanSimplification

Simplifies boolean expressions that can be determined.

ex)

false AND $right => false

true AND $right => $right

true OR $right => true

false OR $right => $right

If(true, $then, $else) => $then

20

Page 21: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsSimplifyFilters

Removes filters that can be evaluated trivially.

ex)

Filter(true, child) => child

Filter(false, child) => empty

21

Page 22: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsCombineFilters

Merges two filters.

ex)

Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child)

22

Page 23: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsPushPredicateThroughProject

Pushes Filter operators through Project operator.

ex)

Filter(‘i === 1, Project(‘i, ‘j, child)) => Project(‘i, ‘j, Filter(‘i === 1, child))

23

Page 24: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsPushPredicateThroughJoin

Pushes Filter operators through Join operator.

ex)

Project(“left.i”.attr === 1, Join(left, right) =>Join(Project(‘i === 1, left), right)

24

Page 25: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

OptimizationsColumnPruning

Eliminates the reading of unused columns.

ex)

Join(left, right, LeftSemi, “left.id”.attr === “right.id”.attr) => Join(Project(‘id, left), right, LeftSemi)

25

Page 26: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in depth

Page 27: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in depthProvides:

Physical operators to build RDD

Conversion from Existing RDD of Product to SchemaRDD support

Parquet file read/write support

JSON file read support

Columnar in-memory table support

27

Page 28: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in deptho.a.s.sql.SchemaRDD

Extends RDD[Row].

Has logical plan tree.

Provides LINQ-like interfaces to construct logical plan.

select, where, join, orderBy, …

Executes the plan.

28

Page 29: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in deptho.a.s.sql.execution.SparkStrategies

Converts logical plan to physical.

Some rules are based on statistics of the operators.

29

Page 30: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in depthParquet read/write support

Parquet: Columnar storage format for Hadoop

Reads existing Parquet files.

Converts Parquet schema to row schema.

Writes new Parquet files.

Currently DecimalType and TimestampType are not supported.

30

Page 31: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in depthJSON read support

Loads a JSON file (one object per line)

Infers row schema from the entire dataset.

Giving the schema is experimental.

Inferring the schema by sampling is also experimental.

31

Page 32: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

SQL core in depthColumnar in-memory table support

Caches table like RDD.cache, but as columnar style.

Can prune unnecessary columns when read data.

32

Page 33: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Interesting issues

Page 34: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Interesting issuesSupport the GroupingSet/ROLLUP/CUBE

https://issues.apache.org/jira/browse/SPARK-2663

Use statistics to skip partitions when reading from in-memory columnar data

https://issues.apache.org/jira/browse/SPARK-2961

34

Page 35: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Interesting issuesPluggable interface for shuffles

https://issues.apache.org/jira/browse/SPARK-2044

Sort-merge join

https://issues.apache.org/jira/browse/SPARK-2213

Cost-based join reordering

https://issues.apache.org/jira/browse/SPARK-2216

35

Page 36: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

How to contribute

Page 37: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

How to contributeSee: Contributing to Spark

Open an issue on JIRA

Send pull-request at GitHub

Communicate with committers and reviewers

Congratulations!

37

Page 38: Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介

Conclusion

Introduced SparkSQL & Catalyst

Now you know them very well!

And you know how to contribute.

!

Let’s contribute to Spark & SparkSQL!!

38