optimizing across relational and linear algebra in parallel … · outline •introduction and...

Optimizing Across Relational and Linear Algebra in Parallel

Analytics PipelinesAsterios Katsifodimos

Assistant Professor @ TU Delft

[email protected]://asterios.katsifodimos.com

mailto:[email protected]


Outline

• Introduction and Motivation• Databases vs. parallel dataflow systems

• Declarativity in Parallel-dataflow Systems• The Emma language and optimizer• The Lara language for mixed linear- and relational-algebra programs

• Optimizing across linear & relational Algebra• BlockJoin: A join algorithm for mixed relational & linear algebra programs

• The road ahead

2

A timeline of data analysis toolsfrom a biased perspective

Google publishes MapReduce.

70s 80s 90s 00s 2004 - 2009 2009 - 2016

IngresOracle System-R…

MonetDBC-StoreVerticaAster DataGreenPlum…

BerkeleyDBMS AccessPostgreSQLMySQL…

GammaTRACETeradata…

HadoopMapReduce

SparkFlinkImpalaGiraphHive (SQL!)HBaseBlu…

Appearance of Relational Databases

First parallelshared-nothingarchitectures

Open Source Projects and mainstream databases

First columnar storageDatabases. Faster!

Scalable, UDF-based, data analytics blooms

Alternative MapReduce implementations go mainstream

2003: Internet ExplodesNoSQL movement.

Open Source rises

1974: Don Chamberlin publishes SQL

Criticism of MapReduce.Back to SQL?

3

2016 -

Spark DataframesSystemMLEmmaMRQL

Better Abstractions for dataflow systems

We need easier to use tools!

Relational DBs vs. Modern Big Data Processing Stack

HBase

Hadoop

Hadoop Distributed File System (HDFS)

SQL Compiler & Optimizer

SQL-like abstraction

Iterative M/R-like Jobs (Java, Scala, …)

Put/Get ops

Files/Bytes

Cluster Management (Kubernetes, YARN, …)

Apache Flink Apache Spark

Transaction Manager

Row/Column Storage

Relational Dataflow Engine

SQL Compiler & Optimizer

SQL

4

Operating System

Declarative Data Processing

DistributedCollections

Parallel DataflowEngines

Second-OrderFunctions (Map, Reduce, etc.)

Relations RDBMSSQL

Relational Databases

Parallel Dataflows

5

People with data analysis skills

Big data analysts

Big Data systems

programmingexperts

... // initialize

val cntrds = centroids.iterate(theta) { currCntrds =>

val newCntrds = points

.map(findNearestCntrd).withBcSet(currCntrds, "cntrds")

.map( (c, p) => (c, p, 1L) )

.groupBy(0).reduce( (x, y) =>

(x._1, x._2 + y._2, x._3 + y._3) )

.map( x => Centroid(x._1, x._2 / x._3) )

currCntrds

}

... // initialize

while (theta) {

newCntrds = points

.map(findNearestCntrd)

.map( (c, p) => (c, (p, 1L)) )

.reduceByKey( (x, y) =>

(x._1 + y._1, x._2 + y._2) )

.map( x => Centroid(x._1, x._2._1 / x._2._2) )

bcCntrs = sc.broadcast(newCntrds.collect())

}

Quiz: guess the algorithm!

Apache Flink

K-Means Clustering

• Input• Dataset (set of points in 2D) — large• K initial centroids — small

• Step 1:• Read K-centroids• Assign each point to the closest centroid (wrt distance)• Output <centroid, point> (or, cluster)

• Step 2:• For each centroid• Calculate its average position wrt distance from its points

• Output: <new centroid> (coordinates)

8

... // initialize




.map( (c, p) => (c, p, 1L) )


(x._1, x._2 + y._2, x._3 + y._3) )


currCntrds

}

Partial Aggregates... // initialize

while (theta) {

newCntrds = points


.map( (c, p) => (c, (p, 1L)) )


(x._1 + y._1, x._2 + y._2) )



}

Efficient aggregation requires a special combination of primitives

... // initialize




.map( (c, p) => (c, p, 1L) )


(x._1, x._2 + y._2, x._3 + y._3) )


currCntrds

}

Broadcast Variables... // initialize

while (theta) {

newCntrds = points


.map( (c, p) => (c, (p, 1L)) )


(x._1 + y._1, x._2 + y._2) )



}

Explicit data-exchange between the driver and the parallel dataflows

... // initialize




.map( (c, p) => (c, p, 1L) )


(x._1, x._2 + y._2, x._3 + y._3) )


currCntrds

}

Native Iterations... // initialize

while (theta) {

newCntrds = points


.map( (c, p) => (c, (p, 1L)) )


(x._1 + y._1, x._2 + y._2) )



}

Feedback edges in the dataflow graph have to be specified explicitly.

Beyond Datasets: End-to-end ML Pipelines

12

Question Formulation

Data Source Exploration & Selection

Data cleaning &

preparation

Feature Selection

ML Algorithm Selection

Hyper Parameter

Search

Model Training

Model Evaluation

1 2 3 4 5 6 7 8

Data Management Machine Learning

A typical Data Science Workflow

80% of data scientists’ time

A very simple ML pipeline example

13

Gather and clean sensor

data

Principal Component

Analysis

Train ML Model

Factory 3

Factory 1

Factory 2 ⋈

Preprocessing

Naturally expressed in• SQL• Dataflow APIs

14

SELECTf1.v, f2.v, f3.v

FROMfactory_1 AS f1,factory_2 AS f2,factory_3 AS f3

WHEREf1.sid == f2.sid ANDf2.sid == f3.sid ANDf1.v != NULL ANDf2.v != NULL ANDf3.v != NULL


data

Factory3

Factory1

Factory2

⋈

Matrix Operations/ML in SQL

Impedance mismatch!

15

Principal Component

Analysis

Train ML Model

* From:http://stackoverflow.com/questions/18019469/finding-covariance-using-sql

Calculating covariance matrix in SQL

Machine Learning in “Native” Syntax

Naturally expressed in• Matlab• R• …

16

// read factories table as matrix FacMatrix[rows, cols] = size(FacMatrix)colMeans = mean(FacMatrix)NormMat = zeros(rows, cols)for i = 1:rows

NormMat(i, :) = FacMatrix(i, :) - colMeans;endCovMat = NormMat’ * NormMat / (rows – 1)

// …

Principal Component

Analysis

Train ML Model

SQL + ML Native Syntax = Optimization Barrier

17


data

Principal Component

Analysis

Train ML Model

Factory 3

Factory 1

Factory 2 ⋈

Optimization barrier!

Big Data Language Desiderata

• Declarativity: what vs. how • Enables optimizability, eases programming

• User-Defined Functions as first-class citizens• Important reason why MapReduce took off

• Control-flow for Machine Learning and Graph workloads• Iterations, if-then-else branches• Turing complete “driver” with parallelizable constructs

• Portability • Should support multiple backend engines (Flink, Spark, MapReduce, etc.)

• Deep embedding• e.g. à la LINQ (Language Integrated Query) in C#, Ferry in Haskel, etc. 18

19

//SELECT f1.v, f2.v

//FROM factory_1 as f1, factory_2 as f2

//WHERE f1.id == f2.id and f1.v != null and f2.v != nullval factories = for {

f1 <- factory_1

f2 <- factory_2

if f1.sid == f2.sid &&

f1.v != null &&

f2.v != null &&

} yield (f1.v,…, f2.v, …)

//Calculate the mean of each of the columns

val colMeans = FacMatrix.cols( col => col.mean() )

val NormMat = FacMatrix – colMeans

//Calculate covariance matrix

val CovMat = NormMat.t %*% NormMat/(NormMat.numRows – 1)

// ...

while(…) {

val FacMatrix = factories.toMatrix()

Gather and clean data

Principal Component Analysis

Deeply Embedded in Scala

Databag type§ Like RDDs on Spark, Datasets on Flink, etc.§ Operations on Databags are parallelized

Declarative for-comprehensions§ Like Select-From-Where in SQL§ Can express join, cross, filter, etc.

Databag => Matrix§ Enable optimization across Linear and

relational algebra§ Intermediate representation to reason

across linear and relational algebras (WiP)

“Native” Matrix Syntax§ Against Impedance mismatch

Emma/Lara: Databags & Matrices

[SIGMOD15/16, SIGMOD Record 16, BeyondMR 2016, PVLDB 2017]

UDFs: First-class citizens

Loops: First-class citizens

[ (x, y) for x in xs, y in ys if x == y ]

Python

for (x <- xs; y <- ys; if x == y) yield (x, y)

Scala

SELECT x, y FROM x AS xs, y AS ys WHERE x = y

SQL⦃ (x, y) | x ∈ xs, y ∈ ys, x = y ⦄

Math

Comprehensions: generalize SQL and are available as first-class syntax in moderngeneral purpose programming languages.

FP goodness: Comprehension Syntax

20

Automatic Optimization of Programs• Caching of intermediate results

• Reused intermediate results (e.g., within a loop) are automatically cached

• Choice of Partitioning• Pre-partition datasets which are used in multiple joins

• Unnesting• e.g., converting exists() clauses into joins

• Join order/algorithm selection• Classic database optimizations

• Fold-group Fusion• Calculate multiple aggregates on one pass

• …21

A concrete optimizationBlockJoin: partitioning data for linear algebra operations through joins

22

[PVLDB ’17 (VLDB 18)] "BlockJoin: Efficient Matrix Partitioning Through Joins”A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. In the Proceedings of the VLDB Endowment, Vol. 10, No. 13, to be presented in VLDB 2018.

Row- vs. block-partitioned operations

23

Relational Algebra: Aggregate over a row (avg) Linear Algebra: Matrix Multiplication

Certain relational operations better executed over row-partitioned tables.

Scalable Linear Algebra operations such as Matrix Multiplication better executed over block-partitioned matrices.Orders of magnitude speedups.

Matrix Block-partitioning

24

erators in order to produce or consume block-partitioneddatasets has the potential to significantly speedup workloadsthat mix operators from linear and relational algebra. Weput our focus on joins, and propose BlockJoin, a join op-erator which consumes row-partitioned relational data andoutputs a block-partitioned matrix. The task of designingsuch an operator poses a number of challenges. First, in or-der to block-partition a matrix which results from a join, thematching tuples of the two joining sides need to be knownin advance. This stems from the fact that a matrix blockcan contain columns from both input relations. Thus, thepartitioning function cannot know which those columns are,until there is an association of all pairs of tuples in the twojoined relations. Therefore, such workloads are typically im-plemented by first materializing the join results, and subse-quently block-partitioning those results.

In contrast, BlockJoin is based on the observation thatit is su�cient to know which keys and corresponding tupleidentifiers match, prior to materializing the complete joinresult. Analogous to joins which have been proposed forcolumnar databases [23, 8, 1], BlockJoin builds on two mainconcepts: index joins and late materialization. More specif-ically, BlockJoin first identifies the matching tuple pairs byperforming a join on the keys of the two relations (analogousto TID-Joins [24]). After the matching tuple-id pairs havebeen identified, a partitioning function derives, for each tu-ple, the blocks to which that tuple contributes data. Finally,it materializes the join result by sending tuples to the ma-chines hosting the corresponding blocks. Thus, BlockJoincompletely avoids to materialize the join results in a row-partitioned manner. Instead, it performs the joining andblock-partitioning phases together. Our experiments showthat BlockJoin performs up to 7x faster than the baseline ofconducting a row-wise join followed by a block-partitioningstep.

The contributions of this paper are summarized as follows:

• We discuss and motivate the need for implement-ing relational operators producing block-partitioneddatasets (Section 2.2).

• We propose BlockJoin, a distributed join algorithmwhich produces block-partitioned results for workloadsmixing linear and relational algebra operations. To thebest of our knowledge, this is the first work proposinga relational operator for block-partitioned results (Sec-tion 3).

• We compare the performance of BlockJoin against abaseline approach, which first joins row-partitionedrelations and then block-partitions the results. Wethereby exemplify BlockJoin’s superiority experimen-tally and via a cost model (Section 3.3).

• We provide a reference implementation of BlockJoinbased on Apache Spark [38] and experimentally showthat, depending on the size and shape of the in-put relations, BlockJoin is up to 7x faster comparedto the baseline approach. Moreover, we show thatfor tables with large numbers of columns, BlockJoinscales gracefully when the baseline approach fails (Sec-tions 4 and 5).

1 2 3 4

1 1.1 1.2 1.3 1.4

2 2.1 2.2 2.3 2.4

3 3.1 3.2 3.3 3.4

4 4.1 4.2 4.3 4.4

(a) Original matrix M

1 2

1

2

1.1 1.2

2.1 2.2

1.3 1.4

2.3 2.4

3.1 3.2

4.1 4.2

3.3 3.3

4.3 4.4

(b) 2 x 2 blocked matrix.

Figure 2: A matrix and its blocked representation.

2. BACKGROUNDIn this section we introduce the blocked matrix represen-

tation and discuss a running example which will be usedthroughout the paper. For the example, we also depict abaseline implementation in a distributed dataflow system.

2.1 Block-Partitioned Matrix RepresentationDistributed dataflow systems use an element-at-a-time

processing model in which an element typically representsa line in a text file or a tuple of a relation. Systems whichimplement matrices in this model can choose among a va-riety of partition schemes (e.g., cell-, row- or column-wisepartitioning) of the matrix. For common operations suchas matrix multiplications, all of these representations in-cur huge performance overhead [16]. Block-partitioning thematrix provides significant performance benefits, includinga reduction in the number of tuples required to representand process a matrix, block-level compression, and the op-timization of operations like multiplication on a block-levelbasis. These benefits have led to the widespread adoption ofblock-partitioned matrices in parallel data processing plat-forms [16, 18, 19]. In a blocked representation, the matrixis split into submatrices of fixed size, called blocks. Theseblocks become the processing elements in the dataflow sys-tem. Figure 2 shows a matrix and its blocked representation.In the remainder of the paper, we will refer to individualblocks using horizontal and vertical block indices: for in-stance, given a matrix M , we denote the bottom left blockof Figure 2 (b) as M2,1. While our algorithm works with ar-bitrarily shaped blocks, the majority of systems uses squareblocks only, and we adopt this convention in the rest of thepaper for the sake of simplicity.

2.2 Motivating ExampleFor our motivating example we choose a common use case

of learning a spam detection model in an e-commerce appli-cation. Assume that customers write reviews for products,some of which are spam, and we want to train a classifier toautomatically detect the spam reviews. The data for prod-ucts and reviews is stored in di↵erent files in a distributedfilesystem, and we need an ML algorithm to learn from at-tributes of both relations. Therefore, we first need to jointhe records from these tables to obtain reviews with theircorresponding products. Next, we need to transform theseproduct-review pairs into a suitable representation for anML algorithm, which means we need to apply a user definedfunction (UDF) that transforms the attributes into a vectorrepresentation. Finally, we aggregate these vectors into adistributed, blocked feature matrix to feed them into an MLsystem (such as SystemML).

Original Matrix Block-Partitioned Matrix

Common Pattern: Blocked Matrices as Join Results

25

e.g., join products and their reviews, to form a feature matrix. The matrix is block partitioned to be used in linear algebra operations.

⋈Products

Reviews

Feature Matrix Block-Partitioned Matrix

repartition

Data shuffling dominates execution time

• Challenge: can we join and block-partition results with less shuffling?

• Solution: BlockJoin, a one-step parallel join algorithm which outputs block-partitioned results

• Basic idea1. Find out which and how many tuples survive the join (without shuffling)2. Give each surviving tuple a unique, sequential ID3. Shuffle only surviving tuples and block-partition the joining tuples on the fly

• Results: up to >7x speedup over row-wise join followed by block-partitioning• Implementation can be further optimized (compression, sorting, etc.)• BlockJoin can be used for multi-way joins (where we expect larger speedups)

26

Goal: Join two tables, then block-partition them

a 1b 1d 1

a 2 3a 2 3a 2 3b 2 3c 2 3

⋈=

a 1 2 3a 1 2 3a 1 2 3b 1 2 3

B[0,0] B[0,1]

B[1,0] B[1,1]

Join Result Original Tables

TransactionUser

27

Join two tables, then block-partition them

a 1b 1

I

Ua 1 a 2 3

a 2 3a 2 3

⋈

b 1d 1

b 2 3c 2 3

⋈

=

= b 1 2 3

a 1 2 3a 1 2 3a 1 2 3

a 1a 1

2 32 3

a 1b 1

2 32 3

B[0,0] B[0,1]

B[1,0] B[1,1]

Shuffle Shuffle

a 2 3b 2 3c 2 3

a 2 3a 2 3

d 1U

T U T

TU

Node 1

Node 2

28

BlockJoin: fused join + block-partitioning operatorSaves heavy repartitioning costs.

29

Step I: Local Index Join

1. Find which tuples survive the join. 2. Prepare surviving tuple IDs and metadata.3. Broadcast tuple IDs and metadata to all nodes.

U

merge & materialize blocks

+

Late Materialization

6

split split5 range-part. range-part.5

Early Materialization

U

merge blocks

+

7

create blocks create blocks6

4 fetch kernelfetch kernel 4 fetch kernelfetch kernel

row-idx col-idx

0 1 2

1

2

3

000 1 2

2

1

3

0,0 0,1

1,0 1,1

0,0 0,1

1,0 1,1

0

0

1

1

0

1

0 1

Late Materialization Early Materialization

split

5

create blocks

6

0

0

1

1

0

1

0 1

range-partition& local sort

5

4 fetch kernel

blk-row-idx

blk-col-idx

(a) (b)

6

merge & materialize blocks

Products Reviews

fetch kernel fetch kernel4

index-join

22

collect <key, TID>

1

broadcast

3

P0,0

(Prod1,‘XPhone’, 199,‘Cat_SP’)

P0,1

(Prod2,‘YPhone’, 599,‘Cat_UT’)

P1,0

(Prod3,‘ZPhone’, 99,‘Cat_Phone_old’)

P1,1

(Prod4,‘ZPhone2’,199,‘Cat_UT’)

Products

R0,0

(Prod1,‘...’,4,false)

R0,1

(Prod2,‘...’,5,true)

R1,0


R1,1


Reviews

Prod1, R0,0

Prod2, R0,1

Prod1, R1,0

Prod4, R1,1

Prod1, P0,0

Prod2, P0,1

Prod3, P1,0

Prod4, P1,1

Prod1

Prod2

P0,0

P0,1

R1,0

R0,1

Prod4 P1,1

R1,1

P0,0

P0,1

R1,0

R0,1

Prod1 P0,0

R0,0

P0,0

R0,0

0

1

2

P1,1

R1,1

3

Block Meta-Data

TID : < Relation, Partition-ID, Index within Partition >

join key TIDP TIDR row-idx TIDP TIDR

collect <key, TID>

1

index-join & apply row-idx

2

broadcast

3

(a) (b)

Step II: Block Materialization

Exchange tuples parts in order to materialize blocks on each node participating in the join. Two materialization strategies, late & early.

[VLDB ‘17] "BlockJoin: Efficient Matrix Partitioning Through Joins”: A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. PVLDB 2017, to appear in VLDB 2018, Rio.

Main results: o 7x speedups compared to

separate joining and block-partitioning

o Scales more gracefully

BlockJoin step 1: Find the surviving tupleswith a semi-join/index-join

Driver Node: calculates scaffolds of final results (similar to index-join). The scaffold can be used for block partitioning (next step).

abd

aaabc

=⋈

a 1

b 1

T

U

a 2 3

b 2 3

c 2 3

a 2 3

a 2 3

d 1U

T

aaab

Collect join keysat the driver.

0123

B[0,0] B[0,1]

B[1,0] B[1,1]

U T Broadcast

Node 1

Node 2

30

aaab

0123

aaab

0123

Node 1

Node 2

BlockJoin step 2: Fill-in the blanksand shuffle the right tuples to their respective blocks

a 1

T

U

a 2 3b 2 3c 2 3

a 2 3a 2 3

d 1U

Ta 1a 1

2 32 3

a 1b 1

2 32 3

B[0,0] B[0,1]

B[1,0] B[1,1]

b 1=>

=>

31

aaab

1 2 31 2 3

1

0123

aaab

1 2 32 3

0123

Shuffle

Node 1

Node 2

The devil is in the detailbut details were skipped to simplify presentation

• The shown example assumed • Primary key on User and foreign key on Transaction

• This can be relaxed either 1) similarly to late materialization in columnar-databases or 2) extra counters• Ordered relations - algorithm works with unordered relations

• The algorithm performs best when• Join selectivity is low (less tuples survive, huge savings vs. row-wise partitioning)• The number of columns is big

• Up to >7x speedup over row-wise join followed by block-partitioning• Implementation can be further optimized (compression, sorting, etc.)• BlockJoin can be used for multi-way joins (where we expect larger speedups)

32

The road aheadDeep in-database analytics

33

The road aheadDeepin-databaseanalytics

31

Linear Algebra + Relational Algebra =Theoretical Foundation

• Goal• Intermediate representation/algebra to support equivalences and

transformations of mixed linear-relational algebra programs.

• Fundamental questions• What is the right intermediate representation?• Which is the set of transformations/equivalences for optimization?• Do current cost models work?

• Main challenge• Data model mismatch: unordered sets (relations) vs. ordered/indexed

matrices

34

//SELECT f1.v, f2.v //FROM factory_1 as f1, factory_2 as f2//WHERE f1.id == f2.id and f1.v != null and f2.v != nullval factories = for { //Emma’s for-comprehension for joining

f1 <- factory_1f2 <- factory_2

if f1.id == f2.id && f1.v != null &&f2.v != null &&

} yield (f1.v,…, f2.v, …)

//Calculate the mean of each of the columnsval colMeans = FacMatrix.cols( col => col.mean() )

val NormMat = FacMatrix – colMeans//Calculate covariance matrixval CovMat = NormMat.t %*% NormMat / (NormMat.numRows – 1)// ...

val FacMatrix = factories.toMatrix()

35

Optimizer

Hadoop

Hadoop Distributed File System (HDFS)

Dataflow Processing Jobs

Cluster Management (Kubernetes, YARN, …)

Apache Flink

Apache Spark

Compiler/Parser

Code Generator

Programming Abstraction

Intermediate Representation

Optimized Program

Linear Algebra + Relational Algebra =Systems Research

• Goal

• Systems natively supporting mixed linear-relational algebra programs.

• Fundamental questions

• Data partitioning schemes (row- vs. column- vs. block-partitioning)?

• Relational operators operating on block-partitioned data?

• Linear algebra operators operating on row/column-partitioned data?

• Operator Fusion?

• Main challenges

• Novel, non-trivial operators (e.g., BlockJoin)

• Reuse of existing operators

36

Thank [email protected]

http://asterios.katsifodimos.com


http://asterios.katsifodimos.com/

class DataBag[+A] { // Type Conversion def this(s: Seq[A]) // Scala Seq -> DataBagdef fetch() // DataBag -> Scala Seq// Input/Output (static)def read[A](url: String, format: ...): DataBag[A] def write[A](url: String, format: ...)(in: DataBag[A]) // Monad Operators (enable comprehension syntax)def map[B](f: A => B): DataBag[B] def flatMap[B](f: A => DataBag[B]): DataBag[B] def withFilter(p: A => Boolean): DataBag[A] // Nestingdef groupBy[K](k: (A) => K): DataBag[Grp[K,DataBag[A]]] // Difference, Union, Duplicate Removaldef minus[B >: A](subtrahend: DataBag[B]): DataBag[B] def plus[B >: A](addend: DataBag[B]): DataBag[B] def distinct(): DataBag[A] // Structural Recursiondef fold[B](z: B, s: A => B, p: (B, B) => B): B// Aggregates (aliases for various foldsdef minBy, min, sum, product, count, empty, exists, ...

}

The DataBag abstraction

optimizing across relational and linear algebra in parallel … · outline •introduction and...

Documents