optimizing across relational and linear algebra in parallel … · outline •introduction and...
TRANSCRIPT
Optimizing Across Relational and Linear Algebra in Parallel
Analytics PipelinesAsterios Katsifodimos
Assistant Professor @ TU Delft
[email protected]://asterios.katsifodimos.com
Outline
• Introduction and Motivation• Databases vs. parallel dataflow systems
• Declarativity in Parallel-dataflow Systems• The Emma language and optimizer• The Lara language for mixed linear- and relational-algebra programs
• Optimizing across linear & relational Algebra• BlockJoin: A join algorithm for mixed relational & linear algebra programs
• The road ahead
2
A timeline of data analysis toolsfrom a biased perspective
Google publishes MapReduce.
70s 80s 90s 00s 2004 - 2009 2009 - 2016
IngresOracle System-R…
MonetDBC-StoreVerticaAster DataGreenPlum…
BerkeleyDBMS AccessPostgreSQLMySQL…
GammaTRACETeradata…
HadoopMapReduce
SparkFlinkImpalaGiraphHive (SQL!)HBaseBlu…
Appearance of Relational Databases
First parallelshared-nothingarchitectures
Open Source Projects and mainstream databases
First columnar storageDatabases. Faster!
Scalable, UDF-based, data analytics blooms
Alternative MapReduce implementations go mainstream
2003: Internet ExplodesNoSQL movement.
Open Source rises
1974: Don Chamberlin publishes SQL
Criticism of MapReduce.Back to SQL?
3
2016 -
Spark DataframesSystemMLEmmaMRQL
Better Abstractions for dataflow systems
We need easier to use tools!
Relational DBs vs. Modern Big Data Processing Stack
HBase
Hadoop
Hadoop Distributed File System (HDFS)
SQL Compiler & Optimizer
SQL-like abstraction
Iterative M/R-like Jobs (Java, Scala, …)
Put/Get ops
Files/Bytes
Cluster Management (Kubernetes, YARN, …)
Apache Flink Apache Spark
Transaction Manager
Row/Column Storage
Relational Dataflow Engine
SQL Compiler & Optimizer
SQL
4
Operating System
Declarative Data Processing
DistributedCollections
Parallel DataflowEngines
Second-OrderFunctions (Map, Reduce, etc.)
Relations RDBMSSQL
Relational Databases
Parallel Dataflows
5
People with data analysis skills
Big data analysts
Big Data systems
programmingexperts
... // initialize
val cntrds = centroids.iterate(theta) { currCntrds =>
val newCntrds = points
.map(findNearestCntrd).withBcSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 + y._2, x._3 + y._3) )
.map( x => Centroid(x._1, x._2 / x._3) )
currCntrds
}
... // initialize
while (theta) {
newCntrds = points
.map(findNearestCntrd)
.map( (c, p) => (c, (p, 1L)) )
.reduceByKey( (x, y) =>
(x._1 + y._1, x._2 + y._2) )
.map( x => Centroid(x._1, x._2._1 / x._2._2) )
bcCntrs = sc.broadcast(newCntrds.collect())
}
Quiz: guess the algorithm!
Apache Flink
K-Means Clustering
• Input• Dataset (set of points in 2D) — large• K initial centroids — small
• Step 1:• Read K-centroids• Assign each point to the closest centroid (wrt distance)• Output <centroid, point> (or, cluster)
• Step 2:• For each centroid• Calculate its average position wrt distance from its points
• Output: <new centroid> (coordinates)
8
... // initialize
val cntrds = centroids.iterate(theta) { currCntrds =>
val newCntrds = points
.map(findNearestCntrd).withBcSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 + y._2, x._3 + y._3) )
.map( x => Centroid(x._1, x._2 / x._3) )
currCntrds
}
Partial Aggregates... // initialize
while (theta) {
newCntrds = points
.map(findNearestCntrd)
.map( (c, p) => (c, (p, 1L)) )
.reduceByKey( (x, y) =>
(x._1 + y._1, x._2 + y._2) )
.map( x => Centroid(x._1, x._2._1 / x._2._2) )
bcCntrs = sc.broadcast(newCntrds.collect())
}
Efficient aggregation requires a special combination of primitives
... // initialize
val cntrds = centroids.iterate(theta) { currCntrds =>
val newCntrds = points
.map(findNearestCntrd).withBcSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 + y._2, x._3 + y._3) )
.map( x => Centroid(x._1, x._2 / x._3) )
currCntrds
}
Broadcast Variables... // initialize
while (theta) {
newCntrds = points
.map(findNearestCntrd)
.map( (c, p) => (c, (p, 1L)) )
.reduceByKey( (x, y) =>
(x._1 + y._1, x._2 + y._2) )
.map( x => Centroid(x._1, x._2._1 / x._2._2) )
bcCntrs = sc.broadcast(newCntrds.collect())
}
Explicit data-exchange between the driver and the parallel dataflows
... // initialize
val cntrds = centroids.iterate(theta) { currCntrds =>
val newCntrds = points
.map(findNearestCntrd).withBcSet(currCntrds, "cntrds")
.map( (c, p) => (c, p, 1L) )
.groupBy(0).reduce( (x, y) =>
(x._1, x._2 + y._2, x._3 + y._3) )
.map( x => Centroid(x._1, x._2 / x._3) )
currCntrds
}
Native Iterations... // initialize
while (theta) {
newCntrds = points
.map(findNearestCntrd)
.map( (c, p) => (c, (p, 1L)) )
.reduceByKey( (x, y) =>
(x._1 + y._1, x._2 + y._2) )
.map( x => Centroid(x._1, x._2._1 / x._2._2) )
bcCntrs = sc.broadcast(newCntrds.collect())
}
Feedback edges in the dataflow graph have to be specified explicitly.
Beyond Datasets: End-to-end ML Pipelines
12
Question Formulation
Data Source Exploration & Selection
Data cleaning &
preparation
Feature Selection
ML Algorithm Selection
Hyper Parameter
Search
Model Training
Model Evaluation
1 2 3 4 5 6 7 8
Data Management Machine Learning
A typical Data Science Workflow
80% of data scientists’ time
A very simple ML pipeline example
13
Gather and clean sensor
data
Principal Component
Analysis
Train ML Model
Factory 3
Factory 1
Factory 2 ⋈
Preprocessing
Naturally expressed in• SQL• Dataflow APIs
14
SELECTf1.v, f2.v, f3.v
FROMfactory_1 AS f1,factory_2 AS f2,factory_3 AS f3
WHEREf1.sid == f2.sid ANDf2.sid == f3.sid ANDf1.v != NULL ANDf2.v != NULL ANDf3.v != NULL
Gather and clean sensor
data
Factory3
Factory1
Factory2
⋈
Matrix Operations/ML in SQL
Impedance mismatch!
15
Principal Component
Analysis
Train ML Model
* From:http://stackoverflow.com/questions/18019469/finding-covariance-using-sql
Calculating covariance matrix in SQL
Machine Learning in “Native” Syntax
Naturally expressed in• Matlab• R• …
16
// read factories table as matrix FacMatrix[rows, cols] = size(FacMatrix)colMeans = mean(FacMatrix)NormMat = zeros(rows, cols)for i = 1:rows
NormMat(i, :) = FacMatrix(i, :) - colMeans;endCovMat = NormMat’ * NormMat / (rows – 1)
// …
Principal Component
Analysis
Train ML Model
SQL + ML Native Syntax = Optimization Barrier
17
Gather and clean sensor
data
Principal Component
Analysis
Train ML Model
Factory 3
Factory 1
Factory 2 ⋈
Optimization barrier!
Big Data Language Desiderata
• Declarativity: what vs. how • Enables optimizability, eases programming
• User-Defined Functions as first-class citizens• Important reason why MapReduce took off
• Control-flow for Machine Learning and Graph workloads• Iterations, if-then-else branches• Turing complete “driver” with parallelizable constructs
• Portability • Should support multiple backend engines (Flink, Spark, MapReduce, etc.)
• Deep embedding• e.g. à la LINQ (Language Integrated Query) in C#, Ferry in Haskel, etc. 18
19
//SELECT f1.v, f2.v
//FROM factory_1 as f1, factory_2 as f2
//WHERE f1.id == f2.id and f1.v != null and f2.v != nullval factories = for {
f1 <- factory_1
f2 <- factory_2
if f1.sid == f2.sid &&
f1.v != null &&
f2.v != null &&
} yield (f1.v,…, f2.v, …)
//Calculate the mean of each of the columns
val colMeans = FacMatrix.cols( col => col.mean() )
val NormMat = FacMatrix – colMeans
//Calculate covariance matrix
val CovMat = NormMat.t %*% NormMat/(NormMat.numRows – 1)
// ...
while(…) {
val FacMatrix = factories.toMatrix()
Gather and clean data
Principal Component Analysis
Deeply Embedded in Scala
Databag type§ Like RDDs on Spark, Datasets on Flink, etc.§ Operations on Databags are parallelized
Declarative for-comprehensions§ Like Select-From-Where in SQL§ Can express join, cross, filter, etc.
Databag => Matrix§ Enable optimization across Linear and
relational algebra§ Intermediate representation to reason
across linear and relational algebras (WiP)
“Native” Matrix Syntax§ Against Impedance mismatch
Emma/Lara: Databags & Matrices
[SIGMOD15/16, SIGMOD Record 16, BeyondMR 2016, PVLDB 2017]
UDFs: First-class citizens
Loops: First-class citizens
[ (x, y) for x in xs, y in ys if x == y ]
Python
for (x <- xs; y <- ys; if x == y) yield (x, y)
Scala
SELECT x, y FROM x AS xs, y AS ys WHERE x = y
SQL⦃ (x, y) | x ∈ xs, y ∈ ys, x = y ⦄
Math
Comprehensions: generalize SQL and are available as first-class syntax in moderngeneral purpose programming languages.
FP goodness: Comprehension Syntax
20
Automatic Optimization of Programs• Caching of intermediate results
• Reused intermediate results (e.g., within a loop) are automatically cached
• Choice of Partitioning• Pre-partition datasets which are used in multiple joins
• Unnesting• e.g., converting exists() clauses into joins
• Join order/algorithm selection• Classic database optimizations
• Fold-group Fusion• Calculate multiple aggregates on one pass
• …21
A concrete optimizationBlockJoin: partitioning data for linear algebra operations through joins
22
[PVLDB ’17 (VLDB 18)] "BlockJoin: Efficient Matrix Partitioning Through Joins”A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. In the Proceedings of the VLDB Endowment, Vol. 10, No. 13, to be presented in VLDB 2018.
Row- vs. block-partitioned operations
23
Relational Algebra: Aggregate over a row (avg) Linear Algebra: Matrix Multiplication
Certain relational operations better executed over row-partitioned tables.
Scalable Linear Algebra operations such as Matrix Multiplication better executed over block-partitioned matrices.Orders of magnitude speedups.
Matrix Block-partitioning
24
erators in order to produce or consume block-partitioneddatasets has the potential to significantly speedup workloadsthat mix operators from linear and relational algebra. Weput our focus on joins, and propose BlockJoin, a join op-erator which consumes row-partitioned relational data andoutputs a block-partitioned matrix. The task of designingsuch an operator poses a number of challenges. First, in or-der to block-partition a matrix which results from a join, thematching tuples of the two joining sides need to be knownin advance. This stems from the fact that a matrix blockcan contain columns from both input relations. Thus, thepartitioning function cannot know which those columns are,until there is an association of all pairs of tuples in the twojoined relations. Therefore, such workloads are typically im-plemented by first materializing the join results, and subse-quently block-partitioning those results.
In contrast, BlockJoin is based on the observation thatit is su�cient to know which keys and corresponding tupleidentifiers match, prior to materializing the complete joinresult. Analogous to joins which have been proposed forcolumnar databases [23, 8, 1], BlockJoin builds on two mainconcepts: index joins and late materialization. More specif-ically, BlockJoin first identifies the matching tuple pairs byperforming a join on the keys of the two relations (analogousto TID-Joins [24]). After the matching tuple-id pairs havebeen identified, a partitioning function derives, for each tu-ple, the blocks to which that tuple contributes data. Finally,it materializes the join result by sending tuples to the ma-chines hosting the corresponding blocks. Thus, BlockJoincompletely avoids to materialize the join results in a row-partitioned manner. Instead, it performs the joining andblock-partitioning phases together. Our experiments showthat BlockJoin performs up to 7x faster than the baseline ofconducting a row-wise join followed by a block-partitioningstep.
The contributions of this paper are summarized as follows:
• We discuss and motivate the need for implement-ing relational operators producing block-partitioneddatasets (Section 2.2).
• We propose BlockJoin, a distributed join algorithmwhich produces block-partitioned results for workloadsmixing linear and relational algebra operations. To thebest of our knowledge, this is the first work proposinga relational operator for block-partitioned results (Sec-tion 3).
• We compare the performance of BlockJoin against abaseline approach, which first joins row-partitionedrelations and then block-partitions the results. Wethereby exemplify BlockJoin’s superiority experimen-tally and via a cost model (Section 3.3).
• We provide a reference implementation of BlockJoinbased on Apache Spark [38] and experimentally showthat, depending on the size and shape of the in-put relations, BlockJoin is up to 7x faster comparedto the baseline approach. Moreover, we show thatfor tables with large numbers of columns, BlockJoinscales gracefully when the baseline approach fails (Sec-tions 4 and 5).
1 2 3 4
1 1.1 1.2 1.3 1.4
2 2.1 2.2 2.3 2.4
3 3.1 3.2 3.3 3.4
4 4.1 4.2 4.3 4.4
(a) Original matrix M
1 2
1
2
1.1 1.2
2.1 2.2
1.3 1.4
2.3 2.4
3.1 3.2
4.1 4.2
3.3 3.3
4.3 4.4
(b) 2 x 2 blocked matrix.
Figure 2: A matrix and its blocked representation.
2. BACKGROUNDIn this section we introduce the blocked matrix represen-
tation and discuss a running example which will be usedthroughout the paper. For the example, we also depict abaseline implementation in a distributed dataflow system.
2.1 Block-Partitioned Matrix RepresentationDistributed dataflow systems use an element-at-a-time
processing model in which an element typically representsa line in a text file or a tuple of a relation. Systems whichimplement matrices in this model can choose among a va-riety of partition schemes (e.g., cell-, row- or column-wisepartitioning) of the matrix. For common operations suchas matrix multiplications, all of these representations in-cur huge performance overhead [16]. Block-partitioning thematrix provides significant performance benefits, includinga reduction in the number of tuples required to representand process a matrix, block-level compression, and the op-timization of operations like multiplication on a block-levelbasis. These benefits have led to the widespread adoption ofblock-partitioned matrices in parallel data processing plat-forms [16, 18, 19]. In a blocked representation, the matrixis split into submatrices of fixed size, called blocks. Theseblocks become the processing elements in the dataflow sys-tem. Figure 2 shows a matrix and its blocked representation.In the remainder of the paper, we will refer to individualblocks using horizontal and vertical block indices: for in-stance, given a matrix M , we denote the bottom left blockof Figure 2 (b) as M2,1. While our algorithm works with ar-bitrarily shaped blocks, the majority of systems uses squareblocks only, and we adopt this convention in the rest of thepaper for the sake of simplicity.
2.2 Motivating ExampleFor our motivating example we choose a common use case
of learning a spam detection model in an e-commerce appli-cation. Assume that customers write reviews for products,some of which are spam, and we want to train a classifier toautomatically detect the spam reviews. The data for prod-ucts and reviews is stored in di↵erent files in a distributedfilesystem, and we need an ML algorithm to learn from at-tributes of both relations. Therefore, we first need to jointhe records from these tables to obtain reviews with theircorresponding products. Next, we need to transform theseproduct-review pairs into a suitable representation for anML algorithm, which means we need to apply a user definedfunction (UDF) that transforms the attributes into a vectorrepresentation. Finally, we aggregate these vectors into adistributed, blocked feature matrix to feed them into an MLsystem (such as SystemML).
Original Matrix Block-Partitioned Matrix
Common Pattern: Blocked Matrices as Join Results
25
e.g., join products and their reviews, to form a feature matrix. The matrix is block partitioned to be used in linear algebra operations.
⋈Products
Reviews
Feature Matrix Block-Partitioned Matrix
repartition
Data shuffling dominates execution time
• Challenge: can we join and block-partition results with less shuffling?
• Solution: BlockJoin, a one-step parallel join algorithm which outputs block-partitioned results
• Basic idea1. Find out which and how many tuples survive the join (without shuffling)2. Give each surviving tuple a unique, sequential ID3. Shuffle only surviving tuples and block-partition the joining tuples on the fly
• Results: up to >7x speedup over row-wise join followed by block-partitioning• Implementation can be further optimized (compression, sorting, etc.)• BlockJoin can be used for multi-way joins (where we expect larger speedups)
26
Goal: Join two tables, then block-partition them
a 1b 1d 1
a 2 3a 2 3a 2 3b 2 3c 2 3
⋈=
a 1 2 3a 1 2 3a 1 2 3b 1 2 3
B[0,0] B[0,1]
B[1,0] B[1,1]
Join Result Original Tables
TransactionUser
27
Join two tables, then block-partition them
a 1b 1
I
Ua 1 a 2 3
a 2 3a 2 3
⋈
b 1d 1
b 2 3c 2 3
⋈
=
= b 1 2 3
a 1 2 3a 1 2 3a 1 2 3
a 1a 1
2 32 3
a 1b 1
2 32 3
B[0,0] B[0,1]
B[1,0] B[1,1]
Shuffle Shuffle
a 2 3b 2 3c 2 3
a 2 3a 2 3
d 1U
T U T
TU
Node 1
Node 2
28
BlockJoin: fused join + block-partitioning operatorSaves heavy repartitioning costs.
29
Step I: Local Index Join
1. Find which tuples survive the join. 2. Prepare surviving tuple IDs and metadata.3. Broadcast tuple IDs and metadata to all nodes.
U
merge & materialize blocks
+
Late Materialization
6
split split5 range-part. range-part.5
Early Materialization
U
merge blocks
+
7
create blocks create blocks6
4 fetch kernelfetch kernel 4 fetch kernelfetch kernel
row-idx col-idx
0 1 2
1
2
3
000 1 2
2
1
3
0,0 0,1
1,0 1,1
0,0 0,1
1,0 1,1
0
0
1
1
0
1
0 1
Late Materialization Early Materialization
split
5
create blocks
6
0
0
1
1
0
1
0 1
range-partition& local sort
5
4 fetch kernel
blk-row-idx
blk-col-idx
(a) (b)
6
merge & materialize blocks
Products Reviews
fetch kernel fetch kernel4
index-join
22
collect <key, TID>
1
broadcast
3
P0,0
(Prod1,‘XPhone’, 199,‘Cat_SP’)
P0,1
(Prod2,‘YPhone’, 599,‘Cat_UT’)
P1,0
(Prod3,‘ZPhone’, 99,‘Cat_Phone_old’)
P1,1
(Prod4,‘ZPhone2’,199,‘Cat_UT’)
Products
R0,0
(Prod1,‘...’,4,false)
R0,1
(Prod2,‘...’,5,true)
R1,0
(Prod1,‘...’,5,false)
R1,1
(Prod4,‘...’,2,false)
Reviews
Prod1, R0,0
Prod2, R0,1
Prod1, R1,0
Prod4, R1,1
Prod1, P0,0
Prod2, P0,1
Prod3, P1,0
Prod4, P1,1
Prod1
Prod2
P0,0
P0,1
R1,0
R0,1
Prod4 P1,1
R1,1
P0,0
P0,1
R1,0
R0,1
Prod1 P0,0
R0,0
P0,0
R0,0
0
1
2
P1,1
R1,1
3
Block Meta-Data
TID : < Relation, Partition-ID, Index within Partition >
join key TIDP TIDR row-idx TIDP TIDR
collect <key, TID>
1
index-join & apply row-idx
2
broadcast
3
(a) (b)
Step II: Block Materialization
Exchange tuples parts in order to materialize blocks on each node participating in the join. Two materialization strategies, late & early.
[VLDB ‘17] "BlockJoin: Efficient Matrix Partitioning Through Joins”: A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. PVLDB 2017, to appear in VLDB 2018, Rio.
Main results: o 7x speedups compared to
separate joining and block-partitioning
o Scales more gracefully
BlockJoin step 1: Find the surviving tupleswith a semi-join/index-join
Driver Node: calculates scaffolds of final results (similar to index-join). The scaffold can be used for block partitioning (next step).
abd
aaabc
=⋈
a 1
b 1
T
U
a 2 3
b 2 3
c 2 3
a 2 3
a 2 3
d 1U
T
aaab
Collect join keysat the driver.
0123
B[0,0] B[0,1]
B[1,0] B[1,1]
U T Broadcast
Node 1
Node 2
30
aaab
0123
aaab
0123
Node 1
Node 2
BlockJoin step 2: Fill-in the blanksand shuffle the right tuples to their respective blocks
a 1
T
U
a 2 3b 2 3c 2 3
a 2 3a 2 3
d 1U
Ta 1a 1
2 32 3
a 1b 1
2 32 3
B[0,0] B[0,1]
B[1,0] B[1,1]
b 1=>
=>
31
aaab
1 2 31 2 3
1
0123
aaab
1 2 32 3
0123
Shuffle
Node 1
Node 2
The devil is in the detailbut details were skipped to simplify presentation
• The shown example assumed • Primary key on User and foreign key on Transaction
• This can be relaxed either 1) similarly to late materialization in columnar-databases or 2) extra counters• Ordered relations - algorithm works with unordered relations
• The algorithm performs best when• Join selectivity is low (less tuples survive, huge savings vs. row-wise partitioning)• The number of columns is big
• Up to >7x speedup over row-wise join followed by block-partitioning• Implementation can be further optimized (compression, sorting, etc.)• BlockJoin can be used for multi-way joins (where we expect larger speedups)
32
The road aheadDeep in-database analytics
33
The road aheadDeepin-databaseanalytics
31
Linear Algebra + Relational Algebra =Theoretical Foundation
• Goal• Intermediate representation/algebra to support equivalences and
transformations of mixed linear-relational algebra programs.
• Fundamental questions• What is the right intermediate representation?• Which is the set of transformations/equivalences for optimization?• Do current cost models work?
• Main challenge• Data model mismatch: unordered sets (relations) vs. ordered/indexed
matrices
34
//SELECT f1.v, f2.v //FROM factory_1 as f1, factory_2 as f2//WHERE f1.id == f2.id and f1.v != null and f2.v != nullval factories = for { //Emma’s for-comprehension for joining
f1 <- factory_1f2 <- factory_2
if f1.id == f2.id && f1.v != null &&f2.v != null &&
} yield (f1.v,…, f2.v, …)
//Calculate the mean of each of the columnsval colMeans = FacMatrix.cols( col => col.mean() )
val NormMat = FacMatrix – colMeans//Calculate covariance matrixval CovMat = NormMat.t %*% NormMat / (NormMat.numRows – 1)// ...
val FacMatrix = factories.toMatrix()
35
Optimizer
Hadoop
Hadoop Distributed File System (HDFS)
Dataflow Processing Jobs
Cluster Management (Kubernetes, YARN, …)
Apache Flink
Apache Spark
Compiler/Parser
Code Generator
Programming Abstraction
Intermediate Representation
Optimized Program
Linear Algebra + Relational Algebra =Systems Research
• Goal
• Systems natively supporting mixed linear-relational algebra programs.
• Fundamental questions
• Data partitioning schemes (row- vs. column- vs. block-partitioning)?
• Relational operators operating on block-partitioned data?
• Linear algebra operators operating on row/column-partitioned data?
• Operator Fusion?
• Main challenges
• Novel, non-trivial operators (e.g., BlockJoin)
• Reuse of existing operators
36
Thank [email protected]
http://asterios.katsifodimos.com
class DataBag[+A] { // Type Conversion def this(s: Seq[A]) // Scala Seq -> DataBagdef fetch() // DataBag -> Scala Seq// Input/Output (static)def read[A](url: String, format: ...): DataBag[A] def write[A](url: String, format: ...)(in: DataBag[A]) // Monad Operators (enable comprehension syntax)def map[B](f: A => B): DataBag[B] def flatMap[B](f: A => DataBag[B]): DataBag[B] def withFilter(p: A => Boolean): DataBag[A] // Nestingdef groupBy[K](k: (A) => K): DataBag[Grp[K,DataBag[A]]] // Difference, Union, Duplicate Removaldef minus[B >: A](subtrahend: DataBag[B]): DataBag[B] def plus[B >: A](addend: DataBag[B]): DataBag[B] def distinct(): DataBag[A] // Structural Recursiondef fold[B](z: B, s: A => B, p: (B, B) => B): B// Aggregates (aliases for various foldsdef minBy, min, sum, product, count, empty, exists, ...
}
The DataBag abstraction