big data: challenges and some solutions207e4193-139b-40ec-9040... · 2017-11-29 · stream and...

51
1 © Volker Markl © 2013 Berlin Big Data Center • All Rights Reserved 1 © Volker Markl Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond Volker Markl http://www.user.tu-berlin.de/marklv/ http://www.dima.tu-berlin.de http:// www.dfki.de/web/forschung/iam http://bbdc.berlin [email protected]

Upload: others

Post on 28-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl

Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond

Volker Marklhttp://www.user.tu-berlin.de/marklv/

http://www.dima.tu-berlin.dehttp://www.dfki.de/web/forschung/iam

http://bbdc.berlin

[email protected]

Page 2: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

2 © Volker Markl2

2 © Volker Markl

Thanks to my team members and students

• Dr. Stephan Ewen• Sebastian Schelter• Dr. Kostas Tzoumas• Dr. Asterios Katsifodimos• Fabian Hüske• Alexander Alexandrov• Max Heimeland many more members of the Stratosphere Project, theBerlin Big Data Center, and the Apache Flink community

Page 3: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved

3 © Volker Markl

ML

ML

algorithms

algorithms

data too uncertain Veracity Data Mining MATLAB, R, PythonPredictive/Prescriptive MATLAB, R, Python

Data & Analysis: Increasingly Complex!

data volume too large Volumedata rate too fast Velocitydata too heterogeneous Variability

Data

Reporting aggregation, selectionAd-Hoc Queries SQL, XQueryETL/ELT MapReduce

Analysis

DM

DM

scalability

scalability

Page 4: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

4 © Volker Markl4 © 2013 Berlin Big Data Center • All Rights Reserved

4 © Volker Markl

Application

DataScience

Control Flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Decoupling

Convergence

Monte Carlo

Mathematical ProgrammingLinear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra / SQL

Scalability

Data Analysis LanguageCompiler

Memory Management

Memory Hierarchy

Data Flow

Hardware Adaptation

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)

Real-TimeNew Technology to the Rescue!

Page 5: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

5 © Volker Markl5

5 © Volker Markl

http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

A Zoo of Technologies!

Page 6: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

6 © Volker Markl6

6 © Volker Markl

Big Data Analytics Requires Systems Programming

R/Matlab:4 million users

Hadoop:200,000

users

Data Analysis Statistics

AlgebraOptimization

Machine LearningNLP

Signal ProcessingImage Analysis

Audio-,Video AnalysisInformation IntegrationInformation Extraction

Data Value ChainData Analysis Process

Predictive Analytics

IndexingParallelizationCommunicationMemory ManagementQuery OptimizationEfficient AlgorithmsResource ManagementFault ToleranceNumerical Stability

Big Data is now where database systems were in the70s (prior to relational algebra, query optimizationand a SQL-standard)!

People with Big DataAnalytics Skills

Declarative languages to the rescue!

Page 7: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

7 © Volker Markl7 © 2013 Berlin Big Data Center • All Rights Reserved

7 © Volker Markl

Deep Analysis of „Big Data“ is Key!

Small Data Big Data (3V)

Dee

p A

naly

tics

Sim

ple

Ana

lysi

s

Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.

Page 8: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

8 © Volker Markl8 © 2013 Berlin Big Data Center • All Rights Reserved

8 © Volker Markl

ML DMFeature EngineeringRepresentationAlgorithms (SVM, GPs, etc.)

Think ML-algorithms in a scalable way

Process iterative algorithms

in a scalable way

Technology X

Machine Learning + Data Management = X

Declarative LanguagesAutomatic AdaptionScalable processing

Control flowIterative Algorithms

Error EstimationActive Sampling

Sketches

Curse of DimensionalityIsolation Convergence

Monte Carlo

Mathematical ProgrammingLinear Algebra

Regression

StatisticHashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra/SQL

Scalability

Data Analysis Language

CompilerMemory Management

Memory Hierarchy

Dataflow

Hardware adaption

Indexing

Resource Management

NF2 /XQuery Data Warehouse/OLAP

declarative

Goal: Data Analysis without System Programming!

Page 9: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

9 © Volker Markl9

9 © Volker Markl

Agenda• On Stratosphere and Flink

– Conception, Initial Contributions– Streaming Data Analysis– The Flink Community

• On Roman Generals and Big Data Analytics• On Emma and Mosaics

Page 10: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

10 © Volker Markl10 © Volker Markl

Apache Flink

June 2, 2008

Aug 26, 2014

Page 11: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

11 © Volker Markl

• Relational Algebra• Declarativity• Query Optimization• Robust Out-of-core

• Scalability• User-defined

Functions • Complex Data Types• Schema on Read

• Iterations• Advanced Dataflows• General APIs• Native Streaming

11

Draws onDatabase Technology

Draws onMapReduce Technology

Adds

Stratosphere: General Purpose Programming + Database Execution

A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010)D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scaleanalytical processing. SoCC 2010: 119-130A. Alexandrov, R.Bergmann, S. Ewen, et al: The Stratosphere platform for big data analytics. VLDB J. 23(6): 939-964 (2014)

Page 12: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

12 © Volker Markl12

12 © Volker Markl

Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

What is Apache Flink?

http://flink.apache.org

• Key Features:• Bounded and unbounded data• Event time semantics• Stateful and fault-tolerant• Running on thousands of nodes with

very good throughput and latency• Exactly-once semantics for stateful

computations.• Flexible windowing based on time,

count, or sessions in addition to data-driven windows

• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers

Page 13: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

13 © Volker Markl13 © 2013 Berlin Big Data Center • All Rights Reserved

13 © Volker Markl

Technology inside Flinkcase class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Cost-based optimizer

Type extraction stack

Task scheduling

Recoverymetadata

Pre-flight (Client)

MasterWorkers

DataSource

orders.tbl

Filter

Map DataSource

lineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

DataflowGraph

Memory manager

Out-of-core algos

Batch & Streaming

State & Checkpoints

deployoperators

trackintermediate

results

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. Tzoumas:Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38(4): 28-38 (2015)

Page 14: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

14 © Volker Markl14 © 2013 Berlin Big Data Center • All Rights Reserved

14 © Volker Markl

Effect of optimization

14

Run on a sampleon the laptop

Run a month laterafter the data evolved

Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C

Page 15: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

15 © Volker Markl15 © 2013 Berlin Big Data Center • All Rights Reserved

15 © Volker Markl

Why optimization ?

Do you want to hand-tune that?

F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, J.C. Freytag:Peeking into the optimization of data flow programs with MapReduce-style UDFs. ICDE 2013: 1292-1295

Page 16: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

16 © Volker Markl16 © 2013 Berlin Big Data Center • All Rights Reserved

16 © Volker Markl

ITERATIONS IN DATA FLOWSÆ MACHINE LEARNING ALGORITHMS

S. Ewen, S. Schelter, K. Tzoumas, D. Warneke, V. Markl: Iterative Parallel Data Processing with Stratosphere: an Inside Look. SIGMOD 2013S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11): 1268-1279 (2012)

Page 17: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

17 © Volker Markl17

17 © Volker Markl

Iterate by looping

• for/while loop in client submits one job per iteration step• Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Page 18: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

18 © Volker Markl18

18 © Volker Markl

Iterate in the Dataflow

partial solution

partial solution X

other datasets

Y initial solution

iteration result

Replace

Step function

Page 19: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

19 © Volker Markl19 © 2013 Berlin Big Data Center • All Rights Reserved

19 © Volker Markl

Large-Scale Machine Learning

Factorizing a matrix with28 billion ratings forrecommendations

(Scale of Netflixor Spotify)

More at: http://data-artisans.com/computing-recommendations-with-flink.html

Page 20: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

20 © Volker Markl20

20 © Volker Markl

Optimizing iterative programs

Caching Loop-invariant DataPushing work„out of the loop“

Maintain state as index

Page 21: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

21 © Volker Markl21 © 2013 Berlin Big Data Center • All Rights Reserved

21 © Volker Markl

STATE IN ITERATIONSÆ GRAPHS AND MACHINE LEARNING

Page 22: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

22 © Volker Markl22

22 © Volker Markl

Iterate natively with deltas

partial solution

deltasetX

otherdatasets

Y initialsolution

iterationresult

workset A B workset

Merge deltas

Replace

initialworkset

Page 23: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

23 © Volker Markl23

23 © Volker Markl

Effect of delta iterations…

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

1 6 11 16 21 26 31 36 41 46 51 56 61

# of

ele

men

ts u

pdat

ed

iteration

Page 24: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

24 © Volker Markl24 © 2013 Berlin Big Data Center • All Rights Reserved

24 © Volker Markl

… very fast graph analysis

… and mix and matchETL-style and graph analysisin one program

Performance competitivewith dedicated graph

analysis systems

More at: http://data-artisans.com/data-analysis-with-flink.html

Page 25: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

25 © Volker Markl25 © 2013 Berlin Big Data Center • All Rights Reserved

25 © Volker Markl

“STREAMING DATA” ANALYSIS

Page 26: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

26 © Volker Markl26

26 © Volker Markl

Bounded and unbounded data• Bounded data: a dataset with a natural beginning and

end– E.g., current position of all trucks in a fleet, content of a data

warehouse

• Unbounded data: a dataset without a natural beginning and end– E.g., customers of our products, tweets about our product

• Note: few data is bounded by nature; most bounded data is a view over unbounded data

26

By courtesy of Kostas Tzoumas

Page 27: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

27 © Volker Markl27

27 © Volker Markl

Stream and batch processing• Stream processing: continuous processing that

continuously produces results– E.g., a Java program that connects to a socket and parses the

socket contents– Apache Flink, Apache Storm

• Batch processing: processing that takes finite time to complete and produces results only in the end– E.g., sorting a file– Apache Hadoop MapReduce, Apache Spark

27

By courtesy of Kostas Tzoumas

Page 28: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

28 © Volker Markl28

28 © Volker Markl

How do they fit together

28

• Batch processing over bounded data– Natural

• Stream processing over bounded data– Treats bounded data as subset of stream

• Batch processing over unbounded data– Needs a pre-processing stream processing phase that splits

stream into chunks

• Stream processing over unbounded data– Natural

By courtesy of Kostas Tzoumas

Page 29: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

29 © Volker Markl29

29 © Volker Markl

A different view• What changes faster? Your code or your data?

• ddata/dt >> dcode/dt is a data streaming problem

• dcode/dt >> ddata/dt is a data exploration problem (and likely to become a data streaming problem later)

29

Credits Joe Hellerstein

Page 30: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

30 © Volker Markl30

30 © Volker Markl

Defining windows in Flink

• Trigger policy– When to trigger the computation on current window

• Eviction policy– When data points should leave the window– Defines window width/size

• E.g., count-based policy– evict when #elements > n– start a new window every n-th element

• Built-in: Count, Time, Delta policies

Page 31: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

31 © Volker Markl31

31 © Volker Markl

Checkpointing / Recovery• Flink acknowledges batches of records

– Less overhead in failure-free case– Currently tied to fault tolerant data sources (e.g., Kafka)

• Flink operators can keep state– State is checkpointed– Checkpointing and record acks go together

• Exactly one semantics for state

Page 32: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

32 © Volker Markl32 © 2013 Berlin Big Data Center • All Rights Reserved

32 © Volker Markl

Checkpointing / Recovery

32Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)

Page 33: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

33 © Volker Markl33 © Volker Markl

Some Benchmark ResultsInitially performed by Yahoo! Engineering, Dec 16, 2015,

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high throughputs[..]. Spark

streaming 1.5.1 supports high throughputs, but at a relatively higher

latency.

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-athttps://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 34: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

34 © Volker Markl34 © 2013 Berlin Big Data Center • All Rights Reserved

34 © Volker Markl

THE FLINK COMMUNITY

J. Soto and V. Markl, "A Historical Account of Apache Flink," [Online]. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/ Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf

Page 35: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

35 © Volker Markl35

35 © Volker Markl

Flink Community as of Today

479

10,659

50

678

959314

243

1,393365376

2,179

541

452

196

18,884 Members313 Contributors41 Meetups30 Cities15 Countries 0

50

100

150

200

250

300

350

Feb 09 Jun 10 Nov 11 Mrz 13 Jul 14 Dez 15 Apr 17

Contributors

Page 36: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

36 © Volker Markl36 © Volker Markl

36

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

By courtesy of Kostas Tzoumas

Some Highly Engaged Users

Page 37: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

37 © Volker Markl37

37 © Volker Markl

Other Companies in the Flink Community

https://flink.apache.org/poweredby.html

Page 38: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

38 © Volker Markl38

38 © Volker Markl

Flink in the ecosystem

POSIX Java/ScalaCollections

POSIX

By courtesy of Kostas Tzoumas

Page 39: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

39 © Volker Markl39

39 © Volker Markl

2008 2009

2012

2010

20112012

2010

2014

2015 2015 2015

201620162017

> 30 Companies using Flink

Timeline of Flink

Page 40: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

40 © Volker Markl40 © 2013 Berlin Big Data Center • All Rights Reserved

40 © Volker Markl

Evolution of Big Data Platforms

Page 41: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

41 © Volker Markl41

41 © Volker Markl

Fault tolerancePessimistic Recovery:• Write intermediate state to stable storage• Restart from such a checkpoint in case of a failure

Problematic:• High overhead, checkpoint must

be replicated to other machines• Overhead always incurred, even if no

failures happen!

¾ How can we avoid this overhead in failure-free casesru

nti

me

per

iter

atio

n (

sec)

} actual work

} checkpointingoverhead

20

120

Page 42: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

42 © Volker Markl42

42 © Volker Markl

Optimistic recovery• Many data mining algorithms are fixpoint algorithms• Optimistic Recovery: jump to a different state in case of a failure,

still converge to solution

• No checkpoints Æ No overhead in absense of failures!• algorithm-specific compensation function must restore state

pessimistic recovery optimistic recoveryfailure-free

S. Schelter, S. Ewen, K. Tzoumas, V. Markl: "All roads lead to Rome": optimistic recovery for distributed iterative data processing. CIKM 2013S. Dudoladov, C. Xu, S. Schelter, A. Katsifodimos, S. Ewen, K. Tzoumas, V. Markl: Optimistic Recovery for Iterative Dataflows. SIGMOD 2015

Page 43: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

43 © Volker Markl43 © Volker Markl

Declarative Data Processing and Mosaics

Page 44: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

44 © Volker Markl44

44 © Volker Markl

A Billion $$$ Mantra...

Declarative Data Processing

SQL Relations RDBMS

A simple, high-level language for querying data (Chamberlin ’74).

An effective, formal foundation based on relational algebra and calculus (Codd ’71).

An efficient, low-level execution environment tailored towards the data (Selinger ’79).

Page 45: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

45 © Volker Markl45

45 © Volker Markl

With 40+ years of success...

Declarative Data Processing

SQL Relations RDBMS

Page 46: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

46 © Volker Markl46

46 © Volker Markl

Is Being Revised

Declarative Data Processing

SQL Relations RDBMS

DistributedCollections

Parallel DataflowEngines

Second-OrderFunctions

Page 47: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

47 © Volker Markl47

47 © Volker Markl

Fine grained parallelism through deep language Embedding (EMMA)

Emma Source (backed by Scala AST)

DataBag StreamMatrix

Emma Core (backed by Scala AST)

Target Language + Runtime

(iii) Lowering

Programming Abstractions

(i) Lifting

Common Concrete Syntax(realized as Scala eDSL)

Common Abstract Syntax

e.g. Spark, Flink, MM Engine

A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, V. Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016A. Alexandrov, A. Katsifodimos, G. Krastev, V. Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)

Page 48: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

48 © Volker Markl48

48 © Volker Markl

MosaicsFrontend:

Backends:

Page 49: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

49 © Volker Markl49

49 © Volker Markl

Research1. Unifying Modelling Across

Theories2. Cross Theory Optimization

3. Optimizing Across Engines4. Predicting and Learning

Program Runtimes

5. Optimizing Across Hardware6. Generating Hardware-Targeted

Code

Page 50: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

50 © Volker Markl50

50 © Volker Markl

Law

Society

EconomyTechnology

Application

Business ModelsBenchmarkingOpen Source & Open DataDeployment ModelsInformation PricingInformation Marketplaces

Data ManagementData ProcessingStatistics/MLLinguistics/Text&SpeechHCI/VisualizationNovel Computer ArchitecturesSecurity

The Five Dimensions of Big DataOwnershipCopyright/IPRLiabilityInsolvencyPrivacy

User BehaviourSocial ProcessesCollaborationPolitical ProcessesPublic Good

Economy & BusinessPublic SectorHealthcareTransportationHumanitiesSciences

SystemsFrameworks

SkillsBest-Practices

Tools

Page 51: Big Data: Challenges and Some Solutions207e4193-139b-40ec-9040... · 2017-11-29 · Stream and batch processing • Stream processing: continuous processing that continuously produces

51 © Volker Markl

Acknowledgements

I would like to thank the members of the Stratosphere project, theBerlin Big Data Center, and my national and internationalcollaborators. In particular, I would like to thank my former andcurrent students and postdocs at DIMA/TU Berlin and DFKI.Without them it would not have been possible to realize the visionsinto concrete software system artifacts. Last, but not least, I wouldlike to thank all of the contributors and users in the Apache Flinkcommunity, without them the Stratosphere project and thecorrelated research of the Berlin Big Data Center would not haveachieved the worldwide impact that we have experienced over thelast few years