data stream processing - concepts and frameworks

58
Data Stream Processing – Concepts and Frameworks Matthias Niehoff 1

Upload: matthias-niehoff

Post on 21-Jan-2018

123 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Data Stream Processing - Concepts and Frameworks

Data Stream Processing – Concepts and FrameworksMatthias Niehoff

1

Page 2: Data Stream Processing - Concepts and Frameworks

AGENDA

2

Typical Problems

Basic Ideas

Streaming Frameworks

Current Innovations

Recommendations

Page 3: Data Stream Processing - Concepts and Frameworks

3

Basic IdeasData Stream Processing – Why and what is it?

Page 4: Data Stream Processing - Concepts and Frameworks

Batch Layer

Speed Layer

Current Situation of Dealing with (Big) Data

4

Page 5: Data Stream Processing - Concepts and Frameworks

IoT Sensor DataIndustrial Machines,

Consumer Electronic,

Agriculture

Click StreamsOnline Shops, Self Service

Portals, Comparison Portals

MonitoringSystem Health, Traffic

between Systems,

Resource Utilization

Online GamingGamer Interactions, Reward

Systems, Custom Content

& Experiences

Automotive IndustryVehicle Tracking, Predictive

Maintenance , Routing

Information

Financial TransactionsFraud Detection, Trade

Monitoring and

Management

Sources for streaming data can not only be found in the frequently mentioned IoT area. In many other industries incur streaming data. Strictly speaking, any data can be viewed as a stream. Some of the most popular use cases and examples are:

5

Sources for Streaming Data

Page 6: Data Stream Processing - Concepts and Frameworks

Distributed Stream Processing

6

Page 7: Data Stream Processing - Concepts and Frameworks

7

Endless & Continuous Data

7

Page 8: Data Stream Processing - Concepts and Frameworks

8

Speed & Realtime

Page 9: Data Stream Processing - Concepts and Frameworks

9

Distributed & Scalable

Page 10: Data Stream Processing - Concepts and Frameworks

First step – Microbatching

10

Source

Processing

Sink

Microbatches

Page 11: Data Stream Processing - Concepts and Frameworks

Native Streaming

11

Source

Processing

Sink

Page 12: Data Stream Processing - Concepts and Frameworks

12

Typical Problemsand the way frameworks tackle them

Page 13: Data Stream Processing - Concepts and Frameworks

13

Time

Page 14: Data Stream Processing - Concepts and Frameworks

14

Order

Page 15: Data Stream Processing - Concepts and Frameworks

Event time vs processing time

15

event

processing

1 2 3 4 5 6 7 8 9t in minutes

Page 16: Data Stream Processing - Concepts and Frameworks

Windowing - Slicing data into chunks

16

Tumbling Window Sliding Window Session Window

Time Trigger Count Trigger Content Trigger

Page 17: Data Stream Processing - Concepts and Frameworks

Tumbling & Sliding Windows

17

4 5 3 6 1 5 9 2 8 6 7 2

4 5 3 6 1 5 9 2 8 6 7 2

18 17 23

tumbling windows

sum

4 5 3 6 1 5 9 2 8 6 7 2

18 17 23 sum

4 5 3 6 1 5 9 2 8 6 7 2

15 25

sliding windows

Page 18: Data Stream Processing - Concepts and Frameworks

Session Window

18

time

user 1

user 2

?logout

delayed event

Page 19: Data Stream Processing - Concepts and Frameworks

Session Window

19

time

user 1

user 2

logout

delayed event

Page 20: Data Stream Processing - Concepts and Frameworks

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

20

[...] stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted [...]

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Page 21: Data Stream Processing - Concepts and Frameworks

•Part 1 of „When will the result be calculated?“

•Watermark of all received data

•A watermark of 10:00 means „It is assumed that all data until 10:00 now arrived“

• fix watermark

• heuristic watermark

•A window will be materialized/processed when watermark equals end of window

21

Watermarks

Page 22: Data Stream Processing - Concepts and Frameworks

Watermarks

22event time

proc

essi

ng ti

me

3

6

4,5

Page 23: Data Stream Processing - Concepts and Frameworks

Trigger

23

ContentEvent Time Processing Time Count Composite

•Part 2 of „When will the result be calculated?“

•Triggers an (additional) materialization of the window

•Example

• every 10 minutes (in processing time)

• & when the watermark reached the end of the window

• & with each delayed event

• but only for additional 15 minutes in processing time (allowed lateness)

Page 24: Data Stream Processing - Concepts and Frameworks

AccumulatorsJoining the individual (triggered) results

•every result on its own (discarding)

•Results based on each other (accumulating)

•Results based on each other & correction of the old result (accumulating & retracting)

24

Page 25: Data Stream Processing - Concepts and Frameworks

Accumulators

25

Discarding Accumulating Accumulating & Retracting

(5,2) 7 7 7

(8,3) 11 18 18, -7

(4) 4 22 22, -18

Last value 4 22 22

Total sum 22 47 22

5 2 | 8 3 | 4

Page 26: Data Stream Processing - Concepts and Frameworks

Watermarks, Trigger, Accumulators

vgl. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

26

input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(60))).triggering(AtWatermark().withEarlyFirings(AtPeriod(Duration.standardMinutes(10))).withLateFirings(AtCount(1)))).withAllowedLateness(Duration.standardMinutes(15))) .discardingFiredPanes()).apply(Sum.integersPerKey());

Page 27: Data Stream Processing - Concepts and Frameworks

Stream ~ Table Model •Aggregating a stream over time yields a table

•Changes to a table over time yields a stream

•Table will be updated by every entry in the stream

•Every new entry triggers a computation

•Retention period for late events (c.f. allowed lateness)

•Stream/Table ⊆ Dataflow

27

(key1, value1) key1 value1 1

key1 value3 2

key2 value2 1

key1 value1 1

key2 value2 1(key2, value2)

(key1, value3)

key valueupdate count

Page 28: Data Stream Processing - Concepts and Frameworks

28

Stateful Processing

Page 29: Data Stream Processing - Concepts and Frameworks

State & Window Processing•Non trivial applications mostly need some kind of (temporal) persistent state

• i.e aggregations over a longer time, counter, slowly refreshing metadata

• held in memory, can be stored on disk

• interesting: partitioning, rescaling, node failure?

29

state

operation

Page 30: Data Stream Processing - Concepts and Frameworks

State Implementations•State is most of the time partitioned

• Distributed over multiple nodes

• Number of nodes might change

•State must be fault-tolerant

•State access must be fast

•Storage backend

• native/own-build: i.e. in Spark Streaming

• existing tools: RocksDB in Kafka Streams

• pluggable: Flink, amongst others also RocksDB

•Carbone et. al. (2017), State Management in Flink, http://www.vldb.org/pvldb/vol10/p1718-carbone.pdf

30

Page 31: Data Stream Processing - Concepts and Frameworks

31

Data Lookup

Page 32: Data Stream Processing - Concepts and Frameworks

Lookup Additional Data

32

Results

Queue Processing

Metadata

Page 33: Data Stream Processing - Concepts and Frameworks

Lookup - Remote Read

33

Queue Metadata

Node 2

Node 1

cccc

cccc

cccc

cccc

Page 34: Data Stream Processing - Concepts and Frameworks

Lookup - Local Read

34

Queue Metadata

Node 2

Node 1

cccc

cccc

cccc

cc cc

Page 35: Data Stream Processing - Concepts and Frameworks

35

Deployment & Runtime Environment

Page 36: Data Stream Processing - Concepts and Frameworks

Runtime Environment - Cluster vs. Library

36

YARN

Page 37: Data Stream Processing - Concepts and Frameworks

Framework Dependent

•UI

•REST APIs

•Metrics

Scheduler Monitoring

Own Logging

•Technical

•Business

Java „Classics"

•JMX

•Profiler

37

Monitoring

Page 38: Data Stream Processing - Concepts and Frameworks

38

Delivery Guarantees

Page 39: Data Stream Processing - Concepts and Frameworks

Guarantees

39

at-most-once at-least-once exactly-once

Record Acknowledgement

Micro Batching

Snapshots/Checkpoints

Changelogs

Page 40: Data Stream Processing - Concepts and Frameworks

Guarantees

40

at-most-once at-least-once exactly-once

Page 41: Data Stream Processing - Concepts and Frameworks

41

Streaming FrameworksHelping you implement your solution

Page 42: Data Stream Processing - Concepts and Frameworks

Tyler Akidau

“ ... an execution engine designed for unbounded data sets, and nothing more”

42

T. Akidau et. al (2015): The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Page 43: Data Stream Processing - Concepts and Frameworks

Apache Spark•Open Source (2010) & Apache project (2013)

•Unified Batch & Stream Processing

•Wide distribution, especially in Hadoop environments

•Batch: RDD as base, DataFrames and DataSets as optimization

•Streaming: DStream & Structured Streaming

43

Page 44: Data Stream Processing - Concepts and Frameworks

Apache Spark Streaming•Microbatching

•Similiar, partly unified, programming model as with batch processing

•State and window operations

•Missing support for event time

44

Page 45: Data Stream Processing - Concepts and Frameworks

Apache Spark Structured Streaming•DataSets/DataFrames for streaming processing

•DataStream as an ever-growing table

•Unified API

•Limited support for event time operations

45

valds=sparkSession.read.json("someFile.json")

ds.write.json("otherFile.json")

valds=sparkSession.readStream.format("kafka").option("...","...").load

ds.writeStream.outputMode("complete").format("console").start()

Page 46: Data Stream Processing - Concepts and Frameworks

Apache Flink•Started as research project in 2010 (Stratosphere), Apache project since 2014

•Low latency streaming and high throughput batch processing

•Streaming first

•Flexible state and window handling

•Rich support for event time handling

46

Page 47: Data Stream Processing - Concepts and Frameworks

Apache Kafka Streams API•Only a library, no runtime environment

•Requires Kafka cluster ( >= 0.10)

•Uses Kafka consumer technologies for

• Ordering

• Partitioning

• Scaling

•Source & sink: Kafka topics only

•Kafka Connect for sources & sinks

47

Page 48: Data Stream Processing - Concepts and Frameworks

48

Current developmentsThe latest promises and features

Page 49: Data Stream Processing - Concepts and Frameworks

Queryable State•Known as

• Queryable state (Flink)

• Interactive Queries (Kafka Streams)

•Still low level

• Data lifecycle

• (De)Serialization

• Partitioned state discovery

49

state

operation

query interface

Page 50: Data Stream Processing - Concepts and Frameworks

Streaming SQL•Use SQL to query Streaming Data

• time varying relations i.e. [12:00, 12:00)

• query on multiple points in time

•Standard ANSI SQL + some extensions

• SELECT TABLE, SELECT STREAM

• WINDOWS

• TRIGGERS

•Supported by

• Flink

• Kafka Streams (KSQL)

•https://s.apache.org/streaming-sql-strata-nyc

50

CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3;

Page 51: Data Stream Processing - Concepts and Frameworks

51

RecommendationsOr at least some hints when choosing a framework

Page 52: Data Stream Processing - Concepts and Frameworks

Spark Streaming might be an option if•Spark is already used for batch processing

•Hadoop, and therefore YARN, is used

•A huge community is important

•Scala is not a problem

•Latency is not an important criteria *

•Event time handling is not needed *

•* those change in Structured Streaming

• event time support

• reduce microbatching overhead

52

Page 53: Data Stream Processing - Concepts and Frameworks

Flink is good for...•flexible event time processing

• watermarks

• trigger

• accumulator

•connectivity to the most important peripheral systems

•low latency stream processing

•excellent state handling

53

Page 54: Data Stream Processing - Concepts and Frameworks

And finally Kafka Streams, for...•you want an easy deployment

•you already have a scheduler/micro service platform

•low latency and high throughput

•event time support

•a lightweight start in streaming

•if you already use Kafka

•if you are fine with making Kafka your central backbone

54

Page 55: Data Stream Processing - Concepts and Frameworks

Comparison

55

Engine Microbatching Native Nativ

Programmingmodel Declarative Declarative Declarativ

Guarantees Exactly-Once Exactly-Once Exactly-Once

Event time Handling No/Yes* Yes Yes

State Storage Own Pluggable RocksDB + Topic

Community & Ecosystem Big Medium Big

Deployment Cluster Cluster Library

MonitoringUI, REST API, Dropwizard

MetricsUI, Metrics (JMX, Ganglia),

Rest APIKafka Tools, Confluent Control Center, JMX

Page 56: Data Stream Processing - Concepts and Frameworks

A word on•Apache Beam

• High-level API for different streaming runner, i.e Google Cloud Dataflow, Flink and Spark Streaming

•Google Cloud Data Flow

• Cloud Streaming by Google

•Apex

• YARN based with a static topology which can be changed at runtime

•Flume

• Logfile Shipping, especially into HDFS

•Storm/Heron

• Streaming pioneer by Twitter, Heron as a successor with the same API

56

Page 57: Data Stream Processing - Concepts and Frameworks

Take aways•Streaming is not easy

• (Event) Time

• State

• Deployment

• Correctness

•Different concepts and implementations

•Be aware of

• Monitoring

• „Overkill“

•Ongoing research and development

57

Page 58: Data Stream Processing - Concepts and Frameworks

Our mission – to promote agile development, innovation and technology – extends through everything we do.

codecentric AGHochstraße 1142697 Solingen Germany

Address

E-Mail: [email protected] Twitter: @matthiasniehoff www.codecentric.de

Contact Info

Stay connected!

58