debunking six common myths in stream processing

1

Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016

Apache Flink®: State of the Union and What's Next

2

Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016

Debunking Six Common Myths in Stream Processing

3

Original creators of Apache Flink®

Providers of the dA Platform, a supported

Flink distribution

Outline What is data streaming

Myth 1: The Lambda architecture

Myth 2: The throughput/latency tradeoff

Myth 3: Exactly once not possible

Myth 4: Streaming is for (near) real-time

Myth 5: Batching and buffering

Myth 6: Streaming is hard

4

The streaming architecture

5

6

Reconsideration of data architecture

Better app isolation More real-time reaction to events Robust continuous applications Process both real-time and historical data

7

app state

app state

app state

event log

Queryservice

What is (distributed) streaming Computations on never-

ending “streams” of data records (“events”)

A stream processor distributes the computation in a cluster

8

Your code

Your code

Your code

Your code

What is stateful streaming Computation and state

• E.g., counters, windows of past events, state machines, trained ML models

Result depends on history of stream

A stateful stream processor gives the tools to manage state• Recover, roll back, version, upgrade,

etc

9

Your code

state

What is event-time streaming Data records associated with

timestamps (time series data)

Processing depends on timestamps

An event-time stream processor gives you the tools to reason about time• E.g., handle streams that are out of

order• Core feature is watermarks – a clock to

measure event time10

Your code

state

t3 t1 t2t4 t1-t2 t3-t4

What is streaming Continuous processing on data that

is continuously generated

I.e., pretty much all “big” data

It’s all about state and time11

Myth 1: The Lambda architecture

13

14

Myth variations Stream processing is approximate

Stream processing is for transient data

Stream processing cannot handle high data volume

Hence, stream processing needs to be coupled with batch processing

Lambda architecture

15

file 1

file 2

Job 1

Job 2

Scheduler

Streaming job

Serv

e &

stor

e

Lambda no longer needed Lambda was useful in the first days of stream

processing (beginning of Apache Storm)

Not any more• Stream processors can handle very large volumes• Stream processors can compute accurate results

Good news is I don’t hear Lambda so often anymore

16

Myth 2: Throughput/latency tradeoff

17

18

Myth flavors Low latency systems cannot support high

throughput

In general, you need to trade off one for the other

There is a “high throughput” category and a “low-latency” category (naming varies)

Physical limits Most stream processing pipelines are network

bottlenecked

The network dictates both (1) what is the latency and (2) what is the throughput

A well-engineered system achieves the physical limits allowed by the network

19

Buffering It is natural to handle many records together

• All software and hardware systems do that• E.g., network bundles bytes into frames

Every streaming system buffers records for performance (Flink certainly does)• You don’t want to send single records over the network• "Record-at-a-time" does not exist at the physical level

20

Buffering (2) Buffering is a performance optimization• Should be opaque to the user• Should not dictate system behavior in any

other way• Should not impose artificial boundaries• Should not limit what you can do with the

system• Etc...

21

Some numbers

22

Some more

23

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

Myth 3: Exactly once not possible

24

25

What is “exactly once” Under failures, system computes result as if

there was no failure

In contrast to:• At most once: no guarantees• At least once: duplicates possible

Exactly once state versus exactly once delivery

26

Myth variations Exactly once is not possible in nature

Exactly once is not possible end-to-end

Exactly once is not needed

You need to trade off performance for exactly once

(Usually perpetuated by folks until they implement exactly once )

Transactions “Exactly once” is transactions: either all actions

succeed or none succeed

Transactions are possible

Transactions are useful

Let’s not start eventual consistency all over again…

27

Flink checkpoints Periodic asynchronous consistent snapshots of

application state

Provide exactly-once state guarantees under failures

28

End-to-end exactly once Checkpoints double as transaction coordination mechanism

Source and sink operators can take part in checkpoints

Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates

29

transactional sinks

State management Checkpoints triple as state

versioning mechanism (savepoints)

Go back and forth in time while maintaining state consistency

Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

30

Myth 4: Streaming = real time

31

Myth variations I don’t have low latency applications hence I

don’t need stream processing

Stream processing is only relevant for data before storing them

We need a batch processor to do heavy offline computations

32

Low latency and high latency streams

33

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Robust continuous applications

34

Accurate computation Batch processing is not an accurate

computation model for continuous data• Misses the right concepts and primitives• Time handling, state across batch boundaries

Stateful stream processing a better model• Real-time/low-latency is the icing on the cake

35

Myth 5: Batching and buffering

36

37

Myth variations There is a "mini-batch" category between

batch and streaming

“Record-at-a-time” versus “mini-batching” or similar "choices"

Mini-batch systems can get better throughput

Myth variations (2) The difference between mini-

batching and streaming is latency

I don’t need low latency hence I need mini-batching

I have a mini-batching use case38

We have answered this already Can get throughput and latency (myth #2)• Every system buffers data, from the network to

the OS to Flink

Streaming is a model, not just fast (myth #4)• Time and state• Low latency is the icing on the cake

39

Continuous operation Data is continuously produced

Computation should track data production• With dynamic scaling, pause-and-resume

Restarting our pipelines every second is not a great idea, and not just for latency reasons

40

Myth 6: Streaming is hard

41

42

Myth variations Streaming is hard to learn

Streaming is hard to reason about

Windows? Event time? Triggers? Oh, my!!

Streaming needs to be coupled with batch

I know batch already

It's about your data and code What's the form of your data?• Unbounded (e.g., clicks, sensors, logs), or• Bounded (e.g., ???*)

What changes more often?• My code changes faster than my data• My data changes faster than my code

43

* Please help me find a great example of naturally static data

It's about your data and code If your data changes faster than your

code you have a streaming problem• You may be solving it with hourly batch

jobs depending on someone else to create the hourly batches

• You are probably living with inaccurate results without knowing it

44

It's about your data and code If your code changes faster than your

data you have an exploration problem• Using notebooks or other tools for quick

data exploration is a good idea• Once your code stabilizes you will have

a streaming problem, so you might as well think of it as such from the beginning 45

Flink in the real world

46

47

Flink community > 240 contributors, 95 contributors in Flink 1.1

42 meetups around the world with > 15,000 members

2x-3x growth in 2015, similar in 2016

Powered by Flink

48

Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business

process monitoring.

King, the creators of Candy Crush Saga, uses Flink to provide data

science teams with real-time analytics.

Bouygues Telecom uses Flink for real-time event processing over billions of

Kafka messages per day.

Alibaba, the world's largest retailer, built a Flink-based system (Blink) to

optimize search rankings in real time.

See more at flink.apache.org/poweredby.html

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

49

Flink Forward 2016

Current work in Flink

52

Flink's unique combination of features

53

Low latencyHigh Throughput

Well-behavedflow control

(back pressure)

Consistency

Works on real-timeand historic data

Performance Event Time

APIsLibraries

StatefulStreaming

Savepoints(replays, A/B testing,upgrades, versioning)

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Fluent API

Out-of-order events

Fast and largeout-of-core state

Flink 1.1

54

Connectors MetricSystem (Stream) SQL Session

WindowsLibrary

enhancements

Flink 1.1 + ongoing development

55

ConnectorsSession

Windows(Stream) SQL

Libraryenhancements

MetricSystem

Metrics &Visualization

Dynamic Scaling

Savepointcompatibility Checkpoints

to savepoints

More connectors Stream SQLWindows

Large stateMaintenance

Fine grainedrecovery

Side in-/outputsWindow DSL

Security

Mesos &others

Dynamic ResourceManagement

Authentication

Queryable State

Flink 1.1 + ongoing development

56

ConnectorsSession

Windows(Stream) SQL

Libraryenhancements

MetricSystem

Operations

Ecosystem ApplicationFeatures

Metrics &Visualization

Dynamic Scaling

Savepointcompatibility Checkpoints

to savepoints

More connectors Stream SQLWindows

Large stateMaintenance

Fine grainedrecovery

Side in-/outputsWindow DSL

BroaderAudience

Security

Mesos &others

Dynamic ResourceManagement

Authentication

Queryable State

A longer-term vision for Flink

57

58

Streaming use casesApplication

(Near) real-time apps

Continuous apps

Analytics on historical data

Request/response apps

TechnologyLow-latency streaming

High-latency streaming

Batch as special case of streaming

Large queryable state

Request/response applications Queryable state: query Flink state directly instead

of pushing results in a database

Large state support and query API coming in Flink

59

queries

In summary The need for streaming comes from a rethinking of data infra

architecture• Stream processing then just becomes natural

Debunking 5 myths• Myth 1: The Lambda architecture• Myth 2: The throughput/latency tradeoff• Myth 3: Exactly once not possible• Myth 4: Streaming is for (near) real-time• Myth 5: Batching and buffering• Myth 6: Streaming is hard

60

61

Thank you!@kostas_tzoumas @ApacheFlink @dataArtisans

We are hiring!

data-artisans.com/careers

debunking six common myths in stream processing

Software