debunking six common myths in stream processing

62
1 Kostas Tzoumas @kostas_tzoumas Flink London Meetup November 3, 2016 Apache Flink®: State of the Union and What's Next

Upload: kostas-tzoumas

Post on 23-Jan-2017

104 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Debunking Six Common Myths in Stream Processing

1

Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016

Apache Flink®: State of the Union and What's Next

Page 2: Debunking Six Common Myths in Stream Processing

2

Kostas Tzoumas@kostas_tzoumas

Flink London MeetupNovember 3, 2016

Debunking Six Common Myths in Stream Processing

Page 3: Debunking Six Common Myths in Stream Processing

3

Original creators of Apache Flink®

Providers of the dA Platform, a supported

Flink distribution

Page 4: Debunking Six Common Myths in Stream Processing

Outline What is data streaming

Myth 1: The Lambda architecture

Myth 2: The throughput/latency tradeoff

Myth 3: Exactly once not possible

Myth 4: Streaming is for (near) real-time

Myth 5: Batching and buffering

Myth 6: Streaming is hard

4

Page 5: Debunking Six Common Myths in Stream Processing

The streaming architecture

5

Page 6: Debunking Six Common Myths in Stream Processing

6

Reconsideration of data architecture

Better app isolation More real-time reaction to events Robust continuous applications Process both real-time and historical data

Page 7: Debunking Six Common Myths in Stream Processing

7

app state

app state

app state

event log

Queryservice

Page 8: Debunking Six Common Myths in Stream Processing

What is (distributed) streaming Computations on never-

ending “streams” of data records (“events”)

A stream processor distributes the computation in a cluster

8

Your code

Your code

Your code

Your code

Page 9: Debunking Six Common Myths in Stream Processing

What is stateful streaming Computation and state

• E.g., counters, windows of past events, state machines, trained ML models

Result depends on history of stream

A stateful stream processor gives the tools to manage state• Recover, roll back, version, upgrade,

etc

9

Your code

state

Page 10: Debunking Six Common Myths in Stream Processing

What is event-time streaming Data records associated with

timestamps (time series data)

Processing depends on timestamps

An event-time stream processor gives you the tools to reason about time• E.g., handle streams that are out of

order• Core feature is watermarks – a clock to

measure event time10

Your code

state

t3 t1 t2t4 t1-t2 t3-t4

Page 11: Debunking Six Common Myths in Stream Processing

What is streaming Continuous processing on data that

is continuously generated

I.e., pretty much all “big” data

It’s all about state and time11

Page 12: Debunking Six Common Myths in Stream Processing

12

Page 13: Debunking Six Common Myths in Stream Processing

Myth 1: The Lambda architecture

13

Page 14: Debunking Six Common Myths in Stream Processing

14

Myth variations Stream processing is approximate

Stream processing is for transient data

Stream processing cannot handle high data volume

Hence, stream processing needs to be coupled with batch processing

Page 15: Debunking Six Common Myths in Stream Processing

Lambda architecture

15

file 1

file 2

Job 1

Job 2

Scheduler

Streaming job

Serv

e &

stor

e

Page 16: Debunking Six Common Myths in Stream Processing

Lambda no longer needed Lambda was useful in the first days of stream

processing (beginning of Apache Storm)

Not any more• Stream processors can handle very large volumes• Stream processors can compute accurate results

Good news is I don’t hear Lambda so often anymore

16

Page 17: Debunking Six Common Myths in Stream Processing

Myth 2: Throughput/latency tradeoff

17

Page 18: Debunking Six Common Myths in Stream Processing

18

Myth flavors Low latency systems cannot support high

throughput

In general, you need to trade off one for the other

There is a “high throughput” category and a “low-latency” category (naming varies)

Page 19: Debunking Six Common Myths in Stream Processing

Physical limits Most stream processing pipelines are network

bottlenecked

The network dictates both (1) what is the latency and (2) what is the throughput

A well-engineered system achieves the physical limits allowed by the network

19

Page 20: Debunking Six Common Myths in Stream Processing

Buffering It is natural to handle many records together

• All software and hardware systems do that• E.g., network bundles bytes into frames

Every streaming system buffers records for performance (Flink certainly does)• You don’t want to send single records over the network• "Record-at-a-time" does not exist at the physical level

20

Page 21: Debunking Six Common Myths in Stream Processing

Buffering (2) Buffering is a performance optimization• Should be opaque to the user• Should not dictate system behavior in any

other way• Should not impose artificial boundaries• Should not limit what you can do with the

system• Etc...

21

Page 22: Debunking Six Common Myths in Stream Processing

Some numbers

22

Page 23: Debunking Six Common Myths in Stream Processing

Some more

23

TeraSort

Relational Join

Classic Batch Jobs

GraphProcessing

LinearAlgebra

Page 24: Debunking Six Common Myths in Stream Processing

Myth 3: Exactly once not possible

24

Page 25: Debunking Six Common Myths in Stream Processing

25

What is “exactly once” Under failures, system computes result as if

there was no failure

In contrast to:• At most once: no guarantees• At least once: duplicates possible

Exactly once state versus exactly once delivery

Page 26: Debunking Six Common Myths in Stream Processing

26

Myth variations Exactly once is not possible in nature

Exactly once is not possible end-to-end

Exactly once is not needed

You need to trade off performance for exactly once

(Usually perpetuated by folks until they implement exactly once )

Page 27: Debunking Six Common Myths in Stream Processing

Transactions “Exactly once” is transactions: either all actions

succeed or none succeed

Transactions are possible

Transactions are useful

Let’s not start eventual consistency all over again…

27

Page 28: Debunking Six Common Myths in Stream Processing

Flink checkpoints Periodic asynchronous consistent snapshots of

application state

Provide exactly-once state guarantees under failures

28

Page 29: Debunking Six Common Myths in Stream Processing

End-to-end exactly once Checkpoints double as transaction coordination mechanism

Source and sink operators can take part in checkpoints

Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates

29

transactional sinks

Page 30: Debunking Six Common Myths in Stream Processing

State management Checkpoints triple as state

versioning mechanism (savepoints)

Go back and forth in time while maintaining state consistency

Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

30

Page 31: Debunking Six Common Myths in Stream Processing

Myth 4: Streaming = real time

31

Page 32: Debunking Six Common Myths in Stream Processing

Myth variations I don’t have low latency applications hence I

don’t need stream processing

Stream processing is only relevant for data before storing them

We need a batch processor to do heavy offline computations

32

Page 33: Debunking Six Common Myths in Stream Processing

Low latency and high latency streams

33

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Page 34: Debunking Six Common Myths in Stream Processing

Robust continuous applications

34

Page 35: Debunking Six Common Myths in Stream Processing

Accurate computation Batch processing is not an accurate

computation model for continuous data• Misses the right concepts and primitives• Time handling, state across batch boundaries

Stateful stream processing a better model• Real-time/low-latency is the icing on the cake

35

Page 36: Debunking Six Common Myths in Stream Processing

Myth 5: Batching and buffering

36

Page 37: Debunking Six Common Myths in Stream Processing

37

Myth variations There is a "mini-batch" category between

batch and streaming

“Record-at-a-time” versus “mini-batching” or similar "choices"

Mini-batch systems can get better throughput

Page 38: Debunking Six Common Myths in Stream Processing

Myth variations (2) The difference between mini-

batching and streaming is latency

I don’t need low latency hence I need mini-batching

I have a mini-batching use case38

Page 39: Debunking Six Common Myths in Stream Processing

We have answered this already Can get throughput and latency (myth #2)• Every system buffers data, from the network to

the OS to Flink

Streaming is a model, not just fast (myth #4)• Time and state• Low latency is the icing on the cake

39

Page 40: Debunking Six Common Myths in Stream Processing

Continuous operation Data is continuously produced

Computation should track data production• With dynamic scaling, pause-and-resume

Restarting our pipelines every second is not a great idea, and not just for latency reasons

40

Page 41: Debunking Six Common Myths in Stream Processing

Myth 6: Streaming is hard

41

Page 42: Debunking Six Common Myths in Stream Processing

42

Myth variations Streaming is hard to learn

Streaming is hard to reason about

Windows? Event time? Triggers? Oh, my!!

Streaming needs to be coupled with batch

I know batch already

Page 43: Debunking Six Common Myths in Stream Processing

It's about your data and code What's the form of your data?• Unbounded (e.g., clicks, sensors, logs), or• Bounded (e.g., ???*)

What changes more often?• My code changes faster than my data• My data changes faster than my code

43

* Please help me find a great example of naturally static data

Page 44: Debunking Six Common Myths in Stream Processing

It's about your data and code If your data changes faster than your

code you have a streaming problem• You may be solving it with hourly batch

jobs depending on someone else to create the hourly batches

• You are probably living with inaccurate results without knowing it

44

Page 45: Debunking Six Common Myths in Stream Processing

It's about your data and code If your code changes faster than your

data you have an exploration problem• Using notebooks or other tools for quick

data exploration is a good idea• Once your code stabilizes you will have

a streaming problem, so you might as well think of it as such from the beginning 45

Page 46: Debunking Six Common Myths in Stream Processing

Flink in the real world

46

Page 47: Debunking Six Common Myths in Stream Processing

47

Flink community > 240 contributors, 95 contributors in Flink 1.1

42 meetups around the world with > 15,000 members

2x-3x growth in 2015, similar in 2016

Page 48: Debunking Six Common Myths in Stream Processing

Powered by Flink

48

Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business

process monitoring.

King, the creators of Candy Crush Saga, uses Flink to provide data

science teams with real-time analytics.

Bouygues Telecom uses Flink for real-time event processing over billions of

Kafka messages per day.

Alibaba, the world's largest retailer, built a Flink-based system (Blink) to

optimize search rankings in real time.

See more at flink.apache.org/poweredby.html

Page 49: Debunking Six Common Myths in Stream Processing

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

49

Page 50: Debunking Six Common Myths in Stream Processing

50

Page 51: Debunking Six Common Myths in Stream Processing

Flink Forward 2016

Page 52: Debunking Six Common Myths in Stream Processing

Current work in Flink

52

Page 53: Debunking Six Common Myths in Stream Processing

Flink's unique combination of features

53

Low latencyHigh Throughput

Well-behavedflow control

(back pressure)

Consistency

Works on real-timeand historic data

Performance Event Time

APIsLibraries

StatefulStreaming

Savepoints(replays, A/B testing,upgrades, versioning)

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Fluent API

Out-of-order events

Fast and largeout-of-core state

Page 54: Debunking Six Common Myths in Stream Processing

Flink 1.1

54

Connectors MetricSystem (Stream) SQL Session

WindowsLibrary

enhancements

Page 55: Debunking Six Common Myths in Stream Processing

Flink 1.1 + ongoing development

55

ConnectorsSession

Windows(Stream) SQL

Libraryenhancements

MetricSystem

Metrics &Visualization

Dynamic Scaling

Savepointcompatibility Checkpoints

to savepoints

More connectors Stream SQLWindows

Large stateMaintenance

Fine grainedrecovery

Side in-/outputsWindow DSL

Security

Mesos &others

Dynamic ResourceManagement

Authentication

Queryable State

Page 56: Debunking Six Common Myths in Stream Processing

Flink 1.1 + ongoing development

56

ConnectorsSession

Windows(Stream) SQL

Libraryenhancements

MetricSystem

Operations

Ecosystem ApplicationFeatures

Metrics &Visualization

Dynamic Scaling

Savepointcompatibility Checkpoints

to savepoints

More connectors Stream SQLWindows

Large stateMaintenance

Fine grainedrecovery

Side in-/outputsWindow DSL

BroaderAudience

Security

Mesos &others

Dynamic ResourceManagement

Authentication

Queryable State

Page 57: Debunking Six Common Myths in Stream Processing

A longer-term vision for Flink

57

Page 58: Debunking Six Common Myths in Stream Processing

58

Streaming use casesApplication

(Near) real-time apps

Continuous apps

Analytics on historical data

Request/response apps

TechnologyLow-latency streaming

High-latency streaming

Batch as special case of streaming

Large queryable state

Page 59: Debunking Six Common Myths in Stream Processing

Request/response applications Queryable state: query Flink state directly instead

of pushing results in a database

Large state support and query API coming in Flink

59

queries

Page 60: Debunking Six Common Myths in Stream Processing

In summary The need for streaming comes from a rethinking of data infra

architecture• Stream processing then just becomes natural

Debunking 5 myths• Myth 1: The Lambda architecture• Myth 2: The throughput/latency tradeoff• Myth 3: Exactly once not possible• Myth 4: Streaming is for (near) real-time• Myth 5: Batching and buffering• Myth 6: Streaming is hard

60

Page 61: Debunking Six Common Myths in Stream Processing

61

Thank you!@kostas_tzoumas @ApacheFlink @dataArtisans

Page 62: Debunking Six Common Myths in Stream Processing

We are hiring!

data-artisans.com/careers