debunking six common myths in stream processing
TRANSCRIPT
1
Kostas Tzoumas@kostas_tzoumas
Flink London MeetupNovember 3, 2016
Apache Flink®: State of the Union and What's Next
2
Kostas Tzoumas@kostas_tzoumas
Flink London MeetupNovember 3, 2016
Debunking Six Common Myths in Stream Processing
3
Original creators of Apache Flink®
Providers of the dA Platform, a supported
Flink distribution
Outline What is data streaming
Myth 1: The Lambda architecture
Myth 2: The throughput/latency tradeoff
Myth 3: Exactly once not possible
Myth 4: Streaming is for (near) real-time
Myth 5: Batching and buffering
Myth 6: Streaming is hard
4
The streaming architecture
5
6
Reconsideration of data architecture
Better app isolation More real-time reaction to events Robust continuous applications Process both real-time and historical data
7
app state
app state
app state
event log
Queryservice
What is (distributed) streaming Computations on never-
ending “streams” of data records (“events”)
A stream processor distributes the computation in a cluster
8
Your code
Your code
Your code
Your code
What is stateful streaming Computation and state
• E.g., counters, windows of past events, state machines, trained ML models
Result depends on history of stream
A stateful stream processor gives the tools to manage state• Recover, roll back, version, upgrade,
etc
9
Your code
state
What is event-time streaming Data records associated with
timestamps (time series data)
Processing depends on timestamps
An event-time stream processor gives you the tools to reason about time• E.g., handle streams that are out of
order• Core feature is watermarks – a clock to
measure event time10
Your code
state
t3 t1 t2t4 t1-t2 t3-t4
What is streaming Continuous processing on data that
is continuously generated
I.e., pretty much all “big” data
It’s all about state and time11
12
Myth 1: The Lambda architecture
13
14
Myth variations Stream processing is approximate
Stream processing is for transient data
Stream processing cannot handle high data volume
Hence, stream processing needs to be coupled with batch processing
Lambda architecture
15
file 1
file 2
Job 1
Job 2
Scheduler
Streaming job
Serv
e &
stor
e
Lambda no longer needed Lambda was useful in the first days of stream
processing (beginning of Apache Storm)
Not any more• Stream processors can handle very large volumes• Stream processors can compute accurate results
Good news is I don’t hear Lambda so often anymore
16
Myth 2: Throughput/latency tradeoff
17
18
Myth flavors Low latency systems cannot support high
throughput
In general, you need to trade off one for the other
There is a “high throughput” category and a “low-latency” category (naming varies)
Physical limits Most stream processing pipelines are network
bottlenecked
The network dictates both (1) what is the latency and (2) what is the throughput
A well-engineered system achieves the physical limits allowed by the network
19
Buffering It is natural to handle many records together
• All software and hardware systems do that• E.g., network bundles bytes into frames
Every streaming system buffers records for performance (Flink certainly does)• You don’t want to send single records over the network• "Record-at-a-time" does not exist at the physical level
20
Buffering (2) Buffering is a performance optimization• Should be opaque to the user• Should not dictate system behavior in any
other way• Should not impose artificial boundaries• Should not limit what you can do with the
system• Etc...
21
Some numbers
22
Some more
23
TeraSort
Relational Join
Classic Batch Jobs
GraphProcessing
LinearAlgebra
Myth 3: Exactly once not possible
24
25
What is “exactly once” Under failures, system computes result as if
there was no failure
In contrast to:• At most once: no guarantees• At least once: duplicates possible
Exactly once state versus exactly once delivery
26
Myth variations Exactly once is not possible in nature
Exactly once is not possible end-to-end
Exactly once is not needed
You need to trade off performance for exactly once
(Usually perpetuated by folks until they implement exactly once )
Transactions “Exactly once” is transactions: either all actions
succeed or none succeed
Transactions are possible
Transactions are useful
Let’s not start eventual consistency all over again…
27
Flink checkpoints Periodic asynchronous consistent snapshots of
application state
Provide exactly-once state guarantees under failures
28
End-to-end exactly once Checkpoints double as transaction coordination mechanism
Source and sink operators can take part in checkpoints
Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates
29
transactional sinks
State management Checkpoints triple as state
versioning mechanism (savepoints)
Go back and forth in time while maintaining state consistency
Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests
30
Myth 4: Streaming = real time
31
Myth variations I don’t have low latency applications hence I
don’t need stream processing
Stream processing is only relevant for data before storing them
We need a batch processor to do heavy offline computations
32
Low latency and high latency streams
33
2016-3-112:00 am
2016-3-11:00 am
2016-3-12:00 am
2016-3-1111:00pm
2016-3-1212:00am
2016-3-121:00am
2016-3-1110:00pm
2016-3-122:00am
2016-3-123:00am…
partition
partition
Stream (low latency)
Batch(bounded stream)Stream (high latency)
Robust continuous applications
34
Accurate computation Batch processing is not an accurate
computation model for continuous data• Misses the right concepts and primitives• Time handling, state across batch boundaries
Stateful stream processing a better model• Real-time/low-latency is the icing on the cake
35
Myth 5: Batching and buffering
36
37
Myth variations There is a "mini-batch" category between
batch and streaming
“Record-at-a-time” versus “mini-batching” or similar "choices"
Mini-batch systems can get better throughput
Myth variations (2) The difference between mini-
batching and streaming is latency
I don’t need low latency hence I need mini-batching
I have a mini-batching use case38
We have answered this already Can get throughput and latency (myth #2)• Every system buffers data, from the network to
the OS to Flink
Streaming is a model, not just fast (myth #4)• Time and state• Low latency is the icing on the cake
39
Continuous operation Data is continuously produced
Computation should track data production• With dynamic scaling, pause-and-resume
Restarting our pipelines every second is not a great idea, and not just for latency reasons
40
Myth 6: Streaming is hard
41
42
Myth variations Streaming is hard to learn
Streaming is hard to reason about
Windows? Event time? Triggers? Oh, my!!
Streaming needs to be coupled with batch
I know batch already
It's about your data and code What's the form of your data?• Unbounded (e.g., clicks, sensors, logs), or• Bounded (e.g., ???*)
What changes more often?• My code changes faster than my data• My data changes faster than my code
43
* Please help me find a great example of naturally static data
It's about your data and code If your data changes faster than your
code you have a streaming problem• You may be solving it with hourly batch
jobs depending on someone else to create the hourly batches
• You are probably living with inaccurate results without knowing it
44
It's about your data and code If your code changes faster than your
data you have an exploration problem• Using notebooks or other tools for quick
data exploration is a good idea• Once your code stabilizes you will have
a streaming problem, so you might as well think of it as such from the beginning 45
Flink in the real world
46
47
Flink community > 240 contributors, 95 contributors in Flink 1.1
42 meetups around the world with > 15,000 members
2x-3x growth in 2015, similar in 2016
Powered by Flink
48
Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business
process monitoring.
King, the creators of Candy Crush Saga, uses Flink to provide data
science teams with real-time analytics.
Bouygues Telecom uses Flink for real-time event processing over billions of
Kafka messages per day.
Alibaba, the world's largest retailer, built a Flink-based system (Blink) to
optimize search rankings in real time.
See more at flink.apache.org/poweredby.html
30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
49
50
Flink Forward 2016
Current work in Flink
52
Flink's unique combination of features
53
Low latencyHigh Throughput
Well-behavedflow control
(back pressure)
Consistency
Works on real-timeand historic data
Performance Event Time
APIsLibraries
StatefulStreaming
Savepoints(replays, A/B testing,upgrades, versioning)
Exactly-once semanticsfor fault tolerance
Windows &user-defined state
Flexible windows(time, count, session, roll-your own)
Complex Event Processing
Fluent API
Out-of-order events
Fast and largeout-of-core state
Flink 1.1
54
Connectors MetricSystem (Stream) SQL Session
WindowsLibrary
enhancements
Flink 1.1 + ongoing development
55
ConnectorsSession
Windows(Stream) SQL
Libraryenhancements
MetricSystem
Metrics &Visualization
Dynamic Scaling
Savepointcompatibility Checkpoints
to savepoints
More connectors Stream SQLWindows
Large stateMaintenance
Fine grainedrecovery
Side in-/outputsWindow DSL
Security
Mesos &others
Dynamic ResourceManagement
Authentication
Queryable State
Flink 1.1 + ongoing development
56
ConnectorsSession
Windows(Stream) SQL
Libraryenhancements
MetricSystem
Operations
Ecosystem ApplicationFeatures
Metrics &Visualization
Dynamic Scaling
Savepointcompatibility Checkpoints
to savepoints
More connectors Stream SQLWindows
Large stateMaintenance
Fine grainedrecovery
Side in-/outputsWindow DSL
BroaderAudience
Security
Mesos &others
Dynamic ResourceManagement
Authentication
Queryable State
A longer-term vision for Flink
57
58
Streaming use casesApplication
(Near) real-time apps
Continuous apps
Analytics on historical data
Request/response apps
TechnologyLow-latency streaming
High-latency streaming
Batch as special case of streaming
Large queryable state
Request/response applications Queryable state: query Flink state directly instead
of pushing results in a database
Large state support and query API coming in Flink
59
queries
In summary The need for streaming comes from a rethinking of data infra
architecture• Stream processing then just becomes natural
Debunking 5 myths• Myth 1: The Lambda architecture• Myth 2: The throughput/latency tradeoff• Myth 3: Exactly once not possible• Myth 4: Streaming is for (near) real-time• Myth 5: Batching and buffering• Myth 6: Streaming is hard
60
61
Thank you!@kostas_tzoumas @ApacheFlink @dataArtisans
We are hiring!
data-artisans.com/careers