streaming in the wild with apache flink
TRANSCRIPT
![Page 1: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/1.jpg)
Kostas Tzoumas@kostas_tzoumas
Hadoop Summit San JoseJune 6, 2016
Streaming in the Wild with Apache FlinkTM
![Page 2: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/2.jpg)
2
Streaming technology is enabling the obvious: continuous processing on data
that is continuously produced
Hint: you are already doing streaming
![Page 3: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/3.jpg)
Why embrace streaming? Monitor your business and react in real
time
Implement robust continuous applications
Adopt a decentralized architecture
Consolidate analytics infrastructure 3
![Page 4: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/4.jpg)
React in real time
4
![Page 5: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/5.jpg)
5
Streaming versus real-time Streaming != Real-time
E.g., streaming that is not real time: continuous applications with large windows
E.g., real-time that is not streaming: very fast data warehousing queries
However: streaming applications can be fast
Streaming
Real time
![Page 6: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/6.jpg)
How real-time is Flink?
6
Yahoo! benchmark* data Artisans benchmarks**
* https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at** http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ and http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
![Page 7: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/7.jpg)
When and why does this matter? Immediate reaction to life• E.g., generate alerts on
anomaly/pattern/special event
Avoid unnecessary tradeoffs• Even if application is not latency-critical• With Flink you do not pay a price for latency!
7
![Page 8: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/8.jpg)
Bouygues Telecom – LUX
8
One of the largest telcos in France. System (among others) used for real time diagnostics and alarming.
Read more: http://data-artisans.com/flink-at-bouygues-html/
![Page 9: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/9.jpg)
Robust continuous applications
9
![Page 10: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/10.jpg)
10
Continuous application A production data application that needs to be live
24/7 feeding other systems (perhaps customer-facing)
Need to be efficient, consistent, correct, and manageable
Stream processing is a great way to implement continuous applications robustly
![Page 11: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/11.jpg)
Continuous apps with “batch”
11
file 1
file 2
Job 1
Job 2
time
file 3 Job 3
Scheduler
Serv
e &
stor
e
![Page 12: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/12.jpg)
Continuous apps with “lambda”
12
file 1
file 2
Job 1
Job 2
Scheduler
Streaming job
Serv
e &
stor
e
![Page 13: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/13.jpg)
Problems with batch and λ Way too many moving parts (and code dup)
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
13
![Page 14: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/14.jpg)
Continuous apps with streaming
14
Streaming job
Serv
e &
stor
e
![Page 15: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/15.jpg)
Extending the Yahoo! benchmark Work of Jamie Grier, inspired by a real continuous
application at Twitter
15http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
![Page 16: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/16.jpg)
What is the use case? Counting!• Tweet impressions or ad views
Most analytics is continuous counting and aggregations grouped by dimensions• E.g., anomaly detection
16
![Page 17: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/17.jpg)
Requirements Performance: millions of events/sec, millions of keys
Correctness: counts correlated with timestamps
Consistency: counts should be correct under failures
Manageability: ability to pause & restart, reprocess, change code, etc
17
![Page 18: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/18.jpg)
Before Flink Performance: 1000s of cores needed to sustain
workload
Correctness: time handled in application code (or not)
Consistency: approximate results during the day, exact results once a day (lambda)
Manageability: acceptable
18
![Page 19: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/19.jpg)
After Flink Performance: 10s of cores needed to sustain
workload
Correctness: time handled by framework
Consistency: correct results on demand
Manageability: acceptable19
![Page 20: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/20.jpg)
Results (yet to be beaten!)
Same program as Yahoo! benchmark
30x over Storm, plus consistent results20
![Page 21: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/21.jpg)
Manageability Flink savepoints (Flink 1.0): consistent
snapshots of stateful applications• Planned downtime for code upgrades,
maintenance, migration, debugging, etc
Monitoring (Flink 1.1)
Dynamic scaling (Flink 1.2+)21
![Page 22: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/22.jpg)
22
Decentralized architecture
![Page 23: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/23.jpg)
23
Streaming and microservices
App App
App
local state
local state
Archive
A decentralized architecture favors a streaming-based data infrastructure with local application state
![Page 24: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/24.jpg)
Zalando
24
Slides at http://www.slideshare.net/ZalandoTech/flink-in-zalandos-world-of-microservices-62376341
![Page 25: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/25.jpg)
Zalando
25
Transitioning from monolithicarchitecture to microservices
![Page 26: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/26.jpg)
New BI stack
26
![Page 27: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/27.jpg)
Flink @ Zalando (present & future) Business process monitoring
• Check if Zalando platform works• Order & delivery velocities• SLAs of related events
Continuous ETL• Transformation, combination, pre-aggregation• Data cleansing and validation
Complex Event Processing
Sales monitoring
27
![Page 28: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/28.jpg)
Consolidate analytics
28
![Page 29: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/29.jpg)
Stream Processing as a Service How do we make stream processing more
accessible to the data analyst?
More familiar interfaces• Flink 1.1 includes the first version of SQL for
static data sets and data streams
Easier deployment29
![Page 30: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/30.jpg)
King.com
30
![Page 31: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/31.jpg)
King.com - RBEA RBEA – a platform
designed to make stream processing available inside King.com
Data scientists submit scripts in Groovy
Flink backend executes these scripts
31
https://techblog.king.com/rbea-scalable-real-time-analytics-king/
![Page 32: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/32.jpg)
Netflix Netflix plans to offer
Stream Processing as a Service internally in the company
Currently testing Flink and Apache Beam
32
http://www.slideshare.net/mdaxini/netflix-keystone-streaming-data-pipeline-scale-in-the-clouddbtb2016-62076009
![Page 33: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/33.jpg)
Closing
33
![Page 34: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/34.jpg)
Disclaimer A lot of this presentation is based on the work of very
talented engineers building data products with Flink
Bouygues Telecom: Amine Abdessemed, ...
Zalando: Mihail Vieru, Javier Lopez
King.com: Gyula Fora, Mattias Andersson, ...
Netflix: Monal Daxini, ...34
![Page 35: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/35.jpg)
35
More Flink tales at Hadoop SummitXiaowei JiangBlink−Improved Runtime for Flink and its Application in Alibaba SearchWednesday, June 29, 2016, 2:10PM - 2:50PM210C
Stephan EwenTurning the Stream Processor into a Database: Building Online Applications on StreamsThursday, June 30, 2016, 12:20PM - 1:00PM212
![Page 36: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/36.jpg)
Flink Forward 2016, BerlinSubmission deadline: June 30, 2016 (watch website)Early bird deadline: July 15, 2016www.flink-forward.org
![Page 37: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/37.jpg)
We are hiring!data-artisans.com/careers
![Page 38: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/38.jpg)
Appendix
![Page 39: Streaming in the Wild with Apache Flink](https://reader031.vdocuments.mx/reader031/viewer/2022022200/58a783ff1a28abef478b5e9f/html5/thumbnails/39.jpg)
Batch < Streaming In principle, batch is a special case
of streaming (global window)
In practice, batch processors can be more efficient than stream processors in batch
Flink is a very efficient batch processor (DataSet code path)
39