real-time data processing with apache flink•stream processing emulated on top of batch system...
TRANSCRIPT
![Page 1: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/1.jpg)
Gyula Fóra
Flink committer
Swedish ICT
Real-time data processing
with Apache Flink
![Page 2: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/2.jpg)
2
▪ Data stream: Infinite sequence of data arriving in a continuous
fashion.
▪ Stream processing: Analyzing and acting on real-time streaming
data, using continuous queries
Stream processing
![Page 3: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/3.jpg)
3 Parts of a Streaming Infrastructure
3
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
![Page 4: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/4.jpg)
4
Apache Storm
• True streaming over distributed dataflow
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing emulated on top of batch system (non-native)
• Functional API (DStreams), restricted by batch runtime
Apache Samza
• True streaming built on top of Apache Kafka, state is first class citizen
• Slightly different stream notion, low level API
Apache Flink
• True streaming over stateful distributed dataflow
• Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
Streaming landscape
![Page 5: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/5.jpg)
What is Flink
5
Gelly
Table
ML
SA
MO
A
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Hadoop
M/R
Local Remote Yarn Tez Embedded
Data
flow
Data
flow
MR
QL
Table
Ca
sca
din
g (
WiP
)Streaming dataflow runtime
![Page 6: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/6.jpg)
Program compilation
6
case class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
MasterWorkers
DataSourc
eorders.tbl
Filter
MapDataSourc
elineitem.tbl
JoinHybrid Hash
build
HTprobe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
![Page 7: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/7.jpg)
Flink Streaming
7
![Page 8: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/8.jpg)
What is Flink Streaming
8
Native stream processor (low-latency)
Expressive functional API
Flexible operator state, stream windows
Exactly-once processing guarantees
![Page 9: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/9.jpg)
Native vs non-native streaming
9
Stream
discretizer
Job Job Job Jobwhile (true) {// get next few records// issue batch computation
}
while (true) {// process next record
}
Long-standing operators
Non-native streaming
Native streaming
![Page 10: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/10.jpg)
Pipelined stream processor
10
StreamingShuffle!
![Page 11: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/11.jpg)
Defining windows in Flink
Trigger policy• When to trigger the computation on current window
Eviction policy• When data points should leave the window
• Defines window width/size
E.g., count-based policy• evict when #elements > n
• start a new window every n-th element
Built-in: Count, Time, Delta policies
11
![Page 12: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/12.jpg)
Expressive APIs
12
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
![Page 13: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/13.jpg)
DataStream API
13
![Page 14: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/14.jpg)
Overview of the API
Data stream sources• File system
• Message queue connectors
• Arbitrary source functionality
Stream transformations• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count,
Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
Data stream outputs
For the details please refer to the programming guide:• http://flink.apache.org/docs/latest/streaming_guide.html
14
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
![Page 15: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/15.jpg)
Use-case: Financial analytics
15
Reading from multiple inputs
• Merge stock data from various sources
Window aggregations
• Compute simple statistics over windows
Data driven windows
• Define arbitrary windowing semantics
Combine with sentiment analysis
• Enrich your analytics with social media feeds (Twitter)
Streaming joins
• Join multiple data streams
Detailed explanation and source code on our blog
• http://flink.apache.org/news/2015/02/09/streaming-example.html
![Page 16: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/16.jpg)
Reading from multiple inputs
case class StockPrice(symbol : String, price : Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socketStockStream = env.socketTextStream("localhost", 9999)
.map(x => { val split = x.split(",")
StockPrice(split(0), split(1).toDouble) })
val SPX_Stream = env.addSource(generateStock("SPX")(10) _)
val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)
val stockStream = socketStockStream.merge(SPX_Stream, FTSE_STREAM)
16
(1)
(2)
(4)
(3)
(1)
(2)
(3)
(4)
"HDP, 23.8""HDP, 26.6"
StockPrice(SPX, 2113.9)
StockPrice(FTSE, 6931.7)StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)
![Page 17: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/17.jpg)
Window aggregations
val windowedStream = stockStream
.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))
val lowest = windowedStream.minBy("price")
val maxByStock = windowedStream.groupBy("symbol").maxBy("price")
val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)
17
(1)
(2)
(4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)
StockPrice(HDP, 23.8)
StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 26.6)
StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 25.2)
![Page 18: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/18.jpg)
Data-driven windows
case class Count(symbol : String, count : Int)
val priceWarnings = stockStream.groupBy("symbol")
.window(Delta.of(0.05, priceChange, defaultPrice))
.mapWindow(sendWarning _)
val warningsPerStock = priceWarnings.map(Count(_, 1)) .groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count")
18
(1)(2) (4)
(3)
(1)
(2)
(4)
(3)
StockPrice(SPX, 2113.9)StockPrice(FTSE, 6931.7)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)
Count(HDP, 1)StockPrice(HDP, 23.8)StockPrice(HDP, 26.6)
![Page 19: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/19.jpg)
Combining with a Twitter stream
val tweetStream = env.addSource(generateTweets _)
val mentionedSymbols = tweetStream.flatMap(tweet => tweet.split(" "))
.map(_.toUpperCase())
.filter(symbols.contains(_))
val tweetsPerStock = mentionedSymbols.map(Count(_, 1)).groupBy("symbol")
.window(Time.of(30, SECONDS))
.sum("count")
19
"hdp is on the rise!""I wish I bought more YHOO and HDP stocks"
Count(HDP, 2)Count(YHOO, 1)(1)
(2)(4)
(3)
(1)
(2)
(4)
(3)
![Page 20: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/20.jpg)
Streaming joins
val tweetsAndWarning = warningsPerStock.join(tweetsPerStock)
.onWindow(30, SECONDS)
.where("symbol")
.equalTo("symbol"){ (c1, c2) => (c1.count, c2.count) }
val rollingCorrelation = tweetsAndWarning
.window(Time.of(30, SECONDS))
.mapWindow(computeCorrelation _)
20
Count(HDP, 2)Count(YHOO, 1)
Count(HDP, 1)
(1,2)
(1) (2)
(1)
(2)
0.5
![Page 21: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/21.jpg)
21
▪ Performance optimizations• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
▪ Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec /
node
Performance
![Page 22: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/22.jpg)
Fault tolerance
22
![Page 23: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/23.jpg)
Overview
23
▪ Fault tolerance in other systems
• Message tracking/acks (Apache Storm)
• RDD lineage tracking/recomputation
Fault tolerance in Apache Flink
• Based on consistent global snapshots
• Algorithm inspired by Chandy-Lamport
• Low runtime overhead, stateful exactly-
once semantics
![Page 24: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/24.jpg)
Checkpointing / Recovery
24
Asynchronous Barrier Snapshotting for globally consistent checkpoints
Pushes checkpoint barriersthrough the data flow
Operator checkpointstarting
Checkpoint done
Data Stream
barrier
Before barrier =part of the snapshot
After barrier =Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
![Page 25: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/25.jpg)
State management
25
State declared in the operators is managed and
checkpointed by Flink
Pluggable backends for storing persistent
snapshots
• Currently: JobManager, FileSystem (HDFS, Tachyon)
State partitioning and flexible scaling in the
future
![Page 26: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/26.jpg)
Closing
26
![Page 27: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/27.jpg)
Streaming roadmap for 2015
State management
• New backends for state snapshotting
• Support for state partitioning and incremental
snapshots
• Master Failover
Improved monitoring
Integration with other Apache projects
• SAMOA, Zeppelin, Ignite
Streaming machine learning and other new
libraries27
![Page 28: Real-time data processing with Apache Flink•Stream processing emulated on top of batch system (non-native) •Functional API (DStreams), restricted by batch runtime Apache Samza](https://reader034.vdocuments.mx/reader034/viewer/2022052102/603c8e6aac99e709f1706424/html5/thumbnails/28.jpg)
flink.apache.org
@ApacheFlink