Download - Spark Streaming
Spark Streaming
Large-scale near-real-time stream processing
UC BERKELEY
Tathagata Das (TD)
2
MotivationMany important applications must
process large data streams at second-scale latencies– Check-ins, status updates, site statistics,
spam filtering, …
Require large clusters to handle workloads
Require latencies of few seconds
3
Case study: Conviva, Inc.
Real-time monitoring of online video metadata
Custom-built distributed streaming system– 1000s complex metrics on millions of
videos sessions– Requires many dozens of nodes for
processing
Hadoop backend for offline analysis– Generating daily and monthly reports– Similar computation as the streaming
system
Painful to maintain two stacks
4
Goals Framework for large-scale stream
processing Scalable to large clusters (~ 100
nodes) with near-real-time latency (~ 1 second)
Efficiently recovers from faults and stragglers
Simple programming model that integrates well with batch & interactive queries
Existing system do not achieve all of them
5
Existing Streaming Systems
Record-at-a-time processing model– Each node has mutable state– For each record, update state & send
new recordsmutable state
node 1 node
3
input records push
node 2
input records
6
Existing Streaming Systems
Storm– Replays records if not processed due to failure– Processes each record at least once– May update mutable state twice!– Mutable state can be lost due to failure!
Trident – Uses transactions to update state– Processes each record exactly once– Per state transaction updates slow
No integration with batch processing&
Cannot handle stragglers
7
Spark Streaming
8
Discretized Stream Processing
Run a streaming computation as a series of very small, deterministic batch jobs
Batch processing models, like MapReduce, recover from faults and stragglers efficiently– Divide job into deterministic tasks– Rerun failed/slow tasks in parallel on
other nodes
Same recovery techniques at lower time scales
9
Spark StreamingState between batches kept in
memory as immutable, fault-tolerant dataset– Specifically as Spark’s Resilient
Distributed Dataset
Batch sizes can be reduced to as low as 1/2 second to achieve ~ 1 second latency
Potentially combine streaming and batch workloads to build a single unified stack
10
Discretized Stream Processing
time = 0 - 1:
time = 1 - 2:
batch operationsinput
input
immutable distributed dataset
(replicated in memory)
immutable distributed dataset, stored in memory
as RDD
input stream state stream
… ……
state / output
11
Fault RecoveryState stored as Resilient Distributed
Dataset (RDD)– Deterministically re-computable parallel
collection– Remembers lineage of operations used to
create themFault / straggler recovery is done in
parallel on other nodes
operation
input dataset(replicated and fault-tolerant)
state RDD(not replicated)
Fast recovery from faults without full data replication
12
Programming ModelA Discretized Stream or DStream is a
series of RDDs representing a stream of data– API very similar to RDDs
DStreams can be created… – Either from live streaming data– Or by transforming other DStreams
13
DStream Data SourcesMany sources out of the box
– HDFS– Kafka– Flume– Twitter– TCP sockets– Akka actor– ZeroMQ
Easy to add your own
Contributed by external developers
TransformationsBuild new streams from existing streams
– RDD-like operations• map, flatMap, filter, count, reduce,• groupByKey, reduceByKey, sortByKey, join• etc.
– New window and stateful operations• window, countByWindow, reduceByWindow• countByValueAndWindow,
reduceByKeyAndWindow• updateStateByKey• etc.
Output Operations Send data to outside world
– saveAsHadoopFiles– print – prints on the driver’s screen– foreach - arbitrary operation on every
RDD
16
ExampleProcess a stream of Tweets to find the 20 most popular hashtags in the last 10 mins
1. Get the stream of Tweets and isolate the hashtags
2. Count the hashtags over 10 minute window3. Sort the hashtags by their counts4. Get the top 20 hashtags
17
1. Get the stream of Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashtags = tweets.flatMap (status => getTags(status))
transformation
DStream
= RDD
t-1 t t+1 t+2 t+4t+3
flatMap flatMap flatMap flatMap flatMap
tweets
hashTags
tagCounts
2. Count the hashtags over 10 min
val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashTags.window(Minutes(10), Seconds(1))
.map(tag => (tag, 1)).reduceByKey(_ + _)
sliding window operation
hashTags
t-1 t t+1 t+2 t+4t+3
2. Count the hashtags over 10 min
val tweets = ssc.twitterStream(<username>, <password>)val hashtags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags
.countByValueAndWindow(Minutes(10), Seconds(1))
hashTags
t-1 t t+1 t+2 t+4t+3
+
+
–tagCounts
20
Smart window-based reduce
Technique with count generalizes to reduce– Need a function to “subtract” – Applies to invertible reduce functions
Could have implemented counting as:
hashTags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(1), …)
3. Sort the hashtags by their counts
val tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags
.countByValueAndWindow(Minutes(1), Seconds(1))
val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))
allows arbitrary RDD operations to create
a new DStream
4. Get the top 20 hashtagsval tweets = ssc.twitterStream(<username>, <password>)val hashTags = tweets.flatMap (status => getTags(status))val tagCounts = hashtags
.countByValueAndWindow(Minutes(1), Seconds(1))val sortedTags = tagCounts.map { case (tag, cnt) => (cnt, tag) } .transform(_.sortByKey(false))
sortedTags.foreach(showTopTags(20) _)
output operation
23
10 popular hashtags in last 10 min
// Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 1 minute window val tagCounts = tweets.flatMap (statuts => getTags(status)) .countByValueAndWindow (Minutes(10), Second(1))
// Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } .transform(_.sortByKey(false))
// Show the top 10 tags sortedTags.foreach(showTopTags(10) _)
24
Demo
25
Other OperationsMaintaining arbitrary state, tracking sessions
tweets.updateStateByKey(tweet => updateMood(tweet))
Selecting data directly from a DStreamtagCounts.slice(<from Time>, <to Time>).sortByKey()
tweets
t-1 t t+1 t+2 t+4t+3
user mood
26
PerformanceCan process 6 GB/sec (60M records/sec) of data on 100 nodes at sub-second latency
0 50 1000
0.5
1
1.5
2
2.5
3
3.5WordCount
1 sec2 sec
# Nodes in Cluster
Clus
ter T
hrou
ghpu
t (G
B/s)
0 20 40 60 80 100
0
1
2
3
4
5
6
7Grep
1 sec2 sec
# Nodes in Cluster
Clus
ter T
hhro
ughp
ut (G
B/s
)
27
Comparison with othersHigher throughput than Storm
– Spark Streaming: 670k records/second/node
– Storm: 115k records/second/node– Apache S4: 7.5k records/second/node
100 100005
1015202530
WordCount
Spark StormRecord Size (bytes)
Thro
ughp
ut p
er n
ode
(MB/
s)
100 10000
20406080
Grep
Spark StormRecord Size (bytes)
Thro
ughp
ut p
er n
ode
(MB/
s)
28
Fast Fault RecoveryRecovers from faults/stragglers within 1 sec
29
Real Applications: Conviva
Real-time monitoring of video metadata• Implemented Shadoop – a
wrapper for Hadoop jobs to run over Spark / Spark Streaming
• Ported parts of Conviva’s Hadoop stack to run on Spark Streaming
Shadoop
HadoopJob
SparkStreaming
val shJob = new SparkHadoopJob[…]( <Hadoop job> )val shJob.run( <Spark context> )
30
Real Applications: Conviva
Real-time monitoring of video metadata
0 10 20 30 40 50 60 700
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
# Nodes in Cluster
Activ
e se
ssio
ns (m
illio
ns)• Achieved 1-2 second
latency• Millions of video
sessions processed scales linearly with cluster size
31
Real Applications: Mobile Millennium
ProjectTraffic estimation using online machine learning
0 20 40 60 800
400
800
1200
1600
2000
# Nodes in Cluster
GPS
obse
rvati
ons p
er se
cond• Markov chain Monte
Carlo simulations on GPS observations
• Very CPU intensive, requires 10s of machines for useful computation
• Scales linearly with cluster size
32
Failure SemanticsInput data replicated by the system
Lineage of deterministic ops used to recompute RDD from input data if worker nodes fails
Transformations – exactly once
Output operations – at least once
33
Java API for StreamingDeveloped by Patrick WendellSimilar to Spark Java APIDon’t need to know scala to try
streaming!
34
Contributors5 contributors from UCB, 3 external
contributors– Matei Zaharia, Haoyuan Li– Patrick Wendell– Denny Britz– Sean McNamara*– Prashant Sharma*– Nick Pentreath*– Tathagata Das
Vision - one stack to rule them all
Ad-hoc Queries
Batch Processing
Stream Processing Spark
+Spark
Streaming
36
37
ConclusionAlpha to be release with Spark 0.7 by
weekend
Look at the new Streaming Programming Guide
More about Spark Streaming system in our paper
http://tinyurl.com/dstreams
Join us in Strata on Feb 26 in Santa Clara