discretized streams: fault-tolerant streaming …spark streaming, based on the spark engine [43]....

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion StoicaUniversity of California, Berkeley

AbstractMany “big data” applications must act on data in realtime. Running these applications at ever-larger scales re-quires parallel platforms that automatically handle faultsand stragglers. Unfortunately, current distributed streamprocessing models provide fault recovery in an expen-sive manner, requiring hot replication or long recoverytimes, and do not handle stragglers. We propose a newprocessing model, discretized streams (D-Streams), thatovercomes these challenges. D-Streams enable a par-allel recovery mechanism that improves efficiency overtraditional replication and backup schemes, and toleratesstragglers. We show that they support a rich set of oper-ators while attaining high per-node throughput similarto single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally,D-Streams can easily be composed with batch and in-teractive query models like MapReduce, enabling richapplications that combine these modes. We implementD-Streams in a system called Spark Streaming.

1 IntroductionMuch of “big data” is received in real time, and is mostvaluable at its time of arrival. For example, a social net-work may wish to detect trending conversation topics inminutes; a search site may wish to model which usersvisit a new page; and a service operator may wish tomonitor program logs to detect failures in seconds. Toenable these low-latency processing applications, thereis a need for streaming computation models that scaletransparently to large clusters, in the same way that batchmodels like MapReduce simplified offline processing.

Designing such models is challenging, however, be-cause the scale needed for the largest applications (e.g.,realtime log processing or machine learning) can be hun-dreds of nodes. At this scale, two major problems are

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor third-party components of this work must be honored. For all otheruses, contact the Owner/Author.

Copyright is held by the Owner/Author(s).SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA.ACM 978-1-4503-2388-8/13/11.http://dx.doi.org/10.1145/2517349.2522737

faults and stragglers (slow nodes). Both problems are in-evitable in large clusters [12], so streaming applicationsmust recover from them quickly. Fast recovery is evenmore important in streaming than it was in batch jobs:while a 30 second delay to recover from a fault or strag-gler is a nuisance in a batch setting, it can mean losingthe chance to make a key decision in a streaming setting.

Unfortunately, existing streaming systems havelimited fault and straggler tolerance. Most dis-tributed streaming systems, including Storm [37],TimeStream [33], MapReduce Online [11], and stream-ing databases [5, 9, 10], are based on a continuous op-erator model, in which long-running, stateful operatorsreceive each record, update internal state, and send newrecords. While this model is quite natural, it makes itdifficult to handle faults and stragglers.

Specifically, given the continuous operator model,systems perform recovery through two approaches [20]:replication, where there are two copies of each node[5, 34], or upstream backup, where nodes buffer sentmessages and replay them to a new copy of a failed node[33, 11, 37]. Neither approach is attractive in large clus-ters: replication costs 2× the hardware, while upstreambackup takes a long time to recover, as the whole systemmust wait for a new node to serially rebuild the failednode’s state by rerunning data through an operator. In ad-dition, neither approach handles stragglers: in upstreambackup, a straggler must be treated as a failure, incur-ring a costly recovery step, while replicated systems usesynchronization protocols like Flux [34] to coordinatereplicas, so a straggler will slow down both replicas.

This paper presents a new stream processing model,discretized streams (D-Streams), that overcomes thesechallenges. Instead of managing long-lived operators,the idea in D-Streams is to structure a streaming compu-tation as a series of stateless, deterministic batch compu-tations on small time intervals. For example, we mightplace the data received every second (or every 100ms)into an interval, and run a MapReduce operation on eachinterval to compute a count. Similarly, we can run arolling count over several intervals by adding the newcount from each interval to the old result. By structuringcomputations this way, D-Streams make (1) the state ateach timestep fully deterministic given the input data,forgoing the need for synchronization protocols, and (2)the dependencies between this state and older data visi-

This version contains minor clarifications in the related work section over the official conference version.

ble at a fine granularity. We show that this enables pow-erful recovery mechanisms, similar to those in batch sys-tems, that outperform replication and upstream backup.

There are two challenges in realizing the D-Streammodel. The first is making the latency (interval granu-larity) low. Traditional batch systems, such as Hadoop,fall short here because they keep state in replicated,on-disk storage systems between jobs. Instead, we usea data structure called Resilient Distributed Datasets(RDDs) [43], which keeps data in memory and can re-cover it without replication by tracking the lineage graphof operations that were used to build it. With RDDs, weshow that we can attain sub-second end-to-end latencies.We believe that this is sufficient for many real-worldbig data applications, where the timescale of the eventstracked (e.g., trends in social media) is much higher.

The second challenge is recovering quickly fromfaults and stragglers. Here, we use the determinism ofD-Streams to provide a new recovery mechanism thathas not been present in previous streaming systems: par-allel recovery of a lost node’s state. When a node fails,each node in the cluster works to recompute part of thelost node’s RDDs, resulting in significantly faster recov-ery than upstream backup without the cost of replication.Parallel recovery was hard to perform in continuous pro-cessing systems due to the complex state synchroniza-tion protocols needed even for basic replication (e.g.,Flux [34]),1 but becomes simple with the fully determin-istic D-Stream model. In a similar way, D-Streams canrecover from stragglers using speculative execution [12],while previous streaming systems do not handle them.

We have implemented D-Streams in a system calledSpark Streaming, based on the Spark engine [43]. Thesystem can process over 60 million records/second on100 nodes at sub-second latency, and can recover fromfaults and stragglers in sub-second time. Spark Stream-ing’s per-node throughput is comparable to commercialstreaming databases, while offering linear scalability to100 nodes, and is 2–5× faster than the open sourceStorm and S4 systems, while offering fault recoveryguarantees that they lack. Apart from its performance,we illustrate Spark Streaming’s expressiveness throughports of two real applications: a video distribution mon-itoring system and an online machine learning system.

Finally, because D-Streams use the same processingmodel and data structures (RDDs) as batch jobs, a pow-erful advantage of our model is that streaming queriescan seamlessly be combined with batch and interactivecomputation. We leverage this feature in Spark Stream-ing to let users run ad-hoc queries on streams usingSpark, or join streams with historical data computed asan RDD. This is a powerful feature in practice, giving

1 The only parallel recovery algorithm we are aware of, by Hwanget al. [21], only tolerates one node failure and cannot handle stragglers.

users a single API to combine previously disparate com-putations. We sketch how we have used it in our applica-tions to blur the line between live and offline processing.

2 Goals and BackgroundMany important applications process large streams ofdata arriving in real time. Our work targets applicationsthat need to run on tens to hundreds of machines, and tol-erate a latency of several seconds. Some examples are:• Site activity statistics: Facebook built a distributed

aggregation system called Puma that gives advertis-ers statistics about users clicking their pages within10–30 seconds and processes 106 events/s [35].

• Cluster monitoring: Datacenter operators often col-lect and mine program logs to detect problems, usingsystems like Flume [3] on hundreds of nodes [17].

• Spam detection: A social network such as Twittermay wish to identify new spam campaigns in realtime using statistical learning algorithms [39].

For these applications, we believe that the 0.5–2 sec-ond latency of D-Streams is adequate, as it is well belowthe timescale of the trends monitored. We purposely donot target applications with latency needs below a fewhundred milliseconds, such as high-frequency trading.

2.1 Goals

To run these applications at large scales, we seek a sys-tem design that meets four goals:

1. Scalability to hundreds of nodes.

2. Minimal cost beyond base processing—we do notwish to pay a 2× replication overhead, for example.

3. Second-scale latency.

4. Second-scale recovery from faults and stragglers.

To our knowledge, previous systems do not meet thesegoals: replicated systems have high overhead, while up-stream backup based ones can take tens of seconds to re-cover lost state [33, 41], and neither tolerates stragglers.

2.2 Previous Processing Models

Though there has been a wide set of work on distributedstream processing, most previous systems use the samecontinuous operator model. In this model, streamingcomputations are divided into a set of long-lived state-ful operators, and each operator processes records asthey arrive by updating internal state (e.g., a table track-ing page view counts over a window) and sending newrecords in response [10]. Figure 1(a) illustrates.

While continuous processing minimizes latency, thestateful nature of operators, combined with nondeter-minism that arises from record interleaving on the net-work, makes it hard to provide fault tolerance efficiently.Specifically, the main recovery challenge is rebuilding

mutable state

synchronization

primaries

replicas

node 1 node 2

node 1’ node 2’

input

(a) Continuous operator processing model. Each node con-tinuously receives records, updates internal state, and emitsnew records. Fault tolerance is typically achieved throughreplication, using a synchronization protocol like Flux orDPC [34, 5] to ensure that replicas of each node see records inthe same order (e.g., when they have multiple parent nodes).

t = 1:

t = 2:

D-Stream 1 D-Stream 2

immutable dataset

immutable dataset

batch operation

…

input

(b) D-Stream processing model. In each time interval, therecords that arrive are stored reliably across the cluster to forman immutable, partitioned dataset. This is then processed viadeterministic parallel operations to compute other distributeddatasets that represent program output or state to pass to thenext interval. Each series of datasets forms one D-Stream.

Figure 1: Comparison of traditional record-at-a-time stream processing (a) with discretized streams (b).

the state of operators on a lost, or slow, node. Previ-ous systems use one of two schemes, replication andupstream backup [20], which offer a sharp tradeoff be-tween cost and recovery time.

In replication, which is common in database systems,there are two copies of the processing graph, and inputrecords are sent to both. However, simply replicating thenodes is not enough; the system also needs to run a syn-chronization protocol, such as Flux [34] or Borealis’sDPC [5], to ensure that the two copies of each operatorsee messages from upstream parents in the same order.For example, an operator that outputs the union of twoparent streams (the sequence of all records received oneither one) needs to see the parent streams in the sameorder to produce the same output stream, so the twocopies of this operator need to coordinate. Replicationis thus costly, though it recovers quickly from failures.

In upstream backup, each node retains a copy of themessages it sent since some checkpoint. When a nodefails, a standby machine takes over its role, and theparents replay messages to this standby to rebuild itsstate. This approach thus incurs high recovery times,because a single node must recompute the lost stateby running data through the serial stateful operatorcode. TimeStream [33] and MapReduce Online [11]use this model. Popular message queueing systems, likeStorm [37], also use this approach, but typically onlyprovide “at-least-once” delivery for messages, relyingon the user’s code to handle state recovery.2

More importantly, neither replication nor upstreambackup handle stragglers. If a node runs slowly in thereplication model, the whole system is affected because

2 Storm’s Trident layer [26] automatically keeps state in a repli-cated database instead, writing updates in batches. This is expensive,as all updates must be replicated transactionally across the network.

of the synchronization required to have the replicas re-ceive messages in the same order. In upstream backup,the only way to mitigate a straggler is to treat it as a fail-ure, which requires going through the slow state recov-ery process mentioned above, and is heavy-handed for aproblem that may be transient.3 Thus, while traditionalstreaming approaches work well at smaller scales, theyface substantial problems in a large commodity cluster.

3 Discretized Streams (D-Streams)D-Streams avoid the problems with traditional streamprocessing by structuring computations as a set ofshort, stateless, deterministic tasks instead of continu-ous, stateful operators. They then store the state in mem-ory across tasks as fault-tolerant data structures (RDDs)that can be recomputed deterministically. Decomposingcomputations into short tasks exposes dependencies at afine granularity and allows powerful recovery techniqueslike parallel recovery and speculation. Beyond fault tol-erance, the D-Stream model gives other benefits, such aspowerful unification with batch processing.

3.1 Computation Model

We treat a streaming computation as a series of deter-ministic batch computations on small time intervals. Thedata received in each interval is stored reliably across thecluster to form an input dataset for that interval. Oncethe time interval completes, this dataset is processed viadeterministic parallel operations, such as map, reduceand groupBy, to produce new datasets representing ei-ther program outputs or intermediate state. In the for-mer case, the results may be pushed to an external sys-

3 Note that a speculative execution approach as in batch systemswould be challenging to apply here because the operator code assumesthat it is fed inputs serially, so even a backup copy of an operator wouldneed to spend a long time recovering from its last checkpoint.

Spark Streaming

divide data stream into

batches

streaming computations

expressed using DStreams

Spark Task Scheduler

Memory Manager

generate RDD

transfor-mations

batches of input data as RDDs

live input data stream

batches of results

Spark batch jobs to execute RDD transformations

Figure 2: High-level overview of the Spark Streamingsystem. Spark Streaming divides input data streamsinto batches and stores them in Spark’s memory. Itthen executes a streaming application by generatingSpark jobs to process the batches.

tem in a distributed manner. In the latter case, the inter-mediate state is stored as resilient distributed datasets(RDDs) [43], a fast storage abstraction that avoids repli-cation by using lineage for recovery, as we shall explain.This state dataset may then be processed along with thenext batch of input data to produce a new dataset of up-dated intermediate states. Figure 1(b) shows our model.

We implemented our system, Spark Streaming, basedon this model. We used Spark [43] as our batch process-ing engine for each batch of data. Figure 2 shows a high-level sketch of the computation model in the context ofSpark Streaming. This is explained in more detail later.

In our API, users define programs by manipulatingobjects called discretized streams (D-Streams). A D-Stream is a sequence of immutable, partitioned datasets(RDDs) that can be acted on by deterministic transfor-mations. These transformations yield new D-Streams,and may create intermediate state in the form of RDDs.

We illustrate the idea with a Spark Streaming pro-gram that computes a running count of view events byURL. Spark Streaming exposes D-Streams through afunctional API similar to LINQ [42, 2] in the Scala pro-gramming language.4 The code for our program is:

pageViews = readStream("http://...", "1s")

ones = pageViews.map(event => (event.url, 1))

counts = ones.runningReduce((a, b) => a + b)

This code creates a D-Stream called pageViews byreading an event stream over HTTP, and groups theseinto 1-second intervals. It then transforms the eventstream to get a new D-Stream of (URL, 1) pairs calledones, and performs a running count of these with astateful runningReduce transformation. The argumentsto map and runningReduce are Scala function literals.

4Other interfaces, such as streaming SQL, would also be possible.

interval [0, 1)

interval [1, 2)

pageViews DStream

ones DStream

counts DStream

map

. . .

. . .

. . .

reduce

Figure 3: Lineage graph for RDDs in the view countprogram. Each oval is an RDD, with partitions shownas circles. Each sequence of RDDs is a D-Stream.

To execute this program, Spark Streaming will receivethe data stream, divide it into one second batches andstore them in Spark’s memory as RDDs (see Figure 2).Additionally, it will invoke RDD transformations likemap and reduce to process the RDDs. To execute thesetransformations, Spark will first launch map tasks to pro-cess the events and generate the url-one pairs. Then itwill launch reduce tasks that take both the results of themaps and the results of the previous interval’s reduces,stored in an RDD. These tasks will produce a new RDDwith the updated counts. Each D-Stream in the programthus turns into a sequence of RDDs.

Finally, to recover from faults and stragglers, both D-Streams and RDDs track their lineage, that is, the graphof deterministic operations used to build them [43].Spark tracks this information at the level of partitionswithin each distributed dataset, as shown in Figure 3.When a node fails, it recomputes the RDD partitions thatwere on it by re-running the tasks that built them fromthe original input data stored reliably in the cluster. Thesystem also periodically checkpoints state RDDs (e.g.,by asynchronously replicating every tenth RDD)5 to pre-vent infinite recomputation, but this does not need tohappen for all data, because recovery is often fast: thelost partitions can be recomputed in parallel on separatenodes. In a similar way, if a node straggles, we can spec-ulatively execute copies of its tasks on other nodes [12],because they will produce the same result.

We note that the parallelism usable for recovery in D-Streams is higher than in upstream backup, even if oneran multiple operators per node. D-Streams expose par-allelism across both partitions of an operator and time:1. Much like batch systems run multiple tasks per node,

each timestep of a transformation may create multi-ple RDD partitions per node (e.g., 1000 RDD parti-tions on a 100-core cluster). When the node fails, wecan recompute its partitions in parallel on others.

5Since RDDs are immutable, checkpointing does not block the job.

2. The lineage graph often enables data from differenttimesteps to be rebuilt in parallel. For example, inFigure 3, if a node fails, we might lose some mapoutputs from each timestep; the maps from differenttimesteps can be rerun in parallel, which would notbe possible in a continuous operator system that as-sumes serial execution of each operator.

Because of these properties, D-Streams can paral-lelize recovery over hundreds of cores and recover in 1–2seconds even when checkpointing every 30s (§6.2).

In the rest of this section, we describe the guaranteesand programming interface of D-Streams in more detail.We then return to our implementation in Section 4.

3.2 Timing Considerations

Note that D-Streams place records into input datasetsbased on the time when each record arrives at the sys-tem. This is necessary to ensure that the system canalways start a new batch on time, and in applicationswhere the records are generated in the same locationas the streaming program, e.g., by services in the samedatacenter, it poses no problem for semantics. In otherapplications, however, developers may wish to grouprecords based on an external timestamp of when an eventhappened, e.g., when a user clicked a link, and recordsmay arrive out of order. D-Streams provide two meansto handle this case:

1. The system can wait for a limited “slack time” beforestarting to process each batch.

2. User programs can correct for late records at the ap-plication level. For example, suppose that an appli-cation wishes to count clicks on an ad between timet and t + 1. Using D-Streams with an interval sizeof one second, the application could provide a countfor the clicks received between t and t + 1 as soonas time t + 1 passes. Then, in future intervals, theapplication could collect any further events with ex-ternal timestamps between t and t + 1 and computean updated result. For example, it could output a newcount for time interval [t, t+1) at time t+5, based onthe records for this interval received between t andt+5. This computation can be performed with an ef-ficient incremental reduce operation that adds the oldcounts computed at t+1 to the counts of new recordssince then, avoiding wasted work. This approach issimilar to order-independent processing [23].

These timing concerns are inherent to stream process-ing, as any system must handle external delays. Theyhave been studied in detail in databases [23, 36]. Ingeneral, any such technique can be implemented overD-Streams by “discretizing” its computation in smallbatches (running the same logic in batches). Thus, wedo not explore these approaches further in this paper.

3.3 D-Stream API

Because D-Streams are primarily an execution strategy(describing how to break a computation into steps), theycan be used to implement many of the standard opera-tions in streaming systems, such as sliding windows andincremental processing [10, 4], by simply batching theirexecution into small timesteps. To illustrate, we describethe operations in Spark Streaming, though other inter-faces (e.g., SQL) could also be supported.

In Spark Streaming, users register one or morestreams using a functional API. The program can defineinput streams to be read from outside, which receive dataeither by having nodes listen on a port or by loading itperiodically from a storage system (e.g., HDFS). It canthen apply two types of operations to these streams:• Transformations, which create a new D-Stream from

one or more parent streams. These may be stateless,applying separately on the RDDs in each time inter-val, or they may produce state across intervals.

• Output operations, which let the program write datato external systems. For example, the save operationwill output each RDD in a D-Stream to a database.

D-Streams support the same stateless transformationsavailable in typical batch frameworks [12, 42], includingmap, reduce, groupBy, and join. We provide all the oper-ations in Spark [43]. For example, a program could run acanonical MapReduce word count on each time intervalof a D-Stream of words using the following code:

pairs = words.map(w => (w, 1))

counts = pairs.reduceByKey((a, b) => a + b)

In addition, D-Streams provide several stateful trans-formations for computations spanning multiple inter-vals, based on standard stream processing techniquessuch as sliding windows [10, 4]. These include:

Windowing: The window operation groups all therecords from a sliding window of past time intervals intoone RDD. For example, calling words.window("5s") inthe code above yields a D-Stream of RDDs containingthe words in intervals [0,5), [1,6), [2,7), etc.

Incremental aggregation: For the common use case ofcomputing an aggregate, like a count or max, over a slid-ing window, D-Streams have several variants of an in-cremental reduceByWindow operation. The simplest oneonly takes an associative merge function for combiningvalues. For instance, in the code above, one can write:

pairs.reduceByWindow("5s", (a, b) => a + b)

This computes a per-interval count for each time intervalonly once, but has to add the counts for the past five sec-onds repeatedly, as shown in Figure 4(a). If the aggrega-tion function is also invertible, a more efficient versionalso takes a function for “subtracting” values and main-

words interval counts

sliding counts

t-1

t

t+1

t+2

t+3

t+4 +

(a) Associative only

words interval counts

sliding counts

t-1

t

t+1

t+2

t+3

t+4 + +

–

(b) Associative & invertible

Figure 4: reduceByWindow execution for theassociative-only and associative+invertible versionsof the operator. Both versions compute a per-intervalcount only once, but the second avoids re-summingeach window. Boxes denote RDDs, while arrows showthe operations used to compute window [t, t +5).

tains the state incrementally (Figure 4(b)):

pairs.reduceByWindow("5s", (a,b) => a+b, (a,b) => a-b)

State tracking: Often, an application has to track statesfor various objects in response to a stream of events indi-cating state changes. For example, a program monitoringonline video delivery may wish to track the number ofactive sessions, where a session starts when the systemreceives a “join” event for a new client and ends when itreceives an “exit” event. It can then ask questions suchas “how many sessions have a bitrate above X .”D-Streams provide a track operation that transformsstreams of (Key, Event) records into streams of (Key,State) records based on three arguments:• An initialize function for creating a State from the

first Event for a new key.• An update function for returning a new State given

an old State and an Event for its key.• A timeout for dropping old states.

For example, one could count the active sessions from astream of (ClientID, Event) pairs called as follows:

sessions = events.track(

(key, ev) => 1, // initialize function

(key, st, ev) => // update function

ev == Exit ? null : 1,

"30s") // timeout

counts = sessions.count() // a stream of ints

This code sets each client’s state to 1 if it is active anddrops it by returning null from update when it leaves.Thus, sessions contains a (ClientID, 1) element foreach active client, and counts counts the sessions.

These operators are all implemented using the batchoperators in Spark, by applying them to RDDs from dif-ferent times in parent streams. For example, Figure 5

D-Stream of (Key, Event) pairs

D-Stream of (Key, State) pairs

track

groupBy + map

t = 1:

t = 2:

t = 3: groupBy + map

. . .

Figure 5: RDDs created by the track operation.

shows the RDDs built by track, which works by group-ing the old states and the new events for each interval.

Finally, the user calls output operators to send resultsout of Spark Streaming into external systems (e.g., fordisplay on a dashboard). We offer two such operators:save, which writes each RDD in a D-Stream to a storagesystem (e.g., HDFS or HBase), and foreachRDD, whichruns a user code snippet (any Spark code) on each RDD.For example, a user can print the top K counts withcounts.foreachRDD(rdd => print(rdd.top(K))).

3.4 Consistency Semantics

One benefit of D-Streams is that they provide clean con-sistency semantics. Consistency of state across nodescan be a problem in streaming systems that processeach record eagerly. For instance, consider a system thatcounts page views by country, where each page viewevent is sent to a different node responsible for aggre-gating statistics for its country. If the node responsiblefor England falls behind the node for France, e.g., due toload, then a snapshot of their states would be inconsis-tent: the counts for England would reflect an older prefixof the stream than the counts for France, and would gen-erally be lower, confusing inferences about the events.Some systems, like Borealis [5], synchronize nodes toavoid this problem, while others, like Storm, ignore it.

With D-Streams, the consistency semantics are clear,because time is naturally discretized into intervals, andeach interval’s output RDDs reflect all of the input re-ceived in that and previous intervals. This is true regard-less of whether the output and state RDDs are distributedacross the cluster—users do not need to worry aboutwhether nodes have fallen behind each other. Specifi-cally, the result in each output RDD, when computed,is the same as if all the batch jobs on previous inter-vals had run in lockstep and there were no stragglersand failures, simply due to the determinism of computa-tions and the separate naming of datasets from differentintervals. Thus, D-Streams provide consistent, “exactly-once” processing across the cluster.

3.5 Unification with Batch & Interactive Processing

Because D-Streams follow the same processing model,data structures (RDDs), and fault tolerance mechanismsas batch systems, the two can seamlessly be combined.

Aspect D-Streams Continuous proc. systems

Latency 0.5–2 s1–100 ms unless recordsare batched for consistency

Consis-tency

Records processedatomically with in-terval they arrive in

Some systems wait a shorttime to sync operators be-fore proceeding [5, 33]

Laterecords

Slack time or app-level correction

Slack time, out of orderprocessing [23, 36]

Faultrecovery

Fast parallel recov-ery

Replication or serial recov-ery on one node

Stragglerrecovery

Possible via specu-lative execution

Typically not handled

Mixingw/ batch

Simple unificationthrough RDD APIs

In some DBs [15]; not inmessage queueing systems

Table 1: Comparing D-Streams with record-at-a-time systems.

Spark Streaming provides several powerful features tounify streaming and batch processing.

First, D-Streams can be combined with static RDDscomputed using a standard Spark job. For instance, onecan join a stream of message events against a precom-puted spam filter, or compare them with historical data.

Second, users can run a D-Stream program on previ-ous historical data using a “batch mode.” This makes iteasy compute a new streaming report on past data.

Third, users run ad-hoc queries on D-Streams interac-tively by attaching a Scala console to their Spark Stream-ing program and running arbitrary Spark operations onthe RDDs there. For example, the user could query themost popular words in a time range by typing:

counts.slice("21:00", "21:05").topK(10)

Discussions with developers who have written bothoffline (Hadoop-based) and online processing applica-tions show that these features have significant practicalvalue. Simply having the data types and functions usedfor these programs in the same codebase saves substan-tial development time, as streaming and batch systemscurrently have separate APIs. The ability to also querystate in the streaming system interactively is even moreattractive: it makes it simple to debug a running compu-tation, or to ask queries that were not anticipated whendefining the aggregations in the streaming job, e.g., totroubleshoot an issue with a website. Without this abil-ity, users typically need to wait tens of minutes for thedata to make it into a batch cluster, even though all therelevant state is in memory on stream processing nodes.

3.6 Summary

To end our overview of D-Streams, we compare themwith continuous operator systems in Table 1. The maindifference is that D-Streams divide work into small, de-terministic tasks operating on batches. This raises their

Wor

ker

Task execution Block manager

Input receiver

Wor

ker

Task execution Block manager

Input receiver

replication of input & checkpointed RDDs

Client

Client

Master

Task scheduler

Block tracker

RDD lineage

D-Stream lineage

Input tracker

Comm. Manager

Comm. Manager

New

Modified

Figure 6: Components of Spark Streaming, showingwhat we added and modified over Spark.

minimum latency, but lets them employ highly efficientrecovery techniques. In fact, some continuous operatorsystems, like TimeStream and Borealis [33, 5], also de-lay records, in order to deterministically execute opera-tors that have multiple upstream parents (by waiting forperiodic “punctuations” in streams) and to provide con-sistency. This raises their latency past the millisecondscale and into the second scale of D-Streams.

4 System ArchitectureWe have implemented D-Streams in a system calledSpark Streaming, based on a modified version of theSpark processing engine [43]. Spark Streaming consistsof three components, shown in Figure 6:• A master that tracks the D-Stream lineage graph and

schedules tasks to compute new RDD partitions.• Worker nodes that receive data, store the partitions

of input and computed RDDs, and execute tasks.• A client library used to send data into the system.

As shown in the figure, Spark Streaming reuses manycomponents of Spark, but we also modified and addedmultiple components to enable streaming. We discussthose changes in Section 4.2.

From an architectural point of view, the main differ-ence between Spark Streaming and traditional streamingsystems is that Spark Streaming divides its computationsinto short, stateless, deterministic tasks, each of whichmay run on any node in the cluster, or even on multi-ple nodes. Unlike the rigid topologies in traditional sys-tems, where moving part of the computation to anothermachine is a major undertaking, this approach makes itstraightforward to balance load across the cluster, reactto failures, or launch speculative copies of slow tasks.It matches the approach used in batch systems, such asMapReduce, for the same reasons. However, tasks inSpark Streaming are far shorter, usually just 50–200 ms,due to running on in-memory RDDs.

All state in Spark Streaming is stored in fault-tolerantdata structures (RDDs), instead of being part of a long-running operator process as in previous systems. RDDpartitions can reside on any node, and can even be com-

puted on multiple nodes, because they are computed de-terministically. The system tries to place both state andtasks to maximize data locality, but this underlying flex-ibility makes speculation and parallel recovery possible.

These benefits come naturally from running on abatch platform (Spark), but we also had to make signif-icant changes to support streaming. We discuss job exe-cution in more detail before presenting these changes.

4.1 Application Execution

Spark Streaming applications start by defining one ormore input streams. The system can load streams eitherby receiving records directly from clients, or by load-ing data periodically from an external storage system,such as HDFS, where it might be placed by a log collec-tion system [3]. In the former case, we ensure that newdata is replicated across two worker nodes before send-ing an acknowledgement to the client library, becauseD-Streams require input data to be stored reliably to re-compute results. If a worker fails, the client library sendsunacknowledged data to another worker.

All data is managed by a block store on each worker,with a tracker on the master to let nodes find the loca-tions of blocks. Because both our input blocks and theRDD partitions we compute from them are immutable,keeping track of the block store is straightforward—eachblock is simply given a unique ID, and any node that hasthat ID can serve it (e.g., if multiple nodes computed it).The block store keeps new blocks in memory but dropsthem in an LRU fashion, as described later.

To decide when to start processing a new interval, weassume that the nodes have their clocks synchronizedvia NTP, and have each node send the master a list ofblock IDs it received in each interval when it ends. Themaster then starts launching tasks to compute the outputRDDs for the interval, without requiring any further kindof synchronization. Like other batch schedulers [22], itsimply starts each task whenever its parents are finished.

Spark Streaming relies on Spark’s existing batchscheduler within each timestep [43], and performs manyof the optimizations in systems like DryadLINQ [42]:• It pipelines operators that can be grouped into a sin-

gle task, such as a map followed by another map.• It places tasks based on data locality.• It controls the partitioning of RDDs to avoid shuf-

fling data across the network. For example, in a re-duceByWindow operation, each interval’s tasks needto “add” the new partial results from the current in-terval (e.g., a click count for each page) and “sub-tract” the results from several intervals ago. Thescheduler partitions the state RDDs for different in-tervals in the same way, so that data for each key(e.g., a page) is consistently on the same node acrosstimesteps. More details are given in [43].

4.2 Optimizations for Stream Processing

While Spark Streaming builds on Spark, we also had tomake sigificant optimizations and changes to this batchengine to support streaming. These included:

Network communication: We rewrote Spark’s dataplane to use asynchronous I/O to let tasks with remoteinputs, such as reduce tasks, fetch them faster.

Timestep pipelining: Because the tasks inside eachtimestep may not perfectly utilize the cluster (e.g., at theend of the timestep, there might only be a few tasks leftrunning), we modified Spark’s scheduler to allow sub-mitting tasks from the next timestep before the currentone has finished. For example, consider our first map+ runningReduce job in Figure 3. Because the maps ateach step are independent, we can begin running themaps for timestep 2 before timestep 1’s reduce finishes.

Task Scheduling: We made multiple optimizations toSpark’s task scheduler, such as hand-tuning the size ofcontrol messages, to be able to launch parallel jobs ofhundreds of tasks every few hundred milliseconds.

Storage layer: We rewrote Spark’s storage layer to sup-port asynchronous checkpointing of RDDs and to in-crease performance. Because RDDs are immutable, theycan be checkpointed over the network without blockingcomputations on them and slowing jobs. The new stor-age layer also uses zero-copy I/O for this when possible.

Lineage cutoff: Because lineage graphs between RDDsin D-Streams can grow indefinitely, we modified thescheduler to forget lineage after an RDD has been check-pointed, so that its state does not grow arbitrarily. Sim-ilarly, other data structures in Spark that grew withoutbound were given a periodic cleanup process.

Master recovery: Because streaming applications needto run 24/7, we added support for recovering the Sparkmaster’s state if it fails (Section 5.3).

Interestingly, the optimizations for stream processingalso improved Spark’s performance in batch benchmarksby as much as 2×. This is a powerful benefit of using thesame engine for stream and batch processing.

4.3 Memory Management

In our current implementation of Spark Streaming, eachnode’s block store manages RDD partitions in an LRUfashion, dropping data to disk if there is not enoughmemory. In addition, the user can set a maximum historytimeout, after which the system will simply forget oldblocks without doing disk I/O (this timeout must be big-ger than the checkpoint interval). We found that in manyapplications, the memory required by Spark Streamingis not onerous, because the state within a computation istypically much smaller than the input data (many appli-

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1

Recovery

tim

e(m

in)

System Load (Before Failure)

Upstream BackupParallel Recovery N = 5

Parallel Recovery N = 10Parallel Recovery N = 20

Figure 7: Recovery time for single-node upstreambackup vs. parallel recovery on N nodes, as a func-tion of the load before a failure. We assume the timesince the last checkpoint is 1 min.

cations compute aggregate statistics), and any reliablestreaming system needs to replicate data received overthe network to multiple nodes, as we do. However, wealso plan to explore ways to prioritize memory use.

5 Fault and Straggler RecoveryThe deterministic nature of D-Streams makes it possibleto use two powerful recovery techniques for worker statethat are hard to apply in traditional streaming systems:parallel recovery and speculative execution. In addition,it simplifies master recovery, as we shall also discuss.

5.1 Parallel Recovery

When a node fails, D-Streams allow the state RDD par-titions that were on the node, and all tasks that it wascurrently running, to be recomputed in parallel on othernodes. The system periodically checkpoints some of thestate RDDs, by asynchronously replicating them to otherworker nodes.6 For example, in a program computing arunning count of page views, the system could choose tocheckpoint the counts every minute. Then, when a nodefails, the system detects all missing RDD partitions andlaunches tasks to recompute them from the last check-point. Many tasks can be launched at the same time tocompute different RDD partitions, allowing the wholecluster to partake in recovery. As described in Section 3,D-Streams exploit parallelism both across partitions ofthe RDDs in each timestep and across timesteps for in-dependent operations (e.g., an initial map), as the lineagegraph captures dependencies at a fine granularity.

To show the benefit of parallel recovery, Figure 7compares it with single-node upstream backup using asimple analytical model. The model assumes that thesystem is recovering from a minute-old checkpoint.

In the upstream backup line, a single idle machine per-forms all of the recovery and then starts processing newrecords. It takes a long time to catch up at high loadsbecause new records for it continue to arrive while it is

6 Because RDDs are immutable, checkpointing does not block thecurrent timestep’s execution.

rebuilding old state. Indeed, suppose that the load beforefailure was λ . Then during each minute of recovery, thebackup node can do 1 min of work, but receives λ min-utes of new work. Thus, it fully recovers from the λ unitsof work that the failed node did since the last checkpointat a time tup such that tup ·1 = λ + tup ·λ , which is

tup =λ

1−λ.

In the other lines, all of the machines partake in re-covery, while also processing new records. Supposingthere where N machines in the cluster before the failure,the remaining N−1 machines now each have to recoverλ/N work, but also receive new data at a rate of N

N−1 λ .The time tpar at which they catch up with the arrivingstream satisfies tpar ·1 = λ

N + tpar · NN−1 λ , which gives

tpar =λ/N

1− NN−1 λ

≈ λ

N(1−λ ).

Thus, with more nodes, parallel recovery catches up withthe arriving stream much faster than upstream backup.

5.2 Straggler Mitigation

Besides failures, another concern in large clusters isstragglers [12]. Fortunately, D-Streams also let us mit-igate stragglers like batch systems do, by running spec-ulative backup copies of slow tasks. Such speculationwould be difficult in a continuous operator system, as itwould require launching a new copy of a node, populat-ing its state, and overtaking the slow copy. Indeed, repli-cation algorithms for stream processing, such as Fluxand DPC [34, 5], focus on synchronizing two replicas.

In our implementation, we use a simple threshold todetect stragglers: whenever a task runs more than 1.4×longer than the median task in its job stage, we mark it asslow. More refined algorithms could also be used, but weshow that this method still works well enough to recoverfrom stragglers within a second.

5.3 Master Recovery

A final requirement to run Spark Streaming 24/7 wasto tolerate failures of Spark’s master. We do this by (1)writing the state of the computation reliably when start-ing each timestep and (2) having workers connect to anew master and report their RDD partitions to it whenthe old master fails. A key aspect of D-Streams that sim-plifies recovery is that there is no problem if a givenRDD is computed twice. Because operations are deter-ministic, such an outcome is similar to recovering froma failure.7 This means that it is fine to lose some runningtasks while the master reconnects, as they can be redone.

7 One subtle issue here is output operators; we have designed op-erators like save to be idempotent, so that the operator outputs eachtimestep’s worth of data to a known path, and does not overwrite pre-vious data if that timestep was already computed.

Our current implementation stores D-Stream meta-data in HDFS, writing (1) the graph of the user’s D-Streams and Scala function objects representing usercode, (2) the time of the last checkpoint, and (3) theIDs of RDDs since the checkpoint in an HDFS file thatis updated through an atomic rename on each timestep.Upon recovery, the new master reads this file to findwhere it left off, and reconnects to the workers to de-termine which RDD partitions are in memory on eachone. It then resumes processing each timestep missed.Although we have not yet optimized the recovery pro-cess, it is reasonably fast, with a 100-node cluster re-suming work in 12 seconds.

6 EvaluationWe evaluated Spark Streaming using both several bench-mark applications and by porting two real applicationsto it: a commercial video distribution monitoring systemand a machine learning algorithm for estimating trafficconditions from automobile GPS data [19]. These latterapplications also leverage D-Streams’ unification withbatch processing, as we shall discuss.

6.1 Performance

We tested the performance of the system using three ap-plications of increasing complexity: Grep, which findsthe number of input strings matching a pattern; Word-Count, which performs a sliding window count over 30s;and TopKCount, which finds the k most frequent wordsover the past 30s. The latter two applications used the in-cremental reduceByWindow operator. We first report theraw scaling performance of Spark Streaming, and thencompare it against two widely used streaming systems,S4 from Yahoo! and Storm from Twitter [29, 37]. Weran these applications on “m1.xlarge” nodes on AmazonEC2, each with 4 cores and 15 GB RAM.

Figure 8 reports the maximum throughput that SparkStreaming can sustain while keeping the end-to-end la-tency below a given target. By “end-to-end latency,” wemean the time from when records are sent to the systemto when results incorporating them appear. Thus, the la-tency includes the time to wait for a new input batch tostart. For a 1 second latency target, we use 500 ms inputintervals, while for a 2 s target, we use 1 s intervals. Inboth cases, we used 100-byte input records.

We see that Spark Streaming scales nearly linearly to100 nodes, and can process up to 6 GB/s (64M records/s)at sub-second latency on 100 nodes for Grep, or 2.3 GB/s(25M records/s) for the other, more CPU-intensive jobs.8

Allowing a larger latency improves throughput slightly,but even the performance at sub-second latency is high.

8 Grep was network-bound due to the cost to replicate the inputdata to multiple nodes—we could not get the EC2 network to sendmore than 68 MB/s per node. WordCount and TopK were more CPU-heavy, as they do more string processing (hashes & comparisons).

0 1 2 3 4 5 6 7

0 50 100

TopKCount

1 sec 2 sec

0 1 2 3 4 5 6 7

0 50 100 # Nodes in Cluster

WordCount

1 sec 2 sec

0 1 2 3 4 5 6 7

0 50 100

Clu

ster

Thh

roug

hput

(G

B/s

)

Grep

1 sec 2 sec

Figure 8: Maximum throughput attainable under agiven latency bound (1 s or 2 s) by Spark Streaming.

Comparison with Commercial Systems SparkStreaming’s per-node throughput of 640,000 records/sfor Grep and 250,000 records/s for TopKCount on4-core nodes is comparable to the speeds reportedfor commercial single-node streaming systems. Forexample, Oracle CEP reports a throughput of 1 millionrecords/s on a 16-core machine [31], StreamBasereports 245,000 records/s on 8 cores [40], and Esperreports 500,000 records/s on 4 cores [13]. While thereis no reason to expect D-Streams to be slower or fasterper-node, the key advantage is that Spark Streamingscales nearly linearly to 100 nodes.

Comparison with S4 and Storm We also comparedSpark Streaming against two open source distributedstreaming systems, S4 and Storm. Both are continuousoperators systems that do not offer consistency acrossnodes and have limited fault tolerance guarantees (S4has none, while Storm guarantees at-least-once deliv-ery of records). We implemented our three applicationsin both systems, but found that S4 was limited in thenumber of records/second it could process per node (atmost 7500 records/s for Grep and 1000 for WordCount),which made it almost 10× slower than Spark and Storm.Because Storm was faster, we also tested it on a 30-nodecluster, using both 100-byte and 1000-byte records.

We compare Storm with Spark Streaming in Figure 9,reporting the throughput Spark attains at sub-second la-tency. We see that Storm is still adversely affected bysmaller record sizes, capping out at 115K records/s/n-ode for Grep for 100-byte records, compared to 670Kfor Spark. This is despite taking several precautions inour Storm implementation to improve performance, in-cluding sending “batched” updates from Grep every 100input records and having the “reduce” nodes in Word-Count and TopK only send out new counts every second,instead of each time a count changes. Storm was fasterwith 1000-byte records, but still 2× slower than Spark.

6.2 Fault and Straggler Recovery

We evaluated fault recovery under various conditions us-ing the WordCount and Grep applications. We used 1-second batches with input data residing in HDFS, andset the data rate to 20 MB/s/node for WordCount and

0 5

10 15 20 25 30

100 1000 Record Size (bytes)

TopKCount

0 5

10 15 20 25 30

100 1000 Record Size (bytes)

WordCount

Spark Streaming Storm

0 10 20 30 40 50 60 70

100 1000

Thro

ughp

ut (M

B/s

/nod

e)

Record Size (bytes)

Grep

Figure 9: Throughput vs Storm on 30 nodes.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Before

failur

e

On fail

ure

Next 3

s

Secon

d 3s

Third 3

s

Fourth

3s

Fifth 3s

Sixth 3

s

Grep, 2 failures

Grep, 1 failure

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

Before

failur

e

On fail

ure

Next 3

s

Secon

d 3s

Third 3

s

Fourth

3s

Fifth 3s

Sixth 3

s

Proc

essi

ng T

ime

(s)

WC, 2 failures

WC, 1 failure

Figure 10: Interval processing times for WordCount(WC) and Grep under failures. We show the aver-age time to process each 1s batch of data before afailure, during the interval of the failure, and during3-second periods after. Results are over 5 runs.

80 MB/s/node for Grep, which led to a roughly equalper-interval processing time of 0.58s for WordCount and0.54s for Grep. Because the WordCount job performs anincremental reduceByKey, its lineage graph grows indef-initely (since each interval subtracts data from 30 sec-onds in the past), so we gave it a checkpoint interval of10 seconds. We ran the tests on 20 four-core nodes, using150 map tasks and 10 reduce tasks per job.

We first report recovery times under these these baseconditions, in Figure 10. The plot shows the average pro-cessing time of 1-second data intervals before the failure,during the interval of failure, and during 3-second peri-ods thereafter, for either 1 or 2 concurrent failures. (Theprocessing for these later periods is delayed while recov-ering data for the interval of failure, so we show how thesystem restabilizes.) We see that recovery is fast, withdelays of at most 1 second even for two failures anda 10s checkpoint interval. WordCount’s recovery takeslonger because it has to recompute data going far back,whereas Grep just loses four tasks on each failed node.

Varying the Checkpoint Interval Figure 11 showsthe effect of changing WordCount’s checkpoint interval.Even when checkpointing every 30s, results are delayedat most 3.5s. With 2s checkpoints, the system recoversin just 0.15s, while still paying less than full replication.

Varying the Number of Nodes To see the effect ofparallelism, we also tried the WordCount application on

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Before failure

On failure

Next 3s Second 3s

Third 3s

Fourth 3s

Fifth 3s Sixth 3s

Proc

essi

ng T

ime

(s) 30s checkpoints

10s checkpoints 2s checkpoints

Figure 11: Effect of checkpoint time in WordCount.

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Before failure

On failure

Next 3s Second 3s

Third 3s

Fourth 3s

Fifth 3s Sixth 3s

Proc

essi

ng T

ime

(s) 30s ckpts, 20 nodes

30s ckpts, 40 nodes 10s ckpts, 20 nodes 10s ckpts, 40 nodes

Figure 12: Recovery of WordCount on 20 & 40 nodes.

40 nodes. As Figure 12 shows, doubling the nodes re-duces the recovery time in half. While it may seem sur-prising that there is so much parallelism given the lineardependency chain of the sliding reduceByWindow oper-ator in WordCount, the parallelism comes because thelocal aggregations on each timestep can be done in par-allel (see Figure 4), and these are the bulk of the work.

Straggler Mitigation Finally, we tried slowing downone of the nodes instead of killing it, by launching a60-thread process that overloaded the CPU. Figure 13shows the per-interval processing times without thestraggler, with the straggler but with speculative exe-cution (backup tasks) disabled, and with the stragglerand speculation enabled. Speculation improves the re-sponse time significantly. Note that our current imple-mentation does not attempt to remember straggler nodesacross time, so these improvements occur despite repeat-edly launching new tasks on the slow node. This showsthat even unexpected stragglers can be handled quickly.A full implementation would blacklist slow nodes.

6.3 Real Applications

We evaluated the expressiveness of D-Streams by port-ing two real applications. Both applications are signif-icantly more complex than the test programs shown sofar, and both took advantage of D-Streams to performbatch or interactive processing in addition to streaming.

6.3.1 Video Distribution Monitoring

Conviva provides a commercial management platformfor video distribution over the Internet. One feature ofthis platform is the ability to track the performanceacross different geographic regions, CDNs, client de-vices, and ISPs, which allows the broadcasters to quickly

0.55 0.54

3.02 2.40

1.00 0.64

0.0

1.0

2.0

3.0

4.0

WordCount Grep

Proc

essi

ng T

ime

(s)

No straggler

Straggler, no speculation

Straggler, with speculation

Figure 13: Processing time of intervals in Grep andWordCount in normal operation, as well as in thepresence of a straggler, with and without speculation.

0

1

2

3

4

0 32 64

Act

ive

sess

ions

(m

illio

ns)

Nodes in Cluster

(a) Scalability

0 50

100 150 200 250

Query 1 Query 2 Query 3 Res

pons

e tim

e (m

s)

(b) Ad-hoc queries

Figure 14: Results for the video application. (a) showsthe number of client sessions supported vs. clus-ter size. (b) shows the performance of three ad-hocqueries from the Spark shell, which count (1) all ac-tive sessions, (2) sessions for a specific customer, and(3) sessions that have experienced a failure.

idenify and respond to delivery problems. The system re-ceives events from video players and uses them to com-pute more than fifty metrics, including complex metricssuch as unique viewers and session-level metrics such asbuffering ratio, over different grouping categories.

The current application is implemented in two sys-tems: a custom-built distributed streaming system forlive data, and a Hadoop/Hive implementation for histor-ical data and ad-hoc queries. Having both live and his-torical data is crucial because customers often want togo back in time to debug an issue, but implementing theapplication on these two separate systems creates signif-icant challenges. First, the two implementations have tobe kept in sync to ensure that they compute metrics in thesame way. Second, there is a lag of several minutes min-utes before data makes it through a sequence of Hadoopimport jobs into a form ready for ad-hoc queries.

We ported the application to D-Streams by wrappingthe map and reduce implementations in the Hadoop ver-sion. Using a 500-line Spark Streaming program and anadditional 700-line wrapper that executed Hadoop jobswithin Spark, we were able to compute all the metrics(a 2-stage MapReduce job) in batches as small as 2 sec-onds. Our code uses the track operator described in Sec-tion 3.3 to build a session state object for each client IDand update it as events arrive, followed by a sliding re-duceByKey to aggregate the metrics over sessions.

0

500

1000

1500

2000

0 25 50 75 100 GPS

obs

erva

tions

pe

r sec

ond

Nodes in Cluster

Figure 15: Scalability of the Mobile Millennium job.

We measured the scaling performance of the applica-tion and found that on 64 quad-core EC2 nodes, it couldprocess enough events to support 3.8 million concur-rent viewers, which exceeds the peak load experiencedat Conviva so far. Figure 14(a) shows the scaling.

In addition, we used D-Streams to add a new featurenot present in the original application: ad-hoc queries onthe live stream state. As shown in Figure 14(b), SparkStreaming can run ad-hoc queries from a Scala shell inless than a second on the RDDs representing sessionstate. Our cluster could easily keep ten minutes of datain RAM, closing the gap between historical and live pro-cessing, and allowing a single codebase to do both.

6.3.2 Crowdsourced Traffic Estimation

We applied the D-Streams to the Mobile Millenniumtraffic information system [19], a machine learningbased project to estimate automobile traffic conditions incities. While measuring traffic for highways is straight-forward due to dedicated sensors, arterial roads (theroads in a city) lack such infrastructure. Mobile Millen-nium attacks this problem by using crowdsourced GPSdata from fleets of GPS-equipped cars (e.g., taxi cabs)and cellphones running a mobile application.

Traffic estimation from GPS data is challenging, be-cause the data is noisy (due to GPS inaccuracy neartall buildings) and sparse (the system only receives onemeasurement from each car per minute). Mobile Millen-nium uses a highly compute-intensive expectation max-imization (EM) algorithm to infer the conditions, usingMarkov Chain Monte Carlo and a traffic model to esti-mate a travel time distribution for each road link. Theprevious implementation [19] was an iterative batch jobin Spark that ran over 30-minute windows of data.

We ported this application to Spark Streaming usingan online version of the EM algorithm that merges innew data every 5 seconds. The implementation was 260lines of Spark Streaming code, and wrapped the existingmap and reduce functions in the offline program. In ad-dition, we found that only using the real-time data couldcause overfitting, because the data received in five sec-onds is so sparse. We took advantage of D-Streams toalso combine this data with historical data from the sametime during the past ten days to resolve this problem.

Figure 15 shows the performance of the algorithm on

up to 80 quad-core EC2 nodes. The algorithm scalesnearly perfectly because it is CPU-bound, and providesanswers more than 10× faster than the batch version.9

7 DiscussionWe have presented discretized streams (D-Streams), anew stream processing model for clusters. By breakingcomputations into short, deterministic tasks and stor-ing state in lineage-based data structures (RDDs), D-Streams can use powerful recovery mechanisms, similarto those in batch systems, to handle faults and stragglers.

Perhaps the main limitation of D-Streams is that theyhave a fixed minimum latency due to batching data.However, we have shown that total delay can still beas low as 1–2 seconds, which is enough for many real-world use cases. Interestingly, even some continuous op-erator systems, such as Borealis and TimeStream [5, 33],add delays to ensure determinism: Borealis’s SUnionoperator and TimeStream’s HashPartition wait to batchdata at “heartbeat” boundaries so that operators withmultiple parents see input in a deterministic order. Thus,D-Streams’ latency is in a similar range to these systems,while offering significantly more efficient recovery.

Beyond their recovery benefits, we believe that themost important aspect of D-Streams is that they showthat streaming, batch and interactive computations canbe unified in the same platform. As “big” data becomesthe only size of data at which certain applications canoperate (e.g., spam detection on large websites), organi-zations will need the tools to write both lower-latencyapplications and more interactive ones that use this data,not just the periodic batch jobs used so far. D-Streamsintegrate these modes of computation at a deep level, inthat they follow not only a similar API but also the samedata structures and fault tolerance model as batch jobs.This enables rich features like combining streams withoffline data or running ad-hoc queries on stream state.

Finally, while we presented a basic implementation ofD-Streams, there are several areas for future work:

Expressiveness: In general, as the D-Stream abstractionis primarily an execution strategy, it should be possibleto run most streaming algorithms within them, by sim-ply “batching” the execution of the algorithm into stepsand emitting state across them. It would be interestingto port languages like streaming SQL [4] or ComplexEvent Processing models [14] over them.

Setting the batch interval: Given any application, set-ting an appropriate batch interval is very important as itdirectly determines the trade-off between the end-to-endlatency and the throughput of the streaming workload.

9 Note that the raw rate of records/second for this algorithm is lowerthan in our other programs because it performs far more work for eachrecord, drawing 300 Markov Chain Monte Carlo samples per record.

Currently, a developer has to explore this trade-off anddetermine the batch interval manually. It may be possi-ble for the system to tune it automatically.

Memory usage: Our model of stateful stream process-ing generates new a RDD to store each operator’s stateafter each batch of data is processed. In our current im-plementation, this will incur a higher memory usage thancontinuous operators with mutable state. Storing differ-ent versions of the state RDDs is essential for the systemperform lineage-based fault recovery. However, it maybe possible to reduce the memory usage by storing onlythe deltas between these state RDDs.

Approximate results: In addition to recomputing lostwork, another way to handle a failure is to return ap-proximate partial results. D-Streams provide the oppor-tunity to compute partial results by simply launching atask before its parents are all done, and offer lineage datato know which parents were missing.

8 Related WorkStreaming Databases Streaming databases such asAurora, Telegraph, Borealis, and STREAM [7, 9, 5, 4]were the earliest academic systems to study streaming,and pioneered concepts such as windows and incremen-tal operators. However, distributed streaming databases,such as Borealis, used replication or upstream backupfor recovery [20]. We make two contributions over them.

First, D-Streams provide a more efficient recoverymechanism, parallel recovery, that runs faster than up-stream backup without the cost of replication. Parallelrecovery is feasible because D-Streams discretize com-putations into stateless, deterministic tasks. In contrast,streaming databases use a stateful continous operatormodel, and require complex protocols for both replica-tion (e.g., Borealis’s DPC [5] or Flux [34]) and upstreambackup [20]. The only parallel recovery protocol we areaware of, by Hwang et al [21], only tolerates one nodefailure, and cannot handle stragglers.

Second, D-Streams also tolerate stragglers, usingspeculative execution [12]. Straggler mitigation is dif-ficult in continuous operator models because each nodehas mutable state that cannot be rebuilt on another nodewithout a costly serial replay process.

Large-scale Streaming While several recent systemsenable streaming computation with high-level APIs sim-ilar to D-Streams, they also lack the fault and stragglerrecovery benefits of the discretized stream model.

TimeStream [33] runs the continuous, stateful opera-tors in Microsoft StreamInsight [2] on a cluster. It usesa recovery mechanism similar to upstream backup thattracks which upstream data each operator depends onand replays it serially through a new copy of the opera-tor. Recovery thus happens on a single node for each op-

erator, and takes time proportional to that operator’s pro-cessing window (e.g., 30 seconds for a 30-second slid-ing window) [33]. In contrast, D-Streams use statelesstransformations and explicitly put state in data structures(RDDs) that can (1) be checkpointed asynchronously tobound recovery time and (2) be rebuilt in parallel, ex-ploiting parallelism across data partitions and timestepsto recover in sub-second time. D-Streams can also han-dle stragglers, while TimeStream does not.

Naiad [27, 28] automatically incrementalizes dataflow computations written in LINQ and is unique inalso being able to incrementalize iterative computations.However, it uses traditional synchronous checkpointingfor fault tolerance, and cannot respond to stragglers.

MillWheel [1] runs stateful computations using anevent-driven API but handles reliability by writing allstate to a replicated storage system like BigTable.

MapReduce Online [11] is a streaming Hadoop run-time that pushes records between maps and reduces anduses upstream backup for reliability. However, it cannotrecover reduce tasks with long-lived state (the user mustmanually checkpoint such state into an external system),and does not handle stragglers. Meteor Shower [41] alsouses upstream backup, and can take tens of seconds torecover state. iMR [25] offers a MapReduce API for logprocessing, but can lose data on failure. Percolator [32]runs incremental computations using triggers, but doesnot offer high-level operators like map and join.

Finally, to our knowledge, almost none of these sys-tems support combining streaming with batch and ad-hoc queries, as D-Streams do. Some streaming databaseshave supported combining tables and streams [15].

Message Queueing Systems Systems like Storm, S4,and Flume [37, 29, 3] offer a message passing modelwhere users write stateful code to process records, butthey generally have limited fault tolerance guarantees.For example, Storm ensures “at-least-once” delivery ofmessages using upstream backup at the source, but re-quires the user to manually handle the recovery of state,e.g., by keeping all state in a replicated database [38].Trident [26] provides a functional API similar to LINQon top of Storm that manages state automatically. How-ever, Trident does this by storing all state in a replicateddatabase to provide fault tolerance, which is expensive.

Incremental Processing CBP [24] and Comet [18]provide “bulk incremental processing” on traditionalMapReduce platforms by running MapReduce jobs onnew data every few minutes. While these systems ben-efit from the scalability and fault/straggler tolerance ofMapReduce within each timestep, they store all state ina replicated, on-disk filesystem across timesteps, incur-ring high overheads and latencies of tens of seconds tominutes. In contrast, D-Streams can keep state unrepli-

cated in memory using RDDs and can recover it acrosstimesteps using lineage, yielding order-of-magnitudelower latencies. Incoop [6] modifies Hadoop to supportincremental recomputation of job outputs when an inputfile changes, and also includes a mechanism for strag-gler recovery, but it still uses replicated on-disk storagebetween timesteps, and does not offer an explicit stream-ing interface with concepts like windows.

Parallel Recovery One recent system that adds paral-lel recovery to streaming operators is SEEP [8], whichallows continuous operators to expose and split up theirstate through a standard API. However, SEEP requiresinvasive rewriting of each operator against this API, anddoes not extend to stragglers.

Our parallel recovery mechanism is also similar totechniques in MapReduce, GFS, and RAMCloud [12,16, 30], all of which partition recovery work on failure.Our contribution is to show how to structure a streamingcomputation to allow the use of this mechanism acrossdata partitions and time, and to show that it can be im-plemented at a small enough timescale for streaming.

9 Conclusion

We have proposed D-Streams, a new model for dis-tributed streaming computation that enables fast (oftensub-second) recovery from both faults and stragglerswithout the overhead of replication. D-Streams forgoconventional streaming wisdom by batching data intosmall timesteps. This enables powerful recovery mecha-nisms that exploit parallelism across data partitions andtime. We showed that D-Streams can support a widerange of operators and can attain high per-node through-put, linear scaling to 100 nodes, sub-second latency, andsub-second fault recovery. Finally, because D-Streamsuse the same execution model as batch platforms, theycompose seamlessly with batch and interactive queries.We used this capability in Spark Streaming to let userscombine these models in powerful ways, and showedhow it can add rich features to two real applications.

Spark Streaming is open source, and is now includedin Spark at http://spark-project.org.

10 Acknowledgements

We thank the SOSP reviewers and our shepherd fortheir detailed feedback. This research was supported inpart by NSF CISE Expeditions award CCF-1139158 andDARPA XData Award FA8750-12-2-0331, a GooglePhD Fellowship, and gifts from Amazon Web Services,Google, SAP, Cisco, Clearstory Data, Cloudera, Erics-son, Facebook, FitWave, General Electric, Hortonworks,Huawei, Intel, Microsoft, NetApp, Oracle, Samsung,Splunk, VMware, WANdisco and Yahoo!.

http://spark-project.org

References[1] T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak,

J. Haberman, R. Lax, S. McVeety, D. Mills,P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. InVLDB, 2013.

[2] M. H. Ali, C. Gerea, B. S. Raman, B. Sezgin,T. Tarnavski, T. Verona, P. Wang, P. Zabback,A. Ananthanarayan, A. Kirilov, M. Lu, A. Raiz-man, R. Krishnan, R. Schindlauer, T. Grabs,S. Bjeletich, B. Chandramouli, J. Goldstein,S. Bhat, Y. Li, V. Di Nicola, X. Wang, D. Maier,S. Grell, O. Nano, and I. Santos. Microsoft CEPserver and online behavioral targeting. Proc. VLDBEndow., 2(2):1558, Aug. 2009.

[3] Apache Flume. http://incubator.apache.org/flume/.[4] A. Arasu, B. Babcock, S. Babu, M. Datar,

K. Ito, I. Nishizawa, J. Rosenstein, and J. Widom.STREAM: The Stanford stream data managementsystem. SIGMOD 2003.

[5] M. Balazinska, H. Balakrishnan, S. R. Madden,and M. Stonebraker. Fault-tolerance in the Borealisdistributed stream processing system. ACM Trans.Database Syst., 2008.

[6] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar,and R. Pasquin. Incoop: MapReduce for incremen-tal computations. In SOCC ’11, 2011.

[7] D. Carney, U. Cetintemel, M. Cherniack, C. Con-vey, S. Lee, G. Seidman, M. Stonebraker, N. Tat-bul, and S. Zdonik. Monitoring streams: a newclass of data management applications. In VLDB’02, 2002.

[8] R. Castro Fernandez, M. Migliavacca, E. Kaly-vianaki, and P. Pietzuch. Integrating scale out andfault tolerance in stream processing using operatorstate management. In SIGMOD, 2013.

[9] S. Chandrasekaran, O. Cooper, A. Deshpande,M. J. Franklin, J. M. Hellerstein, W. Hong, S. Kr-ishnamurthy, S. Madden, V. Raman, F. Reiss, andM. Shah. TelegraphCQ: Continuous dataflow pro-cessing for an uncertain world. In CIDR, 2003.

[10] M. Cherniack, H. Balakrishnan, M. Balazinska,D. Carney, U. Cetintemel, Y. Xing, and S. B.Zdonik. Scalable distributed stream processing. InCIDR, 2003.

[11] T. Condie, N. Conway, P. Alvaro, and J. M. Heller-stein. MapReduce online. NSDI, 2010.

[12] J. Dean and S. Ghemawat. MapReduce: Simplifieddata processing on large clusters. In OSDI, 2004.

[13] EsperTech. Performance-related information.http://esper.codehaus.org/esper/performance/performance.html, Retrieved March 2013.

[14] EsperTech. Tutorial. http://esper.codehaus.org/tutorials/tutorial/tutorial.html, Retrieved March

2013.[15] M. Franklin, S. Krishnamurthy, N. Conway, A. Li,

A. Russakovsky, and N. Thombre. Continuous an-alytics: Rethinking query processing in a network-effect world. CIDR, 2009.

[16] S. Ghemawat, H. Gobioff, and S.-T. Leung. TheGoogle File System. In Proceedings of SOSP ’03,2003.

[17] J. Hammerbacher. Who is using flume in produc-tion? http://www.quora.com/Flume/Who-is-using-Flume-in-production/answer/Jeff-Hammerbacher.

[18] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin,and L. Zhou. Comet: batched stream processingfor data intensive distributed computing. In SoCC,2010.

[19] T. Hunter, T. Moldovan, M. Zaharia, S. Merzgui,J. Ma, M. J. Franklin, P. Abbeel, and A. M.Bayen. Scaling the Mobile Millennium system inthe cloud. In SOCC ’11, 2011.

[20] J.-H. Hwang, M. Balazinska, A. Rasin,U. Cetintemel, M. Stonebraker, and S. Zdonik.High-availability algorithms for distributed streamprocessing. In ICDE, 2005.

[21] J. hyon Hwang, Y. Xing, and S. Zdonik. A coop-erative, self-configuring high-availability solutionfor stream processing. In ICDE, 2007.

[22] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fet-terly. Dryad: distributed data-parallel programsfrom sequential building blocks. In EuroSys 07,2007.

[23] S. Krishnamurthy, M. Franklin, J. Davis, D. Farina,P. Golovko, A. Li, and N. Thombre. Continuousanalytics over discontinuous streams. In SIGMOD,2010.

[24] D. Logothetis, C. Olston, B. Reed, K. C. Webb, andK. Yocum. Stateful bulk processing for incremen-tal analytics. SoCC, 2010.

[25] D. Logothetis, C. Trezzo, K. C. Webb, andK. Yocum. In-situ MapReduce for log processing.In USENIX ATC, 2011.

[26] N. Marz. Trident: a high-level ab-straction for realtime computation.http://engineering.twitter.com/2012/08/trident-high-level-abstraction-for.html.

[27] F. McSherry, D. G. Murray, R. Isaacs, and M. Isard.Differential dataflow. In Conference on InnovativeData Systems Research (CIDR), 2013.

[28] D. Murray, F. McSherry, R. Isaacs, M. Isard,P. Barham, and M. Abadi. Naiad: A timelydataflow system. In SOSP ’13, 2013.

[29] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari.S4: Distributed stream computing platform. In Intl.Workshop on Knowledge Discovery Using Cloudand Distributed Computing Platforms (KDCloud),

http://esper.codehaus.org/esper/performance/performance.html

http://esper.codehaus.org/esper/performance/performance.html

http://esper.codehaus.org/tutorials/tutorial/tutorial.html

http://esper.codehaus.org/tutorials/tutorial/tutorial.html

2010.[30] D. Ongaro, S. M. Rumble, R. Stutsman, J. K.

Ousterhout, and M. Rosenblum. Fast crash recov-ery in RAMCloud. In SOSP, 2011.

[31] Oracle. Oracle complex event processing per-formance. http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/cepperformancewhitepaper-128060.pdf, 2008.

[32] D. Peng and F. Dabek. Large-scale incrementalprocessing using distributed transactions and noti-fications. In OSDI 2010.

[33] Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang,L. Zhou, Y. Yu, and Z. Zhang. Timestream: Reli-able stream computation in the cloud. In EuroSys’13, 2013.

[34] M. Shah, J. Hellerstein, and E. Brewer. Highlyavailable, fault-tolerant, parallel dataflows. SIG-MOD, 2004.

[35] Z. Shao. Real-time analytics at Face-book. XLDB 2011, http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011 tue 0940facebookrealtimeanalytics.pdf.

[36] U. Srivastava and J. Widom. Flexible time man-agement in data stream systems. In PODS, 2004.

[37] Storm. https://github.com/nathanmarz/storm/wiki.[38] Guaranteed message processing (Storm wiki).

https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing.

[39] K. Thomas, C. Grier, J. Ma, V. Paxson, andD. Song. Design and evaluation of a real-time URLspam filtering service. In IEEE Symposium on Se-curity and Privacy, 2011.

[40] R. Tibbetts. Streambase performance &scalability characterization. http://www.streambase.com/wp-content/uploads/downloads/StreamBase White Paper Performance andScalability Characterization.pdf, 2009.

[41] H. Wang, L.-S. Peh, E. Koukoumidis, S. Tao, andM. C. Chan. Meteor shower: A reliable streamprocessing system for commodity data centers. InIPDPS ’12, 2012.

[42] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-son, P. K. Gunda, and J. Currey. DryadLINQ:A system for general-purpose distributed data-parallel computing using a high-level language. InOSDI ’08, 2008.

[43] M. Zaharia, M. Chowdhury, T. Das, A. Dave,J. Ma, M. McCauley, M. Franklin, S. Shenker, andI. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster comput-ing. In NSDI, 2012.

http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/cepperformancewhitepaper-128060.pdf



http://www-conf.slac.stanford.edu/xldb2011/talks/ xldb2011_tue_0940_facebookrealtimeanalytics.pdf



https://github.com/nathanmarz/storm/wiki

https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing

https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing

http://www.streambase.com/wp-content/uploads/downloads/StreamBase_White_Paper_ Performance_and_Scalability_Characterization.pdf




discretized streams: fault-tolerant streaming …spark streaming, based on the spark engine [43]....

Documents