introduction to spark streaming

Introduction to Streaming in Apache Spark

Based on Apache Spark 1.6.0

Akash SethiSoftware Consultant

Knoldus Software LLP.

Agenda What is Streaming Abstraction Provided For Streaming Execution Process Transformation Type of Transformation Action Performance Tuning options

High Level architecture of Spark Streaming

Streaming in Apache SparkProvide way to consume continues stream of data.Build on top of Spark CoreIt supports Java, Scala and Python.API is similar to Spark Core.

DStream as a continues series of Data

Streaming in Apache SparkSpark Streaming uses a “micro-batch” architecture. New batches are created at regular time intervals. At the beginning of each time interval a new batch is created, and any data that arrives during that interval gets added to that batch. At the end of the time interval the batch is done growing. The size of the time intervals is determined by a parameter called the batch interval.

Streaming in Apache Spark

Spark Streaming provides an abstraction called DStreams, or discretized streams. A DStream is a sequence of data arriving over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. Here RDDs are created on the basis of Time.

Each input batch forms an RDD, and is processed using Spark jobs to create other RDDs. The processed results can then be pushed out to external systems in batches.

We can also specify block size in milliseconds.

By default, received data is replicated across two nodes, so Spark Streaming can tolerate single worker failures. Using just lineage, however, re computation could take a long time for data that has been built up since the beginning of the program. Thus, Spark Streaming also includes a mechanism called checkpointing that saves state periodically to a reliable file system (e.g., HDFS or S3). Typically, you might set up checkpointing every 5–10 batches of data. When recovering lost data, Spark Streaming needs only to go back to the last checkpoint.


Execution of Spark Streaming within Spark’s Components


Transformation

Transformations apply some operation on current DStream and generate a new DStream.

Transformations on DStreams can be grouped into either stateless or stateful:

In stateless transformations the processing of each batch does not depend on the data of its previous batches.

Stateful transformations, in contrast, use data or intermediate results from previous batches to compute the results of the current batch. They include transformations based on sliding windows and on tracking state across time.

Stateless Transformation Code

Transformation on DStream

Stateless Transformation

Stateful Transformations

Stateful transformations are operations on DStreams that track data across time; that is, some data from previous batches is used to generate the results for a new batch.

The two main types of Stateful Transformation are:

Windowed Operations UpdateStateByKey

Stateful Transformation

Windowed TransformationsWindowed operations compute results across a longer time

period than the StreamingContext’s batch interval, by combining results from multiple batches

All windowed operations need two parameters, window duration and sliding duration, both of which must be a multiple of the StreamingContext’s batch interval. The window duration controls how many previous batches of data are considered

Stateful TransformationIf we had a source DStream with a batch interval of 10 seconds and wanted to create a sliding window of the last 30 seconds(or last 3 batches) we would set the windowDuration to 30 seconds. The sliding duration, which defaults to the batch interval, controls how frequently the new DStream computes results. If we had the source DStream with a batch interval of 10 seconds and wanted to compute our window only on every second batch, we would set our sliding interval to 20 seconds.

Stateful Transformations

UpdateStateByKey Transformationsit’s useful to maintain state across the batches in a DStream .

updateStateByKey() enables this by providing access to a state variable for DStreams of key/value pairs. Given a DStream of (key, event) pairs, it lets you construct a new DStream of (key, state) pairs by taking a function that specifies how to update the state for each key given new events. For example, in a web server log, our events might be visits to the site, where the key is the user ID. Using updateStateByKey(), we could track the last pages each user visited. This list would be our “state” object, and we’d update it as each event arrives.

Action/Output Operations

Output operations specify what needs to be done with the final transformed data in a stream.

Output Operations are similar to spark core.- print the output

- save to text file

- save as Object in File etc.

It contains some extra methods like. forEachRdd()

Performance Considerations

Spark Streaming applications have a few specialized tuning options.

Batch and Window SizesThe most common question is what minimum batch size Spark Streaming

can use. In general, 500 milliseconds has proven to be a good minimum size for many applications. The best approach is to start with a larger batch size (around 10 seconds) and work your way down to a smaller batch size.

Level of ParallelismA common way to reduce the processing time of batches is to

increase the parallelism.

Increasing the number of receiversReceivers can sometimes act as a bottleneck if there are too many records

for a single machine to read in and distribute. You can add more receivers by creating multiple input DStreams (which creates multiple receivers), and then applying union to merge them into a single stream.

Explicitly repartitioning received dataIf receivers cannot be increased anymore, you can further redistribute the

received data by explicitly repartitioning the input stream (or the union of multiple streams) using DStream.repartition.

Increasing parallelism in aggregationFor operations like reduceByKey(), you can specify the parallelism as a

second parameter.

Performance Considerations

Demo

https://github.com/knoldus/knolx-spark-streaming

https://github.com/knoldus/knolx-spark-streaming

References Learning Spark LIGHTNING-FAST DATA

ANALYSIS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

http://spark.apache.org/streaming/

Thank You