structured streaming in spark

Structured StreamingSpark Streaming 2.0

https://hadoopist.wordpress.comGiri R Varatharajanhttps://www.linkedin.com/in/girivaratharajan

What is Structured Streaming in Apache Spark

● Continuous Data Flow Programming Model in

Spark introduced in 2.0

● Low Tolerance & High Throughput System

● Exactly Once Semantic - No Duplicates

● Stateful Aggregation over the Time, Event,

Window, Record.

● A Streaming platform built on top of Spark SQL

● Express your the computational code as your

batch computational code in Spark SQL

Dataframes

● Alpha Release released with Spark 2.0

● Supports HDFS, S3 now and support for Kafka,

Kinesis and Other Sources very soon.

Spark Streaming

< 2.0Behavior

● Micro Batching : streams are called as Discretized

Streams (DStreams)

● Running Aggregations needs to be specified with

a updateStateByKey method

● Requires careful construction of fault tolerance.

Micro Batching

Streaming Model

● Live Data Streams Keep appending

to the Dataframe called Unbounded

table.

● Runs incremental aggregates on the

Unbounded table.

Spark Streaming

2.0Behavior

+Demo

● Continuous Data Flow : Streams are appended in

an Unbounded Table with Dataframes APIs on it.

● No need to specify any method for running

aggregates over the time, window, or record.

● Look at the network socket wordcount program.

● Streaming is performed in Complete, Append,

Update Mode(s)

Continuous Data Flow

Lines = Input TablewordCounts = Result Table

Streaming Model

//Socket Stream - Read as and when it arrives in NetCat Channelval lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()

Streaming Model

val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word").count().orderBy("window")

Create/Read Streams

SparkSession.readStream()

● File Source (HDFS, S3, Text, Parquet, Csv,

Json,etc.)

● Socket Stream (NetCat)

● Kafka, Kinesis and Other Input Sources are Under

Research so cross your fingers.

● DataStreamReader API

(http://spark.apache.org/docs/latest/api/scala/index

.html#org.apache.spark.sql.streaming.DataStream

Reader)

Outputting Streams

SparkSession.writeStream()

Output Sink Types:

● Parquet Sink - HDFS, S3, Parquet

● Console Sink - Terminal

● Memory Sink - In memory table that can be queried over time interactively

● Foreach Sink

● DataStreamWriter

API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st

reaming.DataStreamWriter)

Output Modes:

● Append Mode(Default)

○ New rows only appended

○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)

● Complete Mode

■ Output the whole result to any Sink

■ Applicable only for aggregated Queries (groupBy, etc)

● Update Mode

○ Updates on any of the row attributes will get appended to the output sink.

CheckPointing ● In case of Failure recover the previous progress

and state of a previous query, and continue where

it left off.

● Configure a CheckPoint location in writeStream

method of DataStreamWriter

● Must be configured for Parquet Sink, File Sink.

Unsupported Operations yet

● Sort, Limit of First N rows, Distinct on Input

Streams

● Joins bt two streaming datasets

● Outer Joins (FO, LO, RO) bt two streaming

datasets.

● ds.count() ⇒ Use ds.groupBy.count() instead

Key Takeaways ● Structured Streaming is still experimental but please try it out.

● Streaming Events are gathered and appended to a infinite

dataframe series (Unbounded Table) and queries are running on

top of that.

● Development is very similar to the development of Spark for

Static Dataframe/DataSets APIs.

● Execute Ad-hoc Queries, Run aggregates, update DBs, track

session data, prepare dashboards,etc.

● readStream() - Schema of the Streaming Dataframes are

checked only at run time hence it’s untyped.

● writeStream() with various Output Modes, Output Sinks are

available. Always remember when to use what types of Output

Mode.

● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks

are the upcoming features and are being developed at the open

source community.

● Structured Streaming is not recommended for Production

workloads at this point even if it’s a File Streaming, Socket

Streaming.

Thank You Spark Code is available in my github:https://github.com/vgiri2015/Spark2.0-and-greater/tree/master/src/main/scala/structStreaming

Other Spark related repositories:https://github.com/vgiri2015/spark-latest-v1

My blogs and Learning in Spark:https://hadoopist.wordpress.com/category/apache-spark/

https://github.com/vgiri2015/Spark2.0-and-greater/tree/master/src/main/scala/structStreaming



https://github.com/vgiri2015/spark-latest-v1

https://github.com/vgiri2015/spark-latest-v1

https://hadoopist.wordpress.com/category/apache-spark/

https://hadoopist.wordpress.com/category/apache-spark/

structured streaming in spark

Engineering