structured streaming in spark
Post on 16-Apr-2017
Embed Size (px)
Structured StreamingSpark Streaming 2.0
https://hadoopist.wordpress.comGiri R Varatharajanhttps://www.linkedin.com/in/girivaratharajan
What is Structured Streaming in Apache Spark
Continuous Data Flow Programming Model in Spark introduced in 2.0
Low Tolerance & High Throughput System Exactly Once Semantic - No Duplicates Stateful Aggregation over the Time, Event,
A Streaming platform built on top of Spark SQL Express your the computational code as your
batch computational code in Spark SQL
Alpha Release released with Spark 2.0 Supports HDFS, S3 now and support for Kafka,
Kinesis and Other Sources very soon.
Micro Batching : streams are called as Discretized Streams (DStreams)
Running Aggregations needs to be specified with a updateStateByKey method
Requires careful construction of fault tolerance.
Live Data Streams Keep appending to the Dataframe called Unbounded
Runs incremental aggregates on the Unbounded table.
Continuous Data Flow : Streams are appended in an Unbounded Table with Dataframes APIs on it.
No need to specify any method for running aggregates over the time, window, or record.
Look at the network socket wordcount program. Streaming is performed in Complete, Append,
Continuous Data Flow
Lines = Input TablewordCounts = Result Table
//Socket Stream - Read as and when it arrives in NetCat Channelval lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word").count().orderBy("window")
File Source (HDFS, S3, Text, Parquet, Csv, Json,etc.)
Socket Stream (NetCat) Kafka, Kinesis and Other Input Sources are Under
Research so cross your fingers.
DataStreamReader API (http://spark.apache.org/docs/latest/api/scala/index
Output Sink Types:
Parquet Sink - HDFS, S3, Parquet Console Sink - Terminal Memory Sink - In memory table that can be queried over time interactively Foreach Sink DataStreamWriter
Append Mode(Default) New rows only appended Applicable only for Non Aggregated Queries (select,where,filter,join,etc)
Complete Mode Output the whole result to any Sink Applicable only for aggregated Queries (groupBy, etc)
Update Mode Updates on any of the row attributes will get appended to the output sink.
CheckPointing In case of Failure recover the previous progress and state of a previous query, and continue where
it left off.
Configure a CheckPoint location in writeStream method of DataStreamWriter
Must be configured for Parquet Sink, File Sink.
Unsupported Operations yet
Sort, Limit of First N rows, Distinct on Input Streams
Joins bt two streaming datasets Outer Joins (FO, LO, RO) bt two streaming
ds.count() Use ds.groupBy.count() instead
Key Takeaways Structured Streaming is still experimental but please try it out. Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on
top of that.
Development is very similar to the development of Spark for Static Dataframe/DataSets APIs.
Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc.
readStream() - Schema of the Streaming Dataframes are checked only at run time hence its untyped.
writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output
Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open
Structured Streaming is not recommended for Production workloads at this point even if its a File Streaming, Socket
Thank You Spark Code is available in my github:https://github.com/vgiri2015/Spark2.0-and-greater/tree/master/src/main/scala/structStreaming
Other Spark related repositories:https://github.com/vgiri2015/spark-latest-v1
My blogs and Learning in Spark:https://hadoopist.wordpress.com/category/apache-spark/