what's new with apache spark's structured streaming?

49
What’s new with Apache Spark’s Structured Streaming? Miklos Christine 3/21/2017

Upload: miklos-christine

Post on 13-Apr-2017

124 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: What's new with Apache Spark's Structured Streaming?

What’s new with Apache Spark’s Structured Streaming?

Miklos Christine3/21/2017

Page 2: What's new with Apache Spark's Structured Streaming?

$ whoami

Solutions Architect @ Databricks• Apache Spark Advocate• Build and architect big data platforms for streaming and batch processing

Previously:• Sales Engineer @ Cloudera• Software Engineer @ Cisco

Page 3: What's new with Apache Spark's Structured Streaming?

building robust stream processing

apps is hard

Page 4: What's new with Apache Spark's Structured Streaming?

Complexities in stream processing

Complex Data

Diverse data formats (json, avro, binary, …)

Data can be dirty, late, out-of-order

Complex Systems

Diverse storage systems and formats (SQL, NoSQL, parquet, ... )

System failures

Complex Workloads

Event time processing

Combining streaming with interactive queries,

machine learning

Page 5: What's new with Apache Spark's Structured Streaming?

Spark Streaming 1.x APIs (DStreams)

Difficulties: • Separate SparkStreamingContext()• Additional library packages and dependencies

• Kafka / Kinesis libraries• State management / window functions • Serialization issues & code changes (upgrade issues)• 1 SparkStreamingContext() per Application

Page 6: What's new with Apache Spark's Structured Streaming?

Spark Streaming 1.x (aka DStreams)// Function to create a new StreamingContext and set it up

def creatingFunc(): StreamingContext = {

// Create a StreamingContext

val ssc = new StreamingContext(sc, Seconds(batchIntervalSeconds))

// Get the input stream from the source

val topic1_Stream = createKafkaStream(ssc, kafkaTopic1, kafkaBrokers)

// … … … logic

// To make sure data is not deleted by the time we query it interactively

ssc.remember(Minutes(1))

println("Creating function called to create new StreamingContext")

ssc

}

Page 7: What's new with Apache Spark's Structured Streaming?

Spark Streaming 1.x (aka DStreams)// Get or create a streaming context.

val ssc = StreamingContext.getActiveOrCreate(creatingFunc)

// This starts the streaming context in the background.

ssc.start()

// This is to ensure that we wait for some time before the background streaming job

starts. This will put this cell on hold for 5 times the batchIntervalSeconds.

ssc.awaitTerminationOrTimeout(batchIntervalSeconds * 5 * 1000)

Page 8: What's new with Apache Spark's Structured Streaming?

Structured Streaming

stream processing on Spark SQL enginefast, scalable, fault-tolerant

rich, unified, high level APIs deal with complex data and complex workloads

rich ecosystem of data sourcesintegrate with many storage systems

Page 9: What's new with Apache Spark's Structured Streaming?

you should not have to

reason about streaming

Page 10: What's new with Apache Spark's Structured Streaming?

you should write simple batch queries

&

Spark should automatically streamify

them

Page 11: What's new with Apache Spark's Structured Streaming?

Treat Streams as Unbounded Tables

11

data stream unbounded input table

new data in the data stream

= new rows appended

to a unbounded table

Page 12: What's new with Apache Spark's Structured Streaming?

New ModelTrigger: every 1 sec

Time

Input data upto t = 3

Que

ry

Input: data from source as an append-only table

Trigger: how frequently to check input for new data

Query: operations on input usual map/filter/reduce new window, session ops

t=1 t=2 t=3

data upto t = 1

data upto t = 2

Page 13: What's new with Apache Spark's Structured Streaming?

New Model

result up to t = 1

Result

Que

ry

Time

data upto t = 1

Input data upto t = 2

result up to t = 2

data upto t = 3

result up to t = 3

Result: final operated table updated after every trigger

Output: what part of result to write to storage after every trigger

Output[complete mode]

write all rows in result table to storage

t=1 t=2 t=3

Complete output: write full result table every time

Page 14: What's new with Apache Spark's Structured Streaming?

New Model

Que

ry

Time

data upto t = 1

Input data upto t = 2

data upto t = 3

Result: final operated table updated after every trigger

Output: what part of result to write to storage after every trigger

Complete output: write full result table every time

Append output: write only new rows that got added to result table since previous batch

t=1 t=2 t=3

Result result up to t =

3

Output[append mode] write new rows since last trigger to storage

result up to t =

1

result up to t =

2

Page 15: What's new with Apache Spark's Structured Streaming?

static data =bounded table

streaming data =unbounded table

API - Dataset/DataFrame

Single API !

Page 16: What's new with Apache Spark's Structured Streaming?

Batch Queries with DataFrames

input = spark.read.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.write.format("parquet").save("dest-path")

Read from Json file

Select some devices

Write to parquet file

Page 17: What's new with Apache Spark's Structured Streaming?

Streaming Queries with DataFrames

input = spark.readStream.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.writeStream.format("parquet").start("dest-path")

Read from Json file streamReplace read with readStream

Select some devicesCode does not change

Write to Parquet file streamReplace save() with start()

Page 18: What's new with Apache Spark's Structured Streaming?

DataFrames,Datasets, SQL

Logical Plan

Streaming Source

Projectdevice, signal

Filtersignal > 15

Streaming Sink

Spark automatically streamifies!

Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of

data

Series of Incremental

Execution Plans

proc

ess

new

file

s

t = 1 t = 2 t = 3

proc

ess

new

file

s

proc

ess

new

file

s

input = spark.readStream.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.writeStream.format("parquet").start("dest-path")

Page 19: What's new with Apache Spark's Structured Streaming?

Fault-tolerance with Checkpointing

Checkpointing - metadata (e.g. offsets) of current batch stored in a write ahead log in HDFS/S3

Query can be restarted from the log

Streaming sources can replay the exact data range in case of failure

Streaming sinks can dedup reprocessed data when writing, idempotent by design

end-to-end exactly-once guarantees

proc

ess

new

file

s

t = 1 t = 2 t = 3

proc

ess

new

file

s

proc

ess

new

file

s

write ahead

log

Page 20: What's new with Apache Spark's Structured Streaming?

ComplexStreaming ETL

Page 21: What's new with Apache Spark's Structured Streaming?

Traditional ETL

Raw, dirty, un/semi-structured is data dumped as files

Periodic jobs run every few hours to convert raw data to structured data ready for further analytics

filedump

seconds hourstable

10101010

Page 22: What's new with Apache Spark's Structured Streaming?

Traditional ETL

Hours of delay before taking decisions on latest data

Unacceptable when time is of essence[intrusion detection, anomaly detection, etc.]

filedump

seconds hourstable

10101010

Page 23: What's new with Apache Spark's Structured Streaming?

Streaming ETL w/ Structured Streaming

Structured Streaming enables raw data to be available as structured data as soon as possible

tableseconds10101010

Page 24: What's new with Apache Spark's Structured Streaming?

Streaming ETL w/ Structured Streaming

24

Example

- Json data being received in Kafka

- Parse nested json and flatten it

- Store in structured Parquet table

- Get end-to-end failure guarantees

val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()

val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json").as("data")) .select("data.*")

val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet")

Page 25: What's new with Apache Spark's Structured Streaming?

Reading from Kafka [Spark 2.1]Support Kafka 0.10.0.1Specify options to configure

How? kafka.boostrap.servers => broker1

What?subscribe => topic1,topic2,topic3 // fixed list of topicssubscribePattern => topic* // dynamic list of topicsassign => {"topicA":[0,1] } // specific partitions

Where? startingOffsets => latest

(default) / earliest / {"topicA":{"0":23,"1":345} }

val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()

Page 26: What's new with Apache Spark's Structured Streaming?

Reading from Kafka

rawData dataframe has the following columns

key value topic partition offset timestamp[binary] [binary] "topicA" 0 345 1486087873

[binary] [binary] "topicB" 3 2890 1486086721

val rawData = spark.readStream .format("kafka") .option("subscribe", "topic") .option("kafka.boostrap.servers",...) .load()

Page 27: What's new with Apache Spark's Structured Streaming?

Transforming Data

Cast binary value to stringName it column json

val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")

Page 28: What's new with Apache Spark's Structured Streaming?

Transforming Data

val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")

Cast binary value to stringName it column json

Parse json string and expand into nested columns, name it data

json{ "timestamp": 1486087873, "device": "devA", …}{ "timestamp": 1486082418, "device": "devX", …}

data (nested)

timestamp device …1486087873 devA …1486086721 devX …

from_json("json")as "data"

Page 29: What's new with Apache Spark's Structured Streaming?

Transforming Data

val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")

data (nested)

timestamp device …

1486087873 devA …

1486086721 devX …

timestamp device …

1486087873 devA …

1486086721 devX …

select("data.*")

(not nested)

Cast binary value to stringName it column json

Parse json string and expand into nested columns, name it data

Flatten the nested columns

Page 30: What's new with Apache Spark's Structured Streaming?

Transforming Data

Cast binary value to stringName it column json

Parse json string and expand into nested columns, name it data

Flatten the nested columns

powerful built-in APIs to perform complex data

transformationsfrom_json, to_json, explode, ...

100s of functions

(see our blog post)

val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json").as("data")) .select("data.*")

Page 31: What's new with Apache Spark's Structured Streaming?

Save parsed data as Parquet table in the given path

Partition files by date so that future queries on time slices of data is fast

e.g. query on last 48 hours of data

Writing to Parquet table

val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")

Page 32: What's new with Apache Spark's Structured Streaming?

Checkpointing

Enable checkpointing by setting the checkpoint location to save offset logs

start actually starts a continuous running StreamingQuery in the Spark cluster

val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")

Page 33: What's new with Apache Spark's Structured Streaming?

Streaming Query

query is a handle to the continuously running StreamingQuery

Used to monitor and manage the execution

StreamingQueryval query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")

proc

ess n

ew

data

t = 1 t = 2 t = 3

proc

ess n

ew

data

proc

ess n

ew

data

Page 34: What's new with Apache Spark's Structured Streaming?

Data Consistency on Ad-hoc Queries

Data available for complex, ad-hoc analytics within seconds

Parquet table is updated atomically, ensures prefix integrity

Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our bloghttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

seconds!complex, ad-hoc

queries on latest

data

Page 35: What's new with Apache Spark's Structured Streaming?

Advanced Streaming Analytics

Page 36: What's new with Apache Spark's Structured Streaming?

Event time Aggregations

Many use cases require aggregate statistics by event timeE.g. what's the #errors in each system in the 1 hour windows?

Many challengesExtracting event time from data, handling late, out-of-order data

DStream APIs were insufficient for event-time stuff

Page 37: What's new with Apache Spark's Structured Streaming?

Windowing is just another type of grouping in Struct. Streaming

number of records every hour

Support UDAFs!

parsedData .groupBy(window("timestamp","1 hour")) .count()

avg signal strength of each device every 10 mins

Event time Aggregations

parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal")

Page 38: What's new with Apache Spark's Structured Streaming?

Stateful Processing for Aggregations

Aggregates has to be saved as distributed state between triggers

Each trigger reads previous state and writes updated state

State stored in memory, backed by write ahead log in HDFS/S3

Fault-tolerant, exactly-once guarantee!

proc

ess

new

dat

a

t = 1

sink

src

t = 2

proc

ess

new

dat

a

sink

src

t = 3

proc

ess

new

dat

a

sink

src

state state

write ahead

log

state updates are written to log for checkpointing

state

Page 39: What's new with Apache Spark's Structured Streaming?

Watermarking and Late Data

Watermark [Spark 2.1] - threshold on how late an event is expected to be in event time

Trails behind max seen event time

Trailing gap is configurable

event time

max event time

watermark data older than

watermark not expected

12:30 PM

12:20 PM

trailing gapof 10 mins

Page 40: What's new with Apache Spark's Structured Streaming?

Watermarking and Late Data

Data newer than watermark may be late, but allowed to aggregate

Data older than watermark is "too late" and dropped

Windows older than watermark automatically deleted to limit the amount of intermediate state

event time

max event time

watermark data too late,

dropped

12:30 PMlate dataallowed to aggregate

Page 41: What's new with Apache Spark's Structured Streaming?

Watermarking and Late Data event time

max event time

watermark data older than

watermark not expected

parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()

late dataallowed to aggregate

allowed latenessof 10 mins

Page 42: What's new with Apache Spark's Structured Streaming?

Watermarking to Limit State [Spark 2.1]

data too late, ignored in counts, state dropped

Processing Time12:00

12:05

12:10

12:15

12:10 12:15 12:20

12:07

12:13

12:08

Even

t Tim

e12:15

12:18

12:04

watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted

data is late, but considered in counts

parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()

system tracks max observed event time

12:08

wm = 12:04

10 m

in

12:14

more details in online programming guide

Page 43: What's new with Apache Spark's Structured Streaming?

mapGroupsWithStateallows any user-definedstateful ops to a user-defined state--

supports timeouts--

fault-tolerant, exactly-once--

supports Scala and Java

Arbitrary Stateful Operations [Spark 2.2]

dataset.groupByKey(groupingFunc).mapGroupsWithState(mappingFunc)

def mappingFunc( key: K, values: Iterator[V], state: KeyedState[S]): U = { // update or remove state // set timeouts // return mapped value }

Page 44: What's new with Apache Spark's Structured Streaming?

Many more updates!StreamingQueryListener [Spark 2.1]Receive of regular progress heartbeats for health and perf monitoringAutomatic in Databricks, come to the Databricks booth for a demo!!

Kafka Batch Queries [Spark 2.2]Run batch queries on Kafka just like a file system

Kafka Sink [Spark 2.2]Write to Kafka, can only give at-least-once guarantee as Kafka doesn't support transactional updates

Kinesis SourceRead from Amazon Kinesis

44

Page 45: What's new with Apache Spark's Structured Streaming?

More InfoStructured Streaming Programming Guide

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Databricks blog posts for more focused discussionshttps://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html

https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

and more to come, stay tuned!!

45

Page 46: What's new with Apache Spark's Structured Streaming?

GET TICKETS NOW!!

EARLY BIRD PRICING!!

Page 47: What's new with Apache Spark's Structured Streaming?

Need time to convince your manager?

Spark Summit Code:ChicagoMU

Good for 15% off starting 4/8.

Page 48: What's new with Apache Spark's Structured Streaming?

Comparison with Other Engines

48

Read our blog to understand this table

Page 49: What's new with Apache Spark's Structured Streaming?

Thank you!!