a deep dive into structured streaming in apache spark

54
A Deep Dive into Structured Streaming Burak Yavuz Spark Meetup @ Milano 18 January 2017

Upload: databricks

Post on 12-Apr-2017

459 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: A Deep Dive into Structured Streaming in Apache Spark

A Deep Dive into Structured Streaming

Burak Yavuz

Spark Meetup @ Milano 18 January 2017

Page 2: A Deep Dive into Structured Streaming in Apache Spark

Streaming in Apache Spark

Spark Streaming changed how people write streaming apps

2

SQL Streaming MLlib

Spark Core

GraphXFunctional, concise and expressive

Fault-tolerant state management

Unified stack with batch processing

More than 50% users consider most important part of Apache Spark

Page 3: A Deep Dive into Structured Streaming in Apache Spark

Streaming apps are growing more complex

3

Page 4: A Deep Dive into Structured Streaming in Apache Spark

Streaming computations don’t run in isolation

Need to interact with batch data, interactive analysis, machine learning, etc.

Page 5: A Deep Dive into Structured Streaming in Apache Spark

Use case: IoT Device Monitoring

IoT events from Kafka

ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring

- Handle late data - Aggregate on windows

on event time

Interactively debug issues - consistency

event stream

Anomaly detection - Learn models offline - Use online + continuous

learning

Page 6: A Deep Dive into Structured Streaming in Apache Spark

Use case: IoT Device Monitoring

IoT events from Kafka

ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring

- Handle late data - Aggregate on windows

on event time

Interactively debug issues - consistency

event stream

Anomaly detection - Learn models offline - Use online + continuous

learningContinuous Applications

Not just streaming any more

Page 7: A Deep Dive into Structured Streaming in Apache Spark

1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time

2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation

3. Reasoning about end-to-end guarantees - Requires carefully constructing sinks that handle failures correctly - Data consistency in the storage while being updated

Pain points with DStreams

Page 8: A Deep Dive into Structured Streaming in Apache Spark

Structured Streaming

Page 9: A Deep Dive into Structured Streaming in Apache Spark

The simplest way to perform streaming analyticsis not having to reason about streaming at all

Page 10: A Deep Dive into Structured Streaming in Apache Spark

New ModelTrigger: every 1 sec

1 2 3Time

data upto 1

Input data upto 2

data up to 3

Que

ry

Input: data from source as an append-only table

Trigger: how frequently to check input for new data

Query: operations on input usual map/filter/reduce new window, session ops

Page 11: A Deep Dive into Structured Streaming in Apache Spark

New ModelTrigger: every 1 sec

1 2 3

result for data up to 1

Result

Que

ry

Time

data upto 1

Input data upto 2

result for data up to 2

data up to 3

result for data up to 3

Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output

[complete mode] output all the rows in the result table

Page 12: A Deep Dive into Structured Streaming in Apache Spark

New ModelTrigger: every 1 sec

1 2 3

result for data up to 1

Result

Que

ry

Time

data upto 1

Input data upto 2

result for data up to 2

data up to 3

result for data up to 3

Output [append mode] output only new rows

since last trigger

Result: final operated table updated every trigger interval

Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time

Append output: Write only new rows that got added to result table since previous batch

*Not all output modes are feasible with all queries

Page 13: A Deep Dive into Structured Streaming in Apache Spark

Static, bounded data

Streaming, unbounded data

Single API !

API - Dataset/DataFrame

Page 14: A Deep Dive into Structured Streaming in Apache Spark

14

Dataset

A distributed bounded or unbounded collection of data with a known schema

A high-level abstraction for selecting, filtering, mapping, reducing, and aggregating structured data

Common API across Scala, Java, Python, R

Page 15: A Deep Dive into Structured Streaming in Apache Spark

Standards based: Supports most popular constructs in HiveQL, ANSI SQL

Compatible: Use JDBC to connect with popular BI tools

15

Datasets with SQL

SELECT type, avg(signal) FROM devices GROUP BY type

Page 16: A Deep Dive into Structured Streaming in Apache Spark

Concise: great for ad-hoc interactive analysis

Interoperable: based on R / pandas, easy to go back and forth

16

Dynamic Datasets (DataFrames)

spark.table("device-data") .groupBy("type") .agg("type", avg("signal")) .map(lambda …) .collect()

Page 17: A Deep Dive into Structured Streaming in Apache Spark

17

Statically-typed Datasets

No boilerplate: automatically convert to/from domain objects

Safe: compile time checks for correctness

// Compute histogram of age by name.val hist = ds.groupBy(_.type).mapGroups { case (type, data: Iter[DeviceData]) => val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a / 10) += 1 } (type, buckets)}

case class DeviceData(type:String, signal:Int)

// Convert data to Java objectsval ds: Dataset[DeviceData] = spark.read.json("data.json").as[DeviceData]

Page 18: A Deep Dive into Structured Streaming in Apache Spark

Unified, Structured APIs in Spark 2.0

18

SQL DataFrames Datasets

Syntax Errors

Analysis Errors

Runtime Compile Time

RuntimeCompile

Time

Compile Time

Runtime

Page 19: A Deep Dive into Structured Streaming in Apache Spark

Batch ETL with DataFrames

input = spark.read.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.write.format("parquet").save("dest-path")

Read from Json file

Select some devices

Write to parquet file

Page 20: A Deep Dive into Structured Streaming in Apache Spark

Streaming ETL with DataFrames

input = spark.readStream.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.writeStream.format("parquet").start("dest-path")

Read from Json file stream Replace read with readStream

Select some devices Code does not change

Write to Parquet file stream Replace save() with start()

Page 21: A Deep Dive into Structured Streaming in Apache Spark

Streaming ETL with DataFrames

readStream…load() creates a streaming DataFrame, does not start any of the computation

writeStream…start() defines where & how to output the data and starts the processing

input = spark.readStream.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.writeStream.format("parquet").start("dest-path")

Page 22: A Deep Dive into Structured Streaming in Apache Spark

Streaming ETL with DataFrames1 2 3

Result

Input

Output [append mode]

new rows in result

of 2new rows in result

of 3

input = spark.readStream.format("json").load("source-path")

result = input.select("device", "signal").where("signal > 15")

result.writeStream.format("parquet").start("dest-path")

Page 23: A Deep Dive into Structured Streaming in Apache Spark

Demo

Page 24: A Deep Dive into Structured Streaming in Apache Spark

Continuous Aggregations

Continuously compute average signal across all devices

Continuously compute average signal of each type of device

24

input.avg("signal")

input.groupBy("device_type") .avg("signal")

Page 25: A Deep Dive into Structured Streaming in Apache Spark

Continuous Windowed Aggregations

25

input.groupBy( $"device_type", window($"event_time", "10 min"))

.avg("signal")

Continuously compute average signal of each type of device in last 10 minutes using event_time column

Simplifies event-time stream processing (not possible in DStreams) Works on both, streaming and batch jobs

Page 26: A Deep Dive into Structured Streaming in Apache Spark

Watermarking for handling late data[Spark 2.1]

26

input.withWatermark("event_time", "15 min").groupBy(

$"device_type", window($"event_time", "10 min"))

.avg("signal")

Watermark = threshold using event-time for classifying data as "too late"

Continuously compute 10 minute average while considering that data can be up to 15 minutes late

Page 27: A Deep Dive into Structured Streaming in Apache Spark

Joining streams with static datakafkaDataset = spark.readStream  .format("kafka")  .option("subscribe", "iot-updates")  .option("kafka.bootstrap.servers", ...) .load()

staticDataset = spark.read  .jdbc("jdbc://", "iot-device-info")

joinedDataset =  kafkaDataset.join(staticDataset, "device_type")

27

Join streaming data from Kafka with static data via JDBC to enrich the streaming data …

… without having to think that you are joining streaming data

Page 28: A Deep Dive into Structured Streaming in Apache Spark

Output ModesDefines what is outputted every time there is a trigger Different output modes make sense for different queries

28

input.select("device", "signal").writeStream.outputMode("append").format("parquet").start("dest-path")

Append mode with non-aggregation queries

input.agg(count("*")).writeStream.outputMode("complete").format("memory")

.queryName("my_table").start()

Complete mode with aggregation queries

Page 29: A Deep Dive into Structured Streaming in Apache Spark

Query Managementquery = result.writeStream

.format("parquet")

.outputMode("append")

.start("dest-path")

query.stop()query.awaitTermination()query.exception()

query.status()query.recentProgress()

29

query: a handle to the running streaming computation for managing it

- Stop it, wait for it to terminate - Get status - Get error, if terminated

Multiple queries can be active at the same time

Can ask for current status, and metrics

Page 30: A Deep Dive into Structured Streaming in Apache Spark

Logically: Dataset operations on table (i.e. as easy to understand as batch)

Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously)

DataFrame

Logical Plan

Continuous, incremental execution

Catalyst optimizer

Query Execution

Page 31: A Deep Dive into Structured Streaming in Apache Spark

Structured StreamingHigh-level streaming API built on Datasets/DataFrames

Event time, windowing, sessions, sources & sinks End-to-end exactly once semantics

Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models

Page 32: A Deep Dive into Structured Streaming in Apache Spark

What can you do with this that’s hard with other engines?

True unification Combines batch + streaming + interactive workloads Same code + same super-optimized engine for everything

Flexible API tightly integrated with the engine Choose your own tool - Dataset/DataFrame/SQL Greater debuggability and performance

Benefits of Spark in-memory computing, elastic scaling, fault-tolerance, straggler mitigation, …

Page 33: A Deep Dive into Structured Streaming in Apache Spark

Underneath the Hood

Page 34: A Deep Dive into Structured Streaming in Apache Spark

Batch Execution on Spark SQL

34

DataFrame/ Dataset

Logical Plan

Abstract representation of

query

Page 35: A Deep Dive into Structured Streaming in Apache Spark

Batch Execution on Spark SQL

35

DataFrame/ Dataset

Logical Plan

Planner

SQL AST

DataFrame Unresolved Logical Plan Logical Plan Optimized

Logical Plan RDDsSelected Physical

Plan

Analysis Logical Optimization

Physical Planning

Cost

Mod

el

Physical Plans

Code Generation

CatalogDataset

Helluvalot of magic!

Page 36: A Deep Dive into Structured Streaming in Apache Spark

Batch Execution on Spark SQL

36

DataFrame/ Dataset

Logical Plan Execution PlanPlanner

Run super-optimized Spark jobs to compute results

Bytecode generation JVM intrinsics, vectorization Operations on serialized data

Code Optimizations Memory Optimizations

Compact and fast encoding Off-heap memory

Project Tungsten - Phase 1 and 2

Page 37: A Deep Dive into Structured Streaming in Apache Spark

Continuous Incremental Execution

Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans, for each processing the next chunk of streaming data

37

DataFrame/ Dataset

Logical Plan

Incremental Execution Plan 1

Incremental Execution Plan 2

Incremental Execution Plan 3

Planner

Incremental Execution Plan 4

Page 38: A Deep Dive into Structured Streaming in Apache Spark

Continuous Incremental Execution

38

Planner

Incremental Execution 2Offsets: [106-197] Count: 92

Planner polls for new data from sources

Incremental Execution 1Offsets: [19-105] Count: 87

Incrementally executes new data and writes to sink

Page 39: A Deep Dive into Structured Streaming in Apache Spark

Continuous AggregationsMaintain running aggregate as in-memory state backed by WAL in file system for fault-tolerance

39

state data generated and used across incremental executions

Incremental Execution 1

state: 87

Offsets: [19-105] Running Count: 87

memoryIncremental Execution 2

state: 179

Offsets: [106-179] Count: 87+92 = 179

Page 40: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

All data and metadata in the system needs to be recoverable / replayable st at e

Planner

source sink

Incremental Execution 1

Incremental Execution 2

Page 41: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant Planner

Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS

st at e

Planner

source sink

Offsets written to fault-tolerant WAL

before execution

Incremental Execution 2

Incremental Execution 1

Page 42: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant Planner

Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS

st at e

Planner

source sink

Failed planner fails current execution

Incremental Execution 2

Incremental Execution 1

Failed Execution

Failed Planner

Page 43: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant Planner

Tracks offsets by writing the offset range of each execution to a write ahead log (WAL) in HDFS

Reads log to recover from failures, and re-execute exact range of offsets

st at e

Restarted Planner

source sink

Offsets read back from WAL

Incremental Execution 1

Same executions regenerated from offsets

Failed ExecutionIncremental Execution 2

Page 44: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant Sources

Structured streaming sources are by design replayable (e.g. Kafka, Kinesis, files) and generate the exactly same data given offsets recovered by planner

st at e

Planner

sink

Incremental Execution 1

Incremental Execution 2

source

Replayable source

Page 45: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant State

Intermediate "state data" is a maintained in versioned, key-value maps in Spark workers, backed by HDFS

Planner makes sure "correct version" of state used to re-execute after failure

Planner

source sink

Incremental Execution 1

Incremental Execution 2

st at e

state is fault-tolerant with WAL

Page 46: A Deep Dive into Structured Streaming in Apache Spark

Fault-tolerance

Fault-tolerant Sink

Sink are by design idempotent, and handles re-executions to avoid double committing the output

Planner

source

Incremental Execution 1

Incremental Execution 2

st at e sink

Idempotent by design

Page 47: A Deep Dive into Structured Streaming in Apache Spark

47

offset tracking in WAL +

state management +

fault-tolerant sources and sinks =

end-to-end exactly-once guarantees

Page 48: A Deep Dive into Structured Streaming in Apache Spark

48

Fast, fault-tolerant, exactly-once

stateful stream processing

without having to reason about streaming

Page 49: A Deep Dive into Structured Streaming in Apache Spark

Blogs for Deeper Technical Dive

Blog 1: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

Blog 2: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 50: A Deep Dive into Structured Streaming in Apache Spark

Comparison with Other Engines

50Read the blog to understand this table

Page 51: A Deep Dive into Structured Streaming in Apache Spark

Previous Status: Spark 2.0.2Basic infrastructure and API

- Event time, windows, aggregations - Append and Complete output modes - Support for a subset of batch queries

Source and sink - Sources: Files, Kafka - Sinks: Files, in-memory table,

unmanaged sinks (foreach)

Experimental release to set the future direction

Not ready for production but good to experiment with and provide feedback

Page 52: A Deep Dive into Structured Streaming in Apache Spark

Current Status: Spark 2.1.0Event-time watermarks and old state evictions

Status monitoring and metrics - Current status and metrics (processing rates, state data size, etc.) - Codahale/Dropwizard metrics (reporting to Ganglia, Graphite, etc.)

Stable future-proof checkpoint format upgrading code across Spark versions

Many, many, many stability and performance improvements

52

Page 53: A Deep Dive into Structured Streaming in Apache Spark

Future DirectionStability, stability, stability

Support for more queries Sessionization More output modes

Support for more sinks

ML integrations

Make Structured Streaming ready for production workloads by Spark 2.2/2.3

Page 54: A Deep Dive into Structured Streaming in Apache Spark

https://spark-summit.org/east-2017/