a deep dive into structured streaming in apache spark

Download A Deep Dive into Structured Streaming in Apache Spark

Post on 12-Apr-2017

425 views

Category:

Engineering

1 download

Embed Size (px)

TRANSCRIPT

  • A Deep Dive into Structured Streaming

    Burak Yavuz

    Spark Meetup @ Milano 18 January 2017

  • Streaming in Apache Spark

    Spark Streaming changed how people write streaming apps

    2

    SQL Streaming MLlib

    Spark Core

    GraphXFunctional, concise and expressive

    Fault-tolerant state management

    Unified stack with batch processing

    More than 50% users consider most important part of Apache Spark

  • Streaming apps are growing more complex

    3

  • Streaming computations dont run in isolation

    Need to interact with batch data, interactive analysis, machine learning, etc.

  • Use case: IoT Device Monitoring

    IoT events from Kafka

    ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring

    - Handle late data - Aggregate on windows

    on event time

    Interactively debug issues - consistency

    event stream

    Anomaly detection - Learn models offline - Use online + continuous

    learning

  • Use case: IoT Device Monitoring

    IoT events from Kafka

    ETL into long term storage - Prevent data loss - Prevent duplicatesStatus monitoring

    - Handle late data - Aggregate on windows

    on event time

    Interactively debug issues - consistency

    event stream

    Anomaly detection - Learn models offline - Use online + continuous

    learningContinuous Applications

    Not just streaming any more

  • 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time

    2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation

    3. Reasoning about end-to-end guarantees - Requires carefully constructing sinks that handle failures correctly - Data consistency in the storage while being updated

    Pain points with DStreams

  • Structured Streaming

  • The simplest way to perform streaming analyticsis not having to reason about streaming at all

  • New ModelTrigger: every 1 sec

    1 2 3Time

    data upto 1

    Input data upto 2

    data up to 3

    Que

    ry

    Input: data from source as an append-only table

    Trigger: how frequently to check input for new data

    Query: operations on input usual map/filter/reduce new window, session ops

  • New ModelTrigger: every 1 sec

    1 2 3

    result for data up to 1

    Result

    Que

    ry

    Time

    data upto 1

    Input data upto 2

    result for data up to 2

    data up to 3

    result for data up to 3

    Result: final operated table updated every trigger interval

    Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output

    [complete mode] output all the rows in the result table

  • New ModelTrigger: every 1 sec

    1 2 3

    result for data up to 1

    Result

    Que

    ry

    Time

    data upto 1

    Input data upto 2

    result for data up to 2

    data up to 3

    result for data up to 3

    Output [append mode] output only new rows

    since last trigger

    Result: final operated table updated every trigger interval

    Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time

    Append output: Write only new rows that got added to result table since previous batch

    *Not all output modes are feasible with all queries

  • Static, bounded data

    Streaming, unbounded data

    Single API !

    API - Dataset/DataFrame

  • 14

    Dataset

    A distributed bounded or unbounded collection of data with a known schema

    A high-level abstraction for selecting, filtering, mapping, reducing, and aggregating structured data

    Common API across Scala, Java, Python, R

  • Standards based: Supports most popular constructs in HiveQL, ANSI SQL

    Compatible: Use JDBC to connect with popular BI tools

    15

    Datasets with SQL

    SELECT type, avg(signal) FROM devices GROUP BY type

  • Concise: great for ad-hoc interactive analysis

    Interoperable: based on R / pandas, easy to go back and forth

    16

    Dynamic Datasets (DataFrames)

    spark.table("device-data") .groupBy("type") .agg("type", avg("signal")) .map(lambda ) .collect()

  • 17

    Statically-typed Datasets

    No boilerplate: automatically convert to/from domain objects

    Safe: compile time checks for correctness

    // Compute histogram of age by name.val hist = ds.groupBy(_.type).mapGroups { case (type, data: Iter[DeviceData]) => val buckets = new Array[Int](10) data.map(_.signal).foreach { a => buckets(a / 10) += 1 } (type, buckets)}

    case class DeviceData(type:String, signal:Int)

    // Convert data to Java objectsval ds: Dataset[DeviceData] = spark.read.json("data.json").as[DeviceData]

  • Unified, Structured APIs in Spark 2.0

    18

    SQL DataFrames Datasets

    Syntax Errors

    Analysis Errors

    Runtime Compile Time

    RuntimeCompile

    Time

    Compile Time

    Runtime

  • Batch ETL with DataFrames

    input = spark.read.format("json").load("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.write.format("parquet").save("dest-path")

    Read from Json file

    Select some devices

    Write to parquet file

  • Streaming ETL with DataFrames

    input = spark.readStream.format("json").load("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.writeStream.format("parquet").start("dest-path")

    Read from Json file stream Replace read with readStream

    Select some devices Code does not change

    Write to Parquet file stream Replace save() with start()

  • Streaming ETL with DataFrames

    readStreamload() creates a streaming DataFrame, does not start any of the computation

    writeStreamstart() defines where & how to output the data and starts the processing

    input = spark.readStream.format("json").load("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.writeStream.format("parquet").start("dest-path")

  • Streaming ETL with DataFrames1 2 3

    Result

    Input

    Output [append mode]

    new rows in result

    of 2new rows in result

    of 3

    input = spark.readStream.format("json").load("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.writeStream.format("parquet").start("dest-path")

  • Demo

  • Continuous Aggregations

    Continuously compute average signal across all devices

    Continuously compute average signal of each type of device

    24

    input.avg("signal")

    input.groupBy("device_type") .avg("signal")

  • Continuous Windowed Aggregations

    25

    input.groupBy( $"device_type", window($"event_time", "10 min"))

    .avg("signal")

    Continuously compute average signal of each type of device in last 10 minutes using event_time column

    Simplifies event-time stream processing (not possible in DStreams) Works on both, streaming and batch jobs

  • Watermarking for handling late data[Spark 2.1]

    26

    input.withWatermark("event_time", "15 min").groupBy(

    $"device_type", window($"event_time", "10 min"))

    .avg("signal")

    Watermark = threshold using event-time for classifying data as "too late"

    Continuously compute 10 minute average while considering that data can be up to 15 minutes late

  • Joining streams with static datakafkaDataset = spark.readStream .format("kafka") .option("subscribe", "iot-updates") .option("kafka.bootstrap.servers", ...) .load()

    staticDataset = spark.read .jdbc("jdbc://", "iot-device-info")

    joinedDataset = kafkaDataset.join(staticDataset, "device_type")

    27

    Join streaming data from Kafka with static data via JDBC to enrich the streaming data

    without having to think that you are joining streaming data

  • Output ModesDefines what is outputted every time there is a trigger Different output modes make sense for different queries

    28

    input.select("device", "signal").writeStream.outputMode("append").format("parquet").start("dest-path")

    Append mode with non-aggregation queries

    input.agg(count("*")).writeStream.outputMode("complete").format("memory")

    .queryName("my_table").start()

    Complete mode with aggregation queries

  • Query Managementquery = result.writeStream

    .format("parquet")

    .outputMode("append")

    .start("dest-path")

    query.stop()query.awaitTermination()query.exception()

    query.status()query.recentProgress()

    29

    query: a handle to the running streaming computation for managing it

    - Stop it, wait for it to terminate - Get status - Get error, if terminated

    Multiple queries can be active at the same time

    Can ask for current status, and metrics

  • Logically: Dataset operations on table (i.e. as easy to understand as batch)

    Physically: Spark automatically runs the query in streaming fashion (i.e. incrementally and continuously)

    DataFrame

    Logical Plan

    Continuous, incremental execution

    Catalyst optimizer

    Query Execution

  • Struct