stateful distributed stream processing

of 23/23
Stateful distributed stream processing Gyula Fóra [email protected] @GyulaFora

Post on 09-Feb-2017

7.551 views

Category:

Data & Analytics

4 download

Embed Size (px)

TRANSCRIPT

  • Stateful distributed stream processing

    Gyula [email protected]

    @GyulaFora

  • This talk

    Stateful processing by example

    Definition and challenges

    State in current open-source systems

    State in Apache Flink

    Closing

    2Apache Flink Meetup @ MapR2015-08-27

  • Stateful processing by example

    Window aggregations Total number of customers

    in the last 10 minutes State: Current aggregate

    Machine learning Fitting trends to the evolving

    stream State: Model

    3Apache Flink Meetup @ MapR2015-08-27

  • Stateful processing by example

    Pattern recognition Detect suspicious financial

    activity State: Matched prefix

    Stream-stream joins Match ad views and

    impressions State: Elements in the window

    4Apache Flink Meetup @ MapR2015-08-27

  • Stateful operators

    All these examples use a common processing pattern

    Stateful operator (in essence):: , , .

    State hangs around and can be read and modified as the stream evolves

    Goal: Get as close as possible while maintaining scalability and fault-tolerance

    5Apache Flink Meetup @ MapR2015-08-27

  • State-of-the-art systems

    Most systems allow developers to implement stateful programs

    Trick is to limit the scope of (state access) while maintaining expressivity

    Issues to tackle: Expressivity Exactly-once semantics Scalability to large inputs Scalability to large states

    6Apache Flink Meetup @ MapR2015-08-27

  • States available only in Trident API Dedicated operators for state updates and

    queries State access methods

    stateQuery() partitionPersist() persistentAggregate()

    Its very difficult toimplement transactionalstates

    Exactly-once guarantee

    7Apache Flink Meetup @ MapR2015-08-27

  • Storm Word Count

    8Apache Flink Meetup @ MapR2015-08-27

  • Stateless runtime by design No continuous operators UDFs are assumed to be stateless

    State can be generated as a stream of RDDs: updateStateByKey()

    : [], . is scoped to a specific key

    Exactly-once semantics

    9Apache Flink Meetup @ MapR2015-08-27

  • val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,new HashPartitioner(ssc.sparkContext.defaultParallelism),true,initialRDD)

    val updateFunc = (values: Seq[Int], state: Option[Int]) => {val currentCount = values.sumval previousCount = state.getOrElse(0)Some(currentCount + previousCount)

    }

    Spark Streaming Word Count

    10Apache Flink Meetup @ MapR2015-08-27

  • Stateful dataflow operators(Any task can hold state)

    State changes are storedas a log by Kafka

    Custom storage engines canbe plugged in to the log

    is scoped to a specific task At-least-once processing

    semantics

    11Apache Flink Meetup @ MapR2015-08-27

  • Samza Word Count public class WordCounter implements StreamTask, InitableTask {

    //Some omitted details

    private KeyValueStore store;

    public void process(IncomingMessageEnvelope envelope,MessageCollector collector, TaskCoordinator coordinator) {

    //Get the current countString word = (String) envelope.getKey();Integer count = store.get(word);if (count == null) count = 0;

    //Increment, store and sendcount += 1;store.put(word, count);collector.send(

    new OutgoingMessageEnvelope(OUTPUT_STREAM, word ,count));}

    }12Apache Flink Meetup @ MapR2015-08-27

  • What can we say so far? Trident

    + Consistent state accessible from outside Only works well with idempotent states States are not part of the operators

    Spark+ Integrates well with the system guarantees Limited expressivity Immutability increases update complexity

    Samza+ Efficient log based state updates+ States are well integrated with the operators Lack of exactly-once semantics State access is not fully transparent

    13Apache Flink Meetup @ MapR2015-08-27

  • Take whats good, make it work + add some more

    Clean and powerful abstractions Local (Task) state Partitioned (Key) state

    Proper API integration Java: OperatorState interface Scala: mapWithState, flatMapWithState

    Exactly-once semantics by checkpointing

    14Apache Flink Meetup @ MapR2015-08-27

  • Flink Word Count

    words.keyBy(x => x).mapWithState {(word, count: Option[Int]) =>{

    val newCount = count.getOrElse(0) + 1val output = (word, newCount)(output, Some(newCount))

    }}

    15Apache Flink Meetup @ MapR2015-08-27

  • Local State

    Task scoped state access Can be used to implement

    custom access patterns Typical usage:

    Source operators (offset) Machine learning models Use cyclic flows to simulate

    global state access

    16Apache Flink Meetup @ MapR2015-08-27

  • Local State Example (Java)

    public class MySource extends RichParallelSourceFunction {// Omitted detailsprivate OperatorState offset;

    @Overridepublic void run(SourceContext ctx) {

    Object checkpointLock = ctx.getCheckpointLock();isRunning = true;while (isRunning) {

    synchronized (checkpointLock) {offset.update(offset.value() + 1);// ctx.collect(next);

    }}

    }}

    17Apache Flink Meetup @ MapR2015-08-27

  • Partitioned State

    Key scoped state access Highly scalable Allows for incremental

    backup/restore Typical usage:

    Any per-key operation Grouped aggregations Window buffers

    18Apache Flink Meetup @ MapR2015-08-27

  • Partitioned State Example (Scala)

    // Compute the current average of each city's temperaturetemps.keyBy("city").mapWithState {

    (in: Temp, state: Option[(Double, Long)]) =>{val current = state.getOrElse((0.0, 0L))val updated = (current._1 + in.temp, current._2 + 1)val avg = Temp(in.city, updated._1 / updated._2)(avg, Some(updated))

    }}

    case class Temp(city: String, temp: Double)

    19Apache Flink Meetup @ MapR2015-08-27

  • Exactly-once semantics

    Based on consistent global snapshots Algorithm designed for stateful dataflows

    20Apache Flink Meetup @ MapR2015-08-27

    Detailed mechanism

  • Exactly-once semantics

    Low runtime overhead Checkpointing logic is separated from

    application logic

    21Apache Flink Meetup @ MapR2015-08-27

    Blogpost on streaming fault-tolerance

  • Summary

    State is essential to many applications Fault-tolerant streaming state is a hard

    problem There is a trade-off between expressivity vs

    scalability/fault-tolerance Flink tries to hit the sweet spot with

    Providing very flexible abstractions Keeping good scalability and exactly-once

    semantics

    22Apache Flink Meetup @ MapR2015-08-27

  • Thank you!