distributed stream processing - spark summit east 2017

89
MANCHESTER LONDON NEW YORK

Upload: petr-zapletal

Post on 21-Feb-2017

249 views

Category:

Software


1 download

TRANSCRIPT

MANCHESTER LONDON NEW YORK

Petr Zapletal @petr_zapletal@spark_summit

@cakesolutions

Distributed Real-Time Stream Processing: Why and How

Agenda

● Motivation

● Stream Processing

● Available Frameworks

● Systems Comparison

● Recommendations

The Data Deluge

● New Sources and New Use Cases

● 8 Zettabytes (1 ZB = 1 trillion GB) created in 2015

● Stream Processing to the Rescue

● Continuous processing, aggregation and analysis of unbounded data

Distributed Stream Processing

Points of Interest

➜ Runtime and Programming Model

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

➜ Message Delivery Guarantees

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

➜ Message Delivery Guarantees

➜ Fault Tolerance & Low Overhead Recovery

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

➜ Message Delivery Guarantees

➜ Fault Tolerance & Low Overhead Recovery

➜ Latency, Throughput & Scalability

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

➜ Message Delivery Guarantees

➜ Fault Tolerance & Low Overhead Recovery

➜ Latency, Throughput & Scalability

➜ Maturity and Adoption Level

Points of Interest

➜ Runtime and Programming Model

➜ Primitives

➜ State Management

➜ Message Delivery Guarantees

➜ Fault Tolerance & Low Overhead Recovery

➜ Latency, Throughput & Scalability

➜ Maturity and Adoption Level

➜ Ease of Development and Operability

● Most important trait of stream processing system

● Defines expressiveness, possible operations and its limitations

● Therefore defines systems capabilities and its use cases

Runtime and Programming Model

Native Streaming

records

SinkOperator

Source Operator

Processing Operator

records processed one at a time

Processing Operator

records processed in short batches

Processing Operator

Receiver

records

Processing Operator

Micro-batches

Sink Operator

Micro-batching

Native Streaming

● Records are processed as they arrive

Pros

⟹ Expressiveness

⟹ Low-latency

⟹ Stateful operations

Cons

⟹ Fault-tolerance is expensive

⟹ Load-balancing

Micro-batching

Cons

⟹ Lower latency, depends on

batch interval

⟹ Limited expressivity

⟹ Harder stateful operations

● Splits incoming stream into small batches

Pros

⟹ High-throughput

⟹ Easier fault tolerance

⟹ Simpler load-balancing

Programming Model

Compositional

⟹ Provides basic building blocks as

operators or sources

⟹ Custom component definition

⟹ Manual Topology definition &

optimization

⟹ Advanced functionality often

missing

Declarative

⟹ High-level API

⟹ Operators as higher order functions

⟹ Abstract data types

⟹ Advance operations like state

management or windowing supported

out of the box

⟹ Advanced optimizers

Apache Streaming Landscape

TRIDENT

Storm

Stor● Pioneer in large scale stream processing

● Higher level micro-batching system build atop Storm

Trident

● Unified batch and stream processing over a batch runtime

Spark Streaming

input data stream Spark

StreamingSpark Engine

batches of input data

batches of processed data

Samza

● Builds heavily on Kafka’s log based philosophy

Task 1

Task 2

Task 3

Kafka Streams

● Simple streaming library on top of Apache Kafka

Apex

● Processes massive amounts of real-time events natively in Hadoop

Operator

Operator

OperatorOperator Operator Operator

Output Stream

Stream

StreamStre

am

Stream

Tuple

Flink

Stream Data

Batch Data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

● Native streaming & High level API

Counting Words

Spark Summit 2017 Apache Apache Spark Storm Apache Trident Flink Streaming Samza Scala 2017 Streaming

(Apache, 3)(Streaming, 2)(2017, 2)(Spark, 2)(Storm, 1)(Trident, 1)(Flink, 1)(Samza, 1)(Scala, 1)(Summit, 1)

Storm

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

...

Map<String, Integer> counts = new HashMap<String, Integer>();

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getString(0);

Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;

counts.put(word, count);

collector.emit(new Values(word, count));

}

Storm

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

...

Map<String, Integer> counts = new HashMap<String, Integer>();

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getString(0);

Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;

counts.put(word, count);

collector.emit(new Values(word, count));

}

Storm

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("spout", new RandomSentenceSpout(), 5);

builder.setBolt("split", new Split(), 8).shuffleGrouping("spout");

builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

...

Map<String, Integer> counts = new HashMap<String, Integer>();

public void execute(Tuple tuple, BasicOutputCollector collector) {

String word = tuple.getString(0);

Integer count = counts.containsKey(word) ? counts.get(word) + 1 : 1;

counts.put(word, count);

collector.emit(new Values(word, count));

}

Trident

public static StormTopology buildTopology(LocalDRPC drpc) {

FixedBatchSpout spout = ...

TridentTopology topology = new TridentTopology();

TridentState wordCounts = topology.newStream("spout1", spout)

.each(new Fields("sentence"),new Split(), new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(new MemoryMapState.Factory(),

new Count(), new Fields("count"));

...

}

Trident

public static StormTopology buildTopology(LocalDRPC drpc) {

FixedBatchSpout spout = ...

TridentTopology topology = new TridentTopology();

TridentState wordCounts = topology.newStream("spout1", spout)

.each(new Fields("sentence"),new Split(), new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(new MemoryMapState.Factory(),

new Count(), new Fields("count"));

...

}

Trident

public static StormTopology buildTopology(LocalDRPC drpc) {

FixedBatchSpout spout = ...

TridentTopology topology = new TridentTopology();

TridentState wordCounts = topology.newStream("spout1", spout)

.each(new Fields("sentence"),new Split(), new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(new MemoryMapState.Factory(),

new Count(), new Fields("count"));

...

}

Spark Streaming

val conf = new SparkConf().setAppName("wordcount")val ssc = new StreamingContext(conf, Seconds(1))

val text = ...val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.print()

ssc.start()ssc.awaitTermination()

Spark Streaming

val conf = new SparkConf().setAppName("wordcount")val ssc = new StreamingContext(conf, Seconds(1))

val text = ...val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.print()

ssc.start()ssc.awaitTermination()

val conf = new SparkConf().setAppName("wordcount")val ssc = new StreamingContext(conf, Seconds(1))

val text = ...val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.print()

ssc.start()ssc.awaitTermination()

Spark Streaming

val conf = new SparkConf().setAppName("wordcount")val ssc = new StreamingContext(conf, Seconds(1))

val text = ...val counts = text.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.print()

ssc.start()ssc.awaitTermination()

Spark Streaming

Samza

class WordCountTask extends StreamTask {

override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) {

val text = envelope.getMessage.asInstanceOf[String]

val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }

collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts))

}

Samza

class WordCountTask extends StreamTask {

override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) {

val text = envelope.getMessage.asInstanceOf[String]

val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }

collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts))

}

class WordCountTask extends StreamTask {

override def process(envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator) {

val text = envelope.getMessage.asInstanceOf[String]

val counts = text.split(" ") .foldLeft(Map.empty[String, Int]) { (count, word) => count + (word -> (count.getOrElse(word, 0) + 1)) }

collector.send(new OutgoingMessageEnvelope(new SystemStream("kafka", "wordcount"), counts))

}

Samza

Kafka Streams

final Serde<String> stringSerde = Serdes.String();final Serde<Long> longSerde = Serdes.Long();

KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input");

KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream();

wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");

Kafka Streams

final Serde<String> stringSerde = Serdes.String();final Serde<Long> longSerde = Serdes.Long();

KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "streams-file-input");

KStream<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+"))) .groupBy((key, value) -> value) .count("Counts") .toStream();

wordCounts.to(stringSerde, longSerde, "streams-wordcount-output");

Apex

val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in)dag.addStream[String]("words", parser.out, counter.data)

class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]()

override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w)

} }

Apex

val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in)dag.addStream[String]("words", parser.out, counter.data)

class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]()

override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w)

} }

val input = dag.addOperator("input", new LineReader) val parser = dag.addOperator("parser", new Parser) val out = dag.addOperator("console", new ConsoleOutputOperator) dag.addStream[String]("lines", input.out, parser.in)dag.addStream[String]("words", parser.out, counter.data)

class Parser extends BaseOperator { @transient val out = new DefaultOutputPort[String]() @transient val in = new DefaultInputPort[String]()

override def process(t: String): Unit = { for(w <- t.split(" ")) out.emit(w)

} }

Apex

Flink

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1)

counts.print()

env.execute("wordcount")

Flink

val env = ExecutionEnvironment.getExecutionEnvironment

val text = env.fromElements(...) val counts = text.flatMap ( _.split(" ") ) .map ( (_, 1) ) .groupBy(0) .sum(1)

counts.print()

env.execute("wordcount")

Summary

Native Micro-batching Native Native

Compositional Declarative Compositional Declarative

At-least-once Exactly-once At-least-once Exactly-once

Record ACKs RDD based Checkpointing Log-based Checkpointing

Not build-inDedicated Operators

Stateful Operators

Stateful Operators

Very Low Medium Low Low

Low Medium High High

High High Medium Medium

Micro-batching

Exactly-once*

Dedicated DStream

Medium

High

Streaming Model

API

Guarantees

Fault Tolerance

State Management

Latency

Throughput

Maturity

TRIDENT

Hybrid

Compositional*

Exactly-once

Checkpointing

Stateful Operators

Very Low

High

Medium

Native

Declarative*

At-least-once

Low

Log-based

Stateful Operators

Low

High

General Guidelines

As always, It depends

General Guidelines

➜ Evaluate particular application needs

General Guidelines

➜ Evaluate particular application needs

➜ Programming model

General Guidelines

➜ Evaluate particular application needs

➜ Programming model

➜ Available delivery guarantees

General Guidelines

➜ Evaluate particular application needs

➜ Programming model

➜ Available delivery guarantees

➜ Almost all non-trivial jobs have state

General Guidelines

➜ Evaluate particular application needs

➜ Programming model

➜ Available delivery guarantees

➜ Almost all non-trivial jobs have state

➜ Fast recovery is critical

Recommendations [Storm & Trident]

● Fits for small and fast tasks

● Very low (tens of milliseconds) latency

● State & Fault tolerance degrades performance significantly

● Potential update to Heron

○ Keeps the API, according to Twitter better in every single way

○ Open-sourced recently

Recommendations [Spark Streaming]

● Spark Ecosystem

● Data Exploration

● Latency is not critical

● Micro-batching limitations

Recommendations [Samza]

● Kafka is a cornerstone of your architecture

● Application requires large states

● Don’t need exactly once

Recommendations [Kafka Streams]

● Similar functionality like Samza with nicer APIs

● Operated by Kafka itself

● Great learning curve

● At-least once delivery

● May not support more advanced functionality

Recommendations [Apex]

● Prefer compositional approach

● Hadoop

● Great performance

● Dynamic DAG changes

Recommendations [Flink]

● Conceptually great, fits very most use cases

● Take advantage of batch processing capabilities

● Need a functionality which is hard to implement in micro-batch

● Enough courage to use emerging project

Dataflow and Apache Beam

Dataflow Model & SDKs

Apache Flink

Apache Spark

Direct Pipeline

Google CloudDataflow

Stream Processing

Batch Processing

Multiple Modes One Pipeline Many Runtimes

Local or

cloud

Local

Cloud

Questions

MANCHESTER LONDON NEW YORK

MANCHESTER LONDON NEW YORK

@petr_zapletal @cakesolutions

347 708 1518

[email protected]

We are hiringhttp://www.cakesolutions.net/careers

References● http://storm.apache.org/

● http://spark.apache.org/streaming/

● http://samza.apache.org/

● https://apex.apache.org/

● https://flink.apache.org/

● http://beam.incubator.apache.org/

● http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1

● http://data-artisans.com/blog/

● http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple

● https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

● http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem

● https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

● http://stackoverflow.com/questions/29111549/where-do-apache-samza-and-apache-storm-differ-in-their-use-cases

● https://www.dropbox.com/s/1s8pnjwgkkvik4v/GearPump%20Edge%20Topologies%20and%20Deployment.pptx?dl=0

● ...

List of Figures

● https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html

● http://www.slideshare.net/ptgoetz/storm-hadoop-summit2014

● https://databricks.com/wp-content/uploads/2015/07/image41-1024x602.png

● http://data-artisans.com/wp-content/uploads/2015/08/microbatching.png

● http://samza.apache.org/img/0.9/learn/documentation/container/stateful_job

● http://data-artisans.com/wp-content/uploads/2015/08/streambarrier.png

● https://cwiki.apache.org/confluence/display/FLINK/Stateful+Stream+Processing

● https://raw.githubusercontent.com/tweise/apex-samples/kafka-count-jdbc/exactly-once/docs/images/image00.png

● https://4.bp.blogspot.com/-RlLeDymI_mU/Vp-1cb3AxNI/AAAAAAAACSQ/5TphliHJA4w/s1600/dataflow%2BASF.pn

g

Backup Slides

MANCHESTER LONDON NEW YORK

Processing Architecture Evolution

Batch Pipeline

Serving DBHDFS

Query

Processing Architecture Evolution

Lambda Architecture

Batch Layer Serving Layer

Stream layer

Query

All

your

dat

aOozie

Query

Processing Architecture Evolution

Standalone Stream Processing

Stream Processing

Processing Architecture Evolution

Kappa Architecture

Query

Stream Processing

ETL Operations

● Transformations, joining or filtering of incoming data

Streaming Applications

Windowing

● Trends in bounded interval, like tweets or sales

Streaming Applications

Streaming Applications

Machine Learning

● Clustering, Trend fitting, Classification

Streaming Applications

Pattern Recognition

● Fraud detection, Signal triggering, Anomaly detection

Fault tolerance in streaming systems is inherently harder that in batch

Fault Tolerance

Managing State

f: (input, state) => (output, state’)

Performance

Hard to design not biased test, lots of variables

Performance

➜ Latency vs. Throughput

Performance

➜ Latency vs. Throughput

➜ Costs of Delivery Guarantees, Fault-tolerance & State Management

Performance

➜ Latency vs. Throughput

➜ Costs of Delivery Guarantees, Fault-tolerance & State Management

➜ Tuning

Performance

➜ Latency vs. Throughput

➜ Costs of Delivery Guarantees, Fault-tolerance & State Management

➜ Tuning

➜ Network operations, Data locality & Serialization

Project Maturity

When picking up the framework, you should always consider its maturity

For a long time de-facto industrial standard

Project Maturity [Storm & Trident]

Project Maturity [Spark Streaming]

The most trending Scala repository these days and one of the engines behind Scala’s popularity

Project Maturity [Samza]

Used by LinkedIn and also by tens of other companies

Project Maturity [Kafka Streams]

???

Project Maturity [Apex]

Graduated very recently, adopted by a couple of corporate clients already

Project Maturity [Flink]

Still an emerging project, but we can see its first production deployments