introduction to apache beam - huodongjia.com · introduction to apache beam some of the slides (the...

38
1 Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Upload: trinhdang

Post on 16-Jul-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

1

Introduction to Apache Beam

Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Page 2: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

2

About me

Software engineer @PayPal, working on streaming data processing.

PMC member, committer @ApacheBeam, Spark runner lead.

Page 3: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

3

“A unified programming model for batch and streaming data processing, that can be executed

on various processing engines”

Page 4: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

What’s in the box?

● SDKs for writing Beam pipelines -- Java and Python

● The Beam Model: What / Where / When / How

● Runners for existing distributed processing backends

▪ Apache Apex

▪ Apache Flink

▪ Apache Spark

▪ Google Cloud Dataflow

▪ Direct (in-process) runner for testing

Page 5: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

5

The Beam Pipeline

Page 6: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

6

The Beam pipeline: overview

InputTransform PCollection Transform PCollection Transform

Output

Page 7: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

7

I/O

Page 8: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

8

The Beam pipeline: IOs

InputTransform PCollection Transform PCollection Transform

Output

Page 9: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Pipeline IOs: bounded source

pipeline

.apply("ReadLines", readText)

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply(“WriteFormatted”, writeText);

TextIO.Read.Bound readText = TextIO.Read.from(“path/to/input.txt”);

TextIO.Write.Bound writeText = TextIO.Write.to(“path/to/output”);

*Some code snippets were shortened or elided for clarity.

Page 10: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Pipeline IOs: unbounded source

pipeline

.apply("ReadLines", readKafka.values())

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply(“WriteFormatted”, writeKafka.values());

KafkaIO.Read<Integer, String> readKafka =

KafkaIO.<Integer, String>read().withTopic(“my_input_topic”)...

KafkaIO.Write<Integer, String> writeKafka =

KafkaIO.<Integer, String>write().withTopic(“my_output_topic”)...

*Some code snippets were shortened or elided for clarity.

Page 11: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Supported IOs (April 2017)

● HDFS● HBase● JDBC● MongoDB● Elasticsearch● Kafka● Kinesis● JMS● MQTT● Google - GCS, BigQuery, BigTable, Datastore

Most of this was done in little over a year, thanks to the Beam community!

Page 12: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

12

Transformations

Page 13: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

13

The Beam pipeline: transfomations

InputTransform PCollection Transform PCollection Transform

Output

Page 14: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

SDK core primitives

SDK transformations are mostly built on top of the following core primitives: ● ParDo - executing the user’s DoFn function ~ map/flatmap.● GroupByKey - grouping by key and window.● Window.into - applying a window to a PCollection.● Flatten.pCollections - union one or more PCollections into a single

PCollection.

Page 15: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

CountWords

pipeline

.apply("ReadLines", TextIO.Read.from(options.getInputFile()))

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

*Some code snippets were shortened or elided for clarity.

InputReadLines PCollection CountWords PCollection

FormatAsTex

t

OutputPCollection WriteCounts

Page 16: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The CountWords composite

// Convert lines of text into individual words.

PCollection<String> words =

lines.apply(ParDo.of(new ExtractWordsFn()));

// Count the number of times each word occurs.

PCollection<KV<String, Long>> wordCounts =

words.apply(Count.<String>perElement());

*Some code snippets were shortened or elided for clarity.

PCollection CountWords PCollection

PCollection ExtractWords PCollection

PCollection CountElemnt PCollection

Page 17: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

17

The Beam Model

Page 18: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

18

Processing time vs. event time

Page 19: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

19

The Beam Model: asking the right questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

Page 20: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: What is being computed?

PCollection<KV<String, Integer>> scores = input

.apply(Sum.integersPerKey());

*Some code snippets were shortened or elided for clarity.

Page 21: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: What is being computed?

Page 22: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: Where in event time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))

.apply(Sum.integersPerKey());

*Some code snippets were shortened or elided for clarity.

Page 23: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: Where in event time?

Page 24: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: When in processing time?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()))

.apply(Sum.integersPerKey());

*Some code snippets were shortened or elided for clarity.

Page 25: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: When in processing time?

Page 26: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: How do refinements relate?

PCollection<KV<String, Integer>> scores = input

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())

.apply(Sum.integersPerKey());

*Some code snippets were shortened or elided for clarity.

Page 27: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam Model: How do refinements relate?

Page 28: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

28

Customizing What Where When How

3

Streaming

4Streaming

+ Accumulation

1

Classic

Batch

For more information see https://beam.apache.org/get-started/mobile-gaming-example/

2Windowed

Batch

Page 29: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

The Beam streaming pipeline

*Some code snippets were shortened or elided for clarity.

pipeline

.apply(KafkaIO.read().withTopic(“team_points_topic”))

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1)))

.accumulatingFiredPanes())

.apply(Sum.integersPerKey())

.apply(KafkaIO.write().withTopic(“team_points_topic”))

Page 30: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

30

Portability

Page 31: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Direct runner

pipeline

.apply("ReadLines", TextIO.Read.from(options.getInputFile()))

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline pipeline = Pipeline.create(options);

*Some code snippets were shortened or elided for clarity.

Page 32: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Flink runner

pipeline

.apply("ReadLines", TextIO.Read.from(options.getInputFile()))

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

FlinkPipelineOptions flinkPipelineOptions =

PipelineOptionsFactory.as(FlinkPipelineOptions.class);

flinkPipelineOptions.setRunner(FlinkRunner.class);

Pipeline pipeline = Pipeline.create(flinkPipelineOptions);

*Some code snippets were shortened or elided for clarity.

Page 33: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Spark runner

pipeline

.apply("ReadLines", TextIO.Read.from(options.getInputFile()))

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

SparkPipelineOptions sparkPipelineOptions =

PipelineOptionsFactory.as(SparkPipelineOptions.class);

sparkPipelineOptions.setRunner(SparkRunner.class);

Pipeline pipeline = Pipeline.create(sparkPipelineOptions);

*Some code snippets were shortened or elided for clarity.

Page 34: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

Spark runner

pipeline

.apply("ReadLines", TextIO.Read.from(options.getInputFile()))

.apply(“CountWords”, new CountWords())

.apply(“FormatAsText”, MapElements.via(new FormatAsTextFn()))

.apply("WriteCounts", TextIO.Write.to(options.getOutput()));

SparkPipelineOptions sparkPipelineOptions =

PipelineOptionsFactory.as(SparkPipelineOptions.class);

sparkPipelineOptions.setRunner(SparkRunner.class);

sparkPipelineOptions.setSparkMaster(“spark://IP:PORT”);

Pipeline pipeline = Pipeline.create(sparkPipelineOptions);

*Some code snippets were shortened or elided for clarity.

Page 35: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

35

The Apache Beam Vision

Page 36: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

36

The Apache Beam vision

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Apache Apex

Page 37: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

37

Learn more!

Apache Beamhttps://beam.apache.org

The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Join the mailing lists! [email protected]@beam.apache.org

Follow @ApacheBeam on Twitter

Page 38: Introduction to Apache Beam - Huodongjia.com · Introduction to Apache Beam Some of the slides (the really nice ones!) were contributed by Frances Perry & Tyler Akidau, April 2016

38

Thank you!