fundamentals of stream processing with apache beam, tyler akidau, frances perry

Frances Perry & Tyler Akidau@francesjperry, @ takidauApache Beam Committers & Google EngineersFundamentals of Stream Processing with Apache Beam (incubating)

Kafka Summit - April 2016

Frances and Tylertalk about Apache Beam -- unified model for batch and stream data processing

NOTE:

These slides are not being actively maintained.

For up to date presentations on Apache Beam, please see:beam.incubator.apache.org/presentation-materials/

Infinite, Out-of-Order Data SetsWhat, Where, When, HowReasons This is AwesomeAgenda

Apache Beam (incubating)2413

the kind of data set were talking aboutthe four questions at the heart of the beam modelwhy these are the right questionsthe current state of the apache beam project

Infinite, Out-of-Order Data Sets1

talk about the shape of datarunning example analyzing mobile game logsusers around the globe crushing candy want to analyze these logs to learn about game...

Data...

heres gaming logseach square represents an event where a user scored some points for their team

...can be big...

game gets popular

...really, really big...TuesdayWednesdayThursday

start organizing it into a repeated structure

maybe infinitely big...9:008:00

14:00

13:00

12:00

11:0010:00

2:001:00

7:00

6:00

5:00

4:003:00

repetitive structure just a cheap way of representing an infinite data source. game logs are continuousdistributed systems can cause ambiguity...

with unknown delays.9:008:00

14:00

13:00

12:00

11:0010:00

8:00

8:00

8:00

Lets look at some points that were scored at 8am red score 8am, received quickly yellow score also happened at 8am, received at 8:30 due to network congestion green element was hours late. this was someone playing in airplane mode on the plane. had to wait for it to land.so now weve got an unordered, infinite data set, how do we process it...

Element-wise transformations13:0014:008:009:0010:0011:0012:00Processing Time

Element wise transformations work on individual elementsparsing, translating or filtering applied as elements flow pastbut other transforms like counting or joining require combining multiple elements together ...

Aggregating via Processing-Time Windows

13:0014:008:009:0010:0011:0012:00Processing Time

when doing aggregations, need to divide the infinite stream of elements into finite sized chunks that can be processed independently.simplest way using arrival time in fixed time periodscan mean elements are being processed out of order, late elements may be aggregated with unrelated elements that arrived about the same time...

Aggregating via Event-Time Windows

Event TimeProcessing Time11:0010:0015:0014:0013:0012:00

11:0010:0015:0014:0013:0012:00

InputOutput

reorganize data base on when they occurred, not when they arrivedred element arrived relatively on time and stays in the noon window. green that arrived at 12:30, was actually created about 11:30, so it moves up to the 11am window.requires formalizing the difference between processing time and event time

RealityFormalizing Event-Time SkewProcessing TimeEvent TimeIdealSkew

Blue axis is event, Green is processing. Ideally no delay -- elements processed when they occurred Reality looks more like that red squiggly line, where processing time is slightly delayed off event time. The variable distance between reality and ideal is called skew. need to track in order to reason about correctness.

Formalizing Event-Time SkewWatermarks describe event time progress.

"No timestamp earlier than the watermark will be seen"Processing TimeEvent Time~WatermarkIdealSkew

Often heuristic-based.

Too Slow? Results are delayed.Too Fast? Some data is late.

red line the watermark -- no event times earlier than this point are expected to appear in the future.often heuristic based too slow unnecessary latency.too fast some data comes in late, after we thought we were done for a given time period.how do we reason about these types of infinite, out-of-order datasets...

What, Where, When, How2

not too hard if you know what kinds of questions to ask!

What are you computing?Where in event time?When in processing time?How do refinements relate?

What results are calculated? sums, joins, histograms, machine learning models?Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated for all time, in fixed windows, or as user activity sessions?When in processing time are results materialized? Does the time each element arrives in the system affect results? How do we know when to emit a result? What do we do about data that comes in late from those pesky users playing on transatlantic flights?And finally, how do refinements relate? If we choose to emit results multiple times, is each result independent and distinct, do they build upon one another?

Lets dive into how each question contributes when we build a pipeline...

What are you computing? What Where When How

Element-WiseAggregatingComposite

The first thing to figure out is what you are actually computing.transform each element independently, similar to the Map function in MapReduce, easy to parallelizeOther transformations, like Grouping and Combining, require inspecting multiple elements at a timeSome operations are really just subgraphs of other more primitive operations.

Now lets see a code snippet for our gaming example...

What: Computing Integer Sums// Collection of raw log linesPCollection raw = IO.read(...);

// Element-wise transformation into team/score pairsPCollection input =raw.apply(ParDo.of(new ParseFn());

// Composite transformation containing an aggregationPCollection scores = input.apply(Sum.integersPerKey());

What Where When How

Psuedo Java for compactness/clarity!start by reading a collection of raw eventstransform it into a more structured collection containing key value pairs with a team name and the number of points scored during the event.use a composite operation to sum up all the points per team.Lets see how this code excutes...

What: Computing Integer SumsWhat Where When How

Looking at points scored for a given teamblue axis, green axis ideal This score of 3 from just before 12:07 arrives almost immediately. 7 minutes delayed. elevator or subway. graph not big enough to show offline mode from transatlantic flight

What: Computing Integer SumsWhat Where When How

time is thick white line.accumulate sum into the intermediate stateproduces output represented by the blue rectangle.all the data available, rectangle covers all events, no matter when in time they occurred.single final result emitted when its all completepretty standard batch processing -> lets see what happens if we tweak the other questions

Windowing divides data into event-time-based finite chunks.

Often required when doing aggregations over unbounded data.Where in event time?What Where When HowFixedSliding

1

2

3

5

4

Sessions

2

43

1

Key 2Key 1Key 3Time1

2

3

4

windowing lets us create individual results for different slices of event time.divides data into finite chunks based on the event time of each element.common patterns include fixed time (like hourly, daily, monthly), sliding windows (like the last 24 hours worth of data, every hour) -- single element may be in multiple overlapping windowssession based windows that capture bursts of user activity -- unaligned per keyvery common when trying to aggregations on infinite dataalso actually common pattern in batch, though historically done using composite keys.

Where: Fixed 2-minute WindowsWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());

fixed windows that are 2 minutes long

Where: Fixed 2-minute WindowsWhat Where When How

independent answer for every two minute period of event time. still waiting until the entire computation completes to emit any results. wont work for infinite data!want to reduce latency...

When in processing time?What Where When HowTriggers control when results are emitted.

Triggers are often relative to the watermark.Processing TimeEvent Time~WatermarkIdealSkew

trigger define when in processing time to emit resultsoften relative to the watermark, which is that heuristic about event time progress.

When: Triggering at the WatermarkWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

request that results are emitted when we think weve roughly seen all the elements for a given window. actually default -- just written it for clarity.

When: Triggering at the WatermarkWhat Where When How

left graph shows a perfect watermark -- tracks when all the data for a given event time has arrivedemit the result from each window as soon as the watermark passes. watermark is usually just a heuristic, so look more like graph on the right. now 9 is missedand if the watermark is delayed, like in the first graph, need to wait a long time for anything. would like speculative.lets use a more advanced trigger...

When: Early and Late FiringsWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

ask for early, speculative firings every minuteget updates every time a late element comes in.

When: Early and Late FiringsWhat Where When How

in all cases, able to get speculative results before the watermark.now get results when watermark passes, but still handle late value 9 even with heuristic watermarkin this case, we accumulate across the multiple results per windowIn the final window, we see and emit 3 but then still include that 3 in the next update of 12.but this behavior around multiple firings is configurable...

How do refinements relate?What Where When HowHow should multiple outputs per window accumulate?Appropriate choice depends on consumer.

FiringElementsSpeculative [3]Watermark[5, 1]Late[2]Last ObservedTotal Observed

Discarding362211

Accumulating39111123

Acc. & Retracting39, -311, -91111

(Accumulating & Retracting not yet implemented.)

fire three times for a window -- a speculative firing with 3, watermark with two more values 5 and 1, and finally a late value 2.one option is emit new elements that have come in since the last result. requires consumer to be able to do final sumcould produce the running sum every time. consumer may overcountproduce both the new running sum and retract the old one.

How: Add Newest, Remove PreviousWhat Where When HowPCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

use accumulating and retracting.

How: Add Newest, Remove PreviousWhat Where When How

speculative results, on time results, and retractions.now the final window emits 3, then retracts the 3 when emitting 12.

So those are the four questions...

Reasons This is Awesome3

those are the four key questionsare they the right questions?

CorrectnessPowerComposabilityFlexibilityModularityWhat / Where / When / How

here are 5 reasons...


the results we get are correct

this is not something weve historically gotten with streaming systems.

Distributed Systems are Distributed

distributed systems are distributed.if the winds had been blowing from the east instead of the west, elements might have arrived in a slightly different order.

Processing Time Results Differ

if we were aggregating based on processing time, this would result in different results for the two orderings.

Event Time Results are Stable

aggregating based on event time may have different intermediate resultsbut the final results are identical across the two arrival scenarios.


next, the abstractions can represent powerful and complex algorithms.

SessionsWhat Where When HowPCollection scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

earlier mentioned session windows -- burst of user activitysimple code change...

Identifying Bursts of User ActivityWhat Where When How

want to identify two groupings of pointsin other words, Tyler was playing the game, got distracted by a squirrel, and then resumed his play.

Identifying Bursts of User ActivityWhat Where When How

now you can see the sessions being built over timeat first we see multiple components in the first session not until late element 9 comes in that we realize its one big session


next -- weve seen what the four questions can do.what if we ask the questions twice?

Calculating Session LengthsWhat Where When How

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

code to calculate the length of a user session

What Where When How

Remember that these graphs are always shown per keyheres the graph calculating session legths for Frances and the ones for Tyler

Calculating the Average Session LengthWhat Where When How

.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

now lets take those session lengths per userask the questions againthis time using fixed windows to take the mean across the entire collection...

What Where When How

Now calculating the average length of all sessions that ended in a given time periodif we rolled out an update to our game, this would let us quickly understand if that resulted a change in user behaviorif the change made the game less fun, we could see a sudden drop in how long users play


ok flexibility for covering all sorts of uses cases

1.Classic Batch

2. Batch with Fixed Windows

3. Streaming 5. Streaming With Retractions

4. Streaming with Speculative + Late DataWhat Where When How

6. Sessions

By tuning our what/where/when/how knobs, weve covered everything from classic batch to sessions


And not only that, we do so with lovely modular code

PCollection scores = input .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());

PCollection scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

1.Classic Batch 2. Batch with Fixed Windows3. Streaming 5. Streaming With Retractions

4. Streaming with Speculative + Late Data6. Sessions

all these uses cases -- and we never changed our core algorithmjust integer summing here, but the same would apply with much more complex algorithms too


so there you go -- 5 reasons that these 4 questions are awesome

Apache Beam (incubating)4

discussed the use caseintroduced the four questions in the Beam modelshown they are the right questions now lets get concrete and see how we got to this model and where were going

The Evolution of BeamMapReduce

Google Cloud DataflowApache Beam

BigTable

Dremel

Colossus

Flume

Megastore

Spanner

PubSub

Millwheel

Google published the original paper on MapReduce in 2004 -- fundamentally change the way we do distributed processing. Inside Google, kept innovating, but just published papers Externally the open source community created Hadoop. Entire ecosystem flourished, partially influenced by those Google papers. In 2014, Google Cloud Dataflow -- included both a new programming model and fully managed service share this model more broadly -- both because it is awesome and because users benefit from a larger ecosystem and portability across multiple runtimes.So Google, along with a handful of partners donated this programming model to the Apache Software Foundation, as the incubating project Apache Beam...

The Beam Model: What / Where / When / How

SDKs for writing Beam pipelines -- starting with Java

Runners for Existing Distributed Processing BackendsApache Flink (thanks to data Artisans)Apache Spark (thanks to Cloudera)Google Cloud Dataflow (fully managed service)Local (in-process) runner for testingWhat is Part of Apache Beam?

Today, Apache Beam includes

the core unified programming model the initial Java SDK that we developed as part of Cloud Dataflowand, most important for portability, multiple runners that can execute Beam pipelines on existing distributed process backends.but our longer term goals...

End users: who want to write pipelines or transform libraries in a language thats familiar.

SDK writers: who want to make Beam concepts available in new languages.

Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn RunnersRunner ARunner BBeam Model: Pipeline ConstructionOtherLanguagesBeam Java

Beam PythonExecutionExecutionCloud DataflowExecution

fully support three different categories of usersEnd users who want to write data processing pipelinesIncludes adding value like additional connectors -- weve got Kafka!Additionally, support community-sourced SDKs and runnersEach community has very different sets of goals and needs. having a vision and reaching it are two different things...

Visions are a Journey02/01/2016Enter Apache IncubatorEarly 2016Internal API redesignSlight ChaosMid 2016API StabilizationLate 2016Multiple runners execute Beam pipelines

02/25/20161st commit to ASF repository

Beam entered incubation in early February.Quickly did the code donations and began bootstrapping the infrastructure.initial focus is on stabilizing internal APIs and integrating the additional runners.Part of that is understanding what different runners can do...

Categorizing Runner Capabilities

http://beam.incubator.apache.org/capability-matrix/

The Beam model is attempting to generalize semantics -- will not align perfectly with all possible runtimes.Started categorizing the features in the model and the various levels of runner support.This will help users understand mismatches like using event time processing in Spark or exactly once processing with Samza.

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Growing the Beam Community

And one of the things were most excited about is the collaboration opportunities that Beam enables. Been doing this stuff for a while at Google -- very hermetic environment. Looking forward to incorporating new perspectives -- to build a truly generalizable solution.Growing the Beam development community over the next few months, whether they are looking to write transform libraries for end users, new SDKs, or provide new runners.So if youre interested in learning more...

Learn More!Apache Beam (incubating)http://beam.incubator.apache.orgThe World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! [email protected]@beam.incubator.apache.orgFollow @ApacheBeam on Twitter (and @francesjperry and @takidau too!)

Thank you!

fundamentals of stream processing with apache beam, tyler akidau, frances perry

Engineering