flink meetup

43
Get your hands on implementing a Flink app: A tutorial Christos Hadjinikolis & Satyasheel | DataReply.uk

Upload: christos-hadjinikolis

Post on 20-Mar-2017

76 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Flink   meetup

Get your hands on implementing a Flink app: A tutorial

Christos Hadjinikolis & Satyasheel | DataReply.uk

Page 2: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 2

Tutorial Overview:

What is Apache Flink? Why Flink? Processing both bounded and un-bounded data! Anatomy of a Flink App Windowing in Flink Event time & Process time in Flink

2/22/17

Page 3: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 3

What is Apache Flink?

“A distributed data processing platform…”

2/22/17

Page 4: Flink   meetup

42/22/17C. Hadjinikolis & Satyasheel | DataReply

Flink is a distributed stream- & batch- data processing platform Stream processing

…the real-time processing of data continuously, concurrently, and in a record-by-record fashion, where data is not static.

Batch processing…the execution of a series of programs each on a set or "batch" of static inputs, rather than a single input (which would instead be a custom job).

Page 5: Flink   meetup

52/22/17C. Hadjinikolis & Satyasheel | DataReply

…distributed processing dataset types

UnboundedInfinite datasets that are appended to continuously: End users interacting with mobile or web applications Physical sensors providing measurements Financial markets Machine log data Surveillance camera frames

Page 6: Flink   meetup

62/22/17C. Hadjinikolis & Satyasheel | DataReply

…distributed processing dataset types

BoundedFinite, unchanging datasets:

Pictures Documents Database tables

Page 7: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 7

Why Flink?

“The world is turning more and more towards stream processing…”

2/22/17

Page 8: Flink   meetup

82/22/17C. Hadjinikolis & Satyasheel | DataReply

Opt for Flink because it:

Provides results that are accurate Is stateful and fault-tolerant and can

seamlessly recover from failures Performs at large scale

Page 9: Flink   meetup

92/22/17C. Hadjinikolis & Satyasheel | DataReply

…exactly-once semantics

Statefull … apps can maintain summaries of processed data.

Checkpointing… a mechanism that ensures that in the event of failure no duplicate re-computation of an event will take place.

Page 10: Flink   meetup

102/22/17C. Hadjinikolis & Satyasheel | DataReply

…event time semantics

…event-time-based windowingEvent time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed.

Page 11: Flink   meetup

112/22/17C. Hadjinikolis & Satyasheel | DataReply

… flexible windowingWindows can be customized with flexible triggering conditions to support sophisticated streaming patterns based on:

Time; Count, and; Sessions.

Page 12: Flink   meetup

122/22/17C. Hadjinikolis & Satyasheel | DataReply

… lightweight fault tolerance

Recovers from failures with zero data loss while the tradeoff between reliability and latency is negligible.

Page 13: Flink   meetup

132/22/17C. Hadjinikolis & Satyasheel | DataReply

… lightweight fault tolerance

Savepoints Provide a state versioning mechanism. Applications can update and reprocess

historic data with no lost state.

Page 14: Flink   meetup

142/22/17C. Hadjinikolis & Satyasheel | DataReply

… Scalable

Designed to run on large scale clusters

with many thousands on nodes.

Page 15: Flink   meetup

152/22/17C. Hadjinikolis & Satyasheel | DataReply

So, in summary…Flink is an open-source stream processing framework, which: Eliminates the “performance vs. reliability”

problem and; Performs consistently in both categories.

Page 16: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 16

Processing both bounded & un-bouded data!

“Unbounding the boundaries…”

2/22/17

Page 17: Flink   meetup

172/22/17C. Hadjinikolis & Satyasheel | DataReply

…the streaming model & bounded datasets DataStream API un-bounded

data DataSet API bounded data

A bounded dataset is handled inside of Flink as a “finite stream”, with only a few minor differences in how Flink manages un-bounded datasets.

Page 18: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 18

Anatomy of a Flink App

“Let’s get this started…”

2/22/17

Page 19: Flink   meetup

192/22/17C. Hadjinikolis & Satyasheel | DataReply

…Flink programs transform collections of dataEach program consists of the same basic parts: Obtain an execution environment, Load/create the initial data, Specify transformations on this data, Specify where to put the results of your

computations Trigger the program execution

Page 20: Flink   meetup

202/22/17C. Hadjinikolis & Satyasheel | DataReply

Create execution environment

Load streaming data

Trigger transformations

Specify dumping location

Execute

Page 21: Flink   meetup

212/22/17C. Hadjinikolis & Satyasheel | DataReply

…Lazy evaluation

When the program’s main method is executed: Each operation is created and added to

the program’s plan. execution is explicitly triggered by

an execute() call.This helps with constructing an optimised data-flow as a holistically planned unit.

Page 22: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 22

Lets take 15 mins…

2/22/17

Page 23: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 23

Windowing in Flink

“…a simple word count app.”

2/22/17

Page 24: Flink   meetup

242/22/17C. Hadjinikolis & Satyasheel | DataReply

…so what is a window? A window is a way to get a {snapshot} of the streaming data. A {snapshot} can be based on time or other variables. One can define the window based on no of records or other

stream specific variables.

Page 25: Flink   meetup

252/22/17C. Hadjinikolis & Satyasheel | DataReply

…enough with theory! Give us some code!

A streaming word count example with no windowing

Page 26: Flink   meetup

262/22/17C. Hadjinikolis & Satyasheel | DataReply

…updating states

Flink automatically updates its states

without the user explicitly doing so. To better appreciate this, it is worth

contrasting Flink with Spark. Spark relies on micro-batches:

This means one has to define the batch size either in terms of time or size

Flink, does not require defining a batch size. It can process each and every new event

individually (it is true stream processing!)

vs

Page 27: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 27

Lets see an example

2/22/17

Page 28: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 28

Windowing in Flink

“Don't waste a minute not being happy. If one window closes, run to the next window - or break down a door. …”

2/22/17

Page 29: Flink   meetup

292/22/17C. Hadjinikolis & Satyasheel | DataReply

…so why use windowing at all?

Aggregation on DataStream is different from

aggregation Dataset. One cannot count all records on infinite stream. DataStream aggregation makes sense on window

stream.

Page 30: Flink   meetup

302/22/17C. Hadjinikolis & Satyasheel | DataReply

…what types of windowing can you use? Tumbling Windows :

Aligned, fixed length, non-overlapping window. Sliding Windows :

Aligned, fixed length, overlapping window. Session Windows :

Non aligned, variable length window. Count Windows :

Fixed number of records/events, non-overlapping window.

Page 31: Flink   meetup

312/22/17C. Hadjinikolis & Satyasheel | DataReply

…anatomy of the window API

3 window functions:

Window Assigner:  Responsible for assigning given element to window. Depending upon the definition of window, one element can belong to one or more

windows at a time. Trigger: 

Defines the condition for triggering window evaluation. This function controls when a given window created by window assigner is

evaluated. Evictor: 

An optional function which defines the preprocessing before firing window operations.

Page 32: Flink   meetup

322/22/17C. Hadjinikolis & Satyasheel | DataReply

…understanding count window

Window Assigner (for count-based window user-

defined) No start or end to the window, therefore the window is non-time

based. For these windows we use the GlobalWindows window assigner. For a given key, all key-values are filled into the same window.

keyValue.window(GlobalWindows.create())

The window API allows us to add the window assigner to the window. Every window assigner has a default trigger.

for global windows that trigger is NeverTrigger which never triggers.

so, this window assigner has to be used with a custom trigger.

Page 33: Flink   meetup

332/22/17C. Hadjinikolis & Satyasheel | DataReply

…understanding count window

Count trigger

Once we have the window assigner, we have to define when the window needs to be trigger-ed, for example:

trigger(CountTrigger.of(2)) This results in the window being evaluated every two records.

Evictor In addition to these, an evictor can be used for further preprocessing tasks

before firing a window operation, e.g. to remove the every 3rd element of all window.

Some default evictors: CountEvictor , DeltaEvictor , TimeEvictor

Page 37: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 37

Lets take 15 mins…

2/22/17

Page 38: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 38

Timing in Flink

“The two most powerful warriors are patience and time.

2/22/17

Page 39: Flink   meetup

392/22/17C. Hadjinikolis & Satyasheel | DataReply

…the time concept in streaming A streaming application is an always running application. ..we need to take snapshots of the stream at various points. ..these points can be defined using a time component. ..we can group, correlate, different events happening in the

stream. Some of the constructs like window, heavily use the time

component. Most of the streaming frameworks support a single meaning of

time, which is mostly tied to the processing time.

Page 40: Flink   meetup

402/22/17C. Hadjinikolis & Satyasheel | DataReply

…time in Flink When we say, last “t” seconds, what do we mean exactly?

Well in Flink it’s one of three things: Processing Time“…the records arrived in last "t" seconds for the processing.” Event Time“… all the records generated in those last "t" seconds at the source.” Ingestion Time

The time when events ingested into the system. This time is in between of the event time and processing time.

Page 41: Flink   meetup

412/22/17C. Hadjinikolis & Satyasheel | DataReply

…time in Flink

Page 43: Flink   meetup

C. Hadjinikolis & Satyasheel | DataReply 432/22/17

Thanks for your attention!