flink london meetup 3 march 2016 - flink basics

Post on 14-Apr-2017

165 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Motivation

TheEvolutionofMassive-ScaleDataProcessingTylerAkidau,StaffSoftwareEngineer@Googlehttps://goo.gl/5k0xaL

TheEvolutionofMassive-ScaleDataProcessingTylerAkidau,StaffSoftwareEngineer@Googlehttps://goo.gl/5k0xaL

We’renotthenewest!

APACHEFLINKLONDONMEETUP3rdMarch2016|BonhillHouse,London

Whatwe’llcovertoday

¨  Hand-waveybit¨  Practicalbit¨  Textbookbit

Part1:Thehand-waveybit

¨  Aim:¤ MakesureweallhavesamebasicunderstandingofwhatFlinkis

¤  Introducekeyconceptsn Notexhaustiven Notexplainingmuch!

WTFisFlink?

Flinkbasics…

¨  ApacheFoundationtoplevelopensourceproject…¨  …fordistributeddataprocessing…¨  …witha“streamingfirst”architecture…¨  …runningontheJVM.Or:A‘free’waytoprocessalotofdata(especiallystreamingdata)on‘commodity’hardware,withacodebasethatiscontinuallyimproving.Usefulforreporting,analytics,logprocessing,machinelearning,etc.

Somekeyterms

¨  DataStreamApossiblyunboundedimmutablecollectionofdataitemsofthesametype

¨  DataSetAnabstractrepresentationofafiniteimmutablecollectionofdataofthesametypethatmaycontainduplicates

¨  SourceCanbefile-based,socket-based,collection-based,Custom(e.g.Kaea)

¨  SinkConsumesDataSets/DataStreamsandforwardsthemtofiles,sockets,externalsystems,orprintsthem

¨  OperatorRepresentsanoperation(oradataprocessingstep)inthe‘JobGraph’–includespropertiesliketheactualcodeanddesiredparallelism.

Applicationarchitecture

Flink‘skeleton’programstructure

DataStream1.  Obtaina

StreamExecutionEnvironment

2.  Connecttodatastreamsources

3.  Specifytransformationsonthedatastreams

4.  Specifyoutputfortheprocesseddata

5.  Executetheprogram[env.execute()]

DataSet1.  Obtainan

ExecutionEnvironment2.  Load/createtheinitial

data3.  Specifytransformations

onthedata4.  Specifywheretoput

results5.  Executetheprogram

[env.execute(), print(), collect()]

(infuturemeetupsGuestSpeakerswillgiveusthejuicydetails…)

KeyFlinkfeatures

High Performance

Support for out-of-order events

Low latency

Exactly-once semantics

Flexible streaming windows

One runtime for stream & batch /

ecosystem

Back pressure

Delta iterate operators

One runtime for stream & batch /

ecosystem

Delta iterate operators

High Performance

Support for out-of-order events

Low latency

Exactly-once semantics

Flexible streaming windows

Back pressure

AccordingtotheApacheFlinksite

(http://flink.apache.org/)

Highperformance/Lowlatency

Highthroughput

Lowlatency

Flowcontrolandbackpressure

¨  Backpressurebottleneck:‘pressure’buildingupbecausedataisarrivingfasterthanitcanbeprocessed.¤ Temporaryprocessslow-down(e.g.GConJVM)¤ Temporarytrafficspike

¨  “Flinkachievesthemaximumthroughputallowedbytheslowestpartofthepipeline”¤ Notaconfigurable‘feature’¤  Inherentinarchitecture(buffer-based)

Exactly-oncesemanticsforstate

¨  Intheeventoffailure“Pickupwhereyouleftoff”.¤  Meansyouneedtorememberwhereyouleftoff(dataandstate)

¨  3levelsofriskappetite:¤  L1–Acceptmisses(“Atmostonce”)¤  L2–Acceptduplicates(“Atleastonce”)¤  L3–Don’taccepteither(“Exactlyonce”)

¨  Checkpointing/snapshots¤  Dependentonstreamsource–e.g.Kaea¤  Orchestrationistricky(seenextslide)

Checkpointingorchestration

SupportforOut-of-Orderevents

¨  Reallife:messageswillbedelayed

¨  Everyeventistime-stamped¨  It’sharderthanitsounds(‘kinds’oftime,windows,watermarks,etc)

t1t2t3t5t6t7t4t8

Highlyflexiblestreamingwindows

Thestartandendofthedatastreamthatisbeingprocessed.¨  Differentwaystodefinethewindow,including:

¤  Time(from9:00:00to9:00:04)¤  Count(fromitem12toitem18)¤  Session(fromfirst‘keyedevent’untilwedon’tseesame

keyforXtime–analogoustocookiesession)¤ Morecomplexlogicdrivenbythedata,andmore

complexwindowsdependingonwhatisneeded

(Delta)iterateoperators

Iterateoperator

Deltaiterateoperator

Workon‘hot’Don’ttouch‘cold’

(Delta)iterateoperators

Oneruntime/libraryecosystem

NB:•  Librariesinbeta•  APIsinJava,Scala,[Python]•  FlinkCEPtoo?

top related