java high level stream api

Click here to load reader

Post on 08-Jan-2017




0 download

Embed Size (px)


Stream API For Apex

June 2016

Apex Overview

Apex Overview

YARN is the resource managerHDFS used for storing any persistent state

Current Development ModelDirected Acyclic Graph (DAG)Filtered StreamOutput Stream

TupleTupleFiltered StreamEnriched StreamEnriched Streamer






OperatorStream is a sequence of data tuplesTypical Operator takes one or more input streams, performs computations & emits one or more output streamsEach operator is your custom business logic in java, or built-in operator from our open source libraryOperator has many instances that run in parallel and each instance is single-threadedDirected Acyclic Graph (DAG) is made up of operators and streams

Current Application [email protected](name="WordCountDemo")public class Application implements StreamingApplication{ @Override public void populateDAG(DAG dag, Configuration conf) { WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator()); UniqueCounter wordCount = dag.addOperator("count", new UniqueCounter()); ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator()); dag.addStream("wordinput-count", input.outputPort,; dag.addStream("count-console",wordCount.count, consoleOperator.input); }}

Easier for beginners to start withFluent APISmaller learning curveTransform methods in one place vs operator libraryOperator API provides flexibility while high-level API provides ease of use

Why we need high-level API

Stream API map(..)filter(..)addOperator(...)with(prop, val)window(Opt...)ApexStream


Stream API (Application Example)@ApplicationAnnotation(name = "WordCountStreamingApiDemo")public class ApplicationWithStreamAPI implements StreamingApplication{ @Override public void populateDAG(DAG dag, Configuration configuration) { String localFolder = "./src/test/resources/data"; ApexStream stream = StreamFactory .fromFolder(localFolder) .flatMap(new Split()) .window(new WindowOption.GlobalWindow(), new TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes()) .countByKey(new ConvertToKeyVal()).print(); stream.populateDag(dag); }}

How it works

ApexStream literally means bounded/unbounded data set of type T ApexStream also holds a graph data struture of all operator and connections between operators from input to current pointEach transform method attach one or more operators to current graph data structure and return a new Apex Stream objectThe graph data structure wont be translated to Apex DAG until populateDag or run method are called

How it works (Cont)

Method chain for readabilityStateless transform(map, flatmap, filter)Some input and output are available (file, console, Kafka)Some interoperability (addOperator, getDag, set property/attributes etc)Local mode and distributed modeAnnonymous function class supportExtensible

Current Status

WindowedStream is in pull request along with Operators that support itA few window transforms (count, reduce, etc)3 Window types (fix window, sliding window, session window)3 Trigger types (early trigger, late trigger, at watermark)3 Accumulation modes(accumulate, discard, accumulation_retraction)In memory window state (checkpointed)

Current Status (Cont)

RoadmapPersistent window state for windowed operators (large state)Fully follow Beam model (window, trigger, watermark)Rich selection of windowed transform (group, combine, join)Support custom window assignorSupport custom triggerMore input/output (hbase, cassendra, jdbc, etc)Better schema supportMore language support (java 8, scala, etc...)What the community asks for

ResourcesApache Apex website - - - - @ApacheApex; Follow - - - - Examples - request

Demo & Code Example

Word CountAutoComplete

Thank You!June 2016

Comments/[email protected]