flink batch processing and iterations

Download Flink Batch Processing and Iterations

If you can't read please download the document

Post on 08-Jan-2017

1.140 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

Extracting Themes

Batch Processing using Apache FlinkBy - Sameer Wadkar

Test1

Flink APITableInput is in the form of files or collections (Unit Testing)Results of transformations are returned as Sinks which may be files or command line terminal or collections (Unit Testing)DataStreamDataSetSQL like expression language embedded in Java/ScalaInstead of working with DataSet or DataStream use Table abstractionSimilar to DataSet but applies to streaming data

Test2

Source CodeSource Code for examples presented can be downloaded from

https://github.com/sameeraxiomine/FlinkMeetup

Test3

Flink DataSet API Word Count public class WordCount { public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet text = getLines(env); //Create DataSet from lines in file DataSet wordCounts = text .flatMap(new LineSplitter()) .groupBy(0) //Group by first element of the Tuple .aggregate(Aggregations.SUM, 1); wordCounts.print();//Execute the WordCount job } /*FlatMap implantation which converts each line to many pairs*/ public static class LineSplitter implements FlatMapFunction { @Override public void flatMap(String line, Collector out) { for (String word : line.split(" ")) { out.collect(new Tuple2(word, 1)); } } }

Source Code -https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WordCount.java

Test4

Flink Batch API (Table API)public class WordCountUsingTableAPI { public static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment .getExecutionEnvironment(); TableEnvironment tableEnv = new TableEnvironment(); DataSet words = getWords(env); Table table = tableEnv.fromDataSet(words); Table filtered = table .groupBy("word") .select("word.count as wrdCnt, word") .filter(" wrdCnt = 2"); DataSet result = tableEnv.toDataSet(filtered, Word.class); result.print(); }public static DataSet getWords(ExecutionEnvironment env) { //Return DataSet of Word}public static class Word { public String word; public int wrdCnt; public Word(String word, int wrdCnt) { this.word = word; this.wrdCnt = wrdCnt; } public Word() {} // empty constructor to satisfy POJO requirements @Override public String toString() { return "Word [word=" + word + ", count=" + wrdCnt + "]"; } }}Source Code -https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/WordCountUsingTableAPI.java

Test5

Table API How it worksTable filtered = table .groupBy("word") .select(word, word.count as wrdCnt")//count(word) .filter(" wrdCnt = 2");DataSet result = tableEnv.toDataSet(filtered, Word.class);public static DataSet getWords(ExecutionEnvironment env) { //Return DataSet of Word}public static class Word { public String word; public int wrdCnt; }groupby Word.wordCount words (word.count as wrdCnt)& emit word,wrdCntTransform to DataSet using reflection

Filter words with wrdCnt ==2

Test6

Iterative AlgorithmInput DataUpdate InputyesOutputReadWrite1IterationContinue?2354Result of the last iteration

Test7

Iterative Algorithm - MapReduceInput DataUpdate InputOutputReadWrite1IterationContinue?2354Result of the last iterationHDFSHDFSMapReduce JobCheck Counters orNew MapReduce jobyes

Test8

Iterative Algorithm - SparkInput DataUpdate RDD and CacheOutputReadWrite1IterationContinue?2354Write to DiskHDFSRDDSpark ActionSpark Action or check counters yes

Test9

Iterative Algorithm - FlinkInput DataNew Input DataOutputReadWrite1IterationContinue?2354Write to DiskDatasetIterativeDataSetDeltaIterationorInside Job IterationAggregatorConvergenceCriterionDatasetPipelinedyes

Test10

Batch Processing - Iterator OperatorsIterative algorithms are common used in Graph processingMachine Learning Bayesian, Numerical Solutions, Optimization algorithms

Accumulators can be used as Job Level CountersAggregators are used as Iteration level CountersReset at the end of each iterationCan specify a convergence criterion to exit the loop (iterative process)

Test11

Bulk Iterations vs Delta IterationsBulk Iterations are appropriate when entire datasets are consumed per iteration. Example - K Means Clustering algorithm

Delta Iterations are exploit the following featuresEach iteration processes on a subset of full DataSetThe working dataset become smaller in each iterations allowing the iterations to become faster in each subsequent stepExample Graph processing (Propagate minimum in a graph)

Test12

Bulk Iteration Toy ExampleConsider a DataSet of random numbers from 0-99. This DataSet can be arbitrarily largeEach number needs to be incremented simultaneously Stop when the sum of all numbers exceeds an arbitrary but user defined value ( Ex. noOfElements * 20000) at the end of the iteration

i1+1i2+1i3+1in+1.Increment all numbers simultaneouslyInput Dataset of numbersIs sum of all numbers > NNoEnd

Test13

Bulk Iteration Sample Dataset of 5 elementsInitial DatasetFinal DatasetInitial Total = 238Final Total = 100,003

DataSet is used as Input where the first element is the key and the second element is incremented each iterationSum of all second elements of the Tuple2 cannot exceed 100000

Test14

Bulk Iteration Solution Solution HighlightsCannot use counters (Accumulators) to determine when to stop. Accumulators are guaranteed to be accurate only at the end of the programAggregators are used at the end of each iteration to verify terminating condition

Source Code - https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/AdderBulkIterations.java

Test15

Bulk Iteration Implementation

Input1

MapMapMapStep Function(Add 1)2Iterate (Max 100,000 times)Check for terminating condition(Synchronize)Feedback to next iteration3

Output4Terminates after 19953 iterations

Test16

Bulk Iteration Source Codepublic static void main(String[] args) throws Exception { final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); //First create an initial dataset IterativeDataSet initial = getData(env) .iterate(MAX_ITERATIONS);//Register Aggregator and Convergence Criterion Classinitial.registerAggregationConvergenceCriterion("total", new LongSumAggregator(), new VerifyIfMaxConvergence()); //IterateDataSet iteration = initial.map( new RichMapFunction() { private LongSumAggregator agg = null; @Override public void open(Configuration parameters) { this.agg = this.getIterationRuntimeContext().getIterationAggregator("total"); } @Override public Tuple2 map(Tuple2 input) throws Exception { long incrementF1 = input.f1 + 1; Tuple2 out = new Tuple2(input.f0, incrementF1); this.agg.aggregate(out.f1); return out; } }); DataSet finalDs = initial.closeWith(iteration); //Close Iteration finalDs.print(); //Consume output}public static class VerifyIfMaxConvergence implements ConvergenceCriterion{ @Override public boolean isConverged(int iteration, LongValue value) {return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX); }}

Test17

Bulk Iteration StepsCreate intial Dataset(IterativeDataSet)And define max iterationsIterativeDataSet initial = getData(env).iterate(MAX_ITERATIONS);Register Convergence Criterioninitial.registerAggregationConvergenceCriterion("total", new LongSumAggregator(), new VerifyIfMaxConvergence());Execute Iterations and update aggregator and check for convergence at end of each iterationEnd Iteration by executing closewith(DataSet) on the IterativeDataSetDataSet iteration = initial.map(new RichMapFunction() { return new Tuple2(input.f0, input.f1+1);});

class VerifyIfMaxConvergence implements ConvergenceCriterion{ public boolean isConverged(int iteration, LongValue value) { return (value.getValue()>AdderBulkIterations.ABSOLUTE_MAX); }}DataSet finalDs = initial.closeWith(iteration);

finalDs.print();//Consume results

Test18

Bulk Iteration The Wrong Way DataSet input = getData(env); DataSet output = input; for(int i=0;i() { public Tuple2 map(Tuple2 input) { return new Tuple2(input.f0, input.f1+1); } }); //This is what slows down iteration. Job starts immediately here long sum = output.map(new FixTuple2()).reduce(new ReduceFunc()) .collect().get(0); input = output;//Prepare for next iteration System.out.println("Current Sum="+sum); if(sum>100){ System.out.println("Breaking now:"+i); break; }} output.print();Flink cannot optimize because job executes immediately on long sum = output.map(new FixTuple2()).reduce(new ReduceFunc()).collect().get(0);https://github.com/sameeraxiomine/FlinkMeetup/blob/master/src/main/java/org/apache/flink/examples/AdderBulkIterationsWrongWay.java

Test19

Delta Iteration Example1234567111289101315Given the following events and their relationships propagate root id of each event to its childrenSource Code - http