apache flink - hadoop mapreduce compatibility

Apache FlinkHadoop Compatibility

Fabian Hueske @fhueske

Hadoop MapReduce Jobs

Input Map Reduce Output

InputFormat Mapper Reducer OutputFormat

• Jobs have a static structure.

• Input, Output, Map, Reduce run your custom (or library) code.

• If application logic is too complex, you need more than one job.

Flink Programs

Source Map Reduce

Source

Source

Filter

Join

CoGroup Sink

• Flink program are DAG data flows.

• Data Sources, Data Sinks, Map and Reduce operators are included.

• Everything that MapReduce gives and much more (super set).

• Much better performance

• Especially if more than 1 MR job is executed.

Run your Hadoop code with Flink?

• Hadoop data types (Writable) are natively supported.

• Hadoop Filesystems are natively supported.

• Flink features Input- & OutputFormats, Map, and Reduce

functions, just like Hadoop MapReduce.

• Concepts are the same, but interfaces are not :-(

But Flink provides wrappers for Hadoop code :-)

• mapred.* API: In/OutputFormat, Mappers, & Reducers

• mapreduce.* API: In/OutputFormat

Alright, sounds good…

… but will my WordCount still work?!?

final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// set up Hadoop InputFormat

HadoopInputFormat<LongWritable, Text> hadoopInputFormat =

new HadoopInputFormat<LongWritable, Text>(new TextInputFormat(), LongWritable.class, Text.class, new JobConf());TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath));

DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat); // read data with Hadoop InputFormatDataSet<Tuple2<Text, LongWritable>> words =

// apply Hadoop Mapper

text.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))// apply Hadoop Reducer

.groupBy(0).reduceGroup(new HadoopReduceFunction<Text, LongWritable, Text, LongWritable>(new Counter()));

// set up Hadoop Output FormatHadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =

new HadoopOutputFormat<Text, LongWritable>(new TextOutputFormat<Text, LongWritable>(), new JobConf());hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath));

words.output(hadoopOutputFormat); // write data with Hadoop OutputFormatenv.execute("Hadoop Compat WordCount"); // execute the program

Hadoop Data Types Hadoop Input- & OutputFormats Your Hadoop Functions

Yes, it will…

Use MapReduce like you always wanted

• Freely assemble your functions into a program.

• Very efficient, pipelined execution.

– Program is executed on Flink (no Hadoop involved).

– No writing to/reading from HDFS within a program.

• Caveat: No support for custom Hadoop partitioners & sorters, yet :-(

Input Map Reduce

Input

Output

Reduce

Map Reduce

Output

WHAT TO EXPECT NEXT?

Hadoop Job

Do not change a single line of code!

• Inject MapReduce jobs as a whole into Flink programs

– with support for custom partitioners, sorters, groupers.

• Run Hadoop MapReduce jobs on Flink

– without changing a single line of code.

Source Map Reduce

Source

Source

Filter

Join

CoGroup Sink

Looking for some fun?

Try Hadoop on Flink!

apache flink - hadoop mapreduce compatibility

Data & Analytics