writing hadoop jobs in scala using scalding

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

How much storage can100$ dollars buy you?

1980

1 photo


1980

1 photo

1990

5 songs


2000

7 movies

1980

1 photo

1990

5 songs


2000

7 movies

1980

1 photo

1990

5 songs

600 movies

170.000 songs

5 million photos

2010


From single drives…

From single drives… to clusters…

Data Science

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi


Alfréd Rényi

data scientist


Alfréd Rényi

data scientist

and data


Alfréd Rényi

data scientist

and datainsights

Map Reduce

Distributed File System+=

Hadoop

Storage

Map Reduce

Distributed File System+=

Hadoop

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Word Count

Hello cruel world

Say hello! Hello!

Raw

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map Reduce

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

say 1

Raw Map Reduce Result

4 Main Characteristics of Scala

JVM


JVM Statically Typed



Object Oriented



Object Oriented

Functional Programming


def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

http://www.scala-lang.org/api/current/scala/collection/immutable/List.html

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

Recap

• Map/Reduce


Map/Reduce

• Functional Language that runs on the JVM

Scala

Recap

• Map/Reduce


Map/Reduce

• Functional Language that runs on the JVM

Scala

• Open Source Implementation of MR in the JVM

Hadoop

Recap

So in what language is Hadoop implemented?

The Result?

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

High level approaches

SQL DataTransformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

Java

Pig

User defined functions (UDF)

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

• Data Flow Programming Model• User Defined Functions

Good parts

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

Bad

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

Red

GreenRefactor

TDD Cycle

Red

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Big Data Big Speed

A typical day working with Hadoop

Is Scalding of any help here?


0 Size of code


1 Types

0 Size of code


1 Types

2 Unit Testing

0 Size of code


1 Types

2 Unit Testing

3 Local execution

0 Size of code

1Types

Unit Testing

Acceptance Testing


Lean Startup

An extra cycle

Compilation Phase

Unit Testing

Acceptance Testing


Lean Startup

An extra cycle

Static type-checking makes

you a better programmer™

(Int,Int,Int,Int)

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]


(Int,Int,Int,Int)



val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

(Int,Int,Int,Int)



val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

2Unit Testing

How do you test a distributed algorithm without a distributed

platform?

Source

Tap

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

3Local Execution

HDFS

Local

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

More Scalding goodness


Algebird


Algebird

Matrix library

Be functionalQuestions?

writing hadoop jobs in scala using scalding

Technology

word word

foreach word

class job

order word

import org

fileoutputformat import

fileinputformat import

new configuration job