writing hadoop jobs in scala using scalding

93
Writing Hadoop Jobs in Scala using Scalding @tonicebrian

Upload: toni-cebrian

Post on 27-Jan-2015

115 views

Category:

Technology


1 download

DESCRIPTION

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

TRANSCRIPT

Page 1: Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

Page 2: Writing Hadoop Jobs in Scala using Scalding

How much storage can100$ dollars buy you?

Page 3: Writing Hadoop Jobs in Scala using Scalding

1980

1 photo

How much storage can100$ dollars buy you?

Page 4: Writing Hadoop Jobs in Scala using Scalding

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

Page 5: Writing Hadoop Jobs in Scala using Scalding

2000

7 movies

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

Page 6: Writing Hadoop Jobs in Scala using Scalding

2000

7 movies

1980

1 photo

1990

5 songs

600 movies

170.000 songs

5 million photos

2010

How much storage can100$ dollars buy you?

Page 7: Writing Hadoop Jobs in Scala using Scalding

From single drives…

Page 8: Writing Hadoop Jobs in Scala using Scalding

From single drives… to clusters…

Page 9: Writing Hadoop Jobs in Scala using Scalding

Data Science

Page 10: Writing Hadoop Jobs in Scala using Scalding

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

Page 11: Writing Hadoop Jobs in Scala using Scalding

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

Page 12: Writing Hadoop Jobs in Scala using Scalding

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and data

Page 13: Writing Hadoop Jobs in Scala using Scalding

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and datainsights

Page 14: Writing Hadoop Jobs in Scala using Scalding
Page 15: Writing Hadoop Jobs in Scala using Scalding

Map Reduce

Distributed File System+=

Hadoop

Page 16: Writing Hadoop Jobs in Scala using Scalding

Storage

Map Reduce

Distributed File System+=

Hadoop

Page 17: Writing Hadoop Jobs in Scala using Scalding

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Page 18: Writing Hadoop Jobs in Scala using Scalding

Word Count

Hello cruel world

Say hello! Hello!

Raw

Page 19: Writing Hadoop Jobs in Scala using Scalding

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map

Page 20: Writing Hadoop Jobs in Scala using Scalding

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map Reduce

Page 21: Writing Hadoop Jobs in Scala using Scalding

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

say 1

Raw Map Reduce Result

Page 22: Writing Hadoop Jobs in Scala using Scalding
Page 23: Writing Hadoop Jobs in Scala using Scalding
Page 24: Writing Hadoop Jobs in Scala using Scalding
Page 25: Writing Hadoop Jobs in Scala using Scalding
Page 26: Writing Hadoop Jobs in Scala using Scalding
Page 27: Writing Hadoop Jobs in Scala using Scalding
Page 28: Writing Hadoop Jobs in Scala using Scalding

4 Main Characteristics of Scala

Page 29: Writing Hadoop Jobs in Scala using Scalding

JVM

4 Main Characteristics of Scala

Page 30: Writing Hadoop Jobs in Scala using Scalding

JVM Statically Typed

4 Main Characteristics of Scala

Page 31: Writing Hadoop Jobs in Scala using Scalding

JVM Statically Typed

Object Oriented

4 Main Characteristics of Scala

Page 32: Writing Hadoop Jobs in Scala using Scalding

JVM Statically Typed

Object Oriented

Functional Programming

4 Main Characteristics of Scala

Page 33: Writing Hadoop Jobs in Scala using Scalding

def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

Page 34: Writing Hadoop Jobs in Scala using Scalding

Recap

Page 35: Writing Hadoop Jobs in Scala using Scalding

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

Recap

Page 36: Writing Hadoop Jobs in Scala using Scalding

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

Recap

Page 37: Writing Hadoop Jobs in Scala using Scalding

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

• Open Source Implementation of MR in the JVM

Hadoop

Recap

Page 38: Writing Hadoop Jobs in Scala using Scalding

So in what language is Hadoop implemented?

Page 39: Writing Hadoop Jobs in Scala using Scalding
Page 40: Writing Hadoop Jobs in Scala using Scalding

The Result?

Page 41: Writing Hadoop Jobs in Scala using Scalding

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

Page 42: Writing Hadoop Jobs in Scala using Scalding

High level approaches

SQL DataTransformations

Page 43: Writing Hadoop Jobs in Scala using Scalding

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

Page 44: Writing Hadoop Jobs in Scala using Scalding

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

Java

Pig

User defined functions (UDF)

Page 45: Writing Hadoop Jobs in Scala using Scalding
Page 46: Writing Hadoop Jobs in Scala using Scalding
Page 47: Writing Hadoop Jobs in Scala using Scalding

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields;  public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ];  Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );  // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );  // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );  // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

Page 48: Writing Hadoop Jobs in Scala using Scalding

• Data Flow Programming Model• User Defined Functions

Good parts

Page 49: Writing Hadoop Jobs in Scala using Scalding

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

Bad

Page 50: Writing Hadoop Jobs in Scala using Scalding
Page 51: Writing Hadoop Jobs in Scala using Scalding

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )  // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

Page 52: Writing Hadoop Jobs in Scala using Scalding
Page 53: Writing Hadoop Jobs in Scala using Scalding

Red

GreenRefactor

TDD Cycle

Page 54: Writing Hadoop Jobs in Scala using Scalding

Red

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Page 55: Writing Hadoop Jobs in Scala using Scalding

Big Data Big Speed

Page 56: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 57: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 58: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 59: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 60: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 61: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 62: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 63: Writing Hadoop Jobs in Scala using Scalding

A typical day working with Hadoop

Page 64: Writing Hadoop Jobs in Scala using Scalding

Is Scalding of any help here?

Page 65: Writing Hadoop Jobs in Scala using Scalding

Is Scalding of any help here?

0 Size of code

Page 66: Writing Hadoop Jobs in Scala using Scalding

Is Scalding of any help here?

1 Types

0 Size of code

Page 67: Writing Hadoop Jobs in Scala using Scalding

Is Scalding of any help here?

1 Types

2 Unit Testing

0 Size of code

Page 68: Writing Hadoop Jobs in Scala using Scalding

Is Scalding of any help here?

1 Types

2 Unit Testing

3 Local execution

0 Size of code

Page 69: Writing Hadoop Jobs in Scala using Scalding

1Types

Page 70: Writing Hadoop Jobs in Scala using Scalding

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Page 71: Writing Hadoop Jobs in Scala using Scalding

Compilation Phase

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Page 72: Writing Hadoop Jobs in Scala using Scalding
Page 73: Writing Hadoop Jobs in Scala using Scalding

Static type-checking makes

you a better programmer™

Page 74: Writing Hadoop Jobs in Scala using Scalding

(Int,Int,Int,Int)

Fail-fast with type errors

Page 75: Writing Hadoop Jobs in Scala using Scalding

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

Page 76: Writing Hadoop Jobs in Scala using Scalding

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

Page 77: Writing Hadoop Jobs in Scala using Scalding

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

Page 78: Writing Hadoop Jobs in Scala using Scalding

2Unit Testing

Page 79: Writing Hadoop Jobs in Scala using Scalding

How do you test a distributed algorithm without a distributed

platform?

Page 80: Writing Hadoop Jobs in Scala using Scalding

Source

Tap

Page 81: Writing Hadoop Jobs in Scala using Scalding

Source

Tap

Page 82: Writing Hadoop Jobs in Scala using Scalding

Source

Tap

Page 83: Writing Hadoop Jobs in Scala using Scalding

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

Page 84: Writing Hadoop Jobs in Scala using Scalding

3Local Execution

Page 85: Writing Hadoop Jobs in Scala using Scalding
Page 86: Writing Hadoop Jobs in Scala using Scalding

HDFS

Local

Page 87: Writing Hadoop Jobs in Scala using Scalding

HDFS

Local

Page 88: Writing Hadoop Jobs in Scala using Scalding

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

Page 89: Writing Hadoop Jobs in Scala using Scalding

More Scalding goodness

Page 90: Writing Hadoop Jobs in Scala using Scalding

More Scalding goodness

Algebird

Page 91: Writing Hadoop Jobs in Scala using Scalding

More Scalding goodness

Algebird

Matrix library

Page 92: Writing Hadoop Jobs in Scala using Scalding
Page 93: Writing Hadoop Jobs in Scala using Scalding

Be functionalQuestions?