writing hadoop jobs in scala using scalding
DESCRIPTION
Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processingTRANSCRIPT
Writing Hadoop Jobs in Scala using Scalding @tonicebrian
How much storage can100$ dollars buy you?
1980
1 photo
How much storage can100$ dollars buy you?
1980
1 photo
1990
5 songs
How much storage can100$ dollars buy you?
2000
7 movies
1980
1 photo
1990
5 songs
How much storage can100$ dollars buy you?
2000
7 movies
1980
1 photo
1990
5 songs
600 movies
170.000 songs
5 million photos
2010
How much storage can100$ dollars buy you?
From single drives…
From single drives… to clusters…
Data Science
“A mathematician is a device for turning coffee into theorems”
Alfréd Rényi
“A mathematician is a device for turning coffee into theorems”
Alfréd Rényi
data scientist
“A mathematician is a device for turning coffee into theorems”
Alfréd Rényi
data scientist
and data
“A mathematician is a device for turning coffee into theorems”
Alfréd Rényi
data scientist
and datainsights
Map Reduce
Distributed File System+=
Hadoop
Storage
Map Reduce
Distributed File System+=
Hadoop
StorageProgram ModelMap
ReduceDistributed File System+=
Hadoop
Word Count
Hello cruel world
Say hello! Hello!
Raw
Word Count
Hello cruel world
Say hello! Hello!
hello 1
cruel 1
world 1
say 1
hello 2
Raw Map
Word Count
Hello cruel world
Say hello! Hello!
hello 1
cruel 1
world 1
say 1
hello 2
Raw Map Reduce
Word Count
Hello cruel world
Say hello! Hello!
hello 3
cruel 1
world 1
say 1
Raw Map Reduce Result
4 Main Characteristics of Scala
JVM
4 Main Characteristics of Scala
JVM Statically Typed
4 Main Characteristics of Scala
JVM Statically Typed
Object Oriented
4 Main Characteristics of Scala
JVM Statically Typed
Object Oriented
Functional Programming
4 Main Characteristics of Scala
def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.
def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.
Recap
• Programming paradigm that employs concepts from Functional Programming
Map/Reduce
Recap
• Map/Reduce
• Programming paradigm that employs concepts from Functional Programming
Map/Reduce
• Functional Language that runs on the JVM
Scala
Recap
• Map/Reduce
• Programming paradigm that employs concepts from Functional Programming
Map/Reduce
• Functional Language that runs on the JVM
Scala
• Open Source Implementation of MR in the JVM
Hadoop
Recap
So in what language is Hadoop implemented?
The Result?
package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
The Result?
High level approaches
SQL DataTransformations
High level approaches
input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;
package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}
Java
Pig
User defined functions (UDF)
package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }
WordCount in Cascading
• Data Flow Programming Model• User Defined Functions
Good parts
• Data Flow Programming Model• User Defined Functions
Good parts
• Still Java• Objects for Flows
Bad
package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}
Red
GreenRefactor
TDD Cycle
Red
GreenRefactor
Unit Testing
Acceptance Testing
Continuous Deployment
…Lean Startup
Broader view
Big Data Big Speed
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
A typical day working with Hadoop
Is Scalding of any help here?
Is Scalding of any help here?
0 Size of code
Is Scalding of any help here?
1 Types
0 Size of code
Is Scalding of any help here?
1 Types
2 Unit Testing
0 Size of code
Is Scalding of any help here?
1 Types
2 Unit Testing
3 Local execution
0 Size of code
1Types
Unit Testing
Acceptance Testing
Continuous Deployment
Lean Startup
An extra cycle
Compilation Phase
Unit Testing
Acceptance Testing
Continuous Deployment
Lean Startup
An extra cycle
Static type-checking makes
you a better programmer™
(Int,Int,Int,Int)
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
val w = 5val x = 5val y = 5val z = 5
w + x + y + z = 20
(Int,Int,Int,Int)
TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]
Fail-fast with type errors
val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)
w + x + y + z => type error
val w = 5val x = 5val y = 5val z = 5
w + x + y + z = 20
2Unit Testing
How do you test a distributed algorithm without a distributed
platform?
Source
Tap
Source
Tap
Source
Tap
// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}
3Local Execution
HDFS
Local
HDFS
Local
> run-main com.twitter.scalding.Tool MyJob --local
> run-main com.twitter.scalding.Tool MyJob --hdfs
SBT as a REPL
More Scalding goodness
More Scalding goodness
Algebird
More Scalding goodness
Algebird
Matrix library
Be functionalQuestions?