ibm research ® © 2007 ibm corporation introduction to hadoop & map- reduce
TRANSCRIPT
IBM Research
®
© 2007 IBM Corporation
INTRODUCTION TO HADOOP & MAP- REDUCE
IBM Research | India Research Lab
Outline
Map-Reduce Features Combiner / Partitioner / Counter
Passing Configuration Parameters
Distributed-Cache
Hadoop I/O
Passing Custom Objects as Key-Values
Input and Output Formats Introduction
Input/Output Formats provided by Hadoop
Writing Custom Input/Output-Formats
Miscellaneous Chaining Map-Reduce Jobs
Compression
Hadoop Tuning and Optimization
IBM Research | India Research Lab
Combiner
A local reduce
Processes the output of each map function
Same signature as of a reduce
Often reduces the number of intermediate key-value pairs
IBM Research | India Research Lab
Word-Count
Hadoop MapMap ReduceHadoop Map
Map
Map MapReduce Map
Hadoop
Key KeyValue Value
(Hadoop, 1)(Map, 1)(Map, 1)
(Reduce , 1)(Hadoop, 1)
(Map, 1)(Map, 1)
(Map, 1)(Map, 1)
(Reduce, 1)(Map, 1)
(Hadoop, 1)
(Key, 1)(Key, 1) (Value, 1)(Value, 1)
Sort/Shuffle
(Hadoop, [1,1,1])
(Map, [1,1,1,1,1,1,1])
(Key, [1,1])
(Reduce, [1,1])(Value, [1,1])
A-I
J-Q
R-Z
(Hadoop, 3)
(Map, 7)(Key, 2)
(Reduce, 2)(Value, 2)
IBM Research | India Research Lab
Word-Count
Hadoop MapMap ReduceHadoop Map
Map
Map MapReduce
MapHadoop
Key Key
Value Value
(Hadoop, 1)(Map, 1)(Map, 1)
(Reduce , 1)(Hadoop, 1)
(Map, 1)(Map, 1)
(Map, 1)(Map, 1)
(Reduce, 1)(Map, 1)
(Hadoop, 1)
(Key, 1)(Key, 1) (Value, 1)(Value, 1)
(Hadoop, [2,1])
(Map, [4, 3])
(Key, [2])
(Reduce, [1,1])
(Value, [2])
A-I
J-Q
R-Z
(Hadoop, 3)
(Map, 7)(Key, 2)
(Reduce, 2)(Value, 2)
(Hadoop, [1,1])
(Map, [1,1,1,1])
(Reduce , [1])
(Map, [1,1,1])
(Reduce, 1)
(Hadoop, 1)
(Key, [1,1])
(Value, [1,1])
(Hadoop, 2)
(Map, 4)
(Reduce , 1)
(Map, 3)
(Reduce, 1)
(Hadoop, 1)
(Key, 2)
(Value, 2)
COMBINER
IBM Research | India Research Lab
COMBINER
public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key, Iterable<IntWritable> values, Context context){
context.write(key, new IntWritable(count(values)));
}}
Type of Output Key Type of Output Value
Type of Input Key Type of Input Value
IBM Research | India Research Lab
Word-Count Runner Class
public class WordCountRunner{ public static void main(String[] args){
Job job = new Job();
job.setMapperClass(WordCountMap.class);
job.setCombinerClass(WordCountCombiner.class);
job.setReducerClass(WordCountReduce.class);
job.setJarByClass(WordCountRunner.class);
FileInputFormat.addInputPath(job, inputFilesPath);
FileOutputFormat.addOutputPath(job, outputPath);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValuesClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
job.waitForCompletion(true); } }
IBM Research | India Research Lab
Counters
IBM Research | India Research Lab
Counters Built-in Counters
Report Metrics for various aspects of a Job
Task Counters
• Gather information about tasks over the course of a job
• Results are aggregated across all tasks
• MAP_INPUT_RECORDS, REDUCE_INPUT_GROUPS FileSystem Counters
• BYTES_READ, BYTES_WRITTEN
• Bytes Read/Written by each File-System (HDFS, KFS, Local, S3 etc) FileInputFormat Counters
• BYTES_READ (Bytes Read through FileInputFormat) FileOutputFormat Counters
• BYTES_WRITTEN (Bytes Written through FileOutputFormat) Job Counters
• Maintained by Job-Tracker
• TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES
IBM Research | India Research Lab
User-Define Counters public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{
enum WCCounters {NOUNS, PRONOUNS, ADJECTIVES};
public void map(LongWritable key, Text line, Context context){
String[] tokens = Tokenize(line);for(int i=0; i<tokens.length; i++){
if(isNoun(tokens[i]))context.getCounter(WCCounter.NOUNS).increment(1);else if(isProNoun(tokens[i]))context.getCounter(WCCounter.PRONOUNS).increment(1);else if(isAdjective(tokens[i]))context.getCounter(WCCount.ADJECTIVES).increment(1);
context.write(new Text(tokens[i]), new IntWritable(1));
}}
}
IBM Research | India Research Lab
Retrieving the values of a Counter
Counter counters = job.getCounters();Counter counter = counters.findCounter(WCCounters.NOUNS);int value = counter.getValue();
IBM Research | India Research Lab
Output
13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.NOUNS=234213/10/08 15:36:15 INFO mapred.JobClient WordCountMap.PRONOUNS=212413/10/08 15:36:15 INFO mapred.JobClient WordCountMap.ADJECTIVES=1897
IBM Research | India Research Lab
Partitioner
Map keys to reducers/partitions
Determines which reducer receives a certain key
Identical keys produced by different map functions must map to same partition/reducer
If n reducers are used, then n partitions must be filled Number of reducers are set by the call “setNumReduceTasks”
Hadoop uses HashPartitioner as default partitioner
IBM Research | India Research Lab
Defining a Custom Partitioner
Implement a class which extends the Partitioner class
Partitioning impacts load-balancing aspect of a map-reduce program
Word-Count: Many words starting with vowels
Words starting with a different character sent to different reducer
For words starting with vowels, second character may be taken into account
IBM Research | India Research Lab
Word-Count Runner Class
public class WordCountRunner{ public static void main(String[] args){
Job job = new Job();
job.setMapperClass(WordCountMap.class);
job.setCombinerClass(WordCountCombiner.class);
job.setReducerClass(WordCountReduce.class);
job.setJarByClass(WordCountRunner.class);
job.setPartitionerClass(WordCountPartitioner.class);
FileInputFormat.addInputPath(job, inputFilesPath);
FileOutputFormat.addOutputPath(job, outputPath);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValuesClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(1);
job.waitForCompletion(true); } }
IBM Research | India Research Lab
Passing Configuration Parameters
Map-Reduce jobs may require certain input parameters
One may want to avoid counting words starting with certain prefixes
Prefixes can be set in the configuration
IBM Research | India Research Lab
Word-Count Runner Class
public class WordCountRunner{ public static void main(String[] args){
Job job = new Job();
Configuration conf = job.getConfiguration();
conf.set(“PrefixesToAvoid”, “abs bts bnm swe”);
……
……
job.waitForCompletion(true);
} }
IBM Research | India Research Lab
Word-Count Map
public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{
private String[] prefixesToAvoid;
public void setup(Context context) throws InterruptedException{Configuration conf = context.getConfiguration();String prefixes = conf.get(“PrefixesToAvoid”);this.prefixesToAvoid = prefixes.split(“ “);
}
public void map(LongWritable key, Text line, Context context){
String[] tokens = Tokenize(line);for(int i=0; i<tokens.length; i++){
context.write(new Text(tokens[i]), new IntWritable(1));}
} }
IBM Research | India Research Lab
Distributed Cache
A file may need to be broadcasted to each map-node For example, a dictionary in a spell-check Such file-names can be added in a distributed-cache. Hadoop copies files added to the cache to all map-nodes.
Step 1 : Put file to HDFS hdfs dfs –put /tmp/file1 /cachefile1
Step 2 : Add CacheFile in Job Configuration Configuration conf = job.getConfiguration();
DistributedCache.addCacheFile(new URI(“/cachefile1”), conf);
Step 3 : Access cache file locally at each map Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream finputStream = new FileInputStream(cacheFiles[0].toString());
IBM Research | India Research Lab
Hadoop I/O : Reading an HDFS File
// Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get File Stream Path infile = new Path(filePath); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(infile)));
// Read file line by line StringBuilder fileContent = new StringBuilder(); String line = br.readLine(); while(line!=null){
fileContent.append(line).append(“\n”); line = br.readLine(); }
IBM Research | India Research Lab
Hadoop I/O : Writing to an HDFS file
// Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get FileStreamPath path = new Path(filePath);FSDataOutputStream outputStream = hdfs.create(path);
// Write to filebyte[] bytes = content.getBytes();outputStream.write(bytes, 0, bytes.length);outputStream.close();
IBM Research | India Research Lab
Hadoop I/O : Getting the File being Processed
A map-reduce job may need to process multiple files
The functionality of a map may depend upon which file is being processed
FileSplit fileSplit = (FileSplit) context.getInputSplit();
String filename = fileSplit.getPath().getName();
IBM Research | India Research Lab
Custom Objects as Key-Values
Passing key and values from map functions to reducers IntWritable, DoubleWritable, LongWritable, Text, ArrayWritable
Passing key and values of custom classes may be desirable
Objects that can be passed around must implement certain interfaces Writable for passing as values
WritableComparable for passing as keys
IBM Research | India Research Lab
Example Use-Case
Consider Weather data
Temperature and Pressure values at different lattitude-longitude-elevation-timestamp quadruple
Data is hence 4-dimensional Temperature and Pressure data in separate files File Format : lattitude, longitude, elevation, timestamp, temperature-value
• Ex: 10 20 10 1 99F 10 21 10 2 98F• Similarly for Pressure Ex 10 20 10 1 101kPa
We want to read the two data files and combine the data
• Ex: 10 20 10 1 99F 101kPa
Let class STPoint represent the coordinates class STPoint{
double lattitude, longitude, elevation;long timestamp;
}
IBM Research | India Research Lab
Map to Reduce Flow
10 20 1 10 99F10 21 1 10 98F
10 20 1 10 101kPa10 21 1 10 109kPa
MAP
MAP (10 20 1 10, 99F)
Text
DoubleWritable
(10 21 1 10, 101kPa)
Text
DoubleWritable
REDUCE
(10, 20, 1, 10, 99F, 101kPa)
IBM Research | India Research Lab
Map to Reduce Flow
10 20 1 10 99F10 21 1 10 98F
10 20 1 10 101kPa10 21 1 10 109kPa
MAP
MAP (STPoint(10 20 1 10), 99F)
STPoint
DoubleWritable
(STPoint(10 21 1 10), 101kPa)
STPoint
DoubleWritable
REDUCE
(STPoint(10, 20, 1, 10), 99F, 101kPa)
IBM Research | India Research Lab
Map
public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable>{
public void map(LongWritable key, Text line, Context context){
String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();
String keyString = lattitude+” “+longitude+” “+elevation+” “+timestamp;
context.write(new Text(keyString), attrVal); }
}
Type of Map Output Key Type of Output Value
IBM Research | India Research Lab
New Map
public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{
public void map(LongWritable key, Text line, Context context){
String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();
STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp);
context.write(stpoint, attrVal); }
}
Type of Map Output Key Type of Output Value
More Intuitive, Human Readable, Reduces Processing at Reduce Side
IBM Research | India Research Lab
New Reduce
public class DataReadReduce extends Reducer<STPoint, DoubleWritable, Text, DoubleWritable>{
public void reduce(STPoint key, Iterable<DoubleWritable> values, Context context){
}
}
Type of Output Key Type of Output Value
Type of Input Key Type of Input Value
IBM Research | India Research Lab
Passing Custom Objects as Key-Values
Key-Value Pairs are written to local disk by map functions User must tell how to write a custom object
Key-Value Pairs are read by reducers from local disk User must tell how to read a custom object
Keys are sorted and compared User must spectify how to compare two keys
IBM Research | India Research Lab
WritableComparable Interface
Three Methods public void readFields(DataInput in) {}
public void write(DataOutput out) {}
public int compareTo(Object other) {}
Objects that are passed as keys must implement WritableComparable interface.
Objects that are passed as values must implement Writable Interface Writable interface does not have compareTo method
Only keys are compared and not values and hence compareTo method not required for objects being passed only as keys.
IBM Research | India Research Lab
Implementing WritableComparable for STPoint
public void readFields(DataInput in) {
this.lattitude = in.readDouble(); this.longitude = in.readDouble();
this.elevation = in.readDouble(); long timeStamp = in.readLong(); }
public void write(DataOutput output){out.writeDouble(this.lattitude); out.writeDouble(this.longitude);out.writeDouble(this.elevation); out.writeLong(this.timestamp);
}
public int compareTo(STPoint other){
return this.toString().compareTo(other.toString());}
IBM Research | India Research Lab
InputFormat and OutputFormat
InputFormat Defines how to read data from file and feed it to the map functions
OutputFormat Defines how to write data on to a file
Hadoop provides various Input and Output Formats
A user can also implement custom input and output formats
Defining custom input and output formats is a very useful feature of map-reduce
IBM Research | India Research Lab
Input-Format
Defines how to read data from file and feed it to the map functions
How to define Splits? getSplits()
How to define Record? getRecordReader()
Hadoop provides various Input and Output Formats
A user can also implement custom input and output formats
Defining custom input and output formats is a very useful feature of map-reduce
IBM Research | India Research Lab
Split
A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3
64 MB
64 MB
64 MB
64 MB
Split 1
Split 2
Split 3
Split 4
MAP-1
MAP-2
MAP-3
MAP-4
IBM Research | India Research Lab
Split
A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3
64 MB
64 MB
64 MB
64 MB
Split 1
Split 2
MAP-1
MAP-2
IBM Research | India Research Lab
Split
A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3
64 MB
64 MB
64 MB
64 MB
Split 1
Split 2
MAP-1
MAP-2
IBM Research | India Research Lab
Split
A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1R16 2 3 4R17 1 2 5R18 9 3 5R19 5 8 1R20 3 3 3
64 MB
64 MB
64 MB
64 MB
Split 1
Split 2
MAP-1
MAP-2
Split 3 MAP-3
IBM Research | India Research Lab
Record-Reader
R1 1 2 3
R2 2 3 5
R3 2 4 6
R4 6 4 2
R5 1 3 6
All records fed to Map taskone by one
IBM Research | India Research Lab
Record-Reader
R1 1 2 3R2 2 3 5
R3 2 4 6R4 6 4 2
R5 1 3 6
There are three records now
IBM Research | India Research Lab
Record-Reader
R1 1 2 3R5 1 3 6
R2 2 3 5R3 2 4 6
R4 6 4 2
All the tuples with identical values incolumn 1 are bunched in the same record
IBM Research | India Research Lab
TextInputFormat
Default Input Format Key is Byte Offset and Value is Line Content Suitable for reading raw text files
10 20 1 10 99F10 21 1 10 98F
TEXTINPUT FORMAT
(0, “10 20 1 10 99F”)
offset line as a string
MAP
(10, “10 21 1 10 98F”)
IBM Research | India Research Lab
KeyValueInputFormat
Input data in form of key \tab value Anything before \tab is key Anything after \tab is value Input if not in correct format will throw up an error
10 20 1 10 \t 99F10 21 1 10 \t 98F
KEY VALUEINPUT FORMAT
(“10 20 1 10”, “99F”)
Key as Content before tab Value as content after tab
MAP
(“10 21 1 10”, “98F”)
IBM Research | India Research Lab
SequenceInputFormat
Hadoop specific high performance binary input format Key is user-defined Value is user-defined
Binary File
SEQUENCEINPUT FORMAT
(“10 20 1 10”, “99F”)
User-defined key User-defined Value
MAP
(“10 21 1 10”, “98F”)
IBM Research | India Research Lab
OutputFormats TextOutputFormat
Default Output Format
Writes data in Key \tab Value format
This output to read subsequently by KeyValueInputFormat
SequenceOutputFormat Writes Binary Files suitable for reading into subsequent MR jobs
Keys and Values are User defined
IBM Research | India Research Lab
Text Input and Output Format
(“10 21 1 10”, “98F”)
(“10 20 1 10”, “99F”)
TEXTOUTPUT FORMAT
10 20 1 10 \tab 99F10 21 1 10 \tab 98F
TEXTINPUT FORMAT (10, “10 21 1 10 \tab 98F”)
(0, “10 20 1 10 \tab 99F”)
KEY VALUEINPUT FORMAT (“10 21 1 10”, “98F”)
(“10 20 1 10”, “99F”)
IBM Research | India Research Lab
Custom Input Formats
Allows a user control over how to read data and subsequently feed it to the map functions
Advisable to implement custom input formats for specific use-cases
Simplifies the process of implementing map-reduce algorithms
IBM Research | India Research Lab
CustomInputFormat
10 20 1 10 99F10 21 1 10 98F
MYINPUT FORMAT
(STPoint(10 20 1 10), 99F) MAP
(STPoint(10 21 1 10), 98F)
- Key is of type STPoint
IBM Research | India Research Lab
Map
public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{
public void map(LongWritable key, Text line, Context context){
String tokens[] = line.split(“ “);double lattitude = new Double(tokens[0]).doubleValue();double longitude = new Double(tokens[1]).doubleValue();double elevation = new Double(tokens[2]).doubleValue();long timestamp = new Long(tokens[3]).longValue();double attrVal = new Double(tokens[4]).doubleValue();
STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp);
context.write(stpoint, attrVal); }
}
Type of Map Output Key Type of Output Value
IBM Research | India Research Lab
New Map With Custom Input Format
class MyMap extends Mapper<STPoint, DoubleWritable, STPoint, DoubleWritable>{
public void map(STPoint point, DoubleWritable attrValue, Context context){
context.write(stpoint, attrVal); }
}
Map Output Key Map Output Value
More Intuitive, Human Readable
Map Input Key Map Input Value
IBM Research | India Research Lab
Specifying Input and Output Format
public class MyRunner{ public static void main(String[] args){
Job job = new Job();
job.setMapperClass(MyMap.class);
job.setCombinerClass(MyCombiner.class);
job.setReducerClass(MyReduce.class);
job.setJarByClass(MyRunner.class);
job.setPartitionerClass(MyPartitioner.class);
job.setInputFormatClass(MyInputFormat.class);
job.setOutputFormatClass(MyOutputFormat.class);
FileInputFormat.addInputPath(job, inputFilesPath);
FileOutputFormat.addOutputPath(job, outputPath);
job.setNumReduceTasks(1);
job.waitForCompletion(true);
} }
IBM Research | India Research Lab
Implementing a Custom Input Format
Specify how to split the data Data Split handled by class FileInputFormat
Custom Input Format can extend this class RecordReader
Reading the data in each split, parsing it and passing it to map
Iterator over the input data
IBM Research | India Research Lab
IBM Research | India Research Lab
Custom Input Format
public class MyInputFormat extends FileInputFormat<STPoint, DoubleWritable>{
public RecordReader<STPoint, DoubleWritable> createRecordReader (InputSplit split, TaskAttemptContext
context){
return new MyRecordReader();}
}
IBM Research | India Research Lab
Custom Record-Reader
IBM Research | India Research Lab
Custom Record-Reader
public class MyRecordReader extends RecordReader<STPoint, DoubleWritable>{private STPoint point; private DoubleWritable attrVal;private LineRecordReader lineRecordReader;
public void initialize(InputSplit split, TaskAttemptContext context){ lineRecordReader = new LineRecordReader();
lineRecordReader.initialize(split, context);}
public boolean nextKeyValue(){if(!lineRecordReader.nextKeyValue()){
this.point = null; this.attrVal = -1;return false;
}
String lineString = lineRecordReader.getCurrentValue();this.point = getSTPoint(lineString); this.attrVal = getAttributeValue(lineString);return true;
}
public STPoint getCurrentKey(){ return this.point;}public DoubleWritable getCurrentValue() { return this.attrVal; }
}
IBM Research | India Research Lab
Chaining Map-Reduce Jobs
Simple tasks may be completed by single map and reduce.
Complex tasks will require multiple map and reduce cycles.
Multiple map and reduce cycles need to be chained together
Chaining multiple jobs in a sequence
Chaining multiple jobs in complex dependency
Chaining multiple maps in a sequence
IBM Research | India Research Lab
Chaining Map-Reduce Jobs In Sequence
Most Commonly Done The output of a reducer is an input to the map functions of the next cycle. A new job starts only after the prior job has finished
MAPJOB 1
REDUCEJOB 1
MAPJOB 2
REDUCEJOB 2
MAPJOB 3
REDUCEJOB 3
INPUT
OUTPUT
IBM Research | India Research Lab
Chaining Multiple Jobs In Sequence
Job job1 = new Job();job1.setInputPath(inputPath1); job1.setOutputPath(outputPath1);// set all other parametersjob1.setMapperClass(); ……..job1.waitForCompletion(true);
Job job2 = new Job();job2.setInputPath(outputPath1); job2.setOutputPath(outputPath2);// set all other parametersjob2.setMapperClass();job2.waitForCompletion(true);
IBM Research | India Research Lab
Chaining Multiple Jobs in Complex Dependency
Chaining in sequence assumes that the jobs are dependent on a chain fashion.
That may not be so.
Example: Job1 may process a data-set of certain type, Job2 may process a data-set of another type and Job3 combines the results of Job1 and Job2
Job1 and Job2 are independent while Job3 is dependent on Job1 and Job2
Running Jobs in sequence Job1, Job2 and then Job3 may not be ideal.
IBM Research | India Research Lab
Chaining Multiple Jobs in Complex Dependency
Use addDependingJob to specify the dependencies job3.addDependingJob(job1);
Job3.addDependingJob(job2);
Define a JobControl object JobControl jc = new JobControl();
jc.addJob(job1);
jc.addJob(job2);
jc.addJob(job3);
jc.run();
IBM Research | India Research Lab
Chaining Multiple Maps In A Sequence
Multiple Map tasks can also be chained in a sequence followed by a reducer.
Avoids development of large map methods
Avoids multiple MR jobs with additional IO overheads
More ease of development, Code Re-Use.
Use ChainMapper API
IBM Research | India Research Lab
Chaining Multiple Maps In A Sequence
IBM Research | India Research Lab
Chaining Multiple Maps In A Sequence
Job chainJob = new Job(conf);ChainMapper.addMapper(chainJob, Map-1 Info );ChainMapper.addMapper(chainJob, Map-2 Info );ChainMapper.addMapper(chainJob, Map-3 Info );chainJob.setReducer(Reducer-Info);chainJob.waitForCompletion(true);
IBM Research | India Research Lab
Compression
MR jobs produce large output
The output of an MR job can be compressed.
Saves a lot of space.
Need to ensure that the compression algorithm used is such , such that it produces splittable files
bzip2 is one such compression algorithm
If a compression algorithm does not produce splittable files, the output will not be split and a single map will process the whole data in a subsequent job.
gzip output is not splittable.
IBM Research | India Research Lab
Compressing the Output
FileOutputFormat.setCompressOutput(job, true);FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
IBM Research | India Research Lab
Hadoop Tuning and Optimization
A number of parameters may impact the performance of a job. Whether to compress output or not
Number of Reduce Tasks
Block Size (64 MB or 128 MB or 256 MB etc)
Speculative Execution or Not
Buffer Size for Sorting
Temporary Space Allocation
Many more such parameters
Tuning these parameters is not an exact science
Some recommendations have been developed how to set these parameters
IBM Research | India Research Lab
Compression
mapred.compress.map.output
Default False
Pros Faster Disk Writes
Lower Disk Space Usage
Lesser Time Spent on Data Transfer
Cons Overhead in compression and decompression
Recommendation For large jobs and large cluster, compress.
IBM Research | India Research Lab
Speculative Execution
mapred.map/reduce.tasks.speculative.execution
Default True
Pros Reduces the job-time if the task progress is slow due to memory unavailability or
hardware degradation
Cons Increases the job-time if the task progress is slow due to complex and large
calculations.
Recommendation Set it to false in case of high average completion task duration (> 1 hr) due to complex
and large calculations
IBM Research | India Research Lab
Block Size
dfs.block.size
Default 64 MB
Recommendations Small Cluster and large data-sets
• Many map tasks will be needed
• Data Size 160 GB and Block Size 64 MB => # Splits = 2560
• Data Size 160 GB and Block Size 128 MB => # Splits = 1280
• Data Size 160 GB and Block Size 256 MB => # Splits = 640 In small clusters (6-10 nodes), the map task creation overhead is significant. So Block
Size should be large but small enough to utilize all resources
Block Size should be set according to size of the cluster, map task capacity and average size of input files.
IBM Research | India Research Lab
References
Hadoop – The Definitive Guide . Oreilly Press
Pro-Hadoop : Build scalable, distributed applications in the Cloud.
Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/.
www.slideshare.net