![Page 1: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/1.jpg)
1
Processing Data with Map Reduce
Allahbaksh Mohammedali Asadullah
Infosys Labs, Infosys Technologies
![Page 2: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/2.jpg)
2
ContentMap FunctionReduce FunctionWhy HadoopHDFSMap Reduce – HadoopSome Questions
![Page 3: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/3.jpg)
3
What is Map FunctionMap is a classic primitive of Functional Programming. Map means apply a function to a list of elements and return the modified list.
function List Map(Function func, List elements){
List newElement;
foreach element in elements{
newElement.put(apply(func, element))
}
return newElement
}
![Page 4: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/4.jpg)
4
Example Map Functionfunction double increaseSalary(double salary){
return salary* (1+0.15);
}
function double Map(Function increaseSalary, List<Employee> employees){
List<Employees> newList;
foreach employee in employees{
Employee tempEmployee = new (
newList.add(tempEmployee.income=increaseSalary(
tempEmployee.income)
}
}
![Page 5: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/5.jpg)
5
Fold or Reduce FuntionFold/Reduce reduces a list of values to one.Fold means apply a function to a list of elements and return a resulting element
function Element Reduce(Function func, List elements){
Element earlierResult;
foreach element in elements{
func(element, earlierResult)
} return earlierResult;
}
![Page 6: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/6.jpg)
6
Example Reduce Functionfunction double add(double number1, double number2){
return number1 + number2;
}
function double Reduce (Function add, List<Employee> employees){
double totalAmout=0.0;
foreach employee in employees{
totalAmount =add(totalAmount,emplyee.income);
}
return totalAmount
}
![Page 7: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/7.jpg)
7
I know Map and Reduce, How do I use it
I will use some library or framework.
![Page 8: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/8.jpg)
8
Why some framework?Lazy to write boiler plate codeFor modularityCode reusability
![Page 9: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/9.jpg)
9
What is best choice
![Page 10: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/10.jpg)
10
Why Hadoop?
![Page 11: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/11.jpg)
11
![Page 12: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/12.jpg)
12
Programming Language Support
C++
![Page 13: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/13.jpg)
13
Who uses it
![Page 14: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/14.jpg)
14
Strong Community
Image Courtesy http://goo.gl/15Nu3
![Page 15: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/15.jpg)
15
Commercial Support
![Page 16: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/16.jpg)
16
Hadoop
![Page 17: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/17.jpg)
17
Hadoop HDFS
![Page 18: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/18.jpg)
18
Hadoop Distributed File SystemLarge Distributed File System on commudity hardware4k nodes, Thousands of files, Petabytes of dataFiles are replicated so that hard disk failure can be handled easilyOne NameNode and many DataNode
![Page 19: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/19.jpg)
19
Hadoop Distributed File System
Client
NamenodeMetadata (Name,replicas,..):
Client
Metadata ops
Block opsData NodesRead Data Nodes
Replication
Rack 2Rack 1
WriteBlocks
HDFS ARCHITECTURE
![Page 20: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/20.jpg)
20
NameNode Meta-data in RAM
The entire metadata is in main memory. Metadata consist of
• List of files• List of Blocks for each file• List of DataNodes for each block• File attributes, e.g creation time• Transaction Log
NameNode uses heartbeats to detect DataNode failure
![Page 21: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/21.jpg)
21
Data NodeData Node stores the data in file system
Stores meta-data of a block
Serves data and meta-data to Clients
Pipelining of Data i.e forwards data to other specified DataNodes
DataNodes send heartbeat to the NameNode every three sec.
![Page 22: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/22.jpg)
22
HDFS CommandsAccessing HDFS
hadoop dfs –mkdir myDirectory hadoop dfs -cat myFirstFile.txt
Web Interfacehttp://host:port/dfshealth.jsp
![Page 23: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/23.jpg)
23
Hadoop MapReduce
![Page 24: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/24.jpg)
24
Map Reduce Diagramtically
Input Split 1Input Split 2Input Split 3Input Split 4Input Split 5
Input Split 0
Intermediate file is divided into R partitions, by partitioning function
ReducerMapperInput Files Output Files
![Page 25: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/25.jpg)
25
Input FormatInputFormat descirbes the input sepcification to a MapReduce job. That is how the data is to be read from the File System .
Split up the input file into logical InputSplits, each of which is assigned to an Mapper
Provide the RecordReader implementation to be used to collect input record from logical InputSplit for processing by Mapper
RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.
![Page 26: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/26.jpg)
26
Creating a your MapperThe mapper should implements .mapred.Mapper
Earlier version use to extend class .mapreduce.Mapper class
Extend .mapred.MapReduceBase class which provides default implementation of close and configure method.
The Main method is map ( WritableComparable key, Writable value, OutputCollector<K2,V2> output, Reporter reporter)
One instance of your Mapper is initialized per task. Exists in separate process from all other instances of Mapper – no data sharing. So static variables will be different for different map task.Writable -- Hadoop defines a interface called Writable which is Serializable. Examples IntWritable, LongWritable, Text etc.
WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.InverseMapper swaps the key and value
![Page 27: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/27.jpg)
27
CombinerCombiners are used to optimize/minimize the number of key value pairs that will be shuffled across the network between mappers and reducers.
Combiner are sort of mini reducer that will be applied potentially several times still during the map phase before to send the new set of key/value pairs to the reducer(s).
Combiners should be used when the function you want to apply is both commutative and associative.
Example: WordCount and Mean value computationReference http://goo.gl/iU5kR
![Page 28: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/28.jpg)
28
PartitionerPartitioner controls the partitioning of the keys of the intermediate map-outputs.
The key (or a subset of the key) is used to derive the partition, typically by a hash function.
The total number of partitions is the same as the number of reduce tasks for the job.
Some Partitioner are BinaryPartitioner, HashPartitioner, KeyFieldBasedPartitioner, TotalOrderPartitioner
![Page 29: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/29.jpg)
29
Creating a your ReducerThe mapper should implements .mapred.Reducer
Earlier version use to extend class .mapreduce.Reduces class
Extend .mapred.MapReduceBase class which provides default implementation of close and configure method.
The Main method is reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter)
Keys & values sent to one partition all goes to the same reduce task
Iterator.next() always returns the same object, different data
HashPartioner partition it based on Hash function written
IdentityReducer is default implementation of the Reducer
![Page 30: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/30.jpg)
30
Output FormatOutputFormat is similar to InputFormatDifferent type of output formats are
TextOutputFormat
SequenceFileOutputFormat
NullOutputFormat
![Page 31: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/31.jpg)
31
Mechanics of whole processConfigure the Input and OutputConfigure the Mapper and ReducerSpecify other parameters like number Map job, number of reduce job etc.Submit the job to client
![Page 32: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/32.jpg)
32
ExampleJobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); //JobClient.submit
![Page 33: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/33.jpg)
33
Job Tracker & Task TrackerMaster Node
Job Tracker
Slave NodeTask Tracker
Task Task
Slave NodeTask Tracker
Task Task
. . . . . . .
![Page 34: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/34.jpg)
34
Job Launch ProcessJobClient determines proper division of input into InputSplitsSends job data to master JobTracker server.Saves the jar and JobConf (serialized to XML) in shared location and posts the job into a queue.
34
![Page 35: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/35.jpg)
35
Job Launch Process Contd..
TaskTrackers running on slave nodes periodically query JobTracker for work.Get the job jar from the Master node to the data node.Launch the main class in separate JVM queue. TaskTracker.Child.main()
Location of the jar {local.dir}/taskTracker/$user/jobcache/$jobid/jars/
35
![Page 36: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/36.jpg)
36
Small File ProblemWhat should I do if I have lots of small
files?One word answer is SequenceFile.
SequenceFile Layout
Tar to SequenceFile http://goo.gl/mKGC7
Consolidator http://goo.gl/EVvi7
Key Value Key Value Key Value Key Value
File Name File Content
![Page 37: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/37.jpg)
37
Problem of Large FileWhat if I have single big file of 20Gb?One word answer is There is no problems with large files
![Page 38: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/38.jpg)
38
SQL DataWhat is way to access SQL data?One word answer is DBInputFormat.DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database.
DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database.
Database Access with Hadoop http://goo.gl/CNOBc
![Page 39: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/39.jpg)
39
JobConf conf = new JobConf(getConf(), MyDriver.class);
conf.setInputFormat(DBInputFormat.class);
DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost:port/dbNamee”);
String [] fields = { “employee_id”, "name" };
DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”, fields);
![Page 40: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/40.jpg)
40
public class MyRow implements Writable, DBWritable {
private int employeeNumber;
private String employeeName;
public void write(DataOutput out) throws IOException {
out.writeInt(employeeNumber); out.writeChars(employeeName);
}
public void readFields(DataInput in) throws IOException {
employeeNumber= in.readInt(); employeeName = in.readUTF();
}
public void write(PreparedStatement statement) throws SQLException {
statement.setInt(1, employeeNumber); statement.setString(2, employeeName);
}
public void readFields(ResultSet resultSet) throws SQLException {
employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2);
}
}
![Page 41: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/41.jpg)
41
Question &
Answer
![Page 42: Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011](https://reader033.vdocuments.mx/reader033/viewer/2022042613/54be2d554a79592f108b45f5/html5/thumbnails/42.jpg)
42
Thanks You