mapreduce basics
DESCRIPTION
Covers: Distributed processing issues, MR programming model Sample MR job How MR can be implemented Pros and cons of MR, tips for better performanceTRANSCRIPT
![Page 1: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/1.jpg)
MapReduce basics
Harisankar H,PhD student, DOS lab, Dept. CSE, IIT Madras
6-Feb-2013
http://harisankarh.wordpress.com
![Page 2: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/2.jpg)
Distributed processing ?
• Processing distributed across multiple machines/servers
Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
![Page 3: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/3.jpg)
Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the nodes fail
– Typically if you have 10,000 servers, on the average one will fail per day
![Page 4: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/4.jpg)
Issues in distributed processing
• Realized traditionally using special-purpose implementations– E.g., indexer, log processor
• Implementation really hard at socket programming level– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization– Scheduling across heterogeneous nodes– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:– Complex, brittle, non-generic code– Reimplementation of common features like fault-tolerance,
distribution
![Page 5: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/5.jpg)
Need for a generic abstraction for distributed processing
• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you have both to some extent
App programmer abstraction systems developer
Separation of concerns
Express app logic
Performance, fault handling etc.
![Page 6: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/6.jpg)
MapReduce abstraction(app programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}– For each input <k,v>:
map(k1,v1) list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2)) list(v3)
System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.
![Page 7: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/7.jpg)
Example: word count
• Problem:– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:– <document name, document contents>
• Organize the files in this format
• Output:– <word, count>
• Get it in output files
• Next step: – Define the map() and reduce() functions
![Page 8: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/8.jpg)
Word count
map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);
reduce(String key, List values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
![Page 9: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/9.jpg)
Program in java
public void map(LongWritable key, Text value, Context context) throws …
{String line = value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());context.write(word, one);
}}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws …
{int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new
IntWritable(sum));}
![Page 10: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/10.jpg)
Implementing MapReduce abstraction
• Looked at the application programmer’s view• Need a platform which implements the
MapReduce abstraction• Hadoop is the popular open-source
implementation of MapReduce abstraction• Questions for the platform developer
– How to • parallelize ?• handle faults ?• provide locality ?• distribute the data ?
App programmer abstraction systems developer
![Page 11: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/11.jpg)
Basics of platform implementation
• parallelize ?– Each map can be executed independently in parallel– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?– Have a distributed file system(HDFS)
• provide locality ?– Prefer to execute map() on the nodes having input <k,v>
pair
![Page 12: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/12.jpg)
MapReduce implementation
• Distributed File System(DFS) + MapReduce(MR) Engine– Specifically, MR engine uses a DFS
• Distributed files system– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture• Master assigns and manages replicated blocks in the
slaves
![Page 13: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/13.jpg)
MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across workers
– Workers perform the map() and reduce() functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and reassigns tasks if necessary
• Failure detection usually done through timeouts
![Page 14: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/14.jpg)
network
![Page 15: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/15.jpg)
Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>) <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network] reduce()
• Make map jobs of roughly equal expected execution times
• Try to make reduce() jobs less skewed
![Page 16: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/16.jpg)
Pros and cons of MapReduce
• Advantages– Simple, easy to use distributed processing system– Reasonably generic– Exploits locality for performance– Simple and less buggy implementation
• Issues– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations– E.g.: k-means clustering– Generate-Map-Reduce
• Difficult to model streaming computations• Centralized entities like master becomes bottlenecks• Most real-world problems require large chains of MR jobs
![Page 17: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/17.jpg)
Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster
Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
![Page 18: MapReduce basics](https://reader033.vdocuments.mx/reader033/viewer/2022050919/54620e34af7959ba618b4b05/html5/thumbnails/18.jpg)
Hadoop components
• HDFS
– Master: Namenode
– Slave : DataNode
• MapReduce engine
– Master: JobTracker
– Slave: TaskTracker