mapreduce: simplified data processing on large clusterscis.csuohio.edu/~sschung/cis601/cis...

MapReduce: Simplified Data Processing on Large

Clusters By Stephen Cardina

The Problem● You have a large amount of raw data, such as a database or a web log, and

you need to get some sort of derived data from it, such as finding how many words start with a specific letter or how many items are within a specific range.

● When the input is large it needs a lot of different machines working at once in order to finish in a reasonable pace.

● Along with several other small issues, such as how to deal with it if it fails or how to handle the data, this can lead to a lot of troubleshooting and a bigger time investment.

The Solution● MapReduce is a programming model for processing and generating large

data sets. ● MapReduce automatically does parallel processing going through data much

faster than it would normally.● MapReduce can be separated into two distinct parts, map and reduce.● Map goes through the data and parses the information based on the user’s

input. This separates it into columns for later use.● Reduce merges the values given from map to make it easier for the user to

use the data.

A Quick MapReduce Example● If you had an array with 5 words, such as [car, bike, train, bus, boat] and you

wanted to separate them by how many letters they have, you can. ● You can use Map to separate it into 3 different columns● 3: [car, bus]● 4: [boat, bike]● 5: [train]● 3, 4, and 5 are the intermediate keys while the words are value pairs

A Quick MapReduce Example● Now you might not care what they actually say and instead just want to get

how many instances of 3 letter words there are.● You can now use reduce to make the columns even quicker to read● 3: 2● 4: 2● 5: 1● Here reduce joined the 3 letter words, [car, bus] into a single value 2,

alongside the 4 letter words.

Execution overviewStep 1:

The MapReduce library splits the input file into many smaller (M) pieces, which tend to be between 16 and 64 megabytes. It also starts up multiple copies on the machines.


One of the copies is made into what is known as the master, while all the others are workers. There are several M map tasks and R reduce tasks to do, so the master assigns them to idle workers.


A worker is assigned a map task and reads the current piece it has been assigned. It parses the information according to the map function and buffers them in memory.


Periodically, the buffered memory will be written to the local disk and made into R, reduce, regions. The master is then told of the location so it can get a worker to reduce it.


When a reduce worker is called it will read the buffered data in the local disk. After it has read everything it will sort the data by the intermediate key.


As the reduce worker goes through the data, every instance of a unique intermediate key it sends key and the corresponding intermediate key to the reduce function. The output of the function is then written to the output file.


When all of the map and reduce tasks are done, the MapReduce function ends and we go back to the user code.

Execution speedThe graph over on the right show how fast the input is scanned. It slowly gets faster as more workers are assigned. Around the 55 second mark it peaks then it slowly goes down as there are no more tasks to be assigned.

Comparing different executionsThis graph show how long it takes to do a sort program with 10^10 records in it. For this there are 15,000 map tasks, 4,000 reduce tasks (as in output files) and 1,746 workers

Comparing different executionsBackup tasks are when the MapReduce is almost done so it has it’s workers work on the in progress tasks. This helps save a lot of time if there are stragglers that are taking a long time. On the graph to the right it shows it takes a lot longer, about 400 seconds, if it’s disabled.

Comparing different executionsMapReduce also deals with if a worker stops functioning. If the master doesn’t hear back from them for a bit it assumes that it failed and marks it as such. This helps if one of the worker stops responding for a few minutes and still gives us a decent time shown to the right.

In Summary

● MapReduce is a great way to deal with a large amount of raw data.● MapReduce automatically does parallel processing. ● It separates it into 2 tasks map and reduce.● Map takes the inputs and separates them into groups. ● Reduce takes the data from map and reduces it for the user.● It handles all of this via a master and several workers.● Deals with slow down and non-responding workers at a decent pace.

Questions?

Reference

http://eecs.csuohio.edu/~sschung/IST734/mapreduce-osdi04.pdf

mapreduce: simplified data processing on large clusterscis.csuohio.edu/~sschung/cis601/cis...

Documents