cosc 6397 big data analytics introduction to map reduce (i)gabriel/courses/cosc6397_s14/bda... ·...

16
1 COSC 6397 Big Data Analytics Introduction to Map Reduce (I) Edgar Gabriel Spring 2014 Recommended Literature Original MapReduce paper by google http://research.google.com/archive/mapreduce-osdi04.pdf Fantastic resource for tutorials, examples etc: http://www.coreservlets.com/

Upload: others

Post on 08-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

1

COSC 6397

Big Data Analytics

Introduction to Map Reduce (I)

Edgar Gabriel

Spring 2014

Recommended Literature

• Original MapReduce paper by google

http://research.google.com/archive/mapreduce-osdi04.pdf

• Fantastic resource for tutorials, examples etc:

http://www.coreservlets.com/

Page 2: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

2

Map Reduce Programming Model

• Input key/value pairs output a set of key/value pairs

• Map

– Input pair intermediate key/value pair

– (k1, v1) list(k2, v2)

• Reduce

– One key all associated intermediate values

– (k2, list(v2)) list(v3)

• Reduce stage starts when final map task is done

Task Granularity and Pipelining

• Many tasks means

– Minimal time for fault recovery

– Better pipeline shuffling with map execution

– Better load balancing

Slide based on Dan Weld’s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google, Inc.)

Page 3: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

3

Map Reduce Framework

• Takes care of distributed processing and coordination

• Scheduling

– Jobs are broken down into smaller chunks called tasks

– Scheduler assigns tasks to available resources

• Tasks Localization with data

– Framework strives to place tasks on the nodes that host

the segment of data to be processed by that task

– Code is moved to data, not data to code

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Major Components

• User Components:

– Mapper

– Reducer

– Combiner (Optional)

– Partitioner (Optional) (Shuffle)

– Writable(s) (Optional)

• System Components:

– Master

– Input Splitter*

– Output Committer* * You can use your own if you really want!

Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html

Page 4: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

4

Notes

• Mappers and Reducers are typically single threaded and

deterministic

– Determinism allows for restarting of failed jobs, or

speculative execution

– all independent of each other, can run on arbitrary

number of nodes

• Mappers/Reducers run entirely independent of each other

– In Hadoop, they run in separate JVMs

Input Splitter

• Is responsible for splitting input into multiple chunks

• These chunks are then used as input for your mappers

• Splits on logical boundaries

• Multiple splitters available in Hadoop,

– Break up data by line, fixed chunk etc.

– Application can write its on splitter

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Page 5: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

5

Mapper

• Reads in input pair <K,V> (a section as split by the input

splitter)

• Outputs a pair <K’, V’>

• Example: for Word Count with the following input: “The

teacher went to the store. The store was closed; the store

opens in the morning. The store opens at 9am.”

• The output would be: – <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1>

<store, 1> <the, 1> <store, 1> <was, 1> <closed, 1>

<the, 1> <store, 1> <opens, 1> <in, 1> <the, 1>

<morning, 1> <the 1> <store, 1> <opens, 1> <at, 1>

<9am, 1>

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Reducer

• Accepts the Mapper output, and collects values on the key

– All inputs with the same key must go to the same reducer!

• Input is typically sorted, output is output exactly as is

• For our example, the reducer input would be: – <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1>

<store, 1> <the, 1> <store, 1> <was, 1> <closed, 1>

<the, 1> <store, 1> <opens, 1> <in, 1> <the, 1>

<morning, 1> <the 1> <store, 1> <opens, 1> <at, 1>

<9am, 1>

• The output would be: – <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3>

<was, 1> <closed, 1> <opens, 1> <morning, 1> <at, 1>

<9am, 1>

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Page 6: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

6

Combiner

• Combiner is an intermediate reducer

• Optional

• Reduces output from each mapper,

• Reduces bandwidth and limits data size for sorting

• Cannot change the type of its input

– Input types must be the same as output types

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Output Committer

• Is responsible for taking the reduce output, and

committing it to a file

• Typically, this committer needs a corresponding input

splitter (so that another job can read the input)

• Again, usually built in splitters are good enough, unless

you need to output a special kind of file

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Page 7: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

7

Partitioner (Shuffler)

• Decides which pairs are sent to which reducer

• Default is simply:

– Key.hashCode() % numOfReducers

• User can override to:

– Provide (more) uniform distribution of load between

reducers

– Some values might need to be sent to the same reducer

• Ex. To compute the relative frequency of a pair of

words <W1, W2> you would need to make sure all of

word W1 are sent to the same reducer

– Binning of results

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Master • Responsible for scheduling & managing jobs

• Scheduled computation should be close to the data if

possible

– Bandwidth is expensive! (and slow)

– This relies on a Distributed File System (GFS / HDFS)!

• If a task fails to report progress (such as reading input,

writing output, etc), crashes, the machine goes down,

etc, it is assumed to be stuck, and is killed, and the

step is re-launched (with the same input)

• The Master is handled by the framework, no user code

is necessary

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Page 8: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

8

Master (II) • HDFS can replicate data to be local if necessary for

scheduling (discussed later in course)

• Because our nodes are (or at least should be)

deterministic

– The Master can restart failed nodes

• Nodes should have no side effects!

– If a node is the last step, and is completing slowly, the

master can launch a second copy of that node

• First one to complete wins, then any other runs are

killed

Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”

http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx

Word Count

Page 9: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

9

Map Reduce word count code sample

• Implement Mapper

– Input is text

– Tokenize the text and emit first character with a count of

1 - <token, 1>

• Implement Reducer

– Sum up counts for each letter

– Write out the result outputfile

• Configure the Job

– Specify Input, Output, Mapper, Reducer and Combiner

• Run the job

Mapper

• Mapper Class

– Class has 4 Java Generics parameters

– (1) input key (2) input value (3) output key (4) output

value

• Input and output utilizes hadoop’s IO framework org.apache.hadoop.io

• map() method injects Context object, use to:

– Write output

– Create your own counters

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Page 10: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

10

Mapper code in Hadoop

public static class TokenizerMapper

extends Mapper<LongWritable, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

Writables

• Types that can be serialized / deserialized to a stream

– framework will serialize your data before writing it to

disk

• Required to be input/output classes

• User can implement this interface, and use their own types

for their input/output/intermediate values

• There are default for basic values, like

– Strings: Text

– Integers: IntWritable

– Long: LongWritable

– Float: FloatWritable

Page 11: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

11

Reducer • Similarly to Mapper – generic class with four types

(1) input key (2) input value (3) output key (4) output value

– The output types of map functions must match the input types

of reduce function

– In this case Text and IntWritable

• Map/Reduce framework groups key-value pairs produced by

mapper by key

• For each key there is a set of one or more values

• Input into a reducer is sorted by key

– Known as Shuffle and Sort

• Reduce function accepts key->setOfValues and outputs keyvalue

pairs

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Reducer code in Hadoop

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable>

values, Context context) throws IOException,

InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

Page 12: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

12

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf,

args).getRemainingArgs();

Job job = new Job(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setReducerClass(IntSumReducer.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

job.setInputFormatClass(TextInputFormat.class);

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

• Job class

– Encapsulates information about a job

– Controls execution of the job

• A job is packaged within a jar file

– Hadoop Framework distributes the jar on your behalf

– Needs to know which jar file to distribute

– The easiest way to specify the jar that your job resides in is by

calling job.setJarByClass

job.setJarByClass(getClass());

– Hadoop will locate the jar file that contains the provided class

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Page 13: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

13

job.setInputFormatClass(TextInputFormat.class);

TextInputFormat.addInputPath(job, new Path(otherArgs[0]));

• Specify input data

• Can be a file or a directory

– Directory is converted to a list of files as an input

– Input is specified by implementation of InputFormat - in this case TextInputFormat

– Responsible for creating splits and a record reader

– Controls input types of key-value pairs, in this case LongWritable and Text

– File is broken into lines, mapper will receive 1 line at a time

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

• OutputFormat defines specification for outputting data from

Map/Reduce job

– Count job utilizes an implementation of OutputFormat –

TextOutputFormat

• Define output path where reducer should place its output

– If path already exists then the job will fail

– Each reducer task writes to its own file

– By default a job is configured to run with a single reducer

– Writes key-value pair as plain text

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Page 14: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

14

• Specify the output key and value types for both mapper and

reducer functions

– Many times the same type

– If types differ then use

setMapOutputKeyClass();

setMapOutputValueClass();

• job.waitForCompletion(true)

– Submits and waits for completion

– The boolean parameter flag specifies whether output should be

written to console

– If the job completes successfully ‘true’ is returned, otherwise

‘false’ is returned

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Combiner

• Combine data per Mapper task to reduce amount of

data transferred to reduce phase

• Reducer can very often serve as a combiner

– Only works if reducer’s output key-value pair types are

the same as mapper’s output types

• Combiners are not guaranteed to run

– Optimization only

– Not for critical logic

• Add to main file

job.setCombinerClass(IntSumReducer.class);

Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/

Page 15: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

15

K-means in MapReduce

• Mapper:

– K= datatpoint ID, V=datapoint coordinates

– Calculate the distance for a datapoint to each centroid

– Determine closest cluster

– Write K’,V’ with K’ = cluster ID and V’=coordinates of the

point

• Reducer:

– Each reducer gets all entry associated with one cluster

– Calculate sum of all point coordinates per cluster

– Recalculate cluster mean

K-means in MapReduce

• One iteration of the k-means algorithm

• Multiple iterations within one job not supported by

MapReduce

-> Hadoop has a way on how to specify a sequence of jobs

with dependencies

• How to distribute Cluster centroids to all map tasks?

-> Hadoop distributed cache

Page 16: COSC 6397 Big Data Analytics Introduction to Map Reduce (I)gabriel/courses/cosc6397_s14/BDA... · – Minimal time for fault recovery – Better pipeline shuffling with map execution

16

K-means clustering

• Initial cluster centroids used has a huge impact on the

number of iterations of k-means algorithm

• Especially important due to the complexities of

handling multiple iterations in MapReduce

• Pre-clustering algorithm such as Canopy Clustering to

determine initial cluster centroids

Canopy clustering

• Start with a list of data points and two distances T1<T2

for processing

– Select any point (random) from the list to form a canopy

center

– Calculate distance to all other points in the list

– Put all the points which fall within the distance threshold

of T1 into a canopy

– Remove from the original list all the points which fall

within the threshold of T2. These points are excluded

from being the center of a new canopie.

– Repeat steps until original list is empty.