cosc 6397 big data analytics advanced mapreduce

12
1 COSC 6397 Big Data Analytics Advanced MapReduce Edgar Gabriel Spring 2015 Basic statistical operations Calculating minimum, maximum, mean, median, standard deviation Data typically multi-dimensional -> analytics can be based on one or more dimensions of the data Exploiting parallelism only on the map side – single reducer often required Image source: Hadoop MapReduce Cookbook, chapter 5.

Upload: others

Post on 01-Jan-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COSC 6397 Big Data Analytics Advanced MapReduce

1

COSC 6397

Big Data Analytics

Advanced MapReduce

Edgar Gabriel

Spring 2015

Basic statistical operations

• Calculating minimum, maximum, mean, median, standard

deviation

• Data typically multi-dimensional -> analytics can be based on one

or more dimensions of the data

– Exploiting parallelism only on the map side –

– single reducer often required

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 2: COSC 6397 Big Data Analytics Advanced MapReduce

2

Group-by operations • Calculate basic operations by group

– Allows to utilize more than one reducer

– Grouping based on key of the mapper step

• Example: calculate number of accesses to a webpage based on a

log-file

Image source: Hadoop MapReduce Cookbook, chapter 5.

Frequency distributions

• Arrangement of values that one or more variables take

in a sample

• Each entry in the table contains the number of

occurrences of values within a particular group

• Table summarizes the distribution of values in the

sample

• Example:

– Analyze the log file of a web server

– Sort the number of hits received by each URL in

ascending order

– Input Example: 205.212.115.106 - [01/Jul/1995:00:00:00:12 -0400] “GET

/shuttle/countdown/countdown.html HTTP/1.0” 200 3985

Page 3: COSC 6397 Big Data Analytics Advanced MapReduce

3

Frequency distributions

• First MapReduce job counts the number of occurrences

of a URL

– Result of the MapReduce job: a file containing the list of

<URL> <no. of occurrences>

• Second MapReduce job

– Use the output of first MapReduce job as input

– Mapper: use <no of occurrences> as key and <URL> as

value

– Reducer: omit the <no of occurrences> in output file

(ignoring URL)

• Sorting is implicit by Hadoop framework

Example output

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 4: COSC 6397 Big Data Analytics Advanced MapReduce

4

Histograms

• Graphical representation of the distribution of data

• Estimate of the probability distribution of a continuous

variable

• Representation of tabulated frequencies, shown as

adjacent rectangles, erected over discrete intervals

– area proportional to the frequency of the observations in

the interval

• Example:

– Determine the number of accesses to the web server per

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Histograms

• Map step uses the hour as the key and ‘one’ as the

value

• Reducer sums up the number of occurrences for each

hour

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 5: COSC 6397 Big Data Analytics Advanced MapReduce

5

Histograms

0

20000

40000

60000

80000

100000

120000

140000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Scatter Plots

• A scatter plot is using Cartesian coordinates to display

values for two variables for a set of data

• Typically used when a variable exists that is under the

control of the user

– a parameter is systematically incremented and/or

decremented by the other,

• also called the control parameter or independent

variable

• is typically plotted along the horizontal axis

– The measured or dependent variable is customarily

plotted along the vertical axis

Page 6: COSC 6397 Big Data Analytics Advanced MapReduce

6

Scatter Plots

• Example: analyzes the data to find the relationship

between the size of the web pages and the number of

hits received by the web page

Image source: Hadoop MapReduce Cookbook, chapter 5.

Scatter Plots

Image source: Hadoop MapReduce Cookbook, chapter 5.

Page 7: COSC 6397 Big Data Analytics Advanced MapReduce

7

Secondary Sorting

• MapReduce sorts intermediate key-value pairs by the

keys during shuffle and sort phase

• Sometimes, additional sorting based on the values

would be useful

• Example: data from sensors

– Intermediate key-value pair:

key = mi, value = ( tj, ri)

with mi being a sensor id

tj being a time stamp

ri being the actual value

– Order of values for a given key is not in increasing order

of timestamps

Secondary Sorting (II)

• Solution: encode sensor id and time stamp in the key

key = mi, value = ( tj, ri) key = (mi:tj) value = ri

but need to ensure that all keys containing mi end up at the

same reduce call!

• Implement three classes:

– Partitioner: which keys are sent to which reducers

– SortComparator: decides how map output keys are sorted

– GroupComparator: decides which map output keys go to the

same reduce method call

job.setPartitionerClass(SensorPartitioner.class);

job.setGroupingComparatorClass(KeyGroupingComparator.class);

job.setSortComparatorClass(CompositeKeyComparator.class);

Page 8: COSC 6397 Big Data Analytics Advanced MapReduce

8

Partitioner public static class SensorPartitioner extends Partitioner<Text,Text>

{

public int getPartition(Text key, Text val, int numReducers) {

String [] tempstring = key.toString().split(“:");

int sensorId = Integer.parseInt(tempstring[1]);

return sensorId % numReducers;

}

}

• Data from one sensor will end up at the same reducer

• Since the keys are still different, the reduce method

will still be invoked separately for each key = (mi:tj)

SortComparator: determine order in which

keys are presented to the reducer

public class CompositeKeyComparator extends WritableComparator {

public int compare(WritableComparable w1, WritableComparable w2) {

String [] t1 = w1.toString().split(“:");

String [] t2 = w2.toString().split(“:");

int s1 = Integer.parseInt(t1[1]);

int s2 = Integer.parseInt(t2[1]);

int result = s1.compareTo(s2);

if(0 == result) {

double d1 = Double.parseDouble(t1[2]);

double d2 = Double.parseDouble(t2[2]);

result = -1 * d1.compareTo(d2);

}

return result;

}

}

Sorting based on

Sensor id

Sorting based on

Time stamp

Page 9: COSC 6397 Big Data Analytics Advanced MapReduce

9

GroupComparator: determine which keys are

grouped together in a single call to a reducer

public class KeyGroupingComparator extends WritableComparator {

public int compare(WritableComparable w1, WritableComparable w2) {

String [] t1 = w1.toString().split(“:");

String [] t2 = w2.toString().split(“:");

int s1 = Integer.parseInt(t1[1]);

int s2 = Integer.parseInt(t2[1]);

return s1.compareTo(s2);

}

}

Graphical flow

s1 0800 x1

s1 0805 x2

s2 0920 x3

s2 0910 x4

s1 0715 x5

s3 1005 x6

input file

map()

map()

map()

s1:0800 x1

s2:0920 x3

s1:0715 x5

intermediate

key-value pairs

reduce()

reduce()

Sensor

Partitioner

s1:0800 x1

s1:0805 x2

s1:0715 x5

s2:0920 x3

s3:1005 x6 s3:1005 x6

s2:0910 x4

s1:0805 x2 s1:0800 x1

s1:0805 x2

s1:0715 x5

CompositeKey

Comparator

s2:0910 x4

s2:0910 x4

s3:1005 x6

s2:0920 x3

KeyGroup

Comparator

s1:0715 x5

s1:0800 x1

s1:0805 x2

s2:0910 x4

s2:0920 x3

s3:1005 x6

Page 10: COSC 6397 Big Data Analytics Advanced MapReduce

10

Joining two Datasets

• Combining input from two (or more) data sets common

problem in Big Data Analytics

– easy to handle with Pig and Hive

– slightly more complicated with MapReduce

• Assumptions:

– data entries having the same keys are combined

– inner joint vs. outer joint possible

Option 1: memory backed join

• Useful if one of the files is relatively small

– maximum a few MBytes

• Solution:

– provide the smaller file as part of the Distributed Cache

mechanism of Hadoop

– read the file in the Mapper setup() function and store

as a global list/array/hashmap

– in map, extract the key of the data entry provided from

the large file and search in global list/array/hashmap for

an entry with the same key

– generate intermediate key value pair, with value being

the combination of both data files for that key

Page 11: COSC 6397 Big Data Analytics Advanced MapReduce

11

Option 2: Map-side join

• Useful if

– Input data set is divided into the same number of

partitions

– Input data is sorted by the same key in each source

• Often the case of the data sets are the result of

previous map-reduce jobs

– All records for a particular key must reside in the same

partition

• Tool/program available if you want to join all entries

– For customizing you can write your own map-reduce job

• TupleWritable

– Writable type storing multiple Writable

– retrieve i-th Writable with the .get(i) method

– Users are encouraged to implement their own serializable

types in most cases

• In main()

job.setInputFormatClass(CompositeInputFormat.class);

String joinStatement = CompositeInputFormat.compose("inner",

KeyValueTextInputFormat.class,

new Path “/someinput");

job.getConfiguration().set("mapreduce.join.expr", joinStatement);

Page 12: COSC 6397 Big Data Analytics Advanced MapReduce

12

If you need to customize your join

public class MapSideJoinMapper extends

Mapper<Text, TupleWritable, Text, Text> {

Text txtValue = new Text("");

public void map ( Text key, TupleWritable value,

Context context) throws IOException {

if (value.toString().length() > 0) {

String arr1[] = value.get(0).toString().split(",");

String arr2[] = value.get(1).toString().split(",");

txtValue.set(arr1[1].toString() +arr1[2].toString() +

+ arr2[0].toString());

context.write(key, txtValue);

}

}

}

Reducer Side Join

• Most generic but also most expensive case

– Both data files go through mapper and the shuffle/sort step

– Reducer combines the data emitted for the same key

• MultipleInputs

– supports MapReduce jobs that have multiple input paths with a

different InputFormat and Mapper for each path

– To simplify logic in the reducer secondary sorting might be

required to ensure that data arrives in the correct order at the

reducer

MultipleInputs.addInputPath(job, new Path(args[0]),

TextInputFormat.class, PostsMapper.map);

MultipleInputs.addInputPath(job1, new Path(args[1]),

TextInputFormat.class, UsersMapper.map);