cs 435 - introduction to big datacs435/pa/pa1/fall2019/recitation4.pdf · paahuni khandelwal email:...

20
Paahuni Khandelwal Email: [email protected] 20 th September, 2019 [Recitation 4] CS 435 - Introduction to Big Data

Upload: others

Post on 22-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Paahuni Khandelwal Email: [email protected]

20th September, 2019

[Recitation 4]

CS 435 - Introduction to Big Data

Submission

!2

o Submission Deadline for PA1 - 30th Sept (by 5 pm)o Tarball should include

- 3 .java file for each profile (Profile1.java, Profile2.java, Profile3.java)- 3 jars for each profile containing .class file (Profile1.jar, Profile2.jar, Profile3.jar)- Output folder/part-file from each profile

o NOTE: Your output should be generated after running jobs on this input set -

https://www.cs.colostate.edu/~cs435/PA/PA1/Fall2019/PA1Dataset.tar.gz

Profile 2PROFILE 2o A list of top 500 unigrams and their frequencies within each article o Output should be grouped by Document IDo Top 500 unigrams per documento Total output records <= 500 * (no. of unique documents)

!3

Profile 2 Solution 1- Using Multiple jobs with TopN

o Output of Mapper 1: ({DocumentID, unigram},1)o Functionality of Reducer 1: - Count the frequency of (DocumentID, unigram) pairo Output of Reducer 1 will be in form : (DocumentID,{unigram, frequency})

!4

MAP REDUCE JOB 1

Profile 2 Solution 1- Using Multiple jobs with TopN

o Functionality of Mapper 2- Store and update top 500 unigrams(based on frequency) in the TreeMap- Override cleanup() method of Reducer interface to send local top 500 unigrams from each mapper- Is treeMap at mapper side really needed for Profile2?

o Output of Mapper 2: (DocumentID, {unigram, frequency})o Functionality of Reducer 2: - Store and update TreeMap to get 500 unigrams per documento Output of Reducer 2 will be in form : (DocumentID,{unigram, frequency})

!5

MAP REDUCE JOB 2

Profile 2 Solution 2- Using Composite Key with sortComparator

!6

o Output of Mapper 1: ({DocumentID, unigram},1)o Functionality of Reducer 1: - Count the frequency of (DocumentID, unigram) pairo Output of Reducer 1 will be in form : (DocumentID,{unigram, frequency})

Separate composite key/value pair by “\t”

MAP REDUCE JOB 1

Profile 2 Solution 2- Using Composite Key with sortComparator

!7

o Output of Mapper 2: ({DocumentID, unigram, frequency}, NullWritable.get())o myGroupComp:

- Extend WriteComparator to perform sorting on all three attributeso Functionality of Reducer 2: - Store 500 unigrams per documento Output of Reducer 2 will be in form : (DocumentID,{unigram, frequency}) or ({DocumentID, unigram, frequency}, null) or ({DocumentID, unigram}, frequency})

MAP REDUCE JOB 2

sortComparator

!8

public static class myGroupComp extends WritableComparator { protected myGroupComp() { super(Text.class, true); }

@Override public int compare(WritableComparable w1, WritableComparable w2) {

Text t1 = (Text) w1; Text t2 = (Text) w2; String[] t1Items = t1.toString().split("\t"); String[] t2Items = t2.toString().split("\t"); Integer docId1=Integer.parseInt(t1Items[0]); Integer docId2=Integer.parseInt(t2Items[0]); String unigram1 = t1Items[1]; String unigram2 = t2Items[1]; Integer frequency1=Integer.parseInt(t1Items[2]); Integer frequency2=Integer.parseInt(t2Items[2]);

sortComparator (Cont.)

!9

public int compare(WritableComparable w1, WritableComparable w2) {

. . . . Integer docId1=Integer.parseInt(t1Items[0]); Integer docId2=Integer.parseInt(t2Items[0]); String unigram1 = t1Items[1]; String unigram2 = t2Items[1]; Integer frequency1=Integer.parseInt(t1Items[2]); Integer frequency2=Integer.parseInt(t2Items[2]);

Integer comparison = docId1.compareTo(docId2); if (comparison == 0){ comparison = -1 * frequency1.compareTo(frequency2); // -1 for descending order if (comparison == 0){ comparison = unigram1.compareTo(unigram2); } }

return comparison; } }

TreeMap: Profile 2, Job 2 using sortComparator

!10

o How to sort key/value having multiple attributes in TreeMap?o Functionality of Reducer 2: Store and update TreeMap to get global 500 unigrams per document - Maintain global list of only top 500 items (DocumentID + “\t” + unigram + “\t” + frequency) - Maintain 2 global variable which points to noOfUnigramsSeenSoFar and for which DocumentID

- Use cleanup() to write output to context- Inside your reduce()

* Add new tuple to the list, if new DocIDFromInputKey arrives.* Or, increment noOfUnigramsSeenSoFar by 1 if DocIDFromInputKey is same as DocumentID* Keep setting the DocumentID to current tuple DocIDFromInputKey* Iterate through list and write output to context* Handle last DocIDFromInputKey unigrams in cleanup() method

Map Reduce Chaining Jobs?

!11

Configuration conf1 = new Configuration();

Job job1 = Job.getInstance(conf1, “Job 1"); job1.setNumReduceTasks(4); job1.setJarByClass(Profile2.class); job1.setMapperClass(Mapper1.class); job1.setCombinerClass(Reducer1.class); job1.setReducerClass(Reducer1.class); job1.setPartitionerClass(UnigramPartitioner.class); . . . FileInputFormat.addInputPath(job1, new Path(args[0])); FileOutputFormat.setOutputPath(job1, output_path_for_Job_1); job1.waitForCompletion(true);

Job job2 = Job.getInstance(conf2, “Job 2"); job2.setJarByClass(Profile2.class); job2.setMapperClass(Mapper2.class); job2.setReducerClass(Reducer2.class); job2.setSortComparatorClass(myGroupComp.class);

. . .

FileInputFormat.addInputPath(job2, output_path_for_Job_1)); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1);

Profile 3

o A list of top 500 unigrams and their frequencies in the corpuso List should be sorted from most frequent unigrams to least frequent oneso Solution to generate Profile 3 is combination of Profile 1 and Profile 2 o We will list unigram with its total occurrence in the complete dataset

!12

Profile 3 Solution 1 (using TopN) - 1 MapReduce Job

!13

o Output of Mapper: (unigram,1)o Functionality of Reducer: - Store and update HashMap<Unigram, Frequency> to get frequency of each unigram - Use cleanup() to perform sorting and writing top 500 unigrams to context

- Your cleanup() should comparingByValue on entrySet, then sort in reverseOrder() - Set count to 0 and through sorted HashMap keySet. - Write <unigram, frequency> to context until count reaches 500.

o Output of Reducer will be in form : (unigram, frequency)

Note: Explicitly set number of reducers to 1 in driver as job.setNumReduceTasks(1)

Refer: https://docs.oracle.com/javase/8/docs/api/java/util/Map.Entry.html

Profile 3 Solution 2 (using Multiple Jobs)

!14

MAP REDUCE JOB 1o Output of Mapper 1: (unigram,1)

o Functionality of Reducer 1: - Store and update HashMap<Unigram, Frequency> to get frequency of each unigram o Output of Reducer will be in form : (unigram, frequency)

Profile 3 Solution 2 (using Multiple Jobs)

!15

MAP REDUCE JOB 2o Use TopN design patterno Functionality of Mapper 2: - Initialize TreeMap<String,String> - Store and update the 500 unigrams in the TreeMap - Use cleanup() to send local top 500 unigrams from each mapperNOTE: You can set key as NullWritable o Output of Mapper 2:

(null, {unigram,frequency})

o Functionality of Reducer 2:- Initialize TreeMap<String, String>- Store and update global top 500 unigrams

o Output of Reducer 2 will be in form : (unigram, frequency)

Profile 3 Solution 2 (using Multiple Jobs)

!16

MAP REDUCE JOB 2 - Second Solutiono Output of Mapper 2:

(frequency, unigram)

o Use sortComparator: - set descending order while sorting on key (frequency)

o Output of Reducer 2 will be in form : (unigram, frequency) Write only top 500 unigrams only

Logger for intermediate results

!17

import org.apache.log4j.Logger;

. . .

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private Logger LOGGER = Logger.getLogger(TokenizerMapper.class.getName()); private final static IntWritable one = new IntWritable(1); private Text word = new Text();

@Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {

LOGGER.info(“Mapper input key and value received”) . . .

} }

Logger

To view logs use command: yarn logs -applicationId <application_id>

!18

Next Recitation

o Will be held on 27th September (Friday) in CS130 from 4 to 5 pmo Introduction for PA 2o Help Sessions for PA1

- Monday (9/23) : 2pm-4pm- Thursday (9/26) : 8am-10am- Friday (9/27) : 3pm-4pm

!19

Thank you

!20

Questions?