project 2 - mapreduce - computer sciencepxk/417/notes/content/project-2... · programming project...
TRANSCRIPT
CS417 11/10/15
PaulKrzyzanowski 1
Distributed SystemsProgramming Project Assignment 2
Using MapReduce on your Hadoop Cluster
Paul Krzyzanowski
TA: Yuanzhen Gu
Rutgers University
Fall 2015
111/10/15 © 2014-2015 Paul Krzy zanowsk i
The Assignment• You are provided with United States Census data
– Download Zip Code Tabulation Areas Gazetteer File (1.1MB)which contains:• Zip code identify ing an area• Population count• Housing unit count• Land area (m2), water area (m2),• Latitude, longitude
• Find out the potential trend of rise of housing prices in northeast, northwest, southeast, and southwest.– The potential trend of the rise of housing price simply based on the ratio of supply and
demand, which is housing unit count / population density. The smaller the result the higher the trend. This ignores other factors like public infrastructures, community, environment, etc.
– Population density is s imply based on population count / land area.
11/10/15 © 2014-2015 Paul Krzy zanowsk i 2
Assignment Goals• Solve the problem using map-reduce• Briefly explain how the input is mapped into (key, value)
pairs in the map phase• Briefly explain how the (key, value) pairs produced by the
map stage are processed by the reduce phase• If the job cannot be done in a single map-reduce pass,
describe how it would be structured into two or more map-reduce jobs with the output of the first job becoming input to the next one(s)
• Languages:Java, Python, Go, C++
11/10/15 © 2014-2015 Paul Krzy zanowsk i 3
Recap – What is Hadoop?• An open source framework for “reliable, scalable,
distributed computing”• It gives you the ability to process and work with large
datasets that are distributed across clusters of commodity hardware
• It allows you to parallelize computation and ‘move processing to the data’ using the MapReduce framework
11/10/15 © 2014-2015 Paul Krzy zanowsk i 4
Recap – Hadoop Architecture
11/10/15 © 2014-2015 Paul Krzy zanowsk i 5
Recap – Hadoop Job Configuration Parameters
11/10/15 © 2014-2015 Paul Krzy zanowsk i 6
CS417 11/10/15
PaulKrzyzanowski 2
How to do this assignment: Step 1
Configuring Your First Hadoop Cluster
11/10/15 © 2014-2015 Paul Krzy zanowsk i 7
Prerequisites
• Ubuntu Linux 12.04 LTS
• Install Java v1.7+• Add a dedicated Hadoop system user• Configure SSH access• Disable IPv6
• Or configure your Hadoop environment on LCSR:– http://www.cs.rutgers.edu/~watrous/hadoop.html– We will give instructions on setting your own cluster in this recitation
11/10/15 © 2014-2015 Paul Krzy zanowsk i 8
Install Java & Hadoop• We need to install java on the cluster machines in order to run Hadoop
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
• Configure JAVA_HOME in both ~/etc/.bashrc & hadoop-env.shexport JAVA_HOME=/usr/lib/jvm/java-7-oracle
11/10/15 © 2014-2015 Paul Krzy zanowsk i 9
Hadoop Configuration
11/10/15 © 2014-2015 Paul Krzy zanowsk i 10
Environment variables Setup• Modify environment variables
Go back to the root and edit the .bashrc file
11/10/15 © 2014-2015 Paul Krzy zanowsk i 11
Configure HDFS• HDFS is the distributed file system (similar to Google’s GFS) that sits
behind Hadoop instances, syncing data so that it’s close to the processing and providing redundancy– We need to set this up first
• We need to specify some mandatory parameters to get HDFS up and running in various XML configuration files
/usr/local/hadoop/etc/hadoop/yarn-site.xml
11/10/15 © 2014-2015 Paul Krzy zanowsk i 12
CS417 11/10/15
PaulKrzyzanowski 3
Step 1a: Start HDFS• Begin by starting the HDFS file system from the master server
• There is a script which will run the name node on the master and the data nodes on the slaves:
$ cd /usr/local/hadoop/bin/./start-dfs.sh
• Monitor the log files on the master and slaves:$ tail –f /usr/local/hadoop/logs/
11/10/15 © 2014-2015 Paul Krzy zanowsk i 13
Step 1b: Start HDFSOr you can start all together:
11/10/15 © 2014-2015 Paul Krzy zanowsk i 14
Explore Hadoop
11/10/15 © 2014-2015 Paul Krzy zanowsk i 15
Web Interfaces• HDFS Namenode and check health using
http://localhost:50070
• HDFS Secondary Namenode status using http://localhost:50090
• Job Tracker Web UIhttp://192.168.65.134:50030
• TaskTracker Web UI: http://192.168.65.134:50060/
11/10/15 © 2014-2015 Paul Krzy zanowsk i 16
Hadoop Web Interfaces daemon
11/10/15 © 2014-2015 Paul Krzy zanowsk i 17
Step 2: Write your MapReduce Code• The Mapper:
• The Reducer:
• Make sure the above “Word Count” example works properly in your Hadoop environment
11/10/15 © 2014-2015 Paul Krzy zanowsk i 18
public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } }
public void reduce(Text key, Iterable<IntWritable> values,Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
CS417 11/10/15
PaulKrzyzanowski 4
Step 3: Copy data to HDFS & Run jar file• Before run the actual MapReduce job, you must first copy
the file from your local file system to Hadoop’s HDFS• Download input data and copy data it to HDFS• Run the MapReduce job$ bin/hadoop jar hadoop*your_program*.jar \
CensusTrend /user/read_file_directory \/user/result_output_directory
11/10/15 © 2014-2015 Paul Krzy zanowsk i 19
Step 4: Retrieve MapReduce job result• Check the result is successfully stored in HDFS output
directory• Create a file in locally
$ mkdir /local_directory/output_result
• Copy the result file directory from HDFS to local file system$ bin/hadooop dfs –getmerge \
/user/result_output_directory \/local_directory/output_result
• You should also be able to check the result from your Hadoop Web Interface.
11/10/15 © 2014-2015 Paul Krzy zanowsk i 20
Documentation• Document your work NEATLY• For your submission, explain:
– The files you’re submitting and what they do– how input is mapped into (key, value) pairs– how (key, value) pairs are processed by reduce phase– If job cannot be done in a single map-reduce pass, describe how it
would be structured into two or more map-reduce jobs– How to compile & run – Any bugs or peculiarities
11/10/15 © 2014-2015 Paul Krzy zanowsk i 21
The End
11/10/15 22© 2014-2015 Paul Krzy zanowsk i