hadoop multiple nodes cluster setup and execution … · hadoop fs -mkdir /input 3.2 running a...

HADOOP MULTIPLE NODES CLUSTER SETUP AND

EXECUTION OF MAP REDUCE PROGRAMS

STUDENT NAME: Harinath Selvaraj

STUDENT NUMBER: C00235324

DEPARTMENT: Department of Computing and Networking

COURSE NAME: Master's in Data Science

COURSE CODE: CW_KCDAT_M

SUPERVISOR: Micheal

DATE OF SUBMISSION: 28 APRIL 2019

WORD COUNT: 1157

1. INTRODUCTION

Apache Hadoop is a framework written in Java programming which has features similar to

Google File System(GFS) and the MapReduce computing paradigm. It is primarily used to run

programs on large clusters to enable parallel processing. It is generally deployed on low-cost

hardware due to its fault-tolerant nature and due to the fact that it is used to handle large data sets

(Apache, n.d.-b). This report demonstrates how to setup a multiple node Hadoop cluster with 3

Nodes (1 Master and 2 Slaves) and run MapReduce jobs in JAVA and PYTHON languages and

validate their outputs.

2. SETUP INSTRUCTIONS

The Hadoop cluster was initially setup in a single node and then was made to support multiple

nodes i.e) 1 Master and 2 Slave Nodes.

Instructions to setup both single node and multiple nodes were obtained from the below links,

[1] Single Node Hadoop Cluster -

https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html

[2] Multiple Node Hadoop Cluster -

https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html

The existing Ubuntu virtual machines setup for previous lab exercises were deleted and a fresh

copy of ubuntu machine was created from the snapshot.

In order to check whether the nodes functioning properly after completion of the multi-node

cluster, Ubuntu Desktop was installed using the below commands,

sudo apt-get update

sudo apt-get install ubuntu-desktop

The setup instructions for installing Hadoop in a single node were done as given in the link [1].

The Hadoop setup file was downloaded from Apache Hadoop website(Apache, n.d.-a) from the

below link,

http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-

3.1.2.tar.gz

https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html

http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html

After successful setup for the single node, the machine is cloned and the remaining instructions

to setup a multiple node cluster was done in link [2]. The below image shows the presence of 3

Virtual machines – 1 Master node and 2 Slave nodes.

Figure 1. 3 Node setup for Hadoop Cluster

Figure 2, 3 and 4 shows that the Master, Slave 1 and Slave 2 Nodes are up and running.

Figure 2. Hadoop Running on Master Node

Figure 3. Hadoop Running on Slave 1 Node

Figure 4. Hadoop

Running on Slave 2 Node

The link http://master:8088 was accessed from the Master VM Desktop in order to check if the

nodes are running.

Figure 5. Master, Slave1 and Slave2 nodes are visible in the master VM

3. RUNNING HADOOP MAP REDUCE JOBMapReduce is the heart of Apache Hadoop. It is a programming paradigm that enables massive

scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept

is fairly simple to understand for those who are familiar with clustered scale-out data processing

solutions (IBM, n.d.).

http://master:8088/

3.1 PERFORMING WORD COUNT USING HADOOP

An input directory is created which will hold the files required for processing.

hadoop fs -mkdir /input

The below command will perform a word count on all the files present in the JAR File - $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar

wordcount /input /output

The Job output is shown in the below screenshot,

NOTE: The output directory has to be deleted using the below command before running the next Hadoop job.

hadoop fs -rm -R /output

ISSUE – The below JAVA Exception was thrown when a new Hadoop job was started.

RESOLUTION – In order to fix the issue, the input folder was deleted with the below command,

hadoop fs -rm -R /input

and then created again with the below command,

hadoop fs -mkdir /input

3.2 RUNNING A PYTHON MAP REDUCE JOB USING HADOOP

Since Hadoop is written in Java language, other programming languages such as python can’t be

directly executed as a MapReduce job. The Hadoop Streaming API is used in order to run

Python code as a MapReduce job. This will help to transfer the data from Map and Reduce code

via STDIN (standard input) and STDOUT (standard output) respectively. The program uses

sys.stdin to read the data and sys.stdout to pass the data to output.

3.2.1 PYTHON CODE

The below python code is similar to the JAVA code which counts the number of words.

Mapper.py

#!/usr/bin/env python

"""mapper.py"""

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

#

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

Reducer.py

#!/usr/bin/env python"""reducer.py"""

from operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = None

# input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip()

# parse the input we got from mapper.py word, count = line.split('\t', 1)

# convert count (currently a string) to int try:

count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue

# this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word

# do not forget to output the last word if needed!if current_word == word: print '%s\t%s' % (current_word, current_count)

Read permissions were granted to the mapper.py and reducer.py files so that they can be

accessed by the Hadoop job.

chmod +x /home/hduser/reducer.py

chmod +x /home/hduser/reducer.py

3.2.2 INPUT FILES REQUIRED FOR PROCESSING

The flat files are obtained from the below links -

http://www.gutenberg.org/cache/epub/20417/pg20417.txt - 750KB

http://www.gutenberg.org/files/5000/5000-8.txt – 1.4MB

http://www.gutenberg.org/files/4300/4300-0.txt – 1.5MB

The files are copied to HDFS using the below commands,

hdfs dfs -copyFromLocal -p 5000-8.txt /input/

hdfs dfs -copyFromLocal -p 4300-0.txt /input/

hdfs dfs -copyFromLocal -p pg20417.txt /input/

3.2.3 RUNNING THE MAP REDUCE JOB

ISSUE:

The Hadoop version which I had installed didn’t have the Hadoop streaming library that is

required to run the Python job. This is confirmed by running a search on the hadoop file path.

The search results which are seen below are the documents but not the actual library which I was

looking for.

http://www.gutenberg.org/files/4300/4300-0.txt

http://www.gutenberg.org/files/5000/5000-8.txt

http://www.gutenberg.org/cache/epub/20417/pg20417.txt

Therefore, the Hadoop Streaming file was downloaded from the link and copied to

$HADOOP_HOME/share/hadoop/mapreduce/ path,

http://www.java2s.com/Code/Jar/h/Downloadhadoopstreamingjar.htm

The below command is executed to run the Hadoop job,

hadoop \

jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-streaming.jar \

-mapper "python /home/hduser1/mapper.py" \

-reducer "python /home/hduser1/reducer.py" \

-input "/input/*" \

-output "/output"

Run Screenshot

http://www.java2s.com/Code/Jar/h/Downloadhadoopstreamingjar.htm

Output Screenshot

4. CONCLUSION

The below activities were successful performed,

1) Install a single node Hadoop cluster

2) Install a multi-node Hadoop cluster

3) Run a MapReduce program in JAVA to find the word count within all the files inside a

directory

4) Run a MapReduce program in JAVA to find the word count within all the files inside a

directory

Running MapReduce programs helped me to understand how the tasks are executed in parallel in

the slave virtual machines. This exercise gave me a confidence to setup Hadoop Cluster and run

MapReduce programs on Python seamlessly.

REFERENCES

Apache. (n.d.-a). Apache Hadoop. Retrieved March 27, 2019, from http://hadoop.apache.org/releases.html

Apache. (n.d.-b). Welcome to Apache Hadoop! Hadoop.Apache.Org. Retrieved from http://hadoop.apache.org

IBM. (n.d.). What is MapReduce? | IBM Analytics. Retrieved March 27, 2019, from https://www.ibm.com/analytics/hadoop/mapreduce

hadoop multiple nodes cluster setup and execution … · hadoop fs -mkdir /input 3.2 running a...

Documents