big data processing using hadoop poster presentation
TRANSCRIPT
Hadoop : Cloud versus Commodity Hardware
Presenter: Amrut Patil Advisor: Dr. Rajendra K. RajRochester Institute of Technology
Amrut PatilRochester Institute of TechnologyEmail: [email protected]
Contact1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on
Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10,Berkeley, CA, USA, 2004. USENIX Association..
2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co.3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose
References
• Big Data is becoming more commonplace, both in scientific researchand industrial settings.
• Hadoop, a parallelized and distributed storage and processing opensource framework, is gaining increasing popularity to process vastamount of data.
• This project investigates the use of Hadoop for Big Data processing.
• We compare the design and implementation of Hadoopinfrastructure in a cloud setting and on commodity hardware.
Overview
• Set up AWS account and get AWS authentication credentials, namely,Access Key ID, Secret Access Key, X.509 Certificate file,X.509 private key file, AWS account ID
• Set up command line tools to start and stop EC2 instances.• Prepare an SSH key pair: Public key is embedded in the EC2 instance
and private key is on the local machine. Together they establish asecure communication channel.
• Set up Hadoop on EC2 by configuring security parameters(AWSAccount ID, AWS Access Key ID and AWS Secret Access Key) in thesingle initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh.
• To launch a Hadoop cluster on EC2, use:hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves>
• To login to the master node of the cluster, use:hadoop-ec2 login <cluster-name>
• Testing functionality of Hadoop cluster, use:bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
• To shut down a cluster:bin/hadoop-ec2 terminate-cluster <cluster-name>
Hadoop Background
• Verified functionality of the Hadoop cluster by installing and runningHive, a datawarehousing package.
• Accessible: This infrastructure can be set up using commodityhardware and in a cloud setting.
• Scalable: The cluster capacity can be easily increased by adding morenumber of machines.
• Fault Tolerant: In case of failure, it automatically restarts failed jobs• Low Cost: One can quickly and cheaply create their own cluster using
a set of machines.
Conclusions
• Hadoop employs a master/slave architecture for distributed storageand computation.
• The distributed storage system is called the Hadoop File System(HDFS).
Blocks of Hadoop for data processing:• NameNode: Master of HDFS. Monitors how the files are broken
down into file blocks, nodes which store these blocks and directsthe slave datanodes to perform I/O tasks.
• DataNode: Performs the task of reading and writing files from HDFSto local file system.
• Secondary NameNode: Takes snapshot of HDFS metadata after pre-defined intervals of time. Useful to handle fault tolerance.
• Job Tracker: Determines which tasks to process, monitors taskswhile they are running and assigns nodes to tasks.
• Task Tracker: Manages the execution of individual tasks on eachslave node.
• Hadoop uses the MapReduce framework for easily scaling dataprocessing over multiple computing nodes.
Approaches for Implementing Hadoop
• On a Cloud Setting: Utilized Amazon Web Services(AWS)namely, Amazon Elastic Cloud Computer(EC2) and Amazon SimpleStorage Service(S3).
• Using Commodity Hardware: Utilized several old PCs that werebeing retired running Ubuntu 12.04 LTS.
• Choose one specific node which will host the NameNode and JobTracker daemons. This machine also activates the DataNode and TaskTracker daemons on all slave nodes.
• Set up passphraseless SSH for the master to remotely access everynode in the cluster. Public key is stored locally on every node whileprivate key is send by the master node..
• User accounts should have the same name on all nodes.• Generate an RSA keypair on the master node using:
ssh-keygen -t rsa• Copy public key to every slave node as well as master node using:
scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key• Log in to target node from the master::
ssh target• Hadoop configuration settings are contained in three XML files:
core-site.xml, hdfs-site.xml, and mapred-site.xml.• Hadoop can be run in three operational modes:
• Local (Standalone)Mode: Hadoop runs completely on localmachine. HDFS is not used and no Hadoop daemons arelaunched.
• Psuedo-distributed mode: All daemons are running on a singlemachine. Mainly used for development work.
• Fully Distributed mode: Actual Hadoop cluster runs in this mode.• To start Hadoop Daemons: bin/start-all.sh• To stop Hadoop Daemons: bin/stop-all.sh
Hadoop on the Cloud
Common Architecture of Hadoop Cluster
Secondary Name Node
NameNodeJob Tracker
DataNode
Task Tracker
DataNode
Task Tracker
DataNode
Task Tracker
Only 1 Per Cluster
Only 1 Per ClusterMaster
Slave 1
. . . . .
Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with NameNode and JobTracker as Masters and DataNode and TaskTracker
as Slaves
Slave 2 Slave N
Hadoop on Commodity Hardware