hadoop operations basic
TRANSCRIPT
![Page 1: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/1.jpg)
Hadoop Operations - BasicHafizur RahmanApril 4, 2013
![Page 2: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/2.jpg)
![Page 3: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/3.jpg)
Agenda
● Why Hadoop● Hadoop Architecture● Hadoop Installation● Hadoop Configuration● Hadoop DFS Command● What's next
![Page 4: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/4.jpg)
Challenges at Large Scale
● Single node can't handle due to limited resource○ Processor time, Memory, Hard drive space, Network
bandwidth○ Individual hard drives can only sustain read speeds between
60-100 MB/second, so multicore does not help that much
● Multiple nodes needed, but probability of failure increases○ Network failure, Data transfer failure, Node failure○ Desynchronized clock, Lock○ Partial failure in distributed atomic transaction
![Page 5: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/5.jpg)
Hadoop Approach (1/4)
● Data Distribution○ Distributed to all the nodes in the cluster○ Replicated to several nodes
![Page 6: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/6.jpg)
Hadoop Approach (2/4)
● Move computation to the data○ Whenever possible, rather than moving data for
processing, computation is moved to the node that contains the data
○ Most data is read from local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers
○ This data locality results in high performance
![Page 7: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/7.jpg)
Hadoop Approach (3/4)
● MapReduce programming model ○ Run as isolated process
![Page 8: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/8.jpg)
Hadoop Approach (4/4)
● Isolated execution○ Communication between nodes is limited and done
implicitly○ Individual node failures can be worked around by
restarting tasks on other nodes■ No message exchange needed by user task■ No roll back to pre-arranged checkpoints to
partially restart the computation■ Other workers continue to operate as though
nothing went wrong
![Page 9: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/9.jpg)
Hadoop Environment
![Page 10: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/10.jpg)
High-level Hadoop architecture
![Page 11: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/11.jpg)
HDFS (1/2)
● Storage component of Hadoop● Distributed file system modeled after GFS● Optimized for high throughput● Works best when reading and writing large files
(gigabytes and larger)● To support this throughput HDFS leverages unusually
large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O)
![Page 12: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/12.jpg)
HDFS (2/2)
● Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance
● HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed
![Page 13: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/13.jpg)
HDFS Architecture
![Page 14: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/14.jpg)
MapReduce (1/2)
● MapReduce is a batch-based, distributed computing framework modeled
● Simplifies parallel processing by abstracting away the complexities involved in working with distributed systems○ computational parallelization○ work distribution○ dealing with unreliable hardware and software
![Page 15: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/15.jpg)
MapReduce (2/2)
![Page 16: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/16.jpg)
MapReduce Logical Architecture
● Name Node● Secondary
Name Node● Data Node● Job Tracker● Task Tracker
![Page 17: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/17.jpg)
Hadoop Installation
● Local mode○ No need to communicate with other nodes, so it
does not use HDFS, nor will it launch any of the Hadoop daemons
○ Used for developing and debugging the application logic of a MapReduce program
● Pseudo Distributed Mode○ All daemons running on a single machine○ Helps to examine memory usage, HDFS
input/output issues, and other daemon interactions● Fully Distributed Mode
![Page 18: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/18.jpg)
Hadoop ConfigurationFile name Description
hadoop-env.sh ● Environment-specific settings go here.● If a current JDK is not in the system path you’ll want to come here to configure your
JAVA_HOME
core-site.xml ● Contains system-level Hadoop configuration items○ HDFS URL○ Hadoop temporary directory○ script locations for rack-aware Hadoop clusters
● Override settings in core-default.xml: http://hadoop.apache.org/common/docs/r1.0.0/core-default.html.
hdfs-site.xml ● Contains HDFS settings○ default file replication count○ block size○ whether permissions are enforced
● Override settings in hdfs-default.xml: http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html
mapred-site.xml ● Contains HDFS settings○ default number of reduce tasks○ default min/max task memory sizes○ speculative execution
● Override settings in mapred-default.xml: http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html
![Page 19: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/19.jpg)
InstallationPseudo Distributed Mode
● Setup public key based login○ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa○ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
● Update the following configuration○ hadoop.tmp.dir and fs.default.name at core-site.
xml○ dfs.replication at hdfs-site.xml○ mapred.job.tracker at mapred-site.xml
● Format NameNode○ bin/hadoop namenode -format
● Start all daemons○ bin/start-all.sh
![Page 20: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/20.jpg)
Hands On
● HDFS Commands○ http://hadoop.apache.org/docs/r0.18.1
/hdfs_shell.html● Execute example
○ Wordcount● Web Interface
○ NameNode daemon: http://localhost:50070/○ JobTracker daemon: http://localhost:50030/○ TaskTracker daemon: http://localhost:50060/
● Hadoop Job Command
![Page 21: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/21.jpg)
Hadoop FileSystemFile System
URI Scheme
Java Impl. (all under org.apache.hadoop)
Description
Local file fs.LocalFileSystem Filesystem for a locally connected disk with client-side checksums
HDFS hdfs hdfs.DistributedFileSystem Hadoop’s distributed filesystem
WebHDFS webhdfs hdfs.web.WebHdfsFileSystem
Filesystem providing secure read-write access to HDFS over HTTP
S3 (native) s3n fs.s3native.NativeS3FileSystem
Filesystem backed by Amazon S3
S3 (block based)
s3 fs.s3.S3FileSystem Filesystem backed by Amazon S3, which stores files in blocks (much like HDFS) to overcome S3’s 5 GB file size limit.
GlusterFS glusterfs fs.glusterfs.GlusterFileSystem
Still in betahttps://github.com/gluster/glusterfs/tree/master/glusterfs-hadoop
![Page 22: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/22.jpg)
InstallationFully Distributed Mode
Three different kind of hosts:● master
○ master node of the cluster○ hosts NameNode and JobTracker daemons
● backup○ hosts Secondary NameNode daemon
● slave1, slave2, ...○ slave boxes running both DataNode and TaskTracker
daemons
![Page 23: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/23.jpg)
Hadoop ConfigurationFile Name Description
masters ● Name is misleading and should have been called secondary-masters● When you start Hadoop it will launch NameNode and JobTracker on the local
host from which you issued the start command and then SSH to all the nodes in this file to launch the SecondaryNameNode.
slaves ● Contains a list of hosts that are Hadoop slaves● When you start Hadoop it will SSH to each host in this file and launch the
DataNode and TaskTracker daemons
![Page 24: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/24.jpg)
Recipes
● S3 Configuration● Using multiple disks/volumes and limiting
HDFS disk usage● Setting HDFS block size● Setting the file replication factor
![Page 25: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/25.jpg)
Recipes:S3 Configuration
● Config file: conf/hadoop-site.xml● To access S3 data using DFS command
<property> <name>fs.s3.awsAccessKeyId</name> <value>ID</value></property><property> <name>fs.s3.awsSecretAccessKey</name> <value>SECRET</value></property>
● To use S3 as a replacement for HDFS<property> <name>fs.default.name</name> <value>s3://BUCKET</value></property>
![Page 26: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/26.jpg)
Recipes:Disk Configuration
● Config file: $HADOOP_HOME/conf/hdfs-site.xml● For multiple locations:<property> <name>dfs.data.dir</name> <value>/u1/hadoop/data,/u2/hadoop/data</value></property>
● For limiting the HDFS disk usage, specify reserved space for non-DFS (bytes per volume)
<property><name>dfs.datanode.du.reserved</name><value>6000000000</value></property>
![Page 27: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/27.jpg)
Recipes:HDFS Block Size (1/3)
● HDFS stores files across the cluster by breaking them down into coarser grained, fixed-size blocks
● Default HDFS block size is 64 MB● Affects performance of
○ filesystem operations where larger block sizes would be more effective, if you are storing and processing very large files
○ MapReduce computations, as the default behavior of Hadoop is to create one map task for each data block of the input files
![Page 28: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/28.jpg)
Recipes:HDFS Block Size (2/3)
● Option 1: NameNode configuration○ Add/modify dfs.block.size parameter at conf/hdfs-
site.xml○ Block size in number of bytes○ Only the files copied after the change will have the
new block size○ Existing files in HDFS will not be affected
<property> <name>dfs.block.size</name> <value>134217728</value></property>
![Page 29: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/29.jpg)
Recipes:HDFS Block Size (2/3)
● Option 2: During file upload○ Applies only to the specific file paths> bin/hadoop fs -Ddfs.blocksize=134217728 -put data.in /user/foo
● Use fsck command> bin/hadoop fsck /user/foo/data.in -blocks -files -locations/user/foo/data.in 215227246 bytes, 2 block(s): ....0. blk_6981535920477261584_1059len=134217728 repl=1 [hostname:50010]1. blk_-8238102374790373371_1059 len=81009518 repl=1 [hostname:50010]
![Page 30: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/30.jpg)
Recipes:File Replication Factor (1/3)
● Replication done for fault tolerance○ Pros: Improves data locality and data access
bandwidth○ Cons: Needs more storage
● HDFS replication factor is a file-level property that can be set per file basis
![Page 31: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/31.jpg)
Recipes:File Replication Factor (2/3)
● Set default replication factor○ Add/Modify dfs.replication property in conf/hdfs-
site.xml○ Old files will be unaffected○ Only the files copied after the change will have the
new replication factor<property> <name>dfs.replication</name> <value>2</value></property>
![Page 32: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/32.jpg)
Recipes:File Replication Factor (3/3)
● Set replication factor during file upload > bin/hadoop fs -D dfs.replication=1 -copyFromLocal non-criticalfile.txt /user/foo
● Change the replication factor of files or file paths that are already in the HDFS○ Use setrep command○ Syntax: hadoop fs -setrep [-R] <path>
> bin/hadoop fs -setrep 2 non-critical-file.txtReplication 3 set: hdfs://myhost:9000/user/foo/non-critical-file.txt
![Page 33: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/33.jpg)
Recipes:Merging files in HDFS
● Use HDFS getmerge command● Syntax:hadoop fs -getmerge <src> <localdst> [addnl]
● Copies files in a given path in HDFS to a single concatenated file in the local filesystem
> bin/hadoop fs -getmerge /user/foo/demofiles merged.txt
![Page 34: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/34.jpg)
Hadoop Operations - Advanced
![Page 35: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/35.jpg)
Example:Advanced Operations
● HDFS○ Adding new data node○ Decommissioning data node○ Checking FileSystem Integrity with fsck○ Balancing HDFS Block Data○ Dealing with a Failed Disk
● MapReduce○ Adding a Tasktracker○ Decommissioning a Tasktracker○ Killing a MapReduce Job○ Killing a MapReduce Task○ Dealing with a Blacklisted Tasktracker
![Page 36: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/36.jpg)
Links
● http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
● http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
● http://developer.yahoo.com/hadoop/tutorial/
● http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
![Page 37: Hadoop operations basic](https://reader034.vdocuments.mx/reader034/viewer/2022042607/554a5a49b4c90531228b5104/html5/thumbnails/37.jpg)
Q/A