Download - Hadoop HP Day1
-
What Is What Is HadoopHadoop Distributed computing frame work
For clusters of computers Thousands of Compute Nodes Petabytes of data Petabytes of data
Open source, Java Googles MapReduce inspired
Yahoos Hadoop. Now part of Apache group
3/3/2013 www.hpottech.com 1
-
What is What is The Apache Hadoop project develops open-source
software for reliable, scalable, distributed computing. Hadoop includes: Hadoop Common utilities Avro: A data serialization system with scripting languages. Chukwa: managing large distributed systems. HBase: A scalable, distributed database for large tables. HBase: A scalable, distributed database for large tables. HDFS: A distributed file system. Hive: data summarization and ad hoc querying. MapReduce: distributed processing on compute clusters. Pig: A high-level data-flow language for parallel
computation. ZooKeeper: coordination service for distributed
applications.
3/3/2013 www.hpottech.com 2
-
Problem ScopeProblem Scope
Hadoop is a large-scale distributed batch processing infrastructure.
scale to hundreds or thousands of computers, each with several computers, each with several processor cores.
efficiently distribute large amounts of work across a set of machines
3/3/2013 3www.hpottech.com
-
How large an amount of work?How large an amount of work? Hundreds of gigabytes of data - the
low end of Hadoop-scale. process "web-scale" data (hundreds of
gigabytes to terabytes or petabytes). gigabytes to terabytes or petabytes). Includes a distributed file system -
breaks up input data and distribute to several machines.
3/3/2013 4www.hpottech.com
-
Challenges at Large ScaleChallenges at Large Scale
Performing large-scale computation is difficult. the probability of failures rises. In a distributed environment, partial failures are an
expected and common occurrence. Individual compute nodes may overheat, crash, experience
hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly Data may be corrupted, or maliciously or improperly
transmitted. Clocks may become desynchronized, lock files may not be
released. the rest of the distributed system should be able to
recover from the component failure or transient error condition and continue to make progress.
3/3/2013 5www.hpottech.com
-
Challenges at Large ScaleChallenges at Large Scale Hadoop is designed to handle
hardware failure and data congestion issues very robustly.
Compute hardware has finite Compute hardware has finite resources : Processor time Memory Hard drive space Network bandwidth
3/3/2013 6www.hpottech.com
-
Moore's LawMoore's Law
Moore's Law (named after Gordon Moore, the founder of Intel) states that the number of transistors that can be placed in a processor will be placed in a processor will double approximately every two years, for half the cost.
3/3/2013 7www.hpottech.com
-
The The HadoopHadoop ApproachApproach
Efficiently process large volumes of information by connecting many commodity computers together to work in parallel. work in parallel.
Tied these smaller together into a single cost-effective compute cluster.
3/3/2013 8www.hpottech.com
-
Comparison to Existing Comparison to Existing TechniquesTechniques
Hadoop Condor.
Simplified programming model :
Efficient, automatic distribution of data and
Condor does not automatically distribute data: a separate SAN must be managed in
distribution of data and work across machines.
must be managed in addition to the compute cluster.
collaboration between multiple compute nodes must be managed with a communication system such as MPI..
3/3/2013 www.hpottech.com 9
-
Data Distribution Data Distribution -- HadoopHadoop
In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in.
The Hadoop Distributed File System (HDFS) will split large data files into (HDFS) will split large data files into chunks
Each chunk is replicated across several machines.
An active monitoring system then re-replicates the data.
3/3/2013 10www.hpottech.com
-
Data Distribution Data Distribution -- HadoopHadoop.... Data is conceptually record-oriented Individual input files are broken into lines
or into other formats specific to the application logic.
Each process running on a node in the cluster then processes a subset of these
Each process running on a node in the cluster then processes a subset of these records.
Data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers.
3/3/2013 11www.hpottech.com
-
Data is distributed across nodes Data is distributed across nodes at load timeat load time
3/3/2013 12www.hpottech.com
-
MapReduceMapReduce: Isolated Processes: Isolated Processes
Hadoop limits the amount of communication across nodes
Hadoop will not run just any program and distribute it across a cluster. and distribute it across a cluster.
Programs must be written to conform to a particular programming model, named "MapReduce."
3/3/2013 13www.hpottech.com
-
Flat ScalabilityFlat Scalability
One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve.
A program written in distributed frameworks other than Hadoop may require large amounts of refactoring when scaling from ten amounts of refactoring when scaling from ten to one hundred or one thousand machines.
After a Hadoop program is written and functioning on ten nodes, very little--if any--work is required for that same program to run on a much larger amount of hardware.
3/3/2013 14www.hpottech.com
-
HadoopHadoop Installation PreparationInstallation PreparationDemo Demo
3/3/2013 www.hpottech.com 15
-
Steps Steps HadoopHadoop Installation PreparationInstallation Preparation
Install VM Player. Import the RedHat Linux VM to this
VM Player. Start the VM Player. Start the VM Player. User root/root123 to log on to the VM. Follow the tutorial.
3/3/2013 www.hpottech.com 16
-
Hadoop File System
3/3/2013 1
-
The Hadoop Distributed File System
HDFS, is a distributed file system designed to
hold very large amounts of data (terabytes or
even petabytes), and provide high-throughput
access to this information. access to this information.
Files are stored in a redundant fashion across
multiple machines to ensure their durability to
failure and high availability to very parallel
applications.
-
Basic Features: HDFS
Highly fault-tolerant
High throughput
Suitable for applications with large data sets
Streaming access to file system data Streaming access to file system data
Can be built out of commodity hardware
3/3/2013 3
-
Fault tolerance
Failure is the norm rather than exception
A HDFS instance may consist of thousands of server machines, each storing part of the file systems data.
Since we have huge number of components and Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.
Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.
3/3/2013 4
-
Data Characteristics
Streaming data access
Applications need streaming access to data
Batch processing rather than interactive user access.
Large data sets and files: gigabytes to terabytes size
High aggregate data bandwidth
Scale to hundreds of nodes in a cluster Scale to hundreds of nodes in a cluster
Tens of millions of files in a single instance
Write-once-read-many: a file once created, written and closed need not be changed this assumption simplifies coherency
A map-reduce application or web-crawler application fits perfectly with this model.
3/3/2013 5
-
Catmap
map
split
split
combine
combine
reduce
reduce
part0
part1
MapReduce
Bat
Dog
Other
Words
(size:
TByte)
map
mapsplit
split
combine reducepart2
3/3/2013 6
-
ARCHITECTURE
3/3/2013 7
-
Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.on.
HDFS exposes a file system namespace and allows user data to be stored in files.
A file is split into one or more blocks and set of blocks are stored in DataNodes.
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
3/3/2013 8
-
HDFS Architecture
Namenode
Datanodes Datanodes
Client
Read
Metadata opsMetadata(Name, replicas..)(/home/foo/data,6. ..
Block ops
3/3/2013 9
Breplication
Rack1 Rack2
Client
Blocks
Write
-
File system Namespace
Hierarchical file system with directories and files
Create, remove, move, rename etc.
Namenode maintains the file system
Any meta information changes to the file system
3/3/2013 10
Any meta information changes to the file system recorded by the Namenode.
An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.
-
Data Replication
HDFS is designed to store very large files across machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
3/3/2013 11
size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.
BlockReport contains all the blocks on a Datanode.
-
Replica Placement
The placement of the replicas is critical to HDFS reliability and performance.
Optimizing replica placement distinguishes HDFS from other distributed file systems.
Rack-aware replica placement: Goal: improve reliability, availability and network bandwidth
3/3/2013 12
Goal: improve reliability, availability and network bandwidth utilization
Many racks, communication between racks are through switches.
Network bandwidth between machines on the same rack is greater than those in different racks.
Namenode determines the rack id for each DataNode.
-
Replica Placement
Replicas are typically placed on unique racks
Simple but non-optimal
Writes are expensive
Replication factor is 3
3/3/2013 13
Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.
1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.
-
Replica Selection
Replica selection for READ operation: HDFS tries
to minimize the bandwidth consumption and
latency.
If there is a replica on the Reader node then
3/3/2013 14
If there is a replica on the Reader node then
that is preferred.
HDFS cluster may span multiple data centers:
replica in the local data center is preferred over
the remote one.
-
Safemode Startup
On startup Namenode enters Safemode.
Replication of data blocks do not occur in Safemode.
Each DataNode checks in with Heartbeat and BlockReport.
Namenode verifies that each block has acceptable number of replicas
3/3/2013 15
Namenode verifies that each block has acceptable number of replicas
After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode.
It then makes the list of blocks that need to be replicated.
Namenode then proceeds to replicate these blocks to other Datanodes.
-
Filesystem Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. For example, creating a new file.
3/3/2013 16
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local filesystem
Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in Namenodes local filesystem.
-
Namenode
Keeps image of entire file system namespace and file Blockmap in memory.
4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories.
When the Namenode starts up it gets the FsImage and
3/3/2013 17
When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can recover back to the last checkpointed state in case of a crash.
-
Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same directory.
It uses heuristics to determine optimal number of files
3/3/2013 18
It uses heuristics to determine optimal number of files per directory and creates directories appropriately:
Research issue?
When the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
-
Configuring HDFS
Cluster configuration
The HDFS configuration is located in a set of XML
files in the Hadoop configuration directory;
conf conf
-
hadoop-defaults.xml
contains default values for every parameter in
Hadoop.
This file is considered read-only.
override this configuration by setting new override this configuration by setting new
values in hadoop-site.xml.
This file should be replicated consistently across
all machines in the cluster. (It is also possible,
though not advisable, to host it on NFS.)
-
hadoop-site.xml.
Configuration settings are a set of key-value
pairs of the format:
property-name
property-valueproperty-value
Adding the line true inside the property body will prevent
properties from being overridden by user applications.
-
hadoop-site.xml
The following settings are necessary to configure
HDFS:
key value examplekey value example
fs.default.name protocol://servername:port hdfs://alpha.milkman.org:9000
dfs.data.dir pathname /home/username/hdfs/data
dfs.name.dir pathname /home/username/hdfs/name
-
A single-node configuration:
hadoop-site.xml
fs.default.name
hdfs://your.server.name.com:9000
dfs.data.dir
/home/username/hdfs/data/home/username/hdfs/data
dfs.name.dir
/home/username/hdfs/name
* After copying this information into your conf/hadoop-site.xml file, copy this to the conf/ directories on all machines in the cluster.
-
Starting HDFS
format the file system that was just configured:
user@namenode:hadoop$ bin/hadoop namenode -format
This process should only be performed once. When it is complete, you are free to start the distributed file system:complete, you are free to start the distributed file system:
user@namenode:hadoop$ bin/start-dfs.sh
This command will start the NameNode server on the master machine (which is where the start-dfs.sh script was invoked).
It will also start the DataNode instances on each of the slave machines.
In a single-machine "cluster," this is the same machine as the NameNodeinstance.
On a real cluster of two or more machines, this script will ssh into each slave machine and start a DataNode instance.
-
Interacting With HDFS
Command format:user@machine:hadoop$ bin/hadoop moduleName -cmd args...
The moduleName tells the program which subset of Hadoop functionality to use. -cmd
is the name of a specific command within this module to execute. Its arguments
follow the command name.
Two such modules are relevant to HDFS: dfs and dfsadmin.
-
Tutorial:
Hadoop - Single Node :Installation
Tutorial-InstallationHDFSSingleNode.docx
-
PROTOCOL
3/3/2013 27
-
The Communication Protocol
All HDFS communication protocols are layered on top of the TCP/IP protocol
A client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocolwith the Namenode.
The Datanodes talk to the Namenode using Datanode
3/3/2013 28
The Datanodes talk to the Namenode using Datanode protocol.
RPC abstraction wraps both ClientProtocol and Datanode protocol.
Namenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients.
-
ROBUSTNESS
3/3/2013 29
-
Objectives
Primary objective of HDFS is to store data
reliably in the presence of failures.
Three common failures are: Namenode failure,
Datanode failure and network partition.Datanode failure and network partition.
3/3/2013 30
-
DataNode failure and heartbeat
A network partition can cause a subset of Datanodes to lose connectivity with the Namenode.
Namenode detects this condition by the absence of a Heartbeat message.
Namenode marks Datanodes without Hearbeat and does Namenode marks Datanodes without Hearbeat and does not send any IO requests to them.
Any data registered to the failed Datanode is not available to the HDFS.
Also the death of a Datanode may cause replication factor of some of the blocks to fall below their specified value.
3/3/2013 31
-
Re-replication
The necessity for re-replication may arise due
to:
A Datanode may become unavailable,
A replica may become corrupted, A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be
increased.
3/3/2013 32
-
Cluster Rebalancing
HDFS architecture is compatible with data rebalancing schemes.
A scheme might move data from one Datanodeto another if the free space on a Datanode falls below a certain threshold.below a certain threshold.
In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.
3/3/2013 33
-
Data Integrity
Consider a situation: a block of data fetched from Datanode arrives corrupted.
This corruption may occur because of faults in a storage device, network faults, or buggy software.
A HDFS client creates the checksum of every block A HDFS client creates the checksum of every block of its file and stores it in hidden files in the HDFS namespace.
When a clients retrieves the contents of file, it verifies that the corresponding checksums match.
If does not match, the client can retrieve the block from a replica.
3/3/2013 34
-
Metadata Disk Failure
FsImage and EditLog are central data structures of HDFS.
A corruption of these files can cause a HDFS instance to be non-
functional.
For this reason, a Namenode can be configured to maintain
multiple copies of the FsImage and EditLog.multiple copies of the FsImage and EditLog.
Multiple copies of the FsImage and EditLog files are updated
synchronously.
Meta-data is not data-intensive.
The Namenode could be single point failure: automatic failover
is NOT supported.
3/3/2013 35
-
DATA ORGANIZATION
3/3/2013 36
-
Data Blocks
HDFS support write-once-read-many with reads
at streaming speeds.
A typical block size is 64MB (or even 128 MB).
A file is chopped into 64MB chunks and stored. A file is chopped into 64MB chunks and stored.
3/3/2013 37
-
Staging
A client request to create a file does not reach Namenode immediately.
HDFS client caches the data into a temporary file. When the data reached a HDFS block size the client contacts the Namenode.contacts the Namenode.
Namenode inserts the filename into its hierarchy and allocates a data block for it.
The Namenode responds to the client with the identity of the Datanode and the destination of the replicas (Datanodes) for the block.
Then the client flushes it from its local memory.
3/3/2013 38
-
Staging (contd.)
The client sends a message that the file is closed.
Namenode proceeds to commit the file for creation operation into the persistent store.
If the Namenode dies before file is closed, the file is lost.
This client side caching is required to avoid network congestion; also it has precedence is AFS (Andrew file system).
3/3/2013 39
-
Replication Pipelining
When the client receives response from
Namenode, it flushes its block in small pieces
(4K) to the first replica, that in turn copies it to
the next replica and so on.the next replica and so on.
Thus data is pipelined from Datanode to the
next.
3/3/2013 40
-
API (ACCESSIBILITY)
3/3/2013 41
-
Application Programming Interface
HDFS provides Java API for application to use.
Python access is also used in many applications.
A C language wrapper for Java API is also
available.available.
A HTTP browser can be used to browse the files
of a HDFS instance.
3/3/2013 42
-
FS Shell, Admin and Browser Interface
HDFS organizes its data in files and directories.
It provides a command line interface called the FS shell that lets the user interact with data in the HDFS.
The syntax of the commands is similar to bash and csh.
Example: to create a directory /foodir Example: to create a directory /foodir
/bin/hadoop dfs mkdir /foodir
There is also DFSAdmin interface available
Browser interface is also available to view the namespace.
3/3/2013 43
-
Space Reclamation
When a file is deleted by a client, HDFS renames file to a file in be the /trash directory for a configurable amount of time.
A client can request for an undelete in this allowed time.
After the specified time the file is deleted and the space is reclaimed.is reclaimed.
When the replication factor is reduced, the Namenode selects excess replicas that can be deleted.
Next heartbeat(?) transfers this information to the Datanode that clears the blocks for use.
3/3/2013 44
-
HDFS & GFS
The design of HDFS is based on the design of GFS, the Google File System.
HDFS is a block-structured file system: individual files are broken into blocks of a individual files are broken into blocks of a fixed size.
These blocks are stored across a cluster of one or more machines with data storage capacity. Individual machines in the cluster are referred to as DataNodes.
-
HDFS Characteristics
A file can be made of several blocks, and they are
not necessarily stored on the same machine; the
target machines which hold each block are
chosen randomly on a block-by-block basis. chosen randomly on a block-by-block basis.
Thus access to a file may require the cooperation of
multiple machines, but supports file sizes far
larger than a single-machine DFS; individual files
can require more space than a single hard drive
could hold.
-
Replication in HDFS
DataNodes holding blocks of multiple files with a replication factor of 2.
The NameNode maps the filenames onto the block ids.
-
Common Example Operations
-
Listing files
someone@anynode:hadoop$ bin/hadoop dfs -ls
someone@anynode:hadoop$
someone@anynode:hadoop$ bin/hadoop dfs -ls /
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2008-09-20 19:40 /hadoop
drwxr-xr-x - hadoop supergroup 0 2008-09-20 20:08 /tmp
-
Create home directory, and then
populate it with some filesStep 1: Create your home directory if it does not already exist.
someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user
someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user/someone
Step 2: Upload a file. To insert a single file into HDFS, we can use the putcommand like so:command like so:
someone@anynode:hadoop$ bin/hadoop dfs -put /home/someone/interestingFile.txt /user/yourUserName/
Step 3: Verify the file is in HDFS. We can verify that the operation worked with either of the two following (equivalent) commands:
someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserName
someone@anynode:hadoop$ bin/hadoop dfs -ls
-
Uploading
Step 4: Uploading multiple files at once. The put command is more powerful than moving a single file at a time. It can also be used to upload entire directory trees into HDFS.
Create a local directory and put some files into it using the cp command. Our example user may have a situation like the following:
someone@anynode:hadoop$ ls -R myfiles
myfiles:
file1.txt file2.txt subdir/
myfiles/subdir:
anotherFile.txtanotherFile.txt
someone@anynode:hadoop$
This entire myfiles/ directory can be copied into HDFS like so:
someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsername
someone@anynode:hadoop$ bin/hadoop -ls
Found 1 items
/user/someone/myfiles 2008-06-12 20:59 rwxr-xr-x someone supergroup
user@anynode:hadoop bin/hadoop -ls myfiles
Found 3 items
/user/someone/myfiles/file1.txt 186731 2008-06-12 20:59 rw-r--r-- someone supergroup
/user/someone/myfiles/file2.txt 168 2008-06-12 20:59 rw-r--r-- someone supergroup
/user/someone/myfiles/subdir 2008-06-12 20:59 rwxr-xr-x someone supergroup
-
Uploading of Files
Uploading a file into HDFS first copies the data onto the DataNodes.
When they all acknowledge that they have received all the data and the file handle is closed, it is then made visible to the rest of the system.
Thus based on the return value of the put command, you Thus based on the return value of the put command, you can be confident that a file has either been successfully uploaded, or has "fully failed;
You will never get into a state where a file is partially uploaded and the partial contents are visible externally
but the upload disconnected and did not complete the entire file contents. In a case like this, it will be as though no upload took place.
-
uses of the put command
Command: Assuming: Outcome:
bin/hadoop dfs -put foo barNo file/directory named
/user/$USER/bar exists in HDFS
Uploads local file foo to a file named
/user/$USER/bar
bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named
bin/hadoop dfs -put foo bar /user/$USER/bar is a directoryUploads local file foo to a file named
/user/$USER/bar/foo
bin/hadoop dfs -put foo
somedir/somefile
/user/$USER/somedir does not exist in
HDFS
Uploads local file foo to a file named
/user/$USER/somedir/somefile,
creating the missing directory
bin/hadoop dfs -put foo bar/user/$USER/bar is already a file in
HDFS
No change in HDFS, and an error is
returned to the user.
-
Retrieving data from HDFS
Step 1: Display data with cat.
someone@anynode:hadoop$ bin/hadoop dfs -cat foo
(contents of foo are displayed here)
someone@anynode:hadoop$
Step 2: Copy a file from HDFS to the local file system.Step 2: Copy a file from HDFS to the local file system.
The get command is the inverse operation of put; it will copy a file or directory (recursively) from HDFS into the target of your choosing on the local file system. A synonymous operation is called -copyToLocal.
someone@anynode:hadoop$ bin/hadoop dfs -get foo localFoo
someone@anynode:hadoop$ ls
localFoo
someone@anynode:hadoop$ cat localFoo
(contents of foo are displayed here)
-
Shutting Down HDFS
someone@namenode:hadoop$ bin/stop-dfs.sh
This command must be performed by the same
user who started HDFS with bin/start-dfs.shuser who started HDFS with bin/start-dfs.sh
-
HDFS Command Reference
Running bin/hadoop dfs with no additional
arguments will list all commands which can be
run with the FsShell system
bin/hadoop dfs -help commandName will bin/hadoop dfs -help commandName will
display a short usage summary for the
operation in question
-
HDFS Command Reference
Command Operation
-ls path
Lists the contents of the directory specified by path, showing the
names, permissions, owner, size and modification date for each
entry.
-lsr pathBehaves like -ls, but recursively displays entries in all
subdirectories of path.subdirectories of path.
-du pathShows disk usage, in bytes, for all files which match path;
filenames are reported with the full HDFS protocol prefix.
-dus pathLike -du, but prints a summary of disk usage of all files/directories
in the path.
-mv src dest Moves the file or directory indicated by src to dest, within HDFS.
-cp src dest Copies the file or directory identified by src to dest, within HDFS.
-rm path Removes the file or empty directory identified by path.
-moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.
-
HDFS Command Reference
Command Operation
-get [-crc] src localDestCopies the file or directory in HDFS identified by src
to the local file system path identified by localDest.
-getmerge src localDest [addnl]
Retrieves all files that match the path src in HDFS,
and copies them to a single, merged file in the local
file system identified by localDest.
-cat filename Displays the contents of filename on stdout.
-copyToLocal [-crc] src localDest Identical to -get
-moveToLocal [-crc] src localDestWorks like -get, but deletes the HDFS copy on
success.
-
Tutorial
HDFS Command.
Tutorial-HDFSComand.docx
-
DFSAdmin Command Reference
dfs module : provides common file and directory manipulation commands, they all work with objects within the file system.
dfsadmin module manipulates or queries the file system as a whole.
Getting overall status: Getting overall status: bin/hadoop dfsadmin -report. This returns basic information
about the overall health of the HDFS cluster, as well as some per-server metrics.
More involved status: bin/hadoop dfsadmin -metasave filename will record this
information in filename. The metasave command will enumerate lists of blocks which are under-replicated, in the process of being replicated, and scheduled for deletion.
-
Safemode:
Safemode: the file system is mounted read-only
no replication is performed
nor can files be created or deleted.
automatically entered as the NameNode starts, to allow all DataNodes time to check in with the NameNodeDataNodes time to check in with the NameNode
waits until a specific percentage of the blocks are present and accounted-for; (dfs.safemode.threshold.pct ).
The bin/hadoop dfsadmin -safemode what : enter - Enters safemode
leave - Forces the NameNode to exit safemode
get - Returns a string indicating whether safemode is ON or OFF
wait - Waits until safemode has exited and returns
-
HDFS
Changing HDFS membership - When
decommissioning nodes, it is important to
disconnect nodes from HDFS gradually to
ensure that data is not lost. ensure that data is not lost.
-
Upgrading HDFS versions
# bin/start-dfs.sh -upgrade.
It will then begin upgrading the HDFS version.
#bin/hadoop dfsadmin -upgradeProgress
#bin/hadoop dfsadmin -upgradeProgress force. #bin/hadoop dfsadmin -upgradeProgress force.
-
Upgrading HDFS versions.
bin/start-dfs.sh -rollback.
It will restore the previous HDFS state.
Only one such archival copy can be kept at a
time.time.
bin/hadoop dfsadmin -finalizeUpgrade.
The rollback command cannot be issued after this
point. This must be performed before a second
Hadoop upgrade is allowed
-
Getting help
bin/hadoop dfsadmin -help cmd
-
Using HDFS in MapReduce
The HDFS is a powerful companion to HadoopMapReduce.
By setting the fs.default.name configuration option to point to the NameNode , Hadoop MapReduce jobs will automatically draw their input files from HDFS. automatically draw their input files from HDFS.
Using the regular FileInputFormat subclasses, Hadoop will automatically draw its input data sources from file paths within HDFS, and will distribute the work over the cluster in an intelligent fashion to exploit block locality where possible.
-
Using HDFS Programmatically1: import java.io.File;
2: import java.io.IOException;
3:
4: import org.apache.hadoop.conf.Configuration;
5: import org.apache.hadoop.fs.FileSystem;
6: import org.apache.hadoop.fs.FSDataInputStream;
7: import org.apache.hadoop.fs.FSDataOutputStream;
8: import org.apache.hadoop.fs.Path;
9:
10: public class HDFSHelloWorld {
11:
12: public static final String theFilename = "hello.txt";
13: public static final String message = "Hello, world!\n";
14:
15: public static void main (String [] args) throws IOException {
16:
17: Configuration conf = new Configuration();
18: FileSystem fs = FileSystem.get(conf);18: FileSystem fs = FileSystem.get(conf);
19:
20: Path filenamePath = new Path(theFilename);
21:
22: try {
23: if (fs.exists(filenamePath)) {
24: // remove the file first
25: fs.delete(filenamePath);
26: }
27:
28: FSDataOutputStream out = fs.create(filenamePath);
29: out.writeUTF(message;
30: out.close();
31:
32: FSDataInputStream in = fs.open(filenamePath);
33: String messageIn = in.readUTF();
34: System.out.print(messageIn);
35: in.close();
46: } catch (IOException ioe) {
47: System.err.println("IOException during operation: " + ioe.toString());
48: System.exit(1);
49: }
40: }
41: }
-
HDFS Permissions and Security
HDFS security is based on the POSIX model of users and groups.
Each file or directory has 3 permissions (read, write and execute) associated with it at three different granularities: the file's owner, users in the same group granularities: the file's owner, users in the same group as the owner, and all other users in the system.
As the HDFS does not provide the full POSIX spectrum of activity, some combinations of bits will be meaningless.
For example, no file can be executed; the +x bits cannot be set on files (only directories). Nor can an existing file be written to, although the +w bits may still be set.
-
HDFS Permissions and Security
Security permissions and ownership can be
modified using the
bin/hadoop dfs -chmod, -chown, and -chgrp
they work in a similar fashion to the POSIX/Linux
tools of the same name.
-
Security
Superuser status - The username which was
used to start the Hadoop process (i.e., the
username who actually ran bin/start-all.sh or
bin/start-dfs.sh) is acknowledged to be the bin/start-dfs.sh) is acknowledged to be the
superuser for HDFS.
If Hadoop is shutdown and restarted under a
different username, that username is then
bound to the superuser account.
-
Tutorial
Showing Security and dfsadmin command
Tutorial-HDFSAdmin&SecurityComand.docx
-
Additional HDFS Tasks
Rebalancing Blocks
Copying Large Sets of Files
Decommissioning Nodes
Verifying File System Health Verifying File System Health
Rack Awareness
HDFS Web Interface
-
Rebalancing Blocks
New nodes can be added to a cluster in a straightforward manner.
On the new node, the same Hadoop version and configuration (conf/hadoop-site.xml) as on the rest of the cluster should be installed.
Starting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster. (The new node should be added to the slaves file on the master server as well, to inform be added to the slaves file on the master server as well, to inform the master how to invoke script-based commands on the new node.)
But the new DataNode will have no data on board initially; it is therefore not alleviating space concerns on the existing nodes.
New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.
-
Rebalancing Blocks
The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. (The default is 10%.)
The balancer script can be run by starting The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. e.g., bin/start-balancer.sh -threshold 5.
The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh.
-
Rebalancing Blocks
The balancing script can be run either when
nobody else is using the cluster (e.g.,
overnight), but can also be run in an "online"
fashion while many other jobs are on-going. fashion while many other jobs are on-going.
the dfs.balance.bandwidthPerSec configuration
parameter can be used to limit the number of
bytes/sec each node may devote to
rebalancing its data store.
-
Copying Large Sets of Files
Hadoop includes a tool called distcp.
bin/hadoop distcp src dest, Hadoop will start
a MapReduce task to distribute the burden of
copying a large number of files from src to copying a large number of files from src to
dest.
The paths are assumed to be directories, and
are copied recursively. S3 URLs can be
specified with s3://bucket-name/key.
-
Decommissioning Nodes
nodes can also be removed from a cluster while it is running, without data loss.
But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy loss may occur as they may hold the sole copy of one or more file blocks.
Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.
-
Decommissioning Nodes.. Steps
Step 1: Cluster configuration. If it is assumed that nodes may be retired in your cluster, then before it is started, an excludes file must be configured. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.
Step 2: Determine hosts to decommission. Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude, one per line. This will prevent them from connecting to the NameNode.
-
Decommissioning Nodes.. Steps
Step 3: Force configuration reload. Run the command bin/hadoopdfsadmin -refreshNodes. This will force the NameNode to reread its configuration, including the newly-updated excludes file. It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.
Step 4: Shutdown nodes. After the decommission process has Step 4: Shutdown nodes. After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.
Step 5: Edit excludes file again. Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.
-
Verifying File System Health
Hadoop provides an fsck command to do exactly this
bin/hadoop fsck [path] [options]
bin/hadoop fsck -- -files blocks
By default, fsck will not operate on files still open for write by another client. A list of such files can be produced with the -openforwrite option
-
Rack Awareness
For larger Hadoop installations which span
multiple racks, it is important to ensure that
replicas of data exist on multiple racks..
HDFS can be made rack-aware by the use of a HDFS can be made rack-aware by the use of a
script which allows the master node to map
the network topology of the cluster.
-
Rack Awareness
#!/bin/bash
# Set rack id based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
##
# This is invoked with an IP address as its only argument
# get IP address from the input
ipaddr=$0
# select "x.y" and convert it to "x/y"
segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}
-
HDFS Web Interface
HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations.
http://namenode:50070/
The address and port where the web interface listens can be changed by setting dfs.http.addressin conf/hadoop-site.xml.
It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0
-
HDFS Web Interface
Each DataNode exposes its file browser interface on port 50075.
You can override this by setting the dfs.datanode.http.address configuration key dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075.
Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.
-
Tutorial
Copying Large Sets of Files
Verifying File System Health
HDFS Web Interface : features
Tutorial-HDFSMiscelle.docx Tutorial-HDFSMiscelle.docx
-
Lecture 2
MapReduce
MapReduce
MapReduce
-
Outline
MapReduce: Programming Model
MapReduce Examples
A Brief History
MapReduce Execution Overview MapReduce Execution Overview
Hadoop
MapReduce Resources
-
MapReduce Basics
designed to compute large volumes of data in a parallel fashion all data elements in MapReduce are immutable
MapReduce programs transform lists of input data elements into lists of output data elementselements into lists of output data elements
map, and reduce
-
Mapping Lists
-
Reducing Lists
-
Combination Map Reduce
-
MapReduce
A simple and powerful interface that enables
automatic parallelization and distribution of
large-scale computations, combined with an
implementation of this interface that achieves implementation of this interface that achieves
high performance on large clusters of
commodity PCs.
Dean and Ghermawat, MapReduce: Simplified Data Processing on Large Clusters, Google Inc.
-
MapReduce
More simply, MapReduce is: A parallel programming model and associated
implementation.
-
Programming Model
Description The mental model the programmer has about the detailed
execution of their application.
Purpose Improve programmer productivity
Evaluation Evaluation Expressibility Simplicity Performance
-
Programming Models
Parallel Programming Models Message passing
Independent tasks encapsulating local data
Tasks interact by exchanging messages
Shared memory Tasks share a common address space
Tasks interact by reading and writing this space asynchronously Tasks interact by reading and writing this space asynchronously
Data parallelization Tasks execute a sequence of independent operations
Data usually evenly partitioned across tasks
Also referred to as Embarrassingly parallel
-
MapReduce:
Programming Model
Process data using special map() and reduce()
functions
The map() function is called on every item in the input and
emits a series of intermediate key/value pairs
All values associated with a given key are grouped together All values associated with a given key are grouped together
The reduce() function is called on every unique key, and its
value list, and emits a value that is added to the output
-
MapReduce:
Programming Model
M
How now
Brown cow
How doesIt work now
brown 1cow 1does 1How 2
it 1now 2work 1
M
M
M
R
R
Input OutputMap
ReduceMapReduceFramework
-
MapReduce:
Programming Model
More formally,
Map(k1,v1) --> list(k2,v2)
Reduce(k2, list(v2)) --> list(v2)
-
MapReduce Runtime System
1. Partitions input data
2. Schedules execution across a set of
machines
3. Handles machine failure3. Handles machine failure
4. Manages interprocess communication
-
MapReduce Benefits
Greatly reduces parallel programming complexity Reduces synchronization complexity
Automatically partitions data
Provides failure transparency
Handles load balancing
Practical Practical Approximately 1000 Google MapReduce jobs run everyday.
-
MapReduce Examples
Word frequency
RuntimeSystem
Map
doc
Reduce
-
MapReduce Examples
Distributed grep
Map function emits if word matches
search criteria
Reduce function is the identity function
URL access frequency URL access frequency
Map function processes web logs, emits
Reduce function sums values and emits
-
A Brief History
Functional programming (e.g., Lisp)
map() function
Applies a function to each value of a sequence
reduce() function reduce() function
Combines all elements of a sequence using a binary
operator
-
MapReduce Execution Overview
1. The user program, via the MapReduce
library, shards the input data
UserProgramInput
Data
Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6
* Shards are typically 16-64mb in size
-
MapReduce Execution Overview
2. The user program creates process copies
distributed on a machine cluster. One copy
will be the Master and the others will be
worker threads.worker threads.
UserProgram
Master
WorkersWorkers
WorkersWorkers
Workers
-
MapReduce Resources
3. The master distributes M map and R reduce
tasks to idle workers.
M == number of shards
R == the intermediate key space is divided into R R == the intermediate key space is divided into R
parts
Master IdleWorker
Message(Do_map_task)
-
MapReduce Resources
4. Each map-task worker reads assigned input
shard and outputs intermediate key/value
pairs.
Output buffered in RAM. Output buffered in RAM.
MapworkerShard 0 Key/value pairs
-
MapReduce Execution Overview
5. Each worker flushes intermediate values,
partitioned into R regions, to disk and
notifies the Master process.
Master
Mapworker
Disk locations
LocalStorage
-
MapReduce Execution Overview
6. Master process gives disk locations to an
available reduce-task worker who reads all
associated intermediate data.
Master
Reduceworker
Disk locations
remoteStorage
-
MapReduce Execution Overview
7. Each reduce-task worker sorts its
intermediate data. Calls the reduce function,
passing in unique keys and associated key
values. Reduce function output appended to values. Reduce function output appended to
reduce-tasks partition output file.
Reduceworker
Sorts data PartitionOutput file
-
MapReduce Execution Overview
8. Master process wakes up user process when
all tasks have completed. Output contained
in R output files.
wakeup UserProgramMaster
Outputfiles
-
MapReduce Execution Overview
Fault Tolerance
Master process periodically pings workers
Map-task failure
Re-execute Re-execute
All output was stored locally
Reduce-task failure
Only re-execute partially completed tasks
All output stored in the global file system
-
Tutorial:
Running a map Reduce Program
Tutorial-MapReduce.docx