hdfs user reference

HDFS User Reference

Biju Nair

Local File System

FileA

FileB

FileC

Inode-‐n

Inode-‐m

Inode-‐p

File A0ributes

Block 0 Address

Block 1 Address

Block 2 Address

Block 3 Address

Inode-‐n

File A0ributes

Block 0 Address

Block 1 Address

Block 2 Address

Inode-‐m

File A0ributes

Block 0 Address

Block 1 Address

Block 2 Address

Block 3 Address

Inode-‐m

DISK

Directory

Par@@on Table

MBR

Boot block

Super block

Free Space Trk

i-‐nodes

Root dir

File block size is based on what is used when FS is defined

2

Hadoop Distributed File System

FileA

FileB

FileC

H1:blk0, H2:blk1

H3:blk0,H1:blk1

H2:blk0;H3:blk1

HDFS Directory Master Host (NN)

DISK

Local File System File

FileA0

FileB1

Inode-‐x

Inode-‐y

Local FS Directory Host 1

FileA1

FileC0

Inode-‐a

Inode-‐n


FileB0

FileC1

Inode-‐r

Inode-‐c


In-‐x

In-‐y

In-‐a

In-‐n

In-‐r

In-‐c DISK

DISK

DISK

Files created are of size equal to the HDFS blksize

3

HDFS

Date Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta

/... /subdir2/

Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta

/... /subdir2/

Data Node ${dfs.data.dir}/current/VERSION /blk_<id_1>,/blk_<id_1>.meta

/... /subdir2/

Name Node ${dfs.name.dir}/current/VERSION

/edits,/fsimage,/fs@me

Secondary Name Node ${fs.checkpoint.dir}/current/VERSION

/edits,/fsimage,/fs@me

Hadoop CLI HDFS UI WebHDFS

Data Nodes

HDFS Data Transfer Protocol

RPC

HTTP

RPC

HTTP/S

4

HDFS Config Files and Ports •  Default configuraJon

–  core-‐default.xml, hdfs-‐default.xml •  Site specific configuraJon

–  core-‐site.xml, hdfs-‐site.xml under conf •  ConfiguraJon of daemon processes

–  hadoop-‐env.sh under conf •  List of slave/data nodes

–  “slaves” file under conf •  Ports

–  Default NN UI port 50070 (HTTP), 50470 (HTTPS) –  Default NN Port 8020/9000 –  Default DN UI port 50075 (HTTP), 50475(HTTPS)

5

HDFS -‐ Write Flow

Client

Namespace MetaData Blockmap (Fsimage Edit files)

Name Node

Data Node Data Node Data Node

1

2

3

4

5

6 6

7 7

8

1.  Client requests to open a file to write through fs.create() call. This will overwrite exisJng file. 2.  Name node responds with a lease to the file path 3.  Client writes to local and when data reaches block size, requests Name Node for write 4.  Name Node responds with a new blockid and the desJnaJon data nodes for write and replicaJon 5.  Client sends the first data node the data and the checksum generated on the data to be wriaen 6.  First data node writes the data and checksum and in parallel pipelines the replicaJons to other DN 7.  Each data node where the data is replicated responds back with success /failure to the first DN 8.  First data node in turn informs to the Name node that the write request for the block is complete

which in turn will update its block map Note: There can be only one write at a Jme on a file

6

HDFS -‐ Read Flow

Client

Namespace MetaData Blockmap (Fsimage Edit files)

Name Node


1

2

3

4

5 6

1.  Client requests to open a file to read through fs.open() call 2.  Name node responds with a lease to the file path 3.  Client requests for read the data in the file 4.  Name Node responds with block ids in sequence and the corresponding data nodes 5.  Client reaches out directly to the DNs for each block of data in the file 6.  When DNs sends back data along with check sum, client performs a checksum verificaJon by

generaJng a checksum 7.  If the checksum verificaJon fails client reaches out to other DNs where the re is a replicaJon

7

7

HDFS -‐ Name Node

Fsimage (MetaData) Namespace Ownership Permissions

Create/mod/Access Jme, Is hidden

EditFile (Journal)

Changes to metadata

BlockMap (In-‐memory)

Details on File blocks and where they are stored

1.  Name node manages the HDFS file system using the fsimage/edifile and block-‐map data structures 2.  Fsimage and edifile data are stored on disk. When hdfs starts they are read, merged and stored in-‐memory 3.  Data nodes sends details about the blocks they are storing when it starts and also at regular intervals 4.  Name node uses the block map send by data nodes to build the BlockMap data structure data 5.  The BlockMap data is used when requests for reads on files comes to the FileSystem 6.  Also the BlockMap data is used to idenJfy the under/over replicated files which requires correcJon 7.  At no point Name node stores data locally or directly involved in transferring data from files to client 8.  The client reading/wriJng data receives meta data details from NN and then directly works with DNs 9.  Name nodes require large memory since it needs to hold all the in-‐memory data structures 10.  If the NN is lost the data in the file systems can’t be accessed

8

FS Meta Data Change Management

Fsimage (MetaData)

EditFile (Journal)

1.  When HDFS is up and running changes to file system metadata are stored in Edit files 2.  When NN starts it looks for EditFiles in the system and merges the content with the fsimage on the disk 3.  The merging process creates new fsimage and edifile. Also the process discards the old fsimage & edit files. 4.  Since the edit files can be large for a very acJve HDFS cluster, the NN start-‐up will take a long Jme 5.  Secondary name node at regular interval or aier a certain edifile size, merges the edit file and fsimage file 6.  The merge process creates a new fsimage file and an edit file. The secondary NN copies the new fsimage file back to NN 7.  This will reduce the NN start-‐up process and also the fsimage can be used if there is a failure in the NN server to restore

Secondary NameNode

Fsimage_1 (MetaData)

EditFile_1 (MetaData)

Fsimage (MetaData)

EditFile (Journal)

NameNode

Fsimage_1 (MetaData)

EditFile_1 (MetaData)

At Start-‐up Periodically

9

HDFS -‐ Data Node

MetaData BlockMap


Name Node

Heart Beat / Block map

1.  Data nodes stores blocks of data for each file stored in HDFS and the default clock size is 128 MB 2.  Blocks of data is replicated n Jmes and by default it is 3 Jmes 3.  Data node periodically sends a heartbeat to the name node to inform NN that it is alive 4.  If NN doesn’t receive a heart beat , it will mark the DN as dead and stops sending further requests to the DN 5.  Also in periodic intervals, data node sends out a block map which includes all the file blocks it stores 6.  When a DN is dead, all the files for which blocks were stored in the DN will get marked as under replicated 7.  NN will recJfy under replicaJon by replicaJng the blocks to other data nodes

10

Ensuring Data Integrity

•  Through replicaJon/replicaJon assurance – First replica closer to client node – Second replica on a different rack – Third replica on the rack as the second replica

•  File system checks run manually

•  Block scanning over a period of Jme

•  Storing checksums along with block data

11

Permission and Quotas

•  File and directories use much of POSIX model – Associated with an owner and a group – Permission for owner, group and others –  r for read, w for append to files –  r for lisJng files, w for delete/create files in dirs – x to access child directories – Stciky bit on dirs prevents deleJons by others – User idenJficaJon can be simple (OS) or Kerberos

12

Permission and Quotas

•  Quota for number of files – Name quota –  dfsadmin -‐setQuota <N> <dir>...<dir> –  dfsadmin -‐clrSpaceQuota <dir>...<dir>

•  Quota on the size of data –  Space quota can be set to restrict space usage –  dfsadmin -‐setSpaceQuota <N> <dir>...<dir>

•  Replicated data also consumes quota –  dfsadmin -‐clrSpaceQuota <dir>...<dir>

•  ReporJng –  fs -‐count -‐q <dir>...<dir>

13

HDFS snapshot •  No copy of data blocks. Only the metadata (block list and file names) are copied •  Allow snapshot on a directory

–  hdfs dfsadmin –allowSnapshot <path> •  Create snapshot

–  hdfs dfs –createSnapshot <path> [<name>] –  Default name is ‘s’+Jmestamp

•  Verify snapshot –  hadoop fs –ls <path>/.snapshot

•  Directory with snapshot can’t be deleted or renamed •  Disallow snapshot

–  hdfs dfsadmin –disallowSnapshot <path> –  All exisJng snapshot need to be deleted before disallow

•  Delete snapshot –  hdfs dfs –deleteSnapshot <path> <name>

•  Rename snapshot –  hdfs dfs –renameSnapshot <path> <oldname> <newname>

•  Snapshot differences –  hdfs snapshotDiff <path> <starJng snapshot name> <ending snapshot name>

•  List all snap shoaable directories –  hdfs lsSnapshoaableDir

14

HDFS back-‐up using snapshot •  Create a snapshot on the source cluster •  Perform a “distcp” of the snapshot to backup cluster •  Create a snapshot of the copy on the backup cluster •  Cleanup any old back-‐up copies to comply with the enterprise retenJon policy

•  The reverse can be followed to recover data from the backup –  Data need to be removed on the producJon cluster before the restore

–  During deleJon –skipTrash opJon of “rm” will help reduce space usage

15

distcp •  Tool to perform inter and intra cluster copy of data •  UJlizes mapreduce to perform the copy •  It can be used to

–  Copy data with in a cluster –  Copy data between clusters –  Copy files or directories –  Copy data from mulJple sources

•  Can be used to create a backup cluster •  Starts up containers on both source and target •  Consumes network traffic between clusters •  Need to be scheduled at appropriate Jme •  Can control resource uJlizaJon using parameters

16

distcp •  Hadoop distcp [opJons] <srcURL> … <srcURL> <destURL> –  Source path need to be obsolute –  DesJnaJon directory will be created if not present –  “update” opJon will update only the changed files –  “skipcrccheck” opJon to disable checksum –  “overwrite” opJon is to overwrite exisJng files which is by default skipped if present

–  “delete” opJon to delete files in desJnaJon which are not in source

–  “hip” fs need to be used to copy between different versions of HDFS

–  “m” opJon to specify the number of mappers

17

distcp – “atomic” opJon to commit all changes or none

– “async” to run distcp async i.e. non blocking – “i” opJon to ignore failures during copy – “log” directory on DFS where logs to be saved – “p [rbugp]” preserve file status as source – “strategy [staJc|dynamic]” – “bandwidth [MB]” bandwidth per map in MB

18

HDFS JAVA APIs Func@on API

Directory Create FileSystem.mkdirs(path, permission)

Directory Rename/Move FileSystem.rename(oldpath, newpath)

Directory Delete FileSystem.delete(path, true)

File Create FileSystem.createNewFile(path)

File Open FileSystem.open(path)

File Read FSDataInputStream.read*

File Write FSDataOutputStream.write*

File Rename/Move FileSystem.rename(oldpath, newpath)

File Delete FileSystem.delete(path, false)

File Append FileSystem.append(path)

File Seek FSDataInputStream.seek(int)

File System FileSystem.get(conf)

19

HDFS FederaJon

Diagram source: hadoop.apache.org – JIRA HDFS-‐1052

HDFS without Federa@on HDFS with Federa@on

-‐  Namespace management and block management together -‐  Supports one name space -‐  Hinders scalability above 400 0 nodes -‐  Doesn’t support some of mulJ-‐tenancy requirements

-‐  Namespace management and block management seperated -‐  Block management can be on its node on its own -‐  Supports more than one name space/NN -‐  Scalable beyond 4000 nodes and millions of rows -‐  Can deploy mulJ-‐tenancy requirements like NN for specific

departments and isoloaJon -‐  A namespace and block pool is called namespace volume

20

Enabling HDFS federaJon •  IdenJfy an unique cluster id •  IdenJfy nameservices ids for name nodes •  Add dfs.nameservices to hdfs-‐site.xml

–  Comma separated nameservice(ns) names •  Update hdfs-‐site.xml on all NNs and DNs

–  dfs.namenode.rpc-‐address.ns –  dfs.namenode.hap-‐address.ns –  dfs.namenode.servicerpc-‐address.ns –  dfs.namenode.haps-‐address.ns –  dfs.namenode.secondaryhap-‐address.ns –  dfs.namenode.backup.address.ns

•  Format all name nodes using the cluster id –  hdfs namenode –format –clusterId <cluster id>

21

HDFS Rack Awareness •  Rack awareness enables efficient data placement

–  Data writes –  Balancer –  Decommissioning/commissioning of nodes

•  Each node is assigned to a rack (rack id) –  Rack id is used in the path names

•  Data placement –  First block is placed near client or random node/rack –  Second replica of block placed in a second rack node –  Third replica is placed in a different node in second rack –  If HDFS is not rack aware, second and third replicas are placed at random nodes

22

Enabling HDFS Rack Awareness

•  Update core-‐site.xml with topology properJes –  topology.script.file.name

•  Script can be shell script, Python, Java –  topology.script.number.args

•  Copy the script to the conf directory •  Distribute the script and core-‐site.xml •  Stop and start the name node •  Verify that the racks are recognized by HDFS – hdfs fsck -‐racks

23

HDFS NFS Gateway

•  Allows HDFS to be mounted as part of local FS •  Stateless daemon translates NFS to HDFS access protocol •  DFSClient is part of the gateway daemon

–  Averages 30 MB/S for writes •  MulJple gateways can be used for scalability •  Gateway machine requires all soiware and configs like HDFS client

–  Gateway can be run on HDFS cluster nodes •  Random writes are not supported

HDFS Cluster

NN

DN

DN

DN

NFS Gateway (DFSClient)

RPC

HDFS

HDFS

Client NFSv3

24

HDFS NFS Gateway ConfiguraJon

•  Consists of two daemons –  portmap and nfs3

•  ConfiguraJon –  dfs.nodename.access.precision; 3600000 (1 Hr)

•  Name node restart –  dfs.nfs3.dump.dir; dir to store out of seq data

•  Enough space to store data for all concurrent file writes •  Use NFS for smaller file transfers in the order of 1 GB

–  dfs.nfs.exports.allowed.hosts; Host access •  client*.abc.com r;client*.xyc.com rw

– Update log4j.properJes file •  log4j.logger.org.apache.hadoop.hdfs.nfs=DEBUG •  log4j.logger.org.apache.hadoop.oncrpc=DEBUG

25

HDFS NFS Gateway ConfiguraJon

•  Stop nfs & rpcbind services provided by OS – service nfs stop – service rpcbind stop

•  Start hadoop portmap as root – hadoop-‐daemon.sh start portmap – To stop use “stop” instead of “start” as parameter

•  Start mountd and nfsd as user starJng HDFS – hadoop-‐daemon.sh start nfs3 – To stop use “stop” instead of “start” as parameter

26

HDFS NFS Gateway ConfiguraJon •  Validate NFS services are running

–  rpcinfo –p $nfs_server_ip –  Should see entries for mountd, portmapper & nfs

•  Verify HDFS namespace is exported for mount –  showmount –e $nfs_server_ip –  Should see the export list

•  Mount HDFS on client –  Create a mount point as root; –  Change ownership of mount point to user running HDFS cluster –  mount -‐t nfs -‐o vers=3,proto=tcp,nolock $nfs_server:/

$mount_point –  Client sends UID of user to NFS –  NFS looks up the username for UID and uses it to access HDFS –  User name and UID should be the same on client and NFS

27

HDFS Name Node HA

Ac@ve Name Node Passive Name Node

Shared Storage

ZKFC ZKFC

Zookeeper Quorum ZK ZK ZK


HB

HB

•  Zookeeper does failure detecJon and helps acJve name node elecJon •  ZKFC ZooKeeper Failover Controller

•  monitors the health of name node •  Holds a session open on ZK and a lock for acJve NN •  If no other NN holds zlock, it tries to acquire it to make NN acJve

•  Share storage can be NFS mount or quorum of journal storage •  Fencing is defined to prevent split brain scenario of two NN wriJng

28

HDFS NN HA ConfiguraJon •  Define dfs.nameservices

–  Nameservice Id •  Define dfs.namenodes.[nameservice id]

–  Comma separated list of name nodes •  Define dfs.namenode.rpc-‐address.[Nameservice Id].[Name node Id]

–  Fully qualified machine name and port •  Define dfs.namenode.hap-‐address.[nameservice ID].[name node ID]

–  Fully qualified machine name and port •  Define dfs.namenode.shared.edits.dir

–  For nfs: file:///mnt/... –  For Journal nodes: qjournal://node1:8485;node2. com:8485;

•  Define dfs.client.failover.proxy.provider.[nameservice ID] –  org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

•  Define dfs.ha.fencing.methods –  sshfence; requires password less ssh into name nodes from one another –  shell

•  Define fs.defaultFS the HA enabled logical URI •  For journal nodes

–  Define dfs.journalnode.edits.dir where edits and other local states used by JNs will be stored

29

HDFS NN HA ConfiguraJon •  Define dfs.ha.automaJc-‐failover.enabled

–  Set to true •  Define ha.zookeeper.quorum

–  Host and port of ZK •  To enable HA in an exisJng cluster

–  Run hdfs dfsadmin –safemode enter –  Run hdfs dfsadmin –saveNamespace –  Stop HDFS cluster dfs-‐stop.sh –  Start journal node daemons hdfs-‐daemon.sh journalnode –  Run hdfs zkfc –formatZK on exisJng NN –  Run hdfs –iniEalizeSharedEdits on exisJng NN –  Run hdfs namenode –bootstrapStandBy on new NN –  Delete secondary name node –  Start HDFS cluster dfs-‐start.sh

30

hdfs haadmin

•  -‐ns <nameserviceId> •  -‐transiJonToAcJve <serviceId> •  -‐transiJonToStandby <serviceId> •  -‐failover <serviceId> <serviceId> –  [-‐-‐forcefence] [-‐-‐forceacJve]

•  -‐getServiceState <serviceId> •  -‐checkHealth <serviceId> •  -‐help <command>

31

hdfs dfsadmin

•  -‐report •  -‐safemode [enter|leave|get|wait] •  -‐finalizeUpgrade •  -‐refreshNodes uses files defined in dfs.hosts & dfs.host.exclude

•  -‐report •  -‐lsr •  -‐upgradeProgress status •  -‐metasave •  -‐setQuota <quota>/-‐clrQuota <dirname>…<dirname> •  -‐setRep [-‐w] <w> <path/file>

32

hdfs fsck

•  hdfs fsck [opJons] path – move

–  delete – openforwrite – files – blocks –  locaJons –  racks

33

Balancer

•  start-‐balancer.sh – policy datanode|blockpool –  threshold <percentage>; default 10% – dfs.balancer.bandwidthPerSec specified in bytes

•  Default 1 MB/sec

34

Adding New Nodes

•  Add node address to dfs.hosts file – Update mapred.hosts file if using mapred

•  Update namenode with the new set of nodes –  hadoop dfsadmin –refreshNodes – Update jobtracker with the new set of nodes

•  hadoop mradmin –refreshNodes

•  Update “slaves” file with the new node names •  Start new datanodes (and tasktrackers) •  Check the availability of the new nodes in UI •  Run balancer so that data is distributed

35

Decommissioning Nodes

•  Add node address to exclude file –  dfs.hosts.exclude –  mapred.hosts.exclude

•  Update namenode (and jobtracker) –  hadoop dfsadmin –refreshNodes –  hadoop mradmin –refreshNodes

•  Verify all the nodes are decommissioned (UI) •  Remove nodes from dfs.hosts (and mapred.hosts) file •  Update namenode (and jobtracker) •  Remove nodes from the “slaves” file

36

HDFS Upgrade •  No file system layout change –  Install new version of HDFS (and MapReduce)

– Stop the old daemons – Update the configuraJon files – Start the new daemons

– Update clients to use the new libraries – Remove the old install and the configuraJon files – Update applicaJon code for deprecated APIs

37

HDFS Upgrade •  With file system layout changes

–  When there is a layout change NN will not start –  Run FSCK to make sure that the FS is healthy –  Keep a copy of the FSCK output for verificaJon –  Clear HDFS and map reduce temporary files –  Make sure that any previous upgrade is finalized –  Shutdown map reduce and kill orphaned task –  Shutdown HDFS and make a copy of NN directories –  Install new versions of HDFS and Map Reduce –  Start HDFS with –upgrade opJon

•  Start-‐dfs.sh –upgrade –  Once the upgrade is complete perform manual spot checks

•  hadoop dfsadmin –upgradeProcess status –  Start Map Reduce –  Rollback or Finalize the upgrade

•  stop-‐dfs.sh; start-‐dfs.sh –rollback •  hadoop dfsadmin -‐finalizeUpgrade

38

Key Parameters

Parameter Descrip@on Default Value

dfs.blocksize File block size 128 MB

dfs.replicaJon File block replicaJon count 3

dfs.datanode.numblocks No of blocks aier which new sub directory gets created in DN

io.bytes.per.checksum Number of data bytes for which check sum is calculated

512

dfs.datanode.scan.period.hours Timeframe in hours to complete block scanning

504 (3 weeks)

39

40

[email protected]

blog.asquareb.com

https://github.com/bijugs

@gsbiju

hdfs user reference

Technology