hdfs internals

16
HDFS Internals Bhupesh Chawda [email protected] DataTorrent

Upload: bhupesh-chawda

Post on 25-Jan-2017

29 views

Category:

Software


0 download

TRANSCRIPT

HDFS InternalsBhupesh Chawda

[email protected]

DataTorrent

Image Source: https://help.marklogic.com/news/list/Index/10

Agenda

What are Blocks?● A physical storage disk has a block size - minimum amount of data it can

read or write. Normally 512 bytes.● File systems for a single disk also deal with data in blocks. Normally few

kilo bytes (4 kb).● Hadoop has a much larger block size. By default it is 64 mb.● Files in HDFS are broken down into block sized chunks and are stored as

independent units. ● However, files smaller than a block size do not occupy the entire block.

○ Should I care?

Why so large blocks?● Minimize disk seek times● Assuming 10 ms of seek time, and 100 MB/s as disk transfer rate, if block

size if 100 MB, then seek time is 1% of transfer time which is small enough to ignore.

● Hence default is 64 MB while many production environments also use 128 MB.

HDFS Architecture

Image Source: https://hadoop.apache.org

Namenode and Datanode● Master - Namenode

○ Manages file system namespace○ File system tree and metadata for all files and directories○ Stores this info in -

■ Namespace image■ Edit log

○ Knows for a given file which datanodes has the corresponding blocks. Reconstructed at startup

● Worker - Datanode○ Store and retrieve blocks as requested by clients○ Periodically report back to the namenode on the list of blocks they are storing

HDFS Storage

Image Source: https://developer.yahoo.com/hadoop/tutorial/module2.html

Secondary Namenode

Image Source: http://www.quickmeme.com/meme/35ke38

Secondary Namenode● Not a backup namenode● Periodically merge the namespace image with the edit log, if edit log

becomes too large● Usually runs on a different machine than the namenode● The secondary however always lags behind primary and hence the

merged copy cannot be used in case of primary failure● In event of primary failure, copy the primary namespace image to the

secondary and run it as the new primary.

Writing a File in HDFS

Image Source: Hadoop The definitive guide, 4th edition

Reading a file in HDFS

Image Source: Hadoop The definitive guide, 4th edition

HDFS Block Placement

Image Source: Hadoop The definitive guide, 4th edition

Small File Problem?

Each file occupies namespace irrespective of file size!!

Image Source: http://www.bodhtree.com/blog/2012/09/28/hadoop-how-to-manage-huge-numbers-of-small-files-in-hdfs/

Sample:

Thank You!!

Please send your questions at:[email protected]