apache hadoop - a deep dive (part 1 - hdfs)

HADOOP-A DEEP DIVEDebarchan Sarkar

Sunil Kumar Chakrapani

The call would start soon, please be on mute.Thanks for your time and patience.

AGENDA Recap - What is Big DATA?

Problems Introduced

Traditional Architecture

Cluster Architecture

Where it all started?

How does It work, A 50000 feet overview How does it work 1 & 2

Hadoop Distributed Architecture

HDFS Architecture

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

Volu

me

1980190,000$

20100.07$

19909,000$

200015$

Storage/GB

ERP / CRM WEB 2.0

Internet of things

WHAT IS BIG DATA?

STORAGE CAPACITY VS ACCESS SPEED1990 2010

Stores 1370 MB of data

Read

@ 4.4MB/S transfer rate

1 TB is a norm

Read

@ 100MB/S transfer rate

Takes 5 minutes Takes 2.5 hours

READ 1 TB OF DATA1 Machine 10 Machine

4 I/O Channels

Each channel: 100 MB/s

~ 45 minutes

4 I/O Channels

Each channel: 100 MB/s

~4.5 Minutes

HARDWARE FAILURE

A common way of avoiding data loss is through replication

TRADITIONAL ARCHITECTURE

Servers

SAN

Storage

CLUSTER ARCHITECTURE

1 U

1 U

1 U

1 U

1 U

1 U

1 U

1 U 1 U

1 U

NUTCH IS WHERE IT ALL STARTED

Google File System

Map Reduce

HDFS: HADOOP Distributed File System

MapReduce

HOW DOES IT WORK - 1

HOW DOES IT WORK - 2

RUNTIME

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

CodeCodeCodeCode

Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

MapReduce Layer

HDFS Layer

HADOOP DISTRIBUTED ARCHITECTURE Master Slave

RACK 1 - DataNodes RACK 2 - DataNodes

File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6

HDFS ARCHITECTURE

1 11

2 23

3

2

34 445

5

5 66

6

Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01

BLOCK SIZE AND REPLICATION <property>

<name>dfs.block.size</name> <value>134217728</value> </property>

<property> <name>dfs.replication</name>

<value>3</value>

</property>

NAMENODE & SECONDARY

NameNode

Secondary NameNode

• Reads fsimage and edits file

• Transaction in edits are merged With fsimage and edits is emptied

• A client application creates a new file in HDFS

• Name node logs that transaction in the edits file

Checkpoint

• Secondary Namenode periodically creates checkpoints of the namespace

• It downloads fsimage and edit from the active NameNode

• Merges fsimage and edits locally

• Uploads the new image back to the active NameNode

• fs.checkpoint.period• fs.checkpoint.size

SAFE MODE During start up the NameNode loads the file system state from the fsimage

and the edits log file.

Waits for DataNodes to report their blocks.

During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS

cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes

have reported that most file system blocks are available.

HDFS WRITES

1 2 3

1. HDFS client

caches the file data into a

temporary local file

Step 2

Step 3

Step 4

Step 5

Name Node

Data Node

FEEDBACKSupport Team’s blog: http://blogs.msdn.com/b/bigdatasupport/ Facebook Page: https://www.facebook.com/MicrosoftBigData Facebook Group: https://www.facebook.com/groups/bigdatalearnings/ Twitter: @debarchans

Read more:http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Big_data

Next Session:Apache Hadoop – Map Reduce

http://blogs.msdn.com/b/bigdatasupport/

https://www.facebook.com/MicrosoftBigData

https://www.facebook.com/groups/bigdatalearnings/

https://www.facebook.com/groups/bigdatalearnings/

http://en.wikipedia.org/wiki/Hadoop

http://en.wikipedia.org/wiki/Big_data

apache hadoop - a deep dive (part 1 - hdfs)

Data & Analytics

file data

new file

edits log file

edits file transaction

file system blocks

namenode secondary namenode

file system state

datanodes file metadata