apache hadoop - a deep dive (part 1 - hdfs)

18
HADOOP-A DEEP DIVE Debarchan Sarkar Sunil Kumar Chakrapani The call would start soon, please be on mute. Thanks for your time and patience.

Upload: debarchan-sarkar

Post on 24-May-2015

325 views

Category:

Data & Analytics


0 download

DESCRIPTION

This is our next tech talk in the series where we dive deep into the Apache Hadoop framework. Hadoop, undoubtedly is the current industry leader in Big data implementation. This tech talk covers core Hadoop and how it works. This is Part 1 which explains HDFS. The next tech talk will be Part 2 explaining MapReduce.

TRANSCRIPT

Page 1: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HADOOP-A DEEP DIVEDebarchan Sarkar

Sunil Kumar Chakrapani

The call would start soon, please be on mute.Thanks for your time and patience.

Page 2: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

AGENDA Recap - What is Big DATA?

Problems Introduced

Traditional Architecture

Cluster Architecture

Where it all started?

How does It work, A 50000 feet overview How does it work 1 & 2

Hadoop Distributed Architecture

HDFS Architecture

Page 3: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

Internet of things Audio /

VideoLog Files

Text/Image

Social Sentiment

Data Market FeedseGov Feeds

Weather

Wikis / Blogs

Click Stream

Sensors / RFID / Devices

Spatial & GPS Coordinates

WEB 2.0Mobile

Advertising

Collaboration

eCommerce

Digital Marketing

Search Marketing

Web Logs

Recommendations

ERP / CRM

Sales Pipeline

PayablesPayroll

Inventory

Contacts

Deal Tracking

Terabytes(10E12)

Gigabytes(10E9)

Exabytes(10E18)

Petabytes(10E15)

Velocity - Variety - variability

Volu

me

1980190,000$

20100.07$

19909,000$

200015$

Storage/GB

ERP / CRM WEB 2.0

Internet of things

WHAT IS BIG DATA?

Page 4: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

STORAGE CAPACITY VS ACCESS SPEED1990 2010

Stores 1370 MB of data

Read

@ 4.4MB/S transfer rate

1 TB is a norm

Read

@ 100MB/S transfer rate

Takes 5 minutes Takes 2.5 hours

Page 5: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

READ 1 TB OF DATA1 Machine 10 Machine

4 I/O Channels

Each channel: 100 MB/s

~ 45 minutes

4 I/O Channels

Each channel: 100 MB/s

~4.5 Minutes

Page 6: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HARDWARE FAILURE

A common way of avoiding data loss is through replication

Page 7: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

TRADITIONAL ARCHITECTURE

Servers

SAN

Storage

Page 8: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

CLUSTER ARCHITECTURE

1 U

1 U

1 U

1 U

1 U

1 U

1 U

1 U 1 U

1 U

Page 9: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

NUTCH IS WHERE IT ALL STARTED

Google File System

Map Reduce

HDFS: HADOOP Distributed File System

MapReduce

Page 10: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HOW DOES IT WORK - 1

Page 11: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HOW DOES IT WORK - 2

RUNTIME

// Map Reduce function in JavaScript

var map = function (key, value, context) {var words = value.split(/[^a-zA-Z]/);for (var i = 0; i < words.length; i++) {

if (words[i] !== "") {context.write(words[i].toLowerCase(), 1);}}};

var reduce = function (key, values, context) {var sum = 0;while (values.hasNext()) {sum += parseInt(values.next());

}context.write(key, sum);};

CodeCodeCodeCode

Page 12: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

Reference: http://en.wikipedia.org/wiki/File:Hadoop_1.png

MapReduce Layer

HDFS Layer

HADOOP DISTRIBUTED ARCHITECTURE Master Slave

Page 13: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

RACK 1 - DataNodes RACK 2 - DataNodes

File Metadata/user/kc/data01.txt – Block 1,2,3,4/user/apb/data02.txt– Block 5,6

HDFS ARCHITECTURE

1 11

2 23

3

2

34 445

5

5 66

6

Block1: R1DN01, R1DN02, R2DN01Block2:R1DN01, R1DN02, R2DN03Block3:R1DN02, R1DN03, R2DN01

Page 14: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

BLOCK SIZE AND REPLICATION <property>

<name>dfs.block.size</name> <value>134217728</value> </property>

<property> <name>dfs.replication</name>

<value>3</value>

</property>

Page 15: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

NAMENODE & SECONDARY

NameNode

Secondary NameNode

• Reads fsimage and edits file

• Transaction in edits are merged With fsimage and edits is emptied

• A client application creates a new file in HDFS

• Name node logs that transaction in the edits file

Checkpoint

• Secondary Namenode periodically creates checkpoints of the namespace

• It downloads fsimage and edit from the active NameNode

• Merges fsimage and edits locally

• Uploads the new image back to the active NameNode

• fs.checkpoint.period• fs.checkpoint.size

Page 16: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

SAFE MODE During start up the NameNode loads the file system state from the fsimage

and the edits log file.

Waits for DataNodes to report their blocks.

During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS

cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes

have reported that most file system blocks are available.

Page 17: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

HDFS WRITES

1 2 3

1. HDFS client

caches the file data into a

temporary local file

Step 2

Step 3

Step 4

Step 5

Name Node

Data Node

Page 18: Apache Hadoop - A Deep Dive (Part 1 - HDFS)

FEEDBACKSupport Team’s blog: http://blogs.msdn.com/b/bigdatasupport/ Facebook Page: https://www.facebook.com/MicrosoftBigData Facebook Group: https://www.facebook.com/groups/bigdatalearnings/ Twitter: @debarchans

Read more:http://en.wikipedia.org/wiki/Hadoophttp://en.wikipedia.org/wiki/Big_data

Next Session:Apache Hadoop – Map Reduce