hadoop 2.0 architecture | hdfs federation | namenode high availability |

Post on 27-Jan-2015

148 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Slide 1 www.edureka.in/hadoop

Slide 2

Hello There!!My name is Annie.

Let me test your Hadoop 1.x knowledge?

Annie’s Introduction

Slide 3

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you store 1 billion files in a Hadoop 1.x cluster?- Yes- No

Annie’s Question

Slide 4

No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x.

Annie’s Answer

Slide 5

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False

Annie’s Question

Slide 6

False. Not possible with Hadoop 1.x.

Annie’s Answer

Slide 7

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- Too much burden on Job Tracker

Annie’s Question

Slide 8

Single Point of Failure of NameNode and too much burden on Job Tracker.

Annie’s Answer

Slide 9

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No

Annie’s Question

Slide 10

No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.

Annie’s Answer

Slide 11

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use Hadoop for Real-time processing?- Yes- No

Annie’s Question

Slide 12

No. Hadoop is designed and developer for massively parallel batch processing.

Annie’s Answer

Limitations of Hadoop 1.x

No horizontal scalability of NameNode

Does not support NameNode High Availability

Overburdened JobTracker

Not possible to run Non-MapReduce Big Data Applications on HDFS

Does not support Multi-tenancy

Slide 14 www.edureka.in/hadoop

Hadoop 1.x – In Summary

Client

HDFS Map Reduce

Secondary NameNode

Data BlocksDataNode

NameNode Job Tracker

Task Tracker

Map Reduce

DataNode Task Tracker

Map Reduce….

DataNode DataNodeTask Tracker

Map Reduce

Task Tracker

Map Reduce

Slide 15 www.edureka.in/hadoop

Problem Description

NameNode – No Horizontal Scalability

Single NameNode and Single Namespace, limited by NameNode RAM

NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery usingSecondary NameNode in case of failure

Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle ofApplications

MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be usedfor other workloads such as Graph processing etc.

Hadoop 1.x - Challenges

NameNode - No High Availability

NameNode - No Horizontal Scale

DataNode

DataNode

DataNode

….

Client Get Block Locations

Block Management

Read Data

NameNodeNS

Slide 16 www.edureka.in/hadoop

NameNode – Scale and HA

Slide 17 www.edureka.in/hadoop

Name Node –Single Point of Failure

Secondary NameNode:

“Not a hot standby” for the NameNode Connects to NameNode every hour* Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode

You give me metadata every hour, I will make

it secure

Single Point Failure

Secondary NameNode

NameNode

metadata

metadata

Slide 18 www.edureka.in/hadoop

Job Tracker – Overburdened

CPU

Spends a very significant portion of time and effort managing the life cycle of applications

Network

Single Listener Thread to communicate with thousands of

Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker….

Job Tracker

Slide 19 www.edureka.in/hadoop

MRv1 – Unpredictability in Large Clusters

As the cluster size grow and reaches to 4000 Nodes

Cascading Failures

The DataNode failures results in a seriousdeterioration of the overall clusterperformance because of attempts to replicatedata and overload live nodes, through networkflooding.

Multi-tenancy

As clusters increase in size, you may want toemploy these clusters for a variety of models.MRv1 dedicates its nodes to Hadoop andcannot be re-purposed for other applicationsand workloads in an Organization. With thegrowing popularity and adoption of cloudcomputing among enterprises, this becomesmore important.

Unutilized Data in HDFS

Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing

Slide 11 www.edureka.in/hadoop

Introducing Hadoop 2.0

Features Hadoop 1.x Hadoop 2.0

HDFS Federation One NameNode and a Namespace Multiple NameNode and Namespaces

NameNode High Availability Not present Highly Available

YARN - Processing Control and Multi-tenancy

JobTracker, TaskTracker Resource Manager, Node Manager, App Master, Capacity Scheduler

Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem

Slide 12 www.edureka.in/hadoop

Namenode

Block Management

NS

Storage

Datanode Datanode…

Nam

esp

ace

Blo

ckSto

rage

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

…Datanode 2

…Datanode m

…Blo

ckSto

rage

Pool 1 Pool k Pool n

Block Pools

… …

Hadoop 1.0 Hadoop 2.0

Slide 22 www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

Hadoop 2.0 Cluster Architecture - Federation

Slide 23 www.edureka.in/hadoop

cluster.

Annie’s Question

How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

Slide 24 www.edureka.in/hadoop

Annie’s Answer

(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.

Slide 25

Annie’s Question

You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?

www.edureka.in/hadoop

Slide 26

Annie’s Answer

The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.

www.edureka.in/hadoop

Slide 27

Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Data Node

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Data Node

Client

Data Node

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Node Manager

Hadoop 2.0 Cluster Architecture - HA

NameNode High Availability

Next Generation MapReduce

HDFS HIGH AVAILABILITY

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

Slide 28

Hadoop 2.0 Cluster Architecture - HA

www.edureka.in/hadoop

High Availability in Hadoop 2.0

NameNode recovery in Hadoop 1.0

Secondary NameNode

Standby NameNode

Active NameNode

Secondary NameNode

NameNode

Edit logs

Meta-Data

Automatic failover to Standby NameNode

Manually Recover using Secondary

NameNodeFSImage

Slide 29

Annie’s Question

NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure of NameNodeb) Too much burden on Job Tracker

www.edureka.in/hadoop

Slide 30

Annie’s Answer

Single Point of Failure of NameNode.

www.edureka.in/hadoop

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

HBase

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

OtherYARN

Frameworks(MPI, GIRAPH)

Slide 23 www.edureka.in/hadoop

YARNCluster Resource Management

YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework

YARN and Hadoop Ecosystem

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

Slide 32 www.edureka.in/hadoop

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Slide 33 www.edureka.in/hadoop

Organizes jobs into queues

Queue shares as %’s of cluster

FIFO scheduling within eachqueue

Data locality-aware Scheduling

Hierarchical QueuesTo manage the resource within an organization.

Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.

SecurityTo safeguard applications from other users.

ElasticityResources are available in a predictable and elastic manner to queues.

Multi-tenancySet of limit to prevent over-utilization of resources by a singleapplication.

OperabilityRuntime configuration of Queues.

Resource-based schedulingIf needed, Applications can request more resources than the default.

Multi-tenancy - Capacity Scheduler

Slide 34

Annie’s Question

YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Too much burden on Job Tracker

www.edureka.in/hadoop

Slide 35

Annie’s Answer

Too much burden on Job Tracker.

www.edureka.in/hadoop

Slide 36

NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – In Summary

Client

HDFS YARN

Resource ManagerStandby NameNode

Active NameNode

Distributed Data Storage Distributed Data Processing

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

ORJournal Node

Scheduler

Applications Manager

(AsM)

www.edureka.in/hadoop

Slide 37

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you use Hadoop 2.0 for Real-time processing?- Yes- No

Annie’s Question

Slide 38

No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.

Annie’s Answer

Slide 39 www.edureka.in/hadoop

What about Real-time Processing?

Hadoop is good for Batch but

How do I process Big Data in Real-time?

Slide 40 www.edureka.in/hadoop

Storm is coming….

APACHE STORM

The Real-time Hadoop

• Continuous commutation system

Distributed, Reliable, Fault-tolerant,

Scalable and Robust

• Suitable for Big Data processing• Guarantees no data loss

Programming Language agnostic

• JSON-based for Ruby, Python etc.

Use case

• Stream processing• Distributed RPC• Continuous Computation

Hadoop Vs. Storm

Hadoop Storm

Differences

Fundamentally as Batch processing system

Real-time processing, process unterminated streams (e.g. twitter feeds) of data, process data as it arrives

MapReduce Jobs run to completion

Topologies (Computation Graph) run forever

Stateful NodesStateless Nodes

Hadoop Storm

Similarities

Scalable Scalable

Guarantees no data loss Guarantees no data loss

Open Source Open Source

Storm Use Cases

Data Normalization• Groupon uses Storm to build real-time data integration

systems.

Analytics• Storm powers Twitter’s publisher analytics product,

processing every tweet and click that happens on Twitter toprovide analytics for Twitter's publisher partners.

• Flipboard use Storm across a wide range of services rangingfrom Content Search to real-time analytics, to generatingcustom magazine fields.

Log processing• Alibaba uses Storm to process the application log and data

change in databases to supply real-time data stats for dataapps.

• NaviSite uses Storm in its server log monitoring and auditingsystem.

Thank YouSee You in Next Class

top related