hadoop architecture meetup

Hadoop Architecture

Agenda• Different Hadoop daemons & its roles

• How does a Hadoop cluster look like

• Under the Hood:- How does it write a file

• Under the Hood:- How does it read a file

• Under the Hood:- How does it replicate the file

• Under the Hood:- How does it run a job

• How to balance an un-balanced hadoop cluster

Hadoop – A bit of background

• It’s an open source project

• Based on 2 technical papers published by Google

• A well known platform for distributed applications

• Easy to scale-out

• Works well with commodity hard wares(not entirely true)

• Very good for background applications

Hadoop Architecture

• Two Primary components Distributed File System (HDFS): It deals with file

operations like read, write, delete & etc

Map Reduce Engine: It deals with parallel computation

Hadoop Distributed File System

• Runs on top of existing file system

• A file broken into pre-defined equal sized blocks & stored individually

• Designed to handle very large files

• Not good for huge number of small files

Map Reduce Engine

• A Map Reduce Program consists of map and reduce functions

• A Map Reduce job is broken into tasks that run in parallel

• Prefers local processing if possible

Hadoop Cluster

Typical Workflow

Cluster Balancing

Quiz

• If you had written a file of size 1TB into HDFS with replication factor 2, What is the actual size required by the HDFS to store this file?

• True/False? Even if Name node goes down, I still will be able to read files from HDFS.

Quiz

• True/False? In Hadoop Cluster, We can have a secondary Job Tracker to enhance the fault tolerance.

• True/False? If Job Tracker goes down, You will not be able to write any file into HDFS.

Quiz

• True/False? Name node stores the actual data itself.

• True/False? Name node can be re-built using the secondary name node.

• True/False? If a data node goes down, Hadoop takes care of re-replicating the affected data block.

Quiz

• In which scenario, one data node tries to read data from another data node?

• What are the benefits of Name node’s rack-

awareness?

• True/False? HDFS is well suited for applications which write huge number of small files.

Quiz

• True/False? Hadoop takes care of balancing the cluster automatically?

• True/False? Output of Map tasks are written to HDFS file?

• True/False? Output of Reduce tasks are written to HDFS file?

Quiz

• True/False? In production cluster, commodity hardware can be used to setup Name node.

• Thank You

hadoop architecture meetup

Technology

data node xdata node

data node nclienti

node client

blockname nodedata node

downname nodedata node

data map

data node na aa ccdn1

node pickstwo nodes