dfs and hdfs - wmich.edu · 2019-11-22 · 11/21/19 3 7 •introduction •architecture namenode,...

8
11/21/19 1 1.Definition of DFS 2.how it works 3. main concepts: Distribution, replication, fault tolerance and high concurrency concept CS6030 Zirui Yang DFS and HDFS 1 File management system is used by the operating system to access the files and folders stored in one computer. Distributed file system is a system that can handle accessing data across multiple clusters (nodes). What is DFS(Distributed File System)? 2 Distributed file system is used to manage files and data blocks across different clusters and racks. It will enhance fault tolerance and access concurrency by replicating data blocks on different clusters to ensure fault tolerance and parallelism. Distribution, Replication Advantages: fault tolerance and high concurrency concept. The functions of DFS and How is works 3

Upload: others

Post on 14-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

1

1.Definition of DFS2.how it works3. main concepts: Distribution, replication, fault tolerance and high concurrency concept

CS6030 Zirui Yang

DFS and HDFS

1

• File management system is used by the operating system to access the files and folders stored in one computer.

• Distributed file system is a system that can handle accessing data across multiple clusters (nodes).

What is DFS(Distributed File System)?

2

• Distributed file system is used to manage files and data blocks across different clusters and racks. It will enhance fault tolerance and access concurrency by replicating data blocks on different clusters to ensure fault tolerance and parallelism.

• Distribution, Replication• Advantages: fault tolerance and high

concurrency concept.

The functions of DFS and How is works

3

Page 2: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

2

4

5

6

Page 3: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

3

7

• Introduction• Architecture

NameNode, DataNodes, HDFS Client

• File I/O Operations and Replica Management

HDFS: Hadoop Distributed File System

8

Introduction

HDFS

The Hadoop Distributed File System (HDFS) is the file system component of Hadoop. It is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. These are achieved by replicating file content on multiple machines(DataNodes).

9

Page 4: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

4

Outline

• Introduction• Architecture

NameNode, DataNodes, HDFS Client

File I/O Operations and Replica ManagementFile Read and Write, Block Placement, Replication management, Balancer

10

• A file can be made of several DATA blocks, and they are stored across a cluster of one or more machines with data storage capacity.

• Each block of a file is replicated across a number of machines, To prevent loss of data.

Architecture

11

Architecture

12

Page 5: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

5

NameNode and DataNodes• HDFS stores file system metadata and application data

separately.• Metadata refers to file metadata(attributes such as

permissions, modification, access times, namespace and disk space quotas.

• HDFS stores metadata on a dedicated server, called the NameNode.(Master)

• Application data are stored on other servers called DataNodes.(Slaves)

Architecture

13

• Single Namenode:• Maintain the namespace tree(a hierarchy of files and

directories) it have operations like opening, closing, and renaming files and directories.

• Determine the mapping of file blocks to DataNodes (the physical location of file data).

• Collect block reports from Datanodes on block locations.• Replicate missing data blocks.

Architecture

14

• DataNodes:• The DataNodes are responsible for serving read and write

requests from the file system’s clients.

• The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

• Data nodes periodically send block reports to Namenode.

Architecture

15

Page 6: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

6

Architecture

16

• Hadoop HDFS Data Read/Write Operation• To write/read a file in HDFS, a client needs to

interact with namenode (master). namenodeprovides the address of the datanodes (slaves),then client will start writing/reading the data

Architecture

17

Architecture

18

Page 7: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

7

Architecture

19

Thank you!

20

• HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.

• Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodescontaining replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.

• Replication:: It is done by datanode.

Architecture

21

Page 8: DFS and HDFS - wmich.edu · 2019-11-22 · 11/21/19 3 7 •Introduction •Architecture NameNode, DataNodes, HDFS Client •File I/O Operations and Replica Management HDFS: Hadoop

11/21/19

8

• NameNode and DataNode communication: Heartbeats.

• DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available.

Architecture

22

• HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode doesn’t receive heartbeat from a datanode then it will consider it dead.

• Balancing : If a datanode is crashed the blocks present on it will be gone too and the blocks will be under-replicated compared to the remaining blocks. Here master node(namenode) will give a signal to datanodescontaining replicas of those lost blocks to replicate so that overall distribution of blocks is balanced.

• Replication:: It is done by datanode.

Architecture

23

Architecture

24