Download - Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Hadoop Distributed File

System (HDFS)

01/16/2018 1

Survey Results

Total: 19 responses

18 CS and 1 CEN

15 Master (79%) and 4 PhD (21%)

01/16/2018 2

Survey Results

01/16/2018 3

Survey Results

How many hours did you spend in the first

week for the reading assignment?

1 hour (2 responses)

2 hours (7 responses)

3-5 hours (6 responses)

6 and more hours (4 responses)

How many hours per week do you plan to

spend for studying the course?

0-5 hours: 5 responses

6-10 hours: 11 hours

> 10 hours: 01/16/2018 4

Additional Comments

No final exam and give higher weights to

assignments and project

More programming assignments and hands-on

experience

Solve real problems in big data using cloud

platforms, e.g., AWS or Google Cloud Platform

One review per week and increase the word limit

to 1000 words

Suggest a book or reference for further reads

Show how big data is used in other fields such

as machine learning01/16/2018 5

HDFS Overview

A distributed file system

Built on the architecture of Google File

System (GS)

Shares a similar architecture to many other

common distributed storage engines such as

Amazon S3 and Microsoft Azure

HDFS is a stand-along storage engine and

can be used in isolation of the query

processing engine

01/16/2018 6

HDFS Architecture

01/16/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

7

What is where?

01/16/2018

B B B

B B B

B B B

B

B B B

B B

Name node

Data nodes

File and directory names

Block ordering and locations

Capacity of data nodes

Architecture of data nodes

Block data

Name node location

8

Analogy to Unix FS

01/16/2018

The logical view is similar

/

usermary

chu

etc hadoop

9

Analogy to Unix FS

01/16/2018

The physical model is comparable

Unix HFDS

File1

List of iNodes

Block 1

Block 2

Block 3

…

File1

List of block locations

Meta data

B B B

B B B

B B B

B

B B B

B B

10

HDFS Create

01/16/2018

Data nodes

File creator

Name node

11

HDFS Create

01/16/2018

Data nodes

File creatorCreate(…)

Name node

The creator process calls the create

function which translates to an RPC

call at the name node

12

HDFS Create

01/16/2018

Name node

Data nodes

File creatorCreate(…)

The master node creates three initial

blocks

1. First block is assigned to a random

machine

2. Second block is assigned to another

random machine in the same rack of

the first machine

3. Third block is assigned to a random

machine in another rack

1 2 3

13

HDFS Create

01/16/2018

Name node

Data nodes

File creatorOutputStream

1 2 3

14

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

15

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

16

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

17

HDFS Create

01/16/2018

Name node

Data nodes

File creator

1 2 3

OutputStream#write

When a block is filled up, the

creator contacts the name node

to create the next block

Next block

18

Notes about writing to HDFS

Data transfers of replicas are pipelined

The data does not go through the name node

Random writing is not supported

Appending to a file is supported but it creates

a new block

01/16/2018 19

Self-writing

01/16/2018

Name node

Data nodes

File

creator

If the file creator is running on one

of the data nodes, the first replica

is always assigned to that node

20

Reading from HDFS

Reading is relatively easier

No replication is needed

Replication can be exploited

Random reading is allowed

01/16/2018 21

HDFS Read

01/16/2018

Data nodes

File readeropen(…)

Name node

The reader process calls the open

function which translates to an RPC

call at the name node

22

HDFS Read

01/16/2018

Data nodes

File readerInputStream

Name node

The name node locates the first block

of that file and returns the address of

one of the nodes that store that block

The name node returns an input

stream for the file

23

HDFS Read

01/16/2018

Data nodes

File reader

InputStream#read(…)

Name node

24

HDFS Read

01/16/2018

Data nodes

File reader

Name node

When an end-of-block is

reached, the name node

locates the next block

Next block

25

HDFS Read

01/16/2018

Data nodes

File reader

Name node

seek(pos)

InputStream#seek operation locates

a block and positions the stream

accordingly

26

Self-reading

01/16/2018

Data nodes

File

reader

Name node

1. If the block is locally stored

on the reader, this replica is

chosen to read

2. If not, a replica on another

machine in the same rack is

chosen

3. Any other random block is

chosen

Open,

seek

27

When self-reading occurs,

HDFS can make it much faster

through a feature called

short-circuit

Notes About Reading

The API is much richer than the simple

open/seek/close API

You can retrieve block locations

You can choose a specific replica to read

The same API is generalized to other file

systems including the local FS and S3

Review question: Compare random access

read in local file systems to HDFS

01/16/2018 28

HDFS Special Features

Node decomission

Load balancer

Cheap concatenation

01/16/2018 29

Node Decommission

01/16/2018 30

B B B

B B B

B B B

B

B B B

B B

B B B

B

Load Balancing

01/16/2018 31

B B B

B B B

B B B

B

B B B

B B

Load Balancing

01/16/2018 32

B B B

B B B

B B B

B

B B B

B B

Start the load balancer

Cheap Concatenation

01/16/2018 33

Name node

File 1

File 2

File 3

Concatenate File 1 + File 2 + File 3 File 4

Rather than creating new blocks, HDFS can just

change the metadata in the name node to delete

File 1, File 2, and File 3, and assign their blocks to a

new File 4 in the right order.

HDFS API

01/16/2018 34

FileSystem

DistributedFileSystemLocalFileSystem S3FileSystem

Path Configuration

HDFS API

01/16/2018 35

Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);

// To get the local FSfs = FileSystem.getLocal (conf);

// To get the default FSfs = FileSystem.get(conf);

Create the file system

HDFS API

01/16/2018 36

FSDataOutputStream out = fs.create(path, …);

Create a new file

fs.delete(path, recursive);fs.deleteOnExit(path);

Delete a file

fs.rename(oldPath, newPath);

Rename a file

HDFS API

01/16/2018 37

FSDataInputStream in = fs.open(path, …);

Open a file

in.seek(pos);in.seekToNewSource(pos);

Seek to a different location

HDFS API

01/16/2018 38

fs.concat(destination, src[]);

Concatenate

fs.getFileStatus(path);

Get file metadata

fs.getFileBlockLocations(path, from, to);

Get block locations

Download - Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture

Top Related