Hadoop Distributed File
System (HDFS)
01/16/2018 1
Survey Results
Total: 19 responses
18 CS and 1 CEN
15 Master (79%) and 4 PhD (21%)
01/16/2018 2
Survey Results
01/16/2018 3
Survey Results
How many hours did you spend in the first
week for the reading assignment?
1 hour (2 responses)
2 hours (7 responses)
3-5 hours (6 responses)
6 and more hours (4 responses)
How many hours per week do you plan to
spend for studying the course?
0-5 hours: 5 responses
6-10 hours: 11 hours
> 10 hours: 01/16/2018 4
Additional Comments
No final exam and give higher weights to
assignments and project
More programming assignments and hands-on
experience
Solve real problems in big data using cloud
platforms, e.g., AWS or Google Cloud Platform
One review per week and increase the word limit
to 1000 words
Suggest a book or reference for further reads
Show how big data is used in other fields such
as machine learning01/16/2018 5
HDFS Overview
A distributed file system
Built on the architecture of Google File
System (GS)
Shares a similar architecture to many other
common distributed storage engines such as
Amazon S3 and Microsoft Azure
HDFS is a stand-along storage engine and
can be used in isolation of the query
processing engine
01/16/2018 6
HDFS Architecture
01/16/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
7
What is where?
01/16/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
File and directory names
Block ordering and locations
Capacity of data nodes
Architecture of data nodes
Block data
Name node location
8
Analogy to Unix FS
01/16/2018
The logical view is similar
/
usermary
chu
etc hadoop
9
Analogy to Unix FS
01/16/2018
The physical model is comparable
Unix HFDS
File1
List of iNodes
Block 1
Block 2
Block 3
…
File1
List of block locations
Meta data
B B B
B B B
B B B
B
B B B
B B
10
HDFS Create
01/16/2018
Data nodes
File creator
Name node
11
HDFS Create
01/16/2018
Data nodes
File creatorCreate(…)
Name node
The creator process calls the create
function which translates to an RPC
call at the name node
12
HDFS Create
01/16/2018
Name node
Data nodes
File creatorCreate(…)
The master node creates three initial
blocks
1. First block is assigned to a random
machine
2. Second block is assigned to another
random machine in the same rack of
the first machine
3. Third block is assigned to a random
machine in another rack
1 2 3
13
HDFS Create
01/16/2018
Name node
Data nodes
File creatorOutputStream
1 2 3
14
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
15
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
16
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
17
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
When a block is filled up, the
creator contacts the name node
to create the next block
Next block
18
Notes about writing to HDFS
Data transfers of replicas are pipelined
The data does not go through the name node
Random writing is not supported
Appending to a file is supported but it creates
a new block
01/16/2018 19
Self-writing
01/16/2018
Name node
Data nodes
File
creator
If the file creator is running on one
of the data nodes, the first replica
is always assigned to that node
20
Reading from HDFS
Reading is relatively easier
No replication is needed
Replication can be exploited
Random reading is allowed
01/16/2018 21
HDFS Read
01/16/2018
Data nodes
File readeropen(…)
Name node
The reader process calls the open
function which translates to an RPC
call at the name node
22
HDFS Read
01/16/2018
Data nodes
File readerInputStream
Name node
The name node locates the first block
of that file and returns the address of
one of the nodes that store that block
The name node returns an input
stream for the file
23
HDFS Read
01/16/2018
Data nodes
File reader
InputStream#read(…)
Name node
24
HDFS Read
01/16/2018
Data nodes
File reader
Name node
When an end-of-block is
reached, the name node
locates the next block
Next block
25
HDFS Read
01/16/2018
Data nodes
File reader
Name node
seek(pos)
InputStream#seek operation locates
a block and positions the stream
accordingly
26
Self-reading
01/16/2018
Data nodes
File
reader
Name node
1. If the block is locally stored
on the reader, this replica is
chosen to read
2. If not, a replica on another
machine in the same rack is
chosen
3. Any other random block is
chosen
Open,
seek
27
When self-reading occurs,
HDFS can make it much faster
through a feature called
short-circuit
Notes About Reading
The API is much richer than the simple
open/seek/close API
You can retrieve block locations
You can choose a specific replica to read
The same API is generalized to other file
systems including the local FS and S3
Review question: Compare random access
read in local file systems to HDFS
01/16/2018 28
HDFS Special Features
Node decomission
Load balancer
Cheap concatenation
01/16/2018 29
Node Decommission
01/16/2018 30
B B B
B B B
B B B
B
B B B
B B
B B B
B
Load Balancing
01/16/2018 31
B B B
B B B
B B B
B
B B B
B B
Load Balancing
01/16/2018 32
B B B
B B B
B B B
B
B B B
B B
Start the load balancer
Cheap Concatenation
01/16/2018 33
Name node
File 1
File 2
File 3
Concatenate File 1 + File 2 + File 3 File 4
Rather than creating new blocks, HDFS can just
change the metadata in the name node to delete
File 1, File 2, and File 3, and assign their blocks to a
new File 4 in the right order.
HDFS API
01/16/2018 34
FileSystem
DistributedFileSystemLocalFileSystem S3FileSystem
Path Configuration
HDFS API
01/16/2018 35
Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);
// To get the local FSfs = FileSystem.getLocal (conf);
// To get the default FSfs = FileSystem.get(conf);
Create the file system
HDFS API
01/16/2018 36
FSDataOutputStream out = fs.create(path, …);
Create a new file
fs.delete(path, recursive);fs.deleteOnExit(path);
Delete a file
fs.rename(oldPath, newPath);
Rename a file
HDFS API
01/16/2018 37
FSDataInputStream in = fs.open(path, …);
Open a file
in.seek(pos);in.seekToNewSource(pos);
Seek to a different location
HDFS API
01/16/2018 38
fs.concat(destination, src[]);
Concatenate
fs.getFileStatus(path);
Get file metadata
fs.getFileBlockLocations(path, from, to);
Get block locations