hadoop distributed file system (hdfs)eldawy/18wcs226/slides/cs226-03-hdfs.pdf · a distributed file...
TRANSCRIPT
![Page 1: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/1.jpg)
Hadoop Distributed File
System (HDFS)
01/16/2018 1
![Page 2: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/2.jpg)
Survey Results
Total: 19 responses
18 CS and 1 CEN
15 Master (79%) and 4 PhD (21%)
01/16/2018 2
![Page 3: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/3.jpg)
Survey Results
01/16/2018 3
![Page 4: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/4.jpg)
Survey Results
How many hours did you spend in the first
week for the reading assignment?
1 hour (2 responses)
2 hours (7 responses)
3-5 hours (6 responses)
6 and more hours (4 responses)
How many hours per week do you plan to
spend for studying the course?
0-5 hours: 5 responses
6-10 hours: 11 hours
> 10 hours: 01/16/2018 4
![Page 5: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/5.jpg)
Additional Comments
No final exam and give higher weights to
assignments and project
More programming assignments and hands-on
experience
Solve real problems in big data using cloud
platforms, e.g., AWS or Google Cloud Platform
One review per week and increase the word limit
to 1000 words
Suggest a book or reference for further reads
Show how big data is used in other fields such
as machine learning01/16/2018 5
![Page 6: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/6.jpg)
HDFS Overview
A distributed file system
Built on the architecture of Google File
System (GS)
Shares a similar architecture to many other
common distributed storage engines such as
Amazon S3 and Microsoft Azure
HDFS is a stand-along storage engine and
can be used in isolation of the query
processing engine
01/16/2018 6
![Page 7: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/7.jpg)
HDFS Architecture
01/16/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
7
![Page 8: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/8.jpg)
What is where?
01/16/2018
B B B
B B B
B B B
B
B B B
B B
Name node
Data nodes
File and directory names
Block ordering and locations
Capacity of data nodes
Architecture of data nodes
Block data
Name node location
8
![Page 9: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/9.jpg)
Analogy to Unix FS
01/16/2018
The logical view is similar
/
usermary
chu
etc hadoop
9
![Page 10: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/10.jpg)
Analogy to Unix FS
01/16/2018
The physical model is comparable
Unix HFDS
File1
List of iNodes
Block 1
Block 2
Block 3
…
File1
List of block locations
Meta data
B B B
B B B
B B B
B
B B B
B B
10
![Page 11: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/11.jpg)
HDFS Create
01/16/2018
Data nodes
File creator
Name node
11
![Page 12: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/12.jpg)
HDFS Create
01/16/2018
Data nodes
File creatorCreate(…)
Name node
The creator process calls the create
function which translates to an RPC
call at the name node
12
![Page 13: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/13.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creatorCreate(…)
The master node creates three initial
blocks
1. First block is assigned to a random
machine
2. Second block is assigned to another
random machine in the same rack of
the first machine
3. Third block is assigned to a random
machine in another rack
1 2 3
13
![Page 14: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/14.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creatorOutputStream
1 2 3
14
![Page 15: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/15.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
15
![Page 16: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/16.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
16
![Page 17: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/17.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
17
![Page 18: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/18.jpg)
HDFS Create
01/16/2018
Name node
Data nodes
File creator
1 2 3
OutputStream#write
When a block is filled up, the
creator contacts the name node
to create the next block
Next block
18
![Page 19: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/19.jpg)
Notes about writing to HDFS
Data transfers of replicas are pipelined
The data does not go through the name node
Random writing is not supported
Appending to a file is supported but it creates
a new block
01/16/2018 19
![Page 20: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/20.jpg)
Self-writing
01/16/2018
Name node
Data nodes
File
creator
If the file creator is running on one
of the data nodes, the first replica
is always assigned to that node
20
![Page 21: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/21.jpg)
Reading from HDFS
Reading is relatively easier
No replication is needed
Replication can be exploited
Random reading is allowed
01/16/2018 21
![Page 22: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/22.jpg)
HDFS Read
01/16/2018
Data nodes
File readeropen(…)
Name node
The reader process calls the open
function which translates to an RPC
call at the name node
22
![Page 23: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/23.jpg)
HDFS Read
01/16/2018
Data nodes
File readerInputStream
Name node
The name node locates the first block
of that file and returns the address of
one of the nodes that store that block
The name node returns an input
stream for the file
23
![Page 24: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/24.jpg)
HDFS Read
01/16/2018
Data nodes
File reader
InputStream#read(…)
Name node
24
![Page 25: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/25.jpg)
HDFS Read
01/16/2018
Data nodes
File reader
Name node
When an end-of-block is
reached, the name node
locates the next block
Next block
25
![Page 26: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/26.jpg)
HDFS Read
01/16/2018
Data nodes
File reader
Name node
seek(pos)
InputStream#seek operation locates
a block and positions the stream
accordingly
26
![Page 27: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/27.jpg)
Self-reading
01/16/2018
Data nodes
File
reader
Name node
1. If the block is locally stored
on the reader, this replica is
chosen to read
2. If not, a replica on another
machine in the same rack is
chosen
3. Any other random block is
chosen
Open,
seek
27
When self-reading occurs,
HDFS can make it much faster
through a feature called
short-circuit
![Page 28: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/28.jpg)
Notes About Reading
The API is much richer than the simple
open/seek/close API
You can retrieve block locations
You can choose a specific replica to read
The same API is generalized to other file
systems including the local FS and S3
Review question: Compare random access
read in local file systems to HDFS
01/16/2018 28
![Page 29: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/29.jpg)
HDFS Special Features
Node decomission
Load balancer
Cheap concatenation
01/16/2018 29
![Page 30: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/30.jpg)
Node Decommission
01/16/2018 30
B B B
B B B
B B B
B
B B B
B B
B B B
B
![Page 31: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/31.jpg)
Load Balancing
01/16/2018 31
B B B
B B B
B B B
B
B B B
B B
![Page 32: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/32.jpg)
Load Balancing
01/16/2018 32
B B B
B B B
B B B
B
B B B
B B
Start the load balancer
![Page 33: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/33.jpg)
Cheap Concatenation
01/16/2018 33
Name node
File 1
File 2
File 3
Concatenate File 1 + File 2 + File 3 File 4
Rather than creating new blocks, HDFS can just
change the metadata in the name node to delete
File 1, File 2, and File 3, and assign their blocks to a
new File 4 in the right order.
![Page 34: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/34.jpg)
HDFS API
01/16/2018 34
FileSystem
DistributedFileSystemLocalFileSystem S3FileSystem
Path Configuration
![Page 35: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/35.jpg)
HDFS API
01/16/2018 35
Configuration conf = new Configuration();Path path = new Path(“…”);FileSystem fs = path.getFileSystem(conf);
// To get the local FSfs = FileSystem.getLocal (conf);
// To get the default FSfs = FileSystem.get(conf);
Create the file system
![Page 36: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/36.jpg)
HDFS API
01/16/2018 36
FSDataOutputStream out = fs.create(path, …);
Create a new file
fs.delete(path, recursive);fs.deleteOnExit(path);
Delete a file
fs.rename(oldPath, newPath);
Rename a file
![Page 37: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/37.jpg)
HDFS API
01/16/2018 37
FSDataInputStream in = fs.open(path, …);
Open a file
in.seek(pos);in.seekToNewSource(pos);
Seek to a different location
![Page 38: Hadoop Distributed File System (HDFS)eldawy/18WCS226/slides/CS226-03-HDFS.pdf · A distributed file system Built on the architecture of Google File System (GS) Shares a similar architecture](https://reader034.vdocuments.mx/reader034/viewer/2022050513/5f9d240027f60762f0158ae7/html5/thumbnails/38.jpg)
HDFS API
01/16/2018 38
fs.concat(destination, src[]);
Concatenate
fs.getFileStatus(path);
Get file metadata
fs.getFileBlockLocations(path, from, to);
Get block locations