what does it mean to virtualize the hadoop file system? tom phelan chief architect for bluedata
TRANSCRIPT
![Page 1: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/1.jpg)
What does it mean to virtualize the Hadoop
File System?
Tom Phelan
Chief Architect for BlueData
![Page 2: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/2.jpg)
It is HDFS …
![Page 3: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/3.jpg)
Unless it is not
![Page 4: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/4.jpg)
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
![Page 5: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/5.jpg)
What is HDFS?
Before we can virtualize it, we need to understand what “it” is.
![Page 6: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/6.jpg)
HDFSIt is a distributed file system built with NameNodes and
DataNodes
http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500
Source: David Engfer via slidershare.net
![Page 7: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/7.jpg)
hadoop-hdfs.jar org.apache.hadoop.fs.FileSystem
org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem
HDFS Implementation
HDFS Implementation
![Page 8: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/8.jpg)
HDFS ImplementationHDFS Implementation
Hadoop Distributed File System API/Java Class
Distributed File System Client Protocol at TCP/IP level – “over the wire”
HDFS Implementation
It is a stack of Java code used by Hadoop applications to access data.
YARN
HDFS Implementation
![Page 9: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/9.jpg)
Generic Java ClassesJava class org.apache.hadoop.fs.FileSystem
HDFS over the wire protocolJava class org.apache.hadoop.hdfs.DFSClient
HDFS Layers of Potential Virtualization
![Page 10: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/10.jpg)
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
HDFS Implementation
WireProtocol
HDFS Implementation
![Page 11: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/11.jpg)
HDFS Virtualization
The virtualization of either the HDFS Implementation or the Protocols
![Page 12: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/12.jpg)
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
![Page 13: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/13.jpg)
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement one of the Hadoop Compatible File System (HCFS)
Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)
![Page 14: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/14.jpg)
Virtualize the HDFS Implementation
This is the only method of HDFS virtualization that requires Hadoop compute virtualization.
Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks.
Instances of this type of HDFS virtualization include:• VMware BDE• Apache OpenStack Sahara• Cloudera Director• Hortonworks Cloudbreak
![Page 15: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/15.jpg)
NameNodeResourceManager
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
HOST
HOST
HOSTVM
VM
VM
Virtualize the HDFS Implementation
![Page 16: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/16.jpg)
Advantages:•Simple•No new Java code•Compute/data locality
Considerations:•Requires data ingest time•The clusters become stateful
Virtualize the HDFS Implementation
![Page 17: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/17.jpg)
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)
• Implement a HCFS via the FileSystem protocol (fs.FileSystem)
![Page 18: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/18.jpg)
Implement a HCFS via the over-the-wire protocol
Use the unmodified hadoop-hdfs jarfs.defaultfs hdfs://1.2.3.4:8020/path
Instance:• EMC Isilon
![Page 19: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/19.jpg)
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
StorageService Local
Disk
Local Disk
Implement a HCFS via the over-the-wire protocol
![Page 20: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/20.jpg)
Advantages:•Multi-protocol•No new Java code•Enterprise storage services
Considerations:•Open source / proprietary•No compute / data locality
Implement a HCFS via the over-the-wire protocol
![Page 21: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/21.jpg)
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)• Implement a HCFS via the FileSystem protocol
(fs.FileSystem)
![Page 22: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/22.jpg)
Implement a HCFS via the FileSystem Java classes
Write the java code that implements the class, build a jar file,put the jar file in the YARN services class path
edit the core-site.xml file
Instances:•S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem
https://github.com/Aloisius/hadoop-s3a•GlusterFS - org.apache.hadoop.fs.FilterFileSystem
https://github.com/gluster/glusterfs-hadoop•Tachyon – org.apache.hadoop.fs.FileSystem
https://github.com/amplab/tachyon•Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem
https://github.com/apache/ignite
![Page 23: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/23.jpg)
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
CustomFS Impl CustomFS
Impl
StorageService
StorageService
StorageService
Implement a HCFS via the FileSystem Java classes
![Page 24: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/24.jpg)
Host
NameNode
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk Local
Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
Local Disk
Local Disk
CustomFS Impl CustomFS
Impl
StorageService
Implement a HCFS via the FileSystem Java classes
StorageService
StorageService
ResourceManager
![Page 25: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/25.jpg)
Advantages:•Open source / proprietary•Multiple file access protocols supported
Considerations:•These are file systems•New Java code•Possibly no compute / data locality•May lag latest HDFS feature set
Implement a HCFS via the FileSystem Java classes
![Page 26: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/26.jpg)
HDFS Virtualization
Is there another way?
![Page 27: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/27.jpg)
HDFS Virtualization
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol• Implement a HCFS via the FileSystem Java classes
• Virtualize the Hadoop Compatible File System Protocol
![Page 28: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/28.jpg)
Virtualize the Hadoop Compatible File System Protocol
Instance:• BlueData EPIC software – org.apache.fs.FileSystem
Translate the Hadoop File System Calls into native calls to the BackEnd File systems
Insert intelligent caching layer
![Page 29: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/29.jpg)
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient Local Disk
Local Disk
DTAPImpl
DTAPImpl
DTAPService
DTAPService
HostStorageService
Local Disk
Local Disk
Virtualize the Hadoop Compatible File System Protocol
![Page 30: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/30.jpg)
HDFS mem cachePage
Cache
HDFS Implementation
DFSClient
DataNode
page
Application is cache aware
![Page 31: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/31.jpg)
Extend mem cache to any File System or Object storage
Page Cache
DTAP FileSystem Implementation
DTAPService
page
HDFS GlusterFS Object Store
Application is cache unaware
![Page 32: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/32.jpg)
Advantages:•Not a file system•Transparent in memory cache
write back, read ahead•Supports multiple protocols•Supports compute / data locality
Considerations:•New Java code•Open source / proprietary•May lag latest HDFS feature set
Virtualize the Hadoop Compatible File System Protocol
![Page 33: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/33.jpg)
Let’s Review
![Page 34: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/34.jpg)
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
![Page 35: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/35.jpg)
A Few Words about Performance
Performance measurements are an art as well as a science
•Bottlenecks in applications•Bottlenecks in infrastructure
network CPU disk
•Configuration is key block size distro security
![Page 36: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/36.jpg)
Virtualize the HDFS Implementation
Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers
Performance – VMware BDE
![Page 37: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/37.jpg)
Performance – Isilon
http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.htmlSource of graph: Stefan Radtke blog post
Implement a HCFS via the over-the-wire protocol
![Page 38: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/38.jpg)
Performance – Tachyon
Source of graph: Haoyuan Li
Implement a HCFS via the FileSystem Java classes
https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf
![Page 39: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/39.jpg)
Performance – BlueData
Source of Graph: BlueData customer proof-of-concept results
Virtualize the Hadoop Compatible File System Protocol
![Page 40: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/40.jpg)
Virtualized HDFS solutions provide good performance
Even with remote storage
Even in virtualized environments
![Page 41: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/41.jpg)
When it comes to Hadoop storage virtualization, speed is not the whole story
Other factors to consider when implementing a virtualized HDFS option:
•Use of a virtualized compute environment
•Open source / proprietary solution
•Required Hadoop File System features
•Lifespan of Hadoop cluster
![Page 42: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/42.jpg)
Other factors to consider when selecting storage:
•Data accessibility
Hadoop File System protocol
NFS, object store, other protocols
•Enterprise storage services
data protection
geographical replication
offline backup
When it comes to Hadoop storage virtualization, speed is not the whole story
![Page 43: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/43.jpg)
Consider a Virtualized HDFS Solution
When any of the following are true:
•Hadoop and non-Hadoop applications are required to access the same data
Do not want to replicate the data
•Enterprise storage data services required
•Need to run Hadoop in a virtual compute environment
![Page 44: What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData](https://reader035.vdocuments.mx/reader035/viewer/2022062407/56649f415503460f94c60751/html5/thumbnails/44.jpg)
Hadoop File System
Volume, Velocity, Variety
Virtualization