implementing the hadoop distributed file system protocol ......isi_hdfs_d multi-threaded daemon runs...
TRANSCRIPT
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Implementing the Hadoop Distributed File System Protocol on OneFS
Jeff Hughes EMC Isilon
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Outline
Hadoop Overview OneFS Overview MapReduce + OneFS Details of isi_hdfs_d Wrap up & Questions
2
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Apache Hadoop Project
Two main components MapReduce Distributed File System (DFS)
4
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. – http://hadoop.apache.org/
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: MapReduce
MapReduce Distributed computation framework Optimized for batch processing Typical I/O profile:
5
DFS Read Map Task Map Output
Shuffle Reduce Task
DFS Write
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: HDFS Architecture
6
http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: HDFS Semantics
Metadata/data server cluster architecture Cluster coherent namespace and data
Write-once-read-many access Single writer only Can append to existing files
Data mirrored 3x for resiliency Client exposed to data topology Block locations as part of file metadata
7
http://www.snia.org/sites/default/files2/sdc_archives/2010_presentations/wednesday/DhrubaBorthakur-Hadoop_File_Systems.pdf
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Hadoop: Why HDFS?
Portability – All user space and OS independent Purpose-built – Primary workflow is MapReduce Limited set of operations to implement
Single software package Fluid client/server protocol development
Exposure of data topology Enables client to control data path locality
8
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
OneFS: Architecture
10
Isilon IQ Storage Layer
Intracluster Communication Infiniband
Servers
Client/Application Layer Ethernet Layer
Servers
Servers
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
OneFS: OS
Built from the ground up on FreeBSD File system is a loadable kernel module
with VFS interface Supports POSIX syscalls locally Protocol servers access /ifs paths
FS Built for mixed namespace access Supports SMB, NFS, HTTP, etc.
11
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
OneFS: Semantics
Symmetric cluster architecture Metadata distributed across all nodes Tightly coupled group semantics
Globally coherent file system access Distributed lock manager Two-phase commit for all write operations
Reed-Solomon FEC used for data protection
12
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Running MapReduce against OneFS
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Architecture
14
OneFS Clustered FileSystem
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
Hadoop Node
DFSClient 1) Request(“/file”)
2) Response (block locations) 3) GetBlock(block)
OneFS runs a daemon that speaks NameNode and DataNode natively
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Benefits
Easier integration to existing workflows First class multi-protocol access Reduce ETL stages
Increased disk efficiency HDFS: 30% usable, OneFS: 80% usable Reduced data center footprint
More data management options Snapshots, site replication, etc.
15
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Challenges
Typical data path locality changes MapReduce ➝ HDFS acts like DAS MapReduce ➝ OneFS goes over the network
Client/server compatibility and maintenance OneFS and MapReduce clusters run different
software versions Hidden benefit of multi-HDFS version access
16
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Mitigations
1GbE < SATA controller < 10GbE Hadoop designed for 1GbE 10GbE prices dropping Denser storage == less nodes/networking
“Rack” locality limits cross-switch contention
17
Rack A Rack B Rack C
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
MapReduce + OneFS: Performance
18
DFS Read Map Task Map Output
Shuffle Reduce Task
DFS Write
Typically ~100Mbit per task from HDFS
I/Os against temp vary considerable per job
More variable, still ~100Mbit per task to HDFS
Performance bottleneck likely to be temp space Terasort example: 75% of I/Os against temp Latency not much impact over HDFS large
block read/write operations http://cto.vmware.com/analyzing-hadoops-internals-with-analytics/
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
HDFS Protocols
Two TCP based protocols NameNode – metadata operations DataNode – data transfers
About 26 NameNode RPCs Mostly use fully qualified paths POSIX-like file attrs (modebits, user/group)
Only 2 DataNode client operations Simple read/write with a block identifier More in Apache HDFS for administration
20
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
NameNode Request Example
Example getFileInfo(“/testfile”) request (think stat):
21
“/testfile” Parameter type (string) Method name
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
NameNode Response Example
22
And the reply:
Object type Group Owner
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
isi_hdfs_d
Multi-threaded daemon runs on all nodes Services both NN and DN protocols Translates RPCs to POSIX system calls Stateless, underlying FS handles coherency
23
OneFS Node
isi_hdfs_d
Thread
Request VFS
OneFS Syscall
Response
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Example NameNode RPCs
Most NameNode RPCs are straightforward setPermission() → chmod(…) setTimes() → utimes(…) create() → open(…, O_CREAT, …)
Other RPCs need some creative interpretation recoverLease()/renewLease() abandonBlock() setReplication()
24
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
HDFS Data Path
NN RPC: getBlockLocations(file)/addBlock(file) Returns list of LocatedBlocks
25
LocatedBlock long offset
Block
long blkid
long genStamp long numBytes
DatanodeInfo[] ...
DFSClient connects to DN Chooses which DatanodeInfo
based on locality Only Block structure passed to
DN in read/write operation
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
LocatedBlocks Translation
26
LocatedBlock long offset
Block
long blkid
long genStamp
long numBytes
DatanodeInfo[] ...
Logical byte offset into the file
Opaque to client, used by DN
Inode number Size of extent Absolute byte offset
<IP:port> and rack info for different paths to same block
Specific to OneFS and isi_hdfs_d
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Read Path Example
27
getBlockLocations()
LocatedBlocks
DN_OP_READ(Block)
Data stream
DFSClient NameNode DataNode
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
NameNode Connection Routing
NameNode is configured as single URL Easy configuration: hdfs://log-server.isilon.com:8020/
DNS round-robin to distribute across nodes Metadata IOPs get spread out OneFS maintains cross-node consistency
IP Failover plus client retries for resiliency Hadoop retries ops 5 times at many levels
28
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
DataNode References
DataNode references returned by NameNode All OneFS DataNodes can access same data Each LocatedBlocks for reads has 3 DN refs Round-robin across available nodes Multiple refs lets client try other nodes before
coming back to the NameNode again Write path only 1 reference per logical block No need for client to replicate writes
29
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Few Quirks
User/group identities are strings OneFS natively stores UIDs or SIDs only Requires name resolution on access
Locking! HDFS uses “leases” to restrict to single writer Implemented but no cross-protocol contention
Hadoop apps don’t expect files to move Caveat emptor when mixing protocols
30
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
It Works!
31
You see this across NFS:
And the same directory on HDFS:
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
Conclusions
HDFS protocol can map to POSIX pretty easily Not all traditional shared storage is bad for
Hadoop workflows Locality features worth preserving even without
node locality as a possibility Interoperability can unlock new novel workflows
32
2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.
33
Questions?
[email protected] Special thanks to Conrad Meyer!