implementing the hadoop distributed file system protocol ......isi_hdfs_d multi-threaded daemon runs...

33
Implementing the Hadoop Distributed File System Protocol on OneFS Jeff Hughes EMC Isilon

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Implementing the Hadoop Distributed File System Protocol on OneFS

Jeff Hughes EMC Isilon

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Outline

Hadoop Overview OneFS Overview MapReduce + OneFS Details of isi_hdfs_d Wrap up & Questions

2

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Hadoop Overview

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Apache Hadoop Project

Two main components MapReduce Distributed File System (DFS)

4

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. – http://hadoop.apache.org/

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Hadoop: MapReduce

MapReduce Distributed computation framework Optimized for batch processing Typical I/O profile:

5

DFS Read Map Task Map Output

Shuffle Reduce Task

DFS Write

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Hadoop: HDFS Architecture

6

http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Hadoop: HDFS Semantics

Metadata/data server cluster architecture Cluster coherent namespace and data

Write-once-read-many access Single writer only Can append to existing files

Data mirrored 3x for resiliency Client exposed to data topology Block locations as part of file metadata

7

http://www.snia.org/sites/default/files2/sdc_archives/2010_presentations/wednesday/DhrubaBorthakur-Hadoop_File_Systems.pdf

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Hadoop: Why HDFS?

Portability – All user space and OS independent Purpose-built – Primary workflow is MapReduce Limited set of operations to implement

Single software package Fluid client/server protocol development

Exposure of data topology Enables client to control data path locality

8

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

OneFS Overview

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

OneFS: Architecture

10

Isilon IQ Storage Layer

Intracluster Communication Infiniband

Servers

Client/Application Layer Ethernet Layer

Servers

Servers

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

OneFS: OS

Built from the ground up on FreeBSD File system is a loadable kernel module

with VFS interface Supports POSIX syscalls locally Protocol servers access /ifs paths

FS Built for mixed namespace access Supports SMB, NFS, HTTP, etc.

11

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

OneFS: Semantics

Symmetric cluster architecture Metadata distributed across all nodes Tightly coupled group semantics

Globally coherent file system access Distributed lock manager Two-phase commit for all write operations

Reed-Solomon FEC used for data protection

12

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Running MapReduce against OneFS

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

MapReduce + OneFS: Architecture

14

OneFS Clustered FileSystem

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

OneFS Node

NameNode

DataNode

Hadoop Node

DFSClient 1) Request(“/file”)

2) Response (block locations) 3) GetBlock(block)

OneFS runs a daemon that speaks NameNode and DataNode natively

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

MapReduce + OneFS: Benefits

Easier integration to existing workflows First class multi-protocol access Reduce ETL stages

Increased disk efficiency HDFS: 30% usable, OneFS: 80% usable Reduced data center footprint

More data management options Snapshots, site replication, etc.

15

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

MapReduce + OneFS: Challenges

Typical data path locality changes MapReduce ➝ HDFS acts like DAS MapReduce ➝ OneFS goes over the network

Client/server compatibility and maintenance OneFS and MapReduce clusters run different

software versions Hidden benefit of multi-HDFS version access

16

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

MapReduce + OneFS: Mitigations

1GbE < SATA controller < 10GbE Hadoop designed for 1GbE 10GbE prices dropping Denser storage == less nodes/networking

“Rack” locality limits cross-switch contention

17

Rack A Rack B Rack C

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

MapReduce + OneFS: Performance

18

DFS Read Map Task Map Output

Shuffle Reduce Task

DFS Write

Typically ~100Mbit per task from HDFS

I/Os against temp vary considerable per job

More variable, still ~100Mbit per task to HDFS

Performance bottleneck likely to be temp space Terasort example: 75% of I/Os against temp Latency not much impact over HDFS large

block read/write operations http://cto.vmware.com/analyzing-hadoops-internals-with-analytics/

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Details of isi_hdfs_d

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

HDFS Protocols

Two TCP based protocols NameNode – metadata operations DataNode – data transfers

About 26 NameNode RPCs Mostly use fully qualified paths POSIX-like file attrs (modebits, user/group)

Only 2 DataNode client operations Simple read/write with a block identifier More in Apache HDFS for administration

20

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

NameNode Request Example

Example getFileInfo(“/testfile”) request (think stat):

21

“/testfile” Parameter type (string) Method name

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

NameNode Response Example

22

And the reply:

Object type Group Owner

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

isi_hdfs_d

Multi-threaded daemon runs on all nodes Services both NN and DN protocols Translates RPCs to POSIX system calls Stateless, underlying FS handles coherency

23

OneFS Node

isi_hdfs_d

Thread

Request VFS

OneFS Syscall

Response

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Example NameNode RPCs

Most NameNode RPCs are straightforward setPermission() → chmod(…) setTimes() → utimes(…) create() → open(…, O_CREAT, …)

Other RPCs need some creative interpretation recoverLease()/renewLease() abandonBlock() setReplication()

24

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

HDFS Data Path

NN RPC: getBlockLocations(file)/addBlock(file) Returns list of LocatedBlocks

25

LocatedBlock long offset

Block

long blkid

long genStamp long numBytes

DatanodeInfo[] ...

DFSClient connects to DN Chooses which DatanodeInfo

based on locality Only Block structure passed to

DN in read/write operation

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

LocatedBlocks Translation

26

LocatedBlock long offset

Block

long blkid

long genStamp

long numBytes

DatanodeInfo[] ...

Logical byte offset into the file

Opaque to client, used by DN

Inode number Size of extent Absolute byte offset

<IP:port> and rack info for different paths to same block

Specific to OneFS and isi_hdfs_d

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Read Path Example

27

getBlockLocations()

LocatedBlocks

DN_OP_READ(Block)

Data stream

DFSClient NameNode DataNode

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

NameNode Connection Routing

NameNode is configured as single URL Easy configuration: hdfs://log-server.isilon.com:8020/

DNS round-robin to distribute across nodes Metadata IOPs get spread out OneFS maintains cross-node consistency

IP Failover plus client retries for resiliency Hadoop retries ops 5 times at many levels

28

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

DataNode References

DataNode references returned by NameNode All OneFS DataNodes can access same data Each LocatedBlocks for reads has 3 DN refs Round-robin across available nodes Multiple refs lets client try other nodes before

coming back to the NameNode again Write path only 1 reference per logical block No need for client to replicate writes

29

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Few Quirks

User/group identities are strings OneFS natively stores UIDs or SIDs only Requires name resolution on access

Locking! HDFS uses “leases” to restrict to single writer Implemented but no cross-protocol contention

Hadoop apps don’t expect files to move Caveat emptor when mixing protocols

30

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

It Works!

31

You see this across NFS:

And the same directory on HDFS:

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

Conclusions

HDFS protocol can map to POSIX pretty easily Not all traditional shared storage is bad for

Hadoop workflows Locality features worth preserving even without

node locality as a possibility Interoperability can unlock new novel workflows

32

2012 Storage Developer Conference. © EMC Corporation. All Rights Reserved.

33

Questions?

[email protected] Special thanks to Conrad Meyer!