sep 2012 hug: giraffa file system to grow hadoop bigger
DESCRIPTION
HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from HDFS DataNodes. Giraffa is intended to provide higher scalabilty, availability, and maintain very large namespaces. The presentation will explain the Giraffa architecture, the motivation, will address its main challenges, and give an update on the status of the project. Presenter: Konstantin Shvachko (PhD), Founder, AltoScaleTRANSCRIPT
The Giraffa File SystemThe Giraffa File System
AltoStorAltoStor
Konstantin V. ShvachkoKonstantin V. Shvachko
Alto StorAlto Storage Technologiesage Technologies
Hadoop User GroupHadoop User GroupSeptember 19, September 19, 20122012
AltoStorAltoStor
GiraffaGiraffa
�Giraffa is a distributed,
highly available file system
�Utilizes features of
HDFS and HBase
�New open source project
in experimental stage
2
AltoStorAltoStor
Apache HadoopApache Hadoop
�A reliable, scalable, high performance distributed
storage and computing system
�The Hadoop Distributed File System (HDFS)
�Reliable storage layer
�MapReduce – distributed computation framework�MapReduce – distributed computation framework
�Simple computational model
�Ecosystem of Big Data tools
�HBase, Zookeeper
3
AltoStorAltoStor
The The Design Design PrinciplesPrinciples
�Linear scalability
�More nodes can do more work within the same time
�On Data size and Compute resources
�Reliability and Availability
�1 drive fails in 3 years. Probability of failing today 1/1000.�1 drive fails in 3 years. Probability of failing today 1/1000.
�Several drives fail every day on a cluster with thousands of drives
�Move computation to data
�Minimize expensive data transfers
�Sequential data processing
�Avoid random reads. [Use HBase for random data access]
4
AltoStorAltoStor
Hadoop ClusterHadoop Cluster
�HDFS – a distributed file system�NameNode – namespace and block management
�DataNodes – block replica container
�MapReduce – a framework for distributed computations�JobTracker – job scheduling, resource management, lifecycle
coordinationcoordination
�TaskTracker – task execution module
5
NameNode
DataNode
TaskTracker
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
AltoStorAltoStor
Hadoop Distributed File SystemHadoop Distributed File System
�The namespace is a hierarchy of files and directories
�Files are divided into large blocks 128 MB
�Namespace (metadata) is decoupled from data
�Fast namespace operations, not slowed down by
�Direct data streaming from the source storage�Direct data streaming from the source storage
�Single NameNode keeps entire namespace in RAM
�DataNodes store block replicas as files on local drives
�Blocks replicated on 3 DataNodes for redundancy & availability
�HDFS client – point of entry to HDFS
�Contacts NameNode for metadata
�Serves data to applications directly from DataNodes
6
AltoStorAltoStor
Scalability LimitsScalability Limits
�Single-master architecture: a constraining resource
�Single NameNode limits linear performance growth
�A handful of “bad” clients can saturate NameNode
�Single point of failure: takes whole cluster out of service
�NameNode space limit
�100 million files and 200 million blocks with 64GB RAM
�Restricts storage capacity to 20 PB
�Small file problem: block-to-file ratio is shrinking
�“HDFS Scalability: The limits to growth” USENIX ;login: 2010
7
AltoStorAltoStor
Node Count VisualizationNode Count Visualization
�2008 Yahoo!
4000-node cluster
�2010 Facebook
2000 nodes
Re
sou
rce
s p
er
no
de
: C
ore
s, D
isk
s, R
AM
2000 nodes
�2011 eBay
1000 nodes
�2013 Cluster of
500 nodes
8
Re
sou
rce
s p
er
no
de
: C
ore
s, D
isk
s, R
AM
Cluster Size: Number of Nodes
AltoStorAltoStor
Horizontal to Horizontal to Vertical Vertical ScalingScaling
�Horizontal scaling is limited by single-master architecture
�Natural growth of compute power and storage density
�Clusters composed of more dense & powerful servers
�Vertical scaling leads to cluster size shrinking
�Storage capacity, Compute power, and Cost remain constant
�Exponential Information Growth
�2006 Chevron accumulates 2 TB a day
�2012 Facebook ingests 500 TB a day
9
AltoStorAltoStor
Scalability for Hadoop 2.0Scalability for Hadoop 2.0
�HDFS Federation
�Independent NameNodes sharing a common pool of DataNodes
�Cluster is a family of volumes with shared block storage layer
�User sees volumes as isolated file systems
�ViewFS: the client-side mount table
�Yarn: New MapReduce framework
�Dynamic partitioning of cluster resources: no fixed slots
�Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes
10
AltoStorAltoStor
Namespace PartitioningNamespace Partitioning
�Static: Federation�Directory sub-trees are statically assigned to
disjoint volumes�Relocating sub-trees without copying is
challenging�Scale x10: billions of files
�Dynamic:�Dynamic:�Files, directory sub-trees can move automatically
between nodes based on their utilization or load balancing requirements
�Files can be relocated without copying data blocks�Scale x100: 100s of billion of files
�Orthogonal independent approaches.�Federation of distributed namespaces is possible
11
AltoStorAltoStor
Giraffa File SystemGiraffa File System
�HDFS + HBase = Giraffa
�Goal: build from existing building blocks
�Minimize changes to existing components
1. Store file & directory metadata in HBase table
�Dynamic table partitioning into regions�Dynamic table partitioning into regions
�Cashed in RegionServer RAM for fast access
2. Store file data in HDFS DataNodes: data streaming
3. Block management
�Handle communication with DataNodes:
heartbeat, blockReport, addBlock
�Perform block allocation, replication, and deletion
12
AltoStorAltoStor
Giraffa RequirementsGiraffa Requirements
�Availability – the primary goal�Load balancing of metadata traffic�Same data streaming speed to / from DataNodes
�Continuous Availability: No SPOF
�Cluster operability, management
�Cost of running larger clusters same as a smaller one
�More files & more data
13
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million
AltoStorAltoStor
HBase OverviewHBase Overview
�Table: big, sparse, loosely structured�Collection of rows, sorted by row keys
�Rows can have arbitrary number of columns
�Dynamic Table partitioning! �Table is split Horizontally into Regions
�Region Servers serve regions to applications�Region Servers serve regions to applications
�Columns grouped into Column families: vertical partition of tables
�Distributed Cache: �Regions are loaded in nodes’ RAM
�Real-time access to data
14
AltoStorAltoStor
HBase ArchitectureHBase Architecture
15
AltoStorAltoStor
HBase APIHBase API
�HBaseAdmin: administrative functions�Create, delete, list tables
�Create, update, delete columns, column families
�Split, compact, flush
�HTable: access table data�Result HTable.get(Get g) // get cells of a row�Result HTable.get(Get g) // get cells of a row
�void HTable.put(Put p) // update a row
�void HTable.delete(Delete d) // delete cells/row
�ResultScanner getScanner(family) // scan col family
�Variety Filters
�Coprocessors:�Custom actions triggered by update events
�Like database triggers or stored procedures
16
AltoStorAltoStor
Building Blocks Building Blocks
�Giraffa clients�Fetch file & block metadata from Namespace Service
�Exchange data with DataNodes
�Namespace Service�HBase Table stores File metadata as rows
�Block Management�Distributed collection of Giraffa block metadata
�Data Management�DataNodes. Distributed collection of data blocks
17
AltoStorAltoStor
Giraffa Giraffa ArchitectureArchitecture
Namespace Service HBase
Namespace Table
path, attrs, block[], DN[][]
Block Management Processor1
1. Giraffa client
gets files
and blocks
from HBase
2. Block
18
Block Management Layer
BM BM BM
DN
2
DN
DN
DN
DN
DN
DN
DN
DN
Na
mesp
aceA
gen
t
3
App
2. Block
Manager
handles
block
operations
3. Stream data
to or from
DataNodes
AltoStorAltoStor
Giraffa ClientGiraffa Client
�GiraffaFileSystem implements FileSystem
�fs.defaultFS = grfa:///
�fs.grfa.impl = o.a.giraffa.GiraffaFileSystem
�GiraffaClient extends DFSClient
�NamespaceAgent replaces NameNode RPC�NamespaceAgent replaces NameNode RPC
19
Gir
aff
aF
ile
Sys
tem
Gir
aff
aC
lie
nt
DF
SC
lie
nt
Na
me
sp
ac
e
Ag
en
t
to Namespace
to DataNodes
AltoStorAltoStor
Namespace TableNamespace Table
�Single Table called “Namespace” stores�Row Key = File ID
�File attributes:
o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length
�List of blocks of a file�List of blocks of a file
o Persisted in the table
�List of block locations for each block
o Not persisted, but discovered from the BlockManager
�Directory table
o maps directory entry name to respective child row key
20
AltoStorAltoStor
Namespace ServiceNamespace Service
HBase Namespace Service
1
Region Server
NS
Pro
cess
or
Region Server
NS
Pro
cess
or
……
Region Server
NS
Pro
cess
or
Region
Region
Region
Region
Region
Region
…… …… ……
21
Block Management Layer
2
BM Processor
NS
Pro
cess
or
BM Processor
NS
Pro
cess
or
……
BM Processor
NS
Pro
cess
or
Region Region Region
…… …… ……
AltoStorAltoStor
Block Block ManagerManager
�Maintains flat namespace of Giraffa block metadata
1. Block management�Block allocation, deletion, replication
2. DataNode management�Process DataNode block reports, heartbeats. Identify lost nodes
3. Storage for the HBase table�Small file system to store Hfiles, HLog
�BM Server paired on the same node with RegionServer�Distributed cluster of BMServes
�Mostly local communication between Region and BM Servers
�NameNode as an initial implementation of BMServer
22
AltoStorAltoStor
Data ManagementData Management
�DataNodes Store and Report data blocks;Blocks are files on local drives
�Data transfer to and from clients
�Internal data transfers
�Same as HDFS�Same as HDFS
23
AltoStorAltoStor
Row Key DesignRow Key Design
�Row keys�Identify files and directories as rows in the table
�Define sorting of rows in Namespace table
�And therefore Namespace partitioning
�Different row key definitions based on locality requirementrequirement�Key definition is chosen during file system formatting
�Full-path-key is the default implementation�Problem: Rename can move object to another region
�Row keys based on INode numbers
24
AltoStorAltoStor
Locality of ReferenceLocality of Reference
�Files in the same directory – adjacent in the table�Belong to the same region (most of the time)�Efficient “ls”. Avoid jumping across regions
�Row keys define sorting of files and directories in the table
�Tree structured namespace is flattened into linear array�Tree structured namespace is flattened into linear array
�Ordered list of files is self-partitioned into regions
�How to retain tree locality in linearized structure
25
AltoStorAltoStor
Partitioning: RandomPartitioning: Random
�Straightforward partitioning based on random hashing
1
2 3 4
26
15 16
T3 T4T1 T2
id1 id2 id3
AltoStorAltoStor
Partitioning: Full Partitioning: Full SubtreesSubtrees
�Partitioning based on lexicographic full-path ordering
�The default for Giraffa
1
2 3 4
27
15 16
T3 T4
T3 T4
T1 T2
T1 T21
2
15
1
3
1
4
1
2
1
AltoStorAltoStor
Partitioning: Fixed NeighborhoodPartitioning: Fixed Neighborhood
�Partitioning based on fixed depth neighborhoods
1
2 3 4
15 16
28
15 16
T3 T4T1 T2
13
215
14
12
1 216
AltoStorAltoStor
Atomic RenameAtomic Rename
�Giraffa will implement atomic in-place rename
�No support for atomic file move from one directory to another
�Requires inode numbers as unique file IDs
�A move can then be implemented on application level
�Non-atomically move the file from the source directory to a �Non-atomically move the file from the source directory to a
temporary file in the target directory
�Atomically rename the temporary file to its original name
�On failure use simple 3-step recovery procedure
�Eventually implement atomic moves
�PAXOS
�Simplified synchronization algorithms (ZAB)
29
AltoStorAltoStor
33--Step Recovery ProcedureStep Recovery Procedure
�Move of a file from srcDir to trgDir failed
1. If only the source file exists, then start the move over
2. If only the target temporary file exists, then complete
the move by renaming the temporary file to the original
namename
3. If both the source and the temporary target file exist,
then remove the source and rename the temporary file
� This step is non-atomic and may fail as well.
In case of failure repeat the recovery procedure
30
AltoStorAltoStor
New Giraffa FunctionalityNew Giraffa Functionality
�Custom file attributes: user defined file metadata
�Hidden in complex file names or nested directories
o /logs/2012/08/31/server-ip.log
�Stored in Zookeeper or even stand-alone DBs
o Involves Synchronization
�Advanced Scanning, Grouping, Filtering�Advanced Scanning, Grouping, Filtering
�Amazon S3 API turns Giraffa into reliable storage on the cloud
�Versioning
�Based on HBase row versioning
�Restore objects deleted inadvertently
�Alternative approach for snapshots
31
AltoStorAltoStor
StatusStatus
� We are on Apache Extra
� One node cluster running
� Row Key abstraction
� HBase implementation in separate package� HBase implementation in separate package
�Other DBs or Key-Value stores can be plugged in
� Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki
� Server-side processing FS requests. HBase endpoints
� Testing Giraffa with TestHDFSCLI
� Web UI. Multi-node cluster. Release…
32
AltoStorAltoStor
Thank You!Thank You!
33
AltoStorAltoStor
Related WorkRelated Work
�Ceph�Metadata stored on OSD
�MDS cache metadata: Dynamic Partitioning
�Lustre�Plans to release (2.4) distributed namespace
�Code ready�Code ready
�Colossus: from Google S.Quinlan and J.Dean�100 million files per metadata server
�Hundreds of servers
�VoldFS, CassandraFS, KTHFS (MySQL)�Prototypes
�MapR distributed file system
34
AltoStorAltoStor
HistoryHistory
�(2008) Idea. Study of distributed systems
�AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
�Partitioning of the namespace: 4 types of partitioning
�(2009) Study on scalability limits
�NameNode optimization�NameNode optimization
�(2010) Design with Michael Stack
�Presentation at HDFS contributors meeting
�(2011) Plamen implements POC
�(2012) Rewrite open sourced as Apache Extras project
�http://code.google.com/a/apache-extras.org/p/giraffa/
35
AltoStorAltoStor
EtymologyEtymology
�Giraffe. Latin: Giraffa camelopardalis
�Other languages
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis
�Other languages
�Favorites of my daughter
o As the Hadoop traditions require
36
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa