sep 2012 hug: giraffa file system to grow hadoop bigger

The Giraffa File SystemThe Giraffa File System

AltoStorAltoStor

Konstantin V. ShvachkoKonstantin V. Shvachko

Alto StorAlto Storage Technologiesage Technologies

Hadoop User GroupHadoop User GroupSeptember 19, September 19, 20122012

AltoStorAltoStor

GiraffaGiraffa

�Giraffa is a distributed,

highly available file system

�Utilizes features of

HDFS and HBase

�New open source project

in experimental stage

2

AltoStorAltoStor

Apache HadoopApache Hadoop

�A reliable, scalable, high performance distributed

storage and computing system

�The Hadoop Distributed File System (HDFS)

�Reliable storage layer

�MapReduce – distributed computation framework�MapReduce – distributed computation framework

�Simple computational model

�Ecosystem of Big Data tools

�HBase, Zookeeper

3

AltoStorAltoStor

The The Design Design PrinciplesPrinciples

�Linear scalability

�More nodes can do more work within the same time

�On Data size and Compute resources

�Reliability and Availability

�1 drive fails in 3 years. Probability of failing today 1/1000.�1 drive fails in 3 years. Probability of failing today 1/1000.

�Several drives fail every day on a cluster with thousands of drives

�Move computation to data

�Minimize expensive data transfers

�Sequential data processing

�Avoid random reads. [Use HBase for random data access]

4

AltoStorAltoStor

Hadoop ClusterHadoop Cluster

�HDFS – a distributed file system�NameNode – namespace and block management

�DataNodes – block replica container

�MapReduce – a framework for distributed computations�JobTracker – job scheduling, resource management, lifecycle

coordinationcoordination

�TaskTracker – task execution module

5

NameNode

DataNode

TaskTracker

JobTracker

DataNode

TaskTracker

DataNode

TaskTracker

AltoStorAltoStor

Hadoop Distributed File SystemHadoop Distributed File System

�The namespace is a hierarchy of files and directories

�Files are divided into large blocks 128 MB

�Namespace (metadata) is decoupled from data

�Fast namespace operations, not slowed down by

�Direct data streaming from the source storage�Direct data streaming from the source storage

�Single NameNode keeps entire namespace in RAM

�DataNodes store block replicas as files on local drives

�Blocks replicated on 3 DataNodes for redundancy & availability

�HDFS client – point of entry to HDFS

�Contacts NameNode for metadata

�Serves data to applications directly from DataNodes

6

AltoStorAltoStor

Scalability LimitsScalability Limits

�Single-master architecture: a constraining resource

�Single NameNode limits linear performance growth

�A handful of “bad” clients can saturate NameNode

�Single point of failure: takes whole cluster out of service

�NameNode space limit

�100 million files and 200 million blocks with 64GB RAM

�Restricts storage capacity to 20 PB

�Small file problem: block-to-file ratio is shrinking

�“HDFS Scalability: The limits to growth” USENIX ;login: 2010

7

AltoStorAltoStor

Node Count VisualizationNode Count Visualization

�2008 Yahoo!

4000-node cluster

�2010 Facebook

2000 nodes

Re

sou

rce

s p

er

no

de

: C

ore

s, D

isk

s, R

AM

2000 nodes

�2011 eBay

1000 nodes

�2013 Cluster of

500 nodes

8

Re

sou

rce

s p

er

no

de

: C

ore

s, D

isk

s, R

AM

Cluster Size: Number of Nodes

AltoStorAltoStor

Horizontal to Horizontal to Vertical Vertical ScalingScaling

�Horizontal scaling is limited by single-master architecture

�Natural growth of compute power and storage density

�Clusters composed of more dense & powerful servers

�Vertical scaling leads to cluster size shrinking

�Storage capacity, Compute power, and Cost remain constant

�Exponential Information Growth

�2006 Chevron accumulates 2 TB a day

�2012 Facebook ingests 500 TB a day

9

AltoStorAltoStor

Scalability for Hadoop 2.0Scalability for Hadoop 2.0

�HDFS Federation

�Independent NameNodes sharing a common pool of DataNodes

�Cluster is a family of volumes with shared block storage layer

�User sees volumes as isolated file systems

�ViewFS: the client-side mount table

�Yarn: New MapReduce framework

�Dynamic partitioning of cluster resources: no fixed slots

�Separation of JobTracker functions

1. Job scheduling and resource allocation: centralized

2. Job monitoring and job life-cycle coordination: decentralized

o Delegate coordination of different jobs to other nodes

10

AltoStorAltoStor

Namespace PartitioningNamespace Partitioning

�Static: Federation�Directory sub-trees are statically assigned to

disjoint volumes�Relocating sub-trees without copying is

challenging�Scale x10: billions of files

�Dynamic:�Dynamic:�Files, directory sub-trees can move automatically

between nodes based on their utilization or load balancing requirements

�Files can be relocated without copying data blocks�Scale x100: 100s of billion of files

�Orthogonal independent approaches.�Federation of distributed namespaces is possible

11

AltoStorAltoStor

Giraffa File SystemGiraffa File System

�HDFS + HBase = Giraffa

�Goal: build from existing building blocks

�Minimize changes to existing components

1. Store file & directory metadata in HBase table

�Dynamic table partitioning into regions�Dynamic table partitioning into regions

�Cashed in RegionServer RAM for fast access

2. Store file data in HDFS DataNodes: data streaming

3. Block management

�Handle communication with DataNodes:

heartbeat, blockReport, addBlock

�Perform block allocation, replication, and deletion

12

AltoStorAltoStor

Giraffa RequirementsGiraffa Requirements

�Availability – the primary goal�Load balancing of metadata traffic�Same data streaming speed to / from DataNodes

�Continuous Availability: No SPOF

�Cluster operability, management

�Cost of running larger clusters same as a smaller one

�More files & more data

13

HDFS Federated HDFS Giraffa

Space 25 PB 120 PB 1 EB = 1000 PB

Files + blocks 200 million 1 billion 100 billion

Concurrent Clients 40,000 100,000 1 million

AltoStorAltoStor

HBase OverviewHBase Overview

�Table: big, sparse, loosely structured�Collection of rows, sorted by row keys

�Rows can have arbitrary number of columns

�Dynamic Table partitioning! �Table is split Horizontally into Regions

�Region Servers serve regions to applications�Region Servers serve regions to applications

�Columns grouped into Column families: vertical partition of tables

�Distributed Cache: �Regions are loaded in nodes’ RAM

�Real-time access to data

14

AltoStorAltoStor

HBase ArchitectureHBase Architecture

15

AltoStorAltoStor

HBase APIHBase API

�HBaseAdmin: administrative functions�Create, delete, list tables

�Create, update, delete columns, column families

�Split, compact, flush

�HTable: access table data�Result HTable.get(Get g) // get cells of a row�Result HTable.get(Get g) // get cells of a row

�void HTable.put(Put p) // update a row

�void HTable.delete(Delete d) // delete cells/row

�ResultScanner getScanner(family) // scan col family

�Variety Filters

�Coprocessors:�Custom actions triggered by update events

�Like database triggers or stored procedures

16

AltoStorAltoStor

Building Blocks Building Blocks

�Giraffa clients�Fetch file & block metadata from Namespace Service

�Exchange data with DataNodes

�Namespace Service�HBase Table stores File metadata as rows

�Block Management�Distributed collection of Giraffa block metadata

�Data Management�DataNodes. Distributed collection of data blocks

17

AltoStorAltoStor

Giraffa Giraffa ArchitectureArchitecture

Namespace Service HBase

Namespace Table

path, attrs, block[], DN[][]

Block Management Processor1

1. Giraffa client

gets files

and blocks

from HBase

2. Block

18

Block Management Layer

BM BM BM

DN

2

DN

DN

DN

DN

DN

DN

DN

DN

Na

mesp

aceA

gen

t

3

App

2. Block

Manager

handles

block

operations

3. Stream data

to or from

DataNodes

AltoStorAltoStor

Giraffa ClientGiraffa Client

�GiraffaFileSystem implements FileSystem

�fs.defaultFS = grfa:///

�fs.grfa.impl = o.a.giraffa.GiraffaFileSystem

�GiraffaClient extends DFSClient

�NamespaceAgent replaces NameNode RPC�NamespaceAgent replaces NameNode RPC

19

Gir

aff

aF

ile

Sys

tem

Gir

aff

aC

lie

nt

DF

SC

lie

nt

Na

me

sp

ac

e

Ag

en

t

to Namespace

to DataNodes

AltoStorAltoStor

Namespace TableNamespace Table

�Single Table called “Namespace” stores�Row Key = File ID

�File attributes:

o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length

�List of blocks of a file�List of blocks of a file

o Persisted in the table

�List of block locations for each block

o Not persisted, but discovered from the BlockManager

�Directory table

o maps directory entry name to respective child row key

20

AltoStorAltoStor

Namespace ServiceNamespace Service

HBase Namespace Service

1

Region Server

NS

Pro

cess

or

Region Server

NS

Pro

cess

or

……

Region Server

NS

Pro

cess

or

Region

Region

Region

Region

Region

Region

…… …… ……

21

Block Management Layer

2

BM Processor

NS

Pro

cess

or

BM Processor

NS

Pro

cess

or

……

BM Processor

NS

Pro

cess

or

Region Region Region

…… …… ……

AltoStorAltoStor

Block Block ManagerManager

�Maintains flat namespace of Giraffa block metadata

1. Block management�Block allocation, deletion, replication

2. DataNode management�Process DataNode block reports, heartbeats. Identify lost nodes

3. Storage for the HBase table�Small file system to store Hfiles, HLog

�BM Server paired on the same node with RegionServer�Distributed cluster of BMServes

�Mostly local communication between Region and BM Servers

�NameNode as an initial implementation of BMServer

22

AltoStorAltoStor

Data ManagementData Management

�DataNodes Store and Report data blocks;Blocks are files on local drives

�Data transfer to and from clients

�Internal data transfers

�Same as HDFS�Same as HDFS

23

AltoStorAltoStor

Row Key DesignRow Key Design

�Row keys�Identify files and directories as rows in the table

�Define sorting of rows in Namespace table

�And therefore Namespace partitioning

�Different row key definitions based on locality requirementrequirement�Key definition is chosen during file system formatting

�Full-path-key is the default implementation�Problem: Rename can move object to another region

�Row keys based on INode numbers

24

AltoStorAltoStor

Locality of ReferenceLocality of Reference

�Files in the same directory – adjacent in the table�Belong to the same region (most of the time)�Efficient “ls”. Avoid jumping across regions

�Row keys define sorting of files and directories in the table

�Tree structured namespace is flattened into linear array�Tree structured namespace is flattened into linear array

�Ordered list of files is self-partitioned into regions

�How to retain tree locality in linearized structure

25

AltoStorAltoStor

Partitioning: RandomPartitioning: Random

�Straightforward partitioning based on random hashing

1

2 3 4

26

15 16

T3 T4T1 T2

id1 id2 id3

AltoStorAltoStor

Partitioning: Full Partitioning: Full SubtreesSubtrees

�Partitioning based on lexicographic full-path ordering

�The default for Giraffa

1

2 3 4

27

15 16

T3 T4

T3 T4

T1 T2

T1 T21

2

15

1

3

1

4

1

2

1

AltoStorAltoStor

Partitioning: Fixed NeighborhoodPartitioning: Fixed Neighborhood

�Partitioning based on fixed depth neighborhoods

1

2 3 4

15 16

28

15 16

T3 T4T1 T2

13

215

14

12

1 216

AltoStorAltoStor

Atomic RenameAtomic Rename

�Giraffa will implement atomic in-place rename

�No support for atomic file move from one directory to another

�Requires inode numbers as unique file IDs

�A move can then be implemented on application level

�Non-atomically move the file from the source directory to a �Non-atomically move the file from the source directory to a

temporary file in the target directory

�Atomically rename the temporary file to its original name

�On failure use simple 3-step recovery procedure

�Eventually implement atomic moves

�PAXOS

�Simplified synchronization algorithms (ZAB)

29

AltoStorAltoStor

33--Step Recovery ProcedureStep Recovery Procedure

�Move of a file from srcDir to trgDir failed

1. If only the source file exists, then start the move over

2. If only the target temporary file exists, then complete

the move by renaming the temporary file to the original

namename

3. If both the source and the temporary target file exist,

then remove the source and rename the temporary file

� This step is non-atomic and may fail as well.

In case of failure repeat the recovery procedure

30

AltoStorAltoStor

New Giraffa FunctionalityNew Giraffa Functionality

�Custom file attributes: user defined file metadata

�Hidden in complex file names or nested directories

o /logs/2012/08/31/server-ip.log

�Stored in Zookeeper or even stand-alone DBs

o Involves Synchronization

�Advanced Scanning, Grouping, Filtering�Advanced Scanning, Grouping, Filtering

�Amazon S3 API turns Giraffa into reliable storage on the cloud

�Versioning

�Based on HBase row versioning

�Restore objects deleted inadvertently

�Alternative approach for snapshots

31

AltoStorAltoStor

StatusStatus

� We are on Apache Extra

� One node cluster running

� Row Key abstraction

� HBase implementation in separate package� HBase implementation in separate package

�Other DBs or Key-Value stores can be plugged in

� Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki

� Server-side processing FS requests. HBase endpoints

� Testing Giraffa with TestHDFSCLI

� Web UI. Multi-node cluster. Release…

32

AltoStorAltoStor

Thank You!Thank You!

33

AltoStorAltoStor

Related WorkRelated Work

�Ceph�Metadata stored on OSD

�MDS cache metadata: Dynamic Partitioning

�Lustre�Plans to release (2.4) distributed namespace

�Code ready�Code ready

�Colossus: from Google S.Quinlan and J.Dean�100 million files per metadata server

�Hundreds of servers

�VoldFS, CassandraFS, KTHFS (MySQL)�Prototypes

�MapR distributed file system

34

AltoStorAltoStor

HistoryHistory

�(2008) Idea. Study of distributed systems

�AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …

�Partitioning of the namespace: 4 types of partitioning

�(2009) Study on scalability limits

�NameNode optimization�NameNode optimization

�(2010) Design with Michael Stack

�Presentation at HDFS contributors meeting

�(2011) Plamen implements POC

�(2012) Rewrite open sourced as Apache Extras project

�http://code.google.com/a/apache-extras.org/p/giraffa/

35

AltoStorAltoStor

EtymologyEtymology

�Giraffe. Latin: Giraffa camelopardalis

�Other languages

Family Giraffidae

Genus Giraffa

Species Giraffa camelopardalis

�Other languages

�Favorites of my daughter

o As the Hadoop traditions require

36

Arabic Zarafa

Spanish Jirafa

Bulgarian жирафа

Italian Giraffa

sep 2012 hug: giraffa file system to grow hadoop bigger

Technology

row void htable

compute power

temporary file

file

cluster

hdfs

source

directory