dataset versioning for hops file...

72
IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2017 Dataset versioning for Hops File System Snapshotting solution for reliable and reproducible data science experiments BRAULIO GRANA GUTI´ERREZ KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Upload: others

Post on 02-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

Dataset versioning for Hops File SystemSnapshotting solution for reliable and reproducible data science experiments

BRAULIO GRANA GUTI´ERREZ

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

Page 2: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Dataset versioning for Hops File System

Snapshotting solution for reliable and reproducible data science experiments

Braulio Grana Gutierrez

Master of Science Thesis

Communication SystemsSchool of Information and Communication Technology

KTH Royal Institute of Technology

Stockholm, Sweden

15 June 2017

Examiner: Sarunas GirdzijauskasTrita-ICT-EX-2017:159

Page 3: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

c© Braulio Grana Gutierrez, 15 June 2017

Page 4: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Abstract

As the awareness of the potential of Big Data arises, more and more companies arestarting to create their own Data Science divisions and their projects are becomingbig and complex handled by big multidisciplinary teams. Furthermore, with theexpansion of fields such as Deep Learning, Data Science is becoming a verypopular research field both in companies and universities.

In this context it becomes crucial for Data Scientists to be able to reproducetheir experiments and test them against previous models developed in previousversions of a dataset. This Master Thesis project presents the design andimplementation of a snapshotting system for the distributed File System HopsFSbased on Apache HDFS and developed at the Swedish Institute of ComputerScience (SICS).

This project improves on previous solutions designed for both HopsFS andHDFS by solving problems such as the handling of incomplete blocks in snapshotswhile also adding new features such as the automatic snapshots to allow users toundo the last few changes made in a file.

Finally, an analysis of the implementation was performed in order to compareit to the previous state of HopsFS and calculate the impact of the solution on thedifferent operations performed by the system. Said analysis showed an increase ofaround 40% in the time needed to perform operations such as read and write withdifferent workloads due mostly to the new database queries used in this solution.

i

Page 5: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 6: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Abstract

Nar kunskapen om Big Data-potentialen uppstar, borjar allt fler foretag skapa egnadatavetenskapsavdelningar och deras projekt blir stora och komplexa hanterasav stora tvarvetenskapliga team. Vidare, med expansionen av falt som DeepLearning, blir datavetenskap ett mycket populart forskningsomrade bade i foretagoch universitet.

I detta sammanhang blir det avgorande for datavetenskapare att kunna reproducerasina experiment och testa dem mot tidigare modeller som utvecklats i tidigareversioner av en dataset. Detta masterprojekt presenterar design och implementeringav ett ogonblickssystem for det distribuerade filsystemet HopsFS baserat paApache HDFS och utvecklat pa SICS.

Detta projekt forbattras pa tidigare losningar utformade for bade HopsFSoch HDFS genom att losa problem som hantering av ofullstandiga block iogonblicksbilder samtidigt som du lagger till nya funktioner som de automatiskaogonblicksbilderna sa att anvandarna kan angra de senaste andringarna i en fil.

Slutligen genomfordes en analys av genomforandet for att jamfora det medHopsFS tidigare tillstand och berakna losningens inverkan pa de olika operationersom utforts av systemet. Namnda analys visade en okning pa omkring 40 %i den tid som behovs for att utfora operationer som las och skriv med olikaarbetsbelastningar, for det mesta beror pa de nya databasfragor som anvands idenna losning.

iii

Page 7: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 8: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Contents

1 Introduction 11.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem context . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and State of the Art 52.1 Hadoop Distributed FileSystem . . . . . . . . . . . . . . . . . . . 5

2.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 HopsFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Interactions with the database . . . . . . . . . . . . . . . 102.2.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 ZFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.1 ZFS snapshots . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 BTRFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Main features . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 BTRFS Snapshots . . . . . . . . . . . . . . . . . . . . . 16

2.5 MySQL Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.2 NDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Incomplete block problem . . . . . . . . . . . . . . . . . . . . . 182.7 State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7.1 HDFS2 Snapshots . . . . . . . . . . . . . . . . . . . . . 192.7.2 Facebook’s HDFS snapshots . . . . . . . . . . . . . . . . 192.7.3 Previous HopsFS snapshotting system . . . . . . . . . . . 20

v

Page 9: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

vi CONTENTS

3 Solution 213.1 Basic design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Storing block versions . . . . . . . . . . . . . . . . . . . 223.1.2 Reading specific versions . . . . . . . . . . . . . . . . . . 223.1.3 Creating and writing files . . . . . . . . . . . . . . . . . . 233.1.4 Version numbers . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Automatic snapshots . . . . . . . . . . . . . . . . . . . . . . . . 243.3 On-demand snapshots . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Rollback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 Versioning structure . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.1 Differentiate completed blocks from old completed blocks 283.5.2 Differentiate onDemand and automatic snapshots . . . . . 293.5.3 Differentiate automatic and on-demand old blocks . . . . 30

3.6 Database changes . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6.1 Metadata changes . . . . . . . . . . . . . . . . . . . . . . 323.6.2 Added queries . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 Cost overhead analysis . . . . . . . . . . . . . . . . . . . . . . . 383.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.8.1 Solving the incomplete blocks problem . . . . . . . . . . 393.8.2 Improved rollback . . . . . . . . . . . . . . . . . . . . . 393.8.3 Automatic snapshots . . . . . . . . . . . . . . . . . . . . 40

3.9 Design discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 403.9.1 On version storage . . . . . . . . . . . . . . . . . . . . . 403.9.2 On cyclic version numbers . . . . . . . . . . . . . . . . . 403.9.3 On on-demand version numbers . . . . . . . . . . . . . . 413.9.4 On old blocks . . . . . . . . . . . . . . . . . . . . . . . . 413.9.5 On incomplete block handling . . . . . . . . . . . . . . . 41

4 Analysis 434.1 Evaluation of local File Systems . . . . . . . . . . . . . . . . . . 434.2 Evaluation of the solution . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 File reading . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.2 File writing . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Conclusions 535.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1.1 Goals achieved . . . . . . . . . . . . . . . . . . . . . . . 535.1.2 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55

Page 10: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

List of Figures

2.1 HDFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 HopsFS Architecture . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Erasure coding in HopsFS . . . . . . . . . . . . . . . . . . . . . 92.4 Create operation flowchart . . . . . . . . . . . . . . . . . . . . . 122.5 Append operation flowchart . . . . . . . . . . . . . . . . . . . . . 132.6 Visual representation of the incomplete block problem . . . . . . 18

3.1 Structure of a BlockID . . . . . . . . . . . . . . . . . . . . . . . 223.2 TakeSnapshot RPC workflow . . . . . . . . . . . . . . . . . . . . 253.3 Hard rollback visual concept . . . . . . . . . . . . . . . . . . . . 263.4 Rollback RPC workflow . . . . . . . . . . . . . . . . . . . . . . 273.5 Old block problem . . . . . . . . . . . . . . . . . . . . . . . . . 293.6 Differentiating between on-demand and automatic snapshots . . . 303.7 Visual example of differentiating between automatic and on-

demand old blocks . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 First test case for ZFS . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Second test case for ZFS . . . . . . . . . . . . . . . . . . . . . . 444.3 First test case for BTRFS . . . . . . . . . . . . . . . . . . . . . . 454.4 Second test case for BTRFS . . . . . . . . . . . . . . . . . . . . 464.5 Third test case for BTRFS . . . . . . . . . . . . . . . . . . . . . 464.6 First case for read scalability testing . . . . . . . . . . . . . . . . 484.7 Second case for read scalability testing . . . . . . . . . . . . . . . 494.8 Third case for read scalability testing . . . . . . . . . . . . . . . . 494.9 First case for write scalability testing . . . . . . . . . . . . . . . . 514.10 Second case for write scalability testing . . . . . . . . . . . . . . 524.11 Third case for write scalability testing . . . . . . . . . . . . . . . 52

vii

Page 11: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 12: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

List of Tables

3.1 Metadata fields added to the database . . . . . . . . . . . . . . . 323.2 New table hdfs version to block . . . . . . . . . . . . . . . . . . 32

ix

Page 13: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 14: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Chapter 1

Introduction

1.1 Problem descriptionWorking on data science problems often requires the running of a series ofexperiments usually for training new models that fit the data available and makeit easier to make predictions based on that data. When working in these kinds ofprojects in a professional environment it is usual to have changing datasets thatevolve as the company dumps new data from different sources to be analyzed.

Taking this into account it becomes increasingly difficult to test new modelsand results and compare them with older ones when they have not been trainedin the same dataset. One obvious solution would be to just re-run the olderexperiments in the current datasets but that can be an unbearable task as the timeneeded to train the model scales to days and even weeks.

For these reasons the need arises to have a mechanism to retrieve previousstates of datasets in which previous experiments were run. With that in mind themain goal of this thesis is to add snapshotting capabilities to a distributed FileSystem, in this case HopsFS [1], a distribution of Apache HDFS [2] developedat KTH[3] and SICS[4]. Furthermore, this problem needs to be addressed withhigh-scalability in mind because it will work in a distributed environment handlinglarge amounts of data.

1.2 Problem contextAs companies realize the potential value of the data they collect and storefrom their day-to-day activities, the need for infrastructures in which to performexperiments to extract the value from said data becomes a pressing matter.

When handling and analyzing these great amounts of data it is unfeasible to doit in a single machine so companies resort to distributed environments [5] where

1

Page 15: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2 CHAPTER 1. INTRODUCTION

the workload is spread between several machines in order to speed-up the dataprocessing. In this case it is usual to have the data storage being managed by adistributed File System such as Apache HDFS.

In this context, this thesis work focuses on designing and implementing asnapshotting system for the distributed File System HopsFS.

1.3 MotivationsFrom the problem stated above we can extract the following motivations for thisproject:

Reproducible Data Science experimentsAllowing data scientist to perform reproducible experiments on continuouslychanging datasets is perhaps the most important motivation for this project.With a snapshotting system in place, data scientist will be able to go backto previous versions of a dataset and reproduce past experiments as well ascompare the validity of new models with those created in the past over thesame dataset.

Accidental data lossThe accidental deletion and corruption of data in files either by administrationerrors or version control misuses is a common problem for developers.Having previous versions of every file in storage removes this problem.

Mandatory data retentionIn projects where data retention is mandatory by law for a given period oftime a snapshotting system would allow users to preserve project data in thestate that is required while still working on it and changing it if necessary.

1.4 GoalsThe main goal of this Master Thesis project is to provide HopsFS with asnapshotting system with the following features:

Taking snapshots On-demandWith this possibility we will give the user the opportunity create a version oftheir datasets whenever they deem fit (e.g. whenever the same experimentis run with changes on the data). This will let users flag the state of theirdatasets at a given time allowing them to come back to that specific statewhenever they want.

Page 16: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

1.5. STRUCTURE OF THIS THESIS 3

Revert last changes of a fileIn order to protect the user against unwanted changes in their data thisproject should provide a way to undo changes in users’ datasets.

Sub-tree snapshotsA requirement of the project is to offer the possibility to snapshot not theentire file system but only a sub-tree. This way the solution can provideencapsulated snapshots of the different datasets.

Rollbacks to previous versionsPerhaps one of the main features of any snapshotting system is to be able togo back to a previous version of a file whenever the user requires it.

ScalabilityAs this project involves a distributed envirnment where a large a amountof data is handled, any solution designed to fulfill the previously describedgoals must take into account scalability issues that may affect the performanceof the system.

1.5 Structure of this thesisChapter 1 describes the problem and its context. Chapter 2 provides thebackground necessary to understand the problem and the specific knowledge thatthe reader will need to understand the rest of this thesis. Following this Chapter3 describes the goals, metrics, and solution proposed in this thesis project. Thesolution is analyzed and evalued in Chapter 4. Finally, Chapter 5 offers someconclusions and suggests future work.

Page 17: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 18: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Chapter 2

Background and State of the Art

This chapter explains the previous knowledge required to understand this thesisproject as well as some of the other state of the art snapshotting and versioningsolutions.

2.1 Hadoop Distributed FileSystemHadoop Distributed FileSystem (HDFS) [2] is the distributed storage core of theApache Hadoop project [8], designed to run on commodity hardware. HDFS isbased on Google FileSystem (GFS) [10], a distributed file system developed atGoogle which architecture and funcioning was released to the public as a paper inOctober 2003.

2.1.1 ArchitectureHDFS has a master-slave [19] architecture that is composed of two types of nodes:the Data Nodes (DNs) and the Name Node (NN).

Name NodeThe Name Node executes file system name space operations such as open,close and rename. It also determines the mapping between blocks and DataNodes, manages heartbeat messages to Data Nodes and and keeps track ofthe replication of each block.

Data NodesThe Data Nodes are in charge of storing the actual data of the file systemand manage block operations such as read and write requested by a client.Data Nodes can also talk to each other to rebalance data, to move copiesaround, and to keep the replication of data at the required levels.

5

Page 19: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

6 CHAPTER 2. BACKGROUND AND STATE OF THE ART

Figure 2.1: HDFS Architecture

Each Data Node has its own local File System where it stores each HDFSblock as a separate file.

2.1.2 Goals

High fault-toleranceHDFS is designed to scale to hundreds or even thousands of nodes. Thismeans that, realistically there is always going to be some components thatare not functional and, therefore, quick and automatic detection as well asrecovery from hardware failures is paramount. This implies taking somesafety measures in both namenodes and datanodes.

Data Nodes tolerate hardware failures by maintaining always 2 other mirrorreplicas of their data in different DNs at all times. The Name Node on theother hand, ensures fault-tolerance of its metadata state by storing multiplecopies of a transaction log called EditLog to persistently records any changein the file system metadata and of FsImage which is a file that stores theentire file system namespace including the mapping of blocks to files. Inany case the Name Node is a single point of failure in HDFS and, should itfail, it would require manual intervention.

Page 20: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.1. HADOOP DISTRIBUTED FILESYSTEM 7

High data throughputHDFS is not fully POSIX-compliant [12] as POSIX semantics in a fewkey areas have been traded to increase data throughput and support non-POSIX operations such as append. The reasoning behind this trade-offis that HDFS does not have the same goals as a general purpose POSIX-compliant file system where low latency data access would prime over highdata throughput.

Support for large data setsHDFS supports the storage of large files (in the range of gigabytes toterabytes) by partitioning them across multiple machines. This way, RAID[13] storage is not required on the Data Nodes. The file system is speciallytuned for these kind of files because most applications using HDFS willhave large datasets.

One shortcoming that the file system has in this regard comes when handlinglarge amounts of small files as the Name Node, being the single point ofmetadata management, easily becomes a bottleneck.

Simple coherency modelApplications using HDFS are mostly running analytics jobs and, thus,benefit from a write-once-read-many access model for their files. Thisassumption simplifies data coherency issues and enables high throughputin the access of data. In exchange, the file system gives up the capability ofupdating its files aside from appending new contents to them.

LocalityIn order to minimize network congestion and increase the overall throughputof the system, HDFS takes the assumption that is better to move thecomputation closer to the data rather than moving the data closer tothe computation. This becomes especially true when dealing with largedatasets.

PortabilityIn order to become the platform of choice for a great set of applications,HDFS has been designed to be easily portable across various hardwareplatforms and a variety of underlying operating systems. Since the Javaimplementation cannot use features that are exclusive to the platform onwhich HDFS is running, there are some performance bottlenecks.

Page 21: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

8 CHAPTER 2. BACKGROUND AND STATE OF THE ART

2.1.3 Snapshots

There have been some implementations of snapshots in HDFS [14] in recent years,but they presented a severe problem when snapshotting an incomplete HDFSblock. If a snapshotted incomplete block were to be written again, restoringthe snapshot would not restore that block to its original state at the time of thesnapshot.

2.2 HopsFS

Hadoop Open Platform-as-a-service (Hops) [1] [16] is a distributed file systemdeveloped at the Swedish Institute of Computer Science (SICS) [4] and based onApache HDFS.

Figure 2.2: HopsFS Architecture

Page 22: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.2. HOPSFS 9

2.2.1 FeaturesMetadata storage

In HopsFS, the metadata structures kept in memory in HDFS are migratedto tables in MySQL Cluster Database [17] [18]. Data is partitioned ina way that ensures that all inodes in the same directory are located onthe same database node of the cluster. When migrating metadata to therelational database [15], strong consistency of the metadata is maintainedby reordering all file system and block operations to follow the same orderwhen acquiring locks on inodes, starting at the root inode and then followinga depth-first-search path.

Multiple Name NodesAnother difference is that HopsFS maintains multiple Name Nodes toimprove the overall fault-tolerance of the system as well as avoiding theperformance bottle-neck of using just one Name Node. However, there aresome Name Node tasks that could become problematic if several NameNodes attempted to perform them concurrently without synchronizationsuch as replication monitoring or lease management. HopsFS implements aLeader Election algorithm that allows for only one Name Node to performthe problematic tasks at any given time.

Figure 2.3: Erasure coding in HopsFS

Page 23: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

10 CHAPTER 2. BACKGROUND AND STATE OF THE ART

Erasure codingHDFS stores in total 3 copies of the data in order to provide high-availabilityand fault-tolerance, but this also triples the storage space required for anygiven project. HopsFS on the other hand, uses erasure-coding to provide theaforementioned properties while reducing the required storage by 44%.

2.2.2 Interactions with the database

In Apache HDFS, the locks are coarse-grained, every worker thread [defineworker threads] requires a lock to the whole filesystem to operate. BecauseHopsFS uses a database as the backend it can lock only the data it needs, allowingmore worker threads to operate in parallel. In HopsFS every operation is handledin a transaction wrapping the original Apache HDFS code to allow it to use thedatabase as metadata store, instead of in-memory data structures.

To increase scalability and reduce the number of round-trips to the database,every operation in the filesystem transparently manipulates a copy of the appropriatetables that is maintained by the TransactionHandler. This table cache allows theclient to read, and modify data as if it was communicating with the databasedirectly but without causing any network activity. Row-level locking guaranteesthat no other process is modifying the same data, while its in-memory copy isbeing altered.

The transactional request handler operates in two conceptual stages: lockingand task execution.

• The locking phase is used to lock the required rows and populate the cachewith the data necessary for the operation requested by the remote procedurecall. During the locking phase the data will be read directly from thedatabase if it is not already present in the cache. Once the locking phasecompletes, no more access to the database is allowed until the final commitafter task execution has completed.

• The task execution phase runs the operation that was originally implementedin Apache HDFS but it interacts with the cache set up by the TransactionHandlerinstead of its in-memory data structures. The operations allowed on the

cache are find, findList, and update and all the alterations to the stateare recorded so that they can be saved in the database once the operationcompletes. Given that the code may interact with one or more tables, thecache contains a separate object for every table retrieved during the lockingphase.

Page 24: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.2. HOPSFS 11

2.2.3 OperationsThis section shows an in-depth description of HopsFS operations relevant to thisthesis. These operations also exist in HDFS but they are described in the HopsFSsection in order to have a fully-detailed view of how they work internally inHopsFS, which is necessary to understand parts of the solution resulted from thisthesis project. All operations are defined via the Client Protocol and communicatewith the Name Node using Remote Procedure Calls (RPC) [29] as specified by theInternet Engineering Task Force [31] in RFC 5531 [30].

CreateThe Create operation creates the metadata for a new file on the FileSystemand returns an Output Stream where data can be written. It is also a ClientProtocol RPC.

To operate on any file, the Name Node requires the client to hold a lease(basically a lock). The Create operation also takes care of acquiring saidlease from the Name Node and renewing it at regular intervals.

Figure 2.4 shows the information flow for the whole Create operation thatwill now be detailed.

1. The Create method from the client makes an RPC that tells the NameNode the information required to create a new file such as: path,permissions, client server name, replication parameters, block sizeand several flags such as overwrite and append. These flags are usedto select in which way we want to use the file: creating a new one,overwriting the existent one or simply appending to it.

2. On the Name Node, the RPC is handled by properly locking theneeded rows in the database, checking the permissions and flags forthe directory and the file to ensure it can be created, overwritten orappended and then deciding what to do with the file using the flagspassed by the client.

3. The Name Node then returns a FileStatus to the client containingthe metadata, block locations, replication status and other relevantparameters of the file. The client then opens and returns a newDFSOutputStream where users can write new data to the file.

4. Simutaneously, the client starts a worker thread that acquires andperiodically renews the client’s lease on the file using the renewLeaseRPC.

5. The Name Node maintains the lock to the file’s row in the databaseso no other client can write to the same file while the current lease isactive.

Page 25: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

12 CHAPTER 2. BACKGROUND AND STATE OF THE ART

Figure 2.4: Create operation flowchart

6. The client keeps renewing the lease until the close method is called.

AppendThe Append operation is invoked by the Client on a file that has already beenopened and on which the client is already maintaining a lease. The clientstores the data written to the DFSOutputStream in Packets that are then sent toone of the Data Nodes in the file locations retrieved by the getBlockLocationsRPC part of the ClientProtocol.

Figure 2.5 shows the information flow for the whole Append operation thatwill now be detailed.

The Client has the responsibility of requesting the allocation of new blocksto the Name Node whenever the block currently being written becomes full.The Name Node will add a new block to the INode of the file being writtenand then create and lock a new row in the BlockInfo table in the databasewhere blocks’ metadata are stored. Finally, it will return the locations of thenew block to the Client.

Page 26: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.2. HOPSFS 13

Figure 2.5: Append operation flowchart

As long as the block being written is not full, the Client will do as follows:

1. Add information to a Packet as it comes into the DFSOutputStream untilthe size of the Packet reaches the preset size (which must be a dividerof the Block size). Then the Client will send the Packet to the first ofthe Data Node replicas given by the Block’s locations.

2. The first Data Node replica will then search the Block file in its nativefile system and move it to a special folder of Replicas Being Written.In that folder it will append the received packet to the Block file. Inparallel, it will send the Packet to be written in the next mirror DataNode in the pipeline. Once the Packet is written in the disk it will beadded to the ackQueue for this Data Node. A Data Node does not sendback the ACK signal for a Packet until it has received all ACKs from

Page 27: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

14 CHAPTER 2. BACKGROUND AND STATE OF THE ART

the mirror/s after itself.

3. The following Data Nodes follow a similar procedure to the first oneuntil the last (third) mirror is reached. As this Data Node has no otherreplicas to which send the packet it will only wait for its own ACKof the Packet being written and then will send back the ACK to theprevious Data Node in the pipeline.

4. The ACK signal for that Packet will now propagate back throughthe replica pipeline until it reaches the first Data Node to which theClient originally wrote. This one will then have all ACK signals in itsackQueue and will send the last ACK to the Client.

2.3 ZFSZFS is an advanced POSIX-compliant file system designed to overcome manyof the major problems found in previous designs. An important concept tounderstand how ZFS works is zpools. A zpool is the main storage space of ZFSand can be created on one or more virtual devices (vdevs), which can be eitherphysical devices, partitions or even files of another file system. In this storagespace we create File Systems, which are modular instances that can occupy all thestorage capacity of the zpool or up to a set quota.

The most relevant part of ZFS concerning this thesis is the snapshottingcapability, but there many are other features of interest that are listed bellow:

Data integrityZFS protects users’ data on disk from silent data corruption that can becaused by data degradation, bugs in disk firmware, current spikes, phantomwrites misdirected reads and writes and many others. This integrity isachieved by using a Fletcher-based checksum [20] or a SHA-256 hash [21]throughout the file system tree, checksumming each block and then savingthe checksum value in the pointer to that block.

RAIDZFS supports using hardware RAID but it is recommended to use RAID-Z, the software RAID provided by ZFS. RAID-Z [23] is a data/paritydistribution scheme like RAID-5 [22], but uses dynamic stripe width,which, combined with the copy-on-write transactional semantics of ZFS,eliminates the write hole error [25].

CapacityZFS is a 128-bit file system which allows for a much higher maximum

Page 28: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.4. BTRFS 15

storage capacity than a 64-bit file system. The limits of ZFS are designedto be so large that they could never be reached.

DeduplicationDeduplication is a data compresion technique for eliminating copies ofrepeating data. ZFS implements deduplication but at a very expensiveRAM cost (1-5 GB of RAM per TB of storage). Using this feature withoutenough memory may cause lower performance or even complete memorystarvation.

2.3.1 ZFS snapshots

ZFs implements snapshots by taking advantage of copy-on-write which, whenwriting new data, retains the blocks containing the old data. ZFS snapshots arevery space efficient since any unchanged data is shared between the snapshot andits file system and, as new changes are made, new blocks are created to reflectthose changes.

Snapshots are ZFS’s most interesting capability when concerning this thesisas it makes it one of the candidate file systems to be used as host file system in theData Nodes and solve HDFS’s incomplete block snapshotting problem by usingit’s native snapshotting capabilities in each HDFS incomplete block.

2.4 BTRFS

B-tree FileSystem (BTRFS for shorters) is a copy-on-write file system integratedin the linux kernel that is being jointly developed at multiple companies. The filesystem is still in an experimental phase but many of its features and structures arealready stable.

BTRFS instances can have separate POSIX namespaces that are mountableseparately called subvolumes. Subvolumes can be created at any place withinthe file system hierarchy, and they can also be nested. The file system presentssubvolumes as subdirectories and a subvolume cannot be deleted until all nestedsubvolumes are removed.

2.4.1 Main features

CloningThis operation atomically creates a copy-on-write copy of a file calledreflink. The clone operation does not create a reference to the original

Page 29: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

16 CHAPTER 2. BACKGROUND AND STATE OF THE ART

inode, but a new inode whose block pointer reference to the same blocksas the original inode.

Subvolume quotasBTRFS lets the user impose an upper limit of how much a subvolumeor snapshot can consume using quota groups, which can be joint intohierarchies to implement quota pools.

In-place ext2/3/4 conversionBTRFS converts any ext2/3/4 file system into BTRFS by nesting theequivalent BTRFS metadata in its unallocated space while preserving theunmodified original copy of the file system.

This process involves creating a copy of the original file system metadataand letting BTRFS files point to the same blocks used by the original filesystem. Thanks to the copy-on-write nature of BTRFS, the original blocksare preserved through updates.

RAIDBTRFS allows the use of RAID5 and RAID6 as well as RAID0, RAID1and RAID10. Unlike in ZFS, the write hole problem is not yet solved inBTRFS so there is still risk of data loss with power goes off during a writeoperation.

Dynamic sizingBTRFS allows for block devices (such as disk and partitions) to be addedor removed from the file system online. It also allows for online volumeresizing.

Send/ReveiceFor any pair of subvolumes or snapshots BTRFS can create a binary diffbetween them with the btrfs send command and apply those changes laterwith btrfs receive. This is especially useful for recreating snapshots in a oreven entire subvolumes on new file systems.

2.4.2 BTRFS SnapshotsA BTRFS snapshot is just a special case of subvolume that shares its data andmetadata with the original subvolume that we want to snapshot. Thanks to thecopy-on-write nature of BTRFS the creation of snapshots os really fast and theirinitial storage consumption minimal.

This capability is the main focus of interest of BTRFS as far as this thesisis concerned and was, along with ZFS considered for use in the Data Nodes of

Page 30: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.5. MYSQL CLUSTER 17

HopsFS as its integrated snapshot capability would allow to solve the HDFS’sincomplete block snapshot problem.

2.5 MySQL Cluster

Mysql Cluster provides shared-nothing [24] clustering for the MySQL databaseManagement System. It is designed around a distributed, multi-master ACID [11]compliant architecture with no single point of failure and automatic sharding [26]to scale out read and write operations on commodity hardware.

2.5.1 Features

Automatic shardingData partition across the nodes in the system is decided based on a hashingalgorithm of the primary key. If this partitioning is not suit for anapplication, users can define their own partitioning schemes which can adddistribution awareness to application by partitioning by a subkey commonto all rows being accessed by common queries.

MySQL Cluster can also support cros-shard queries and transactions, whichmeans that a client can connect to any node and have queries automaticallyaccess the right shard.

Hybrid storageMySQL Cluster works both with disk storage for data and distributedmemory for indexed columns. Storing non-indexed columns on disk allowsthe system to store datasets larger than the total memory of all the clusteredmachines.

MySQL Cluster also keeps a redo log in disk and keeps it updated regularlyto checkpoint the data so even if there is a full cluster outage, informationcan be recovered from disk. The redo log is written asynchronously.

ReplicationMySQL ensures data is written to several nodes upon committing by usinga two-phase commit mechanism.

It is also possible to replicate asynchronously between clusters (geographicalreplication) for disaster recovery or reduce latency for users in certain areas.

Page 31: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

18 CHAPTER 2. BACKGROUND AND STATE OF THE ART

2.5.2 NDBNDB stands for Network Database and is the underlying distributed databasesystem in MySQL Cluster. It can alse be used of a MySQL Server.

2.6 Incomplete block problemBefore going into the state of the art solutions for snapshotting systems in HDFSand HopsFS we will take a look at a recurrent problem in some of those solutions,the incomplete blocks problem.

This is the problem that happpens when a snapshot is taken of a file where thelast block is not fully-written (e.g. a block of 128 MB that is only storing 24 MBof data). In this case, if snapshots are taken without taking into account the currentstate of the blocks (the amount of data currently stored in them), then incompleteblocks will have an incosistent state when the snapshot is retrieved later if therehave been further writes since the snapshot was taken.

Figure 2.6: Visual representation of the incomplete block problem

In HDFS and HopsFS the only block that can be incomplete block that any filecan have at any given time is the last one due to the append-only nature of thesesystems. This diminishes the impact of the problem but it is still present anywayand must be dealt with.

As we will see in section 2.7 some state of the art solution are unable to dealwith this problem.

Page 32: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

2.7. STATE OF THE ART 19

2.7 State of the ArtThis section showcases current solutions used to implement snapshots either inHDFS or in HopsFS.

2.7.1 HDFS2 SnapshotsThe Apache HDFS project implemented a snapshotting system consisting on read-only snapshots of both directories and files. In this solution snapshots are managedby the Name Node only and the other parts of the system are not even aware oftheir existance and operate as they did before snapshots existed.

They based the implementation of the data structure needed to implement thesolution in the contents of the paper ”Making Data Structures Persistent” [28] byDriscoll et al.

This solution only allows for directory snapshots to be created only on specialdirectories to which the system administrator has given previous permission to besnapshotable. If the system administrator decided to take away said permission,all snapshots of that directory must be deleted beforehand.

The design allows for snapshot creation to be constant and almost insignificant.When a snapshot is created the snapshot ID is added to the time ordered list ofsnapshots of the directory and the access time for the involved files and directoriesremains the same. The access time to the snapshot files also remains the same aswith the regular files as long as there are not any modification to the file.

But the real challenge of this solution comes with the modification ofsnapshotted directories. Those cases are handled using diffs in different waysdepending on the type of modification done to the directory (deletion, modificationor addition of a file). These modifications have a time cost of O(log(diff size)) andthe access to the snapshot files also has a time cost increase of O(log(diff size)).The storage cost on the other hand is proportional to each unique modificationsince the snapshot was taken.

2.7.2 Facebook’s HDFS snapshotsThis solution is fully explained in the paper ”Snapshots in Hadoop DistributedFile System” [6] by Sameer Agarwal, Dhruba Borthakur and Ion Stoica. Itconsists in maintaining a fault tolerant in-memory snapshot tree where each nodeis associated with an HDFS file or directory and is referenced by their snapshots.Each node in this structure also contains an integer value representing the numberof snapshots that are currently pointing to it or to any its children.

This solution takes advantage of the fact that HDFS relaxes some POSIXrestrictions and allows writes to only happen at the end of files in the form of

Page 33: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

20 CHAPTER 2. BACKGROUND AND STATE OF THE ART

appends. The only other form modifying a file is truncating it to a previous offset.This means that the only case in which data can be overwritten in HDFS (andtherefore the only case in which we need to copy it to maintain a snapshot) iswhen an append operation occurs after a truncate.

Based on these facts, they use a selective copy-on-append approach in whichthey only copy a block when one or more append happen after one or moretruncates. In every other case, a pointer-based solution is enough to keep trackof all versions of the file.

They also avoid the incomplete block problem by having the snapshots mapthe block’s generation timestamp with the length and the physical locations, thisway, the snapshot can return the last block to the correct offset at the time of thesnapshot.

Finally, they had to modify the existing Trash functionality in HDFS so itwould not automatically delete blocks when a file is deleted or truncated.

2.7.3 Previous HopsFS snapshotting systemA previous snapshotting system was created for HopsFS as a Master ThesisProject [27] by Pushparaj Motamari. This thesis work offers two possibleapproaches to implementing a snapshotting solution in HopsFS of which we offeronly a very brief description as neither of them was used in this thesis:

Read-Only Nested SnapshotsThis solution allowed for Read-Only nested snapshots to be created inspecial directories that are specified by the system administrator much inthe way HDFS2 snapshots do.

Read-Only Root Level Single SnapshotThe other solution offered is more limited as it only allows for one globalsnapshot of the whole FileSystem, but it can be useful in some instancessuch as making a backup before upgrading the system.

Page 34: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Chapter 3

Solution

This chapter lays out the design of the solution proposed and developed for thisMaster Thesis Project and justifies all design decisions made during the process.

It first explains the basic architecture for creating and storing versions offiles and then moves to the changes needed in the metadata database in order toimplement the solution. It then explains the different types of snapshots taken intoconsideration and the way they are handled in the different parts of the HopsFSarchitecture.

The chapter closes with the costs and contributions of this work and adiscussion on the chosen solution, its limitations and other possible designs aswell as the reasons why they were discarded.

3.1 Basic design

The key concept of the proposed solution is that, in order to take and maintain asnapshot of a given file, there is no need to actually replicate or even handle thedata it stores. This is possible due to the nature of the HopsFS architecture, wheremetadata about files is stored in an external database and file writes are extremelylimited as they are only possible in the form of appends at the end of the file.

The only exception to this statement are incomplete blocks which, due to theaforementioned writing limitations, exist at most once per file and always in thesame position, the last block. The reason why the data contained in these blocksmust be handled directly (either by copying it or snapshotting it using the localFile System of the Data Nodes) is to preserve the offset to which the block hadbeen written in that specific version. Otherwise, retrieving that block for a givenversion would yield a block containing data written after the snapshot was taken.

21

Page 35: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

22 CHAPTER 3. SOLUTION

3.1.1 Storing block versionsIn order to maintain the version to which each block belongs to, we have appendedthe version as the last byte of the BlockID. One of the main advantages of keepingthe version as part of the BlockID is that it simplifies the handling of differentversions of incomplete blocks as they will be considered as different files in theData Nodes.

Figure 3.1: Structure of a BlockID

The appending of the version to the blockID happens everytime a newblock is requested. When the Name Node requests a new BlockID from itsIDGeneratorFactory, the generated ID is then shifted 8 bits to the left and thenadded the current version of the file using and OR.

3.1.2 Reading specific versionsCurrently, the way hops handles requests for a file’s blocks is by querying thedatabase asking for all blocks belonging to the INode of the file. With the solutionproposed in this thesis the core concept of querying the metadata database forall blocks of the INode remains the same but now it has an extra parameter theversion.

In order to understand the queries needed to get the blocks of a specific versionwe first need to ask ourselves which blocks belong to a given version. A versionor snapshot of a file is composed by the following blocks:

• Version blocks: which are the blocks that have the exact same versionnumber as the version we are looking for. These blocks have been eitheradded or modified in the current version.

• Previous completed blocks: which are all the remaining blocks of thefile that had already been completed in previous versions and, thus, willremain with that version number. When the number versions stored reachesthe maximum allowed the earlier versions start to be overwritten by newversion in a cyclic manner, but only the remaining incomplete blocks ofthose versions are actually deleted as the completed blocks will remain part

Page 36: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.1. BASIC DESIGN 23

of the file forever and are only marked with the flag old instead. A morethorough description of this mechanism can be found in section 3.5 andfurther discussion on why this was the chosen design for this case can befound in section 3.9.

What this means is that we now need two queries to get the blocks for a givenversion of the file instead of just one. That extra query is the only reading overheadof this solution and, what is more important it remains constant regardless of thenumber of versions stored and the version being accessed due to the fact that weonly handle the metadata and the rest of the process remains the same.

Moreover, that extra query required to read the blocks of a files is very optimal.The Block Info table in HopsFS database is partitioned by INode meaning that,as long as we are using the INode in our search, the query will only have to gothrough one partition of the database to find the required information. This kindof query is called an index scan and they are among the fastest types of access tothe database. A more detailed description of the queries added for this solutioncan be found in section 3.6.2.

3.1.3 Creating and writing filesThe creation of a new file does not have any overhead from this solution becauseit only handles the allocation of a new block which does not interfere with ourimplementation.

The writing of the file however, does have an extra overhead because everytimesomething is appended to an incomplete block this must be copied and thenappended the new data creating a new version, and another possible overheadif an old version must be deleted because the maximum number of versions isreached. The process of creating a new version of write is as follows:

1. The Client sends an Append RPC to the Name Node.

2. The Name Node then gets the blocks for the last version of the file, handlesthe possible deletion of the oldest version stored which means an update tothe database and clones the metadata of the last block increasing its version.

3. It then return the locations of the new block to the client, who in turn, startsstreaming its data to those locations.

4. On the allocated Data Nodes the previous last block of the file being writtenis copied (instead of moved) to the under construction directory while thefile is being written and renamed to its new blockID (remember that theversion is contained in the blockID and we just changed it in the NameNode).

Page 37: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

24 CHAPTER 3. SOLUTION

5. After the block is finished being written it is moved back to the regular blockdirectory with its new name, resulting in two different versions of the samefile.

This means we have some degree of information duplicity (only of the partiallast block) and also involves an overhead when copying. A more thoroughdiscussion on why this was designed this way and other possible alternatives canbe found in section 3.9.

3.1.4 Version numbersThe storing of the block version inside the BlockID means that our amount ofversions becomes limited to what we can store in a single byte, this is 256versions. Since we also want to keep storing versions automaticaly once thatlimit is reached, it also means that we have to keep the functioning of the versionnumbers working as if they were a cyclic data structure.

As explained in previous sections, once the limit of versions is reached we startdeleting the oldest ones as we create more deleting only the incomplete blocks thatthese held. A more detailed description of how this cyclic versioning works andthe implementation challenges it has represented for this project can be found insection 3.5.

There are of course other possible methods to store the version of a block inthe metadata database. Those methods as well as the reasons behind the finaldesign can be found in section 3.9.

3.2 Automatic snapshotsThe purpose of automatic snapshots is to save the last few versions of a file. Thiscan be used as an undo mechanism to revert the last changes to the file quicklyvery similar to that implemented in Dropbox.

Automatic snapshots are taken every time the file is changed and it has amaximum number of snapshots that can be stored that vary from 0, meaning itonly stores the last version, to 255, which effectively is the last version plus 255snapshots.

Once the limit is reached, new snapshots simply replace the oldest ones.

3.3 On-demand snapshotsOn-demand snapshots are what gives the user the ability to hold on to an automaticversion of a file forever instead of having it overwritten after the limit is reached

Page 38: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.3. ON-DEMAND SNAPSHOTS 25

and more versions are created. This capability allows the user not only to take asnapshot of the current state of the file and save it as an on-demand snapshot so itis not lost, it also allows the user to select anyone of the currently stored automaticversions and save it as an on-demand snapshot as well.

These on-demand snapshots were implemented by adding a new RPC to theClientProtocol API called TakeSnapshot to send a request to the Name Node withthe file and the version that we want saved as an on-demand snapshot. The onlypre-requisite for executing this RPC is having opened the file previously. Figure3.2 further details the functionality of the takeSnapshot RPC.

Figure 3.2: TakeSnapshot RPC workflow

1. The TakeSnapshot method from the client makes an RPC that tells the NameNode the information required to take a snaphot of a version of the file, thatis: path of the file, version to be converted to on-demand snapshot and theClient’s name so the lease can be checked.

2. On the Name Node, the RPC is handled by the NameNodeRpcServer bylocking the needed rows in the database, checking the file is open and thatthe Client has an active lease on the file.

Page 39: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

26 CHAPTER 3. SOLUTION

3. Set all incomplete blocks of the given version as OnDemand blocks in thedatabase. This way they will not be deleted when the automatic version theybelong to is overwritten.

3.4 RollbackEvery snapshotting system must have a rollback capability to allow users to goback to a previous version. In this solution, going back to a previous version is assimple as reading it as if it were the last version, no extra operations are neededand the cost remains the same.

However, it was decided to implement a ”hard rollback” capability to allowusers to go back to a previous version of a file deleting all versions that follow itand settign it as the new last version of the file.

Figure 3.3: Hard rollback visual concept

In order to do this, a new RPC was created in the ClientProtocol that allowsusers to send a rollback request to the Name Node with the file and the versionthey want to rollback to. The only pre-requisite for executing this RPC is havingopened the file previously. Figure 3.4 further details the functionality of therollback RPC.

1. The Rollback method from the client makes an RPC that tells the NameNode the information required to rollback to a version of the file, that is:

Page 40: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.4. ROLLBACK 27

Figure 3.4: Rollback RPC workflow

path of the file, target version of the rollback and the Client’s name so thelease can be checked.

2. On the Name Node, the RPC is handled by the NameNodeRpcServer bylocking the needed rows in the database, checking the file is open and thatthe Client has an active lease on the file.

3. All version following the target version are retrieved from the metadatadatabase. The blocks belonging to those versions are all deleted whetherthey are complete or not. The only exception are OnDemand because theyare part of on-demand versions and thus cannot be deleted, and Old blocksbecause they are simply still part of the file and do not belong to the versionswe want to delete.

4. Last version of the file is set to the target rollback version.

Page 41: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

28 CHAPTER 3. SOLUTION

3.5 Versioning structureAs explained before version numbers for the snapshots are stored inside theBlock’s ID and, inside said ID only 1 byte is available for the version number.This essentially means that we can only store 256 versions of a single blocksimutaneously and therefore must implement a cyclic functioning of snapshotsthat allow for old versions to be dropped to store new ones once the maximumnumber of versions is first reached.

The highest possible value for the version number is 255 because versionnumbers start at 0, but this does not mean that the number of simultaneous versionsof a block allowed is 256. User can set the maximum version number as theyrequire along as it is in the range [0-255].

This necessity to have a cyclic structure for versions has generated severalproblems during the development that will now be listed and explained along withthe solutions adopted and their costs.

In order to explain the problems derived from this versioning structure wewill use visual representations in which we aggregate all blocks having thesame version number in their respectives version bubbles, but the blocks shownrepresent all blocks being stored for this file at a given moment or state as theyhave been called.

3.5.1 Differentiate completed blocks from old completed blocksThe first problem that arises from having cyclic version numbers is that ofdifferentiating between completed blocks belonging to currently stored versionsand completed blocks belonging to versions no longer stored that nevertheless arestill part of the file. The later are blocks that must be retrieved for every versionbecause they belong to all versions currently being stored.

As more and more versions are written the oldest versions start being deletedand replaced by new ones. In case those old version that were deleted hadcomplete blocks, these blocks will not be deleted as they are still part of the file.When a new version with that same version number is written and has its owncomplete blocks, the queries to retrieve the blocks will not be able to differentiatebetween them which will cause them to retrieve blocks that do not belong to theversion asked.

Figure 3.5 shows an example of the old block problem where the red blockswere completed blocks from versions no longer stored and now that they havebeen deleted they have to be made distinguishable from new complete blocks withthe same version. In this case, if old blocks were not properly differentiated andthe user wanted to query for V0 it would also get the completed version of blockB4 stored in version V1 as it would not be distinguishable from older blocks.

Page 42: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.5. VERSIONING STRUCTURE 29

Figure 3.5: Old block problem

Solution

The solution proposed for this problem was using a flag in the metadata databaseto indicate whether a complete block belongs to a currently stored version or isjust an old complete block that must be retrieved for every version.

Blocks are flagged as old when they are complete and about to be deletedto make room for a new version. After that, they are never deleted and neverretrieved by any query unless it is told specifically to retrieve old blocks.

This flag was called is old block and more information about it can be foundin section 3.6.1.

3.5.2 Differentiate onDemand and automatic snapshots

Due to the different natures of on-demand and automatic snapshots, they aredeleted at different paces while still sharing a common version number space (0to 255). Because of this, version numbers that hold on-demand versions mustbe treated differently in order to not delete said on-demand snapshots that aresupposed to be kept until the file is deleted.

Figure 3.6 shows an example of how an incomplete block is branded ason demand when it belongs to an on-demand version in order to avoid its deletion.

Page 43: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

30 CHAPTER 3. SOLUTION

Figure 3.6: Differentiating between on-demand and automatic snapshots

Solution

This problem was solved by adding a new flag to the metadata database that tellswhether an incomplete block is part of an on-demand version or not.

This flag is only ever put on blocks that are not complete because completedblocks must still be retrieved regularly. These are only marked as old as specifiedin section 3.5.1. These incomplete blocks are flagged as on-demand when a newon-demand version is created by the user.

The flag was called is on demand and more information about it can be foundin section 3.6.1.

3.5.3 Differentiate automatic and on-demand old blocks

When there is an on-demand version that is out of rotation there probably are someold blocks that have been created afterwards and therefore should not be retrivedwith that version. The problem is that there is no way to differentiate these specificold blocks because the solution of branding blocks as old was thought in a contextwere they all became the same as far as queries are concerned.

Figure 3.7 is an example of the problem described in this section. In thisexample we can see how the complete version of block B2 has become old onceanother automatic version has overwritten the previous blocks in version V2.

Page 44: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.6. DATABASE CHANGES 31

Figure 3.7: Visual example of differentiating between automatic and on-demandold blocks

The problem now is that when a query tries to retrieve the blocks for the on-demand version V1 it will get both the incomplete version of block B2 marked asOn Demand and the complete version of the same block in V2 marked as old (inred) which is a incorrect state of the file.

Solution

To solve this problem a new table was created that links on-demand versions withall the blocks that should be retrieved to read it. More information about this tableand other metadata changes can be found in section 3.6.1.

3.6 Database changes

All the architecture changes explained above as part of the solution proposedrequire changes in the metadata database such as including new fields in certaintables, new queries and even new tables. This section explain those changes andtheir purpose in detail.

Page 45: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

32 CHAPTER 3. SOLUTION

3.6.1 Metadata changesIn this section we list all metadata that was changed in the database schema inorder to implement this thesis’ solution. Table 3.1 lists all fields added to thedatabase and their tables.

Table 3.1: Metadata fields added to the databaseField Table TypeIsOldBlock hdfs block infos BooleanIsOnDemand hdfs block infos BooleanLastVersion hdfs inodes Int

IsOldBlockBoolean field in table hdfs block info to mark complete blocks as old.

IsOnDemandBoolean field in table hdfs block info to mark incomplete blocks as part ofan on-demand version.

LastVersionInteger field in table hdfs inode to indicate the current last version of thefile.

Apart from these fields a new table was added to the database to differentiatebetween old blocks in on-demand versions from old blocks in automatic versions.Table 3.2 shows the structure of this new table.

Table 3.2: New table hdfs version to blockhdfs version to block

Field Type Primary KeyBlockID BigInt YesINodeID BigInt YesOnDemandVersion Int Yes

3.6.2 Added queriesThe solution involves making new queries to the database since now we have anew parameter involved as it is the version. This section lists and explain each ofthose new queries and in which cases they are used.

Page 46: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.6. DATABASE CHANGES 33

Find by INodeID and versionFind all blocks of an INode with a specific version. This query is one of thetwo, along with ”Find complete blocks by INodeID and previous versions”,neded to retrieve all blocks of a given automatic version. This query is alsoused when searching for blocks of an on-demand version.

Listing 3.1 shows the equivalent of this query in pure SQL although in theimplementation is not quite like this because block version is not a separateparameter but part of the BlockID.

Listing 3.1: Query to retrieve all blocks from the given version.

SELECT ∗ FROM b l o c k i n f o s WHERE inode id =XAND b lock ve rs i on =VERSION;

Because this query searches by INodeID which is the partition key for theBlockInfo table, it is a pruned index scan. We can see this in the output ofthe explain command:

Listing 3.2: EXPLAIN output for ”Find by INodeID and version”.

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1. row ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗i d : 1

s e l e c t t y p e : SIMPLEtab le : h d f s b l o c k i n f o s

p a r t i t i o n s : p11type : r e f

poss ib le keys : PRIMARYkey : PRIMARY

key len : 8r e f : const , const

rows : 2f i l t e r e d : 100.00

Extra : NULL1 row in set , 1 warning (0 .00 sec )

This output shows only one partition being used in the search and that thetype of the join is ref which is one of the best access types.

Find complete blocks by INodeID and previous versionsThis is the second query used to get elements from a specific version. Whatthis query does exactly is look for all blocks prior to the given version. Dueto the cyclic nature of the version numbers explained in section 3.5, this

Page 47: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

34 CHAPTER 3. SOLUTION

query is not always the same, for example, if the given version is a smallernumber than the current version, it must execute the following query:

Listing 3.3: Query to retrieve all complete blocks from previous versions ifVERSION < CURRENT VERSION.

SELECT ∗ FROM b l o c k i n f o s WHERE inode id =XAND ( b lock vers ion<VERSION OR block vers ion>

CURRENT VERSION)AND ( num bytes=MAX BYTES BLOCK OR i s o l d b l o c k =TRUE) ;

On the other hand, if the given version is a greater number compared to thecurrent version number, the query to be executed becomes:

Listing 3.4: Query to retrieve all complete blocks from previous versions ifVERSION > CURRENT VERSION.

SELECT ∗ FROM b l o c k i n f o s WHERE inode id =XAND block vers ion<VERSIONAND block vers ion>CURRENT VERSIONAND ( num bytes=MAX BYTES BLOCK OR i s o l d b l o c k =TRUE) ;

In either case the cost of the query is the same as they are both pruned indexscans as we can see in the following explain command output:

Listing 3.5: EXPLAIN output for ”Find complete blocks by INodeID andprevious versions”.

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1. row ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗i d : 1

s e l e c t t y p e : SIMPLEtab le : h d f s b l o c k i n f o s

p a r t i t i o n s : p11type : range

poss ib le keys : PRIMARYkey : PRIMARY

key len : 8r e f : NULL

rows : 2f i l t e r e d : 100.00

Extra : Using where wi th pushed c o nd i t i on ( ( ‘hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘b l o c k u nd e r c on s t r u c t i o n s ta t e ‘ = 1) and ( ‘hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘ i node id ‘ =

Page 48: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.6. DATABASE CHANGES 35

1) and ( ‘ hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘b lock vers ion ‘ < 2) ) ; Using MRR

1 row in set , 1 warning (0 .00 sec )

We can see that in this case que type of query is range which is not amongthe fastest query types but that is inevitable as it is the type for queriesthat allow a range of values in one of the parameters as it is the case.Nevertheless, this query is still really fast because it is a pruned index scanand therefore only has to lookpu blocks in one partition of the database.

Find on demand versions by INodeIDThe purpose of this query is to get a list of all version numbers that arecurrently holding an on-demand version and thus cannot hold an automaticversion and must be skipped when selecting the next version number to beused for automatic versions.

The SQL query equivalent of this would be as follows:

Listing 3.6: Query to retrieve all version numbers from on-demand versions.

SELECT vers ion FROM b l o c k i n f o s WHERE inode id =XAND is on demand=TRUE;

And as we can see in the following explain output this query is a prunedindex scan of type ref:

Listing 3.7: EXPLAIN output for ”Find on demand versions by INodeID”.

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1. row ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗i d : 1

s e l e c t t y p e : SIMPLEtab le : h d f s b l o c k i n f o s

p a r t i t i o n s : p10type : r e f

poss ib le keys : PRIMARYkey : PRIMARY

key len : 4r e f : const

rows : 2f i l t e r e d : 50.00

Extra : Using where wi th pushed c o nd i t i on ( ‘hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘ is on demand ‘= TRUE)

Page 49: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

36 CHAPTER 3. SOLUTION

1 row in set , 2 warnings (0 ,00 sec )

Find old blocks by INodeIDThis query finds all blocks marked as old byt the is old block column inthe BlockInfo table. This is used when retrieving both automatic and on-demand versions get all completed blocks that are left over from previousversions but, as has been explained before in sections 3.1 and 3.5, willalways be part of the file.

The SQL translation of this query would be:

Listing 3.8: Query to retrieve all old blocks with a given INodeID.

SELECT vers ion FROM b l o c k i n f o s WHERE inode id =XAND i s o l d b l o c k =TRUE;

And as all previous queries, this is also a pruned index scan, in this casewith no ranges used in the parameters so the access type is ref as can beseen in the explain output:

Listing 3.9: EXPLAIN output for ”Find on demand versions by INodeID”.

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1. row ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗i d : 1

s e l e c t t y p e : SIMPLEtab le : h d f s b l o c k i n f o s

p a r t i t i o n s : p10type : r e f

poss ib le keys : PRIMARYkey : PRIMARY

key len : 4r e f : const

rows : 2f i l t e r e d : 50.00

Extra : Using where wi th pushed c o nd i t i on ( ‘hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘ i s o l d b l o c k ‘= TRUE)

1 row in set , 2 warnings (0 ,01 sec )

Find later versions by INodeIDThis last query is the opposite of ”Find complete blocks by INodeID andprevious versions”, we want to find now all blocks that are not marked

Page 50: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.6. DATABASE CHANGES 37

as old between the given version and the current one. This is used whenperforming a Rollback operation to select the blocks that will be deleted.

As before, due to the cyclic nature of the version numbers this querycan have two different variations depending on whether the given versionnumber is higher than the current version number, in which case this querywill be executed:

Listing 3.10: Query to retrieve all blocks from later versions if VERSION> CURRENT VERSION.

SELECT ∗ FROM b l o c k i n f o s WHERE inode id =XAND ( b lock vers ion>VERSION OR block vers ion<

CURRENT VERSION) ;

Or, the opposite case where the given version number is lower than thecurrent version number, in which case we execute the following query:

Listing 3.11: Query to retrieve all blocks from later versions if VERSION< CURRENT VERSION.

SELECT ∗ FROM b l o c k i n f o s WHERE inode id =XAND block vers ion>VERSIONAND block vers ion<CURRENT VERSION;

Either way both queries are pruned index scans and will have an access typerange because only the rows in a given range are selected using an index toselect them. This can be seen in its explain command output:

Listing 3.12: EXPLAIN output for ”Find on demand versions by INodeID”.

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ 1. row ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗i d : 1

s e l e c t t y p e : SIMPLEtab le : h d f s b l o c k i n f o s

p a r t i t i o n s : p11type : range

poss ib le keys : PRIMARYkey : PRIMARY

key len : 12r e f : NULL

rows : 2f i l t e r e d : 100.00

Page 51: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

38 CHAPTER 3. SOLUTION

Extra : Using where wi th pushed c o nd i t i on ( ( ‘hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘ i node id ‘ =1) and ( ‘ hop brau l io ‘ . ‘ h d f s b l o c k i n f o s ‘ . ‘b l ock id ‘ > 1) and ( ‘ hop brau l io ‘ . ‘h d f s b l o c k i n f o s ‘ . ‘ b l ock id ‘ < 5) ) ; Using MRR

1 row in set , 1 warning (0 ,00 sec )

3.7 Cost overhead analysisThis section details the overhead costs in time and space derived from theimplementation of this solution for each operation affected or created.

CreateThe creation of a new file does not imply any overhead as it only handlesthe allocation of the first block of the file which does not interfere with thesnapshotting system.

ReadReading a any version of the file involves the same cost in time but it stillhas an overhead compared to the last version. This overhead comes from anextra query that must be performed, where it used to be just one. This extraquery is has a range type of access, because it has to retrieve elements thatmatch a value in a range, which is a worse type than the other query that isref but is still very fast as it is still an index scan.

AppendAppending data to a file does have an overhead cost both in time and space.The time overhead comes from several sources:

• The Name Node has to retrieve the metadata for all blocks belongingto the last version from the database and this as we have already seenhas an overhead in time.

• Then, the Data Nodes must copy the previous version of the last blockto the new one, which takes longer than just appending to it.

• In case the maximum number of versions has been reached it hasto delete incomplete blocks from the oldest version and brand thecomplete ones as old. This is a series of updates to the database.

As for the space cost it depends on the size of the blocks and on how muchof the last block is written, the more filled with data the more we have tocopy.

Page 52: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.8. CONTRIBUTIONS 39

Take on-demand snapshotTaking a snapshot on-demand does not have a space cost at all since nocopying of file is involved. It does have a time cost that is still smallcompared to that of writing. The time cost of this operation involves:

• Getting all blocks that have the given version number. Not all blocksbelonging to that version, which would involve previous completedblocks, but only blocks with that version number that are not old orout of rotation.

• From those blocks, update the incomplete ones to be OnDemand.

RollbackRolling back is essentially the same as reading except for the addition ofhaving to delete all versions following the target rollback version. The timecost of this varies depending on how old the target version is. The older theversion the more blocks will have to be deleted.

3.8 ContributionsThe solution proposed for this Master Thesis Project solves some problemspresent in HDFS current snapshotting system as well as adding some newcapabilities to the previous iteration of this work for HopsFS. In this section wewill go over those contributions and explain them in detail.

3.8.1 Solving the incomplete blocks problemThis thesis’ solution proposes two methods to avoid this problem (explained insection 2.6) and implements one. The method implemented is to simply copy thelast file every time a snapshot is taken. This is a bit more time consuming butstill solves the problem. The other method proposed is to use a local File Systemin the Data Nodes that has an integrated snapshot capability such as BTRFS andZFS. This capability will be used to take a local snapshot of the last block everytime a snapshot of a file is taken in HopsFS.

3.8.2 Improved rollbackRollbacks were a problem in the previous iteration of this work. They consumedtoo much time because they had to use the whole File System tree. For this thesiswork the rollback time has been greatly improve as it now costs the same asreading the last version of the file.

Page 53: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

40 CHAPTER 3. SOLUTION

Another option was also added to allow for hard rollbacks, which delete allversions following the target rollback version up to the current version and thenset the target rollback version as the new current version of the file.

3.8.3 Automatic snapshots

Automatic snapshot are a part of this solution that is not present in any other stateof the art solution for either HopsFS or HDFS. It allows users to have an undocapability and thus revert accidental data loss or unintended changes in the data.This capability does exist in other systems such as dropbox and is of great use forusers of those systems.

3.9 Design discussionIn this section we will discuss the solutions adopted for certain parts of the thesis,alternative solutions that were considered and why were they not chosen.

3.9.1 On version storage

As explained in section 3.1 the version number of a block is stored in the blockID.Doing this forces the system to have a limited number of version numbers whichin turn creates several problems as explained in section 3.5.

But this was not the initial solution considered for this matter. Initially ablock’s version would be stored in the hdfs block info table as an Int or even BigIntfield, which would allow for a very high amount of version numbers making theuse of cyclic version numbers unnecessary.

This solution was not finally adopted because version numbers would have tobe part of the table’s Primary Key which in turn would mean changing other tablesthat depended on said Primary Key and their functionalities in the Name Node.

3.9.2 On cyclic version numbers

Once it was decided that versions would be stored as a byte of the BlockID thenumber of version numbers available for each block was reduced to 256. Evenwith this new limit there was no reason why we could not just allow 256 versionof the file to be made on-demand and avoid the cyclic versions and all the problemsthey have caused.

The reason it was decided that cyclic version numbers would be needed wasthat automatic versions containing the most recent changes in the file was always

Page 54: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

3.9. DESIGN DISCUSSION 41

an intended feature for this project and thus cyclic versions would always beneeded.

3.9.3 On on-demand version numbersProblems caused by cyclic versioning come almost entirely from having automaticversions in the same version number range as on-demand versions. The initialdesign had separate version numbers for automatic and on-demand versions.

The reason this solution was deemed unfeasible is that turning an automaticversion into an on-demand version would require to copy all blocks belonging tothat version because changing its version number to one in the on-demand rangewould mean changing the blocks IDs as explained before. And in this case itwould not be enough to rename the blocks in the Data Nodes because we alsowant to preserve the automatic version upon which the new on-demand version isbased.

3.9.4 On old blocksHaving old blocks branded with a flag in the database was the solution adoptedfor this particular problem explained in 3.5 but in the process other solutions werecontemplated and discarded. Here we explain those solutions and why they werenot adopted.

One possible solution for this problem could have been changing the versionof old completed blocks to either the oldest stored version or to a special versionnumber that only stores old completed blocks every time a version is about to beoverwritten. This solution was not deemed feasible for two reasons:

• The extra database round-trips were deemed too much overhead for anappend operation in which we are already putting some overhead by havingan extra query to get all blocks and another to check if the next version isalready taken by an on-demand snapshot. And even more by having to copythe last block in the Data Nodes.

• Since version numbers are integrated in the blockID changing a block’sversion would mean changing its blockID which in turn would mean havingto rename the block in the Data Nodes adding even more overhead to theoperation.

3.9.5 On incomplete block handlingThe way incomplete blocks are handled in an append operation in the currentversion of this project is by just copying them in the Data Nodes and then adding

Page 55: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

42 CHAPTER 3. SOLUTION

the new information to the copies. There are several other solutions to thisproblem that have not been implemented for this thesis due to time restrains.

The first of those solution has already been mentioned before and is using localFile Systems in the Data Nodes that can handle local snapshots that we can useto make different local versions of the incomplete blocks instead of copying themwhich is slower and less space-efficient.

The second solution is too use the same approach as the Facebook solutionand have the blocks store an offset field in their metadata that tells how much ofthe block was written in the current version. This approach should be the mostefficient both time and spacewise and is the one that will be recommended forfuture work.

Page 56: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Chapter 4

Analysis

In this chapter, different metrics of the system built are shown to measure theimpact of the snapshots in the overall system. Apart from the system itself, thedifferent options of local file systems to be used in the data nodes are benchmarkedin regard with their time to take snapshots and their scalability.

4.1 Evaluation of local File Systems

This section shows the analysis performed on the snapshotting capabilities of thetwo local file systems considered for Data Nodes as part of the solution for theincomplete block problem.

Even though this analysis is not used in this thesis because another solutionwas actually used as indicated in section 3.9 it is still relevant as the optimalsolution as explained in the future lines of work in section 5.2.

For the local File System in Data Nodes two options were considered: BTRFSand ZFS, which is currently being used for Data Nodes in HopsFS. For both ofthese File Systems several test were taken measuring the time to take a snapshotwith different parameters such as the number of files, the number of subvolumesand the number of snapshots per file to test the scalability of both systems.

In all the test about to be shown snapshots taken over the files are allincremental in this case by a difference of 10 MB. This was meant to ensure thatthere were always changes between snapshots as it would be in a real life case inorder to make the test more accurate.

ZFSHere are shown the test results for ZFS in the three relevant test cases usedfor this analysis.

43

Page 57: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

44 CHAPTER 4. ANALYSIS

1. The test referenced by Figure 4.1 shows the performance of the systemwhen taking 10000 incremental snapshots over a single file.

Figure 4.1: First test case for ZFS

2. The results for this test are shown in Figure 4.2. In this test case weuse 1000 files taking 10 incremental snapshots out of each of themmaking it 10000 snapshots in total.

Figure 4.2: Second test case for ZFS

Page 58: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

4.1. EVALUATION OF LOCAL FILE SYSTEMS 45

BTRFSAnd here we show the results obtained in BTRFS when running the sameexperiments as in ZFS.

1. The test referenced by Figure 4.3 shows the performance of the systemwhen taking 10000 incremental snapshots over a single file.

Figure 4.3: First test case for BTRFS

2. The results for this test are shown in Figure 4.4. In this test case weuse 1000 files taking 10 incremental snapshots out of each of themmaking it 10000 snapshots in total.

3. The results for this test are shown in Figure 4.5. For this test casewe have tried separating the files in different BTRFS subvolumes tosee their impact on performance. 1000 subvolumes with 1 file in eachwere used and 100 incremental snapshots of each of them were taken.

AnalysisThe tests clearly show that the time necessary to take a snapshot in ZFS scaleslinearly which is not good enough in a solution that will be handled thousands oflocal snapshots as the number of files and changes to those files increase.

BTRFS on the other hand, showed a better scalability with time remainingconstant within a range of values. The downside of this system is that thereis a level of instability in the times that leads to some outlier values where thetime needed to complete the snapshot was really high. This instability however

Page 59: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

46 CHAPTER 4. ANALYSIS

Figure 4.4: Second test case for BTRFS

Figure 4.5: Third test case for BTRFS

wouldn’t affect that much the overall use of the system besides some writes takinga bit longer to complete for the users.

The conclusion that can be drawn from this evaluation is that BTRFS is bettersuited for the task of handling a large number of local snapshots in the Data Nodessince the time necessary to take a snapshot stays within the same range while ZFShas it scale linearly.

Page 60: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

4.2. EVALUATION OF THE SOLUTION 47

4.2 Evaluation of the solution

This section shows some results that let us judge whether or not the solutiondesigned and described in Chapter 3 is viable by looking at average times takento perform the usual operations. This analysis is also intended to discover anypossible unknown bottlenecks not spotted during the design and implementationof the solution that would be solved in future iterations of the project.

The test cases analyzed in this section will focus on the usual operations of aFile System such as reading, writing and creating files. The focus will be gettingthe average times of these operations and calculating their impact on the systemcompared with HopsFS without doing any snapshots. The setup used for thesetests had a single Name Node service and a single Data Node service in the samemachine and the NDB database in another.

4.2.1 File reading

Reading data from files is the most repeated operation in data processing systemsand therefore minimizing the impact of the solution in this operation is a crucialpoint. For this operation we have calculated the average time to read a file using2 different workloads:

1. The first workload uses files of 1 block each (the block of each file is notfully written and only stores 10 MB of data). This case emulates how manyframeworks will produce a high amount of small files for different purposeswhile in production and deals and allows us to see how the system wouldbehave in such scenarios.

The results have been that the original HopsFS takes 27.765 miliseconds onaverage to read a file with the aforementioned parameters while the systemwith the solution in place takes 40.282 miliseconds which is an incrementof 45.08% over the time of the original solution.

2. The second workload uses files of 50 blocks each which means file sizes of6.25 GB (remember that blocks in HDFS and HopsFS usually have a sizeof 128 MB). This workload is intended to test how the system handles thereading of big files case to handling several big files in a dataset.

The results for this case show that the original HopsFS takes 32.456miliseconds on average to read a file of 50 blocks while the system with thisthesis solution in place takes 46.645, which shows an increase of 43.71% inthe average time.

Page 61: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

48 CHAPTER 4. ANALYSIS

The reason for the time to have almost doubled is that the previous solutionneeded only one query to solve the problem while the new one needs two and theround-trips to the database are the most expensive parts of this operation.

Scalability

As specified in the previous section, the queries are the most expensive partsof the operation so in order to test the scalibility of the solution for file readsthe performance of the two queries needed to read any version of a file wasbenchmarked to see their behaviour as the number of files and the number ofblocks per file increase.

1. For the first test case shown in Figure 4.6 reads were tested with anincreasing number of small (1 block) files in the system up to 1000 files.

This is the first workload tested when measuring the average time for readsonly this time we increase the number of files in the system with eachiteration to observe how the query time scales with the amount of files.

Figure 4.6: First case for read scalability testing

2. For the second test case shown in Figure 4.7 reads were tested again withan increasing number of bigger (50 blocks) files in the system up to 1000files.

Again this is one of the workloads used when calculating average times forreads and again we increase the number of files in the system with eachiteration to test scalability with bigger files fitting to belong in a big dataset.

Page 62: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

4.2. EVALUATION OF THE SOLUTION 49

Figure 4.7: Second case for read scalability testing

3. For the third test case shown in Figure 4.8 reads were tested with a singlefile but increasing the number of blocks up to 10000 blocks.

This case was not used when calculated reading time averages as it is not arelevant real world workload as it handles an extremely big file of 1.22 TB,but it is useful to observe the scalability of the system as files grow in size.

Figure 4.8: Third case for read scalability testing

Page 63: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

50 CHAPTER 4. ANALYSIS

Conclusions

The scalability tests show that increasing the number of files in the system has noeffect on the queries time while increasing the number of blocks makes the timescale linearly with a very low slope. The third case used in the tests goes to anextreme case of a file reaching 10000 blocks of 128 MB each which is a huge file(1.22 TB).

4.2.2 File writingFile writing, while less common than the other operations, is still a very importantoperation to measure in this solution because of the great impact of both solvingthe incomplete block problem where the last copy of the file must be copied asexplained in section 3.9 and the extra update queries for the new flags created inthe cases they are needed.

For this operation we have calculated the average time to write a file usingdifferent scenarios as we did for the reading average times. In this case we haveused the following workloads:

1. The first workload uses small 1 block files and measures the time taken toappend 10 MB of data to that file. This workload is very commonly seenin real use cases as most data processing frameworks make writes to smalland in some cases temporary files that they create.

The results have been that the original HopsFS takes 39.567 milisecondson average to append to a file while the system with this thesis’ solution inplace takes 56.09 miliseconds which means an increment of 41.76% in theaverage time taken to append 10 MB of data to a file.

2. The second workload handles bigger files of 50 blocks each and measuresthe time taken to append a full block to that file. This workload is typicalwhen working with big datasets to which you usually want to append moredata.

The results for this case show that the original HopsFS takes 40.324miliseconds on average to append a new block to a file of 50 blocks whilethe system with this thesis solution in place takes 56.535, which shows anincrease of 40.20% in the average time.

Scalability

As the queries are the most expensive parts of the writing operation the performanceof the queries needed to append one block of data to a file were benchmarked in

Page 64: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

4.2. EVALUATION OF THE SOLUTION 51

order to test the scalibility of the solution for file appends. For these tests as forthe ones used to calculate the average times the scenario used has each write needthree queries to complete because it always has to add a new block as well asupdate some blocks as old (in this case two blocks for each write).

1. For the first test case shown in Figure 4.9 appends were tested using one fileand writing blocks one by one up to 1000 blocks and setting two blocks asold every time.

Figure 4.9: First case for write scalability testing

2. For the second test case shown in Figure 4.10 appends were tested using1000 files and writing one new block as well as setting one block as old ineach write.

3. For the third test case shown in Figure 4.11 uses the same parameters as thefirst test case but with the difference that now the query to set blocks to oldhas to set an increasing number of blocks instead of just 2.

Conclusions

The scalability tests show that increasing both the number of files and the numberof blocks per file in the system has no great effect on the queries time whileincreasing the number of blocks marked as old in every write does have a minorimpact in scalability.

Page 65: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

52 CHAPTER 4. ANALYSIS

Figure 4.10: Second case for write scalability testing

Figure 4.11: Third case for write scalability testing

This case is not a crucial one for the system because the number of blocks thatneed to be marked as old with every append does not necessarily increase as thefile grows but depends on the number of blocks added with each write.

Page 66: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Chapter 5

Conclusions

This chapter explains the conclusions obtained throughout the design, development,and evaluation described in this thesis and proposes a number of improvementsand extensions that may be of interest in order to continue this work.

5.1 Conclusion

5.1.1 Goals achieved

This Master Thesis project had as its initial goal to create a versioning system fordatasets in HopsFS. That goal was achieved with the design and implementationof a whole block-based snapshotting system that is able to solve some problemssuch as the incomplete blocks problem that the Hadoop solution for HDFS doesnot include.

The project has also managed to produce a number of other features for thesystem such as automatic snapshots that allow users to undo a certain number ofthe last changes in a file. Since all snapshots are block-based we allow for sub-treesnapshots unlike some of the prelvious solutions for HopsFS.

All these features come at the cost extra round-trips to the data base that wereexplained and detailed in section 3.7.

5.1.2 Insights

There are several conclusions that we can gather from the work done in this thesis.First, the design approach of using file and block metadata to create snapshots

instead of the actual data has remained the right approach as it has allowed usto manage all the system from a limited amount point, the Name Nodes and themetadata database.

53

Page 67: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

54 CHAPTER 5. CONCLUSIONS

Second, from the results shown in the analysis part we have been able toconclude that the main bottlenecks in HopsFS are in the metadata database asround-trips for queries have taken most of the time in the operations tested.

Lastly, we have gathered insights from design mistakes that have taken a lot oftime away from this project that already required a significant amount of time justto catch up with the codebase of the system. These design pitfalls are documentedin section 3.9 as a way of ensuring future developers of the project will not wastetime with them.

5.2 Future workUnfortunately, the time frame allowed for this thesis has not allowed to completeall possible problems and optimizations that a project of these characteristicswould require. Therefore there are some points which are recommended to furtherdevelop in future iterations of this work:

Efficient handling of incomplete blocksFor this thesis work and due to lack of time to implement a more efficientsolution, we have approached the incomplete block problem by copying thelast block in the Data Nodes and then appending to it. This solution howeveris temporary and not the way it was planned from the beginning.

Undoubtedly there are better ways to handle this problem such as usinglocal snapshots in the Data Nodes and storing an offset for the blocks andthey have both been explained in section 3.9.

Deletion handlingThe case of a file deletion from the system was not handled in this thesiswork but it is an important part of the system that will have to be added infuture iterations of the work. This will probably have to be done by handlingversions of the files also in the directories INodes.

Query mergingAs the required number of queries for every operation has increased withthis solution and the queries round-trips have proven to be the most timeconsuming part of the solution it would be an interesting line of work tryingto merge these queries into one for each operation. At the moment this isnot possible due to API limitations of the library used to access the databasebut it remains a potential source of improvement nonetheless.

Page 68: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

Bibliography

[1] Hadoop Open Platform-as-a-Service, http://www.hops.io/, retrieved10 July 2017.

[2] Hadoop HDFS, https://wiki.apache.org/hadoop/ProjectDescription, retrieved 10 July 2017.

[3] KTH Royal Institute of Technology, https://www.kth.se/, retrieved10 July 2017.

[4] SICS (Swedish Institute of Computer Science), http://www.sics.se,retrieved 10 July 2017.

[5] Maged Michael, Jose E. Moreira, Doron Shiloach, Scale-up x Scale-out:A Case Study using Nutch/Lucene, Parallel and Distributed ProcessingSymposium. IEEE International, 2007.

[6] Sameer Agarwal, Dhruba Borthakur, Ion Stoica, Snapshots in the HadoopDistributed File System, UC Berkeley Technical Report UCB/EECS, 2011.

[7] Apache Software Foundation, https://www.apache.org/, retrieved10 July 2017.

[8] Apache Hadoop, http://hadoop.apache.org/, retrieved 10 July2017.

[9] Jeffrey Dean, Sanjay Ghemawat, MapReduce: Simplified Data Processingon Large Clusters, Google Research Publications, 2004.

[10] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google FileSystem, Google Research Publications, 2003.

[11] Theo Haerder, Andreas Reuter, Principles of transaction-oriented databaserecovery, ACM Computing Surveys (CSUR) 15:287, 1983.

55

Page 69: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

56 BIBLIOGRAPHY

[12] The Open Group, POSIX IEEE Std 1003.1-2008 Portable Operating SystemInterface, IEEE Standards Association, 2016.

[13] Peter Chen, Edward Lee, Garth Gibson, Randy Katz, David Patterson,”RAID: High-Performance, Reliable Secondary Storage”, ACM ComputingSurveys 26: 145–185, 1994.

[14] Apache Hadoop, HDFS Snapshots, https://hadoop.apache.org/docs/current2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html, retrieved 10 July 2017.

[15] Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling, Scaling HDFSwith a Strongly Consistent Relational Model for Metadata, ”DistributedApplications and Interoperable Systems”, Lecture Notes in ComputerScience, pp. 38–51, 2014.

[16] Peter Mell, Timothy Grance, The NIST definition of Cloud Computing,National Institute of Standards and Technology, 2011.

[17] MySQL, https://www.mysql.com/, retrieved 10 July 2017.

[18] MySQL Cluster, https://www.mysql.com/products/cluster/, retrieved 10 July 2017.

[19] National Instruments, ”Application Design Patterns: Master/Slave”, http://www.ni.com/white-paper/3022/en/, retrieved 10 July 2017.

[20] John Fletcher, ”An Arithmetic Checksum for Serial Transmissions”, IEEETransactions on Communications. COM-30 (1): 247–252, 1982

[21] ”Descriptions of SHA-256, SHA-384, and SHA-512”, NIST, 2011.

[22] Toshiba Security, An Introduction to RAID 5 White Paper, https://www.toshibasecurity.com/resources/white-papers/Toshiba_WhitePaper_RAID5.pdf, retrieved 10 July 2017.

[23] Jeff Bonwick, ”RAID-Z”, https://blogs.oracle.com/bonwick/raid-z-v6, retrieved 10 July 2017.

[24] Michael Stonebraker, The Case for Shared Nothing Architecture, DatabaseEngineering, Volume 9, Number 1, 1986.

[25] ZAR Team, ”Write hole” phenomenon, http://www.raid-recovery-guide.com/raid5-write-hole.aspx, RAIDRecovery Guide, retrieved 10 July 2017.

Page 70: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

BIBLIOGRAPHY 57

[26] Pramod J. Sadalage; Martin Fowler, ”NoSQL Distilled: A Brief Guide to theEmerging World of Polyglot Persistence” Chapter 4: Distribution Models,Addison-Wesley, 2012.

[27] Pushparaj Motamari, Snapshotting in Hadoop Distributed File System forHadoop Open Platform as Service, KTH Master Thesis, 2014.

[28] J. R. Driscoll, N. Sarnak, D. D. Sleator, R. E. Tarjan, Making Data StructuresPersistent, Journal of Computer and System Sciences, Vol. 38, No. 1, 1989.

[29] Remzi H. Arpaci-Dusseau, Andrea C. Arpaci-Dusseau, Introduction toDistributed Systems, Arpaci-Dusseau Books, 2014.

[30] R. Thurlow, RPC: Remote Procedure Call Protocol Specification Version 2(RFC 5531), Internet Engineering Task Force, 2009.

[31] O. Jacobsen, D. Lynch, A Glossary of Networking Terms (RFC 1208),Internet Engineering Task Force, 1991.

Page 71: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires
Page 72: Dataset versioning for Hops File Systemkth.diva-portal.org/smash/get/diva2:1149125/FULLTEXT01.pdf · Introduction 1.1 Problem description Working on data science problems often requires

TRITA -ICT-EX-2017:159

www.kth.se