ef¯¬¾cient, scalable, and versatile application and system ......

Download Ef¯¬¾cient, Scalable, and Versatile Application and System ... Ef¯¬¾cient, Scalable, and Versatile Application

Post on 26-Jun-2020

3 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers

    A Dissertation Presented

    by

    Richard Paul Spillane

    to

    The Graduate School

    in Partial Fulfillment of the

    Requirements

    for the Degree of

    Doctor of Philosophy

    in

    Computer Science

    Stony Brook University

    Technical Report FSL-02-12

    May 2012

  • Copyright by Richard Paul Spillane

    2012

  • Stony Brook University

    The Graduate School

    Richard Paul Spillane

    We, the dissertation committee for the above candidate for the Doctor of Philosophy degree, hereby recommend

    acceptance of this dissertation.

    Erez Zadok–Dissertation Advisor Associate Professor, Computer Science Department

    Robert Johnson–Chairperson of Defense Assistant Professor, Computer Science Department

    Donald Porter–Third Inside member Assistant Professor, Computer Science Department

    Dr. Margo I. Seltzer–Outside member Herchel Smith Professor, Computer Science, Harvard University

    This dissertation is accepted by the Graduate School

    Charles Taber Interim Dean of the Graduate School

    ii

  • Abstract of the Dissertation

    Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers

    by

    Richard Paul Spillane

    Doctor of Philosophy

    in

    Computer Science

    Stony Brook University

    2012

    A good storage system provides efficient, flexible, and expressive abstractions that allow for more concise and non-specific code to be written at the application layer. However, because I/O operations can differ dramatically in performance, a variety of storage system designs have evolved differently to handle specific kinds of workloads and provide different sets of abstractions. For ex- ample, to overcome the gap between random and sequential I/O, databases, file systems, NoSQL database engines, and transaction managers have all come to rely on different on-storage data struc- tures. Therefore, they exhibit different transaction manager designs, offer different or incompatible abstractions, and perform well only for a subset of the universe of important workloads.

    Researchers have worked to coordinate and unify the various and different storage system de- signs used in operating systems. They have done so by porting useful abstractions from one system to another, such as by adding transactions to file systems. They have also done so by increasing the access and efficiency of important interfaces, such as by adding a write-ordering system call. These approaches are useful and valid, but rarely result in major changes to the canonical storage stack because the need for various storage systems to have different data structures for specific workloads ultimately remains.

    In this thesis, we find that the discrepancy and resulting complexity between various storage systems can be reduced by reducing the difference in their underlying data structures and transac- tional designs. This thesis explores two different designs of a consolidated storage system: one that extends file systems with transactions, and one that extends a widely used database data struc- ture with better support for scaling out to multiple devices, and transactions. Both efforts result in contributions to file systems and database design. Based in part on lessons learned from these experiences, we determined that to significantly reduce system complexity and unnecessary over- head, we must create transactional data structures with support for a wider range of workloads. We have designed a generalization of the log-structured merge-tree (LSM-tree) and coupled it with two

    iii

  • novel extensions aimed to improve performance. Our system can perform sequential or file sys- tem workloads 2–5× faster than existing LSM-trees because of algorithmic differences and works at 1–2× the speed of unmodified LSM-trees for random workloads because of transactional log- ging differences. Our system has comparable performance to a pass-through FUSE file system and superior performance and flexibility compared to the storage engine layers of the widely used Cas- sandra and HBase systems; moreover, like log-structured file systems and databases, it eliminates unnecessary I/O writes, writing only once when performing I/O-bound asynchronous transactional workloads.

    It is our thesis that by extending the LSM-tree, we can create a viable and new alternative to the traditional read-optimized file system design that efficiently performs both random database and sequential file system workloads. A more flexible storage system can decrease demand for a plethora of specialized storage designs and consequently improve overall system design simplic- ity while supporting more powerful abstractions such as efficient file and key-value storage and transactions.

    iv

  • To God: for his creation,

    to Sandra: the bedrock of my house, and

    to Mom, Dad, Sean, and my family and friends: for raising, loving, and teaching me.

    v

  • Table of Contents

    List of Figures xi

    List of Tables xii

    Acknowledgments xiv

    1 Introduction 1

    2 Motivation 4 Motivating Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.1 Important Abstractions: System Transactions and Key-value Storage . . . . . . . . 5 2.1.1 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    Transactional Model . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Key-Value Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Implementation Simplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3 Background 12 3.1 Transactional Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.1.1 WAL and Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Journaling File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.3 Database Access APIs and Alternative Storage Regimes . . . . . . . . . . 17

    Soft-updates and Featherstitch . . . . . . . . . . . . . . . . . . . . 17 Integration of LFS Journal for Database Applications . . . . . . . . 17

    3.2 Explaining LSM-trees and Background . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 The DAM Model and B-trees . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 LSM-tree Operation and Minor/Major Compaction . . . . . . . . . . . . . 21 3.2.3 Other Log-structured Database Data Structures . . . . . . . . . . . . . . . 23

    3.3 Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 SchemaFS: Building an FS Using Berkeley DB . . . . . . . . . . . . . . . 25 3.3.2 User-Space LSMFS: Building an FS on LSMs in User-Space with Minimal

    Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Putting it Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    vi

  • 4 Valor: Enabling System Transactions with Lightweight Kernel Extensions 30 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.1.1 Database on an LFS Journal and Database FSes . . . . . . . . . . . . . . . 32 4.1.2 Database Access APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    Berkeley DB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Stasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.2 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Transactional Model . . . . . . . . . . . . . . . . . . . . . . . . . 34 1. Logging Device . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2. Simple Write Ordering . . . . . . . . . . . . . . . . . . . . . . 37 3. Extended Mandatory Locking . . . . . . . . . . . . . . . . . . . 37 4. Interception Mechanism . . . . . . . . . . . . . . . . . . . . . . 37 Cooperating with the Kernel Page Cache . . . . . . . . . . . . . . 38 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    4.2.1 The Logging Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 In-Memory Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . 41

    Life Cycle of a Transaction . . . . . . . . . . . . . . . . . . . . . . 41 Soft vs. Hard Deallocations . . . . . . . . . . . . . . . . . . . . . 43

    On-Disk Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Transition Value Logging . . . . . . . . . . . . . . . . . . . . . . . 43

    LDST: Log Device State Transition . . . . . . . . . . . . . . . . . . . . . 44 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    Performing Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 System Crash Recovery . . . . . . . . . . . . . . . . . . . . . . . 45 Process Crash Recovery . . . . . . . . . . . . . . . . . . . . . . . 45

    4.2.2 Ensuring Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.3 Application Interception . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

View more >