exploring novel burst buffer management on extreme-scale...

FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

EXPLORING NOVEL BURST BUFFER MANAGEMENT ON EXTREME-SCALE HPC

SYSTEMS

By

TENG WANG

A Dissertation submitted to theDepartment of Computer Science

in partial fulfillment of therequirements for the degree of

Doctor of Philosophy

2017

Copyright c© 2017 Teng Wang. All Rights Reserved.

Teng Wang defended this dissertation on March 3, 2017.The members of the supervisory committee were:

Weikuan Yu

Professor Directing Dissertation

Gordon Erlebacher

University Representative

David Whalley

Committee Member

Andy Wang

Committee Member

Sarp Oral

Committee Member

The Graduate School has verified and approved the above-named committee members, and certifiesthat the dissertation has been approved in accordance with university requirements.

ii

ACKNOWLEDGMENTS

First and foremost, I would like to express my special thanks to my advisor Dr. Weikuan Yu for

his ceaseless encouragement and continuous research guidance. I came to study in U.S. with little

research background. At the beginning, I had a hard time to follow the classes and understand

the research basics. While everything seemed unfathomable to me, Dr. Yu kept encouraging me to

position myself better, and spent plenty of time talking with me on how to quickly adapt myself to

the course and research environment in the university. I cannot imagine a better advisor on those

moments we talked with each other. In the meanwhile, Dr. Yu spared no efforts in steering me

towards the correct research directions and took every opportunity to expose me to the excellent

research scholars, such as my mentors during my three summer internships. It was under Dr.

Yu’s generous help that I quickly built up all the background knowledge on file system and I/O,

learned how to identify the cutting-edge research topics and conduct quality-driven research. I am

fortunate to have Dr. Yu as my advisor.

In addition, I gratefully acknowledge the support and instructions I received from Dr. Sarp Oral

and Dr. Bradley Settlemyer during my summer internship in Oak Ridge National Laboratory. I

am also deeply indebted to Dr. Kathryn Mohror, Adam Moody, Dr. Kento Sato and Dr. Tanzima

Islam for their infinite help during my two summer internships at Lawrence Livermore National

Laboratory. Those moments I studied with these excellent research scholars have been permanently

engraved in my memory.

I would also like to thank my committee members Dr. Gordon Erlebacher, Dr. David Whalley

and Dr. Andy Wang for their time and comments to improve this dissertation.

I also appreciate the friendship with all the members in the Parallel Architecture and System

Research Lab (PASL). I joined PASL in 2012 and was fortunate to know most of the members

since the establishment of PASL (2009). During my PhD study, we worked together as a family

and helped each other unconditionally. I’m particularly grateful to Dr. Yandong Wang, Dr. Bin

Wang and Yue Zhu for their substantial help on the burst buffer projects, and Dr. Zhuo Liu, Dr.

Hui Chen and Kevin Vasko for their assistance on the GEOS-5 project. I will also cherish my

friendship with Dr. Cristi Cira, Dr. Jianhui Yue, Xiaobing Li, Huansong Fu, Fang Zhou, Xinning

iii

Wang, Lizhen Shi, Hai Pham, and Hao Zou. With this friendship, I never felt lonely in my PhD

study.

My deepest gratitude and appreciation go to my parents, my parents-in-law and my wife Dr.

Mei Li. It’s their everlasting love, encouragement and support that always warm my heart and

illuminate the long journey for me to pursue my dreams.

The research topics in this dissertation are sponsored in part by the Office of Advanced Scien-

tific Computing Research; U.S. Department of Energy and performed at the Oak Ridge National

Laboratory, which is managed by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725

and resources of the Oak Ridge Leadership Computing Facility, located in the National Center for

Computational Sciences at Oak Ridge National Laboratory. They are also performed in part under

the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under

Contract DE-AC52-07NA27344. They are also funded in part by an Alabama Innovation Award

and National Science Foundation awards 1059376, 1320016, 1340947, 1432892 and 1561041.

iv

TABLE OF CONTENTS

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction 11.1 Overview of Scientific I/O Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Checkpoint/Restart I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Multi-Dimensional I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Overview of Burst Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Representative Burst Buffer Architectures . . . . . . . . . . . . . . . . . . . . 51.2.2 Burst Buffer Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Storage Models to Manage Burst Buffers . . . . . . . . . . . . . . . . . . . . . 7

1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 BurstMem: Overlapping Computation and I/O . . . . . . . . . . . . . . . . . 101.3.2 BurstFS: A Distributed Burst Buffer File System . . . . . . . . . . . . . . . . 101.3.3 TRIO: Reshaping Bursty Writes . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.4 Publication Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Problem Statement 142.1 Increasing Computation-I/O Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Issues of Contention on the Parallel File Systems . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Analysis of Degraded Storage Bandwidth Utilization . . . . . . . . . . . . . . 162.2.2 Analysis of Prolonged Average I/O Time . . . . . . . . . . . . . . . . . . . . 17

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 BurstMem: Overlapping Computation and I/O 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Memcached Based Buffering Framework . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Overview of Memcached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Challenges from Scientific Applications . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Design of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Software Architecture of BurstMem . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Log-Structured Data Organization with Stacked AVL Indexing . . . . . . . . 253.3.3 Coordinated Shuffling for Data Flushing . . . . . . . . . . . . . . . . . . . . . 273.3.4 Enabling Native Communication Performance . . . . . . . . . . . . . . . . . . 29

3.4 Evaluation of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.2 Ingress Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.3 Egress Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.4 Scalability of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.5 Case Study: S3D, A Real-World Scientific Application . . . . . . . . . . . . . 37

v

3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 BurstFS: A Distributed Burst Buffer File System 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Design of BurstFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.1 Scalable Metadata Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 Co-Located I/O Delegation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Server-Side Read Clustering and Pipelining . . . . . . . . . . . . . . . . . . . 49

4.3 Evaluation of BurstFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Overall Write/Read Performance . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 Performance Impact of Different Transfer Sizes . . . . . . . . . . . . . . . . . 554.3.4 Analysis of Metadata Performance . . . . . . . . . . . . . . . . . . . . . . . . 564.3.5 Tile-IO Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.6 BTIO Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.7 IOR Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


5 TRIO: Reshaping Bursty Writes 635.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Design of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2.1 Main Idea of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2.2 Server-Oriented Data Organization . . . . . . . . . . . . . . . . . . . . . . . . 665.2.3 Inter-Burst Buffer Ordered Flush . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Implementation and Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 705.4 Evaluation of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.4.1 Overall Performance of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.4.2 Performance Analysis of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . 725.4.3 Alleviating Inter-Node Contention . . . . . . . . . . . . . . . . . . . . . . . . 735.4.4 Effect of TRIO under a Varying OST Count . . . . . . . . . . . . . . . . . . 745.4.5 Minimizing Average I/O Time . . . . . . . . . . . . . . . . . . . . . . . . . . 75


6 Conclusions 80

7 Future Work 83

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

vi

LIST OF FIGURES

1.1 Checkpoint/restart I/O patterns (adapted from [32]). . . . . . . . . . . . . . . . . . . 3

1.2 I/O access patterns with multi-dimensional variables. . . . . . . . . . . . . . . . . . . 4

1.3 An overview of burst buffers on HPC system. BB refers to Burst Buffer. PE refers toProcessing Element. CN refers to Compute Node. . . . . . . . . . . . . . . . . . . . . 6

2.1 Issues of I/O Contention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 The impact of process count on the bandwidth of a single OST. . . . . . . . . . . . . 17

2.3 Scenarios when individual writes are distributed to different number of OSTs. “NOST” means that each process’s writes are distributed to N OSTs. . . . . . . . . . . 18

2.4 Bandwidth when individual writes are distributed to different number of OSTs. . . . . 19

3.1 Component diagram of Memcached. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Software architecture of BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Data structures for absorbing writes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Coordinated shuffling for N-1 data flushing. . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 CCI-based network communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Ingress I/O bandwidth with a varying server count. . . . . . . . . . . . . . . . . . . . 33

3.7 Ingress I/O bandwidth with a varying client count. . . . . . . . . . . . . . . . . . . . . 34

3.8 Egress I/O bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.9 Dissection of coordinated shuffling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.10 Scalability of BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.11 I/O performance of S3D with BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1 BurstFS system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Diagram of the distributed key-value store for BurstFS. . . . . . . . . . . . . . . . . . 45

4.3 Diagram of co-located I/O delegation on three compute nodes P, Q and R, each with2 processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Server-side read clustering and pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . 50

vii

4.5 Comparison of BurstFS with PLFS and OrangeFS under different write patterns. . . 53

4.6 Comparison of BurstFS with PLFS and OrangeFS under different read patterns. . . . 54

4.7 Comparison of BurstFS with PLFS and OrangeFS under different transfer sizes. . . . 55

4.8 Analysis of metadata performance as a result of transfer size and process count. . . . 56

4.9 Performance of Tile-IO and BTIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.10 Read bandwidth of IOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 A conceptual example comparing TRIO with reactive data flush approach. In (b),reactive data flush incurs unordered arrival (e.g. B7 arrives earlier than B5 to Server1)and interleaved requests of BB-A and BB-B. In (c), Server-Oriented Data Organizationincreases sequentiality while Inter-BB Flush Ordering mitigates I/O contention. . . . 65

5.2 Server-Oriented Data Organization with Stacked AVL Tree. Segments of each servercan be sequentially flushed following in-order traversal of the tree nodes under thisserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3 The overall performance of TRIO under both inter-node and intra-node writes. . . . . 71

5.4 Performance analysis of TRIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5 Flush bandwidth under I/O contention with an increasing process count. . . . . . . . 74

5.6 Flush bandwidth with an increasing stripe count. . . . . . . . . . . . . . . . . . . . . . 75

5.7 Comparison of average I/O time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.8 CDF of job response time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.9 Comparison of total I/O time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

viii

ABSTRACT

The computing power on the leadership-class supercomputers has been growing exponentially over

the past few decades, and is projected to reach exascale in the near future. This trend, however,

will continue to push forward the peak I/O requirement for checkpoint/restart, data analysis and

visualization. As a result, the conventional Parallel File System (PFS) is no longer a qualified

candidate for handling the exascale I/O workloads. On one hand, the basic storage unit of the

conventional PFS is still the hard drives, which are expensive in terms of I/O bandwidth/operation

per dollar. Providing sufficient hard drives to meet the I/O requirement at exascale is prohibitively

costly. On the other hand, the effective I/O bandwidth of PFS is limited by I/O contention, which

occurs when multiple computing processes concurrently write to the same shared disks.

Recently, researchers and system architects are exploring a new storage architecture with tiers

of burst buffers (e.g. DRAM, NVRAM and SSD) deployed between the compute nodes and the

backend PFS. This additional burst buffer layer offers much higher aggregate I/O bandwidth than

the PFS and is designed to absorb the massive I/O workloads on the slower PFS. Burst buffers

have been deployed on numerous contemporary supercomputers, and they have also become an

indispensable hardware component on the next-generation supercomputers.

There are two representative burst buffer architectures being explored: node-local burst buffers

(burst buffers on compute nodes) and remote shared burst buffers (burst buffers on I/O nodes).

Both types of burst buffers rely on a software management system to provide fast and scalable data

service. However, there is still a lack of in-depth study on the software solutions and their impacts.

On one hand, a number of studies on burst buffers are based on modeling and simulation, which

cannot exactly capture the performance impact of various design choices. On the other hand,

existing software development efforts are generally carried out by industrial companies, whose

proprietary products are commercialized without releasing sufficient details on the internal design.

This dissertation explores the alternative burst buffer management strategies based on research

designs and prototype implementations, with a focus on how to accelerate the common scientific

I/O workloads, including the bursty writes from checkpointing and bursty reads from restart/anal-

ysis/visualization. Our design philosophy is to leverage burst buffers as a fast and intermediate

storage layer to orchestrate the data movement between the applications and burst buffers, as well

ix

as the data movement between burst buffers and the backend PFS. On one hand, the performance

benefit of burst buffers can significantly speed up the data movement between the applications and

burst buffers. On the other hand, this additional burst buffer layer offers extra capacity to buffer

and reshape the write requests, and drain them to the backend PFS in a manner catering to the

most effective utilization of PFS capabilities. Rooted on this design philosophy, this dissertation

investigates three data management strategies. The first two strategies answer how to efficiently

move data between the scientific applications and the burst buffers. These two strategies are re-

spectively designed for the remote shared burst buffers and the node-local burst buffers. The rest

one strategy aims to speed up the data movement between the burst buffers and the PFS, it is

applicable to both types of burst buffers. In the first strategy, a novel burst buffer system named

BurstMem is designed and prototyped to manage the remote shared burst buffers. BurstMem

expedites scientific checkpointing by quickly buffering the checkpoints in the burst buffers after

each round of computation and asynchronously flushing the datasets to the PFS during the next

round of computation. It outperforms the state-of-the-art data management systems with efficient

data transfer, buffering and flushing. In the second strategy, we have designed and prototyped an

ephemeral burst buffer file system named BurstFS to manage the node-local burst buffers. BurstFS

delivers scalable write bandwidth by having each process write to its node-local burst buffer. It also

provides fast and temporary data sharing service for multiple coupled applications in the same job.

In the third strategy, a burst buffer orchestration framework named TRIO is devised to address I/O

contention on the PFS. TRIO buffers scientific applications’ bursty write requests, and dynamically

adjusts the flush order of all the write requests to avoid multiple burst buffers’ competing flush on

the same disk. Our experiments demonstrate that by addressing I/O contention, TRIO not only

improves the storage bandwidth utilization but also minimizes the average I/O service time for

each job.

Through systematic experiments and comprehensive evaluation and analysis, we have validated

our design and management solutions for burst buffers can significantly accelerate scientific I/O for

the next-generation supercomputers.

x

CHAPTER 1

INTRODUCTION

The astonishing growth of top 500 supercomputers suggests that, around the time frame of 2023,

the computation power of “leadership-class” systems is likely to surpass 1 exaflop/s (1018 flop-

s/s) [74]. Scientific applications on such large-scale computers often produce gigantic amount of

data. For example, the CHIMERA [37] astrophysics application outputs 160TB of data per check-

point, writing these data takes around an hour on Titan [14] supercomputer hosted in Oak Ridge

National Laboratory [66]. Titan has 18,688 compute nodes. It is currently the third fastest su-

percomputer in the world [24]. An exascale supercomputer is expected to have millions of nodes

with a checkpoint frequency of less than one hour [59, 36]. With such rapid growth in computing

power and data intensity, I/O remains a challenging factor in determining the overall applications’

execution time.

Over the past two decades, system designers have been bridging the computation-I/O gap by

expanding the backend PFS. The basic storage unit on PFS is the hard drive, which is cheap for

capacity but expensive for I/O bandwidth. Currently, the provisioned storage bandwidth1 of the

leadership-scale supercomputers is a few hundreds of GB/s. In the exascale computing era, the

peak bandwidth demand will become two orders of magnitude higher [65, 38]. Expanding PFS for

this level of bandwidth requirement becomes a cost-prohibitive solution.

To make things worse, conventional HPC system generally have separated compute nodes and

PFS. Contention on PFS happens when computing processes concurrently issue the bursty write

requests to the shared PFS. Since the number of compute nodes is typically 10x∼100x more than

the number of storage nodes on PFS, each storage node suffers from heavy contention during a

typical checkpoint workload [55]. Contention has been considered as a key issue that limits PFS’s

bandwidth scalability [55, 72, 109].

Recently, the idea of burst buffer has been proposed to shoulder the exploding data pressure

on the PFS. Many consider burst buffer as a promising storage solution for the exascale I/O

1The storage bandwidth is defined as the aggregate bandwidth of all the disks in a parallel file system.

1

workloads. Burst buffers refer to a broad category of high-performance storage device, such as

DRAM, NVRAM and SSD. They are positioned between the compute nodes and PFS, offering

much higher aggregate bandwidth than the PFS. This additional burst buffer layer can efficiently

absorb the heavy I/O workloads on PFS. Moreover, buffering the bursty writes inside burst buffers

gives plenty of opportunities to reshape the write requests and avoid the contention on PFS.

Numerous existing studies on burst buffers are based on modeling and simulation of burst

buffers [65, 82, 109, 87]. Though several companies have also announced their ongoing development

efforts and early products [33, 13], few design details have been made publicly available. In contrast

to these works, this dissertation explores various design choices of a burst buffer system and assesses

their benefits by prototyping burst buffers for the general scientific I/O workloads. Three burst

buffer management strategies are proposed and studied, each highlighting a distinct contribution

(see Section 1.3). In the rest of this chapter, we provide a high-level description of the characteristics

of scientific I/O workloads, the representative burst buffer architectures and their use cases that

underpin this dissertation, and the storage models adopted for our burst buffer management. We

then discuss our research contributions.

1.1 Overview of Scientific I/O Workloads

In this section, we provide an overview of the representative I/O workloads on HPC sys-

tems, which are also the targeting workloads in this dissertation. These workloads include check-

point/restart and multi-dimensional I/O. The two categories are not disjoint: checkpoints often

contain multi-dimensional arrays, these arrays can be retrieved later for the purpose of crash re-

covery, data analysis and visualization. We classify multi-dimensional I/O in a separate category

to emphasize the diversity of read patterns on a multi-dimensional dataset under the analysis/vi-

sualization workloads.

1.1.1 Checkpoint/Restart I/O

Checkpoint/restart is a common fault tolerance mechanism used by HPC applications. During

a run, application processes periodically save their in-memory state in files called checkpoints,

typically written to a PFS. Upon a failure, the most recent checkpoint can then be read to restart

the job. Checkpointing operations are usually concurrent across all processes in an application,

and occur at program synchronization points when no messages are in flight.

2

P1 P3P2

P1 P3P2

P1 P3P2

N-1 Segmented I/Owith a Shared File

N-1 Strided I/Owith a Shared File

N-N I/O with Individual Files

Figure 1.1: Checkpoint/restart I/O patterns (adapted from [32]).

There are two dominant I/O patterns for checkpoint/restart, N-1 and N-N patterns as shown in

Fig. 1.1. In N-N I/O, each process writes/reads data to/from a unique file. In N-1 I/O, all processes

write to, or read from, a single shared file. N-1 I/O can be further classified into two patterns:

N-1 segmented and N-1 strided. In N-1 segmented I/O, each process accesses a non-overlapping,

contiguous file region. In N-1 strided I/O, processes interleave their I/O amongst each other.

On current HPC systems, checkpointing can account for 75%-80% of the total I/O traffic [83].

While there is ongoing debate on how checkpointing operations will change on exascale systems

compared to today’s systems, there is general consensus that the data size per checkpoint will

increase due to larger job scales and the interval between checkpoints will decrease due to increased

overall failure rates [36, 76]. The larger file sizes and shorter intervals for checkpointing will require

orders of magnitude faster storage bandwidth [65].

3

1.1.2 Multi-Dimensional I/O

Figure 1.2: I/O access patterns with multi-dimensional variables.

Another common I/O workload on HPC systems is data access to multi-dimensional data vari-

ables in scientific applications. While multi-dimensional variables are written in one particular

order, they are often read for analysis or visualization in a different order than the write or-

der [95, 70, 61].

Fig. 1.2(a) shows a sample read pattern on a two-dimensional variable. This variable is initially

decomposed into four blocks and assigned to four processes, which are written to a shared file

on the backend PFS. When this variable is read back for analysis, one process may require only

one or more columns from this variable. However, these two columns are stored non-contiguously

across the data blocks. Therefore, this process needs to issue numerous small read requests on four

different data blocks in order to retrieve its data for analysis. Fig. 1.2(b) illustrates a similar but

more complex scenario with a three-dimensional variable. The 3-D variable is initially stored as

eight different blocks across burst buffers. A process may only need a subvolume in the middle

of the variable for analysis. This subvolume has to be gathered from eight different blocks to

complete its data access, resulting in many small read operations. These small read operations are

not favored by PFS [94, 93, 95].

4

1.2 Overview of Burst Buffers

In this section, we first outline the representative burst buffer architectures, and their use cases

our dissertation is based on. We then discuss the alternative storage models adopted in our burst

buffer management.

1.2.1 Representative Burst Buffer Architectures

Two representative burst buffer architectures are shown in Figure 1.3. In the node-local burst

buffer architecture, burst buffers are located on the individual compute nodes. The benefit of

this architecture is that the aggregate burst buffer bandwidth scales linearly with the number

of compute nodes. Scientific applications can acquire scalable write bandwidth by having each

process write its data to its local burst buffer. However, flushing buffered data to the backend PFS

requires extra computing power on the compute nodes, which can incur computation jitters [31].

In the remote shared burst buffer architecture, burst buffers are placed on the dedicated I/O nodes

deployed between the compute nodes and the backend PFS. Data movement from the compute

nodes to the burst buffers needs to go through the network. This architecture achieves excellent

resource isolation since flushing data to the backend PFS is done without interfering with the

computation on the compute nodes. However, in contrast to the node-local burst buffers, its I/O

bandwidth2 is dependent on several factors, including the network bandwidth, the aggregate burst

buffer bandwidth and the number of I/O nodes.

Both types of burst buffers have been widely deployed on existing high-end computing sys-

tems. They are also projected to come along with a majority of the next-generation supercom-

puters. For instance, node-local burst buffers have been equipped on DASH [53] at the San Diego

Supercomputer Center [17], Catalyst [4] and Hyperion [12] at the Lawrence Livermore National

Laboratory (LLNL) [15], TSUBAME supercomputer series [26] at the Tokyo Institute of Technol-

ogy, and Theta [22] at the Argonne National Laboratory (ANL). They will also come with several

next-generation supercomputers in a few years, such as Summit [21] at the Oak Ridge National

Laboratory and Sierra [20] at the Lawrence Livermore National Laboratory. On the other hand,

remote shared burst buffers have been set up on Tianhe-2 [23] at the Chinese National Supercom-

puter Center, Trinity [25] at the Los Alamos National Laboratory, and Cori [5] at the Lawrence

2I/O bandwidth is typically measured as a quotient of the total data size read/writtten by an application and itstotal I/O time.

5

Figure 1.3: An overview of burst buffers on HPC system. BB refers to Burst Buffer. PE refers toProcessing Element. CN refers to Compute Node.

Berkeley National Laboratory. As one of the next-generation supercomputers, Aurora [2] at the

Argonne National Laboratory features a heterogeneous burst buffer architecture including both

node-local burst buffers and remote shared burst buffers.

1.2.2 Burst Buffer Use Cases

Burst buffers are designed to accelerate the bursty write and read operations featured by most

scientific applications. Although we are aware that burst buffers also have the potential to support

the cloud-based applications that issue intermittent read/write requests, their use cases are beyond

the focus of this dissertation. In general, burst buffers’ support for scientific applications can be

summarized into the following use cases.

• Checkpoint/Restart: Applications periodically write “checkpoints” that includes snap-

shots of data structures and state information. Upon a failure, they can load the checkpoints

and rollback to a previous state. Applications checkpointing to/restarting from burst buffers

can reap much higher aggregate bandwidth than PFS.

• Overlapping Computation and Checkpointing to PFS: Scientific applications’ life

cycle generally alternates between a phase of computation and a phase of checkpointing. Ap-

6

plications can hide the checkpointing latency by temporarily buffering the data in burst buffers

after a phase of computation, and let burst buffers asynchronously flush the checkpoints to

the PFS during the next phase of computation.

• Reshaping Bursty Writes to PFS: Burst buffers stand as a middle layer that absorbs

the bursty writes to PFS. With all the buffered write requests, burst buffers have the global

knowledge of how scientific data are laid out on the PFS. Based on this knowledge, they can

reshape the write traffic to avoid contention on PFS.

• Staging/Sharing Intermediate data: Intermediate data such as checkpoints or plot files

are staged for two purposes: out-of-core computation and data sharing. In the former case,

applications with insufficient memory can swap out a portion of its in-memory data to burst

buffers, and later read them back for computation. In the latter case, a job usually consists

of multiple scientific workflows sharing data with each other. For instance, the output of a

simulation program can be used as the input for an analysis program to perform post-analysis

(after simulation) and in-situ/in-transit analysis (during simulation). In both cases, staging

the intermediate data to burst buffers and loading the data from burst buffers are much faster

than PFS.

• Prefetching Data for Fast Analysis: Scientific applications with the foreknowledge of

future reads can hide the read latency on PFS by prefetching data to burst buffers.

Due to the architectural differences, node-local burst buffers and remote shared burst buffers have

distinct preferences on these use cases. Since node-local burst buffers reside on the compute nodes,

and each compute node is generally dedicated to an individual job, data on node-local burst buffers

are temporarily available within the individual job’s life cycle. Under this constraint, we can

leverage node-local burst buffers to perform scalable checkpointing and enable temporary data

sharing among coupled workflow applications in the same job. On the other hand, data on the

remote shared burst buffers are independent of the life cycle of the individual job, we can harness

remote shared burst buffers to provide the center-wide data service to all the jobs on the compute

nodes. Moreover, remote shared burst buffers are jitter-free for data flushing. Therefore, it is

tempting to overlap computation and checkpointing to PFS based on the remote shared burst

buffers.

1.2.3 Storage Models to Manage Burst Buffers

A key consideration before structuring a burst buffer system is what type of storage models

can be used to manage burst buffers. In particular, there are two representative storage models

7

widely adopted under the cloud and HPC environments: distributed file systems and databases.

Each offers an alternative approach to manage burst buffers. Although there are data management

solutions that fall outside these two categories (e.g. DataSpaces [48], PGAS [92]), their architectures

can be derived and synthesized from the two storage models. Therefore, we mainly focus on the

discussion on the distributed file systems and databases. Our burst buffer-based solutions root in

these two storage models.

Distributed File System vs. Distributed Database. In a file system, data are stored

in the form of files. The format of each file is defined by the file system clients (i.e. applications

and I/O libraries). For instance, applications using POSIX I/O [29] stores each file as a stream

of binary bytes, data are accessed based on their sizes and byte addresses in the file. In order to

provide richer data access semantics, application developers need to implement their own functions

beyond POSIX I/O and define their own data format atop the binary stream in POSIX. High-level

I/O libraries (e.g. HDF5 [11], NetCDF [58]) format each file as a container of multi-dimensional

arrays, data are accessed based on their dimensionality, positions and sizes along each dimension.

However, these file formats are opaque to the file system.

In contrast, a database is generally used for storing related, structured data (e.g. tables, indices)

with well-defined data formats. Unlike a file system, the physical data formats of these structured

data are defined by the database and they are opaque to the clients. Clients access databases

using the database-specific semantics, such as SQL for relational databases and Put/Get APIs for

key-value stores. Users using these interfaces can easily fulfill a wide variety of complex application

logics without further effort to implement these logics. Compared with a file system, the complicated

implementation of these application logics are offloaded to the database. Besides, a database allows

indexing based on any attribute or data property (i.e. SQL columns for relational databases and

keys for key-value stores). These indexed attributes facilitate fast query.

The choice of distributed file systems/databases largely depends on the burst buffer seman-

tics to be exposed to applications. File system based burst buffer service is advantageous in two

aspects. First, it can transparently support a large quantity of file system-based scientific applica-

tions that use POSIX and other POSIX-based I/O libraries, such as HDF5 and NetCDF. Second, it

gives application developers more flexibility to define their own data formats, and implement their

application logics. On the other hand, a database-based burst buffer service allows application

8

developers to more easily implement their complex application logics by directly using the rich

client-side interfaces. It also speeds up data processing with customizable data indexing. How-

ever, to support existing file system-based applications, application developers need to replace the

POSIX/NetCDF/HDF5-based functions with database client’s native functions, which demands

non-trivial efforts.

Besides the choices of databases/file systems, another key consideration is what types of file

system/database services can be harnessed to manage burst buffers. According to the data layout

strategy, existing file systems can be classified into locality-based distributed file systems (e.g.

HDFS [90]) and striping-based distributed file systems (e.g. Lustre [35]). There are also two types

of databases: relational databases and non-relational databases. The non-relational databases can

be further classified into key-value stores and graph stores. Both non-relational databases and

key-value stores are applicable to manage burst buffers.

Locality-Based Distributed File Systems vs. Striping-Based Distributed File Sys-

tems. In the locality-based file systems, each process prioritizes writing data to its node-local

storage. This design choice avoids data movement across network during a bursty writes. So it

delivers scalable write bandwidth. Because of this benefit, we design a locality-based distributed file

system to manage the node-local burst buffers in Chapter 4. However, a key challenge is how to

read data: because file data are written by each process locally, in order to read the remote data,

each process needs to look up the data sources for its read requests. A scalable metadata service

is needed to serve the large quantity of lookup requests during bursty reads.

In the striping-based distributed file systems, data written by each process are striped across

multiple data nodes. The locations of data sources are calculated based on a pre-determined hash

function. Therefore, during read, each process can directly compute the locations of the requested

data and retrieve the data from the data sources. Compared with the locality-based distributed

file system, this type of file system delivers balanced read/write bandwidth. However, its write

bandwidth is limited by contention when multiple processes concurrently write to the same data

node.

Relational Database vs. Key-Value Stores.. A radical difference between relational

database and key-value store is their data models exposed to users. Relational databases represent

data in a form of one or more tables of rows and columns, with a unique key for each row. These

9

tables are queried and maintained by SQL, which defines a rich set of semantics that allow users to

easily fulfill complex application logics (e.g. insert/delete/update/query/join/transaction). How-

ever, the complex data model also makes the implementation of relational databases incredibly

complicated. For example, a relatively simple SELECT statement could have dozens of potential

query execution paths, which a query optimizer would evaluate at run time. In contrast, key-value

stores represent data in the form of key-value pairs, and provide much simpler interfaces (e.g.

Put/Get for storing/retrieving data). The simplicity of their data services afford key-value stores

much faster speed and better scalability. In Chapter 3, we leverage distributed key-value store to

manage the remote shared burst buffers.

1.3 Research Contributions

In this dissertation, we have thoroughly investigated the I/O challenges on HPC systems, and

researched three approaches for burst buffer management. In particular, this dissertation has made

the following three contributions.

1.3.1 BurstMem: Overlapping Computation and I/O

We have designed BurstMem, a burst buffer system to manage the remote shared burst buffers.

It accelerates checkpointing by temporarily buffering applications’ scientific datasets in burst buffers

and asynchronously flushing the data to PFS. BurstMem is built on top of Memcached, a distributed

key-value store widely adopted under the cloud environment. While inheriting the buffering man-

agement framework of Memcached, we have identified its major issues in handling the bursty

checkpoint workload, including low burst buffer capacity and bandwidth utilization, unable to

leverage the native transport of various high-speed interconnects on HPC system and lack of data

indexing and shuffling support for fast data flush. Based on our analysis, we structure BurstMem

to accelerate checkpointing with indexed and space-efficient buffer management, fast and portable

data transfer, and coordinated data flush. Our evaluations demonstrated that BurstMem is able

to achieve 8.5× speedup over the bandwidth of conventional PFS.

1.3.2 BurstFS: A Distributed Burst Buffer File System

We have analyzed the benefits and challenges of using node-local burst buffers to accelerate sci-

entific I/O, and designed an ephemeral Burst Buffer File System (BurstFS) that supports scalable

10

and efficient aggregation of I/O bandwidth from burst buffers while having the same life cycle as a

batch-submitted job. BurstFS features several techniques including scalable metadata indexing for

fast and scalable write with amortized metadata cost, co-located I/O delegation for scalable read

and data sharing among coupled applications in the same job, and server-side read clustering and

pipelining to optimize small and noncontiguous read operations widely used in a multi-dimensional

I/O workload. Through extensive tuning and analysis, we have validated that BurstFS has ac-

complished our design objectives, with linear scalability in terms of aggregated I/O bandwidth for

parallel writes and reads.

1.3.3 TRIO: Reshaping Bursty Writes

We have devised TRIO, a burst buffer orchestartion framework to efficiently drain data from

burst buffers to the backend PFS. The strategy is applicable to both remote shared burst buffers

and node-local burst buffers. TRIO buffers the applications’ checkpointing write requests in burst

buffers and reshapes the bursty writes to maximize the number of sequential writes to PFS. Mean-

while, TRIO coordinates the flushing orders among concurrent burst buffers to address two levels of

contention on PFS, namely, competing writes from multiple processes on the same storage server,

and the interleaving writes from multiple concurrent applications on the same storage server. Our

experimental results demonstrated that TRIO could efficiently utilize storage bandwidth and reduce

the average I/O time by 37% in typical checkpointing scenarios.

1.3.4 Publication Contributions

During my PhD study, I have contributed to the following publications (listed in the chrono-

logical order).

1. Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Christian

Cira, Bin Wang, Zhuo Liu, Bliss Bailey, Weikuan Yu. Assessing the Performance Impact of

High-Speed Interconnects on MapReduce Programs. Third Workshop on Big Data Bench-

marking, 2013 [107].

2. Zhuo Liu, Jay Lofstead, Teng Wang, Weikuan Yu. A Case of System-Wide Power Man-

agement for Scientific Applications. IEEE International Conference on Cluster Computing,

2013 [67].

11

3. Zhuo Liu, Bin Wang, Teng Wang, Yuan Tian, Cong Xu, Yandong Wang, Weikuan Yu, Carlos

A Cruz, Shujia Zhou, Tom Clune, Scott Klasky. Profiling and Improving I/O Performance

of a Large-Scale Climate Scientific Application, 2013 [69].

4. Yandong Wang, Robin Goldstone, Weikuan Yu, Teng Wang. Characterization and Opti-

mization of Memory-Resident MapReduce on HPC Systems. IEEE 28th International Parallel

and Distributed Processing Symposium, 2014 [106].

5. Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, Weikuan Yu. BPAR: A Bundle-Based

Parallel Aggregation Framework for Decoupled I/O Execution. International Workshop on

Data Intensive Scalable Computing Systems, in Conjunction with SC 2014 [104].

6. Teng Wang, Sarp Oral, Yandong Wang, Bradley Settlemyer, Scott Atchley, Weikuan Yu.

BurstMem: A High-Performance Burst Buffer System for Scientific Applications. IEEE In-

ternational Conference on Big Data, 2014 [103].

7. Teng Wang, Sarp Oral, Michael Pritchard, Kento Vasko, Weikuan Yu. Development of a

Burst Buffer System for Data-Intensive Applications. International Workshop on The Lustre

Ecosystem: Challenges and Opportunities, 2015 [101].

8. Teng Wang, Kathryn Mohror, Adam Moody, Weikuan Yu, Kento Sato. Poster Presented

at The International Conference for High Performance Computing, Networking, Storage and

Analysis, 2015 [99].

9. Teng Wang, Sarp Oral, Michael Pritchard, Bin Wang, Weikuan Yu. TRIO: Burst Buffer

Based I/O Orchestration. IEEE International Conference on Cluster Computing, 2015 [102].

10. Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, Weikuan Yu. An Ephemeral Burst

Buffer File System for Scientific Applications. IEEE/ACM the International Conference for

High Performance Computing, Networking, Storage and Analysis 2016 [98].

11. Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, Weikuan Yu. Enhance Scientific Application

I/O with Cross-Bundle Aggregation. International Journal of High Performance Computing

Applications [105], 2016.

12. Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, Kento Sato, Tanzima Islam, Weikuan

Yu. MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers.

IEEE 31th International Parallel and Distributed Processing Symposium [100], 2017.

12

1.4 Dissertation Overview

In the rest of this dissertation, we first elaborate on the problems that prevent scientific ap-

plications from achieving the satisfactory I/O performance. We then detail three burst buffer

management strategies that address these issues. Every chapter focuses on presenting one ap-

proach, along with a comprehensive evaluations and comparisons between our solution and the

state-of-the-art techniques.

In Chapter 2, we thoroughly investigate the key issues that constrain the scientific I/O perfor-

mance. Namely, increasing computation and I/O gap, and I/O contention on PFS. The insights

from this chapter are used to motivate our innovations.

In Chapter 3, we specify the design of BurstMem, a remote shared burst buffer based I/O

burst buffer management framework. BurstMem complements Memcached for checkpointing with

enhanced data services for fast data transfer, data buffering and flushing. Our performance evalu-

ation demonstrates that our solutions can efficiently accelerate scientific checkpointing.

In Chapter 4, we explore the benefits and key considerations of architecting a distributed burst

buffer file system to manage node-local burst buffers. Based on our exploration, we design BurstFS

to speedup checkpointing/restart, analysis and visualization. BurstFS supports scalable writes

and fast data sharing under various concurrent I/O workloads. Our comparison with the state-of-

the-art solutions demonstrates that BurstFS carries great potential in handling the exascale I/O

workloads.

In Chapter 5, we introduce TRIO, a burst buffer orchestration framework that buffers and

reorganizes checkpointing write requests for sequential and contention-aware data flush to the

PFS. This strategy can be used on both remote shared burst buffers and node-local burst buffers.

Our experiments with different types of concurrent workloads demonstrate that by addressing the

I/O contention on PFS, TRIO not only improves the bandwidth of a single application, but also

minimizes the average I/O time of the individual application under a multi-application workload.

Finally, we conclude this dissertation and discuss opportunities for future work in Chapter 7.

13

CHAPTER 2

PROBLEM STATEMENT

This chapter discusses the detailed I/O challenges on HPC systems. First, we highlight the problem

of the increasing computation-I/O gap. We then present the issues incurred by I/O contention on

the PFS, and experimentally analyze their impact.

2.1 Increasing Computation-I/O Gap

Studies have demonstrated that most scientific applications on HPC systems are data-intensive

applications [7]. These applications’ life cycle alternates between computation phases and I/O

phases [67], with a gigantic volume of data stored during each I/O phase. As an example, the

climate scientific application Arctic Systems Reanalysis (ASR) output 23.14TB data to study one-

year climate changes. The volume of output data is projected to be around hundreds of petabytes

for a single simulation run on exascale computing systems [40]. This demands the commensurate

I/O bandwidth improvement on the backend PFS to quickly absorb the output. However, the

growth of storage bandwidth on PFS is constantly below the computing power, resulting in an ever-

increasing computation-IO gap. For instance, an ongoing upgrade from Sequoia [19] to Sierra [20]

supercomputer (expected to come around 2018) at the Lawrence Livermore National Laboratory

will enhance the computing power by a factor of 7. However, a corresponding upgrade on the

backend PFS is expected to improve the bandwidth by a maximum of 2 times. Similarly, the

recent upgrade from Edison [10] to Cori [5] supercomputer (delivered in 2016) at the Lawrence

Berkeley National Laboratory enhanced the computing power by a factor of 12, with a storage

bandwidth improvement of only 4 times. In the era of exascale computing, the computation-I/O

gap is estimated to become 10× wider [49].

Accompanied with the widening computation-IO gap is the booming bandwidth requirement for

checkpointing. An exascale system is expected to raise the periodical checkpointing data size by two

orders of magnitude, and reduce the checkpointing interval from 4-8 hours to tens of minutes [36].

14

This requires two orders of magnitude higher storage bandwidth (e.g. 60TB/s as predicted by [49])

for fast checkpointing and crash recovery.

2.2 Issues of Contention on the Parallel File Systems

Another critical issue faced by the PFS is I/O contention. Under a typical checkpointing

A1 A2

App1

A1 B1 B2 B3 B4 A2

B1 B2 B3 B4

App2

Arrival time on PFS

App2 completes

App1 completes

Prolonged app I/O time

A1 A2 A3 A4

Process-A

B5 B6 B7 B8

Process-B

B5 A2 B6 A3

Sequential Writes Sequential Writes

A4 B7 B8

Interleaved Write Requests

A1

Parallel File System

(a). Degraded storage bandwidth. (b). Prolonged average I/O time.

Figure 2.1: Issues of I/O Contention.

workload, the storage servers on PFS will receive a large burst of write requests from 10× to 100×

more compute nodes. The heavy and concurrent workload on each storage server incurs two issues.

First, when the number of competing write requests exceeds the capabilities of each storage server,

its bandwidth will degrade. Second, when multiple applications compete to use the shared storage

servers, I/O service for mission-critical applications can be frequently interrupted by low-priority

applications. I/O requests from small jobs can be delayed due to concurrent accesses from large

jobs, prolonging the average I/O time.

Figure 2.1(a) shows how storage bandwidth degrades under contention. Processes A and B each

sequentially issues four write requests to PFS. These write requests belong to contiguous regions

of a shared file (e.g. A1, A2, A3, A4 in Figure 2.1(a) are contiguous segments in the file). Due to

15

I/O contention, although these write requests are issued sequentially, their arrival sequence on PFS

become interleaved. This interleaved arrival leads to lower bandwidth utilization. Figure 2.1(b)

illustrates the prolonged average I/O time. Application 1 and Application 2 compete for PFS’s

I/O service by concurrently issuing 2 and 4 write requests, respectively. These write requests

are interleaved on PFS and serviced according to their arrival order. Although Application 1

issues fewer write requests, it still turns around slowly (even completes later than Application 2).

Consequently, the average I/O time of these two applications is prolonged by the PFS’s reactive

I/O service for these applications.

2.2.1 Analysis of Degraded Storage Bandwidth Utilization

A key reason for degraded storage bandwidth utilization is that each checkpointing file is striped

across multiple storage servers on PFS, and shared by the participating processes. Therefore, each

process can access multiple storage servers, and each storage server is reachable from multiple

processes. Such N-N mapping poses two challenges. First, each storage server suffers from the

competing writes from multiple processes; second, since the requests of each process are distributed

to multiple storage servers, each process is involved in the competition for multiple storage servers.

We used the IOR benchmark [88] to analyze the impacts of both challenges on the Lustre file system

(Spider file system [89]) connected to Titan Supercomputer [14] in Oak Ridge National Laboratory.

To investigate the first challenge, we used an increasing number of processes to concurrently write in

total 32GB data to a single storage server (OST). Each process wrote a contiguous, nonoverlapping

file segment. The result is exhibited in Fig. 2.2. The bandwidth first increased from 356MB/s with

1 process to 574MB/s with 2 processes, then decreased constantly to 401MB/s with 32 processes,

resulting in 30.1% bandwidth degradation from 2 to 32 processes. The improvement from 1 to 2

processes was because the single-process I/O stream was not able to saturate OST bandwidth. Our

intuition suggested that contention from 2 to 32 processes can incur heavy disk seeks; however, our

lack of privilege to gather I/O traces at the kernel level on ORNL’s Spider filesystem prevented us

from directly proving our intuition. We repeated our experiments on our in-house Lustre filesystem

(running the same version as that on Spider) and observed that up to 32% bandwidth degradation

were caused by I/O contention. By analyzing I/O traces using blktrace [3], we found that disk access

time accounted for 97.1% of the total I/O time on average, indicating that excessive concurrent

accesses to OSTs can degrade the bandwidth utilization. Overall, the bandwidth utilization of one

16

0

200

400

600

800

1000

1 2 4 8 16 32

Ba

nd

wid

th (

MB

/s)

Number of Processes

Bandwidth

Figure 2.2: The impact of process count on the bandwidth of a single OST.

OST on Spider was 75.6% when there were only two concurrent processes, but dropped to 53.5%

when there were 32 processes.

To evaluate the performance impact of the second challenge, we distributed each IOR process’s

write requests to multiple OSTs. In our experiment, we spread the write requests from each process

to 1, 2, 4 OSTs, which are presented in Fig. 2.3 as 1 OST, 2 OSTs, 4 OSTs, respectively. We fixed

the total data size as 128GB, the number of utilized OSTs as 4, and measured the bandwidth under

the three scenarios using a varying number of processes. The result is demonstrated in Fig. 2.4.

The scenario of 1 OST consistently delivered the highest bandwidth with the same number of

processes, outperforming 2 OSTs and 4 OSTs by 16% and 26% on average, respectively. This was

because, by localizing each process’s writes on 1 OST, each OST was under the contention from

fewer processes. Another interesting observation is that the bandwidth under the same scenario

(e.g. 1 OST) degrades with more processes involved. This trend can be explained by the impact

of the first issue as we measured in Fig. 2.2.

2.2.2 Analysis of Prolonged Average I/O Time

To assess the impact of prolonged average I/O time under a multi-application environment, we

ran BTIO [108], MPI-Tile-IO [16] and IOR [88] benchmarks concurrently but differentiated their

output data sizes as 13.27GB, 64GB and 128GB, respectively. This launch configuration is referred

17

S-16

OST3 OST1

OST4 OST2

OST3 OST1

OST4 OST2

OST3

OST1

OST4

OST2

(a): 1 OST (b): 2 OSTs (c): 4 OSTs

P1 P3 P4

P2 P1 P2 P3 P4

P1 P2 P3 P4

Figure 2.3: Scenarios when individual writes are distributed to different number of OSTs. “N OST”means that each process’s writes are distributed to N OSTs.

Table 2.1: The I/O time of individual benchmarks when they are launched concurrently (MultiWL)and serially (SigWL).

Time(s) BTIO MPI-TILE-IO IOR AVG TOT

MultiWL 41 121.83 179.75 114.19 179.75SigWL 9.79 72.28 161 81.02 161

to as MultiWL. Competition for storage was assured by striping all benchmark output files across

the same four OSTs on the Spider file system. We compared their job I/O time with that when

these benchmarks were launched in a serial order, which is referred to as SigWL. The I/O time

of the individual benchmarks are shown in Table 2.1 as three columns, BTIO, MPI-TILE-IO, and

IOR, respectively. The average and total I/O time of the three benchmarks are shown in columns

AVG and TOT. As we can see, the average I/O time of MultiWL is 1.41× longer than SigWL.

This is because a job’s storage service was affected by the contention from other concurrent jobs.

Furthermore, the contention from large jobs can significantly delay the I/O of small jobs. In our

tests, the most affected benchmark was BTIO, which generated the smallest workload. Its I/O

time in MultiWL was 4.18× longer than SigWL.

18

0

500

1000

1500

2000

2500

4 8 16 32 64 128

Band

width (M

B/s)

Number of Processes

1 OST 2 OSTs 4 OSTs

Figure 2.4: Bandwidth when individual writes are distributed to different number of OSTs.

2.3 Summary

In summary, this chapter has revealed major I/O challenges on HPC system. Namely, the

increasing computation-I/O gap and the I/O contention on PFS. Our burst buffer management

strategies provide distinct solutions to these challenges. First, we bridge the computation-I/O

gap by temporarily buffering the checkpoints in burst buffers, and gradually flushing checkpoints

to PFS (see Chapter 3). This avoids scientific applications’ direct interactions with PFS. We

achieve this purpose based on remote shared burst buffers, since the remote shared burst buffers

are jitter-free for data flushing. We have also explored the use of node-local burst buffers to absorb

the read/write traffics on PFS, by buffering the bursty writes in node-local burst buffers, and then

directly retrieving the data from burst buffers for restart/analysis and visualization (see Chapter 4).

On the other hand, we have addressed I/O contention by taking burst buffers as an intermediate

layer to reshape the applications’ bursty writes on PFS (see Chapter 5).

19

CHAPTER 3

BURSTMEM: OVERLAPPING COMPUTATION

AND I/O

3.1 Introduction

As mentioned in Chapter 2, the ever-increasing computation-IO gap imposes stringent check-

point demands on the PFS. BurstMem is designed to shoulder the data pressure from checkpoint

workloads. It provides simple interfaces that allow applications to quickly dump checkpoint data,

and asynchronously flush the data to PFS. It avoids the direct interaction with PFS with over-

lapped computation and data flushing. With a set of storage management strategies, it efficiently

leverages the capacity and bandwidth of burst buffer storage and reduces the overall I/O time. In

addition, BurstMem is designed with a novel tree-based indexing technique that can support fast

data flush in two phases. Finally, BurstMem is implemented with an abstract communication layer

that enables its portability to systems with diverse network configurations [30].

As mentioned in Section 1.2.3 of Chapter 1, there are several storage models for burst buffer

service. We structure BurstMem as a distributed key-value store, since a key-value store well sup-

ports the simple sematincs required by BurstMem, e.g. storing the checkpoints generated at a

given checkpoint phase to burst buffer, and retrieving checkpoints produced during a given pe-

riod for data flush or crash recovery. BurstMem is implemented by customizing and extending

the functionality of a cutting-edge caching system named Memcached. It is a lightweight, dis-

tributed DRAM-based caching system. Though not designed for scientific applications, it includes

all the features for distributed buffer management. Its fast storage solution and great extensibility

for complex applications distinguish it from many other distributed key-value stores (e.g. Mon-

goDB [43], HBase [51]), as a decent candidate for burst buffer services. We customize Memcached

by modifying its data placement strategy, communication layer and memory management module.

20

Furthermore, we design BurstMem with a mechanism of coordinated data shuffling and flushing to

PFS. In summary, we make the following contributions in this chapter:

• We have examined the storage management issues in Memcached. Based on our analysis

we introduce a log-structured storage management scheme with a novel tree-based indexing

technique. This allows us to efficiently utilize storage resources.

• We have applied a coordinated shuffling scheme for efficient data flushing, and designed a

portable communication layer that supports high-speed data transfer.

• A systematic evaluation of BurstMem is conducted using both synthetic benchmarks and a

real-world application. Our results demonstrate that on average BurstMem can improve the

I/O performance by as much as 8.5×.

The rest of the chapter is organized as follows. We describe the Memcached based buffering

management framework in Section 3.2, followed by Section 3.3 that elaborates on the design of

BurstMem. Section 3.4 provides experimental results. Section 3.5 provides a review of related

work. Finally, we summarize this chapter in Section 3.6.

3.2 Memcached Based Buffering Framework

In this section, we first depict the framework of Memcached, and then analyze the challenges

of using Memcached to support checkpoint workload.

3.2.1 Overview of Memcached

Memcached is an open-source, distributed caching system deployed to address the web-scale

performance and scalability challenges. Two of its key components (client and server) function as a

distributed key-value store that mutually resolve web servers’ caching requirements. Fig. 3.1 shows

the general architecture along with its main components. The Memcached client can interact with

a number of servers for its data store and data retrieval purposes. As a distributed caching system,

Memcached incorporates several key architectural aspects indispensable to the design of burst

buffer system. First, the Memcached client adopts a two-stage hashing mechanism for balanced

data placement. Second, the Memcached server offers a lossy key-value store that involves all the

major functionality of a local storage system, such as space, data and metadata management.

21

User Applications

Memcached

Hash

Lossy Key-Value

Memcached Client

Lossy Key-Value

Figure 3.1: Component diagram of Memcached.

Two-stage Hashing. Memcached applies a two-stage hashing mechanism to store or retrieve

a certain key-value pair (KVP). In the first stage, the key is hashed to the server responsible for

storing the KVP. Once the KVP is stored on the server, it is further hashed to an entry in the

server’s local hash table that records the address of this KVP.

A Lossy Key-Value Store with Disconnected Servers. The Memcached server is de-

signed as a simple but powerful in-memory key-value store. A server pre-allocates many groups of

1MB slabs, each slab contains serveral chunks. Chunks that belong to the same group are of equal

size, but they have different sizes among different groups. Each group possesses a unique ID ranked

from 0 to 42. The chunk size for different group increases with a factor of 1.25 with Group IDs.

For instance, Group 1 contains chunks sized 96B, Group 2 contains chunks sized 96B*1.25, and so

on. Thus, Group 42 contains chunks sized 96B ∗1.2541 = 1048576B, which is exactly 1MB. So each

slab only contains one chunk in Group 42. To insert a KVP, the Memcached server selects a chunk

from the group with the closest chunk size and copies it to this chunk, and the chunk address is

then recorded in the hash table. Upon a conflict, a chained list of entries are provided to hold the

address of multiple KVPs.

3.2.2 Challenges from Scientific Applications

For its simple, scalable and powerful design and performance, Memcached has been used by

many web applications that require fast cache storage for their temporary data. The requests of

22

these applications are generally distributed, arriving in a random order, with very little synchro-

nization.

While scientific applications also generate large volumes of data, their data patterns are sig-

nificantly different from the web applications. Particularly, they possess a number of distinct

characteristics described below that warrant a new perspective on how to leverage the strengths of

Memcached.

• Lock-step I/O from many synchronized clients: Scientific applications typically consist

of many parallel processes who enter their I/O phase in a lock-step manner. These processes

have frequent and well coordinated synchronization which means that processes need to ex-

change data with their neighbors. Their I/O operations are also closely synchronized.

• Bursty and non-overlapping I/O: Scientific applications usually have well-defined ex-

ecution phases. For example, they alternate between computation and I/O phases. This

characteristic provides the fundamental requirement for designing a burst buffer. The file

extent on which each process writes does not overlap. In each I/O phase, a process does not

rewrite the content already written.

• Frequent writes and few reads: Scientific applications periodically create snapshots (via

checkpointing) of their intermediate results and datasets. They are typically write-intensive

but read only sparingly. In-memory variables such as arrays and meshes are written at each

snapshot. Checkpoint data is only read during application restart. Given these characteristics,

we design burst buffer mainly for application write throughput. This is different from many

other existing buffering systems such as PredatA [114], DataSpaces [48] in ADIOS [96] that

focus on in-situ data sharing and analysis for scientific simulation.

The distinct features of scientific applications lead us to rethink the design of Memcached.

In this chapter, we carry out a study on the design of the burst buffer system on top of the

Memcached framework. While preserving many features of the Memcached architecture, we modify

Memcached from three critical aspects: storage management, coordination with the PFS, and the

communication efficiency, respectively.

3.3 Design of BurstMem

The goal of BurstMem is to efficiently absorb large amounts of write requests, and provide

high-throughput service to migrate the data into the PFS. In this section, we first present an

23

architectural overview of BurstMem. We then review how BurstMem copes with bursty writes,

describe our strategy to flush the data from BurstMem to the underlying PFS, and detail the

approach to leverage the native transport of various high-speed interconnects on the HPC systems.

3.3.1 Software Architecture of BurstMem

Parallel File System BStore

BMan

Burst Buffer System HPC Clients

Intercon

nect

Figure 3.2: Software architecture of BurstMem.

Fig. 3.2 shows the software architecture of BurstMem and its relationship with other system

components in a typical HPC environment. As a data buffering system, BurstMem is located

between the processing elements and the backend persistent storage hosted by the PFS. It connects

to all application processes via a high-speed interconnect, temporarily buffers bursty datasets from

these processes, and gradually flush the datasets to the PFS.

BurstMem is composed of two main components: BurstMem Managers (BMans) and BurstMem

Stores (BStores). Each BMan is designed with a BStore as an internal Memcached server. All

BMans form a parallel set of burst buffer daemons to intercept application data. Each BMan keeps

track of the address and health status of neighboring BMans. These BMans are in charge of the

24

bulk of responsibilities for system maintenance and resource management. They coordinate all the

BStores for fast data buffering, balanced data distribution and long-term persistent storage.

Using BurstMem, scientific applications can follow a new checkpoint flow. After each phase

of computation, the clients coordinates with BMans for checkpointing. The BMans absorb and

store all the checkpoint dataset to BStores. The application then returns to the computation of

next phase, leaving the ensuing data flushing operation to BStore. In this way, data flushing is

overlapped with computation.

BurstMem is built on top of Memcached, it exposes to the application a simple checkpoint

API, which invokes a customized light-weight Memcached client library for data shipping. Once

BMan receives the KVP, it leverages BStore for storage management. BStore follows the same data

processing flow as Memcached server’s storage management. It allocates memory for the accepted

KVP, and records its location so that the KVP can be retrieved in the future.

3.3.2 Log-Structured Data Organization with Stacked AVL Indexing

Timestamp2*

Timestamp3*Timestamp1*

FileName2*

FileName1* FileName3*

Offset2*

Offset1* Offset3*

(b):*Stacked*AVL*Index*Tree*

Data*Store*

(a):*LogDstructured*Data*Store*

Data*Store*

Value* Value*…"

Segment*of*Plain*File*

Value**

BMan*

KVP** KVP**

AppendDonly*

Write&

Figure 3.3: Data structures for absorbing writes.

We introduce a Log-Structured data organization with Adelson-Velskii and Landis (AVL)

tree [57] based indexing (LSA) to absorb the large amount of bursty write requests. LSA re-

solves three major issues in Memcached that prevent it from achieving efficient writes. First, the

25

original Memcached preallocates fix-sized memory chunks to accommodate the incoming write re-

quests. However, this results in underutilized memory resources when a write request is not aligned

with the memory chunk, and additional memory allocation is needed each time the chunks with a

given size are used up. Second, contemporary HPC platforms are embracing tiered storage with

both DRAM and SSD, but Memcached is oblivious to this architecture. Third, the hash-based

indexing used by Memcached is unable to support range queries, which is a requirement for bulk

data flushing and HPC application restart.

The key idea of LSA is to compact the received write requests following an append-only manner

to avoid memory waste, as shown in Fig. 3.3 (a). When used memory reaches a high watermark,

we append in-memory data to SSD.

As illustrated in Fig. 3.3(a), we design a hierarchical data store to log the concrete file data

(value) from write requests. A large (by default 4GB) DRAM block is maintained for logs at the

first level, a separate intermediate file is reserved for logs on SSD. The address space of data store

covers the storage from both the DRAM and SSD.

Upon its arrival in BMan, each write request is converted into a KVP. The key records the

information that uniquely identifies the request, including the current checkpointing timestamp,

the name of its targeted checkpoint file, and the offset (i.e. position of the write request on the

checkpoint file) as well as the length of the value. The value points to the concrete data (e.g. a

segment of a plain file) in the write request to be stored into the data store and then flushed to the

checkpoint file, as shown in Fig. 3.3(a). When BStore receives a write request, it separates the key

from the value and appends the value to the end of the data store. By doing so, BStore eliminates

the random access caused by following a strict order of checkpoint file offsets, and maximizes the

write throughput.

In order to facilitate data retrieval (e.g. application restart) from the data store, all the keys

for each KVP are organized in a stacked AVL-tree structure that records the metadata of absorbed

write requests. An AVL-tree [57] is a self-balancing binary search tree that supports lookup,

insertion and deletion with O(logN) complexity for both average and worst cases, thus achieving

better performance than binary search trees. It also delivers an ordered node sequence that allows

in-order traversal. Although inheriting many AVL’s virtues, our design differs greatly from the

conventional AVL tree, exhibiting a stacked structure. It consists of three categories of layers:

26

timestamp, filename, and offset, as shown in Fig. 3.3(b). Each write request is first indexed by the

timestamp, then the filename, and finally by the offset pointing to its position in the checkpoint

file to be stored in the PFS. A pointer recording the address of each write request in the data

store is maintained together with the offset index. The intuition behind such design is to accelerate

retrieval and traversal for the data in a specific range. Using checkpointing as an example, flushing

the dataset that belongs to a single timestamp is important. Our stacked AVL tree allows such

a dataset to be located in a single range search operation. After pinpointing the index of such

a timestamp, each filename subtree under such an index is traversed, restoring the order of each

checkpoint file through an in-order traversal of all its offset metadata. Taken together, the tree

index supports diverse query patterns. For example, retrieving all the data under timestamp 1,

from timestamp 1 to 3, or under timestamp 1 belonging to filename 1, etc. These query patterns

are not supported by hash-based indexing in Memcached.

However, one major issue faced by LSA is to determine when to conduct the garbage collection

necessary to reclaim the used space in data store and to trim the stacked AVL tree. We address

such an issue by leveraging a key characteristic of checkpointing. After completing a checkpointing

operation, the data belonging to a specific timestamp can be discarded since they have been flushed

to the underlying file system. Thus, we first mark the timestamp node on the AVL tree as unused. A

process is invoked periodically in the background to traverse the tree for those unused timestamps

and compact the in-memory data store to reclaim the memory space used by values of such a

timestamp. When all values within the timestamp have been recycled, we trim the timestamp

subtree off our stacked AVL tree. In addition, we leave the log on the SSD untouched. Only when

the size of the log on SSD is close to a threshold, and all the data within the log has been transfered

into the PFS, do we discard the log in its entirety and generate a new log to absorb the data from

memory.

3.3.3 Coordinated Shuffling for Data Flushing

BurstMem is responsible for flushing the data into the PFS. In the current design, data flushing

takes place after checkpointing. We also allow clients to trigger the data flushing explicitly. There

are two general checkpointing patterns in scientific applications, N−N and N−1 checkpointing. In

N-N checkpointing every process writes to a separate file. In N-1 checkpointing, all processes write

to a single shared file. BurstMem supports both patterns. Under the N-N pattern, each client’s

27

checkpoint data is hosted by one BStore. These BStores can flush data into different checkpoint

files without interfering with each other. In contrast, under the N-1 checkpoint pattern, the shared

checkpoint file spreads across many BStores. Naively flushing data content into a shared file can

incur significant lock overhead, leading to drastically degraded throughput. Coordinated shuffling

is applied to address such an issue for the N-1 case.

Before elaborating on the coordinated shuffling, we briefly describe the cause of lock overhead.

Many PFSs use distributed locks to guarantee data consistency. Taking Lustre as an example,

locking is performed at the granularity of Lustre stripes (by default 1 MB). When there is a need

to write a stripe for data flushing, a BStore needs to first acquire the lock for that stripe. If another

BStore owns the lock, Lustre has to revoke the prior ownership before granting the lock ownership

to the first BStore. Once the stripe lock is acquired, the lock and the data are buffered in the

Lustre Object Storage Client side (OSC) inside the BStore. Under the N-1 case, write requests

from multiple BStores may overlap on the same stripe, causing frequent ownership changes on

the stripe lock. Associated with the change of lock ownership, data flushing can cause frequent

network traffic and delay the entire process. Therefore, the contiguous, stripe-aligned write requests

is preferred compared to the noncontiguous, stripe-unaligned write requests since the former can

efficiently reduce the degree of lock contention.

In most of our targeted cases, each BStore possesses several noncontiguous, small segments for

the shared file, thus amplifying the lock overhead. Therefore, coordinated shuffling is designed to

reshuffle the segments among BStores so that each BStore can flush contiguous segments into the

PFS with alleviated lock contention. As illustrated in Fig. 3.4, each shared file is logically divided

into several contiguous segments. The total number of segments is equal to the number of BStores.

The purpose of data shuffling is to have each BStore store all the data that belong to the same

segment. Fig. 3.4 details this process. BStores 1, 2 and 3 possess data chunks from Client 1, 2 and

3, respectively. These chunks belong to the same shared file. This file is divided into 3 segments,

which is mapped to the three BStores. Before data shuffling, each BStore stores noncontiguous

data chunks. Data shuffling begins after such mapping is established. Following this mapping,

Chunk 2 is shuffled from BStore 2 to BStore 1, Chunk 3 is shuffled from BStore 1 to BStore 2, and

so on. Once the shuffling operation is completed, each BStore can then flush the data to the PFS.

28

Client'

Client'

Client'

1' 3'

2' 5'

4' 6'

BStore1(

BStore2(

Segment1(

Segment2(

Shuffle(Checkpoint(

BStore3(

BStore1(

BStore2(

BStore3(

1'

2'

3'

4'

5'

6'

Flush(

Segment3(

Shared(File(on(PFS(

1'

3'

2'

�'

4'

6'

1'

2'

�'

�'

5'

�'

Figure 3.4: Coordinated shuffling for N-1 data flushing.

Our data flushing scheme is inspired by the idea of ROMIO [94]. We do not directly use ROMIO

library since it is coupled with MPI environment. Such an environment restricts BurstMem’s

potential for future extension for fault tolerance.

3.3.4 Enabling Native Communication Performance

Memcached relies on the BSD Sockets interface and uses the reliable stream (i.e. TCP) to

transfer data. Although Sockets eases the implementation, performance of socket-based communi-

cation cannot fully exploit the advantage of leadership scale HPC systems, such as Remote Direct

Memory Access (RDMA), or OS-bypass. In addition, its performance is not optimized and is highly

subject to data sizes.

Therefore, we have employed the Common Communication Interface (CCI) [30] to efficiently

leverage the performance advantage of HPC facilities. It is designed by Oak Ridge National Labo-

ratory. It exposes the performance of using native network interfaces to scientific applications. CCI

has now been fully deployed on the Titan supercomputer to serve various scientific applications.

29

In our optimization, we leverage CCI to accelerate checkpointing from clients to BMans, as well as

data shuffling among BMans.

CCI&Server& CCI&Client&

cci_get_event***cci_accept*************

established***

cci_get_event***

**cci_connect****cci_get_event****established**

ConnecUon*Request*

Accept*Reply*

Established*

Endpoint" Endpoint"

Remote"Memory"Access"

Endpoint" Endpoint"

Figure 3.5: CCI-based network communication.

Figure 3.5 illustrates our implementation of CCI-based network communication. CCI uses clien-

t/server semantics to establish a connection. Checkpointing or data shuffling triggers a connection

request to the peer. In both the Server and the Client, CCI abstracts a network device as the

Endpoint, which is a virtualized device containing many resources, such as send and receive queues,

as well as the buffers that are associated with the queues. Once the connection is established, we

leverage the Remote Memory Access (RMA) feature that enables zero-copy in CCI to transfer the

data over the network.

In our design, we use an event-driven model to improve the throughput. We poll the CCI

Endpoint for new events (via cci get event). On the client side, one thread is dedicated to estab-

lishing connections with remote servers. Meanwhile, a data-transferring thread uses non-blocking

RMA to transfer the data. RMA is typically one-sided (i.e. only the initiator is actively involved

in the transfer). In order to notify the completion of a RMA operation, we use the completion

message option that sends a message and generates a receive event on the remote endpoint. A

30

central thread, which polls the events from the Endpoint, orchestrates all of above threads. Sim-

ilarly, on the server side, a thread detects events from the Endpoint and dispatches the requests,

e.g. connection requests or completion events of RMA, to the corresponding threads for further

processing.

3.4 Evaluation of BurstMem

3.4.1 Methodology

Testbed: Experiments on BurstMem were conducted on the Titan supercomputer [14] hosted

at Oak Ridge National Laboratory. Each node is equipped with a 16-core 2.2GHZ AMD Opteron

6274 (Interlagos) processor, 32 GB of RAM, and a connection to the Cray custom high-speed

interconnect. Two nodes share 1 Gemini high-speed interconnect router. Since there was no I/O

server deployed for I/O buffering, we used a separate set of compute nodes for BurstMem. Out

of the 256 allocated to the experiment, 128 of the compute nodes were used as clients that write

data into the BurstMem. The other 128 compute nodes were allocated as the BurstMem servers.

In every experiment, we placed one process on one physical node.

Titan is connected to Spider II, a center-wide Lustre-based file system. It features 30 PB of

disk space, offering 1 TB/s aggregated bandwidth organized in two non-overlapping identical file

systems, each providing 500 GB/s I/O performance. The default stripe size of each created file is

1 MB. The default stripe count is 4.

In all the experiments, we pinned a 16 MB DRAM buffer for each RMA channel among two

communication entities.

Benchmarks: To evaluate the performance of BurstMem, we employed a synthetic workload

using IOR [88] and also a real-world scientific application called S3D [41]. We report the average

of 5 test run results.

IOR is a flexible synthetic benchmarking tool that is able to emulate diverse I/O access patterns.

It was initially designed for measuring the I/O performance of parallel file system (PFS). We added

BurstMem support to IOR by redirecting all writes from the processes to BurstMem instead of

the PFS. This new version of IOR is referred to as BB-IOR. To emulate bursty I/O behavior

as described in Section 3.2.2, we set interTestDelay to 20 seconds between any two I/O phases,

31

and iterated 10 times. For comparison, we also redirect the writes to Memcached, refered to as

MemCache-IOR.

To evaluate the performance of real applications, we integrated BurstMem into S3D, which we

refer to as BB-S3D. S3D is a parallel turbulent combustion application using a direct numerical

simulation solver developed at Sandia National Laboratories. It solves fully compressible Navier-

Stokes, total energy, mass continuity equations coupled with detailed chemistry. The problem

domain is a conventional 3-D structured Cartesian mesh. All the MPI processes are partitioned

along the X-Y-Z dimensions. S3D exhibits bursty patterns during the execution. Its checkpointing

phase regularly alternates with computation. Each checkpointing phase outputs four global arrays

representing the variables of mass, velocity, pressure and temperature.

3.4.2 Ingress Bandwidth

We first investigated the ingress bandwidth that BurstMem can support to absorb the write

requests. We used the IOR benchmark and increased the number of BurstMem servers from 1 to

128. In each test, we used the same number of IOR clients as that of BurstMem servers to stress

the system. We used a 1MB (default stripe size) transfer unit to alleviate the lock contention issue

in the Lustre file system. We have IOR-N-1 and IOR-N-N respectively represent the N-1 and N-N

pattern as mentioned in Section 3.3.3 for the original IOR. In the IOR-N-1 case, we set the stripe

count as the number of clients. For a fair comparison, we set the stripe count as 1 for the IOR-N-N

case so that same number of Object Storage Targets (OST) are utilized as the number of clients.

On average, each IOR client wrote 4 GB data.

Figure 3.6 compares the ingress bandwidth IOR receives with and without BurstMem support,

as well as MemCache-IOR. Overall, BurstMem delivered significantly higher ingress bandwidth than

the other three alternatives. As seen in Figure 3.6, BB-IOR is able to achieve 278.2%, 246.9% and

174.5% improvement on average, when compared to the IOR-N-1, MemCache-IOR and IOR-N-N

respectively. Such improvement is consistent across different number of BurstMem servers.

BB-IOR achieves substantial improvement over the original IOR by buffering the write requests

instead of writing directly to the Lustre file system. Such performance improvement is what we

expect. However, as also shown in Figure 3.6, simply using Memcached as the buffering system

could not maximize the ingress bandwidth. BurstMem effectively outperformed Memcached with

LSA described in Section 3.3.2 and CCI support described in Section 3.3.4. Specifically, BurstMem

32

0

20

40

60

80

100

1 2 4 8 16 32 64 128

Ing

ress B

an

dw

idth

(G

B/s

ec)

Number of BurstMem Servers

IOR-N-1

IOR-N-N

BB-IOR

MemCache-IOR

Figure 3.6: Ingress I/O bandwidth with a varying server count.

benefited from Gemini’s native transport using CCI and avoids frequent memory allocation using

LSA.

To further examine the ingress bandwidth under different workloads, we reduced the number

of BurstMem servers to 4, which equals the default stripe count. We also set the stripe count to

4 and had all the clients write on one shared file for the original IOR. In both BB-IOR and the

original IOR, we increased the number of IOR clients from 4 to 128, thereby increasing the workload

per BurstMem server and OST. Figure 3.7 illustrates the performance comparison with respect to

the increasing number of IOR clients. On average, BB-IOR outperformed the original IOR and

MemCache-IOR by 508.62% and 408.30%, respectively. We observe an increasing bandwidth from

4 to 16 IOR clients. This was because when the number of IOR clients is fewer than 16, the

supplied bandwidth of each BurstMem server was not fully saturated. Once the number reaches

16, such bandwidth was fully utilized, and BurstMem was able to provide stable ingress bandwidth

regardless of the workloads.

33

0

2

4

6

8

10

12

14

16

18

4 8 16 32 64 128

Ing

res

s B

an

dw

idth

(G

B/s

ec

)

Number of IOR Clients

IOR

BB-IOR

MemCache-IOR

Figure 3.7: Ingress I/O bandwidth with a varying client count.

3.4.3 Egress Bandwidth

Efficiently flushing data to the PFS to spare space for future writes is another essential feature

for BurstMem. In this section, we measured the performance of egress bandwidth and evaluated

the effectiveness of coordinated shuffling introduced in Section 3.3.3. To emulate the common I/O

access pattern in scientific applications, we interleaved the writes from multiple IOR clients and used

a 16 KB transfer size, one of the dominant transfer sizes for scientific applications [56]. Similar to

ingress bandwidth evaluation, we set the number of IOR clients equal to that of BurstMem servers

and have each client output 4 GB of data to a shared file. Because Memcached does not support

flushing, the comparison does not include Memcached.

Figure 3.8 shows the performance of egress bandwidth. Cumulative egress bandwidth of BB-

IOR increases from 0.83 GB/s at 4 processes to 6.09 GB/s at 128 processes. BB-IOR was able to

achieve 2× to 19× higher bandwidth when compared to the original IOR whose performance was

consistently below 0.4 GB/s for all cases. Such low performance was mainly due to large overhead

from lock contention caused by unaligned writes. Coordinated shuffling rearranged unaligned write

requests into sequential, stripe-aligned writes, thereby significantly improving the overall egress

34

0

1

2

3

4

5

6

7

8

4 8 16 32 64 128

Eg

ress B

an

dw

idth

(G

B/s

ec)


IOR

BB-IOR

Figure 3.8: Egress I/O bandwidth.

bandwidth.

In Figure 3.9, we show the time spent on shuffling and flushing. The shuffling operation incurred

over 30% overhead. However, it enabled flushing to achieve better performance due to large,

sequential writes to PFS in a stripe-aligned manner and delivered orders of magnitude better

performance at massive scale. Hence, the extra overhead from data shuffling is worth the trade-off

given the significant benefit it delivers.

3.4.4 Scalability of BurstMem

Scalability is a critical factor for BurstMem. We want to ensure that BurstMem is able to

provide increasing bandwidth when given more resources; such as more BurstMem nodes and

additional CPU cores on each node. In this section, we evaluated the scalability of BurstMem from

two perspectives, horizontal scalability (scale-out) and vertical scalability (scale-up). We continue

using IOR as the benchmark tool for our evaluation. In the horizontal scaling experiment, we

increased the number of BurstMem servers and measured the cumulative bandwidth delivered by

BurstMem. In the vertical scalability experiment, we increased the number of threads in each

individual BurstMem server.

35

0"10"20"30"40"50"60"70"80"90"

100"

4" 8" 16" 32" 64" 128"

Time%(sec)%

Number%of%BurstMem%Servers%

Shuffle"Time"

Flush"Time"

Figure 3.9: Dissection of coordinated shuffling.

Horizontal Scaling. We evaluated the Horizontal scalability by fixing the number of clients

as 128, and increasing the number of BurstMem servers from 4 to 128. The I/O request size was

set to 1 MB, each client wrote 512 MB of data, featuring 64 GB of input data in total for each

iteration.

Figure 3.10 (a) shows the performance results of horizontal scaling. As shown in the figure,

cumulative bandwidth improved linearly from 9.9 GB/s with 4 BurstMem servers to 62.04 GB/s

with 32 BurstMem servers. However, the increasing rate declined when going from 64 to 128

BurstMem servers. This was because the supplied bandwidth of each BurstMem server could

efficiently absorb I/O requests from more than 2 clients. When the number of BurstMem servers was

fewer than a quarter of the clients (32), they were mostly saturated. However, further increasing the

number of BurstMem servers from that point (32 servers) led to underutilized bandwidth provided

by BurstMem system, and the bandwidth gradually became client-bound.

In summary, the linear horizontal scalability was achievable when ingress bandwidth was bounded

by BurstMem. In addition, there were some other factors that can affect cumulative bandwidth,

including varying end-to-end network bandwidth on Titan due to locality, and the contention of

36

network resources. These factors caused the cumulative bandwidth to be lower than the theoretical

maximum bandwidth.

0

20

40

60

80

100

4 8 16 32 64 128

Cu

mu

lati

ve B

an

dw

idth

(GB

/sec)


BB-IOR

(a) Horizontal Scaling.

0

1

2

3

4

5

6

7

8

1 2 4 8 15

Ban

dw

idth

(G

B/s

ec)

Number of Threads

BB-IOR

(b) Vertical Scaling.

Figure 3.10: Scalability of BurstMem.

Vertical Scaling. We evaluated BurstMem’s vertical scalability by scaling the number of

threads in each BurstMem server from 1 to 15, one thread per core. The remaining one core

was used to run system daemons. We measured the bandwidth that could be supplied by each

individual BurstMem. On average, each BurstMem served 16 clients. Each client sent 512 MB of

data to the server, featuring 8 GB of input data for each BurstMem server. Figure 3.10 (b) shows

the bandwidth increasing from 2.49 GB/s at one thread to 6.11 GB/s at 15 threads. There is a

sharp increase at 15 processes because each compute node contained 2 NUMA nodes, and each

NUMA node contained 8 cores. Titan scheduled the first 8 threads to the first NUMA node and

the last 7 threads to the second NUMA node. When we used 15 threads, we included the capability

from another NUMA node; such as memory bandwidth and computing power.

3.4.5 Case Study: S3D, A Real-World Scientific Application

During the experiments with S3D, we kept the size of X, Y, Z dimensions as 50, 50, 50, re-

spectively and have each process write about 2 GB checkpoint data. We compared the cumulative

bandwidth of BurstMem-enabled S3D (BB-S3D) with that of the original S3D implementation.

Figure 3.11 showed the I/O performance comparison between BB-S3D and S3D. The bandwidth

of BB-S3D increased linearly from 1.27 GB/s at 1 process to 80.49 GB/s at 125 processes. This

37

0

20

40

60

80

100

1 8 27 64 125

Cu

mu

lati

ve B

an

dw

idth

(G

B/s

ec)

Number of Processes

BB-S3D

S3D

Figure 3.11: I/O performance of S3D with BurstMem.

yielded a performance improvement of up to 10× over the original S3D when the number of MPI

processes was 125.

We observed that the original S3D bandwidth was lower than that of the original IOR. This was

because, in IOR tests, we set the transfer size as the stripe size, which optimized the performance

under Lustre file system. In contrast, the transfer units of Fortran I/O in S3D varies from 0.95

MB, 2.86 MB to 10.49 MB. This was less favored by Lustre.

3.5 Related Work

Improving I/O performance on large-scale HPC systems has gained broad attention over the

past decades. A number of studies have introduced new I/O middleware libraries. MPI-IO [94, 62],

PnetCDF [58, 69], HDF5 [11, 64] boost I/O performance using parallel I/O that involves a massive

number of participating processes. PLFS [32] introduces an extra I/O layer that converts the non-

contiguous, interspersed I/O into contiguous, sequential I/O. All these studies aim to optimize I/O

on the parallel file system (PFS). Therefore, their performance is still restricted by the bandwidth

of PFS.

38

I/O forwarding [28] is another key technique applied on Blue Gene/P systems. It leverages two

I/O forwarding components, named CIOD [77] and ZOID [54], both of which use synchronous I/O

forwarding. Venkatram et al. [97] replaces synchronous I/O forwarding with asynchronous staging,

thus enhancing an application’s overall performance. However, such techniques only applies to the

Blue Gene/P architecture.

Orthogonal to this work, asynchronous data staging is proposed by many other researchers.

Such work generally falls into two categories: local staging [83, 60] and remote staging [78, 27]. In

the former case, an application uses local storage of compute nodes as staging area. However, the

performance can be affected by computation jitters [31]. Remote staging buffers data in additional

partitions of compute nodes. Although remote staging is immune to computation jitters, it is

confined by available resources of compute nodes, such as the supplied bandwidth and storage

capacity.

Burst buffer in HPC is a relatively recent idea. Past burst buffer studies mainly used modeling

and simulations. Liu et al. [65] designed a simulator of burst buffer for the IBM Blue Gene/P

architecture. Bing et al. [109] characterized output burst absorption on Jaguar and made an

important step toward quantitative models of storage system performance behaviors. Different

from them, our work focuses on designing and implementing a prototype burst buffer system and

analyzing its performance benefit.

More recently, there have been an increasing number of ongoing projects to provide software

solutions. Two representative projects are DataWarp [9] in Cray [6] and Infinite Memory Engine

(IME) [13] in DataDirect Network [8]. DataWarp utilizes flash SSD I/O blades with Cray Aires

high-speed interconnect. It is designed for Trinity [25] and Cori [5] supercomputers. It features

a flexible storage allocation mechanism that allows a user to reserve burst buffers, which provides

a seamless integration with SLURM. Besides this, a user can also customize their reservation to

act either like a file system mount point or a layer of local cache that better supports bursty

checkpoint/restart workloads. Different from BurstMem, it is designed to run on a Cray Aries

interconnect, while BurstMem leverages CCI for portable data transfer. IME is positioned on I/O

nodes. Like BurstMem, it supports diverse interconnects, e.g. InfiniBand, Cray Aires, etc. It is also

optimized for data flush. Like Cray DataWarp, the whole buffer space is viewed by applications

as a mount point that transparently absorbs clients’ I/O requests. Different from IME, BurstMem

39

exposes to scientific applications key-value store based interface that can be more easily extended

to support a richer semantics. In addition, while both DataWarp and IME are developed as

commercial products and provide more comprehensive functionality than BurstMem, little design

detail has been released, the way we structure BurstMem provide complimentary design choices for

these commercial products.

3.6 Summary

In this chapter, we have designed a high-performance burst buffer system on top of Mem-

cached. Through in-depth analysis, we have identified that Memcached has many issues to be

directly used as burst buffer, such as the lack of efficient storage management to absorb large

amounts of bursty writes and the incapability to exploit modern high-speed network interconnects.

Based on our analysis, we introduce several techniques to enhance Memcached as the BurstMem

system for bursty I/O in scientific applications. Our techniques include a log-structured data or-

ganization with stacked AVL indexing for fast I/O absorption and low-latency, semantic-rich data

retrieval; coordinated data shuffling for efficient data flushing; and CCI-based communication for

high-speed data transfer. Our experiments on the Titan supercomputer with synthetic benchmarks

and real-world applications demonstrate that BurstMem can efficiently provide high-performance

I/O services for current HPC scientific applications with good scalability.

40

CHAPTER 4

BURSTFS: A DISTRIBUTED BURST BUFFER

FILE SYSTEM

4.1 Introduction

Node-local Burst buffers are a powerful hardware resource for scientific applications to buffer

their bursty I/O traffic. However, the usage of node-local burst buffers is not yet well-studied, nor

are burst buffer software interfaces standardized across systems. Currently, users are left with the

freedom to explore the use of burst buffers in an ad-hoc manner. However, domain scientists would

rather focus on their scientific problems instead of fiddling with the complexity of how to best use

burst buffers.

Several efforts have explored the use of locality-aware distributed file systems (e.g., HDFS [90])

to manage node-local burst buffers [106, 110, 113]. In such file systems, each process stores its

primary data to the local burst buffer. Because compute processes can be co-located with their

data, it is feasible to achieve linearly scalable aggregated bandwidth [113, 83]. However, burst

buffers are only temporarily available to user jobs. A user job can utilize local burst buffers within

the duration of its allocation, but the job loses access to the burst buffer storage when the allocation

terminates.

Conventional file systems such as HDFS [90] or Lustre [35, 81] are typically designed to indefi-

nitely persist data, on the order of an HPC system’s lifetime. They utilize long-running daemons

for I/O services, which are not necessary for temporary burst buffer usage. In addition, the con-

struction and cleanup of I/O services for these file systems can lead to a waste of resources in terms

of compute cores, storage and memory. Therefore, for effective use of burst buffers by scientific

users, it is critical to develop software for standardizing the use of node-local burst buffers, so that

they can be seamlessly integrated into the repertoire of HPC tools on leadership supercomputers.

HPC applications typically exhibit two main I/O patterns: shared file (N-1) and file-per-process

(N-N) [32] (see details in Section 1.1). For node-local burst buffers, with the N-N pattern, appli-

cations can achieve scalable bandwidth by having each process write/read its files locally. The

41

difficulty for node-local burst buffers lies with the N-1 I/O pattern, in which all processes write a

portion of a shared file. In particular, a shared file requires the metadata for all data segments to

be properly constructed, indexed, and collected at the time of writes, then later formulated with a

global layout before any process can locate its targeted data for read access. While this issue has

been investigated on persistent parallel file systems [32, 111], the problem of efficiently formulating

and serving the global layout of a shared file remains a critical issue for a temporary file system

across burst buffers.

In addition, datasets from scientific applications are typically multi-dimensional. Such datasets

are usually stored in one particular order of multiple dimensions, but frequently read from different

dimensions based on the nature of scientific simulation or analysis. Often, there is an incompatibility

between the order of writes and the order of reads for data elements in a multi-dimensional dataset,

which typically leads to many small non-contiguous read operations for one process to retrieve

its desired data elements [95] (see Section 1.1.2 for detail). An effective node-local burst buffer

file system also needs to provide a mechanism for scientific applications to efficiently read multi-

dimensional datasets without many costly small read operations.

In this research, we have designed an ephemeral Burst Buffer File System (BurstFS) that has

the same temporary life cycle as a batch-submitted job. BurstFS organizes the metadata for the

data written in local burst buffers into a distributed key-value store. To cope with the challenges

from the aforementioned I/O patterns, we designed several techniques in BurstFS including scalable

metadata indexing, co-located I/O delegation, and server-side read clustering and pipelining. We

used a number of I/O kernels and benchmarks to evaluate the performance of BurstFS, and validate

our design choices through tuning and analysis.

In summary, in this research project of BurstFS we make the following contributions.

• We present the design and implementation of a burst-buffer file system to meet the need of

effective utilization of burst buffers on leadership supercomputers.

• We introduce several mechanisms inside BurstFS, including scalable metadata indexing for

quickly locating data segments of a shared file, co-located I/O delegation for scalable and

recyclable I/O management, and server-side clustering and pipelining to support fast access

of multi-dimensional datasets.

42

• We evaluate the performance of BurstFS with a broad set of I/O kernels and benchmarks.

Our results demonstrate that BurstFS achieves linear scalability in terms of aggregated I/O

bandwidth for parallel writes and reads.

• To the best of our knowledge, BurstFS is the first file system designed to have a co-existent

and ephemeral life cycle with one or a batch of scientific applications in the same job.

The rest of this chapter is organized as follows. Section 4.2 presents the design of BurstFS. The

experimental methodology and results are respectively presented in Section 4.3. Section 4.4 sum-

marizes the related work, followed by Section 4.5 that concludes this chapter.

4.2 Design of BurstFS

We designed the Burst Buffer File System (BurstFS) as an ephemeral file system, with the

same lifetime as an HPC job. Our overarching goal for BurstFS is to support scalable aggregation

of I/O operations across distributed, node-local storage for data-intensive simulations, analytics,

visualization, and checkpoint/restart. BurstFS instances are launched at the beginning of a batch

job, provide data services for all applications in the job, and terminate at the end of the job

allocation. Fig. 4.1 shows the system architecture of BurstFS.

When a batch job is allocated a set of compute nodes on an HPC system, an instance of BurstFS

will be constructed on-the-fly across these nodes, using the locally-attached burst buffers, which

may consist of memory, SSD, or other fast storage devices. These burst buffers enable very fast

log-structured local writes; i.e., all processes can store their writes to the local logs. Next, one or

more parallel programs launched on a portion of these nodes can leverage BurstFS to write data

to, or read data from, the burst buffers. In addition, a BurstFS instance exists only during the

lifetime of the batch job. All allocated resources and nodes will be cleaned up for reuse at the end

of the scheduled execution. This avoids any post-mortem interference with other jobs or potential

unforeseeable complications to the operation of file and storage systems. Furthermore, parallel

programs within the same job allocation (e.g., programs launched within the same batch script)

can share data and storage on the same BurstFS instance, which can greatly reduce the need of

back-end persistent file systems for data sharing across these programs.

BurstFS is mounted with a configurable prefix and transparently intercepts all POSIX functions

under that prefix [99]. Data sharing between different programs can be accomplished by mounting

43

Comp Node Burst Buffer

BurstFS Scalable

Metadata Indexing

Parallel Program #1

Parallel Program #2

Co-located I/O Delegation

Read Clustering and Pipelining

Batch Job

Figure 4.1: BurstFS system architecture.

BurstFS using the same prefix. Upon the unmount operation from the last program, all BurstFS

instances sequentially flush their data for data persistence (if requested), clean up their resources

and exit.

To support the challenging I/O patterns discussed in Section 4.1, we designed several techniques

in BurstFS including scalable metadata indexing, co-located I/O delegation, and server-side read

clustering and pipelining as shown in Fig. 4.1. BurstFS organizes the metadata for the local logged

data into a distributed key-value store. It enables scalable metadata indexing such that a global

view of the data can be generated quickly to facilitate fast read operations. It also provides a

lazy synchronization scheme to mitigate the cost and frequency of metadata updates. In addition,

BurstFS supports co-located I/O delegation for scalable and recyclable I/O management. Further-

more, we introduce a mechanism called server-side read clustering and pipelining for improving the

read performance. We elaborate on these techniques in the rest of this section.

4.2.1 Scalable Metadata Indexing

As discussed in Section 4.1, one of the challenges for the N-1 I/O pattern is accessing the

metadata of segments scattered across all nodes. This leads to a huge scalability problem when all

processes are reading their data and each one needs to gather the metadata from all nodes.

44

Figure 4.2: Diagram of the distributed key-value store for BurstFS.

Distributed Key-Value Store for Metadata. BurstFS solves this issue using a distributed

key-value store for metadata, along with log-structured writes for data segments. It leverages

MDHIM [52] for the construction of distributed key-value stores and provides additional features

for efficient handling of bursty read and write operations.

Fig. 4.2 shows the organization of data and metadata for BurstFS. Each process stores its data

to the local burst buffer as data logs, which are organized as data segments. New data are always

appended to the data logs, i.e., stored via log-structured writes. With such log-structured writes,

all segments from one process are stored together regardless of their global logical position with

respect to data from other processes.

When the processes in a parallel program create a global shared file, a key-value pair (e.g.,

M1 or M2, etc) is generated for each segment. A key consists of the file ID (8-byte hash value)

and the logical offset of the segment in the shared file. The value describes the actual location

of the segment, including the hosting burst buffer, the log containing the segment (there can be

45

more than one log from multiple processes on the same node), the physical offset in the log, and

the length. The key-value pairs (KVP) for all the segments can then provide the global layout for

the shared file. All the KVPs are consistently hashed and distributed among the key-value servers

(e.g., KVS0, KVS1 and so on). With such an organization, the metadata storage and services are

spread across multiple key-value servers. Many processes from a parallel application can quickly

retrieve the metadata and form a global view of the layout of a shared file.

Lazy Synchronization. In BurstFS, we also develop lazy synchronization to provide efficient

support for bursty writes. Each process provides a small memory pool for holding the metadata

KVPs from write operations, and, at the end of a configurable interval, KVPs are periodically

stored to the distributed key-value stores. An fsync operation can force an explicit synchronization.

BurstFS leverages the batch put operation from MDHIM to transfer these KVPs together in a few

round-trips, minimizing the latency incurred by single put operations. During the synchronization

interval, BurstFS searches for contiguous KVPs in the memory pool to potentially combine. A

combined KVP can span a bigger range. As shown in Fig. 4.2, segments [2-3) MB and [3-4) MB

are contiguous and map to the same server (KVS0), so their KVPs are combined into one KVP. Lazy

synchronization can significantly reduce the number of KVPs required when many data segments

issued by each process are logically contiguous (e.g. N-1 segmented and N-N write in Fig. 1.1).

Parallel Range Queries. To begin a read operation, BurstFS has to first look up the meta-

data for the distributed data segments. Thus, it searches for all KVPs whose offsets fall in the

requested range, e.g., [offset, offset+count] is the requested range in pread. With batched read

requests, BurstFS needs to search for all KVPs that are targeted by the read requests in the batch.

To retrieve the requested metadata entries for different read operations, we need support for a

variety of range queries to the key-value store. However, range queries are not directly supported

by MDHIM; the clients can indirectly perform range queries by iterating over consecutive KVPs

within a range with repeated cursor-type operations. Clients must sequentially invoke one or more

cursor operations for one range server, and must search multiple range servers until all KVPs have

been located. The additive round-trip latencies by all cursor operations to multiple range servers

can severely delay read operations.

To mitigate this, we introduce parallel extensions for both MDHIM clients and servers. On

the client side, we transform an incoming range request and break it into multiple small range

46

queries to be sent to each server based on consistent hashing. Compared with sequential cursor

operations, this extension allows a range query to be broken into many small range queries, one for

each range server. These small queries are then sent in parallel to all range servers to retrieve all

KVPs. On the server side, for the small range query within its scope, all KVPs inside that range are

retrieved through a single sequential scan in the key-value store. With this parallel optimization,

any combination of queries can be accomplished through only parallel range queries to all servers

and a single local scan operation at each key-value server.

4.2.2 Co-Located I/O Delegation

In contrast to BurstFS write operations that store data locally, read operations in BurstFS

may need to transfer data from remote burst buffers to a process initiating a read. To ensure

the efficiency of reads, we need to support fast and scalable data transfer for read operations. A

common approach adopted by many parallel programming models such as MPI [73] and PGAS [44]

is to have each process make read function calls to persistent file and storage service daemons.

Because BurstFS has a limited lifetime to that of a single job, BurstFS has special requirements

for I/O services. One implementation option might be to have persistent I/O daemons to support

BurstFS; however, that would lead to a waste of computation and memory resources. Another

implementation choice could be to utilize a simple I/O service thread spawned from the parent

process in a parallel program. However, with this approach, the service thread can only serve the

I/O needs for processes in the same program, and cannot serve subsequent or concurrent programs

in the batch job.

In BurstFS, we introduce scalable read services through a mechanism called co-located I/O

delegation. We launch an I/O proxy process on each node, a delegator. Delegators are decoupled

from the applications in a batch job, and are launched across all compute nodes. The delegators

collectively provide data services for all applications in the job.

As shown in Fig. 4.3, processes on three compute nodes will have all their I/O activities del-

egated to the delegator on the same node. Each delegator consists of two main components: a

request manager and an I/O service manager. In this way, a conventional client-server model for

I/O services is transformed into a peer-peer model among all delegators. With this arrangement,

individual processes no longer communicate with I/O servers directly, but go through their I/O

delegators. This leads to a great reduction on the total number of network communication channels

47

P1 P0 Shared Memory

Delegator

Request Send Queue

Data Recv Queue

I/O Service Manager

Shared Memory

Pipes

Request Manager

Q1 Q0 Shared Memory

Delegator

R1 R0 Shared Memory

Delegator

Burst Buffer

Figure 4.3: Diagram of co-located I/O delegation on three compute nodes P, Q and R, each with2 processes.

and the associated resources across the compute nodes. The I/O service manager in each delegator

is dedicated to serve the incoming read requests from peer delegators. The I/O service managers

exploit opportunities to consolidate requests, pipeline data retrieval from local storage, and transfer

data back to requesting delegators (see Section 4.2.3 for details).

The request manager of a delegator is composed of two main data structures: a request send

queue and a data receive queue, as shown in Fig. 4.3. The request send queue is a circular list with a

configurable number of entries. When not full, it receives the read requests from all client processes

through named pipes. Requests are queued based on the destination delegator. Requests to the

48

same delegator are chained together, which consolidates multiple requests into a single network

message. The data receive queue resides in a shared memory pool constructed across delegator and

client processes on the same node. For each I/O request, an outstanding request entry is created

in the receive queue. Data returned from remote delegators is directly deposited in the shared

memory pool, and the receive queue is searched for a matching outstanding request entry. When a

match is found, the outstanding request is marked as complete. An additional acknowledgment is

sent via the pipe to notify the client process to consume the data.

The request manager monitors the utilization level of the shared memory pool. When it is

higher than a configurable threshold (default 75%), the delegator (1) informs processes of the

urgent need to consume their data and (2) throttles request injection to remote delegators. The

request manager also monitors the ingress bandwidth based on the received data for read requests in

the send queue. When the ingress bandwidth is saturated, the request manager creates additional

network communication channels to send requests and receive data.

4.2.3 Server-Side Read Clustering and Pipelining

As discussed in Section 4.1, with multi-dimensional variables, a process can issue many small,

noncontiguous read requests for scattered data segments in each data log. Various I/O libraries

and tools have provided special support for such noncontiguous read access. For instance, POSIX

lio listio allows read requests to be transferred in batches; and OrangeFS supports batched

read requests. While being able to combine small requests into a list or a large request, these

techniques mainly work from the client side and rely on the underlying storage system such as the

disk scheduler to prefetch or merge requests for fast data retrieval. However, there is still a lack of

distributed file systems that can globally optimize these batch read requests from all processes.

As an ephemeral file system in a batch job, BurstFS directly manages accesses to the datasets

from scientific applications via delegators. Therefore, besides leveraging the existing techniques

of batched reads from the client side, BurstFS can exploit its visibility of read requests at the

server side (via the I/O service manager) for further performance improvements. To this end, we

introduce a mechanism called server-side read clustering and pipelining (SSCP) in the I/O service

manager to improve the read performance of BurstFS.

SSCP addresses several concurrent, sometimes conflicting objectives: (1) the need of detecting

spatial locality among read requests and combining them for large contiguous reads. (2) and the

49

768KB

ReadSSD

MemoryBuffer

Xmit

ReadSSD

Xmit

256KB

628KB

278KB

320KB

1MB 512KB

Transfer

Read

32KB

ReadSSD…

Xmit

Arrival tim

e

120KB

320KB

200KB

…

Consolidate

Copy

Two-Level Request Queue

size categories

Individual and combined read requests

Figure 4.4: Server-side read clustering and pipelining.

need of serving on-demand read requests as soon as possible for execution progress. As shown in

Fig. 4.4, SSCP provides two key components to achieve these objectives, a two-level request queue

for read clustering and a three-stage pipeline for fast data movement.

In the two-level request queue, SSCP first creates several categories of request sizes, ranging from

32KB to 1MB (see Fig. 4.4). Incoming requests will be inserted to the appropriate size category

either individually, or if contiguous with other requests, combined with the existing contiguous

requests and then inserted into the suitable size category. As shown in the figure, two contiguous

requests of 120KB and 200KB are combined by the service manager. Within each size category,

all requests are queued based on their arrival time. A combined request will use the arrival time

from its oldest member. For best scheduling efficiency, the category with largest request size is

50

prioritized for service. Within the same category, the oldest request will be served first. BurstFS

enforces a threshold on the wait time of each category (default 5ms). If there is any category

having not been serviced longer than this threshold, BurstFS selects the oldest read request from

this category for service and resets the category’s wait time.

The I/O service manager creates a memory pool to temporarily buffer outgoing data. This

facilitates the rearrangement of data segments for network transfer and allows the formulation of a

pipeline. Fig. 4.4 shows the three-stage data movement pipeline: reading, copying, and transferring.

In the reading stage, the I/O service manager picks up a request from the request list based on the

aforementioned scheduling policy, reads the requested data from the local burst buffer to a slot in

the memory buffer. In the copying stage, the data in the memory buffer is prepared as an outgoing

reply for the remote delegator, and then copied from the memory buffer to the network packet.

Data inside the memory buffer may need to be divided into multiple replies for different remote

delegators. The I/O service manager then creates multiple network replies, one for each delegator.

In the transferring stage, the I/O service manager can pack one or more network replies for the

same remote delegator into one network message (1MB maximum), and transmit (Xmit in Fig. 4.4)

it to the delegator.

4.3 Evaluation of BurstFS

4.3.1 Testbed

Our experiments were conducted on the Catalyst cluster [4] at Lawrence Livermore National

Laboratory (LLNL), consisting of 384 nodes. Each node is equipped with two 12-core Intel Xeon

E5-2695v2 processors, 128 GB DRAM and an 800-GB burst buffer comprised of PCIe SSDs.

Configuration: We focused on comparing BurstFS with two contemporary file systems: Or-

angeFS 2.8.8, and the Parallel Log-Structured File System 2.5 (PLFS [32]). As a representative

parallel file system (PFS), OrangeFS stripes each file over multiple storage servers to enable parallel

I/O with high aggregate bandwidth. In our experiments, we established OrangeFS server instances

across all the compute nodes allocated to a job to manage all the node-local SSDs. PLFS is de-

signed to accelerate N-1 writes by transforming random, dispersed, N-1 writes into sequential N-N

writes in a log-structured manner. Data written by each process are stored on the backend PFS as

51

a log file. In our experiments, we used OrangeFS (over node-local SSDs) as the backend PFS for

PLFS. We used PLFS’s MPI interface for read and write.

Since Version 2.0, PLFS has had burst buffer support. In PLFS with burst buffer support

(referred to as “PLFS burst buffer” in the rest of this paper), instead of writing the log file on the

backend PFS, processes store their metalinks on the backend PFS, which point to the real location

of their log files in the burst buffers. This allows each process to write its log file to the burst

buffer instead of the backend PFS. In our experiments, we had each process write to its node-local

SSD, and the location was recorded in the metalink stored on the center-wide Lustre parallel file

system. This configuration can deliver scalable write bandwidth. In order to read data from the

PLFS burst buffer, each node-local SSD has to be mounted on all other compute nodes as a global

file system (e.g., NFS), which requires system administrator support. A primary goal for BurstFS

is that it be completely controllable from user space, including mounting the file system. Thus, due

to the requirement of administrator intervetion to establish the cross-mount environment for read

with PLFS burst buffer, we only evaluated the write scalability of PLFS burst buffer and include

this result in Section 4.3.2.

Benchmarks: We have employed microbenchmarks that exhibit three checkpoint/restart I/O

patterns (see Section 1.1.1 of Chapter 1). Note that an N-1 strided pattern is a case of 2-D scientific

I/O as depicted by Fig. 1.2 in Section 1.1.2 of Chapter 1.

To assess BurstFS’s potential to support scientific applications, we evaluated BurstFS using

I/O workloads extracted from MPI-Tile-IO [16] and BTIO [108]. MPI-Tile-IO is a widely adopted

benchmark used for simulating the workloads that exist in visualization and numerical applica-

tions. The two-dimensional dataset is partitioned into multiple tiles, each process rendering pixels

inside one tile. Developed by NASA Advanced Supercomputing Division, BTIO partitions a three-

dimensional array across a square number of processes, each process processing multiple Cartesian

subsets. In both workloads, all processes first write their data into a shared file, then read back

into their memory. To evaluate the support for a batch job of multiple applications, we employed

the Interleaved Or Random (IOR) benchmark [88] to read data provided by Tile-IO and BTIO

programs in the same job.

52

0.5

1

2

4

8

16

32

64

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes

PLFS-BBPLFSOrangeFSBurstFS

(a) N-1 Segmented Write

0.5

1

2

4

8

16

32

64

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes


(b) N-1 Strided Write

0.5

1

2

4

8

16

32

64

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes


(c) N-N Write

Figure 4.5: Comparison of BurstFS with PLFS and OrangeFS under different write patterns.

4.3.2 Overall Write/Read Performance

We first evaluated the overall write/read performance of BurstFS. In this experiment, 16 pro-

cesses were placed on each node, each writing 64MB data following an N-1 strided, N-1 segmented,

or N-N pattern. After each process wrote all of its data, we used fsync to force all writes to be

synchronized to the node-local SSD. We set the stripe size on OrangeFS as 1MB and fixed the

transfer size at 1MB to align with the stripe size, and each file was striped across all nodes in

OrangeFS. This configuration gives OrangeFS the best read/write bandwidth over other tuning

choices (e.g. 64KB default stripe size).

Fig. 4.5 compares the write bandwidth with PLFS burst buffer (PLFS-BB), PLFS, and Or-

angeFS. In all three write patterns, both BurstFS and PLFS burst buffer scale linearly with process

count. This is because processes in both systems wrote locally and the write bandwidth of each

node-local SSD was saturated. While we also observed linear scalability in OrangeFS and PLFS,

their bandwidths increase at a much slower rate. This is because both PLFS and OrangeFS stripe

their file(s) across multiple nodes, which can cause degraded bandwidth due to contention when

different processes write to the same remote node. On average, BurstFS delivered 3.5×, 2.7×, and

1.3× the performance of OrangeFS for N-1 segmented, N-1 strided, and N-N patterns, respectively.

Its performance was 1.6×, 1.6×, and 1.5× the performance of PLFS, respectively, for the three

patterns.

We observed that PLFS initially delivered higher bandwidth than BurstFS at small process

counts (16 and 32 processes), for all three patterns. After further investigation, we found this

was because, internally, PLFS transformed the N-1 writes into N-N writes. However, when fsync

53

1

2

4

8

16

32

64

128

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes

BurstFSOrangeFSPLFS

(a) N-1 Segmented Read

1

2

4

8

16

32

64

128

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes

BurstFSOrangeFSPLFS

(b) N-1 Strided Read

1

2

4

8

16

32

64

128

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

ec)

Number of Processes

BurstFSOrangeFSPLFS

(c) N-N Read

Figure 4.6: Comparison of BurstFS with PLFS and OrangeFS under different read patterns.

was called to force these N-N files to be written to PLFS’s back end file system, (i.e., OrangeFS),

OrangeFS did not completely flush the files to the SSDs before fsync returns. The measured

bandwidth is even higher than the aggregate SSD bandwidth on the local file systems. Fig. 4.6

compares the read bandwidth of BurstFS with OrangeFS and PLFS. Each process read 64MB data

under N-1 strided, N-1 segmented and N-N patterns. For the N-1 strided reads, we first created a

shared file using N-1 segmented writes, then read all data using the N-1 strided reads. In this way,

each process read data from multiple logs. In order to cluster the non-contiguous read requests

under this pattern, we used POSIX lio listio to transfer read requests to BurstFS in batches. In

the case of OrangeFS, when we enabled its list I/O operations, we observed the bandwidth was two

times lower than the configuration without list I/O operations. This is because OrangeFS list I/O

does not benefit large read operations. Thus, for this experiment, we report only the performance

of the N-1 strided pattern in OrangeFS without its list I/O support.

As we can see from Fig. 4.6(a), the bandwidth of the N-1 segmented read scales linearly with

process count for BurstFS, since each process read all data directly from its local node. In contrast,

both PLFS and OrangeFS read data from remote nodes, losing the benefit from locality. On the

other hand, the bandwidth of N-1 strided read in Fig. 4.6(b) increases at a much slower rate in

BurstFS compared with segmented read. This is because the strided read pattern resulted in

higher contention due to all-to-all reads from remote burst buffers. BurstFS with N-1 strided read

still scales better and outperforms both OrangeFS and PLFS. This is because instead of servicing

each request individually, BurstFS delegators clustered read requests from numerous processes and

served them through a three-stage read pipeline. On average, BurstFS delivered 2.2×, 2.5× and

54

0.1

1

10

100

1000

10000

100000

1 4 16 64 256 1024

Thro

ugh

pu

t (M

B/s

)

Transfer Size (KB)

BurstFS OrangeFS PLFS

(a) Write

PLFS

3.514BurstFS OrangeFS PLFS

0.1

1

10

100

1000

10000

100000

1 4 16 64 256 1024

Thro

ugh

pu

t (M

B/s

)

Transfer Size (KB)

BurstFS OrangeFS OrangeFS_List PLFS

(b) Read

Figure 4.7: Comparison of BurstFS with PLFS and OrangeFS under different transfer sizes.

1.4× the performance of OrangeFS, respectively, for N-1 segmented, N-1 strided and N-N patterns.

It delivered 1.6×, 1.4× and 1.6× the performance of PLFS, respectively, for the three patterns.

4.3.3 Performance Impact of Different Transfer Sizes

Fig. 4.7 shows the impact of transfers sizes on the bandwidth of BurstFS. We focused on N-1

strided I/O, because it is a challenging I/O pattern. Similar to the experiment in Section 4.3.2,

for BurstFS strided read operations, we first created a shared file using N-1 segmented writes and

then read the data back using an N-1 strided pattern. In this way, BurstFS will not benefit from

local reads.

The results in Fig. 4.7(a) demonstrate the impact of transfer sizes on write bandwidth when

64 processes wrote to a shared file. BurstFS outperformed OrangeFS and PLFS by having each

process write data locally, and it delivered outstanding performance improvement at small transfer

sizes, for example, 24.4× and 16.7× at 1KB compared to OrangeFS and PLFS, respectively. This

is because both PLFS and OrangeFS suffered from the cost of competing writes and repeated data

transfers to the shared remote SSDs.

Fig. 4.7(b) shows the impact of transfer size on read bandwidth. For small read requests,

OrangeFS provides list I/O support so that a list of read requests can be combined into one

function call. The result of this type of read operations is shown in Fig. 4.7(b) as OrangeFS List.

As we can see from this figure, although OrangeFS List enhances the performance of small reads,

it is still lower than BurstFS. This is because of the additional benefits of server-side clustering

55

and pipelining in BurstFS. Overall, BurstFS yielded up to 10.2×, 3× and 12.3× performance

improvement compared to OrangeFS, OrangeFS List and PLFS, respectively.

0

1

2

3

4

5

6

1 4 16 64 256 1024

Lo

oku

p T

ime

(s)

Transfer Size (KB)

BurstFSMDHIMPLFS

(a) Metadata performance with varying transfer sizes

0

1

2

3

4

16 32 64 128 256 512 1024

Lo

oku

p T

ime

(s)

Number of Processes

BurstFSMDHIMPLFS

(b) Metadata Performance with varying processcounts

Figure 4.8: Analysis of metadata performance as a result of transfer size and process count.

4.3.4 Analysis of Metadata Performance

As discussed in Section 4.2.1, BurstFS distributes the global metadata indices over distributed

key-value store. During file open, each process in PLFS needs to construct a global view of a

shared file by reading and combining metadata from other processes. After this step, all look-up

operations are conducted locally. To evaluate the benefit of our design, we compared the metadata

look-up time of BurstFS with that of PLFS (i.e. PLFS’s total time on index construction during

file open and local look-up during read), as well as the original look-up time from the MDHIM

functions. We examined the look-up performance using both cursor and batch get functions from

MDHIM. Each cursor operation triggers a round-trip transfer for each key-value pair, and a look-up

for a range can invoke multiple cursor operations as described in Section 4.2.1. The total look-up

time was significantly higher than other cases. For instance, it took 81 seconds for the 4KB case

in Fig. 4.8(a). So we omit the look-up time with cursor operations in our figures.

Fig. 4.8(a) compares the look-up time of BurstFS, PLFS, and MDHIM batch get (denoted as

MDHIM). In all three cases, we launched 32 processes, each to look up the locations of 64MB data

written under the N-1 strided pattern. The total data volume is the product of the transfer size

and the number of segments. Thus, a smaller transfer size will lead to more segments, therefore

56

more indices. As we can see from the figure, the look-up time of all cases drops along with the

increasing transfer size. This is because of fewer metadata look-ups. The look-up time of BurstFS

is significantly lower than PLFS. This validates that the scalable metadata indexing technique in

BurstFS can quickly establish a global view of the metadata for a shared file. In contrast, every

process in PLFS has to load all the indices generated during write and construct the global indices

for read, this all-to-all load dominated the look-up time. BurstFS also outperformed MDHIM

by minimizing the number of read operations with only one sequential scan at the range server

because of its support for parallel range queries (see Section 4.2.1). On average, BurstFS reduced

the look-up time by 77% and 58% compared with PLFS and MDHIM, respectively.

Fig. 4.8(b) shows the metadata performance with an increasing process count. In this test, each

process looked up 64 MB data written with a transfer size of 64 KB. More processes led to more

look-up operations. As shown in the figure, the look-up time of PLFS increases sharply with the

process count. In contrast, the look-up time for BurstFS and MDHIM increases slowly with more

processes, because of the use of a distributed key-value store for metadata.

4.3.5 Tile-IO Test

Fig. 4.9(a) shows the performance of BurstFS with Tile-IO. In this experiment, a 32GB global

array was partitioned over 256 processes. Each process first wrote its tile to several non-contiguous

regions of the shared file, then read it back to its local memory. For write operations, BurstFS

outperformed OrangeFS and PLFS by directly writing data to local SSDs. For reads, although all

three file systems benefited from the buffer cache, BurstFS still performed best since each process

read data locally. Overall, BurstFS delivered 6.9× and 2.5× improvement over OrangeFS for

reads and writes, respectively, and 7.3× and 1.4× improvement over PLFS for reads and writes,

respectively.

4.3.6 BTIO Test

Fig. 4.9(b) shows the performance of BurstFS under the BTIO workload with problem size D.

In this experiment, the 408 × 408 × 408 global array was decomposed over 64 processes. Similar to

Tile-IO, each process first wrote its own cells to several noncontiguous regions of the shared file, then

read them back to its local memory. Due to the 3-D partitioning, the transfer size (2040B) of each

process is much smaller than Tile-IO (32KB), so the I/O bandwidth of both PLFS and OrangeFS

57

0

1

2

3

4

5

6

7

8

D

Ban

dw

idth

(G

B/s

)

BurstFS

0

20

40

60

80

WRITE READ

Ban

dw

idth

(G

B/s

)


(a) Performance of Tile-IO.

0

0.5

1

1.5

2

2.5

3

3.5

Write Read

Ban

dw

idth

(G

B/s

)


(b) Performance of BTIO.

Figure 4.9: Performance of Tile-IO and BTIO.

with BTIO decreases rapidly, compared with Tile-IO. BurstFS sustains this small-message workload

with the benefits of local reads and server-side read clustering. Overall, it delivered 15.6× and 9.5×

performance improvement over OrangeFS for reads and writes, respectively. It also outperforms

PLFS by 16.2× and 7×, respectively, for reads and writes.

4.3.7 IOR Test

In order to evaluate the support for data sharing among different programs in a batch job, we

conducted a test with IOR. We ran IOR with a varying number of processes reading a shared file

written by another set of processes from a Tile-IO program. Processes in both MPI programs were

launched in the same job. Each node hosted 16 Tile-IO processes and 16 IOR processes. Once

Tile-IO processes completed writing to the shared file, this file was read back by IOR processes

using the N-1 segmented read pattern. We kept the same transfer size of IOR as Tile-IO. Since the

read pattern did not match the initial write pattern of Tile-IO, each process read from multiple

logs on remote nodes. We fixed the size of each tile as 128MB and the number of tiles along the Y

axis as 4, and then increased the number of tiles along the X axis. Thus the number of tiles on the

X axis will increase along with the number of reading processes. Fig. 4.10(a) compares the read

bandwidth of BurstFS with PLFS and OrangeFS. Both PLFS and OrangeFS are vulnerable to

small transfer size (32KB). BurstFS maintains high bandwidth because of locally combining small

requests and server-side read clustering and pipelining. On average, when reading data produced

by Tile-IO, BurstFS delivered 2.3× and 2.5× the performance of OrangeFS and PLFS, respectively.

58

We also evaluated the read bandwidth of IOR over the dataset generated by BTIO, using two

BTIO classes D and E. For Class D, we used 64 processes to write an array of 408 × 408 × 408 to

a shared file. For Class E, 225 processes wrote an array of 1020 × 1020 × 1020 to a shared file. In

both cases, the shared file was then read back by the IOR processes using the N-1 segmented read

pattern. Fig. 4.10(b) shows the read bandwidth. Due to the much smaller transfer size (2040B for

Class D and 2720B for Class E), the bandwidths of OrangeFS and PLFS with BTIO are much lower

than with Tile-IO. While the performance of BurstFS is also impacted by the small transfer size, it

delivers much better bandwidth to these small requests. On average, when reading data produced

by BTIO, BurstFS delivered 10× and 12.2× performance improvement compared to OrangeFS and

PLFS, respectively.

0.5

1

2

4

8

16

32

64

16 32 64 128 256 512 1024

Ban

dw

idth

(G

B/s

)

Number of Processes

BurstFS

OrangeFS

PLFS

(a) Read bandwidth of IOR on the shared file writtenby Tile-IO.

0

1

2

3

4

5

6

7

8

D E

Ban

dw

idth

(G

B/s

)

Problem Size


(b) Read bandwidth of IOR on the shared file writtenby BTIO.

Figure 4.10: Read bandwidth of IOR.

4.4 Related Work

The importance of burst buffers is shown by their inclusion in the blueprint of many next-

generation supercomputers [2, 5, 20, 21, 22, 25] with a broad investment in supporting software.

DataWarp [9], IME [13] and aBBa [1] are three ongoing projects in Cray, DDN and EMC. Their

potential benefits have been explored from various research angles [65, 87, 103]. All these works

target remote, shared burst buffers. In contrast, our work centers on node-local burst buffers, an

equally important architecture that currently lacks standardized support software. Compared with

59

work on remote burst buffers, our work delivers linear scalability for checkpointing/restart since

most I/O requests are serviced locally. PLFS burst buffer [31] supports node-local burst buffers

(see Section 4.3.1) and can deliver fast, scalable write performance. It relies on a global file system

(e.g., Lustre, NFS) to manage metalinks, which can be an overhead if the number of metalinks

is large. In addition, reading data from PLFS burst buffer requires each of the node-local burst

buffers to be mounted across all compute nodes. BurstFS differs from PLFS burst buffer in that

it is structured as a standalone file system. BurstFS achieves scalable read performance using the

collective services of its delegators. Moreover, BurstFS is specialized for managing node-local burst

buffers, while PLFS burst buffer supports both the node-local burst buffers and remote shared

burst buffers.

The I/O bandwidth demand from checkpoint/restart has been increasing on par with the com-

puting power. SCR [76], CRUISE [83] and FusionFS [113] are notable efforts designed to address

this increasing I/O challenge and achieve linear write bandwidth by having each process write in-

dividual files to node-local storage (N-N). Different from these works, BurstFS supports both N-1

and N-N I/O patterns and delivers scalable read/write bandwidth for both patterns. Multidimen-

sional I/O has long been a challenging workload for parallel file systems. The small, non-contiguous

read/write requests issued from individual processes can dramatically constrain parallel file system

bandwidth. Several state-of-the-art approaches have been developed to address this issue. PLFS

accelerates small, non-contiguous N-1 writes by transforming them into sequential, contiguous N-N

writes [32]. However, PLFS (without burst buffer support) still relies on a back end parallel file

system to store the individual files from the N-N writes. Contention can occur when two files

are striped on the same storage server. In contrast, BurstFS provides an independent file system

service. It addresses write contention via local writes, and is optimized for read-intensive work-

loads. Two-phase I/O [94] is another widely adopted approach to optimize small, non-contiguous

I/O workloads. All processes send their I/O requests to aggregators, which consolidate them into

large, contiguous requests. The read service of BurstFS has some similarity to two-phase I/O: its

delegators are akin to making the I/O aggregators used in two-phase I/O into a service. How-

ever, there are two key distinctions. First, the consolidations of BurstFS are directly conducted

at the file system instead of the aggregators. This avoids the extra transfer from aggregators to

60

client processes. Second, the consolidation is done by each delegator individually without extra

synchronization overhead.

Cross-application data sharing is a daunting topic since many contemporary programming mod-

els (e.g. MPI, PGAS) define separate name spaces for each program. A widely adopted approach

is leveraging existing distributed systems, such as distributed file systems (e.g. Lustre [35], Or-

angeFS [34], HDFS [90]) and distributed key-value stores (e.g. Memcached [80], Dynamo [46],

BigTable [39]). However, these services are usually distant from computing processes, yielding lim-

ited bandwidth. In addition, the heavy overhead from start up, tear down, and management makes

them unsuitable to be co-located with applications in batch jobs. On the other hand, a couple of

service programs are developed to be run with applications in batch jobs. Docan et al. [47] devel-

oped DART, a communication framework that enables data sharing via separate service processes

located on a different set of nodes from the simulation applications (in the same job). Their later

work DataSpaces [47] extends the original design. In both studies, application processes write to

and read from the service process in an ad hoc manner. Each operation requires a separate net-

work transfer. In contrast, the delegator in BurstFS is designed as an I/O proxy process co-located

with application processes on the same node. All writes are local. Reads are deferred to the I/O

delegator, which provides many opportunities to optimize the read operations.

4.5 Summary

In this chapter, we examined the requirements of data management for node-local burst buffers,

a critical topic since node-local burst buffers are in the designs of next-generation, large-scale su-

percomputers. Our approach to managing node-local burst buffers is BurstFS, an ephemeral burst

buffer file system with the same lifetime as batch jobs and designed for high performance with HPC

I/O workloads. BurstFS can be used by multiple applications within the same job, sequentially as

with checkpoint/restart, or concurrently as with ensemble applications. We implemented several

techniques in BurstFS that greatly benefit challenging HPC I/O patterns: scalable metadata in-

dexing, co-located I/O delegation, and server-side read clustering and pipelining. These techniques

ensure scalable metadata handling and fast data transfers. Our performance results demonstrate

that BurstFS can efficiently support a variety of challenging I/O patterns. Particularly, it can

support shared file workloads across distributed, node-local burst buffers with performance very

61

close to that for non-shared file workloads. BurstFS also scales linearly for parallel write and read

bandwidth and outperforms the state-of-the-art by a significant margin.

62

CHAPTER 5

TRIO: RESHAPING BURSTY WRITES

5.1 Introduction

To address the I/O contention issues described in Chapter 2, we designed TRIO, a burst buffer

based orchestration framework to reshape the bursty writes in a contention-aware manner. Previous

efforts to mitigate I/O contention generally fall into two categories: client-side and server-side

optimizations. Client-side optimizations mostly resolve I/O contention in a single application, by

buffering the dataset in staging area [27, 78] or optimizing application’s I/O pattern [42]. Server-

side optimizations generally embed their solutions inside the storage server, overcoming issues of

contention by dynamically coordinating data movement among servers [91, 45, 112].

In this chapter, I describe a novel burst buffer based I/O orchestration framework named TRIO

to address I/O contention. Compared with the client-side optimizations, an orchestration frame-

work on burst buffers is able to coordinate I/O traffic between different jobs, mitigating I/O con-

tention at a larger scope. Compared with the server-side optimization, an orchestration framework

on burst buffers can free storage servers from the extra responsibility of handling I/O contention,

making it portable to other PFSs. As the name suggests, TRIO functions to orchestrate the bursty

writes between three components: computing processes, burst buffers and PFS. It is accomplished

by two component techniques: Stacked AVL Tree based Indexing (STI) and Contention-Aware

Scheduling (CAS). STI organizes the checkpointing write requests inside each burst buffer accord-

ing to their physical layout among storage servers and assists data flush operations with enhanced

sequentiality. CAS orchestrates all burst buffer’s flush operations to mitigate I/O contention. Taken

together, our contributions are three-fold.

• We have conducted a comprehensive analysis on two issues that are associated with check-

pointing operations in HPC systems, i.e., degraded bandwidth utilization of storage servers

and prolonged average job I/O time.

63

• Based on our analysis, we propose TRIO to orchestrate applications’ write requests that are

buffered in BB for enhanced I/O sequentiality and alleviated I/O contention.

• We have evaluated the performance of TRIO using representative checkpointing patterns.

Our results revealed that TRIO efficiently utilized storage bandwidth and reduced average

job I/O time by 37%.

The rest of this chapter is organized as follows. Section 5.2 and Section 5.3 respectively present

the design and implementation of TRIO. Section 5.4 evaluates the benefits of TRIO. Related work

and conclusion are discussed in Section 5.5 and Section 5.6.

5.2 Design of TRIO

The two I/O contention issues discussed in Section 2.2 in Chapter 2 result from direct and

eager interactions between applications and storage servers on PFS. Many computing platforms

have introduced burst buffer as an intermediate layer to absorb the bursty writes. Buffering a large

checkpoint dataset gives more visibility to the I/O traffic, which provides a chance to intercept

and reshape the pattern of I/O operations on PFS. However, existing works generally use burst

buffer as a middle layer to avoid applications’ direct interaction with PFS [65], few works [103] have

discussed the interaction between BB and PFS, i.e. how to orchestrate I/O so that large datasets

can be efficiently flushed from BB to PFS. To this end, we structure TRIO to coordinate the I/O

traffic from compute nodes to burst buffer and to storage servers. In the rest of the section, we first

highlight the main idea of TRIO through a comparison with a reactive data flush approach for BB

management; then we detail two key techniques in Section 5.2.2 and Section 5.2.3.

5.2.1 Main Idea of TRIO

Fig. 5.1(a) illustrates a general framework of how BBs interact with PFS. On each Compute

Node (CN), 2 processes are checkpointing to a shared file that is striped over 2 storage servers. A1,

A2, A3, A4, B5, B6, B7 and B8 are contiguous file segments. These segments are first buffered on

the BB located on each CN during checkpointing, then flushed from BB to two storage servers on

PFS.

An intuitive strategy is to reactively flush the datasets to the PFS as they arrive at the BB. Fig.

4(b) shows the basic idea of such a reactive approach. This approach has two drawbacks. First,

64

S-9

A1 A3 B5 B� A2 A4 B6 B8

Legend

A1 File Segment

A2 A1

A4 A3

Process0( Process1(

Compute Node-A

B6 B5

B8 B7

Process0( Process1(

Compute Node-B

Data Flush

Time Time

(a) Burst buffer framework with data flush.

B5

(b) Reactive data flush.

A1 A3 A2 A4 B7 B8 B5 B6

A2 A3 A4 A1

Burst Buffer-A

B5 B8 B6 B7

Burst Buffer-B

B7 A3 A1

B6 A4 B8 A2

Storage Server1 Storage Server2

Burst Buffer-A

interleaved unordered

Burst Buffer-B

B7 B5 A3 A1

A4 A2 B8 B6

Storage Server1 Storage Server2

t1

t2 t2

sequential

(c) Proactive data flush with TRIO.

Storage Server1 Storage Server2 1

Flush Order Server-Oriented Data

Organization using STI Flush Order

2

A3 A1

A4 A2

Server1 Server2

1 1

B7 B5

B8

Server1

2

2

Inter-BB Flush Order in TRIO

Flush first Flush second

B6

t1

Inter-BB Flush Ordering using CAS

Server2

Server-Oriented Data Organization using STI

Burst Buffer-A Burst Buffer-B

Figure 5.1: A conceptual example comparing TRIO with reactive data flush approach. In (b), reac-tive data flush incurs unordered arrival (e.g. B7 arrives earlier than B5 to Server1) and interleavedrequests of BB-A and BB-B. In (c), Server-Oriented Data Organization increases sequentialitywhile Inter-BB Flush Ordering mitigates I/O contention.

directly flushing the unordered segments from each BB can degrade the chance of sequential writes.

We refer to this chance as sequentiality. In Fig. 4(a), segments B5 and B7 are contiguously laid out

on Storage Server 1, but they arrive at BB-B out of order in Fig. 4(b). Due to reactive flushing,

B7 will be flushed earlier than B5, losing the opportunity to retain sequentiality. Second, multiple

BBs can compete for access on the same storage server. As indicated by this figure, BB-A and

BB-B concurrently flush A4 and B8 to Server 2, so the two segments are interleaved, their arrival

order is against their physical layout on Server 2 (see Fig.4 (a)). This will degrade the bandwidth

with frequent disk seeks. In a multi-job environment, segments to a storage server come from files

of different jobs. Interleaved accesses from different applications to the shared storage servers can

prolong the average job I/O time and delay the timely service for mission-critical and small jobs.

In contrast to this reactive data flush approach, we propose a proactive data flush framework,

named TRIO, to address these two drawbacks. Fig. 5.1(c) gives an illustrative example of how it

enhances the sequentiality in flushed data stream and mitigates contention on storage server side.

Before flushing data, TRIO follows a server-oriented data organization to group together segments

to each storage server and establishes an intra-BB flushing order based on their offsets in the file.

This is realized through a server-oriented and stacked AVL Tree based Indexing (STI) technique,

65

which is elaborated in Section 5.2.2. In this figure, B5 and B7 in BB-B are organized together

and flushed sequentially, which enhances sequentiality on Server 1. Meanwhile, localizing BB-B’s

writes on Server 1 minimizes its interference on Server 2 during this interval. Similarly, BB-A

organizes A2 and A4 together and flushes them sequentially to Server 2, minimizing its interference

on Server 1. However, contention can arise if both BB-A and BB-B flush to the same servers. TRIO

addresses this problem using a second technique, Contention-Aware Scheduling (CAS), as discussed

in Section 5.2.3. CAS establishes an inter-BB flushing order that specifies which BB should flush

to which server each time. In this simplified example, BB-A flushes its segments to Server 1 and

Server 2 in sequence, while BB-B flushes to Server 2 and Server 1 in sequence. In this way, during

the time periods t1 and t2, each server is accessed by a different BB, avoiding contention. More

details about these two optimizations are discussed in the rest of this section.

5.2.2 Server-Oriented Data Organization

As mentioned earlier, directly flushing unordered segments to PFS can degrade I/O sequentiality

on servers. Many state-of-the-art storage systems apply tree-based indexing [84, 85] to increase

sequentiality. These storage systems leverage conventional tree structures (e.g. B-Tree) to organize

file segments based on their locations on the disk. Sequential writes can be enabled by in-order

traversal of the tree.

Although it is possible to organize all segments in BB using a conventional tree structure (e.g.

indexing only by offset), it will result in a flat metadata namespace. This cannot satisfy the

complex semantic requirements in TRIO. For instance, as mentioned in Section 5.2.1, sequentially

flushing all the file segments under a given storage server together is beneficial. To accomplish

this I/O pattern, BB needs to group all segments on the same storage server together. Meanwhile,

since these segments can come from different files (e.g. File1, File2, File3 on Server 1 in Fig. 5.2),

sequential flush requires BB to group together segments of the same file and then order these

segments based on the offset. Accomplishing the aforementioned purpose using a conventional tree

structure requires a full tree traversal to retrieve all the segments belonging to a given server and

group these segments for different files.

We introduce a technique called Stacked Adelson-Velskii and Landis (AVL) Tree based Index-

ing (STI) [103] to address these requirements. Like many other conventional tree structures, an

AVL tree [57] is a self-balancing tree that supports lookup, insertion and deletion in logarithmic

66

S-5

Server2&

Server3&Server1&

File2&

File1& File3&

offset1

offset2

offset3 Data Store

Metadata

Indexing

Raw Data1 Raw Data2 Raw Data3

Data Store

Burst Buffer

Figure 5.2: Server-Oriented Data Organization with Stacked AVL Tree. Segments of each servercan be sequentially flushed following in-order traversal of the tree nodes under this server.

complexity. It can also deliver an ordered node sequence following an in-order traversal of all tree

nodes. STI differs in that all the tree nodes are organized in a stacked manner. As shown in Fig. 5.2,

this example of a stacked AVL tree enables two semantics: sequentially flushing all segments of a

given file (e.g., offset1, offset2, and offset3 of File1), and sequentially flushing all files in a given

server (e.g., File1, File2, and File3 of Server1). The semantic of server-based flushing is stacked on

top of the semantic of file-based flushing. STI is also extensible for new semantics (e.g. flushing all

segments under a given timestamp) by inserting a new layer (e.g. timestamp) in the tree.

The stacked AVL tree of each BB is dynamically built during runtime. When a file segment

arrives at BB, three types of metadata that uniquely identify this segment are extracted: server ID,

file name, and offset. BB first looks up the first layer (e.g. the layer of server ID in Fig. 5.2) to check

if the server ID already exists (it may exist if another segment belonging to the same server has

already been inserted). If not, a new tree node is created and inserted in the first layer. Similarly,

its file name and offset are inserted in the second and third layers. Once the offset is inserted as

a new tree node in the third layer (there is no identical offset under the same file because of the

append-only nature of checkpointing), this tree node is associated with a pointer (see Fig. 5.2) that

67

points to the raw data of this segment in the data store.

With this data structure, each BB can sequentially issue all segments belonging to a given

storage server by in-order traversal of the subtree rooted at the server node. For instance, flushing

all segments to Server 1 in Fig. 5.2 can be accomplished by traversing the subtree of the node

“Server 1”, sequentially retrieving and writing the raw data of all segments (e.g. raw data pointed

by offset1, offset2, offset3) of all the files (e.g. file1, file2, file3). Once all the data in a given server

is flushed, all the tree nodes belonging to this server are trimmed.

Our current design for data flush is based on a general BB use case. That is, after an application

finishes one or multiple rounds of computation, it dumps the checkpointing dataset to BB, and

begins the next round of computation. Though we use a proactive approach in reshaping the I/O

traffic inside BB, flushing checkpointing data to PFS is still driven by the demand from applications.

After flushing, storage space on BB will be reclaimed entirely. We leave it as our future work to

investigate a more aggressive and automatically triggering mechanism for flushing inside the burst

buffer.

5.2.3 Inter-Burst Buffer Ordered Flush

Server-oriented organization enhances sequentiality by allowing each BB to sequentially flush

all file segments belonging to one storage server each time. However, contention can arise when

multiple BBs flush to the same storage server. For instance, in Fig. 5.1(c), contention on Storage

Server 2 can happen if BB-A and BB-B concurrently flush their segments belonging to Storage

Server 2 without any coordination, leading to multiple concurrent I/O operations at Storage Server

2 within a short period. We address this problem by introducing a technique called Contention-

Aware Scheduling (CAS). CAS orders all BBs’ flush operations to minimize competitions for each

server. For instance, in Fig. 5.1(c), BB-A flushes to Server 1 and Server 2 in sequence, while BB-B

flushes to Server 2 and Server 1 in sequence. This ordering ensures that, within any given time

period, each server is accessed only by one BB. Although the flushing order can be decided statically

before all BBs starts flushing, this approach needs all BBs to synchronize before flushing and the

result is unpredictable under real-world workloads. Instead, CAS follows a dynamic approach,

which adjusts the order during flush in a bandwidth-aware manner.

Bandwidth-Oriented Data Flushing. In general, each storage server can only support a

limited number of concurrent BBs flushing before its bandwidth is saturated. In this paper, we

68

refer to this threshold as α, which can be measured via offline characterization. For instance, our

experiment reveals that each OST on Spider II [81] is saturated by the traffic from 2 compute

nodes; thus, setting α to 2 can deliver maximized bandwidth utilization on each OST. Based on

this bandwidth constraint, we propose a Bandwidth-aware Flush Ordering (BFO) to dynamically

order the flush operations of each BB so that each storage server is being used by at most α BBs.

For instance, in Fig. 5.1(c), BB-A buffers segments of Server 1 and Server 2. Assuming α = 1, it

needs to select a server that has not been assigned to any BB. Since BB-B is flushing to Server 2

at time t1, BB-A picks up Server 1 and flushes the corresponding segments (A1, A3) to this server.

By doing so, the contention on Server 1 and Server 2 are avoided and consequently the two servers’

bandwidth utilization is maximized.

A key question is how to get the usage information of each server. BFO maintains this infor-

mation via an arbitrator located on one of the compute nodes. When a BB wants to flush to one

of its targeted servers, it sends a flushing request to the arbitrator. This request contains several

pieces of information about this BB, such as its job ID, job priority, and utilization. The arbitrator

then selects one from the targeted servers being used by fewer than α BBs, returns its ID to BB,

and increases the usage of this server by 1. The BB then starts flushing all its data to this server,

and requests to flush to other targeted servers. The arbitrator then decreases the usage of the old

server by 1 and assigns another qualified server to this BB. When there is no qualified BB available,

it temporarily queues the BB’s request.

Job-Aware Scheduling. In general, compute nodes greatly outnumber storage servers, so

there may be multiple BBs being queued for flushing to the same storage server. When this storage

server becomes available, the arbitrator needs to assign this storage server to a proper BB. A naive

approach to select a BB would be to follow FCFS. Since each BB is allocated to one job along with

its compute node, treating BBs equally can delay service of critical jobs, and prolong job I/O time

of small jobs. Instead, the arbitrator categorizes BBs based on their job priorities and job sizes.

It prioritizes the service for BBs of high-priority jobs, including those that are important at the

beginning, or the ones that have higher criticality (e.g. the usages of some BB in this job reach

their capacity). Among BBs with equal priority, it selects the one belonging to the smallest jobs

(e.g. jobs with smallest checkpointing data size) to reduce average job I/O time.

69

5.3 Implementation and Complexity Analysis

We have built a proof-of-concept prototype of TRIO using C with Message Passing Interface

(MPI). To emulate I/O in the multi-job environment, more than one communicators can be created

among all BBs, each corresponding to a disjoint set of BBs involved in their mutual I/O tasks. The

bursty writes from all these sets are orchestrated by the arbitrator. We also leverage STI (See

Section 5.2.2) to organize all the data structures inside the arbitrator for efficient and semantic-

based lookup and update. For instance, when a storage server becomes available, the arbitrator

needs to assign it to a waiting BB that has data on it. This BB should belong to the job with higher

priority than other waiting BBs’ jobs. Under this semantics, the profile of job, storage server, BB

are stacked as three layers on STI. Assume a system with m BB-augmented compute nodes and n

storage servers, and each job uses k compute nodes. At most this STI contains m/k+mn/k+mn

nodes, where m/k, mn/k, mn are respectively the number of tree nodes in each layer. This means,

for a system with 10000 compute nodes and 1000 storage servers, the number of tree nodes is

at most 20M, incurring less than 1GB storage overhead. On the other hand, the time spent in

arbitrator is dominated by its communication with each BB. Existing high-speed interconnects

(e.g. QDR Infiniband) generally yield a roundtrip latency lower than 10 µs for small messages

(smaller than 1KB) [75], this means a single arbitrator is able to handle the requests from 105 BBs

within 1 second.

5.4 Evaluation of TRIO

Experiments on TRIO were conducted on the Titan supercomputer. Of the 32GB memory on

each compute node, we reserved 16GB as the burst buffer space. We evaluated TRIO against the

workload from IOR benchmark. Each test case was run 15 times for every data point, the median

result was presented.

As discussed in Section 5.2.3, CAS mitigates contention by restricting the number of concurrent

BBs flushing to each storage server to α. In all our tests, we set α to 2 (the number which saturates

the bandwidth of each OST), thus limiting the number of BBs on each OST to at most two.

70

0

500

1000

1500

2000

2500

4 8 16 32 64 128 256

Band

width (M

B/s)

Number of Nodes

NOTRIO-‐N-‐1 TRIO-‐N-‐1 NOTRIO-‐N-‐N TRIO-‐N-‐N

Figure 5.3: The overall performance of TRIO under both inter-node and intra-node writes.

5.4.1 Overall Performance of TRIO

Fig. 5.3 demonstrates the overall benefit of TRIO under competing workloads with an increasing

number of IOR processes. We compared the aggregated OST bandwidth of two configurations. In

the first configuration (shown in Fig. 5.3 as NOTRIO), all the processes directly issued their write

requests to PFS. In the second configuration (shown in Fig. 5.3 as TRIO), all processes’ write

requests were buffered on the burst buffer space and flushed to PFS using TRIO. N-1 and N-N

patterns were employed in both configurations. We ran 16 processes on each compute node and

stressed the system by increasing the number of compute nodes involved from 4 to 256. Each

process wrote in total 1GB data. Both request size and stripe size were configured as 1MB. I/O

competition of all processes was assured by striping each file over the same four OSTs (default

stripe count).

As we can see from Fig. 5.3, in both N-1 and N-N patterns, bandwidth of NOTRIO dropped

with increasing number of processes involved. This was due to the exacerbated contention from

both intra-node and inter-node I/O traffic. By contrast, TRIO demonstrated much more stable

performance by optimizing intra-node traffic using STI and inter-node I/O traffic using CAS. The

lower bandwidth observed with 4 nodes than other node numbers in both TRIO-N-1 and TRIO-N-

N cases was due to OST bandwidth not being fully utilized (4 BBs were used to flush to 4 OSTs

71

0

500

1000

1500

2000

N-‐1 N-‐N

Band

width (M

B/s)

NAÏVE BUFFER-‐NOOP BUFFER-‐SEQ BUFFER-‐STI BUFFER-‐TRIO

Figure 5.4: Performance analysis of TRIO.

in these cases). Overall, on average TRIO improved I/O bandwidth by 77% for N-1 pattern and

139% for N-N pattern.

5.4.2 Performance Analysis of TRIO

To gain more insight into the contributions of each technique of TRIO, we incrementally ana-

lyzed its performance based on the five configurations shown in Fig. 5.4. In NAIVE, each process

directly wrote its data to PFS. In BUFFER-NOOP, all processes’ write requests were buffered in

BB and flushed without any optimization. This configuration corresponds to the reactive approach

discussed in Section 5.2.1. In BUFFER-SEQ, all the buffered write requests were reordered accord-

ing to their offsets and flushed sequentially. In BUFFER-STI, all the write requests were organized

using STI, each time a random OST was selected under the AVL tree and all write requests that

belong to this OST were sequentially flushed. In BUFFER-TRIO, CAS was enabled on top of STI,

which restricted the number of concurrent flushing BBs on each OST to 2, this configuration cor-

responds to the proactive approach discussed in Section 5.2.1. We evaluated the five configurations

using the workload of 8-node case (128 processes write to 4 OSTs) in Fig. 5.3. In this case, TRIO

reaped its most benefits since the number of flushing BBs was twice the number of OSTs.

As we can see from Fig. 5.4, simply buffering the dataset in BUFFER-NOOP was not able

to improve the performance over NAIVE due to issues discussed in Section 5.2.1. In contrast,

72

the sequential flush order in BUFFER-SEQ significantly outperformed NAIVE for both N-1 and

N-N patterns. Interestingly, although STI sequentially flushed the write requests to each OST, it

demonstrated no benefit over BUFFER-SEQ. This is because, without controlling the number of

flushing BBs on each OST, each OST was flushed by an unbalanced number of BBs, the benefits of

localization and sequential flush using STI were offset by prolonged contention on the overloaded

OSTs. This issue was alleviated when CAS was enabled in BUFFER-TRIO: by placing 2 BBs on

each OST and localizing their flush, the bandwidth of each OST was more efficiently utilized than

BUFFER-SEQ.

5.4.3 Alleviating Inter-Node Contention

To evaluate TRIO’s ability to sustain inter-node I/O contention using CAS, we placed 1 IOR

process on each node and had each process dump a 16GB dataset to its local BB, these datasets were

flushed to the PFS using TRIO. Such configurations for N-1 and N-N are referred to as TRIO-N-1

and TRIO-N-N respectively. For comparison, we had each IOR process dump its 16GB in-memory

data directly to the underlying PFS. Such configurations for the two patterns are referred to as

NOTRIO-N-1 and NOTRIO-N-N respectively. Contention for both patterns was assured by striping

all files over the same 4 OSTs.

Fig. 5.5 reveals the bandwidth of both TRIO and NOTRIO with an increasing number of

IOR processes. In N-1 case, the bandwidth of TRIO first grew from 1.4GB/s with 4 processes

to 1.7GB/s with 8 processes, then stabilized around 1.8GB/s. The stable performance with more

than 8 processes occurred because TRIO scheduled 2 concurrent BBs on each OST. Therefore, even

under heavy contention, each OST was being used by 2 BBs that consumed most OST bandwidth.

In contrast, the bandwidth of NOTRIO peaked at 1.4GB/s with 8 processes, then dropped to

1.1GB/s with 256 processes. This accounted for only 60% of the bandwidth delivered by TRIO

with 256 processes. This bandwidth degradation resulted from the inter-node contention generated

by larger numbers of processes. Overall, by mitigating contention, TRIO delivered a 35% bandwidth

improvement over NOTRIO on average.

We also observed similar trends for TRIO and NOTRIO in N-N case: the bandwidth of TRIO

ascended from 1.5GB/s with 4 processes to 2.1GB/s with 8 processes, then stabilized from this

point on. The bandwidth of NOTRIO kept dropping as more processes were involved. These

performance trends resulted from the same reasons as discussed for N-1 case.

73

0

500

1000

1500

2000

2500

4 8 16 32 64 128 256

Band

width (M

B/s)

Number of Nodes

NOTRIO_N_1 TRIO_N_1 NOTRIO_N_N TRIO_N_N

Figure 5.5: Flush bandwidth under I/O contention with an increasing process count.

5.4.4 Effect of TRIO under a Varying OST Count

Sometimes applications tend to stripe their files over a large number of OSTs to utilize more

resources. Though utilizing more OSTs can deliver higher bandwidth, writing in a conventional

manner that issues write requests to servers in a round-robin manner can distribute each write

request to more OSTs, incurring wider contention and preventing I/O bandwidth from further

scaling. We emulated this scenario by striping each file over an increasing number of OSTs and

using double the number of IOR processes to write on these OSTs. Contention for both N-1 and

N-N patterns was assured by striping each file over the same set of OSTs.

Fig. 5.6 compares the bandwidth of TRIO and NOTRIO under this scenario. It can be observed

that the bandwidth of NOTRIO-N-1 increased sublinearly from 0.81GB/s with a stripe count of

2 to 27GB/s with a stripe count of 128. In contrast, the bandwidth of TRIO increased with a

much faster speed, resulting in a 38.6% improvement on average over NOTRIO. A similar trend

was observed with the N-N checkpointing pattern. By localizing the writes of each BB on one OST

each time and assigning the same number of BBs to each OST, CAS minimized the interference

between different processes, thereby better utilizing the bandwidth. Sometimes localization may

not help utilize more bandwidth. For instance, when the number of available OSTs is greater than

the number of flushing BBs, localizing on one OST may underutilize the supplied bandwidth. We

74

0.5

1

2

4

8

16

32

64

128

256

512

2 4 8 16 32 64 128

I/O

Ba

nd

wid

th (

GB

/s)

Stripe Count

NOTRIO_N_1

NOTRIO_N_N

TRIO_N_1

TRIO_N_N

Figure 5.6: Flush bandwidth with an increasing stripe count.

believe a similar approach can also work for these scenarios. For instance, we can assign multiple

OSTs to each BB, with each BB only distributing its writes among the assigned OSTs to mitigate

interference.

5.4.5 Minimizing Average I/O Time

As mentioned in Section 5.2.3, TRIO reduces average job I/O time by prioritizing small jobs.

To evaluate this feature, we grouped 128 processes into 8 jobs, each with 16 processes, and place

1 process on each node. We had each process dump its dataset to its local BB and coordinated

the data flush using TRIO. When multiple BBs requested the same OST, TRIO selected a BB

via the Shortest Job First (SJF) algorithm, which first served a BB belonging to the smallest job.

This configuration is shown in Fig. 5.7 as TRIO SJF. For comparison, we applied FCFS in TRIO

to select a BB. This configuration served the first BB requesting this OST first, and we refer to

it as TRIO FCFS. We also included the result of having each process directly write its dataset to

PFS, which we refer to as NOTRIO. We varied the data size such that each process in the next

job wrote a separate file whose size was twice that of the prior job. Following this approach, each

process in the smallest job wrote a 128MB file, and each process in the largest job wrote a 16GB

file. To enable resource sharing, we striped the file so that each OST was shared by all 8 jobs.

75

0

20

40

60

80

100

120

WL1 WL2 WL3 WL4

Average I/O Tim

e (Sec)

Workload Per OST

NOTRIO TRIO-‐FCFS TRIO-‐SJF

Figure 5.7: Comparison of average I/O time.

We increased the ratio of the number of processes over the number of OSTs to observe scheduling

efficiency under different workloads.

Fig. 5.7 reveals average job I/O time for all the three cases. Workload 1 (WL1), WL2, WL3, and

WL4 refer to scenarios when the number of processes was 2, 4, 8, and 16 times the number of OSTs,

respectively. The average I/O time of TRIO SJF was the shortest for all workloads, accounting

for on average 57% and 63% of TRIO FCFS and NOTRIO, respectively. We also observed that

the I/O time of TRIO SJF increased with growing workloads at a much slower rate than the other

two. This was because, with the heavier workload, each OST absorbed more data from each job.

This gave SJF more room for optimization. Another interesting phenomenon was that TRIO FCFS

demonstrated no benefit over NOTRIO in terms of the average I/O time. This was because, using

TRIO FCFS, once each BB acquired an available OST from the arbitrator, it drained all of its

data on this OST. Since FCFS is unaware of large and small jobs, it is likely that the requests from

the large job were scheduled first on a given OST. The small job requesting the same OST could

only start draining its data after the large job finished. This monopolizing behavior significantly

delayed small jobs’ I/O time.

76

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160 180 200

CD

F o

f T

ime (

%)

Time (sec)

TRIO_FCFS

TRIO_SJF

NOTRIO

Figure 5.8: CDF of job response time.

For a further analysis, we also plotted the cumulative distribution functions (CDF) of job

response time with WL4 as shown in Fig. 5.8, it is defined as the interval between the arrival time

of the first request of the job at the arbitrator and the time when the job completes its I/O task.

By scheduling small jobs first, 7 out of 8 jobs in TRIO-SJF were able to complete their work within

80 seconds. By contrast, jobs in TRIO-FCFS and NOTRIO completed at much slower rates.

Fig. 5.9 shows the total I/O time of draining all the jobs’ datasets. There was no significant

distinction between TRIO-FCFS and TRIO-SJF because, from OST’s perspective, each OST was

handling the same amount of data for the two cases. By contrast, I/O time of NOTRIO was longer

than the other two due to contention. The impact of contention became more significant under

larger workloads.

5.5 Related Work

I/O Contention: In general, research around I/O contention falls into two categories: client-

side and server-side optimization. In client-side optimization, processes involved in the same job

collaboratively coordinate their access to the PFS to mitigate contention. Abbasia et al. [27] and

Nisar et al. [78] addressed contention by delegating the I/O of all processes involved in the same

77

0

50

100

150

200

250

WL1 WL2 WL3 WL4

Total I/O

Tim

e (Sec)

Workload Per OST

NOTRIO TRIO-‐FCFS TRIO-‐SJF

Figure 5.9: Comparison of total I/O time.

application to a small number of compute nodes. Chen et al. [42] and Liao et al. [79] mitigated

I/O contention by having processes shuffle data in a layout-aware manner. These mechanisms

have been widely adopted in existing I/O middlewares, such as ADIOS [71, 69] and MPI-IO [94].

Server-side optimization embeds some I/O control mechanisms on the server side. Dai et al. [45]

designed an I/O scheduler that dynamically places write operations among servers to avoid con-

gested servers. Zhang et al. [45] proposed a server-side I/O orchestration mechanism to mitigate

interference between multiple processes. Liu et al. [68] researched a low level caching mechanism

that optimizes the I/O pattern on hard disk drives. Different from these works, we address I/O

contention issues using BB as an intermediate layer. Compared with client-side optimization, an

orchestration framework on BB is able to coordinate I/O traffic between different jobs, mitigating

I/O contention at a larger scope. Compared with the server-side optimization, an orchestration

framework on BB can free storage servers from the extra responsibility of handling I/O contention,

making it portable to other PFSs.

Burst Buffer: The idea of BB was proposed recently to cope with the exploding data pressure in

the upcoming exascale computing era. Several next-generation HPC systems in the Coral project,

i.e. Summit [21], Sierra [20], Aurora [2], are designed with BB support. The SCR group is currently

trying to strengthen the support for SCR by developing a multi-level checkpointing scheme on top

78

of BB [18]. DDN and Cray are developing IME [13] and DataWarp [33], respectively as BB layers

to absorb the bursty read/write traffic from scientific applications. Most of these works use BB as

an intermediate layer to avoid application’s direct interaction with PFS. The focal point of TRIO

is the interaction between BB and PFS. Namely, how to efficiently flush data to PFS.

Inter-Job I/O Coordination: Compared with the numerous research works on intra-job I/O

coordination, inter-job coordination has received very limited attention. Liu et al. [66] designed

a tool to extract the I/O signatures of various jobs to assist the scheduler in making optimal

scheduling decisions. Dorier et al. [50] proposed a reactive approach to mitigate I/O interference

from multiple applications by dynamically interrupting and serializing application’s execution upon

performance decrease. Our work differs in that it coordinates inter-job I/O traffic in a layout-aware

manner to both avoid bandwidth degradation and minimize average job I/O time under contention.

5.6 Summary

In this chapter, we have analyzed the major performance issues of checkpointing operations on

HPC systems: prolonged average job I/O time and degraded storage server bandwidth utilization.

Accordingly, we have designed TRIO, a burst buffer based orchestration framework, to reshape

I/O traffic from burst buffer to PFS. By increasing intra-BB write sequentiality and coordinating

inter-BB flushing order, TRIO efficiently utilized storage bandwidth and reduced average job I/O

time by 37% in the typical checkpointing patterns.

79

CHAPTER 6

CONCLUSIONS

In summary, this dissertation undertakes substantial efforts to investigate the burst buffer manage-

ment strategies in order to accelerate the bursty scientific I/O workloads. We have designed the

strategies to manage both the remote shared burst buffers and the node-local burst buffers. Be-

cause of their architectural differences, our strategies contribute to these two types of burst buffers

on distinct facets.

On one hand, when remote shared burst buffers are deployed on the I/O nodes, data movement

between applications and burst buffers need to go through the network. Based on this feature, we

researched the strategy to fully exploit the high-speed interconnect for fast data transfer. Another

major advantage of this architecture is that data flushing from the remote shared burst buffers to the

backend PFS can be conducted without interfering with the computation on the compute nodes, so

we have explored a burst buffer based checkpointing framework that efficiently hides applications’

checkpoint time with the overlapped computation and data flush operations. In contrast, when

node-local burst buffers are deployed on the individual compute nodes, we can avoid the network

transfer overhead by having each process directly write to its local burst buffer. This local writes

delivers scalable write bandwidth, but it also creates challenges for read. So we have investigated

the strategies that ensure high read bandwidth under these local write operations. Moreover, unlike

the remote shared burst buffers, the node-local burst buffers are allocated to the individual job;

data on these burst buffers are temporarily available in each job. Based on this characteristic,

we structure the node-local burst buffers to offer a temporary data sharing service for coupled

applications in the same job.

Beside the aforementioned architectural differences and the distinct contributions on these two

types of burst buffers, we have designed a common data flushing strategy that is applicable to both

types of burst buffers. This strategy reshapes the bursty writes in burst buffers before draining

them to the backend PFS to avoid contention on PFS.

80

Since burst buffers is an emerging storage solution to be adopted on the exascale computing

systems, we afford the first-hand insights and the alternative storage solutions for the system

architects tasked with building the next-generation supercomputers. More specifically, we have

made the following three contributions.

• BurstMem: Overlapping Computation and I/O In order to bridge the computation-IO

gap, we introduced the design of a novel remote shared burst buffer system to avoid applica-

tions’ direct interactions with PFS, by temporarily buffering the checkpoints in burst buffers

and gradually flushing the data to PFS. While BurstMem inherits the buffering management

framework of Memcached, it also complements the functionality of Memcached with several

salient features to accelerate checkpointing, including a log-structured data organization that

efficiently utilizes the burst buffer bandwidth and capacity; a stacked AVL tree based indexing

that quickly locates the requested data and retrieves them for data flush and crash recovery;

a coordinated data shuffling that transforms the small and noncontiguous write requests into

large and sequential ones, so that data flushes can be conducted in a stripe-aligned manner; a

CCI-based communication layer that is portable and able to leverage the native transport of

various HPC interconnects. Experiments using both synthetic benchmarks and a real-world

application demonstrated that BurstMem is able to achieve 8.5× speedup over the bandwidth

of the conventional PFS.

• BurstFS: A Distributed Burst Buffer File System We further investigated the use

of node-local burst buffers to handle various I/O workloads. We designed and prototyped

BurstFS, an ephemeral burst buffer file system that co-exists with the individual job. BurstFS

vastly accelerates the checkpointing and multi-dimensional I/O workloads with three tech-

niques. First, it enables scalable log-structured writes using scalable metadata indexing. With

this technique, each process can directly write its data to its node-local burst buffer for both

N-1 and N-N patterns, avoiding the contention issues that commonly exist on the center-wide

file systems. Second, it provides a temporary data sharing service for the coupled applications

in the same job via collocated I/O delegation. I/O delegation also opens up opportunities

for further optimizations on the server side. Finally, it optimizes the multi-dimensional I/O

workload by server-side read clustering and pipelining. This technique combines the small and

noncontiguous read requests into larger ones and pipelines the read, copy and send operations.

Through extensive tuning and analysis, we have validated that BurstFS has accomplished our

design objectives, with linear scalability in terms of aggregated I/O bandwidth for parallel

writes and reads.

• TRIO: Reshaping Bursty Writes on PFS We have introduced the design of a burst

buffer orchestration framework to reshape scientific applications’ bursty writes on the burst

81

buffer layer. TRIO addresses the contention with two design choices. First, before flushing,

each burst buffer groups together all the bursty writes to be flushed to the same storage server

and sequentially organizes the write requests in each group to maximize the flush sequentiality

on each storage server. Second, during flushing, burst buffers dynamically orchestrate their

flush order on the storage servers to avoid burst buffers’ competing flushes on the same

storage server, and minimize the write interference that occurs when data flushing for one

application is interleaved by the data flushing for other applications. Our experimental results

demonstrated that TRIO could efficiently utilize storage bandwidth and reduce the average

I/O time by 37% on average in the typical checkpointing scenarios.

BurstMem and BurstFS are designed to manage the remote shared burst buffers and node-local

burst buffers, respectively. On the other hand, TRIO takes burst buffers as an intermediate storage

layer between compute nodes and the backend PFS, so it is portable to both types of burst buffers.

82

CHAPTER 7

FUTURE WORK

This dissertation also opens up new opportunities for the future burst buffer related research. In

particular, the following three branches deserve further investment.

First, a major advantage of the node-local burst buffers is that scientific applications can reap

scalable checkpointing bandwidth by having each process write data to its own node-local burst

buffer. This benefit has been justified in Chapter 4. However, this scalable write bandwidth is

accompanied with escalating failure rates as more compute nodes are involved in a job. Stud-

ies [76, 86] reported that a small portion of failures can be recovered by restarting from the local

checkpoints. However, the vast majority of failures require a restart from the external storage (e.g.

checkpoints on the remote shared burst buffers). The remote shared burst buffers, on the other

hand, have lower failure rate since the number of I/O nodes is much smaller than the number of

compute nodes. So it is tempting to design a fault-tolerant burst buffer system that relishes the

virtues of both the node-local burst buffers and the remote shared burst buffers. On one hand,

we can still perform scalable checkpointing by writing to the node-local burst buffers; on the other

hand, we can choose to periodically flush part of the checkpoints to the remote shared burst buffers.

Data on the remote shared burst buffers can be flushed to the PFS at a even slower frequency. This

hierarchical burst buffer management scheme has been theoretically proven efficient in [87]. We

can accomplish this purpose by combining BurstMem and BurstFS. One of the challenges that

demands further investigation is the impact of computation jitters on the compute nodes. It is

critical to quantify this impact since data need to be asynchronously flushed to the remote shared

burst buffers.

Second, our work in Chapter 5 orchestrates burst buffers’ flush order using one arbitrator. This

design choice limits its scalability. A future research topic is to distribute the responsibility of the

arbitrator to a number of arbitrators on different compute nodes. This can be accomplished by

partitioning storage servers into disjoint sets and assigning one arbitrator to orchestrate the I/O

requests in each set. We will extend the current framework to analyze the effect of a distributed

83

burst buffer orchestration. In addition, the existing framework is designed to handle the large and

sequential checkpointing workload, which is not feasible for a checkpointing workload dominated by

the small and noncontiguous write requests. A potential solution is to leverage some of the existing

works to reshuffle the data in an attempt to transform the small, noncontiguous write requests into

large and sequential ones [32, 63, 96].

Third, POSIX is a de facto stardard used for file I/O operations. Plenty of high-level I/O

libraries, such as pNetCDF, HDF5 and MPI-IO are buit on top of POSIX, so a burst buffer file

system that transparently supports these libraries can benefit a vast number of the real-world sci-

entific applications. Our work in Chapter 4 is built on top of POSIX, so it is promising to extend

its functionality to transparently accelerate these scientific applications. One of the challenges

is to investigate what consistency is needed by the real-world applications. BurstFS enforces no

consistency control. This design choice works for scientific applications that evenly partition the

global arrays into all the participating processes. However, a comprehensive analysis is required

to characterize the consistency requirement of the applications belonging to broader categories.

Associated with this challenge is the requirement for load balancing. BurstFS is not feasible for

applications that unevenly distribute their data to the participating processes. For these applica-

tions, we need to extend BurstFS’s function to support remote writes. So that processes under the

heavier workloads can shift a portion of their data to the remote burst buffers.

84

BIBLIOGRAPHY

[1] Active Burst Buffer Appliance. http://www.theregister.co.uk/2012/09/21/emc_abba.

[2] Aurora Supercomputer. http://aurora.alcf.anl.gov.

[3] Blktrace. http://linux.die.net/man/8/blktrace.

[4] Catalyst Cluster. http://computation.llnl.gov/computers/catalyst.

[5] Cori Supercomputer. http://www.nersc.gov/users/computational-systems/cori.

[6] Cray. http://www.cray.com.

[7] Data intensive computing talk. http://www.exascale.org/mediawiki/images/6/64/

Talk-12-Choudhary.pdf.

[8] Datadirect network. http://www.ddn.com/.

[9] Datawarp. http://www.cray.com/products/storage/datawarp.

[10] Edison Supercomputer. http://www.nersc.gov/users/computational-systems/edison/.

[11] HDF5. http://www.hdfgroup.org/HDF5/.

[12] Hyperion Cluster. https://hyperionproject.llnl.gov/index.php.

[13] Infinite Memory Engine. http://www.ddn.com/products/infinite-memory-engine-ime.

[14] Introducing Titan. http://www.olcf.ornl.gov/titan.

[15] Lawrence Livermore National Laboratory. https://asc.llnl.gov/computing_resources/purple.

[16] MPI-Tile-IO. http://www.mcs.anl.gov/research/projects.

[17] San Diego Supercomputer Center. http://www.gsic.titech.ac.jp/en/tsubame2.

[18] Scalable Checkpoint/Restart. https://computation.llnl.gov/project/scr.

[19] Sequoia Supercomputer. https://asc.llnl.gov/sequoia/rfp/02_SequoiaSOW_V06.doc.

[20] Sierra Supercomputer. https://www.llnl.gov/news/next-generation-supercomputer-coming-lab.

85

[21] Summit Supercomputer. https://www.olcf.ornl.gov/summit.

[22] Theta and Aurora Supercomputers. http://aurora.alcf.anl.gov.

[23] Tianhe-2. http://www.top500.org/system/177999.

[24] Top 500 Supercomputers. https://www.top500.org.

[25] Trinity. http://www.lanl.gov/projects/trinity.

[26] TSUBAME2. http://www.gsic.titech.ac.jp/en/tsubame2.

[27] Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, and FangZheng. Datastager: Scalable Data Staging Services for Petascale Applications. Cluster Com-puting, 13(3):277–290, 2010.

[28] Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham, RobertRoss, Lee Ward, and P Sadayappan. Scalable I/O Forwarding Framework for High-Performance Computing Systems. In Cluster Computing and Workshops, 2009. CLUS-TER’09. IEEE International Conference on, pages 1–10. IEEE, 2009.

[29] IEEE Standards Association et al. IEEE/ANSI Std 1003.1, 1996 Edition. InformationTechnology–Portable Operating System Interface (POSIX)–Part, 1.

[30] Scott Atchley, David Dillow, Galen Shipman, Patrick Geoffray, Jeffrey M Squyres, GeorgeBosilca, and Ronald Minnich. The Common Communication Interface (CCI). In High Per-formance Interconnects (HOTI), 2011 IEEE 19th Annual Symposium on, pages 51–60. IEEE,2011.

[31] John Bent, Sorin Faibish, Jim Ahrens, Gary Grider, John Patchett, Percy Tzelnic, and JonWoodring. Jitter-free Co-Processing on a Prototype Exascale Storage Stack. In Mass StorageSystems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1–5. IEEE, 2012.

[32] John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez,Milo Polte, and Meghan Wingate. PLFS: A Checkpoint Filesystem for Parallel Applications.In Proceedings of the Conference on High Performance Computing Networking, Storage andAnalysis, page 21. ACM, 2009.

[33] Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, BrianFriesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. AcceleratingScience with the NERSC Burst Buffer Early User Program.

[34] Michael Moore David Bonnie, Becky Ligon, Mike Marshall, Walt Ligon, Nicholas Mills, ElaineQuarles Sam Sampson, Shuangyang Yang, and Boyd Wilson. OrangeFS: Advancing PVFS.

86

[35] Peter J Braam and R Zahir. Lustre: A Scalable, High Performance File System. Cluster FileSystems, Inc, 2002.

[36] Michael J Brim, David A Dillow, Sarp Oral, Bradley W Settlemyer, and Feiyi Wang. Asyn-chronous Object Storage with QoS for Scientific and Commercial Big Data. In Proceedingsof the 8th Parallel Data Storage Workshop, pages 7–13. ACM, 2013.

[37] SW Bruenn, A Mezzacappa, WR Hix, JM Blondin, P Marronetti, OEB Messer, CJ Dirk,and S Yoshida. 2D and 3D Core-Collapse Supernovae Simulation Results Obtained with theCHIMERA Code. In Journal of Physics: Conference Series, volume 180, page 012018. IOPPublishing, 2009.

[38] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham,and Robert Ross. Understanding and Improving Computational Science Storage AccessThrough Continuous Characterization. ACM Transactions on Storage (TOS), 7(3):8, 2011.

[39] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, MikeBurrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A DistributedStorage System for Structured Data. In Proceedings of the 7th USENIX Symposium onOperating Systems Design and Implementation (OSDI), Berkeley, CA, USA, 2006. USENIXAssociation.

[40] Chao Chen, Yong Chen, Kun Feng, Yanlong Yin, Hassan Eslami, Rajeev Thakur, Xian-HeSun, and William D Gropp. Decoupled I/O for Data-Intensive High Performance Computing.In Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on, pages312–320. IEEE, 2014.

[41] J H Chen, A Choudhary, B de Supinski, M DeVries, E R Hawkes, S Klasky, W K Liao,K L Ma, J Mellor-Crummey, N Podhorszki, R Sankaran, S Shende, and C S Yoo. TerascaleDirect Numerical Simulations of Turbulent Combustion Using S3D. Computational Scienceand Discovery, 2(1):015001, 2009.

[42] Yong Chen, Xian-He Sun, Rajeev Thakur, Philip C Roth, and William D Gropp. Lacio: ANew Collective I/O Strategy for Parallel I/O Systems. In Parallel and Distributed ProcessingSymposium (IPDPS), 2011 IEEE International, pages 794–804. IEEE, 2011.

[43] Kristina Chodorow. MongoDB: the Definitive Guide. O’Reilly Media, Inc., 2013.

[44] Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarrıa-Miranda. An Evaluation ofGlobal Address Space Languages: Co-Array Fortran and Unified Parallel C. In Proceedingsof the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Program-ming, pages 36–47. ACM, 2005.

87

[45] Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross. Two-Choice Randomized DynamicI/O Scheduler for Object Storage Systems. In SC, 2014.

[46] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, AvinashLakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels.Dynamo: Amazon’s Highly Available Key-Value Store. In ACM SIGOPS Operating SystemsReview, volume 41, pages 205–220. ACM, 2007.

[47] Ciprian Docan, Manish Parashar, and Scott Klasky. DataSpaces: An Interaction and Co-ordination Framework for Coupled Simulation Workflows. In Proceedings of the 19th ACMInternational Symposium on High Performance Distributed Computing, HPDC ’10, pages25–36, New York, NY, USA, 2010. ACM.

[48] Ciprian Docan, Manish Parashar, and Scott Klasky. DataSpaces: An Interaction and Coor-dination Framework for Coupled Simulation Workflows. Cluster Computing, 15(2):163–181,2012.

[49] Jack Dongarra. Impact of Architecture and Technology for Extreme Scale on Software andAlgorithm Design. In The Department of Energy Workshop on Cross-cutting Technologiesfor Computing at the Exascale, 2010.

[50] Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, and Shadi Ibrahim. CALCioM:Mitigating I/O Interference in HPC Systems through Cross-Application Coordination. InParallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 155–164. IEEE, 2014.

[51] Lars George. HBase: the Definitive Guide. O’Reilly Media, Inc., 2011.

[52] Hugh N Greenberg, John Bent, and Gary Grider. MDHIM: a parallel key/value frameworkfor HPC. In HotStorage. USENIX Association, 2015.

[53] Jiahua He, Arun Jagatheesan, Sandeep Gupta, Jeffrey Bennett, and Allan Snavely. Dash:A Recipe for a Flash-Based Data Intensive Supercomputer. In Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing, Networking, Storageand Analysis, pages 1–11. IEEE Computer Society, 2010.

[54] Kamil Iskra, John W Romein, Kazutomo Yoshii, and Pete Beckman. ZOID: I/O-forwardinginfrastructure for petascale architectures. In Proceedings of the 13th ACM SIGPLAN Sym-posium on Principles and practice of parallel programming, pages 153–162. ACM, 2008.

[55] Hui Jin, Tao Ke, Yong Chen, and Xian-He Sun. Checkpointing Orchestration: Toward aScalable HPC Fault-Tolerant Environment. In Cluster, Cloud and Grid Computing (CCGrid),2012 12th IEEE/ACM International Symposium on, pages 276–283. IEEE, 2012.

88

[56] Youngjae Kim, Raghul Gunasekaran, Galen M Shipman, David Dillow, Zhe Zhang, Bradley WSettlemyer, et al. Workload Characterization of A Leadership Class Storage Cluster. InPetascale Data Storage Workshop (PDSW), 2010 5th, pages 1–5. IEEE, 2010.

[57] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting andSearching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.

[58] Jianwei Li, Wei-keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp,Robert Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. Parallel netCDF: Ahigh-performance scientific I/O interface. In Supercomputing, 2003 ACM/IEEE Conference,pages 39–39. IEEE, 2003.

[59] Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Anupam Ra-jendran, Zhao Zhang, and Ioan Raicu. ZHT: A Light-Weight Reliable Persistent DynamicScalable Zero-Hop Distributed Hash Table. In Parallel and distributed processing (IPDPS),2013 IEEE 27th international symposium on, pages 775–787. IEEE, 2013.

[60] Wei-keng Liao, Avery Ching, Kenin Coloma, Arifa Nisar, Alok Choudhary, Jacqueline Chen,Ramanan Sankaran, and Scott Klasky. Using MPI File Caching to Improve Parallel WritePerformance for Large-Scale Scientific Applications. In Supercomputing, 2007. SC’07. Pro-ceedings of the 2007 ACM/IEEE Conference on, pages 1–11. IEEE, 2007.

[61] Jialin Liu and Yong Chen. Fast data analysis with integrated statistical metadata in scientificdatasets. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages1–8. IEEE, 2013.

[62] Jialin Liu, Yong Chen, and Yu Zhuang. Hierarchical i/o scheduling for collective i/o. In Clus-ter, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposiumon, pages 211–218. IEEE, 2013.

[63] Jialin Liu, Bradly Crysler, Yin Lu, and Yong Chen. Locality-Driven High-Level I/O Aggre-gation for Processing Scientific Datasets. In Big Data, 2013 IEEE International Conferenceon, pages 103–111. IEEE, 2013.

[64] Jialin Liu, Evan Racah, Quincey Koziol, and Richard Shane Canon. H5Spark: Bridging theI/O Gap between Spark and Scientific Data Formats on HPC Systems. Cray User Group,2016.

[65] Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, AdamCrume, and Carlos Maltzahn. On the Role of Burst Buffers in Leadership-Class StorageSystems. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposiumon, pages 1–11. IEEE, 2012.

89

[66] Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S. Vazhkudai. AutomaticIdentification of Application I/O Signatures from Noisy Server-side Traces. In Proceedingsof the 12th USENIX Conference on File and Storage Technologies, FAST’14, pages 213–228,Berkeley, CA, USA, 2014. USENIX Association.

[67] Zhuo Liu, Jay Lofstead, Teng Wang, and Weikuan Yu. A Case of System-Wide PowerManagement for Scientific Applications. In Cluster Computing (CLUSTER), 2013 IEEEInternational Conference on, pages 1–8. IEEE, 2013.

[68] Zhuo Liu, Bin Wang, Patrick Carpenter, Dong Li, Jeffrey S Vetter, and Weikuan Yu. PCM-Based Durable Write Cache for Fast Disk I/O. In Modeling, Analysis & Simulation of Com-puter and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Sympo-sium on, pages 451–458. IEEE, 2012.

[69] Zhuo Liu, Bin Wang, Teng Wang, Yuan Tian, Cong Xu, Yandong Wang, Weikuan Yu, Car-los A Cruz, Shujia Zhou, Tom Clune, et al. Profiling and Improving I/O Performance ofa Large-Scale Climate Scientific Application. In Computer Communications and Networks(ICCCN), 2013 22nd International Conference on, pages 1–7. IEEE, 2013.

[70] Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield,Matthew Wolf, and Qing Liu. Six Degrees of Scientific Data: Reading Patterns for ExtremeScale Science IO. In Proceedings of the 20th international symposium on High performancedistributed computing, pages 49–60. ACM, 2011.

[71] Jay Lofstead, Fang Zheng, Scott Klasky, and Karsten Schwan. Adaptable, Metadata RichIO Methods for Portable High Performance IO. In Parallel & Distributed Processing, 2009.IPDPS 2009. IEEE International Symposium on, pages 1–10. IEEE, 2009.

[72] Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, KarstenSchwan, and Matthew Wolf. Managing Variability in the IO Performance of Petascale StorageSystems. In Proceedings of the 2010 ACM/IEEE International Conference for High Perfor-mance Computing, Networking, Storage and Analysis, pages 1–12. IEEE Computer Society,2010.

[73] Ewing Lusk, S Huss, B Saphir, and M Snir. MPI: A Message-Passing Interface Standard,2009.

[74] Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms,Mr Prabhat, Suren Byna, and Yushu Yao. A Multiplatform Study of I/O Behavior onPetascale Supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pages 33–44. ACM, 2015.

[75] Christopher Mitchell, Yifeng Geng, and Jinyang Li. Using One-Sided RDMA Reads to Builda Fast, CPU-Efficient Key-Value Store. In USENIX Annual Technical Conference, pages103–114, 2013.

90

[76] Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. Design, Mod-eling, and Evaluation of a Scalable Multi-Level Checkpointing System. In High PerformanceComputing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages1–11. IEEE, 2010.

[77] Jose Moreira, Michael Brutman, Jose Castanos, Thomas Engelsiepen, Mark Giampapa, TomGooding, Roger Haskin, Todd Inglett, Derek Lieber, Pat McCarthy, et al. Designing a Highly-Scalable Operating System: The Blue Gene/L Story. In Proceedings of the 2006 ACM/IEEEconference on Supercomputing, page 118. ACM, 2006.

[78] Arifa Nisar, Wei-keng Liao, and Alok Choudhary. Scaling parallel I/O performance throughI/O delegate and caching system. In High Performance Computing, Networking, Storage andAnalysis, 2008. SC 2008. International Conference for, pages 1–12. IEEE, 2008.

[79] Arifa Nisar, Wei-keng Liao, and Alok Choudhary. Delegation-Based I/O Mechanism for HighPerformance Computing Systems. IEEE Transactions on Parallel and Distributed Systems,23(2):271–279, 2012.

[80] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li,Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling Memcache at Facebook.In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI),pages 385–398, 2013.

[81] Sarp Oral, David A Dillow, Douglas Fuller, Jason Hill, Dustin Leverman, Sudharshan SVazhkudai, Feiyi Wang, Youngjae Kim, James Rogers, James Simmons, et al. OLCFs 1TB/s, Next-Generation Lustre File System. In Proceedings of Cray User Group Conference(CUG 2013), 2013.

[82] Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Butt, Min Li, and MahmutKandemir. Provisioning a Multi-Tiered Data Staging Area for Extreme-Scale Machines. InDistributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 1–12.IEEE, 2011.

[83] Raghunath Rajachandrasekar, Adam Moody, Kathryn Mohror, and Dhabaleswar K Panda.A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22ndinternational symposium on High-performance parallel and distributed computing, pages 143–154. ACM, 2013.

[84] Kai Ren and Garth A Gibson. TABLEFS: Enhancing Metadata Efficiency in the Local FileSystem. In USENIX Annual Technical Conference, pages 145–156, 2013.

[85] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-Tree Filesystem. TOS,9(3):9, 2013.

91

[86] Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis Rde Supinski, and Satoshi Matsuoka. Design and Modeling of a Non-Blocking CheckpointingSystem. In Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012.

[87] Kiminori Sato, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R De Supinski, NaoyaMaruyama, and Shingo Matsuoka. A User-Level Infiniband-Based File System and Check-point Strategy for Burst Buffers. In Cluster, Cloud and Grid Computing (CCGrid), 201414th IEEE/ACM International Symposium on, pages 21–30. IEEE, 2014.

[88] Hongzhang Shan and John Shalf. Using IOR to Analyze the I/O Performance for HPCPlatforms. Lawrence Berkeley National Laboratory, 2007.

[89] G Shipman, D Dillow, Sarp Oral, and Feiyi Wang. The Spider Center Wide File System:From Concept to Reality. In Proceedings, Cray User Group (CUG) Conference, Atlanta, GA,2009.

[90] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The HadoopDistributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE26th Symposium on, pages 1–10. IEEE, 2010.

[91] Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, and Samuel Lang. Server-SideI/O Coordination for Parallel File Systems. In High Performance Computing, Networking,Storage and Analysis (SC), 2011 International Conference for, pages 1–11. IEEE, 2011.

[92] Tim Stitt. An Introduction to the Partitioned Global Address Space (PGAS) ProgrammingModel. Connexions, Rice University, 2009.

[93] Houjun Tang, Suren Byna, Steve Harenberg, Wenzhao Zhang, Xiaocheng Zou, Daniel FMartin, Bin Dong, Dharshi Devendran, Kesheng Wu, David Trebotich, et al. In Situ Stor-age Layout Optimization for AMR Spatio-temporal Read Accesses. In Parallel Processing(ICPP), 2016 45th International Conference on, pages 406–415. IEEE, 2016.

[94] Rajeev Thakur, William Gropp, and Ewing Lusk. Data Sieving and Collective I/O in ROMIO.In Frontiers of Massively Parallel Computation, 1999. Frontiers’ 99. The Seventh Symposiumon the, pages 182–189. IEEE, 1999.

[95] Yuan Tian, Scott Klasky, Hasan Abbasi, Jay Lofstead, Ray Grout, Norbert Podhorszki, QingLiu, Yandong Wang, and Weikuan Yu. EDO: Improving Read Performance for ScientificApplications through Elastic Data Organization. In Cluster Computing (CLUSTER), 2011IEEE International Conference on, pages 93–102. IEEE, 2011.

92

[96] Yuan Tian, Zhuo Liu, Scott Klasky, Bin Wang, Hasan Abbasi, Shujia Zhou, Norbert Pod-horszki, Tom Clune, Jeremy Logan, and Weikuan Yu. A Lightweight I/O Scheme to FacilitateSpatial and Temporal Queries of Scientific Data Analytics. In Mass Storage Systems andTechnologies (MSST), 2013 IEEE 29th Symposium on, pages 1–10. IEEE, 2013.

[97] Venkatram Vishwanath, Mark Hereld, Kamil Iskra, Dries Kimpe, Vitali Morozov, Michael EPapka, Robert Ross, and Kazutomo Yoshii. Accelerating I/O Forwarding in IBM Blue Gene/PSystems. In High Performance Computing, Networking, Storage and Analysis (SC), 2010International Conference for, pages 1–10. IEEE, 2010.

[98] Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, and Weikuan Yu. An EphemeralBurst-Buffer File System for Scientific Applications. In Proceedings of the International Con-ference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages69:1–69:12, Piscataway, NJ, USA, 2016. IEEE Press.

[99] Teng Wang, Kathryn Mohror, Adam Moody, Weikuan Yu, and Kento Sato. BurstFS: A Dis-tributed Burst Buffer File System for Scientific Applications. In The International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC poster), 2015.

[100] Teng Wang, Adam Moody, Yue Zhu, Kento Sato, Tanzima Islam, and Weikuan Yu. MetaKV:A Key-Value Store for Metadata Management of Distributed Burst Buffers. In Paralleland Distributed Processing Symposium, 2017 IEEE 31th International, pages 799–808. IEEE,2014.

[101] Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. Development ofa Burst Buffer System for Data-Intensive Applications. CoRR, abs/1505.01765, 2015.

[102] Teng Wang, Sarp Oral, Michael Pritchard, Bin Wang, and Weikuan Yu. TRIO: Burst BufferBased I/O Orchestration. In 2015 IEEE International Conference on Cluster Computing,pages 194–203. IEEE, 2015.

[103] Teng Wang, Sarp Oral, Yandong Wang, Brad Settlemyer, Scott Atchley, and Weikuan Yu.BurstMem: A High-Performance Burst Buffer System for Scientific Applications. In Big Data(Big Data), 2014 IEEE International Conference on, pages 71–79. IEEE, 2014.

[104] Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, and Weikuan Yu. BPAR: A Bundle-BasedParallel Aggregation Framework for Decoupled I/O Execution. In Data Intensive ScalableComputing Systems (DISCS), 2014 International Workshop on, pages 25–32. IEEE, 2014.

[105] Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, and Weikuan Yu. Enhance Parallel In-put/Output with Cross-bundle Aggregation. Int. J. High Perform. Comput. Appl., 30(2):241–256, May 2016.

93

[106] Yandong Wang, Robin Goldstone, Weikuan Yu, and Teng Wang. Characterization and Op-timization of Memory-Resident MapReduce on HPC Systems. In Parallel and DistributedProcessing Symposium, 2014 IEEE 28th International, pages 799–808. IEEE, 2014.

[107] Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Cristi Cira,Bin Wang, Zhuo Liu, Bliss Bailey, et al. Assessing the performance impact of high-speedinterconnects on mapreduce. In Specifying Big Data Benchmarks, pages 148–163. Springer,2014.

[108] Parkson Wong and R der Wijngaart. NAS Parallel Benchmarks I/O Version 2.4. NASAAmes Research Center, Moffet Field, CA, Tech. Rep. NAS-03-002, 2003.

[109] Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and NorbertPodhorszki. Characterizing Output Bottlenecks in a Supercomputer. In High PerformanceComputing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages1–11. IEEE, 2012.

[110] Jiangling Yin, Jun Wang, Jian Zhou, Tyler Lukasiewicz, Dan Huang, and Junyao Zhang.Opass: Analysis and Optimization of Parallel Data Access on Distributed File Systems. InParallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages623–632. IEEE, 2015.

[111] W. Yu, J.S. Vetter, and H.S. Oral. Performance Characterization and Optimization of Par-allel I/O on the Cray XT. In 22nd IEEE International Parallel and Distributed ProcessingSymposium (IPDPS’08), Miami, FL, April 2008.

[112] Xuechen Zhang, Kei Davis, and Song Jiang. IOrchestrator: Improving the Performanceof Multi-Node I/O Systems via Inter-Server Coordination. In Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing, Networking, Storageand Analysis, pages 1–11. IEEE Computer Society, 2010.

[113] Dongfang Zhao, Zhao Zhang, Xiaobing Zhou, Tonglin Li, Ke Wang, Dries Kimpe, PhilipCarns, Robert Ross, and Ioan Raicu. FusionFS: Toward Supporting Data-Intensive ScientificApplications on Extreme-Scale High-Performance Computing Systems. In Big Data (BigData), 2014 IEEE International Conference on, pages 61–70. IEEE, 2014.

[114] Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, ManishParashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. PreDatA–PreparatoryData Analytics on Peta-Scale Machines. In Parallel and Distributed Processing (IPDPS),2010 IEEE International Symposium on, pages 1–12. IEEE, 2010.

94

BIOGRAPHICAL SKETCH

Teng Wang was born in Puyang, Henan province of China. He received the Master’s degree in

Software Engineering from Huazhong University of Science and Technology, Wuhan, China, in 2012.

He obtained the Bachelor’s degree in Computer Science from Zhengzhou University, Zhengzhou,

China, in 2009. His research interests include high performance computing, parallel I/O, storage

systems, and cloud computing.

95

exploring novel burst buffer management on extreme-scale...

Documents