exploring novel burst buffer management on extreme-scale...
TRANSCRIPT
FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
EXPLORING NOVEL BURST BUFFER MANAGEMENT ON EXTREME-SCALE HPC
SYSTEMS
By
TENG WANG
A Dissertation submitted to theDepartment of Computer Science
in partial fulfillment of therequirements for the degree of
Doctor of Philosophy
2017
Copyright c© 2017 Teng Wang. All Rights Reserved.
Teng Wang defended this dissertation on March 3, 2017.The members of the supervisory committee were:
Weikuan Yu
Professor Directing Dissertation
Gordon Erlebacher
University Representative
David Whalley
Committee Member
Andy Wang
Committee Member
Sarp Oral
Committee Member
The Graduate School has verified and approved the above-named committee members, and certifiesthat the dissertation has been approved in accordance with university requirements.
ii
ACKNOWLEDGMENTS
First and foremost, I would like to express my special thanks to my advisor Dr. Weikuan Yu for
his ceaseless encouragement and continuous research guidance. I came to study in U.S. with little
research background. At the beginning, I had a hard time to follow the classes and understand
the research basics. While everything seemed unfathomable to me, Dr. Yu kept encouraging me to
position myself better, and spent plenty of time talking with me on how to quickly adapt myself to
the course and research environment in the university. I cannot imagine a better advisor on those
moments we talked with each other. In the meanwhile, Dr. Yu spared no efforts in steering me
towards the correct research directions and took every opportunity to expose me to the excellent
research scholars, such as my mentors during my three summer internships. It was under Dr.
Yu’s generous help that I quickly built up all the background knowledge on file system and I/O,
learned how to identify the cutting-edge research topics and conduct quality-driven research. I am
fortunate to have Dr. Yu as my advisor.
In addition, I gratefully acknowledge the support and instructions I received from Dr. Sarp Oral
and Dr. Bradley Settlemyer during my summer internship in Oak Ridge National Laboratory. I
am also deeply indebted to Dr. Kathryn Mohror, Adam Moody, Dr. Kento Sato and Dr. Tanzima
Islam for their infinite help during my two summer internships at Lawrence Livermore National
Laboratory. Those moments I studied with these excellent research scholars have been permanently
engraved in my memory.
I would also like to thank my committee members Dr. Gordon Erlebacher, Dr. David Whalley
and Dr. Andy Wang for their time and comments to improve this dissertation.
I also appreciate the friendship with all the members in the Parallel Architecture and System
Research Lab (PASL). I joined PASL in 2012 and was fortunate to know most of the members
since the establishment of PASL (2009). During my PhD study, we worked together as a family
and helped each other unconditionally. I’m particularly grateful to Dr. Yandong Wang, Dr. Bin
Wang and Yue Zhu for their substantial help on the burst buffer projects, and Dr. Zhuo Liu, Dr.
Hui Chen and Kevin Vasko for their assistance on the GEOS-5 project. I will also cherish my
friendship with Dr. Cristi Cira, Dr. Jianhui Yue, Xiaobing Li, Huansong Fu, Fang Zhou, Xinning
iii
Wang, Lizhen Shi, Hai Pham, and Hao Zou. With this friendship, I never felt lonely in my PhD
study.
My deepest gratitude and appreciation go to my parents, my parents-in-law and my wife Dr.
Mei Li. It’s their everlasting love, encouragement and support that always warm my heart and
illuminate the long journey for me to pursue my dreams.
The research topics in this dissertation are sponsored in part by the Office of Advanced Scien-
tific Computing Research; U.S. Department of Energy and performed at the Oak Ridge National
Laboratory, which is managed by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725
and resources of the Oak Ridge Leadership Computing Facility, located in the National Center for
Computational Sciences at Oak Ridge National Laboratory. They are also performed in part under
the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under
Contract DE-AC52-07NA27344. They are also funded in part by an Alabama Innovation Award
and National Science Foundation awards 1059376, 1320016, 1340947, 1432892 and 1561041.
iv
TABLE OF CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 11.1 Overview of Scientific I/O Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Checkpoint/Restart I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Multi-Dimensional I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Overview of Burst Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Representative Burst Buffer Architectures . . . . . . . . . . . . . . . . . . . . 51.2.2 Burst Buffer Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Storage Models to Manage Burst Buffers . . . . . . . . . . . . . . . . . . . . . 7
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3.1 BurstMem: Overlapping Computation and I/O . . . . . . . . . . . . . . . . . 101.3.2 BurstFS: A Distributed Burst Buffer File System . . . . . . . . . . . . . . . . 101.3.3 TRIO: Reshaping Bursty Writes . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.4 Publication Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Problem Statement 142.1 Increasing Computation-I/O Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Issues of Contention on the Parallel File Systems . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Analysis of Degraded Storage Bandwidth Utilization . . . . . . . . . . . . . . 162.2.2 Analysis of Prolonged Average I/O Time . . . . . . . . . . . . . . . . . . . . 17
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 BurstMem: Overlapping Computation and I/O 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Memcached Based Buffering Framework . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Overview of Memcached . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 Challenges from Scientific Applications . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Design of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Software Architecture of BurstMem . . . . . . . . . . . . . . . . . . . . . . . 243.3.2 Log-Structured Data Organization with Stacked AVL Indexing . . . . . . . . 253.3.3 Coordinated Shuffling for Data Flushing . . . . . . . . . . . . . . . . . . . . . 273.3.4 Enabling Native Communication Performance . . . . . . . . . . . . . . . . . . 29
3.4 Evaluation of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.2 Ingress Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.3 Egress Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4.4 Scalability of BurstMem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4.5 Case Study: S3D, A Real-World Scientific Application . . . . . . . . . . . . . 37
v
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 BurstFS: A Distributed Burst Buffer File System 414.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 Design of BurstFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Scalable Metadata Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.2 Co-Located I/O Delegation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.2.3 Server-Side Read Clustering and Pipelining . . . . . . . . . . . . . . . . . . . 49
4.3 Evaluation of BurstFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.1 Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Overall Write/Read Performance . . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 Performance Impact of Different Transfer Sizes . . . . . . . . . . . . . . . . . 554.3.4 Analysis of Metadata Performance . . . . . . . . . . . . . . . . . . . . . . . . 564.3.5 Tile-IO Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.6 BTIO Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.7 IOR Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 TRIO: Reshaping Bursty Writes 635.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Design of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Main Idea of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2.2 Server-Oriented Data Organization . . . . . . . . . . . . . . . . . . . . . . . . 665.2.3 Inter-Burst Buffer Ordered Flush . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Implementation and Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 705.4 Evaluation of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Overall Performance of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.4.2 Performance Analysis of TRIO . . . . . . . . . . . . . . . . . . . . . . . . . . 725.4.3 Alleviating Inter-Node Contention . . . . . . . . . . . . . . . . . . . . . . . . 735.4.4 Effect of TRIO under a Varying OST Count . . . . . . . . . . . . . . . . . . 745.4.5 Minimizing Average I/O Time . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 Conclusions 80
7 Future Work 83
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vi
LIST OF FIGURES
1.1 Checkpoint/restart I/O patterns (adapted from [32]). . . . . . . . . . . . . . . . . . . 3
1.2 I/O access patterns with multi-dimensional variables. . . . . . . . . . . . . . . . . . . 4
1.3 An overview of burst buffers on HPC system. BB refers to Burst Buffer. PE refers toProcessing Element. CN refers to Compute Node. . . . . . . . . . . . . . . . . . . . . 6
2.1 Issues of I/O Contention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The impact of process count on the bandwidth of a single OST. . . . . . . . . . . . . 17
2.3 Scenarios when individual writes are distributed to different number of OSTs. “NOST” means that each process’s writes are distributed to N OSTs. . . . . . . . . . . 18
2.4 Bandwidth when individual writes are distributed to different number of OSTs. . . . . 19
3.1 Component diagram of Memcached. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Software architecture of BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Data structures for absorbing writes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Coordinated shuffling for N-1 data flushing. . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 CCI-based network communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Ingress I/O bandwidth with a varying server count. . . . . . . . . . . . . . . . . . . . 33
3.7 Ingress I/O bandwidth with a varying client count. . . . . . . . . . . . . . . . . . . . . 34
3.8 Egress I/O bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Dissection of coordinated shuffling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.10 Scalability of BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.11 I/O performance of S3D with BurstMem. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 BurstFS system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Diagram of the distributed key-value store for BurstFS. . . . . . . . . . . . . . . . . . 45
4.3 Diagram of co-located I/O delegation on three compute nodes P, Q and R, each with2 processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Server-side read clustering and pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
4.5 Comparison of BurstFS with PLFS and OrangeFS under different write patterns. . . 53
4.6 Comparison of BurstFS with PLFS and OrangeFS under different read patterns. . . . 54
4.7 Comparison of BurstFS with PLFS and OrangeFS under different transfer sizes. . . . 55
4.8 Analysis of metadata performance as a result of transfer size and process count. . . . 56
4.9 Performance of Tile-IO and BTIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.10 Read bandwidth of IOR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 A conceptual example comparing TRIO with reactive data flush approach. In (b),reactive data flush incurs unordered arrival (e.g. B7 arrives earlier than B5 to Server1)and interleaved requests of BB-A and BB-B. In (c), Server-Oriented Data Organizationincreases sequentiality while Inter-BB Flush Ordering mitigates I/O contention. . . . 65
5.2 Server-Oriented Data Organization with Stacked AVL Tree. Segments of each servercan be sequentially flushed following in-order traversal of the tree nodes under thisserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 The overall performance of TRIO under both inter-node and intra-node writes. . . . . 71
5.4 Performance analysis of TRIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Flush bandwidth under I/O contention with an increasing process count. . . . . . . . 74
5.6 Flush bandwidth with an increasing stripe count. . . . . . . . . . . . . . . . . . . . . . 75
5.7 Comparison of average I/O time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8 CDF of job response time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.9 Comparison of total I/O time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
viii
ABSTRACT
The computing power on the leadership-class supercomputers has been growing exponentially over
the past few decades, and is projected to reach exascale in the near future. This trend, however,
will continue to push forward the peak I/O requirement for checkpoint/restart, data analysis and
visualization. As a result, the conventional Parallel File System (PFS) is no longer a qualified
candidate for handling the exascale I/O workloads. On one hand, the basic storage unit of the
conventional PFS is still the hard drives, which are expensive in terms of I/O bandwidth/operation
per dollar. Providing sufficient hard drives to meet the I/O requirement at exascale is prohibitively
costly. On the other hand, the effective I/O bandwidth of PFS is limited by I/O contention, which
occurs when multiple computing processes concurrently write to the same shared disks.
Recently, researchers and system architects are exploring a new storage architecture with tiers
of burst buffers (e.g. DRAM, NVRAM and SSD) deployed between the compute nodes and the
backend PFS. This additional burst buffer layer offers much higher aggregate I/O bandwidth than
the PFS and is designed to absorb the massive I/O workloads on the slower PFS. Burst buffers
have been deployed on numerous contemporary supercomputers, and they have also become an
indispensable hardware component on the next-generation supercomputers.
There are two representative burst buffer architectures being explored: node-local burst buffers
(burst buffers on compute nodes) and remote shared burst buffers (burst buffers on I/O nodes).
Both types of burst buffers rely on a software management system to provide fast and scalable data
service. However, there is still a lack of in-depth study on the software solutions and their impacts.
On one hand, a number of studies on burst buffers are based on modeling and simulation, which
cannot exactly capture the performance impact of various design choices. On the other hand,
existing software development efforts are generally carried out by industrial companies, whose
proprietary products are commercialized without releasing sufficient details on the internal design.
This dissertation explores the alternative burst buffer management strategies based on research
designs and prototype implementations, with a focus on how to accelerate the common scientific
I/O workloads, including the bursty writes from checkpointing and bursty reads from restart/anal-
ysis/visualization. Our design philosophy is to leverage burst buffers as a fast and intermediate
storage layer to orchestrate the data movement between the applications and burst buffers, as well
ix
as the data movement between burst buffers and the backend PFS. On one hand, the performance
benefit of burst buffers can significantly speed up the data movement between the applications and
burst buffers. On the other hand, this additional burst buffer layer offers extra capacity to buffer
and reshape the write requests, and drain them to the backend PFS in a manner catering to the
most effective utilization of PFS capabilities. Rooted on this design philosophy, this dissertation
investigates three data management strategies. The first two strategies answer how to efficiently
move data between the scientific applications and the burst buffers. These two strategies are re-
spectively designed for the remote shared burst buffers and the node-local burst buffers. The rest
one strategy aims to speed up the data movement between the burst buffers and the PFS, it is
applicable to both types of burst buffers. In the first strategy, a novel burst buffer system named
BurstMem is designed and prototyped to manage the remote shared burst buffers. BurstMem
expedites scientific checkpointing by quickly buffering the checkpoints in the burst buffers after
each round of computation and asynchronously flushing the datasets to the PFS during the next
round of computation. It outperforms the state-of-the-art data management systems with efficient
data transfer, buffering and flushing. In the second strategy, we have designed and prototyped an
ephemeral burst buffer file system named BurstFS to manage the node-local burst buffers. BurstFS
delivers scalable write bandwidth by having each process write to its node-local burst buffer. It also
provides fast and temporary data sharing service for multiple coupled applications in the same job.
In the third strategy, a burst buffer orchestration framework named TRIO is devised to address I/O
contention on the PFS. TRIO buffers scientific applications’ bursty write requests, and dynamically
adjusts the flush order of all the write requests to avoid multiple burst buffers’ competing flush on
the same disk. Our experiments demonstrate that by addressing I/O contention, TRIO not only
improves the storage bandwidth utilization but also minimizes the average I/O service time for
each job.
Through systematic experiments and comprehensive evaluation and analysis, we have validated
our design and management solutions for burst buffers can significantly accelerate scientific I/O for
the next-generation supercomputers.
x
CHAPTER 1
INTRODUCTION
The astonishing growth of top 500 supercomputers suggests that, around the time frame of 2023,
the computation power of “leadership-class” systems is likely to surpass 1 exaflop/s (1018 flop-
s/s) [74]. Scientific applications on such large-scale computers often produce gigantic amount of
data. For example, the CHIMERA [37] astrophysics application outputs 160TB of data per check-
point, writing these data takes around an hour on Titan [14] supercomputer hosted in Oak Ridge
National Laboratory [66]. Titan has 18,688 compute nodes. It is currently the third fastest su-
percomputer in the world [24]. An exascale supercomputer is expected to have millions of nodes
with a checkpoint frequency of less than one hour [59, 36]. With such rapid growth in computing
power and data intensity, I/O remains a challenging factor in determining the overall applications’
execution time.
Over the past two decades, system designers have been bridging the computation-I/O gap by
expanding the backend PFS. The basic storage unit on PFS is the hard drive, which is cheap for
capacity but expensive for I/O bandwidth. Currently, the provisioned storage bandwidth1 of the
leadership-scale supercomputers is a few hundreds of GB/s. In the exascale computing era, the
peak bandwidth demand will become two orders of magnitude higher [65, 38]. Expanding PFS for
this level of bandwidth requirement becomes a cost-prohibitive solution.
To make things worse, conventional HPC system generally have separated compute nodes and
PFS. Contention on PFS happens when computing processes concurrently issue the bursty write
requests to the shared PFS. Since the number of compute nodes is typically 10x∼100x more than
the number of storage nodes on PFS, each storage node suffers from heavy contention during a
typical checkpoint workload [55]. Contention has been considered as a key issue that limits PFS’s
bandwidth scalability [55, 72, 109].
Recently, the idea of burst buffer has been proposed to shoulder the exploding data pressure
on the PFS. Many consider burst buffer as a promising storage solution for the exascale I/O
1The storage bandwidth is defined as the aggregate bandwidth of all the disks in a parallel file system.
1
workloads. Burst buffers refer to a broad category of high-performance storage device, such as
DRAM, NVRAM and SSD. They are positioned between the compute nodes and PFS, offering
much higher aggregate bandwidth than the PFS. This additional burst buffer layer can efficiently
absorb the heavy I/O workloads on PFS. Moreover, buffering the bursty writes inside burst buffers
gives plenty of opportunities to reshape the write requests and avoid the contention on PFS.
Numerous existing studies on burst buffers are based on modeling and simulation of burst
buffers [65, 82, 109, 87]. Though several companies have also announced their ongoing development
efforts and early products [33, 13], few design details have been made publicly available. In contrast
to these works, this dissertation explores various design choices of a burst buffer system and assesses
their benefits by prototyping burst buffers for the general scientific I/O workloads. Three burst
buffer management strategies are proposed and studied, each highlighting a distinct contribution
(see Section 1.3). In the rest of this chapter, we provide a high-level description of the characteristics
of scientific I/O workloads, the representative burst buffer architectures and their use cases that
underpin this dissertation, and the storage models adopted for our burst buffer management. We
then discuss our research contributions.
1.1 Overview of Scientific I/O Workloads
In this section, we provide an overview of the representative I/O workloads on HPC sys-
tems, which are also the targeting workloads in this dissertation. These workloads include check-
point/restart and multi-dimensional I/O. The two categories are not disjoint: checkpoints often
contain multi-dimensional arrays, these arrays can be retrieved later for the purpose of crash re-
covery, data analysis and visualization. We classify multi-dimensional I/O in a separate category
to emphasize the diversity of read patterns on a multi-dimensional dataset under the analysis/vi-
sualization workloads.
1.1.1 Checkpoint/Restart I/O
Checkpoint/restart is a common fault tolerance mechanism used by HPC applications. During
a run, application processes periodically save their in-memory state in files called checkpoints,
typically written to a PFS. Upon a failure, the most recent checkpoint can then be read to restart
the job. Checkpointing operations are usually concurrent across all processes in an application,
and occur at program synchronization points when no messages are in flight.
2
P1 P3P2
P1 P3P2
P1 P3P2
N-1 Segmented I/Owith a Shared File
N-1 Strided I/Owith a Shared File
N-N I/O with Individual Files
Figure 1.1: Checkpoint/restart I/O patterns (adapted from [32]).
There are two dominant I/O patterns for checkpoint/restart, N-1 and N-N patterns as shown in
Fig. 1.1. In N-N I/O, each process writes/reads data to/from a unique file. In N-1 I/O, all processes
write to, or read from, a single shared file. N-1 I/O can be further classified into two patterns:
N-1 segmented and N-1 strided. In N-1 segmented I/O, each process accesses a non-overlapping,
contiguous file region. In N-1 strided I/O, processes interleave their I/O amongst each other.
On current HPC systems, checkpointing can account for 75%-80% of the total I/O traffic [83].
While there is ongoing debate on how checkpointing operations will change on exascale systems
compared to today’s systems, there is general consensus that the data size per checkpoint will
increase due to larger job scales and the interval between checkpoints will decrease due to increased
overall failure rates [36, 76]. The larger file sizes and shorter intervals for checkpointing will require
orders of magnitude faster storage bandwidth [65].
3
1.1.2 Multi-Dimensional I/O
Figure 1.2: I/O access patterns with multi-dimensional variables.
Another common I/O workload on HPC systems is data access to multi-dimensional data vari-
ables in scientific applications. While multi-dimensional variables are written in one particular
order, they are often read for analysis or visualization in a different order than the write or-
der [95, 70, 61].
Fig. 1.2(a) shows a sample read pattern on a two-dimensional variable. This variable is initially
decomposed into four blocks and assigned to four processes, which are written to a shared file
on the backend PFS. When this variable is read back for analysis, one process may require only
one or more columns from this variable. However, these two columns are stored non-contiguously
across the data blocks. Therefore, this process needs to issue numerous small read requests on four
different data blocks in order to retrieve its data for analysis. Fig. 1.2(b) illustrates a similar but
more complex scenario with a three-dimensional variable. The 3-D variable is initially stored as
eight different blocks across burst buffers. A process may only need a subvolume in the middle
of the variable for analysis. This subvolume has to be gathered from eight different blocks to
complete its data access, resulting in many small read operations. These small read operations are
not favored by PFS [94, 93, 95].
4
1.2 Overview of Burst Buffers
In this section, we first outline the representative burst buffer architectures, and their use cases
our dissertation is based on. We then discuss the alternative storage models adopted in our burst
buffer management.
1.2.1 Representative Burst Buffer Architectures
Two representative burst buffer architectures are shown in Figure 1.3. In the node-local burst
buffer architecture, burst buffers are located on the individual compute nodes. The benefit of
this architecture is that the aggregate burst buffer bandwidth scales linearly with the number
of compute nodes. Scientific applications can acquire scalable write bandwidth by having each
process write its data to its local burst buffer. However, flushing buffered data to the backend PFS
requires extra computing power on the compute nodes, which can incur computation jitters [31].
In the remote shared burst buffer architecture, burst buffers are placed on the dedicated I/O nodes
deployed between the compute nodes and the backend PFS. Data movement from the compute
nodes to the burst buffers needs to go through the network. This architecture achieves excellent
resource isolation since flushing data to the backend PFS is done without interfering with the
computation on the compute nodes. However, in contrast to the node-local burst buffers, its I/O
bandwidth2 is dependent on several factors, including the network bandwidth, the aggregate burst
buffer bandwidth and the number of I/O nodes.
Both types of burst buffers have been widely deployed on existing high-end computing sys-
tems. They are also projected to come along with a majority of the next-generation supercom-
puters. For instance, node-local burst buffers have been equipped on DASH [53] at the San Diego
Supercomputer Center [17], Catalyst [4] and Hyperion [12] at the Lawrence Livermore National
Laboratory (LLNL) [15], TSUBAME supercomputer series [26] at the Tokyo Institute of Technol-
ogy, and Theta [22] at the Argonne National Laboratory (ANL). They will also come with several
next-generation supercomputers in a few years, such as Summit [21] at the Oak Ridge National
Laboratory and Sierra [20] at the Lawrence Livermore National Laboratory. On the other hand,
remote shared burst buffers have been set up on Tianhe-2 [23] at the Chinese National Supercom-
puter Center, Trinity [25] at the Los Alamos National Laboratory, and Cori [5] at the Lawrence
2I/O bandwidth is typically measured as a quotient of the total data size read/writtten by an application and itstotal I/O time.
5
Figure 1.3: An overview of burst buffers on HPC system. BB refers to Burst Buffer. PE refers toProcessing Element. CN refers to Compute Node.
Berkeley National Laboratory. As one of the next-generation supercomputers, Aurora [2] at the
Argonne National Laboratory features a heterogeneous burst buffer architecture including both
node-local burst buffers and remote shared burst buffers.
1.2.2 Burst Buffer Use Cases
Burst buffers are designed to accelerate the bursty write and read operations featured by most
scientific applications. Although we are aware that burst buffers also have the potential to support
the cloud-based applications that issue intermittent read/write requests, their use cases are beyond
the focus of this dissertation. In general, burst buffers’ support for scientific applications can be
summarized into the following use cases.
• Checkpoint/Restart: Applications periodically write “checkpoints” that includes snap-
shots of data structures and state information. Upon a failure, they can load the checkpoints
and rollback to a previous state. Applications checkpointing to/restarting from burst buffers
can reap much higher aggregate bandwidth than PFS.
• Overlapping Computation and Checkpointing to PFS: Scientific applications’ life
cycle generally alternates between a phase of computation and a phase of checkpointing. Ap-
6
plications can hide the checkpointing latency by temporarily buffering the data in burst buffers
after a phase of computation, and let burst buffers asynchronously flush the checkpoints to
the PFS during the next phase of computation.
• Reshaping Bursty Writes to PFS: Burst buffers stand as a middle layer that absorbs
the bursty writes to PFS. With all the buffered write requests, burst buffers have the global
knowledge of how scientific data are laid out on the PFS. Based on this knowledge, they can
reshape the write traffic to avoid contention on PFS.
• Staging/Sharing Intermediate data: Intermediate data such as checkpoints or plot files
are staged for two purposes: out-of-core computation and data sharing. In the former case,
applications with insufficient memory can swap out a portion of its in-memory data to burst
buffers, and later read them back for computation. In the latter case, a job usually consists
of multiple scientific workflows sharing data with each other. For instance, the output of a
simulation program can be used as the input for an analysis program to perform post-analysis
(after simulation) and in-situ/in-transit analysis (during simulation). In both cases, staging
the intermediate data to burst buffers and loading the data from burst buffers are much faster
than PFS.
• Prefetching Data for Fast Analysis: Scientific applications with the foreknowledge of
future reads can hide the read latency on PFS by prefetching data to burst buffers.
Due to the architectural differences, node-local burst buffers and remote shared burst buffers have
distinct preferences on these use cases. Since node-local burst buffers reside on the compute nodes,
and each compute node is generally dedicated to an individual job, data on node-local burst buffers
are temporarily available within the individual job’s life cycle. Under this constraint, we can
leverage node-local burst buffers to perform scalable checkpointing and enable temporary data
sharing among coupled workflow applications in the same job. On the other hand, data on the
remote shared burst buffers are independent of the life cycle of the individual job, we can harness
remote shared burst buffers to provide the center-wide data service to all the jobs on the compute
nodes. Moreover, remote shared burst buffers are jitter-free for data flushing. Therefore, it is
tempting to overlap computation and checkpointing to PFS based on the remote shared burst
buffers.
1.2.3 Storage Models to Manage Burst Buffers
A key consideration before structuring a burst buffer system is what type of storage models
can be used to manage burst buffers. In particular, there are two representative storage models
7
widely adopted under the cloud and HPC environments: distributed file systems and databases.
Each offers an alternative approach to manage burst buffers. Although there are data management
solutions that fall outside these two categories (e.g. DataSpaces [48], PGAS [92]), their architectures
can be derived and synthesized from the two storage models. Therefore, we mainly focus on the
discussion on the distributed file systems and databases. Our burst buffer-based solutions root in
these two storage models.
Distributed File System vs. Distributed Database. In a file system, data are stored
in the form of files. The format of each file is defined by the file system clients (i.e. applications
and I/O libraries). For instance, applications using POSIX I/O [29] stores each file as a stream
of binary bytes, data are accessed based on their sizes and byte addresses in the file. In order to
provide richer data access semantics, application developers need to implement their own functions
beyond POSIX I/O and define their own data format atop the binary stream in POSIX. High-level
I/O libraries (e.g. HDF5 [11], NetCDF [58]) format each file as a container of multi-dimensional
arrays, data are accessed based on their dimensionality, positions and sizes along each dimension.
However, these file formats are opaque to the file system.
In contrast, a database is generally used for storing related, structured data (e.g. tables, indices)
with well-defined data formats. Unlike a file system, the physical data formats of these structured
data are defined by the database and they are opaque to the clients. Clients access databases
using the database-specific semantics, such as SQL for relational databases and Put/Get APIs for
key-value stores. Users using these interfaces can easily fulfill a wide variety of complex application
logics without further effort to implement these logics. Compared with a file system, the complicated
implementation of these application logics are offloaded to the database. Besides, a database allows
indexing based on any attribute or data property (i.e. SQL columns for relational databases and
keys for key-value stores). These indexed attributes facilitate fast query.
The choice of distributed file systems/databases largely depends on the burst buffer seman-
tics to be exposed to applications. File system based burst buffer service is advantageous in two
aspects. First, it can transparently support a large quantity of file system-based scientific applica-
tions that use POSIX and other POSIX-based I/O libraries, such as HDF5 and NetCDF. Second, it
gives application developers more flexibility to define their own data formats, and implement their
application logics. On the other hand, a database-based burst buffer service allows application
8
developers to more easily implement their complex application logics by directly using the rich
client-side interfaces. It also speeds up data processing with customizable data indexing. How-
ever, to support existing file system-based applications, application developers need to replace the
POSIX/NetCDF/HDF5-based functions with database client’s native functions, which demands
non-trivial efforts.
Besides the choices of databases/file systems, another key consideration is what types of file
system/database services can be harnessed to manage burst buffers. According to the data layout
strategy, existing file systems can be classified into locality-based distributed file systems (e.g.
HDFS [90]) and striping-based distributed file systems (e.g. Lustre [35]). There are also two types
of databases: relational databases and non-relational databases. The non-relational databases can
be further classified into key-value stores and graph stores. Both non-relational databases and
key-value stores are applicable to manage burst buffers.
Locality-Based Distributed File Systems vs. Striping-Based Distributed File Sys-
tems. In the locality-based file systems, each process prioritizes writing data to its node-local
storage. This design choice avoids data movement across network during a bursty writes. So it
delivers scalable write bandwidth. Because of this benefit, we design a locality-based distributed file
system to manage the node-local burst buffers in Chapter 4. However, a key challenge is how to
read data: because file data are written by each process locally, in order to read the remote data,
each process needs to look up the data sources for its read requests. A scalable metadata service
is needed to serve the large quantity of lookup requests during bursty reads.
In the striping-based distributed file systems, data written by each process are striped across
multiple data nodes. The locations of data sources are calculated based on a pre-determined hash
function. Therefore, during read, each process can directly compute the locations of the requested
data and retrieve the data from the data sources. Compared with the locality-based distributed
file system, this type of file system delivers balanced read/write bandwidth. However, its write
bandwidth is limited by contention when multiple processes concurrently write to the same data
node.
Relational Database vs. Key-Value Stores.. A radical difference between relational
database and key-value store is their data models exposed to users. Relational databases represent
data in a form of one or more tables of rows and columns, with a unique key for each row. These
9
tables are queried and maintained by SQL, which defines a rich set of semantics that allow users to
easily fulfill complex application logics (e.g. insert/delete/update/query/join/transaction). How-
ever, the complex data model also makes the implementation of relational databases incredibly
complicated. For example, a relatively simple SELECT statement could have dozens of potential
query execution paths, which a query optimizer would evaluate at run time. In contrast, key-value
stores represent data in the form of key-value pairs, and provide much simpler interfaces (e.g.
Put/Get for storing/retrieving data). The simplicity of their data services afford key-value stores
much faster speed and better scalability. In Chapter 3, we leverage distributed key-value store to
manage the remote shared burst buffers.
1.3 Research Contributions
In this dissertation, we have thoroughly investigated the I/O challenges on HPC systems, and
researched three approaches for burst buffer management. In particular, this dissertation has made
the following three contributions.
1.3.1 BurstMem: Overlapping Computation and I/O
We have designed BurstMem, a burst buffer system to manage the remote shared burst buffers.
It accelerates checkpointing by temporarily buffering applications’ scientific datasets in burst buffers
and asynchronously flushing the data to PFS. BurstMem is built on top of Memcached, a distributed
key-value store widely adopted under the cloud environment. While inheriting the buffering man-
agement framework of Memcached, we have identified its major issues in handling the bursty
checkpoint workload, including low burst buffer capacity and bandwidth utilization, unable to
leverage the native transport of various high-speed interconnects on HPC system and lack of data
indexing and shuffling support for fast data flush. Based on our analysis, we structure BurstMem
to accelerate checkpointing with indexed and space-efficient buffer management, fast and portable
data transfer, and coordinated data flush. Our evaluations demonstrated that BurstMem is able
to achieve 8.5× speedup over the bandwidth of conventional PFS.
1.3.2 BurstFS: A Distributed Burst Buffer File System
We have analyzed the benefits and challenges of using node-local burst buffers to accelerate sci-
entific I/O, and designed an ephemeral Burst Buffer File System (BurstFS) that supports scalable
10
and efficient aggregation of I/O bandwidth from burst buffers while having the same life cycle as a
batch-submitted job. BurstFS features several techniques including scalable metadata indexing for
fast and scalable write with amortized metadata cost, co-located I/O delegation for scalable read
and data sharing among coupled applications in the same job, and server-side read clustering and
pipelining to optimize small and noncontiguous read operations widely used in a multi-dimensional
I/O workload. Through extensive tuning and analysis, we have validated that BurstFS has ac-
complished our design objectives, with linear scalability in terms of aggregated I/O bandwidth for
parallel writes and reads.
1.3.3 TRIO: Reshaping Bursty Writes
We have devised TRIO, a burst buffer orchestartion framework to efficiently drain data from
burst buffers to the backend PFS. The strategy is applicable to both remote shared burst buffers
and node-local burst buffers. TRIO buffers the applications’ checkpointing write requests in burst
buffers and reshapes the bursty writes to maximize the number of sequential writes to PFS. Mean-
while, TRIO coordinates the flushing orders among concurrent burst buffers to address two levels of
contention on PFS, namely, competing writes from multiple processes on the same storage server,
and the interleaving writes from multiple concurrent applications on the same storage server. Our
experimental results demonstrated that TRIO could efficiently utilize storage bandwidth and reduce
the average I/O time by 37% in typical checkpointing scenarios.
1.3.4 Publication Contributions
During my PhD study, I have contributed to the following publications (listed in the chrono-
logical order).
1. Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Christian
Cira, Bin Wang, Zhuo Liu, Bliss Bailey, Weikuan Yu. Assessing the Performance Impact of
High-Speed Interconnects on MapReduce Programs. Third Workshop on Big Data Bench-
marking, 2013 [107].
2. Zhuo Liu, Jay Lofstead, Teng Wang, Weikuan Yu. A Case of System-Wide Power Man-
agement for Scientific Applications. IEEE International Conference on Cluster Computing,
2013 [67].
11
3. Zhuo Liu, Bin Wang, Teng Wang, Yuan Tian, Cong Xu, Yandong Wang, Weikuan Yu, Carlos
A Cruz, Shujia Zhou, Tom Clune, Scott Klasky. Profiling and Improving I/O Performance
of a Large-Scale Climate Scientific Application, 2013 [69].
4. Yandong Wang, Robin Goldstone, Weikuan Yu, Teng Wang. Characterization and Opti-
mization of Memory-Resident MapReduce on HPC Systems. IEEE 28th International Parallel
and Distributed Processing Symposium, 2014 [106].
5. Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, Weikuan Yu. BPAR: A Bundle-Based
Parallel Aggregation Framework for Decoupled I/O Execution. International Workshop on
Data Intensive Scalable Computing Systems, in Conjunction with SC 2014 [104].
6. Teng Wang, Sarp Oral, Yandong Wang, Bradley Settlemyer, Scott Atchley, Weikuan Yu.
BurstMem: A High-Performance Burst Buffer System for Scientific Applications. IEEE In-
ternational Conference on Big Data, 2014 [103].
7. Teng Wang, Sarp Oral, Michael Pritchard, Kento Vasko, Weikuan Yu. Development of a
Burst Buffer System for Data-Intensive Applications. International Workshop on The Lustre
Ecosystem: Challenges and Opportunities, 2015 [101].
8. Teng Wang, Kathryn Mohror, Adam Moody, Weikuan Yu, Kento Sato. Poster Presented
at The International Conference for High Performance Computing, Networking, Storage and
Analysis, 2015 [99].
9. Teng Wang, Sarp Oral, Michael Pritchard, Bin Wang, Weikuan Yu. TRIO: Burst Buffer
Based I/O Orchestration. IEEE International Conference on Cluster Computing, 2015 [102].
10. Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, Weikuan Yu. An Ephemeral Burst
Buffer File System for Scientific Applications. IEEE/ACM the International Conference for
High Performance Computing, Networking, Storage and Analysis 2016 [98].
11. Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, Weikuan Yu. Enhance Scientific Application
I/O with Cross-Bundle Aggregation. International Journal of High Performance Computing
Applications [105], 2016.
12. Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, Kento Sato, Tanzima Islam, Weikuan
Yu. MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers.
IEEE 31th International Parallel and Distributed Processing Symposium [100], 2017.
12
1.4 Dissertation Overview
In the rest of this dissertation, we first elaborate on the problems that prevent scientific ap-
plications from achieving the satisfactory I/O performance. We then detail three burst buffer
management strategies that address these issues. Every chapter focuses on presenting one ap-
proach, along with a comprehensive evaluations and comparisons between our solution and the
state-of-the-art techniques.
In Chapter 2, we thoroughly investigate the key issues that constrain the scientific I/O perfor-
mance. Namely, increasing computation and I/O gap, and I/O contention on PFS. The insights
from this chapter are used to motivate our innovations.
In Chapter 3, we specify the design of BurstMem, a remote shared burst buffer based I/O
burst buffer management framework. BurstMem complements Memcached for checkpointing with
enhanced data services for fast data transfer, data buffering and flushing. Our performance evalu-
ation demonstrates that our solutions can efficiently accelerate scientific checkpointing.
In Chapter 4, we explore the benefits and key considerations of architecting a distributed burst
buffer file system to manage node-local burst buffers. Based on our exploration, we design BurstFS
to speedup checkpointing/restart, analysis and visualization. BurstFS supports scalable writes
and fast data sharing under various concurrent I/O workloads. Our comparison with the state-of-
the-art solutions demonstrates that BurstFS carries great potential in handling the exascale I/O
workloads.
In Chapter 5, we introduce TRIO, a burst buffer orchestration framework that buffers and
reorganizes checkpointing write requests for sequential and contention-aware data flush to the
PFS. This strategy can be used on both remote shared burst buffers and node-local burst buffers.
Our experiments with different types of concurrent workloads demonstrate that by addressing the
I/O contention on PFS, TRIO not only improves the bandwidth of a single application, but also
minimizes the average I/O time of the individual application under a multi-application workload.
Finally, we conclude this dissertation and discuss opportunities for future work in Chapter 7.
13
CHAPTER 2
PROBLEM STATEMENT
This chapter discusses the detailed I/O challenges on HPC systems. First, we highlight the problem
of the increasing computation-I/O gap. We then present the issues incurred by I/O contention on
the PFS, and experimentally analyze their impact.
2.1 Increasing Computation-I/O Gap
Studies have demonstrated that most scientific applications on HPC systems are data-intensive
applications [7]. These applications’ life cycle alternates between computation phases and I/O
phases [67], with a gigantic volume of data stored during each I/O phase. As an example, the
climate scientific application Arctic Systems Reanalysis (ASR) output 23.14TB data to study one-
year climate changes. The volume of output data is projected to be around hundreds of petabytes
for a single simulation run on exascale computing systems [40]. This demands the commensurate
I/O bandwidth improvement on the backend PFS to quickly absorb the output. However, the
growth of storage bandwidth on PFS is constantly below the computing power, resulting in an ever-
increasing computation-IO gap. For instance, an ongoing upgrade from Sequoia [19] to Sierra [20]
supercomputer (expected to come around 2018) at the Lawrence Livermore National Laboratory
will enhance the computing power by a factor of 7. However, a corresponding upgrade on the
backend PFS is expected to improve the bandwidth by a maximum of 2 times. Similarly, the
recent upgrade from Edison [10] to Cori [5] supercomputer (delivered in 2016) at the Lawrence
Berkeley National Laboratory enhanced the computing power by a factor of 12, with a storage
bandwidth improvement of only 4 times. In the era of exascale computing, the computation-I/O
gap is estimated to become 10× wider [49].
Accompanied with the widening computation-IO gap is the booming bandwidth requirement for
checkpointing. An exascale system is expected to raise the periodical checkpointing data size by two
orders of magnitude, and reduce the checkpointing interval from 4-8 hours to tens of minutes [36].
14
This requires two orders of magnitude higher storage bandwidth (e.g. 60TB/s as predicted by [49])
for fast checkpointing and crash recovery.
2.2 Issues of Contention on the Parallel File Systems
Another critical issue faced by the PFS is I/O contention. Under a typical checkpointing
A1 A2
App1
A1 B1 B2 B3 B4 A2
B1 B2 B3 B4
App2
Arrival time on PFS
App2 completes
App1 completes
Prolonged app I/O time
A1 A2 A3 A4
Process-A
B5 B6 B7 B8
Process-B
B5 A2 B6 A3
Sequential Writes Sequential Writes
A4 B7 B8
Interleaved Write Requests
A1
Parallel File System
(a). Degraded storage bandwidth. (b). Prolonged average I/O time.
Figure 2.1: Issues of I/O Contention.
workload, the storage servers on PFS will receive a large burst of write requests from 10× to 100×
more compute nodes. The heavy and concurrent workload on each storage server incurs two issues.
First, when the number of competing write requests exceeds the capabilities of each storage server,
its bandwidth will degrade. Second, when multiple applications compete to use the shared storage
servers, I/O service for mission-critical applications can be frequently interrupted by low-priority
applications. I/O requests from small jobs can be delayed due to concurrent accesses from large
jobs, prolonging the average I/O time.
Figure 2.1(a) shows how storage bandwidth degrades under contention. Processes A and B each
sequentially issues four write requests to PFS. These write requests belong to contiguous regions
of a shared file (e.g. A1, A2, A3, A4 in Figure 2.1(a) are contiguous segments in the file). Due to
15
I/O contention, although these write requests are issued sequentially, their arrival sequence on PFS
become interleaved. This interleaved arrival leads to lower bandwidth utilization. Figure 2.1(b)
illustrates the prolonged average I/O time. Application 1 and Application 2 compete for PFS’s
I/O service by concurrently issuing 2 and 4 write requests, respectively. These write requests
are interleaved on PFS and serviced according to their arrival order. Although Application 1
issues fewer write requests, it still turns around slowly (even completes later than Application 2).
Consequently, the average I/O time of these two applications is prolonged by the PFS’s reactive
I/O service for these applications.
2.2.1 Analysis of Degraded Storage Bandwidth Utilization
A key reason for degraded storage bandwidth utilization is that each checkpointing file is striped
across multiple storage servers on PFS, and shared by the participating processes. Therefore, each
process can access multiple storage servers, and each storage server is reachable from multiple
processes. Such N-N mapping poses two challenges. First, each storage server suffers from the
competing writes from multiple processes; second, since the requests of each process are distributed
to multiple storage servers, each process is involved in the competition for multiple storage servers.
We used the IOR benchmark [88] to analyze the impacts of both challenges on the Lustre file system
(Spider file system [89]) connected to Titan Supercomputer [14] in Oak Ridge National Laboratory.
To investigate the first challenge, we used an increasing number of processes to concurrently write in
total 32GB data to a single storage server (OST). Each process wrote a contiguous, nonoverlapping
file segment. The result is exhibited in Fig. 2.2. The bandwidth first increased from 356MB/s with
1 process to 574MB/s with 2 processes, then decreased constantly to 401MB/s with 32 processes,
resulting in 30.1% bandwidth degradation from 2 to 32 processes. The improvement from 1 to 2
processes was because the single-process I/O stream was not able to saturate OST bandwidth. Our
intuition suggested that contention from 2 to 32 processes can incur heavy disk seeks; however, our
lack of privilege to gather I/O traces at the kernel level on ORNL’s Spider filesystem prevented us
from directly proving our intuition. We repeated our experiments on our in-house Lustre filesystem
(running the same version as that on Spider) and observed that up to 32% bandwidth degradation
were caused by I/O contention. By analyzing I/O traces using blktrace [3], we found that disk access
time accounted for 97.1% of the total I/O time on average, indicating that excessive concurrent
accesses to OSTs can degrade the bandwidth utilization. Overall, the bandwidth utilization of one
16
0
200
400
600
800
1000
1 2 4 8 16 32
Ba
nd
wid
th (
MB
/s)
Number of Processes
Bandwidth
Figure 2.2: The impact of process count on the bandwidth of a single OST.
OST on Spider was 75.6% when there were only two concurrent processes, but dropped to 53.5%
when there were 32 processes.
To evaluate the performance impact of the second challenge, we distributed each IOR process’s
write requests to multiple OSTs. In our experiment, we spread the write requests from each process
to 1, 2, 4 OSTs, which are presented in Fig. 2.3 as 1 OST, 2 OSTs, 4 OSTs, respectively. We fixed
the total data size as 128GB, the number of utilized OSTs as 4, and measured the bandwidth under
the three scenarios using a varying number of processes. The result is demonstrated in Fig. 2.4.
The scenario of 1 OST consistently delivered the highest bandwidth with the same number of
processes, outperforming 2 OSTs and 4 OSTs by 16% and 26% on average, respectively. This was
because, by localizing each process’s writes on 1 OST, each OST was under the contention from
fewer processes. Another interesting observation is that the bandwidth under the same scenario
(e.g. 1 OST) degrades with more processes involved. This trend can be explained by the impact
of the first issue as we measured in Fig. 2.2.
2.2.2 Analysis of Prolonged Average I/O Time
To assess the impact of prolonged average I/O time under a multi-application environment, we
ran BTIO [108], MPI-Tile-IO [16] and IOR [88] benchmarks concurrently but differentiated their
output data sizes as 13.27GB, 64GB and 128GB, respectively. This launch configuration is referred
17
S-16
OST3 OST1
OST4 OST2
OST3 OST1
OST4 OST2
OST3
OST1
OST4
OST2
(a): 1 OST (b): 2 OSTs (c): 4 OSTs
P1 P3 P4
P2 P1 P2 P3 P4
P1 P2 P3 P4
Figure 2.3: Scenarios when individual writes are distributed to different number of OSTs. “N OST”means that each process’s writes are distributed to N OSTs.
Table 2.1: The I/O time of individual benchmarks when they are launched concurrently (MultiWL)and serially (SigWL).
Time(s) BTIO MPI-TILE-IO IOR AVG TOT
MultiWL 41 121.83 179.75 114.19 179.75SigWL 9.79 72.28 161 81.02 161
to as MultiWL. Competition for storage was assured by striping all benchmark output files across
the same four OSTs on the Spider file system. We compared their job I/O time with that when
these benchmarks were launched in a serial order, which is referred to as SigWL. The I/O time
of the individual benchmarks are shown in Table 2.1 as three columns, BTIO, MPI-TILE-IO, and
IOR, respectively. The average and total I/O time of the three benchmarks are shown in columns
AVG and TOT. As we can see, the average I/O time of MultiWL is 1.41× longer than SigWL.
This is because a job’s storage service was affected by the contention from other concurrent jobs.
Furthermore, the contention from large jobs can significantly delay the I/O of small jobs. In our
tests, the most affected benchmark was BTIO, which generated the smallest workload. Its I/O
time in MultiWL was 4.18× longer than SigWL.
18
0
500
1000
1500
2000
2500
4 8 16 32 64 128
Band
width (M
B/s)
Number of Processes
1 OST 2 OSTs 4 OSTs
Figure 2.4: Bandwidth when individual writes are distributed to different number of OSTs.
2.3 Summary
In summary, this chapter has revealed major I/O challenges on HPC system. Namely, the
increasing computation-I/O gap and the I/O contention on PFS. Our burst buffer management
strategies provide distinct solutions to these challenges. First, we bridge the computation-I/O
gap by temporarily buffering the checkpoints in burst buffers, and gradually flushing checkpoints
to PFS (see Chapter 3). This avoids scientific applications’ direct interactions with PFS. We
achieve this purpose based on remote shared burst buffers, since the remote shared burst buffers
are jitter-free for data flushing. We have also explored the use of node-local burst buffers to absorb
the read/write traffics on PFS, by buffering the bursty writes in node-local burst buffers, and then
directly retrieving the data from burst buffers for restart/analysis and visualization (see Chapter 4).
On the other hand, we have addressed I/O contention by taking burst buffers as an intermediate
layer to reshape the applications’ bursty writes on PFS (see Chapter 5).
19
CHAPTER 3
BURSTMEM: OVERLAPPING COMPUTATION
AND I/O
3.1 Introduction
As mentioned in Chapter 2, the ever-increasing computation-IO gap imposes stringent check-
point demands on the PFS. BurstMem is designed to shoulder the data pressure from checkpoint
workloads. It provides simple interfaces that allow applications to quickly dump checkpoint data,
and asynchronously flush the data to PFS. It avoids the direct interaction with PFS with over-
lapped computation and data flushing. With a set of storage management strategies, it efficiently
leverages the capacity and bandwidth of burst buffer storage and reduces the overall I/O time. In
addition, BurstMem is designed with a novel tree-based indexing technique that can support fast
data flush in two phases. Finally, BurstMem is implemented with an abstract communication layer
that enables its portability to systems with diverse network configurations [30].
As mentioned in Section 1.2.3 of Chapter 1, there are several storage models for burst buffer
service. We structure BurstMem as a distributed key-value store, since a key-value store well sup-
ports the simple sematincs required by BurstMem, e.g. storing the checkpoints generated at a
given checkpoint phase to burst buffer, and retrieving checkpoints produced during a given pe-
riod for data flush or crash recovery. BurstMem is implemented by customizing and extending
the functionality of a cutting-edge caching system named Memcached. It is a lightweight, dis-
tributed DRAM-based caching system. Though not designed for scientific applications, it includes
all the features for distributed buffer management. Its fast storage solution and great extensibility
for complex applications distinguish it from many other distributed key-value stores (e.g. Mon-
goDB [43], HBase [51]), as a decent candidate for burst buffer services. We customize Memcached
by modifying its data placement strategy, communication layer and memory management module.
20
Furthermore, we design BurstMem with a mechanism of coordinated data shuffling and flushing to
PFS. In summary, we make the following contributions in this chapter:
• We have examined the storage management issues in Memcached. Based on our analysis
we introduce a log-structured storage management scheme with a novel tree-based indexing
technique. This allows us to efficiently utilize storage resources.
• We have applied a coordinated shuffling scheme for efficient data flushing, and designed a
portable communication layer that supports high-speed data transfer.
• A systematic evaluation of BurstMem is conducted using both synthetic benchmarks and a
real-world application. Our results demonstrate that on average BurstMem can improve the
I/O performance by as much as 8.5×.
The rest of the chapter is organized as follows. We describe the Memcached based buffering
management framework in Section 3.2, followed by Section 3.3 that elaborates on the design of
BurstMem. Section 3.4 provides experimental results. Section 3.5 provides a review of related
work. Finally, we summarize this chapter in Section 3.6.
3.2 Memcached Based Buffering Framework
In this section, we first depict the framework of Memcached, and then analyze the challenges
of using Memcached to support checkpoint workload.
3.2.1 Overview of Memcached
Memcached is an open-source, distributed caching system deployed to address the web-scale
performance and scalability challenges. Two of its key components (client and server) function as a
distributed key-value store that mutually resolve web servers’ caching requirements. Fig. 3.1 shows
the general architecture along with its main components. The Memcached client can interact with
a number of servers for its data store and data retrieval purposes. As a distributed caching system,
Memcached incorporates several key architectural aspects indispensable to the design of burst
buffer system. First, the Memcached client adopts a two-stage hashing mechanism for balanced
data placement. Second, the Memcached server offers a lossy key-value store that involves all the
major functionality of a local storage system, such as space, data and metadata management.
21
User Applications
Memcached
Hash
Lossy Key-Value
Memcached Client
Lossy Key-Value
Figure 3.1: Component diagram of Memcached.
Two-stage Hashing. Memcached applies a two-stage hashing mechanism to store or retrieve
a certain key-value pair (KVP). In the first stage, the key is hashed to the server responsible for
storing the KVP. Once the KVP is stored on the server, it is further hashed to an entry in the
server’s local hash table that records the address of this KVP.
A Lossy Key-Value Store with Disconnected Servers. The Memcached server is de-
signed as a simple but powerful in-memory key-value store. A server pre-allocates many groups of
1MB slabs, each slab contains serveral chunks. Chunks that belong to the same group are of equal
size, but they have different sizes among different groups. Each group possesses a unique ID ranked
from 0 to 42. The chunk size for different group increases with a factor of 1.25 with Group IDs.
For instance, Group 1 contains chunks sized 96B, Group 2 contains chunks sized 96B*1.25, and so
on. Thus, Group 42 contains chunks sized 96B ∗1.2541 = 1048576B, which is exactly 1MB. So each
slab only contains one chunk in Group 42. To insert a KVP, the Memcached server selects a chunk
from the group with the closest chunk size and copies it to this chunk, and the chunk address is
then recorded in the hash table. Upon a conflict, a chained list of entries are provided to hold the
address of multiple KVPs.
3.2.2 Challenges from Scientific Applications
For its simple, scalable and powerful design and performance, Memcached has been used by
many web applications that require fast cache storage for their temporary data. The requests of
22
these applications are generally distributed, arriving in a random order, with very little synchro-
nization.
While scientific applications also generate large volumes of data, their data patterns are sig-
nificantly different from the web applications. Particularly, they possess a number of distinct
characteristics described below that warrant a new perspective on how to leverage the strengths of
Memcached.
• Lock-step I/O from many synchronized clients: Scientific applications typically consist
of many parallel processes who enter their I/O phase in a lock-step manner. These processes
have frequent and well coordinated synchronization which means that processes need to ex-
change data with their neighbors. Their I/O operations are also closely synchronized.
• Bursty and non-overlapping I/O: Scientific applications usually have well-defined ex-
ecution phases. For example, they alternate between computation and I/O phases. This
characteristic provides the fundamental requirement for designing a burst buffer. The file
extent on which each process writes does not overlap. In each I/O phase, a process does not
rewrite the content already written.
• Frequent writes and few reads: Scientific applications periodically create snapshots (via
checkpointing) of their intermediate results and datasets. They are typically write-intensive
but read only sparingly. In-memory variables such as arrays and meshes are written at each
snapshot. Checkpoint data is only read during application restart. Given these characteristics,
we design burst buffer mainly for application write throughput. This is different from many
other existing buffering systems such as PredatA [114], DataSpaces [48] in ADIOS [96] that
focus on in-situ data sharing and analysis for scientific simulation.
The distinct features of scientific applications lead us to rethink the design of Memcached.
In this chapter, we carry out a study on the design of the burst buffer system on top of the
Memcached framework. While preserving many features of the Memcached architecture, we modify
Memcached from three critical aspects: storage management, coordination with the PFS, and the
communication efficiency, respectively.
3.3 Design of BurstMem
The goal of BurstMem is to efficiently absorb large amounts of write requests, and provide
high-throughput service to migrate the data into the PFS. In this section, we first present an
23
architectural overview of BurstMem. We then review how BurstMem copes with bursty writes,
describe our strategy to flush the data from BurstMem to the underlying PFS, and detail the
approach to leverage the native transport of various high-speed interconnects on the HPC systems.
3.3.1 Software Architecture of BurstMem
Parallel File System BStore
BMan
Burst Buffer System HPC Clients
Intercon
nect
Figure 3.2: Software architecture of BurstMem.
Fig. 3.2 shows the software architecture of BurstMem and its relationship with other system
components in a typical HPC environment. As a data buffering system, BurstMem is located
between the processing elements and the backend persistent storage hosted by the PFS. It connects
to all application processes via a high-speed interconnect, temporarily buffers bursty datasets from
these processes, and gradually flush the datasets to the PFS.
BurstMem is composed of two main components: BurstMem Managers (BMans) and BurstMem
Stores (BStores). Each BMan is designed with a BStore as an internal Memcached server. All
BMans form a parallel set of burst buffer daemons to intercept application data. Each BMan keeps
track of the address and health status of neighboring BMans. These BMans are in charge of the
24
bulk of responsibilities for system maintenance and resource management. They coordinate all the
BStores for fast data buffering, balanced data distribution and long-term persistent storage.
Using BurstMem, scientific applications can follow a new checkpoint flow. After each phase
of computation, the clients coordinates with BMans for checkpointing. The BMans absorb and
store all the checkpoint dataset to BStores. The application then returns to the computation of
next phase, leaving the ensuing data flushing operation to BStore. In this way, data flushing is
overlapped with computation.
BurstMem is built on top of Memcached, it exposes to the application a simple checkpoint
API, which invokes a customized light-weight Memcached client library for data shipping. Once
BMan receives the KVP, it leverages BStore for storage management. BStore follows the same data
processing flow as Memcached server’s storage management. It allocates memory for the accepted
KVP, and records its location so that the KVP can be retrieved in the future.
3.3.2 Log-Structured Data Organization with Stacked AVL Indexing
Timestamp2*
Timestamp3*Timestamp1*
FileName2*
FileName1* FileName3*
Offset2*
Offset1* Offset3*
(b):*Stacked*AVL*Index*Tree*
Data*Store*
(a):*LogDstructured*Data*Store*
Data*Store*
Value* Value*…"
Segment*of*Plain*File*
Value**
BMan*
KVP** KVP**
AppendDonly*
Write&
Figure 3.3: Data structures for absorbing writes.
We introduce a Log-Structured data organization with Adelson-Velskii and Landis (AVL)
tree [57] based indexing (LSA) to absorb the large amount of bursty write requests. LSA re-
solves three major issues in Memcached that prevent it from achieving efficient writes. First, the
25
original Memcached preallocates fix-sized memory chunks to accommodate the incoming write re-
quests. However, this results in underutilized memory resources when a write request is not aligned
with the memory chunk, and additional memory allocation is needed each time the chunks with a
given size are used up. Second, contemporary HPC platforms are embracing tiered storage with
both DRAM and SSD, but Memcached is oblivious to this architecture. Third, the hash-based
indexing used by Memcached is unable to support range queries, which is a requirement for bulk
data flushing and HPC application restart.
The key idea of LSA is to compact the received write requests following an append-only manner
to avoid memory waste, as shown in Fig. 3.3 (a). When used memory reaches a high watermark,
we append in-memory data to SSD.
As illustrated in Fig. 3.3(a), we design a hierarchical data store to log the concrete file data
(value) from write requests. A large (by default 4GB) DRAM block is maintained for logs at the
first level, a separate intermediate file is reserved for logs on SSD. The address space of data store
covers the storage from both the DRAM and SSD.
Upon its arrival in BMan, each write request is converted into a KVP. The key records the
information that uniquely identifies the request, including the current checkpointing timestamp,
the name of its targeted checkpoint file, and the offset (i.e. position of the write request on the
checkpoint file) as well as the length of the value. The value points to the concrete data (e.g. a
segment of a plain file) in the write request to be stored into the data store and then flushed to the
checkpoint file, as shown in Fig. 3.3(a). When BStore receives a write request, it separates the key
from the value and appends the value to the end of the data store. By doing so, BStore eliminates
the random access caused by following a strict order of checkpoint file offsets, and maximizes the
write throughput.
In order to facilitate data retrieval (e.g. application restart) from the data store, all the keys
for each KVP are organized in a stacked AVL-tree structure that records the metadata of absorbed
write requests. An AVL-tree [57] is a self-balancing binary search tree that supports lookup,
insertion and deletion with O(logN) complexity for both average and worst cases, thus achieving
better performance than binary search trees. It also delivers an ordered node sequence that allows
in-order traversal. Although inheriting many AVL’s virtues, our design differs greatly from the
conventional AVL tree, exhibiting a stacked structure. It consists of three categories of layers:
26
timestamp, filename, and offset, as shown in Fig. 3.3(b). Each write request is first indexed by the
timestamp, then the filename, and finally by the offset pointing to its position in the checkpoint
file to be stored in the PFS. A pointer recording the address of each write request in the data
store is maintained together with the offset index. The intuition behind such design is to accelerate
retrieval and traversal for the data in a specific range. Using checkpointing as an example, flushing
the dataset that belongs to a single timestamp is important. Our stacked AVL tree allows such
a dataset to be located in a single range search operation. After pinpointing the index of such
a timestamp, each filename subtree under such an index is traversed, restoring the order of each
checkpoint file through an in-order traversal of all its offset metadata. Taken together, the tree
index supports diverse query patterns. For example, retrieving all the data under timestamp 1,
from timestamp 1 to 3, or under timestamp 1 belonging to filename 1, etc. These query patterns
are not supported by hash-based indexing in Memcached.
However, one major issue faced by LSA is to determine when to conduct the garbage collection
necessary to reclaim the used space in data store and to trim the stacked AVL tree. We address
such an issue by leveraging a key characteristic of checkpointing. After completing a checkpointing
operation, the data belonging to a specific timestamp can be discarded since they have been flushed
to the underlying file system. Thus, we first mark the timestamp node on the AVL tree as unused. A
process is invoked periodically in the background to traverse the tree for those unused timestamps
and compact the in-memory data store to reclaim the memory space used by values of such a
timestamp. When all values within the timestamp have been recycled, we trim the timestamp
subtree off our stacked AVL tree. In addition, we leave the log on the SSD untouched. Only when
the size of the log on SSD is close to a threshold, and all the data within the log has been transfered
into the PFS, do we discard the log in its entirety and generate a new log to absorb the data from
memory.
3.3.3 Coordinated Shuffling for Data Flushing
BurstMem is responsible for flushing the data into the PFS. In the current design, data flushing
takes place after checkpointing. We also allow clients to trigger the data flushing explicitly. There
are two general checkpointing patterns in scientific applications, N−N and N−1 checkpointing. In
N-N checkpointing every process writes to a separate file. In N-1 checkpointing, all processes write
to a single shared file. BurstMem supports both patterns. Under the N-N pattern, each client’s
27
checkpoint data is hosted by one BStore. These BStores can flush data into different checkpoint
files without interfering with each other. In contrast, under the N-1 checkpoint pattern, the shared
checkpoint file spreads across many BStores. Naively flushing data content into a shared file can
incur significant lock overhead, leading to drastically degraded throughput. Coordinated shuffling
is applied to address such an issue for the N-1 case.
Before elaborating on the coordinated shuffling, we briefly describe the cause of lock overhead.
Many PFSs use distributed locks to guarantee data consistency. Taking Lustre as an example,
locking is performed at the granularity of Lustre stripes (by default 1 MB). When there is a need
to write a stripe for data flushing, a BStore needs to first acquire the lock for that stripe. If another
BStore owns the lock, Lustre has to revoke the prior ownership before granting the lock ownership
to the first BStore. Once the stripe lock is acquired, the lock and the data are buffered in the
Lustre Object Storage Client side (OSC) inside the BStore. Under the N-1 case, write requests
from multiple BStores may overlap on the same stripe, causing frequent ownership changes on
the stripe lock. Associated with the change of lock ownership, data flushing can cause frequent
network traffic and delay the entire process. Therefore, the contiguous, stripe-aligned write requests
is preferred compared to the noncontiguous, stripe-unaligned write requests since the former can
efficiently reduce the degree of lock contention.
In most of our targeted cases, each BStore possesses several noncontiguous, small segments for
the shared file, thus amplifying the lock overhead. Therefore, coordinated shuffling is designed to
reshuffle the segments among BStores so that each BStore can flush contiguous segments into the
PFS with alleviated lock contention. As illustrated in Fig. 3.4, each shared file is logically divided
into several contiguous segments. The total number of segments is equal to the number of BStores.
The purpose of data shuffling is to have each BStore store all the data that belong to the same
segment. Fig. 3.4 details this process. BStores 1, 2 and 3 possess data chunks from Client 1, 2 and
3, respectively. These chunks belong to the same shared file. This file is divided into 3 segments,
which is mapped to the three BStores. Before data shuffling, each BStore stores noncontiguous
data chunks. Data shuffling begins after such mapping is established. Following this mapping,
Chunk 2 is shuffled from BStore 2 to BStore 1, Chunk 3 is shuffled from BStore 1 to BStore 2, and
so on. Once the shuffling operation is completed, each BStore can then flush the data to the PFS.
28
Client'
Client'
Client'
1' 3'
2' 5'
4' 6'
BStore1(
BStore2(
Segment1(
Segment2(
Shuffle(Checkpoint(
BStore3(
BStore1(
BStore2(
BStore3(
1'
2'
3'
4'
5'
6'
Flush(
Segment3(
Shared(File(on(PFS(
1'
3'
2'
�'
4'
6'
1'
2'
�'
�'
5'
�'
Figure 3.4: Coordinated shuffling for N-1 data flushing.
Our data flushing scheme is inspired by the idea of ROMIO [94]. We do not directly use ROMIO
library since it is coupled with MPI environment. Such an environment restricts BurstMem’s
potential for future extension for fault tolerance.
3.3.4 Enabling Native Communication Performance
Memcached relies on the BSD Sockets interface and uses the reliable stream (i.e. TCP) to
transfer data. Although Sockets eases the implementation, performance of socket-based communi-
cation cannot fully exploit the advantage of leadership scale HPC systems, such as Remote Direct
Memory Access (RDMA), or OS-bypass. In addition, its performance is not optimized and is highly
subject to data sizes.
Therefore, we have employed the Common Communication Interface (CCI) [30] to efficiently
leverage the performance advantage of HPC facilities. It is designed by Oak Ridge National Labo-
ratory. It exposes the performance of using native network interfaces to scientific applications. CCI
has now been fully deployed on the Titan supercomputer to serve various scientific applications.
29
In our optimization, we leverage CCI to accelerate checkpointing from clients to BMans, as well as
data shuffling among BMans.
CCI&Server& CCI&Client&
cci_get_event***cci_accept*************
established***
cci_get_event***
**cci_connect****cci_get_event****established**
ConnecUon*Request*
Accept*Reply*
Established*
Endpoint" Endpoint"
Remote"Memory"Access"
Endpoint" Endpoint"
Figure 3.5: CCI-based network communication.
Figure 3.5 illustrates our implementation of CCI-based network communication. CCI uses clien-
t/server semantics to establish a connection. Checkpointing or data shuffling triggers a connection
request to the peer. In both the Server and the Client, CCI abstracts a network device as the
Endpoint, which is a virtualized device containing many resources, such as send and receive queues,
as well as the buffers that are associated with the queues. Once the connection is established, we
leverage the Remote Memory Access (RMA) feature that enables zero-copy in CCI to transfer the
data over the network.
In our design, we use an event-driven model to improve the throughput. We poll the CCI
Endpoint for new events (via cci get event). On the client side, one thread is dedicated to estab-
lishing connections with remote servers. Meanwhile, a data-transferring thread uses non-blocking
RMA to transfer the data. RMA is typically one-sided (i.e. only the initiator is actively involved
in the transfer). In order to notify the completion of a RMA operation, we use the completion
message option that sends a message and generates a receive event on the remote endpoint. A
30
central thread, which polls the events from the Endpoint, orchestrates all of above threads. Sim-
ilarly, on the server side, a thread detects events from the Endpoint and dispatches the requests,
e.g. connection requests or completion events of RMA, to the corresponding threads for further
processing.
3.4 Evaluation of BurstMem
3.4.1 Methodology
Testbed: Experiments on BurstMem were conducted on the Titan supercomputer [14] hosted
at Oak Ridge National Laboratory. Each node is equipped with a 16-core 2.2GHZ AMD Opteron
6274 (Interlagos) processor, 32 GB of RAM, and a connection to the Cray custom high-speed
interconnect. Two nodes share 1 Gemini high-speed interconnect router. Since there was no I/O
server deployed for I/O buffering, we used a separate set of compute nodes for BurstMem. Out
of the 256 allocated to the experiment, 128 of the compute nodes were used as clients that write
data into the BurstMem. The other 128 compute nodes were allocated as the BurstMem servers.
In every experiment, we placed one process on one physical node.
Titan is connected to Spider II, a center-wide Lustre-based file system. It features 30 PB of
disk space, offering 1 TB/s aggregated bandwidth organized in two non-overlapping identical file
systems, each providing 500 GB/s I/O performance. The default stripe size of each created file is
1 MB. The default stripe count is 4.
In all the experiments, we pinned a 16 MB DRAM buffer for each RMA channel among two
communication entities.
Benchmarks: To evaluate the performance of BurstMem, we employed a synthetic workload
using IOR [88] and also a real-world scientific application called S3D [41]. We report the average
of 5 test run results.
IOR is a flexible synthetic benchmarking tool that is able to emulate diverse I/O access patterns.
It was initially designed for measuring the I/O performance of parallel file system (PFS). We added
BurstMem support to IOR by redirecting all writes from the processes to BurstMem instead of
the PFS. This new version of IOR is referred to as BB-IOR. To emulate bursty I/O behavior
as described in Section 3.2.2, we set interTestDelay to 20 seconds between any two I/O phases,
31
and iterated 10 times. For comparison, we also redirect the writes to Memcached, refered to as
MemCache-IOR.
To evaluate the performance of real applications, we integrated BurstMem into S3D, which we
refer to as BB-S3D. S3D is a parallel turbulent combustion application using a direct numerical
simulation solver developed at Sandia National Laboratories. It solves fully compressible Navier-
Stokes, total energy, mass continuity equations coupled with detailed chemistry. The problem
domain is a conventional 3-D structured Cartesian mesh. All the MPI processes are partitioned
along the X-Y-Z dimensions. S3D exhibits bursty patterns during the execution. Its checkpointing
phase regularly alternates with computation. Each checkpointing phase outputs four global arrays
representing the variables of mass, velocity, pressure and temperature.
3.4.2 Ingress Bandwidth
We first investigated the ingress bandwidth that BurstMem can support to absorb the write
requests. We used the IOR benchmark and increased the number of BurstMem servers from 1 to
128. In each test, we used the same number of IOR clients as that of BurstMem servers to stress
the system. We used a 1MB (default stripe size) transfer unit to alleviate the lock contention issue
in the Lustre file system. We have IOR-N-1 and IOR-N-N respectively represent the N-1 and N-N
pattern as mentioned in Section 3.3.3 for the original IOR. In the IOR-N-1 case, we set the stripe
count as the number of clients. For a fair comparison, we set the stripe count as 1 for the IOR-N-N
case so that same number of Object Storage Targets (OST) are utilized as the number of clients.
On average, each IOR client wrote 4 GB data.
Figure 3.6 compares the ingress bandwidth IOR receives with and without BurstMem support,
as well as MemCache-IOR. Overall, BurstMem delivered significantly higher ingress bandwidth than
the other three alternatives. As seen in Figure 3.6, BB-IOR is able to achieve 278.2%, 246.9% and
174.5% improvement on average, when compared to the IOR-N-1, MemCache-IOR and IOR-N-N
respectively. Such improvement is consistent across different number of BurstMem servers.
BB-IOR achieves substantial improvement over the original IOR by buffering the write requests
instead of writing directly to the Lustre file system. Such performance improvement is what we
expect. However, as also shown in Figure 3.6, simply using Memcached as the buffering system
could not maximize the ingress bandwidth. BurstMem effectively outperformed Memcached with
LSA described in Section 3.3.2 and CCI support described in Section 3.3.4. Specifically, BurstMem
32
0
20
40
60
80
100
1 2 4 8 16 32 64 128
Ing
ress B
an
dw
idth
(G
B/s
ec)
Number of BurstMem Servers
IOR-N-1
IOR-N-N
BB-IOR
MemCache-IOR
Figure 3.6: Ingress I/O bandwidth with a varying server count.
benefited from Gemini’s native transport using CCI and avoids frequent memory allocation using
LSA.
To further examine the ingress bandwidth under different workloads, we reduced the number
of BurstMem servers to 4, which equals the default stripe count. We also set the stripe count to
4 and had all the clients write on one shared file for the original IOR. In both BB-IOR and the
original IOR, we increased the number of IOR clients from 4 to 128, thereby increasing the workload
per BurstMem server and OST. Figure 3.7 illustrates the performance comparison with respect to
the increasing number of IOR clients. On average, BB-IOR outperformed the original IOR and
MemCache-IOR by 508.62% and 408.30%, respectively. We observe an increasing bandwidth from
4 to 16 IOR clients. This was because when the number of IOR clients is fewer than 16, the
supplied bandwidth of each BurstMem server was not fully saturated. Once the number reaches
16, such bandwidth was fully utilized, and BurstMem was able to provide stable ingress bandwidth
regardless of the workloads.
33
0
2
4
6
8
10
12
14
16
18
4 8 16 32 64 128
Ing
res
s B
an
dw
idth
(G
B/s
ec
)
Number of IOR Clients
IOR
BB-IOR
MemCache-IOR
Figure 3.7: Ingress I/O bandwidth with a varying client count.
3.4.3 Egress Bandwidth
Efficiently flushing data to the PFS to spare space for future writes is another essential feature
for BurstMem. In this section, we measured the performance of egress bandwidth and evaluated
the effectiveness of coordinated shuffling introduced in Section 3.3.3. To emulate the common I/O
access pattern in scientific applications, we interleaved the writes from multiple IOR clients and used
a 16 KB transfer size, one of the dominant transfer sizes for scientific applications [56]. Similar to
ingress bandwidth evaluation, we set the number of IOR clients equal to that of BurstMem servers
and have each client output 4 GB of data to a shared file. Because Memcached does not support
flushing, the comparison does not include Memcached.
Figure 3.8 shows the performance of egress bandwidth. Cumulative egress bandwidth of BB-
IOR increases from 0.83 GB/s at 4 processes to 6.09 GB/s at 128 processes. BB-IOR was able to
achieve 2× to 19× higher bandwidth when compared to the original IOR whose performance was
consistently below 0.4 GB/s for all cases. Such low performance was mainly due to large overhead
from lock contention caused by unaligned writes. Coordinated shuffling rearranged unaligned write
requests into sequential, stripe-aligned writes, thereby significantly improving the overall egress
34
0
1
2
3
4
5
6
7
8
4 8 16 32 64 128
Eg
ress B
an
dw
idth
(G
B/s
ec)
Number of BurstMem Servers
IOR
BB-IOR
Figure 3.8: Egress I/O bandwidth.
bandwidth.
In Figure 3.9, we show the time spent on shuffling and flushing. The shuffling operation incurred
over 30% overhead. However, it enabled flushing to achieve better performance due to large,
sequential writes to PFS in a stripe-aligned manner and delivered orders of magnitude better
performance at massive scale. Hence, the extra overhead from data shuffling is worth the trade-off
given the significant benefit it delivers.
3.4.4 Scalability of BurstMem
Scalability is a critical factor for BurstMem. We want to ensure that BurstMem is able to
provide increasing bandwidth when given more resources; such as more BurstMem nodes and
additional CPU cores on each node. In this section, we evaluated the scalability of BurstMem from
two perspectives, horizontal scalability (scale-out) and vertical scalability (scale-up). We continue
using IOR as the benchmark tool for our evaluation. In the horizontal scaling experiment, we
increased the number of BurstMem servers and measured the cumulative bandwidth delivered by
BurstMem. In the vertical scalability experiment, we increased the number of threads in each
individual BurstMem server.
35
0"10"20"30"40"50"60"70"80"90"
100"
4" 8" 16" 32" 64" 128"
Time%(sec)%
Number%of%BurstMem%Servers%
Shuffle"Time"
Flush"Time"
Figure 3.9: Dissection of coordinated shuffling.
Horizontal Scaling. We evaluated the Horizontal scalability by fixing the number of clients
as 128, and increasing the number of BurstMem servers from 4 to 128. The I/O request size was
set to 1 MB, each client wrote 512 MB of data, featuring 64 GB of input data in total for each
iteration.
Figure 3.10 (a) shows the performance results of horizontal scaling. As shown in the figure,
cumulative bandwidth improved linearly from 9.9 GB/s with 4 BurstMem servers to 62.04 GB/s
with 32 BurstMem servers. However, the increasing rate declined when going from 64 to 128
BurstMem servers. This was because the supplied bandwidth of each BurstMem server could
efficiently absorb I/O requests from more than 2 clients. When the number of BurstMem servers was
fewer than a quarter of the clients (32), they were mostly saturated. However, further increasing the
number of BurstMem servers from that point (32 servers) led to underutilized bandwidth provided
by BurstMem system, and the bandwidth gradually became client-bound.
In summary, the linear horizontal scalability was achievable when ingress bandwidth was bounded
by BurstMem. In addition, there were some other factors that can affect cumulative bandwidth,
including varying end-to-end network bandwidth on Titan due to locality, and the contention of
36
network resources. These factors caused the cumulative bandwidth to be lower than the theoretical
maximum bandwidth.
0
20
40
60
80
100
4 8 16 32 64 128
Cu
mu
lati
ve B
an
dw
idth
(GB
/sec)
Number of BurstMem Servers
BB-IOR
(a) Horizontal Scaling.
0
1
2
3
4
5
6
7
8
1 2 4 8 15
Ban
dw
idth
(G
B/s
ec)
Number of Threads
BB-IOR
(b) Vertical Scaling.
Figure 3.10: Scalability of BurstMem.
Vertical Scaling. We evaluated BurstMem’s vertical scalability by scaling the number of
threads in each BurstMem server from 1 to 15, one thread per core. The remaining one core
was used to run system daemons. We measured the bandwidth that could be supplied by each
individual BurstMem. On average, each BurstMem served 16 clients. Each client sent 512 MB of
data to the server, featuring 8 GB of input data for each BurstMem server. Figure 3.10 (b) shows
the bandwidth increasing from 2.49 GB/s at one thread to 6.11 GB/s at 15 threads. There is a
sharp increase at 15 processes because each compute node contained 2 NUMA nodes, and each
NUMA node contained 8 cores. Titan scheduled the first 8 threads to the first NUMA node and
the last 7 threads to the second NUMA node. When we used 15 threads, we included the capability
from another NUMA node; such as memory bandwidth and computing power.
3.4.5 Case Study: S3D, A Real-World Scientific Application
During the experiments with S3D, we kept the size of X, Y, Z dimensions as 50, 50, 50, re-
spectively and have each process write about 2 GB checkpoint data. We compared the cumulative
bandwidth of BurstMem-enabled S3D (BB-S3D) with that of the original S3D implementation.
Figure 3.11 showed the I/O performance comparison between BB-S3D and S3D. The bandwidth
of BB-S3D increased linearly from 1.27 GB/s at 1 process to 80.49 GB/s at 125 processes. This
37
0
20
40
60
80
100
1 8 27 64 125
Cu
mu
lati
ve B
an
dw
idth
(G
B/s
ec)
Number of Processes
BB-S3D
S3D
Figure 3.11: I/O performance of S3D with BurstMem.
yielded a performance improvement of up to 10× over the original S3D when the number of MPI
processes was 125.
We observed that the original S3D bandwidth was lower than that of the original IOR. This was
because, in IOR tests, we set the transfer size as the stripe size, which optimized the performance
under Lustre file system. In contrast, the transfer units of Fortran I/O in S3D varies from 0.95
MB, 2.86 MB to 10.49 MB. This was less favored by Lustre.
3.5 Related Work
Improving I/O performance on large-scale HPC systems has gained broad attention over the
past decades. A number of studies have introduced new I/O middleware libraries. MPI-IO [94, 62],
PnetCDF [58, 69], HDF5 [11, 64] boost I/O performance using parallel I/O that involves a massive
number of participating processes. PLFS [32] introduces an extra I/O layer that converts the non-
contiguous, interspersed I/O into contiguous, sequential I/O. All these studies aim to optimize I/O
on the parallel file system (PFS). Therefore, their performance is still restricted by the bandwidth
of PFS.
38
I/O forwarding [28] is another key technique applied on Blue Gene/P systems. It leverages two
I/O forwarding components, named CIOD [77] and ZOID [54], both of which use synchronous I/O
forwarding. Venkatram et al. [97] replaces synchronous I/O forwarding with asynchronous staging,
thus enhancing an application’s overall performance. However, such techniques only applies to the
Blue Gene/P architecture.
Orthogonal to this work, asynchronous data staging is proposed by many other researchers.
Such work generally falls into two categories: local staging [83, 60] and remote staging [78, 27]. In
the former case, an application uses local storage of compute nodes as staging area. However, the
performance can be affected by computation jitters [31]. Remote staging buffers data in additional
partitions of compute nodes. Although remote staging is immune to computation jitters, it is
confined by available resources of compute nodes, such as the supplied bandwidth and storage
capacity.
Burst buffer in HPC is a relatively recent idea. Past burst buffer studies mainly used modeling
and simulations. Liu et al. [65] designed a simulator of burst buffer for the IBM Blue Gene/P
architecture. Bing et al. [109] characterized output burst absorption on Jaguar and made an
important step toward quantitative models of storage system performance behaviors. Different
from them, our work focuses on designing and implementing a prototype burst buffer system and
analyzing its performance benefit.
More recently, there have been an increasing number of ongoing projects to provide software
solutions. Two representative projects are DataWarp [9] in Cray [6] and Infinite Memory Engine
(IME) [13] in DataDirect Network [8]. DataWarp utilizes flash SSD I/O blades with Cray Aires
high-speed interconnect. It is designed for Trinity [25] and Cori [5] supercomputers. It features
a flexible storage allocation mechanism that allows a user to reserve burst buffers, which provides
a seamless integration with SLURM. Besides this, a user can also customize their reservation to
act either like a file system mount point or a layer of local cache that better supports bursty
checkpoint/restart workloads. Different from BurstMem, it is designed to run on a Cray Aries
interconnect, while BurstMem leverages CCI for portable data transfer. IME is positioned on I/O
nodes. Like BurstMem, it supports diverse interconnects, e.g. InfiniBand, Cray Aires, etc. It is also
optimized for data flush. Like Cray DataWarp, the whole buffer space is viewed by applications
as a mount point that transparently absorbs clients’ I/O requests. Different from IME, BurstMem
39
exposes to scientific applications key-value store based interface that can be more easily extended
to support a richer semantics. In addition, while both DataWarp and IME are developed as
commercial products and provide more comprehensive functionality than BurstMem, little design
detail has been released, the way we structure BurstMem provide complimentary design choices for
these commercial products.
3.6 Summary
In this chapter, we have designed a high-performance burst buffer system on top of Mem-
cached. Through in-depth analysis, we have identified that Memcached has many issues to be
directly used as burst buffer, such as the lack of efficient storage management to absorb large
amounts of bursty writes and the incapability to exploit modern high-speed network interconnects.
Based on our analysis, we introduce several techniques to enhance Memcached as the BurstMem
system for bursty I/O in scientific applications. Our techniques include a log-structured data or-
ganization with stacked AVL indexing for fast I/O absorption and low-latency, semantic-rich data
retrieval; coordinated data shuffling for efficient data flushing; and CCI-based communication for
high-speed data transfer. Our experiments on the Titan supercomputer with synthetic benchmarks
and real-world applications demonstrate that BurstMem can efficiently provide high-performance
I/O services for current HPC scientific applications with good scalability.
40
CHAPTER 4
BURSTFS: A DISTRIBUTED BURST BUFFER
FILE SYSTEM
4.1 Introduction
Node-local Burst buffers are a powerful hardware resource for scientific applications to buffer
their bursty I/O traffic. However, the usage of node-local burst buffers is not yet well-studied, nor
are burst buffer software interfaces standardized across systems. Currently, users are left with the
freedom to explore the use of burst buffers in an ad-hoc manner. However, domain scientists would
rather focus on their scientific problems instead of fiddling with the complexity of how to best use
burst buffers.
Several efforts have explored the use of locality-aware distributed file systems (e.g., HDFS [90])
to manage node-local burst buffers [106, 110, 113]. In such file systems, each process stores its
primary data to the local burst buffer. Because compute processes can be co-located with their
data, it is feasible to achieve linearly scalable aggregated bandwidth [113, 83]. However, burst
buffers are only temporarily available to user jobs. A user job can utilize local burst buffers within
the duration of its allocation, but the job loses access to the burst buffer storage when the allocation
terminates.
Conventional file systems such as HDFS [90] or Lustre [35, 81] are typically designed to indefi-
nitely persist data, on the order of an HPC system’s lifetime. They utilize long-running daemons
for I/O services, which are not necessary for temporary burst buffer usage. In addition, the con-
struction and cleanup of I/O services for these file systems can lead to a waste of resources in terms
of compute cores, storage and memory. Therefore, for effective use of burst buffers by scientific
users, it is critical to develop software for standardizing the use of node-local burst buffers, so that
they can be seamlessly integrated into the repertoire of HPC tools on leadership supercomputers.
HPC applications typically exhibit two main I/O patterns: shared file (N-1) and file-per-process
(N-N) [32] (see details in Section 1.1). For node-local burst buffers, with the N-N pattern, appli-
cations can achieve scalable bandwidth by having each process write/read its files locally. The
41
difficulty for node-local burst buffers lies with the N-1 I/O pattern, in which all processes write a
portion of a shared file. In particular, a shared file requires the metadata for all data segments to
be properly constructed, indexed, and collected at the time of writes, then later formulated with a
global layout before any process can locate its targeted data for read access. While this issue has
been investigated on persistent parallel file systems [32, 111], the problem of efficiently formulating
and serving the global layout of a shared file remains a critical issue for a temporary file system
across burst buffers.
In addition, datasets from scientific applications are typically multi-dimensional. Such datasets
are usually stored in one particular order of multiple dimensions, but frequently read from different
dimensions based on the nature of scientific simulation or analysis. Often, there is an incompatibility
between the order of writes and the order of reads for data elements in a multi-dimensional dataset,
which typically leads to many small non-contiguous read operations for one process to retrieve
its desired data elements [95] (see Section 1.1.2 for detail). An effective node-local burst buffer
file system also needs to provide a mechanism for scientific applications to efficiently read multi-
dimensional datasets without many costly small read operations.
In this research, we have designed an ephemeral Burst Buffer File System (BurstFS) that has
the same temporary life cycle as a batch-submitted job. BurstFS organizes the metadata for the
data written in local burst buffers into a distributed key-value store. To cope with the challenges
from the aforementioned I/O patterns, we designed several techniques in BurstFS including scalable
metadata indexing, co-located I/O delegation, and server-side read clustering and pipelining. We
used a number of I/O kernels and benchmarks to evaluate the performance of BurstFS, and validate
our design choices through tuning and analysis.
In summary, in this research project of BurstFS we make the following contributions.
• We present the design and implementation of a burst-buffer file system to meet the need of
effective utilization of burst buffers on leadership supercomputers.
• We introduce several mechanisms inside BurstFS, including scalable metadata indexing for
quickly locating data segments of a shared file, co-located I/O delegation for scalable and
recyclable I/O management, and server-side clustering and pipelining to support fast access
of multi-dimensional datasets.
42
• We evaluate the performance of BurstFS with a broad set of I/O kernels and benchmarks.
Our results demonstrate that BurstFS achieves linear scalability in terms of aggregated I/O
bandwidth for parallel writes and reads.
• To the best of our knowledge, BurstFS is the first file system designed to have a co-existent
and ephemeral life cycle with one or a batch of scientific applications in the same job.
The rest of this chapter is organized as follows. Section 4.2 presents the design of BurstFS. The
experimental methodology and results are respectively presented in Section 4.3. Section 4.4 sum-
marizes the related work, followed by Section 4.5 that concludes this chapter.
4.2 Design of BurstFS
We designed the Burst Buffer File System (BurstFS) as an ephemeral file system, with the
same lifetime as an HPC job. Our overarching goal for BurstFS is to support scalable aggregation
of I/O operations across distributed, node-local storage for data-intensive simulations, analytics,
visualization, and checkpoint/restart. BurstFS instances are launched at the beginning of a batch
job, provide data services for all applications in the job, and terminate at the end of the job
allocation. Fig. 4.1 shows the system architecture of BurstFS.
When a batch job is allocated a set of compute nodes on an HPC system, an instance of BurstFS
will be constructed on-the-fly across these nodes, using the locally-attached burst buffers, which
may consist of memory, SSD, or other fast storage devices. These burst buffers enable very fast
log-structured local writes; i.e., all processes can store their writes to the local logs. Next, one or
more parallel programs launched on a portion of these nodes can leverage BurstFS to write data
to, or read data from, the burst buffers. In addition, a BurstFS instance exists only during the
lifetime of the batch job. All allocated resources and nodes will be cleaned up for reuse at the end
of the scheduled execution. This avoids any post-mortem interference with other jobs or potential
unforeseeable complications to the operation of file and storage systems. Furthermore, parallel
programs within the same job allocation (e.g., programs launched within the same batch script)
can share data and storage on the same BurstFS instance, which can greatly reduce the need of
back-end persistent file systems for data sharing across these programs.
BurstFS is mounted with a configurable prefix and transparently intercepts all POSIX functions
under that prefix [99]. Data sharing between different programs can be accomplished by mounting
43
Comp Node Burst Buffer
BurstFS Scalable
Metadata Indexing
Parallel Program #1
Parallel Program #2
Co-located I/O Delegation
Read Clustering and Pipelining
Batch Job
Figure 4.1: BurstFS system architecture.
BurstFS using the same prefix. Upon the unmount operation from the last program, all BurstFS
instances sequentially flush their data for data persistence (if requested), clean up their resources
and exit.
To support the challenging I/O patterns discussed in Section 4.1, we designed several techniques
in BurstFS including scalable metadata indexing, co-located I/O delegation, and server-side read
clustering and pipelining as shown in Fig. 4.1. BurstFS organizes the metadata for the local logged
data into a distributed key-value store. It enables scalable metadata indexing such that a global
view of the data can be generated quickly to facilitate fast read operations. It also provides a
lazy synchronization scheme to mitigate the cost and frequency of metadata updates. In addition,
BurstFS supports co-located I/O delegation for scalable and recyclable I/O management. Further-
more, we introduce a mechanism called server-side read clustering and pipelining for improving the
read performance. We elaborate on these techniques in the rest of this section.
4.2.1 Scalable Metadata Indexing
As discussed in Section 4.1, one of the challenges for the N-1 I/O pattern is accessing the
metadata of segments scattered across all nodes. This leads to a huge scalability problem when all
processes are reading their data and each one needs to gather the metadata from all nodes.
44
Figure 4.2: Diagram of the distributed key-value store for BurstFS.
Distributed Key-Value Store for Metadata. BurstFS solves this issue using a distributed
key-value store for metadata, along with log-structured writes for data segments. It leverages
MDHIM [52] for the construction of distributed key-value stores and provides additional features
for efficient handling of bursty read and write operations.
Fig. 4.2 shows the organization of data and metadata for BurstFS. Each process stores its data
to the local burst buffer as data logs, which are organized as data segments. New data are always
appended to the data logs, i.e., stored via log-structured writes. With such log-structured writes,
all segments from one process are stored together regardless of their global logical position with
respect to data from other processes.
When the processes in a parallel program create a global shared file, a key-value pair (e.g.,
M1 or M2, etc) is generated for each segment. A key consists of the file ID (8-byte hash value)
and the logical offset of the segment in the shared file. The value describes the actual location
of the segment, including the hosting burst buffer, the log containing the segment (there can be
45
more than one log from multiple processes on the same node), the physical offset in the log, and
the length. The key-value pairs (KVP) for all the segments can then provide the global layout for
the shared file. All the KVPs are consistently hashed and distributed among the key-value servers
(e.g., KVS0, KVS1 and so on). With such an organization, the metadata storage and services are
spread across multiple key-value servers. Many processes from a parallel application can quickly
retrieve the metadata and form a global view of the layout of a shared file.
Lazy Synchronization. In BurstFS, we also develop lazy synchronization to provide efficient
support for bursty writes. Each process provides a small memory pool for holding the metadata
KVPs from write operations, and, at the end of a configurable interval, KVPs are periodically
stored to the distributed key-value stores. An fsync operation can force an explicit synchronization.
BurstFS leverages the batch put operation from MDHIM to transfer these KVPs together in a few
round-trips, minimizing the latency incurred by single put operations. During the synchronization
interval, BurstFS searches for contiguous KVPs in the memory pool to potentially combine. A
combined KVP can span a bigger range. As shown in Fig. 4.2, segments [2-3) MB and [3-4) MB
are contiguous and map to the same server (KVS0), so their KVPs are combined into one KVP. Lazy
synchronization can significantly reduce the number of KVPs required when many data segments
issued by each process are logically contiguous (e.g. N-1 segmented and N-N write in Fig. 1.1).
Parallel Range Queries. To begin a read operation, BurstFS has to first look up the meta-
data for the distributed data segments. Thus, it searches for all KVPs whose offsets fall in the
requested range, e.g., [offset, offset+count] is the requested range in pread. With batched read
requests, BurstFS needs to search for all KVPs that are targeted by the read requests in the batch.
To retrieve the requested metadata entries for different read operations, we need support for a
variety of range queries to the key-value store. However, range queries are not directly supported
by MDHIM; the clients can indirectly perform range queries by iterating over consecutive KVPs
within a range with repeated cursor-type operations. Clients must sequentially invoke one or more
cursor operations for one range server, and must search multiple range servers until all KVPs have
been located. The additive round-trip latencies by all cursor operations to multiple range servers
can severely delay read operations.
To mitigate this, we introduce parallel extensions for both MDHIM clients and servers. On
the client side, we transform an incoming range request and break it into multiple small range
46
queries to be sent to each server based on consistent hashing. Compared with sequential cursor
operations, this extension allows a range query to be broken into many small range queries, one for
each range server. These small queries are then sent in parallel to all range servers to retrieve all
KVPs. On the server side, for the small range query within its scope, all KVPs inside that range are
retrieved through a single sequential scan in the key-value store. With this parallel optimization,
any combination of queries can be accomplished through only parallel range queries to all servers
and a single local scan operation at each key-value server.
4.2.2 Co-Located I/O Delegation
In contrast to BurstFS write operations that store data locally, read operations in BurstFS
may need to transfer data from remote burst buffers to a process initiating a read. To ensure
the efficiency of reads, we need to support fast and scalable data transfer for read operations. A
common approach adopted by many parallel programming models such as MPI [73] and PGAS [44]
is to have each process make read function calls to persistent file and storage service daemons.
Because BurstFS has a limited lifetime to that of a single job, BurstFS has special requirements
for I/O services. One implementation option might be to have persistent I/O daemons to support
BurstFS; however, that would lead to a waste of computation and memory resources. Another
implementation choice could be to utilize a simple I/O service thread spawned from the parent
process in a parallel program. However, with this approach, the service thread can only serve the
I/O needs for processes in the same program, and cannot serve subsequent or concurrent programs
in the batch job.
In BurstFS, we introduce scalable read services through a mechanism called co-located I/O
delegation. We launch an I/O proxy process on each node, a delegator. Delegators are decoupled
from the applications in a batch job, and are launched across all compute nodes. The delegators
collectively provide data services for all applications in the job.
As shown in Fig. 4.3, processes on three compute nodes will have all their I/O activities del-
egated to the delegator on the same node. Each delegator consists of two main components: a
request manager and an I/O service manager. In this way, a conventional client-server model for
I/O services is transformed into a peer-peer model among all delegators. With this arrangement,
individual processes no longer communicate with I/O servers directly, but go through their I/O
delegators. This leads to a great reduction on the total number of network communication channels
47
P1 P0 Shared Memory
Delegator
Request Send Queue
Data Recv Queue
I/O Service Manager
Shared Memory
Pipes
Request Manager
Q1 Q0 Shared Memory
Delegator
R1 R0 Shared Memory
Delegator
Burst Buffer
Figure 4.3: Diagram of co-located I/O delegation on three compute nodes P, Q and R, each with2 processes.
and the associated resources across the compute nodes. The I/O service manager in each delegator
is dedicated to serve the incoming read requests from peer delegators. The I/O service managers
exploit opportunities to consolidate requests, pipeline data retrieval from local storage, and transfer
data back to requesting delegators (see Section 4.2.3 for details).
The request manager of a delegator is composed of two main data structures: a request send
queue and a data receive queue, as shown in Fig. 4.3. The request send queue is a circular list with a
configurable number of entries. When not full, it receives the read requests from all client processes
through named pipes. Requests are queued based on the destination delegator. Requests to the
48
same delegator are chained together, which consolidates multiple requests into a single network
message. The data receive queue resides in a shared memory pool constructed across delegator and
client processes on the same node. For each I/O request, an outstanding request entry is created
in the receive queue. Data returned from remote delegators is directly deposited in the shared
memory pool, and the receive queue is searched for a matching outstanding request entry. When a
match is found, the outstanding request is marked as complete. An additional acknowledgment is
sent via the pipe to notify the client process to consume the data.
The request manager monitors the utilization level of the shared memory pool. When it is
higher than a configurable threshold (default 75%), the delegator (1) informs processes of the
urgent need to consume their data and (2) throttles request injection to remote delegators. The
request manager also monitors the ingress bandwidth based on the received data for read requests in
the send queue. When the ingress bandwidth is saturated, the request manager creates additional
network communication channels to send requests and receive data.
4.2.3 Server-Side Read Clustering and Pipelining
As discussed in Section 4.1, with multi-dimensional variables, a process can issue many small,
noncontiguous read requests for scattered data segments in each data log. Various I/O libraries
and tools have provided special support for such noncontiguous read access. For instance, POSIX
lio listio allows read requests to be transferred in batches; and OrangeFS supports batched
read requests. While being able to combine small requests into a list or a large request, these
techniques mainly work from the client side and rely on the underlying storage system such as the
disk scheduler to prefetch or merge requests for fast data retrieval. However, there is still a lack of
distributed file systems that can globally optimize these batch read requests from all processes.
As an ephemeral file system in a batch job, BurstFS directly manages accesses to the datasets
from scientific applications via delegators. Therefore, besides leveraging the existing techniques
of batched reads from the client side, BurstFS can exploit its visibility of read requests at the
server side (via the I/O service manager) for further performance improvements. To this end, we
introduce a mechanism called server-side read clustering and pipelining (SSCP) in the I/O service
manager to improve the read performance of BurstFS.
SSCP addresses several concurrent, sometimes conflicting objectives: (1) the need of detecting
spatial locality among read requests and combining them for large contiguous reads. (2) and the
49
768KB
ReadSSD
MemoryBuffer
Xmit
ReadSSD
Xmit
256KB
628KB
278KB
320KB
1MB 512KB
Transfer
Read
32KB
ReadSSD…
Xmit
Arrival tim
e
120KB
320KB
200KB
…
Consolidate
Copy
Two-Level Request Queue
size categories
Individual and combined read requests
Figure 4.4: Server-side read clustering and pipelining.
need of serving on-demand read requests as soon as possible for execution progress. As shown in
Fig. 4.4, SSCP provides two key components to achieve these objectives, a two-level request queue
for read clustering and a three-stage pipeline for fast data movement.
In the two-level request queue, SSCP first creates several categories of request sizes, ranging from
32KB to 1MB (see Fig. 4.4). Incoming requests will be inserted to the appropriate size category
either individually, or if contiguous with other requests, combined with the existing contiguous
requests and then inserted into the suitable size category. As shown in the figure, two contiguous
requests of 120KB and 200KB are combined by the service manager. Within each size category,
all requests are queued based on their arrival time. A combined request will use the arrival time
from its oldest member. For best scheduling efficiency, the category with largest request size is
50
prioritized for service. Within the same category, the oldest request will be served first. BurstFS
enforces a threshold on the wait time of each category (default 5ms). If there is any category
having not been serviced longer than this threshold, BurstFS selects the oldest read request from
this category for service and resets the category’s wait time.
The I/O service manager creates a memory pool to temporarily buffer outgoing data. This
facilitates the rearrangement of data segments for network transfer and allows the formulation of a
pipeline. Fig. 4.4 shows the three-stage data movement pipeline: reading, copying, and transferring.
In the reading stage, the I/O service manager picks up a request from the request list based on the
aforementioned scheduling policy, reads the requested data from the local burst buffer to a slot in
the memory buffer. In the copying stage, the data in the memory buffer is prepared as an outgoing
reply for the remote delegator, and then copied from the memory buffer to the network packet.
Data inside the memory buffer may need to be divided into multiple replies for different remote
delegators. The I/O service manager then creates multiple network replies, one for each delegator.
In the transferring stage, the I/O service manager can pack one or more network replies for the
same remote delegator into one network message (1MB maximum), and transmit (Xmit in Fig. 4.4)
it to the delegator.
4.3 Evaluation of BurstFS
4.3.1 Testbed
Our experiments were conducted on the Catalyst cluster [4] at Lawrence Livermore National
Laboratory (LLNL), consisting of 384 nodes. Each node is equipped with two 12-core Intel Xeon
E5-2695v2 processors, 128 GB DRAM and an 800-GB burst buffer comprised of PCIe SSDs.
Configuration: We focused on comparing BurstFS with two contemporary file systems: Or-
angeFS 2.8.8, and the Parallel Log-Structured File System 2.5 (PLFS [32]). As a representative
parallel file system (PFS), OrangeFS stripes each file over multiple storage servers to enable parallel
I/O with high aggregate bandwidth. In our experiments, we established OrangeFS server instances
across all the compute nodes allocated to a job to manage all the node-local SSDs. PLFS is de-
signed to accelerate N-1 writes by transforming random, dispersed, N-1 writes into sequential N-N
writes in a log-structured manner. Data written by each process are stored on the backend PFS as
51
a log file. In our experiments, we used OrangeFS (over node-local SSDs) as the backend PFS for
PLFS. We used PLFS’s MPI interface for read and write.
Since Version 2.0, PLFS has had burst buffer support. In PLFS with burst buffer support
(referred to as “PLFS burst buffer” in the rest of this paper), instead of writing the log file on the
backend PFS, processes store their metalinks on the backend PFS, which point to the real location
of their log files in the burst buffers. This allows each process to write its log file to the burst
buffer instead of the backend PFS. In our experiments, we had each process write to its node-local
SSD, and the location was recorded in the metalink stored on the center-wide Lustre parallel file
system. This configuration can deliver scalable write bandwidth. In order to read data from the
PLFS burst buffer, each node-local SSD has to be mounted on all other compute nodes as a global
file system (e.g., NFS), which requires system administrator support. A primary goal for BurstFS
is that it be completely controllable from user space, including mounting the file system. Thus, due
to the requirement of administrator intervetion to establish the cross-mount environment for read
with PLFS burst buffer, we only evaluated the write scalability of PLFS burst buffer and include
this result in Section 4.3.2.
Benchmarks: We have employed microbenchmarks that exhibit three checkpoint/restart I/O
patterns (see Section 1.1.1 of Chapter 1). Note that an N-1 strided pattern is a case of 2-D scientific
I/O as depicted by Fig. 1.2 in Section 1.1.2 of Chapter 1.
To assess BurstFS’s potential to support scientific applications, we evaluated BurstFS using
I/O workloads extracted from MPI-Tile-IO [16] and BTIO [108]. MPI-Tile-IO is a widely adopted
benchmark used for simulating the workloads that exist in visualization and numerical applica-
tions. The two-dimensional dataset is partitioned into multiple tiles, each process rendering pixels
inside one tile. Developed by NASA Advanced Supercomputing Division, BTIO partitions a three-
dimensional array across a square number of processes, each process processing multiple Cartesian
subsets. In both workloads, all processes first write their data into a shared file, then read back
into their memory. To evaluate the support for a batch job of multiple applications, we employed
the Interleaved Or Random (IOR) benchmark [88] to read data provided by Tile-IO and BTIO
programs in the same job.
52
0.5
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
PLFS-BBPLFSOrangeFSBurstFS
(a) N-1 Segmented Write
0.5
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
PLFS-BBPLFSOrangeFSBurstFS
(b) N-1 Strided Write
0.5
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
PLFS-BBPLFSOrangeFSBurstFS
(c) N-N Write
Figure 4.5: Comparison of BurstFS with PLFS and OrangeFS under different write patterns.
4.3.2 Overall Write/Read Performance
We first evaluated the overall write/read performance of BurstFS. In this experiment, 16 pro-
cesses were placed on each node, each writing 64MB data following an N-1 strided, N-1 segmented,
or N-N pattern. After each process wrote all of its data, we used fsync to force all writes to be
synchronized to the node-local SSD. We set the stripe size on OrangeFS as 1MB and fixed the
transfer size at 1MB to align with the stripe size, and each file was striped across all nodes in
OrangeFS. This configuration gives OrangeFS the best read/write bandwidth over other tuning
choices (e.g. 64KB default stripe size).
Fig. 4.5 compares the write bandwidth with PLFS burst buffer (PLFS-BB), PLFS, and Or-
angeFS. In all three write patterns, both BurstFS and PLFS burst buffer scale linearly with process
count. This is because processes in both systems wrote locally and the write bandwidth of each
node-local SSD was saturated. While we also observed linear scalability in OrangeFS and PLFS,
their bandwidths increase at a much slower rate. This is because both PLFS and OrangeFS stripe
their file(s) across multiple nodes, which can cause degraded bandwidth due to contention when
different processes write to the same remote node. On average, BurstFS delivered 3.5×, 2.7×, and
1.3× the performance of OrangeFS for N-1 segmented, N-1 strided, and N-N patterns, respectively.
Its performance was 1.6×, 1.6×, and 1.5× the performance of PLFS, respectively, for the three
patterns.
We observed that PLFS initially delivered higher bandwidth than BurstFS at small process
counts (16 and 32 processes), for all three patterns. After further investigation, we found this
was because, internally, PLFS transformed the N-1 writes into N-N writes. However, when fsync
53
1
2
4
8
16
32
64
128
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
BurstFSOrangeFSPLFS
(a) N-1 Segmented Read
1
2
4
8
16
32
64
128
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
BurstFSOrangeFSPLFS
(b) N-1 Strided Read
1
2
4
8
16
32
64
128
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
ec)
Number of Processes
BurstFSOrangeFSPLFS
(c) N-N Read
Figure 4.6: Comparison of BurstFS with PLFS and OrangeFS under different read patterns.
was called to force these N-N files to be written to PLFS’s back end file system, (i.e., OrangeFS),
OrangeFS did not completely flush the files to the SSDs before fsync returns. The measured
bandwidth is even higher than the aggregate SSD bandwidth on the local file systems. Fig. 4.6
compares the read bandwidth of BurstFS with OrangeFS and PLFS. Each process read 64MB data
under N-1 strided, N-1 segmented and N-N patterns. For the N-1 strided reads, we first created a
shared file using N-1 segmented writes, then read all data using the N-1 strided reads. In this way,
each process read data from multiple logs. In order to cluster the non-contiguous read requests
under this pattern, we used POSIX lio listio to transfer read requests to BurstFS in batches. In
the case of OrangeFS, when we enabled its list I/O operations, we observed the bandwidth was two
times lower than the configuration without list I/O operations. This is because OrangeFS list I/O
does not benefit large read operations. Thus, for this experiment, we report only the performance
of the N-1 strided pattern in OrangeFS without its list I/O support.
As we can see from Fig. 4.6(a), the bandwidth of the N-1 segmented read scales linearly with
process count for BurstFS, since each process read all data directly from its local node. In contrast,
both PLFS and OrangeFS read data from remote nodes, losing the benefit from locality. On the
other hand, the bandwidth of N-1 strided read in Fig. 4.6(b) increases at a much slower rate in
BurstFS compared with segmented read. This is because the strided read pattern resulted in
higher contention due to all-to-all reads from remote burst buffers. BurstFS with N-1 strided read
still scales better and outperforms both OrangeFS and PLFS. This is because instead of servicing
each request individually, BurstFS delegators clustered read requests from numerous processes and
served them through a three-stage read pipeline. On average, BurstFS delivered 2.2×, 2.5× and
54
0.1
1
10
100
1000
10000
100000
1 4 16 64 256 1024
Thro
ugh
pu
t (M
B/s
)
Transfer Size (KB)
BurstFS OrangeFS PLFS
(a) Write
PLFS
3.514BurstFS OrangeFS PLFS
0.1
1
10
100
1000
10000
100000
1 4 16 64 256 1024
Thro
ugh
pu
t (M
B/s
)
Transfer Size (KB)
BurstFS OrangeFS OrangeFS_List PLFS
(b) Read
Figure 4.7: Comparison of BurstFS with PLFS and OrangeFS under different transfer sizes.
1.4× the performance of OrangeFS, respectively, for N-1 segmented, N-1 strided and N-N patterns.
It delivered 1.6×, 1.4× and 1.6× the performance of PLFS, respectively, for the three patterns.
4.3.3 Performance Impact of Different Transfer Sizes
Fig. 4.7 shows the impact of transfers sizes on the bandwidth of BurstFS. We focused on N-1
strided I/O, because it is a challenging I/O pattern. Similar to the experiment in Section 4.3.2,
for BurstFS strided read operations, we first created a shared file using N-1 segmented writes and
then read the data back using an N-1 strided pattern. In this way, BurstFS will not benefit from
local reads.
The results in Fig. 4.7(a) demonstrate the impact of transfer sizes on write bandwidth when
64 processes wrote to a shared file. BurstFS outperformed OrangeFS and PLFS by having each
process write data locally, and it delivered outstanding performance improvement at small transfer
sizes, for example, 24.4× and 16.7× at 1KB compared to OrangeFS and PLFS, respectively. This
is because both PLFS and OrangeFS suffered from the cost of competing writes and repeated data
transfers to the shared remote SSDs.
Fig. 4.7(b) shows the impact of transfer size on read bandwidth. For small read requests,
OrangeFS provides list I/O support so that a list of read requests can be combined into one
function call. The result of this type of read operations is shown in Fig. 4.7(b) as OrangeFS List.
As we can see from this figure, although OrangeFS List enhances the performance of small reads,
it is still lower than BurstFS. This is because of the additional benefits of server-side clustering
55
and pipelining in BurstFS. Overall, BurstFS yielded up to 10.2×, 3× and 12.3× performance
improvement compared to OrangeFS, OrangeFS List and PLFS, respectively.
0
1
2
3
4
5
6
1 4 16 64 256 1024
Lo
oku
p T
ime
(s)
Transfer Size (KB)
BurstFSMDHIMPLFS
(a) Metadata performance with varying transfer sizes
0
1
2
3
4
16 32 64 128 256 512 1024
Lo
oku
p T
ime
(s)
Number of Processes
BurstFSMDHIMPLFS
(b) Metadata Performance with varying processcounts
Figure 4.8: Analysis of metadata performance as a result of transfer size and process count.
4.3.4 Analysis of Metadata Performance
As discussed in Section 4.2.1, BurstFS distributes the global metadata indices over distributed
key-value store. During file open, each process in PLFS needs to construct a global view of a
shared file by reading and combining metadata from other processes. After this step, all look-up
operations are conducted locally. To evaluate the benefit of our design, we compared the metadata
look-up time of BurstFS with that of PLFS (i.e. PLFS’s total time on index construction during
file open and local look-up during read), as well as the original look-up time from the MDHIM
functions. We examined the look-up performance using both cursor and batch get functions from
MDHIM. Each cursor operation triggers a round-trip transfer for each key-value pair, and a look-up
for a range can invoke multiple cursor operations as described in Section 4.2.1. The total look-up
time was significantly higher than other cases. For instance, it took 81 seconds for the 4KB case
in Fig. 4.8(a). So we omit the look-up time with cursor operations in our figures.
Fig. 4.8(a) compares the look-up time of BurstFS, PLFS, and MDHIM batch get (denoted as
MDHIM). In all three cases, we launched 32 processes, each to look up the locations of 64MB data
written under the N-1 strided pattern. The total data volume is the product of the transfer size
and the number of segments. Thus, a smaller transfer size will lead to more segments, therefore
56
more indices. As we can see from the figure, the look-up time of all cases drops along with the
increasing transfer size. This is because of fewer metadata look-ups. The look-up time of BurstFS
is significantly lower than PLFS. This validates that the scalable metadata indexing technique in
BurstFS can quickly establish a global view of the metadata for a shared file. In contrast, every
process in PLFS has to load all the indices generated during write and construct the global indices
for read, this all-to-all load dominated the look-up time. BurstFS also outperformed MDHIM
by minimizing the number of read operations with only one sequential scan at the range server
because of its support for parallel range queries (see Section 4.2.1). On average, BurstFS reduced
the look-up time by 77% and 58% compared with PLFS and MDHIM, respectively.
Fig. 4.8(b) shows the metadata performance with an increasing process count. In this test, each
process looked up 64 MB data written with a transfer size of 64 KB. More processes led to more
look-up operations. As shown in the figure, the look-up time of PLFS increases sharply with the
process count. In contrast, the look-up time for BurstFS and MDHIM increases slowly with more
processes, because of the use of a distributed key-value store for metadata.
4.3.5 Tile-IO Test
Fig. 4.9(a) shows the performance of BurstFS with Tile-IO. In this experiment, a 32GB global
array was partitioned over 256 processes. Each process first wrote its tile to several non-contiguous
regions of the shared file, then read it back to its local memory. For write operations, BurstFS
outperformed OrangeFS and PLFS by directly writing data to local SSDs. For reads, although all
three file systems benefited from the buffer cache, BurstFS still performed best since each process
read data locally. Overall, BurstFS delivered 6.9× and 2.5× improvement over OrangeFS for
reads and writes, respectively, and 7.3× and 1.4× improvement over PLFS for reads and writes,
respectively.
4.3.6 BTIO Test
Fig. 4.9(b) shows the performance of BurstFS under the BTIO workload with problem size D.
In this experiment, the 408 × 408 × 408 global array was decomposed over 64 processes. Similar to
Tile-IO, each process first wrote its own cells to several noncontiguous regions of the shared file, then
read them back to its local memory. Due to the 3-D partitioning, the transfer size (2040B) of each
process is much smaller than Tile-IO (32KB), so the I/O bandwidth of both PLFS and OrangeFS
57
0
1
2
3
4
5
6
7
8
D
Ban
dw
idth
(G
B/s
)
BurstFS
0
20
40
60
80
WRITE READ
Ban
dw
idth
(G
B/s
)
BurstFS OrangeFS PLFS
(a) Performance of Tile-IO.
0
0.5
1
1.5
2
2.5
3
3.5
Write Read
Ban
dw
idth
(G
B/s
)
BurstFS OrangeFS PLFS
(b) Performance of BTIO.
Figure 4.9: Performance of Tile-IO and BTIO.
with BTIO decreases rapidly, compared with Tile-IO. BurstFS sustains this small-message workload
with the benefits of local reads and server-side read clustering. Overall, it delivered 15.6× and 9.5×
performance improvement over OrangeFS for reads and writes, respectively. It also outperforms
PLFS by 16.2× and 7×, respectively, for reads and writes.
4.3.7 IOR Test
In order to evaluate the support for data sharing among different programs in a batch job, we
conducted a test with IOR. We ran IOR with a varying number of processes reading a shared file
written by another set of processes from a Tile-IO program. Processes in both MPI programs were
launched in the same job. Each node hosted 16 Tile-IO processes and 16 IOR processes. Once
Tile-IO processes completed writing to the shared file, this file was read back by IOR processes
using the N-1 segmented read pattern. We kept the same transfer size of IOR as Tile-IO. Since the
read pattern did not match the initial write pattern of Tile-IO, each process read from multiple
logs on remote nodes. We fixed the size of each tile as 128MB and the number of tiles along the Y
axis as 4, and then increased the number of tiles along the X axis. Thus the number of tiles on the
X axis will increase along with the number of reading processes. Fig. 4.10(a) compares the read
bandwidth of BurstFS with PLFS and OrangeFS. Both PLFS and OrangeFS are vulnerable to
small transfer size (32KB). BurstFS maintains high bandwidth because of locally combining small
requests and server-side read clustering and pipelining. On average, when reading data produced
by Tile-IO, BurstFS delivered 2.3× and 2.5× the performance of OrangeFS and PLFS, respectively.
58
We also evaluated the read bandwidth of IOR over the dataset generated by BTIO, using two
BTIO classes D and E. For Class D, we used 64 processes to write an array of 408 × 408 × 408 to
a shared file. For Class E, 225 processes wrote an array of 1020 × 1020 × 1020 to a shared file. In
both cases, the shared file was then read back by the IOR processes using the N-1 segmented read
pattern. Fig. 4.10(b) shows the read bandwidth. Due to the much smaller transfer size (2040B for
Class D and 2720B for Class E), the bandwidths of OrangeFS and PLFS with BTIO are much lower
than with Tile-IO. While the performance of BurstFS is also impacted by the small transfer size, it
delivers much better bandwidth to these small requests. On average, when reading data produced
by BTIO, BurstFS delivered 10× and 12.2× performance improvement compared to OrangeFS and
PLFS, respectively.
0.5
1
2
4
8
16
32
64
16 32 64 128 256 512 1024
Ban
dw
idth
(G
B/s
)
Number of Processes
BurstFS
OrangeFS
PLFS
(a) Read bandwidth of IOR on the shared file writtenby Tile-IO.
0
1
2
3
4
5
6
7
8
D E
Ban
dw
idth
(G
B/s
)
Problem Size
BurstFS OrangeFS PLFS
(b) Read bandwidth of IOR on the shared file writtenby BTIO.
Figure 4.10: Read bandwidth of IOR.
4.4 Related Work
The importance of burst buffers is shown by their inclusion in the blueprint of many next-
generation supercomputers [2, 5, 20, 21, 22, 25] with a broad investment in supporting software.
DataWarp [9], IME [13] and aBBa [1] are three ongoing projects in Cray, DDN and EMC. Their
potential benefits have been explored from various research angles [65, 87, 103]. All these works
target remote, shared burst buffers. In contrast, our work centers on node-local burst buffers, an
equally important architecture that currently lacks standardized support software. Compared with
59
work on remote burst buffers, our work delivers linear scalability for checkpointing/restart since
most I/O requests are serviced locally. PLFS burst buffer [31] supports node-local burst buffers
(see Section 4.3.1) and can deliver fast, scalable write performance. It relies on a global file system
(e.g., Lustre, NFS) to manage metalinks, which can be an overhead if the number of metalinks
is large. In addition, reading data from PLFS burst buffer requires each of the node-local burst
buffers to be mounted across all compute nodes. BurstFS differs from PLFS burst buffer in that
it is structured as a standalone file system. BurstFS achieves scalable read performance using the
collective services of its delegators. Moreover, BurstFS is specialized for managing node-local burst
buffers, while PLFS burst buffer supports both the node-local burst buffers and remote shared
burst buffers.
The I/O bandwidth demand from checkpoint/restart has been increasing on par with the com-
puting power. SCR [76], CRUISE [83] and FusionFS [113] are notable efforts designed to address
this increasing I/O challenge and achieve linear write bandwidth by having each process write in-
dividual files to node-local storage (N-N). Different from these works, BurstFS supports both N-1
and N-N I/O patterns and delivers scalable read/write bandwidth for both patterns. Multidimen-
sional I/O has long been a challenging workload for parallel file systems. The small, non-contiguous
read/write requests issued from individual processes can dramatically constrain parallel file system
bandwidth. Several state-of-the-art approaches have been developed to address this issue. PLFS
accelerates small, non-contiguous N-1 writes by transforming them into sequential, contiguous N-N
writes [32]. However, PLFS (without burst buffer support) still relies on a back end parallel file
system to store the individual files from the N-N writes. Contention can occur when two files
are striped on the same storage server. In contrast, BurstFS provides an independent file system
service. It addresses write contention via local writes, and is optimized for read-intensive work-
loads. Two-phase I/O [94] is another widely adopted approach to optimize small, non-contiguous
I/O workloads. All processes send their I/O requests to aggregators, which consolidate them into
large, contiguous requests. The read service of BurstFS has some similarity to two-phase I/O: its
delegators are akin to making the I/O aggregators used in two-phase I/O into a service. How-
ever, there are two key distinctions. First, the consolidations of BurstFS are directly conducted
at the file system instead of the aggregators. This avoids the extra transfer from aggregators to
60
client processes. Second, the consolidation is done by each delegator individually without extra
synchronization overhead.
Cross-application data sharing is a daunting topic since many contemporary programming mod-
els (e.g. MPI, PGAS) define separate name spaces for each program. A widely adopted approach
is leveraging existing distributed systems, such as distributed file systems (e.g. Lustre [35], Or-
angeFS [34], HDFS [90]) and distributed key-value stores (e.g. Memcached [80], Dynamo [46],
BigTable [39]). However, these services are usually distant from computing processes, yielding lim-
ited bandwidth. In addition, the heavy overhead from start up, tear down, and management makes
them unsuitable to be co-located with applications in batch jobs. On the other hand, a couple of
service programs are developed to be run with applications in batch jobs. Docan et al. [47] devel-
oped DART, a communication framework that enables data sharing via separate service processes
located on a different set of nodes from the simulation applications (in the same job). Their later
work DataSpaces [47] extends the original design. In both studies, application processes write to
and read from the service process in an ad hoc manner. Each operation requires a separate net-
work transfer. In contrast, the delegator in BurstFS is designed as an I/O proxy process co-located
with application processes on the same node. All writes are local. Reads are deferred to the I/O
delegator, which provides many opportunities to optimize the read operations.
4.5 Summary
In this chapter, we examined the requirements of data management for node-local burst buffers,
a critical topic since node-local burst buffers are in the designs of next-generation, large-scale su-
percomputers. Our approach to managing node-local burst buffers is BurstFS, an ephemeral burst
buffer file system with the same lifetime as batch jobs and designed for high performance with HPC
I/O workloads. BurstFS can be used by multiple applications within the same job, sequentially as
with checkpoint/restart, or concurrently as with ensemble applications. We implemented several
techniques in BurstFS that greatly benefit challenging HPC I/O patterns: scalable metadata in-
dexing, co-located I/O delegation, and server-side read clustering and pipelining. These techniques
ensure scalable metadata handling and fast data transfers. Our performance results demonstrate
that BurstFS can efficiently support a variety of challenging I/O patterns. Particularly, it can
support shared file workloads across distributed, node-local burst buffers with performance very
61
close to that for non-shared file workloads. BurstFS also scales linearly for parallel write and read
bandwidth and outperforms the state-of-the-art by a significant margin.
62
CHAPTER 5
TRIO: RESHAPING BURSTY WRITES
5.1 Introduction
To address the I/O contention issues described in Chapter 2, we designed TRIO, a burst buffer
based orchestration framework to reshape the bursty writes in a contention-aware manner. Previous
efforts to mitigate I/O contention generally fall into two categories: client-side and server-side
optimizations. Client-side optimizations mostly resolve I/O contention in a single application, by
buffering the dataset in staging area [27, 78] or optimizing application’s I/O pattern [42]. Server-
side optimizations generally embed their solutions inside the storage server, overcoming issues of
contention by dynamically coordinating data movement among servers [91, 45, 112].
In this chapter, I describe a novel burst buffer based I/O orchestration framework named TRIO
to address I/O contention. Compared with the client-side optimizations, an orchestration frame-
work on burst buffers is able to coordinate I/O traffic between different jobs, mitigating I/O con-
tention at a larger scope. Compared with the server-side optimization, an orchestration framework
on burst buffers can free storage servers from the extra responsibility of handling I/O contention,
making it portable to other PFSs. As the name suggests, TRIO functions to orchestrate the bursty
writes between three components: computing processes, burst buffers and PFS. It is accomplished
by two component techniques: Stacked AVL Tree based Indexing (STI) and Contention-Aware
Scheduling (CAS). STI organizes the checkpointing write requests inside each burst buffer accord-
ing to their physical layout among storage servers and assists data flush operations with enhanced
sequentiality. CAS orchestrates all burst buffer’s flush operations to mitigate I/O contention. Taken
together, our contributions are three-fold.
• We have conducted a comprehensive analysis on two issues that are associated with check-
pointing operations in HPC systems, i.e., degraded bandwidth utilization of storage servers
and prolonged average job I/O time.
63
• Based on our analysis, we propose TRIO to orchestrate applications’ write requests that are
buffered in BB for enhanced I/O sequentiality and alleviated I/O contention.
• We have evaluated the performance of TRIO using representative checkpointing patterns.
Our results revealed that TRIO efficiently utilized storage bandwidth and reduced average
job I/O time by 37%.
The rest of this chapter is organized as follows. Section 5.2 and Section 5.3 respectively present
the design and implementation of TRIO. Section 5.4 evaluates the benefits of TRIO. Related work
and conclusion are discussed in Section 5.5 and Section 5.6.
5.2 Design of TRIO
The two I/O contention issues discussed in Section 2.2 in Chapter 2 result from direct and
eager interactions between applications and storage servers on PFS. Many computing platforms
have introduced burst buffer as an intermediate layer to absorb the bursty writes. Buffering a large
checkpoint dataset gives more visibility to the I/O traffic, which provides a chance to intercept
and reshape the pattern of I/O operations on PFS. However, existing works generally use burst
buffer as a middle layer to avoid applications’ direct interaction with PFS [65], few works [103] have
discussed the interaction between BB and PFS, i.e. how to orchestrate I/O so that large datasets
can be efficiently flushed from BB to PFS. To this end, we structure TRIO to coordinate the I/O
traffic from compute nodes to burst buffer and to storage servers. In the rest of the section, we first
highlight the main idea of TRIO through a comparison with a reactive data flush approach for BB
management; then we detail two key techniques in Section 5.2.2 and Section 5.2.3.
5.2.1 Main Idea of TRIO
Fig. 5.1(a) illustrates a general framework of how BBs interact with PFS. On each Compute
Node (CN), 2 processes are checkpointing to a shared file that is striped over 2 storage servers. A1,
A2, A3, A4, B5, B6, B7 and B8 are contiguous file segments. These segments are first buffered on
the BB located on each CN during checkpointing, then flushed from BB to two storage servers on
PFS.
An intuitive strategy is to reactively flush the datasets to the PFS as they arrive at the BB. Fig.
4(b) shows the basic idea of such a reactive approach. This approach has two drawbacks. First,
64
S-9
A1 A3 B5 B� A2 A4 B6 B8
Legend
A1 File Segment
A2 A1
A4 A3
Process0( Process1(
Compute Node-A
B6 B5
B8 B7
Process0( Process1(
Compute Node-B
Data Flush
Time Time
(a) Burst buffer framework with data flush.
B5
(b) Reactive data flush.
A1 A3 A2 A4 B7 B8 B5 B6
A2 A3 A4 A1
Burst Buffer-A
B5 B8 B6 B7
Burst Buffer-B
B7 A3 A1
B6 A4 B8 A2
Storage Server1 Storage Server2
Burst Buffer-A
interleaved unordered
Burst Buffer-B
B7 B5 A3 A1
A4 A2 B8 B6
Storage Server1 Storage Server2
t1
t2 t2
sequential
(c) Proactive data flush with TRIO.
Storage Server1 Storage Server2 1
Flush Order Server-Oriented Data
Organization using STI Flush Order
2
A3 A1
A4 A2
Server1 Server2
1 1
B7 B5
B8
Server1
2
2
Inter-BB Flush Order in TRIO
Flush first Flush second
B6
t1
Inter-BB Flush Ordering using CAS
Server2
Server-Oriented Data Organization using STI
Burst Buffer-A Burst Buffer-B
Figure 5.1: A conceptual example comparing TRIO with reactive data flush approach. In (b), reac-tive data flush incurs unordered arrival (e.g. B7 arrives earlier than B5 to Server1) and interleavedrequests of BB-A and BB-B. In (c), Server-Oriented Data Organization increases sequentialitywhile Inter-BB Flush Ordering mitigates I/O contention.
directly flushing the unordered segments from each BB can degrade the chance of sequential writes.
We refer to this chance as sequentiality. In Fig. 4(a), segments B5 and B7 are contiguously laid out
on Storage Server 1, but they arrive at BB-B out of order in Fig. 4(b). Due to reactive flushing,
B7 will be flushed earlier than B5, losing the opportunity to retain sequentiality. Second, multiple
BBs can compete for access on the same storage server. As indicated by this figure, BB-A and
BB-B concurrently flush A4 and B8 to Server 2, so the two segments are interleaved, their arrival
order is against their physical layout on Server 2 (see Fig.4 (a)). This will degrade the bandwidth
with frequent disk seeks. In a multi-job environment, segments to a storage server come from files
of different jobs. Interleaved accesses from different applications to the shared storage servers can
prolong the average job I/O time and delay the timely service for mission-critical and small jobs.
In contrast to this reactive data flush approach, we propose a proactive data flush framework,
named TRIO, to address these two drawbacks. Fig. 5.1(c) gives an illustrative example of how it
enhances the sequentiality in flushed data stream and mitigates contention on storage server side.
Before flushing data, TRIO follows a server-oriented data organization to group together segments
to each storage server and establishes an intra-BB flushing order based on their offsets in the file.
This is realized through a server-oriented and stacked AVL Tree based Indexing (STI) technique,
65
which is elaborated in Section 5.2.2. In this figure, B5 and B7 in BB-B are organized together
and flushed sequentially, which enhances sequentiality on Server 1. Meanwhile, localizing BB-B’s
writes on Server 1 minimizes its interference on Server 2 during this interval. Similarly, BB-A
organizes A2 and A4 together and flushes them sequentially to Server 2, minimizing its interference
on Server 1. However, contention can arise if both BB-A and BB-B flush to the same servers. TRIO
addresses this problem using a second technique, Contention-Aware Scheduling (CAS), as discussed
in Section 5.2.3. CAS establishes an inter-BB flushing order that specifies which BB should flush
to which server each time. In this simplified example, BB-A flushes its segments to Server 1 and
Server 2 in sequence, while BB-B flushes to Server 2 and Server 1 in sequence. In this way, during
the time periods t1 and t2, each server is accessed by a different BB, avoiding contention. More
details about these two optimizations are discussed in the rest of this section.
5.2.2 Server-Oriented Data Organization
As mentioned earlier, directly flushing unordered segments to PFS can degrade I/O sequentiality
on servers. Many state-of-the-art storage systems apply tree-based indexing [84, 85] to increase
sequentiality. These storage systems leverage conventional tree structures (e.g. B-Tree) to organize
file segments based on their locations on the disk. Sequential writes can be enabled by in-order
traversal of the tree.
Although it is possible to organize all segments in BB using a conventional tree structure (e.g.
indexing only by offset), it will result in a flat metadata namespace. This cannot satisfy the
complex semantic requirements in TRIO. For instance, as mentioned in Section 5.2.1, sequentially
flushing all the file segments under a given storage server together is beneficial. To accomplish
this I/O pattern, BB needs to group all segments on the same storage server together. Meanwhile,
since these segments can come from different files (e.g. File1, File2, File3 on Server 1 in Fig. 5.2),
sequential flush requires BB to group together segments of the same file and then order these
segments based on the offset. Accomplishing the aforementioned purpose using a conventional tree
structure requires a full tree traversal to retrieve all the segments belonging to a given server and
group these segments for different files.
We introduce a technique called Stacked Adelson-Velskii and Landis (AVL) Tree based Index-
ing (STI) [103] to address these requirements. Like many other conventional tree structures, an
AVL tree [57] is a self-balancing tree that supports lookup, insertion and deletion in logarithmic
66
S-5
Server2&
Server3&Server1&
File2&
File1& File3&
offset1
offset2
offset3 Data Store
Metadata
Indexing
Raw Data1 Raw Data2 Raw Data3
Data Store
Burst Buffer
Figure 5.2: Server-Oriented Data Organization with Stacked AVL Tree. Segments of each servercan be sequentially flushed following in-order traversal of the tree nodes under this server.
complexity. It can also deliver an ordered node sequence following an in-order traversal of all tree
nodes. STI differs in that all the tree nodes are organized in a stacked manner. As shown in Fig. 5.2,
this example of a stacked AVL tree enables two semantics: sequentially flushing all segments of a
given file (e.g., offset1, offset2, and offset3 of File1), and sequentially flushing all files in a given
server (e.g., File1, File2, and File3 of Server1). The semantic of server-based flushing is stacked on
top of the semantic of file-based flushing. STI is also extensible for new semantics (e.g. flushing all
segments under a given timestamp) by inserting a new layer (e.g. timestamp) in the tree.
The stacked AVL tree of each BB is dynamically built during runtime. When a file segment
arrives at BB, three types of metadata that uniquely identify this segment are extracted: server ID,
file name, and offset. BB first looks up the first layer (e.g. the layer of server ID in Fig. 5.2) to check
if the server ID already exists (it may exist if another segment belonging to the same server has
already been inserted). If not, a new tree node is created and inserted in the first layer. Similarly,
its file name and offset are inserted in the second and third layers. Once the offset is inserted as
a new tree node in the third layer (there is no identical offset under the same file because of the
append-only nature of checkpointing), this tree node is associated with a pointer (see Fig. 5.2) that
67
points to the raw data of this segment in the data store.
With this data structure, each BB can sequentially issue all segments belonging to a given
storage server by in-order traversal of the subtree rooted at the server node. For instance, flushing
all segments to Server 1 in Fig. 5.2 can be accomplished by traversing the subtree of the node
“Server 1”, sequentially retrieving and writing the raw data of all segments (e.g. raw data pointed
by offset1, offset2, offset3) of all the files (e.g. file1, file2, file3). Once all the data in a given server
is flushed, all the tree nodes belonging to this server are trimmed.
Our current design for data flush is based on a general BB use case. That is, after an application
finishes one or multiple rounds of computation, it dumps the checkpointing dataset to BB, and
begins the next round of computation. Though we use a proactive approach in reshaping the I/O
traffic inside BB, flushing checkpointing data to PFS is still driven by the demand from applications.
After flushing, storage space on BB will be reclaimed entirely. We leave it as our future work to
investigate a more aggressive and automatically triggering mechanism for flushing inside the burst
buffer.
5.2.3 Inter-Burst Buffer Ordered Flush
Server-oriented organization enhances sequentiality by allowing each BB to sequentially flush
all file segments belonging to one storage server each time. However, contention can arise when
multiple BBs flush to the same storage server. For instance, in Fig. 5.1(c), contention on Storage
Server 2 can happen if BB-A and BB-B concurrently flush their segments belonging to Storage
Server 2 without any coordination, leading to multiple concurrent I/O operations at Storage Server
2 within a short period. We address this problem by introducing a technique called Contention-
Aware Scheduling (CAS). CAS orders all BBs’ flush operations to minimize competitions for each
server. For instance, in Fig. 5.1(c), BB-A flushes to Server 1 and Server 2 in sequence, while BB-B
flushes to Server 2 and Server 1 in sequence. This ordering ensures that, within any given time
period, each server is accessed only by one BB. Although the flushing order can be decided statically
before all BBs starts flushing, this approach needs all BBs to synchronize before flushing and the
result is unpredictable under real-world workloads. Instead, CAS follows a dynamic approach,
which adjusts the order during flush in a bandwidth-aware manner.
Bandwidth-Oriented Data Flushing. In general, each storage server can only support a
limited number of concurrent BBs flushing before its bandwidth is saturated. In this paper, we
68
refer to this threshold as α, which can be measured via offline characterization. For instance, our
experiment reveals that each OST on Spider II [81] is saturated by the traffic from 2 compute
nodes; thus, setting α to 2 can deliver maximized bandwidth utilization on each OST. Based on
this bandwidth constraint, we propose a Bandwidth-aware Flush Ordering (BFO) to dynamically
order the flush operations of each BB so that each storage server is being used by at most α BBs.
For instance, in Fig. 5.1(c), BB-A buffers segments of Server 1 and Server 2. Assuming α = 1, it
needs to select a server that has not been assigned to any BB. Since BB-B is flushing to Server 2
at time t1, BB-A picks up Server 1 and flushes the corresponding segments (A1, A3) to this server.
By doing so, the contention on Server 1 and Server 2 are avoided and consequently the two servers’
bandwidth utilization is maximized.
A key question is how to get the usage information of each server. BFO maintains this infor-
mation via an arbitrator located on one of the compute nodes. When a BB wants to flush to one
of its targeted servers, it sends a flushing request to the arbitrator. This request contains several
pieces of information about this BB, such as its job ID, job priority, and utilization. The arbitrator
then selects one from the targeted servers being used by fewer than α BBs, returns its ID to BB,
and increases the usage of this server by 1. The BB then starts flushing all its data to this server,
and requests to flush to other targeted servers. The arbitrator then decreases the usage of the old
server by 1 and assigns another qualified server to this BB. When there is no qualified BB available,
it temporarily queues the BB’s request.
Job-Aware Scheduling. In general, compute nodes greatly outnumber storage servers, so
there may be multiple BBs being queued for flushing to the same storage server. When this storage
server becomes available, the arbitrator needs to assign this storage server to a proper BB. A naive
approach to select a BB would be to follow FCFS. Since each BB is allocated to one job along with
its compute node, treating BBs equally can delay service of critical jobs, and prolong job I/O time
of small jobs. Instead, the arbitrator categorizes BBs based on their job priorities and job sizes.
It prioritizes the service for BBs of high-priority jobs, including those that are important at the
beginning, or the ones that have higher criticality (e.g. the usages of some BB in this job reach
their capacity). Among BBs with equal priority, it selects the one belonging to the smallest jobs
(e.g. jobs with smallest checkpointing data size) to reduce average job I/O time.
69
5.3 Implementation and Complexity Analysis
We have built a proof-of-concept prototype of TRIO using C with Message Passing Interface
(MPI). To emulate I/O in the multi-job environment, more than one communicators can be created
among all BBs, each corresponding to a disjoint set of BBs involved in their mutual I/O tasks. The
bursty writes from all these sets are orchestrated by the arbitrator. We also leverage STI (See
Section 5.2.2) to organize all the data structures inside the arbitrator for efficient and semantic-
based lookup and update. For instance, when a storage server becomes available, the arbitrator
needs to assign it to a waiting BB that has data on it. This BB should belong to the job with higher
priority than other waiting BBs’ jobs. Under this semantics, the profile of job, storage server, BB
are stacked as three layers on STI. Assume a system with m BB-augmented compute nodes and n
storage servers, and each job uses k compute nodes. At most this STI contains m/k+mn/k+mn
nodes, where m/k, mn/k, mn are respectively the number of tree nodes in each layer. This means,
for a system with 10000 compute nodes and 1000 storage servers, the number of tree nodes is
at most 20M, incurring less than 1GB storage overhead. On the other hand, the time spent in
arbitrator is dominated by its communication with each BB. Existing high-speed interconnects
(e.g. QDR Infiniband) generally yield a roundtrip latency lower than 10 µs for small messages
(smaller than 1KB) [75], this means a single arbitrator is able to handle the requests from 105 BBs
within 1 second.
5.4 Evaluation of TRIO
Experiments on TRIO were conducted on the Titan supercomputer. Of the 32GB memory on
each compute node, we reserved 16GB as the burst buffer space. We evaluated TRIO against the
workload from IOR benchmark. Each test case was run 15 times for every data point, the median
result was presented.
As discussed in Section 5.2.3, CAS mitigates contention by restricting the number of concurrent
BBs flushing to each storage server to α. In all our tests, we set α to 2 (the number which saturates
the bandwidth of each OST), thus limiting the number of BBs on each OST to at most two.
70
0
500
1000
1500
2000
2500
4 8 16 32 64 128 256
Band
width (M
B/s)
Number of Nodes
NOTRIO-‐N-‐1 TRIO-‐N-‐1 NOTRIO-‐N-‐N TRIO-‐N-‐N
Figure 5.3: The overall performance of TRIO under both inter-node and intra-node writes.
5.4.1 Overall Performance of TRIO
Fig. 5.3 demonstrates the overall benefit of TRIO under competing workloads with an increasing
number of IOR processes. We compared the aggregated OST bandwidth of two configurations. In
the first configuration (shown in Fig. 5.3 as NOTRIO), all the processes directly issued their write
requests to PFS. In the second configuration (shown in Fig. 5.3 as TRIO), all processes’ write
requests were buffered on the burst buffer space and flushed to PFS using TRIO. N-1 and N-N
patterns were employed in both configurations. We ran 16 processes on each compute node and
stressed the system by increasing the number of compute nodes involved from 4 to 256. Each
process wrote in total 1GB data. Both request size and stripe size were configured as 1MB. I/O
competition of all processes was assured by striping each file over the same four OSTs (default
stripe count).
As we can see from Fig. 5.3, in both N-1 and N-N patterns, bandwidth of NOTRIO dropped
with increasing number of processes involved. This was due to the exacerbated contention from
both intra-node and inter-node I/O traffic. By contrast, TRIO demonstrated much more stable
performance by optimizing intra-node traffic using STI and inter-node I/O traffic using CAS. The
lower bandwidth observed with 4 nodes than other node numbers in both TRIO-N-1 and TRIO-N-
N cases was due to OST bandwidth not being fully utilized (4 BBs were used to flush to 4 OSTs
71
0
500
1000
1500
2000
N-‐1 N-‐N
Band
width (M
B/s)
NAÏVE BUFFER-‐NOOP BUFFER-‐SEQ BUFFER-‐STI BUFFER-‐TRIO
Figure 5.4: Performance analysis of TRIO.
in these cases). Overall, on average TRIO improved I/O bandwidth by 77% for N-1 pattern and
139% for N-N pattern.
5.4.2 Performance Analysis of TRIO
To gain more insight into the contributions of each technique of TRIO, we incrementally ana-
lyzed its performance based on the five configurations shown in Fig. 5.4. In NAIVE, each process
directly wrote its data to PFS. In BUFFER-NOOP, all processes’ write requests were buffered in
BB and flushed without any optimization. This configuration corresponds to the reactive approach
discussed in Section 5.2.1. In BUFFER-SEQ, all the buffered write requests were reordered accord-
ing to their offsets and flushed sequentially. In BUFFER-STI, all the write requests were organized
using STI, each time a random OST was selected under the AVL tree and all write requests that
belong to this OST were sequentially flushed. In BUFFER-TRIO, CAS was enabled on top of STI,
which restricted the number of concurrent flushing BBs on each OST to 2, this configuration cor-
responds to the proactive approach discussed in Section 5.2.1. We evaluated the five configurations
using the workload of 8-node case (128 processes write to 4 OSTs) in Fig. 5.3. In this case, TRIO
reaped its most benefits since the number of flushing BBs was twice the number of OSTs.
As we can see from Fig. 5.4, simply buffering the dataset in BUFFER-NOOP was not able
to improve the performance over NAIVE due to issues discussed in Section 5.2.1. In contrast,
72
the sequential flush order in BUFFER-SEQ significantly outperformed NAIVE for both N-1 and
N-N patterns. Interestingly, although STI sequentially flushed the write requests to each OST, it
demonstrated no benefit over BUFFER-SEQ. This is because, without controlling the number of
flushing BBs on each OST, each OST was flushed by an unbalanced number of BBs, the benefits of
localization and sequential flush using STI were offset by prolonged contention on the overloaded
OSTs. This issue was alleviated when CAS was enabled in BUFFER-TRIO: by placing 2 BBs on
each OST and localizing their flush, the bandwidth of each OST was more efficiently utilized than
BUFFER-SEQ.
5.4.3 Alleviating Inter-Node Contention
To evaluate TRIO’s ability to sustain inter-node I/O contention using CAS, we placed 1 IOR
process on each node and had each process dump a 16GB dataset to its local BB, these datasets were
flushed to the PFS using TRIO. Such configurations for N-1 and N-N are referred to as TRIO-N-1
and TRIO-N-N respectively. For comparison, we had each IOR process dump its 16GB in-memory
data directly to the underlying PFS. Such configurations for the two patterns are referred to as
NOTRIO-N-1 and NOTRIO-N-N respectively. Contention for both patterns was assured by striping
all files over the same 4 OSTs.
Fig. 5.5 reveals the bandwidth of both TRIO and NOTRIO with an increasing number of
IOR processes. In N-1 case, the bandwidth of TRIO first grew from 1.4GB/s with 4 processes
to 1.7GB/s with 8 processes, then stabilized around 1.8GB/s. The stable performance with more
than 8 processes occurred because TRIO scheduled 2 concurrent BBs on each OST. Therefore, even
under heavy contention, each OST was being used by 2 BBs that consumed most OST bandwidth.
In contrast, the bandwidth of NOTRIO peaked at 1.4GB/s with 8 processes, then dropped to
1.1GB/s with 256 processes. This accounted for only 60% of the bandwidth delivered by TRIO
with 256 processes. This bandwidth degradation resulted from the inter-node contention generated
by larger numbers of processes. Overall, by mitigating contention, TRIO delivered a 35% bandwidth
improvement over NOTRIO on average.
We also observed similar trends for TRIO and NOTRIO in N-N case: the bandwidth of TRIO
ascended from 1.5GB/s with 4 processes to 2.1GB/s with 8 processes, then stabilized from this
point on. The bandwidth of NOTRIO kept dropping as more processes were involved. These
performance trends resulted from the same reasons as discussed for N-1 case.
73
0
500
1000
1500
2000
2500
4 8 16 32 64 128 256
Band
width (M
B/s)
Number of Nodes
NOTRIO_N_1 TRIO_N_1 NOTRIO_N_N TRIO_N_N
Figure 5.5: Flush bandwidth under I/O contention with an increasing process count.
5.4.4 Effect of TRIO under a Varying OST Count
Sometimes applications tend to stripe their files over a large number of OSTs to utilize more
resources. Though utilizing more OSTs can deliver higher bandwidth, writing in a conventional
manner that issues write requests to servers in a round-robin manner can distribute each write
request to more OSTs, incurring wider contention and preventing I/O bandwidth from further
scaling. We emulated this scenario by striping each file over an increasing number of OSTs and
using double the number of IOR processes to write on these OSTs. Contention for both N-1 and
N-N patterns was assured by striping each file over the same set of OSTs.
Fig. 5.6 compares the bandwidth of TRIO and NOTRIO under this scenario. It can be observed
that the bandwidth of NOTRIO-N-1 increased sublinearly from 0.81GB/s with a stripe count of
2 to 27GB/s with a stripe count of 128. In contrast, the bandwidth of TRIO increased with a
much faster speed, resulting in a 38.6% improvement on average over NOTRIO. A similar trend
was observed with the N-N checkpointing pattern. By localizing the writes of each BB on one OST
each time and assigning the same number of BBs to each OST, CAS minimized the interference
between different processes, thereby better utilizing the bandwidth. Sometimes localization may
not help utilize more bandwidth. For instance, when the number of available OSTs is greater than
the number of flushing BBs, localizing on one OST may underutilize the supplied bandwidth. We
74
0.5
1
2
4
8
16
32
64
128
256
512
2 4 8 16 32 64 128
I/O
Ba
nd
wid
th (
GB
/s)
Stripe Count
NOTRIO_N_1
NOTRIO_N_N
TRIO_N_1
TRIO_N_N
Figure 5.6: Flush bandwidth with an increasing stripe count.
believe a similar approach can also work for these scenarios. For instance, we can assign multiple
OSTs to each BB, with each BB only distributing its writes among the assigned OSTs to mitigate
interference.
5.4.5 Minimizing Average I/O Time
As mentioned in Section 5.2.3, TRIO reduces average job I/O time by prioritizing small jobs.
To evaluate this feature, we grouped 128 processes into 8 jobs, each with 16 processes, and place
1 process on each node. We had each process dump its dataset to its local BB and coordinated
the data flush using TRIO. When multiple BBs requested the same OST, TRIO selected a BB
via the Shortest Job First (SJF) algorithm, which first served a BB belonging to the smallest job.
This configuration is shown in Fig. 5.7 as TRIO SJF. For comparison, we applied FCFS in TRIO
to select a BB. This configuration served the first BB requesting this OST first, and we refer to
it as TRIO FCFS. We also included the result of having each process directly write its dataset to
PFS, which we refer to as NOTRIO. We varied the data size such that each process in the next
job wrote a separate file whose size was twice that of the prior job. Following this approach, each
process in the smallest job wrote a 128MB file, and each process in the largest job wrote a 16GB
file. To enable resource sharing, we striped the file so that each OST was shared by all 8 jobs.
75
0
20
40
60
80
100
120
WL1 WL2 WL3 WL4
Average I/O Tim
e (Sec)
Workload Per OST
NOTRIO TRIO-‐FCFS TRIO-‐SJF
Figure 5.7: Comparison of average I/O time.
We increased the ratio of the number of processes over the number of OSTs to observe scheduling
efficiency under different workloads.
Fig. 5.7 reveals average job I/O time for all the three cases. Workload 1 (WL1), WL2, WL3, and
WL4 refer to scenarios when the number of processes was 2, 4, 8, and 16 times the number of OSTs,
respectively. The average I/O time of TRIO SJF was the shortest for all workloads, accounting
for on average 57% and 63% of TRIO FCFS and NOTRIO, respectively. We also observed that
the I/O time of TRIO SJF increased with growing workloads at a much slower rate than the other
two. This was because, with the heavier workload, each OST absorbed more data from each job.
This gave SJF more room for optimization. Another interesting phenomenon was that TRIO FCFS
demonstrated no benefit over NOTRIO in terms of the average I/O time. This was because, using
TRIO FCFS, once each BB acquired an available OST from the arbitrator, it drained all of its
data on this OST. Since FCFS is unaware of large and small jobs, it is likely that the requests from
the large job were scheduled first on a given OST. The small job requesting the same OST could
only start draining its data after the large job finished. This monopolizing behavior significantly
delayed small jobs’ I/O time.
76
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160 180 200
CD
F o
f T
ime (
%)
Time (sec)
TRIO_FCFS
TRIO_SJF
NOTRIO
Figure 5.8: CDF of job response time.
For a further analysis, we also plotted the cumulative distribution functions (CDF) of job
response time with WL4 as shown in Fig. 5.8, it is defined as the interval between the arrival time
of the first request of the job at the arbitrator and the time when the job completes its I/O task.
By scheduling small jobs first, 7 out of 8 jobs in TRIO-SJF were able to complete their work within
80 seconds. By contrast, jobs in TRIO-FCFS and NOTRIO completed at much slower rates.
Fig. 5.9 shows the total I/O time of draining all the jobs’ datasets. There was no significant
distinction between TRIO-FCFS and TRIO-SJF because, from OST’s perspective, each OST was
handling the same amount of data for the two cases. By contrast, I/O time of NOTRIO was longer
than the other two due to contention. The impact of contention became more significant under
larger workloads.
5.5 Related Work
I/O Contention: In general, research around I/O contention falls into two categories: client-
side and server-side optimization. In client-side optimization, processes involved in the same job
collaboratively coordinate their access to the PFS to mitigate contention. Abbasia et al. [27] and
Nisar et al. [78] addressed contention by delegating the I/O of all processes involved in the same
77
0
50
100
150
200
250
WL1 WL2 WL3 WL4
Total I/O
Tim
e (Sec)
Workload Per OST
NOTRIO TRIO-‐FCFS TRIO-‐SJF
Figure 5.9: Comparison of total I/O time.
application to a small number of compute nodes. Chen et al. [42] and Liao et al. [79] mitigated
I/O contention by having processes shuffle data in a layout-aware manner. These mechanisms
have been widely adopted in existing I/O middlewares, such as ADIOS [71, 69] and MPI-IO [94].
Server-side optimization embeds some I/O control mechanisms on the server side. Dai et al. [45]
designed an I/O scheduler that dynamically places write operations among servers to avoid con-
gested servers. Zhang et al. [45] proposed a server-side I/O orchestration mechanism to mitigate
interference between multiple processes. Liu et al. [68] researched a low level caching mechanism
that optimizes the I/O pattern on hard disk drives. Different from these works, we address I/O
contention issues using BB as an intermediate layer. Compared with client-side optimization, an
orchestration framework on BB is able to coordinate I/O traffic between different jobs, mitigating
I/O contention at a larger scope. Compared with the server-side optimization, an orchestration
framework on BB can free storage servers from the extra responsibility of handling I/O contention,
making it portable to other PFSs.
Burst Buffer: The idea of BB was proposed recently to cope with the exploding data pressure in
the upcoming exascale computing era. Several next-generation HPC systems in the Coral project,
i.e. Summit [21], Sierra [20], Aurora [2], are designed with BB support. The SCR group is currently
trying to strengthen the support for SCR by developing a multi-level checkpointing scheme on top
78
of BB [18]. DDN and Cray are developing IME [13] and DataWarp [33], respectively as BB layers
to absorb the bursty read/write traffic from scientific applications. Most of these works use BB as
an intermediate layer to avoid application’s direct interaction with PFS. The focal point of TRIO
is the interaction between BB and PFS. Namely, how to efficiently flush data to PFS.
Inter-Job I/O Coordination: Compared with the numerous research works on intra-job I/O
coordination, inter-job coordination has received very limited attention. Liu et al. [66] designed
a tool to extract the I/O signatures of various jobs to assist the scheduler in making optimal
scheduling decisions. Dorier et al. [50] proposed a reactive approach to mitigate I/O interference
from multiple applications by dynamically interrupting and serializing application’s execution upon
performance decrease. Our work differs in that it coordinates inter-job I/O traffic in a layout-aware
manner to both avoid bandwidth degradation and minimize average job I/O time under contention.
5.6 Summary
In this chapter, we have analyzed the major performance issues of checkpointing operations on
HPC systems: prolonged average job I/O time and degraded storage server bandwidth utilization.
Accordingly, we have designed TRIO, a burst buffer based orchestration framework, to reshape
I/O traffic from burst buffer to PFS. By increasing intra-BB write sequentiality and coordinating
inter-BB flushing order, TRIO efficiently utilized storage bandwidth and reduced average job I/O
time by 37% in the typical checkpointing patterns.
79
CHAPTER 6
CONCLUSIONS
In summary, this dissertation undertakes substantial efforts to investigate the burst buffer manage-
ment strategies in order to accelerate the bursty scientific I/O workloads. We have designed the
strategies to manage both the remote shared burst buffers and the node-local burst buffers. Be-
cause of their architectural differences, our strategies contribute to these two types of burst buffers
on distinct facets.
On one hand, when remote shared burst buffers are deployed on the I/O nodes, data movement
between applications and burst buffers need to go through the network. Based on this feature, we
researched the strategy to fully exploit the high-speed interconnect for fast data transfer. Another
major advantage of this architecture is that data flushing from the remote shared burst buffers to the
backend PFS can be conducted without interfering with the computation on the compute nodes, so
we have explored a burst buffer based checkpointing framework that efficiently hides applications’
checkpoint time with the overlapped computation and data flush operations. In contrast, when
node-local burst buffers are deployed on the individual compute nodes, we can avoid the network
transfer overhead by having each process directly write to its local burst buffer. This local writes
delivers scalable write bandwidth, but it also creates challenges for read. So we have investigated
the strategies that ensure high read bandwidth under these local write operations. Moreover, unlike
the remote shared burst buffers, the node-local burst buffers are allocated to the individual job;
data on these burst buffers are temporarily available in each job. Based on this characteristic,
we structure the node-local burst buffers to offer a temporary data sharing service for coupled
applications in the same job.
Beside the aforementioned architectural differences and the distinct contributions on these two
types of burst buffers, we have designed a common data flushing strategy that is applicable to both
types of burst buffers. This strategy reshapes the bursty writes in burst buffers before draining
them to the backend PFS to avoid contention on PFS.
80
Since burst buffers is an emerging storage solution to be adopted on the exascale computing
systems, we afford the first-hand insights and the alternative storage solutions for the system
architects tasked with building the next-generation supercomputers. More specifically, we have
made the following three contributions.
• BurstMem: Overlapping Computation and I/O In order to bridge the computation-IO
gap, we introduced the design of a novel remote shared burst buffer system to avoid applica-
tions’ direct interactions with PFS, by temporarily buffering the checkpoints in burst buffers
and gradually flushing the data to PFS. While BurstMem inherits the buffering management
framework of Memcached, it also complements the functionality of Memcached with several
salient features to accelerate checkpointing, including a log-structured data organization that
efficiently utilizes the burst buffer bandwidth and capacity; a stacked AVL tree based indexing
that quickly locates the requested data and retrieves them for data flush and crash recovery;
a coordinated data shuffling that transforms the small and noncontiguous write requests into
large and sequential ones, so that data flushes can be conducted in a stripe-aligned manner; a
CCI-based communication layer that is portable and able to leverage the native transport of
various HPC interconnects. Experiments using both synthetic benchmarks and a real-world
application demonstrated that BurstMem is able to achieve 8.5× speedup over the bandwidth
of the conventional PFS.
• BurstFS: A Distributed Burst Buffer File System We further investigated the use
of node-local burst buffers to handle various I/O workloads. We designed and prototyped
BurstFS, an ephemeral burst buffer file system that co-exists with the individual job. BurstFS
vastly accelerates the checkpointing and multi-dimensional I/O workloads with three tech-
niques. First, it enables scalable log-structured writes using scalable metadata indexing. With
this technique, each process can directly write its data to its node-local burst buffer for both
N-1 and N-N patterns, avoiding the contention issues that commonly exist on the center-wide
file systems. Second, it provides a temporary data sharing service for the coupled applications
in the same job via collocated I/O delegation. I/O delegation also opens up opportunities
for further optimizations on the server side. Finally, it optimizes the multi-dimensional I/O
workload by server-side read clustering and pipelining. This technique combines the small and
noncontiguous read requests into larger ones and pipelines the read, copy and send operations.
Through extensive tuning and analysis, we have validated that BurstFS has accomplished our
design objectives, with linear scalability in terms of aggregated I/O bandwidth for parallel
writes and reads.
• TRIO: Reshaping Bursty Writes on PFS We have introduced the design of a burst
buffer orchestration framework to reshape scientific applications’ bursty writes on the burst
81
buffer layer. TRIO addresses the contention with two design choices. First, before flushing,
each burst buffer groups together all the bursty writes to be flushed to the same storage server
and sequentially organizes the write requests in each group to maximize the flush sequentiality
on each storage server. Second, during flushing, burst buffers dynamically orchestrate their
flush order on the storage servers to avoid burst buffers’ competing flushes on the same
storage server, and minimize the write interference that occurs when data flushing for one
application is interleaved by the data flushing for other applications. Our experimental results
demonstrated that TRIO could efficiently utilize storage bandwidth and reduce the average
I/O time by 37% on average in the typical checkpointing scenarios.
BurstMem and BurstFS are designed to manage the remote shared burst buffers and node-local
burst buffers, respectively. On the other hand, TRIO takes burst buffers as an intermediate storage
layer between compute nodes and the backend PFS, so it is portable to both types of burst buffers.
82
CHAPTER 7
FUTURE WORK
This dissertation also opens up new opportunities for the future burst buffer related research. In
particular, the following three branches deserve further investment.
First, a major advantage of the node-local burst buffers is that scientific applications can reap
scalable checkpointing bandwidth by having each process write data to its own node-local burst
buffer. This benefit has been justified in Chapter 4. However, this scalable write bandwidth is
accompanied with escalating failure rates as more compute nodes are involved in a job. Stud-
ies [76, 86] reported that a small portion of failures can be recovered by restarting from the local
checkpoints. However, the vast majority of failures require a restart from the external storage (e.g.
checkpoints on the remote shared burst buffers). The remote shared burst buffers, on the other
hand, have lower failure rate since the number of I/O nodes is much smaller than the number of
compute nodes. So it is tempting to design a fault-tolerant burst buffer system that relishes the
virtues of both the node-local burst buffers and the remote shared burst buffers. On one hand,
we can still perform scalable checkpointing by writing to the node-local burst buffers; on the other
hand, we can choose to periodically flush part of the checkpoints to the remote shared burst buffers.
Data on the remote shared burst buffers can be flushed to the PFS at a even slower frequency. This
hierarchical burst buffer management scheme has been theoretically proven efficient in [87]. We
can accomplish this purpose by combining BurstMem and BurstFS. One of the challenges that
demands further investigation is the impact of computation jitters on the compute nodes. It is
critical to quantify this impact since data need to be asynchronously flushed to the remote shared
burst buffers.
Second, our work in Chapter 5 orchestrates burst buffers’ flush order using one arbitrator. This
design choice limits its scalability. A future research topic is to distribute the responsibility of the
arbitrator to a number of arbitrators on different compute nodes. This can be accomplished by
partitioning storage servers into disjoint sets and assigning one arbitrator to orchestrate the I/O
requests in each set. We will extend the current framework to analyze the effect of a distributed
83
burst buffer orchestration. In addition, the existing framework is designed to handle the large and
sequential checkpointing workload, which is not feasible for a checkpointing workload dominated by
the small and noncontiguous write requests. A potential solution is to leverage some of the existing
works to reshuffle the data in an attempt to transform the small, noncontiguous write requests into
large and sequential ones [32, 63, 96].
Third, POSIX is a de facto stardard used for file I/O operations. Plenty of high-level I/O
libraries, such as pNetCDF, HDF5 and MPI-IO are buit on top of POSIX, so a burst buffer file
system that transparently supports these libraries can benefit a vast number of the real-world sci-
entific applications. Our work in Chapter 4 is built on top of POSIX, so it is promising to extend
its functionality to transparently accelerate these scientific applications. One of the challenges
is to investigate what consistency is needed by the real-world applications. BurstFS enforces no
consistency control. This design choice works for scientific applications that evenly partition the
global arrays into all the participating processes. However, a comprehensive analysis is required
to characterize the consistency requirement of the applications belonging to broader categories.
Associated with this challenge is the requirement for load balancing. BurstFS is not feasible for
applications that unevenly distribute their data to the participating processes. For these applica-
tions, we need to extend BurstFS’s function to support remote writes. So that processes under the
heavier workloads can shift a portion of their data to the remote burst buffers.
84
BIBLIOGRAPHY
[1] Active Burst Buffer Appliance. http://www.theregister.co.uk/2012/09/21/emc_abba.
[2] Aurora Supercomputer. http://aurora.alcf.anl.gov.
[3] Blktrace. http://linux.die.net/man/8/blktrace.
[4] Catalyst Cluster. http://computation.llnl.gov/computers/catalyst.
[5] Cori Supercomputer. http://www.nersc.gov/users/computational-systems/cori.
[6] Cray. http://www.cray.com.
[7] Data intensive computing talk. http://www.exascale.org/mediawiki/images/6/64/
Talk-12-Choudhary.pdf.
[8] Datadirect network. http://www.ddn.com/.
[9] Datawarp. http://www.cray.com/products/storage/datawarp.
[10] Edison Supercomputer. http://www.nersc.gov/users/computational-systems/edison/.
[11] HDF5. http://www.hdfgroup.org/HDF5/.
[12] Hyperion Cluster. https://hyperionproject.llnl.gov/index.php.
[13] Infinite Memory Engine. http://www.ddn.com/products/infinite-memory-engine-ime.
[14] Introducing Titan. http://www.olcf.ornl.gov/titan.
[15] Lawrence Livermore National Laboratory. https://asc.llnl.gov/computing_resources/purple.
[16] MPI-Tile-IO. http://www.mcs.anl.gov/research/projects.
[17] San Diego Supercomputer Center. http://www.gsic.titech.ac.jp/en/tsubame2.
[18] Scalable Checkpoint/Restart. https://computation.llnl.gov/project/scr.
[19] Sequoia Supercomputer. https://asc.llnl.gov/sequoia/rfp/02_SequoiaSOW_V06.doc.
[20] Sierra Supercomputer. https://www.llnl.gov/news/next-generation-supercomputer-coming-lab.
85
[21] Summit Supercomputer. https://www.olcf.ornl.gov/summit.
[22] Theta and Aurora Supercomputers. http://aurora.alcf.anl.gov.
[23] Tianhe-2. http://www.top500.org/system/177999.
[24] Top 500 Supercomputers. https://www.top500.org.
[25] Trinity. http://www.lanl.gov/projects/trinity.
[26] TSUBAME2. http://www.gsic.titech.ac.jp/en/tsubame2.
[27] Hasan Abbasi, Matthew Wolf, Greg Eisenhauer, Scott Klasky, Karsten Schwan, and FangZheng. Datastager: Scalable Data Staging Services for Petascale Applications. Cluster Com-puting, 13(3):277–290, 2010.
[28] Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham, RobertRoss, Lee Ward, and P Sadayappan. Scalable I/O Forwarding Framework for High-Performance Computing Systems. In Cluster Computing and Workshops, 2009. CLUS-TER’09. IEEE International Conference on, pages 1–10. IEEE, 2009.
[29] IEEE Standards Association et al. IEEE/ANSI Std 1003.1, 1996 Edition. InformationTechnology–Portable Operating System Interface (POSIX)–Part, 1.
[30] Scott Atchley, David Dillow, Galen Shipman, Patrick Geoffray, Jeffrey M Squyres, GeorgeBosilca, and Ronald Minnich. The Common Communication Interface (CCI). In High Per-formance Interconnects (HOTI), 2011 IEEE 19th Annual Symposium on, pages 51–60. IEEE,2011.
[31] John Bent, Sorin Faibish, Jim Ahrens, Gary Grider, John Patchett, Percy Tzelnic, and JonWoodring. Jitter-free Co-Processing on a Prototype Exascale Storage Stack. In Mass StorageSystems and Technologies (MSST), 2012 IEEE 28th Symposium on, pages 1–5. IEEE, 2012.
[32] John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez,Milo Polte, and Meghan Wingate. PLFS: A Checkpoint Filesystem for Parallel Applications.In Proceedings of the Conference on High Performance Computing Networking, Storage andAnalysis, page 21. ACM, 2009.
[33] Wahid Bhimji, Debbie Bard, Melissa Romanus, David Paul, Andrey Ovsyannikov, BrianFriesen, Matt Bryson, Joaquin Correa, Glenn K Lockwood, Vakho Tsulaia, et al. AcceleratingScience with the NERSC Burst Buffer Early User Program.
[34] Michael Moore David Bonnie, Becky Ligon, Mike Marshall, Walt Ligon, Nicholas Mills, ElaineQuarles Sam Sampson, Shuangyang Yang, and Boyd Wilson. OrangeFS: Advancing PVFS.
86
[35] Peter J Braam and R Zahir. Lustre: A Scalable, High Performance File System. Cluster FileSystems, Inc, 2002.
[36] Michael J Brim, David A Dillow, Sarp Oral, Bradley W Settlemyer, and Feiyi Wang. Asyn-chronous Object Storage with QoS for Scientific and Commercial Big Data. In Proceedingsof the 8th Parallel Data Storage Workshop, pages 7–13. ACM, 2013.
[37] SW Bruenn, A Mezzacappa, WR Hix, JM Blondin, P Marronetti, OEB Messer, CJ Dirk,and S Yoshida. 2D and 3D Core-Collapse Supernovae Simulation Results Obtained with theCHIMERA Code. In Journal of Physics: Conference Series, volume 180, page 012018. IOPPublishing, 2009.
[38] Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham,and Robert Ross. Understanding and Improving Computational Science Storage AccessThrough Continuous Characterization. ACM Transactions on Storage (TOS), 7(3):8, 2011.
[39] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, MikeBurrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A DistributedStorage System for Structured Data. In Proceedings of the 7th USENIX Symposium onOperating Systems Design and Implementation (OSDI), Berkeley, CA, USA, 2006. USENIXAssociation.
[40] Chao Chen, Yong Chen, Kun Feng, Yanlong Yin, Hassan Eslami, Rajeev Thakur, Xian-HeSun, and William D Gropp. Decoupled I/O for Data-Intensive High Performance Computing.In Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on, pages312–320. IEEE, 2014.
[41] J H Chen, A Choudhary, B de Supinski, M DeVries, E R Hawkes, S Klasky, W K Liao,K L Ma, J Mellor-Crummey, N Podhorszki, R Sankaran, S Shende, and C S Yoo. TerascaleDirect Numerical Simulations of Turbulent Combustion Using S3D. Computational Scienceand Discovery, 2(1):015001, 2009.
[42] Yong Chen, Xian-He Sun, Rajeev Thakur, Philip C Roth, and William D Gropp. Lacio: ANew Collective I/O Strategy for Parallel I/O Systems. In Parallel and Distributed ProcessingSymposium (IPDPS), 2011 IEEE International, pages 794–804. IEEE, 2011.
[43] Kristina Chodorow. MongoDB: the Definitive Guide. O’Reilly Media, Inc., 2013.
[44] Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey, Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao, and Daniel Chavarrıa-Miranda. An Evaluation ofGlobal Address Space Languages: Co-Array Fortran and Unified Parallel C. In Proceedingsof the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Program-ming, pages 36–47. ACM, 2005.
87
[45] Dong Dai, Yong Chen, Dries Kimpe, and Robert Ross. Two-Choice Randomized DynamicI/O Scheduler for Object Storage Systems. In SC, 2014.
[46] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, AvinashLakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels.Dynamo: Amazon’s Highly Available Key-Value Store. In ACM SIGOPS Operating SystemsReview, volume 41, pages 205–220. ACM, 2007.
[47] Ciprian Docan, Manish Parashar, and Scott Klasky. DataSpaces: An Interaction and Co-ordination Framework for Coupled Simulation Workflows. In Proceedings of the 19th ACMInternational Symposium on High Performance Distributed Computing, HPDC ’10, pages25–36, New York, NY, USA, 2010. ACM.
[48] Ciprian Docan, Manish Parashar, and Scott Klasky. DataSpaces: An Interaction and Coor-dination Framework for Coupled Simulation Workflows. Cluster Computing, 15(2):163–181,2012.
[49] Jack Dongarra. Impact of Architecture and Technology for Extreme Scale on Software andAlgorithm Design. In The Department of Energy Workshop on Cross-cutting Technologiesfor Computing at the Exascale, 2010.
[50] Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, and Shadi Ibrahim. CALCioM:Mitigating I/O Interference in HPC Systems through Cross-Application Coordination. InParallel and Distributed Processing Symposium, 2014 IEEE 28th International, pages 155–164. IEEE, 2014.
[51] Lars George. HBase: the Definitive Guide. O’Reilly Media, Inc., 2011.
[52] Hugh N Greenberg, John Bent, and Gary Grider. MDHIM: a parallel key/value frameworkfor HPC. In HotStorage. USENIX Association, 2015.
[53] Jiahua He, Arun Jagatheesan, Sandeep Gupta, Jeffrey Bennett, and Allan Snavely. Dash:A Recipe for a Flash-Based Data Intensive Supercomputer. In Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing, Networking, Storageand Analysis, pages 1–11. IEEE Computer Society, 2010.
[54] Kamil Iskra, John W Romein, Kazutomo Yoshii, and Pete Beckman. ZOID: I/O-forwardinginfrastructure for petascale architectures. In Proceedings of the 13th ACM SIGPLAN Sym-posium on Principles and practice of parallel programming, pages 153–162. ACM, 2008.
[55] Hui Jin, Tao Ke, Yong Chen, and Xian-He Sun. Checkpointing Orchestration: Toward aScalable HPC Fault-Tolerant Environment. In Cluster, Cloud and Grid Computing (CCGrid),2012 12th IEEE/ACM International Symposium on, pages 276–283. IEEE, 2012.
88
[56] Youngjae Kim, Raghul Gunasekaran, Galen M Shipman, David Dillow, Zhe Zhang, Bradley WSettlemyer, et al. Workload Characterization of A Leadership Class Storage Cluster. InPetascale Data Storage Workshop (PDSW), 2010 5th, pages 1–5. IEEE, 2010.
[57] Donald E. Knuth. The Art of Computer Programming, Volume 3: (2Nd Ed.) Sorting andSearching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.
[58] Jianwei Li, Wei-keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp,Robert Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. Parallel netCDF: Ahigh-performance scientific I/O interface. In Supercomputing, 2003 ACM/IEEE Conference,pages 39–39. IEEE, 2003.
[59] Tonglin Li, Xiaobing Zhou, Kevin Brandstatter, Dongfang Zhao, Ke Wang, Anupam Ra-jendran, Zhao Zhang, and Ioan Raicu. ZHT: A Light-Weight Reliable Persistent DynamicScalable Zero-Hop Distributed Hash Table. In Parallel and distributed processing (IPDPS),2013 IEEE 27th international symposium on, pages 775–787. IEEE, 2013.
[60] Wei-keng Liao, Avery Ching, Kenin Coloma, Arifa Nisar, Alok Choudhary, Jacqueline Chen,Ramanan Sankaran, and Scott Klasky. Using MPI File Caching to Improve Parallel WritePerformance for Large-Scale Scientific Applications. In Supercomputing, 2007. SC’07. Pro-ceedings of the 2007 ACM/IEEE Conference on, pages 1–11. IEEE, 2007.
[61] Jialin Liu and Yong Chen. Fast data analysis with integrated statistical metadata in scientificdatasets. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages1–8. IEEE, 2013.
[62] Jialin Liu, Yong Chen, and Yu Zhuang. Hierarchical i/o scheduling for collective i/o. In Clus-ter, Cloud and Grid Computing (CCGrid), 2013 13th IEEE/ACM International Symposiumon, pages 211–218. IEEE, 2013.
[63] Jialin Liu, Bradly Crysler, Yin Lu, and Yong Chen. Locality-Driven High-Level I/O Aggre-gation for Processing Scientific Datasets. In Big Data, 2013 IEEE International Conferenceon, pages 103–111. IEEE, 2013.
[64] Jialin Liu, Evan Racah, Quincey Koziol, and Richard Shane Canon. H5Spark: Bridging theI/O Gap between Spark and Scientific Data Formats on HPC Systems. Cray User Group,2016.
[65] Ning Liu, Jason Cope, Philip Carns, Christopher Carothers, Robert Ross, Gary Grider, AdamCrume, and Carlos Maltzahn. On the Role of Burst Buffers in Leadership-Class StorageSystems. In Mass Storage Systems and Technologies (MSST), 2012 IEEE 28th Symposiumon, pages 1–11. IEEE, 2012.
89
[66] Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S. Vazhkudai. AutomaticIdentification of Application I/O Signatures from Noisy Server-side Traces. In Proceedingsof the 12th USENIX Conference on File and Storage Technologies, FAST’14, pages 213–228,Berkeley, CA, USA, 2014. USENIX Association.
[67] Zhuo Liu, Jay Lofstead, Teng Wang, and Weikuan Yu. A Case of System-Wide PowerManagement for Scientific Applications. In Cluster Computing (CLUSTER), 2013 IEEEInternational Conference on, pages 1–8. IEEE, 2013.
[68] Zhuo Liu, Bin Wang, Patrick Carpenter, Dong Li, Jeffrey S Vetter, and Weikuan Yu. PCM-Based Durable Write Cache for Fast Disk I/O. In Modeling, Analysis & Simulation of Com-puter and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Sympo-sium on, pages 451–458. IEEE, 2012.
[69] Zhuo Liu, Bin Wang, Teng Wang, Yuan Tian, Cong Xu, Yandong Wang, Weikuan Yu, Car-los A Cruz, Shujia Zhou, Tom Clune, et al. Profiling and Improving I/O Performance ofa Large-Scale Climate Scientific Application. In Computer Communications and Networks(ICCCN), 2013 22nd International Conference on, pages 1–7. IEEE, 2013.
[70] Jay Lofstead, Milo Polte, Garth Gibson, Scott Klasky, Karsten Schwan, Ron Oldfield,Matthew Wolf, and Qing Liu. Six Degrees of Scientific Data: Reading Patterns for ExtremeScale Science IO. In Proceedings of the 20th international symposium on High performancedistributed computing, pages 49–60. ACM, 2011.
[71] Jay Lofstead, Fang Zheng, Scott Klasky, and Karsten Schwan. Adaptable, Metadata RichIO Methods for Portable High Performance IO. In Parallel & Distributed Processing, 2009.IPDPS 2009. IEEE International Symposium on, pages 1–10. IEEE, 2009.
[72] Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, KarstenSchwan, and Matthew Wolf. Managing Variability in the IO Performance of Petascale StorageSystems. In Proceedings of the 2010 ACM/IEEE International Conference for High Perfor-mance Computing, Networking, Storage and Analysis, pages 1–12. IEEE Computer Society,2010.
[73] Ewing Lusk, S Huss, B Saphir, and M Snir. MPI: A Message-Passing Interface Standard,2009.
[74] Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms,Mr Prabhat, Suren Byna, and Yushu Yao. A Multiplatform Study of I/O Behavior onPetascale Supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, pages 33–44. ACM, 2015.
[75] Christopher Mitchell, Yifeng Geng, and Jinyang Li. Using One-Sided RDMA Reads to Builda Fast, CPU-Efficient Key-Value Store. In USENIX Annual Technical Conference, pages103–114, 2013.
90
[76] Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. Design, Mod-eling, and Evaluation of a Scalable Multi-Level Checkpointing System. In High PerformanceComputing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages1–11. IEEE, 2010.
[77] Jose Moreira, Michael Brutman, Jose Castanos, Thomas Engelsiepen, Mark Giampapa, TomGooding, Roger Haskin, Todd Inglett, Derek Lieber, Pat McCarthy, et al. Designing a Highly-Scalable Operating System: The Blue Gene/L Story. In Proceedings of the 2006 ACM/IEEEconference on Supercomputing, page 118. ACM, 2006.
[78] Arifa Nisar, Wei-keng Liao, and Alok Choudhary. Scaling parallel I/O performance throughI/O delegate and caching system. In High Performance Computing, Networking, Storage andAnalysis, 2008. SC 2008. International Conference for, pages 1–12. IEEE, 2008.
[79] Arifa Nisar, Wei-keng Liao, and Alok Choudhary. Delegation-Based I/O Mechanism for HighPerformance Computing Systems. IEEE Transactions on Parallel and Distributed Systems,23(2):271–279, 2012.
[80] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C Li,Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, et al. Scaling Memcache at Facebook.In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI),pages 385–398, 2013.
[81] Sarp Oral, David A Dillow, Douglas Fuller, Jason Hill, Dustin Leverman, Sudharshan SVazhkudai, Feiyi Wang, Youngjae Kim, James Rogers, James Simmons, et al. OLCFs 1TB/s, Next-Generation Lustre File System. In Proceedings of Cray User Group Conference(CUG 2013), 2013.
[82] Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Butt, Min Li, and MahmutKandemir. Provisioning a Multi-Tiered Data Staging Area for Extreme-Scale Machines. InDistributed Computing Systems (ICDCS), 2011 31st International Conference on, pages 1–12.IEEE, 2011.
[83] Raghunath Rajachandrasekar, Adam Moody, Kathryn Mohror, and Dhabaleswar K Panda.A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22ndinternational symposium on High-performance parallel and distributed computing, pages 143–154. ACM, 2013.
[84] Kai Ren and Garth A Gibson. TABLEFS: Enhancing Metadata Efficiency in the Local FileSystem. In USENIX Annual Technical Conference, pages 145–156, 2013.
[85] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-Tree Filesystem. TOS,9(3):9, 2013.
91
[86] Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis Rde Supinski, and Satoshi Matsuoka. Design and Modeling of a Non-Blocking CheckpointingSystem. In Proceedings of the International Conference on High Performance Computing,Networking, Storage and Analysis, page 19. IEEE Computer Society Press, 2012.
[87] Kiminori Sato, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R De Supinski, NaoyaMaruyama, and Shingo Matsuoka. A User-Level Infiniband-Based File System and Check-point Strategy for Burst Buffers. In Cluster, Cloud and Grid Computing (CCGrid), 201414th IEEE/ACM International Symposium on, pages 21–30. IEEE, 2014.
[88] Hongzhang Shan and John Shalf. Using IOR to Analyze the I/O Performance for HPCPlatforms. Lawrence Berkeley National Laboratory, 2007.
[89] G Shipman, D Dillow, Sarp Oral, and Feiyi Wang. The Spider Center Wide File System:From Concept to Reality. In Proceedings, Cray User Group (CUG) Conference, Atlanta, GA,2009.
[90] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The HadoopDistributed File System. In Mass Storage Systems and Technologies (MSST), 2010 IEEE26th Symposium on, pages 1–10. IEEE, 2010.
[91] Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, and Samuel Lang. Server-SideI/O Coordination for Parallel File Systems. In High Performance Computing, Networking,Storage and Analysis (SC), 2011 International Conference for, pages 1–11. IEEE, 2011.
[92] Tim Stitt. An Introduction to the Partitioned Global Address Space (PGAS) ProgrammingModel. Connexions, Rice University, 2009.
[93] Houjun Tang, Suren Byna, Steve Harenberg, Wenzhao Zhang, Xiaocheng Zou, Daniel FMartin, Bin Dong, Dharshi Devendran, Kesheng Wu, David Trebotich, et al. In Situ Stor-age Layout Optimization for AMR Spatio-temporal Read Accesses. In Parallel Processing(ICPP), 2016 45th International Conference on, pages 406–415. IEEE, 2016.
[94] Rajeev Thakur, William Gropp, and Ewing Lusk. Data Sieving and Collective I/O in ROMIO.In Frontiers of Massively Parallel Computation, 1999. Frontiers’ 99. The Seventh Symposiumon the, pages 182–189. IEEE, 1999.
[95] Yuan Tian, Scott Klasky, Hasan Abbasi, Jay Lofstead, Ray Grout, Norbert Podhorszki, QingLiu, Yandong Wang, and Weikuan Yu. EDO: Improving Read Performance for ScientificApplications through Elastic Data Organization. In Cluster Computing (CLUSTER), 2011IEEE International Conference on, pages 93–102. IEEE, 2011.
92
[96] Yuan Tian, Zhuo Liu, Scott Klasky, Bin Wang, Hasan Abbasi, Shujia Zhou, Norbert Pod-horszki, Tom Clune, Jeremy Logan, and Weikuan Yu. A Lightweight I/O Scheme to FacilitateSpatial and Temporal Queries of Scientific Data Analytics. In Mass Storage Systems andTechnologies (MSST), 2013 IEEE 29th Symposium on, pages 1–10. IEEE, 2013.
[97] Venkatram Vishwanath, Mark Hereld, Kamil Iskra, Dries Kimpe, Vitali Morozov, Michael EPapka, Robert Ross, and Kazutomo Yoshii. Accelerating I/O Forwarding in IBM Blue Gene/PSystems. In High Performance Computing, Networking, Storage and Analysis (SC), 2010International Conference for, pages 1–10. IEEE, 2010.
[98] Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, and Weikuan Yu. An EphemeralBurst-Buffer File System for Scientific Applications. In Proceedings of the International Con-ference for High Performance Computing, Networking, Storage and Analysis, SC ’16, pages69:1–69:12, Piscataway, NJ, USA, 2016. IEEE Press.
[99] Teng Wang, Kathryn Mohror, Adam Moody, Weikuan Yu, and Kento Sato. BurstFS: A Dis-tributed Burst Buffer File System for Scientific Applications. In The International Conferencefor High Performance Computing, Networking, Storage and Analysis (SC poster), 2015.
[100] Teng Wang, Adam Moody, Yue Zhu, Kento Sato, Tanzima Islam, and Weikuan Yu. MetaKV:A Key-Value Store for Metadata Management of Distributed Burst Buffers. In Paralleland Distributed Processing Symposium, 2017 IEEE 31th International, pages 799–808. IEEE,2014.
[101] Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. Development ofa Burst Buffer System for Data-Intensive Applications. CoRR, abs/1505.01765, 2015.
[102] Teng Wang, Sarp Oral, Michael Pritchard, Bin Wang, and Weikuan Yu. TRIO: Burst BufferBased I/O Orchestration. In 2015 IEEE International Conference on Cluster Computing,pages 194–203. IEEE, 2015.
[103] Teng Wang, Sarp Oral, Yandong Wang, Brad Settlemyer, Scott Atchley, and Weikuan Yu.BurstMem: A High-Performance Burst Buffer System for Scientific Applications. In Big Data(Big Data), 2014 IEEE International Conference on, pages 71–79. IEEE, 2014.
[104] Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, and Weikuan Yu. BPAR: A Bundle-BasedParallel Aggregation Framework for Decoupled I/O Execution. In Data Intensive ScalableComputing Systems (DISCS), 2014 International Workshop on, pages 25–32. IEEE, 2014.
[105] Teng Wang, Kevin Vasko, Zhuo Liu, Hui Chen, and Weikuan Yu. Enhance Parallel In-put/Output with Cross-bundle Aggregation. Int. J. High Perform. Comput. Appl., 30(2):241–256, May 2016.
93
[106] Yandong Wang, Robin Goldstone, Weikuan Yu, and Teng Wang. Characterization and Op-timization of Memory-Resident MapReduce on HPC Systems. In Parallel and DistributedProcessing Symposium, 2014 IEEE 28th International, pages 799–808. IEEE, 2014.
[107] Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Cristi Cira,Bin Wang, Zhuo Liu, Bliss Bailey, et al. Assessing the performance impact of high-speedinterconnects on mapreduce. In Specifying Big Data Benchmarks, pages 148–163. Springer,2014.
[108] Parkson Wong and R der Wijngaart. NAS Parallel Benchmarks I/O Version 2.4. NASAAmes Research Center, Moffet Field, CA, Tech. Rep. NAS-03-002, 2003.
[109] Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and NorbertPodhorszki. Characterizing Output Bottlenecks in a Supercomputer. In High PerformanceComputing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages1–11. IEEE, 2012.
[110] Jiangling Yin, Jun Wang, Jian Zhou, Tyler Lukasiewicz, Dan Huang, and Junyao Zhang.Opass: Analysis and Optimization of Parallel Data Access on Distributed File Systems. InParallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages623–632. IEEE, 2015.
[111] W. Yu, J.S. Vetter, and H.S. Oral. Performance Characterization and Optimization of Par-allel I/O on the Cray XT. In 22nd IEEE International Parallel and Distributed ProcessingSymposium (IPDPS’08), Miami, FL, April 2008.
[112] Xuechen Zhang, Kei Davis, and Song Jiang. IOrchestrator: Improving the Performanceof Multi-Node I/O Systems via Inter-Server Coordination. In Proceedings of the 2010ACM/IEEE International Conference for High Performance Computing, Networking, Storageand Analysis, pages 1–11. IEEE Computer Society, 2010.
[113] Dongfang Zhao, Zhao Zhang, Xiaobing Zhou, Tonglin Li, Ke Wang, Dries Kimpe, PhilipCarns, Robert Ross, and Ioan Raicu. FusionFS: Toward Supporting Data-Intensive ScientificApplications on Extreme-Scale High-Performance Computing Systems. In Big Data (BigData), 2014 IEEE International Conference on, pages 61–70. IEEE, 2014.
[114] Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, ManishParashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. PreDatA–PreparatoryData Analytics on Peta-Scale Machines. In Parallel and Distributed Processing (IPDPS),2010 IEEE International Symposium on, pages 1–12. IEEE, 2010.
94
BIOGRAPHICAL SKETCH
Teng Wang was born in Puyang, Henan province of China. He received the Master’s degree in
Software Engineering from Huazhong University of Science and Technology, Wuhan, China, in 2012.
He obtained the Bachelor’s degree in Computer Science from Zhengzhou University, Zhengzhou,
China, in 2009. His research interests include high performance computing, parallel I/O, storage
systems, and cloud computing.
95