veloc: very low overhead checkpointing systemthis research was supported by the exascale computing...
TRANSCRIPT
VELOC: Very Low Overhead Checkpointing System
ANL: Bogdan Nicolae, Franck CappelloLLNL: Adam Moody, Elsa Gonsiorowski, Kathryn Mohror
DescriptionProblem Statement: Extreme scale simulations need to checkpoint periodically for fault-tolerance. But overhead of checkpoint/restart is increasing beyond acceptable and requires non-trivial optimizations
Hidden Complexity of Storage Interaction
Facilitates ease of use while optimizing performance and scalability
Flexibility through Modular Design
VeloC API● Application-level
checkpoint and restart API
● Minimizes code changes in applications
● Two possible modes:○ File-oriented API: Manually
write files and tell VeloC about them
○ Memory-oriented API: Declare and capture memory regions automatically
● Fire-and-forget: VeloC operates in the background
● Waiting for checkpoints is optional; a primitive is used to check progress
High Performance and Scalability● Scenario: 512 PEs/node, each
checkpoints 256 MB (64 GB total/node)
● Sync: blocking writes to Lustre● Async: writes to SSD followed
by async flush to Lustre● Metric: Increase in runtime due
to checkpointing● Conclusions:
○ Sync flush to PFS shows I/O bottlenecks even at small scale
○ ASync mode hides PFS latency: (reduced overhead up tp 80%, improved scalability)
VeloC in a nutshell:● Provides a simple API to checkpoint and restart based on data
structures declared as important (protected) and captured automatically or files directly managed by the application
● Hides the complexity of interacting with the heterogeneous storage hierarchy (burst buffers, etc.) of current and future CORAL and ECP systems
● Based on a modular design that facilitates flexibility in choosing a resilience strategy and operation mode (synchronous or asynchronous), while being highly customizable with additional post-processing modules
● Many other use cases beyond fault tolerance: suspend-resume jobs to extend over multiple reservations, revisit previous states (e.g. adjoint computations), etc.
● Initial stress tests show high performance and scalability
Goal: Provide a multi-level checkpoint/restart environment that delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility
● Initial stress tests show promising results● Application: Heat Distribution● Platform: ANL Theta (KNL, Local SSD,
Lustre)
● Configurable resilience strategy:○ Partner replication○ Erasure coding○ Optimized transfer to external
storage● Configurable mode of
operation:○ Synchronous mode: resilience
engine runs in application process
○ Asynchronous mode: resilience engine in separate backend process (does not die if app dies due to software failures)
● Easily extensible:○ Custom modules can be added
for additional post-processing in the engine (e.g. compression)
BB ION ION ION BB
SN SN SN SN
NODE N
Vendor Specific Transfer API
Post-processing
request
Completionnotification
Asynchronous mode
Erasure Coding
Veloc Engine
VeloC Backend
OptimizedTransfer
Partner Replication
Custom Modules
Resource-aware optimal
recovery
Local Checkpoint
and Recovery
VeloC Client
APPLICATION
Local Recovery
Phase
Checkpoint Decision Making
Synchronous mode
VeloC API
NODE 1
Colla
bora
tive
Resi
lienc
e
One simple VeloC API
Complex Heterogeneous Storage Hierarchy (Burst Buffers, Parallel File Systems, Object Stores, etc.)
Many complicated vendor APIs (CCPR, DataWarp, etc.)
User Applications
VeloC Library
VeloC EngineJob
Scheduler/ResourceManager
BB ION ION ION BB
SN SN SN SN
http://veloc.readthedocs.io
→
This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations - the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-POST-745767.