veloc: very low overhead checkpointing systemthis research was supported by the exascale computing...

1
VELOC: Very Low Overhead Checkpointing System ANL: Bogdan Nicolae, Franck Cappello LLNL: Adam Moody, Elsa Gonsiorowski, Kathryn Mohror Description Problem Statement: Extreme scale simulations need to checkpoint periodically for fault-tolerance. But overhead of checkpoint/restart is increasing beyond acceptable and requires non-trivial optimizations Hidden Complexity of Storage Interaction Facilitates ease of use while optimizing performance and scalability Flexibility through Modular Design VeloC API ● Application-level checkpoint and restart API Minimizes code changes in applications Two possible modes: File-oriented API: Manually write files and tell VeloC about them Memory-oriented API: Declare and capture memory regions automatically Fire-and-forget: VeloC operates in the background Waiting for checkpoints is optional; a primitive is used to check progress High Performance and Scalability Scenario: 512 PEs/node, each checkpoints 256 MB (64 GB total/node) Sync: blocking writes to Lustre Async: writes to SSD followed by async flush to Lustre Metric: Increase in runtime due to checkpointing ● Conclusions: Sync flush to PFS shows I/O bottlenecks even at small scale ASync mode hides PFS latency: (reduced overhead up tp 80%, improved scalability) VeloC in a nutshell: Provides a simple API to checkpoint and restart based on data structures declared as important (protected) and captured automatically or files directly managed by the application Hides the complexity of interacting with the heterogeneous storage hierarchy (burst buffers, etc.) of current and future CORAL and ECP systems Based on a modular design that facilitates flexibility in choosing a resilience strategy and operation mode (synchronous or asynchronous), while being highly customizable with additional post-processing modules Many other use cases beyond fault tolerance: suspend-resume jobs to extend over multiple reservations, revisit previous states (e.g. adjoint computations), etc. Initial stress tests show high performance and scalability Goal: Provide a multi-level checkpoint/restart environment that delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility Initial stress tests show promising results Application: Heat Distribution Platform: ANL Theta (KNL, Local SSD, Lustre) Configurable resilience strategy: Partner replication Erasure coding Optimized transfer to external storage Configurable mode of operation: Synchronous mode: resilience engine runs in application process Asynchronous mode: resilience engine in separate backend process (does not die if app dies due to software failures) Easily extensible: Custom modules can be added for additional post-processing in the engine (e.g. compression) BB ION ION ION BB SN SN SN SN NODE N Vendor Specific Transfer API Post- processing request Completion notification Asynchronous mode Erasure Coding Veloc Engine VeloC Backend Optimized Transfer Partner Replication Custom Modules Resource- aware optimal recovery Local Checkpoint and Recovery VeloC Client APPLICATION Local Recovery Phase Checkpoint Decision Making Synchronous mode VeloC API NODE 1 Collaborative Resilience One simple VeloC API Complex Heterogeneous Storage Hierarchy (Burst Buffers, Parallel File Systems, Object Stores, etc.) Many complicated vendor APIs (CCPR, DataWarp, etc.) User Applications VeloC Library VeloC Engine Job Scheduler/ Resource Manager BB ION ION ION BB SN SN SN SN http://veloc.readthedocs.io This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations - the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-POST-745767.

Upload: others

Post on 28-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VELOC: Very Low Overhead Checkpointing SystemThis research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations

VELOC: Very Low Overhead Checkpointing System

ANL: Bogdan Nicolae, Franck CappelloLLNL: Adam Moody, Elsa Gonsiorowski, Kathryn Mohror

DescriptionProblem Statement: Extreme scale simulations need to checkpoint periodically for fault-tolerance. But overhead of checkpoint/restart is increasing beyond acceptable and requires non-trivial optimizations

Hidden Complexity of Storage Interaction

Facilitates ease of use while optimizing performance and scalability

Flexibility through Modular Design

VeloC API● Application-level

checkpoint and restart API

● Minimizes code changes in applications

● Two possible modes:○ File-oriented API: Manually

write files and tell VeloC about them

○ Memory-oriented API: Declare and capture memory regions automatically

● Fire-and-forget: VeloC operates in the background

● Waiting for checkpoints is optional; a primitive is used to check progress

High Performance and Scalability● Scenario: 512 PEs/node, each

checkpoints 256 MB (64 GB total/node)

● Sync: blocking writes to Lustre● Async: writes to SSD followed

by async flush to Lustre● Metric: Increase in runtime due

to checkpointing● Conclusions:

○ Sync flush to PFS shows I/O bottlenecks even at small scale

○ ASync mode hides PFS latency: (reduced overhead up tp 80%, improved scalability)

VeloC in a nutshell:● Provides a simple API to checkpoint and restart based on data

structures declared as important (protected) and captured automatically or files directly managed by the application

● Hides the complexity of interacting with the heterogeneous storage hierarchy (burst buffers, etc.) of current and future CORAL and ECP systems

● Based on a modular design that facilitates flexibility in choosing a resilience strategy and operation mode (synchronous or asynchronous), while being highly customizable with additional post-processing modules

● Many other use cases beyond fault tolerance: suspend-resume jobs to extend over multiple reservations, revisit previous states (e.g. adjoint computations), etc.

● Initial stress tests show high performance and scalability

Goal: Provide a multi-level checkpoint/restart environment that delivers high performance and scalability for complex heterogeneous storage hierarchies without sacrificing ease of use and flexibility

● Initial stress tests show promising results● Application: Heat Distribution● Platform: ANL Theta (KNL, Local SSD,

Lustre)

● Configurable resilience strategy:○ Partner replication○ Erasure coding○ Optimized transfer to external

storage● Configurable mode of

operation:○ Synchronous mode: resilience

engine runs in application process

○ Asynchronous mode: resilience engine in separate backend process (does not die if app dies due to software failures)

● Easily extensible:○ Custom modules can be added

for additional post-processing in the engine (e.g. compression)

BB ION ION ION BB

SN SN SN SN

NODE N

Vendor Specific Transfer API

Post-processing

request

Completionnotification

Asynchronous mode

Erasure Coding

Veloc Engine

VeloC Backend

OptimizedTransfer

Partner Replication

Custom Modules

Resource-aware optimal

recovery

Local Checkpoint

and Recovery

VeloC Client

APPLICATION

Local Recovery

Phase

Checkpoint Decision Making

Synchronous mode

VeloC API

NODE 1

Colla

bora

tive

Resi

lienc

e

One simple VeloC API

Complex Heterogeneous Storage Hierarchy (Burst Buffers, Parallel File Systems, Object Stores, etc.)

Many complicated vendor APIs (CCPR, DataWarp, etc.)

User Applications

VeloC Library

VeloC EngineJob

Scheduler/ResourceManager

BB ION ION ION BB

SN SN SN SN

http://veloc.readthedocs.io

This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations - the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation’s exascale computing imperative. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-POST-745767.