an introduction to checkpointing - uclouvain · 2016-10-31 · distributed multi-threaded...

An introduction to

checkpointingfor scientifc applications

damien.francois@uclouvain.beUCL/CISM

November 2016CISM/CÉCI training session

What ischeckpointing ?

$ ./count

$ ./count1

$ ./count12

$ ./count123

$ ./count123^C$

$ ./count123^C$ ./count

$ ./count123^C$ ./count1

Without checkpointing:

Without checkpointing: With checkpointing:

$ ./count123^C$ ./count12 3

Checkpointing:

'saving' a computation so that it can be resumed later

(rather than started again)

Today's agenda:

1. General concepts and scientifc soft.

2. Working with Signals

3. Slurm recipes

4. DMTCP

Why do we needcheckpointing ?

Imagine a text editor without 'checkpointing' ...

The idea:

Save the program state

every time a checkpoint is encountered

and restart from there upon (un)planned stop

rather than bootstrap again from scratch

Values in variablesOpen fles...

Position in the codeSignal or event...

starting loops at iteration 0creating tmp fles...

1. Fit in time constraints

2. Debugging, monitoring

3. Cope with hardware failures

4. Job preemption

Goals of checkpointing in HPC:

1. Fit in time constraints

All clusters limit maximum 'wall' time of jobs

to allow high job turnover

to ensure fair time sharing of the cluster

-------------(and reduce waiting times...)

2. Debugging, monitoring

Checkpointing means saving the state on disk

-> You can view the state while the job is running

-> You can restart at the checkpoint before a bug occurred

3. Cope with hardware failures

4. Job preemption

Not used at CÉCI, preemption is the ability for

a high-priority job to re-queue a low-priority job

1 Many scientifc software save stateafter N iteration.

Working with checkpoint-restart-able software

Many scientifc software have built-in checkpointing capabilities

(although it might not be called that way)

Check the documentation

Evaluate the options : tradeoff between

I/O overhead

portabilityease of use

http://www.gaussian.com/g_blog/faq2.htm

http://cfd.direct/openfoam/user-guide/controlDict/

https://www.cp2k.org/restarting

https://www.molpro.net/info/2015.1/doc/quickstart/node65.html

http://www.cfs.dl.ac.uk/docs/html/part3/node6.html

http://cms.mpi.univie.ac.at/vasp/vasp/ISTART_tag.html

http://lammps.sandia.gov/doc/restart.html

http://www.gromacs.org/Documentation/How-tos/Doing_Restarts

http://www.abinit.org/doc/helpfiles/for-v7.10/input_variables/varrlx.html#restartxf

So you can play

On lemaitre2: ~dfr/checkpoint.tgz

2 Using UNIX signals to reduceoverhead : do not save the state ateach iteration -- wait for the signal.

UNIX processes can receive 'signals' from the user, the OS, or another process

fg, bg

kill -9

UNIX processes can receive 'signals' from the user, the OS, or another process

UNIX processes can receive 'signals' with an associated default action

3 Use Slurm signaling abilities tomanage checkpoint-able software inSlurm scripts on the clusters.

scancel is used to send signals to jobs

--signal to have Slurm send signals automaticallybefore the end of the allocation

Example: send SIGINT 60 seconds before job is killed (so, here, after 2 minutes)

If your program exits with a non-zero exit code incase of interruption, you can have your job re-

queued automatically

Note the --open-mode=append

Or chain the jobs...

Using a signal-based watchdogto re-queue the job just before it is killed

4 Making non restartable softwarerestartable with DMTCP

● Distributed Multi-Threaded CheckPointing● Works with Linux Kernel 2.6.9 and later● Supports sequential and multi-threaded computations across

single/multiple hosts● Entirely in user space (no kernel modules or root privilege)● Transparent (no recompiling, no re-linking)● Written at Northeastern U. and MIT and under active development for 5+

years● LGPL'd and freely available● No remote I/O● Supports threads, mutexes/semaphoes, forks, shared memory, exec, and

many more

Advertised Features

What types of programs can DMTCP checkpoint?It checkpoints most binary programs on most Linux distributions. Some examples on which users haveverified that DMTCP works are: Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU CommonLisp), emacs, vi/cscope, Open MPI, MPICH-2, OpenMP, and Cilk. See Supported Applications for furtherdetails. Our goal is to support DMTCP for all vanilla programs. If DMTCP does not work correctly onyour program, then this is a bug in DMTCP. We would be appreciative if you can then file a bug report with DMTCP.

From their FAQ:

Imagine a non-checkpointable program

Run with dmtcp_launch (runs monitoring daemon if necessary)

Restart with dmtcp_restart_script.sh

Launch the coordinator and the program withautomatic checkpointing every 30 seconds

Launch coordinator and restart program

https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.jobhttps://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_rstr.job

Summary,Wrap-up andConclusions.

damien.francois@uclouvain.beUCL/CISM

October 2014CISM/CÉCI training session

Never click 'Discard' again...

The submission script(s)

● Either one big one or two small ones● Checkpoint periodically or --signal● Requeue automatically● Open-mode=append

an introduction to checkpointing - uclouvain · 2016-10-31 · distributed multi-threaded...

Documents

threaded rod threaded rod threaded rod & accessories

fault tolerant systems chapter 6 – checkpointing i

checkpointing transaction-based distributed shared memory...

checkpointing hpc applications - jlesc › downloads ›...

checkpointing & rollback recovery

ibm streams v4.1 and incremental checkpointing

low design-risk checkpointing storage solution for exascale...

speculative memory checkpointing

uncoordinated checkpointing rollback-recovery

checkpointing strategies for parallel jobs

applications of dmtcp for checkpointing · what is...

checkpointing with minimal recovery in

ch13 checkpointing and recovery

experimenal evaluation of concurrent checkpointing and

diskless checkpointing on super-scale architectures

coordinated checkpointing presented by sarah arnold 1

implement checkpointing for android

uncoordinated checkpointing without domino effect for send

compiler-generated staggered checkpointing

distributed algorithms for message-passing...