an introduction to checkpointing - uclouvain · 2016-10-31 · distributed multi-threaded...

62
An introduction to checkpointing for scientifc applications [email protected] UCL/CISM November 2016 CISM/CÉCI training session

Upload: others

Post on 04-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

An introduction to

checkpointingfor scientifc applications

[email protected]/CISM

November 2016CISM/CÉCI training session

Page 2: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

What ischeckpointing ?

Page 3: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count

Page 4: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count1

Page 5: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count12

Page 6: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123

Page 7: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$

Page 8: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count

Page 9: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count1

Page 10: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count1

Without checkpointing:

Page 11: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count1

$ ./count123^C$ ./count4

Without checkpointing: With checkpointing:

Page 12: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count12

$ ./count123^C$ ./count45

Without checkpointing: With checkpointing:

Page 13: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count12 3

$ ./count123^C$ ./count45 6

Without checkpointing: With checkpointing:

Page 14: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

$ ./count123^C$ ./count12 3

$ ./count123^C$ ./count45 6

Without checkpointing: With checkpointing:

Checkpointing:

'saving' a computation so that it can be resumed later

(rather than started again)

Page 15: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Today's agenda:

1. General concepts and scientifc soft.

2. Working with Signals

3. Slurm recipes

4. DMTCP

Page 16: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Why do we needcheckpointing ?

Page 17: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Imagine a text editor without 'checkpointing' ...

Page 18: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

The idea:

Save the program state

every time a checkpoint is encountered

and restart from there upon (un)planned stop

rather than bootstrap again from scratch

Values in variablesOpen fles...

Position in the codeSignal or event...

starting loops at iteration 0creating tmp fles...

Page 19: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

1. Fit in time constraints

2. Debugging, monitoring

3. Cope with hardware failures

4. Job preemption

Goals of checkpointing in HPC:

Page 20: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

1. Fit in time constraints

Goals of checkpointing in HPC:

All clusters limit maximum 'wall' time of jobs

to allow high job turnover

to ensure fair time sharing of the cluster

-------------(and reduce waiting times...)

Page 21: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

2. Debugging, monitoring

Goals of checkpointing in HPC:

Checkpointing means saving the state on disk

-> You can view the state while the job is running

-> You can restart at the checkpoint before a bug occurred

Page 22: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Goals of checkpointing in HPC:

3. Cope with hardware failures

Page 23: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

4. Job preemption

Goals of checkpointing in HPC:

Not used at CÉCI, preemption is the ability for

a high-priority job to re-queue a low-priority job

Page 24: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

1 Many scientifc software save stateafter N iteration.

Page 25: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

Many scientifc software have built-in checkpointing capabilities

(although it might not be called that way)

Check the documentation

Evaluate the options : tradeoff between

I/O overhead

portabilityease of use

Page 26: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://www.gaussian.com/g_blog/faq2.htm

Page 27: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://cfd.direct/openfoam/user-guide/controlDict/

Page 28: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

https://www.cp2k.org/restarting

Page 29: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

https://www.molpro.net/info/2015.1/doc/quickstart/node65.html

Page 30: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://www.cfs.dl.ac.uk/docs/html/part3/node6.html

Page 31: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://cms.mpi.univie.ac.at/vasp/vasp/ISTART_tag.html

Page 32: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://lammps.sandia.gov/doc/restart.html

Page 33: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://www.gromacs.org/Documentation/How-tos/Doing_Restarts

Page 34: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Working with checkpoint-restart-able software

http://www.abinit.org/doc/helpfiles/for-v7.10/input_variables/varrlx.html#restartxf

Page 35: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

So you can play

On lemaitre2: ~dfr/checkpoint.tgz

Page 36: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

2 Using UNIX signals to reduceoverhead : do not save the state ateach iteration -- wait for the signal.

Page 37: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

UNIX processes can receive 'signals' from the user, the OS, or another process

Page 38: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

UNIX processes can receive 'signals' from the user, the OS, or another process

^C

^Z

^D

fg, bg

kill -9

kill

Page 39: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

UNIX processes can receive 'signals' from the user, the OS, or another process

e.g.

Page 40: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

UNIX processes can receive 'signals' from the user, the OS, or another process

e.g.

Page 41: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

UNIX processes can receive 'signals' with an associated default action

Page 42: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded
Page 43: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

3 Use Slurm signaling abilities tomanage checkpoint-able software inSlurm scripts on the clusters.

Page 44: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

scancel is used to send signals to jobs

Page 45: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

--signal to have Slurm send signals automaticallybefore the end of the allocation

Page 46: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Example: send SIGINT 60 seconds before job is killed (so, here, after 2 minutes)

Page 47: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

If your program exits with a non-zero exit code incase of interruption, you can have your job re-

queued automatically

Page 48: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Note the --open-mode=append

Page 49: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Or chain the jobs...

Page 50: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Using a signal-based watchdogto re-queue the job just before it is killed

Page 51: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

4 Making non restartable softwarerestartable with DMTCP

Page 52: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded
Page 53: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

● Distributed Multi-Threaded CheckPointing● Works with Linux Kernel 2.6.9 and later● Supports sequential and multi-threaded computations across

single/multiple hosts● Entirely in user space (no kernel modules or root privilege)● Transparent (no recompiling, no re-linking)● Written at Northeastern U. and MIT and under active development for 5+

years● LGPL'd and freely available● No remote I/O● Supports threads, mutexes/semaphoes, forks, shared memory, exec, and

many more

Advertised Features

What types of programs can DMTCP checkpoint?It checkpoints most binary programs on most Linux distributions. Some examples on which users haveverified that DMTCP works are: Matlab, R, Java, Python, Perl, Ruby, PHP, Ocaml, GCL (GNU CommonLisp), emacs, vi/cscope, Open MPI, MPICH-2, OpenMP, and Cilk. See Supported Applications for furtherdetails. Our goal is to support DMTCP for all vanilla programs. If DMTCP does not work correctly onyour program, then this is a bug in DMTCP. We would be appreciative if you can then file a bug report with DMTCP.

From their FAQ:

Page 54: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Imagine a non-checkpointable program

Page 55: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Run with dmtcp_launch (runs monitoring daemon if necessary)

Page 56: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Restart with dmtcp_restart_script.sh

Page 57: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

:q

Launch the coordinator and the program withautomatic checkpointing every 30 seconds

Page 58: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Launch coordinator and restart program

Page 59: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.jobhttps://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_rstr.job

Page 60: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Summary,Wrap-up andConclusions.

[email protected]/CISM

October 2014CISM/CÉCI training session

Page 61: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

Never click 'Discard' again...

Page 62: An introduction to checkpointing - UCLouvain · 2016-10-31 · Distributed Multi-Threaded CheckPointing Works with Linux Kernel 2.6.9 and later Supports sequential and multi-threaded

The submission script(s)

● Either one big one or two small ones● Checkpoint periodically or --signal● Requeue automatically● Open-mode=append