Checkpointing with DMTCP

Checkpointing is the saving of a snapshot of the application’s state, so that it can restart from that point in case of failure. This is essential for long running HPC applications. See: https://en.wikipedia.org/wiki/Application_checkpointing

Your long running programs should be able to do application checkpointing. If they can’t then contact the authors and nicely request that they add checkpointing ability into their software. Some prgrammers though are recalcitrant and you will need to find other ways to checkpoint your programs. Here we will show how you can use DMTCP: Distributed Multi Threaded Check Pointing See: http://dmtcp.sourceforge.net/.

Download the DMTCP Example Files¶

Copy the code from the examples directory to your own directory. The README.txt file in there will explain what the files are.

$ cp -r /shared/eresearch/pbs_job_examples/checkpointing_dmtcp/ .

Runnning DMTCP¶

Here we will do a test of using DMTCP with an example primes program but in this first test we won’t run it under PBS. This is so that you can concentrate on how to use DMTCP first.

The primes program should already be set to search for primes between 100,000 and 200,000, this will take about 3 minutes and find 8392 primes.
Because it will run for a short time, and only use one core, for this test it can be run on the login node.

In one terminal start the primes program under the dmtcp_launch program. Do not use an & at the end i.e. don’t background the program.

$ dmtcp_launch ./primes.py 
Prime Number Finder
Looking for prime numbers in the range 100000--200000 ...

While that is running, in another terminal you can test out a few DMTCP commands …

$ dmtcp_command --list
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Client List:
PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE
1, python[40000:16626]@ermdc13.itd.uts.edu.au, 2937d6087255c268-40000-9290f9d5644a3, 
WorkerState::RUNNING

This shows that a DMTCP “Coordinator” has been started and it is monitoring the primes program.

In that same terminal you can ask it to create a “checkpoint”:

$ dmtcp_command --checkpoint

This will create a checkpoint file containing all information needed to restart your program and a restart script:
ckpt_python_2937d6087255c268-40000-929b4b1365bea.dmtcp
dmtcp_restart_script_2937d6087255c268-40000-929b47452ef78.sh
dmtcp_restart_script.sh will be a symlink pointing to the real restart script as above.

The script ./dmtcp_restart_script.sh, is what you can use to restart your program again. It handles the restarting of various system level bookkeeping tasks and is the recommended way to restart your program. (But note: it does not seem to work under PBS.)

Now back at the first terminal kill your program (by hitting Control-C) to simulate something bad happening. (Alternatively you can also type killall python in the second terminal to kill the python process running your primes program, but make sure you don’t have other python programs running!)

$ dmtcp_launch ./primes.py 
Prime Number Finder
Looking for prime numbers in the range 100000--200000 ...
^C
KeyboardInterrupt
$

Wait a few minutes …

Let’s restart our program now. In any terminal restart your killed program as below. Wait till it’s finished.

$ ./dmtcp_restart_script.sh
Found 8392 primes during this run.

This is what part of the primes.txt file will look like:

Prime numbers in the range 100000--200000
100003   2019-01-15 13:19
100019   2019-01-15 13:19
100043   2019-01-15 13:19
.....
120871   2019-01-15 13:20
120877   2019-01-15 13:20
120889   2019-01-15 13:20
120899   2019-01-15 13:20 
120907   2019-01-15 13:44 <-- note time gap
120917   2019-01-15 13:44
120919   2019-01-15 13:44
....
199961   2019-01-15 13:47
199967   2019-01-15 13:47
199999   2019-01-15 13:47

Notice that primes.py finds hundreds of primes every minute. But in the middle of this text file there is a large time gap. That’s because DMTCP checkpointed the program just as it had calculated and written out that 120899 was a prime and after that the program was terminated. Then when we restarted it at 13:44 it picked up where it left off.

There are numerous configure and run-time options are also available. See man dmtcp. e.g. $ dmtcp_launch --interval 30 ./primes.py

Make sure to remove old ckpt images that you no longer need as they can be large.

Running DMTCP with PBS¶

We will now run the same primes program with DMTCP but under PBS, as we should be doing for all our jobs. The modified PBS submission script to launch a DMTCP managed job is job_launch_dmtcp.sh

$ qsub job_launch_dmtcp.sh
108870.hpcnode0
$

As this is a test, we might want to do something bad to our program. Here is how to do that:
Type qstat -an1 to find out what node our job is running on. SSH into that node (you might have to use an interactive PBS session). Then run ps -x and look for the PID of the python ./primes.py process. Kill it: kill -9 PID where PID is the correct process ID! (e.g. kill -9 50285) Otherwise you can just let the job run to completion.

The job script will continue and hence the primes.txt file as-is will copied to your PBS directory and the scratch directory removed.

If you have a look at the primes.txt file from your killed job you will see it does not contain the expected 8392 primes;

$ cat primes.txt | wc -l
2816

and the end of the file might look like this:

$ tail primes.txt
132893   2019-01-15 13:20
132911   2019-01-15 13:20
132929   2019-01-15 13:20

You can see it only got to checking up to 132,929, well short of up to 200,000. There also will be a DMTCP checkpoint file in your directory.

Re-running DMTCP with PBS¶

Now that we have run the job (and maybe got it to fail as a test) lets re-run the job so it can pickup where it left off. The modified PBS submission script to re-launch a DMTCP managed job is job_restart_dmtcp.sh. Before we launch that though we need to do some things.

When the python program was killed the submission script continued on; it moved your data back to the submission directory (line mv ${SCRATCH}/primes.txt ${PBS_O_WORKDIR}) and removed the scratch directory (line rmdir ${SCRATCH}). This needs to be reversed in the job restart script as the image ckpt_*.dmtcp will expect to find it again.

But what was the full path to the scratch directory? It’s in the checkpoint image! As the scratch directory was set to /scratch/${USER}_${PBS_JOBID%.*} if I grep for “scratch” I can find out what the full path was:

$ strings ckpt_python_568a0c8247ca6381-40000-1ba9356b291ce8.dmtcp | grep scratch
...
/scratch/mlake_108870
...

So we need to create this directory and copy the partial primes.txt into it. You will see the two lines to do this in the job restart script. Finally before we restart lets make a backup of the partial output: $ cp primes.txt primes.txt.bak Now submit the restart script.

$ qsub job_restart_dmtcp.sh
108873.hpcnode0

Wait till its finished.

You should now have a primes.txt file that is complete and contains 8392 unique primes.

There are many options to running DMTCP. Read the references and the man pages for dmtcp and the man pages for the other DMTCP commands under /usr/bin/. Although DMTCP works fine for this example YMMV.

References¶

Distributed Multi Threaded Check Pointing: http://dmtcp.sourceforge.net and the FAQ at http://dmtcp.sourceforge.net/FAQ.html

One of the maintainers of DMTCP has some example PBS scripts here: https://github.com/rohgarg/dmtcp-job-scripts

A more complex example is here: https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job Note that this is for SLURM and not PBS.