Points de contrôle

Program execution is sometimes too long for the duration allowed by the cluster's job submission systems. Long program executions are also subject to system instabilities. A program with a short execution time can be easily restarted. However, when program execution becomes very long, it is preferable to use checkpoints to minimize the chances of losing several weeks of computation. These checkpoints will subsequently allow the program to be restarted.

Creating and Loading a Checkpoint¶

Checkpoint creation and loading may already be implemented in an application you are using. In this case, simply use this functionality and consult the relevant documentation as needed.

However, if you have access to the application's source code and/or are its author, you can implement checkpoint creation and loading. Fundamentally:

A checkpoint file should be created periodically. Periods of 2 to 24 hours are suggested.
While writing the file, keep in mind that the computation task can be interrupted at any time, for any technical reason. Therefore:
- It is preferable not to overwrite the previous checkpoint when creating a new one.
- Writing can be made atomic by performing an operation that confirms the completion of the checkpoint write. For example, you can initially name the file based on the date and time, and finally create a symbolic link "latest-version" to the new, uniquely named checkpoint file. Another more advanced method: you can create a second file containing a hash sum of the checkpoint, allowing validation of the checkpoint's integrity upon eventual loading.
- Once the atomic write is complete, you can decide whether or not to delete old checkpoints.

Note

To avoid reinventing the wheel, especially if modifying the source code is not an option, we suggest using DMTCP.

DMTCP¶

The DMTCP (Distributed Multithreaded CheckPointing) software allows you to checkpoint programs without having to recompile them. The first execution is performed with the dmtcp_launch program, specifying the time between save intervals. Restarting is done by executing the dmtcp_restart_script.sh script. By default, this script and the program restart files are written to the location where the program was launched. You can change the location of the checkpoint files with the --ckptdir <checkpoint-directory> option. You can use dmtcp_launch --help to get all options. Note that DMTCP does not currently work with MPI-parallelized software.

An example script:

job_with_dmtcp.sh

#!/bin/bash
# ---------------------------------------------------------------------
# SLURM script for job resubmission on a Compute Canada cluster.
# ---------------------------------------------------------------------
#SBATCH --job-name=job_chain
#SBATCH --account=def-someuser
#SBATCH --cpus-per-task=1
#SBATCH --time=0-10:00
#SBATCH --mem=100M
# ---------------------------------------------------------------------
echo "Current working directory: $(pwd)"
echo "Starting run at: $(date)"
# ---------------------------------------------------------------------
# Run your simulation step here...

if test -e "dmtcp_restart_script.sh"; then
     # There is a checkpoint file, restart;
     ./dmtcp_restart_script.sh -h $(hostname)
else
     # There is no checkpoint file, start a new simulation.
     dmtcp_launch --rm  -i 3600 -q <programme> <arg1> ... <argn>
fi

# ---------------------------------------------------------------------
echo "Job finished with exit code $? at: $(date)"
# ---------------------------------------------------------------------

Resubmitting a Long-Running Job¶

If a long computation is expected to be broken down into several Slurm tasks, the two recommended methods are: * using Slurm job arrays; * resubmitting from the end of the script.