Running interactive jobs

Overview

Teaching: 30 min
Exercises: 0 min

Questions

How do I run jobs that allow me to interact with my python code?

How do I debug code on an interactive node?

Objectives

Be able to start interactive jobs and run python code

Batch jobs are great, but …

Batch jobs are great, but you need to implement them correctly before they will work.

It’s quite unsatisfying having to run a job just to find out if it will work or not. This is particularly true if your jobs are experiencing long wait times in the scheduler queue, just to have a job fail in a few seconds or minutes.

Interactive jobs let you have a better turnaround time while figuring your job out – the scheduler gives you the resources you want, and connects you to a terminal session on the compute node to try things out.

They also give an opportunity to run monitoring tools to see how your program is faring.

Use `salloc` instead of `sbatch`

Let’s take the CPU prime example from an earlier lesson (the GPU example would be great to try, but we don’t have enough GPUs to ensure a reasonable waiting time for all students).

submit-cpu.sh

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1000M
#SBATCH --time=00:30:00

module load python/3.13

virtualenv --no-download $SLURM_TMPDIR/venv
source $SLURM_TMPDIR/venv/bin/activate

pip install --no-index --upgrade pip
pip install --no-index numba

python primes_cpu.py

In this case we would like to get the exact some resources from the scheduler as we did from the batch job, so we take the values from the “#SBATCH” lines and give them to salloc instead:

salloc --nodes=1 --cpus-per-task=1 --mem-per-cpu=1000M --time=00:30:00

Now we wait … the command will appear to hang, but it’s just waiting to get the resources from the scheduler. We will eventually get a command prompt.

From here we can try out the first few lines from the SLURM script, one at a time:

module load python/3.13
virtualenv --no-download $SLURM_TMPDIR/venv
source $SLURM_TMPDIR/venv/bin/activate
pip install --no-index --upgrade pip
pip install --no-index numba

Now we get to the part where the prime detection script is run, the one that does the work. We will force this into the background using & at the end of the line:

python primes_cpu.py &

(Be careful putting the & in your batch scripts: the scheduler often thinks your code has finished running and the scheduler kills your job.)

We can “see” the program running by using the jobs command.

If we want to bring the program to the forefront, get the job id and: ‘fg [job id] (e.g., probably fg 1`).

We can suspend the program by pressing Ctrl-Z. This stops the program from running, but doesn’t kill it.

Check jobs again to see it and get the job id.

Finally bg [job id] (e.g., probably bg 1) runs the program in the background.

While the program is running in the background, run the htop command (and htop -u).

If trying with a GPU node, add --gres=gpu:1 to you salloc, load the cuda module (module load cuda), and check out what the gpu is doing with nvidia-smi.

Key Points

Interactive jobs are a useful way to set up or solve issues with python code on a cluster

previous episode

Python on the Cluster

next episode

Running interactive jobs

Overview

Batch jobs are great, but …

Use `salloc` instead of `sbatch`

Key Points

previous episode

next episode

previous episode

Python on the Cluster

next episode

Running interactive jobs

Overview

Batch jobs are great, but …

Use salloc instead of sbatch

Key Points

previous episode

next episode

Use `salloc` instead of `sbatch`