Using resources effectively
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do we monitor our jobs?
How can I get my jobs scheduled more easily?
Objectives
Understand how to look up job statistics and profile code.
Understand job size implications.
We now know virtually everything we need to know about getting stuff on a cluster. We can log on, submit different types of jobs, use pre-installed software, and install and use software of our own. What we need to do now is use the systems effectively.
Estimating required resources using the scheduler
Although we covered requesting resources from the scheduler earlier, how do we know what type of resources the software will need in the first place, and the extent of its demand for each?
Unless the developers or prior users have provided some idea, we don’t. Not until we’ve tried it ourselves at least once. We’ll need to benchmark our job and experiment with it before we know how how great its demand for system resources.
Read the docs
Most HPC facilities maintain documentation as a wiki, website, or a document sent along when you register for an account. Take a look at these resources, and search for the software of interest: somebody might have written up guidance for getting the most out of it.
The most effective way of figuring out the resources required for a job to run successfully needs is
to submit a test job, and then ask the scheduler about its impact using sacct -u yourUsername
.
You can use this knowledge to set up the next job with a close estimate of its load on the system. A good general rule is to ask the scheduler for 20% to 30% more time and memory than you expect the job to need. This ensures that minor fluctuations in run time or memory use will not result in your job being cancelled by the scheduler. Keep in mind that if you ask for too much, your job may not run even though enough resources are available, because the scheduler will be waiting to match what you asked for.
Benchmarking
fastqc
Create a job that runs the following command in the same directory as the
.fastq
files[yourUsername@gra-login1 ~]$ fastqc name_of_fastq_file
You’ll need to figure out a good amount of resources to allocate for this first “test run”. You might also want to have the scheduler email you to tell you when the job is done.
Hint: The job only needs 1 CPU and not too much memory or time. The trick is figuring out just how much you’ll need!
Is
fastqc
available?You might need to load the fastqc module before
fastqc
will be available. Unsure? Run[yourUsername@gra-login1 ~]$ which fastqc
Solution
First, write the SLURM script to run
fastqc
on the file supplied at the command-line.[yourUsername@gra-login1 ~]$ cat fastqc-job.sh
#!/bin/bash #SBATCH -t 00:10:00 fastqc $1
Now, create and run a script to launch a job for each
.fastq
file.[yourUsername@gra-login1 ~]$ cat fastqc-launcher.sh
for f in *.fastq do sbatch fastqc-job.sh $f done
[yourUsername@gra-login1 ~]$ chmod +x fastqc-launcher.sh [yourUsername@gra-login1 ~]$ ./fastqc-launcher.sh
Once the job completes (note that it takes much less time than expected), we can query the scheduler
to see how long our job took and what resources were used. We will use sacct -u yourUsername
to
get statistics about our job.
[yourUsername@gra-login1 ~]$ sacct -u yourUsername
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1964 bash standard default 1 COMPLETED 0:0
1964.extern extern default 1 COMPLETED 0:0
1964.0 bash default 1 COMPLETED 0:0
1965 build-ind+ summer-sc+ default 1 COMPLETED 0:0
1965.batch batch default 1 COMPLETED 0:0
1965.extern extern default 1 COMPLETED 0:0
This shows all the jobs we ran recently (note that there are multiple entries per job). To get info about a specific job, we change command slightly.
[yourUsername@gra-login1 ~]$ sacct -u yourUsername -l -j 1965
It will show a lot of info, in fact, every single piece of info collected on your job by the
scheduler. It may be useful to redirect this information to less
to make it easier to view (use
the left and right arrow keys to scroll through fields).
[yourUsername@gra-login1 ~]$ sacct -u yourUsername -l -j 1965 | less
Some interesting fields include the following:
- Hostname: Where did your job run?
- MaxRSS: What was the maximum amount of memory used?
- Elapsed: How long did the job take?
- State: What is the job currently doing/what happened to it?
- MaxDiskRead: Amount of data read from disk.
- MaxDiskWrite: Amount of data written to disk.
Measuring the statistics of currently running tasks
Connecting to Nodes
Typically, clusters allow users to connect directly to compute nodes from the head node. This is useful to check on a running job and see how it’s doing, but is not a recommended practice in general, because it bypasses the resource manager.
If you need to do this, check where a job is running with
squeue
, then runssh nodename
.Give it a try!
Solution
[yourUsername@gra-login1 ~]$ ssh
We can also check on stuff running on the login node right now the same way (so it’s
not necessary to ssh
to a node for this example).
Monitor system processes with top
The most reliable way to check current system stats is with top
. Some sample output might look
like the following (type q
to exit top
):
[yourUsername@gra-login1 ~]$ top
top - 21:00:19 up 3:07, 1 user, load average: 1.06, 1.05, 0.96
Tasks: 311 total, 1 running, 222 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.2 us, 3.2 sy, 0.0 ni, 89.0 id, 0.0 wa, 0.2 hi, 0.2 si, 0.0 st
KiB Mem : 16303428 total, 8454704 free, 3194668 used, 4654056 buff/cache
KiB Swap: 8220668 total, 8220668 free, 0 used. 11628168 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1693 jeff 20 0 4270580 346944 171372 S 29.8 2.1 9:31.89 gnome-shell
3140 jeff 20 0 3142044 928972 389716 S 27.5 5.7 13:30.29 Web Content
3057 jeff 20 0 3115900 521368 231288 S 18.9 3.2 10:27.71 firefox
6007 jeff 20 0 813992 112336 75592 S 4.3 0.7 0:28.25 tilix
1742 jeff 20 0 975080 164508 130624 S 2.0 1.0 3:29.83 Xwayland
1 root 20 0 230484 11924 7544 S 0.3 0.1 0:06.08 systemd
68 root 20 0 0 0 0 I 0.3 0.0 0:01.25 kworker/4:1
2913 jeff 20 0 965620 47892 37432 S 0.3 0.3 0:11.76 code
2 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kthreadd
Overview of the most important fields:
PID
: What is the numerical id of each process?USER
: Who started the process?RES
: What is the amount of memory currently being used by a process (in bytes)?%CPU
: How much of a CPU is each process using? Values higher than 100 percent indicate that a process is running in parallel.%MEM
: What percent of system memory is a process using?TIME+
: How much CPU time has a process used so far? Processes using 2 CPUs accumulate time at twice the normal rate.COMMAND
: What command was used to launch a process?
htop
provides a curses-based overlay for top
, producing a better-organized and “prettier”
dashboard in your terminal. Unfortunately, it is not always available. If this is the case,
politely ask your system administrators to install it for you.
ps
To show all processes from your current session, type ps
.
[yourUsername@gra-login1 ~]$ ps
PID TTY TIME CMD
15113 pts/5 00:00:00 bash
15218 pts/5 00:00:00 ps
Note that this will only show processes from our current session. To show all processes you own
(regardless of whether they are part of your current session or not), you can use ps ux
.
[yourUsername@gra-login1 ~]$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yourUsername 67780 0.0 0.0 149140 1724 pts/81 R+ 13:51 0:00 ps ux
yourUsername 73083 0.0 0.0 142392 2136 ? S 12:50 0:00 sshd: yourUsername@pts/81
yourUsername 73087 0.0 0.0 114636 3312 pts/81 Ss 12:50 0:00 -bash
This is useful for identifying which processes are doing what.
Key Points
The smaller your job, the faster it will schedule.