Array Jobs

Array Jobs provide a way of submitting 100s or 1000s of similar jobs with a single command (and a single job script) they reduce the load on the scheduler that such a large number of jobs would have and make managing jobs easier for the user.

Each array task which is run by the scheduler runs the same job script, with different parameters.

The following example will run a hypothetical code "process_file.bin" which takes an input file and produces an output file (with the filenames given as command line parameters to the code. The exanple script runs this code 10 times on 10 different input files - named in001..in010 and produces 10 output files out001..out010:

Example

#!/bin/bash
#
# Example SLURM job script for DiRAC 3

#SBATCH --job-name=array
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-10
#SBATCH --mem=2G
#SBATCH --time=00:30:00
#SBATCH --output=%x-%A_%a.out
#SBATCH --export=None

cd $SLURM_SUBMIT_DIR

module load gcc/12.3.0
module load openmpi/4.1.5

echo "JobID: $SLURM_ARRAY_JOB_ID, TaskID: $SLURM_ARRAY_TASK_ID"
INFILE=$(printf "in%03d" $SLURM_ARRAY_TASK_ID)
OUTFILE=$(printf "out%03d" $SLURM_ARRAY_TASK_ID)
./process_file.bin -i $INFILE -o $OUTFILE

The important directive here is --array=... this defines the id's of the tasks to be run - in the case of the example above, 10 array tasks will be run, each one will have a different SLURM_ARRAY_TASK_ID from 1 to 10. The short bash script above runs the code with the appropriate command line parameters for each task.

Task ids can be specified in a few other ways:

--array=1,4,7,20,33: run the array ids given (seperated by commas)
--array=1-9:2: will run tasks starting from 1, to 9 with a stride of 2 (so tasks 1,3,5,7,9).
--array=1-1000%50: will run 1000 tasks, starting at task id 1, ending at task id 1000, but will only schedule at most 50 of these tasks simultaneously. This is a particularly useful feature when tasks are bottlenecked a shared resource (often access to the filesystem or a shared database) to reduce the impact of the job on the cluster and ensure tasks run efficiently.

Best Practices

There are a few things to consider when running array jobs:

These types of jobs often bottleneck on a single resource (filesystem, access to a shared database etc.) which means that as more tasks run simultaneously there cpu efficiency drops. In these cases you should make use of the % operator in the --array=... directive to limit the number of simultaneous tasks. If you are unsure, please contact RCS support.
It is easy to start large numbers of very short (a few minute) tasks over a short period of time. This puts undue stress on the job scheduler and affects scheduler performance for other users. If you have jobs which contain short (<20 minute) tasks, please consider running a number of these tasks in each array task to reduce the number of array tasks which need to be started.

Managing array jobs

An array job can be cancelled in the same way as any other job - for example to cancel an array job with the job_id 1075:

scancel 1075

This would cancel all array tasks of this job. You can also cancel specific tasks within the job, for example:

scancel 1075_7 1075_19

Would cancel array tasks 7 and 19 of array job_id 1075.

You can also cancel a range of tasks - for example

scancel 1075_[1-5]

Would cancel tasks 1 to 5.