Simple batch jobs

Simple batch jobs

We consider a simple batch job to be one which is serial (uses only a single cpu core) or shared memory (uses multiple cpu cores, but only a single node). This is the simplest type of batch job to consider and will be all that is required for many use cases.

Simple serial job

The following job script will submit a job to run on a single cpu core, for 2 hours, using a maximum of 4 Gigabytes of memory.

#!/bin/bash
#
# Example SLURM job script for ALICE

#SBATCH --job-name=simple_job
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=02:00:00
#SBATCH --export=NONE

# Optional - if you have a hpc project registered, submit against this to get high QoS. 
# Remove the following line if you are not registered with an HPC project.
#SBATCH --account=MY_HPC_PROJECT

hostname
date
sleep 60

The script is a simple shell script, however, the lines beginning #SBATCH contain directives to the job scheduler. These directives describe the resources required to run the job and provide additional information about how the job is to be run.

#SBATCH --job-name=simple_job: Gives the job a name which will be reported in squeue output - if you are running multiple jobs at a time it is often useful to give these jobs a readable name so it is easy to identify which of your job is which in squeue output.
#SBATCH --cpus-per-task=1: The job runs on a single cpu core.
#SBATCH --mem=4G: The job requires 4 gigabytes of memory - note --mem requests memory per node - so if you are running a job which requires multiple nodes, the job will have this much memory assigned on each node it is running on.
#SBATCH --time=02:00:00: The job will complete within 2 hours. It is important to provide as accurate an upper limit on the jobs runtime as you can. The less runtime you request, the easier the job will be to schedule (and so the quicker it will start). However if the job is still running when its runtime ends, it will be terminated by the system.
#SBATCH --export=NONE: Prevents the environment from you submission shell from being exported to the job when it is run. This is normally the behaviour you want for batch submission.
#SBATCH --account=MY_HPC_PROJECT: If you are a member of one or more HPC projects, you should submit your job against one of these projects in order to obtain a high QoS and make use of the advanced queues and increased resources available to hpc project members. If you are not an HPC project member, just delete this line from your script. More information on HPC Projects

Some of these directives you will not be able to give accurate values for when you first run a computation - in these cases it is best to be generous with the resources you request and request an email on job creation (see shared memory jobs below for how to do this). The memory usage and runtime of the job will be contained in this email and can be used to provide a more accurate value when you next submit a similar job.

Shared memory jobs

Some codes that you may wish to run on ALICE will support parallelism and be able to take advantage of multiple CPU cores. The simplest (and most common) way to achieve this is the use of shared memory (via multiple threads, multiple processes often using a library such as OpenMP).

To run a code which supports shared memory parallelism

#!/bin/bash
#
# Example shared memory SLURM job script for ALICE

#SBATCH --job-name=simple_parallel
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=8G
#SBATCH --time=02:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=MAIL_ADDRESS
#SBATCH --output=parallel-%j.out
#SBATCH --error=parallel-%j.err
#SBATCH --export=NONE
# Optional - if you have a hpc project registered, submit against this to get high QoS. 
# Remove the following line if you are not registered with an HPC project.
#SBATCH --account=MY_HPC_PROJECT

# load any needed modules
module load gcc/12.3.0

# cd into the job submission directory
cd $SLURM_SUBMIT_DIR

# for OpenMP codes, set openmp_num_threads to the number of cpus requested.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./your_openmp_code

# none OpenMP codes usually require you to run them with a parameter (typically -n, -t, --threads or something similar to specify the number of threads or processes to start). For example:
./your_threaded_code -t $SLURM_CPUS_PER_TASK

There are some changes in the scheduler directives:

--cpus-per-task=16: this time we ask for multiple cpus to be assigned.
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=MAIL_ADDRESS: This will cause an email to be sent to "MAIL_ADDRESS" (replace this with your email address) when the job starts, ends or fails. The email sent on the job ending or failing will contain useful information on your jobs cpu and memory usage, parallel efficiency and runtime.
#SBATCH --output=parallel-%j.out
#SBATCH --error=parallel-%j.err: Change the location to which your jobs output and error files are writen - if these directives are not specified both output and error will be written to a file named slurm-JOB_ID.out. The %j in the filename will be replaced by the jobid in the filename as written. You can also use %u for username and %x for job name.
#SBATCH --export=NONE: By default, slurm passes the environment of the shell from which the job was submitted (including all loaded modules) to the shell running your script. This makes sense for many interactive jobs, however, we recommend using this option to prevent this behaviour in batch jobs. This will ensure your job gets a clean environment when it starts, which should help in consistant running of jobs. Note that when doing this, you will need to load any modules your code needs from withing the script.

OpenMP codes

The number of threads available to run a code which makes use of openmp is read from the environment variable OMP_NUM_THREADS when the code runs. To get your code to use the correct number of threads, just set: export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK to your job file before executing your code.

$SLURM_CPUS_PER_TASK will be the number of cpus you requested when submitting the job.

Other threaded/multiple processor codes

Other codes which support shared memory parallelism will usually require you to specify the number of threads/processes to start either via a command line parameter or setting an environment variable prior to running the code. Please check the documentation for the code you are using to find the correct parameter to set.

Output / error files

By default the name of the output and error files will be slurm-{job-name}.out and slurm-{job-name}.err, but these can be customised in the job submisison file, eg:

#SBATCH --output=myjob.out
#SBATCH --error=myjob.err

Use the job name and number in the file names to avoid overwriting files from previous jobs:

#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

Check the filename pattern section in the manual information for 'sbatch' for other variables which can be used:

man sbatch