MPI jobs

MPI (and other distributed memory) jobs run across multiple compute nodes, with processes only able to access the memory of the node on which they run. The MPI library is typically used to co-ordinate and pass messages between these processes.

MPI implementations typically contain an mpirun or mpiexec command which is used to launch the computation across multiple nodes. Slurm provides it's own srun command for the same purpose.

In all cases we recommend launching mpi jobs using srun rather than mpirun or mpiexec.

launching OpenMPI jobs

The following examples runs an MPI job, starting 64 tasks on each of 4 nodes (256 tasks in total). As the job is using entire nodes it explicitly requests the use of the parallel partition (--partition=parallel) and requests all available memory on each node be allocated to the job (--mem=0)

.. using srun

#!/bin/bash
#  Submit to the parallel queue - replace the account with the hpc project you are using.
#SBATCH --job-name="srun_test"
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --partition=parallel

#  Request 4 nodes for the job, 64 cores and all available memory on each node.
#  Maximum runtime 20 minutes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
#SBATCH --time=00:20:00 
#SBATCH --mem=0

# Do not pass submitting shell environment to the job script
#SBATCH --export=None

cd $SLURM_SUBMIT_DIR

#  Load gcc/openmpi which were used to build the code
module load gcc/12.3.0
module load openmpi/4.1.5

#  Run your code - note --mpi=pmix is required as srun by default treats codes as non-mpi.
srun --mpi=pmix ./mpi_simple_gcc

Codes build with AMD or Intel compilers which use openmpi can be run in the same way, just load the relevent compiler in place of gcc.

... using mpirun

Launching mpi jobs with mpirun is possible, though we recommend using srun instead.

#!/bin/bash
#  Submit to the parallel queue - replace the account with the hpc project you are using.
#SBATCH --job-name="srun_test"
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --partition=parallel

#  Request 4 nodes for the job, 64 cores and all available memory on each node.
#  Maximum runtime 20 minutes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
#SBATCH --time=00:20:00 
#SBATCH --mem=0

# Do not pass submitting shell environment to the job script
#SBATCH --export=None

cd $SLURM_SUBMIT_DIR

#  Clear modules and load gcc/openmpi which were used to build the code
module load gcc/12.3.0
module load openmpi/4.1.5

#  Launch your code with mpirun
mpirun ./mpi_simple_gcc

launching Intel oneapi MPI jobs

Intel provide their own mpi implementation based on mpich alongside the intel oneapi compiler suite. Codes can also be built using the oneapi compiler with openmpi - in the latter case use the above instructions for openmpi to launch your code.

The following scripts will launch a code built with the intel oneapi compiler and linked against the intel mpi library:

... using srun

#!/bin/bash
#  Submit to the parallel queue - replace the account with the hpc project you are using.
#SBATCH --job-name="srun_test"
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --partition=parallel

#  Request 4 nodes for the job, 64 cores and all available memory on each node.
#  Maximum runtime 20 minutes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
#SBATCH --time=00:20:00
#SBATCH --mem=0
# Do not pass submitting shell environment to the job script
#SBATCH --export=None

#  Clear modules and load compiler/mpi modules which were used to build the code
cd $SLURM_SUBMIT_DIR
module load intel-oneapi-compilers/2023.1.0
module load intel-oneapi-mpi/2021.9.0

# Setup environment to force intel mpi to use the same pmi2 library as slurm.
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

# Launch the code using pmi2 (pmix support for oneapi is not fully tested yet).
srun --mpi=pmi2 mpi_simple_oneapi

... using mpirun

Codes linked against intelmpi can still be launched with mpirun - note that we have seen slow job startup times when using mpirun/mpiexec and recommend launching with srun as above instead.

#!/bin/bash
#  Submit to the parallel queue - replace the account with the hpc project you are using.
#SBATCH --job-name="srun_test"
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --partition=parallel

#  Request 4 nodes for the job, 64 cores and all available memory on each node.
#  Maximum runtime 20 minutes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=64
#SBATCH --time=00:20:00
#SBATCH --mem=0
# Do not pass submitting shell environment to the job script
#SBATCH --export=None

#  Clear modules and load compiler/mpi modules which were used to build the code
cd $SLURM_SUBMIT_DIR
module load intel-oneapi-compilers/2023.1.0
module load intel-oneapi-mpi/2021.9.0

# Setup environment to force intel mpi to use the same pmi2 library as slurm.
export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi2.so

# Launch the code using mpirun
mpirun ./mpi_simple_oneapi

Hybrid mpi codes

It is usual for large mpi codes to also make use of shared memory parallelism - that is to use mpi to launch distributed tasks across several nodes and for each of those tasks to make use of openmp to make use of multiple cpu cores.

The following example will start 16 mpi tasks on 4 nodes (4 tasks per node), with each task making use of 16 cpus via openmp, note the main change is in the request for tasks per node and cpus per task, additionally as this code also uses openmp we set OMP_NUM_THREADS:

#!/bin/bash
#  Submit to the parallel queue - replace the account with the hpc project you are using.
#SBATCH --job-name="srun_test"
#SBATCH --account=MY_HPC_PROJECT
#SBATCH --partition=parallel

#  Request 4 nodes for the job, 4 tasks per node, 16 cpus per task and all available memory on each node.
#  Maximum runtime 20 minutes
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16
#SBATCH --time=00:20:00 
#SBATCH --mem=0
# Do not pass submitting shell environment to the job script
#SBATCH --export=None

cd $SLURM_SUBMIT_DIR

#  Clear modules and load gcc/openmpi which were used to build the code
module load gcc/12.3.0
module load openmpi/4.1.5

#  Run your code - note --mpi-pmix is required as srun by default treats codes as non-mpi. 
#  As this is an hybrid code and uses openmp, we also set OMP_NUM_THREADS
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmix ./hybrid_mpi_code

CPU affinity

By default, when running single tasks with mpi, each task is bound to a cpu core. There may be occasions when you want to change this default binding - this can be done using the --cpu-bind=... option of srun. In most cases the default will be optimal.

--cpu-bind=none:

Do not bind tasks to cores - tasks are free to be scheduled on any core asigned to the job.

--cpu-bind=rank:

Bind by task rank - the lowest numbered task on each node is bound to core (or socket) 0, the next to core (or socket 1) and so on. Only supported if the entire node is allocated to the job.

--cpu-bind=sockets:

Bind tasks to sockets - useful for hybrid codes where each mpi rank is running on a single socket.

--cpu-bind=verbose:

print the cpu binding of each task to stdout before the code starts. Useful for checking that the cpu binding is as expected. This option can also be used with other options - for example:

--cpu-bind=verbose,rank:

The optimal cpu binding, particularly for hybrid mpi codes, can be difficult to identify. If you require help with this, please contact RCS supporti. The research software engineering team can assist with this.