Skip to content

This page is a very basic overview of ALICE for users new to HPC. It will help get a first job to be run on the cluster.

After this, the sections linked on the left should be consulted, commencing with Getting Started and using the 'Search' function at the top of the page.

What is ALICE?

ALICE is an HPC (High Performance Computing) cluster of interconnected computers (referred to as 'nodes').

Why use ALICE?

If a computing task requires more memory, processing power, or speed than a desktop or laptop can provide, especially for complex problems or large datasets then using ALICE to run that task may be useful.

Basic architecture of ALICE

ALICE is composed of:

  • Login nodes - where you connect and prepare jobs
  • Compute nodes - where your jobs actually run
  • Storage - shared filesystems for data
  • Scheduler - manages resource allocation and queues to run jobs

Nodes

A node is one computer within the system.

Node types and quantitites on ALICE:

  • Login (4)
  • Compute - CPU (56)
    • Compute - large memory (4)
    • Compute - GPU (8)

Scheduler

SLURM is the scheduler used on ALICE. This allocates resources fairly.

Jobs can be submitted to the scheduler requesting resources needed such as:

  • 8 CPUs and 80GB RAM for 2 hours
  • 1 GPU and 64 GB RAM for 14 hours
  • An interactive job with 1 CPU and 20GB RAM for 1 hour
  • 4 entire compute nodes for 3 days

The scheduler decides when and where your job runs.

Storage

Each of these storage areas are available from all nodes:

  • /home - 60GB permanent backed up for small files / code / scripts
  • /scratch - fast temporary storage
  • /data - shared project storage (if requested)

All nodes also have temporary local storage.

The Research File Store (RFS) is available on login nodes.

Log on to ALICE

  • Check you can login via SSH (secure shell) or NoMachine.

Note

When prompted for your username, this should be of the form 'abc1', not your friendly email address

  • You will need at least a basic understanding of using Linux on the command line and be able to create and edit text files. If you have not used Linux via the command line, work through the tutorial here (you do not need to wait to attend a course)

Now that you have logged in to one of the four login nodes, you can interact with the system.

Simple short tasks can be run on a login node but to take advantage of the full capability of the system jobs can and should be submitted to the scheduler to run on one or more of the compute nodes.

Important

Computationally intensive tasks should not be run on the login nodes. More on this below.

Run a task on a compute node or nodes via the scheduler

  • Why run jobs via the scheduler when they can be started on a login node?

The login nodes may have hundreds of other users logged in. If a user on a login node starts running intensive jobs this will clog up the node and affect other users work. Please be a good ALICE user.

Login nodes should be used for:

  • Editing
  • Compiling
  • Preparing jobs
  • Lightweight testing

A simple batch job

Copy the text in the box below (substituting your email address for MAIL_ADDRESS) into a file named myjob.sub:

#!/bin/bash
#
# Example SLURM job script for ALICE

#SBATCH --job-name=simple_job
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:02:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=MAIL_ADDRESS
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --export=NONE

echo "starting at:"
date
echo
echo "Running on:"
hostname
echo
sleep 60
echo "finishing at:"
date

Now, submit the job to the scheduler:

sbatch myjob.sub

You will receive an email when the job starts, and when it finishes, please read them, particularly the summary in the 'Job.....Ended' one.

The output and error files will also now have been written to the directory you submitted the job from. Make sure you read their contents.

Congratulations! You have just submitted a job to the cluster!

Now, please read the more in depth documentation, starting in the Running Jobs section.

Using GPU nodes

If your code / program has been written to use GPUs then see here. If you don't know if your code / program has been written to use GPUs, then it probably hasn't, but check with the documentation for the program or with the maintainer / writer.

Running MPI (Message Passing Interface) jobs across multiple nodes

If your code / program has been written to be able to run across multiple compute nodes using MPI to access shared memory across those nodes then see here. If you don't know if your code / program can run across multiple compute nodes, then it probably doesn't, but check with the program documentation or maintainer / writer of the code / program.

Useful tips

Good HPC etiquette on ALICE

ALICE is a shared system. Your usage of it will affect other users, as their usage will affect yours. Please be a good user.

Do:

  • Request realistic resources for your jobs
  • Remove temporary files from /system when finished with
  • Don't store temporary files in /data
  • Use batch jobs wherever possible
  • Test small first
  • Read the documentation on this website

Don’t:

  • Run heavy jobs on login nodes
  • Request huge resources “just in case”
  • Leave idle interactive sessions
  • Store unused files on the system long term