This page is a very basic overview of ALICE for users new to HPC. It will help get a first job to be run on the cluster.

After this, the sections linked on the left should be consulted, commencing with Getting Started and using the 'Search' function at the top of the page.

What is ALICE?
Why use ALICE?
Basic architecture of ALICE
- Nodes
- Scheduler
- Storage
Log on to ALICE
Run a task on a compute node or nodes via the scheduler
Useful tips
Good HPC etiquette on ALICE

What is ALICE?

ALICE is an HPC (High Performance Computing) cluster of interconnected computers (referred to as 'nodes').

Why use ALICE?

If a computing task requires more memory, processing power, or speed than a desktop or laptop can provide, especially for complex problems or large datasets then using ALICE to run that task may be useful.

Basic architecture of ALICE

ALICE is composed of:

Login nodes - for connecting to ALICE and preparing jobs
Compute nodes - where jobs actually run
Storage - shared filesystems for data
Scheduler - manages resource allocation and queues to run jobs

Nodes

A node is one computer within the system.

Node types and quantitites on ALICE:

Login (4)
Compute - CPU (56)
- Compute - large memory (4)
- Compute - GPU (8)

Scheduler

SLURM is the scheduler used on ALICE. This allocates resources fairly.

Jobs can be submitted to the scheduler requesting resources needed such as:

8 CPUs and 80GB RAM for 2 hours
1 GPU and 64 GB RAM for 14 hours
An interactive job with 1 CPU and 20GB RAM for 1 hour
4 entire compute nodes for 3 days

The scheduler decides when and where your job runs.

Storage

Each of these storage areas are available from all nodes:

/home - 60GB permanent backed up for small files / code / scripts
/scratch - fast temporary storage
/data - shared project storage (if requested)

All nodes also have temporary local storage.

The Research File Store (RFS) is available on login nodes.

More detailed information about storage on ALICE

Log on to ALICE

Login to ALICE via SSH (secure shell) or NoMachine.

Note

When prompted for your username, this should be of the form 'abc1', not your friendly email address

You will need at least a basic understanding of using Linux on the command line and be able to create and edit text files. If you have not used Linux via the command line, work through the tutorial here (you do not need to wait to attend a course)

Now that you have logged in to one of the four login nodes, you can interact with the system.

Simple short tasks can be run on a login node but to take advantage of the full capability of the system jobs can and should be submitted to the scheduler to run on one or more of the compute nodes.

Important

Computationally intensive tasks should not be run on the login nodes. More on this below.

Run a task on a compute node or nodes via the scheduler

Why run jobs via the scheduler when they can be started on a login node?

The login nodes may have hundreds of other users logged in. If a user on a login node starts running intensive jobs this will clog up the node and affect other users work. Please be a good ALICE user.

Login nodes should be used for:

Editing
Compiling
Preparing jobs
Lightweight testing

A simple batch job

Copy the text in the box below (substituting your email address for MAIL_ADDRESS) into a file named myjob.sub:

#!/bin/bash
#
# Example SLURM job script for ALICE

#SBATCH --job-name=simple_job
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:02:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=MAIL_ADDRESS
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
#SBATCH --export=NONE

echo "starting at:"
date
echo
echo "Running on:"
hostname
echo

ls -l /no/such/path

sleep 60
echo "finishing at:"
date

Now, submit the job to the scheduler:

sbatch myjob.sub

You will receive an email when the job starts, and when it finishes, please read them, particularly the summary in the 'Job.....Ended' one.

The output and error files will also now have been written to the directory you submitted the job from. Make sure you read their contents.

Congratulations! You have just submitted a job to the cluster!

Now, please read the more in depth documentation, starting in the Running Jobs section.

Using GPU nodes

If your code / program has been written to use GPUs then see here. If you don't know if your code / program has been written to use GPUs, then it probably hasn't, but check with the documentation for the program or with the maintainer / writer.

Running MPI (Message Passing Interface) jobs across multiple nodes

If your code / program has been written to be able to run across multiple compute nodes using MPI to access shared memory across those nodes then see here. If you don't know if your code / program can run across multiple compute nodes, then it probably doesn't, but check with the program documentation or maintainer / writer of the code / program.

Useful tips

Use the development partition to test jobs are correctly configured before scaling up and requesting more resources.
A typical workflow to develop and run a job:
- SSH into login node
- Transfer code / data
- Test using an interactive job and / or the development partition
- Write batch script to submit to scheduler
- Submit job
- Collect outputs once job finishes
How to ask for help if stuck

Good HPC etiquette on ALICE

ALICE is a shared system. Your usage of it will affect other users, as their usage will affect yours. Please be a good user.

Do:

Request realistic resources for your jobs
Remove temporary files from /system when finished with
Don't store temporary files in /data
Use batch jobs wherever possible
Test small first
Read the documentation on this website

Don’t:

Run heavy jobs on login nodes
Request huge resources “just in case”
Leave idle interactive sessions
Store unused files on the system long term