- Login/Connection Issues
- Issues with NX / NoMachine
- Issues with software / applications
- Job submission issues
- My process was killed.
- My job was killed but there is no output and no errors were reported.
- My command or job won't run and I see an error message "XXXXXXXX"
- When will my job start running?
- Why hasn't my job started?
- Previous jobs requesting the same resources would have started by now, why hasn't this one started?
- How can a scheduled job be started sooner?
- How much resource (time, memory, nodes, cpu cores) should be requested for a job?
- My job started but only ran for 10 minutes.
- My job shows as Blocked or Deferred?
- How is my job prioritised?
- How do I acknowledge use of the HPC facilities in research publications?
- When do HPC Service Days occur?
- How do I increase my home directory storage quota?
- I cannot write files to /data or /scratch but I am not over my file space quota
Login/Connection Issues
I cannot log into ALICE via SSH nor NoMachine
Please first check that it is not currently an HPC Service Day
As well as in the link above, upcoming Service Days are listed whenever you log into the system whether via SSH or NoMachine.
If you attempt to access ALICE during a Service Day you may see a message relating to Multi-Factor Authentication, but you will not be able to access the system even if you enter a MFA verification code.
If you are not on the University network and your account has MFA enabled you will need to complete the MFA authentication after entering your password. The authentication method will vary, dependening on how you have configured it.
I cannot authenticate with my SSH key
It is not possible to log into ALICE from outside the University network with SSH keys. MFA for account logins (for accounts enrolled into MFA), and this means we are not able to allow SSH key authentication.
SSH key authentication will continue to work when connecting to ALICE from within the University network. That means from a University-based computer, or via the VPN or University Remote Desktop service.
Issues with NX / NoMachine
I see a green screen when using NX / NoMachine
Sometimes when resuming a suspended NX session (particularly from a different computer with a different monitor size or resolution), a green screen will be seen rather than the expected desktop session.
To fix this, access the NoMachine menu either by hovering your mouse over the top right hand corner of the NX session, or use the "ctrl-alt-0" shortcut. Then select 'Display', ensure 'fit to window' is not enabled and 'resize remote screen' is enabled, then 'Done' or the '<' icon in the top left (twice, to exit the menu system).
I am unable to log in with NX / NoMachine
There are several situations that can cause this issue:
Screen stuck on 'Creating a new GNOME virtual desktop'
If you see this error it can be a symptom of several different issues. Please work through the list below, even if you do not see the specific error message quoted. If none of these solutions allow you to log in to HPC via NoMachine again, contact the Service Desk and mention that you have worked through all the items in this FAQ.
Password needs to be changed
If your password is approaching the time at which you must change it to comply with the password policy then you will
not be able to log in with NX until you change your password. You can log in via SSH when you will see a message
warning you that you will need to change your password soon - you can then change your password with the passwd
command. Alternatively, you can change your password via a managed Windows Desktop or Outlook web access.
Account over quota
Usually a failure to log in with NX is due to your account having gone over quota. The NX server session for ALICE stores some state files within your home directory; if you have reached or exceeded your file quota, these cannot be created and NX fails.
You will see an error message about not being able to create a file, eg:
The session negotiation failed.
Error: Cannot create file 'authority'
Error: Cannot create session directory:
Disk quota exceeded
If this happens it is still possible to log in via SSH / secure shell / PuTTY to ALICE. Log in, check your quota with the quotacheck
command and if necessary delete any unwanted files (using the rm
command), or move them to your scratch directory (using the mv
command). Once some space has been cleared, it will be possible to log in with NX again.
You can see which files or directories in your home directory are using the most space with the homeusage
command.
Home directory quotas will not be increased. If you need more storage space you should use one of the other file systems available.
Dbus errors
These are often introduced by local conda/anaconda/miniconda installs in your home directory.
If you see an error message saying something like "Could not connect to session bus: Failed to connect to socket /tmp/dbus-jWUMQLCUP1: Connection refused":
Log in via SSH / secure shell / PuTTY (as you cannot log in via NX) and use a command line terminal to either:
Check in your ~/.bashrc (or possibly ~/.bash_profile or ~/.profile) files if something has been added to $PATH or $LD_LIBRARY_PATH which may have caused problems. To display the contents of your .bashrc file from the command line enter:
cat ~/.bashrc
Examples of possible causes will look similar to one of the following:
1.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/splash/giza/lib
2.
export PATH="${PATH}:${HOME}/glimmer3.02/bin"
3. or, by loading some module files for convenience:
module load softwarepackage/version
4. or, added automatically by running an installation script, eg:
# added by Miniconda2 installer
export PATH="$HOME/miniconda2/bin:$PATH"
or:
# added by Anaconda3 2018.12 installer
# >>> conda init >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$(CONDA_REPORT_ERRORS=false '/home/a/abc1/anaconda3/bin/conda' shell.bash hook 2> /dev/null)"
if [ $? -eq 0 ]; then
\eval "$__conda_setup"
else
if [ -f "/home/a/abc1/anaconda3/etc/profile.d/conda.sh" ]; then
. "/home/a/abc1/anaconda3/etc/profile.d/conda.sh"
CONDA_CHANGEPS1=false conda activate base
else
\export PATH="/home/a/abc1/anaconda3/bin:$PATH"
fi
fi
#unset __conda_setup
# <<< conda init <<<
Solution
Then, still whilst logged in via SSH / secure shell / PuTTY, use a command line terminal to either:
If you are using a conda variant: set the 'auto_activate_base' parameter to 'false' in ~/.condarc by entering the following on the command line (this need only be done once)
conda config --set auto_activate_base false
More information on Python and installing conda in your home directory.
Or you can try commenting out (putting # at the start of the relevant line or lines) recent changes in ~/.bashrc,
~/.bash_profile or ~/.profile relating to PATH or LD_LIBRARY_PATH. You can use a text editor such as nano
to do this.
A corrupted previous NX / NoMachine session
If a previous session did not shut down or disconnect properly there may be invalid saved session data which is preventing access. If you have tried all of the methods above, log on via SSH then on the command line enter:
mv ~/.nx ~/.nx.bak
If you are still unable to log on to your desktop, please contact the Service Desk and mention that you have worked through all the items in this FAQ.
Issues with software / applications
What software is available? / Is software XXXXXX available?
See the Modules page for informatoin on how to search for available software and how to enable it for use.
How do I set up my account so the modules I always need are loaded automatically?
You can edit the file ~/.bashrc adding entries at the end of this file to always load the modules you need. If you haven't made any other modifications to this file and you always wish to use R version 4.2.1 for example, your .bashrc file would become:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
module load R/4.2.1
I get a message saying "command not found".
This is probably because you haven't loaded the module which sets up your environment correctly for this software. See the Modules page for more information.
If you have loaded the module but still receive the "command not found" error, it may be that access to this module is restricted. A small number of commercial applications are restricted to specific user groups, this includes Mathematica, SAS, Tecplot, COMSOL, STATA and Supermongo.
You will need to open a support request with the Service Desk requesting to be added to the relevant group, and authorisation will be sought from the owner.
Job submission issues
My process was killed.
The login nodes on ALICE have a hard limit of two hours of CPU time per process. If this is exceeded, the process will be killed.
This limit applies to CPU time not runtime: an idle process may run for many hours without consuming much CPU time. Similarly a multicore process may consume two hours of CPU time very quickly.
Long running processes should not be launched directly on the login nodes. Instead they should be submitted to the queue using the job scheduler. For further information see the page about submitting Interactive vs Batch jobs.
My job was killed but there is no output and no errors were reported.
Almost certainly this is because you have not specified a value for vmem, or have a value which is too low. Increase the value requested and resubmit your job.
My command or job won't run and I see an error message "XXXXXXXX"
Error messages on their own with no context are rarely useful to help diagnose problems. If you are raising a support request, please include useful information such as:
- Command(s) entered
- Full path to and name of job submission script
- Job number(s)
- Error message(s) - but only if you have included the information above.
When will my job start running?
To check when if and when a job is scheduled to start (this is an estimate, not a guarantee):
squeue -j JOB_ID --start
If the JOB_ID is not known or has been forgotten, all jobs currently scheduled for a user can be seen with:
squeue --me
Why hasn't my job started?
First, check if and when it is scheduled to start.
If the job is scheduled to start after the next Service Day then it is likely that more time has been requested for the job than was available before the Service Day.
If the 'squeue' commmands in the section above show the message below in the 'NODELIST(REASON)' column for the job:
f the job(Job's QOS not permitted to use this partition)
Then the job does not have access to a specialist partition which has been requested, either explicitly or implicitly. Please check here for more information about this message and variations of it.
Other messages which may be seen in the 'NODELIST(REASON)' column for the job when running 'squeue':
(Nodes required for job are DOWN DRAINED or reserved for jobs in higher priority partitions)
(Priority)
(PartitionNodeLimit)
(QOSMaxJobsPerUserLimit)
Please check here for more information about these messages.
Previous jobs requesting the same resources would have started by now, why hasn't this one started?
First, check the status of the job with the command in the 'When will my job start running?' section.
If the job is not scheduled to start for some time yet, this is the earliest the scheduler can fit it in with the other currently running and scheduled jobs on the system. The previous job requesting the same resources may have been submitted when the cluster was less busy with fewer jobs in the scheduling system.
How can a scheduled job be started sooner?
The sooner a job is submitted to the scheduler, the sooner it will start. Do not 'wait until the cluster is less busy' before submitting a job. Many other jobs will be being submitted by other users whilst waiting and if these are requesting similar resources to your job, they will start first.
Make sure a job does not request more resource (time, memory, cpus) than it needs.
-
If a submitted job requests 10 hours but runs in 4 hours it may miss an opportunity to start earlier, e.g. if a scheduling slot is available for more than 4 but less than 10 hours.
-
If a job requests 10 cpu cores but only uses 4 of them, it may have to wait longer for the 10 cores to be available than if only 4 had been requested. The extra 6 cores will not be available for other jobs whilst this job is running. This is considered a waste of compute resources.
See the FAQ entry below for help with determining what resources to request.
How much resource (time, memory, nodes, cpu cores) should be requested for a job?
The number of cpu cores a task can use should be known from the documentation for the software being used.
Wallclock time and memory can only be determined empirically through experience and testing. Try submitting a job to the scheduler with a 'guess-timate' of resources needed, eg:
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
#SBATCH --time=10:00:00
Make sure the job is configured to send a report when it starts and ends (or fails):
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=abc1@leicester
A summary of the job will be emailed which will look something like:
Brief Summary for job 288888 (My_Job_name):
Memory per node requested: 10.00GiB
Max Memory per node used: 8.54GiB
Wallclock time requested: 10:00:00
Wallclock time elapsed: 1:00:00
Wallclock time accuracy: 10.00%
Exit state: COMPLETED
The job requested 10 hours of Wallclock time, but used only 1 hour. The next time a similar job is run, the time requested could be reduced to 1 hour 30 minutes. It is always better to request slightly more time in case a job runs for a bit longer than previous similar jobs. However, the less Wallclock time requested, the sooner a job is likely to start. Requesting 10% more time than a job is expected to use would be sensible.
Another example:
Brief Summary for job 288889 (Job_no_2):
Memory per node requested: 10.00GiB
Max Memory per node used: 7.42.00GiB
Wallclock time requested: 10:00:00
Wallclock time elapsed: 10:00:12
Wallclock time accuracy: 100.00%
Exit state: WALLCLOCK EXCEEDED
The job requested 10 hours of time, the task was still running after 10 hours and was terminated by the scheduler before it finished it's work. Try runnning the job again requesting maybe 20 hours and then reduce that figure if it finishes in much less time.
If a job tries to use more memory than it has requested, it will also be terminated by the scheduler.
If a job has not been configured to send the email summary, the resources used can be checked afterwards using the 'sacct' command and the JOB_ID number.
My job started but only ran for 10 minutes.
It's likely that you've not requested the walltime for a job. This is required to ensure that jobs can be scheduled effectively and that the system can be emptied for service days. If it's not specified, the default value is one minute. There is more information in the page about Submitting Jobs
My job shows as Blocked or Deferred?
Jobs can be placed in the Deferred state for several reasons:
- You have too much outstanding work (i.e. a large number of running and/or idle jobs)
- A back-end service was experiencing problems when you submitted your job
- There is something wrong in your job submission script
In the first two cases you don't need to do anything as the scheduler will re-assess Deferred jobs after an hour to see if the condition that led to them being deferred has gone away. If so, your jobs will be moved to the running or idle state as normal. The command
squeue -l -j <jobid>
will give you more information about why a job has been deferred.
How is my job prioritised?
As with all HPC services there are always competing demands for when a job is run. Job are prioritised according to a set of rules, more details can be found in scheduling priorities
To see the current prioritization of queued jobs:
squeue -l --states=PENDING --sort=P,p
How do I acknowledge use of the HPC facilities in research publications?
You can use the following statements to acknowledge the use of ALICE:
This research used the ALICE High Performance Computing Facility at the University of Leicester.
When do HPC Service Days occur?
The HPC service day schedule is displayed every time you log in via SSH or NX.
The schedule can also be seen here: Service Days.
Additionally there is a shared Outlook calendar available which contains HPC service days and other related events. In Outlook's calendar view, click on the Open Calendar button in the Home ribbon, select From Address Book... and add the calendar of the rcsadmin user.
How do I increase my home directory storage quota?
There is a hard quota of 60GB for all users, you will not be able to write further data to your home directory if you exceed this quota. Home directory quotas are not increased.
There is more information about this, how to determine where you are using most space in your home directory and other filesystems which you can use on HPC.
If you have deleted files via NX / NoMachine they may be in your Wastebasket and will still count towards your quota. To reclaim this space, log on using NX / NoMachine, right click on the Wastebasket icon and select Empty Wastebasket.
I cannot write files to /data or /scratch but I am not over my file space quota
There are 2 quotas for /data and /scratch:
- Disk space
- Number of files
You may be over the quota for number of files (one million). You can check this with the command quotacheck
, e.g.:
quotacheck
/home/l/nye1: 25GiB used of 60GiB (hard limit 60GiB)
Overall scratch and data usage: 150GiB, 74943 files
ALICE Shared area Use Files Size Used Avail
/scratch/project 0% 868 0KiB 151GiB -151GiB
/data/project 0% 868 0KiB 151GiB -151GiB
File number quota: 74943 of 1000000 files used (7.49%) across /scratch and /data
Note that shared area values are currently combined for /scratch and /data
Check for a line similar to this one:
File number quota: 998814 of 1000000 files used (99.88%) across /scratch and /data
If you have reached or are close to the limit for number of files you will need to review your usage and reduce it by deleting, archiving or moving data you no longer require.