Known Issues HPC3
Below is a list issues that we're actively working on. We hope to have these resolved soon. This is intended to be a temporary page.
For differences between the new platforms and Mahuika, see the more permanent differences from Mahuika.
OnDemand Apps¶
-
Firefox Browser will fail to render the HPC Shell Access app correctly. Please switch to a Chrome or Safari browser until the vendor provides a fix.
-
The resources dedicated to interactive work via a web browser are smaller, and so computations requiring large amounts of memory or many CPU cores are not yet supported.
-
Missing user Namespaces in Kubernetes pods will interfere with most Apptainer operations. One can run
apptainer pullcommand,apptainer exec,run,shellcommands can not be executed.
UCX ERROR¶
Multi-node MPI jobs may fail on the four nodes mg[13-16] with errors like UCX ERROR: no active messages transport. If you encounter this, add the sbatch option -x mg[13-16] to avoid those nodes. Single-task jobs are not affected.
Core dump files¶
Contrary to what is stated in our documentation on core files, these are not currently available, even if ulimit -c unlimited is set.
Software¶
-
FileSender - If you modify the
default_transfer_days_validparameter in your~/.filesender/filesender.py.inito > 20 it will cause the transfer to fail with a 500 error code. Please do not modify this parameter. -
Legacy Code - Some of our environment modules cause system software to stop working, e.g: do
module load Perl/5.38.2-GCC-12.3.0and find thatsvnstops working. This is usually the case if they loadLegacySystem/7as a dependency. The solutions are to ask us to re-build the problem environment module, or just don't have it loaded while doing other things. -
MPI software using 2020 or earlier toolchains eg intel/2020a, may not work correctly across nodes. Trying with more recent toolchains is recommended.
Slurm¶
Requesting GPUs¶
If you request a GPU without specifying which type of GPU, you will get a random one. So please always specify a GPU type.
BadConstraints¶
This uninformative message can appear in the squeue output as the reason for a job pending. It does not always reflect a real problem though, just a side-effect of the mechanism we are using to target jobs to the right-sized node(s) together with a small bug in Slurm. If it causes your job to be put on hold (ie: its priority gets appears as zero in output from squeue --me -S -p --Format=jobid:10,partition:13,reason:22,numnodes:.6,prioritylong:.6) then please try scontrol release <jobid> or Contact our Support Team if the issue persists.