Skip to content
Contact Support

Parallel Computing

To properly utilise high performance computing hardware, you need to be able to utilise multiple CPUs. Many scientific software applications support parallel execution, but this often requires explicit configuration rather than happening automatically.

Some definitions that will help you understand this page:

  • CPU: The hardware that performs computations
  • Task: An independent process that is run using one or more CPUs. All CPUs assigned to a task share the same memory.
  • Node: The physical hardware. Each node contains memory and a number of CPUs.
  • Shared Memory: Multiple CPUs used within a single task
  • Distributed Memory: Multiple tasks used across nodes

Utilizing Multiple CPUs

Requesting extra resources through Slurm doesn't guarantee your program will use them. Always consult the software specific documentation first when trying to determine what types of parallel computing to use (and how). Software may:

  • Claim implicit multi-core support (verify this works)
  • Require explicit core specification in command line (e.g., -n 8, -np 16)
  • Need parallelization type specified in command line (e.g., -dis, -mpi=intelmpi)
  • Require input file regeneration to be configured in a particular way, (partitioning into same number of domains as tasks, etc).
  • Give estimates on initial resource requirements.

Quick Reference

Method Also Called Slurm Options Usage
Shared Memory Multithreading, SMP --cpus-per-task Limited to single node; efficient memory use
Distributed Memory MPI --ntasks Scales across nodes; higher overhead
Hybrid - --ntasks + --cpus-per-task Combines both approaches
Job Array - --array Best for independent tasks
GPU GPGPU --gpus-per-node Specialized hardware for matrix operations

Shared Memory Parallelisation

Shared Memory Parallelism, or multi-threading, parallelises itself by forking (duplicating) a single process into multiple parallel threads via libraries like OpenMP (OMP), TBB, or pthread.

A non-parallalised (series) program works like this:

serial

In contrast, a shared memory parallelised program works like this:

parallel

A shared memory parallelised program: - Requires shared memory (so all CPUs must be on same node) - Limited by node capacity (e.g., On Mahuika the maximum number of CPUs that can be requested for a shared memory job is 166 (or twice that with hyperthreading)) - Uses --cpus-per-task to specify thread count

Example Script

#!/bin/bash -e

#SBATCH --job-name       MultithreadingTest
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem            2048MB
#SBATCH --cpus-per-task  4

taskset -c -p $$  # Prints available CPUs

See also

Distributed Memory

Distributed memory parellelism, generally implemented with the Message Passing Interface (MPI), enables distributed parallel computation across multiple nodes through inter-process communication.

A distributed memory parallelisation program:

  • Does not require shared memory (Each task has its own memory).
  • Can utilise multiple nodes.
  • Has higher communication and memory overhead than multi-threading
  • Memory requirements typically scale with CPU count
  • Predates shared-memory parallelism; common in classical HPC applications

  • Use --ntasks (>1), or --ntasks-per-node together with --nodes

  • Use --mem-per-cpu instead of --mem to ensure consistent memory regardless of how Slurm packs tasks onto nodes.
  • Launched via mpirun or preferably Slurm's srun.

Example Script

#!/bin/bash -e

#SBATCH --job-name       MPIJob
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem-per-cpu    512MB
#SBATCH --ntasks         4

srun bash -c 'taskset -c -p $$'  # Prints CPU available to each task

Warning

For non-MPI programs, either set --ntasks=1 or do not use srun at all. Using srun in conjunction with --cpus-per-task=1 will cause --ntasks to default to 2.

Hybrid Parallel

Combining --ntasks and --cpus-per-task for distributed tasks each of which is multi-threaded. Not commonly supported.

Example Script

#!/bin/bash -e

#SBATCH --job-name       HybridJob
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem-per-cpu    512MB
#SBATCH --cpus-per-task  4
#SBATCH --ntasks         2

srun bash -c 'taskset -c -p $$'  # Prints CPUs available to each task

Job Arrays

Job arrays execute independent tasks simultaneously—ideal for embarrassingly parallel problems with no inter-task dependencies (e.g. parameter sweeps).

  • Tasks can execute in any order.
  • Efficient way to run multiple serial jobs simultaneously rather than applying multiple CPUs to a single job.
  • Scales without efficiency loss.
  • Use --array to specify index range

Example Script

#!/bin/bash -e

#SBATCH --job-name       ArrayJob     # job name (shows up in the queue)
#SBATCH --account        nesi99991    # Project to bill
#SBATCH --time           00:01:00     # Walltime (HH:MM:SS)
#SBATCH --mem            512MB        # Memory
#SBATCH --array          1-2          # Array jobs

echo "This is result ${SLURM_ARRAY_TASK_ID}"

See also

GPUs

GPUs excel at large-scale parallel operations on matrices, making them ideal for machine learning, graphics processing and simulating many kinds of physical systems.

  • Specialized hardware requested in addition to CPUs and memory.
  • Well-suited for large matrix operations and machine learning.
  • Use --gpus-per-node=<gpu_type>:<gpu_number>

Example Script

#!/bin/bash -e

#SBATCH --job-name        gpu-job
#SBATCH --account         nesi99991
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     2G
#SBATCH --gpus-per-node   P100:1

module load CUDA
nvidia-smi

Should print out a summary of GPU utilisation.

See also

  • Using GPUs for more in depth documentation about GPUs.
  • Hardware for a full list of available GPUs.