Parallel Computing

To properly utilise high performance computing hardware, you need to be able to utilise multiple CPUs. Many scientific software applications support parallel execution, but this often requires explicit configuration rather than happening automatically.

Some definitions that will help you understand this page:

CPU: The hardware that performs computations
Task: An independent process that is run using one or more CPUs. All CPUs assigned to a task share the same memory.
Node: The physical hardware. Each node contains memory and a number of CPUs.
Shared Memory: Multiple CPUs used within a single task
Distributed Memory: Multiple tasks used across nodes

Utilizing Multiple CPUs¶

Requesting extra resources through Slurm doesn't guarantee your program will use them. Always consult the software specific documentation first when trying to determine what types of parallel computing to use (and how). Software may:

Claim implicit multi-core support (verify this works)
Require explicit core specification in command line (e.g., -n 8, -np 16)
Need parallelization type specified in command line (e.g., -dis, -mpi=intelmpi)
Require input file regeneration to be configured in a particular way, (partitioning into same number of domains as tasks, etc).
Give estimates on initial resource requirements.

Quick Reference¶

Method	Also Called	Slurm Options	Usage
Shared Memory	Multithreading, SMP	`--cpus-per-task`	Limited to single node; efficient memory use
Distributed Memory	MPI	`--ntasks`	Scales across nodes; higher overhead
Hybrid	-	`--ntasks` + `--cpus-per-task`	Combines both approaches
Job Array	-	`--array`	Best for independent tasks
GPU	GPGPU	`--gpus-per-node`	Specialized hardware for matrix operations

Shared Memory¶

Shared Memory Parallelism, or multi-threading, parallelises itself by forking (duplicating) a single process into multiple parallel threads via libraries like OpenMP (OMP), TBB, or pthread.

A non-parallalised (series) program works like this:

serial

In contrast, a shared memory parallelised program works like this:

parallel

A shared memory parallelised program:

Requires shared memory (so all CPUs must be on same node)
Limited by node capacity (e.g., On Mahuika the maximum number of CPUs that can be requested for a shared memory job is 166 (or twice that with simultaneous multithreading (SMP)))
Uses --cpus-per-task to specify thread count

Example Script¶

#!/bin/bash -e

#SBATCH --job-name       MultithreadingTest
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem            2048MB
#SBATCH --cpus-per-task  4

taskset -c -p $$  # Prints available CPUs

See also

Multiththreading
Python Multiprocessing (not threading, which isn't truly parallel in Python).
MATLAB Parpool

Distributed Memory¶

Distributed memory parellelism, generally implemented with the Message Passing Interface (MPI), enables distributed parallel computation across multiple nodes through inter-process communication.

A distributed memory parallelisation program:

Does not require shared memory (Each task has its own memory).
Can utilise multiple nodes.
Has higher communication and memory overhead than multi-threading
Memory requirements typically scale with CPU count
Predates shared-memory parallelism; common in classical HPC applications
Use --ntasks (>1), or --ntasks-per-node together with --nodes
Use --mem-per-cpu instead of --mem to ensure consistent memory regardless of how Slurm packs tasks onto nodes.
Launched via mpirun or preferably Slurm's srun.

Example Script¶

#!/bin/bash -e

#SBATCH --job-name       MPIJob
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem-per-cpu    512MB
#SBATCH --ntasks         4

srun bash -c 'taskset -c -p $$'  # Prints CPU available to each task

Warning

For non-MPI programs, either set --ntasks=1 or do not use srun at all. Using srun in conjunction with --cpus-per-task=1 will cause --ntasks to default to 2.

See also

MPI Scaling Example

Hybrid Parallel¶

Combining --ntasks and --cpus-per-task for distributed tasks each of which is multi-threaded. Not commonly supported.

Example Script¶

#!/bin/bash -e

#SBATCH --job-name       HybridJob
#SBATCH --account        nesi99991
#SBATCH --time           00:01:00
#SBATCH --mem-per-cpu    512MB
#SBATCH --cpus-per-task  4
#SBATCH --ntasks         2

srun bash -c 'taskset -c -p $$'  # Prints CPUs available to each task

Job Arrays¶

Job arrays execute independent tasks simultaneously—ideal for embarrassingly parallel problems with no inter-task dependencies (e.g. parameter sweeps).

Tasks can execute in any order.
Efficient way to run multiple serial jobs simultaneously rather than applying multiple CPUs to a single job.
Scales without efficiency loss.
Use --array to specify index range

Example Script¶

#!/bin/bash -e

#SBATCH --job-name       ArrayJob     # job name (shows up in the queue)
#SBATCH --account        nesi99991    # Project to bill
#SBATCH --time           00:01:00     # Walltime (HH:MM:SS)
#SBATCH --mem            512MB        # Memory
#SBATCH --array          1-2          # Array jobs

echo "This is result ${SLURM_ARRAY_TASK_ID}"

See also

For more info see the page on Job Arrays

GPUs¶

GPUs excel at large-scale parallel operations on matrices, making them ideal for machine learning, graphics processing and simulating many kinds of physical systems.

Specialized hardware requested in addition to CPUs and memory.
Well-suited for large matrix operations and machine learning.
Use --gpus-per-node=<gpu_type>:<gpu_number>

Example Script¶

#!/bin/bash -e

#SBATCH --job-name        gpu-job
#SBATCH --account         nesi99991
#SBATCH --output          %x.out
#SBATCH --mem-per-cpu     2G
#SBATCH --gpus-per-node   P100:1

module load CUDA
nvidia-smi

Should print out a summary of GPU utilisation.

See also

Using GPUs for more in depth documentation about GPUs.
Hardware for a full list of available GPUs.