Message parsing interface (MPI) on the UEB HPC Cluster¶

MPI enables MPI_compiled programs tu run in parallel on every node in the cluster, that is on 864 CPUs in NGS cluster (well, in theory). In MPI job, many processes (called ranks) running on one or more nodes can exchange data via high‑speed interconnects. On this cluster, MPI jobs are launched by Slurm; Slurm sets up the environment (PMI/PMIx), allocates nodes/CPUs, and starts your MPI ranks.

Key concepts - Rank = one MPI process. In Slurm terms, a rank is one --ntasks slot. - Threads per rank (OpenMP) = --cpus-per-task (aka -c). Hybrids use ranks × threads. - Launch with srun (supported on Slurm). Avoid using mpirun.

Set the MPI environment using Modules

module use /share/apps/Modules/modulefiles
module avail Gromacs openmpi
module load openmpi
module load Gromacs     # or a specific version, e.g. Gromacs/2025.2

Choosing ranks and threads¶

On a 96‑CPU node, common choices are: - Pure MPI on a single node: --ntasks=96 -c 1 (96 ranks, 1 thread each, 96 CPUs). - Hybrid on several nodes: `-N 3 --ntasks=150 --ntasks-per-node=50 -c 1 --hint=multithread --cpu-bind=threads (50 ranks × 1 threads = 150 CPUs).

Set OpenMP vars to match -c and pin threads to cores:

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OMP_PLACES=cores
export OMP_PROC_BIND=close

For now there are only 2 programs capable of MPI: Ray assembler and Gromacs. I plan to replace multithreaded versions with MPI one whenever possible. FYI, the list of planned MPI software installations is here.

Example of usage with GROMACS (gmx_mpi) and Slurm `sbatch`¶

Assumptions: - You have prepared input files (e.g. conf.gro, topol.top, md.mdp). - You will generate topol.tpr before the run (can be done as a 1‑rank step).

Save as gmx_mpi_job.sh and submit with sbatch gmx_mpi_job.sh.

#!/usr/bin/env bash
#SBATCH -J gmx_mpi
#SBATCH -p long                     # pick your partition
#SBATCH -N 2                        # nodes
#SBATCH --ntasks-per-node=24        # MPI ranks per node
#SBATCH -c 4                        # OpenMP threads per rank
#SBATCH --mem=0                     # use all memory on each node (optional)
#SBATCH -t 12:00:00
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err

set -euo pipefail

# Modules
module use /share/apps/Modules/modulefiles
module load openmpi
module load Gromacs_mpi

# Map Slurm CPUs to OpenMP threads and pin placement
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OMP_PLACES=cores
export OMP_PROC_BIND=close

# (1) Preprocess to make the TPR with a single rank (no MPI launcher needed)
gmx_mpi grompp -f md.mdp -c conf.gro -p topol.top -o topol.tpr

# (2) Run production MD across the allocation with OpenMPI
# Use PE=<threads per rank> so each MPI rank is bound to the right number of cores.
mpirun -np ${SLURM_NTASKS} \
  --map-by ppr:${SLURM_NTASKS_PER_NODE}:node:PE=${SLURM_CPUS_PER_TASK} \
  --bind-to core \
  gmx_mpi mdrun -deffnm topol -ntomp ${OMP_NUM_THREADS} -pin on -dlb auto

Tuning notes - Start with 24×4 per 96‑CPU node (or adjust to your socket/NUMA layout). Try pure MPI vs. hybrid and pick what’s better.
- It may be faster to use only the single node then multiple nodes to avoid connecting results from many chunks across the network. - Keep -ntomp equal to OMP_NUM_THREADS (or omit if programs like GROMACS can read OMP vars). - Use node‑local or fast scratch for large trajectories if available. - Be sure to use the MPI build (like gmx_mpi) and not the thread‑MPI (e.g. gmx) build.

One‑node variant¶

Compare with multiple node jobs to see the best solution.

#!/usr/bin/env bash
#SBATCH -J gmx_1node
#SBATCH -p long
#SBATCH -N 1
#SBATCH --ntasks-per-node=24
#SBATCH -c 4
#SBATCH -t 08:00:00
#SBATCH -o %x.%j.out
#SBATCH -e %x.%j.err

module use /share/apps/Modules/modulefiles
module load openmpi
module load Gromacs_mpi

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK:-1}
export OMP_PLACES=cores
export OMP_PROC_BIND=close

# (1) Preprocess on a single rank (no launcher needed)
gmx_mpi grompp -f md.mdp -c conf.gro -p topol.top -o topol.tpr

# (2) Run hybrid MPI+OpenMP across the single node with OpenMPI
mpirun -np ${SLURM_NTASKS} \
  --map-by ppr:${SLURM_NTASKS_PER_NODE}:node:PE=${SLURM_CPUS_PER_TASK} \
  --bind-to core \
  gmx_mpi mdrun -deffnm topol -ntomp ${OMP_NUM_THREADS} -pin on -dlb auto

Remember: for Slurm jobs, use only mpirun inside sbatch script to launch MPI ranks to avoid mismatched launchers and environment.

Message parsing interface (MPI) on the UEB HPC Cluster¶

Choosing ranks and threads¶

Example of usage with GROMACS (gmx_mpi) and Slurm sbatch¶

One‑node variant¶

Example of usage with GROMACS (gmx_mpi) and Slurm `sbatch`¶