Skip to content

InfiniBand on the UEB HPC

The cluster is using InfiniBand (IB) on the private cluster network to interconnect the login/head node, compute nodes, and storage. This fabric carries high‑throughput, low‑latency traffic for MPI and I/O (including NFS). Public access (SSH, web) stays on Ethernet; internal data paths run over InfiniBand.

What is InfiniBand?

InfiniBand is a high‑speed, switched interconnect designed for HPC. It provides: - Very low latency messaging (single‑digit microseconds) - High bandwidth links (e.g., theoreticaly 56 Gb/s on this cluster fabric) - RDMA (Remote Direct Memory Access) and verbs offload, reducing CPU overhead - IP over InfiniBand (IPoIB) to move standard TCP/UDP/NFS traffic across the IB fabric

Why it beats 10 Gb Ethernet for HPC workloads

Aspect 10 Gb Ethernet (real speed) 56 Gb InfiniBand (real speed of Mellanox Card v.4) Impact
Link bandwidth ~10 Gb/s (<3 GB/s) 56 Gb/s (≈29.5 GB/s) Much higher sustained throughput
One‑way latency (TCP) ~5–50 µs ~1–2 µs (verbs/RDMA) Faster MPI collectives & syncs
CPU overhead Higher (kernel TCP stack) Lower (RDMA offload) More CPU left for the app
Congestion control Generic HPC‑tuned (DC/QoS options) Smoother multi‑node scaling

NFS over InfiniBand

  • NFS traffic rides the IB fabric (via IPoIB) even with standard TCP mounts; this isolates I/O from the public LAN.
  • With NFS‑RDMA, NFS uses RDMA instead of TCP: lower CPU, lower latency, and better small‑I/O performance.
  • Admins mount shared filesystems cluster‑wide; users don’t need to mount manually. You can verify mode with:
nfsstat -m        # look for 'proto=rdma' if NFS-RDMA is enabled

MPI & node interconnect

  • MPI benefits from IB’s low latency and high message rate, improving scaling on multi‑node jobs.
  • Launch MPI ranks with srun as usual; Slurm sets up PMI/PMIx over the IB fabric automatically.

How to check InfiniBand status?

# IPoIB interface name on slurm login node is ibp1s0
ip -d link show ibp1s0        # for the fast speed must be 'state UP', 
                              # 'mode connected' and 'mtu 65520'

# Mellanox IB card info
ibstat                        # HCA and port state (ACTIVE), link speed/width

Practical guidance

  • I/O‑intensive workflows (e.g., alignment to large references, GROMACS trajectories) benefit from the IB‑connected storage path.
  • Small‑file storms will kill performance anywhere. Batch small outputs or use formats that bundle (e.g., tar/zip, database‑like stores).
  • MPI jobs: use sbatch sctipt with mpiruninside the script and exactly match threads to cores to avoid oversubscription.

See also: HPC layout, Slurm examples, and Modules.