InfiniBand on the UEB HPC¶
The cluster is using InfiniBand (IB) on the private cluster network to interconnect the login/head node, compute nodes, and storage. This fabric carries high‑throughput, low‑latency traffic for MPI and I/O (including NFS). Public access (SSH, web) stays on Ethernet; internal data paths run over InfiniBand.
What is InfiniBand?¶
InfiniBand is a high‑speed, switched interconnect designed for HPC. It provides: - Very low latency messaging (single‑digit microseconds) - High bandwidth links (e.g., theoreticaly 56 Gb/s on this cluster fabric) - RDMA (Remote Direct Memory Access) and verbs offload, reducing CPU overhead - IP over InfiniBand (IPoIB) to move standard TCP/UDP/NFS traffic across the IB fabric
Why it beats 10 Gb Ethernet for HPC workloads¶
| Aspect | 10 Gb Ethernet (real speed) | 56 Gb InfiniBand (real speed of Mellanox Card v.4) | Impact |
|---|---|---|---|
| Link bandwidth | ~10 Gb/s (<3 GB/s) | 56 Gb/s (≈29.5 GB/s) | Much higher sustained throughput |
| One‑way latency (TCP) | ~5–50 µs | ~1–2 µs (verbs/RDMA) | Faster MPI collectives & syncs |
| CPU overhead | Higher (kernel TCP stack) | Lower (RDMA offload) | More CPU left for the app |
| Congestion control | Generic | HPC‑tuned (DC/QoS options) | Smoother multi‑node scaling |
NFS over InfiniBand¶
- NFS traffic rides the IB fabric (via IPoIB) even with standard TCP mounts; this isolates I/O from the public LAN.
- With NFS‑RDMA, NFS uses RDMA instead of TCP: lower CPU, lower latency, and better small‑I/O performance.
- Admins mount shared filesystems cluster‑wide; users don’t need to mount manually. You can verify mode with:
nfsstat -m # look for 'proto=rdma' if NFS-RDMA is enabled
MPI & node interconnect¶
- MPI benefits from IB’s low latency and high message rate, improving scaling on multi‑node jobs.
- Launch MPI ranks with
srunas usual; Slurm sets up PMI/PMIx over the IB fabric automatically.
How to check InfiniBand status?¶
# IPoIB interface name on slurm login node is ibp1s0
ip -d link show ibp1s0 # for the fast speed must be 'state UP',
# 'mode connected' and 'mtu 65520'
# Mellanox IB card info
ibstat # HCA and port state (ACTIVE), link speed/width
Practical guidance¶
- I/O‑intensive workflows (e.g., alignment to large references, GROMACS trajectories) benefit from the IB‑connected storage path.
- Small‑file storms will kill performance anywhere. Batch small outputs or use formats that bundle (e.g., tar/zip, database‑like stores).
- MPI jobs: use
sbatchsctipt withmpiruninside the script and exactly match threads to cores to avoid oversubscription.
See also: HPC layout, Slurm examples, and Modules.