Grafana — Cluster Monitoring¶
Grafana is the web dashboard for real-time and historical metrics of the HPC cluster.
Use it to see node health, queue pressure, job efficiency, and per-user activity.
What you’ll find in the dashboards¶
- Cluster overview
- Node states (idle/alloc/drain/down), availability, and partition status
-
Queue depth and job throughput
-
Partitions & queues (Slurm)
- Jobs by state (RUNNING/PENDING/COMPLETED/FAILED)
- Pending-reason breakdown (why jobs wait)
-
Per-partition load vs. capacity
-
Node health
- CPU load/usage, memory usage, swap
-
Temperatures (if exported), uptime, failures/drain flags
-
GPU (if present)
-
Allocation per job/user, utilization %, memory, temperature, power
-
Job efficiency
- CPU efficiency (requested vs. used cores)
- Memory efficiency (requested vs. used RAM)
-
Runtime vs. requested time (under/over-request patterns)
-
User / account usage
- Running/pending jobs per user
-
Core-hours and memory footprints over time
-
Storage / I/O (if exporters enabled)
- Filesystem capacity and inodes
- Read/write throughput and IOPS
Notes
- Dashboards are powered by Prometheus + Slurm Exporter (and node exporters).
- If a panel shows “No data,” the underlying exporter may be disabled for that resource type.
Quick navigation¶
- Home / Cluster Overview → high-level health and utilization
- Slurm: Queues → pending reasons, queue pressure, partitions
- Slurm: Jobs & Users → job counts, efficiency, user activity
- Nodes → per-node CPU/RAM, state, and alerts
- GPU (if present) → utilization, memory, power, per-job attribution
Practical use cases¶
- Is the cluster busy? Check “Queue depth” and “Running cores” on the Overview.
- Why is my job pending? Open Slurm: Queues → Pending reasons.
- Did my job use resources efficiently? See Job efficiency panels (CPU/RAM vs. requested).
- Which nodes are unhealthy or drained? Open Nodes and sort by state/alerts.
- ... and more
Access¶
- Open: Grafana Web Pages