Grafana — Cluster Monitoring¶

Grafana is the web dashboard for real-time and historical metrics of the HPC cluster.
Use it to see node health, queue pressure, job efficiency, and per-user activity.

What you’ll find in the dashboards¶

Cluster overview
Node states (idle/alloc/drain/down), availability, and partition status
Queue depth and job throughput
Partitions & queues (Slurm)
Jobs by state (RUNNING/PENDING/COMPLETED/FAILED)
Pending-reason breakdown (why jobs wait)
Per-partition load vs. capacity
Node health
CPU load/usage, memory usage, swap
Temperatures (if exported), uptime, failures/drain flags
GPU (if present)
Allocation per job/user, utilization %, memory, temperature, power
Job efficiency
CPU efficiency (requested vs. used cores)
Memory efficiency (requested vs. used RAM)
Runtime vs. requested time (under/over-request patterns)
User / account usage
Running/pending jobs per user
Core-hours and memory footprints over time
Storage / I/O (if exporters enabled)
Filesystem capacity and inodes
Read/write throughput and IOPS

Notes
- Dashboards are powered by Prometheus + Slurm Exporter (and node exporters).
- If a panel shows “No data,” the underlying exporter may be disabled for that resource type.

Home / Cluster Overview → high-level health and utilization
Slurm: Queues → pending reasons, queue pressure, partitions
Slurm: Jobs & Users → job counts, efficiency, user activity
Nodes → per-node CPU/RAM, state, and alerts
GPU (if present) → utilization, memory, power, per-job attribution

Practical use cases¶

Is the cluster busy? Check “Queue depth” and “Running cores” on the Overview.
Why is my job pending? Open Slurm: Queues → Pending reasons.
Did my job use resources efficiently? See Job efficiency panels (CPU/RAM vs. requested).
Which nodes are unhealthy or drained? Open Nodes and sort by state/alerts.
... and more

Access¶

Open: Grafana Web Pages

Grafana — Cluster Monitoring¶

What you’ll find in the dashboards¶

Quick navigation¶

Practical use cases¶

Access¶