Skip to content

Grafana — Cluster Monitoring

Grafana is the web dashboard for real-time and historical metrics of the HPC cluster.
Use it to see node health, queue pressure, job efficiency, and per-user activity.


What you’ll find in the dashboards

  • Cluster overview
  • Node states (idle/alloc/drain/down), availability, and partition status
  • Queue depth and job throughput

  • Partitions & queues (Slurm)

  • Jobs by state (RUNNING/PENDING/COMPLETED/FAILED)
  • Pending-reason breakdown (why jobs wait)
  • Per-partition load vs. capacity

  • Node health

  • CPU load/usage, memory usage, swap
  • Temperatures (if exported), uptime, failures/drain flags

  • GPU (if present)

  • Allocation per job/user, utilization %, memory, temperature, power

  • Job efficiency

  • CPU efficiency (requested vs. used cores)
  • Memory efficiency (requested vs. used RAM)
  • Runtime vs. requested time (under/over-request patterns)

  • User / account usage

  • Running/pending jobs per user
  • Core-hours and memory footprints over time

  • Storage / I/O (if exporters enabled)

  • Filesystem capacity and inodes
  • Read/write throughput and IOPS

Notes
- Dashboards are powered by Prometheus + Slurm Exporter (and node exporters).
- If a panel shows “No data,” the underlying exporter may be disabled for that resource type.


Quick navigation

  • Home / Cluster Overview → high-level health and utilization
  • Slurm: Queues → pending reasons, queue pressure, partitions
  • Slurm: Jobs & Users → job counts, efficiency, user activity
  • Nodes → per-node CPU/RAM, state, and alerts
  • GPU (if present) → utilization, memory, power, per-job attribution

Practical use cases

  • Is the cluster busy? Check “Queue depth” and “Running cores” on the Overview.
  • Why is my job pending? Open Slurm: QueuesPending reasons.
  • Did my job use resources efficiently? See Job efficiency panels (CPU/RAM vs. requested).
  • Which nodes are unhealthy or drained? Open Nodes and sort by state/alerts.
  • ... and more

Access