Skip to content

Monitoring & Alerts

Monitoring collects metrics (Prometheus 🡕) and displays them (Grafana 🡕). On top, a fully declarative alerting layer notifies the administrator as soon as a node, service, resource, or network goes awry. When enabled, it adapts automatically to what is deployed.

A zone’s collector runs two distinct services on the same host:

ServiceRoleAccess
prometheusMetrics, alert rules, network probes, Alertmanagerhttps://prometheus.<zone>.domain.tld (SSO-protected)
monitoringGrafana dashboardshttps://stats.<zone>.domain.tld (SSO-protected)

monitoring requires prometheus on the same host: the generator refuses a configuration where Grafana has no data source. You can enable prometheus alone (metrics + alerts without Grafana).

Prometheus evaluates rules generated from the topology; when one triggers, Alertmanager routes the notification according to its severity.

Diagram

Alertmanager deduplicates, groups (by alert and instance) and inhibits redundant alerts: a critical silences the warning of the same incident.

Escalation is driven by severity, not by time:

SeverityDestination
warningMatrix room #alert-warnings
criticalMatrix room #alert-incidents + e-mail

Mail is sent through the host’s local Postfix relay; the Matrix bridge is a dedicated bot (@alertbot:domain.tld) that posts to both rooms.

The severity of a failure depends on the importance of the node. A critical server going down is an incident ; a powered-off workstation, a simple info.

ClassWhoEffect
criticalgateway, hcs, server profilesfailures in critical
non-criticaldesktops, laptopsfailures in warning
disabledno alert on this node

The default class follows the host profile ; it can be overridden with a feature :

  • alert-critical : raises the node to critical level.
  • alert-non-critical : lowers it.
  • alert-disabled : removes it from alerts.

Like monitoring-node:<zone>, these features accept an owner zone : alert-critical:main makes a node (typically the HCS) monitored by the main zone.

Rules are derived from what each node declares; nothing to list by hand.

FamilyTrigger
Node unreachablenode-exporter silent (NodeDown)
Active service but failingexpected systemd unit not active (ServiceDown)
Unit in failurea systemd unit failed (SystemdUnitFailed)
Resourcesdisk, RAM, load, inodes, OOM (overridable thresholds)
Networkgateway, tailnet (headscale) or DNS unreachable (blackbox probes)

A nixos-rebuild may trigger alert flapping (services restarting). The node being deployed drops a maintenance flag that Prometheus sees; Alertmanager then inhibits all alerts from that node for the duration of the operation. The flag is set and cleared with dnf-maintenance on|off, which the deployment flow automatically wraps around the rebuild.