Monitoring & Alerts
Monitoring collects metrics (Prometheus 🡕) and displays them (Grafana 🡕). On top, a fully declarative alerting layer notifies the administrator as soon as a node, service, resource, or network goes awry. When enabled, it adapts automatically to what is deployed.
Two services, one host
Section titled “Two services, one host”A zone’s collector runs two distinct services on the same host:
| Service | Role | Access |
|---|---|---|
prometheus | Metrics, alert rules, network probes, Alertmanager | https://prometheus.<zone>.domain.tld (SSO-protected) |
monitoring | Grafana dashboards | https://stats.<zone>.domain.tld (SSO-protected) |
monitoring requires prometheus on the same host: the generator refuses
a configuration where Grafana has no data source. You can enable prometheus
alone (metrics + alerts without Grafana).
The alert flow
Section titled “The alert flow”Prometheus evaluates rules generated from the topology; when one triggers, Alertmanager routes the notification according to its severity.
Alertmanager deduplicates, groups (by alert and instance) and inhibits redundant alerts: a critical silences the warning of the same incident.
Severity and escalation
Section titled “Severity and escalation”Escalation is driven by severity, not by time:
| Severity | Destination |
|---|---|
warning | Matrix room #alert-warnings |
critical | Matrix room #alert-incidents + e-mail |
Mail is sent through the host’s local Postfix relay; the Matrix bridge is a
dedicated bot (@alertbot:domain.tld) that posts to both rooms.
Node classes
Section titled “Node classes”The severity of a failure depends on the importance of the node. A critical server going down is an incident ; a powered-off workstation, a simple info.
| Class | Who | Effect |
|---|---|---|
| critical | gateway, hcs, server profiles | failures in critical |
| non-critical | desktops, laptops | failures in warning |
| disabled | no alert on this node |
The default class follows the host profile ; it can be overridden with a feature :
alert-critical: raises the node to critical level.alert-non-critical: lowers it.alert-disabled: removes it from alerts.
Like monitoring-node:<zone>, these features accept an owner zone :
alert-critical:main makes a node (typically the HCS) monitored by the
main zone.
What is monitored
Section titled “What is monitored”Rules are derived from what each node declares; nothing to list by hand.
| Family | Trigger |
|---|---|
| Node unreachable | node-exporter silent (NodeDown) |
| Active service but failing | expected systemd unit not active (ServiceDown) |
| Unit in failure | a systemd unit failed (SystemdUnitFailed) |
| Resources | disk, RAM, load, inodes, OOM (overridable thresholds) |
| Network | gateway, tailnet (headscale) or DNS unreachable (blackbox probes) |
Silence during rebuilds
Section titled “Silence during rebuilds”A nixos-rebuild may trigger alert flapping (services restarting). The node
being deployed drops a maintenance flag that Prometheus sees; Alertmanager
then inhibits all alerts from that node for the duration of the operation.
The flag is set and cleared with dnf-maintenance on|off, which the deployment
flow automatically wraps around the rebuild.