Monitoring & Alerts

Monitoring collects metrics (Prometheus 🡕) and displays them (Grafana 🡕). On top, a fully declarative alerting layer notifies the administrator as soon as a node, service, resource, or network goes awry. When enabled, it adapts automatically to what is deployed.

Two services, one host

A zone’s collector runs two distinct services on the same host:

Service	Role	Access
`prometheus`	Metrics, alert rules, network probes, Alertmanager	`https://prometheus.<zone>.domain.tld` (SSO-protected)
`monitoring`	Grafana dashboards	`https://stats.<zone>.domain.tld` (SSO-protected)

monitoring requires prometheus on the same host: the generator refuses a configuration where Grafana has no data source. You can enable prometheus alone (metrics + alerts without Grafana).

The alert flow

Prometheus evaluates rules generated from the topology; when one triggers, Alertmanager routes the notification according to its severity.

Alertmanager deduplicates, groups (by alert and instance) and inhibits redundant alerts: a critical silences the warning of the same incident.

Severity and escalation

Escalation is driven by severity, not by time:

Severity	Destination
`warning`	Matrix room `#alert-warnings`
`critical`	Matrix room `#alert-incidents` + e-mail

Mail is sent through the host’s local Postfix relay; the Matrix bridge is a dedicated bot (@alertbot:domain.tld) that posts to both rooms.

Node classes

The severity of a failure depends on the importance of the node. A critical server going down is an incident ; a powered-off workstation, a simple info.

Class	Who	Effect
critical	`gateway`, `hcs`, `server` profiles	failures in `critical`
non-critical	desktops, laptops	failures in `warning`
disabled		no alert on this node

The default class follows the host profile ; it can be overridden with a feature :

alert-critical : raises the node to critical level.
alert-non-critical : lowers it.
alert-disabled : removes it from alerts.

Like monitoring-node:<zone>, these features accept an owner zone : alert-critical:main makes a node (typically the HCS) monitored by the main zone.

What is monitored

Rules are derived from what each node declares; nothing to list by hand.

Family	Trigger
Node unreachable	`node-exporter` silent (`NodeDown`)
Active service but failing	expected systemd unit not `active` (`ServiceDown`)
Unit in failure	a systemd unit `failed` (`SystemdUnitFailed`)
Resources	disk, RAM, load, inodes, OOM (overridable thresholds)
Network	gateway, tailnet (headscale) or DNS unreachable (blackbox probes)

Silence during rebuilds

A nixos-rebuild may trigger alert flapping (services restarting). The node being deployed drops a maintenance flag that Prometheus sees; Alertmanager then inhibits all alerts from that node for the duration of the operation. The flag is set and cleared with dnf-maintenance on|off, which the deployment flow automatically wraps around the rebuild.