Prometheus alerts
Maintenance map of the alert code: what goes where, and how to extend it without breaking anything. The admin-facing operation is described in Monitoring & Alerts.
Pure logic vs wiring
Section titled “Pure logic vs wiring”The code is split in two to keep rule generation testable:
dnf/lib/alerts.nix: pure functions that produce Prometheus rules from the topology. Tested indnf/tests/unit/lib/alerts_test.nix.dnf/modules/service/prometheus.nix: impure wiring (Alertmanager, routing by severity, sops, Matrix bot, vhost, blackbox probes).
Any non-trivial logic goes into alerts.nix ; the module merely plugs it in.
Helpers (alerts.nix)
Section titled “Helpers (alerts.nix)”Exposed via dnfLib (see dnf/lib/default.nix) :
| Helper | Role |
|---|---|
serviceUnits | DNF service → systemd unit (e.g. idm → kanidmd.service) |
nodeClass | Node class (critical / non-critical / disabled) : alert-* features then profile |
severityForClass | Class → severity (critical or warning) |
hostExpectedUnits | Expected units for a host (based on its enabled services) |
mkNodeRuleGroups | NodeDown, ServiceDown, SystemdUnitFailed |
mkResourceRuleGroups | Disk, RAM, load, inodes, OOM (thresholds defaultThresholds) |
mkNetworkRuleGroups | Blackbox probes (gateway, tailnet, DNS) |
mkMaintenanceRuleGroups | Maintenance flag (silence during rebuild) |
mergeRuleGroups | Merges fragments into a single document |
mkAlertRuleGroups | Shortcut : nodes + resources |
The trap : a single rule document
Section titled “The trap : a single rule document”So we merge everything via mergeRuleGroups, then emit only one entry :
services.prometheus.rules = [ (builtins.toJSON (dnfLib.mergeRuleGroups ( [ (dnfLib.mkAlertRuleGroups { inherit nodes; /* … */ }) ] ++ lib.optional alerting.silenceOnRebuild (dnfLib.mkMaintenanceRuleGroups { /* … */ }) ++ lib.optional alerting.network.enable (dnfLib.mkNetworkRuleGroups { /* … */ }) )))];What to touch where
Section titled “What to touch where”| I want to… | I touch… |
|---|---|
| monitor a new service | serviceUnits in alerts.nix |
| add a new rule family | a mkXRuleGroups + add it to the module’s mergeRuleGroups |
| change a default threshold | defaultThresholds in alerts.nix |
| change a node’s class | alert-*[:zone] feature (no code) |
| add a new destination | the alertmanager block in prometheus.nix (receiver + route) |