Alerts
The alerting system delivers Prometheus alerts to Matrix rooms (and via email for incidents). A one-time provisioning of the bot is enough: everything is declarative afterwards. The overall architecture is described in Monitoring & Alerting.
Prerequisites
Section titled “Prerequisites”| Prerequisite | Why |
|---|---|
prometheus + monitoring on the collector | hosts Alertmanager and the rules |
Reachable matrix service | hosts the bot and the rooms |
Registration secret in sops (matrix-rss-password) | creates the bot account |
| Admin sops key present | writes the bot secrets |
Installing the bot
Section titled “Installing the bot”A single idempotent command (safe to re-run): it creates the bot account,
the rooms, and writes the identity into var/generated/matrix.nix.
-
Declare the human administrator (local part) in the configuration:
etc/config.yaml network:matrix:admin: "alice" -
Run provisioning:
Fenêtre de terminal just configure-alert-botThe command, using the admin sops key:
- creates the
@alertbot:domain.tldaccount (nonce HMAC flow on the shared secret) ; - creates or resolves the
#alert-warningsand#alert-incidentsrooms, invites the bot and invites the admin ; - writes
bot+ room IDs intovar/generated/matrix.nix; - stores password, access token and webhook secret in sops.
- creates the
-
Deploy the collector:
Fenêtre de terminal just generatejust apply <collector>Alerting activates automatically as soon as the rooms exist (
var/generated/matrix.nixpopulated). To force-disable it:darkone.service.prometheus.alerting.enable = false;.
Created secrets
Section titled “Created secrets”Provisioning stores three secrets in usr/secrets/secrets.yaml:
| Sops key | Role |
|---|---|
alertmanager-matrix-password | bot account password (reconnection) |
alertmanager-matrix-token | bot access token (message sending) |
alertmanager-webhook-secret | authenticates Alertmanager to the Matrix bridge |
Testing an alert
Section titled “Testing an alert”-
On a supervised node, stop a monitored service:
Fenêtre de terminal sudo systemctl stop <unité> -
Within a minute, a
ServiceDown(orSystemdUnitFailed) message appears in#alert-warnings,#alert-incidentsand an email if the node is critical. The alert link points tohttps://prometheus.<zone>.domain.tld. -
Restart the service: the alert resolves (resolution notification).
Fenêtre de terminal sudo systemctl start <unité>
Silencing during a rebuild
Section titled “Silencing during a rebuild”The deployment flow sets a maintenance flag to avoid triggering false alerts
during a nixos-rebuild. Manually:
dnf-maintenance on # inhibits node alertsdnf-maintenance off # re-enables themTroubleshooting
Section titled “Troubleshooting”| Symptom | Likely cause | Fix |
|---|---|---|
| No messages in a room | bot not in the room or expired token | re-run just configure-alert-bot |
| Room creation 403 | Caddy anti-bot filter | already handled (neutral agent) ; do not use raw curl |
admin API not public | /_synapse/admin closed | let the automatic SSH tunnel work (Matrix host must be reachable) |
| Unresolved alert link | missing DNS or certificate for the prometheus vhost | deploy the gateway / HCS for the new subdomain |
matrix.nix ignored at deploy | file not readable by nix | the script does a chmod 644 ; check var/generated/ permissions |
| No incident email | local Postfix relay inactive | alerting enables it by default ; check darkone.service.prometheus.alerting.email |