Skip to content

Alerts

The alerting system delivers Prometheus alerts to Matrix rooms (and via email for incidents). A one-time provisioning of the bot is enough: everything is declarative afterwards. The overall architecture is described in Monitoring & Alerting.

PrerequisiteWhy
prometheus + monitoring on the collectorhosts Alertmanager and the rules
Reachable matrix servicehosts the bot and the rooms
Registration secret in sops (matrix-rss-password)creates the bot account
Admin sops key presentwrites the bot secrets

A single idempotent command (safe to re-run): it creates the bot account, the rooms, and writes the identity into var/generated/matrix.nix.

  1. Declare the human administrator (local part) in the configuration:

    etc/config.yaml
    network:
    matrix:
    admin: "alice"
  2. Run provisioning:

    Fenêtre de terminal
    just configure-alert-bot

    The command, using the admin sops key:

    • creates the @alertbot:domain.tld account (nonce HMAC flow on the shared secret) ;
    • creates or resolves the #alert-warnings and #alert-incidents rooms, invites the bot and invites the admin ;
    • writes bot + room IDs into var/generated/matrix.nix ;
    • stores password, access token and webhook secret in sops.
  3. Deploy the collector:

    Fenêtre de terminal
    just generate
    just apply <collector>

    Alerting activates automatically as soon as the rooms exist (var/generated/matrix.nix populated). To force-disable it: darkone.service.prometheus.alerting.enable = false;.

Provisioning stores three secrets in usr/secrets/secrets.yaml:

Sops keyRole
alertmanager-matrix-passwordbot account password (reconnection)
alertmanager-matrix-tokenbot access token (message sending)
alertmanager-webhook-secretauthenticates Alertmanager to the Matrix bridge
  1. On a supervised node, stop a monitored service:

    Fenêtre de terminal
    sudo systemctl stop <unité>
  2. Within a minute, a ServiceDown (or SystemdUnitFailed) message appears in #alert-warnings, #alert-incidents and an email if the node is critical. The alert link points to https://prometheus.<zone>.domain.tld.

  3. Restart the service: the alert resolves (resolution notification).

    Fenêtre de terminal
    sudo systemctl start <unité>

The deployment flow sets a maintenance flag to avoid triggering false alerts during a nixos-rebuild. Manually:

Fenêtre de terminal
dnf-maintenance on # inhibits node alerts
dnf-maintenance off # re-enables them
SymptomLikely causeFix
No messages in a roombot not in the room or expired tokenre-run just configure-alert-bot
Room creation 403Caddy anti-bot filteralready handled (neutral agent) ; do not use raw curl
admin API not public/_synapse/admin closedlet the automatic SSH tunnel work (Matrix host must be reachable)
Unresolved alert linkmissing DNS or certificate for the prometheus vhostdeploy the gateway / HCS for the new subdomain
matrix.nix ignored at deployfile not readable by nixthe script does a chmod 644 ; check var/generated/ permissions
No incident emaillocal Postfix relay inactivealerting enables it by default ; check darkone.service.prometheus.alerting.email