Manage alerts configured with Jsonnet#
In addition to Simple HTTPS uptime checks, we also have a set of alerts that are configured in support deployments using Jsonnet in our Infrastructure.
Paging#
We don’t have an on-call rotation currently, and nobody is expected to respond outside working hours. Hence, we don’t really currently have paging alerts.
However, we may temporarily mark some alerts to page specific people during ongoing incidents that have not been resolved yet. This is usually done to monitor a temporary fix that may or may not have solved the issue. By adding a paging alert, we buy ourselves a little peace of mind - as long as the page is not firing, we are doing ok.
Alerts should have a label named page
that can be set to the pagerduty
username of whoever should be paged for that alert.
Configuration#
We use the Prometheus alert manager to set up alerts that are defined in the helm-charts/support/values.jsonnet
file.
At the time of writing, we have the following classes of alerts:
when a persistent volume claim (PVC) is approaching full capacity
when a pod has restarted.
When an alert threshold is crossed, an automatic notification is sent to PagerDuty and the #pagerduty-notifications
channel on the 2i2c Slack.
When a PVC is approaching full capacity#
We monitor the capacity of the following volumes:
home directories
hub database
prometheus database
The alert is triggered when the volume is more than 90% full.
To resolve the alert, follow the guidance below
To be documented, see GH issue
When a pod has restarted#
We monitor pod restarts for the following services:
jupyterhub-groups-exporter
If a pod has restarted, it may indicate an issue with the service or its configuration. Check the logs of the pod to identify any errors or issues that may have caused the restart. If necessary, add more error handling to the code, redeploy the service or adjust its configuration to resolve the issue.
Once the pod is stable, ensure that the alert is resolved by checking whether the pod has been running without restarts, e.g. by running the following command:
$ kubectl -n <namespace> get pod
NAME READY STATUS RESTARTS AGE
staging-groups-exporter-deployment-9b4c6749c-sgfcc 1/1 Running 0 10m
If you have taken the above actions and the issue persists, then open a GitHub issue capturing the details of the problem for consideration by the wider 2i2c team. See Simple HTTPS uptime checks on how to snooze an alert in the meantime.