Alerting#
We have a few alerts configured to notify us when things go wrong and we use PagerDuty to manage them.
How to manage alerts#
Important
Manage alerts has lots of useful howto guides about how to manage this type of alerts that we have setup.
Severity levels#
When an alert threshold is crossed, an automatic notification is sent to PagerDuty and the #pagerduty-notifications channel on the 2i2c Slack.
Each alert setup with Jsonnet has a severity level set through the *.jsonnet configuration file. The severity levels are:
take immediate actionsame day action neededaction needed this weekto be planned in sprint planning
This level is what determines how quickly you should respond to the alert and translates into the priority of the incident created in PagerDuty. It does this by running an Event Orchestration after an incident is created. This Event Orchestration is what sets a priority based on the severity label.
Priority levels#
The PagerDuty alerts can have a priority between P1 and P4 or have no priority set at all.
P1 alerts#
These alerts signal an ongoing community outage! An outage is a period of time when a hub is unavailable or its critical services are not functioning as expected and impacting two or more of hub users’ activity
The priority is set by:
PagerDuty’s Event Orchestration if the alert has a
take immediate actionseverity or based on the Service it pertains. (E.g. all JupyterHub health checks are P1s)Manually by the engineer
Important
All of the P1 PagerDuty alerts will show up in the 2i2c status page and subscribed users will receive updated related to it.
Warning
If an Alert goes from P1 to another priority number or no number at all, Pagerduty’s status page will loose track of it and will forever show up on the status page unless it is manually removed.
P2 alerts#
These alerts signal that the community is about to be affected if we don’t do something asap. E.g. bumping a hub’s home directory when it has less than 10% available.
The priority is set by PagerDuty’s Event Orchestration if the alert has a
same day action neededseverity or based on the Service it pertains. (E.g. all JupyterHub health checks are P1s)
P3 alerts#
Correlate with the
action needed this weekseverity levelCommunity about to be affected if we don’t do something soon, but not immediately
P4 alerts#
Correlate
to be planned in sprint planningseverity levelCommunity not necessarily affected on a specific timeline, but we must take some action into the committed column of next sprint
Alerts configured with Jsonnet#
There are a set of alerts that are configured in support deployments using Jsonnet in our Infrastructure.
Configuration#
We use the Prometheus alert manager to set up alerts that are defined in the helm-charts/support/values.jsonnet file.
At the time of writing, we have the following alerting rules groups, and under each group there are one or more alerts:
PVC available capacity For when a persistent volume claim (PVC) is approaching full capacity, with the following alerts:
Home Directory Disk 90% full
Home Directory Disk 100% full (outage)
Hub Database Disk 90% full
Prometheus Disk 90% full
Important Pod Restart For when a pod has restarted, with the following alerts:
jupyterhub-groups-exporterrestartjupyterhub-home-nfsrestart
Server Startup Failure For when a user server has failed to start.
DiskIO saturation For when a disk is approaching IO saturation
Pods stuck in an undesirable state for too long For when there’s a pod that’s stuck in
Pendingfor more than 15m or a pod stuck inTerminatingfor more than 10m.
Each of these alerts is integrated with a Pagerduty Service. And these services can then be grouped under Pagerduty Business Services that can be presented on the status page.
Important
You can find the existing Services under Service Directory and the existing Business Services on 2i2c’s Pagerduty page.
Important Pagerduty pages to know about#
All of the alerts we have configured are managed by Pagerduty There are some important web pages provided by Pagerduty that are relevant to know about:
List of incidents This is where all incidents can be found
Internal status page This is where outages will show up, per business service. Clicking on an incident from this page will link you to the alert.
External status page This is where outages will show up, per business service to the outside world. This is where people can:
subscribe for updates about outages
subscribe to get info about maintenance windows that we might post
Find out about the uptime of each Business Service.