Manage alerts

Manage alerts#

In addition to Simple HTTPS uptime checks, we also have a set of alerts that are configured in support deployments using Jsonnet in our Infrastructure. More about these alerts at Alerts configured with Jsonnet

What to do when an alert fires based on its type and severity#

When an alert fires a person should decide how to handle it based on the type of alert and its severity. The general steps to take for any alert are:

Validate/invalidate as quickly as possible if we are dealing with an outage
If an outage, then add the P1 priority to the PD incident and follow the incident process
If not an outage, then time-box yourself to 30 min and either debug or fix it if you know how
After 30min, use your judgement to decide if you need to create a follow-up issue to investigate further. It might be that you believe that extra time invested won’t bring any light into the root cause. This is ok. Just leave a note in PD.
Close the alert in PD and link to the issue if you created one

Below are some guidelines on how to handle the different types of alerts we have configured.

Severity and timeline#

When an alert fires, it will create an incident in PagerDuty and notify the #pagerduty-notifications channel on the 2i2c Slack. Each alert setup with Jsonnet has a severity level that can be one of:

take immediate action
same day action needed
action needed this week
to be planned in sprint planning

This severity level, together with the priority levels described in Priority levels is what determines how quickly you should respond to the alert.

What to do for alerts about JupyterHub not being available#

These alerts are outages and have a P1 priority set by default. To resolve this alert:

check if another engineer is doing any work/testing/decommissioning of the hub
if they are, then resolve the alert in PD and remove its P1 priority (in this order)
if they’re not, then
- validate if the hub is indeed not available
- follow the incident response process

What to do for alerts on PVC capacity#

We monitor the capacity of the following volumes:

home directories
hub database
prometheus database

There are two alert types that are triggered based a capacity threshold:

A P2 alert at 10%

It is triggered when the volume has less than 10% of free space remaining and is assigned a same day action needed, hence a P2 priority. To resolve the alert, follow the guidance below:
1. Ack the alert in PD
2. Increase the size of a disk storing home directories
3. To be documented, see GH issue
4. Resizing Prometheus’ disk
A P1 at 0%
- It is triggered when the volumes doesn’t have any capacity left.
- If we respond in time to the P2 alert, this one should never trigger.
- This alert is assigned a take immediate action severity level and a P1 priority, hence it is an outage.
- To resolve the alert, use the guides from the bullet above.

What to do for alerts on pod restarts#

We monitor pod restarts for the following services:

jupyterhub-groups-exporter
jupyterhub-home-nfs
jupyterhub-cost-monitoring
support-grafana
support-prometheus-server
proxy

If a pod has restarted, it may indicate an issue with the service or its configuration. To resolve the alert:

Ack the alert in PD
Check if the pod is running or is restarting infinitely
Check the logs of the pod to identify any errors or issues that may have caused the restart - If the pod is stable and not restarting anymore, see if the logs present anything useful enough to open a tracking issue. And if not, mark the alert as resolved. - If the pod is still restarting, try getting it in a stable state by redeploying it or adjust its configuration to resolve the issue.
If you have taken the above actions and the issue persists, then - Open a GitHub issue capturing the details of the problem for consideration by the wider 2i2c team. - Setup a Priority number on the alert

What to do for alerts on server startup failures#

Any time two consecutive spawns fail in a 30m time window, we trigger an alert. This alert doesn’t have a severity lever or a priority level set on it by default because it can be anything. This is why is best to investigate these ASAP.

There is additional automation that runs each time an alert like this is triggered. The automation triggers a GitHub workflow that runs a health check for the alerting cluster and hub.

A note with the status of this run is left in the PagerDuty incident
In addition, if the health check succeeds, the incident is resolved and a message with this status is posted in the #pagerduty-notifications Slack channel
If the health check fails, then a message, mentioning the channel members, is posted in the #pagerduty-notifications Slack channel

What to do for alerts for application outages#

When an application is not working as expected, it is classed as a possible application outage. Similar to the server start alert above, it’s best to investigate these ASAP to validate whether this is an outage or not. If it is an outage, follow the incident process as normal.

To resolve the alert:#

Check if you can spawn a server on that cluster and hub. If not, then is most likely an outage an you must set the P1 priority on this alert and follow the incident response process for outages.
If you can spawn a server, then this is most likely not an outage. But check the list of possible causes above and find the one that matches what you’re seeing in the logs.

If logs are not available or not proving any useful info, then you can manually resolve the alert in PD as a mystery. It will likely come back if there’s an underlying issue and a pair of eyes will be available to investigate.
If the logs seem suspicious but you cannot put your finger on the issue, then open a tracking GitHub issue to be discussed with the rest of the engineering team.

The causes for this can be varied, and it always requires investigation. Some common causes are:

Node was too slow to spin up. This may be transient - test again, and if this works, it’s fine.
The user may try to bring their own image and that image is not available or buggy in some way. There is not much we can do here.
Appropriate nodepools have not been created somehow. Check the autoscaler logs, and examine the pod specification carefully (particularly affinity and nodeSelector).
The requested resources are too big to fit on the node type that was requested. Our resource generation script is designed to guard against this. Check to see if we are actually using the resource generation script here.
There is not enough quota in the cloud project for node spin up to happen. Check the cloud console to see if this is the case, and request additional quota.
There is a cloud provider outage. Check out their status page.
A mysterious 7th option. Form a mental model of our infrastructure, and poke around. If you find any useful info,

What to do for alerts on home directory IOPs or Throughput#

Whenever the home directory storage IOPs or Throughput performance is limited (saturated) for an extended period (defined in Terraform as datapoints_to_alarm triggers over evaluation_periods evaluation periods), PagerDuty triggers an alert. Triggers can be back-to-back, or distributed over the interval.

Under day-to-day conditions, we should treat this alert as an indication that we might need to increase the performance of the home directory disk. We should periodically analyse the number of times that disk performance has been limited (using the Prometheus metrics) over a month, and consider bumping these if the community needs more headroom.

Under workshop conditions, this alert should be considered early warning that the home disk is under pressure. The consequences of this are typically reduced filesystem performance for all workshop users, which may manifest as slow/laggy user experience in JupyterLab, and poor performance of analysis code running in kernels.

This can be improved by bumping either the iops or the throughput variables for the home directory disk in Terraform, or by temporarily imperatively modifying these values in the AWS console.

How to get useful information about an alert#

Each automatic alert will have a title which is formed using the alert name and various labels considered important.

Example: [FIRING:1] home-nfs has 10% of space left openscapes prod (same day action needed).

The FIRING:n part tracks how many times the alert has been triggered. But because we are not yet grouping alerts, it will always be 1, so it can be ignored.
<disk name> has <limit>% of space left this is the alert name and it has info about which disk the alert is about and how much space left it has
<cluster-name> <hub-name> these are labels that provide info about the cluster and hub for which the alert has triggered for
same day action needed the severity of the alert, which set the timeline when this alert should be handled

Also, clicking on an alert in PagerDuty, gets you all the metadata associated with it, where you can find extra info, like the summary.

How to add a new alert#

To add a new alert, you’ll have to add it to /helm-charts/support/values.jsonnet first after checking out Jsonnet in our Infrastructure.
Then, if this is an alert that doesn’t pertain to any of the existing alerting groups as defined in Alerting - Configuration, you’ll have to:

create a new group
create a new Service in Pagerduty for this groups
get the integration key of this service and store it encrypted under a new Pagerduty receiver
write a matcher rule in Alert Manager that will link this group to this new receiver

Configure Slack notifications with the PagerDuty service integrations
- Go to the #pagerduty-notifications channel in the 2i2c Slack
- Make sure your PD account is linked to Slack with the shortcut command /pd link
- Use /pd connect to select the service to connect to the channel and select Responder for how you want to be notified
Test it
If you know what the outage condition for this new group is, create a new Orchestration rule for it, so that outage alerts are automatically assigned P1 and shown in the status page