Simple HTTPS uptime checks#
Ideally, when a hub is down, a machine alerts us - we do not have to wait for a user to report it to our helpdesk. While we aren’t quite there, we currently have very simple uptime monitoring for all our hubs with free GCP Uptime Checks.
Where are the checks?#
Uptime checks are centralized - they don’t exist in the same project or cloud provider
as the hubs they are checking, but in one centralized GCP project (two-eye-two-see
). This
has a few advantages:
We do not have to implement the same functionality three times (one per cloud provider), as we would have to if this were to exist in the same project as the hub.
These are all ‘black box’ external checks, so it does not particularly matter where they come from.
You can browse the existing checks on the GCP Console as well.
Cost#
Note that as of October 2022 Google Stackdriver Pricing the free monthly quota is 1 million executions of uptime checks per project.
Feature |
Price |
Free allotment per month |
Effective date |
---|---|---|---|
Execution of Monitoring uptime checks |
$0.30/1,000 executions |
1 million executions per Google Cloud project |
October 1, 2022 |
When are notifications triggered?#
Our uptime checks are performed every 15 minutes, and we alert if checks have failed for 31 minutes. This make sure there are at least 2 failed checks before we alert.
We are optimizing for actionable alerts that we can completely trust, and prevent any kind of alert fatigue for our engineers.
JupyterHub health checks#
The JupyterHub does get restarted during deployment, and this can cause a few
seconds of downtime - and we do not want to alert in case the uptime check hits
the hub just at that moment. We trade-off a few minutes of responsiveness for
trust here. /hub/health
is the endpoint checked for hubs, and /health
is checked
for binderhub.
When an alert is triggered, it automatically opens an Incident in the
Managed JupyterHubs service
we maintain in PagerDuty. This also notifies the #pagerduty-notifications
channel on
the 2i2c slack, and kicks off our incident response process
Prometheus health checks#
Our prometheus instances are protected by auth, so we just check to see if we get a
401 Unauthorized
response from the prometheus instance.
When an alert is triggered, it automatically opens an Incident in the
Cluster Prometheus service
we maintain in PagerDuty. This also notifies the #pagerduty-notifications
channel on
the 2i2c slack, and kicks off our incident response process
How are the checks set up?#
We use Terraform in the terraform/uptime-checks directory to set up the checks, notifications channel and alerting policies. This allows new checks to be created automatically whenever a new hub or cluster is added, with no manual steps required.
Terraform is run in our continuous deployment pipeline on GitHub actions at the
end of every deployment, using a GCP
ServiceAccount
that was manually created. It has just enough permissions to access the
terraform state (on GCS), the uptime checks, notification channels and alert
policies. Nothing destructive can happen if this terraform apply
goes
wrong, so it is alright to run this without human supervision on GitHub Actions
How do I snoooze a check?#
As the checks are all in GCP they can be created through the monitoring console.
The alpha
gcloud component also supports setting snoozes from the command line. For further documentation see the Google Cloud Monitoring docs or the gcloud alpha monitoring snoozes reference. You may need to add the alpha
component to your gcloud
install.
Example CLI use that snoozes binder-staging check for 7 days:
HUB=binder-staging
POLICY=$(gcloud alpha monitoring policies list --filter "displayName ~ binder-staging" --format='value(name)')
# echo $POLICY
# projects/two-eye-two-see/alertPolicies/12673409021288629743
gcloud alpha monitoring snoozes create --display-name="Uptime Check Disabled $HUB" --criteria-policies="$POLICY" --start-time="$(date -Iseconds)" --end-time="+PT7D"
# Created snooze [projects/two-eye-two-see/snoozes/3009021608334458880].