About upgrade disruptions#

When we upgrade our Kubernetes clusters we can cause different kinds of disruptions, this text provides an overview of them.

Kubernetes api-server disruption#

K8s clusters’ control plane (api-server etc.) can be either highly available (HA) or not. EKS clusters, AKS clusters, and “regional” GKE clusters are HA, but “zonal” GKE clusters are not. A few of our GKE clusters are zonal still, but as the cost savings are minimal we only create for regional clusters now.

If upgrading a zonal cluster, the single k8s api-server will be temporarily unavailable, but that is not a big problem as user servers and JupyterHub will remain accessible. The brief disruption is that JupyterHub won’t be able to start new user servers, and user servers won’t be able to create or scale their dask-clusters.

Provider managed workload disruptions#

When upgrading a cloud provider managed k8s cluster, it may upgrade some managed workload part of the k8s cluster, such as calico that enforces NetworkPolicy rules. Maybe this could cause a disruption for users, but its not currently known if it does and in what manner.

Core node pool disruptions#

Disruptions to the core node pool is a disruption to workloads running on it, and there are a few workloads that when disrupted would disrupt users.

ingress-nginx-controller pod(s) disruptions#

The support chart we install in each cluster comes with the ingress-nginx chart. The ingress-nginx chart creates one or more ingress-nginx-controller pods that are proxying network traffic associated with incoming connections.

To shut down such pod means to break connections from users working against the user servers. A broken connection can be re-established if there is another replica of this pod ready to accept a new connection.

We are currently running only one replica of the ingress-nginx-controller pod, and we won’t have issues with this during rolling updates, such as when the Deployment’s pod template specification is changed or when manually running kubectl rollout restart -n support deploy/support-ingress-nginx-controller. We will however have broken connections and user pods unable to establish new connections directly if kubectl delete is used on this single pod, or kubectl drain is used on the node.

hub pod disruptions#

Our JupyterHub installations each have a single hub pod, and having more isn’t supported by JupyterHub itself. Due to this, and because it has a disk mounted to it that can only be mounted to one pod at a time, it isn’t configured to do rolling updates.

When the hub pod isn’t running, users can’t visit /hub paths, but they can still visit /user paths and control their already started user server.

proxy pod disruptions#

Our JupyterHub installations each has a single proxy pod running configurable-http-proxy, having more replicas isn’t supported because JupyterHub will only update one replica with new proxy routes.

When the proxy pod isn’t running, users can’t visit /hub, /user, or /service paths, because they all route through the proxy pod.

When the proxy pod has started and become ready, it also needs to be re-configured by JupyterHub on how to route traffic to arrive to /user and /service paths. This is done during startup and then regularly by JupyterHub every five minutes. Due to this, a proxy pod being restarted can cause a outage of five minutes.

User node pool disruptions#

Disruptions to a user node pool will disrupt user server pods running on it.