Configure per-user storage quotas#
This guide explains how to enable and configure per-user storage quotas using the jupyterhub-home-nfs
helm chart.
Note
Nest all config examples under a basehub
key if deploying this for a daskhub.
Creating a pre-provisioned disk#
The in-cluster NFS server uses a pre-provisioned disk to store the users’ home directories. We don’t use a dynamically provisioned volume because we want to be able to reuse the same disk even when the Kubernetes cluster is deleted and recreated. So the first step is to create a disk that will be used to store the users’ home directories.
For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the tfvars
file of the cluster:
ebs_volumes = {
"staging" = {
size = 100 # in GB
type = "gp3"
name_suffix = "staging"
tags = { "2i2c:hub-name": "staging" }
}
}
This will create a disk with a size of 100GB for the staging
hub that we can reference when configuring the NFS server.
Apply these changes with:
terraform plan -var-file=projects/$CLUSTER_NAME.tfvars
terraform apply -var-file=projects/$CLUSTER_NAME.tfvars
Enabling jupyterhub-home-nfs
#
To be able to configure per-user storage quotas, we need to run an in-cluster NFS server using jupyterhub-home-nfs
. This can be enabled by setting jupyterhub-home-nfs.enabled = true
in the hub’s values file (or the common values files if all hubs on this cluster will be using this).
jupyterhub-home-nfs
expects a reference to an pre-provisioned disk.
You can retrieve the volumeId
by checking the Terraform outputs:
terraform output
Here’s an example of how to connect the volume to jupyterhub-home-nfs
on AWS and GCP in the hub values file.
jupyterhub-home-nfs:
enabled: true # can be migrated to common values file
eks:
enabled: true # can be migrated to common values file
volumeId: vol-0a1246ee2e07372d0
jupyterhub-home-nfs:
enabled: true # can be migrated to common values file
gke:
enabled: true # can be migrated to common values file
volumeId: projects/jupyter-nfs/zones/us-central1-f/disks/jupyter-nfs-home-directories
These changes can be deployed by running the following command:
deployer deploy $CLUSTER_NAME $HUB_NAME
Once these changes are deployed, we should have a new NFS server running in our cluster through the jupyterhub-home-nfs
Helm chart. We can get the IP address of the NFS server by running the following commands:
# Authenticate with the cluster
deployer use-cluster-credentials $CLUSTER_NAME
# Retrieve the service IP
kubectl -n $HUB_NAME get svc ${HUB_NAME}-nfs-service
To check whether the NFS server is running properly, see the Troubleshooting section.
Migrating existing home directories and switching to the new NFS server#
See Migrate data across NFS servers for instructions on performing these steps.
Enforcing storage quotas#
Warning
If you attempt to enforce quotas before having performed the migration, you may see the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/export/$HUB_NAME'
Now we can set quotas for each user and configure the path to monitor for storage quota enforcement.
This can be done by updating basehub.jupyterhub-home-nfs.quotaEnforcer
in the hub’s values file. For example, to set a quota of 10GB for all users on the staging
hub, we would add the following to the hub’s values file:
jupyterhub-home-nfs:
quotaEnforcer:
hardQuota: "10" # in GB
path: "/export/staging"
The path
field is the path to the parent directory of the user’s home directory in the NFS server. The hardQuota
field is the maximum allowed size of the user’s home directory in GB.
To deploy the changes, we need to run the following command:
deployer deploy $CLUSTER_NAME $HUB_NAME
Once this is deployed, the hub will automatically enforce the storage quota for each user. If a user’s home directory exceeds the quota, the user’s pod may not be able to start successfully.
Enabling alerting through Prometheus Alertmanager#
Once we have enabled storage quotas, we want to be alerted when the disk usage of the NFS server exceeds a certain threshold so that we can take appropriate action. To do this, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold using Alertmanager. We will then forward Alertmanager’s alert to PagerDuty.
Note
Use these resource to learn more about PagerDuty’s Prometheus integration and Prometheus’ Alertmanager configuration
First, we need to enable Alertmanager in the hub’s support values file (for example, here’s the one for the nasa-veda
cluster).
prometheus:
alertmanager:
enabled: true
Then, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold. For example, to alert us when the disk usage of the NFS server exceeds 90% of the total disk size over a 15min period, we would add the following to the hub’s support values file:
prometheus:
serverFiles:
alerting_rules.yml:
groups:
# Duplicate this entry for every hub on the cluster that uses an EBS volume as an NFS server
- name: <cluster_name> <hub_name> jupyterhub-home-nfs EBS volume full
rules:
- alert: <hub_name>-jupyterhub-home-nfs-ebs-full
expr: node_filesystem_avail_bytes{mountpoint="/shared-volume", component="shared-volume-metrics", namespace="<hub_name>"} / node_filesystem_size_bytes{mountpoint="/shared-volume", component="shared-volume-metrics", namespace="<hub_name>"} < 0.1
for: 15m
labels:
severity: critical
channel: pagerduty
cluster: <cluster_name>
annotations:
summary: "jupyterhub-home-nfs EBS volume full in namespace {{ $labels.namespace }}"
Note
The important variables to note here are:
expr
: This is what Prometheus will evaluatefor
: This is a duration over which Prometheus will collect data to evaluateexpr
And finally, we need to configure Alertmanager to send alerts to PagerDuty.
prometheus:
alertmanager:
enabled: true
config:
route:
group_wait: 10s
group_interval: 5m
receiver: pagerduty
repeat_interval: 3h
routes:
# Duplicate this entry for every hub on the cluster that uses an EBS volume as an NFS server
- receiver: pagerduty
match:
channel: pagerduty
cluster: <cluster_name>
namespace: <hub_name>
Note
The important variables to understand here are:
group_wait
: How long Alertmanager will initially wait to send a notification to PagerDuty for a group of alertsgroup_interval
: How long Alertmanager will wait to send a notification to PagerDuty for new alerts in a group for which an initial notification has already been sentrepeat_interval
: How long Alertmanager will wait to send a notification to PagerDuty again if it has already sent a successful notificationmatch
: These labels are used to group fired alerts together and is how we manage separate incidents per hub per cluster in PagerDuty
Increasing the size of the volume used by the NFS server#
If the volume used by the NFS server is close to being full, we may need to increase the size of the volume. This can be done by following the instructions in the Increase the size of an AWS EBS volume guide.
Troubleshooting#
Checking the NFS server is running properly#
To check whether the NFS server is running properly, we can run the following command in the NFS server pod in the nfs-server container:
showmount -e 0.0.0.0
If the NFS server is running properly, we should see the path to the NFS server’s export directory. For example:
Export list for 0.0.0.0:
/export *
If we don’t see the path to the export directory, the NFS server is not running properly and we need to check the logs for the NFS server pod.
Debugging quota enforcement#
We can check the current usage and quotas for each user by running the following command in the NFS server pod in the enforce-xfs-quota
container:
xfs_quota -x -c 'report -N -p'
This should show us the list of directories being monitored for quota enforcement, and the current usage and quotas for each of the home directories.
If a user exceeds their quota, the user’s pod will not be able to start successfully. A hub admin will need to free up space in the user’s home directory to allow the pod to start, using the allusers
feature.