New Kubernetes cluster on GCP, Azure or AWS#

This guide will walk through the process of adding a new cluster to our terraform configuration.

You can find out more about terraform in Terraform and their documentation.

Attention

Currently, we do not deploy clusters to AWS solely using terraform. We use eksctl to provision our k8s clusters on AWS and terraform to provision supporting infrastructure, such as storage buckets.

Cluster Design#

This guide will assume you have already followed the guidance in Cluster design considerations to select the appropriate infrastructure.

Prerequisites#

  1. Install kubectl, helm, sops, etc.

    In Setting up your local environment to work on this repo you find instructions on how to setup sops to encrypt and decrypt files.

  2. Install aws

    Verify install and version with aws --version. You should have at least version 2.

  3. Install or upgrade eksctl

    Mac users with homebrew can run brew install eksctl.

    Verify install and version with eksctl version. You should have the latest version of this CLI.

    Important

    Without the latest version, you may install an outdated versions of aws-node because its hardcoded.

  4. Install jsonnet

    Mac users with homebrew can run brew install jsonnet.

    Verify install and version with jsonnet --version.

  1. Install kubectl, helm, sops, etc.

    In Setting up your local environment to work on this repo you find instructions on how to setup sops to encrypt and decrypt files.

  1. Install kubectl, helm, sops, etc.

    In Setting up your local environment to work on this repo you find instructions on how to setup sops to encrypt and decrypt files.

Create a new cluster#

Setup credentials#

Depending on whether this project is using AWS SSO or not, you can use the following links to figure out how to authenticate to this project from your terminal.

N/A

N/A

Generate cluster files#

We automatically generate the files required to setup a new cluster:

  • A .jsonnet file for use with eksctl

  • A sops encrypted ssh key that can be used to ssh into the kubernetes nodes.

  • A ssh public key used by eksctl to grant access to the private key.

  • A .tfvars terraform variables file that will setup most of the non EKS infrastructure.

  • The cluster config directory in ./config/cluster/<new-cluster>

  • The support values file support.values.yaml

  • The the support credentials encrypted file enc-support.values.yaml

  • A .tfvars file for use with terraform

  • The cluster config directory in ./config/cluster/<new-cluster>

  • A sample cluster.yaml config file

  • The support values file support.values.yaml

  • The the support credentials encrypted file enc-support.values.yaml

Warning

An automated deployer command doesn’t exist yet, these files need to be manually generated!

You can generate these with:

export CLUSTER_NAME=<cluster-name>;
export CLUSTER_REGION=<cluster-region-like ca-central-1>
deployer generate dedicated-cluster aws --cluster-name=$CLUSTER_NAME --cluster-region=$CLUSTER_REGION

After running this command, you will be asked to provide the type of hub that will be deployed in the cluster, i.e. basehub or daskhub.

  • If you already know that the there will be daskhubs running in this cluster, then type in daskhub and hit ENTER.

    This will generate a specific node pool for dask workers to run on, in the appropriate .jsonnet file that will be used with eksctl.

  • Otherwise, just hit ENTER and it will default to a basehub infrastructure that you can later amend if daskhubs will be needed by following the guide on how to add support for daskhubs in an existing cluster.

This will generate the following files:

  1. eksctl/$CLUSTER_NAME.jsonnet with a default cluster configuration, deployed to us-west-2

  2. eksctl/ssh-keys/secret/$CLUSTER_NAME.key, a sops encrypted ssh private key that can be used to ssh into the kubernetes nodes.

  3. eksctl/ssh-keys/$CLUSTER_NAME.pub, an ssh public key used by eksctl to grant access to the private key.

  4. terraform/aws/projects/$CLUSTER_NAME.tfvars, a terraform variables file that will setup most of the non EKS infrastructure.

Create and render an eksctl config file

We use an eksctl config file in YAML to specify how our cluster should be built. Since it can get repetitive, we use jsonnet to declaratively specify this config. You can find the .jsonnet files for the current clusters in the eksctl/ directory.

The previous step should’ve created a baseline .jsonnet file you can modify as you like. The eksctl docs have a reference for all the possible options. You’d want to make sure to change at least the following:

  • Region / Zone - make sure you are creating your cluster in the correct region and verify the suggested zones 1a, 1b, and 1c actually are available in that region.

    # a command to list availability zones, for example
    # ca-central-1 doesn't have 1c, but 1d instead
    aws ec2 describe-availability-zones --region=$CLUSTER_REGION
    
  • Size of nodes in instancegroups, for both notebook nodes and dask nodes. In particular, make sure you have enough quota to launch these instances in your selected regions.

  • Kubernetes version - older .jsonnet files might be on older versions, but you should pick a newer version when you create a new cluster.

Once you have a .jsonnet file, you can render it into a config file that eksctl can read.

Tip

Make sure to run this command inside the eksctl directory.

jsonnet $CLUSTER_NAME.jsonnet > $CLUSTER_NAME.eksctl.yaml

Tip

The *.eksctl.yaml files are git ignored as we can regenerate it, so work against the *.jsonnet file and regenerate the YAML file when needed by a eksctl command.

Create the cluster

Now you’re ready to create the cluster!

Tip

Make sure to run this command inside the eksctl directory, otherwise it cannot discover the ssh-keys subfolder.

eksctl create cluster --config-file=$CLUSTER_NAME.eksctl.yaml

This might take a few minutes.

If any errors are reported in the config (there is a schema validation step), fix it in the .jsonnet file, re-render the config, and try again.

Once it is done, you can test access to the new cluster with kubectl, after getting credentials via:

aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION

kubectl should be able to find your cluster now! kubectl get node should show you at least one core node running.

export CLUSTER_NAME=<cluster-name>
export CLUSTER_REGION=<cluster-region-like ca-central-1>
export PROJECT_ID=<gcp-project-id>
deployer generate dedicated-cluster gcp --cluster-name=$CLUSTER_NAME --project-id=$PROJECT_ID --cluster-region=$CLUSTER_REGION

After running this command, you will be asked to provide the type of hub that will be deployed in the cluster, i.e. basehub or daskhub.

  • If you already know that the there will be daskhubs running in this cluster, then type in daskhub and hit ENTER.

    This will generate a specific node pool for dask workers to run on, in the appropriate .jsonnet file that will be used with eksctl.

  • Otherwise, just hit ENTER and it will default to a basehub infrastructure that you can later amend if daskhubs will be needed by following the guide on how to add support for daskhubs in an existing cluster.

This will generate the following files:

Generating the terraform infrastructure file…

  1. terraform/gcp/projects/$CLUSTER_NAME.tfvars

  2. config/clusters/$CLUSTER_NAME

  3. config/clusters/$CLUSTER_NAME/cluster.yaml

  4. config/clusters/$CLUSTER_NAME/support.values.yaml

  5. config/clusters/$CLUSTER_NAME/enc-support.values.yaml

An automated deployer command doesn’t exist yet, these files need to be manually generated. The minimum inputs this file requires are:

  • subscription_id: Azure subscription ID to create resources in. Should be the id, rather than display name of the project.

  • resourcegroup_name: The name of the Resource Group to be created by terraform, where the cluster and other resources will be deployed into.

  • global_container_registry_name: The name of an Azure Container Registry to be created by terraform to use for our image. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:

    az acr check-name --name ACR_NAME --output table
    
  • global_storage_account_name: The name of a storage account to be created by terraform to use for Azure File Storage. This must be unique across all of Azure. You can use the following Azure CLI command to check your desired name is available:

    az storage account check-name --name STORAGE_ACCOUNT_NAME --output table
    
  • ssh_pub_key: The public half of an SSH key that will be authorised to login to nodes.

See the variables file for other inputs this file can take and their descriptions.

Naming Convention Guidelines for Container Registries and Storage Accounts

Names for Azure container registries and storage accounts must conform to the following guidelines:

  • alphanumeric strings between 5 and 50 characters for container registries, e.g., myContainerRegistry007

  • lowercase letters and numbers strings between 2 and 24 characters for storage accounts, e.g., mystorageaccount314

Note

A failure will occur if you try to create a storage account whose name is not entirely lowercase.

We recommend the following conventions using lowercase:

  • {CLUSTER_NAME}hubregistry for container registries

  • {CLUSTER_NAME}hubstorage for storage accounts

Note

Changes in Azure’s own requirements might break our recommended convention. If any such failure occurs, please signal it.

This increases the probability that we won’t take up a namespace that may be required by the Hub Community, for example, in cases where we are deploying to Azure subscriptions not owned/managed by 2i2c.

Example .tfvars file:

subscription_id                = "my-awesome-subscription-id"
resourcegroup_name             = "my-awesome-resource-group"
global_container_registry_name = "myawesomehubregistry"
global_storage_account_name    = "myawesomestorageaccount"
ssh_pub_key                    = "ssh-rsa my-public-ssh-key"

Add GPU nodegroup if needed#

If this cluster is going to have GPUs, you should edit the generated jsonnet file to include a GPU nodegroups.

Initialising Terraform#

Our default terraform state is located centrally in our two-eye-two-see-org GCP project, therefore you must authenticate gcloud to your @2i2c.org account before initialising terraform. The terraform state includes all cloud providers, not just GCP.

gcloud auth application-default login

Then you can change into the terraform subdirectory for the appropriate cloud provider and initialise terraform.

Our AWS terraform code is now used to deploy supporting infrastructure for the EKS cluster, including:

  • An IAM identity account for use with our CI/CD system

  • Appropriately networked EFS storage to serve as an NFS server for hub home directories

  • Optionally, setup a shared database

  • Optionally, setup user buckets

The steps above will have created a default .tfvars file. This file can either be used as-is or edited to enable the optional features listed above.

Initialise terraform for use with AWS:

cd terraform/aws
terraform init
cd terraform/gcp
terraform init -backend-config=backends/default-backend.hcl -reconfigure
cd terraform/azure
terraform init

Note

There are other backend config files stored in terraform/backends that will configure a different storage bucket to read/write the remote terraform state for projects which we cannot access from GCP with our @2i2c.org email accounts. This saves us the pain of having to handle multiple authentications as these storage buckets are within the project we are trying to deploy to.

For example, to work with Pangeo you would initialise terraform like so:

terraform init -backend-config=pangeo-backend.hcl -reconfigure

Creating a new terraform workspace#

We use terraform workspaces so that the state of one .tfvars file does not influence another. Create a new workspace with the below command, and again give it the same name as the .tfvars filename, $CLUSTER_NAME.

terraform workspace new $CLUSTER_NAME

Note

Workspaces are defined per backend. If you can’t find the workspace you’re looking for, double check you’ve enabled the correct backend.

Plan and Apply Changes#

Important

When deploying to Google Cloud, make sure the Compute Engine, Kubernetes Engine, Artifact Registry, and Cloud Logging APIs are enabled on the project before deploying!

First, make sure you are in the new workspace that you just created.

terraform workspace show

Plan your changes with the terraform plan command, passing the .tfvars file as a variable file.

terraform plan -var-file=projects/$CLUSTER_NAME.tfvars

Check over the output of this command to ensure nothing is being created/deleted that you didn’t expect. Copy-paste the plan into your open Pull Request so a fellow 2i2c engineer can double check it too.

If you’re both satisfied with the plan, merge the Pull Request and apply the changes to deploy the cluster.

terraform apply -var-file=projects/$CLUSTER_NAME.tfvars

Congratulations, you’ve just deployed a new cluster!

Exporting and Encrypting the Cluster Access Credentials#

In the previous step, we will have created an IAM user with just enough permissions for automatic deployment of hubs from CI/CD. Since these credentials are checked-in to our git repository and made public, they should have least amount of permissions possible.

To begin deploying and operating hubs on your new cluster, we need to export these credentials, encrypt them using sops, and store them in the secrets directory of the infrastructure repo.

  1. First, make sure you are in the right terraform directory:

    cd terraform/aws
    
    cd terraform/gcp
    
    cd terraform/azure
    
  2. Check you are still in the correct terraform workspace

    terraform workspace show
    

    If you need to change, you can do so as follows

    terraform workspace list  # List all available workspaces
    terraform workspace select WORKSPACE_NAME
    
  3. Fetch credentials for automatic deployment

    Create the directory if it doesn’t exist already:

    mkdir -p ../../config/clusters/$CLUSTER_NAME
    
    terraform output -raw continuous_deployer_creds > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
    
    terraform output -raw ci_deployer_key > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
    
    terraform output -raw kubeconfig > ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.yaml
    
  4. Then encrypt the key using sops.

    Note

    You must be logged into Google with your @2i2c.org account at this point so sops can read the encryption key from the two-eye-two-see project.

    sops --output ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json --encrypt ../../config/clusters/$CLUSTER_NAME/deployer-credentials.secret.json
    

    This key can now be committed to the infrastructure repo and used to deploy and manage hubs hosted on that cluster.

  5. Double check to make sure that the config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json file is actually encrypted by sops before checking it in to the git repo. Otherwise this can be a serious security leak!

    cat ../../config/clusters/$CLUSTER_NAME/enc-deployer-credentials.secret.json
    

Create a cluster.yaml file#

See also

We use cluster.yaml files to describe a specific cluster and all the hubs deployed onto it. See Configuration structure for more information.

Create a cluster.yaml file under the config/cluster/$CLUSTER_NAME> folder and populate it with the following info:

name: <your-cluster-name>
provider: aws # <copy paste link to sign in url here>
aws:
  key: enc-deployer-credentials.secret.json
  clusterType: eks
  clusterName: $CLUSTER_NAME
  region: $CLUSTER_REGION
  billing:
    # For an AWS account explicitly configured to have the cloud bill
    # paid directly by the community and not through 2i2c, declare
    # paid_by_us to false
    paid_by_us: true
support:
  helm_chart_values_files:
    - support.values.yaml
    - enc-support.secret.values.yaml
hubs: []

Note

The aws.key file is defined relative to the location of the cluster.yaml file.

A cluster.yaml file should already have been generated as part of Generate cluster files.

Billing information

For projects where we are paying the cloud bill & then passing costs through, you need to fill in information under gcp.billing.bigquery and set gcp.billing.paid_by_us to true. Partnerships should be able to tell you if we are doing cloud costs pass through or not.

  1. Going to the Billing Tab on Google Cloud Console

  2. Make sure the correct project is selected in the top bar. You might have to select the ‘All’ tab in the project chooser if you do not see the project right away.

  3. Click ‘Go to billing account’

  4. In the default view (Overview) that opens, you can find the value for billing_id in the right sidebar, under “Billing Account”. It should be of the form XXXXXX-XXXXXX-XXXXXX.

  5. Select “Billing export” on the left navigation bar, and you will find the values for project and dataset under “Detailed cost usage”.

  6. If “Detailed cost usage” is not set up, you should enable it

Warning

We use this config only when we do not have permissions on the Azure subscription to create a Service Principal with terraform.

name: <cluster-name>  # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: kubeconfig
kubeconfig:
  # The location of the *encrypted* key we exported from terraform
  file: enc-deployer-credentials.secret.yaml
name: <cluster-name>  # This should also match the name of the folder: config/clusters/$CLUSTER_NAME
provider: azure
azure:
  # The location of the *encrypted* key we exported from terraform
  key: enc-deployer-credentials.secret.json
  # The name of the cluster *as it appears in the Azure Portal*! Sometimes our
  # terraform code adjusts the contents of the 'name' field, so double check this.
  cluster: <cluster-name>
  # The name of the resource group the cluster has been deployed into. This is
  # the same as the resourcegroup_name variable in the .tfvars file.
  resource_group: <resource-group-name>

Commit this file to the repo.

Access#

Grant additional access

First, we need to grant the freshly created deployer IAM user access to the kubernetes cluster.

  1. As this requires passing in some parameters that match the created cluster, we have a terraform output that can give you the exact command to run.

terraform output -raw eksctl_iam_command
  1. Run the eksctl create iamidentitymapping command returned by terraform output. That should give the continuous deployer user access.

The command should look like this:

eksctl create iamidentitymapping \
    --cluster $CLUSTER_NAME \
    --region $CLUSTER_REGION \
    --arn arn:aws:iam::<aws-account-id>:user/hub-continuous-deployer \
    --username hub-continuous-deployer \
    --group system:masters

Test the access by running:

deployer use-cluster-credentials $CLUSTER_NAME

and running:

kubectl get node

It should show you the provisioned node on the cluster if everything works out ok.

Grant eksctl access to other users

Note

This section is still required even if the account is managed by SSO. Though a user could run deployer use-cluster-credentials $CLUSTER_NAME to gain access as well.

AWS EKS has a strange access control problem, where the IAM user who creates the cluster has full access without any visible settings changes, and nobody else does. You need to explicitly grant access to other users. Find the usernames of the 2i2c engineers on this particular AWS account, and run the following command to give them access:

Note

You can modify the command output by running terraform output -raw eksctl_iam_command as described in Exporting and Encrypting the Cluster Access Credentials.

eksctl create iamidentitymapping \
   --cluster $CLUSTER_NAME \
   --region $CLUSTER_REGION \
   --arn arn:aws:iam::<aws-account-id>:user/<iam-user-name> \
   --username <iam-user-name> \
   --group system:masters

This gives all the users full access to the entire kubernetes cluster. After this step is done, they can fetch local config with:

aws eks update-kubeconfig --name=$CLUSTER_NAME --region=$CLUSTER_REGION

This should eventually be converted to use an IAM Role instead, so we need not give each individual user access, but just grant access to the role - and users can modify them as they wish.

Test deployer access by running:

deployer use-cluster-credentials $CLUSTER_NAME

and running:

kubectl get node

It should show you the provisioned node on the cluster if everything works out ok.

Test deployer access by running:

deployer use-cluster-credentials $CLUSTER_NAME

and running:

kubectl get node

It should show you the provisioned node on the cluster if everything works out ok.