CSI Topology – Configuration How-To

Cormac

3 years ago

In this post, we will look at another feature of the vSphere CSI driver that enables the placement of Kubernetes objects on different vSphere environments using a combination of vSphere Tags and a feature of the CSI driver called topology or failure domains. To achieve this, some additional entries must be added to the vSphere CSI driver configuration file. The CSI driver discovers each Kubernetes node/virtual machine topology, and through the kubelet, adds them as labels to the nodes. Please note that at the time of writing, the volume topology and availability zone feature was still in beta with vSphere CSI driver version 2.2 – something to keep in mind if you are planning to use the feature in production.

This is what my test environment looks like:

CSI Topology can be used to provide another level of availability to your Kubernetes cluster and applications. The objective is to deploy Kubernetes Pods with their own Persistent Volumes (PVs) to the same zone. Thus, if a StatefulSet is deployed across multiple zones using CSI Topology, and one of the zones fail, it does not impact the overall application. We utilize vSphere Tags to define Regions and Zones. Regions and Zones are topology constructs used by Kubernetes. In my example, I associate a Region Tag (k8s-region) at the Datacenter level and a unique Zone Tag (k8s-zone) to each of the Clusters in the Datacenter.

Note that there is no dependency on the Cloud Provider/CPI. So long as all of the nodes have a ProviderID after the Cloud Provider has initialized, you can proceed with the CSI driver deployment and include the following steps. Here is how to check the ProviderID after the CPI has been deployed.

$ kubectl get nodes
NAME                  STATUS   ROLES                  AGE     VERSION
k8s-controlplane-01   Ready    control-plane,master   4h35m   v1.20.5
k8s-worker-01         Ready    <none>                 4h33m   v1.20.5
k8s-worker-02         Ready    <none>                 4h32m   v1.20.5
k8s-worker-03         Ready    <none>                 4h31m   v1.20.5

$ kubectl describe nodes | grep ProviderID
ProviderID:                   vsphere://42247114-0280-f5a1-0e8b-d5118f0ff8fd
ProviderID:                   vsphere://42244b45-e7ec-18a3-8cf1-493e4c9d3780
ProviderID:                   vsphere://4224fde2-7fae-35f7-7083-7bf6eaafd3bb
ProviderID:                   vsphere://4224e557-2b4f-61d3-084f-30d524f97238

Let’s look at the setup steps involved in configuring CSI Topology.

Configure the csi-vphere.conf CSI configuration file

The csi-vsphere.conf file, located on the Kubernetes control plane master in /etc/kubernetes, now has two additional entries. These are the labels for region and zone, k8s-region and k8s-zone. These are the same tags that were used on the vSphere inventory objects.

[Global]
cluster-id = "cormac-upstream"
cluster-distribution = "native"

[VirtualCenter "AA.BB.CC.DD"]
user = "administrator@vsphere.local"
password = "******"
port = "443"
insecure-flag = "1"
datacenters = "OCTO-Datacenter"

[Labels]
region = k8s-region
zone = k8s-zone

Create the CSI secret and deploy the manifests

We can now proceed to the next step, which is deployment of the CSI driver. This is done in 3 steps – (a) create a secret for the csi-vsphere.conf , then (b) modify the CSI manifests to support topology, then finally (c) deploy the updated manifests. The recommendation is to remove the csi-vsphere.conf file from the control plane node once the secret has been created.

Create the secret

$ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf \
--namespace=kube-system
secret/vsphere-config-secret created

Modify vsphere-csi-controller-deployment.yaml

I am using driver version 2.2 of the CSI manifests in this example. Here are the changes that need to be made. In the CSI controller deployment manifest, you will need to uncomment lines 160 and 161 as shown below:

<             - "--feature-gates=Topology=true"
<             - "--strict-topology"
---
>             #- "--feature-gates=Topology=true"
>             #- "--strict-topology”

Modify vsphere-csi-node-ds.yaml

In the CSI node daemonset manifest, you will need to uncomment 3 sections to enable topology.

First, uncomment lines 67 and 68:

<         - name: VSPHERE_CSI_CONFIG
<           value: "/etc/cloud/csi-vsphere.conf" # here csi-vsphere.conf is the name of the file used for creating secret using "--from-file" flag
---
>         #- name: VSPHERE_CSI_CONFIG
>         #  value: "/etc/cloud/csi-vsphere.conf" # here csi-vsphere.conf is the name of the file used for creating secret using "--from-file" flag

Then, uncomment lines 84, 85 and 86:

<         - name: vsphere-config-volume
<           mountPath: /etc/cloud
<           readOnly: true
---
>         #- name: vsphere-config-volume
>         #  mountPath: /etc/cloud
>         #  readOnly: true

Finally, uncomment lines 121, 122 and 123:

<       - name: vsphere-config-volume
<         secret:
<           secretName: vsphere-config-secret
---
>       #- name: vsphere-config-volume
>       #  secret:
>       #    secretName: vsphere-config-secret

Deploy the manifests

Once the above changes are made, deploy the manifests, both for the controller deployment and node daemonset, as well as the various RBAC objects. In CSI 2.2, there are 4 manifests in total that should be deployed.

$ kubectl apply -f rbac/vsphere-csi-controller-rbac.yaml \
-f rbac/vsphere-csi-node-rbac.yaml \
-f deploy/vsphere-csi-controller-deployment.yaml \
-f deploy/vsphere-csi-node-ds.yaml

serviceaccount/vsphere-csi-controller created
clusterrole.rbac.authorization.k8s.io/vsphere-csi-controller-role created
clusterrolebinding.rbac.authorization.k8s.io/vsphere-csi-controller-binding created
serviceaccount/vsphere-csi-node created
role.rbac.authorization.k8s.io/vsphere-csi-node-role created
rolebinding.rbac.authorization.k8s.io/vsphere-csi-node-binding created
deployment.apps/vsphere-csi-controller created
configmap/internal-feature-states.csi.vsphere.vmware.com created
csidriver.storage.k8s.io/csi.vsphere.vmware.com created
service/vsphere-csi-controller created
daemonset.apps/vsphere-csi-node created

Monitor the CSI driver deployment

We can now monitor the CSI driver deployment. If all is working as expected, the CSI controller and CSI node Pods should enter a running state. There will be a node Pod for each K8s node in the cluster.

$ kubectl get pods -A
NAMESPACE     NAME                                          READY   STATUS    RESTARTS   AGE
kube-system   coredns-74ff55c5b-c67jw                       1/1     Running   1          48m
kube-system   coredns-74ff55c5b-vjwpf                       1/1     Running   1          48m
kube-system   etcd-k8s-controlplane-01                      1/1     Running   1          48m
kube-system   kube-apiserver-k8s-controlplane-01            1/1     Running   1          48m
kube-system   kube-controller-manager-k8s-controlplane-01   1/1     Running   1          48m
kube-system   kube-flannel-ds-2vjbr                         1/1     Running   0          46s
kube-system   kube-flannel-ds-58wzn                         1/1     Running   0          46s
kube-system   kube-flannel-ds-gqpdt                         1/1     Running   0          46s
kube-system   kube-flannel-ds-v6n54                         1/1     Running   0          46s
kube-system   kube-proxy-569cm                              1/1     Running   1          47m
kube-system   kube-proxy-9zrbm                              1/1     Running   1          48m
kube-system   kube-proxy-cpzpn                              1/1     Running   1          46m
kube-system   kube-proxy-lr8cx                              1/1     Running   1          45m
kube-system   kube-scheduler-k8s-controlplane-01            1/1     Running   1          48m
kube-system   vsphere-csi-controller-6ddf68d4f-kcwh4        6/6     Running   0          9s
kube-system   vsphere-csi-node-9srwh                        3/3     Running   0          9s
kube-system   vsphere-csi-node-dnhc9                        3/3     Running   0          9s
kube-system   vsphere-csi-node-p7sl7                        3/3     Running   0          9s
kube-system   vsphere-csi-node-s8tkh                        3/3     Running   0          9s

Assuming the CSI controller and CSI node Pods deploy successfully, we can now check the labeling on the worker nodes. I’ve adjusted the LABELS output from the control plane node so that they are a little easier to see, but all nodes should now have these “topologyKeys: failure-domain” labels.

$ kubectl get nodes --show-labels

NAME                  STATUS   ROLES                  AGE   VERSION   LABELS

k8s-controlplane-01   Ready    control-plane,master   84m   v1.20.5   beta.kubernetes.io/arch=amd64,
                                                                      beta.kubernetes.io/os=linux,
                                                                      failure-domain.beta.kubernetes.io/region=region-1,
                                                                      failure-domain.beta.kubernetes.io/zone=zone-a,
                                                                      kubernetes.io/arch=amd64,
                                                                      kubernetes.io/hostname=k8s-controlplane-01,
                                                                      kubernetes.io/os=linux,
                                                                      node-role.kubernetes.io/control-plane=,
                                                                      node-role.kubernetes.io/master=
k8s-worker-01         Ready    <none>                 83m   v1.20.5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-a,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-01,kubernetes.io/os=linux
k8s-worker-02         Ready    <none>                 82m   v1.20.5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-b,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-02,kubernetes.io/os=linux
k8s-worker-03         Ready    <none>                 81m   v1.20.5   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-03,kubernetes.io/os=linux

We should also be able to examine the CSI node objects to see these labels are in place.

$ kubectl get csinodes
NAME                  DRIVERS   AGE
k8s-controlplane-01   0         4m22s
k8s-worker-01         1         24m
k8s-worker-02         1         20m
k8s-worker-03         1         18m


$ kubectl get csinodes -o jsonpath='{range .items[*]}{.metadata.name} {.spec}{"\n"}{end}'
k8s-controlplane-01 {"drivers":null}
k8s-worker-01 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-01","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]}
k8s-worker-02 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-02","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]}
k8s-worker-03 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-03","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]}

Seems like everything is configured as expected. We can now being to test if we can indeed place our Pods and PVs using the topology (multi-AZ) settings.

Simple Pod/PV deployed to a particular zone

If I want to deploy a Pod and PV to a particular zone, in this example zone-a, I can use some manifests for the StorageClass, PVC and Pod as follows. Note the topology references in the StorageClass, which imply that volumes will only be created on available storage in this region (region-1) and this zone (zone-a).

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: zone-a-sc
provisioner: csi.vsphere.vmware.com
parameters:
  storagepolicyname: vsan-a
allowedTopologies:
  - matchLabelExpressions:
      - key: failure-domain.beta.kubernetes.io/zone
        values:
         - zone-a
      - key: failure-domain.beta.kubernetes.io/region
        values:
          - region-1

Note that this manifest does not necessarily need to include a storage policy, but if it did not, the PV could be created on any of the available storage at this zone. Using a storage policy means that it will pick a particular datastore in this region / zone. In this case, the policy matches a vSAN datastore, and there is a vSAN datastore available in this region / zone.

The PVC is quite straight-forward – it simply references the StorageClass above.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: zone-a-pvc
spec:
  storageClassName: zone-a-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Finally, we get the Pod manifest. Again, pretty straight-forward, and this Pod will be instantiated in the same region / zone where the PV exists, and will thus will be scheduled on the worker node (or nodes) in that zone. The PV will be attached to same worker node and the kubelet process running in the worker will mount it onto the Pod.

apiVersion: v1
kind: Pod
metadata:
  name: zone-a-pod
spec:
  containers:
  - name: zone-a-container
    image: "k8s.gcr.io/busybox"
    volumeMounts:
    - name: zone-a-volume
      mountPath: "/mnt/volume1"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: zone-a-volume
      persistentVolumeClaim:
        claimName: zone-a-pvc

If I deploy the above manifests to create a Pod with PVC/PV, I will end up with the objects all deployed on same zone, which is this case in zone-1. We can visualize the Pod and PV deployment as follows:

StatefulSet deployed across multiple zones

Let’s look at a more complex example, such as a StatefulSet. This is where each replica in the set has its own Pod and PV, and as the set is scaled, a new Pod and PV are instantiated for each replica. As mentioned, each zone has its own vSAN datastore, and so that each PV ends up on a vSAN datastore and not any other storage, I am creating a common policy across all vSAN clusters. This policy is RAID0 since availability will be provided by replication from within the application, thus I do not need to do any protection at the infrastructure layer. I am also using a volumeBindingMode of WaitForFirstConsumer instead of the default Immediate. This means that the PV will not be instantiated until the Pod has first been scheduled on a worker node. Therefore we do not need the explicit topology statements in the StorageClass that we saw earlier. The PV will be instantiated and attached to the Kubernetes node where the Pod has been scheduled, so that the kubelet can format and mount it into the Pod. Thus, it is the Pod which will drive the topology deployment in this case.

This is the Storage Class manifest which will be used by the StatefulSet. This is where the volumeBindingMode is specified.

apiVersion: v1
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: wffc-sc
provisioner: csi.vsphere.vmware.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  storagePolicyName: RAID0

The sample application is made up of a single container, called web, which has two volumes, www and logs. These volumes are created based on the storage class defined previously. Of particular interest are the affinity statements in the StatefulSet manifest. There is a nodeAffinity and a podAntiAffinity. The nodeAffinity defines which nodes your Pods are allowed to be scheduled on, based on the node labels. In this case, I have allowed my Pods to be run on all 3 zones, in other words, all 3 vSphere clusters. The podAntiAfinity also deals with Pod placement, again based on labels. It states that this Pod cannot be scheduled on a node if a Pod with the same label (nginx) is already scheduled here. Thus, Pods in this StatefulSet will be scheduled in different zones, and afterwards, the PVs will then be instantiated in the same zone as the Pod since we have set the volumeBindingMode to WaitForFirstConsumer, i.e. wait for the Pod. For Affinity and anti-Affinity to work in Kubernetes, all nodes in the Kubernetes cluster must be labeled appropriately and correctly with the label described in topologyKey. We have already verified that this is the case earlier in the post. A full explanation on K8s affinity is available here. Here is the complete StatefulSet manifest.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: failure-domain.beta.kubernetes.io/zone
                    operator: In
                    values:
                      - zone-a
                      - zone-b
                      - zone-c
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: failure-domain.beta.kubernetes.io/zone
      containers:
        - name: nginx
          image: gcr.io/google_containers/nginx-slim:0.8
          ports:
            - containerPort: 80
              name: web
          volumeMounts:
            - name: www
              mountPath: /usr/share/nginx/html
            - name: logs
              mountPath: /logs
  volumeClaimTemplates:
    - metadata:
        name: www
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: wffc-sc
        resources:
          requests:
            storage: 5Gi
    - metadata:
        name: logs
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: wffc-sc
        resources:
          requests:
            storage: 1Gi

When I deploy the 3 x replica StatefulSet with its StorageClass, we can visualize the deployment something like this:

Check the deployment meets the topology rules

We can run a number of kubectl commands to check to see if the objects were deployed as per the topology rules. First, lets check on the Pods to ensure that each Pod in the StatefulSet was deployed to a unique zone / K8s worker node. It does indeed look like each Pod is on a different K8s node.

$ kubectl get sts
NAME   READY   AGE
web    3/3     49m


$ kubectl get pods web-0 web-1 web-2
NAME    READY   STATUS    RESTARTS   AGE
web-0   1/1     Running   0          49m
web-1   1/1     Running   0          49m
web-2   1/1     Running   0          49m


$ kubectl get pods web-0 web-1 web-2 -o json | egrep "hostname|nodeName|claimName"
                                "f:hostname": {},
                                            "f:claimName": {}
                                            "f:claimName": {}
                "hostname": "web-0",
                "nodeName": "k8s-worker-02",
                            "claimName": "www-web-0"
                            "claimName": "logs-web-0"
                                "f:hostname": {},
                                            "f:claimName": {}
                                            "f:claimName": {}
                "hostname": "web-1",
                "nodeName": "k8s-worker-01",
                            "claimName": "www-web-1"
                            "claimName": "logs-web-1"
                                "f:hostname": {},
                                            "f:claimName": {}
                                            "f:claimName": {}
                "hostname": "web-2",
                "nodeName": "k8s-worker-03",
                            "claimName": "www-web-2"
                            "claimName": "logs-web-2”

Let’s remind ourselves of which K8s nodes are in which zones:

$ kubectl get nodes -L failure-domain.beta.kubernetes.io/zone -L failure-domain.beta.kubernetes.io/region
NAME                  STATUS   ROLES                  AGE   VERSION   ZONE     REGION
k8s-controlplane-01   Ready    control-plane,master   22h   v1.20.5   zone-a   region-1
k8s-worker-01         Ready    <none>                 22h   v1.20.5   zone-a   region-1
k8s-worker-02         Ready    <none>                 22h   v1.20.5   zone-b   region-1
k8s-worker-03         Ready    <none>                 22h   v1.20.5   zone-c   region-1

Next, lets check on the Persistent Volumes to make sure that they are in the same zone and attached to the same K8s node as the Pod:

$ kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}'
pvc-12c9b0d4-e0a5-48b5-8e5d-479a8f96715b    logs-web-0    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-b"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
pvc-2d8dfb98-966d-4006-b731-1aedd0cffbc5    logs-web-1    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
pvc-590da63c-1839-4848-b1a1-b7068b67f00b    logs-web-2    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-c"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
pvc-5ea98c4d-70ef-431f-9e87-a52a2468a515    www-web-1    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
pvc-b23ba58a-3275-4bf5-835d-bf41aefa148d    www-web-2    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-c"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
pvc-b8458bef-178e-40dd-9bc0-2a05f1ddfd65    www-web-0    {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-b"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}

This looks good as well. Pod web-0 in on k8s-worker-02 which is in zone-b, and the two PVs – www-web-0 and logs-web-0 – are also in zone-b. Likewise for the other Pods and PVs. It would appear the placement is working as designed.

Check the deployment via the CNS UI

We can also check the deployment via the CNS UI. Let’s take a look at the vSphere cluster in zone-c. This is where K8s-worker-03 resides, and should be where Pod web-2 resides along with its volumes, www-web-2 and logs-web-2. From the Cluster > Monitor > Container Volumes view in the vSphere Client, we can see the following volumes have been provisioned on the vSAN datastore for zone-c:

Let’s click on the details of the larger 5GB volume which should be the www-web-2 volume. This gives us an Kubernetes objects view. What we can see from this view is that Pod web-2 is using the volume.

Let’s take a look at one final CNS View, and this is the Basics view. This should tell us on which more information about the vSphere datastore and which K8s node VM this volume is attached to. Looks like everything is working as expected from a topology perspective.

That concludes the post. Hopefully you can see the benefit of using the CSI Topology feature when it comes to providing available to your Kubernetes applications when deploying across vSphere infrastructures. For further information, please check out the official CSI documentation. This links to the CSI deployment steps and this links to a demonstration on how to use the volume topology.