CSI Topology – Configuration How-To
In this post, we will look at another feature of the vSphere CSI driver that enables the placement of Kubernetes objects on different vSphere environments using a combination of vSphere Tags and a feature of the CSI driver called topology or failure domains. To achieve this, some additional entries must be added to the vSphere CSI driver configuration file. The CSI driver discovers each Kubernetes node/virtual machine topology, and through the kubelet, adds them as labels to the nodes. Please note that at the time of writing, the volume topology and availability zone feature was still in beta with vSphere CSI driver version 2.2 – something to keep in mind if you are planning to use the feature in production.
This is what my test environment looks like:
CSI Topology can be used to provide another level of availability to your Kubernetes cluster and applications. The objective is to deploy Kubernetes Pods with their own Persistent Volumes (PVs) to the same zone. Thus, if a StatefulSet is deployed across multiple zones using CSI Topology, and one of the zones fail, it does not impact the overall application. We utilize vSphere Tags to define Regions and Zones. Regions and Zones are topology constructs used by Kubernetes. In my example, I associate a Region Tag (k8s-region) at the Datacenter level and a unique Zone Tag (k8s-zone) to each of the Clusters in the Datacenter.
Note that there is no dependency on the Cloud Provider/CPI. So long as all of the nodes have a ProviderID after the Cloud Provider has initialized, you can proceed with the CSI driver deployment and include the following steps. Here is how to check the ProviderID after the CPI has been deployed.
$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-controlplane-01 Ready control-plane,master 4h35m v1.20.5 k8s-worker-01 Ready <none> 4h33m v1.20.5 k8s-worker-02 Ready <none> 4h32m v1.20.5 k8s-worker-03 Ready <none> 4h31m v1.20.5 $ kubectl describe nodes | grep ProviderID ProviderID: vsphere://42247114-0280-f5a1-0e8b-d5118f0ff8fd ProviderID: vsphere://42244b45-e7ec-18a3-8cf1-493e4c9d3780 ProviderID: vsphere://4224fde2-7fae-35f7-7083-7bf6eaafd3bb ProviderID: vsphere://4224e557-2b4f-61d3-084f-30d524f97238
Let’s look at the setup steps involved in configuring CSI Topology.
Configure the csi-vphere.conf CSI configuration file
The csi-vsphere.conf file, located on the Kubernetes control plane master in /etc/kubernetes, now has two additional entries. These are the labels for region and zone, k8s-region and k8s-zone. These are the same tags that were used on the vSphere inventory objects.
[Global] cluster-id = "cormac-upstream" cluster-distribution = "native" [VirtualCenter "AA.BB.CC.DD"] user = "administrator@vsphere.local" password = "******" port = "443" insecure-flag = "1" datacenters = "OCTO-Datacenter" [Labels] region = k8s-region zone = k8s-zone
Create the CSI secret and deploy the manifests
We can now proceed to the next step, which is deployment of the CSI driver. This is done in 3 steps – (a) create a secret for the csi-vsphere.conf , then (b) modify the CSI manifests to support topology, then finally (c) deploy the updated manifests. The recommendation is to remove the csi-vsphere.conf file from the control plane node once the secret has been created.
Create the secret
$ kubectl create secret generic vsphere-config-secret --from-file=csi-vsphere.conf \ --namespace=kube-system secret/vsphere-config-secret created
Modify vsphere-csi-controller-deployment.yaml
I am using driver version 2.2 of the CSI manifests in this example. Here are the changes that need to be made. In the CSI controller deployment manifest, you will need to uncomment lines 160 and 161 as shown below:
< - "--feature-gates=Topology=true" < - "--strict-topology" --- > #- "--feature-gates=Topology=true" > #- "--strict-topology”
Modify vsphere-csi-node-ds.yaml
In the CSI node daemonset manifest, you will need to uncomment 3 sections to enable topology.
< - name: VSPHERE_CSI_CONFIG < value: "/etc/cloud/csi-vsphere.conf" # here csi-vsphere.conf is the name of the file used for creating secret using "--from-file" flag --- > #- name: VSPHERE_CSI_CONFIG > # value: "/etc/cloud/csi-vsphere.conf" # here csi-vsphere.conf is the name of the file used for creating secret using "--from-file" flag
< - name: vsphere-config-volume < mountPath: /etc/cloud < readOnly: true --- > #- name: vsphere-config-volume > # mountPath: /etc/cloud > # readOnly: true
< - name: vsphere-config-volume < secret: < secretName: vsphere-config-secret --- > #- name: vsphere-config-volume > # secret: > # secretName: vsphere-config-secret
Deploy the manifests
Once the above changes are made, deploy the manifests, both for the controller deployment and node daemonset, as well as the various RBAC objects. In CSI 2.2, there are 4 manifests in total that should be deployed.
$ kubectl apply -f rbac/vsphere-csi-controller-rbac.yaml \ -f rbac/vsphere-csi-node-rbac.yaml \ -f deploy/vsphere-csi-controller-deployment.yaml \ -f deploy/vsphere-csi-node-ds.yaml serviceaccount/vsphere-csi-controller created clusterrole.rbac.authorization.k8s.io/vsphere-csi-controller-role created clusterrolebinding.rbac.authorization.k8s.io/vsphere-csi-controller-binding created serviceaccount/vsphere-csi-node created role.rbac.authorization.k8s.io/vsphere-csi-node-role created rolebinding.rbac.authorization.k8s.io/vsphere-csi-node-binding created deployment.apps/vsphere-csi-controller created configmap/internal-feature-states.csi.vsphere.vmware.com created csidriver.storage.k8s.io/csi.vsphere.vmware.com created service/vsphere-csi-controller created daemonset.apps/vsphere-csi-node created
Monitor the CSI driver deployment
We can now monitor the CSI driver deployment. If all is working as expected, the CSI controller and CSI node Pods should enter a running state. There will be a node Pod for each K8s node in the cluster.
$ kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-74ff55c5b-c67jw 1/1 Running 1 48m kube-system coredns-74ff55c5b-vjwpf 1/1 Running 1 48m kube-system etcd-k8s-controlplane-01 1/1 Running 1 48m kube-system kube-apiserver-k8s-controlplane-01 1/1 Running 1 48m kube-system kube-controller-manager-k8s-controlplane-01 1/1 Running 1 48m kube-system kube-flannel-ds-2vjbr 1/1 Running 0 46s kube-system kube-flannel-ds-58wzn 1/1 Running 0 46s kube-system kube-flannel-ds-gqpdt 1/1 Running 0 46s kube-system kube-flannel-ds-v6n54 1/1 Running 0 46s kube-system kube-proxy-569cm 1/1 Running 1 47m kube-system kube-proxy-9zrbm 1/1 Running 1 48m kube-system kube-proxy-cpzpn 1/1 Running 1 46m kube-system kube-proxy-lr8cx 1/1 Running 1 45m kube-system kube-scheduler-k8s-controlplane-01 1/1 Running 1 48m kube-system vsphere-csi-controller-6ddf68d4f-kcwh4 6/6 Running 0 9s kube-system vsphere-csi-node-9srwh 3/3 Running 0 9s kube-system vsphere-csi-node-dnhc9 3/3 Running 0 9s kube-system vsphere-csi-node-p7sl7 3/3 Running 0 9s kube-system vsphere-csi-node-s8tkh 3/3 Running 0 9s
Assuming the CSI controller and CSI node Pods deploy successfully, we can now check the labeling on the worker nodes. I’ve adjusted the LABELS output from the control plane node so that they are a little easier to see, but all nodes should now have these “topologyKeys: failure-domain” labels.
$ kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS k8s-controlplane-01 Ready control-plane,master 84m v1.20.5 beta.kubernetes.io/arch=amd64, beta.kubernetes.io/os=linux, failure-domain.beta.kubernetes.io/region=region-1, failure-domain.beta.kubernetes.io/zone=zone-a, kubernetes.io/arch=amd64, kubernetes.io/hostname=k8s-controlplane-01, kubernetes.io/os=linux, node-role.kubernetes.io/control-plane=, node-role.kubernetes.io/master= k8s-worker-01 Ready <none> 83m v1.20.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-a,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-01,kubernetes.io/os=linux k8s-worker-02 Ready <none> 82m v1.20.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-b,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-02,kubernetes.io/os=linux k8s-worker-03 Ready <none> 81m v1.20.5 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=region-1,failure-domain.beta.kubernetes.io/zone=zone-c,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-worker-03,kubernetes.io/os=linux
We should also be able to examine the CSI node objects to see these labels are in place.
$ kubectl get csinodes NAME DRIVERS AGE k8s-controlplane-01 0 4m22s k8s-worker-01 1 24m k8s-worker-02 1 20m k8s-worker-03 1 18m $ kubectl get csinodes -o jsonpath='{range .items[*]}{.metadata.name} {.spec}{"\n"}{end}' k8s-controlplane-01 {"drivers":null} k8s-worker-01 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-01","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]} k8s-worker-02 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-02","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]} k8s-worker-03 {"drivers":[{"name":"csi.vsphere.vmware.com","nodeID":"k8s-worker-03","topologyKeys":["failure-domain.beta.kubernetes.io/region","failure-domain.beta.kubernetes.io/zone"]}]}
Seems like everything is configured as expected. We can now being to test if we can indeed place our Pods and PVs using the topology (multi-AZ) settings.
Simple Pod/PV deployed to a particular zone
If I want to deploy a Pod and PV to a particular zone, in this example zone-a, I can use some manifests for the StorageClass, PVC and Pod as follows. Note the topology references in the StorageClass, which imply that volumes will only be created on available storage in this region (region-1) and this zone (zone-a).
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: zone-a-sc provisioner: csi.vsphere.vmware.com parameters: storagepolicyname: vsan-a allowedTopologies: - matchLabelExpressions: - key: failure-domain.beta.kubernetes.io/zone values: - zone-a - key: failure-domain.beta.kubernetes.io/region values: - region-1
Note that this manifest does not necessarily need to include a storage policy, but if it did not, the PV could be created on any of the available storage at this zone. Using a storage policy means that it will pick a particular datastore in this region / zone. In this case, the policy matches a vSAN datastore, and there is a vSAN datastore available in this region / zone.
The PVC is quite straight-forward – it simply references the StorageClass above.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: zone-a-pvc spec: storageClassName: zone-a-sc accessModes: - ReadWriteOnce resources: requests: storage: 1Gi
Finally, we get the Pod manifest. Again, pretty straight-forward, and this Pod will be instantiated in the same region / zone where the PV exists, and will thus will be scheduled on the worker node (or nodes) in that zone. The PV will be attached to same worker node and the kubelet process running in the worker will mount it onto the Pod.
apiVersion: v1 kind: Pod metadata: name: zone-a-pod spec: containers: - name: zone-a-container image: "k8s.gcr.io/busybox" volumeMounts: - name: zone-a-volume mountPath: "/mnt/volume1" command: [ "sleep", "1000000" ] volumes: - name: zone-a-volume persistentVolumeClaim: claimName: zone-a-pvc
If I deploy the above manifests to create a Pod with PVC/PV, I will end up with the objects all deployed on same zone, which is this case in zone-1. We can visualize the Pod and PV deployment as follows:
StatefulSet deployed across multiple zones
Let’s look at a more complex example, such as a StatefulSet. This is where each replica in the set has its own Pod and PV, and as the set is scaled, a new Pod and PV are instantiated for each replica. As mentioned, each zone has its own vSAN datastore, and so that each PV ends up on a vSAN datastore and not any other storage, I am creating a common policy across all vSAN clusters. This policy is RAID0 since availability will be provided by replication from within the application, thus I do not need to do any protection at the infrastructure layer. I am also using a volumeBindingMode of WaitForFirstConsumer instead of the default Immediate. This means that the PV will not be instantiated until the Pod has first been scheduled on a worker node. Therefore we do not need the explicit topology statements in the StorageClass that we saw earlier. The PV will be instantiated and attached to the Kubernetes node where the Pod has been scheduled, so that the kubelet can format and mount it into the Pod. Thus, it is the Pod which will drive the topology deployment in this case.
This is the Storage Class manifest which will be used by the StatefulSet. This is where the volumeBindingMode is specified.
apiVersion: v1 kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: wffc-sc provisioner: csi.vsphere.vmware.com volumeBindingMode: WaitForFirstConsumer parameters: storagePolicyName: RAID0
The sample application is made up of a single container, called web, which has two volumes, www and logs. These volumes are created based on the storage class defined previously. Of particular interest are the affinity statements in the StatefulSet manifest. There is a nodeAffinity and a podAntiAffinity. The nodeAffinity defines which nodes your Pods are allowed to be scheduled on, based on the node labels. In this case, I have allowed my Pods to be run on all 3 zones, in other words, all 3 vSphere clusters. The podAntiAfinity also deals with Pod placement, again based on labels. It states that this Pod cannot be scheduled on a node if a Pod with the same label (nginx) is already scheduled here. Thus, Pods in this StatefulSet will be scheduled in different zones, and afterwards, the PVs will then be instantiated in the same zone as the Pod since we have set the volumeBindingMode to WaitForFirstConsumer, i.e. wait for the Pod. For Affinity and anti-Affinity to work in Kubernetes, all nodes in the Kubernetes cluster must be labeled appropriately and correctly with the label described in topologyKey. We have already verified that this is the case earlier in the post. A full explanation on K8s affinity is available here. Here is the complete StatefulSet manifest.
apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: replicas: 3 selector: matchLabels: app: nginx serviceName: nginx template: metadata: labels: app: nginx spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - zone-a - zone-b - zone-c podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: failure-domain.beta.kubernetes.io/zone containers: - name: nginx image: gcr.io/google_containers/nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html - name: logs mountPath: /logs volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] storageClassName: wffc-sc resources: requests: storage: 5Gi - metadata: name: logs spec: accessModes: [ "ReadWriteOnce" ] storageClassName: wffc-sc resources: requests: storage: 1Gi
When I deploy the 3 x replica StatefulSet with its StorageClass, we can visualize the deployment something like this:
Check the deployment meets the topology rules
We can run a number of kubectl commands to check to see if the objects were deployed as per the topology rules. First, lets check on the Pods to ensure that each Pod in the StatefulSet was deployed to a unique zone / K8s worker node. It does indeed look like each Pod is on a different K8s node.
$ kubectl get sts NAME READY AGE web 3/3 49m $ kubectl get pods web-0 web-1 web-2 NAME READY STATUS RESTARTS AGE web-0 1/1 Running 0 49m web-1 1/1 Running 0 49m web-2 1/1 Running 0 49m $ kubectl get pods web-0 web-1 web-2 -o json | egrep "hostname|nodeName|claimName" "f:hostname": {}, "f:claimName": {} "f:claimName": {} "hostname": "web-0", "nodeName": "k8s-worker-02", "claimName": "www-web-0" "claimName": "logs-web-0" "f:hostname": {}, "f:claimName": {} "f:claimName": {} "hostname": "web-1", "nodeName": "k8s-worker-01", "claimName": "www-web-1" "claimName": "logs-web-1" "f:hostname": {}, "f:claimName": {} "f:claimName": {} "hostname": "web-2", "nodeName": "k8s-worker-03", "claimName": "www-web-2" "claimName": "logs-web-2”
Let’s remind ourselves of which K8s nodes are in which zones:
$ kubectl get nodes -L failure-domain.beta.kubernetes.io/zone -L failure-domain.beta.kubernetes.io/region NAME STATUS ROLES AGE VERSION ZONE REGION k8s-controlplane-01 Ready control-plane,master 22h v1.20.5 zone-a region-1 k8s-worker-01 Ready <none> 22h v1.20.5 zone-a region-1 k8s-worker-02 Ready <none> 22h v1.20.5 zone-b region-1 k8s-worker-03 Ready <none> 22h v1.20.5 zone-c region-1
Next, lets check on the Persistent Volumes to make sure that they are in the same zone and attached to the same K8s node as the Pod:
$ kubectl get pv -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.claimRef.name}{"\t"}{.spec.nodeAffinity}{"\n"}{end}' pvc-12c9b0d4-e0a5-48b5-8e5d-479a8f96715b logs-web-0 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-b"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}} pvc-2d8dfb98-966d-4006-b731-1aedd0cffbc5 logs-web-1 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}} pvc-590da63c-1839-4848-b1a1-b7068b67f00b logs-web-2 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-c"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}} pvc-5ea98c4d-70ef-431f-9e87-a52a2468a515 www-web-1 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-a"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}} pvc-b23ba58a-3275-4bf5-835d-bf41aefa148d www-web-2 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-c"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}} pvc-b8458bef-178e-40dd-9bc0-2a05f1ddfd65 www-web-0 {"required":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"failure-domain.beta.kubernetes.io/zone","operator":"In","values":["zone-b"]},{"key":"failure-domain.beta.kubernetes.io/region","operator":"In","values":["region-1"]}]}]}}
This looks good as well. Pod web-0 in on k8s-worker-02 which is in zone-b, and the two PVs – www-web-0 and logs-web-0 – are also in zone-b. Likewise for the other Pods and PVs. It would appear the placement is working as designed.
Check the deployment via the CNS UI
We can also check the deployment via the CNS UI. Let’s take a look at the vSphere cluster in zone-c. This is where K8s-worker-03 resides, and should be where Pod web-2 resides along with its volumes, www-web-2 and logs-web-2. From the Cluster > Monitor > Container Volumes view in the vSphere Client, we can see the following volumes have been provisioned on the vSAN datastore for zone-c:
Let’s click on the details of the larger 5GB volume which should be the www-web-2 volume. This gives us an Kubernetes objects view. What we can see from this view is that Pod web-2 is using the volume.
Let’s take a look at one final CNS View, and this is the Basics view. This should tell us on which more information about the vSphere datastore and which K8s node VM this volume is attached to. Looks like everything is working as expected from a topology perspective.
That concludes the post. Hopefully you can see the benefit of using the CSI Topology feature when it comes to providing available to your Kubernetes applications when deploying across vSphere infrastructures. For further information, please check out the official CSI documentation. This links to the CSI deployment steps and this links to a demonstration on how to use the volume topology.
Hi Cormac, we are testing this but we have 3 masters and it only works for us if the 3 masters are on the same zone. As soon as we move master 3 to zone 3, if we try to provision a PV in zone1 it is nor working. We used a policy in the storage class and on all zones are available datastores matching the policy. But we get an “No compatible datastore found for storagePolicy ”
Since it is a beta feature I guess we cannot open support cases so we are stucked. If the storage class has no policy it works, but creates the PV on a datastore that is shared by all the clusters (zones). And this is not the use case, we use zones to provision on different clusters with different storage. Is there any prerequisite we are missing? All examples we have found use only one master as the one in your post….
So is the issue that all datastores in each AZ match the same policy? If so, have you tried to create some tags for each of the datastores, and then creating unique policies / Storage Classes so that PVs are provisioned to the correct datastore in the correct AZ?
If that is not the issue, I would try to progress the issue via https://github.com/kubernetes-sigs/vsphere-csi-driver/issues, even if it is a beta feature.
Hi, not exactly, as I said it works depending on the master location. If all the masters are on the zone it works, if we move the third master to the zone3 then it doesn’t. It is similar to this issue: https://github.com/kubernetes-sigs/vsphere-csi-driver/issues/999
So the question is if all the master nodes need to have access to the datastores where the PVs are provisioned. We haven’t seen any info about that, and all examples are with and unique master node…..
Nope – the master should not need access to all datastores, as per my example. In that setup, I have only 1 master, but there are 3 AZs each with its own vSphere cluster and a distinct vSAN datastores. So the master was only on one of those clusters and only had one datastore access, but could facilitate provisioning on all datastores.
The only thing that I did not test is a multi-master environment. So I would still recommend driving it via the GitHub site which is monitored by the vSphere CSI engineering team.