With the above in mind, we can now consider some of the requirements when deploying Kubernetes on a vSAN Stretched Cluster. There is a new document now available which highlights these consideration, called Design Considerations and Best Practices When Using vSAN Stretched Clusters for Kubernetes. The main considerations are mostly vSAN stretched cluster specific, such as enabling HA, DRS, Host and VM Affinity Groups, etc, so I won’t go into those in detail here. However, when it comes to PV provisioning, the advice is that the same storage policy should be used for all node VMs, including the control plane and worker, as well as all Persistent Volumes (PVs). This storage policy in vSphere equates to a single, standard Storage Class for all storage objects in the Kubernetes cluster. The other major limitation at the time of writing is that currently only block based read-write-once (RWO) volumes are supported. There is no support for read-write-many (RWX) vSAN File Service based file volumes.
Let’s assume that vSAN stretched cluster is now configured and the Kubernetes cluster has been deployed as a set of VMs across the 2 data sites. The next steps are to install the vSphere Cloud Provider (CPI) and vSphere CSI Driver. These instructions are provided in detail in the official vSphere Container Storage Plugin docs here. The CPI instructions have changed slightly since I last did a vanilla Kubernetes deployment, especially around taints which are needed on all nodes, both control plane and worker, so pay particular attention to the steps found here. The CSI instructions are much the same as before, and it continues to require a taint on the control plane nodes only, so be sure to pay attention there as well. Finally, note that there is an issue with the vSphere CSI v2.5 manifests, where the Provider IDs reported by the CPI are not matched correctly by the vSphere CSI node registrar. The issue is described here. This has already been fixed, but make sure you use the vSphere CSI v2.5.1 manifests or you might come unstuck. For completeness sake, here are the Provider IDs returned when the CPI is deployed.
# kubectl describe nodes | grep "ProviderID" ProviderID: vsphere://42247114-0280-f5a1-0e8b-d5118f0ff8fd ProviderID: vsphere://42244b45-e7ec-18a3-8cf1-493e4c9d3780 ProviderID: vsphere://4224fde2-7fae-35f7-7083-7bf6eaafd3bb
And here is a snippet from the vSphere CSI node driver registrar from v2.5, which is using an incorrectly formatted Provider ID and so is unable to find the Kubernetes nodes (you can see the Provider ID is ordered incorrectly in the error).
E0405 15:54:03.953091 1 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc = failed to retrieve topology information for Node: "k8s-controlplane-01". Error: "failed to retrieve nodeVM \"14712442-8002-a1f5-0e8b-d5118f0ff8fd\" using the node manager. Error: virtual machine wasn't found", restarting registration container.
Let’s now take a look at the StorageClass, PVC, PV and Pod deployment on a vanilla Kubernetes deployment running on a vSAN Stretched Cluster. To recap, here are the relevant builds and versions that I have deployed.
- vSphere 7.0U3d, build 19480866
- Kubernetes upstream/vanilla version v1.21.3
- vSphere CSI driver version 2.5.1
This is the vSAN Stretched Cluster configuration, with 4 ESXi hosts on each site:
Below is the common storage policy which implements cross-site replication. This will become the basis for the Kubernetes Storage Class in a moment.
This is the Kubernetes cluster deployment. I only have a single control plane node configured at present, but there are 6 worker nodes, 3 on the preferred data site and 3 on the secondary data site. These are positioned using Host and VM Affinity “Should” rules.
% kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-controlplane-01 Ready control-plane,master 18h v1.21.3 k8s-worker-01 Ready 18h v1.21.3 k8s-worker-02 Ready 18h v1.21.3 k8s-worker-03 Ready 17h v1.21.3 k8s-worker-04 Ready 17h v1.21.3 k8s-worker-05 Ready 17h v1.21.3 k8s-worker-06 Ready 17h v1.21.3
Now we can create a simple application. Here is the manifest that I am using to build the Storage Class (using the above vSphere storage policy), a PVC (Persistent Volume Claim), a PV (Persistent Volume) and a busybox Pod to use the volume. All YAML is placed in a single manifest for convenience. Note the Storage Class has a reference to the policy defined for vSAN stretched cluster placement, as seen above.
kind: StorageClass apiVersion: storage.k8s.io/v1 metadata: name: csi-vsan-sc provisioner: csi.vsphere.vmware.com allowVolumeExpansion: true parameters: storagepolicyname: "StretchedClusterPolicy" csi.storage.k8s.io/fstype: "ext4" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: csivsan-pvc-vsan-claim spec: storageClassName: csi-vsan-sc accessModes: - ReadWriteOnce resources: requests: storage: 2Gi --- apiVersion: v1 kind: Pod metadata: name: csi-vsan-pod spec: containers: - name: busybox image: busybox volumeMounts: - name: csivsan-vol mountPath: "/demo" command: [ "sleep", "1000000" ] volumes: - name: csivsan-vol persistentVolumeClaim: claimName: csivsan-pvc-vsan-claim
Let’s deploy the manifest, and take a look at the resulting objects, paying particular attention to the underlying PV layout on a vSAN stretched cluster.
% kubectl apply -f demo-sc-pvc-pod-rwo-vsanstretched.yaml storageclass.storage.k8s.io/csi-vsan-sc created persistentvolumeclaim/csivsan-pvc-vsan-claim created pod/csi-vsan-pod created % kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE csi-vsan-sc csi.vsphere.vmware.com Delete Immediate true 5s % kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE csivsan-pvc-vsan-claim Bound pvc-cb93fc80-56a7-4fc5-830f-deeec758ff16 2Gi RWO csi-vsan-sc 9s % kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-cb93fc80-56a7-4fc5-830f-deeec758ff16 2Gi RWO Delete Bound default/csivsan-pvc-vsan-claim csi-vsan-sc 10s % kubectl get pod NAME READY STATUS RESTARTS AGE csi-vsan-pod 1/1 Running 0 16s
Looks like everything has been created successfully. Let’s move to the vSphere client, where we can examine the metadata of the PV via CNS. If there are multiple datastores, administrators can filter on the list of datastores, and by choosing only the vSAN datastore from the stretched cluster, the volume is clearly seen.
If the details icon (2nd column) is selected, more detail about the volume can be seen, such as the storage policy (as chosen by the Storage Class) as well as the K8s node where the volume is attached.
The final question we need to answer is about placement. The PV object, backed by an object on the vSAN datastore, should be replicated across the two data sites of the vSAN stretched cluster. By checking the Physical Placement view, this can be confirmed. I added the blue and green edits below to make it clearer.
Using the above view, it is possible to conclude that the PV is both protected against failures within a site, and also protected across the data sites. Therefore it is capable of surviving multiple failures in a vSAN stretched cluster, although depending on the nature of the failure, there may be some period of outage experienced by the application as the control plane recovers after a majority loss. Note that the same holds through for applications running on the worker nodes. If the application has been deployed as a set of replicating pods which utilizes a majority voting mechanism, it is only when the site with the majority of components fails that there is some outage. Note that any single outage contained within a site, e.g. single host failure, should not impact the application or the K8s cluster in any way as vSAN stretched cluster provides both in-site and cross-site protection.
One final note on this topic that you may want to consider is if you have more than 2 sites available. If you have 3 or more sites that can be used, you might consider a multi-AZ approach where the Kubernetes cluster is distributed across the sites. Thus, in the case of a site failure, it does not impact the majority of control plane components and does not lead to an outage. Multi-AZ / Topology support is also supported by the vSphere CSI Driver. I took a look at it back when it was beta in vSphere CSI 2.2, but it is now fully GA. Just another option for you to consider when looking to deploy highly available Kubernetes clusters on top of vSphere infrastructure.