Kubernetes, vSAN Stretched Cluster with CSI driver v2.5.1

Cormac

2 years ago

In this post, we will look at a relatively new announcement around support for vanilla or upstream Kubernetes clusters, vSAN stretched cluster and the vSphere CSI driver. There are a number of updates around this recently, so I want to highlight a few observations before we get into the deployment. First of all, it is important to highlight that a vSAN Stretched Cluster can have at most 2 fault domains. These are the data sites. While there is a requirement for a third site for the witness, the witness site does not store any application data. Thus all of the application data replicates across 2 sites at most. This should be a consideration in the context of the Kubernetes control plane which always contains an odd number of nodes. Thus, there will always be a situation where the majority of control plane nodes will reside on one data site. If that is the data site which fails or is isolated, then the control plane nodes on that site will need to be restarted on the remaining site. This will cause an outage to the Kubernetes cluster (or clusters) running on the vSAN stretched cluster, and will remain in this state until the majority of control plane nodes are back up and running. You should also be cognizant of the fact that etcd (the Kubernetes Key-Value Store for storing configuration info, etc) and control plane resilience must also be a consideration in cases like this. In other words, in the event of a site failure, the onus is on vSphere to restart the nodes on the remaining site, but the onus is on Kubernetes to recover its control plane functionality.

The second consideration are the versions of the various components that make up the stack. In the vSAN 7.0U3 Release Notes, updated on January 27, 2022, there was a statement around vanilla Kubernetes support enhancements to include vSAN stretched cluster support and topology support. However there was also an issue highlighted in the same release note. This described the fact that when a vSAN stretched cluster has a network partition between sites, volume information could be be lost from the CNS. When volume metadata is not present in the CNS, you cannot create, delete, or re-schedule pods with CNS volumes since the vSphere CSI Driver must access volume information from CNS to perform these operations. Once the network partition is fixed, CNS volume metadata is restored, and pods with CNS volumes can be created, deleted, or re-scheduled. There is no workaround to this issue, unfortunately. The good news is that with the release of vSphere & vSAN 7.0U3d, this issue has been addressed.

With the above in mind, we can now consider some of the requirements when deploying Kubernetes on a vSAN Stretched Cluster. There is a new document now available which highlights these consideration, called Design Considerations and Best Practices When Using vSAN Stretched Clusters for Kubernetes. The main considerations are mostly vSAN stretched cluster specific, such as enabling HA, DRS, Host and VM Affinity Groups, etc, so I won’t go into those in detail here. However, when it comes to PV provisioning, the advice is that the same storage policy should be used for all node VMs, including the control plane and worker, as well as all Persistent Volumes (PVs). This storage policy in vSphere equates to a single, standard Storage Class for all storage objects in the Kubernetes cluster. The other major limitation at the time of writing is that currently only block based read-write-once (RWO) volumes are supported. There is no support for read-write-many (RWX) vSAN File Service based file volumes.

Let’s assume that vSAN stretched cluster is now configured and the Kubernetes cluster has been deployed as a set of VMs across the 2 data sites. The next steps are to install the vSphere Cloud Provider (CPI) and vSphere CSI Driver. These instructions are provided in detail in the official vSphere Container Storage Plugin docs here. The CPI instructions have changed slightly since I last did a vanilla Kubernetes deployment, especially around taints which are needed on all nodes, both control plane and worker, so pay particular attention to the steps found here. The CSI instructions are much the same as before, and it continues to require a taint on the control plane nodes only, so be sure to pay attention there as well. Finally, note that there is an issue with the vSphere CSI v2.5 manifests, where the Provider IDs reported by the CPI are not matched correctly by the vSphere CSI node registrar. The issue is described here. This has already been fixed, but make sure you use the vSphere CSI v2.5.1 manifests or you might come unstuck. For completeness sake, here are the Provider IDs returned when the CPI is deployed.

# kubectl describe nodes | grep "ProviderID"
ProviderID: vsphere://42247114-0280-f5a1-0e8b-d5118f0ff8fd
ProviderID: vsphere://42244b45-e7ec-18a3-8cf1-493e4c9d3780
ProviderID: vsphere://4224fde2-7fae-35f7-7083-7bf6eaafd3bb

And here is a snippet from the vSphere CSI node driver registrar from v2.5, which is using an incorrectly formatted Provider ID and so is unable to find the Kubernetes nodes (you can see the Provider ID is ordered incorrectly in the error).

 E0405 15:54:03.953091 1 main.go:122] Registration process failed with error: 
RegisterPlugin error -- plugin registration failed with err: rpc error: code = Internal desc 
= failed to retrieve topology information for Node: "k8s-controlplane-01". 
Error: "failed to retrieve nodeVM \"14712442-8002-a1f5-0e8b-d5118f0ff8fd\" using the node manager. 
Error: virtual machine wasn't found", restarting registration container.

Let’s now take a look at the StorageClass, PVC, PV and Pod deployment on a vanilla Kubernetes deployment running on a vSAN Stretched Cluster. To recap, here are the relevant builds and versions that I have deployed.

vSphere 7.0U3d, build 19480866
Kubernetes upstream/vanilla version v1.21.3
vSphere CSI driver version 2.5.1

This is the vSAN Stretched Cluster configuration, with 4 ESXi hosts on each site:

Below is the common storage policy which implements cross-site replication. This will become the basis for the Kubernetes Storage Class in a moment.

This is the Kubernetes cluster deployment. I only have a single control plane node configured at present, but there are 6 worker nodes, 3 on the preferred data site and 3 on the secondary data site. These are positioned using Host and VM Affinity “Should” rules.

% kubectl get nodes
NAME                STATUS ROLES                AGE  VERSION
k8s-controlplane-01 Ready  control-plane,master 18h  v1.21.3
k8s-worker-01       Ready                       18h  v1.21.3
k8s-worker-02       Ready                       18h  v1.21.3
k8s-worker-03       Ready                       17h  v1.21.3
k8s-worker-04       Ready                       17h  v1.21.3
k8s-worker-05       Ready                       17h  v1.21.3
k8s-worker-06       Ready                       17h  v1.21.3

Now we can create a simple application. Here is the manifest that I am using to build the Storage Class (using the above vSphere storage policy), a PVC (Persistent Volume Claim), a PV (Persistent Volume) and a busybox Pod to use the volume. All YAML is placed in a single manifest for convenience. Note the Storage Class has a reference to the policy defined for vSAN stretched cluster placement, as seen above.

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: csi-vsan-sc
provisioner: csi.vsphere.vmware.com
allowVolumeExpansion: true
parameters:
  storagepolicyname: "StretchedClusterPolicy"
  csi.storage.k8s.io/fstype: "ext4"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: csivsan-pvc-vsan-claim
spec:
  storageClassName: csi-vsan-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: csi-vsan-pod
spec:
  containers:
  - name: busybox
    image: busybox
    volumeMounts:
    - name: csivsan-vol
      mountPath: "/demo"
    command: [ "sleep", "1000000" ]
  volumes:
    - name: csivsan-vol
      persistentVolumeClaim:
        claimName: csivsan-pvc-vsan-claim

Let’s deploy the manifest, and take a look at the resulting objects, paying particular attention to the underlying PV layout on a vSAN stretched cluster.

% kubectl apply -f demo-sc-pvc-pod-rwo-vsanstretched.yaml
storageclass.storage.k8s.io/csi-vsan-sc created
persistentvolumeclaim/csivsan-pvc-vsan-claim created
pod/csi-vsan-pod created


% kubectl get sc
NAME          PROVISIONER              RECLAIMPOLICY  VOLUMEBINDINGMODE  ALLOWVOLUMEEXPANSION  AGE
csi-vsan-sc  csi.vsphere.vmware.com  Delete          Immediate          true                  5s


% kubectl get pvc
NAME                    STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS  AGE
csivsan-pvc-vsan-claim  Bound    pvc-cb93fc80-56a7-4fc5-830f-deeec758ff16  2Gi        RWO            csi-vsan-sc    9s


% kubectl get pv
NAME                                      CAPACITY  ACCESS MODES  RECLAIM POLICY  STATUS  CLAIM                            STORAGECLASS  REASON  AGE
pvc-cb93fc80-56a7-4fc5-830f-deeec758ff16  2Gi        RWO            Delete          Bound    default/csivsan-pvc-vsan-claim  csi-vsan-sc            10s


% kubectl get pod
NAME          READY  STATUS    RESTARTS  AGE
csi-vsan-pod  1/1    Running  0          16s

Looks like everything has been created successfully. Let’s move to the vSphere client, where we can examine the metadata of the PV via CNS. If there are multiple datastores, administrators can filter on the list of datastores, and by choosing only the vSAN datastore from the stretched cluster, the volume is clearly seen.

If the details icon (2nd column) is selected, more detail about the volume can be seen, such as the storage policy (as chosen by the Storage Class) as well as the K8s node where the volume is attached.

The final question we need to answer is about placement. The PV object, backed by an object on the vSAN datastore, should be replicated across the two data sites of the vSAN stretched cluster. By checking the Physical Placement view, this can be confirmed. I added the blue and green edits below to make it clearer.

Using the above view, it is possible to conclude that the PV is both protected against failures within a site, and also protected across the data sites. Therefore it is capable of surviving multiple failures in a vSAN stretched cluster, although depending on the nature of the failure, there may be some period of outage experienced by the application as the control plane recovers after a majority loss. Note that the same holds through for applications running on the worker nodes. If the application has been deployed as a set of replicating pods which utilizes a majority voting mechanism, it is only when the site with the majority of components fails that there is some outage. Note that any single outage contained within a site, e.g. single host failure, should not impact the application or the K8s cluster in any way as vSAN stretched cluster provides both in-site and cross-site protection.

One final note on this topic that you may want to consider is if you have more than 2 sites available. If you have 3 or more sites that can be used, you might consider a multi-AZ approach where the Kubernetes cluster is distributed across the sites. Thus, in the case of a site failure, it does not impact the majority of control plane components and does not lead to an outage. Multi-AZ / Topology support is also supported by the vSphere CSI Driver. I took a look at it back when it was beta in vSphere CSI 2.2, but it is now fully GA. Just another option for you to consider when looking to deploy highly available Kubernetes clusters on top of vSphere infrastructure.