vSphere with Tanzu – Multi-Zone Preview
One of the most interesting announcements for me at VMware Explore 2022 was around the introduction of vSphere Zones. This feature, when it becomes available with vSphere 8.0, enables vSphere with Tanzu deployments to be rolled out across geographically dispersed vSphere clusters placed in separate racks in a single physical datacenter, as per the release notes. This provides an extra level of availability that wasn’t previously possible. This extra availability is not just for the Supervisor Cluster, but also for the Tanzu Kubernetes clusters deployed by the TKG service. And indeed, it provides additional availability to the applications running on those clusters. My colleagues, Jose Manzaneque and Alexander Ullah do a great job explaining the concept and the benefits in their VMware Explore 2022 session #KUBB1939USD – Achieving High Availability for Workloads Using Zones in Tanzu Kubernetes Grid, so give that a watch if you want to learn more. In this post, I will go through the sequence of steps on how to deploy vSphere with Tanzu in a multi-AZ environment using vSphere Zones. At a high level, this post will demonstrate how to:
- Check the requirements for using zones
- Create vSphere Zones
- Create a zonal storage policy
- Deploy a vSphere with Tanzu Supervisor cluster across zones
- Create a vSphere Namespace that is zonal aware
- Deploy a TKG across zones
- Deploy an application (stateful set) on TKG across zones
There is a lot to cover here, so let’s get on with it.
Note: I am using non-GA versions of software to build this post. The screenshots used here may change in the vSphere 8.0 launch, and in subsequent releases.
1. Deployment details
For this particular deployment, I have 3 vSphere clusters. These clusters are all managed by the same vCenter server, and appear under the same datacenter object in the vCenter inventory. Each vSphere cluster contains 3 x ESXi hosts and have vSAN enabled. Thus, each cluster has its own storage/vSAN datastore. Whilst this requires vCenter server version 8.0.x, each ESXi host is running version 7.0U3g, build 20328353. This vSphere with Tanzu deployment is also using vSphere networking, not NSX-T. Therefore, a single distributed switch has been created, with distributed portgroups shared by all hosts in all clusters. I have also deployed a single NSX ALB (Advanced Load Balancer) to provide virtual IPs for the Kubernetes clusters as well as load balancer services.
2. Create vSphere Zones
vSphere Zones are configurable via the vCenter Service inventory object in the vSphere Client. Select the Configure tab, and you should observe a new entry in the navigation bar called vSphere Zones.
Click on the + to add a new vSphere Zone. For the purpose of deploying vSphere with Tanzu in a multi-AZ fashion, a total of three zones are created. Each Zone is associated with a cluster from the inventory.
With the Zones created, we can now turn our attention to the zonal storage policy.
3. Zonal Storage Policy
In essence, this step is ensuring that there is a common storage policy across all zones. Since I have a vSAN cluster in all 3 vSphere clusters, I will setup a vSAN zonal policy. Note that the zones must be configured before creating the policy. Alternatively, a tag based zonal policy could be used to use VMFS datastores, as demonstrated in the VMware Explore session mentioned in the introduction to this post. [Update] I have been notified by the product management team that vVols (virtual volumes) is also fully supported in multi-AZ. To make the storage policy zone aware, a new Storage topology section has been added to the policy structure. By choosing this option, it implies that this storage policy will be topology aware, and that each zone should have the policy applied to their storage.
Proceeding with the policy setup, we arrive at the Storage topology type. In this case, it is of type Zonal, or in other words, related to Zones.
Complete the creation of the Zonal policy, ensuring that it matches a datastore in each of the three vSphere clusters. Again, in this example, these will be vSAN datastores, but they could also be vVol datastores or tagged VMFS datastores.
4. Supervisor Cluster deployment across Zones
We are now ready to deploy the Supervisor cluster across zones. The first thing you will notice is that there is a new Supervisor location section where you can choose the deployment method. Of course, the single cluster deployment method that we’ve had all along is still available, but a new vSphere zone deployment is now available for selection. If a vSphere zone deployment is chosen, a name must be provided for the supervisor,. Next, select the datacenter, and finally the zones available in that datacenter.
The rest of the deployment is much the same as before. When it comes to selecting a storage policy, you may now choose the newly created zonal storage policy for the control plane VMs. Note: I am not sure this is strictly necessary – so long as the chosen storage policy is common across all three clusters, the supervisor cluster seems to deploy successfully.
The remainder of the deployment – management, load balancer and workload network settings as well as content library selection – remain the same as before. Complete the setup and set the Supervisor cluster deploying.
New Supervisor cluster views
Something to highlight at this point are the considerable changes to the UI in vSphere 8 and multi-AZ vSphere with Tanzu deployments. The first of these is in the Workload Management > Supervisors view. In the Config Status tab, there is a link to a view option.
By clicking on the view option, you can see at which stage the deployment has reached. There are a total of 16 tasks defined in the present UI. In the example below, the vSphere resources have bee initialized, and the control plane VMs have been deployed. They are now in the configuring phase where they will be added to the workload network. This is a very nice feature and should certainly assist with both troubleshooting, as well as figuring out the current state of the deployment.
When the Supervisor cluster has completed deployment, the tasks should look something like this.
And just like previous versions, the VIP address for the Supervisor cluster should now be displayed and the status should be shown as Running in the Control Plabe Node Address
Now notice that there is a different navigation on the left hand side. In this release, it is Supervisor clusters that can be navigated. In previous versions, this navigation screen held Namespaces. Another important aspect of this change is that all Supervisor cluster management information is moved to this view under Workload Management. Therefore, to see Supervisor networking information, storage information, events, etc, you will now need to navigate to Workload Management and choose the appropriate Supervisor. This then allows you to query the various attributes of the Supervisor, as shown below.
Assuming everything has deployed successfully, your Supervisor cluster should resemble the following. Note that each zone has its own cluster object with 3 ESXi hosts and mentioned in the introduction. After deployment of the Supervisor cluster, each zone has a single SupervisorControlPlaneVM, meaning that if any one of the infrastructure zones has a failure, the Supervisor cluster can continue to function.
Let’s proceed with the remaining tasks, namely deploying a Tanzu Kubernetes cluster, and an application on the cluster, both of which can also tolerate failures.
5. Create a vSphere Namespace
Creating a namespace is almost identical to previous versions, except this time, the Supervisor must be chosen when creating the namespace.
The only other step to note that is different to previous versions is that the storage policy for the namespace must be zonal, i.e. the storage policy that we created in step 3. Here is an example of a configured namespace.
We now have a namespace that is ready to be used for deploying a Tanzu Kubernetes Cluster.
6. Deploy a TKG across zones
The first thing to point out is that there is a new API version for Tanzu Kubernetes Clusters (TKC) called v1alpha3. This includes the ability to do multi-AZ deployments of TKCs. Below is a manifest for such multi-AZ TKC. Note that this manifest layout could change at GA. Note that it is also using a version of Kubernetes (v1.23.8) that is not currently available but should become available when vSphere 8.0 is GA. This version is needed for multi-AZ support. I have marked each of the node pools in a different colour so you can see the failureDomain reference the different vSphere zones that were created in step 1. This is how the worker nodes are placed into the different zones.
% cat tanzucluster-multi-az-v1alpha3-v1.23.8-zs-3+3.yaml apiVersion: run.tanzu.vmware.com/v1alpha3 kind: TanzuKubernetesCluster metadata: name: tkg-multiaz-v1-23-8 namespace: cormac-ns spec: topology: controlPlane: replicas: 3 vmClass: guaranteed-small storageClass: multi-az-storage-policy tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable nodePools: - name: node-pool-1 replicas: 1 failureDomain: zone-1 vmClass: guaranteed-small storageClass: multi-az-storage-policy tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable - name: node-pool-2 replicas: 1 failureDomain: zone-2 vmClass: guaranteed-small storageClass: multi-az-storage-policy tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable - name: node-pool-3 replicas: 1 failureDomain: zone-3 vmClass: guaranteed-small storageClass: multi-az-storage-policy tkr: reference: name: v1.23.8---vmware.2-tkg.2-zshippable
In the next step, I will deploy the TKC. Then we will examine it to verify that it has indeed deployed into different zones. We do that by switching into the TKC context after deployment and checking the labels associated with the nodes. These labels should reflect the zones in which the node has been placed. I have shaded them to match the YAML manifest so that they are easy to see.
% kubectl apply -f tanzucluster-multi-az-v1alpha3-v1.23.8-zs-3+3.yaml tanzukubernetescluster.run.tanzu.vmware.com/tkg-multiaz-v1-23-8 created % kubectl get tkc NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE tkg-multiaz-v1-23-8 3 3 v1.23.8---vmware.2-tkg.2-zshippable 11m True True % kubectl-vsphere login --insecure-skip-tls-verify --server=https://cjh-supervisor-01 \ --vsphere-username administrator@vsphere.local --tanzu-kubernetes-cluster-namespace cormac-ns \ --tanzu-kubernetes-cluster-name tkg-multiaz-v1-23-8 Logged in successfully. You have access to the following contexts: cjh-supervisor-01 cormac-ns tkg-multiaz-v1-23-8 If the context you wish to use is not in this list, you may need to try logging in again later, or contact your cluster administrator. To change context, use `kubectl config use-context <workload name>` % kubectl config use-context tkg-multiaz-v1-23-8 Switched to context "tkg-multiaz-v1-23-8". % kubectl get nodes NAME STATUS ROLES AGE VERSION tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8 Ready <none> 6m48s v1.23.8+vmware.2 tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2 Ready <none> 7m18s v1.23.8+vmware.2 tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf Ready <none> 8m20s v1.23.8+vmware.2 tkg-multiaz-v1-23-8-xm8tp-55kqb Ready control-plane,master 10m v1.23.8+vmware.2 tkg-multiaz-v1-23-8-xm8tp-db7zl Ready control-plane,master 5m50s v1.23.8+vmware.2 tkg-multiaz-v1-23-8-xm8tp-dwcwq Ready control-plane,master 102s v1.23.8+vmware.2 % kubectl get nodes --show-labels NAME STATUS ROLES AGE VERSION LABELS tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8 Ready <none> 7m v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-1 tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2 Ready <none> 7m30s v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-2 tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf Ready <none> 8m32s v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-3 tkg-multiaz-v1-23-8-xm8tp-55kqb Ready control-plane,master 10m v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-55kqb,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-2 tkg-multiaz-v1-23-8-xm8tp-db7zl Ready control-plane,master 6m2s v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-db7zl,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-3 tkg-multiaz-v1-23-8-xm8tp-dwcwq Ready control-plane,master 114s v1.23.8+vmware.2 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-dwcwq,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-1
This looks good. Each of the nodes has been deployed in the correct zone, and has the correct labels associated with it. To prove it, we can check in the vSphere UI. Under the cormac-ns namespace (which appears in each zone), and the TKC cluster (which also appears in each zone) the nodes are visible. Each zone has both a control plane node and a worker/node pool node.
So compute seems to be correctly configured. What about storage? Well, the topology aware features of the vSphere CSI driver are now included in this latest version of TKC. Thus, if the csinodes are queried, we can also see that they have the appropriate topology keys added. This means that storage placement across zone should also be possible. If the topology keys are not present when you describe the CSI node, it is possible that the version of K8s that you deployed in TKC does not support topology zones. As the time of writing, v1.23.8 (or higher) is required.
% kubectl describe csinode tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8 Name: tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8 Labels: <none> Annotations: storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/cinder,kubernetes.io/gce-pd CreationTimestamp: Mon, 29 Aug 2022 16:10:51 +0100 Spec: Drivers: csi.vsphere.vmware.com: Node ID: tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8 Allocatables: Count: 59 Topology Keys: [topology.kubernetes.io/zone] Events: <none>
With the TKC successfully deployed across zones, if there is some infrastructure failure, not only will be Supervisor cluster be able to tolerate it but the Tanzu Kubernetes workload clusters will be able to tolerate it as well. We are now ready for the final step which is the deployment of an application across zones on the TKC cluster.
7. Deploy an application across zones
This application that is about to be deployed is a stateful set, meaning that it has both compute and persistent storage requirements. It has a requirement to deploy 3 replicas. Each replica will be deployed in a different zone for availability. Each pod will contain one container (nginx) with two volumes, www (2G) and web (1G). This is the manifest for the application.The important sections for multi-az are under affinity. Both nodeAffinity and podAntiAffinity, are highlighted.
apiVersion: apps/v1 kind: StatefulSet metadata: name: web spec: replicas: 3 selector: matchLabels: app: nginx serviceName: nginx template: metadata: labels: app: nginx spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - zone-1 - zone-2 - zone-3 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: topology.kubernetes.io/zone containers: - name: nginx image: gcr.io/google_containers/nginx-slim:0.8 ports: - containerPort: 80 name: web volumeMounts: - name: www mountPath: /usr/share/nginx/html - name: logs mountPath: /logs volumeClaimTemplates: - metadata: name: www spec: accessModes: [ "ReadWriteOnce" ] storageClassName: multi-az-storage-policy-latebinding resources: requests: storage: 2Gi - metadata: name: logs spec accessModes: [ "ReadWriteOnce" ] storageClassName: multi-az-storage-policy-latebinding resources: requests: storage: 1Gi
The nodeAffinity section is tying nodes that have the same values for the label/key topology.kubernetes.io/zone together when it comes to scheduling a pod. Thus, any nodes with the zone-1 key will have affinity, as will any nodes with zone-2 and zone-3 keys. This means that when it comes to scheduling a Pod, Kubernetes knows it can pick any worker nodes with the same key (although in my example, there is only one worker node in every zone).
The podAntiAffinity section is making sure that the pods that have the label app:nginx (which our pods have in this example) are not run in the same zone. Again, the label/key topology.kubernetes.io/zone is used to determine where to schedule a pod, so that each replica/pod is scheduled in a different zone. Since we have 3 zones, this should be fine to deploy our stateful set with 3 replicas.
But, once again, what about the storage? Note that the storageClassName is not the same as the one associated with the cormac-ns namespace. Where did it come from? Let’s look at the storage classes, and in particular the volume binding mode.
% kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE multi-az-storage-policy csi.vsphere.vmware.com Delete Immediate true 12m multi-az-storage-policy-latebinding csi.vsphere.vmware.com Delete WaitForFirstConsumer true 12m
The first storage class has a volume binding mode of Immediate. This means that the PV is dynamically provisioned when the PVC is created, and then the pod is scheduled. The PV is then attached to the worker node where the pod is scheduled, and the kubelet formats the volume and mounts it to the pod/containers. This is not a problem when all K8s nodes share the same shared storage/datastore. However, if the PV was created on a different zone to the pod, and each zone has its own “local” storage, then we would have a problem and the pod can’t be scheduled . This is where the WaitForFirstConsumer volume binding mode comes in. As the name implies, it holds off the PV creation until the pod that uses the PVC is scheduled. The PV is then provisioned in the same zone as the pod (adhering to the same topology), and thus no issues arise with scheduling or placement. It is the latter storage class that is used in this example for this very reason.
OK – let’s deploy the application, and take a look at the related objects. Then we will check where the pods were scheduled.
% kubectl apply -f multi-az-sts-wffc.yaml statefulset.apps/web created % kubectl get sts,pods,pvc,pv NAME READY AGE statefulset.apps/web 3/3 6m31s NAME READY STATUS RESTARTS AGE pod/web-0 1/1 Running 0 63s pod/web-1 1/1 Running 0 43s pod/web-2 1/1 Running 0 23s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/logs-web-0 Bound pvc-bcbccbcd-d721-40f0-8389-6b3dd4df8de5 1Gi RWO multi-az-storage-policy-latebinding 63s persistentvolumeclaim/logs-web-1 Bound pvc-07a4bd30-ad77-4c73-afaf-9770d0d03d72 1Gi RWO multi-az-storage-policy-latebinding 43s persistentvolumeclaim/logs-web-2 Bound pvc-249a42a0-f779-454f-a461-00636824054f 1Gi RWO multi-az-storage-policy-latebinding 23s persistentvolumeclaim/www-web-0 Bound pvc-49efeaf5-cae3-4cbc-ad55-47a1ad2d02a4 2Gi RWO multi-az-storage-policy-latebinding 63s persistentvolumeclaim/www-web-1 Bound pvc-c614e2f9-201a-4074-9967-89350fc8cbc8 2Gi RWO multi-az-storage-policy-latebinding 43s persistentvolumeclaim/www-web-2 Bound pvc-77533266-c508-42d7-aad0-9c33aa530159 2Gi RWO multi-az-storage-policy-latebinding 23s NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-07a4bd30-ad77-4c73-afaf-9770d0d03d72 1Gi RWO Delete Bound default/logs-web-1 multi-az-storage-policy-latebinding 38s persistentvolume/pvc-249a42a0-f779-454f-a461-00636824054f 1Gi RWO Delete Bound default/logs-web-2 multi-az-storage-policy-latebinding 18s persistentvolume/pvc-49efeaf5-cae3-4cbc-ad55-47a1ad2d02a4 2Gi RWO Delete Bound default/www-web-0 multi-az-storage-policy-latebinding 58s persistentvolume/pvc-77533266-c508-42d7-aad0-9c33aa530159 2Gi RWO Delete Bound default/www-web-2 multi-az-storage-policy-latebinding 18s persistentvolume/pvc-bcbccbcd-d721-40f0-8389-6b3dd4df8de5 1Gi RWO Delete Bound default/logs-web-0 multi-az-storage-policy-latebinding 58s persistentvolume/pvc-c614e2f9-201a-4074-9967-89350fc8cbc8 2Gi RWO Delete Bound default/www-web-1 multi-az-storage-policy-latebinding 38s % kubectl describe pod web-0 | grep Node: Node: tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2/192.168.41.113 % kubectl describe pod web-1 | grep Node: Node: tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8/192.168.41.114 % kubectl describe pod web-2 | grep Node: Node: tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf/192.168.41.112
This all looks good. The pods, PVCs and PVs are successfully deployed. The pods appear to have been placed on different worker nodes, each in a different zone. This also indicates that the WaitForFirstConsumer volume binding mode worked as expected, allowing the pods to schedule before the PVs were provisioned. We have achieved what we set out to do – deploy a cloud native application with a Tanzu Kubernetes cluster and a Supervisor cluster which can all tolerate failures in the infrastructure.