vSphere with Tanzu – Multi-Zone Preview

One of the most interesting announcements for me at VMware Explore 2022 was around the introduction of vSphere Zones. This feature, when it becomes available with vSphere 8.0, enables vSphere with Tanzu deployments to be rolled out across geographically dispersed vSphere clusters placed in separate racks in a single physical datacenter, as per the release notes. This provides an extra level of availability that wasn’t previously possible. This extra availability is not just for the Supervisor Cluster, but also for the Tanzu Kubernetes clusters deployed by the TKG service. And indeed, it provides additional availability to the applications running on those clusters. My colleagues, Jose Manzaneque and Alexander Ullah do a great job explaining the concept and the benefits in their VMware Explore 2022 session #KUBB1939USDAchieving High Availability for Workloads Using Zones in Tanzu Kubernetes Grid, so give that a watch if you want to learn more. In this post, I will go through the sequence of steps on how to deploy vSphere with Tanzu in a multi-AZ environment using vSphere Zones. At a high level, this post will demonstrate how to:

  1. Check the requirements for using zones
  2. Create vSphere Zones
  3. Create a zonal storage policy
  4. Deploy a vSphere with Tanzu Supervisor cluster across zones
  5. Create a vSphere Namespace that is zonal aware
  6. Deploy a TKG across zones
  7. Deploy an application (stateful set) on TKG across zones

There is a lot to cover here, so let’s get on with it.

Note: I am using non-GA versions of software to build this post. The screenshots used here may change in the vSphere 8.0 launch, and in subsequent releases.

1. Deployment details

For this particular deployment, I have 3 vSphere clusters. These clusters are all managed by the same vCenter server, and appear under the same datacenter object in the vCenter inventory. Each vSphere cluster contains 3 x ESXi hosts and have vSAN enabled. Thus, each cluster has its own storage/vSAN datastore. Whilst this requires vCenter server version 8.0.x, each ESXi host is running version 7.0U3g, build 20328353. This vSphere with Tanzu deployment is also using vSphere networking, not NSX-T. Therefore, a single distributed switch has been created, with distributed portgroups shared by all hosts in all clusters. I have also deployed a single NSX ALB (Advanced Load Balancer) to provide virtual IPs for the Kubernetes clusters as well as load balancer services.

2. Create vSphere Zones

vSphere Zones are configurable via the vCenter Service inventory object in the vSphere Client. Select the Configure tab, and you should observe a new entry in the navigation bar called vSphere Zones.

Click on the + to add a new vSphere Zone. For the purpose of deploying vSphere with Tanzu in a multi-AZ fashion, a total of three zones are created. Each Zone is associated with a cluster from the inventory.

With the Zones created, we can now turn our attention to the zonal storage policy.

3. Zonal Storage Policy

In essence, this step is ensuring that there is a common storage policy across all zones. Since I have a vSAN cluster in all 3 vSphere clusters, I will setup a vSAN zonal policy. Note that the zones must be configured before creating the policy. Alternatively, a tag based zonal policy could be used to use VMFS datastores, as demonstrated in the VMware Explore session mentioned in the introduction to this post. [Update] I have been notified by the product management team that vVols (virtual volumes) is also fully supported in multi-AZ. To make the storage policy zone aware, a new Storage topology section has been added to the policy structure. By choosing this option, it implies that this storage policy will be topology aware, and that each zone should have the policy applied to their storage.

Proceeding with the policy setup, we arrive at the Storage topology type. In this case, it is of type Zonal, or in other words, related to Zones.

Complete the creation of the Zonal policy, ensuring that it matches a datastore in each of the three vSphere clusters. Again, in this example, these will be vSAN datastores, but they could also be vVol datastores or tagged VMFS datastores.

4. Supervisor Cluster deployment across Zones

We are now ready to deploy the Supervisor cluster across zones. The first thing you will notice is that there is a new Supervisor location section where you can choose the deployment method. Of course, the single cluster deployment method that we’ve had all along is still available, but a new vSphere zone deployment is now available for selection. If a vSphere zone deployment is chosen, a name must be provided for the supervisor,. Next, select the datacenter, and finally the zones available in that datacenter.

The rest of the deployment is much the same as before. When it comes to selecting a storage policy, you may now choose the newly created zonal storage policy for the control plane VMs. Note: I am not sure this is strictly necessary – so long as the chosen storage policy is common across all three clusters, the supervisor cluster seems to deploy successfully.

The remainder of the deployment – management, load balancer and workload network settings as well as content library selection – remain the same as before. Complete the setup and set the Supervisor cluster deploying.

New Supervisor cluster views

Something to highlight at this point are the considerable changes to the UI in vSphere 8 and multi-AZ vSphere with Tanzu deployments. The first of these is in the Workload Management > Supervisors view. In the Config Status tab, there is a link to a view option.

By clicking on the view option, you can see at which stage the deployment has reached. There are a total of 16 tasks defined in the present UI. In the example below, the vSphere resources have bee initialized, and the control plane VMs have been deployed. They are now in the configuring phase where they will be added to the workload network. This is a very nice feature and should certainly assist with both troubleshooting, as well as figuring out the current state of the deployment.

When the Supervisor cluster has completed deployment, the tasks should look something like this.

And just like previous versions, the VIP address for the Supervisor cluster should now be displayed and the status should be shown as Running in the Control Plabe Node Address

Now notice that there is a different navigation on the left hand side. In this release, it is Supervisor clusters that can be navigated. In previous versions, this navigation screen held Namespaces. Another important aspect of this change is that all Supervisor cluster management information is moved to this view under Workload Management. Therefore, to see Supervisor networking information, storage information, events, etc, you will now need to navigate to Workload Management and choose the appropriate Supervisor. This then allows you to query the various attributes of the Supervisor, as shown below.

Assuming everything has deployed successfully, your Supervisor cluster should resemble the following. Note that each zone has its own cluster object with 3 ESXi hosts and mentioned in the introduction. After deployment of the Supervisor cluster, each zone has a single SupervisorControlPlaneVM, meaning that if any one of the infrastructure zones has a failure, the Supervisor cluster can continue to function.

Let’s proceed with the remaining tasks, namely deploying a Tanzu Kubernetes cluster, and an application on the cluster, both of which can also tolerate failures.

5. Create a vSphere Namespace

Creating a namespace is almost identical to previous versions, except this time, the Supervisor must be chosen when creating the namespace.

The only other step to note that is different to previous versions is that the storage policy for the namespace must be zonal, i.e. the storage policy that we created in step 3. Here is an example of a configured namespace.

We now have a namespace that is ready to be used for deploying a Tanzu Kubernetes Cluster.

6. Deploy a TKG across zones

The first thing to point out is that there is a new API version for Tanzu Kubernetes Clusters (TKC) called v1alpha3. This includes the ability to do multi-AZ deployments of TKCs. Below is a manifest for such multi-AZ TKC. Note that this manifest layout could change at GA. Note that it is also using a version of Kubernetes (v1.23.8) that is not currently available but should become available when vSphere 8.0 is GA. This version is needed for multi-AZ support. I have marked each of the node pools in a different colour so you can see the failureDomain reference the different vSphere zones that were created in step 1. This is how the worker nodes are placed into the different zones.

% cat tanzucluster-multi-az-v1alpha3-v1.23.8-zs-3+3.yaml

apiVersion: run.tanzu.vmware.com/v1alpha3
kind: TanzuKubernetesCluster
 name: tkg-multiaz-v1-23-8
 namespace: cormac-ns
     replicas: 3
     vmClass: guaranteed-small
     storageClass: multi-az-storage-policy
         name: v1.23.8---vmware.2-tkg.2-zshippable
   - name: node-pool-1
     replicas: 1
     failureDomain: zone-1
     vmClass: guaranteed-small
     storageClass: multi-az-storage-policy
         name: v1.23.8---vmware.2-tkg.2-zshippable
   - name: node-pool-2
     replicas: 1
     failureDomain: zone-2
     vmClass: guaranteed-small
     storageClass: multi-az-storage-policy
         name: v1.23.8---vmware.2-tkg.2-zshippable
   - name: node-pool-3
     replicas: 1
     failureDomain: zone-3
     vmClass: guaranteed-small
     storageClass: multi-az-storage-policy
         name: v1.23.8---vmware.2-tkg.2-zshippable

In the next step, I will deploy the TKC. Then we will examine it to verify that it has indeed deployed into different zones. We do that by switching into the TKC context after deployment and checking the labels associated with the nodes. These labels should reflect the zones in which the node has been placed. I have shaded them to match the YAML manifest so that they are easy to see.

% kubectl apply -f tanzucluster-multi-az-v1alpha3-v1.23.8-zs-3+3.yaml
tanzukubernetescluster.run.tanzu.vmware.com/tkg-multiaz-v1-23-8 created

% kubectl get tkc
NAME                  CONTROL PLANE   WORKER   TKR NAME                              AGE   READY   TKR COMPATIBLE   UPDATES AVAILABLE
tkg-multiaz-v1-23-8   3               3        v1.23.8---vmware.2-tkg.2-zshippable   11m   True    True

% kubectl-vsphere login --insecure-skip-tls-verify --server=https://cjh-supervisor-01 \
--vsphere-username administrator@vsphere.local --tanzu-kubernetes-cluster-namespace cormac-ns \
--tanzu-kubernetes-cluster-name tkg-multiaz-v1-23-8

Logged in successfully.

You have access to the following contexts:

If the context you wish to use is not in this list, you may need to try
logging in again later, or contact your cluster administrator.

To change context, use `kubectl config use-context <workload name>`

% kubectl config use-context tkg-multiaz-v1-23-8
Switched to context "tkg-multiaz-v1-23-8".

% kubectl get nodes
NAME                                                     STATUS   ROLES                  AGE     VERSION
tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8   Ready    <none>                 6m48s   v1.23.8+vmware.2
tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2    Ready    <none>                 7m18s   v1.23.8+vmware.2
tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf   Ready    <none>                 8m20s   v1.23.8+vmware.2
tkg-multiaz-v1-23-8-xm8tp-55kqb                          Ready    control-plane,master   10m     v1.23.8+vmware.2
tkg-multiaz-v1-23-8-xm8tp-db7zl                          Ready    control-plane,master   5m50s   v1.23.8+vmware.2
tkg-multiaz-v1-23-8-xm8tp-dwcwq                          Ready    control-plane,master   102s    v1.23.8+vmware.2

% kubectl get nodes --show-labels
NAME                                                     STATUS   ROLES                  AGE     VERSION            LABELS
tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8   Ready    <none>                 7m      v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-1
tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2    Ready    <none>                 7m30s   v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-2
tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf   Ready    <none>                 8m32s   v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf,kubernetes.io/os=linux,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-3
tkg-multiaz-v1-23-8-xm8tp-55kqb                          Ready    control-plane,master   10m     v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-2,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-55kqb,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-2
tkg-multiaz-v1-23-8-xm8tp-db7zl                          Ready    control-plane,master   6m2s    v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-3,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-db7zl,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-3
tkg-multiaz-v1-23-8-xm8tp-dwcwq                          Ready    control-plane,master   114s    v1.23.8+vmware.2   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/zone=zone-1,kubernetes.io/arch=amd64,kubernetes.io/hostname=tkg-multiaz-v1-23-8-xm8tp-dwcwq,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=,run.tanzu.vmware.com/kubernetesDistributionVersion=v1.23.8---vmware.2-tkg.2-zshippable,run.tanzu.vmware.com/tkr=v1.23.8---vmware.2-tkg.2-zshippable,topology.kubernetes.io/zone=zone-1

This looks good. Each of the nodes has been deployed in the correct zone, and has the correct labels associated with it. To prove it, we can check in the vSphere UI. Under the cormac-ns namespace (which appears in each zone), and the TKC cluster (which also appears in each zone) the nodes are visible. Each zone has both a control plane node and a worker/node pool node.

So compute seems to be correctly configured. What about storage? Well, the topology aware features of the vSphere CSI driver are now included in this latest version of TKC. Thus, if the csinodes are queried, we can also see that they have the appropriate topology keys added. This means that storage placement across zone should also be possible. If the topology keys are not present when you describe the CSI node, it is possible that the version of K8s that you deployed in TKC does not support topology zones. As the time of writing, v1.23.8 (or higher) is required.

% kubectl describe csinode tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8
Name:               tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/cinder,kubernetes.io/gce-pd
CreationTimestamp:  Mon, 29 Aug 2022 16:10:51 +0100
      Node ID:  tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8
        Count:        59
      Topology Keys:  [topology.kubernetes.io/zone]
Events:               <none>

With the TKC successfully deployed across zones, if there is some infrastructure failure, not only will be Supervisor cluster be able to tolerate it but the Tanzu Kubernetes workload clusters will be able to tolerate it as well. We are now ready for the final step which is the deployment of an application across zones on the TKC cluster.

7. Deploy an application across zones

This application that is about to be deployed is a stateful set, meaning that it has both compute and persistent storage requirements. It has a requirement to deploy 3 replicas. Each replica will be deployed in a different zone for availability. Each pod will contain one container (nginx) with two volumes, www (2G) and web (1G). This is the manifest for the application.The important sections for multi-az are under affinity. Both nodeAffinity and podAntiAffinity, are highlighted.

apiVersion: apps/v1
kind: StatefulSet
  name: web
  replicas: 3
      app: nginx
  serviceName: nginx
        app: nginx
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                - zone-1
                - zone-2
                - zone-3
          - labelSelector:
              - key: app
                operator: In
                - nginx
            topologyKey: topology.kubernetes.io/zone
        - name: nginx
          image: gcr.io/google_containers/nginx-slim:0.8
            - containerPort: 80
              name: web
            - name: www
              mountPath: /usr/share/nginx/html
            - name: logs
              mountPath: /logs
    - metadata:
        name: www
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: multi-az-storage-policy-latebinding
            storage: 2Gi
    - metadata:
        name: logs
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: multi-az-storage-policy-latebinding
            storage: 1Gi

The nodeAffinity section is tying nodes that have the same values for the label/key topology.kubernetes.io/zone together when it comes to scheduling a pod. Thus, any nodes with the zone-1 key will have affinity, as will any nodes with zone-2 and zone-3 keys. This means that when it comes to scheduling a Pod, Kubernetes knows it can pick any worker nodes with the same key (although in my example, there is only one worker node in every zone).

The podAntiAffinity section is making sure that the pods that have the label app:nginx (which our pods have in this example) are not run in the same zone. Again, the label/key topology.kubernetes.io/zone is used to determine where to schedule a pod, so that each replica/pod is scheduled in a different zone. Since we have 3 zones, this should be fine to deploy our stateful set with 3 replicas.

But, once again, what about the storage? Note that the storageClassName is not the same as the one associated with the cormac-ns namespace. Where did it come from? Let’s look at the storage classes, and in particular the volume binding mode.

% kubectl get sc
multi-az-storage-policy               csi.vsphere.vmware.com   Delete          Immediate              true                   12m
multi-az-storage-policy-latebinding   csi.vsphere.vmware.com   Delete          WaitForFirstConsumer   true                   12m

The first storage class has a volume binding mode of Immediate. This means that the PV is dynamically provisioned when the PVC is created, and then the pod is scheduled. The PV is then attached to the worker node where the pod is scheduled, and the kubelet formats the volume and mounts it to the pod/containers. This is not a problem when all K8s nodes share the same shared storage/datastore. However, if the PV was created on a different zone to the pod, and each zone has its own “local” storage, then we would have a problem and the pod can’t be scheduled . This is where the WaitForFirstConsumer volume binding mode comes in. As the name implies, it holds off the PV creation until the pod that uses the PVC is scheduled. The  PV is then provisioned in the same zone as the pod (adhering to the same topology), and thus no issues arise with scheduling or placement. It is the latter storage class that is used in this example for this very reason.

OK – let’s deploy the application, and take a look at the related objects. Then we will check where the pods were scheduled.

% kubectl apply -f multi-az-sts-wffc.yaml
statefulset.apps/web created

% kubectl get sts,pods,pvc,pv
NAME                   READY   AGE
statefulset.apps/web   3/3     6m31s

NAME                          READY   STATUS    RESTARTS   AGE
pod/web-0                     1/1     Running   0          63s
pod/web-1                     1/1     Running   0          43s
pod/web-2                     1/1     Running   0          23s

NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                          AGE
persistentvolumeclaim/logs-web-0                Bound    pvc-bcbccbcd-d721-40f0-8389-6b3dd4df8de5   1Gi        RWO            multi-az-storage-policy-latebinding   63s
persistentvolumeclaim/logs-web-1                Bound    pvc-07a4bd30-ad77-4c73-afaf-9770d0d03d72   1Gi        RWO            multi-az-storage-policy-latebinding   43s
persistentvolumeclaim/logs-web-2                Bound    pvc-249a42a0-f779-454f-a461-00636824054f   1Gi        RWO            multi-az-storage-policy-latebinding   23s
persistentvolumeclaim/www-web-0                 Bound    pvc-49efeaf5-cae3-4cbc-ad55-47a1ad2d02a4   2Gi        RWO            multi-az-storage-policy-latebinding   63s
persistentvolumeclaim/www-web-1                 Bound    pvc-c614e2f9-201a-4074-9967-89350fc8cbc8   2Gi        RWO            multi-az-storage-policy-latebinding   43s
persistentvolumeclaim/www-web-2                 Bound    pvc-77533266-c508-42d7-aad0-9c33aa530159   2Gi        RWO            multi-az-storage-policy-latebinding   23s

NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                             STORAGECLASS                          REASON   AGE
persistentvolume/pvc-07a4bd30-ad77-4c73-afaf-9770d0d03d72   1Gi        RWO            Delete           Bound    default/logs-web-1                multi-az-storage-policy-latebinding            38s
persistentvolume/pvc-249a42a0-f779-454f-a461-00636824054f   1Gi        RWO            Delete           Bound    default/logs-web-2                multi-az-storage-policy-latebinding            18s
persistentvolume/pvc-49efeaf5-cae3-4cbc-ad55-47a1ad2d02a4   2Gi        RWO            Delete           Bound    default/www-web-0                 multi-az-storage-policy-latebinding            58s
persistentvolume/pvc-77533266-c508-42d7-aad0-9c33aa530159   2Gi        RWO            Delete           Bound    default/www-web-2                 multi-az-storage-policy-latebinding            18s
persistentvolume/pvc-bcbccbcd-d721-40f0-8389-6b3dd4df8de5   1Gi        RWO            Delete           Bound    default/logs-web-0                multi-az-storage-policy-latebinding            58s
persistentvolume/pvc-c614e2f9-201a-4074-9967-89350fc8cbc8   2Gi        RWO            Delete           Bound    default/www-web-1                 multi-az-storage-policy-latebinding            38s

% kubectl describe pod web-0 | grep Node:
Node:         tkg-multiaz-v1-23-8-node-pool-2-m6pqj-f5f65c68b-bgcp2/

% kubectl describe pod web-1 | grep Node:
Node:         tkg-multiaz-v1-23-8-node-pool-1-jljl8-694844cbf7-8bxj8/

% kubectl describe pod web-2 | grep Node:
Node:         tkg-multiaz-v1-23-8-node-pool-3-sgdxd-5997dc4b8d-zw5bf/

This all looks good. The pods, PVCs and PVs are successfully deployed. The pods appear to have been placed on different worker nodes, each in a different zone. This also indicates that the WaitForFirstConsumer volume binding mode worked as expected, allowing the pods to schedule before the PVs were provisioned. We have achieved what we set out to do – deploy a cloud native application with a Tanzu Kubernetes cluster and a Supervisor cluster which can all tolerate failures in the infrastructure.