vSphere with Kubernetes on VCF 4.0 Consolidated Architecture

Since the release of VMware Cloud Foundation (VCF) 4.0 over 1 month ago, I have been asked one question repeatedly – when can I run vSphere with Kubernetes (formerly known as Project Pacific) on a VCF 4.0 Consolidated Architecture? In other words, when can I deploy vSphere with Kubernetes on the Management Domain rather than building a separate VI Workload Domain to run it. The main reason for this request is because this reduces the number of ESXi hosts required to run vSphere with Kubernetes from 7 down to 4. So I am delighted to announce that we now have full support for running vSphere with Kubernetes on the Management Domain of VCF 4.0 in what we term a Consolidated Architecture.

Before we get into the “how-to”, I want to point out the different definitions when it comes to the term consolidated. The term consolidated is used to refer to running VM workloads on the Management Domain, but the term consolidated architecture is also used to refer to running a workload on the Management Domain. I want to highlight that we have always supported the former; in other words you could always run VMs on the Management Domain alongside the management VMs. This post is referring to the latter, and the ability to select the Management Domain for a vSphere with Kubernetes workload. With that clarification, let’s get on with the steps on how to deploy vSphere with Kubernetes on the Management Domain of VCF 4.0.

Deploy VCF 4.0 as normal via Cloud Builder

The initial deployment is exactly the same as posted here previously. As before, I am not including the option to deploy any AVN (Application Virtual Networks) during bringup. I will create my NSX-T Edges as a separate step later on.

Once the deployment has completed, launch SDDC Manager. When you login, you will observe a single workload domain, the Management Domain, consisting of 4 ESXi hosts. This will also be running vCenter 7.0 and has vSAN 7.0 deployed automatically. The Management Domain requires vSAN.

With VCF 4.0, NSX-T 3.0 is also deployed automatically. If you login to the NSX-T Manager, you can see the Transport Nodes and Zones for the ESXi hosts. However, since I skipped the AVN (Application Virtual Networks) during bringup, we have not yet deployed any NSX-T Edge nodes. Thus, there are no Tier0 or Tier1 Logical Routers or any other network services defined.

Add an NSX-T Edge to the Management Domain

vSphere with Kubernetes requires NSX-T to provide networking services. NSX-T provides Tier0 and Tier1 Logical Routers. NSX-T also provides SNAT egresses and Load Balancer ingresses to the Supervisor Cluster and Guest Cluster. Each namespace created in vSphere with Kubernetes gets it’s own Tier-1 Logical Router as well as any required SNAT and Load Balancer IP addresses. This functionality is provided by an NSX-T Edge, and in VCF 4.0, the provisioning of NSX-T Edges on workload domains, including the Management Domain, is fully automated.

To deploy an NSX-T Edge, simply right click on the Management Domain in SDDC Manager.  From the drop-down list, select Add Edge Cluster. Full details on how to populate the wizard for an Edge cluster can be found in this earlier post.

The one item that is important to specify is that the NSX-T Edge use-case is for “Workload Management“, which basically means that this NSX-T Edge is being used for vSphere with Kubernetes. This automatically sets the Edge format factor to Large and the Tier0 Service HA to Active-Active. This step also adds the WCPReady tag to the Edge Cluster in the NSX-T Manager, which we will discuss later.

One other thing to mention is that in my lab, I do not have access to an upstream router. This means I cannot peer it to my Tier0 Logical Router through BGP. I am therefore selecting Static as my Tier0 Routing Type. This means a lot of additional manual steps when it comes to routing later on. The other option is EBGP, which requires peering the Tier0 Logical Router to your physical upstream router which in turn enables automatic route learning. EBGP would be a much more common setting in production environments, but unfortunately it is not possible in my lab. This is where Workload Management is specified in the Add Edge Cluster wizard.

After the details of both NSX-T Edge nodes have been added, complete the wizard, ensure validation passes and deploy the NSX-T Edge cluster. Now we will have to return to the NSX-T Managers to allow vSphere with Kubernetes to be deployed on the Management Domain, but first lets look at what happens if you try to deploy vSphere with Kubernetes on the Management Domain without following the additional steps.

Cluster is not compatible for Workload Management

In VCF 4.0 Consolidated Architecture, SDDC Manager does not allow you to select the vSphere cluster for validation when enabling Workload Management (deploying vSphere with Kubernetes). To proceed with the vSphere with Kubernetes deployment, you must use the vSphere client. However, from the vSphere client, if you now try to enable Workload Management, the vSphere cluster on the Management Domain does not appear as compatible.

This is because ‘trust’ has not been established on the NSX-T Manager for the Management Domain vCenter Server. Let’s do that now.

Enable Trust and Add Tags in NSX-T Manager

From the NSX-T Manager, navigate to System view. Under Configuration > Fabric, select Compute Managers. This will list the Management Domain vCenter Server. Select it, then click on the Edit link to open the vCenter Server / Compute Manager properties. Set the Enable Trust to Yes and click Save.

Now, there may be an additional step required. This depends on whether you deployed Application Virtual Networks (AVN) during the bringup operation, or if you used SDDC Manager to create the Edge Cluster, like I did here. If you deployed AVN during bringup, you will need this step. Also, if you did not select “Workload Management” as the use-case when deploying the NSX-T Edge Cluster, you will also need to do this step.

In the NSX-T Manager, navigate once again to the System view. Under Configuration > Fabric, select Nodes. Next, select Edge Clusters and click on your Edge Cluster name. Now examine the Tags. There should be two; VCF and WCPReady as shown below. If WCPReady is not present, you will need to add it using the Manage > Add steps.

You should not need to do this step if you deploy the Edge Cluster via SDDC Manager and choose “Workload Management” as the use-case.

This should now make the vSphere cluster on the Management Domain appear as Compatible in the vSphere UI when enabling Workload Management.

You can now proceed with the rest of the Workload Management deployment to roll-out vSphere with Kubernetes. The complete steps on how to do this can be found in this previous post. If everything deploys successfully, the Kubernetes API server should receive a Load Balancer IP address from your Ingress range, and appear something like this:

And if you selected static routes as the Tier0 routing type, you should now be able to connect the the Control Plane LB and download the necessary kubectl tools for vSphere with Kubernetes.

Note that the steps to establish trust between NSX-T and the Management Domain vCenter Server will be automated in a future release.

Additional Caveats

If, like me, you have gone with the static routes approach as your Tier0 Routing Type when you deployed the NSX-T Edges, you will now need to add a static route to your Tier0. This is to enable the control plane to pull container images from external repositories. You will also need to add some additional SNAT rules for the Tier0 if you do not have access to the physical networking infrastructure to make changes. It sucks not having access to your upstream router where this could be automated via EBGP, but at least there is a workaround. I’ve detailed my static routes and SNAT setup in this post.

There is another caveat if you have decided to use EBGP as your Tier0 Routing Type when you deployed the NSX-T Edges. The issue manifests itself as being unable to connect to the control plane API server Load Balancer IP address to download the tools. This is because the BGP Route Advertisements have been inadvertently blocked. To resolve this issue, you will now need to modify the Tier0 Route Advertisement configuration and create a new Custom Route Map. In the Tier0 Routing section of NSX-T, you need to do 3 additional steps:

  1. Create a new IP Prefix List which permits any network.
  2. Create a new Custom Route Map which matches the new IP Prefix created in step 1 and permits any network.
  3. Edit the default Route Map to use the new Custom Route Map created in step 2.

This now means that all routes will be advertised. This issue will also be addressed in an upcoming release. It is also specific to EBGP Routing Type. It is not relevant if static routes is chosen as the Routing Type.

Conclusion

At this point, you have successfully deployed vSphere with Kubernetes / Workload Management on the 4 node Management Domain of VCF 4.0. You can now proceed to use this environment just as you would use Workload Management deployed on a separate VI Workload Domain. You can use this environment for creating namespaces, deploying PodVMs in the Supervisor cluster and creating guest Tanzu Kubernetes Grid (TKG) clusters.

Click this link for further details on how to deploy TKG clusters with vSphere with Kubernetes.

For more information on VCF 4.0 Consolidated Architecture, and where to get a detailed white paper on how to deploy vSphere with Kubernetes on VCF 4.0 Consolidated Architecture, check out this blog post from my colleague Kyle Gleed. As Kyle states in his post, we hope this additional qualification of Cloud Foundation 4.0 Consolidated Architecture will make it easier for you to get started with running vSphere with Kubernetes.

19 Replies to “vSphere with Kubernetes on VCF 4.0 Consolidated Architecture”

  1. How does this compare to people running a homelab with 2 x esxi 7 and vcenter (running on 1 host)? I’ve been trying to get my head around it but it seems you need more then 2?

    I got 2 x Supermicro x8ddt-hf+ with 12c/24t and 56GB ram each. On esxi host 1 I got vcenter installed. I’ve been trying to get it to work just to fiddle around with it but seeing you got 4 I doubt that is possible.

    1. Yes – the management domain requires 4 hosts. If you only have 2 physical hosts, then perhaps the way around this is to deploy VCF on nested ESXI hosts where the ESXi hosts are running in VMs.

  2. Thank you very much Cormac. Just wondering could you please give some idea regarding minimum Hardware requirement to setup a Home Lab to deploy Consolidated Arch on a Nested Env…

  3. Can u please help with the SKU for vsphere for Kubernetes add on. Looking to position it with VCF 4

  4. Worked for me, Kubernetes is enabled on the MGMT domain. Thanks a lot!

    Another question: When deploying the image registry, I get the following errors when the containers try to get images externally:

    failed to get images: Image vmware-system-registry-1150478832/harbor-jobservice-a1de3bfc8dd86da413641285600f2e5f60613b83-v1 has failed. Error: Failed to resolve on node hci-vcf-esx-01.lab.local. Reason: Http request failed. Code 400: ErrorType(2) failed to do request: Head https://docker-registry.kube-system.svc:5000/v2/goharbor/harbor-jobservice/manifests/v1.10.1: dial tcp: i/o timeout

    Any idea? Network is working probalby to access external resources…

    1. I think it is somehow related to the issue that I cannot access/ping the Control Plane Node IP Address or any other Ingress IP (Image Registry). The Edge Uplinks are configured correctly and the Uplink VLAN Subnet is the same as the Ingress and Egrees CIDRs. Any ideas how I can narrow down my issue?

      1. Could this be the issue I highlighted in the post under Additional Caveats? There is a bug whereby route advertisements were blocked.

        1. Somehow I have solved it – changed URPF Mode to “None” on the T0 Uplinks and redeployed the kubernetes cluster. This was the first time, that I could ping / access the control plane node IP from external.

          When I change URPF Mode now back to “Strict”, the control plane node IP is still pingable / accessable… not sure now if URPF is related to it 😛

      1. Yes DNS is set correctly and Harbor setup is still failing:

        Additional errors:
        “readiness probe failed for container core: Get http://10.244.0.218:8080/api/ping: dial tcp 10.244.0.218:8080: connect: connection refused”

        NSX Load Balancer for vmware-system-registry-181265554-harbor-1812655540-443 is down
        because of “Virtual Server Status Down”.

        Can you tell me which logs I can search for further details on these problems?

          1. vCenter 7.0 with the latest patch.

            But in the meantime, the Harbor setup has succeeded.
            Retried it about 10 times with different settings, no luck. What I found out finally, was a faulty ESX Transport Node which was not forwarding packets. I think I was just lucky that during the last Harbor deployment, the Harbor containers must have been deployed on the “right” hosts. Very annoying and not noticable in the NSX-T Manager. What helped me was vRNI, which showed alarms of segment ports not being up specifically on that faulty host.

            Anyway, thanks for your help!!! Much appreciated.

          2. Ah – interesting. In fact, this is one of the things I always check before deploying vSphere with K8s. I always ping along the tunnel between the ESXi hosts and the NSX-T Edges to make sure there is communication. Then when I start the deployment, I check the tunnels field for both the hosts and the edges to see that they are all up and that none are down (once there is a VM deployed on the hosts). Unless the tunnels are formed correctly, the LBs won’t work.

            Thanks for the updates – appreciated.

  5. Excellent content. Thank you very much for taking the time to document this process so clearly.

  6. In the case of adopting this environment with vSphere with Kubernetes on VCF 4.0 Consolidated Architecture, what is the recommendation for the deployment of Suite vRealize, should one go to an Implementation with VLAN-Backed Networks? Or what would be the steps to follow to have vRealize?

    1. This is straight from https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/whitepaper/products/vmware-cloud-foundation-faq.pdf

      Does VMware Cloud Foundation 4.0support automated deployment of vRealize Suite?
      VMware Cloud Foundation 4.0 supports automated deployment for vRealize Suite Lifecycle Manager (vRSLCM) and then vRLCM provides deployment of the underlying vRealize components as well as ongoing life cycle management of the vRealize Suite

      So, vRealize is handled through the vRSLCM in VCF 4.0, not SDDC Manager. That is the only caveat I know about. I’m not aware of any specific limitations when it comes to vRealize in consolidated deployments, but that doesn’t mean there aren’t any. It would always be worth asking someone in the support organization first, before doing so. I would defer to one of the many VMware Validated Design for detailed implementation instructions.

Comments are closed.