This week, I have been looking at the new features in TKG v1.4.1 for vSphere which dropped very recently. You can find the TKG v1.4.1 Release Notes here. Probably the most notable feature is that TKG v1.4.1 is now supported in Tanzu Mission Control, so you can now add this to your suite of Kubernetes clusters that are centrally managed from TMC. Note that a few things have changed around how to register a TKG management cluster with TMC which I will cover shortly. The other item that caught my attention was the fact that the Identity Management components that integrate with OIDC and LDAP, namely Pinniped and Dex, are now assigned Load Balancer services by default (as long as the NSX Advanced Load Balancer is also configured and available in your vSphere environment). In TKG v1.4, I had to jump through a few additional manual configurations to convert these services from NodePort to LoadBalancer, so its nice to see that I no longer need to do this. Let’s look at these 2 items in more details.
Tanzu Mission Control support
If you install the TKG v1.4.1 management cluster via the UI, the first thing you may notice is the absence of the TMC registration section. This is what the TKG v1.4.1 installer UI looks like now:
As mentioned, TKG v1.4.1 can now be added to TMC, something not possible with TKG v1.4. To add your TKG management cluster to Tanzu Mission Control, you must now go to the TMC portal, navigate to the Administration section, select the Management clusters tab and click on the button to Register Management Cluster, as shown below.
After providing a name for the cluster, and adding any necessary proxy details (should they be required), TMC will provide a YAML manifest for creating the necessary TMC components on the TKG management cluster to have that cluster added to TMC. You can also view the contents of the YAML manifest, as shown here.
Copy the manifest, then switch to your TKG V1.4.1 management cluster context, and apply the YAML manifest to the TKG management cluster via kubectl.
% kubectl apply -f 'https://xxxxx.tmc.cloud.vmware.com/installer?id=3b49e3a047863f4xxx87f6f5943bbc48&source=registration&type=tkgm' namespace/vmware-system-tmc created configmap/stack-config created secret/tmc-access-secret created customresourcedefinition.apiextensions.k8s.io/agents.clusters.tmc.cloud.vmware.com created customresourcedefinition.apiextensions.k8s.io/extensionconfigs.intents.tmc.cloud.vmware.com created customresourcedefinition.apiextensions.k8s.io/extensionintegrations.clusters.tmc.cloud.vmware.com created customresourcedefinition.apiextensions.k8s.io/extensionresourceowners.clusters.tmc.cloud.vmware.com created customresourcedefinition.apiextensions.k8s.io/extensions.clusters.tmc.cloud.vmware.com created serviceaccount/extension-manager created clusterrole.rbac.authorization.k8s.io/extension-manager-role created clusterrolebinding.rbac.authorization.k8s.io/extension-manager-rolebinding created service/extension-manager-service created deployment.apps/extension-manager created serviceaccount/extension-updater-serviceaccount created Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+ podsecuritypolicy.policy/vmware-system-tmc-agent-restricted created clusterrole.rbac.authorization.k8s.io/extension-updater-clusterrole created clusterrole.rbac.authorization.k8s.io/vmware-system-tmc-psp-agent-restricted created clusterrolebinding.rbac.authorization.k8s.io/extension-updater-clusterrolebinding created clusterrolebinding.rbac.authorization.k8s.io/vmware-system-tmc-psp-agent-restricted created deployment.apps/extension-updater created serviceaccount/agent-updater created clusterrole.rbac.authorization.k8s.io/agent-updater-role created clusterrolebinding.rbac.authorization.k8s.io/agent-updater-rolebinding created deployment.apps/agent-updater created Warning: batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob cronjob.batch/agentupdater-workload created
All going well, a bunch of new objects in the vmware-system-tmc namespace will be created on your TKG v1.4.1 management cluster, and the cluster should soon be visible in TMC. Any workload clusters that already exist can also be managed in TMC, as shown here. Of course, new workload cluster can also be instantiated directly from TMC as well.
There really is not too much to say about this – it simply works out of the box. Previously in TKG v1.4, I wrote a post about how to create ytt overlays to do the conversion of the Pinniped and Dex services from NodePort to LoadBalancer. In TKG v1.4.1 I no longer need to do this – the services are deployed as LoadBalancer automatically, which is neat.
Heads Up: vSphere Multi-Datacenter deployments
One issue I did encounter however was a difficulty in deploying TKG v1.4.1 to vSphere environments that had multiple datacenter objects in the inventory. It seems that the CAPV controller has a difficulty in parsing this information, and it concludes that no datacenter setting has been configured. Since it knows that there are multiple datacenters in the inventory, it cannot proceed. It fails with the bootstrap/Kind cluster creation “timing out” waiting for the cluster control plane to initialise. The issue has been reported and is under investigation at the time of writing. The log entry for the CAPV controller unable to discover the datacenter can be determined as follows. First, use docker ps to find the name of the Kind container that is acting as the bootstrap cluster, then use the following command to display the logs, replacing the name of the kind cluster and the capv controller manager pods names with your own:
% docker exec -it tkg-kind-c7e4237blargg0nnv10g-control-plane \
kubectl logs capv-controller-manager-6b84586c64-749vh -n capv-system manager
You should see an error similar to the following:
E0110 15:03:11.918610 1 controller.go:257] controller-runtime/controller \
"msg"="Reconciler error" "error"="unexpected error while probing vcenter for \
infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: \
unable to find datacenter \"\": default datacenter resolves to multiple instances, \
please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"
I was able to workaround the issue as follows, though I will highlight that this is not an official method. There may be some drawbacks to this approach that I am unaware of, but it does allow you to proceed with the TKG v1.4.1 deployment in a multi-datacenter vSphere environment. In a nutshell, we are replacing the suspect CAPV controller image (v0.7.11) with a working version (v0.7.10) that was available in TKG v1.4.0.
Step 1: Install an editor in Kind
Since there is no editor in the Kind container, we need to install it. I am installing “vi”. You could install an alternative, such as nano, if you wish.
% docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3bcaff36587f projects.registry.vmware.com/tkg/kind/node:v1.21.2_vmware.1-v0.8.1 "/usr/local/bin/entr…" 6 seconds ago Up 1 second 127.0.0.1:60798->6443/tcp tkg-kind-c7g021vblarn8hbvdrs0-control-plane % docker exec -it 3bcaff36587f bash root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get update root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get install vim -y root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi /usr/bin/vi
Step 2: Edit the capv-controller-manager deployment
Now that we have an editor available, we are able to make the necessary changes to the configuration, Next. identify the deployment responsible for running the capv-controller-manager pods.
# kubectl get deploy -A NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager 0/1 1 0 13s capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager 0/1 1 0 9s capi-system capi-controller-manager 0/1 1 0 16s capi-webhook-system capi-controller-manager 0/1 1 0 18s capi-webhook-system capi-kubeadm-bootstrap-controller-manager 0/1 1 0 15s capi-webhook-system capi-kubeadm-control-plane-controller-manager 0/1 1 0 11s capi-webhook-system capv-controller-manager 0/1 1 0 6s capv-system capv-controller-manager 0/1 1 0 4s cert-manager cert-manager 1/1 1 1 9m19s cert-manager cert-manager-cainjector 1/1 1 1 9m19s cert-manager cert-manager-webhook 1/1 1 1 9m18s kube-system coredns 2/2 2 2 10m local-path-storage local-path-provisioner 1/1 1 1 9m56s
The following command will open an editor to the capv-controller-manager deployment. Here, we need to change the version of the cluster-api-vsphere-controller image from version 0.7.11 to version 0.7.10. I have removed a large part of the manifest to make it easier to read:
# kubectl edit deploy capv-controller-manager -n capv-system
- containerPort: 8443
- name: HTTP_PROXY
- name: HTTPS_PROXY
- name: NO_PROXY
The image should now read:
Simply save the changes to the manifest, and this will automatically launch a new capv controller pod, as well as delete the pods using the older image. This will now allow the Kind/bootstrap cluster to deploy the virtual machines that make up the TKG management cluster control plane and worker nodes.
Step 3: Repeat for the TKG management cluster
We are not finished yet, since the same steps now need to be implemented on the TKG management cluster itself. Simply repeat the steps above on the CAPV controller manager on the TKG management cluster, changing the version from v0.7.11 to v0.7.10 and the TKG management cluster should now successfully come online. Note that no steps are necessary for the successful deployment of the subsequent TKG workload clusters. However, if you wish to delete the TKG management cluster, you will need to repeat this step on the Kind/bootstrap cluster that is also launched for the delete operation. There is no need to repeat it on the TKG management cluster though, as the change to the Kind/bootstrap cluster is enough to remove the TKG management cluster.
A VMware KnowledgeBase (KB) article 87396 has now been release to talk through the options of resolving this issue in TKG v1.4.1.