TKG v1.4.1 – Some new features

Cormac

2 years ago

This week, I have been looking at the new features in TKG v1.4.1 for vSphere which dropped very recently. You can find the TKG v1.4.1 Release Notes here. Probably the most notable feature is that TKG v1.4.1 is now supported in Tanzu Mission Control, so you can now add this to your suite of Kubernetes clusters that are centrally managed from TMC. Note that a few things have changed around how to register a TKG management cluster with TMC which I will cover shortly. The other item that caught my attention was the fact that the Identity Management components that integrate with OIDC and LDAP, namely Pinniped and Dex, are now assigned Load Balancer services by default (as long as the NSX Advanced Load Balancer is also configured and available in your vSphere environment). In TKG v1.4, I had to jump through a few additional manual configurations to convert these services from NodePort to LoadBalancer, so its nice to see that I no longer need to do this. Let’s look at these 2 items in more details.

Tanzu Mission Control support

If you install the TKG v1.4.1 management cluster via the UI, the first thing you may notice is the absence of the TMC registration section. This is what the TKG v1.4.1 installer UI looks like now:

As mentioned, TKG v1.4.1 can now be added to TMC, something not possible with TKG v1.4. To add your TKG management cluster to Tanzu Mission Control, you must now go to the TMC portal, navigate to the Administration section, select the Management clusters tab and click on the button to Register Management Cluster, as shown below.

After providing a name for the cluster, and adding any necessary proxy details (should they be required), TMC will provide a YAML manifest for creating the necessary TMC components on the TKG management cluster to have that cluster added to TMC. You can also view the contents of the YAML manifest, as shown here.

Copy the manifest, then switch to your TKG V1.4.1 management cluster context, and apply the YAML manifest to the TKG management cluster via kubectl.

% kubectl apply -f 'https://xxxxx.tmc.cloud.vmware.com/installer?id=3b49e3a047863f4xxx87f6f5943bbc48&source=registration&type=tkgm'
namespace/vmware-system-tmc created
configmap/stack-config created
secret/tmc-access-secret created
customresourcedefinition.apiextensions.k8s.io/agents.clusters.tmc.cloud.vmware.com created
customresourcedefinition.apiextensions.k8s.io/extensionconfigs.intents.tmc.cloud.vmware.com created
customresourcedefinition.apiextensions.k8s.io/extensionintegrations.clusters.tmc.cloud.vmware.com created
customresourcedefinition.apiextensions.k8s.io/extensionresourceowners.clusters.tmc.cloud.vmware.com created
customresourcedefinition.apiextensions.k8s.io/extensions.clusters.tmc.cloud.vmware.com created
serviceaccount/extension-manager created
clusterrole.rbac.authorization.k8s.io/extension-manager-role created
clusterrolebinding.rbac.authorization.k8s.io/extension-manager-rolebinding created
service/extension-manager-service created
deployment.apps/extension-manager created
serviceaccount/extension-updater-serviceaccount created
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/vmware-system-tmc-agent-restricted created
clusterrole.rbac.authorization.k8s.io/extension-updater-clusterrole created
clusterrole.rbac.authorization.k8s.io/vmware-system-tmc-psp-agent-restricted created
clusterrolebinding.rbac.authorization.k8s.io/extension-updater-clusterrolebinding created
clusterrolebinding.rbac.authorization.k8s.io/vmware-system-tmc-psp-agent-restricted created
deployment.apps/extension-updater created
serviceaccount/agent-updater created
clusterrole.rbac.authorization.k8s.io/agent-updater-role created
clusterrolebinding.rbac.authorization.k8s.io/agent-updater-rolebinding created
deployment.apps/agent-updater created
Warning: batch/v1beta1 CronJob is deprecated in v1.21+, unavailable in v1.25+; use batch/v1 CronJob
cronjob.batch/agentupdater-workload created

All going well, a bunch of new objects in the vmware-system-tmc namespace will be created on your TKG v1.4.1 management cluster, and the cluster should soon be visible in TMC. Any workload clusters that already exist can also be managed in TMC, as shown here. Of course, new workload cluster can also be instantiated directly from TMC as well.

Identity Management

There really is not too much to say about this – it simply works out of the box. Previously in TKG v1.4, I wrote a post about how to create ytt overlays to do the conversion of the Pinniped and Dex services from NodePort to LoadBalancer. In TKG v1.4.1 I no longer need to do this – the services are deployed as LoadBalancer automatically, which is neat.

Heads Up: vSphere Multi-Datacenter deployments

One issue I did encounter however was a difficulty in deploying TKG v1.4.1 to vSphere environments that had multiple datacenter objects in the inventory. It seems that the CAPV controller has a difficulty in parsing this information, and it concludes that no datacenter setting has been configured. Since it knows that there are multiple datacenters in the inventory, it cannot proceed. It fails with the bootstrap/Kind cluster creation “timing out” waiting for the cluster control plane to initialise. The issue has been reported and is under investigation at the time of writing. The log entry for the CAPV controller unable to discover the datacenter can be determined as follows. First, use docker ps to find the name of the Kind container that is acting as the bootstrap cluster, then use the following command to display the logs, replacing the name of the kind cluster and the capv controller manager pods names with your own:

% docker exec -it tkg-kind-c7e4237blargg0nnv10g-control-plane \
kubectl logs capv-controller-manager-6b84586c64-749vh -n capv-system manager

You should see an error similar to the following:

E0110 15:03:11.918610      1 controller.go:257] controller-runtime/controller \
"msg"="Reconciler error" "error"="unexpected error while probing vcenter for \
infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster tkg-system/tkg141mgmt: \
unable to find datacenter \"\": default datacenter resolves to multiple instances, \
please specify" "controller"="vspherecluster" "name"="tkg141mgmt" "namespace"="tkg-system"

Workaround

I was able to workaround the issue as follows, though I will highlight that this is not an official method. There may be some drawbacks to this approach that I am unaware of, but it does allow you to proceed with the TKG v1.4.1 deployment in a multi-datacenter vSphere environment. In a nutshell, we are replacing the suspect CAPV controller image (v0.7.11) with a working version (v0.7.10) that was available in TKG v1.4.0.

Step 1: Install an editor in Kind

Since there is no editor in the Kind container, we need to install it. I am installing “vi”. You could install an alternative, such as nano, if you wish.

% docker ps
CONTAINER ID  IMAGE                                                                COMMAND                  CREATED        STATUS        PORTS                      NAMES
3bcaff36587f  projects.registry.vmware.com/tkg/kind/node:v1.21.2_vmware.1-v0.8.1  "/usr/local/bin/entr…"  6 seconds ago  Up 1 second  127.0.0.1:60798->6443/tcp  tkg-kind-c7g021vblarn8hbvdrs0-control-plane

% docker exec -it 3bcaff36587f bash
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get update
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# apt-get install vim -y 
root@tkg-kind-c7g021vblarn8hbvdrs0-control-plane:/# which vi
/usr/bin/vi

Step 2: Edit the capv-controller-manager deployment

Now that we have an editor available, we are able to make the necessary changes to the configuration, Next. identify the deployment responsible for running the capv-controller-manager pods.

# kubectl get deploy -A
NAMESPACE                          NAME                                            READY  UP-TO-DATE  AVAILABLE  AGE
capi-kubeadm-bootstrap-system      capi-kubeadm-bootstrap-controller-manager       0/1    1            0          13s
capi-kubeadm-control-plane-system  capi-kubeadm-control-plane-controller-manager   0/1    1            0          9s
capi-system                        capi-controller-manager                         0/1    1            0          16s
capi-webhook-system                capi-controller-manager                         0/1    1            0          18s
capi-webhook-system                capi-kubeadm-bootstrap-controller-manager       0/1    1            0          15s
capi-webhook-system                capi-kubeadm-control-plane-controller-manager   0/1    1            0          11s
capi-webhook-system                capv-controller-manager                         0/1    1            0          6s
capv-system                        capv-controller-manager                         0/1    1            0          4s
cert-manager                       cert-manager                                    1/1    1            1          9m19s
cert-manager                       cert-manager-cainjector                         1/1    1            1          9m19s
cert-manager                       cert-manager-webhook                            1/1    1            1          9m18s
kube-system                        coredns                                         2/2    2            2          10m
local-path-storage                 local-path-provisioner                          1/1    1            1          9m56s

The following command will open an editor to the capv-controller-manager deployment. Here, we need to change the version of the cluster-api-vsphere-controller image from version 0.7.11 to version 0.7.10. I have removed a large part of the manifest to make it easier to read:

# kubectl edit deploy capv-controller-manager -n capv-system

      containers:
      - args:
        - --secure-listen-address=0.0.0.0:8443
        - --upstream=http://127.0.0.1:8080/
        - --logtostderr=true
        - --v=10
        image: projects.registry.vmware.com/tkg/cluster-api/kube-rbac-proxy:v0.8.0_vmware.1
        imagePullPolicy: IfNotPresent
        name: kube-rbac-proxy
        ports:
        - containerPort: 8443
          name: https
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - args:
        - --metrics-addr=127.0.0.1:8080
        env:
        - name: HTTP_PROXY
        - name: HTTPS_PROXY
        - name: NO_PROXY
        image: projects.registry.vmware.com/tkg/cluster-api/cluster-api-vsphere-controller:v0.7.11_vmware.1
        imagePullPolicy: IfNotPresent

The image should now read:

image: projects.registry.vmware.com/tkg/cluster-api/cluster-api-vsphere-controller:v0.7.10_vmware.1

Simply save the changes to the manifest, and this will automatically launch a new capv controller pod, as well as delete the pods using the older image. This will now allow the Kind/bootstrap cluster to deploy the virtual machines that make up the TKG management cluster control plane and worker nodes.

Step 3: Repeat for the TKG management cluster

We are not finished yet, since the same steps now need to be implemented on the TKG management cluster itself. Simply repeat the steps above on the CAPV controller manager on the TKG management cluster, changing the version from v0.7.11 to v0.7.10 and the TKG management cluster should now successfully come online. Note that no steps are necessary for the successful deployment of the subsequent TKG workload clusters. However, if you wish to delete the TKG management cluster, you will need to repeat this step on the Kind/bootstrap cluster that is also launched for the delete operation. There is no need to repeat it on the TKG management cluster though, as the change to the Kind/bootstrap cluster is enough to remove the TKG management cluster.

A VMware KnowledgeBase (KB) article 87396 has now been release to talk through the options of resolving this issue in TKG v1.4.1.