Data Services Manager 2.0 – Part 11: Simple troubleshooting guidance

As with any product that requires some configuration steps, it is possible to input some incorrect information, and not notice that an issue has occurred until you try some deployments. In this post, I want to share some of the troubleshooting steps that I have used to figure out some misconfigurations I made with Data Services Manager 2.0. Note that this process relies on the admin having some Kubernetes skills. If this is an area you wish to develop, head on over https://kube.academy where there are a number of free lessons to get you started. You may also like to check out one of my books, Kubernetes for vSphere Admins, if you think this might be beneficial. It is available in both paper and kindle editions.

To begin any troubleshooting, start with the DSM UI. There is an information sign-post for each database which provides some status information about the cluster when it is clicked. In a working example, the information returned should look something similar to this.

Let’s retrieve the same information using the command line interface (CLI). Let’s begin with the gateway API, something that I have already mentioned in an earlier blog post. The gateway API can be used to do some basic troubleshooting should a deployment of a database fail for one reason or another. You can download the gateway API kubeconfig from both the vSphere Client and the DSM UI. A kubeconfig is nothing more than a YAML manifest which holds the cluster authentication information for kubectl, the Kubernetes command line interface. Here is where it can be downloaded from the DSM UI.

With the gateway API kubeconfig and the kubectl command line installed, you can now begin to query the status of databases deployed via DSM. Using the api-resources option, the objects which are available for querying can be displayed, and possibly edited. Note that many objects that you might expect to see in a standard Kubernetes cluster are not available through the gateway API. However, there are still many objects that are available, and interesting to query. In this example, I am querying the PostgreSQL and MySQL databases which have already been deployed.

% ls kubeconfig-gateway.yaml
kubeconfig-gateway.yaml


% kubectl api-resources --kubeconfig kubeconfig-gateway.yaml
NAME                        SHORTNAMES   APIVERSION                                        NAMESPACED   KIND
namespaces                  ns           v1                                                false        Namespace
secrets                                  v1                                                true         Secret
customresourcedefinitions   crd,crds     apiextensions.k8s.io/v1                           false        CustomResourceDefinition
apiservices                              apiregistration.k8s.io/v1                         false        APIService
databaseconfigs                          databases.dataservices.vmware.com/v1alpha1        true         DatabaseConfig
mysqlclusters                            databases.dataservices.vmware.com/v1alpha1        true         MySQLCluster
postgresclusters                         databases.dataservices.vmware.com/v1alpha1        true         PostgresCluster
infrastructurepolicies                   infrastructure.dataservices.vmware.com/v1alpha1   false        InfrastructurePolicy
ippools                                  infrastructure.dataservices.vmware.com/v1alpha1   false        IPPool
vmclasses                                infrastructure.dataservices.vmware.com/v1alpha1   false        VMClass
dataservicesreleases                     internal.dataservices.vmware.com/v1alpha1         false        DataServicesRelease
systempolicies                           system.dataservices.vmware.com/v1alpha1           false        SystemPolicy


% kubectl get mysqlclusters --kubeconfig kubeconfig-gateway.yaml
NAME      STATUS   STORAGE   VERSION                AGE
mysql01   Ready    60Gi      8.0.34+vmware.v2.0.0   2d15h


% kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml
NAME   STATUS   STORAGE   VERSION              AGE
pg01   Ready    60Gi      15.5+vmware.v2.0.0   2d15h

For the databases, I could replace the get with a describe in the kubectl command. This produces quite a lot of information. However, if I just wanted to see the status of the database, I could use the following command to just show the Status Conditions. It tells us that the cluster is ready to accept connections, everything is up and that even the write ahead logs (wal) are archiving successfully. This is very similar to the status conditions which we saw in the UI previously.

% kubectl get postgresclusters pg01 --kubeconfig kubeconfig-gateway.yaml \
--template={{.status.conditions}}
[map[lastTransitionTime:2024-03-08T17:12:25Z message:Cluster is ready to accept connections \
observedGeneration:1 reason:Ready status:True type:Ready] \
map[lastTransitionTime:2024-03-08T17:08:59Z message: \
observedGeneration:1 reason:MachinesReady status:True type:MachinesReady] \
map[lastTransitionTime:2024-03-08T17:12:23Z message:Cluster can accept read and write queries. Everything is UP. \
observedGeneration:1 reason:Operational status:True type:DatabaseEngineReady] \
map[lastTransitionTime:2024-03-08T17:12:25Z message: \
observedGeneration:1 reason:Reconciled status:True type:Provisioning] \
map[lastTransitionTime:2024-03-08T17:11:06Z message: \
observedGeneration:1 reason:ConfigApplied status:True type:CustomConfigStatus] \
map[lastTransitionTime:2024-03-08T17:11:12Z message:wal archiving is successful \
observedGeneration:1 reason:WalArchivingIsSuccessful status:True type:WalArchiving]]%

OK, so this is a good working cluster. What if we now introduce some issues to mimic some potential issues one might come across with a misconfiguration. Let’s examine 2 scenarios. The first is where the IP address pool is valid, but we have run out of IP addresses due to the number of databases that have been provisioned, and the second is where the IP address pool is invalid insofar as the address range added to the IP Pool is not reachable from the Data Service Manager. This also means the API server on the Kubernetes cluster that is provisioned for the database is not reachable either.

Scenario 1 – Running out of IP Addresses in an IP Pool

To replicate IP Pool exhaustion, or running out of available addresses, I create a new IP Pool with a range of just two IP addresses. A standalone database would require 3 IP addresses; one for the VM, one for the K8s cluster API server on which the database is deployed and a final one for the Load Balancer/External IP address for the database itself. Thus, we should not be able to successfully provision a database using an infrastructure policy that uses this pool. In fact, very quickly we can determine there is IP address exhaustion using the gateway API seen earlier. In the example below, I am once again displaying only the status conditions, but a kubectl describe would show the same issue. Note the message “insufficient amount of IP addresses”.

% kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml
NAME  STATUS       STORAGE   VERSION              AGE
pg01  Ready        60Gi      15.5+vmware.v2.0.0   2d16h
pg02  InProgress   60Gi      15.5+vmware.v2.0.0   2m43s


% kubectl get postgresclusters pg02 --kubeconfig kubeconfig-gateway.yaml \
--template={{.status.conditions}}
[map[lastTransitionTime:2024-03-11T09:03:48Z message: observedGeneration:1 reason:InProgress status:False type:Ready] \
map[lastTransitionTime:2024-03-11T09:03:48Z message:insufficient amount of IP addreses - IP pool 'restricted-ip-pool' has 2 free IP addresses, 1 more needed \
observedGeneration:1 reason:IPAddressWarning status:True type:IPAddressWarning] \
map[lastTransitionTime:2024-03-11T09:04:55Z message:Cloning @ /pg02-k4z4w - 1 of 2 completed \
observedGeneration:1 reason:MachinesReady status:False type:MachinesReady]]%

The same issue is reflected in the vSphere Client UI when the IP Pool is queried after the deployment has been initiated.

This issue is easily identified through both the gateway API and the UI. This is because DSM provider appliance is able to communicate with the cluster that has been deployed to host the database, and the status conditions can be bubbled up to the admins. Let’s look at an issue that is a little more complex, where the Kubernetes cluster has been deployed using an unreachable IP address, preventing the DSM provider appliance from communicating with it. More specifically, it is the component responsible to rolling out the Kubernetes cluster (Cluster API) on the DSM provider which is unable to communicate with the Kubernetes cluster.

Scenario 2 – Unreachable Address Range in IP Pool

For this scenario, I built a new IP Pool with an unreachable address range of 192.168.100.100-200. In other words, my DSM Provider Appliance is unable to reach this network. Here is what the IP Pool looks like.

Let’s start with the gateway API once more and show the status conditions. Note that the IPAddressWarning is false in this case, meaning that there is nothing to report, i.e. no warning. It seems to be stuck waiting for MachinesReady. This one is a little more complex to troubleshoot.

% kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml
NAME  STATUS     STORAGE VERSION              AGE
pg01  Ready      60Gi    15.5+vmware.v2.0.0   2d16h
pg03  InProgress 60Gi    15.5+vmware.v2.0.0   6m12s

% kubectl get postgresclusters pg03 --kubeconfig kubeconfig-gateway.yaml \
--template={{.status.conditions}}
[map[lastTransitionTime:2024-03-11T09:13:18Z message: observedGeneration:1 reason:InProgress status:False type:Ready] \
map[lastTransitionTime:2024-03-11T09:13:18Z message: observedGeneration:1 reason:IPAddressWarning status:False type:IPAddressWarning] \
map[lastTransitionTime:2024-03-11T09:13:57Z message:Cloning @ /pg03-tcp9q - 1 of 2 completed observedGeneration:1 reason:MachinesReady status:False type:MachinesReady]]%

We will use this example to troubleshoot further. So far, we have only used the gateway API to look at things from a high level. What if we needed to troubleshoot ‘under the covers’ so to speak, and examine the state of the Kubernetes cluster which has been deployed to host the DSM database or data service. Let’s see how to do this next.

Deeper Dive into DSM Troubleshooting

Use extreme caution at this point! I would advise against making any changes to the underlying database or K8s components at this level and only use the following commands as “read-only” to query the database configuration. If there is a need to troubleshoot at this level, I strongly recommend engaging our support organisation for assistance.

Begin by logging into the DSM appliance as the root user. From there, navigate to the folder /opt/vmware/tdm-provider/kubernetes-service. Here you will find a number of kubeconfig YAML files. Note that these do not provide access to the Kubernetes clusters on which the DSM databases are deployed, but rather provider access to what might be considered the management cluster on the DSM Provider appliance. The DSM Provider appliance runs a version of Kubernetes, and uses a concept called CAPV (Cluster API for vSphere) to provision the workload K8s clusters which run the databases. However, these kubeconfigs on the DSM Provider appliance can be used to gain access the Kubernetes clusters on which the databases are deployed. To do this, use the kubeconfig-localhost.yaml to determine what objects can be queried on the DSM Provider appliance ‘management’ cluster, using the api-resources argument once more as follows.

# kubectl api-resources --kubeconfig kubeconfig-localhost.yaml
NAME                           SHORTNAMES   APIVERSION                                NAMESPACED   KIND
configmaps                     cm           v1                                        true         ConfigMap
events                         ev           v1                                        true         Event
namespaces                     ns           v1                                        false        Namespace
secrets                                     v1                                        true         Secret
clusterresourcesetbindings                  addons.cluster.x-k8s.io/v1beta1           true         ClusterResourceSetBinding
clusterresourcesets                         addons.cluster.x-k8s.io/v1beta1           true         ClusterResourceSet
customresourcedefinitions      crd,crds     apiextensions.k8s.io/v1                   false        CustomResourceDefinition
apiservices                                 apiregistration.k8s.io/v1                 false        APIService
kubeadmconfigs                              bootstrap.cluster.x-k8s.io/v1beta1        true         KubeadmConfig
kubeadmconfigtemplates                      bootstrap.cluster.x-k8s.io/v1beta1        true         KubeadmConfigTemplate
clusterclasses                 cc           cluster.x-k8s.io/v1beta1                  true         ClusterClass
clusters                       cl           cluster.x-k8s.io/v1beta1                  true         Cluster
machinedeployments             md           cluster.x-k8s.io/v1beta1                  true         MachineDeployment
machinehealthchecks            mhc,mhcs     cluster.x-k8s.io/v1beta1                  true         MachineHealthCheck
machinepools                   mp           cluster.x-k8s.io/v1beta1                  true         MachinePool
machines                       ma           cluster.x-k8s.io/v1beta1                  true         Machine
machinesets                    ms           cluster.x-k8s.io/v1beta1                  true         MachineSet
kubeadmcontrolplanes           kcp          controlplane.cluster.x-k8s.io/v1beta1     true         KubeadmControlPlane
kubeadmcontrolplanetemplates                controlplane.cluster.x-k8s.io/v1beta1     true         KubeadmControlPlaneTemplate
vsphereclusteridentities                    infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereClusterIdentity
vsphereclusters                             infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereCluster
vsphereclustertemplates                     infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereClusterTemplate
vspheredeploymentzones                      infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereDeploymentZone
vspherefailuredomains                       infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereFailureDomain
vspheremachines                             infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereMachine
vspheremachinetemplates                     infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereMachineTemplate
vspherevms                                  infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereVM
globalinclusterippools                      ipam.cluster.x-k8s.io/v1alpha2            false        GlobalInClusterIPPool
inclusterippools                            ipam.cluster.x-k8s.io/v1alpha2            true         InClusterIPPool
ipaddressclaims                             ipam.cluster.x-k8s.io/v1alpha1            true         IPAddressClaim
ipaddresses                                 ipam.cluster.x-k8s.io/v1alpha1            true         IPAddress
extensionconfigs               ext          runtime.cluster.x-k8s.io/v1alpha1         false        ExtensionConfig

Those of you familiar with CAPV and Cluster API will probably recognise a number of the objects listed here. The majority of these are used as the building blocks to create a workload Kubernetes cluster on which to build a database. Note that one of the objects listed above is secrets. The kubeconfig for a database cluster is held in a secret. The kubeconfig secrets can be accessed as follows:

# kubectl get secrets --kubeconfig kubeconfig-localhost.yaml | grep kubeconfig
mysql01-kubeconfig              2024-03-08T17:11:34Z
pg01-kubeconfig                 2024-03-08T17:05:07Z
pg03-kubeconfig                 2024-03-11T09:13:26Z

Let’s look at the working PostgreSQL database, pg01. The actual kubeconfig is encoded in base64 in the secret. To retrieve it, and store it in its own kubeconfig file, use the following command:

# kubectl get secrets pg01-kubeconfig --kubeconfig kubeconfig-localhost.yaml \
--template={{.data.value}} | base64 --decode > kubeconfig.pg01

The newly created kubeconfig from the previous command can now be used to look directly at the Kubernetes workload cluster where the PostgreSQL database pg01 is deployed.

# kubectl get all -A --kubeconfig kubeconfig.pg01
NAMESPACE            NAME                                            READY  STATUS     RESTARTS       AGE
cert-manager         pod/cert-manager-799d5bd547-pqgqh               1/1    Running    0              2d18h
cert-manager         pod/cert-manager-cainjector-b6d647475-pp2c5     1/1    Running    0              2d18h
cert-manager         pod/cert-manager-webhook-6cb97fd59f-clh46       1/1    Running    0              2d18h
default              pod/default-full-backup-28500479-krvg2          0/1    Completed  0              35h
default              pod/default-incremental-backup-28499039-4kbcf   0/1    Completed  0              2d11h
default              pod/default-incremental-backup-28500479-k2q9w   0/1    Completed  0              35h
default              pod/default-incremental-backup-28501919-ptk55   0/1    Completed  0              11h
default              pod/pg01-0                                      5/5    Running    0              2d18h
default              pod/pg01-monitor-0                              4/4    Running    0              2d18h
kapp-controller      pod/kapp-controller-5859bd58d7-gnpsh            2/2    Running    0              2d18h
kube-system          pod/antrea-agent-p5vwd                          2/2    Running    0              2d18h
kube-system          pod/antrea-controller-545d78bb49-2t2gg          1/1    Running    0              2d18h
kube-system          pod/coredns-86dbf96446-4wb6j                    1/1    Running    0              2d18h
kube-system          pod/etcd-pg01-m4v8q                             1/1    Running    0              2d18h
kube-system          pod/kube-apiserver-pg01-m4v8q                   1/1    Running    0              2d18h
kube-system          pod/kube-controller-manager-pg01-m4v8q          1/1    Running    0              2d18h
kube-system          pod/kube-proxy-dh8pd                            1/1    Running    0              2d18h
kube-system          pod/kube-scheduler-pg01-m4v8q                   1/1    Running    0              2d18h
kube-system          pod/kube-vip-pg01-m4v8q                         1/1    Running    0              2d18h
kube-system          pod/vsphere-cloud-controller-manager-pp9bb      1/1    Running    0              2d18h
telegraf             pod/telegraf-5d4bc448b6-87cll                   1/1    Running    0              2d18h
vmware-sql-postgres  pod/postgres-operator-7857586776-cskdv          1/1    Running    0              2d18h
vmware-system-csi    pod/vsphere-csi-controller-5ddfddf944-5kks9     7/7    Running    0              2d18h
vmware-system-csi    pod/vsphere-csi-node-26d4s                      3/3    Running    2 (2d18h ago)  2d18h

NAMESPACE            NAME                                        TYPE           CLUSTER-IP      EXTERAL-IP      PORT(S)                  AGE
cert-manager         service/cert-manager                        ClusterIP      10.106.95.58    <none>          9402/TCP                 2d18h
cert-manager         service/cert-manager-webhook                ClusterIP      10.109.245.185  <none>          443/TCP                  2d18h
default              service/kubernetes                          ClusterIP      10.96.0.1       <none>          443/TCP                  2d18h
default              service/pg01                                LoadBalancer   10.103.74.207   xx.yy.zz.182    5432:31287/TCP           2d18h
default              service/pg01-agent                          ClusterIP      None            <none>          <none>                   2d18h
default              service/pg01-read-only                      ClusterIP      10.96.48.205    <none>          5432/TCP                 2d18h
kapp-controller      service/packaging-api                       ClusterIP      10.102.254.105  <none>          443/TCP                  2d18h
kube-system          service/antrea                              ClusterIP      10.98.251.199   <none>          443/TCP                  2d18h
kube-system          service/kube-dns                            ClusterIP      10.96.0.10      <none>          53/UDP,53/TCP,9153/TCP   2d18h
vmware-sql-postgres  service/postgres-operator-webhook-service   ClusterIP      10.96.209.206   <none>          443/TCP                  2d18h
vmware-system-csi    service/vsphere-csi-controller              ClusterIP      10.99.160.232   <none>          2112/TCP,2113/TCP        2d18h

NAMESPACE          NAME                                             DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR              AGE
kube-system        daemonset.apps/antrea-agent                      1        1        1      1            1          kubernetes.io/os=linux    2d18h
kube-system        daemonset.apps/kube-proxy                        1        1        1      1            1          kubernetes.io/os=linux    2d18h
kube-system        daemonset.apps/vsphere-cloud-controller-manager  1        1        1      1            1          <none>                    2d18h
vmware-system-csi  daemonset.apps/vsphere-csi-node                  1        1        1      1            1          kubernetes.io/os=linux    2d18h
vmware-system-csi  daemonset.apps/vsphere-csi-node-windows          0        0        0      0            0          kubernetes.io/os=windows  2d18h

NAMESPACE            NAME                                      READY  UP-TO-DATE   AVAILABLE  AGE
cert-manager         deployment.apps/cert-manager              1/1    1            1          2d18h
cert-manager         deployment.apps/cert-manager-cainjector   1/1    1            1          2d18h
cert-manager         deployment.apps/cert-manager-webhook      1/1    1            1          2d18h
kapp-controller      deployment.apps/kapp-controller           1/1    1            1          2d18h
kube-system          deployment.apps/antrea-controller         1/1    1            1          2d18
kube-system          deployment.apps/coredns                   1/1    1            1          2d18h
telegraf             deployment.apps/telegraf                  1/1    1            1          2d18h
vmware-sql-postgres  deployment.apps/postgres-operator         1/1    1            1          2d18h
vmware-system-csi    deployment.apps/vsphere-csi-controller    1/1    1            1          2d18h

NAMESPACE             NAME                                                DESIRED  CURRENT  READY  AGE
cert-manager          replicaset.apps/cert-manager-799d5bd547             1        1        1      2d18h
cert-manager          replicaset.apps/cert-manager-cainjector-b6d647475   1        1        1      2d18h
cert-manager          replicaset.apps/cert-manager-webhook-6cb97fd59f     1        1        1      2d18
kapp-controller       replicaset.apps/kapp-controller-5859bd58d7          1        1        1      2d18h
kube-system           replicaset.apps/antrea-controller-545d78bb49        1        1        1      2d18h
kube-system           replicaset.apps/coredns-559dcb89b8                  0        0        0      2d18h
kube-system           replicaset.apps/coredns-86dbf96446                  1        1        1      2d18h
telegraf              replicaset.apps/telegraf-5d4bc448b6                 1        1        1      2d18h
vmware-sql-postgres   replicaset.apps/postgres-operator-7857586776        1        1        1      2d18h
vmware-system-csi     replicaset.apps/vsphere-csi-controller-5ddfddf944   1        1        1      2d18h

NAMESPACE  NAME                            READY  AG
default    statefulset.apps/pg01          1/1    2d18h
default    statefulset.apps/pg01-monitor  1/1    2d18h

NAMESPACE  NAME                                      SCHEDULE       SUSPEND  ACTIVE   LAST SCHEDULE  AGE
default    cronjob.batch/default-full-backup         59 23 * * 6    False    0        35h            2d18h
default    cronjob.batch/default-incremental-backup  59 23 1/1 * *  False    0        11h            2d18h

NAMESPACE  NAME                                            COMPLETIONS  DURATION  AGE
default    job.batch/default-full-backup-28500479          1/1          5s        35h
default    job.batch/default-incremental-backup-28499039   1/1          5s        2d11h
default    job.batch/default-incremental-backup-28500479   1/1          4s        35h
default    job.batch/default-incremental-backup-28501919   1/1          4s        11h

NAMESPACE  NAME                                                                            STATUS     SOURCE INSTANCE   SOURCE NAMESPACE   TYPE          TIME STARTED          TIME COMPLETED
default    postgresbackup.sql.tanzu.vmware.com/default-full-backup-20240309-235901         Succeeded  pg01              default            full          2024-03-09T23:59:11Z  2024-03-09T23:59:25Z
default    postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240308-235901  Succeeded  pg01              default            incremental  2024-03-08T23:59:01Z  2024-03-08T23:59:08Z
default    postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240309-235901  Succeeded  pg01              default            incremental  2024-03-09T23:59:01Z  2024-03-09T23:59:08Z
default    postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240310-235901  Succeeded  pg01              default            incremental  2024-03-10T23:59:01Z  2024-03-10T23:59:07Z
default    postgresbackup.sql.tanzu.vmware.com/initial-45b072fc-20240308-171109            Succeeded  pg01              default            full          2024-03-08T17:11:09Z  2024-03-08T17:11:22Z

NAMESPACE  NAME                                STATUS    DB VERSION  BACKUP LOCATION      AGE
default    postgres.sql.tanzu.vmware.com/pg01  Running   15.5        pg01-backuplocation  2d18h

This provides a complete view of the ‘workload’ Kubernetes cluster where the PostgreSQL database pg01 is running. We can see the deployments, replicas and pods. We can see the Kubernetes kube-system objects, the vSphere CSI driver for persistent storage, Antrea for networking and even the postgres objects. We have visibility into all of the services (pod networking) and even the jobs that are run to take backups of the database. Lots of good information in here to query for deep dive troubleshooting but as I said before, proceed with caution.

Return to Scenario 2 – Unreachable Address Range in IP Pool

Let’s now return to our other Postgres cluster, pg03. Remember that this was given an IP address which is unreachable from vSphere and the DSM Provider. If I retrieve it’s kubeconfig from a secret, like we just did for pg01, I would not be able to send queries like the one above since it has been provisioned with a front-end/load-balancer IP address that we are unable to communicate to it. A kubectl get command would simply not return anything. What would we do in that case? What we could do is examine the state of the Cluster API and CAPV objects that are being used to create the cluster. Let’s do that next.

There is an excellent utility to extend kubectl called krew. Krew has numerous plugins, and one that I find exceptionally useful for looking at situations like this is the tree plugin. When Kubernetes clusters are deployed using Cluster API, in our case CAPV for vSphere, this Krew ‘tree’ plugin can show the relationship between objects in the cluster. Since I don’t have krew installed in the DSM Provider appliance, and probably something I would not recommend, I would suggest copying the /opt/vmware/tdm-provider/kubernetes-service/kubeconfig-external.yaml from the DSM Provider appliance to your desktop where kubectl, krew and the tree plugin are installed. You can now reference the same set of objects that we were able to reference from with the DSM Provider appliance previously from your desktop. But the really useful thing is that you can look at the relationship of all of the Cluster API objects, checking to see which are in a Ready state and which are not. Let’s look at a working pg01 cluster first. To begin, we list the api-resources. Then we list the clusters. Finally, we use the Krew tree plugin to see the relationship between all the Cluster API objects used to build a Kubernetes cluster on vSphere. So, from my desktop, I run the following:

% kubectl api-resources --kubeconfig kubeconfig-external.yaml
NAME                           SHORTNAMES   APIVERSION                                NAMESPACED   KIND
configmaps                     cm           v1                                        true         ConfigMap
events                         ev           v1                                        true         Event
namespaces                     ns           v1                                        false        Namespace
secrets                                     v1                                        true         Secret
clusterresourcesetbindings                  addons.cluster.x-k8s.io/v1beta1           true         ClusterResourceSetBinding
clusterresourcesets                         addons.cluster.x-k8s.io/v1beta1           true         ClusterResourceSet
customresourcedefinitions      crd,crds     apiextensions.k8s.io/v1                   false        CustomResourceDefinition
apiservices                                 apiregistration.k8s.io/v1                 false        APIService
kubeadmconfigs                              bootstrap.cluster.x-k8s.io/v1beta1        true         KubeadmConfig
kubeadmconfigtemplates                      bootstrap.cluster.x-k8s.io/v1beta1        true         KubeadmConfigTemplate
clusterclasses                 cc           cluster.x-k8s.io/v1beta1                  true         ClusterClass
clusters                       cl           cluster.x-k8s.io/v1beta1                  true         Cluster
machinedeployments             md           cluster.x-k8s.io/v1beta1                  true         MachineDeployment
machinehealthchecks            mhc,mhcs     cluster.x-k8s.io/v1beta1                  true         MachineHealthCheck
machinepools                   mp           cluster.x-k8s.io/v1beta1                  true         MachinePool
machines                       ma           cluster.x-k8s.io/v1beta1                  true         Machine
machinesets                    ms           cluster.x-k8s.io/v1beta1                  true         MachineSet
kubeadmcontrolplanes           kcp          controlplane.cluster.x-k8s.io/v1beta1     true         KubeadmControlPlane
kubeadmcontrolplanetemplates                controlplane.cluster.x-k8s.io/v1beta1     true         KubeadmControlPlaneTemplate
vsphereclusteridentities                    infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereClusterIdentity
vsphereclusters                             infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereCluster
vsphereclustertemplates                     infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereClusterTemplate
vspheredeploymentzones                      infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereDeploymentZone
vspherefailuredomains                       infrastructure.cluster.x-k8s.io/v1beta1   false        VSphereFailureDomain
vspheremachines                             infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereMachine
vspheremachinetemplates                     infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereMachineTemplate
vspherevms                                  infrastructure.cluster.x-k8s.io/v1beta1   true         VSphereVM
globalinclusterippools                      ipam.cluster.x-k8s.io/v1alpha2            false        GlobalInClusterIPPool
inclusterippools                            ipam.cluster.x-k8s.io/v1alpha2            true         InClusterIPPool
ipaddressclaims                             ipam.cluster.x-k8s.io/v1alpha1            true         IPAddressClaim
ipaddresses                                 ipam.cluster.x-k8s.io/v1alpha1            true         IPAddress
extensionconfigs               ext          runtime.cluster.x-k8s.io/v1alpha1         false        ExtensionConfig


% kubectl get cl --kubeconfig kubeconfig-external.yaml
NAME      PHASE         AGE     VERSION
mysql01   Provisioned   2d16h
pg01      Provisioned   2d16h
pg03      Provisioned   13m


% kubectl tree cl pg01 --kubeconfig kubeconfig-external.yaml
NAMESPACE  NAME                                                        READY  REASON  AGE
default    Cluster/pg01                                                True           2d18h
default    ├─ConfigMap/pg01-lock                                       -              2d18h
default    ├─KubeadmControlPlane/pg01                                  True           2d18h
default    │ ├─Machine/pg01-m4v8q                                      True           2d18h
default    │ │ ├─KubeadmConfig/pg01-cgpt7                              True           2d18h
default    │ │ │ └─Secret/pg01-cgpt7                                   -              2d18h
default    │ │ └─VSphereMachine/pg01-tpl-cp-7658441472598658451-cpmdt  True           2d18h
default    │ │  └─VSphereVM/pg01-m4v8q                                 True           2d18h
default    │ │    └─IPAddressClaim/pg01-m4v8q-0-0                      -              2d18h
default    │ │      └─IPAddress/pg01-m4v8q-0-0                         -              2d18h
default    │ ├─Secret/pg01-ca                                          -              2d18h
default    │ ├─Secret/pg01-etcd                                        -              2d18h
default    │ ├─Secret/pg01-kubeconfig                                  -              2d18h
default    │ ├─Secret/pg01-proxy                                       -              2d18h
default    │ └─Secret/pg01-sa                                          -              2d18h
default    ├─VSphereCluster/pg01                                       True           2d18h
default    │ ├─Secret/pg01-vsphere-creds                               -              2d18h
default    │ └─VSphereMachine/pg01-tpl-cp-7658441472598658451-cpmdt    True           2d18h
default    │  └─VSphereVM/pg01-m4v8q                                   True           2d18h
default    │    └─IPAddressClaim/pg01-m4v8q-0-0                        -              2d18h
default    │      └─IPAddress/pg01-m4v8q-0-0                           -              2d18h
default    └─VSphereMachineTemplate/pg01-tpl-cp-7658441472598658451    -              2d18h

In the example above, everything looks good for pg01. The cluster object at the top of the tree has reached Ready state. Now let’s take a look at our broken cluster pg03.

% kubectl tree cl pg03 --kubeconfig kubeconfig-external.yaml
NAMESPACE  NAME                                                         READY  REASON                 AGE
default    Cluster/pg03                                                 False  WaitingForKubeadmInit  14m
default    ├─ConfigMap/pg03-lock                                        -                             14m
default    ├─KubeadmControlPlane/pg03                                   False  WaitingForKubeadmInit  14m
default    │ ├─Machine/pg03-tcp9q                                       True                          14m
default    │ │ ├─KubeadmConfig/pg03-76fss                               True                          14m
default    │ │ │ └─Secret/pg03-76fss                                    -                             14m
default    │ │ └─VSphereMachine/pg03-tpl-cp-17178983102945282084-cfp8d  True                          14m
default    │ │   └─VSphereVM/pg03-tcp9q                                 True                          14m
default    │ │     └─IPAddressClaim/pg03-tcp9q-0-0                      -                             14m
default    │ │       └─IPAddress/pg03-tcp9q-0-0                         -                             14m
default    │ ├─Secret/pg03-ca                                           -                             14m
default    │ ├─Secret/pg03-etcd                                         -                             14m
default    │ ├─Secret/pg03-kubeconfig                                   -                             14m
default    │ ├─Secret/pg03-proxy                                        -                             14m
default    │ └─Secret/pg03-sa                                           -                             14m
default    ├─VSphereCluster/pg03                                        True                          14m
default    │ ├─Secret/pg03-vsphere-creds                                -                             14m
default    │ └─VSphereMachine/pg03-tpl-cp-17178983102945282084-cfp8d    True                          14m
default    │   └─VSphereVM/pg03-tcp9q                                   True                          14m
default    │     └─IPAddressClaim/pg03-tcp9q-0-0                        -                             14m
default    │       └─IPAddress/pg03-tcp9q-0-0                           -                             14m
default    └─VSphereMachineTemplate/pg03-tpl-cp-17178983102945282084    -                             14m

Not so good. Whilst a lot of the cluster has been built, it appears the kubeadm control plane is stuck initialising. In a nutshell, the cluster is waiting for the control plane provider to indicate that the control plane has been initialised. Obviously we know this is down to the networking issue, but can we figure out from the CAPV objects what or where the issue is? We could start by taking a look at the Machine object. This Machine object, backed by a VM in vSphere infrastructure, will run the control plane of the K8s cluster. This is perhaps why we saw it stuck on MachinesReady when we queried the status via the gateway API earlier. Let look at the status of the machine object.

% kubectl get Machine/pg03-tcp9q --kubeconfig kubeconfig-external.yaml  \
--template={{.status.conditions}}
[map[lastTransitionTime:2024-03-11T09:14:59Z status:True type:Ready] \
map[lastTransitionTime:2024-03-11T09:13:27Z status:True type:BootstrapReady] \
map[lastTransitionTime:2024-03-11T09:14:59Z status:True type:InfrastructureReady] \
map[lastTransitionTime:2024-03-11T09:13:27Z reason:WaitingForNodeRef severity:Info status:False type:NodeHealthy]]%

The node is not healthy. It would appear that the Machine object is waiting on a NodeRef.  If we look at the code for Cluster API, we can see that WaitingForNodeRef indicates that the machine.spec.providerId is not yet assigned. The ProviderID is a cloud provider ID which identifies the machine, in our case identifying the VM in vSphere. A cloud provider embeds platform or cloud specific control logic into Kubernetes. In this case, it is control logic for Kubernetes running on vSphere. Until the ProviderID is assigned, the machine will not be ready. The cloud provider component in Kubernetes needs to be able to reach vSphere to determine if the VM is online and healthy. When it succeeds in doing this, it sets the nodeRef as the last step in machine provisioning. Someone with Kubernetes expertise might be able to deduce from this that is a communication issue between the workload K8s cluster and Cluster API on the management cluster (in our case DSM), and as a result, it is not be possible to update the machine’s operational state due to the unreachable IP address. Therefore the machine/node/VM will remain in this state indefinitely. But you could also look at the IP address objects in the tree above and realise that the IP addresses are unreachable from vSphere (a simple ping from vCenter to the front-end/load-balancer IP addresses should reveal this). The actual error itself can be found on the DSM appliance, by looking at the logs for the Cluster API (CAPI & CAPV) components in /var/log/tdm/provider:

root@photon [ /var/log/tdm/provider/containers ]# ls -ltr
total 141584
-rw-r----- 1 root root     5129 Mar 11 09:51 data-plane-image-download-v2.0.0.log
-rw-r----- 1 root root    79699 Mar 11 09:59 data-plane-image-deploy-v2.0.0.log
-rw-r----- 1 root root  6843133 Mar 11 02:01 kubernetes-service.log-2024-03-22-02-1711072861.gz
-rw-r----- 1 root root   295813 Mar 11 06:21 kubernetes-service-caip-1.log
-rw-r----- 1 root root    64038 Mar 11 09:47 kubernetes-service-kadm-1.log
-rw-r----- 1 root root   652003 Mar 11 10:02 kubernetes-service-kcp-1.log
-rw-r----- 1 root root  3367036 Mar 11 10:02 kubernetes-service-capi-1.log
-rw-r----- 1 root root  9697207 Mar 11 10:18 kubernetes-service-capv-1.log
-rw-r----- 1 root root 13646784 Mar 11 10:18 dsm-tsql-provisioner-service.log
-rw-r----- 1 root root 41733076 Mar 11 10:18 sgw-service.log
-rw-r----- 1 root root 41414303 Mar 11 10:18 docker-registry.log
-rw-r----- 1 root root 26966987 Mar 11 10:18 kubernetes-service.log

 If I take a look at the kubernetes-service-capi-1.log, I see the following reconcile error displayed repeatedly:

2024-03-11T11:03:26+00:00 localhost.localdomain container_name/kubernetes-service-capi-1[1968]: \
E0322 11:03:26.785449 1 controller.go:324] "Reconciler error" err="failed to create cluster accessor: \
error creating client for remote cluster \"default/pg03\": error getting rest mapping: \
failed to get API group resources: unable to retrieve the complete list of server APIs: \
v1: Get \"https://192.168.100.100:6443/api/v1?timeout=10s\": net/http: \
request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" \
controller="clusterresourceset" controllerGroup="addons.cluster.x-k8s.io" \
controllerKind="ClusterResourceSet" ClusterResourceSet="default/pg03-csi-crs" \
namespace="default" name="pg03-csi-crs" reconcileID=bdfe85ee-75b4-4502-b00f-b625d9a80c79

The Cluster API on the DSM appliance is unable to communicate with the K8s cluster provisioned for the database. Thus, the cluster cannot ever leave the WaitForNodeRef state. Note that this issue could also arise if there was a firewall blocking port 6443 between the DSM appliance network and the network where the database is being provisioned. This troubleshooting process is also a good way to identify other possible conditions, such as memory or disk pressures, which could also preventing the Kubernetes cluster from coming ready.

Summary

As well as a significant amount of built-in monitoring in the DSM UI, sometimes additional information can be gleaned when using the kubectl CLI for troubleshooting. The gateway API which is easily accessible to customers, gives some details about possible root causes related to misconfigurations. However, in certain situations, it may be necessary to go deeper to look at the underlying CAPV provisioned Kubernetes workload cluster on which the database is deployed. Hopefully the above will help get you started should you need to troubleshoot a DSM environment. But, as highlighted multiple times in the post, don’t take any risks and consider engaging the Global Support organisation if you need to troubleshoot any production environments to this level.