Data Services Manager 2.0 – Part 11: Simple troubleshooting guidance
As with any product that requires some configuration steps, it is possible to input some incorrect information, and not notice that an issue has occurred until you try some deployments. In this post, I want to share some of the troubleshooting steps that I have used to figure out some misconfigurations I made with Data Services Manager 2.0. Note that this process relies on the admin having some Kubernetes skills. If this is an area you wish to develop, head on over https://kube.academy where there are a number of free lessons to get you started. You may also like to check out one of my books, Kubernetes for vSphere Admins, if you think this might be beneficial. It is available in both paper and kindle editions.
To begin any troubleshooting, start with the DSM UI. There is an information sign-post for each database which provides some status information about the cluster when it is clicked. In a working example, the information returned should look something similar to this.
Let’s retrieve the same information using the command line interface (CLI). Let’s begin with the gateway API, something that I have already mentioned in an earlier blog post. The gateway API can be used to do some basic troubleshooting should a deployment of a database fail for one reason or another. You can download the gateway API kubeconfig from both the vSphere Client and the DSM UI. A kubeconfig is nothing more than a YAML manifest which holds the cluster authentication information for kubectl, the Kubernetes command line interface. Here is where it can be downloaded from the DSM UI.
With the gateway API kubeconfig and the kubectl command line installed, you can now begin to query the status of databases deployed via DSM. Using the api-resources option, the objects which are available for querying can be displayed, and possibly edited. Note that many objects that you might expect to see in a standard Kubernetes cluster are not available through the gateway API. However, there are still many objects that are available, and interesting to query. In this example, I am querying the PostgreSQL and MySQL databases which have already been deployed.
% ls kubeconfig-gateway.yaml kubeconfig-gateway.yaml % kubectl api-resources --kubeconfig kubeconfig-gateway.yaml NAME SHORTNAMES APIVERSION NAMESPACED KIND namespaces ns v1 false Namespace secrets v1 true Secret customresourcedefinitions crd,crds apiextensions.k8s.io/v1 false CustomResourceDefinition apiservices apiregistration.k8s.io/v1 false APIService databaseconfigs databases.dataservices.vmware.com/v1alpha1 true DatabaseConfig mysqlclusters databases.dataservices.vmware.com/v1alpha1 true MySQLCluster postgresclusters databases.dataservices.vmware.com/v1alpha1 true PostgresCluster infrastructurepolicies infrastructure.dataservices.vmware.com/v1alpha1 false InfrastructurePolicy ippools infrastructure.dataservices.vmware.com/v1alpha1 false IPPool vmclasses infrastructure.dataservices.vmware.com/v1alpha1 false VMClass dataservicesreleases internal.dataservices.vmware.com/v1alpha1 false DataServicesRelease systempolicies system.dataservices.vmware.com/v1alpha1 false SystemPolicy % kubectl get mysqlclusters --kubeconfig kubeconfig-gateway.yaml NAME STATUS STORAGE VERSION AGE mysql01 Ready 60Gi 8.0.34+vmware.v2.0.0 2d15h % kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml NAME STATUS STORAGE VERSION AGE pg01 Ready 60Gi 15.5+vmware.v2.0.0 2d15h
For the databases, I could replace the get with a describe in the kubectl command. This produces quite a lot of information. However, if I just wanted to see the status of the database, I could use the following command to just show the Status Conditions. It tells us that the cluster is ready to accept connections, everything is up and that even the write ahead logs (wal) are archiving successfully. This is very similar to the status conditions which we saw in the UI previously.
% kubectl get postgresclusters pg01 --kubeconfig kubeconfig-gateway.yaml \ --template={{.status.conditions}} [map[lastTransitionTime:2024-03-08T17:12:25Z message:Cluster is ready to accept connections \ observedGeneration:1 reason:Ready status:True type:Ready] \ map[lastTransitionTime:2024-03-08T17:08:59Z message: \ observedGeneration:1 reason:MachinesReady status:True type:MachinesReady] \ map[lastTransitionTime:2024-03-08T17:12:23Z message:Cluster can accept read and write queries. Everything is UP. \ observedGeneration:1 reason:Operational status:True type:DatabaseEngineReady] \ map[lastTransitionTime:2024-03-08T17:12:25Z message: \ observedGeneration:1 reason:Reconciled status:True type:Provisioning] \ map[lastTransitionTime:2024-03-08T17:11:06Z message: \ observedGeneration:1 reason:ConfigApplied status:True type:CustomConfigStatus] \ map[lastTransitionTime:2024-03-08T17:11:12Z message:wal archiving is successful \ observedGeneration:1 reason:WalArchivingIsSuccessful status:True type:WalArchiving]]%
OK, so this is a good working cluster. What if we now introduce some issues to mimic some potential issues one might come across with a misconfiguration. Let’s examine 2 scenarios. The first is where the IP address pool is valid, but we have run out of IP addresses due to the number of databases that have been provisioned, and the second is where the IP address pool is invalid insofar as the address range added to the IP Pool is not reachable from the Data Service Manager. This also means the API server on the Kubernetes cluster that is provisioned for the database is not reachable either.
Scenario 1 – Running out of IP Addresses in an IP Pool
To replicate IP Pool exhaustion, or running out of available addresses, I create a new IP Pool with a range of just two IP addresses. A standalone database would require 3 IP addresses; one for the VM, one for the K8s cluster API server on which the database is deployed and a final one for the Load Balancer/External IP address for the database itself. Thus, we should not be able to successfully provision a database using an infrastructure policy that uses this pool. In fact, very quickly we can determine there is IP address exhaustion using the gateway API seen earlier. In the example below, I am once again displaying only the status conditions, but a kubectl describe would show the same issue. Note the message “insufficient amount of IP addresses”.
% kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml NAME STATUS STORAGE VERSION AGE pg01 Ready 60Gi 15.5+vmware.v2.0.0 2d16h pg02 InProgress 60Gi 15.5+vmware.v2.0.0 2m43s % kubectl get postgresclusters pg02 --kubeconfig kubeconfig-gateway.yaml \ --template={{.status.conditions}} [map[lastTransitionTime:2024-03-11T09:03:48Z message: observedGeneration:1 reason:InProgress status:False type:Ready] \ map[lastTransitionTime:2024-03-11T09:03:48Z message:insufficient amount of IP addreses - IP pool 'restricted-ip-pool' has 2 free IP addresses, 1 more needed \ observedGeneration:1 reason:IPAddressWarning status:True type:IPAddressWarning] \ map[lastTransitionTime:2024-03-11T09:04:55Z message:Cloning @ /pg02-k4z4w - 1 of 2 completed \ observedGeneration:1 reason:MachinesReady status:False type:MachinesReady]]%
The same issue is reflected in the vSphere Client UI when the IP Pool is queried after the deployment has been initiated.
This issue is easily identified through both the gateway API and the UI. This is because DSM provider appliance is able to communicate with the cluster that has been deployed to host the database, and the status conditions can be bubbled up to the admins. Let’s look at an issue that is a little more complex, where the Kubernetes cluster has been deployed using an unreachable IP address, preventing the DSM provider appliance from communicating with it. More specifically, it is the component responsible to rolling out the Kubernetes cluster (Cluster API) on the DSM provider which is unable to communicate with the Kubernetes cluster.
Scenario 2 – Unreachable Address Range in IP Pool
For this scenario, I built a new IP Pool with an unreachable address range of 192.168.100.100-200. In other words, my DSM Provider Appliance is unable to reach this network. Here is what the IP Pool looks like.
Let’s start with the gateway API once more and show the status conditions. Note that the IPAddressWarning is false in this case, meaning that there is nothing to report, i.e. no warning. It seems to be stuck waiting for MachinesReady. This one is a little more complex to troubleshoot.
% kubectl get postgresclusters --kubeconfig kubeconfig-gateway.yaml NAME STATUS STORAGE VERSION AGE pg01 Ready 60Gi 15.5+vmware.v2.0.0 2d16h pg03 InProgress 60Gi 15.5+vmware.v2.0.0 6m12s % kubectl get postgresclusters pg03 --kubeconfig kubeconfig-gateway.yaml \ --template={{.status.conditions}} [map[lastTransitionTime:2024-03-11T09:13:18Z message: observedGeneration:1 reason:InProgress status:False type:Ready] \ map[lastTransitionTime:2024-03-11T09:13:18Z message: observedGeneration:1 reason:IPAddressWarning status:False type:IPAddressWarning] \ map[lastTransitionTime:2024-03-11T09:13:57Z message:Cloning @ /pg03-tcp9q - 1 of 2 completed observedGeneration:1 reason:MachinesReady status:False type:MachinesReady]]%
We will use this example to troubleshoot further. So far, we have only used the gateway API to look at things from a high level. What if we needed to troubleshoot ‘under the covers’ so to speak, and examine the state of the Kubernetes cluster which has been deployed to host the DSM database or data service. Let’s see how to do this next.
Deeper Dive into DSM Troubleshooting
Use extreme caution at this point! I would advise against making any changes to the underlying database or K8s components at this level and only use the following commands as “read-only” to query the database configuration. If there is a need to troubleshoot at this level, I strongly recommend engaging our support organisation for assistance.
Begin by logging into the DSM appliance as the root user. From there, navigate to the folder /opt/vmware/tdm-provider/kubernetes-service. Here you will find a number of kubeconfig YAML files. Note that these do not provide access to the Kubernetes clusters on which the DSM databases are deployed, but rather provider access to what might be considered the management cluster on the DSM Provider appliance. The DSM Provider appliance runs a version of Kubernetes, and uses a concept called CAPV (Cluster API for vSphere) to provision the workload K8s clusters which run the databases. However, these kubeconfigs on the DSM Provider appliance can be used to gain access the Kubernetes clusters on which the databases are deployed. To do this, use the kubeconfig-localhost.yaml to determine what objects can be queried on the DSM Provider appliance ‘management’ cluster, using the api-resources argument once more as follows.
# kubectl api-resources --kubeconfig kubeconfig-localhost.yaml NAME SHORTNAMES APIVERSION NAMESPACED KIND configmaps cm v1 true ConfigMap events ev v1 true Event namespaces ns v1 false Namespace secrets v1 true Secret clusterresourcesetbindings addons.cluster.x-k8s.io/v1beta1 true ClusterResourceSetBinding clusterresourcesets addons.cluster.x-k8s.io/v1beta1 true ClusterResourceSet customresourcedefinitions crd,crds apiextensions.k8s.io/v1 false CustomResourceDefinition apiservices apiregistration.k8s.io/v1 false APIService kubeadmconfigs bootstrap.cluster.x-k8s.io/v1beta1 true KubeadmConfig kubeadmconfigtemplates bootstrap.cluster.x-k8s.io/v1beta1 true KubeadmConfigTemplate clusterclasses cc cluster.x-k8s.io/v1beta1 true ClusterClass clusters cl cluster.x-k8s.io/v1beta1 true Cluster machinedeployments md cluster.x-k8s.io/v1beta1 true MachineDeployment machinehealthchecks mhc,mhcs cluster.x-k8s.io/v1beta1 true MachineHealthCheck machinepools mp cluster.x-k8s.io/v1beta1 true MachinePool machines ma cluster.x-k8s.io/v1beta1 true Machine machinesets ms cluster.x-k8s.io/v1beta1 true MachineSet kubeadmcontrolplanes kcp controlplane.cluster.x-k8s.io/v1beta1 true KubeadmControlPlane kubeadmcontrolplanetemplates controlplane.cluster.x-k8s.io/v1beta1 true KubeadmControlPlaneTemplate vsphereclusteridentities infrastructure.cluster.x-k8s.io/v1beta1 false VSphereClusterIdentity vsphereclusters infrastructure.cluster.x-k8s.io/v1beta1 true VSphereCluster vsphereclustertemplates infrastructure.cluster.x-k8s.io/v1beta1 true VSphereClusterTemplate vspheredeploymentzones infrastructure.cluster.x-k8s.io/v1beta1 false VSphereDeploymentZone vspherefailuredomains infrastructure.cluster.x-k8s.io/v1beta1 false VSphereFailureDomain vspheremachines infrastructure.cluster.x-k8s.io/v1beta1 true VSphereMachine vspheremachinetemplates infrastructure.cluster.x-k8s.io/v1beta1 true VSphereMachineTemplate vspherevms infrastructure.cluster.x-k8s.io/v1beta1 true VSphereVM globalinclusterippools ipam.cluster.x-k8s.io/v1alpha2 false GlobalInClusterIPPool inclusterippools ipam.cluster.x-k8s.io/v1alpha2 true InClusterIPPool ipaddressclaims ipam.cluster.x-k8s.io/v1alpha1 true IPAddressClaim ipaddresses ipam.cluster.x-k8s.io/v1alpha1 true IPAddress extensionconfigs ext runtime.cluster.x-k8s.io/v1alpha1 false ExtensionConfig
Those of you familiar with CAPV and Cluster API will probably recognise a number of the objects listed here. The majority of these are used as the building blocks to create a workload Kubernetes cluster on which to build a database. Note that one of the objects listed above is secrets. The kubeconfig for a database cluster is held in a secret. The kubeconfig secrets can be accessed as follows:
# kubectl get secrets --kubeconfig kubeconfig-localhost.yaml | grep kubeconfig mysql01-kubeconfig 2024-03-08T17:11:34Z pg01-kubeconfig 2024-03-08T17:05:07Z pg03-kubeconfig 2024-03-11T09:13:26Z
Let’s look at the working PostgreSQL database, pg01. The actual kubeconfig is encoded in base64 in the secret. To retrieve it, and store it in its own kubeconfig file, use the following command:
# kubectl get secrets pg01-kubeconfig --kubeconfig kubeconfig-localhost.yaml \
--template={{.data.value}} | base64 --decode > kubeconfig.pg01
The newly created kubeconfig from the previous command can now be used to look directly at the Kubernetes workload cluster where the PostgreSQL database pg01 is deployed.
# kubectl get all -A --kubeconfig kubeconfig.pg01 NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager pod/cert-manager-799d5bd547-pqgqh 1/1 Running 0 2d18h cert-manager pod/cert-manager-cainjector-b6d647475-pp2c5 1/1 Running 0 2d18h cert-manager pod/cert-manager-webhook-6cb97fd59f-clh46 1/1 Running 0 2d18h default pod/default-full-backup-28500479-krvg2 0/1 Completed 0 35h default pod/default-incremental-backup-28499039-4kbcf 0/1 Completed 0 2d11h default pod/default-incremental-backup-28500479-k2q9w 0/1 Completed 0 35h default pod/default-incremental-backup-28501919-ptk55 0/1 Completed 0 11h default pod/pg01-0 5/5 Running 0 2d18h default pod/pg01-monitor-0 4/4 Running 0 2d18h kapp-controller pod/kapp-controller-5859bd58d7-gnpsh 2/2 Running 0 2d18h kube-system pod/antrea-agent-p5vwd 2/2 Running 0 2d18h kube-system pod/antrea-controller-545d78bb49-2t2gg 1/1 Running 0 2d18h kube-system pod/coredns-86dbf96446-4wb6j 1/1 Running 0 2d18h kube-system pod/etcd-pg01-m4v8q 1/1 Running 0 2d18h kube-system pod/kube-apiserver-pg01-m4v8q 1/1 Running 0 2d18h kube-system pod/kube-controller-manager-pg01-m4v8q 1/1 Running 0 2d18h kube-system pod/kube-proxy-dh8pd 1/1 Running 0 2d18h kube-system pod/kube-scheduler-pg01-m4v8q 1/1 Running 0 2d18h kube-system pod/kube-vip-pg01-m4v8q 1/1 Running 0 2d18h kube-system pod/vsphere-cloud-controller-manager-pp9bb 1/1 Running 0 2d18h telegraf pod/telegraf-5d4bc448b6-87cll 1/1 Running 0 2d18h vmware-sql-postgres pod/postgres-operator-7857586776-cskdv 1/1 Running 0 2d18h vmware-system-csi pod/vsphere-csi-controller-5ddfddf944-5kks9 7/7 Running 0 2d18h vmware-system-csi pod/vsphere-csi-node-26d4s 3/3 Running 2 (2d18h ago) 2d18h NAMESPACE NAME TYPE CLUSTER-IP EXTERAL-IP PORT(S) AGE cert-manager service/cert-manager ClusterIP 10.106.95.58 <none> 9402/TCP 2d18h cert-manager service/cert-manager-webhook ClusterIP 10.109.245.185 <none> 443/TCP 2d18h default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2d18h default service/pg01 LoadBalancer 10.103.74.207 xx.yy.zz.182 5432:31287/TCP 2d18h default service/pg01-agent ClusterIP None <none> <none> 2d18h default service/pg01-read-only ClusterIP 10.96.48.205 <none> 5432/TCP 2d18h kapp-controller service/packaging-api ClusterIP 10.102.254.105 <none> 443/TCP 2d18h kube-system service/antrea ClusterIP 10.98.251.199 <none> 443/TCP 2d18h kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 2d18h vmware-sql-postgres service/postgres-operator-webhook-service ClusterIP 10.96.209.206 <none> 443/TCP 2d18h vmware-system-csi service/vsphere-csi-controller ClusterIP 10.99.160.232 <none> 2112/TCP,2113/TCP 2d18h NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-system daemonset.apps/antrea-agent 1 1 1 1 1 kubernetes.io/os=linux 2d18h kube-system daemonset.apps/kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 2d18h kube-system daemonset.apps/vsphere-cloud-controller-manager 1 1 1 1 1 <none> 2d18h vmware-system-csi daemonset.apps/vsphere-csi-node 1 1 1 1 1 kubernetes.io/os=linux 2d18h vmware-system-csi daemonset.apps/vsphere-csi-node-windows 0 0 0 0 0 kubernetes.io/os=windows 2d18h NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE cert-manager deployment.apps/cert-manager 1/1 1 1 2d18h cert-manager deployment.apps/cert-manager-cainjector 1/1 1 1 2d18h cert-manager deployment.apps/cert-manager-webhook 1/1 1 1 2d18h kapp-controller deployment.apps/kapp-controller 1/1 1 1 2d18h kube-system deployment.apps/antrea-controller 1/1 1 1 2d18 kube-system deployment.apps/coredns 1/1 1 1 2d18h telegraf deployment.apps/telegraf 1/1 1 1 2d18h vmware-sql-postgres deployment.apps/postgres-operator 1/1 1 1 2d18h vmware-system-csi deployment.apps/vsphere-csi-controller 1/1 1 1 2d18h NAMESPACE NAME DESIRED CURRENT READY AGE cert-manager replicaset.apps/cert-manager-799d5bd547 1 1 1 2d18h cert-manager replicaset.apps/cert-manager-cainjector-b6d647475 1 1 1 2d18h cert-manager replicaset.apps/cert-manager-webhook-6cb97fd59f 1 1 1 2d18 kapp-controller replicaset.apps/kapp-controller-5859bd58d7 1 1 1 2d18h kube-system replicaset.apps/antrea-controller-545d78bb49 1 1 1 2d18h kube-system replicaset.apps/coredns-559dcb89b8 0 0 0 2d18h kube-system replicaset.apps/coredns-86dbf96446 1 1 1 2d18h telegraf replicaset.apps/telegraf-5d4bc448b6 1 1 1 2d18h vmware-sql-postgres replicaset.apps/postgres-operator-7857586776 1 1 1 2d18h vmware-system-csi replicaset.apps/vsphere-csi-controller-5ddfddf944 1 1 1 2d18h NAMESPACE NAME READY AG default statefulset.apps/pg01 1/1 2d18h default statefulset.apps/pg01-monitor 1/1 2d18h NAMESPACE NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE default cronjob.batch/default-full-backup 59 23 * * 6 False 0 35h 2d18h default cronjob.batch/default-incremental-backup 59 23 1/1 * * False 0 11h 2d18h NAMESPACE NAME COMPLETIONS DURATION AGE default job.batch/default-full-backup-28500479 1/1 5s 35h default job.batch/default-incremental-backup-28499039 1/1 5s 2d11h default job.batch/default-incremental-backup-28500479 1/1 4s 35h default job.batch/default-incremental-backup-28501919 1/1 4s 11h NAMESPACE NAME STATUS SOURCE INSTANCE SOURCE NAMESPACE TYPE TIME STARTED TIME COMPLETED default postgresbackup.sql.tanzu.vmware.com/default-full-backup-20240309-235901 Succeeded pg01 default full 2024-03-09T23:59:11Z 2024-03-09T23:59:25Z default postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240308-235901 Succeeded pg01 default incremental 2024-03-08T23:59:01Z 2024-03-08T23:59:08Z default postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240309-235901 Succeeded pg01 default incremental 2024-03-09T23:59:01Z 2024-03-09T23:59:08Z default postgresbackup.sql.tanzu.vmware.com/default-incremental-backup-20240310-235901 Succeeded pg01 default incremental 2024-03-10T23:59:01Z 2024-03-10T23:59:07Z default postgresbackup.sql.tanzu.vmware.com/initial-45b072fc-20240308-171109 Succeeded pg01 default full 2024-03-08T17:11:09Z 2024-03-08T17:11:22Z NAMESPACE NAME STATUS DB VERSION BACKUP LOCATION AGE default postgres.sql.tanzu.vmware.com/pg01 Running 15.5 pg01-backuplocation 2d18h
This provides a complete view of the ‘workload’ Kubernetes cluster where the PostgreSQL database pg01 is running. We can see the deployments, replicas and pods. We can see the Kubernetes kube-system objects, the vSphere CSI driver for persistent storage, Antrea for networking and even the postgres objects. We have visibility into all of the services (pod networking) and even the jobs that are run to take backups of the database. Lots of good information in here to query for deep dive troubleshooting but as I said before, proceed with caution.
Return to Scenario 2 – Unreachable Address Range in IP Pool
Let’s now return to our other Postgres cluster, pg03. Remember that this was given an IP address which is unreachable from vSphere and the DSM Provider. If I retrieve it’s kubeconfig from a secret, like we just did for pg01, I would not be able to send queries like the one above since it has been provisioned with a front-end/load-balancer IP address that we are unable to communicate to it. A kubectl get command would simply not return anything. What would we do in that case? What we could do is examine the state of the Cluster API and CAPV objects that are being used to create the cluster. Let’s do that next.
There is an excellent utility to extend kubectl called krew. Krew has numerous plugins, and one that I find exceptionally useful for looking at situations like this is the tree plugin. When Kubernetes clusters are deployed using Cluster API, in our case CAPV for vSphere, this Krew ‘tree’ plugin can show the relationship between objects in the cluster. Since I don’t have krew installed in the DSM Provider appliance, and probably something I would not recommend, I would suggest copying the /opt/vmware/tdm-provider/kubernetes-service/kubeconfig-external.yaml from the DSM Provider appliance to your desktop where kubectl, krew and the tree plugin are installed. You can now reference the same set of objects that we were able to reference from with the DSM Provider appliance previously from your desktop. But the really useful thing is that you can look at the relationship of all of the Cluster API objects, checking to see which are in a Ready state and which are not. Let’s look at a working pg01 cluster first. To begin, we list the api-resources. Then we list the clusters. Finally, we use the Krew tree plugin to see the relationship between all the Cluster API objects used to build a Kubernetes cluster on vSphere. So, from my desktop, I run the following:
% kubectl api-resources --kubeconfig kubeconfig-external.yaml NAME SHORTNAMES APIVERSION NAMESPACED KIND configmaps cm v1 true ConfigMap events ev v1 true Event namespaces ns v1 false Namespace secrets v1 true Secret clusterresourcesetbindings addons.cluster.x-k8s.io/v1beta1 true ClusterResourceSetBinding clusterresourcesets addons.cluster.x-k8s.io/v1beta1 true ClusterResourceSet customresourcedefinitions crd,crds apiextensions.k8s.io/v1 false CustomResourceDefinition apiservices apiregistration.k8s.io/v1 false APIService kubeadmconfigs bootstrap.cluster.x-k8s.io/v1beta1 true KubeadmConfig kubeadmconfigtemplates bootstrap.cluster.x-k8s.io/v1beta1 true KubeadmConfigTemplate clusterclasses cc cluster.x-k8s.io/v1beta1 true ClusterClass clusters cl cluster.x-k8s.io/v1beta1 true Cluster machinedeployments md cluster.x-k8s.io/v1beta1 true MachineDeployment machinehealthchecks mhc,mhcs cluster.x-k8s.io/v1beta1 true MachineHealthCheck machinepools mp cluster.x-k8s.io/v1beta1 true MachinePool machines ma cluster.x-k8s.io/v1beta1 true Machine machinesets ms cluster.x-k8s.io/v1beta1 true MachineSet kubeadmcontrolplanes kcp controlplane.cluster.x-k8s.io/v1beta1 true KubeadmControlPlane kubeadmcontrolplanetemplates controlplane.cluster.x-k8s.io/v1beta1 true KubeadmControlPlaneTemplate vsphereclusteridentities infrastructure.cluster.x-k8s.io/v1beta1 false VSphereClusterIdentity vsphereclusters infrastructure.cluster.x-k8s.io/v1beta1 true VSphereCluster vsphereclustertemplates infrastructure.cluster.x-k8s.io/v1beta1 true VSphereClusterTemplate vspheredeploymentzones infrastructure.cluster.x-k8s.io/v1beta1 false VSphereDeploymentZone vspherefailuredomains infrastructure.cluster.x-k8s.io/v1beta1 false VSphereFailureDomain vspheremachines infrastructure.cluster.x-k8s.io/v1beta1 true VSphereMachine vspheremachinetemplates infrastructure.cluster.x-k8s.io/v1beta1 true VSphereMachineTemplate vspherevms infrastructure.cluster.x-k8s.io/v1beta1 true VSphereVM globalinclusterippools ipam.cluster.x-k8s.io/v1alpha2 false GlobalInClusterIPPool inclusterippools ipam.cluster.x-k8s.io/v1alpha2 true InClusterIPPool ipaddressclaims ipam.cluster.x-k8s.io/v1alpha1 true IPAddressClaim ipaddresses ipam.cluster.x-k8s.io/v1alpha1 true IPAddress extensionconfigs ext runtime.cluster.x-k8s.io/v1alpha1 false ExtensionConfig % kubectl get cl --kubeconfig kubeconfig-external.yaml NAME PHASE AGE VERSION mysql01 Provisioned 2d16h pg01 Provisioned 2d16h pg03 Provisioned 13m % kubectl tree cl pg01 --kubeconfig kubeconfig-external.yaml NAMESPACE NAME READY REASON AGE default Cluster/pg01 True 2d18h default ├─ConfigMap/pg01-lock - 2d18h default ├─KubeadmControlPlane/pg01 True 2d18h default │ ├─Machine/pg01-m4v8q True 2d18h default │ │ ├─KubeadmConfig/pg01-cgpt7 True 2d18h default │ │ │ └─Secret/pg01-cgpt7 - 2d18h default │ │ └─VSphereMachine/pg01-tpl-cp-7658441472598658451-cpmdt True 2d18h default │ │ └─VSphereVM/pg01-m4v8q True 2d18h default │ │ └─IPAddressClaim/pg01-m4v8q-0-0 - 2d18h default │ │ └─IPAddress/pg01-m4v8q-0-0 - 2d18h default │ ├─Secret/pg01-ca - 2d18h default │ ├─Secret/pg01-etcd - 2d18h default │ ├─Secret/pg01-kubeconfig - 2d18h default │ ├─Secret/pg01-proxy - 2d18h default │ └─Secret/pg01-sa - 2d18h default ├─VSphereCluster/pg01 True 2d18h default │ ├─Secret/pg01-vsphere-creds - 2d18h default │ └─VSphereMachine/pg01-tpl-cp-7658441472598658451-cpmdt True 2d18h default │ └─VSphereVM/pg01-m4v8q True 2d18h default │ └─IPAddressClaim/pg01-m4v8q-0-0 - 2d18h default │ └─IPAddress/pg01-m4v8q-0-0 - 2d18h default └─VSphereMachineTemplate/pg01-tpl-cp-7658441472598658451 - 2d18h
In the example above, everything looks good for pg01. The cluster object at the top of the tree has reached Ready state. Now let’s take a look at our broken cluster pg03.
% kubectl tree cl pg03 --kubeconfig kubeconfig-external.yaml NAMESPACE NAME READY REASON AGE default Cluster/pg03 False WaitingForKubeadmInit 14m default ├─ConfigMap/pg03-lock - 14m default ├─KubeadmControlPlane/pg03 False WaitingForKubeadmInit 14m default │ ├─Machine/pg03-tcp9q True 14m default │ │ ├─KubeadmConfig/pg03-76fss True 14m default │ │ │ └─Secret/pg03-76fss - 14m default │ │ └─VSphereMachine/pg03-tpl-cp-17178983102945282084-cfp8d True 14m default │ │ └─VSphereVM/pg03-tcp9q True 14m default │ │ └─IPAddressClaim/pg03-tcp9q-0-0 - 14m default │ │ └─IPAddress/pg03-tcp9q-0-0 - 14m default │ ├─Secret/pg03-ca - 14m default │ ├─Secret/pg03-etcd - 14m default │ ├─Secret/pg03-kubeconfig - 14m default │ ├─Secret/pg03-proxy - 14m default │ └─Secret/pg03-sa - 14m default ├─VSphereCluster/pg03 True 14m default │ ├─Secret/pg03-vsphere-creds - 14m default │ └─VSphereMachine/pg03-tpl-cp-17178983102945282084-cfp8d True 14m default │ └─VSphereVM/pg03-tcp9q True 14m default │ └─IPAddressClaim/pg03-tcp9q-0-0 - 14m default │ └─IPAddress/pg03-tcp9q-0-0 - 14m default └─VSphereMachineTemplate/pg03-tpl-cp-17178983102945282084 - 14m
Not so good. Whilst a lot of the cluster has been built, it appears the kubeadm control plane is stuck initialising. In a nutshell, the cluster is waiting for the control plane provider to indicate that the control plane has been initialised. Obviously we know this is down to the networking issue, but can we figure out from the CAPV objects what or where the issue is? We could start by taking a look at the Machine object. This Machine object, backed by a VM in vSphere infrastructure, will run the control plane of the K8s cluster. This is perhaps why we saw it stuck on MachinesReady when we queried the status via the gateway API earlier. Let look at the status of the machine object.
% kubectl get Machine/pg03-tcp9q --kubeconfig kubeconfig-external.yaml \ --template={{.status.conditions}} [map[lastTransitionTime:2024-03-11T09:14:59Z status:True type:Ready] \ map[lastTransitionTime:2024-03-11T09:13:27Z status:True type:BootstrapReady] \ map[lastTransitionTime:2024-03-11T09:14:59Z status:True type:InfrastructureReady] \ map[lastTransitionTime:2024-03-11T09:13:27Z reason:WaitingForNodeRef severity:Info status:False type:NodeHealthy]]%
The node is not healthy. It would appear that the Machine object is waiting on a NodeRef. If we look at the code for Cluster API, we can see that WaitingForNodeRef indicates that the machine.spec.providerId is not yet assigned. The ProviderID is a cloud provider ID which identifies the machine, in our case identifying the VM in vSphere. A cloud provider embeds platform or cloud specific control logic into Kubernetes. In this case, it is control logic for Kubernetes running on vSphere. Until the ProviderID is assigned, the machine will not be ready. The cloud provider component in Kubernetes needs to be able to reach vSphere to determine if the VM is online and healthy. When it succeeds in doing this, it sets the nodeRef as the last step in machine provisioning. Someone with Kubernetes expertise might be able to deduce from this that is a communication issue between the workload K8s cluster and Cluster API on the management cluster (in our case DSM), and as a result, it is not be possible to update the machine’s operational state due to the unreachable IP address. Therefore the machine/node/VM will remain in this state indefinitely. But you could also look at the IP address objects in the tree above and realise that the IP addresses are unreachable from vSphere (a simple ping from vCenter to the front-end/load-balancer IP addresses should reveal this). The actual error itself can be found on the DSM appliance, by looking at the logs for the Cluster API (CAPI & CAPV) components in /var/log/tdm/provider:
root@photon [ /var/log/tdm/provider/containers ]# ls -ltr total 141584 -rw-r----- 1 root root 5129 Mar 11 09:51 data-plane-image-download-v2.0.0.log -rw-r----- 1 root root 79699 Mar 11 09:59 data-plane-image-deploy-v2.0.0.log -rw-r----- 1 root root 6843133 Mar 11 02:01 kubernetes-service.log-2024-03-22-02-1711072861.gz -rw-r----- 1 root root 295813 Mar 11 06:21 kubernetes-service-caip-1.log -rw-r----- 1 root root 64038 Mar 11 09:47 kubernetes-service-kadm-1.log -rw-r----- 1 root root 652003 Mar 11 10:02 kubernetes-service-kcp-1.log -rw-r----- 1 root root 3367036 Mar 11 10:02 kubernetes-service-capi-1.log -rw-r----- 1 root root 9697207 Mar 11 10:18 kubernetes-service-capv-1.log -rw-r----- 1 root root 13646784 Mar 11 10:18 dsm-tsql-provisioner-service.log -rw-r----- 1 root root 41733076 Mar 11 10:18 sgw-service.log -rw-r----- 1 root root 41414303 Mar 11 10:18 docker-registry.log -rw-r----- 1 root root 26966987 Mar 11 10:18 kubernetes-service.log
If I take a look at the kubernetes-service-capi-1.log, I see the following reconcile error displayed repeatedly:
2024-03-11T11:03:26+00:00 localhost.localdomain container_name/kubernetes-service-capi-1[1968]: \
E0322 11:03:26.785449 1 controller.go:324] "Reconciler error" err="failed to create cluster accessor: \
error creating client for remote cluster \"default/pg03\": error getting rest mapping: \
failed to get API group resources: unable to retrieve the complete list of server APIs: \
v1: Get \"https://192.168.100.100:6443/api/v1?timeout=10s\": net/http: \
request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" \
controller="clusterresourceset" controllerGroup="addons.cluster.x-k8s.io" \
controllerKind="ClusterResourceSet" ClusterResourceSet="default/pg03-csi-crs" \
namespace="default" name="pg03-csi-crs" reconcileID=bdfe85ee-75b4-4502-b00f-b625d9a80c79
The Cluster API on the DSM appliance is unable to communicate with the K8s cluster provisioned for the database. Thus, the cluster cannot ever leave the WaitForNodeRef state. Note that this issue could also arise if there was a firewall blocking port 6443 between the DSM appliance network and the network where the database is being provisioned. This troubleshooting process is also a good way to identify other possible conditions, such as memory or disk pressures, which could also preventing the Kubernetes cluster from coming ready.
Summary
As well as a significant amount of built-in monitoring in the DSM UI, sometimes additional information can be gleaned when using the kubectl CLI for troubleshooting. The gateway API which is easily accessible to customers, gives some details about possible root causes related to misconfigurations. However, in certain situations, it may be necessary to go deeper to look at the underlying CAPV provisioned Kubernetes workload cluster on which the database is deployed. Hopefully the above will help get you started should you need to troubleshoot a DSM environment. But, as highlighted multiple times in the post, don’t take any risks and consider engaging the Global Support organisation if you need to troubleshoot any production environments to this level.